ARTICLE
Communicated by Bard Ermentrout
Asynchronous States and the Emergence of Synchrony in Large Networks of Interacting Excitatory and Inhibitory Neurons D. Hansel
[email protected] Laboratoire de Neurophysique et de Physiologie du Syst`eme Moteur, UniversitÂe RenÂe Descartes, 75270 Paris Cedex 06, France, and InterdisciplinaryCenter for Neural Computation, Hebrew University of Jerusalem, 91904 Jerusalem, Israel
G. Mato
[email protected] ComisiÂon Nacional de EnergÂõ a AtÂomica and CONICET, Centro AtÂomico Bariloche and Instituto Balseiro (CNEA and UNC), 8400 San Carlos de Bariloche, R.N., Argentina
We investigate theoretically the conditions for the emergence of synchronous activity in large networks, consisting of two populations of extensively connected neurons, one excitatory and one inhibitory. The neurons are modeled with quadratic integrate-and-re dynamics, which provide a very good approximation for the subthreshold behavior of a large class of neurons. In addition to their synaptic recurrent inputs, the neurons receive a tonic external input that varies from neuron to neuron. Because of its relative simplicity, this model can be studied analytically. We investigate the stability of the asynchronous state (AS) of the network with given average ring rates of the two populations. First, we show that the AS can remain stable even if the synaptic couplings are strong. Then we investigate the conditions under which this state can be destabilized. We show that this can happen in four generic ways. The rst is a saddle-node bifurcation, which leads to another state with different average ring rates. This bifurcation, which occurs for strong enough recurrent excitation, does not correspond to the emergence of synchrony. In contrast, in the three other instability mechanisms, Hopf bifurcations, which correspond to the emergence of oscillatory synchronous activity, occur. We show that these mechanisms can be differentiated by the ring patterns they generate and their dependence on the mutual interactions of the inhibitory neurons and cross talk between the two populations. We also show that besides these codimension 1 bifurcations, the system can display several codimension 2 bifurcations: Takens-Bogdanov, Gavrielov-Guckenheimer, and double Hopf bifurcations.
Neural Computation 15, 1–56 (2003)
° c 2002 Massachusetts Institute of Technology
2
D. Hansel and G. Mato
1 Introduction Synchrony can be dened in several ways. In its broader denition, the activities of two neurons are said to be synchronized if some kind of temporal correlations exist between them. In this sense, in any system of interacting neurons, some level of synchrony should exist. In this respect, observing synchrony between directly interacting neurons should not be surprising. In fact, several studies in the 1960s and 1970s were devoted to determining what the shape of the cross-correlations of a pair of neurons can tell us about their interactions (Abeles, 1990). The implicit assumption in this approach was that correlations are dominated by direct interactions between the neurons. Other types of synchronous states can emerge through cooperative effects. In these states, the correlations in the activity of a pair of neurons in the network depend weakly on their direct interaction (Hansel & Sompolinsky, 1996). Instead, they are mostly mediated by the inputs to the neurons coming from all other neurons belonging to the part of the network with which they interact. To dene this type of synchronous state, it is appropriate to consider macroscopic observables, for example, the instantaneous activity of the neurons averaged across assemblies of neurons (multi-unit activities) or local eld potentials that presumably reect electrical activity induced by many neurons. In an asynchronous state, the temporal uctuations of these quantities (suitably normalized) are small—on the order of 1 / M where M is the size of the neuronal assembly. When synchrony emerges, these uctuations become nite even if the population is large. Understanding the conditions for the emergence of these states in the brain is one of the goals of recent research in the eld of collective neuronal dynamics. Central issues concern the roles of the cellular properties of the neurons; the nature (excitatory, inhibitory), the strength, and the kinetics of the synaptic interactions; and the local and global architectures of the networks. The ways in which noise, heterogeneities, and spatial uctuations of connectivity patterns affect synchrony are also among the fundamental questions investigated. One possible approach to these questions is to study cooperative dynamical states in simplied network models of local neuronal populations in the brain. Within limits, these models can be studied in depth analytically. Subsequently, these analytical investigations can be extended using numerical simulations beyond the limits that can be treated analytically. This approach has led to substantial progress in our understanding of the general mechanisms of neuronal synchrony. However, most of the theoretical studies have addressed the case of one population of neurons, much less is known about networks comprising several populations. The goal of this article is to study the emergence of synchronous activity in large networks consisting of two interacting populations of neurons, one excitatory and the other inhibitory. We investigate the conditions
Asynchronous States and the Emergence of Synchrony
3
under which asynchronous states (AS) become unstable due to interactions. More specically, we inquire: (1) Under which conditions can the network display a stable asynchronous state? Is this possible when the couplings are strong? (2) What are the different mechanisms through which the AS can become unstable in the network? (3) In what ways do they depend on the recurrent interactions in the excitatory neurons, the mutual interactions of the inhibitory neurons, and the excitatory-inhibitory feedback connections between the two populations? (4) What patterns of synchrony emerge when the AS is unstable? In particular, can one differentiate unambiguously between the regimes of parameters in which neurons spike in synchrony from regimes in which they burst in synchrony? (5) More quantitatively, what determines the period of the synchronous oscillations and the phase shift between the oscillations in the two populations? The article is organized as follows. The model is introduced in section 2. In section 3, the method of AS stability analysis is presented, and a spectral equation for the eigenmodes of the linearized dynamics is established. Assuming that the synaptic time constants of the excitatory and inhibitory synapses are identical simplies the analysis of this equation. This allows us in section 4 to study the instabilities of AS to a large extent completely analytically. The various bifurcations of the AS are characterized. The results of section 4 are subsequently generalized in section 5. Results from numerical simulations are presented in section 6. Section 7 is devoted to a discussion of our results. 2 The model 2.1 The Quadratic Integrate-and-Fire Model. Near the onset of ring, it can be shown that the detailed dynamics of any type I conductance-based model can be replaced by the reduced model (Ermentrout & Kopell, 1986; Ermentrout, 1996), ¡ ¢2 dV CN D A V ¡ V ? C I ¡ Ic , dt
(2.1)
where Ic is the rheobase of the neuron. Analytical formulas can be derived N the constant A, and V ?. Using these to compute the effective capacitance C, formulas, one can compute these parameters numerically as functions of the biophysical parameters of the full conductance-based model (see appendixes A and B). The solution of equation 2.1 diverges in nite time. This divergence corresponds to the ring of an action potential. If one supplements equation 2.1 with the condition that the variable V has to be reset to ¡1 following a spike, this yields a reduced model that accurately describes the dynamics of the full conductance-based model in the limit I ! Ic . In
4
D. Hansel and G. Mato
this model, the relationship between the ring rate º and the current I is given by p º D B I ¡ Ic ,
(2.2)
p N where B D A / p C. When I ¡ Ic is not small, this reduction is no longer exact. However, one can generalize it heuristically as follows. We assume that action potentials are red whenever V crosses some threshold, Vt (from below), and that it is immediately reset to a value Vr . The relationship between the discharge rate and the external current of the model is now given by p
p
Vt A Vr A 1 1 arctan( pI¡Ic ) ¡ arctan( pI¡Ic ) p . D pB º I ¡ Ic
(2.3)
The two parameters, Vt and Vr , are phenomenological. They are not determined by the exact reduction method near the ring onset, and additional constraints are required to determine them. In the following, we impose that the º ¡ I curve of the quadratic integrate-and-re (QIF) model matches the º ¡ I curve of the full model as much as possible. It is convenient to rewrite the reduced model in terms of the dimensionless variables VN and IN dened by A VN D t0 (V ¡ V ?) CN A 2 IN D It0 , CN 2
(2.4) (2.5)
where t0 has a dimension of a time. The dynamics are governed by t0
¯ dV N 2 C IN ¡ INc . DV dt
(2.6)
The dynamical variable VN is reset to VNr D A ( Vr ¡ V ?)t 0 / CN whenever N A similar model has it reaches the threshold value: VN t D A ( Vt ¡ V ?)t0 / C. been used by Latham, Richmond, Nelson, and Nirenberg (2000) to study with numerical simulations a two-population network model. In this article, we use a QIF model with parameters derived from the Wang-Buszaki (WB) conductance-based model described in appendix A. Using the approach described above, one nds that Ic D 0.1601 mA / cm2 , V ? D ¡59.5462 mV, CN D 0.9467 m F / cm2 , and A D 0.012875 mS / cm2 / mV. Requiring that the º-I curve of the reduced and the full model agrees well up to ring rates as large as ºmax D 100 spikes per second yields for the other parameters Vr ¡ V ? D ¡4.6 mV and Vt ¡ V ? D 33.2 mV (VN r D ¡0.626,
Asynchronous States and the Emergence of Synchrony
5
100
n (Hz)
75 50 25 0 0.0
0.5
1.0 1.5 I (mA/cm 2)
2.0
Figure 1: Firing rate of a neuron versus the external current. Circles: WangBusza ki model. Solid line: QIF model with A D 0.012875 mS / cm2 / mV, Vr ¡V ¤ D N D 0.9467 m F / cm2 . ¡4.6 mV, Vt ¡ V ¤ D 33.2 mV, Ic D 0.1601 muA / cm2 and C
VN t D 4.52). The º-I curves of the full model and the QIF model with these parameters are plotted in Figure 1. Clearly the t is excellent over a broad range of ring rates. The traces of the membrane potential for neurons ring at 50 spikes per second and 10 spikes per second are shown in Figures 2A and 2B, respectively, for this QIF model. For comparison, we have also plotted the traces of a WB neuron that res at the same frequencies. For a frequency of 10 Hz, the traces are very similar. When the frequency increases, the traces of the QIF neuron deviate from those of the WB neuron. 2.2 The Two-Population Network Model. We deal with a network of two heterogeneous populations: one excitatory and the other inhibitory. The dynamics of the neurons is modeled by equation 2.6, dV ia syn 2 th C Iia C Iia ( t ) ¡ Ia D Via dt
(2.7)
where i D 1, . . . , Na are the indices of the neurons in population a (a D E, I). For simplicity, the ring threshold of the neurons Vt , the resetting value Vr , and the rheobase currents are the same for all the neurons. Time is measured in units of t0 D 10 ms. The current Iia represents the effect of an external stimulus that is assumed to be constant in time but different from one neuron to another. We consider the case where Na is very large and the external input currents to the neurons in population a are distributed according to a density syn distribution Qa ( I ) . The synaptic current, Iia ( t) , on the ith cell, is the sum of
6
D. Hansel and G. Mato A
V (mv)
20 20 60 100
0
20 t (msec)
40 B
V (mv)
20 20 60 100
0
100
200 t (msec)
300
400
Figure 2: Traces of a WB neuron (solid line) and QIF neuron (dashed line). Parameters of the model are the same as in Figure 1. (A) The neuron res at 50 Hz. (B) The neuron res at 10 Hz. syn
all contributions Iijab from all the presynaptic cells on it: syn Iia (t) D
X
syn
Iijab (t ).
(2.8)
jb syn
In this work, the current Iijab ( t) is modeled by syn
syn
Iijab ( t) D Gijab sjb (t) ,
(2.9)
syn
where Gijab measures the strength of the synapses that neuron j in population b makes on neuron i in population a and sjb evolves with time according to sjb ( t) D
X spikes
fb ( t ¡ tspike,j ) ,
(2.10)
Asynchronous States and the Emergence of Synchrony
7
where the summation is performed over all the spikes emitted by the presynaptic neuron with an index j at times tspike,j . The function fb ( t) is fb ( t ) D
t2b
µ ³ ´ ³ ´¶ 1 t t exp ¡ ¡ exp ¡ H (t) . ¡ t1b t1b t2b
(2.11)
Here, H ( t) is the Heaviside function, and the normalization is such that the integral of fb ( t) is one. The characteristic times t1b and t2b are the rise and decay times of the synapse, respectively. In case t1b D t2b D tb , one obtains the so-called alpha function (Rall, 1967): fb ( t ) D
³ ´ t t exp ¡ . tb tb
(2.12)
syn
Note that Gijab has the dimension of a current density and that excitatory syn
syn
(resp. inhibitory) interactions correspond to Gijab > 0 (resp. Gijab < 0). The synaptic current, which a cell i of the ath population receives from a cell j from the bth population, is syn
Iiab ( t) D
X
syn
Iijab ( t) ,
(2.13)
j
where a D E, I, i D 1, . . . , Na , j D 1, . . . , Nb , and Na is the number of neurons in the ath population. As a limiting case of a network with extensive connectivity, we assume full connectivity between the neurons, which simplies considerably our analytical approach. For simplicity, we also assume that the strength of a synapse between two neurons depends on only the populations to which they belong. Therefore, syn
syn
Gijab D Gab .
(2.14)
In the case of an extensively connected network, the average number of synaptic inputs per neuron varies proportionally to the size of the system, and the synaptic strengths are varied in inverse proportion to the system size so that (see Hansel & Sompolinsky, 1996) syn
Gab D
gab , Nb
(2.15)
where the normalized synaptic strengths, gab are independent of the Na ’s. Our approach can be applied to the case where the kinetics of excitatory (resp. inhibitory) synapses are different when the postsynaptic neuron belongs to the excitatory or the inhibitory population. However, this entails
8
D. Hansel and G. Mato
an increase in the number of parameters of the model. Therefore, in this article, we consider that the synaptic rise time and decay time depend on only the nature, excitatory or inhibitory, of the synapses, and not on the synapse target. The rise and decay times of the synapses are denoted by t1b , t2b , b D E, I. We neglect axonal propagation delays. 3 The Asynchronous State and Its Stability 3.1 The Distribution of Neuron Firing Rates in the Asynchronous State. In the asynchronous state, the synaptic current is the same for all neurons in a population and does not depend on time. It is given by X syn syn (3.1) Iia (t) D gabºb ´ Ia , b D E,I
where ºb is the average ring rate of the neurons in population b and gab is a 2 £ 2 matrix that characterizes the strengths of the synaptic interactions within and between the populations. Integrating the dynamical equations, equation 2.7, one calculates the ring rate of each of the neurons in the network. This yields "Z ºia D
Vt Vr
Á ´ F Iia
# ¡1
dV syn
V2 C Iia C Ia ¡ Iath ! X C gabºb ¡ Iath ,
(3.2)
b D E,I
where F (I) D
p
I p arctan(Vt / I ) ¡ arctan(Vr / I ) p
(3.3)
if I > 0 and F ( I ) D 0 otherwise. These equations, together with ºa D
1 X ºia, Na i
(3.4)
determine the distribution of ring rates in each population. Note that neusyn rons with an external current smaller than Iamin D Ith a ¡ Ia are below threshold and cannot re. The distribution Pa (º) of the ring rates of the neurons in population a in the asynchronous state can be related to the distribution of external currents via the input-output relation, Z 0
P a (º) D Qa ( F ¡1 (º) ) |F ¡1 (º) | C d (º)
Imin a ¡1
dIQ a (I ) ,
(3.5)
Asynchronous States and the Emergence of Synchrony
9 A
Q(I)
1
0.5
0 -1
0
1 I (mA/cm 2 )
2 B
0.03
P(n)
0.02
0.01
0
0
25
50 n (Hz)
75
100
Figure 3: (A) Gaussian distribution of external current with a standard deviation s D 0.4. (B) Distribution of ring rates for the QIF model (parameters as in Figure 1) and a distribution of currents as in A. The average ring rate is ºN D 30 Hz. The delta function at º D 0 is not shown.
where the last contribution comes from the neurons that receive a total input too small to make them re. The distribution of ring rates is plotted in Figure 3 for a gaussian distribution of currents:
Qa ( I ) D p
³ ´ ( I ¡ Ia ) 2 exp ¡ . 2sa2 2p sa2 1
(3.6)
We emphasize that the distribution of ring rates can be kept constant even if the coupling strengths change. This can be done simply by adjusting the distribution of external currents with an offset that depends on the coupling and the average ring rate of the population. In this way, one can separate the effect of coupling strength and the effect of the average rate on synchronization properties.
10
D. Hansel and G. Mato
3.2 Stability Analysis of the Asynchronous State. The stability of the AS in a homogeneous network of leaky integrate-and-re neurons has been studied by Abbott and van Vreeswijk (1993). We have generalized their approach to the case of heterogeneous network of QIF neurons (Hansel & Mato, 2001). For each of the active neurons in the population a, the variable yia can be dened as Z Via dx P (3.7) yia D ºia , 2 C I C x ia Vr b gab ºb where ºi and ºa are the ring rates of neuron i and the average population ring rate in the AS, respectively, and gab are the strengths of the synapses from population b to population a. This dynamical variable evolves in time according to X dyia D ºia C gab Kb ( yia ) 2 dt b
b ( t) ,
(3.8)
where 2 b ( t) D
Nb 1 X (ºjb (t) ¡ ºb ) Nb j D1
(3.9)
is the deviation of the ring rate of neuron j at time t from its value in the AS, and Ka ( yia ) D
ºia P . 2 CI C Via i b gabºb
(3.10)
The probability density, r a ( y, º, t ) satises the continuity equation, @r a ( y, º, t) @t
D ¡
@Ja ( y, º, t ) @y
,
(3.11)
where the ux Ja ( y, º, t) is Á X Ja ( y, º, t) D ºa C gab Kb ( y )2
! b ( t)
r a ( y, º, t) .
(3.12)
b
In the asynchronous state, Ja ( y, º, t) D ºa . The deviations from this value, ja ( y, º, t) D Ja ( y, º, t) ¡ ºa , due to perturbations from the AS state satisfy @ja ( y, º, t) @t
D
X b
gab Kb ( y )
d2
b ( t)
dt
¡ ºa
@ja ( y, º, t) @y
.
(3.13)
Asynchronous States and the Emergence of Synchrony
11
For synaptic interactions modeled with a difference of exponentials (see equations 2.11), it is straightforward to show that equation 3.13 is equivalent to a ( t) D ¡c 1a2 a ( t) C ha ( t) dt X dha ( t) D ¡c 2a ha ( t) C c 1ac 2a ja (1, º, t) , dt º
d2
(3.14) (3.15)
where c ka D 1 / tka for k D 1, 2. The perturbation ja has to be evaluated at y D 1 because this corresponds to the crossing of the threshold, V D Vt (see equation 3.7). The AS is stable if small perturbations from the AS eventually decay. Assuming that ja ( y, º, t) , 2 a ( t) , ha (t) are proportional to exp(lt) and integrating equation 3.13, one nds after a straightforward but tedious calculation that l satises Y µ (l C c 1a ) (l C c 2a ) ¡ c 1ac 2a gaa Ua ¶ (3.16) D gEI gIE UI UE , c 1ac 2a aD E,I where Z Ua (l) D
1 0
dº
Ha (º) 1 ¡ exp(¡l /º)
(3.17)
and Pa (º)º Ha (º) D 2F ¡1 (º)
" 1 ¡ exp(¡l /º)
(cos( A C w ) C sin(A C w ) Aº ) ) (cos(w ) C sin(w ) Aº ) l ¡ exp(¡l /º l C 2 ( 1 C Aº/ l)
# (3.18)
with p A D 2 F ¡1 (º) /º
p w D 2 arctan(Vr / F ¡1 (º) ) .
(3.19) (3.20)
The spectral equation, equation 3.16, determines the eigenvalues of the dynamical equations linearized around the AS. Note that this spectral equation also holds if a fraction of the neurons are subthreshold and do not re in the AS. Indeed, an innitesimal perturbation cannot make these neurons re. Therefore, they do not contribute to the destabilization of the AS.
12
D. Hansel and G. Mato
3.3 The Spectral Equation at the Onset of an Instability. The AS is stable if all the eigenvalues, the solutions to equation 3.16, have a negative real part. Therefore, continuous onset of instabilities occurs when at least one of the eigenvalues crosses the imaginary axis when some parameter is changed. At this onset, l D im and singularities appear in the integral in equation 3.17. These singularities can be isolated, and one nds: ³
´ p m A1a C A2a ¡ i A3a , 2 m 2
(3.21)
dºHa (º)
(3.22)
Ua ( im ) D where Z A1a D A2a D A3a D
1 0
1 X
± m ² ± m ²2 2p n 2p n
Ha
nD 1
1 Z X nD 1
p
dy ctg
± y ² Ha ( Cm ) y 2np 2
¡p
Z
C
p
dy ctg 0
(3.23)
( y C 2np ) 2
± y ² Ha ( m ) y 2
y2
.
(3.24)
These integrals are nonsingular. In general, they have to be evaluated numerically. The onset of instabilities of the AS is determined by taking the real and imaginary part of equation 3.16. The simultaneous solution of these equations determines m and the coupling at the onset of instabilities as a function of the other parameters of the model. At instability onset, the synaptic currents in the unstable mode oscillate with a frequency given by the imaginary part of the eigenvalue, m . The dephasing d between the oscillations of the two populations is exp(id) / 2
/
I 2 E
D
gEE UE ¡ (l /c 1E C 1) (l /c 2E C 1) . | gEI |UI
(3.25)
A positive (resp. negative) value of the phase lag d means that the oscillation of the inhibitory population is in advance (resp. delayed) over the excitatory population. The solutions of equation 3.16 can be classied according to the temporal 6 0, the properties of the corresponding unstable modes of instability. If m D instability occurs through a Hopf bifurcation (Strogatz, 2000). This can happen in two ways. In the rst, called a supercritical Hopf bifurcation, the AS loses stability, and simultaneously stable oscillations of the network activity appear with vanishingly small amplitude and nite frequency. The latter
Asynchronous States and the Emergence of Synchrony
13
is equal to m / 2p . The phase lag between the inhibitory and the excitatory population activity is d (see equation 3.25). In the second, called a subcritical Hopf bifurcation, the AS loses its stability by coalescing with an unstable state in which the activity of the network oscillates. This is a necessary (but not sufcient) condition for having a region in the parameter space where the stable AS coexists with two states of synchronous oscillations—one stable and one unstable. The stable oscillatory state persists in some regions where the AS is unstable and the unstable oscillatory state no longer exists. The values of m and d determined by the stability analysis now correspond to the frequency and the phase lag between the two populations in the unstable oscillatory state at bifurcation. However, if the amplitude of the stable oscillations at the bifurcation is small, these values should also provide a good estimate of the frequency and the phase lag in the stable oscillatory state near the bifurcation. This point needs to be veried with numerical simulations (see section 6). Finally, if m D 0, there is a saddle node (SN) bifurcation (Strogatz, 2000). To study how the stability of the asynchronous states depends on the synaptic properties, we construct phase diagrams for a xed distribution of the ring rates of the two populations, PE (º) and PI (º) , taking as parameters the strength of the four interactions. This implies that when the interaction strengths are changed, the external average inputs (or the ring thresholds) have to be modied accordingly to satisfy equation 3.2 for the two populations. Because of the normalization of the synaptic interactions, changing the synaptic time constants does not affect the ring-rate distribution in the AS. To study their role on the emergence of synchrony, the synaptic time constants can be varied while maintaining constant all the other parameters. 4 Synchrony in a Two-Population Model: The Symmetric Case A detailed analysis of the conditions of the AS stability as a function of synaptic coupling strength, synaptic kinetics, and distributions of ring rates is a formidable task. In this section, we focus on a situation where, t1E D t1I D t1 , t2E D t2I D t2 , and P E (º) D PI (º). We term this situation the symmetric case. It turns out that to a large extent, this case can be investigated analytically and that the results thus obtained are highly instructive for understanding the general properties of the two-population network. 4.1 The Instabilities of the Asynchronous State. In the symmetric case, for all a, Ua (l) D U (l) where U (l) is given by equation 3.17. Therefore equation 3.16 becomes (im C c 1 ) 2 ( im C c 2 ) 2 ¡c 1c 2 ( im C c 1 ) ( im C c 2 ) UT C c 12c 22 U 2 D D 0,
(4.1)
where U is a function of m , T D gEE C gII and D D gEE gII ¡ gEI gIE . The latter parameters are the trace and the determinant of the matrix gab , respectively.
14
D. Hansel and G. Mato
Figure 4: Phase diagram of a two-population QIF network in the symmetric case. The determinant and the trace of the coupling matrix are D D gEE gII ¡gEI gIE and T D gEE C gII , respectively. The standard deviation of the current distribution is sE D sI D 0.1. The average ring rates in the AS are ºE D ºI D 50 Hz. The synaptic time constants are t1E D t1I D 1 msec, t2E D t2I D 4 msec. In the gray region, the AS is stable. See the text for an explanation of the different lines. The dotted line is the limit of L4 in the limit of innite synaptic time constants (see section 5.3).
Therefore, one can completely describe the role of the synaptic strength in a two-dimensional plane spanned by these two effective parameters. Let us consider point X 0 D (D D 0, T D 0). A particular realization of point X 0 is obtained in the absence of coupling. Therefore, at X 0, the AS is marginally stable if the system is homogeneous, that is, for sE D sI D s D 0. If one introduces any level of heterogeneities, the AS becomes stable in some region around this point. The size of this region depends on s (or, equivalently, on the dispersion of the ring rates in the AS). This result holds independent of the parameters and the ring rates distribution provided the symmetry is preserved. It shows that there is always some domain in the D , T plane where the AS is stable. What are the boundaries of this region, and what are the instabilities on these boundaries? An example of the network phase diagram is shown in Figure 4. It was obtained by solving equation 4.1 for synaptic rise and decay times of 1 msec and 4 msec, respectively, the average ring rate of the two populations, º D 50 Hz, and the standard deviation of the external input distribution, s D 0.1. The distribution of ring rates for these parameters is shown in Figure 3. The region of stability of the AS is bounded by four lines.
Asynchronous States and the Emergence of Synchrony
15
Line L1 is deduced from equation 4.1 by taking m D 0. It is straightforward to see that the equation of the line is given by
T D U (0) D C 1 / U (0)
(4.2)
with Z U (0) D
1 0
dº
µ ¶ sin(A C w ) ¡ sin(w ) P (º)º C 1 . 2F¡1 (º) A
(4.3)
For the parameters of Figure 4, U (0) D 0.36. On this line, an SN bifurcation occurs. Above the line, the AS with ºE D ºI D 50 Hz is unstable through an SN bifurcation. At the instability onset, the unstable mode grows exponentially but does not oscillate. This corresponds to a change in the average ring rates of the two populations, ºE and ºI , but not to the emergence of synchrony in the network. Before discussing the general meaning of transition line L2, we consider two cases: 1. P ´ gEI gIE D 0, gII D 0, gEE > 0. The two populations are noninteracting, and the inhibitory neurons are decoupled. For such parameters, the states of the network are represented in the (D , T ) plane by the line segment: D D 0, T > 0. There is one instability with real eigenvalue for the unstable mode on this segment at point XE D (0, 1 / U (0) ) , which belongs to L1. All other instabilities in this segment occur in the region above L1 (not shown in Figure 4). Therefore, for the parameters of the gure, the excitatory population cannot develop synchrony by itself. 2. P D gEI gIE D 0, gEE D 0, gII < 0. As in the previous case, the two populations are noninteracting, but now the excitatory neurons, rather than the inhibitory ones, are decoupled. This situation is mapped in the phase diagram on segment D D 0, T < 0. An instability with a pure imaginary eigenvalue l D im c occurs on this segment at point XI D (0, gc ) where gc and m c are solutions to the complex equation: ( im c C c 1 ) ( im c C c 2 ) D c 1c 2 gc U ( im c ).
(4.4)
Solving this equation yields gc D ¡0.98, m c D 3.02, that is, m c / (2p ºI ) ¼ 0.96. Therefore, at XI , the mutual inhibitory interactions I-I become strong enough to destabilize the AS with a Hopf bifurcation where synchrony emerges on the timescale of the spiking average period. Other solutions of equation 4.4 exist, but they are not relevant since they occur in the region below XI , where the AS is already unstable. This mechanism, where synchrony of neural activity emerged in one population of neurons, has been studied by several authors
16
D. Hansel and G. Mato
(van Vreeswijk,1996; Wang & Busz Âaki, 1996; White, Chow, Ritt, SotoTrevino, ˜ & Kopell, 1998; Neltner, Hansel, Mato, & Meunier, 2000; Golomb & Hansel, 2000; Golomb, Hansel, & Mato, 2001). 6 We now consider the general case, P D gEI gIE D 0, in which the two populations interact. It is straightforward to see that if T and D satisfy the relationship
T D
D C gc , gc
(4.5)
then equation 4.1 is satised with m D m c . Equation 4.5 denes a straight line (L2) in the (T , D ) plane on which a Hopf bifurcation occurs. Obviously XI belongs to this line. The intersection of L1 and L2 denes a point, A ¼ (¡2.72, 1.8) , where the real parts of the two eigenvalues vanish. One corresponds to an SN bifurcation and the other to a Hopf bifurcation. At that point, the AS becomes unstable through a codimension 2 bifurcation called a Gavrielov-Guckenheimer bifurcation (Kuznetsov, 1998). The points on L1 and L2 to the left of A are not relevant as instability onset since they are inside a region where the AS is already unstable. The pattern of synchrony into which population a (a D E, I) settles depends in general on the ratio Ra D
m . 2p ºa
(4.6)
Here Ra is close to one, and neurons re on the average about one spike per period in the unstable mode. However, the phase relationship between the spikes and the collective oscillations can change from cycle to cycle. Consequently, the phase distribution is broad but unimodal. It can be easily shown that on L2, 2 I
2 E
D
gEE ¡ gc . | gEI |
(4.7)
Therefore, on this line, the phase lag is constant, d D 0, that is, at the instability onset, the activities in the two populations oscillate in phase. Point B on L2 dened by D D g2c ´ DB , T D 2gc ´ TB is a multicritical point where two eigenvalues have a vanishing real part. This can be shown by expanding equation 3.16 in the vicinity of B. One nds two other critical lines, L3 and L4, tangent to L2 in B. Sufciently close to B, these lines are given by the equation
T ¡ TB D b1 ( D ¡ DB ) C b2 (D ¡ DB ) 2 ,
(4.8)
Asynchronous States and the Emergence of Synchrony
17
where b1 D
1 gc
b2 D ¡
|W 0 (m c ) | 2 . 4g3c <e(W 0 (m c ) ) 2
(4.9) (4.10)
Here we have dened W (m ) D c 1c 2 U ( im ) / ( im C c 1 ) ( im C c 2 ) , and W 0 (m ) is the derivative of W with respect to m . For the parameters of Figure 4, b1 D ¡1.02 and b2 D 47.9. Hence, the expansion, equation 4.8, truncated at second order, is a good approximation to L3 and L4, only very close to B. This is why it seems to be a cusp in point B. Beyond B, L2 is no longer relevant for the instability of the AS. We conclude that on AB, the AS undergoes an instability to spike-to-spike synchrony. This instability, driven by the I-I interactions, is the extension in the framework of two populations of the instability that occurs in one population of inhibitory neurons. The interaction between the two populations makes this mechanism less efcient. Indeed, from equation 4.5, one sees that when the feedback between the two populations increases, a stronger mutual inhibitory coupling, gII , is required for the AS to be destabilized on L2. Point D is dened as the intersection of L4 and L1. At that point, the AS is unstable through a codimension 2 bifurcation called a Takens-Bogdanov bifurcation (Kuznetsov, 1998). The coordinates of D can also be obtained by expanding equation 3.16 in the limit m ! 0. One nds that in D, D D U (0) 2 , T D 2U (0) . The frequency of the unstable mode, m , varies continuously on L3 and L4, as shown in Figure 5. On L4, m decreases continuously when D increases, from m D m cII D 3.02 at B, to m D 0 at D. On L3, m increases slowly when D increases. In the limit of large D , m converges asymptotically to a nite value, m 1 , and L3 is asymptotic to the straight line dened by T D D / g1 C gc , where g1 and m 1 satisfy the equation (im 1 C c 1 ) (im 1 C c 2 ) D c 1c 2 g1 U ( im 1 ).
(4.11)
Note that this equation is identical to the spectral equation describing the stability of the AS in one population of neurons. For the parameters of Figure 4, m 1 ¼ 3.82 and g1 ¼ 6.5. The ratio between the frequency of the unstable mode and the average ring rate of the neurons, R ´ RE D RI (see equation 4.6), is plotted in Figure 5. This ratio is larger than 1 on L3. This means that the unstable mode oscillates faster than the average frequency of the neurons. However R increases by less than 25% on L3. Therefore, the synchrony that emerges is on the timescale of the spikes. In contrast, on L4, R varies from R ¼ 1 near B to R D 0 in the vicinity of D. This means that along L4, the pattern
18
D. Hansel and G. Mato
Figure 5: The ratio R D m / 2p º versus the determinant of the coupling matrix, D, on lines L3 (dashed line) and L4 (dashed-dotted line) shown in Figure 4. Points B and D correspond to B and D shown in Figure 4.
of synchrony changes continuously from spike-to-spike synchrony to burst synchrony in which the neurons have the time to re several spikes on average during one period of oscillation of the unstable mode. The phase lag, d, on L3 and L4 can be expressed analytically as a function of D , T , and gEE . From equation 4.1 one nds that ( im C c 1 ) (im C c 2 ) D
p c 1c 2 U (im ) ( T § T 2 ¡ 4D ). 2
(4.12)
Analyzing the numerical solutions of the spectral equation, one nds that it is negative on L4 and positive on L3. Substitution in equation 3.25 gives for the phase lag the equation p =m ( T 2 ¡ 4 D ) d D § arctan , gEE ¡ gII
(4.13)
where the positive (resp. negative) sign is for the phase lag on L4 (resp. on L3). The condition T 2 ¡ 4D D 0 denes a line in the phase diagram to which B, D and the point T D D D 0 belong. Therefore, T 2 ¡ 4D < 0 on L3, except in B and D where T 2 ¡ 4D D 0 (see denition of points B and D). Since gEE ¡ gII > 0, cosd > 0.
(4.14)
The phase lag, d, is a function of the three variables gEE , gII , and gEI gIE and not only the reduced variables T and D . In contrast to what happens on L2,
Asynchronous States and the Emergence of Synchrony
19 A
1.6
d
1.2 0.8 0.4 0 -8
0
-4 g
4 g
II
8
EE
B
0
d
-0.4 -0.8 -1.2 -1.6 -8
-4 g
II
0
4 g EE
8
Figure 6: Phase lag of the population oscillations. Solid lines were obtained from equation 3.25. The squares are results from numerical simulations of a twopopulation network with NE D NI D 1600. The positive (resp. negative) part of the x-axis corresponds to variation of gEE with gII D 0 (resp. gII with gEE D 0). (A) In the vicinity of L3. The simulations were done keeping a xed distance equal to 1 from L3 and with gIE D ¡gEI . The positive (resp. negative) part of the x-axis corresponds to variation of gEE (resp. gII ). The standard deviation of the current distribution is s D 0.1. Other parameters are as in Figure 4. (B) In the vicinity of L4. The simulations were done keeping a xed distance equal to 2 from L4 and with gIE D ¡gEI . The standard deviation of the current distribution is s D 0.4. Other parameters are as in Figure 4.
on L3 and L4, it changes continuously and nonmonotonically. It vanishes in B and D. This is shown, for example, in Figure 6A, where d is plotted against gEE (with gII D 0) along the two-line segment L3 and L4 and against gII (with gEE D 0) for the same two lines. Equation 4.14 implies that in general, |d| < p / 2. In particular, the activity of the two populations can never be in antiphase. From equation 4.13, one
20
D. Hansel and G. Mato
Figure 7: Phase diagram of a two-population QIF network in the symmetric case. The determinant and the trace of the coupling matrix are D D gEE gII ¡gEI gIE and T D gEE C gII , respectively. The standard deviation of the current distribution is sE D sI D 0.4. Other parameters are as in Figure 4. In the gray region, the AS is stable. See the text for an explanation of the different lines. The dotted line is the limit of L4 for very large synaptic time constants (see section 5.3).
sees that the maximum phase lag at the onset of instability, |d| D p / 2, is achieved only for gEE D gII D 0 (d D ¡p / 2 on L3 and d D p / 2 on L4). Finally, on line L3, in the limit of very large D and T , the phase lag returns to 0, as can be seen in equation 3.25 using equation 4.11. These results are also conrmed in the example of Figure 6B. The two line segments, L3 and L4, intersect at point C where D ( C) ¼ 7.9 and T ( C) ¼ 4.3. Note that m has different values on the two lines at C, as can be seen from Figure 5. Therefore, at C, the AS is unstable through a double-Hopf bifurcation (Kuznetsov, 1998). Whether L3 or L4 is the border of stability of the AS depends on T . This is shown in Figure 4 for T > T ( C) . The AS loses stability on L4, whereas for T < T ( C) , this occurs on L3. 4.2 The Effect of Heterogeneities. We now consider how the phase diagram for the stability of the AS is modied when the level of heterogeneities varies. Figure 7 shows the phase diagram in the symmetric case for sE D sI D 0.4, with all other parameters the same as in Figure 4. Comparing these two gures, one sees that in spite of a signicant increase in the level of heterogeneities, the transition line (L1) has moved only slightly. In contrast, L2 has substantially moved toward more negative values of T and D . This can be easily understood because to compensate for the increased desynchronizing effect of the heterogeneities, stronger synapses are required to destabilize the AS. Line segment L4 is barely modied in the neighborhood of D, its crossing point with L1. A stronger effect occurs in the vicinity of
Asynchronous States and the Emergence of Synchrony
21
Figure 8: The ratio R D m / 2p º versus the trace of the coupling matrix, T , on line L4 shown in Figure 7. Points B and D correspond to B and D shown in Figure 7.
B, which moves this point toward more negative values of T and D . Most of the changes in line segment L4 when s increases can be explained by this stronger effect. A more dramatic change occurs for the line segment L3. Indeed, when the heterogeneities are strong enough, L3 moves completely to the right of L4. Moreover, on L4, T increases very slowly with D . Therefore, all L3 becomes located well into the region where the AS is already unstable, and this instability scenario is now irrelevant. As in Figure 4, along line L4, the frequency of the unstable mode depends continuously on the location. However, in contrast to that case, L4 is relevant as an instability line all the way between points B and C. In Figure 8, we plotted the ratio of the frequency of the emerging oscillations to the average ring rate showing the continuous and monotonic variation of this quantity along L4 from a value slightly larger than 1 at B corresponding to spike-tospike synchrony to 0 at D in the vicinity of which the network res in synchronous bursts. To characterize further how the instabilities of the AS depend on the level of heterogeneities, we computed the value of D on the four lines L1, L2, L3, L4 as functions of the heterogeneity level, s, for T with all other parameters constant. The results are shown in Figure 9. As already noted from the comparison of Figures 4 and 7, the location of L1 depends only very weakly on the level of the heterogeneities. In contrast, the destabilization of line L2 is much more sensitive to this parameter. In Figure 9B, the line segment L3 is clearly much more sensitive to heterogeneities than line segment L4. Indeed, the coupling strength required on the former line increases much more rapidly with s than on the latter line. A consequence of this difference in sensitivity to heterogeneities is that, as already noted, the two line segments
22
D. Hansel and G. Mato
Figure 9: Determinant of the coupling matrix, D, versus the standard deviation, s on the bifurcation lines L1, L2, L3, L4 for T D 1. The synaptic time constants are as in Figure 4. (A) L1 (solid line) and L2 (long dashed line). (B) L3 (dashed line) and L4 (dashed-dotted line).
L3 and L4 interchange their positions in the (D ,T ) phase diagram when s is large enough. This happens at s ¼ 0.15. For s larger than this value, line L3 is located beyond the instability line L4 in a region where the AS is already unstable. Therefore, in that case only, one instability scenario driven by the E-I-E interaction loop is relevant. 4.3 The Effect of the Firing Rate. For a xed architecture and coupling strength, the average ring rate of the network depends on the external input. To investigate how the stability of the AS depends on this input, we studied the solutions of the spectral equation as a function of the average ring rate, keeping the variance of the threshold distribution constant, s D 0.1. Figure 10A shows how the minimal coupling strength gc required for the destabilization of the AS in one inhibitory population determined by equa-
Asynchronous States and the Emergence of Synchrony
23
Figure 10: (A) The synaptic coupling, gII , at which the AS becomes unstable in one inhibitory population versus the average ring rate of the neurons. The standard deviation of the input distribution is s D 0.1. The synaptic time constants are as in Figure 4. (B) The determinant, D, on the bifurcation lines L3 (dashed line) and L4 (dashed-dotted line) versus the average ring rate of the neurons in a two-population network (symmetric case) for T D 1. Other parameters are as in Figure 4.
tion 4.4 varies with the average ring rate. One nds that gc is a nonmonotonic function of the average ring rate. The critical coupling is minimum (gc D ¡0.98) for a ring rate of º D 50 spikes per second. Note, however, that the extremum of the curve is very shallow. Already for gc D ¡1.5, the AS is unstable in a broad range of average ring rates. Figure 10B plots D c , the value of D at the instability onset on L3, and L4, for T D 1. This gure shows that on L3, Dc decreases monotonically with increasing ring rate. Moreover, this instability disappears for º < ºmin D 20 spikes per second at which value Dc diverges. In contrast, on L4, Dc is a nonmonotonic function of the average ring rate and exists at all values of the ring rate. The two lines intersect for º ¼ 40 spikes per second. Below
24
D. Hansel and G. Mato
this value of the ring rate, the relevant instability of the AS is on L3, whereas above this value, it is on L4. 4.4 Summary of the Results in the Symmetric Case. Our analysis of the symmetric case reveals that in a two-population heterogeneous network, the AS can lose stability in four ways. Two of these are not specic to two-population networks and are an extension in this framework of instabilities that already exist for one population: the rate instability that occurs on L1 and the spike-to-spike synchrony instability that occurs on L2. At the onset of the former type of instability, which depends only weakly on heterogeneities, no synchrony occurs. In the latter case, which is sensitive to heterogeneities, at the onset of the instability, activities in the two populations are synchronized on the timescale of the spikes and oscillate in phase. The two other types of instability that occur on L3 and L4 are specic to two-population networks. On L3, the activities in the two populations oscillate at a frequency on the order of the average ring rate, and the inhibitory population is oscillating ahead of the excitatory population. On L4, the frequency of the population activity oscillations can become very slow. This happens if both the recurrent excitation and the recurrent inhibition (the feedback between the two populations) are strong. On L4, the oscillation of the excitatory population is ahead of the oscillation of the inhibitory population. Finally, L3 and L4 differ in the way they depend on the level of heterogeneities. L3 becomes irrelevant if the heterogeneities are large or the average ring rates are small. 5 Synchrony in Two-Population Networks: The General Case If the distributions of the ring rates of the two populations are different in the AS or the synaptic time constants of the excitation and inhibition are not the same, the stability of the AS (at given ring-rate distributions) depends on three synaptic coupling parameters gEE , gII , gEI gIE and also on the synaptic time constants. 5.1 The Effect of the Synaptic Coupling Strength. Figure 11A displays the phase diagram of the AS in the plane ¡gEI gIE , gEE when gII D 0. The synaptic time constants are t1E D 1 msec, t2E D 3 msec, t1I D 1 msec, and t2I D 6 msec. The average ring rate of the excitatory (resp. inhibitory) population is 20 Hz (resp. 40 Hz), and sE D sI D 0.2 for both populations. As in the symmetric case, if the recurrent excitation is strong enough, SN instability occurs. In the plane ¡gEI gIE , gEE , this instability occurs on a straight line, the equation of which is
gEE D
1 ¡ gEI gIE UI (0) , UE (0)
(5.1)
Asynchronous States and the Emergence of Synchrony
25
Figure 11: Phase diagram in the asymmetric case. The average ring rate of the excitatory (resp. inhibitory) population is ºE D 20 Hz (resp. ºI D 40 Hz). The synaptic time constants are t1E D t1I D 1 msec, t2E D 3 msec, t2I D 6 msec. The standard deviations of the input distributions are sE D sI D 0.2. The AS is stable in the gray region. In this gure, only the physical branches of the instability lines are drawn. In particular, L3, which branches at B, is not drawn. (A) gII D 0. A saddle-node bifurcation occurs on L1 (solid line) and a Hopf bifurcation on L4 (dashed-dotted line). (B) gEE D 0. Hopf bifurcations occur on L2 (long-dashed line) and L4 (dashed-dotted line). The difference between these two bifurcation lines is explained in the text. At E, the instability of the AS is due to the interactions between the two populations since gEE D gII D 0 at that point.
26
D. Hansel and G. Mato
where Ua (0) , a D E, I is given by equation 4.3, replacing the distribution P by Pa , the ring-rate distribution in population a. For the parameters of the gure, one nds UE (0) D 0.53 and UI (0) D 0.4. When gEE increases, this instability can be prevented by increasing the negative feedback between the two populations measured by gEI gIE < 0. When this feedback is strong enough, the AS becomes unstable through a Hopf bifurcation, which occurs on the line to the right of the phase diagram. On this line, the period of the population rhythm varies continuously from m ¼ 1.25 at its intersection with the x-axis, E, (gEE ( E ) D 0) to m D 0 at D. The value of m at E corresponds to a pattern in which the inhibitory (resp. excitatory) neurons re approximately two spikes (resp. one spike) per period of the population oscillation. Near D, it corresponds to a pattern in which neurons tend to re synchronous bursts with many spikes. The properties of the instability on this line are similar to those found on line L4 in the phase diagrams of the symmetric case. Because of these similarities, we have denoted this line by L4 here as well. Figure 11B displays the phase diagram of the model in the plane ¡gEI gIE , gII when gEE D 0. The other parameters are the same as in Figure 11A. The region of stability of the AS is bounded by two lines. One, L2, corresponds to the destabilization of the AS when gII is strong enough. Like line L2 in Figures 4 and 7, this line corresponds to the emergence of synchrony through the I-I interaction. In contrast to the symmetric case, here m / 2p and d are not constant on L2. However, m / 2p always remains just slightly smaller, by less than 4%, than the average ring rate of the inhibitory population, and d is always smaller than 15% of the total period. Another difference with the symmetric case is that L2 is not straight anymore. However, it deviates only slightly from a straight line. In similarity with the symmetric case, the slope of L2 is negative. This means that the feedback between the two populations plays against synchrony through the I-I coupling. A stronger I-I coupling is required to compensate for this feedback. For strong enough feedback, gEI gIE , the AS can lose stability even if gII is small. This transition occurs on the line to the right in Figure 11B. On this line, the population frequency varies signicantly. Indeed, at B, m / 2p ¼ 40 Hz, and it is two times smaller at point E. Embedded into the three-dimensional space, (gEE , gII , gEI gIE ), this line connects to L4 from Figure 11A. We also denote this line by L4 since the instabilities occurring on these two lines can be thought of as corresponding to a similar synchrony mechanism. Note that the slope of L4 in Figure 11B is negative. In other words, increasing gII plays against the destabilization of the AS through this mechanism. 5.2 The Effect of Synaptic Time Constants. In this section, we analyze the effect of synaptic kinetics on the stability of the AS for xed distributions of ring rates. First, we focus on the effect of the decay times.
Asynchronous States and the Emergence of Synchrony 0.02
g2E(1/msec)
A
0.01
0.00 0.0
AS
0.1
0.2
g2I(1/msec)
g2E(1/msec)
0.8
27
B
0.6
AS
0.4 0.2 0.0 0.0
0.1
0.2
g2I(1/msec)
Figure 12: Phase diagrams in the plane c 2I ´ 1 / t2I , c 2E ´ 1 / t2I for two different strengths of the I-I interactions. The average ring rate of the excitatory and inhibitory populations are ºE D 20 Hz and ºI D 30 Hz, respectively. (A) gII D 0. (B) gII D ¡7.4, Other parameters are t1E D t1I D 1 msec, gEE D 6, gEI gIE D ¡36 and sE D sI D 0.2. In the gray region, the AS is stable. On L2 (long-dashed line) and L4 (dashed-dotted line), Hopf bifurcations occur. Along L4, the frequency of the unstable mode m decreases with c 2I . For instance, in A, m / 2p varies from 11 Hz for c 2I D 0.3 msec ¡1 to 0 for c 2I D 0. Along L2, m remains nite and, its variation is smaller.
Figure 12 displays the phase diagram in the (c 2I ,c 2E ) plane for gEE D 6, gEI gIE D ¡36 and two different values of gII . We consider rst the case where gII D 0. In the limit t2E ! 1, the AS is always stable. This is because in that limit, the excitatory input to the inhibitory population is not modulated in time. Therefore, it is equivalent to a positive constant current that excites the inhibitory population. The only way synchrony can occur in this case is through the I-I interactions, which we have assumed to be zero. For nite t2E , the AS is unstable provided t2I is sufciently large. This denes a transition line c 2E D f (c 2I ) where function f (c ) is an increasing function that vanishes linearly when c ! 0. On this line, which is plotted in Figure 12A, m is a decreasing continuous
28
D. Hansel and G. Mato
function of c 2I . It goes to 0 in the limit c 2I ! 0. Since the decay of the inhibition cannot be faster than its rise time (see equation 2.11), m cannot be larger than m D 0.79, its value for t2I D 1 msec, on the transition line in Figure 12A. Therefore, all along the line, the transition leads to a synchronous bursting state. The bursts are short for fast inhibition and long when the inhibition is slow. In Figure 12A, the AS stabilizes for fast inhibition and slow enough excitation. In particular, the fastest excitation compatible with a stable AS (attained for very fast inhibition with t1I D t2I D 1 msec) is t2E ¼ 50 msec, in the range of NMDA synaptic time constants. The phase diagram in the (c 2I ,c 2E ) plane remains qualitatively the same as in Figure 12A, provided that gII is sufciently small. If gII is large enough, a second instability line appears in the region of the phase diagram corresponding to fast inhibition and slow excitation. This is because for a large enough gII , the I-I interaction can induce spike-to-spike synchrony in the inhibitory population. This situation is shown in Figure 12B. Note that the transition line found for gII D 0 (see Figure 12A) is still present but has been shifted to much higher values of c 2E . In contrast to what happens for small values of gII , when the inhibition is fast (t2I < 9 msec), the AS is destabilized for slow enough excitation. For instance, for t2I D 6 msec, this happens for t2E larger than 20 msec. Note that for the parameters of Figure 12B, the AS is stable if the excitatory and inhibitory synapses are in a range corresponding to AMPA and GABAA synapses, respectively. 5.3 The Limit of Slow Synapses. The spectral equation, equation 3.16, can be simplied if one assumes that the synapses have innitely long decay time constants. We assume that c 2E D gcN2E and c 2I D gcN2I where g ¿ 1 and cN2E , cN2I are nite and we make the ansatz l D glN .
(5.2)
Substituting in equation 3.16, expanding at the leading order in g, and usR1 ing the fact that 0 dyK ( y ) D F0 (F ¡1 (º) ) , one nds that at the onset of an instability lN D imN with mN solution of the equation, ( ( imN C cN2E ) ¡cN2E gEE UE ) ( ( imN C cN2I ) ¡cN2I gII UI ) D gEI gIEcN2EcN2I UE UI ,
(5.3)
where the quantities Z UN a D
1 0
dºP a (º) F0 (F ¡1 (º) )
do not depend on mN .
(5.4)
Asynchronous States and the Emergence of Synchrony
29
Equation 5.3 is quadratic in mN . For mN D 0, one recovers the condition for the SN bifurcation obtained above (line L1). Another transition line is 6 found for mN D 0 on which synchronous oscillations emerge and have a low frequency, m / g ¿ 1. During one period of these oscillations, the neurons burst synchronously since they have the time to re several spikes in one oscillation of the populations. If one assumes c 2E D c 2I D c , it is convenient to dene X D gEE UE C gII UI and Y D D UI UE . One nds that an instability to oscillations occurs on the line X D 2 in the ( X, Y ) plane and that m / g ¿ 1 varies continuously on this line. Therefore, sufciently strong gEE and not too large gII are required for burst synchrony. More 6 c 2I , this onset of slow bursting occurs on a surface generally, when c 2E D in the three-dimensional space ( gEE , gII , ¡gEI gIE ) . One also nds here that a sufciently strong gEE is required for this instability. In general, one also nds that increasing gII prevents the emergence of synchronous bursting. Equation 5.3 is also the spectral equation for the stability of the xedpoint solutions of the system of two coupled equations, 1 dºE D ¡ºE C GE ( gEEºE C gEIºI ) cN2E dt 1 dºI D ¡ºI C GI ( gIEºE C gIIºI ) , cN2I dt
(5.5) (5.6)
where the functions Ga ( hE , hI ) are Z Ga ( x) D
1 ¡1
dIQ a (I ) F ( I C x ).
(5.7)
Let us consider a two-population rate model in which the neurons have an input-output transfer function F (x ) and receive a random external input with probability density function Qa (I ) (a D E, I). In this model, the ring rate of neuron i in population a, ºia, follows the equation 1 dºia syn D ¡ºia C F ( Iia C Iia ) , cN2a dt syn
(5.8)
where the total synaptic current, Iia , is given by equation 3.1 and Iia is its external input. It is easy to check that the dynamics of this model can be reduced to two effective degrees of freedom: the average ring rate in the two populations, ºa ( t) , a D E, I, which satisfy equations 5.5 and 5.6. These two equations are nothing other than a two-population mean-eld model with transfer function Ga (x ) given by equation 5.7. An example of such a function is plotted in Figure 13. Therefore, in the limit of very slow synaptic decay, the instabilities of our spiking model are the same as those of a rate model (see also Ermentrout, 1994; Bressloff & Coombes, 2000).
30
D. Hansel and G. Mato
n (Hz)
150
100
50
0
1
0
1 2 I (mA/cm2 )
3
Figure 13: Input-output transfer function of the rate model that corresponds to the QIF model for s D 0.4. Other parameters as in Figure 1.
Finally, one can show that if the rise time of the synapses is also very slow and on the same order of magnitude as their decay time, the quadratic equation for m , equation 5.5, is replaced by a quartic equation. This situation is similar to the previous one except that now the instability to burst synchrony does not require mutual interactions between the excitatory neurons. Here too the spectral equation can be derived from the mean-eld equations of a rate model. However, now it is dened by four coupled equations—two equations for each of the two populations. 5.4 The Effect of the Average Firing Rates. The effect of the ring rates of the two populations on the stability of the AS has been studied above in the symmetric case in which the average ring rates of the two populations were necessarily equal. Here, we consider the case in which the distributions of ring rates in the two populations are different to understand how this affects the stability of the AS. In particular, we want to investigate whether some specic relationship (“resonance”) between the average ring rates in the two populations is favorable to the emergence of synchrony in the network. The SN instability (m D 0) depends on the ring rates through UE (0) and UI (0). The maximal value of the recurrent excitatory feedback, gEE , is given by gEE D
1 UI (0) C |gEI | gIE . 1 C | gII |UI (0) UE (0)
(5.9)
The quantities Ua (0), a D E, I are decreasing functions of the ring rate. For large ring rates, they converge to a nite value. Therefore, increasing
Asynchronous States and the Emergence of Synchrony
31
the ring rate of the excitatory population or decreasing the ring rate of the inhibitory population increases gEE (i.e., stabilizes the AS with respect to the SN instability). In particular, the maximal value of gEE at this onset of instability, when ºI changes, is 1 / UE (0) C | gEI | gIE / gII . It is reached in the limit of very small ring rates of the inhibitory neurons. This is because the sensitivity of the neurons to changes in their input increases when their ring rate decreases. Therefore, the inhibitory neurons are more efcient at low ring rates than at high ring rates at preventing the runaway of the activity of the excitatory neurons due to the excitatory feedback. In the symmetric case, we have found that the interaction with an excitatory population stabilizes the AS with respect to the spike-to-spike instability generated by the I-I interactions. Figure 11B provides an example that this remains true for the general case. In order to gain more insight into this property, we computed the value of gII on L2 for different values of ºE and ºI with all the other parameters of the model held constant. The result for ºI D 50 Hz is plotted in Figure 14A; in the limit, ºE ! 0, gII ! gc , where gc is the critical value of the I-I coupling at which the AS would lose stability in the inhibitory population if it were isolated from the excitatory population. This is expected since in this limit, the excitatory population has no effect because it is inactive. In the limit of large ºE , also, gII ! gc . This is because in this limit, the modulation of the excitatory drive on the inhibitory population becomes small, and therefore the excitatory population has no dynamical effect on the inhibitory population. For intermediate values of the excitatory ring rate, gII is a nonmonotonic function of ºE . The minimum value is reached for ºE ¼ ºI . It is about 25% more negative than gc . Note that the position and the maximum of gII depend on ºI . As a rule, we have found that when ºI is large, this maximum is more pronounced, and it is located around a value of ºE < ºI as depicted in Figure 14A. These results show that in general, the feedback between the two populations disrupts the emergence of synchrony through the mutual interactions between the inhibitory neurons unless the average frequencies of the two populations differ sufciently. The frequencies of the two populations affect the instability on L4 in a different way, as shown in Figure 14B. Here, we have plotted the critical value of the feedback, ¡gEI gIE , between the two populations versus the frequencies ºE and ºI for gEE D gII D 0. The feedback required to destabilize the AS is minimum, around ºE D ºI ¼ 27.5 Hz. Note that this minimum is deep but that its curvature is small. Therefore, in contrast to what happens on L2, synchrony is favored on L4 if the average frequencies in the two populations are similar. The behavior for line L3 is similar to the one found for L4. The critical value of the feedback is smaller when the two populations have the same ring rate. However, in this case, the system is much more sensitive to the difference in the ring rates. For instance, a factor of 2 between ºE and ºI generates an increase of a factor of 10 in the critical
32
D. Hansel and G. Mato
A
-2.4 -2.6 -2.8 -3 g
II
-3.2 -3.4 -3.6 -3.8
0
20
40 60 n (Hz)
80
100
E
B -g g
IE EI
24 16 8 0 0
20 n E (Hz)
40
60 0
20
40
60
80
n (Hz) I
Figure 14: (A) The strength of the mutual inhibition, gII , for which the AS becomes unstable on line L2 versus the average frequency of the excitatory population, ºE . The average frequency of the inhibitory population is ºI D 50 Hz. Other parameters are gEE D 2 and gEI gIE D ¡4. (B) ¡gEI gIE versus the average ring rates of the two populations for which a Hopf bifurcation (L4) occurs. Here, gEE D gII D 0. In A and B, the synaptic time constants are t1E D t1I D 1 msec, t2E D 3 msec, t2I D 6 msec and the standard deviations of the input distributions are sE D sI D 0.2.
¡gEI gIE (data not shown here), while the change on L4 is less than 100%. This behavior is similar to the one found in the previous section: line L3 is a much less robust transition than line L4.
Asynchronous States and the Emergence of Synchrony
33
6 Numerical Simulations The study of the solutions of equation 3.16 has allowed us to determine in the parameter space the AS stability region and the different ways this state can become unstable. The goal of this section is to complete the characterization of these instabilities and the patterns of synchrony that emerge beyond their onset. We focus on the following questions: Are the transitions at the onset of instabilities sub- or supercritical when they occur through a Hopf bifurcation? To what extent is the imaginary part of the unstable eigenvalue on the instability lines a good estimator of the frequency of the population in the stable oscillatory state in which the network settles beyond but close to these lines? Is the phase lag, d, computed from the linearized dynamics, a good estimator for the actual phase lag between the oscillations of the activities in the two populations in the stable state near the instability onset? The answer to these two questions depends critically on the nature of the Hopf bifurcation. Which one of the excitatory or inhibitory populations is more synchronized near onset of synchrony? How can their relative level of synchrony be controlled? In principle, the answer to the rst question can be determined analytically by expanding the dynamics beyond the linear order. However, this calculation is extremely involved and is beyond the scope of this article. Therefore, to clarify all these issues, we relied on numerical simulations. These were performed by integrating the dynamics of the network using a second-order Runge-Kutta scheme with a xed time step, dt D 0.1 msec. This algorithm was supplemented by an interpolation for the determination of the ring times of the neurons (Hansel, Mato, Neltner, & Meunier, 1998; Shelley & Tao, 2001). 6.1 The Nature of the Hopf Bifurcations. In a subcritical Hopf bifurcation, the stable xed point, a stable limit cycle, and an unstable limit cycle can coexist in some range of the control parameter near the onset of bifurcation. This does not happen if the bifurcation is supercritical. One can use this criterion to characterize the nature of a Hopf bifurcation in numerical simulations by checking whether the system displays a hysteresis when the control parameter is changed continuously from a value where only the xed point exists up to values where only the limit cycle is stable and backward. We performed numerical simulations in the vicinity of the three lines L2, L3, and L4. The state of the network was characterized by computing the synchrony level  in the two populations, as explained in appendix C. Simulations with different network sizes were compared to check that syn-
34
D. Hansel and G. Mato
chrony was not due to a nite-size effect. An example of the results of this study is shown in Figures 6.1A through 15C for s D 0.1 and network size NE D NI D 12, 560 near instability points chosen on L2, L3, and L4, respectively. No hysteresis is apparent in the vicinity of the instability points on L2 (see Figure 15A). Although a small hysteresis cannot be completely excluded, our simulations were unable to detect it. In contrast, near L3 and L4,
Asynchronous States and the Emergence of Synchrony
35
the network displays some hysteresis (see Figures 15B and 15C). However, this is a small effect since the range of coupling strength in which it occurs, as well as the value of the synchrony parameters, Â, on the left branches of the hysteresis, are small. When s increases, a hysteresis also appears near the instability points on L2. This effect can be seen by comparing Figures 15A and 15D, which were obtained for s D 0.1 and s D 0.4, respectively. However, this is a small effect, which can hardly be distinguished from sample-to-sample uctuations if the network size is too small (result not shown). Near L4, for this value of s, the hysteresis has shrunk and cannot be seen. 6.2 The Frequency of the Synchronous Oscillations and the Phase Lag Between the Two Populations. Our simulations indicate that a supercritical Hopf bifurcation occurs on L4 for large levels of heterogeneities. Therefore, one expects that the pattern of synchrony should be correctly predicted by our linear stability analysis near the bifurcation. This is conrmed in Figure 16A, where simulation results for the frequency of the population oscillations near L4 are plotted for the parameters of the phase diagram in Figure 7. The simulations were performed along a line parallel to line L4, and the results are compared with the value predicted from a linear stability analysis. In Figure 17A, the potentials of one excitatory neuron and one inhibitory neuron are plotted for T D 5.25, D D 8. On this point, the global frequency is much smaller than the ring rate of the neurons. This means that the system must be bursting, as observed in the simulation.
Figure 15: Facing page. The level of synchrony in the excitatory population close to the Hopf bifurcations lines in the symmetric case. For each value of the bifurcation control parameter (corresponding to the x-axis), the network was simulated during 60,000 time steps. At the end of the run, the nal state of the network was kept and used as initial conditions to run a new simulation after the control parameter was incremented. In each gure, two lines are shown. One corresponds to a sweep positive increment and the other to a negative increment. The standard deviation of the input is s D 0.1 (A) The strength of the recurrent excitation is xed gEE D 2 and gII was changed. For each value of gII , gEI gEI is varied to keep D D 0. Predicted transition on L2 at T D ¡1.11 (vertical arrow). (B) The E-E and I-I couplings are xed: gEE D gII D 0. The E-I and I-E coupling are varied with gEI D ¡gIE . Predicted transition on line L3 at D D 1.98 (vertical arrow). (C) The E-E and the I-I interactions are xed, gEE D 5, gII D 0 and gEI and gIE are changed simultaneously with gEI D ¡gIE . Predicted transition on L4 at D D 7.9 (vertical arrow). (D) gII and gEI , gIE are changed such that gEI D ¡gIE and D D ¡1. Predicted transition on L2 at T D ¡4.2 (vertical arrow). (E) The E-E and the I-I interactions are xed: gEE D gII D 0 and gEI and gIE are changed simultaneously with gEI D ¡gIE . Predicted transition on L4 at D D 8.11.
36
D. Hansel and G. Mato
Figure 16: The frequency of the population activity oscillations near the onset of instability of the AS. The lines represent the predicted frequency from the solution of the spectral equation, equation 3.16. The squares are the results from simulations. A network of NE D NI D 1600 was simulated during 60,000 time steps and the frequency oscillation was obtained by evaluating the autocorrelation of the average membrane potential. (A) Parameters as in Figure 7. The simulations were done near L4 at distance 2 away from L4 with gIE D ¡gEI and keeping gII D ¡2. (B) Parameters as in Figure 4. The simulations were done at a distance 1 away from line L3 with gIE D ¡gEI and keeping gII D ¡2. (C) Parameters of Figure 7. The simulations were done at a distance 2 away from line L2 varying gIE D ¡gEI and keeping gEE D 0.
The Hopf bifurcations that occur on L3 (for small levels of heterogeneities) as well as on L2 if the level of heterogeneities is sufcient were found to be subcritical. Therefore, one expects that the actual oscillation frequencies and phase lags near these lines should deviate from the values obtained for the
Asynchronous States and the Emergence of Synchrony
37
A
20 mV
0
100 200 300 400 500 t (msec) B
10 mV
0
100 200 300 400 500 t(msec)
Figure 17: Synchronous oscillations in which neurons re bursts of spikes. Parameters as in Figure 7 with T D 5.25 and D D 8. (A) Excitatory neuron (lower trace) and inhibitory neuron (upper trace). (B) The membrane potential of the excitatory (resp. inhibitory) neurons averaged over all the population (lower line, resp. upper line).
unstable mode in the linear stability analysis. In spite of this, a reasonable agreement between the predictions and the simulation values was found as shown in Figures 16B and 16C. This is because the hysteresis near L2 has a small range and a small amplitude (see, e.g., Figure 15B). In Figure 6, the phase lag between the oscillations of the activity of the two populations is plotted as a function of gEE for gII D 0 (A) and as a function of gII for gEE D 0 (B). The solid and dashed lines correspond to the predictions from our analytical results. Results from numerical simulations
38
D. Hansel and G. Mato
performed in the vicinity of the instability lines are represented by squares. They were estimated from the position of the cross-correlation peak of the activities of the two populations. In agreement with the linear theory, the phase lag on L3 and L4 are nonmonotonic functions of the coupling. For instance, on line L4, it reaches a maximum of p / 2 when gEE D gII D 0 and then decreases as one approaches points B or D in the phase diagram. The negative sign of the phase lag means that the oscillation of the activity of the excitatory population is ahead of the inhibitory one. Note that on L3, the lag has the opposite sign, that is, the inhibitory population oscillates ahead of the excitatory one. On line L4 near D, the linear theory predicts that the phase lag should become small and vanish at D. This is conrmed by our numerical simulations, as shown in Figure 17B. Finally, the linear theory predicts that at the instability onset on L2, the phase lag should be zero. Therefore, d is expected to be small near L2 when synchrony occurs. Our simulations are in line with this prediction. 6.3 The Level of Synchrony in the Two Populations. For strong enough gII , the activity of the inhibitory neurons can synchronize as a result of their mutual coupling. If the feedback between the two populations, ¡gEI gIE is weak, the dynamical state of the inhibitory population should not be perturbed much. However, the excitatory population is now entrained by the inhibitory one. Therefore, it develops synchrony. The level of synchrony in the excitatory population depends on the coupling gEI . For small gEI , one expects the excitatory population to be weakly synchronized, and the level of synchrony should increase with gEI . These expectations were conrmed by our numerical simulations. Moreover, under these conditions, the inhibitory population was always substantially more synchronized than the excitatory one (results not shown). More generally, we found this property to be true in the vicinity of L2 even for large values of ¡gEI gIE . This can be seen in Figure 18, where the synchrony measures, ÂE and ÂI , are plotted against ¡gIE gEI , for gEI D ¡gIE / 4 (A) and for gEI D ¡4gIE (B). In both cases, the transition point on L2 occurs for the same value of ¡gIE gEI , namely, ¡gIE gEI ¼ 3.25. This agrees with the fact that the spectral equation depends on only gEI and gIE through their product. However, the level of synchrony in the two populations also depends on the ratio gEI / gIE . On both curves, the inhibitory population is always more synchronized than the excitatory one. The synchrony in the excitatory population varies nonmonotonically with gEI gIE , with an intermediate region where the AS is stable. This is because the feedback between the two populations has two opposing effects. On the one hand, it drives the excitatory population, and on the other, it disrupts the synchrony induced by the mutual I-I interactions in the inhibitory population. Finally, increasing gEI (for xed gEI gIE ) increases ÂE . This is because the larger the | gEI |, the stronger is the inhibitory drive on the excitatory population.
Asynchronous States and the Emergence of Synchrony
39 A
0.5 0.4 0.3 c
L2
0.2
L4
0.1 0
1
3
5 -g
7
g
EI IE
B
0.5 0.4
c
0.3 L2
0.2
L4
0.1 0
1
3
5 -g
EI
g
7
IE
Figure 18: The synchrony level of the excitatory (solid line) and the inhibitory populations (dashed line) versus ¡gEI gIE . The strength of the E-E and I-I interactions are xed: gEE D 2, gII D ¡2. Other parameters ºE D ºI D 50 Hz, sE D sI D 0.2. A network with NE D NI D 1600 neurons was simulated for different values of gEI and a xed ratio gEI / gIE . (A) gEI / gIE D ¡1 / 4. (B) gEI / gIE D ¡4. The level of synchrony was calculated as explained in appendix C (average over 6 seconds). The theory predicts that the AS becomes unstable at ¡gEI gIE D 3.25 (on L2) and ¡gEI gIE D 5.7 (on L4) as indicated by the arrows.
In all our numerical simulations, we found that in the vicinity of L2, the level of synchrony is larger in the inhibitory population than in the excitatory one. This is in contrast with what happens near L3. This is shown in Figure 19. Here, for a given value of ¡gEI gIE , the relative level of synchrony depends strongly on the balance between gEI and gIE . For large gEI and small gIE , the synchrony in the excitatory population is higher than in the inhibitory one. The opposite happens if gIE is large and | gEI | is small. The same gure also shows that the synchrony levels near line L4 are affected in the same way by the balance between gEI and gIE . Making the excitatory to
40
D. Hansel and G. Mato A
0.5 0.4 0.3 c
L2
0.2
L3
0.1 0
1
3
5 -g
EI
g
7
IE
B
0.5 0.4 0.3 L3
c
L2
0.2 0.1 0
1
3
5
7
-g g
EI EI
Figure 19: The synchrony level of the excitatory (solid line) and the inhibitory populations (dashed line) vs. ¡gEI gIE . The strength of the E-E and I-I interactions is xed: gEE D 2, gII D ¡2. Other parameters ºE D ºI D 50 Hz, sE D sI D 0.1. A network with NE D NI D 1600 neurons was simulated for different values of gEI and a xed ratio gEI / gIE . (A) gEI / gIE D ¡1 / 4 (B) gEI / gIE D ¡4. The level of synchrony of the two populations was calculated as explained in appendix C (average over 6 seconds). The theory predicts that the AS becomes unstable at ¡gEI gIE D 2.96 (on L2) and ¡gEI gIE D 5.42 (on L3) as indicated by the arrows.
inhibitory interaction stronger makes the inhibitory population more synchronized, and vice versa. 7 Discussion In this work, we have analyzed the synchronization properties of a fully connected two-population system. We have not intended to present a complete model of the cortex, but a simple network model that is amenable to analytical treatment. A natural but nontrivial extension of our work would
Asynchronous States and the Emergence of Synchrony
41
be to study the stability of asynchronous states in networks with more structured patterns of connectivity such as the ones found in the visual cortex, which include local or intracolumnar connections between neurons with similar feature preference, vertical connections between different layers, and feedback connections from higher cortical areas. For instance, one expects that in a network with space-dependent interactions, instabilities of the AS may lead to traveling waves or other spatiotemporal patterns in which hot spots of activity move across the system (Ben Yshai, Hansel, & Sompolinsky, 1997; Hansel & Sompolinsky, 1998; Golomb, 1998; Golomb & Ermentrout, 1999). 7.1 Three Mechanisms for Synchrony in Two-Population Networks. 7.1.1 The Mutual Inhibition Mechanism. The activity of inhibitory neurons can be synchronized on the timescale of the spikes through their mutual interactions. It has been proposed that hippocampal c rhythms can be driven by populations of inhibitory interneurons whose activity is synchronized through this mechanism (Buzs a ki & Chrobak, 1995). Synchrony in one population of inhibitory neurons has been extensively studied in recent years theoretically (van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995; van Vreeswijk, 1996; Wang & Busz a ki, 1996; White et al., 1998; Chow, 1998; Chow, White, Ritt, & Kopell, 1998; Golomb & Hansel, 2000; Neltner et al., 2000; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000) and experimentally (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Colling, Buzs Âaki, & Jefferys, 1996; Traub, Jefferys, & Whittington, 1999, for a review). Here, we have generalized this mechanism to take into account the effect of a second excitatory population. When the cross-talk interaction between the two populations is not too large, we have found that the AS becomes unstable if the mutual inhibition is sufciently strong. Excitation plays against this destabilization in two ways. For a given value of the feedback between the two populations (xed value of | gEI gIE |), the strength of the mutual inhibition gII required to destabilize the AS increases with gEE . Also, for a given gEE , a larger mutual inhibition is required if the feedback between the two populations is increased. If |gEI gIE | is too strong, another mechanism for the destabilization of the AS can occur, as discussed below. These desynchronizing effects of the excitatory population are stronger when the average ring rates of the two populations are more similar. In the example shown in Figure 14, | gII | is maximal when the ring rates of the two populations are similar. Therefore, as far as the value of gII is concerned, the excitatory population has a strong inuence. In contrast, we have found that the frequency of the unstable mode at the transition is mainly determined by the average ring rate of the inhibitory population and depends only weakly on the properties of the excitatory population and on the way the two populations interact. Related to this fact, we have shown that the phase shift between the two populations is very small even
42
D. Hansel and G. Mato
if their average ring rates are very different. In view of these features, this mechanism can arguably be based on the synchronization of the inhibitory population. Our analysis focuses on the mechanisms for destabilization of the AS. We address this problem by studying local bifurcations of the AS. Therefore, we did not address the issue of rhythm properties under strong synchrony. Strongly synchronized states could coexist with the weakly synchronous states or even the AS. The only way to address this question (and the question of the size of the attraction basins of the stable states) is with extensive numerical simulations. The fact that the frequency of the unstable mode is determined to a large extent by the average ring rate of the neurons in the AS does not contradict previous results indicating that when the synchrony is substantial, the rhythm frequency can depend on the synaptic time constant of mutual inhibition (Whittington et al., 1995; Chow et al., 1998; White et al., 1998). 7.1.2 The Cross-Talk Mechanisms. We have demonstrated that the AS can be destabilized with two qualitatively different mechanisms by the crosstalk interactions between the two populations. In the rst mechanism (on L4), the inhibitory population lags behind the excitatory one. Therefore, in the unstable mode, an increase in activity of the excitatory population makes the inhibitory neurons increase their activity. This generates in return an increase in inhibition of the excitatory neurons, which tends to prevent simultaneous ring of many of these neurons. Sufciently strong feedback between the two populations will prevent the perturbation from decaying, and synchrony will emerge. Mutual excitation of excitatory neurons favors the development of this instability since it helps to amplify perturbations in the E population. Mutual inhibition of inhibitory neurons disrupts the development of this instability since it reduces the excitability of this population. In this mechanism, synchrony is induced by recurrent inhibition of the excitatory population. It has been investigated mostly by numerical simulations in previous studies and has been proposed as a mechanism for c oscillations (Jefferys, Traub, & Whittington, 1996). However, as shown here, this mechanism can lead to slow oscillations. The oscillations can be so slow that the neurons develop bursts of action potentials. In the second type of cross-talk instability (on L3), the inhibitory population tends to re in advance with respect to the excitatory one, preventing the excitatory neurons from ring. In return, this deexcitation reduces the activity of the inhibitory neurons. This instability does not decay provided that the feedback between the two populations is sufciently strong. The global oscillations near this instability are on the order of or faster than the average ring rate of the neurons. These two mechanisms are qualitatively different from the mutual inhibition mechanism. Indeed, we have found that the frequency of population
Asynchronous States and the Emergence of Synchrony
43
activity as well as the phase shift between the oscillations in the two populations vary substantially and continuously on the corresponding instability surfaces. Also, the emerging oscillations in the two populations can display a large phase shift, d. However, in all the examples we have studied, the numerical solution of the spectral equation reveals that |d| < p2 . We did not prove this statement in general. However, we have proved that in the symmetric case, |d| D p / 2 on onset instability lines L3 and L4 at gEE D gII D 0. We have also shown that d has a local maximum at that point. Without the assumption of symmetry between the two populations, we have also shown that the oscillations of the two populations become in phase when the emerging rhythm becomes very slow. 7.1.3 Robustness of Synchrony to Heterogeneities. A systematic and comparative study of the robustness of the various modes of destabilization would be a very tedious task. Nevertheless, some general trends emerge from the study of the spectral equation in the symmetric case. The coupling strengths required for destabilization on L2 and L3 depend crucially on the heterogeneity level (see Figure 9). This is in contrast with what happens for L1 and L4 where the dependence is much weaker. Thus, the recurrent inhibition is the most robust mechanism for synchronous oscillations. Another difference between line L3 and the other lines is that for sufciently large s, L3 lies completely inside a region, which is beyond instability line L4. Therefore, when the level of heterogeneities increases, this line becomes physically irrelevant. In this sense, the mechanism corresponding to this instability line is much less robust than the other mechanisms. Similar trends were also found in all the examples we investigated after relaxing the symmetry assumption. Recent studies have suggested that because of heterogeneities, synchrony in inhibitory networks can be difcult to achieve (White et al. 1998; Chow et al., 1998). Our study provides an example of a model in which the desynchronizing effect of large levels of heterogeneities can be overcome provided that the external excitatory drive and the synaptic interactions are sufciently strong. Simulation results in the framework of a conductance-based model were found to be consistent with this conclusion (Neltner et al., 2000). This is in contrast with the behavior of inhibitory networks of leaky integrate-andre models where for sufciently large level of heterogeneities, the AS is always stable (Golomb et al., 2001). For sufciently strong coupling, the AS is stable even if the network is homogeneous. This is yet another difference with the present model. Below, we elaborate further on these differences. 7.2 Getting Rid of Synchrony. The frequent occurrence of synchronous neuronal activity in physiopathological states in the brain (see Bergman et al., 1998; McCormick & Contreras, 2001) suggests that in many situations, synchronous states of activity need to be prevented. The issue is to understand how synchrony can be prevented in particular in systems
44
D. Hansel and G. Mato
where massive feedback interactions are observed. Our analysis shows that even if the interactions are strong, a suitable balance between excitation and inhibition stabilizes the AS. We have proved this statement for our model in full generality and analytically under the assumption of symmetry between the two populations. Indeed, for D D T D 0, the AS is marginally stable without heterogeneities, and it becomes stable when heterogeneities are added. Besides the presence of heterogeneities, this implies that if the excitatory feedback connections, gEE , are strong, the two populations must interact strongly, and the mutual interactions of the inhibitory neurons must be strong to ensure that the AS is stable. In particular, if gII is not strong enough, the system tends to re synchronous bursts. Increasing gII reduces the excitability of the inhibitory population, preventing the synchronous bursts from developing. This desynchronizing effect of mutual inhibition also holds in the nonsymmetric case, as shown in Figure 11.
7.3 Application to the Stability of Persistent Activity. Hebb (1949) proposed that the representation of an object in short-term memory consists of the set of cortical cells that this object activates when it is presented. According to this hypothesis, the reverberation of excitation across this set is able to maintain activity even when the stimulus has been removed, thus preserving a memory of the object. In other words, the network displays multistability. The fact that persistent activity requires strong excitatory feedback poses the problem of control of neural activity in persistent states at ring rates in the range 10 to 30 spikes per second, as observed in experiments. Wang (1999) has argued, mostly on the basis of numerical simulations, that to achieve a stable persistent state with activities in a physiological range (10–50 Hz), recurrent excitatory synapses must be dominated by a slow component. It was subsequently proposed that NMDA synapses were required to achieve such states. Another possibility is that inhibition controls the ring rate (Rubin and Sompolinsky, 1989). However, recurrent inhibition can also induce synchrony, and this can be incompatible with low rate persistent activity. Indeed, if the neurons re synchronously in some short time window, the fast excitatory feedback will decay before the neurons re their next spike. Note, however, that weak synchrony may be compatible with persistent activity. The results shown in Figure 12 indicate that strong mutual inhibitory interactions with synaptic time constants in a range corresponding to GABAA synapses stabilize the AS even if the recurrent excitation is mediated by nonsaturating AMPA synapses. Therefore, strong GABAA mutual inhibition should help achieve a stable persistent state. This is in line with the conclusions of our previous study (Hansel & Mato, 2001), in which the threshold distribution had a nite support. Here, we show that this is true for a gaussian distribution of threshold without truncation (see Figure 3).
Asynchronous States and the Emergence of Synchrony
45
At rst glance, this contradicts the results of Wang (1999). However, I-I interactions were not included in Wang’s study. This may explain why it concluded that stable persistent activity was impossible to achieve with AMPA excitatory synapses and that NMDA synapses were necessary. One should note that recent anatomical and physiological studies have revealed that I-I connections are numerous and strong in the cortex (Sik, Penttonen, Ylinen, & Buzs a ki, 1995; Tarczy-Hornoch, Martin, Jack, & Stratford, 1998; Gupta, Wang, & Markram, 2000). The role we suggest for these connections in the stability of persistent activity ts nicely with these experimental results, without excluding that NMDA synapses could also contribute to delay period activity in cortex. 7.4 Comparison with Networks of Leaky (Linear) Integrate-and-Fire Networks. In the absence of noise, a very precise ne tuning of the external input is required to get a LIF neuron to re below 8 spikes per second. This is due to the logarithmic behavior of the º-I curve near current threshold. Furthermore, there is an exponential divergence of the function H (º) (the equivalent of equation 3.18 for the leaky integrate-and-re neuron; Golomb et al. 2001) when º ! 0 unless P ( x) vanishes at least exponentially fast when x ! 0. In this situation, U also diverges, and the spectral equation is not well dened. This divergence is due to the fact that at very low rates, the LIF model spends an exponentially long time near the threshold. This is because leaky integrate-and-re neurons integrate their inputs in a purely passive way. As a consequence, if the external currents the neurons receive are distributed according to a gaussian, the resulting distribution of ring rate does not vanish sufciently fast at zero ring rate for the integrals in H (º) to converge. To avoid this problem, a gaussian distribution of period was chosen in Golomb et al. (2001). In contrast, for the QIF model in the limit of very low rates, H (º) goes as (exp(¡l /º) ¡1) P (º) /º. Therefore, in this limit, U is proportional to the average period of the population, which is dened even if the distribution of ring rates does not vanish at zero frequency. Another difference between the two models is that in LIF inhibitory networks, instability of the AS can lead to cluster states (van Vreeswijk, 1996; Neltner et al., 2000; Golomb & Hansel, 2000). The number of clusters thus generated can be rather large, especially at low ring rates. This does not happen in our model, since the numerical solution of the spectral equation reveals that the onsets to clustering instabilities are always within the regions where the AS is already unstable. These instabilities are therefore irrelevant for the AS. Of course, this does not exclude the possibility that clustering occurs far from the onset of the instabilities in our model as well. The origin of this different behavior can be claried in the weak coupling limit. In this limit, the instabilities of the AS can be related to the Fourier expansion of the response function, Z (w ), of the neurons, where w is the phase variable describing the position of the neuron on its limit cycle (Kuramoto, 1984; Neltner et al., 2000). Destabilization of the AS through an n-cluster
46
D. Hansel and G. Mato
instability mode in a heterogeneous network requires a large Fourier coefcient of order n. In the LIF model, the function Z is an exponential and is discontinuous at w D p / 2. Therefore, all the Fourier coefcients are large and even diverge in the limit of small ring rates. One can further show that in this limit, the order of the unstable leading mode increases. In contrast, for the QIF, the amplitude of the response function diverges but proportionally to cos(w ) (up to a constant). Therefore, the dominant mode becomes the rst one. This difference has its origin in the way the membrane potential approaches the threshold potential at low rate. In the LIF the derivative of the membrane potential that is just before the spike threshold goes to zero with the ring rate. This is not the case for the QIF where this derivative is 6 0. proportional to Vt2 D 7.5 Relation to Previous Studies. Mechanisms of neuronal synchrony have been addressed in recent years, following two different and complementary theoretical approaches. One has been to study the conditions under which fully synchronized states or more general phase-locked states are stable in fully connected networks of identical neurons (Wang & Rinzel, 1992; Hansel, Mato, & Meunier, 1993b; van Vreeswijk et al., 1994; Hansel, et al., 1995; White et al., 1998; Chow, 1998; Crook, Ermentrout, & Bower, 1998). The other approach has been to investigate the conditions of the stability of the asynchronous state of a large neuronal system (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Treves, 1993; van Vreeswijk, 1996, 2000; Brunel & Hakim, 1999; Brunel, 2000; Golomb & Hansel, 2000; Neltner et al., 2000; Gerstner, 2000). The latter approach was followed here. This article aims to contribute to a comprehensive understanding of the patterns of synchrony in two-population heterogeneous neuronal networks with full connectivity. In order to make the model more amenable to analytical calculations, noise was not introduced. An expected consequence of adding input noise is to increase the parameter domain where the AS is stable. Still, if the noise is not too strong, simulations indicate that three instabilities may occur for the AS. However, another type of instability can also be found if the noise is strong enough. It leads to a state in which the population oscillations are faster than the average ring rate of the neurons. In this state, the ring pattern of the neurons can be very irregular, and averaging over a large number of neurons is necessary to reveal the existence of the underlying rhythmic component in the network activity. Such patterns of synchrony, called fast oscillations, were studied rst by Brunel and Hakim (1999) in sparse networks of inhibitory leaky integrate-and-re neurons. The results of this study were subsequently generalized to the case of two populations with sparse connectivity (Brunel, 2000; N. Brunel, C. Geisler, & X.-J. Wang, private communication Dec. 2001). More recently, Brunel and Hansel (2002) have shown that sparse connectivity was not required for fast oscillations to take place but that they can occur in a fully connected network of inhibitory neurons in the presence
Asynchronous States and the Emergence of Synchrony
47
of strong input noise. Increasing the coupling destabilizes the AS, leading to a state where the neurons re synchronously at a rate that can be signicantly smaller than the population oscillations. Therefore, when noise is also present, the AS can become unstable through four generic mechanisms. The all-to-all connectivity considered here can be thought of as the extreme case of a system in which the connectivity is high. Indeed, we have checked that our results still hold when the connectivity is random and sufciently dense (results not shown). However, an important difference exists between the collective behavior of densely and sparsely connected networks. In densely connected networks in the absence of input noise, the neurons discharge in a very regular manner. This is because the uctuations in the synaptic input are small—of the order of O (1 / K ) , where K is the connectivity. In contrast, in sparse networks with strong excitatory and inhibitory synapses, the ring of the neurons can be highly irregular even if no input noise is present (van Vreeswijk & Sompolinsky, 1996, 1998; Brunel, 2000). In this state, the stochasticity of the ring is of deterministic origin, and the uctuations in the synaptic input remain large due to a balance between excitation and inhibition. We have shown that in the presence of an excitatory population, a stronger coupling is required to synchronize activity through mutual inhibition. A similar conclusion was reported in Traub, Jefferys, and Whittington (1997) relying on a simulation of a detailed conductance-based model. These authors found that excitation can destroy a synchronous state generated by mutual inhibition. We should also remark that the desynchronizing effect of excitation is a characteristic feature of type I neurons. For type II neurons (such as in the Hodgkin-Huxley model), excitation can have a strongly synchronizing effect, especially at low ring rates (Hansel, Mato, & Meunier, 1993b; Hansel et al., 1993a, 1995). The fact that mutual inhibition can prevent recurrent inhibition to induce synchrony was noted by Tsodyks, Skaggs, Sejnowski, and McNaughton (1997) in the framework of a two-population rate model. In this work, strengthening the I-I interactions was shown to prevent the two populations from developing synchronous oscillatory rate modulations. Since we have seen that in the limit of innitely slow synapses, the spectral equation becomes equivalent to the spectral equation of a rate model, it is not surprising that in that limit, our conclusions agree with the results of Tsodyks et al. (1997). However, mutual inhibition is a powerful way to generate synchrony on a timescale comparable to the timescale of AMPA and GABAA interactions. Therefore, it was not clear a priori that this mechanism would still hold for these synapses. Our detailed stability analysis of the AS shows that this is indeed the case. A desynchronizing effect of mutual inhibition was also reported by Bush and Sejnowski (1996) in simulations of a two-population model of a column in primary visual cortex. It was found that when the inhibition between the
48
D. Hansel and G. Mato
basket cells is increased, the temporal modulation of the local eld potential is decreased and the neurons are less bursty. However, in this work, the ring rate of the neurons changed when the mutual inhibition was increased. Another difference with our work is that in the simulations presented by Bush and Sejnowski, the neurons were intrinsic bursters. Therefore, in their case, the desynchronizing effect of mutual inhibition could also have been the consequence of a disruption of the intrinsic bursting when the inhibitory synaptic input became stronger. Excitatory cells in cortex display strong spike adaptation, whereas interneurons do not. To reduce the number of parameters of our two-population model, we assumed the dynamics of the two populations were identical and spike adaptation was not introduced in the excitatory dynamics. Recent theoretical studies have investigated the dynamics of excitatory networks of neurons with spike adaptation (van Vreeswijk & Hansel, 2001; Fuhrman, Markram, & Tsodyks, 2002). In these works, it was demonstrated, mostly analytically, that excitation and spike adaptation can cooperate to create synchronous oscillatory activity and that the frequency of these oscillations depends on the kinetics of the adaptation process. In particular, if the excitatory recurrent feedback and the adaptation is sufciently strong and slow, the network settles into a state in which the neurons re synchronous bursts (van Vreeswijk & Hansel, 2001). How an additional population of inhibitory neuron modies this type of synchronous state was subsequently studied, relying on numerical simulations (van Vreeswijk & Hansel, 2001). The main conclusion was that recurrent inhibition of the excitatory population (E-I-E feedback loop) plays against the emergence of the synchronous bursts but that mutual inhibition (I-I interactions) favors it. In contrast, we have found in our model that recurrent inhibition favors the synchronous bursting that emerges on instability lines L3 and L4 but that mutual inhibition disrupts it. Keeping in mind that spike-to-spike synchrony can be induced by mutual inhibition, inhibitory interactions can thus have a spectrum of qualitatively different effects in two-population networks. Appendix A: The Wang-Busz a ki Model The dynamical equations for the conductance-based neuron model used in this work are (Wang & Busz a ki, 1996):
C
dV D Iext ¡gNa m31 h ( V ¡VNa ) ¡gK n4 ( V ¡ VK ) ¡gl ( V ¡Vl ) C Isyn dt dh h1 ( V ) ¡ h D th ( V ) dt dn n1 ( V ) ¡ n . D tn ( V ) dt
(A.1) (A.2) (A.3)
Asynchronous States and the Emergence of Synchrony
49
The parameters gNa , gK , and gl are the maximum conductances per surface unit for the sodium, potassium, and leak currents, respectively, and VNa , VK , and Vl are the corresponding reversal potentials. The capacitance per surface unit is denoted by C. The external stimulus on the neuron is represented by an external current Iext , and the synaptic current due to the interactions with other neurons in the network is Isyn. The functions m1 ( V ), h1 (V ) , and n1 ( V ) and the characteristic times (in milliseconds) tm , tn , th , are given by x1 ( V ) D ax / (ax C bx ) , tx D 1 / (ax C bx ) with x D m, n, h and: am D ¡0.1(V C 35) / (exp(¡0.1( V C 35) ) ¡ 1) bm D 4 exp(¡(V C 60) / 18)
ah D 0.35 exp(¡(V C 58) / 20)
bh D 5 / (exp(¡0.1( V C 28) ) C 1).
(A.4) (A.5) (A.6) (A.7)
The other parameters of the sodium current are gNa D 35 mS / cm2 and VNa D 55 mV. The delayed rectier current is described in a similar way as in the HH model with an D ¡0.05 (V C 34) / (exp(¡0.1( V C 34) ) ¡ 1) bn D 0.625 exp(¡(V C 44) / 80)
(A.8) (A.9)
and gK D 9 mS / cm2 and VK D ¡90 mV. Other parameters of the model are C D 1 m F/cm2 , gl D 0.1 mS/cm2 , and Vl D ¡65 mV. This neuron model displays an SN bifurcation at the current threshold to repetitive ring, Ic , in the vicinity of which the relationship between p the current, I, injected in the neuron and its ring rate behaves like f / I ¡ Ic for I > Ic . Therefore, they are able to re at an arbitrarily small rate. This is also seen in Figure 1, which represents the frequency-current relation of this model. Appendix B: Reduction of a Conductance-Based Model to the QIF Model We consider a type I conductance-based model. The equations of the singleneuron dynamics can be written in compact form: dX D F(X) C G(X) . dt
(B.1)
50
D. Hansel and G. Mato
In the case of the WB model, X, F, and G are three-dimensional vectors with: X1 D V
(B.2)
X2 D h
(B.3)
X3 D n
(B.4)
F1 (X) D ( Ic ¡gNa m31 h ( V ¡VNa ) ¡gK n4 (V ¡VK ) ¡gl ( V ¡Vl ) ) / C F2 (X) D ( h1 (V ) ¡ h ) / th (V )
F3 (X) D ( n1 (V ) ¡ n ) / tn (V ) G1 D I ¡ Ic
(B.5) (B.6) (B.7) (B.8)
and G2 D G3 D 0. Following Ermentrout (1996), one can study this dynamics in the vicinity of ring onset. One writes: G(X) D 2 N(X) ,
(B.9)
where 2 ¿ 1. For 2 < 0, the neuron is not ring. It displays a stable xed ¯ . For 2 > 0, the neuron res action potentials. At 2 D 0, a point: X D X(I) bifurcation occurs. The Taylor expansion of F (X) around X? D X¯ (Ic ) is F (X) D F(X?) C A(X ¡ X?) C Q(X ¡ X?, X ¡ X?) C ¢ ¢ ¢
(B.10)
where A and Q are, respectively, the Jacobian and the Hessian matrices of F evaluated at X?. Since an SN bifurcation occurs at X?, the matrix A has one zero eigenvalue. The corresponding normalized eigenvector will be denoted by e. Similarly, AT , the transpose of A, has a zero eigenvalue. We will denote by f the corresponding eigenvector satisfying f.e D 1. For small 2 > 0, the solution X(t) of equation B.1 can be expanded, X (t ) D X? C
p
2 z (t) e
(B.11)
where the scalar quantity z satises (Ermentrout & Kopell, 1986) dz p 2 D 2 (g C qz ) , dt
(B.12)
where: g D f.N(X?) and q D f.Q(e, e). This equation can easily be solved. Assuming that g and q have the same sign, one nds r z ( t) D
g p tan gq2 ( t ¡ t0 ) , q
(B.13)
where t0 is a constant. Function z diverges when the argument of the tangent is a multiple of p / 2. These divergences correspond to the ring of an action
Asynchronous States and the Emergence of Synchrony
51
potential in the original system (Ermentrout & Kopell, 1986). Therefore, in the limit 2 ! 0, the ring frequency of the neuron vanishes as p º D B I ¡ Ic,
(B.14)
p gq . BD p
(B.15)
with
Coefcient B can be calculated by diagonalizing matrix A. It turns out that due to the particular structure of this matrix, expressions for e, f, Q(e, e) can be derived analytically in terms of the gating functions (this fact has apparently escaped attention in previous papers). In particular, for the WB model, one nds that 0
0
e D K (1, h1 , n1 ) ³ ´ 1 @F1 @F1 fD 1, th , tn LK @h @n
(B.16) (B.17)
where KD p
1
(B.18)
0 1 C ( h1 ) 2 C (n1 ) 2
L D 1 C th
0
@F1 @h
0
h1 C tn
@F1 @n
0
n1 .
(B.19) 0
The derivative of function f (V ) with respect to V has been denoted f . In these expressions, all the functions have to be evaluated for V, n1 , h1 at the xed point of the dynamics. Finally we nd: v Á ! u 2F 2F X 0 @2 F1 X 00 @F1 1 u 1 @ @ 1 1 0 t C2 C ( n1 ) 2 C . BD Xi Xi p L 2 @V 2 @V @Xi @n2 iD 2,3 @Xi i D2,3
(B.20)
The Taylor expansion of the dynamics, equation B.10, is valid in the limit of low ring rate except during an action potential where the variable z diverges. If I is sufciently close to threshold, an approximate trajectory for the vector X(t) can be computed from z ( t) and the vector e. This approximation is a good description of the behavior of the neuron except during the spikes, which are of short duration. It becomes exact in the limit I ! Ic . Therefore, near the ring onset, one can replace the full model, equation B.1, by the reduced model, equations B.11 and B.12, where all the parameters can be
52
D. Hansel and G. Mato
computed from the full model. In particular, the shape of the f-I curve near threshold is given by equation B.14. When I ¡ Ic is not small, this reduction is no longer exact. However, one can generalize it heuristically as follows. We consider that action potentials are red whenever z crosses (from below) some threshold, zt . Then z is immediately reset to a value zr and it evolves again according to equation B.12. The two phenomenological parameters, zt and zr , cannot be determined by the exact reduction method detailed above. Instead, we estimate them by requiring that the f ¡ I curve ts the f ¡ I curve of the full model as well as possible. Appendix C: A Measure for the Synchrony Level To characterize the degree of synchrony in a population of N neurons in the simulations, we measure the temporal uctuations of the membrane potential averaged over the population (Hansel & Sompolinsky, 1992; Golomb & Rinzel, 1993, 1994; Ginzburg & Sompolinsky, 1994). The quantity
V ( t) D
N 1 X Vi (t) N iD 1
(C.1)
is evaluated over time, and the variance sV2 D h[V (t )]2 it ¡ [hV (t) it ]2 of its temporal uctuations is computed where h. . .it denotes time averaging. After normalization of sV to the average over the population of the single cell membrane potentials sV2i D h [Vi ( t) ]2 it ¡ [hVi ( t) it ]2 , one denes  ( N ) ,  (N) D
1 N
sV2 PN
2 i D 1 sVi
,
(C.2)
which is between 0 and 1. The central limit theorem implies that in the limit, N ! 1  ( N ) behaves as  ( N ) D  (1) C
³ ´ 1 a CO , N N
(C.3)
where a > 0 is a constant. In particular,  (N ) D 1, if the system is fully synchronized (i.e., Vi ( t) D V ( t) for all i), and  (N) D O (1 / N ) if the state of the system is asynchronous. In the asynchronous state,  (1) D 0. More generally, the larger  (1) > 0, the more the population is synchronized. In networks with more than one population of neurons, we dene a synchrony measure  for each population separately.
Asynchronous States and the Emergence of Synchrony
53
Acknowledgments This work was partially supported by ECOS. The work of G.M. was partially supported by grant PICT97 03-00000-00131 from ANPCyT and by Fundaci Âon Antorchas. Fruitful discussions with N. Brunel, B. Ermentrout, D. Golomb, H. Sompolinsky, and C. van Vreeswijk are acknowledged. We thank N. Brunel, D. Golomb, and C. Meunier for careful and critical reading of the manuscript. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous state in networks of pulse-coupled oscillators. Phys. Rev., E48, 1483–1490. Abeles, M. (1990). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Ben Yshai, R., Hansel, D., & Sompolinsky, H. (1997). Traveling waves and processing of weakly tuned inputs in cortical module. J. Comput. Neurosci., 4, 57–77. Bergman, H., Feingold, A., Nini, A., Raz, A., Slovin, H., Abeles, M., & Vaadia E. (1998). Physiological aspects of information processing in the basal ganglia of normal and parkinsonian primates. Trends in Neuroscience, 21, 32–38. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly-coupled spiking neurons. Neural Comput., 12, 91–129. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comput., 11, 1621–1671. Brunel, N., & Hansel, D. (2002). How noise affects synchronization in recurrent networks of inhibitory neurons. Unpublished manuscript. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns of realistic network models. J. Comput. Neurosci., 3, 91–110. BuzsÂa ki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuron networks. Curr. Opin. Neurobiol., 5, 504–510. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica, D118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comput. Neurosci., 5, 407– 420. Crook, S. M., Ermentrout, G. B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillators. Neural Comput., 10, 837–854. Ermentrout, G. B. (1994). Reduction of conductance based models with slow synapses to neural nets. Neural Comput., 6, 679–695.
54
D. Hansel and G. Mato
Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Fuhrman, G., Markram, H., & Tsodyks, M. (2002). Spike frequency adaptation and neocortical rhythms. J. Neurophysiol., 88, 761–770. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous state, and locking. Neural Comput., 12, 43–89. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse-emitting units. Phys. Rev. Lett., 71, 312–315. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neuronal networks. Phys. Rev., E50, 3171–3191. Golomb, D. (1998). Models of neuronal transient synchrony during propagation of activity through neocortical circuitry. J. Neurophysiol., 79, 1–12. Golomb, D., & Ermentrout, G. B. (1999). Continuous and lurching traveling pulses in neuronal networks with delay and spatially decaying connectivity. Proc. Natl. Acad. Sci. USA. 96, 13480–13485. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Comput., 12, 1095–1139. Golomb, D., Hansel, D., & Mato, G. (2001). Theory of synchrony of neuronal activity. In S. Gielen & F. Moss (Eds.), Handbook of biological physics. Amsterdam: Elsevier. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev., E48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica, D72, 259–282. Gupta, A., Wang, Y., & Markram, H. (2000). Organizing principles for a diversity of GABAergic interneurons and synapses in the neocortex. Science, 287, 273– 278. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Phase reduction and neural modeling. in functional analysis of the brain based on multiple-site recordings. Concepts in Neuroscience, 4, 192–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hansel, D., Mato, G., Neltner, L., & Meunier, C. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comput., 10, 467–483. Hansel, D., & Sompolinsky, H. (1992). Synchrony and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Hansel, D., & Sompolinsky, H. (1998) Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev. (Eds.), Methods in neuronal modeling: from ions to networks. Cambridge, MA: MIT Press.
Asynchronous States and the Emergence of Synchrony
55
Hebb, D. (1949). The organization of behavior. New York: Wiley. Jefferys, J. G. R., Traub, R. D., & Whittington, M. A. (1996). Neuronal networks for induced “40 Hz” rhythms. Trends Neurosci., 19, 202–208. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Kuznetsov, Y. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. McCormick, D. A., & Contreras, D. (2001). On the cellular and network bases of epileptic seizures. Annu. Rev. Physiol., 64, 815–846. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12, 1607–1641. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different some-dendritic distribution of synaptic input. J. Neurophysiol.,30, 1138– 1168. Rubin, N., & Sompolinsky, H. (1989). Neural networks with low local ring rates. Europhys. Lett., 10, 465–470. Shelley, M. J., & Tao, L. (2001). Efcient and accurate time-stepping schemes for integrate-and-re neuronal networks. J. Comput. Neurosci., 11, 111–119. Sik, A., Penttonen, M., Ylinen, A., & BuzsÂa ki G. (1995). Hippocampal CA1 interneurons: An in vivo intracellular labeling study. J. Neurosci., 15, 6651–6665. Strogatz, H. S. (2000). Nonlinear dynamics and chaos. Cambridge: Perseus. Tarczy-Hornoch, K., Martin, K. A., Jack J. J., & Stratford K. J. (1998). Synaptic interactions between smooth and spiny neurones in layer 4 of cat visual cortex in vitro. J. Physiol., 508, 351–363. Traub, R. D., Jefferys, J. G., & Whittington, M. A. (1997). Simulation of gamma rhythms in networks of interneurons and pyramidal cells. J. Comput. Neurosci., 4, 141–150. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Traub, R. D., Whittington, M. A., Colling, S. B, BuzsÂaki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (London), 493, 471–484. Treves, A. (1993). Mean eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tsodyks, M. V., Skaggs, W. E., Sejnowski, T. J., & McNaughton B. L. (1997). Paradoxical effects of external modulation of inhibitory interneurons. J. Neurosci., 17, 4382–4388. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev., E54, 5522–5537. van Vreeswijk, C. (2000). Analysis of the asynchronous state in networks of strongly coupled oscillators. Phys. Rev. Lett., 84, 5110–5113. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with adaptation. Neural Comput., 13, 959–992.
56
D. Hansel and G. Mato
van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1372. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603. Wang, X.-J., & BuzsÂaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comput., 4, 84–97. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: Experimental and mathematical observations on network dynamics. Int. J. of Psychophysiology,38, 315–336. Received April 25, 2002; accepted June 27, 2002.
NOTE
Communicated by Erkki Oja
A Constrained EM Algorithm for Principal Component Analysis Jong-Hoon Ahn
[email protected] Jong-Hoon Oh
[email protected] Department of Physics, Pohang University of Science and Technology, Pohang, KyÆ ongbuk, Korea
We propose a constrained EM algorithm for principal component analysis (PCA) using a coupled probability model derived from single-standard factor analysis models with isotropic noise structure. The single probabilistic PCA, especially for the case where there is no noise, can nd only a vector set that is a linear superposition of principal components and requires postprocessing, such as diagonalization of symmetric matrices. By contrast, the proposed algorithm nds the actual principal components, which are sorted in descending order of eigenvalue size and require no additional calculation or postprocessing. The method is easily applied to kernel PCA. It is also shown that the new EM algorithm is derived from a generalized least-squares formulation. 1 Introduction Principal component analysis (PCA) is a statistical technique that analyzes the covariance structure of multivariate data. It takes a data set in a given orthonormal basis in Rd and nds a new orthonormal basis Rq with its axes ordered (d > q). This new basis is rotated in such a way that the axes are oriented along the direction in which the data have their highest variance. The principal components are the new coordinates, ordered by their corresponding variances. The reduced data can best preserve the variance of data; thus, they are widely used in data compression, image analysis, visualization, pattern recognition, regression, and time-series prediction. Situations dealing with high-dimensional data with low-dimensional structure require an efcient PCA technique as a dimensionality-reduction method, and a variety of methods for PCA have been introduced in the literature (Jolliffe, 1986). Nevertheless, the existing PCA methods had one common shortcoming: the absence of an associated probability density model. The derivation of PCA from the perspective of density estimation offers a number of important advantages, and the PCA model in the noiseless case provides a very Neural Computation 15, 57–65 (2003)
° c 2002 Massachusetts Institute of Technology
58
J. Ahn and J. Oh
efcient method for extracting a few principal components from a highdimensional data set (Tipping & Bishop, 1999; Roweis, 1997). Classic eigenvalue analysis is a quick and exact method for use with low-dimensional data, but it is unsuitable for very high-dimensional data sets. Computing a complete covariance matrix is laborious and inefcient when only a few principal components are required. However, the EM algorithm for PCA has rotational ambiguity, which means that it nds only vector sets that are linear superpositions of the principal components. In this article, we propose a coupled latent variable model for extracting the actual principal components. 2 Probabilistic Models 2.1 The Single Latent Variable Model. The single latent variable model for PCA is referred to as the probabilistic model (Tipping & Bishop, 1999). The latent variable model is linear: t D Wx C m C 2 ,
(2.1)
where the q-dimensional column vector x has a unit isotropic distribution N (0, I ). The noise is given as a normal distribution such that 2 » N (0, Y) , with Y diagonal, the d £ q matrix W contains the factor loadings, and m is a constant vector with a maximum likelihood estimator that is the mean of the given data set ftkg. Given this formulation, t is also modeled as the normal density N (m , Y C WW T ) . It is assumed that the diagonal elements of the covariance matrix Y should be identical. When the diagonal elements of Y are equal, the factor loadings eventually become the principal axes. Tipping and Bishop showed that by considering a model with an isotropic noise structure, the maximum likelihood estimator W ML is a matrix whose columns are the scaled and rotated principal eigenvectors of the sample’s covariance matrix S, W D Uq (L q ¡ s 2 I ) 1 / 2 R,
(2.2)
where the q-column vectors in Uq are eigenvectors of S, with corresponding eigenvalues in the diagonal matrix L q , and R is an arbitrary q £q orthogonal matrix. The true principal axes can be recovered when the columns of RT are equal to the eigenvectors of the W T W matrix. Although postprocessing needs few costs, it may be unwanted from a theoretical viewpoint. In order to remove the rotational ambiguity in a probabilistic PCA model, we shall consider a coupled latent variable model constructed from single models under some constraints. 2.2 A Coupled Latent Variable Model. It was shown that the q-dimensional single latent variable model nds q-dimensional principal subspace.
A Constrained Algorithm for PCA
59
Now let us consider a ( q ¡ 1)-dimensional single model prior to the qdimensional model. This certainly recovers (q ¡ 1)-dimensional principal subspace. Assuming that the columns of Uq are sorted in descending order of eigenvalue size, we can constrain the q £ q rotational matrix R to Rqi D 0, where i D 1, . . . , q ¡ 1. This is because a ( q ¡ 1)-dimensional principal subspace should be spanned by the set of rst ( q ¡ 1) principal axes. Because matrix R is orthogonal, the condition leads to Riq D 0 and Rqq D 1. Thus, we can obtain an exact qth principal vector through the coupled model constructed from two single models. We can then deduce that a coupled model constructed from single latent variable models with full dimensionality should recover the exact principal components on their convergence. The ith model in the coupled model is given by yi D WIi x C m C 2
(2.3)
i
where the q£q diagonal matrix Ii contains the factor selection with Ii ( j, j) D 1 for j D 1, . . . , i and Ii ( j, j) D 0, for j D i C 1, . . . , q. For the ith isotropic noise model 2 i » N (0, si2 I ) , the coupled model implies a probability distribution over multiple y spaces for x, given by p ( y1 D t, . . . , yq D t | x ) D D
q Y
p (t | xI i)
iD 1 q Y
(2p si2 ) ¡d / 2
iD 1
(2.4) µ
¶ 1 2 exp ¡ 2 kt ¡WIi x¡m k , 2si
(2.5)
where p ( t | xI i) is the conditional density for the ith single model. Note that R 6 1. With the latent variables x dened by p ( t, . . . , t | x) dt D µ ¶ 1 p (x ) D (2p ) ¡q/ 2 exp ¡ xT x , 2
(2.6)
we obtain the marginal distribution of coupled y spaces in the form of Z p (t, . . . , t) D D
p ( t, . . . , t | x) p ( x ) dx µ ¶ 1 (2p si2 ) ¡d / 2 |M| ¡1 / 2 exp ¡ ( t ¡ m ) T C ¡1 ( t ¡ m ) , 2 iD 1
q Y
(2.7)
where the model covariance is " q #¡1 X I ¡1 T T ¡ WQM Q W CD , s2 iD 1 i
(2.8)
60
J. Ahn and J. Oh
Pq Pq where M D I C iD 1 IiT W T WIi / si2 and Q D iD1 Ii / si2 . Using Bayes’ rule, the posterior distribution of the latent variables x may be calculated: µ 1 ¡q / 2 1/ 2 ( ) (2p ) | |M| . . . exp ¡ fx ¡ M ¡1 QT W T ( t ¡ m )gT p x t, ,t D 2 ¶ (2.9) ¡1 T T ( £ Mfx ¡ M Q W t ¡ m ) g , where the posterior covariance matrix is given by MDIC
q X Ii W T WIi . si2 i D1
(2.10)
M is q £ q matrix while C is d £ d matrix. 3 Derivation of Constrained EM Algorithm An expectation-maximization (EM) algorithm for PCA nds the principal subspace as the noise level sq2 becomes innitely small (Roweis, 1997). The coupled model for the new EM algorithm also comes to be valid for innitely small noises si2 . Further, we need to consider priorities among the single models that are associated with the coupled model. By the constrained algorithm, we mean that the ith single model should be treated prior to the (i C 1)th single model so that the columns of Uq are sorted in descending order of eigenvalue size. The priorities for such models are decided by the limits si2 si2C 1
¡! 0.
(3.1)
These conditions certainly give the constrained EM algorithm for PCA. 3.1 EM Algorithm. As the latent variable x is considered missing data in the EM approach, it can be expected by the posterior density, equation 2.9, and the noise condition, equation 2.11, Z | E[x t] D xp (x | t, . . . , t) dx D M ¡1 QT W T ( t ¡ m )
(3.2)
! fL(W T W ) g¡1 W T ( t ¡ m ) , where elementwise operator L is dened as L ( aij ) D aij for i ¸ j and L ( aij ) D 0 for i < j. The E-step may be also represented in matrix form, H D fL(W T W ) g¡1 W T V,
(3.3)
A Constrained Algorithm for PCA
61
where H is q £ n matrix, which has column vectors fx kg, and V is d £ n matrix, which has column vectors ftk ¡m g. In the M-step, the new parameter value W new that maximizes the expected value of the log likelihood for the complete data given at each iteration is determined. The part that includes P W in the complete data log likelihood LC D nkD 1 ln[p ( tk, . . . , tk | x k ) p ( xk ) ] is given by LW
" # ´ q ³ X 1 HHT T T T T D tr 2VH Q W ¡ W Ii 2 Ii W . 2 si iD 1
(3.4)
If we consider condition 3.1, this is maximized by W new as follows, ( W new
T
D VH Q
! VH
T
´) ¡1 q ³ X HHT I i 2 Ii si i D1
T fU(HHT )g ¡1
(3.5)
,
where elementwise operator U is dened as U ( aij ) D aij for i · j and U ( aij ) D 0 for i > j. 4 Discussion We treat the coupled latent variable model constructed from single probabilistic models in order to remove the rotational ambiguity involved in the single probabilistic model. The constrained EM algorithm derived from them gradually recovers the actual principal components, and it is naturally applied to the kernel PCA, H D fL(C T KC ) g ¡1 C T K
C
new
T
D H fU(HH
T )g ¡1
,
(4.1) (4.2)
which are derived in appendix A. This article proposes using the coupled latent variable model to perform the PCA. The source vector x is considered a hidden variable that is linearly related to the observed vectors yi . Different isotropic noises 2 i are added to the observed data vectors under different model conditions Ii , but all the observed vectors are modeled by the same vector t. By the coupled model, we mean this model description. The EM algorithm is employed to t linear combinations of these source distributions to the observed data. The parameter to be estimated is the factor loading W (Dempster, Laird, & Rubin, 1977). The forms of iteration rules are similar to those used in the previous EM algorithm, where simple pseudo-inverse solutions are used for the leastsquare formulation (Roweis, 1997). But it is clearly distinguished by the
62
J. Ahn and J. Oh
lower and upper operators that leave the matrix elements unchanged or changed to zero.1 Thus, there is virtually no additional or postprocessing cost in using the method. The cost per iteration is the same for both algorithms, via elapsed time or total oating-point operations. Ignoring constants, they are both O ( n £ d £ q) , where n = data number, d = original dimensionality, and q = reduced dimensionality. The postprocessing cost for the EM algorithm to nd the orthogonal vectors is O (n £ q2 C d £ q2 ) , which is smaller than the cost of one iteration. The previous EM algorithm for PCA minimizes the squared error function monotonically. However, the constrained EM algorithm shows a slightly different convergent behavior. It is also derived from the solution for minimizing the generalized squared error function, which is given by a weighted squared error function summed over the required dimensionality with positive constants ai :
JD
q X iD 1
ai kV ¡ WIi Hk2 .
(4.3)
Using the generalized least-squares formulation and condition 3.1 as a ( i C 1) / ai ! 0, we can obtain the same iteration rules 3.3 and 3.5 that are derived in appendix B. Thus, we can learn that the iterative rules B.5 and B.6 monotonically minimize the generalized squared error function 4.3. Although the proposed method for squared cost function is generally slower than the EM algorithm while converging, the two methods nally recover the principal subspace in the same number of iterations. The iterative number required to recover the orthogonal axis is based on the ratios of eigenvalues li / li C 1 . Appendix A: Constrained EM Algorithm for Kernel PCA The constrained EM algorithm for PCA can be naturally applied to kernel PCA (Rosipal & Girolami, 2001). The basic idea is that the factor loadings W can be expressed as W D VW C,
(A.1)
where VW is a nonlinearly mapped data matrix for observed data V 2 Rd£n and C 2 Rn£q contains coefcients. Using this expression and the kernel 1 The explicit forms of the operators are related to our assumption that the columns of Uq are sorted in descending order of the eigenvalue size. Thus, we can obtain well-ordered principal components.
A Constrained Algorithm for PCA
63
matrix K, we can write the matrix W T W and W T VW as T W T W D C T VW VW C
D C T KC
(A.2)
D C T K.
(A.3)
T W T VW D C T VW VW
Thus, from the E-step, equation 3.3, the matrix H 2 Rq£n can be updated as H D fL(C T KC )g ¡1 C T K.
(A.4)
Considering expression A.1, we can write the M-step, equation 3.5, in the matrix form VW C new D VW H T fU(HHT )g ¡1 . Then the M-step follows: C new D HT fU(HHT )g ¡1 .
(A.5)
Appendix B: The Generalized Squared Error Formulation The treatment of a coupled model is supplanted by introducing a generalized squared error function. Our problem is to seek the weight matrix W and the hidden matrix H that minimize the error 4.3. The problem of minimizing the single squared error is a classic one that can be solved by a gradient search procedure. Similarly, the generalized squared error problem is solved by forming the gradient,
rH J D rW J D
q X iD 1 q X iD 1
2ai ( WIi ) T (V ¡ WIi H )
(B.1)
2ai ( V ¡ WIi H ) ( Ii H ) T ,
(B.2)
and setting it equal to zero. This yields the necessary condition (
q X
) ai ( WIi
)T
VD
iD 1
(
V
q X i D1
(
) ai ( Ii H
)T
q X
) ai ( WIi
)T(
WI i ) H
i D1
(
D W
q X i D1
(B.3)
) ai ( Ii H ) ( Ii H
)T
.
(B.4)
64
J. Ahn and J. Oh
Pq Pq Because iD 1 ai (WIi ) T (WIi H ) and iD1 ai ( WIi H ) ( Ii H ) T are nonsingular, we can solve for H and W uniquely as ( q )¡1 Á q ! X X T H D ai ( WIi ) (WIi ) ai Ii W T V iD 1
D
Á
f(W T W ) ¯ Ag¡1 2
D
4(WTW ) ¯
D
f(W T
D
T
q X
ai Ii W T V
iD 1
8Á q < X :
i D1
!
i D1
W ) ¯ Dg
¡1
Á q X
!(
!¡1 93 ¡1 = ai Ii A 5 WT V ;
(B.5)
WTV
and W
D
VH VH
T
i D1 Á q X
D
a i ( Ii H ) ( Ii H
)T
)¡1
iD 1
!
ai Ii f(HHT ¯ Ag ¡1
8 Á !¡1 93¡1 q < = X 5 VH T 4 ( HH T ) ¯ A ai Ii : ; D1 2
D
i D1
ai Ii
q X
i
VH T f(HHT )
¯
DT g¡1 ,
where ¯ denotes the Hadamard product and A is given by Pq Pq 0 Pq 1 ¢ ¢ ¢ aq i D 1 ai i D 2 ai i D3 ai B Pq C Pq Pq B ¢ ¢ ¢ aq C i D 2 ai i D3 ai B iD 2 ai C B Pq C Pq Pq B C ¢ ¢ ¢ a a a a qC A D B iD 3 i iD 3 i i D3 i B C B . C .. .. .. B .. C . . . @ A aq
aq
and D is represented as Pq 0 ai 1 P iqD 2 a B iD 1 i B B B1 1 B DDB B B. .. B. . @. 1
(B.6)
1
aq
Pq a PiqD 3 i iD 1
ai
Pq a PiqD 3 i iD 2
ai
¢¢¢
¢¢¢
Paqq
¢¢¢
Paqq
.. .
.. .
¢¢¢
¢¢¢
aq 1 ai C
C C C C a iD 2 i C . C .. C C . A iD 1
1
(B.7)
(B.8)
A Constrained Algorithm for PCA
65
These iterative rules can minimize the generalized squared error most efciently. If we take condition 3.1 as ai C 1 / ai ! 0 , all upper elements in the D matrix approach zero, and the iteration rules become identical to equations 3.3 and 3.5. Acknowledgments This research was funded by the Brain Science and Engineering Research Program, Ministry of Science and Technology, in Korea. References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm Journal of Royal Statistical Society, 39, 1–38. Jolliffe, I. T. (1986). Principal component analysis. New York: Springer-Verlag. Rosipal, R., & Girolami, M. (2001). An expectation-maximization approach to nonlinear component analysis. Neural Computation, 13, 505–510. Roweis, S. R. (1997). EM algorithms for PCA and SPCA. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 626– 632). Cambridge, MA: MIT Press. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11, 443–482. Received July 26, 2001; accepted June 20, 2002.
LETTER
Communicated by Emilio Salinas
Higher-Order Statistics of Input Ensembles and the Response of Simple Model Neurons Alexandre Kuhn
[email protected] Ad Aertsen
[email protected] Stefan Rotter
[email protected] Neurobiology and Biophysics, Biology III, Albert-Ludwigs-University, D-79104 Freiburg, Germany
Pairwise correlations among spike trains recorded in vivo have been frequently reported. It has been argued that correlated activity could play an important role in the brain, because it efciently modulates the response of a postsynaptic neuron. We show here that a neuron’s output ring rate critically depends on the higher-order statistics of the input ensemble. We constructed two statistical models of populations of spiking neurons that red with the same rates and had identical pairwise correlations, but differed with regard to the higher-order interactions within the population. The rst ensemble was characterized by clusters of spikes synchronized over the whole population. In the second ensemble, the size of spike clusters was, on average, proportional to the pairwise correlation. For both input models, we assessed the role of the size of the population, the ring rate, and the pairwise correlation on the output rate of two simple model neurons: a continuous ring-rate model and a conductance-based leaky integrate-and-re neuron. An approximation to the mean output rate of the ring-rate neuron could be derived analytically with the help of shot noise theory. Interestingly, the essential features of the mean response of the two neuron models were similar. For both neuron models, the three input parameters played radically different roles with respect to the postsynaptic ring rate, depending on the interaction structure of the input. For instance, in the case of an ensemble with small and distributed spike clusters, the output ring rate was efciently controlled by the size of the input population. In addition to the interaction structure, the ratio of inhibition to excitation was found to strongly modulate the effect of correlation on the postsynaptic ring rate.
Neural Computation 15, 67–101 (2003)
° c 2002 Massachusetts Institute of Technology
68
A. Kuhn, A. Aertsen, and S. Rotter
1 Introduction
The brain consists of a huge number of neurons that are heavily interconnected. These cells interact by producing rapid depolarizations of their membrane, the action potentials (spikes) that are propagated along the axon and excite or inhibit postsynaptic neurons by inducing postsynaptic potentials (PSPs). The integration of PSPs can bring the postsynaptic neurons, in turn, to re a spike. Such interactions within the neuronal network lead to temporal correlations between spike trains of different neurons. Pairwise correlations of spike trains are routinely observed in virtually all regions of the brain. They have been related to the representation of sensory information (DeCharms & Merzenich, 1996) and motor behavior (Baker, Spinks, Jackson, & Lemon, 2001), but also to more abstract processing levels (Steinmetz et al., 2000; Salinas & Sejnowski, 2001). The effect of correlated synaptic input on the neuronal response has been investigated in several recent theoretical and numerical studies (Feng & Brown, 2000; Salinas & Sejnowski, 2000; Svirskis & Rinzel, 2000; Feng & Zhang, 2001; Stroeve & Gielen, 2001). They showed that temporal correlation can have a tremendous effect on the rate and variability of the output of model neurons. Pairwise correlations are generally used to characterize the statistical dependence of parallel spike trains. However, as pointed out, for example, by Bohte, Spekreijse, and Roelfsema (2000), an ensemble of spike trains is not completely described by pairwise interactions. Thus, the effect of pairwise correlation cannot be studied without also taking into account correlations among three or more spike trains (i.e., higher-order interactions). This point is illustrated by the following example. Under the general assumption that only the timing of presynaptic spikes and not their form is important, we can model series of incoming spikes as point processes (Tuckwell, 1988; Cox & Isham, 1980). A presynaptic ensemble of N spike trains can then be thought of as N parallel point processes. We dene an interaction as the occurrence of simultaneous spikes in different spike trains. The interactions can be classied as second, third, . . . , Nth-order, corresponding to clusters of 2, 3, . . . , N simultaneous spikes. For simplicity, we assume that the ensemble is homogeneous, that is, the ring rates of all spike trains are identical and, similarly, the rates of second, third, . . . , Nth-order interactions are identical for any subset of 2, 3, . . . , N spike trains picked from the ensemble. We now consider a homogeneous ensemble of N parallel spike trains where all spikes are associated with pairwise (second-order) interactions: there are no interactions of order 3 or higher, nor are there isolated spikes, that is, spikes in one spike train without a counterpart in another one. If k is the rate of each second-order interaction, the discharge rate r of an individual spike train is r D ( N ¡ 1)k . The correlation coefcient between two point processes in a homogeneous system is the rate of interactions with simultaneous spikes on both processes, divided by the
Higher-Order Statistics of Input Ensembles
69
discharge rate of an individual process (see appendix A). Thus, in the case of an ensemble with only second-order interactions, the correlation coefcient is c D 1 / ( N ¡1) . It is approximately inversely proportional to the number of presynaptic inputs. For N D 100 correlated spike trains, the correlation coefcient will be about 10¡2 . Stronger pairwise correlations can be obtained only by introducing higher-order interactions. A single third-order interaction (or triplet), for instance, contributes to the pairwise correlations in three pairs of spike trains, while introducing only one spike per process. The same second-order interaction is obtained by having one pair of spikes in each of the three pairs of spike trains. This results in two spikes in each single process, that is, twice as many as in the triplet case, which results in a halved correlation coefcient. In a homogeneous high-dimensional system, certain combinations of rates and pairwise correlations thus imply the existence of higher-order interactions. Biological neural networks are typically not homogeneous (in the sense dened above), but very little is known about their actual interaction structure. The assessment of the statistical signicance of simultaneous events across more than two neurons is a nontrivial problem. Different statistical methods have been proposed (Martignon, von Hasseln, Grun, ¨ Aertsen, & Palm, 1995; Martignon et al., 2000; Gutig, ¨ Aertsen, & Rotter, 2002). Here, we report on a study of the effects of higher-order interactions on the response of model neurons. Specically, we constructed two different spike train ensembles and compared their effects on the response of a ringrate (FR) neuron model and a conductance-based leaky integrate-and-re (IF) model. For both neuron models, we assessed the effects of the pairwise correlation in the input on the output ring rate of the neuron. Analytical results were derived for the FR neuron. Preliminary results have been presented elsewhere (Kuhn, Rotter, & Aertsen, 2002). 2 Two Models for Correlated Neuronal Populations
The realization x (t) of a stochastic point process consists of a sequence of points in time . . . , tk, tkC 1 , . . . that can be represented by a pulse train, X d ( t ¡ tk ) , x ( t) D k
where d (t ) is the Dirac delta function. To reduce the number of parameters dening the input ensemble, we model spike trains as parallel Poisson point processes with homogeneous interaction structures. Moreover, we restrict our treatment to instantaneous statistical dependence, that is, the interaction among spike trains originates from simultaneous spikes only. 2.1 Single Interaction Process Model. The single interaction process (SIP) model was motivated by results of Deppisch et al. (1993) and Bohte et al. (2000). They simulated a network of integrate-and-re neurons and
70
A. Kuhn, A. Aertsen, and S. Rotter
observed that it could show periods of random looking, unsynchronized neuronal activities, but also population bursts, with almost all neurons ring synchronously. The SIP model is dened as follows. First, a single realization wu ( t) of a Poisson process with stationary rate a is generated. Each process xi (t ) in a set of N spike trains is then dened by xi ( t) D wu ( t) C wi (t )
(i D 1, 2, . . . , N ) ,
where wi (t) is an independent realization of a Poisson process with rate b. The superposition of independent Poisson processes is again a Poisson process with a rate equal to the sum of the individual rates (see Cox & Isham, 1980). Thus, the rate of xi (t ) is r D a C b. Figure 1A shows the realization wu ( t) (top) that is copied into each individual spike train xi ( t) (bottom), introducing a statistical dependence between them. We characterized the statistical dependence between two spike trains by the count correlation coefcient, c D a/ (a C b ) . Note that the ensemble model is completely constrained by the rate r and the pairwise count correlation coefcient c. The relations a D rc and b D r (1 ¡c) are used for the numerical simulation of a set of spike trains with given rate and given pairwise correlation. In the literature, the pairwise dependence is frequently expressed by the cross-correlation. The cross-correlation function R ( D t) of two spike trains xi ( t) and xj ( t) is R (D t) D ad ( D t) C (a C b ) 2 . A derivation of the cross-correlation function and the count correlation coefcient is presented in appendix A. When using the SIP model as input to a point neuron, all individual spike trains xi ( t) are summed up. The compound process can, in turn, be decomposed into two independent processes: a Poisson process with rate a, in which each point accounts for N simultaneous spikes, and a Poisson process of rate Nb, resulting from the sum of all independent realizations wi (t) . The dependence between spike trains in the SIP model is thus described by a single interaction process (of order N). We can write X X X [wu (t) C wi ( t) ] D Nwu (t ) C xi ( t ) D wi (t) . i
i
i
The decomposition into two independent processes greatly facilitates the analysis of the neuronal response. This model of correlated spike trains also appears in Feng and Brown (2000) and Stroeve and Gielen (2001).
Higher-Order Statistics of Input Ensembles
A
Single Interaction Process
B
71
Multiple Interaction Process
w
u
wg
x1 x2
x1 x2
x
x
N
N
time
time
C
SIP
1
sum (s )
spike train index
1
D
MIP 1
10
10
20
20
30
30
40
40
104
104
0
0
0.1 0.2 0.3 0.4 0.5 time (s)
0 0
0.1 0.2 0.3 0.4 0.5 time (s)
Figure 1: Two models of spike train ensembles with identical rates and pairwise correlations but different higher-order statistics. (A) The single interaction process (SIP) model. Each spike train xi ( t) of the ensemble is composed of a xed realization wu ( t) of a Poisson process with rate a and an independent realization of a Poisson process with rate b. (B) The multiple interaction process (MIP) model. Each spike train xi ( t) is realized by randomly deleting spikes from a xed realization w g ( t) . The generating process is a realization of a Poisson process with rate c . (C) Fifty spike trains with a rate of 20 spikes per second and pairwise correlation coefcient 0.1, generated according to the SIP model. Their summed activity is shown below (bin size 5 ms). (D) Fifty spike trains with the same rate and pairwise correlation as in A, but generated according to the MIP model.
72
A. Kuhn, A. Aertsen, and S. Rotter
2.2 Multiple Interaction Process Model. In the second model of a spiking population, the interaction structure is composed of spike clusters involving only subsets of the spike trains. The multiple interaction process (MIP) model is constructed as follows. Each individual spike train is a thinned version (Cox & Isham, 1980) of one and the same realization w g ( t) of a Poisson process. Individual spike trains xi (t ) are thus produced by random deletions of pulses in w g ( t) . The deletions are independent. The thinning process is repeated for all spike trains in the ensemble. If the rate of w g ( t) is c and the probability of deletion is (1 ¡ 2 ), then the rate of each individual spike train is
r Dc2 . Figure 1B shows the generating process w g ( t) (top) and the resulting correlated spike trains x1 (t) , . . . , xN (t ) (bottom). Note that the pulses in each individual spike train are Poisson (Snyder & Miller, 1991). The count correlation coefcient is cD2 . Again, the rate r and correlation coefcient c together completely constrain the model, and the relations 2 D c and c D r/ c are used for the numerical simulation of the ensemble of spike trains. The cross-correlation function of any two spike trains xi ( t) and xj (t ) is R (D t) D c 2 2d ( D t) C (c 2 ) 2 . The derivation of the cross-correlation function and the count correlation coefcient is again given in appendix A. The sum of the N individual spike trains results in a pulse train similar to the generating realization wg ( t) , but with each pulse scaled differently. The scalar amplitude n represents the number of (synchronous) spikes at this point in time. As the deletion of a certain pulse in wg ( t) occurs with a xed probability and is independent for each of the N individual spike trains, the number n follows a binomial distribution: Á ! N n n » B(nI 2 , N ) D 2 (1 ¡ 2 ) N¡n . n The scalar n is independent for each point in the compound process. The sum process can thus be decomposed into N C 1 independent Poisson processes (Snyder & Miller, 1991), each process containing only points with identical multiplicity. Such a process is called an interaction process of order n, with n 2 f0, 1, . . . , Ng. Evidently, the rate of the interaction process of order n is c B(nI 2 , N ) . Figures 1C and 1D show realizations of an ensemble of 50 spike trains, generated according to the SIP and the MIP model, respectively. In both
Higher-Order Statistics of Input Ensembles
73
cases, the individual input rate r and the count correlation coefcient c were identical (r D 20 spikes per second, c D 0.1). The compound activity of all spike trains is shown below the spike rasters. Although any single spike train, as well as any pair of spike trains picked from the two ensembles, would be statistically indistinguishable, the population activity is clearly different for both models. 3 Results
The effects of both ensemble models were tested on two neuron models. To reduce the complexity of the input, we rst focused on a simplied setting. We assume the input population to be composed of two subpopulations, one excitatory and one inhibitory, with the same size and rate of the excitatory and the inhibitory populations. The excitatory presynaptic spike trains were correlated, generated according to the SIP or the MIP model, whereas the inhibitory presynaptic spike trains were modeled as independent Poisson processes. We assessed the inuence of the input parameters, and in particular of the excitatory pairwise correlation, on the mean output rate of two types of model neurons: the FR neuron and the conductance-based leaky IF neuron. The output of the FR neuron is represented by a continuous function of time, the instantaneous ring rate of the neuron, and could be treated analytically. The response of the IF neuron was studied by means of numerical simulations. In addition, the latter model was studied for inputs with different proportions of excitation and inhibition. 3.1 Response of the Firing Rate Neuron. The continuous FR neuron is widely used in neural network studies. Its membrane potential U ( t) is described by
C
1 d U ( t) D ¡ U ( t) C I ( t) , dt R
(3.1)
with C the membrane capacitance, R the membrane resistance, and I ( t) the input current. The membrane potential U (t) is a low-pass version of the input current I ( t) . The output ring rate ro ( t) of the neuron is given by the activation function, which is assumed to be an instantaneous function of the potential. Here, we used the Heaviside step function H ( x ) , ( H ( x) D
0 1
for x < 0 for x ¸ 0.
(3.2)
In this case, the output rate of the neuron is expressed by ro ( t) D KH ( U ( t) ¡ h ) ,
(3.3)
74
A. Kuhn, A. Aertsen, and S. Rotter
where h represents the threshold and K is a constant with dimension spike per second. Alternatively, a sigmoidal function is often used as activation function. In our case, however, we used the step function because it greatly facilitates the analysis of the neuronal response. Since we are interested only in the qualitative behavior of the output rate, the constant K will not be considered any further, and ro ( t) will be dimensionless. An example of the response of the membrane potential U ( t) to the input described above is shown in Figure 2. Synapses were modeled by current pulses. The PSPs were thus represented by an instantaneous jump of potential (positive or negative, according to the excitatory or inhibitory nature of the synapse), followed by an exponential decay toward rest, with time constant t D RC D 10 ms. Excitatory and inhibitory PSPs had the same absolute amplitude, xed at a value of 1. The excitatory and inhibitory input population size was N D 100 in this example, and the individual excitatory and inhibitory input ring rate was r D 20 spikes per second. The excitatory input was modeled by the SIP model (see Figure 2A), or by the MIP model (see Figure 2B), and the pairwise correlation coefcient was c D 0.4, in both cases. Observe that for both input models, the membrane potential shows large depolarizations caused by the occurrence of synchronous excitatory presynaptic spikes. These depolarizing events had, however, different characteristics for both input models, as we will see in more detail in the following. 3.1.1 Amplitude Distribution of the Membrane Potential. The probability density function (pdf) of the amplitude distribution of the membrane potential is represented in Figure 2 next to sample time courses of the membrane potential. The histogram is an estimate generated from a 200 s simulation, with a bin size of 4 (in units of the PSP amplitude). The thick black lines represent an analytical approximation of part of the pdf. As the output rate of the FR neuron is an instantaneous function of the membrane potential, knowledge of the pdf of the potential is sufcient to derive the mean output rate of the neuron. We now describe the main assumptions made to obtain the approximation of the pdf; a full derivation for the potential and the mean output rate is given in appendix B. The approximation of the pdf is based on the decomposition of the input into independent components (cf. section 2) and the use of a classical result of shot noise theory. The membrane potential of a FR neuron can be considered as a ltered point P process. Assuming identical synapses, it can be represented by U (t ) D k h ( t ¡ tk ) , with t k denoting the times of occurrences of synaptic events and h ( t) the impulse response of the lter realized by the synapse and the membrane. For synapses modeled as current pulses, h (t) is simply the impulse response of the membrane, given by ( h ( t) D
0 Ae¡t / t
for t < 0 for t ¸ 0,
(3.4)
Higher-Order Statistics of Input Ensembles
A
120
B
SIP
120
40 0
C
120
0.2 0.4 time (s)
40 0
10 3 10 1 p.d.f.
SIP, approximated
D
120
0.2 0.4 time (s)
10 3 10 1 p.d.f.
MIP, approximated
80 potential
potential
40 0
80 40 0 40 0
MIP
80 potential
potential
80
40 0
75
40 0
0.2 0.4 time (s)
10 3 10 1 p.d.f.
40 0
0.2 0.4 time (s)
10 3 10 1 p.d.f.
Figure 2: Time course (left) and probability density function (pdf, right) of the membrane potential of the ring rate (FR) neuron. The membrane potential is expressed in units of the PSP amplitude. (A) The excitatory input to the FR neuron consisted of 100 excitatory spike trains, generated according to the SIP model (individual ring rate is 20 spikes per second, pairwise correlation 0.4). The inhibitory input was composed of 100 independent Poisson processes with a rate of 20 spikes per second. The pdf estimated from a 200 s simulation of the membrane potential is shown on the right (histogram with a bin width of 4). A segment of the analytical approximation of the pdf (see text) is overlaid (thick black line). The dotted line indicates the ring threshold. (B) Same input rate and pairwise correlation as in A, but using the MIP model for excitatory inputs. (C) Approximation (see text for details) of the membrane potential with the SIP model as excitatory input. The approximation retained the large uctuations caused by clusters of synchronous input spikes (compare with A). The estimated pdf of the approximated membrane potential (from a 200 s simulation of the approximated membrane potential, bar histogram) matched the analytically calculated pdf (thick black line, same as in A). (D) Approximation of the membrane potential using the MIP model for excitatory inputs. Again, the estimated pdf (histogram) matched the analytical expression (thick black line, same as in B).
76
A. Kuhn, A. Aertsen, and S. Rotter
with A being the postsynaptic potential amplitude and t the membrane time constant. If the synaptic events follow a Poisson statistics, the ltered process is called a shot noise (Papoulis, 1991). If the rate of the incoming events is large compared to the time constant of the lter, the distribution of membrane potentials is approximately gaussian (see Papoulis, 1991). In the case of our two input models, however, the interaction structure does not admit a normal approximation. The right panels in Figures 2A and 2B show that the membrane potential densities are clearly not gaussian. Gilbert and Pollak (1960) derived an exact expression for the pdf of a shot noise with exponential impulse response. They showed that for A D 1 and t D 1, the shot noise U (t) with underlying rate of pulses l D 1, has density ( for 0 · U < 1 e ¡g P (U ) D ¡g e (1 ¡ log U ) for 1 · U < 2, R1 where g D ¡ 0 e ¡t log t dt, Euler ’s constant. In fact, the analytic expression of P ( U ) changes at U D 1, 2, 3, . . . The expression for a given segment is derived recurrently from the expression of the preceding segment. The case with parameters A, t , and l arbitrary can be extended from the results of Gilbert and Pollak (1960) and is shown in appendix B.
3.1.2 Analytic Approximation of Membrane Potential Distribution. What is the use of this result for the derivation of the membrane potential density of the FR neuron to the SIP and MIP input models? If the proportions of excitation and inhibition are equal (i.e., the input is balanced), the neuron’s output ring rate is driven by the uctuations of the membrane potential above the threshold. Synchronous clusters of input spikes induce large excursions of the membrane potential that are likely to constitute the major contribution to the neuronal output rate. If, in addition, the nonsynchronous input spikes (excitatory or inhibitory) do not signicantly contribute to the membrane potential excursions, but only to the mean membrane potential level, we can approximate the membrane potential by the ltered interaction process (i.e., the process with simultaneous spikes), together with a constant membrane potential offset due to the nonsynchronous input spikes. For the input with the SIP model, the interaction process has xed amplitude N and rate rc, and the nonsynchronous input consists of an excitatory spike train with rate Nr (1 ¡ c) and an inhibitory spike train with rate Nr. The resulting mean membrane potential offset is given by Campbell’s theorem and amounts to ¡Nrct (see appendix B). We can write P (U ) ¼ PeN ( U C Nrct ) , where the pdf of the potential response to the SIP model is approximated by a shifted version of the pdf of the ltered Nth-order interaction process PeN ( U ) . The latter can be derived on the basis of Gilbert and Pollak’s method.
Higher-Order Statistics of Input Ensembles
77
The rigorous justication of this approximation is given in appendix B. Figure 2C shows an example of the time course of the approximated membrane potential, with the same parameters as in Figure 2A (left panel). Observe that the small uctuations caused by the nonsynchronous input spikes disappeared, but their mean level was retained. A similar approximation can be made for input dened by the MIP model. To be able to use the shot noise density directly, we make the additional assumption that the stochastic spike clusters generated by the MIP model can be approximated by the average cluster; we replace the N C 1 interaction processes of order 0 to N by a single interaction process of order (amplitude) Nc and rate r/ c. The density of the potential response to the MIP model thereby becomes P (U ) ¼ PenN ( U C Nrt ) , where PenN (U ) is the pdf of the ltered average interaction process and ¡Nrt the mean membrane potential offset, caused by the independent inhibitory input. Again, we refer to appendix B for a complete derivation of this result. An example of the time course of the approximated membrane potential response to the input with the MIP model is shown in gure 2D (left panel). Because of the piecewise expression of the shot noise pdf, the convenience of our approximation depends on the number of segments that must be taken into account. For a membrane’s impulse response of amplitude A (see equation 3.4), the expression changes its form at integer multiples of A. For the SIP model (see Figure 2C, right), the neuron’s response to a synchronized spike cluster has amplitude N. We used the rst segment of the pdf, extending over [¡Nrct, ¡N ( rct ¡ 1) [. (We use the conventional mathematical notation x 2 [a, b[ to indicate a · x < b.) For the parameter set considered here, the approximation of the potential holds for U 2 [¡8, 92[ in units of A. Comparison with the histogram estimated from 200 second simulation of the approximated membrane potential shows an excellent correspondence. For the MIP model (see Figure 2D, right), we used the rst two segments. As we reduced the model’s interaction structure to the average spike cluster, the neuron’s response to a synchronized cluster has amplitude Nc, and the rst two segments cover [¡Nrt, ¡N(rt ¡ 2c) [, which corresponds to [¡20, 60[. Note that the change of the analytic form of the pdf can be recognized from the sharp bend at U D ¡N ( rt ¡ c) D 20. Again, a numerical estimate from the simulation of the approximated membrane potential (see the histogram in Figure 2D, right) shows excellent correspondence. If we now compare the analytical approximation of the pdf with the pdf estimated from a simulation of the full input model (Figures 2A and 2B, right), the agreement is very good. A small discrepancy can be observed only near the lower bound of the domain considered. There, our approximation neglected the uctuations due to the nonsynchronous input spikes that spread the membrane potential around the mean value.
78
A. Kuhn, A. Aertsen, and S. Rotter
3.1.3 Mean Response Rate. We can now consider the mean output rate ro for the FR neuron, Z 1 (3.5) ro ´ E[ (ro (t) ] D P ( U ) dU D 1 ¡ Q (h ) , h
Rx
where Q ( x ) D ¡1 P ( U ) dU, the cumulative probability distribution function of the membrane potential. We thus obtain an approximation of the mean output rate by integrating over the approximated density, which has to be known between ¡1 and h . In Figure 2, the threshold h is represented by a dotted line and was set to 30. The number of segments of the pdf that must enter the calculation of ro is thus determined by the value of the input parameters, the threshold, and the membrane time constant. For the SIP model, an approximation of the mean output rate based on the rst segment of the probability distribution is µ ³ ´rct ¶ 1 e h C rct ro ¼ 1 ¡ D , N rct
(3.6)
R e D e ¡rct g / C ( rct ), where C ( z) D 1 tz¡1 e¡t dt is the gamma function. with D 0 It is valid for ¡rct · h / N < ¡(rct ¡ 1) . For larger h , additional segments of the distribution have to be included. For the MIP model, the approximation based on the rst two segments is given in appendix B in equation B.2. We used these approximations to study the output rate of the FR neuron in response to both types of input. The input was described by three parameters: the size N of the input population, the rate r of individual inputs, and the correlation coefcient c between any two individual excitatory inputs. N and r were identical for the excitatory and the inhibitory populations. We examined how the input correlation coefcient c inuenced the mean output rate ro of the FR neuron. In addition, we systematically varied N and r. The results are shown in Figure 3. Analytical approximations are represented by thick gray lines, and results obtained by numerical simulations are given by black dots. Figure 3A shows the output rate ro as a function of the input correlation coefcient c for several input population sizes N D 40, 100, 500, 1000 and a xed input rate of r D 20 spikes per second. The SIP model was used as the excitatory input ensemble. For all N, ro increased monotonically with the correlation coefcient. Indeed, as c grows larger, the rate of occurrence of simultaneous excitatory input spikes, given by rc, increases, and membrane potential excursions above threshold become more frequent, leading to an increase of the mean output rate. Nevertheless, observe that the overall increase of ro does not scale with N and tends to saturate rapidly for larger input population sizes. In the limit N ! 1, we have ro ¼ 1 ¡
( rct ) rct ¡1 ¡rct g . e C ( rct )
Higher-Order Statistics of Input Ensembles
79
This limit is plotted as a dashed line in Figure 3A. Figure 3B features the output rate of the FR neuron for identical input conditions, with an interaction structure of the excitatory population generated according to the MIP model. For small values of c, the response of the FR neuron showed a considerably stronger dependence on the ensemble size than in the SIP model. For N D 40, for instance, the initial increase of ro was smaller than with the SIP model. By contrast, for N D 1000, a small input correlation induced a much larger output rate with the MIP model. For larger N, a further increase of c led to a decrease of ro . Note that for c D 1, both correlated input models are statistically identical, and the responses of the FR neuron were thus the same. For both the SIP and MIP input models, the analytical approximations of ro for the FR neuron model were in very good agreement with the simulation results, showing that the assumptions made were appropriate. As hypothesized, the response of the neuron is driven by the occurrence of excitatory spike clusters for the largest part of the correlation range. If c approaches 0, the rate of these synchronous events tends to 0, and the approximate treatment of the SIP model degrades. The reason is that it neglects the contribution of nonsynchronous excitatory input spikes, the rate of which is Nr (1¡c), increasing for smaller c. This is shown in the inset of Figure 3A, depicting the output rate of the FR neuron for c D 0.005, 0.01, . . . , 0.1. Whereas the approximation tends to 0, the neuron response to the full SIP input model has nonzero values at c D 0, for N D 1000. The success of the approximation near c D 0 thus critically depends on the values of r and N. For the response to the MIP model, we limited the approximation to the rst two segments of the pdf and thus could not approximate the response of the FR neuron over the whole range of c (see Figure 3B). For small correlation coefcients, the size of the mean synchronous spike cluster Nc is small compared to the threshold, and many segments of the shot noise density must be taken into consideration. The success of the analytical approximation here indicates that the stochastic nature of the size of synchronous clusters does not play a signicant role for the mean output rate. This independence of the stochastic nature of the cluster size allows us to explain the observed effects in terms of the average cluster. For small values of c, the rate r/ c of the mean cluster is large. The frequent occurrence of small clusters (with size proportional to c) results in their accumulation in the membrane potential, which drives the potential above threshold h (see Figure 2B) and increases the output rate. Nevertheless, the rate of occurrence of the synchronous input events decreases with increasing c. The clusters become larger but less frequent with increasing c. If their mean inter-event interval (inverse rate) is smaller than the membrane time constant, they do not accumulate, and the increase of the cluster size cannot compensate for their less frequent occurrence. For a large input population size, these two opposing effects result in a non-monotonic dependence of the output rate on the input correlation (see Figure 3B).
80
0.05
0.3
0 0
0.05 0.1
0.2
100
0.1 0 0
C
0.5
D
SIP 90 s 1 1 50 s
output
20 s 1 10 s
0.1
0.3
1000 500
0.2 0 0
0.2 0.4 0.6 0.8 1 correlation coefficient
0.3
0 0
MIP
100
0.1
40
0.4 0.2
0.5 0.4
¥ 1000 500
output
output
0.4
B
SIP
0.5
1
0.5 0.4
output
A
A. Kuhn, A. Aertsen, and S. Rotter
0.3 0.2
40
0.2 0.4 0.6 0.8 1 correlation coefficient MIP 90 s 11 50 s 20 s 1 1
10 s
0.1 0.2 0.4 0.6 0.8 1 correlation coefficient
0 0
0.2 0.4 0.6 0.8 1 correlation coefficient
Figure 3: Mean output of the ring rate (FR) neuron as a function of the pairwise correlation coefcient in the population of excitatory inputs, for the two statistical input models. Relative units are used (see text). Thick gray lines are obtained from analytical approximations (see text); the black dots represent the results of simulations (duration 500 s each). (A) The excitation was generated according to the SIP model, and the inhibition was represented by a set of independent Poisson processes. Excitatory and inhibitory population size and rate were identical. The output rate of the FR neuron is shown for different input population sizes (N D 40, 100, 500, 1000). The individual input rate was held constant (r D 20 spikes per second). The analytically computed output rate of the FR neuron for innite population size is plotted as a dashed line. Inset: Zoom-in of the graph for low pairwise correlation coefcients (0 · c · 0.1). (B) The excitation was generated according to the MIP model, everything else as in A. (C) Same input setting as in A, but for a xed input population size (N D 500) and different individual input rates (r D 10, 20, 50, 90 spikes per second). (D) The excitation was generated according to the MIP model. Everything else as in C. Note that for c D 0 and c D 1, the SIP and the MIP models coincide.
Figure 3C shows the output rate ro for a xed population size of N D 500 and for different input rates r D 10, 20, 50, 90 spikes per second, with the SIP model being used as excitatory input ensemble. The output rate ro increased with increasing c, and more so for larger r. The form of the dependence of ro on c was more saturating for larger r. The output rate of the FR neuron with identical input parameters, but using the MIP model as excitatory input, is
Higher-Order Statistics of Input Ensembles
81
shown in Figure 3D. We observe a distinct difference with the response to the SIP input, particularly for small input correlations. Here, for small c, an increase of c leads to a large increase of ro due to the large N. A consecutive decrease of ro for larger c occurred for small r only. Indeed, for r D 50, 90 spikes per second, the inter-cluster interval did not become small compared to the membrane time constant. After having explored the effects of each of the three input parameters, we now consider how modications of the FR neuron and, more precisely, of the membrane time constant t , the thresholdh of the activation function, and the form of the activation function could affect these results. With regard to t , note that the output rate of the FR neuron with the SIP model (see equation 3.6) depends only on the product rt . Thus, an increase (decrease) of the time constant has the same effect as the proportional increase (decrease) of the input rate. This is the case for the density function of any shot noise with exponential kernel (Gilbert & Pollak, 1960). Taking into consideration that the membrane potential of the FR neuron response (for both the SIP and the MIP models) can be decomposed into a sum of several elementary shot noise processes with different amplitudes and rates (see appendix B), it can be concluded that the equivalence of t and r is valid for all input parameters of both input models. An analogous but somewhat weaker statement can be made concerning the threshold h of the activation function. The output rate of the neuron depends only on the quotient h / N (see equation 3.6). This is the case for the approximated response to both the SIP and the MIP models (see appendix B). Thus, as long as the approximation is good, the effect of an increase of h is equivalent to the proportional decrease of N. Finally, we turn to the effects of the form of the activation function. For an arbitrary function, it would be necessary to know the entire density function of the potential to be able to derive the mean output rate of the FR neuron. The piecewise denition of the density function, however, makes it difcult to derive an approximation of the output rate. Nevertheless, our method would still be applicable, and an approximation of the neuronal rate could be derived, if the input clusters had a low rate. Indeed, in this case, the entire density function is well approximated by the rst few segments. Note that if the activation function was linear, the mean output rate of the FR neuron would not be sensitive to any aspect of the input statistics other than the mean input rate, because the mean output of a linear system is determined by its mean input only. A nonlinear activation function is thus a prerequisite for the sensitivity of the neuron to input correlations and, a fortiori, to higherorder statistics. In turn, any sufciently nonlinear activation function will result in different mean responses of the FR neuron to both input models. Indeed, the response potential densities to both the SIP and MIP model are different, and the mean output rate will typically be different. 3.2 Response of the Integrate-and-Fire Neuron. In order to compare the effects of different input ensemble statistics under more realistic condi-
82
A. Kuhn, A. Aertsen, and S. Rotter
tions, we simulated a conductance-based leaky IF neuron (Tuckwell, 1988; Koch, 1999). Its membrane potential was (on average) depolarized with respect to the resting state, and it uctuated. Its input conductance was also increased, similarly to what is typically observed in vivo (ParÂe, Shink, Gaudreau, Destexhe, & Lang, 1998). The essential difference to the membrane potential dynamics of the FR neuron is that synapses here were modeled not as current pulses but as transient membrane conductance changes. The subthreshold membrane potential of the IF neuron was described by 1 d U ( t) D [Ur ¡ U ( t) ] C Ge ( t) [Ue ¡ U (t )] C Gi ( t) [Ui ¡ U (t) ]. dt R Ur is the resting membrane potential. Ge ( t) and Gi ( t) are the excitatory and inhibitory synaptic conductances, and Ue and Ui are the excitatory and inhibitory synaptic reversal potentials, respectively. We use ge (t) and gi (t ) to denote the conductance changes elicited by a single excitatory or inhibitory presynaptic spike. Both excitatory and inhibitory conductance changes were modeled by an a-function (Koch, 1999; Rotter & Diesmann, 1999), X ge ( t) D ge ate1¡at H ( t) , Ge (t) D ge (t ¡ tek ) , C
gi ( t) D gi ate1¡at H ( t) ,
Gi (t) D
k X k
gi ( t ¡ tik ) ,
where ge (gi ) is the peak excitatory (inhibitory) conductance change, and 1 / a is the time constant of the conductance change. Both types of synapses had the same time constant. The times tek and tik represent the occurrences of excitatory and inhibitory input spikes, respectively. The voltage-dependent conductances responsible for the action potential generation were not included in the model. Instead, when the membrane potential reached a threshold Uh , an action potential was generated and the membrane potential was clamped to a reset value Ureset for a period trefr . In addition, all synaptic currents were shunted. This latter mechanism modeled the refractory period. The various model parameter values were chosen to be in line with the experimental literature (McCormick, Connors, Lighthall, & Prince, 1985): C D 500 pF, R D 30 MV, Ur D ¡70 mV, Uh D ¡50 mV, Ureset D ¡60 mV (Troyer & Miller, 1997), and trefr D 2 ms. The excitatory synapses modeled fast glutamatergic (AMPA) synapses, and the inhibitory synapses modeled fast GABAergic (GABAA ) synapses. We set Ue D 0 mV and Ui D ¡70 mV (Koch, 1999). The time constant of the synaptic conductance changes was 1 / a D 1 ms, and the peak conductances were ge D 1 nS and gi D 3.4 nS, respectively. 3.2.1 Membrane Potential Response. A depolarized membrane potential was obtained by bombarding the neuron with Poisson spike trains. We used independent excitatory and inhibitory spike trains whose rates were
Higher-Order Statistics of Input Ensembles
83
adjusted such that the membrane potential uctuated around a mean value of ¡54.3 mV (LÂeger, Stern, Aertsen, & Heck, 2002). We used an excitatory rate of 9000 spikes per second and an inhibitory rate of 5500 spikes per second, which is equivalent to 9000 independent individual excitatory inputs and 5500 independent individual inhibitory inputs, both with a rate of 1 spike per second. We refer to this input as background input. The membrane potential uctuated with a standard deviation of 1.5 mV, and the model neuron red with a rate of 1 spike per second under these conditions. The excitatory postsynaptic potential (EPSP) or inhibitory postsynaptic potential (IPSP) produced by an additional excitatory (inhibitory) input spike had, on average, an amplitude of 0.16 mV (¡0.15 mV). This is in accordance with Matsumura, Chen, Sawaguchi, Kubota, & Fetz (1996), who found that EPSPs and IPSPs recorded intracellularly in vivo had similar amplitudes and time courses. We let two additional populations of excitatory and inhibitory inputs impinge on the neuron. Both populations had the same size, and the individual excitatory and inhibitory discharge rates were identical, as with the FR neuron. The excitatory spike trains were correlated, with parameters represented by either the SIP or the MIP model; the inhibitory inputs were always independent. Figure 4A shows a 500 ms realization of the excitatory and inhibitory population activity (bottom), with N D 100 and r D 20 spikes per second. Both the input and the IF neuron were simulated with a time step of 0.1 ms. In the gure, the number of excitatory input spikes per time step is plotted as a positive integer and the number of inhibitory input spikes per time step as a negative integer. The SIP model was used for the excitatory population, and the correlation coefcient was c D 0.1. Note that the background synaptic input is not shown. The membrane potential response of the IF neuron is plotted in the upper part of the gure. It uctuated randomly, characterized by occasional fast depolarizations leading to an output spike. These were mostly caused by spike clusters involving the whole correlated excitatory population. Note that the second spike in Figure 4A was due to the background of independent synaptic input. The same input parameters were used for Figure 4B, but here the MIP model replaced the SIP model. The excitatory input activity consisted of a large number of clusters of smaller size; the output spikes resulted from the summation of several input clusters. This summation mechanism was rendered stochastic by the background uctuations of the membrane potential. Due to the reduced size of the input clusters, the depolarization preceding a response spike was slower than with the SIP model. In Figure 4C, we used the SIP model but increased the pairwise correlation coefcient to 0.4. As a consequence, the rate of occurrence of the input clusters increased, accompanied by a corresponding increase of the output rate. Figure 4D shows the response of the neuron to the MIP model, with c D 0.4. The size of the clusters increased compared to c D 0.1 (see Figure 4B), but their rate of occurrence decreased, as already explained in section 3.1.
84
A. Kuhn, A. Aertsen, and S. Rotter
70 100 50 0 0
input
60
0.1 0.2 0.3 time (s)
0.4 0.5
50 60 70 100 50 0 0
0.1 0.2 0.3 time (s)
MIP, c = 0.1 50 60 70 100 50 0 0
0.1
D
SIP, c = 0.4
U (mV)
U (mV)
U (mV)
50
C
input
B
SIP, c = 0.1
input
input
U (mV)
A
0.4 0.5
0.2 0.3 0.4 time (s)
0.5
MIP, c = 0.4 50 60 70 100 50 0 0
0.1
0.2 0.3 0.4 time (s)
0.5
Figure 4: Time course of the membrane potential of the leaky integrate-andre (IF) neuron (top) and the corresponding input (bottom), with the excitation generated by either the SIP or the MIP model. (A) The excitation was generated according to the SIP model, with a pairwise correlation coefcient of 0.1; the inhibition was a population of independent Poisson processes. The input population size (N D 100) and individual ring rate (r D 20 spikes per second) were the same for the excitation and the inhibition. The number of excitatory (inhibitory) input spikes per bin (size 0.1 ms) is plotted as a positive (negative) number in the lower panel. Note that the additional independent background inputs that kept the membrane potential in a depolarized state (see text) are not shown. The dotted line in the upper panel indicates the ring threshold. (B) The excitation was generated according to the MIP model. The pairwise correlation and other input parameters were identical to A. (C) The excitation was generated according to the SIP model, with an excitatory input pairwise correlation coefcient of 0.4. Other parameters as in A. (D) The excitation was generated according to the MIP model. The pairwise correlation and other input parameters were identical to C.
The output rate was larger than with c D 0.1. However, the clusters were still not large enough to reliably elicit an output spike individually. 3.2.2 Mean Response Rate. Similarly to the FR neuron, we compared the response of the IF neuron for the two different input models SIP and
Higher-Order Statistics of Input Ensembles
85
MIP by studying the dependence of the output ring rate on the pairwise correlation of the excitatory input. In addition, we systematically varied N and r. Figure 5A shows the output rate ro of the IF neuron to SIP input as a function of c, for a xed input rate r D 20 spikes per second, and for different input population sizes N D 20, 40, 100, 1000. The output rate slowly increased with the input correlation coefcient. For population sizes beyond 100, the output rate ro did not increase any further. This behavior was caused by the membrane potential resetting of the IF neuron. Evidently, if the input population size is large enough that the depolarizations induced by an input cluster reach the threshold, a further increase of N has no impact on the output rate because the rate of occurrence of the input clusters does not depend on N. In that case, the mean output rate ro of the IF neuron can thus be approximated by (see Murthy & Fetz, 1994) ro D r (1 ¡ trefr ro ) ,
(3.7)
where r is the rate of input clusters. For the SIP model, we have r D rc. This approximation is plotted as a thick gray line in Figure 5A and expresses that the output rate ro is equal to the rate of spike clusters r, subtracting the clusters arriving while the neuron is refractory. Of course, the approximation is not valid if output spikes are (also) caused by the background bombardment of non-synchronized excitatory spikes (as is the case for c D 0, for instance). Figure 5B shows the response of the IF neuron as a function of c but with the MIP model for the excitation and for r D 20 spikes per second and N D 20, 40, 100, 300, 500, 1000. For small values of c, the dependence of the output rate on c was far more strongly modulated than with the SIP model. For N D 1000, for instance, a small c generated a very large output ring rate, much larger than with the SIP model. Only for very small populations (N D 20) was the output rate smaller than with the SIP model, over the whole range of c (see Figure 5A, inset). For large N, as for the FR neuron, the dependence of ro on c was non-monotonic. The initial increase of ro with c was caused by the integration of increasingly large input clusters. However, when the clusters were large enough that the occurrence of a single input cluster reliably triggered an output spike, a further increase of the input correlation was detrimental to the output ring rate. Indeed, a further increase of c, resulting in even larger input clusters, causes ro to decrease, because the occurrence rate of clusters decreases with increasing c. For these larger values of c, the mean output rate ro of the IF neuron can again be approximated by equation 3.7. We assume that the stochastic nature of the spike cluster size does not play a role and replace each cluster by one of average amplitude. The rate of the mean amplitude clusters is r/ c. The approximation given by equation 3.7, with r D r/ c, is represented as a thick gray line in Figure 5B. The amplitude of the mean input spike cluster is proportional to N, and the range of validity of the approximation is larger
86
A. Kuhn, A. Aertsen, and S. Rotter
for large values of N. Note that for larger values of ro , the approximation slightly overestimated the output rate. This is due to the nite rise time of the conductances used to model the synapses, which allowed synchronized clusters to impinge on the cell during the depolarization caused by a rst cluster. Such an event happened more frequently at higher rates. Figure 5C shows the mean output rate of the IF neuron with the SIP model, for N D 500, and r D 10, 20, 50, 90 spikes per second. N was large enough so that each input cluster elicited an output spike. The relation between ro and c depended on r, because the rate of occurrence of the input clusters is proportional to r. The mean output rate was approximated by equation 3.7, with r D rc (thick gray lines). Again, note that the approximation is more precise for small values of ro . We replaced the a-function conductance with a conductance pulse and repeated these simulations. In that case, the (small) discrepancy between the simulated output rate and the approximation, observed for larger values of ro , disappeared completely (data not shown). The response to exactly the same input parameters, but with the MIP model instead of the SIP model, is shown in Figure 5D. As N was large, a non-monotonic dependence of ro on c could be observed. This was the case for all r. The approximation of the mean output rate ro (see equation 3.7, r D r/ c) is plotted as thick gray lines. For the different input rates r, the left bound of the range of the approximated output rate was chosen as the value of c producing the maximal output rate in the numerical simulations. The approximation gave better results for small values of output rates, as explained above. For large r (50, 90 spikes per second), two local maxima appeared in the simulation results for the dependence between ro and c. For the smaller value of c, the interaction structure was thus (locally) optimal with regard to the generation of output spikes, so that both an increase and a decrease of c caused a decrease of ro . To rst approximation, we can make the same assumption here as for the response of the FR neuron and explain this effect in terms of the mean cluster. The mean cluster generated by the MIP model, for a value of c producing the locally maximal output rate, was not big enough to depolarize the membrane potential up to the spiking threshold alone and thus needed support from other inputs. An efcient arrangement consists of pairs of clusters whose sizes were such that within the integration time of the membrane, they jointly cause a depolarization that can just bring the neuron to re a spike. A pair of larger clusters would result in the same effect, but it would occur less frequently, as the mean cluster rate is inversely related to its mean size. Note that the positions of both peaks are shifted to higher values of c for increasing input rate r. This is because not only excitation but also inhibition increases with r, necessitating larger clusters to reach the threshold. Interestingly, this can lead to a nonmonotonic dependence of ro on r. For c around 0.35, for instance, an increase of the input rate from 50 to 90 spikes per second resulted in a decreased output rate.
Higher-Order Statistics of Input Ensembles
140 120 100 80 60 40 20 0 0
B
SIP 2 0 0
SIP®
1
output rate (s )
4
®MIP N=20 0.5 1 100, 1000 40 20
0.2 0.4 0.6 0.8 1 correlation coefficient
D
SIP
1
1
output rate (s )
C
120 100 80 60 40 20 0 0
output rate (s )
1
output rate (s )
A
90 s 1 1
50 s
s 11
20 10 s
0.2 0.4 0.6 0.8 1 correlation coefficient
87
MIP 120 100 1000 80 60 40 20 40 20 0 0 0.2 0.4 0.6 0.8 1 correlation coefficient 140 120 100 80 60 40 20 0 0
MIP
90 s 1 1
50 s
20 s 11 10 s
0.2 0.4 0.6 0.8 1 correlation coefficient
Figure 5: Mean output rate of the leaky integrate-and-re (IF) neuron as a function of the pairwise correlation coefcient of the excitatory input, for the two different input ensemble statistics. Thick gray lines represent an analytical approximation of the output ring rate under the assumption of large spike clusters in the input (see text for details). (A) The excitation was generated according to the SIP model; the inhibition was a set of independent Poisson processes. Excitatory and inhibitory population size and rate were identical. The output rate is shown for different input population sizes (N D 20, 40, 100, 1000) and constant individual input rate (r D 20 spikes per second). The output rate for N D 100 and N D 1000 produced undistinguishable curves. The inset compares the response of the IF neuron for the SIP and the MIP input model as excitation, for N D 20 and r D 20 spikes per second. It shows that for small population sizes, the output rate with the SIP model was larger over the whole range of c. (B) The excitation was generated according to the MIP model. The output rate is shown for different input population sizes (N D 20, 40, 100, 300, 500, 1000) and a xed input rate (r D 20 spikes per second). (C) Same input setting as in A, for different individual input rates (r D 10, 20, 50, 90 spikes per second) and a xed population size (N D 500). (D) The excitation was generated according to the MIP model. All input parameters as in C. Again, note that for c D 0 and c D 1 the SIP and the MIP model coincide.
88
A. Kuhn, A. Aertsen, and S. Rotter
The input scenarios discussed thus far were composed of equal proportions of excitation and inhibition. The ratio of excitation and inhibition received by a cortical neuron in vivo, however, can only be roughly estimated. Anatomical data suggest that excitation amounts to more than four times the inhibition (Braitenberg & Schuz, ¨ 1998). On the other hand, it has been argued on the basis of the irregularity of the discharge of cortical neurons that they operate with equal proportions of excitation and inhibition (van Vreeswijk & Sompolinsky, 1996; Shadlen & Newsome, 1998). We dene the balance of the input, bD
Ni ri , Ne re
where Ne (Ni ) is the excitatory (inhibitory) input population size and re (ri ) the ring rate of individual excitatory (inhibitory) input lines. We varied the input balance of the IF neuron by changing the amount of inhibition (the product Ni ri ). Note that the same variation of b could be obtained by changing either Ne or re , not necessarily leading to exactly the same effects. Figure 6A shows the mean output rate of the IF neuron as a function of the excitatory pairwise correlation, for different levels of inhibition, using the SIP model for the excitatory input. The excitation was composed of Ne D 300 and re D 20 spikes per second. The inhibition was Ni ri D 0, 1200, 3000, 6000, or 9000 spikes per second, corresponding to b D 0, 0.2, 0.5, 1, 1.5. For low levels of inhibition (b D 0, 0.2, 0.5), the output ring rate for weak pairwise correlations in the input was much larger than for the balanced input. For these values of b, an increase of c caused a decrease of the output rate. The occurrence of excitatory clusters of size Ne induced large depolarizations—larger than actually necessary to reach the spiking threshold. The excess excitatory spikes hence did not contribute to generating output spikes, which had a detrimental effect on the output rate. Observe that for b D 0.5, ro decreased down to a value of 10 spikes per second at c D 0.45 and increased up to about 20 spikes per second at c D 1. Thus, the initial decrease of ro , caused by spikes participating in the (too) large synchronization clusters, could be partially compensated for by the increased cluster rate. Using more inhibition than excitation (b D 1.5), no output spikes were produced for independent inputs. Due to the hyperpolarized membrane potential, the uctuations of the potential were not large enough to induce spontaneous spiking. Nevertheless, an increase of the input correlation and the associated occurrence of synchronized clusters brought the neuron to re, with an output rate again tending to the maximal rate of 20 spikes per second for c D 1. This is shown by the overlap with the thick gray line, representing the approximation expressed by equation 3.7 (with r D rc). The response of the IF neuron to the same input parameters, but with the MIP model instead of the SIP model for the excitatory input, is shown in Figure 6B. We again observe considerable differences with the response
Higher-Order Statistics of Input Ensembles
80
SIP b=0
60 40 20 0 0
B 1
100
output rate (s )
1
output rate (s )
A
¬b=1,1.5
0.2 0.4 0.6 0.8 1 correlation coefficient
89
100 80
MIP b=0
60 40 20 0 0
b=1.5
0.2 0.4 0.6 0.8 1 correlation coefficient
Figure 6: Mean output rate of the leaky integrate-and-re (IF) neuron as a function of the pairwise correlation coefcient of the excitatory input, for the two different input ensemble statistics, with different ratios of inhibition to excitation. Thick gray lines represent an analytical approximation of the output ring rate, assuming that output spikes are caused only by large spike clusters in the input. (A) The excitation was generated according to the SIP model. The excitatory population size was Ne D 300, and the excitatory individual ring rate was re D 20 spikes per second. The product of the inhibitory population size and rate was Ni ri D 0, 1200, 3000, 6000, 9000, corresponding to an input balance (see text) of b D 0, 0.2, 0.5, 1, 1.5. The response of the IF neuron for b D 1.5 and b D 1 differed only for very low excitatory input correlations. (B) The excitation was generated according to the MIP model. All input parameters were identical to A.
to the SIP input, particularly for small input correlations. At small values of c, a small increment induced a large increase of ro for b D 1 and 1.5. For b D 0.2, a small increase of c had little inuence on ro , and for b D 0, ro decreased with c. Small values of c generated small synchronized clusters that could drive the output rate up or down, depending on the net excitation level. This is because for high inhibition levels, the membrane potential uctuates strongly, and an increase of excitatory input synchrony results in an even higher variance of the membrane potential, leading to more frequent threshold crossings. By contrast, if the excitation level is high, the synaptic input, on average, tends to depolarize the neuron above threshold (see Shadlen & Newsome, 1998). In that case, the summation of rapidly occurring excitatory spikes is efcient, and synchrony does not greatly accelerate the depolarization of the membrane potential to threshold. For b D 0.2, ro was initially insensitive to changes in c, because the loss of spikes in the synchronized clusters was exactly compensated by the greater efcacy of synchronous input. The degree of input balance at which this effect occurs depends on the input parameters N and r. For larger c, as N was large, the effect of synchrony was detrimental for all b values. As the output spikes were then exclusively caused by clusters of synchronous
90
A. Kuhn, A. Aertsen, and S. Rotter
input spikes, the mean response rate was well described by the approximation expressed by equation 3.7 (with r D r/ c), plotted as a thick gray line in Figure 6B. The interaction structure of the input can thus have a dramatic inuence on the output ring rate. A striking example is given by the response of the IF neuron for b D 0.5. With the interaction structure of the SIP type, an increase of c from 0 provoked a decrease of the output ring rate, but with the MIP model, for the same input balance level, the same increase of correlation induced an increase of the output rate. 4 Discussion
Individual ring rates and pairwise correlations of a population of neurons do not determine the ensemble uniquely. We showed that neurons can be very sensitive to the details of the interaction structure of their input ensembles. Thus, it is difcult to draw any denitive conclusions about the effect of particular input parameters (like pairwise correlation) on a neuron’s output rate, as long as physiological input ensembles are not characterized in a more complete manner, in particular, by specifying their higher-order correlation structure. 4.1 Input Ensemble Statistics. In model studies, it is often assumed that the total input or the membrane potential can be represented by a process with a gaussian probability density function (Tuckwell, 1988; Feng & Brown, 2000; Svirskis & Rinzel, 2000). Doing so has the advantage that the statistical properties of the model can be constrained easily. Nevertheless, as the third and higher cumulants are forced to be zero, this model is obviously not suited to study the impact of the full ensemble statistics. As we have shown in this work, it is crucial to take all moments of the input ensemble into account. Several studies have tackled the impact of correlated spike trains on the output of neuron models. Shadlen and Newsome (1998) and Salinas and Sejnowski (2000) generated correlated spike trains by simulating neurons with partial common input. Two sets of inputs were randomly drawn from a nite pool, resulting in a certain fraction of the inputs being used for both neurons. This introduces a correlation in the output spike trains of the two neurons. Such spike trains have broader cross-correlograms, which is more realistic than the exact synchrony used in this study. However, the higher-order interactions in such ensembles of spike trains were not considered. Another drawback is that the ensembles were not dened analytically. Destexhe and ParÂe (1999) used correlated synaptic input to reproduce the neuronal properties observed in vivo. They used Poisson processes as inputs and introduced stochastic dependence by drawing a smaller number of input processes than there were synapses (similarly to Salinas & Sejnowski, 2000), a single Poisson process being used as input to several synapses.
Higher-Order Statistics of Input Ensembles
91
Again, the relation between the pairwise correlation and the higher-order correlation structure was not stated explicitly. Statistically, the interaction structure of an input ensemble has a large number of degrees of freedom. We picked two extreme situations. In the SIP model, all of the doublets of spikes on any pair of spike trains occur simultaneously, resulting in spike clusters synchronized over the whole population. In the MIP model, the cluster size is a stochastic variable with a mean increasing linearly with the pairwise correlation. To realize a pairwise correlation coefcient identical to the SIP model, the average cluster rate must be higher than in the SIP model. Some studies used an input ensemble related to the MIP model (Murthy & Fetz, 1994; Bernander, Koch, & Usher, 1994) in order to investigate the effects of input synchrony. Their input ensemble differed from ours in that the rate of occurrence of spike clusters was held constant. As a consequence, they parameterized their ensemble in terms of the absolute synchrony (the size of spike clusters), and not in terms of the correlation coefcient. Interestingly, Bernander et al. (1994) also introduced a temporal jitter of the synchronization and showed that the precision of the spike correlation can strongly modulate the output ring rate, depending on the values of the input parameters. The issue of the ensemble statistics was specically tackled by Bohte et al. (2000), who studied an input distribution with maximal entropy. Assuming homogeneous rates and homogeneous pairwise correlations, they derived a maximally at distribution of input clusters. The size of an input cluster was dened as the number of neurons spiking within a single time bin. They showed that under the assumption of maximal entropy, spike clusters involving a large number of neurons already exist for a small pairwise correlation. The SIP and MIP models were chosen because they represent two extreme cases of input ensembles. The question arises of how similar they are to the inputs that a real cortical neuron receives. To date, this question remains largely unanswered. Let us mention two lines of research that could help solve this problem. The rst is the massively parallel recording of multiple single cell activities in vivo (Nicolelis, 1998). Statistical methods suited to the analysis of the corresponding data are beginning to be available (Gerstein & Aertsen, 1985; Martignon et al., 1995, 2000; Gutig ¨ et al., 2002). Nevertheless, the pool of neurons forming the actual presynaptic input ensemble to a single cortical neuron cannot be inferred from such studies. Therefore, a second source of answers could come from the detailed investigation of membrane potential uctuations gained from intracellular recordings in vivo (Borg-Graham, Monier, & Fre gnac, 1998; Azouz & Gray, 1999; LÂeger, Stern, Aertsen, & Heck, 2002). Analysis of membrane potential uctuations or their time courses, possibly combined with simultaneous recordings from neighboring cells, could help to shed light on the interaction structure of the input ensembles seen by cortical neurons.
92
A. Kuhn, A. Aertsen, and S. Rotter
4.2 Dependence of Response on Input Ensemble Statistics. We used the SIP and MIP models of spike train ensembles to describe the excitatory input into neurons. The inhibitory inputs were assumed independent throughout this study. With the SIP model and for a xed pairwise input correlation, an increase of the input population size induced only a limited increase of the output rate for two very different neuron models (see Figures 3A and 5A). This was true at all input rates and over the entire range of input correlations. Interestingly, this was the case not only for the IF neuron but also for the FR neuron, whose membrane potential did not undergo any reset following a threshold crossing. Probing the model neurons with an interaction structure generated by the MIP model, we observed that for any xed pairwise correlation in the input, the range of output rates obtained by varying the size of the input ensemble was much larger than with the SIP model (see Figures 3B and 5B). For small input correlations and in contrast to what happened with an interaction structure of the SIP type, the output rate increased steadily with the size of the input ensemble and could thus attain much larger values. The most striking difference between the response to both interaction structures was observed for large input populations. By large, we mean that an input spike cluster involving the whole population elicits a membrane depolarization that reaches the ring threshold. With the SIP model, the rate of spike clusters was proportional to both the individual input rate and the pairwise correlation. For a large input population, the output rate of the IF neuron was thus approximately linearly related to these two input parameters (see Figure 5C). By contrast, with input statistics of the MIP type, the dependence of the IF response on the input ring rate of a large input ensemble could be non-monotonic (see Figure 5D). Note that for the FR neuron, the variance of the membrane potential was the same with the SIP or the MIP input model. Indeed, the autocovariance function (and thereby the variance) of a linear system’s output is entirely determined by the autocovariance function of its input (Papoulis, 1991). As the autocovariance of the summed input ensemble is given by the sum of the autocovariances and cross-covariances of the individual spike trains (both of which were identical for the SIP and the MIP input models), the autocovariance of the membrane potential is identical for both input ensembles. Thus, the size of the uctuations, as measured by the variance of the membrane potential, does not alone explain the output rate response, as shown by the very different response of the FR neuron to both input ensembles (see Figure 3). This also implies that the assumption of a gaussian distribution of the membrane potential, which requires knowing only the rst and second moments of the potential, may fail to predict the mean response rate. Moreover, it can be easily shown that the variance of the membrane potential increases (monotonically) with the input pairwise correlation c (see appendix B for the complete expression of the variance of the membrane potential). Thus, despite an increasing vari-
Higher-Order Statistics of Input Ensembles
93
ance of the membrane potential, the response rate can decrease (see Figures 3B and 3D), depending on the higher-order interactions of the input ensemble. Many of the neuronal response properties investigated here were similar for both the FR and the IF neuron. The differences could originate from either the postspike reset mechanism built into the IF neuron or from the use of conductances to model the synapses. Tiesinga, JosÂe, and Sejnowski (2000) studied the differences in the neuronal response to an uncorrelated input ensemble when using currents or conductances to model synapses. In the case of the MIP model, we have previously shown that the mean output of a continuous FR neuron with reversal potentials can be in very good correspondence to the output rate observed in an otherwise identical neuron that included a membrane potential reset mechanism (Kuhn, Rotter, & Aertsen, 2002). Hence, depending on the higher-order statistics of the input ensemble, input parameters can have very different effects on the neuron’s output rate. We studied the effects of pairwise correlation in the excitatory population across its entire range from 0 to 1. However, in experimental studies, the correlation coefcient is rarely found to be larger than 0.1 to 0.2 (Aertsen & Gerstein, 1985; Das & Gilbert, 1999; Eggermont & Smith, 1995; Nelson, Salin, Munk, Arzi, & Bullier, 1992). Considering this range of input correlations, it appears that the two different realizations of input correlations considered here impose different constraints on the way the neuronal response can be modulated. If the activity of a presynaptic network is characterized by the synchronization of large pools of neurons, the neuronal output is approximately linearly modulated by the input rate. By contrast, if the population activity of the presynaptic network is composed of small, distributed clusters, the size of the input population constitutes an ideal parameter to modulate the output ring rate of the postsynaptic neuron over a large range. The latter case would thereby suggest an operation mode based on the dynamic recruitment of input lines. 4.3 Balance of Excitation and Inhibition. We changed the mean ratio of inhibition to excitation of the input (the input balance) and observed that the effects of pairwise correlation in the excitatory population on the output rate could be reversed. With the SIP model as excitatory input structure, the monotonic increase of the output rate as a function of the correlation, observed for equal proportions of excitation and inhibition, could change into a monotonically decreasing relation (see Figure 6A). Such modulation of the relative effect of synchrony by the net excitatory level was also observed with the MIP model as interaction structure for low values of the input correlation coefcient (see Figure 6B). This detrimental effect of input synchrony on the postsynaptic ring, at high levels of (net) excitation, has already been reported (Murthy & Fetz, 1994), albeit with a different input ensemble (see above).
94
A. Kuhn, A. Aertsen, and S. Rotter
Stroeve and Gielen (2001) used an input population identical to the SIP model and reported that the pairwise correlation of excitation affects the postsynaptic ring rate only if the inhibitory ring rate is sufciently high. It must be pointed out that this statement follows from their particular choice of input parameters. They used a small input size (N D 120), which was effectively smaller compared to the input used here because the membrane potential of their model neuron was not in a depolarized state. As shown in this study, the effects of input correlation depend strongly on the size of the population. More important, their input parameters were such that a compensation effect, similar to what was described for Figure 6B (b D 0.2, c < 0.2), was in effect. This is clearly demonstrated by the irregularity of inter-spike intervals (Stroeve & Gielen, 2001) that increased with the input correlation. This shows that the neuron’s ring was increasingly caused by large, randomly occurring spike clusters. As such input clusters induced membrane depolarizations beyond the ring threshold, the corresponding “loss of spikes” had to be compensated for by the increased efcacy of synchrony. Hence, as demonstrated in Figure 6B for low input correlation, it is appropriate to consider the net excitation level as a powerful means of modulating the effects of input correlation. This possibility was mentioned earlier by Salinas and Sejnowski (2000). Finally, we would like to stress that both input and output investigated here were stationary and that an important and obvious step toward more realistic conditions would be to study the impact of nonstationary input ensemble statistics. Appendix A
In this appendix, we derive the cross-correlation function and count correlation coefcient for pairs of spike trains generated by either the SIP or the MIP model. Consider a homogeneous Poisson point process x ( t) with rate E[x(t)] D l. The autocorrelation Rxx ( D t) of this process is (Papoulis, 1991) Rxx ( D t) D E[x(t)x(t C D t) ] D ld ( D t) C l2 . For two independent Poisson point processes x (t) and y ( t) with rates l and m , respectively, the cross-correlation function Rxy ( D t) is Rxy ( D t) D E[x(t) y ( t C D t) ] D lm . We now consider dependent processes. Let v ( t) and w (t) be dened as v ( t) D x ( t) C z ( t) w ( t) D y ( t) C z (t) , where x ( t) , y ( t) , and z ( t) are independent homogeneous Poisson processes with rate l, m , and º, respectively. The cross-correlation function of v ( t) and
Higher-Order Statistics of Input Ensembles
95
w (t) is Rvw (D t) D Rxy ( D t) C Rxz ( D t) C Ryz (D t) C Rzz ( D t) 2 D ºd ( D t) C º C lº C mº C lm
D ºd ( D t) C (º C l) (º C m ) .
The count correlation coefcient for the two point processes v (t) and w (t) with respect to a xed observation interval of duration T is dened in terms of the stochastic variables V and W giving the respective numbers of points encountered during the observation interval: cVW D p
cov(V, W ) . var ( V ) var (W )
For arbitrary point processes, the count correlation coefcient depends in general on the observation interval. For the two dependent Poisson processes v ( t) and w (t) considered above, we have in particular var(V) D var (X ) C var (Z ) var(W) D var (Y ) C var(Z) cov(V, W ) D var (Z ) , with X, Y, and Z the stochastic variables giving the respective numbers of points in the processes x (t) , y ( t) , and z ( t) , in the interval considered. For an observation interval of duration T, we have var(X) D lT,
var(Y) D m T,
var (Z ) D ºT,
and thus obtain cVW D p
º . (º C l) (º C m )
Note that this is independent of the observation length T. Thus, we may also simply use the term “pairwise correlation coefcient” for “count correlation coefcient.” Normalizing the cross-correlation function of the two processes by the geometric mean of their respective rates, we obtain an expression of the cross-correlation as a function of the count correlation coefcient, p
p Rvw ( D t) D cVWd ( D t) C (º C l) (º C m ) . (º C l) (º C m )
The count correlation coefcient thus corresponds to the net integral over the central peak in the (normalized) cross-correlation function. The general case represented by v ( t) and w ( t) is now readily applied to our two models of spike train ensembles. For the SIP model, we have º D a
96
A. Kuhn, A. Aertsen, and S. Rotter
and l D m D b. Hence, the cross-correlation function and count correlation coefcient between any two spike trains are R (D t) D ad ( D t) C (a C b ) 2 , a . cD aC b For the MIP model, it holds that the rate of coincident spikes in any pair of spike trains is º D c 2 2 and l D m D c 2 ¡c 2 2 . The cross-correlation function and count correlation coefcient thus are R (D t) D c 2 2d ( D t) C (c 2 ) 2 , cD2 . Appendix B
In this appendix, we develop an analytical approximation of the response of the FR neuron to the two models of spike train ensembles (SIP and MIP). The FR neuron was described above (see equations 3.1 through 3.3). The corresponding synaptic input current is described by a stationary stochastic process, and we are interested in the stationary output rate of the neuron. The knowledge of the stationary membrane potential distribution is sufcient to derive the mean output rate for this particular neuron model (see equation 3.5). If we use linear synapses and Poisson-driven synaptic events, the membrane potential is a shot noise. For pulse-like synaptic currents, the membrane realizes a rst-order low-pass lter with time constant t D RC (see equation 3.4). Gilbert and Pollak (1960) derived an integral equation for the shot noise distribution and solved it for several particular choices of the lter. For the lter h ( t) dened in equation 3.4, with A D 1 and t D 1, the shot noise S (t ) with underlying rate l has an amplitude distribution with density P (S ) given by P (S ) D dSl¡1
for 0 · S < 1.
For larger values of S, P (S ) is found by recursion from the integral equation, " # Z P (S ) D Sl¡1 d ¡ l The constant d is given by dD
e ¡lg , C (l)
S
1
P (x ¡ 1) x¡l dx .
Higher-Order Statistics of Input Ensembles
97
R1 R1 where g D ¡ 0 e¡t log t dt is Euler ’s constant and C ( z ) D 0 tz¡1 e ¡t dt is the gamma function. Thus, the analytic expression of P ( S ) changes at integer values of S. For t arbitrary, the density is easily shown to be ( P? ( S ) D
DSlt ¡1 DSlt ¡1 w (lt , S )
where
"³
w (lt , S) D 1 ¡
1 S ¡1
for 0 · S < 1
(B.1)
for 1 · S < 2
´¡lt
# F (lt , lt, 1 C ltI 1 ¡ S )
and F is the hypergeometric function F (a, b, cI z ) D
1 X ( a ) k ( b) k z k ( c) k k! kD 0
with ( a) k :D a ( a C 1) ¢ ¢ ¢ ( a C k ¡ 1). The constant D is DD
e ¡lt g . C (lt )
If lt D 1, w (lt , S ) D 1 ¡ log S, and P? (S ) simplies to ( for 0 · S < 1 e¡g P? (S ) D ¡g for 1 · S < 2, e (1 ¡ log S ) which is the expression presented in Gilbert and Pollak (1960). We now formulate the approximation of the membrane potential density, P ( U ) , for correlated excitatory and independent inhibitory input. The neuron’s impulse response h ( t) is given explicitly in equation 3.4. We x the PSP amplitude at A D 1. Let us rst consider the membrane response to the SIP model. Let Ne (Ni ) be the number of excitatory (inhibitory) input spike trains and re (ri ) the rate of each individual excitatory (inhibitory) spike train. The excitation can be decomposed into two independent Poisson processes (see section 2.1). Let UeN ( t) be the ltered Ne th-order interaction process and Ue1 (t) the ltered rst-order process. Ui ( t) is the ltered inhibitory input spike train with total rate Ni ri . Thus, if tekN , tek1 , and tik denote the times of events in the corresponding pulse trains, we have UeN (t) D Ue1 (t) D Ui (t) D
X k
X k
X k
Ne h ( t ¡ tekN ) h ( t ¡ tek1 ) ¡h(t ¡ tik ) .
98
A. Kuhn, A. Aertsen, and S. Rotter
Note that we chose the same absolute amplitude for the excitatory and inhibitory synaptic currents. As we are dealing with a linear system, we have U ( t) D UeN ( t) C Ue1 (t ) C Ui ( t) , and thus P ( U ) D ( PeN ¤ Pe1 ¤ Pi ) ( U ) , the convolution product of the densities of the three independent components. We now useR Campbell’s theorem to obtain the mean and variance of each R density. As h ( t) dt D t and h ( t) 2 dt D t / 2, we have E (UeN ) D Ne at D Ne re ct, E (Ue1 ) D Ne bt D Ne re (1¡ c)t , E (Ui ) D ¡Ni ri t,
var (UeN ) D Ne2 at / 2 D Ne2 re ct / 2, var (Ue1 ) D Ne bt / 2 D Ne re (1¡ c) t / 2, var (Ui ) D Ni ri t / 2.
For a large range of parameters, in particular for large excitatory populations, we have var (UeN ) À var ( Ue1 ) C var (Ui ) , and we can approximate the convolution (Pe1 ¤ Pi ) ( U ) by a Dirac delta function located at E(Ue1 C Ui ) . We obtain P (U ) ¼ PeN ( U ¡ Ne re (1 ¡ c) t C Ni ri t ) . The density of the membrane potential response of the FR neuron to the MIP model is approximated in a similar manner. Let Ue ( t) be the ltered excitatory input. The interacting point processes in the MIP model can be decomposed into Ne C 1 independent interaction processes (see section 2.2). Let Uen ( t) be the ltered nth-order interaction process, with rate B(nI c, Ne ) re / c. The membrane potential of the FR neuron is then given by U ( t ) D Ue ( t ) C Ui ( t ) D
N X nD 0
Uen ( t) C Ui (t ).
The variance of Ue (t ) is var (Ue ) D Ne re (1 ¡ c C cNe )t / 2. If the relation var(Ue ) À var (Ui ) holds true, we can approximate P i (U ) by a Dirac delta function located at E(Ui ) and write P (U ) ¼ Pe (U C Ni ri t ) .
P In addition, we approximate Ue ( t) by UenN ( t) D Ne c k h (t ¡ teknN ) , where teknN are the times of pulses in the generating process (wg ( t) ; see section 2.2). Effectively, we let the order of interaction be deterministic and approximate the N interaction processes by a single process with amplitude E(n) D Ne c and with the same rate re / c as the generating process. We then get P (U ) ¼ PenN ( U C Ni ri t ). The membrane potential density for both input models can thus be approximated by the expression of the shot noise density (see equation B.1), using a linear variable transformation U D kS ¡ l. We nally obtain ³ ´ 1 UCl P ( U ) ¼ P? , k k
Higher-Order Statistics of Input Ensembles
99
with an appropriate choice of k, l, and of the parameters of the density function P? (see equation B.1). The mean output rate ro of the FR neuron is then obtained by integrating over the tail of the potential distribution (see equation 3.5) and is expressed in a piecewise manner too. The rst two terms are 8 > > >1 ¡ D > > > > > > > > > > > 1¡D > > > < ro ¼
> > > > > > > > > > > > > > > > > :
´ h C l lt for ¡l · h < k ¡ l k ³ ´ ´ ³ ´¡lt Zh ³ 1 h C l lt U C l lt ¡1 k ¡ > k k U C l¡k : lt k¡l ³ ´ UCl ¢F lt, lt, 1 C lt I 1 ¡ k 9 > dU = ¢ for k ¡ l · h < 2k ¡ l. k > ; 1 lt 8 >
0 is the membrane leak rate and d is the delta function. Ti0 , j is the time when the jth spike generated by the i0 th sensory neuron arrives at the ith cortical neuron. Here, the delay between the sensory and cortical layers is uniformly zero or equivalently uniformly constant. Ti00 , j is the jth ring time of the i0 th cortical neuron, and 2 is the amplitude of a feedforward spike. The ith cortical neuron receives incident spikes only from the i0 th sensory neuron with i0 2 Si where |Si | (D n01 ) is the number of randomly selected sensory neurons. Although we do not consider noise sources such as ongoing activity (Arieli, Sterkin, Grinvald, & Aertsen, 1996; Azouz & Gray, 1999) composed of spikes from neurons outside the network, they
108
N. Masuda and K. Aihara
would also modulate the network behavior and preferred coding schemes in a similar way to dynamical gaussian white noise (Masuda & Aihara, 2002b; van Rossum et al., 2002). The delay t D 2.6 ms (Abeles, 1991; Diesmann et al., 1999), and the ring rate is kept at 20 Hz (de Oliveira, Thiele, & Hoffmann, 1997; Zohary, Shadlen, & Newsome, 1994). The membrane leak rate should be around 0.05 to 0.1 ms¡1 (Abeles, 1982; Koch, 1999), but we use larger values to emphasize the duality of the two complementary coding schemes. We analyze the network dynamics in the high-input regime by setting n01 D 240 and n2 D 30, where 100 to 200 incident spikes are necessary for a cortical neuron to re (Gerstner et al., 1996; Shadlen & Newsome, 1998; Diesmann et al., 1999). The spike trains emitted by the cortical neurons are observed by the virtual synchronous coder and the virtual population rate coder, as shown in Figure 1. We measure the degree of synchrony by calculating syn, which is dened as the ratio of the number of synchronous ring events to the average number of single neuron spikes. A synchronous ring event is dened as spike arrivals from more than n2 / 2 cortical neurons within tw D 1.5 ms (Softky & Koch, 1993; Hatsopoulos et al., 1998; Aihara & Tokuda, 2002). Full synchrony gives syn D 1, whereas fully asynchronous ring results in syn D 0. The population rate of synchronous ring usually represents the input signal more robustly but less accurately than that of asynchronous ring. The accuracy of the synchronous coding is equal to that of coding by a single neuron (Masuda & Aihara, 2002b; van Rossum et al., 2002), although the noise torelance is better (Aihara & Tokuda, 2002). The performance of population rate coding is evaluated by the correlation coefcient corr between the input signal S ( t) and the short-time population ring rate of the cortical neurons, that is, the number of spikes from all the cortical neurons in each bin of width w D 3.8 ms (Shadlen et al., 1996). Given Z 1 iw ( ) s1 i D S (t ) dt w ( i¡1) w and s2 ( i) D the number of spikes observed for t 2 [(i ¡ 1)w, iw], the correlation coefcient between the two time series fs1 ( i)g and fs2 ( i)g denes corr. 2.2 External Inputs. The chaotic input S ( t) was generated from the Lorenz system, the dynamics of which is represented by the following equations: 8 dx > < dt D a (10y ¡ 10x) , dy (2.2) D a (28x ¡ y ¡ xz ) , dt > : dz 8 D a (¡ 3 z C xy ). dt
Rate Coding and Temporal Coding
109
The continuous input signal is dened as S ( t) D 2.25£10¡2 C 7.00£10¡4 x and models nonrandom dynamical rules with complexity contained in complex external stimuli such as visual scenes and sounds. The change rate a is xed at 0.03 so that the characteristic timescale of S ( t) is smaller than the interspike interval. As a result, the synchronous coding cannot capture rapid signal change, and the ring rate can encode only roughly the temporal waveform of S ( t). This fact is revealed by low values of corr for the synchronous ring. Population ring rates with asynchronous ring are more strongly correlated with S ( t). Our results qualitatively remain the same when we apply stochastic inputs, which are often used to study information processing by neural networks (see section 3.4). 3 Selection of Effective Coding Modes
We investigate which coding mechanism is effective when values of model parameters are changed. The parameters examined here have different functional relevance and affect the choice of the optimal coding in different ways. 3.1 Shared Connectivity. Two neurons often share incident connections, especially when the neurons are geometrically or functionally close to each other (Aertsen, Gerstein, Habib, & Palm, 1989; Zohary et al., 1994; Shadlen et al., 1996). They usually share only a part of inputs, and the feedforward and also feedback coupling structure is far from all-to-all. Incomplete coupling introduces quenched noise that disperses the membrane potentials of two neurons starting from the same initial level. The shared connectivity r D n01 / n1 , that is, the ratio of the average number of the shared connections to that of the total connections (n01 ), is biologically between 0.1 and 0.3 (Zohary et al., 1994; Shadlen et al., 1996; Shadlen & Newsome, 1998; Hatsopoulos et al., 1998; Diesmann et al., 1999). Shared connections, or common incident spikes, are likely to drive the neural population to synchrony, as do common feedback spikes delivered to many neurons (Bottani, 1995). Synchronous ring can occur theoretically even without feedback couplings (Knight, 1972; Feng, Brown, & Li, 2000), though biologically it may be produced by cooperation of feedback interactions and correlated inputs (Eckhorn et al., 1988; Hatsopoulos et al., 1998; Brecht et al., 1999). We examine how r inuences network behavior and the efciency of the two codings. As shown in Figure 3, we change the number of shared connections between the sensory layer and the cortical layer by changing n1 and keeping all the other parameters constant. Since the simulation results depend on initial conditions, syn and corr shown in this gure, as well as in the following gures, are the averages of 50 trials. Figure 3 demonstrates that the population rate coding performs better for smaller r, whereas the neurons are more synchronous for larger r. This result holds regardless of the feedback strength within the cortical layer (see Figure 4).
110
N. Masuda and K. Aihara
Figure 3: Change in the degree of synchrony measured by syn and performance of the population rate coder measured by corr when the shared connectivity r varies. We set n1 D n01 / r, n2 D 30, n01 D 240, 2 D 0.007, c D 40 ms ¡1 , and 2 D 0.002.
A
B 1 0.4 0.3
0.6 0.4
r=0.50 r=0.25 r=0.03
0.2 0
0
0.0025 0.005 0.0075 0.01 coupling strength
c o rr
syn
0.8
0.2 r=0.50 r=0.25 r=0.03
0.1 0
0
0.0025 0.005 0.0075
0.01
coupling strength
Figure 4: Change in syn (A) and corr (B) when the feedback strength 2 varies. n2 D 30, n01 D 240, 2 D 0.007,c D 40 ms ¡1 , and 2 D 0.002. The lines in each panel correspond to the three examined cases, with n1 D 480, n1 D 960, and n1 D 8000, corresponding to r D 0.50, r D 0.25, and r D 0.03, respectively.
Rate Coding and Temporal Coding
111
Correlated inputs with large r produce relatively invariant responses of the cortical neurons and interfere with efcient signal averaging, which results in biases and a loss of reliability in signal estimation. The information transmission capacity from the sensory layer to the cortical layer is decreased in this situation (Stein, 1967; Shadlen & Newsome, 1998). However, robust temporal coding with synchronous ring is possible based on coincidence detection. When r is small, cortical neurons are likely to desynchronize to participate in reliable rate coding. Effective pooling of independent information from asynchronous cortical neurons leads to better estimation of S (t ) (Shadlen et al., 1996; Shadlen & Newsome, 1998). Changing r continuously inuences the efciency of the synchronous coding and the population rate coding in opposite directions. 3.2 Feedback Strength. We examine the inuence of feedback coupling 2 among cortical neurons. Feedback interactions generally compel networks
toward characteristic states such as full synchrony, clustered, and phaselocked states (Domany et al., 1994; Bressloff & Coombes, 2000). Figure 4 shows that the cortical layer is a reliable rate coder when 2 is small over a wide range of shared connectivity, whereas the cortical neurons synchronize when 2 is large. 3.3 Membrane Leak Rate. The membrane leak rate c of the cortical neurons is another modulatable parameter. Simulation studies suggest that it can vary rapidly in response to the background activity (Bernander et al., 1991). Networks of LIF neurons with large c synchronize more readily than LIF neurons with small c or nonleaky (c D 0) neurons (Knight, 1972; Bressloff & Coombes, 2000; Feng et al., 2000; Masuda & Aihara, 2001). Therefore, we expect that the preferred coding mode would change gradually from the population rate coding to the synchronous coding as c increases. Changes in ring rates cause signicant biases in the performance measures syn and corr when a in equation 2.2 is xed. Consequently, to make the ring rates xed for the sake of fair comparisons, we adjust 2 for each c in the numerical analyses. The simulation result shown in Figure 5, however, supports our conjecture only for c 2 [0.00, 0.02]. Efcient population rate coding as well as a certain level of synchrony is achieved for larger c . The simultaneous large values of syn and corr for large c are contrasted with the opposed effects on these quantities of varying r or 2 . The relevant parameter range includes the biologically plausible leak rate of c D 0.05 ¡ 0.1 ms ¡1 (Abeles, 1982; Diesmann et al., 1999; Koch, 1999). Although large c could cause signal estimation to deteriorate because of the weighted integration of the signal performed by the LIF neuron (Knight, 1972; Masuda & Aihara, 2002a), its contribution to corr is actually marginal. However, c also modulates the input preference. An LIF neuron with large c operates as a coincidence detector that res only when multiple spikes arrive in a short
112
N. Masuda and K. Aihara
Figure 5: Change in syn and corr when the membrane leak c varies. n1 D 480, n01 D 240, n2 D 30, and 2 D 0.003. We adjust 2 to keep the ring rates constant.
time window (Abeles, 1991; Softky & Koch, 1993; Fujii et al., 1996; Koch, 1999). Such a cortical neuron detects the peaks of S ( t) at which the sensory neurons send relatively many coincident spikes to the cortical neurons (Masuda & Aihara, 2002a). As a result, the cortical neurons re synchronously for increased c , and the rate of synchronous ring events encodes the amplitude of S (t ). Since the temporal waveform of S (t ) is effectively characterized by the location and the amplitude of its peaks, corr becomes large. Introduction of a refractory period will prevent synchronous bursting with high frequency and therefore eliminate the emergence of simultaneously large syn and corr. 3.4 Heterogeneity. Unlike the homogeneous model neurons considered above, real neurons are naturally heterogeneous. Heterogeneity drives neural populations out of synchrony (White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998; Tiesinga & JosÂe, 2000). We consider heterogeneity in feedforward coupling strength, membrane leak rate, and ring threshold. For each neuron or synapse, one of these parameters is assigned a value randomly according to the uniform distributions [2 0 (1 ¡ D2 ) , 2 0 (1 C D 2 ) ], [c 0 (1 ¡ Dc ) , c 0 (1 C D c ) ], or [H 0 (1 ¡ D H ) , H 0 (1 C D H ) ], where 2 0 D 0.007, c 0 D 0.025 ms¡1 , and H 0 D 1. The degrees of heterogeneity are parameterized by D2 , D c , and D H . Figure 6 shows that in all three cases, the cortical neurons prefer the synchronous coding when the network is close to homogeneous, while the population rate coding is more efcient with increased heterogeneity. Figure 6D, in comparison with Figure 6C, shows that the conclusion remains the same when a stochastic instead of chaotic input is applied. The stochas-
Rate Coding and Temporal Coding
113
A
B 0.8
0.8
syn corr
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.5
1
1.5
2
syn corr
2.5
0
3
normalized heterogeneity in feedforward coupling C
0
0.05
0.1
0.15
0.2
normalized heterogeneity in membrane leak D
0.8
0.8
syn corr
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.05
0.1
0.15
syn corr
0.2
0.25
normalized heterogeneity in threshold
0
0
0.05
0.1
0.15
0.2
0.25
normalized heterogeneity in threshold
Figure 6: Change in syn and corr when the cortical neurons are heterogeneous. We consider heterogeneity in (A) the feedforward coupling strength 2 , (B) the membrane leak rate c , and (C, D) the ring threshold H of the cortical neurons. The Lorenz and ltered gaussian noise inputs are applied in A, B, C, and in D, respectively. We set n1 D 480, n01 D 240, n2 D 30, and 2 D 0.003. The horizontal axes show the width of the uniform distribution normalized by the average.
tic input signal is generated from the Ornstein-Uhlenbeck process, or the ltered gaussian noise (Mainen & Sejnowski, 1995; van Rossum et al., 2002), dened by dx x C j ( t) , D ¡ tou dt S (t) D 2.25 £ 10¡2 C 4.00 £ 10¡4 x,
(3.1)
114
N. Masuda and K. Aihara
where tou D 8.0 ms and j ( t) is gaussian white noise applied at every time step dt D 0.02 ms. Heterogeneous resetting (Mar et al., 1999) and transmission delays of feedforward couplings (Gewaltig, Diesmann, & Aertsen, 2001) would introduce additional jitter. The biases in estimating S ( t) that heterogeneity could cause are negligible. Actually, corr grows to fairly large values (corr ¸ 0.6) for strong heterogeneity, compared to that when the network is homogeneous or when r or 2 is small. Although the ring rates of the cortical neurons differ with heterogeneity, S ( t) can be estimated reliably from the collection of heterogeneous spike trains. The network states easily escape from constrained states such as full synchrony or clustered states (Mar et al., 1999; Guardiola et al., 2000). Homogeneous neurons tend to synchronize no matter how small r, 2 , and c are (Knight, 1972). Even when r D 0, two spike trains arriving at two cortical neurons still resemble each other because they encode the same stimulus, just with different phases. Synchronous ring of coupled oscillators is considered important (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989; Hatsopoulos et al., 1998; Brecht et al., 1999; Steinmetz et al., 2000), and mathematical studies have explored how much heterogeneity in, for example, the ring thresholds and coupling strengths (Bottani, 1995) a stable synchronization can endure. However, it has been shown that even less than 10% heterogeneity in the ring rates easily desynchronizes neurons (White et al., 1998; Tiesinga & JosÂe, 2000). Heterogeneous neurons do not necessarily synchronize all the time. Asynchronous ring can provide accurate estimates of the instantaneous ring rate (Knight, 1972; van Rossum et al., 2002).
4 Parameter Unication
We have seen the effects of changing several model parameters on coding modes. Cortical neurons are likely to prefer the synchronous coding when the shared connectivity is large, strong feedbacks are present, the membrane leak is large, or the cortical neurons are fairly homogeneous. The performance as a population rate coder is highest in the opposite extremes except when the membrane leak is changed. The leak rate determines how inputs are ltered, as well as the excitation dynamics. Selection of the coding scheme depends on the parameter values in a continuous manner. We dene an order parameter s that is essential in determining the preferred coding mode. The value of s measures the accumulated dispersion of the membrane potentials of two neurons after the duration of a typical interspike interval. We assume that the arrival of presynaptic spikes is a Poisson process with parameter l and that the input is approximated by a diffusion process (Lansky, 1984; Tuckwell, 1988). The membrane potential vi of the ith neuron
Rate Coding and Temporal Coding
115
follows the stochastic differential equation, dvi ( t) D (m ¡ c i vi ( t) ) dt C
p
DdBi (t ) ,
(4.1)
m D lE2 [2
i, j ],
(4.2)
D D lE2 [2
2 i, j ],
(4.3)
lD
220 N D , 40 ms T
where c i and 2 i, j are the leak rate of the ith cortical neuron and the amplitude of an incident spike from the sensory layer to the cortical neuron, respectively. The Brownian motion is denoted by Bi ( t) , E is the expectation, T is the typical interspike interval, and N is the number of incident spikes necessary for a cortical neuron to re. We assume that 2 i, ji ,c i , and also H i , the ring threshold of the ith neuron, are taken from uniform distributions on [2 0 (1¡D2 ) , 2 0 (1 C D 2 ) ], [c 0 (1¡Dc ) , c 0 (1 C Dc )], and [H 0 (1¡D H ) , H 0 (1 C D H )], respectively. Consequently, equations 4.2 and 4.3 reduce to m D l2 0 , Á D D l 1C
D2
2
3
(4.4) ! .
(4.5)
Supposing vi (0) D 0, it follows from equations 4.1 and 4.4 that Z
vi (t) D
t 0
0
e ¡c i ( t¡t ) (l2 0 dt C
p
DdBi ( t) ) dt 0 .
(4.6)
We rst consider the uctuation of the interspike interval T caused mainly by heterogeneous H i and c i . We calculate T and its standard deviation s1 without considering input jitter. Substituting vi (t) D H i into equation 4.6 and ignoring the diffusive term, the interspike interval T is estimated by ³ ´ 1 c i Hi . (4.7) T D ¡ log 1 ¡ ci l2 0 Expansion of the right-hand side of equation 4.7 to the rst order in D H and D c yields T D A C (¡A C B )
ci ¡c 0 Hi ¡ H 0 CB , c0 H0
(4.8)
where ³ ´ 1 c 0H0 A D ¡ log 1 ¡ D E H,c [T], c0 l2 0
(4.9)
116
N. Masuda and K. Aihara
and BD¡
l2
0
H0 . ¡ c 0H 0
(4.10)
On the basis of equation 4.8 and the assumption that Hi and c i are uniformly distributed, we have s12 D s12 (c , D H , Dc ) ´ E H,c [(T¡A) 2 ] D
D c2 3
( A C B) 2 C
D 2H 3
B2 .
(4.11)
We next dene s2 , which estimates how the membrane potentials of two neurons starting from the same resting potential disperse when they come close to the threshold. To evaluate the difference between vi1 ( T ) and vi2 (T ) , it should be noted that Bi1 ( t) and Bi2 (t) are not independent of each other because of the shared connectivity. For any L2 ([0, T]) –bounded function f ( t) , it follows from the Itoˆ isometry that 2Á Z 4 Ev
!2 3
T
5D
f (t) dB ( t) 0
Z
T
f ( t) 2 dt,
(4.12)
0
where Ev [¢] denotes the average over the probability measure of the Brownian motion. We obtain "Z # Z Z T
Ev
0
T
f ( t) dBi1 ( t)
0
T
f ( t) dBi2 (t )
Dr
f ( t) 2 dt.
(4.13)
0
since the proportion r of the total incident spike inputs is shared by a pair of neurons (Shadlen & Newsome, 1998). We then normalize vi ( t) so that vi1 ( T ) and vi2 (T ) can be compared with the same effective threshold. By substituting vi (t ) D H i vQ i (t) / H 0 into equation 4.1, we obtain ³ dvQ i ( t) D
p ´ m H0 DH 0 ( ) ¡ c i vQ i t dt C dBi ( t) , Hi Hi
(4.14)
and the threshold for ring is normalized to vQ D H 0 . Using E H,c ,v[vQ i1 ( t) ¡ vQ i2 ( t) ] D 0 and equations 4.5, 4.6, 4.12, and 4.13, the second-order approximation leads to s22 D s22 (r, c , D2 , D c , D H ) ´ E H,c ,v[ (vQ i1 (T ) ¡ vQ i2 (T ) ) 2 ] "ÁZ T e ¡c i1 ( T¡t ) H 0 (l2 D E H,c ,v H i1 0
0 dt C
p
DdBi1 ( t) )
Rate Coding and Temporal Coding
Z ¡
0
Z
0
0
e ¡c i1 ( T¡t) H 0 l2 Hi1
Z
T
T
¡r
Z 0
Á
1C
2Á
D2 2
¡l
!
3
ÁZ
T 0
Á ¡ r 1C
Z
T 0
5
e ¡2c i1 ( T¡t ) H 20D dt H2i1
e ¡c i2 (T¡t) H 0l2 H i2 3
T
dt 0
1 ¡ e ¡c i1 T 4 D 2l2 20 E c : 1 ¡ D 2H c i1 C
DdBi2 ( t) )
C
dt
!2 3
p
0
dt
e (c i1 Cc i2 ) ( T¡t ) H20 D dt5 H i1 Hi2
0
l
0 dt C
!2
e ¡c i1 ( T¡t ) H 0l2 H i1
T
¡
8
0 such that fn ( r0 , µ ) ¡ fn (r, µ ) > c > 0
(3.14)
6 r 0, for |r ¡ r0 | > e for arbitrary small e > 0, and we have, for any r D
q ( r, µ ) D exp[nf fn ( r, µ ) ¡ fn ( r 0, µ )g] ! 0, q ( r 0, µ )
(3.15)
as n ! 1. In this case, q ( r, µ ) is concentrated at r0 . In order to have widely spread q ( r) , fn ( r, µ ) should be of order 1 / n. This gives the following theorem. When the activity distribution q ( r) is not concentrated, fn (r, µ ) D O ( n1 ) , and the interactions h2 , . . . , hn¡2 scale as Theorem 2.
h2 D O
³ ´ 1 , n
³ h3 D O ³ hk D O ³ hk D O
1 n2
´
1 nk¡1 1
,
...,
´ ,
k·
n 2
,
k>
n . 2
´
nn¡k
(3.16)
Moreover, µ satises in the limit n ! 1, 1X hj kj,n ( r) . n!1 n
H ( r) D ¡ lim Proof.
(3.17)
Since fn ( r, µ ) is a polynomial, for a nonconcentrated distribution
q ( r) , lim n[ fn ( r, µ ) ¡ fn ( r0 , µ ) ] < 1
n!1
(3.18)
is required. This holds only when fn ( r, µ ) ¡ fn (r 0, µ ) is of order 1 / n, and hence all the hj scale as mentioned, and equation 3.16 holds. We now consider the case when there are no higher-order interactions. The simplest one is that all the hj D 0 except for h1 . The distribution p ( x, µ ) is independent, and q ( r) D d ( r ¡ r0 ) , where r0 D E[r], is easily determined from h1 . The second simplest case is that all hi are zero except for h1 and h2 , h3 D h4 D ¢ ¢ ¢ D 0. This is the case where only pairwise interactions exist.
Synchronous Firing and Higher-Order Interactions in Neuron Pool
135
The stationary distribution of the Boltzmann machine belongs to this class. The distribution is written as µ » ³ ´¼ ¶ 1 1 ¡y , (3.19) q ( r, µ ) D exp n H ( r) C h1 r C hN2 r r ¡ 2 n where hN2 D nh2 . The function fn ( r) D H ( r) C h1 r C hN2 r2
(3.20)
has a peak in the interval [0, 1] for hN2 < 0. Hence, we have a concentrated distribution. When hN2 is positive and large, there can be two concentrated peaks at r D 0 and r D 1, one of which dominates except for the symmetric case. By similar reasoning, we can prove that there is no widespread distribution q ( r) , when there is only a nite number of higher-order interactions. Weak higher-order interactions of almost all orders are required for realizing a widespread activity distribution. Theorem 3.
When h1 D nhN1 is of order n and h2 is of order 1, they dominate over the entropy term, and Remark.
µ q ( r, µ ) ¼ exp
¶ n2 fhN2 r2 ¡ hN1 r ¡ y g ¡ y . 2
(3.21)
Hence, for hN2 > 0, the distribution has two concentrated delta peaks at r D 0 and r D 1. This is the case of Bohte et al. (2000). For a symmetric distribution with E[xi ] D rN and Cov[xi , xj ] D r, the variance of r is "» ¼ 2# 1X 1 n¡1 ( xi ¡ rN ) r. (3.22) V[r] D E D V[xi ] C n n n Remark.
When r > 0, the distribution of r is not concentrated but is spread with variance r. This shows that whenever the covariance of xi and xj is a positive constant of order 1, q ( r) is widespread and higher-order interactions exist necessarily. 4 Calculations of Interactions
We calculate the higher-order interactions from a distribution q ( r) . To this Q ( ) Q ( ) Q ( ) end, we introduce P random variables X1 x , . . . , Xn x such that Xi x is equal to 1 when xj D i, that is, when exactly i neurons are excited, and is
136
S. Amari, H. Nakahara, S. Wu, and Y. Sakai
equal to 0 otherwise. This XQ i is different from Xi , and we have X XQ k D xi1 ,...,ik (1 ¡ xikC 1 ) , . . . , (1 ¡ xin ) ,
(4.1)
where the summation is taken over all permutations of indices. We put g k D E[Xk ],
(4.2)
gQ k D E[XQ k ].
(4.3)
We then have the following linear relations: gQ i D
n X jD 1
Theorem 4.
Aijgj I gi D
n X j D1
BijgQj .
(4.4)
The matrices A D ( Aij ) and B D (Bij ) are mutually inverse, and
Aij D (¡1) j¡ij Ci ,
(4.5)
Bij D j Ci .
(4.6)
Let us put
Proof.
xN i D 1 ¡ xi
(4.7)
and xi C xN i D 1. We then have the following relations: x1 , . . . , x k D x1 , . . . , x k ( x kC 1 C xN kC 1 ) , . . . , (xn C xN n ) ,
(4.8)
x1 , . . . , x k (1 ¡ xkC 1 ) , . . . , (1 ¡ xn ) D x1 , . . . , x kxN kC 1 , . . . , xN n .
(4.9)
By expanding the above relations and taking expectation, we have the relations that connect fE[x1 , . . . , x k]g and fE[x1 , . . . , x kxN kC 1 , . . . , xN n ]g. In order to obtain the explicit forms of A and B, we use the identity n C k n¡k Cj
D n CkC j kC j Ck .
Now we introduce new parameters hQ1 , . . . , hQn by X X hQi D hj D Bjihj , AijhQi .
(4.10)
(4.11)
Then we have another expression of probability distribution p (x ), p ( x, µQ ) D exp because
P
hi Xi D
nX
P
o hQj XQ j ¡ y ,
hQi XQ i .
(4.12)
Synchronous Firing and Higher-Order Interactions in Neuron Pool
137
The probability of exactly k neurons ring is q ( k/ n ) dr D
X
x:
P
p ( x)
xi D k
D n Ck expfhQk ¡ y g,
(4.13)
because when exactly k neurons re, XQ i D 1 only when i D k, and all the other XQ i vanish, and there are n Ck such ring patterns. From this, we have an explicit expression of hQk, hQk D ¡ log n Ck C log[q(k/ n ) / q (0) ],
(4.14)
because expf¡y g D
1 q (0) . n
(4.15)
When q ( r) is given, the corresponding higher-order interactions hk are given explicitly as hi D
X j
Aji f¡ log n Cj C log[q( j/ n ) / q (0) ]g.
(4.16)
5 Entropy Maximization
Bohte et al. (2000) treated the distribution q (r ) by maximizing the entropy under the condition that the rst-order and the second-order statistics are xed. We touch on this problem. The entropy of p ( x) is given by ¡
X
p ( x) log p ( x) X X gQ i log gQ i ¡ gQ 0 log gQ 0 C gQ i log n Ci , D ¡
(5.1)
where gQ 0 D 1 ¡
n X i D1
gQi .
(5.2)
This is the negative of the dual potential of the exponential family in information geometry (Amari & Nagaoka, 2000). Therefore, hQi are calculated by @ hQi D Q ( ´Q ) D log gQi ¡ log gQ 0 ¡ log n Ci . @gQ i
(5.3)
138
S. Amari, H. Nakahara, S. Wu, and Y. Sakai
The problem is written as: to maximize X SD¡ p (x ) log p (x )
(5.4)
under g1 D nE[x1 ] D const, g2 D n C2 E[x1 x2 ] D const.
(5.5) (5.6)
The solution is given by the distribution p ( x) D expfl1 X1 C l2 X2 ¡ y g,
(5.7)
which has only the pairwise correlations but no higher-order interactions. Now we rewrite the constraints: g1 D g2 D
n X jD 1 n X jD 2
Qj , j C1g
(5.8)
Qi. j C2g
(5.9)
The entropy is written in terms of ´Q as SD¡
n X iD 0
gQ i log gQ i C
X
gQ i log n Ci .
(5.10)
Now we x g1 and g2 . The maximal entropy distribution is given from @S @gQ i
D 0,
i D 3, 4, . . . , n.
The solution (Bohte et al., 2000) is ± i ² log Di D ¡ ¡ C 1 (i ¡ 1) log D 0 2 1 ¡ i (i ¡ 2) log D1 C i ( i ¡ 1) log D2 , 2
(5.11)
(5.12)
where Di D
1 gQ i . n Ci
(5.13)
The activity distribution has two peaks in general, but the bigger peak dominates as n ! 1.
Synchronous Firing and Higher-Order Interactions in Neuron Pool
139
6 Derivation of Activity Distribution
The probability that k D nr neurons re among n neurons is given by »
k q ( r) dr D Pr r D n
¼
D n Ck Prfx1 D x2 D ¢ ¢ ¢ D x k D 1, xk C 1 D ¢ ¢ ¢ D xn D 0g,
(6.1)
where dr D 1 / n. In order to calculate Prfx1 D ¢ ¢ ¢ D x k D 1, x kC 1 D ¢ ¢ ¢ D xn D 0g
D Prfu1 , . . . , u k > 0 I u k C 1 , . . . , un < 0g,
(6.2)
we rst x e in the expression ui in equation 2.4 and calculate the conditional probability. We then take the average over e . Since u1 , . . . , un are independent when e is xed, Prfx1 D ¢ ¢ ¢ D x k D 1, x kC 1 D ¢ ¢ ¢ D xn g
D Ee [Prfu1 , u2 , . . . , u k > 0, u k C 1 , . . . , un · 0 | e g] D Ee [n C k fPr(u > 0 | e )g k fPr(u · 0 | e ) gn¡k ],
(6.3) (6.4)
where Ee is the expectation taken with respect to a random variable e and Pr (¢ | e ) is the conditional probability conditioned on e —in other words, the probability for a xed e . In the above derivation, we used the homogeneity of neural activity, that is, symmetry of x1 , . . . , xn . Let us put F (e ) ´ Pr ( u > 0 | e ) p ¶ µ h ¡ ae D Pr ui > p 1¡a Z 1 t2 1 D p e ¡ 2 dt. p h¡ ae 2p p1¡a
(6.5)
We then have µ ¶ k Pr r D D Ee [n C k fF( e ) gk f1 ¡ F ( e ) gn¡k ] n ³ 2´ Z 1 1 e k n¡k ¢p exp ¡ D de. (6.6) n C k fF( e )g f1 ¡ F ( e ) g 2 2p ¡1
140
S. Amari, H. Nakahara, S. Wu, and Y. Sakai
To investigate further, by taking a large n limit, we approximate equation 6.6 as an equation taking a continuous variable r. In other words, we take µ ¶ k ¡¡¡¡! q ( r) dr, (6.7) P rD n n!1 where dr D we obtain
1 , n
and r on the right side is now a continuous variable. Then
µ ¶ k q (r ) ’ nPr r D n D nEe [n Cnr fF( e ) gnr f1 ¡ F ( e ) gn¡nr ]
r
» ³ ´¼ Z 1 1 ¡ F (e) n F (e) C (1 ¡ r ) log exp n r log D 2p r (1 ¡ r) ¡1 1¡r r ³ 2´ 1 e ¢p exp ¡ de 2 2p r ³ 2´ Z 1 1 e n expfnz (e ) g ¢ p exp ¡ (6.8) D de, 2p r (1 ¡ r) ¡1 2 2p where we used the following approximation, n Cnr
’ p
1 exp(¡nr log r ¡ n (1 ¡ r) log(1 ¡ r ) ) , 2p nr (1 ¡ r )
and dened z ( e ) ´ r log
1 ¡ F (e) F (e) C (1 ¡ r) log . 1 ¡r r
(6.9)
By Laplace’s approximation (also referred to as the saddle-point approximation), we get r q (r ) ’
s n ¢ 2p r (1 ¡ r )
Á ! e02 2p 1 exp(nz (e 0 ) ) ¢ p exp ¡ , (6.10) 2 n|z00 ( e0 ) | 2p
where e 0 is given by
e 0 ´ arg max z (e ) . Equation 6.11 is solved by setting e 0 as the solution of » ¼ 1 ¡r dz (e ) r ¡ D F0 ( e ) D 0, 1 ¡ F (e ) de F (e)
(6.11)
(6.12)
Synchronous Firing and Higher-Order Interactions in Neuron Pool
141
and the solution is
e 0 D F ¡1 ( r) .
(6.13)
For r ! 0, 1, we have e 0 ! § 1 or, equivalently, F ( e ) ! 0, 1. By some calculations, equation 6.10 is written by " q ( r) D C exp
# p ( h ¡ ae 0 ) 2 e 20 ¡ , 2(1 ¡ a) 2
(6.14)
where C is a normalization constant. Using equation 6.13, we nally obtain the required activity distribution. 7 Conclusion
The stochastic mechanism underlying the synchronous ring of neurons is studied by using information geometry. When the probability distribution of the activity of neurons is concentrated on one value (the delta function), the fraction of ring neurons is almost the same in each time. When the distribution has two peaks at r1 and r2 , the activity takes the higher value r2 where many neurons are synchronizing and the lower value r1 at other times. When the distribution is widespread and continuous, the neuron pool shows a variety of activities. It is a new nding that higher-order interactions are necessary for generating widespread activities, and independent, pairwise, or such a limited number of interactions cannot generate a widespread distribution. We have used a simple model of neuron pool that receives stimuli from overlapping common neurons to show that it can generate higher-order interactions and hence a widespread distribution of activity. Although the model is simple, it suggests many interesting features. One is the role of higher-order interactions in a neuron pool. This may be used to judge if there are such interactions in experimental data. We treated only symmetric and homogeneous interactions, but the essential point is the same even in unhomogeneous networks. It is also interesting to note that even when the neuronal potentials ui have no higher-order interactions, their nonlinear transforms can have them. We have not discussed dynamical aspects of synchronous ring such as in synring chains, a subject for further research. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Aertsen, A., & Arndt, M. (1993). Response synchronization in the visual cortex. Current Opinion in Neurobiology, 3(4), 586–594.
142
S. Amari, H. Nakahara, S. Wu, and Y. Sakai
Amari, S. (2001).Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47(5), 1701–1711. Amari, S., & Nagaoka, H. (2000). Methods of information geometry. New York: Oxford University Press. Bohte, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effects of pair-wise and higher-order correlations on the ring rate of a postsynaptic neuron. Neural Computation, 12, 153–179. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36(1), 4–14. Martignon, L., Von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 69–81. Nakahara, H., & Amari, S. (2002). Information geometric measure for neural spikes. Neural Computation, 14, 2269–2316. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 8, 555–586. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518.
Received February 19, 2002; accepted July 3, 2002.
LETTER
Communicated by Bard Ermentrout
Determination of Firing Times for the Stochastic Fitzhugh-Nagumo Neuronal Model Henry C. Tuckwell
[email protected] Department of Mathematics, University of California, Irvine, CA 92697, U.S.A., and Laboratory for Visual Neurocomputing, Riken Brain Science Institute, Wako-shi, Saitama 351-0198, Japan Roger Rodriguez
[email protected] Centre de Physique ThÂeorique, CNRS, F13288 Marseille Cedex 9, France Frederic Y. M. Wan
[email protected] Department of Mathematics, University of California, Irvine, CA 92697, U.S.A. We present for the rst time an analytical approach for determining the time of ring of multicomponent nonlinear stochastic neuronal models. We apply the theory of rst exit times for Markov processes to the Fitzhugh-Nagumo system with a constant mean gaussian white noise input, representing stochastic excitation and inhibition. Partial differential equations are obtained for the moments of the time to rst spike. The observation that the recovery variable barely changes in the prespike trajectory leads to an accurate one-dimensional approximation. For the moments of the time to reach threshold, this leads to ordinary differential equations that may be easily solved. Several analytical approaches are explored that involve perturbation expansions for large and small values of the noise parameter. For ranges of the parameters appropriate for these asymptotic methods, the perturbation solutions are used to establish the validity of the one-dimensional approximation for both small and large values of the noise parameter. Additional verication is obtained with the excellent agreement between the mean and variance of the ring time found by numerical solution of the differential equations for the one-dimensional approximation and those obtained by simulation of the solutions of the model stochastic differential equations. Such agreement extends to intermediate values of the noise parameter. For the mean time to threshold, we nd maxima at small noise values that constitute a form of stochastic resonance. We also investigate the dependence of the mean ring time on the initial values of the voltage and recovery variables when the input current has zero mean. Neural Computation 15, 143–159 (2003)
° c 2002 Massachusetts Institute of Technology
144
H. Tuckwell, R. Rodriguez, and F. Wan
1 Introduction
Following the rst use of microelectrodes to observe the activity of individual neurons in the mammalian central nervous system, it was soon found that such activity was for the most part highly irregular and unpredictable (Burns, 1968). There have been proposed and analyzed many mathematical models whose aim is to provide quantitative theories of these phenomena (Softky & Koch, 1993; Stevens & Zador, 1998; Lee & Kim, 1999; Tuckwell, 1989, has a review of early modeling). In the past several years, there has been a vigorous renewed interest in such aspects of neuronal spiking activity, especially with regard to an understanding of information processing (Shadlen & Newsome, 1995; Konig, Engel, & Singer, 1996; Tiesinga, JosÂe, & Sejnowski, 2000). There have been many experimental studies of the spike trains emitted by cortical cells in response to natural and articial stimulation (Burns, 1968; Kreiter & Singer, 1996), but it is still uncertain which models reproduce experimental details (Softky & Koch, 1993; Plesser & Gerstner, 2000). Whereas many early studies of neuronal networks were based on the assumption that there are two possible states for each element of the network corresponding to a ring or nonring nerve cell, models have since become increasingly realistic and have begun to include physiological details of neurons, including synaptic currents of various kinds. Methods of research have included phase analysis, simulation, or numerical integration of systems of ordinary differential equations, as well as direct analytical approaches to problems involving stochastic model neurons with synaptic input. We have recently reported a reductionist method for nonlinear noisy single neurons and networks in which differential equations are solved for the rst- and secondorder moments of the model dynamical variables (Tuckwell & Rodriguez, 1998). Although there has been considerable progress with mathematical analysis of the behaviors of model neurons, much of the research has involved simplied models. In this direction, there have been advances with classical one-dimensional models, but these have mainly been for linear integrateand-re models such as lead to the subthreshold potential being represented as an Ornstein-Uhlenbeck process (Plesser & Tanaka, 1997; Plesser & Gerstner, 2000). There have been only a few analytical studies including noise in nonlinear models such as the Fitzhugh-Nagumo system and the HodgkinHuxley system (Rodriguez, 1995; Tanabe & Pakdaman, 2001). Although several software routines are available to implement simulation studies of model neurons with both linear and nonlinear dynamics, as, for example, in Liley, Alexander, Wright, and Aldous (1999), it is always desirable to have results obtained using analytical approaches. Using the theory of multidimensional (Markov) diffusion processes, we derive equations for the moments of random variables associated with the time to reach threshold in the Fitzhugh-Nagumo system. Although this sys-
Stochastic Neuronal Firing Times
145
tem does not have as rm a basis as conductance-based Hodgkin-Huxley type models, it is relatively simple and more directly amenable to analysis. We describe several methods of nding ring times by solving associated ordinary or partial differential equations, including approximations suitable for small and large values of the noise parameter. Comparison will be made with results obtained by simulating the stochastic Fitzhugh-Nagumo system directly. In a subsequent article, we will conduct a similar analysis of the original Hodgkin-Huxley equations. The approach that we use can be employed for any conductance-based model. 2 The Stochastic Fitzhugh-Nagumo System
The Fitzhugh-Nagumo (FN) system of two differential equations has been employed to provide insight into the more complex Hodgkin-Huxley system of four equations. It shares with the latter the properties of subthreshold responses, action potentials or spikes in response to suitable stimuli, and repetitive activity (periodic solutions) in certain ranges of stimuli. There are associated Hopf bifurcations at the parameter values associated with the end points of these ranges. We study the following corresponding stochastic version of this model (Tuckwell, 1986, 1989; Feng & Brown, 2000) in which the components obey the following equations, dX D [ f ( X) ¡ Y C I]dt C sdW dY D b ( X ¡ c Y ) dt,
(2.1) (2.2)
where X D X ( t) is the “voltage” variable, Y D Y ( t) is the “recovery” variable, W is a standard (zero mean, variance t) Wiener process, s is a constant determining the overall input noise level, I D I ( t) is a deterministic input current (stimulus) that may be constant or time varying, and b and c are positive constants. The function f is a cubic f ( x ) D kx ( x ¡ a ) (1 ¡ x ) ,
(2.3)
where 0 < a < 1. Usually one takes a < 1 / 2 in order to obtain suitable supra-threshold responses. 2.1 Exact Equations for the Moments of the Time to a Threshold. Let T ( x, y ) be the rst exit time of the vector-valued process ( X, Y ) from an open set A in the phase-space for a starting point (x, y ) 2 A. Assuming that the stimulus I is constant so that the process is in fact temporally homogeneous, the theory of Markov diffusion processes (Tuckwell, 1976) yields the following partial differential equations (PDEs) for the rst moment
146
H. Tuckwell, R. Rodriguez, and F. Wan
F (x, y ) D E[T ( x, y )] and for the second moment G ( x, y) D E[T 2 (x, y )]: [ f ( x ) ¡y C I] [ f (x ) ¡y C I]
@F ( x, y ) @x
@G ( x, y ) @x
C b ( x¡c y)
C b ( x ¡ c y)
@F ( x, y) @y
@G ( x, y ) @y
C
s 2 @2 F (x, y ) D ¡1, 2 @x2
(2.4)
C
s 2 @2 G (x, y ) D ¡2F(x, y ) , 2 @x2
(2.5)
with suitable boundary conditions on the boundary of the set A. The set A will be chosen as an open rectangle Q containing ( x, y) . Since we are primarily interested in the time taken for the potential variable X to reach a threshold value h > 0 from the resting value X (0) D 0, the rectangle is taken to be (¡a, h ) £ ( y1 , y2 ) , where a is positive. The usual boundary conditions for rst exits from the open rectangle Q are that F and G vanish on its boundaries. In order to facilitate numerical solution, we modify these boundary conditions, but in such a way that accurate values are still obtained for the moments. The condition that F and G vanish along the line x D h is retained. For the parameter ranges and initial values of interest, most exits of ( X, Y ) from Q occur through the line x D h and, for the values of Y (0) considered here, away from the lines y D y1 and y D y2 . As a ! 1, the paths are less inuenced by the boundary condition at x D ¡a, with nearly all exits occurring at x D h rather than x D ¡a. Hence, one may use a reecting condition Fx (¡a, : ) D 0, where the subscript denotes partial differentiation, instead of F (¡a, : ) D 0. The zero-derivative conditions Fy ( : , y1 ) and Fy ( : , y2 ) may similarly be employed. These zero-derivative conditions facilitate numerical solution of the differential equations by avoiding sharp boundary layers at these three edges of Q and correspond to the Euler boundary conditions in a variational formulation of the problem. The time to rst threshold crossing or time to rst spike from an initial rest point (0, y 0 ) is not the same as the interspike interval in nonlinear models and has rst and second moments F (0, y 0 ) and G (0, y 0 ), respectively, where yo 2 ( y1 , y2 ) . 3 Methods of Solution
We have approached the solution of the moment equations in various ways. Approximate solutions can always be obtained by simulation, as described in the next section. However, accurate numerical solution of the moment equations is a useful and perhaps necessary check on simulation results, which are subject to sampling error. We have pursued the following methods of solution: (1) a simplied approach, which considers the single component process X; (2) approximate solution of the PDEs 2.4 and 2.5 when s is small; and (3) the solution of the PDEs when s is large.
Stochastic Neuronal Firing Times
147
Figure 1: The value of the recovery variable is practically unchanged during the passage of X ( t) from 0 to the threshold value of h D 0.6. Here the input current is I D 1.5, and the noise parameter is s D 1. The values of the remaining parameters a D 0.1, b D 0.015, c D 0.2, k D 0.5, are the standard set throughout. 3.1 A Simplied Approach. In this rst approach, we reduce the problem to a one-dimensional one. For the PDEs 2.4 and 2.5, this is tantamount to neglecting the Fy and Gy terms. The justication for this approach is the observation that during the initial stages of the interspike interval, the recovery variable Y (t ) is practically unchanged and may therefore be set as a constant equal to the initial value Y (0) D y0 . This is illustrated in Figure 1. We then consider the diffusion process satisfying
dX D [ f ( X) ¡ y 0 C I]dt C sdW.
(3.1)
Letting X (0) D x 2 (¡1, h ) , we consider the exit time Th ( x) , which is the time to reach the level h, identied as the threshold for an action potential, given that the voltage started below this value. The rst and second moments, F (x ) and G ( x ) , respectively, of this exit time satisfy the differential equations s 2 d2 F ( x ) dF ( x) C [ f ( x) ¡ y0 C I] D ¡1, 2 2 dx dx
(3.2)
148
H. Tuckwell, R. Rodriguez, and F. Wan
and s 2 d2 G ( x) dG (x ) C [ f ( x ) ¡ y 0 C I] D ¡2F(x) , 2 dx2 dx
(3.3)
with the boundary conditions F (h ) D 0 and G (h ) D 0. For numerical solution, we make the interval of initial values nite with x 2 (¡a, h ) , where a > 0. As mentioned in the previous section, at the lower boundary, it is advantageous to choose zero-derivative conditions Fx (¡a) D 0 and Gx (¡a) D 0. The quantities of interest are the moments of the time to reach threshold from rest, F (0) and G (0). A further reduction is sometimes possible. The solutions of the secondorder ordinary differential equations 3.2 and 3.3 may be approximated when s is small by the solutions of the rst-order equations [ f ( x ) ¡ y 0 C I]
dF ( x ) D ¡1, dx
(3.4)
[ f ( x ) ¡ y 0 C I]
dG ( x) D ¡2F(x) . dx
(3.5)
and
These equations admit only a single boundary value, chosen as zero at x D h. A more accurate approach for small s is described in the next section. 3.2 Solution When s Is Small. When s is small, the PDEs 2.4 and 2.5 can be solved approximately using perturbation methods. Putting 2 D s 2 / 2, the equation satised by the mean rst exit time can be written as 2 Fxx C L1 [F] D ¡1,
(3.6)
L1 [F] D P ( x, y ) Fx C Q ( x, y) Fy ,
(3.7)
with
and where P ( x, y ) D I ¡ y C kx ( x ¡ a) (1 ¡ x ) and Q ( x, y ) D b ( x ¡ c y) . The solution is obtained with F (h , y) D 0 and Fx (¡a, y) D 0. A regular perturbation expansion is sought in powers of 2 : F ( x, yI 2 ) D
1 X nD 0
Fn ( x, y) 2
n
.
(3.8)
Substitution in equation 3.6 gives rst-order PDEs for the coefcients fFn ( x, y ) g, which can be solved by the method of characteristics. For s ¿ 1, it is expected that the solution of equation 2.4 is F ( x, yI 2 ) ¼ F 0 ( x, y) C O (2 ) ,
Stochastic Neuronal Firing Times
149
where F 0 is the leading term in the regular perturbation expansion, equation 3.8. A similar development leads to G ( x, yI 2 ) ¼ G0 (x, y ) C O (2 ). For the leading terms F 0 and G 0, the relations 3.4 and 3.5 are applicable along with the characteristics of equations 2.4 and 2.5. Here, we have y D y0 ( x) , known from a separate determination of the characteristic projections for the linear PDEs for F and G. It follows that the vanishing slope conditions for F and G as x ! ¡1 are satised by F 0 and G 0 . For s < 1, these leading term solutions (or the corresponding multiterm truncated perturbation series) differ by less than 5% from those obtained with the one-dimensional approach, validating the latter for such values of s. 3.3 Solution for Large s . It is convenient to rewrite equation 2.4 in diy s2 mensionless variables z D hx , l D 2h 2 , and u D h . The PDE becomes
lFzz C p ( z, u ) Fz C b ( z ¡ c u ) Fu D ¡1
(3.9)
for ¡a · z · 1, where p (z, u) D [I C kz (z ¡ a ) (1 ¡ hz) ¡ u] and where I D hI , k D kh, a D ha , a D F (1, u ) D 0 and Fz (¡a, u ) D 0. With Z
P ( z, u ) D
a h.
(3.10)
The boundary conditions are now
z
p ( w, u ) dw
µ ¶ h 4 1 1 2 3 ( ) (1 C ha) z ¡ az , D I¡u zC k ¡ z C 4 3 2
(3.11)
we multiply throughout by 1l eP/ l and write the PDE for F as [eP / l Fz ]z C
1 eP/ l [b(z ¡ c u ) eP/ l Fu ] D ¡ . l l
(3.12)
When l À 1, a regular perturbation solution is sought of the form F (z, uI l) D
1 X
Fn (z, u ) l ¡n .
(3.13)
nD 1
With the boundary conditions for the leading term given by F1 (1, u ) D 0 and F1,z (¡a, u ) D 0, the solution for F1 can be found as a double integral. The following expression is then obtained for the mean rst exit time: F (z, uI l) D
1 l
Z
1 z
e ¡P(g,u) / l
Z
g ¡a
eP (f ,u ) / l df dg[1 C O (l¡1 ) ].
(3.14)
150
H. Tuckwell, R. Rodriguez, and F. Wan
A similar analysis gives G ( z, uI l) D
1 l
Z
1
e ¡P(g,u) / l
Z
z
£ [1 C O (l ¡1 ) ].
g ¡a
( ) eP f ,u / l F1 (f , u ) df dg
(3.15)
In addition to yielding explicit accurate approximations for the moments of the rst passage time, the above leading term asymptotic solutions (or more accurate multiterm truncated series solutions) provide, for large s, validation of the one-dimensional approximation of section 3.1. The leading term of the perturbation solution for F is seen from equation 3.12 to be determined by equation 3.2. Similar remarks apply to G. 4 Results and Comparison with Simulation
We have determined the rst two moments of T (x, y ) by the above methods. We have additionally simulated the processes described by the evolution equations 2.1 and 2.2 using a standard strong Euler method. The use of this scheme is straightforward and is vindicated in the existing problem by virtue of the close agreement of the simulation results and those obtained using the analytical framework. To apply this method, with X ( kD t) and Y ( kD t) approximated by X k and Yk, respectively, the simulation scheme consists of the recursive relations p X kC 1 D Xk C [ f (X k ) ¡ Yk C I]D t C b D tN k Y kC 1 D Yk C b[Xk ¡ c Yk]D t, where k D 0, 1, 2, . . . and fNkg is a sequence of independent and identically distributed standard normal random variables, obtained from computer library routines. This was done principally with the set of parameter values a D 0.1, b D 0.015, c D 0.2, and k D 0.5 for various values of s and I. The threshold value h for spikes was estimated from sample paths with allowance for the discrete nature of simulation. The value of D t was pragmatically chosen between 0.00001 and .001, the value being determined in each case by requiring the minimum number of steps to threshold to be at least 20 and the maximum number several thousand. The initial value of X ( t) was set at 0 and that of Y ( t) usually set at 1.0 in the results to be reported. Other values of Y (0) were nevertheless considered for comparison, some being reported in the nal section. When I is less than about 1.3, the (expected) value of X ( t) rarely attains the designated threshold value of h D 0.6. For the chosen parameters, we therefore consider mainly values of I ¸ 1.3, although the value I D 0 is discussed later in the article. The values of the noise parameter s are chosen from 0.05 to 5.0. For small
Stochastic Neuronal Firing Times
151
Figure 2: Mean times to reach threshold versus noise parameter s. Parameter values are given in the text. Values are determined by simulation (circles) and by numerical solution of differential equation 2.4 (+).
values of s, the number of trials in simulations was 4500, whereas only 1500 trials were employed for larger values. Plots of the means of the times for X ( t) to reachh for various I as a function of s are shown in Figures 2 and 3. For a given value of I, results obtained by numerical solution are shown in the same gure for comparison. In Figure 2, the values of I are 1.3 and 1.5, whereas in Figure 3, the values of I are 2 and 3. The numerical results shown here were obtained using the simplied approach from section 3.1. There is excellent agreement between the two sets of results, especially allowing for sampling error in the simulation results. The magnitudes of sampling errors ( § standard error) in the simulated means are summarized as follows, with best cases rst. When I D 1.3, for s D 0.05, § 0.17%; for s D 0.5, § 1.31%; for s D 0.75, §2.70%; and for s D 5, § 3.95%. When I D 1.5, for s D 0.05, § 0.13%; for s D 0.5, § 1.20%; for s D 0.75, §2.53%; and for s D 5, § 3.93%. When I D 2, for s D 5, § 0.10%; for s D 0.5, § 0.92%; for s D 0.75, §2.2%; and for s D 5, §4.50%. When I D 3, for s D 0.05, § 0.07%; for s D 0.5, § 0.66%; for s D 0.75, § 1.76%; and for s D 5, § 4.30%.
152
H. Tuckwell, R. Rodriguez, and F. Wan
Figure 3: Mean rst ring time versus s when I D 2 and I D 3. Values are determined by simulation (circles) and by numerical solution of differential equation 2.4 (+).
In general, the means from simulation are slightly larger than those obtained by solution of the differential equation. It can be clearly seen in the top part of Figure 2, where I D 1.3, that as s increases away from zero, the mean rst passage time actually increases. After reaching a maximum when s is about equal to 0.25, the mean then declines, as is expected with increased variability. Similar behavior is observed for the larger values of the mean input current, except that the maximum of the mean interval occurs at larger values of s. This phenomenon is akin to stochastic resonance, well documented in nonlinear systems with noise, including neural systems such as that of Hodgkin-Huxley and Fitzhugh-Nagumo (Lee & Kim, 1999; Nozaki & Yamamoto, 1998; Massanes & Vicente, 1999). In Figures 4 and 5, we show the standard deviation of the time to reach threshold as determined by both simulation and solution of the differential equation 2.5 for the second moment. It can be seen that there is excellent agreement between the two sets of results for all four values of the input parameters I and s. In each of the four sets of results shown, the standard deviation attains a maximum as a function of s. The magnitudes of the sampling errors in the standard deviations in representative cases are as
Stochastic Neuronal Firing Times
153
Figure 4: Standard deviation of time to reach threshold versus noise parameter s. Parameter values are given in the text. Values are determined by simulation (circles) and by numerical solution (+) of the differential equation 2.4.
follows. When I D 1.3, for s D 0.05, § 2.0%; and for s D 5, § 2.6%. When I D 3, for s D 0.05, § 2.2%; and for s D 5, §4.8%. In Figure 6, we show the behavior of the coefcient of variation (CV, dened as standard deviation divided by the mean), calculated from computer simulations, as a function of the noise parameter s for the values I D 1.3 and I D 3. The results for I D 1.5 and I D 2 are similar to these, respectively. It is seen that the CV is greater than 1 over much of the range of parameters considered. 5 Discussion
Since interest rst arose in the theory of stochastic neuronal ring over 30 years ago, analytical methods have been employed to determine properties of interspike intervals in many integrate-and-re model neurons in which the subthreshold membrane potential satises a linear stochastic differential equation. In this regard, nonlinear models such as Fitzhugh-Nagumo and Hodgkin-Huxley have been studied mainly using computer simulations. We have used an analytical framework for the rst time in a nonlinear
154
H. Tuckwell, R. Rodriguez, and F. Wan
Figure 5: Standard deviation of rst ring time versus s when I D 2 and I D 3. Values determined by simulation (circles) and by numerical solution of the differential equation 2.5 (+).
stochastic neural model. To this end, we have reduced the task of characterizing the ring time to the solution of linear second-order partial differential equations. Some remarks about the nature of these equations and our methods of solution of them are in order when the noise term is small. In such singular perturbation problems, where the highest derivative term in the differential equation is multiplied by the small parameter, setting the small parameter equal to zero (or, equivalently, seeking the leading term of a regular perturbation solution in the small parameter) reduces the order of the equation and therefore leads to a loss of some of the components of the general solution of the equation. However, for problems with a certain combination of boundary conditions, such as prescribing the normal derivative equal to zero at negative innity or at some large negative value of x and prescribing the value of the unknown (equal to zero in our case) at the opposite end of the interval, the missing solutions are irrelevant, as they would have been eliminated by the boundary condition at negative innity anyway. That is, our less general solution, obtained after setting the small parameter s equal to zero, is made to satisfy all boundary conditions (in particular, the
Stochastic Neuronal Firing Times
155
Figure 6: Coefcient of variation of the time to threshold versus s for I D 1.3 and I D 3. Values are determined by simulation (circles) and by numerical solution of the differential equation 2.5 (+).
condition at negative innity automatically), whereupon by a uniqueness theorem for linear PDEs, we have the correct solution of the problem without obtaining the missing solutions. A previous example of this technique is contained in our extensive investigation of the Ornstein-Uhlenbeck process representation of neuronal potential (Wan & Tuckwell, 1982). 5.1 Parameter and Initial Values: The Case I D 0. The parameters we have employed are necessarily in restricted particular ranges, and a brief consideration of other cases is warranted. The deterministic constant component of the input current I represents the drift or the mean synaptic drive, which depends on the relative magnitudes and frequencies of excitation and inhibition. The quantities we have determined using an analytical framework are the moments of the time for emergence of the rst spike. For values of I such as those employed in the calculations performed above, periodic ring occurs; that is, subsequent to the rst spike, an approximately periodic spike train ensues. If the train is sufciently regular with a consistently dened apparent threshold, such as occurs when the noise amplitude is not excessive (see Figure 2 of Tuckwell & Rodriguez, 1998), then the methods
156
H. Tuckwell, R. Rodriguez, and F. Wan
Figure 7: Effects of changing the initial value of the voltage variable X (0) are illustrated for a xed value of Y (0) D 0. (Top) For the two indicated values of X (0) , the variation of the mean time to a spike as a function of noise parameter s is shown. (Bottom) Variations in mean time to a spike as the initial value X (0) changes for two xed values of the noise parameter.
employed here could be applied to the determination of the moments of the associated yet different random variable, namely, the interspike interval. Such a determination will be deferred to a later article. The initial value chosen above for the dynamical variables X to illustrate the usefulness of the analytical approach is X (0) D 0 because it corresponds to a natural resting or equilibrium value for the unperturbed system. The value Y (0) D 1 was based on previously published results as it was close to the values in sustained rhythmic ring for chosen parameter values (Tuckwell & Rodriguez, 1998). However, other initial conditions are of interest and also the case where periodic ring does not necessarily occur because the mean input current is not sufciently large. To study these aspects, we have determined, using simulation, the statistics of the time to rst spike in certain “excitable” cases when I D 0. First, we consider the effects of varying the initial value X (0) of the voltage variable for a xed value Y (0) D 0 of the recovery variable. Some results for the mean time to rst spike are shown in Figure 7. Comparison of the results
Stochastic Neuronal Firing Times
157
Figure 8: Effects of changing the initial value of the recovery variable Y (0) are illustrated for a xed value of X (0) D 0. (Top) For the indicated values of Y (0) , the variation of the mean time to a spike as a function of noise parameter s is shown. (Bottom) Variations in mean time to a spike as the initial value Y (0) changes for the two indicated values of the noise parameter.
in the top part of this gure with those in Figures 2 and 3 indicates that the variation in this mean as the noise parameter varies from 0.5 to 5.0 is similar for the three different values of X (0) even despite the different values of I and the fact that Figures 2 and 3 correspond to the periodic ring mode with a different value for Y (0) . That the mean time to a spike is moderately sensitive to changes in X (0) for xed values of all the parameters and Y (0) can be seen in the lower part of Figure 7. Similar remarks apply to the effects of changing Y (0) , as can be gleaned from inspection of Figure 8. Noticeable here, however, is that when s D 4, the mean time to a spike increases only gradually as Y (0) increases from ¡0.4 to 1.0. In summary, we have investigated for the rst time the determination of times for a threshold crossing in a nonlinear stochastic neuronal model, that of Fitzhugh-Nagumo, with the addition of a white gaussian noise representing stochastic excitation and inhibition. We have found moments of threshold crossing times accurately by solving the associated partial differ-
158
H. Tuckwell, R. Rodriguez, and F. Wan
ential equations, obtained from the theory of Markov processes. The validity of the methods employed has been veried by comparison with computer simulation results. An interesting structure in the mean rst passage time was noted at small values of the noise parameter. In future work, we will report our ndings for the more complicated Hodgkin-Huxley system. Acknowledgments
H. C. T. appreciates support from the Riken Brain Science Institute, Saitama, Japan. References Burns, B. D. (1968). The uncertain nervous system. London: Arnold. Feng, J-F., & Brown, D. (2000). Integrate-and-re models with nonlinear leakage. Bulletin of Mathematical Biology, 62, 467–481. Konig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends in Neurosciences, 19, 130–137. Kreiter, A. K., & Singer, W. (1996). Stimulus-dependent synchronization of neuronal responses in the visual cortex of the awake macaque monkey. Journal of Neuroscience, 16, 2381–2396. Lee, S-G., & Kim, S. (1999). Parameter dependence of stochastic resonance in the stochastic Hodgkin-Huxley neuron. Phys. Rev. E, 60, 826–830. Liley, D. T. J., Alexander, D. M., Wright, J. J., & Aldous, M. D. (1999). Alpha rhythm emerges from large-scale networks of realistically coupled multicompartmental model cortical neurons. Network: Comput. Neural Syst., 10, 79–92. Massanes, S. R., & Vicente, C. J. P. (1999). Nonadiabatic resonances in a noisy Fitzhugh-Nagumo neuron model. Phys. Rev. E, 59, 4490–4497. Nozaki, D., & Yamamoto, Y. (1998). Enhancement of stochastic resonance in a Fitzhugh-Nagumo neuronal model driven by colored noise. Phys. Lett. A, 243, 281–287. Plesser, H. E., & Gerstner, W. (2000). Noise in integrate-and-re neurons: From stochastic inputs to escape rates. Neural Computation, 12, 367–384. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Physics Letters A, 225, 228–234. Rodriguez, R. (1995). Coupled Hodgkin-Huxley neurons with stochastic synaptic inputs. In J. Bertrand (Ed.), Modern group theoretical methods in physics (pp. 233–242). Norwood, MA: Kluwer. Shadlen, M. N., & Newsome, W. T. (1995). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neuroscience, 1, 210–217.
Stochastic Neuronal Firing Times
159
Tanabe, S., & Pakdaman, K. (2001). Dynamics of moments of Fitzhugh-Nagumo neuronal models and stochastic bifurcations. Phys. Rev. E, 63, 031911. Tiesinga, P. H. E., JosÂe, J. V., & Sejnowski, T. J. (2000). Comparison of currentdriven and conductance-driven neocortical model neurons with HodgkinHuxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. Tuckwell, H. C. (1976). On the rst exit time problem for temporally homogeneous Markov processes. J. Appl. Prob., 13, 39–48. Tuckwell, H. C. (1986). Stochastic equations for nerve membrane potential. J. Theoret. Neurobiol., 5, 87–99. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. Tuckwell, H. C., & Rodriguez, R. (1998). Analytical and simulation results for stochastic Fitzhugh-Nagumo neurons and neural networks. J. Computational Neuroscience, 5, 91–113. Wan, F. Y. M., & Tuckwell, H. C. (1982). Neuronal ring and input variability. J. Theoret. Neurobiol., 1, 197–218. Received August 24, 2001; accepted July 3, 2002.
LETTER
Communicated by Richard Zemel
Developmental Constraints Aid the Acquisition of Binocular Disparity Sensitivities Melissa Dominguez
[email protected] Department of Computer Science, Universityof Rochester,Rochester,NY 14627,U.S.A. Robert A. Jacobs
[email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A. This article considers the hypothesis that systems learning aspects of visual perception may benet from the use of suitably designed developmental progressions during training. We report the results of simulations in which four models were trained to detect binocular disparities in pairs of visual images. Three of the models were developmental models in the sense that the nature of their visual input changed during the course of training. These models received a relatively impoverished visual input early in training, and the quality of this input improved as training progressed. One model used a coarse-scale-to-multiscale developmental progression, another used a ne-scale-to-multiscale progression, and the third used a random progression. The nal model was nondevelopmental in the sense that the nature of its input remained the same throughout the training period. The simulation results show that the two developmental models whose progressions were organized by spatial frequency content consistently outperformed the nondevelopmental and random developmental models. We speculate that the superior performance of these two models is due to two important features of their developmental progressions: (1) these models were exposed to visual inputs at a single scale early in training, and (2) the spatial scale of their inputs progressed in an orderly fashion from one scale to a neighboring scale during training. Simulation results consistent with these speculations are presented. We conclude that suitably designed developmental sequences can be useful to systems learning to detect binocular disparities. The idea that visual development can aid visual learning is a viable hypothesis in need of study. 1 Introduction
Human infants are born with limited perceptual, motor, and cognitive abilities relative to adults. Within the eld of developmental psychology, there Neural Computation 15, 161–182 (2003)
° c 2002 Massachusetts Institute of Technology
162
M. Dominguez and R. Jacobs
are at least two perspectives regarding these limitations. The older and more commonplace view is that these limitations are barriers that must be overcome in order for a child to achieve adult function (Piaget, 1952). That is, they are deciencies or immaturities that serve no positive purpose. A newer view is that these apparent inadequacies are in fact helpful, perhaps necessary, stages in development. Limited mental abilities, according to this theory, reect simple neural representations that are useful stepping-stones or building blocks for the subsequent development of more complex representations (Turkewitz & Kenney, 1982). The idea that early developmental stages are useful or necessary precursors to more advanced stages is becoming increasingly studied in the cognitive neurosciences. The development of biological nervous systems is sometimes characterized as using a bootstrapping strategy. Greenough, Black, and Wallace (1987) speculated that asynchrony in brain development serves the useful function of stage setting. The developmental schedule for the maturation of different brain regions is staggered such that neural systems that develop relatively early provide a suitable framework for the development of later experience-sensitive systems. Harwerth, Smith, Duncan, Crawford, and von Noorden (1986) provided experimental evidence consistent with this hypothesis by performing behavioral studies of sensitive periods for visual development in monkeys. Their results suggest that these sensitive periods are organized into a hierarchy in which early visual functions requiring information processing in the earliest stages of the visual system have shorter sensitive periods than higher-level functions requiring more central neural processing. A second example is provided by the work of Shatz (1996). Within the lateral geniculate nucleus (LGN) of adult mammals, retinal ganglion cell axons from one eye are segregated from those arising from the other eye to form a series of alternating eye-specic layers. These layers are not present initially in development. Moreover, they form during a period in which vision is not possible. Shatz argued that the development of eye-specic layers is characterized by at least two important events. First, retinal ganglion cells spontaneously show waves of activity that sweep across the retina such that activity at nearby cells is more highly correlated than at distant cells. Second, LGN cells use a Hebb-style adaptation mechanism that sorts connections based on local correlations of activity in order to form eye-specic layers. If so, then this is a clear example in which a developmental event at an earlier visual region (spontaneous waves of activity at the retina) sets the stage for an event at a later visual region (Hebb-style adaptation at the LGN). Newport (1990) hypothesized that children use a bootstrapping strategy when attempting to learn a language. Human languages are componential systems in which small linguistic components are systematically combined to form larger linguistic structures. According to Newport’s “less is more” hypothesis, the limited attentional and memorial abilities of children are useful when learning a language because they help children segment and
Acquisition of Binocular Developmental Sensitivities
163
identify the small components that comprise the language. Elman (1993) studied an implementation of this general idea and showed that a recurrent neural network whose memory capacity was initially limited but then gradually increased during the course of training learned aspects of an articial grammar better than a network whose memory capacity was never limited; that is, the second network’s memory capacity was always equal to that of the rst network at the end of training. Elman claimed that this outcome supports the idea that starting small is important to the subsequent acquisition of complex mental abilities. Rohde and Plaut (1999), however, were unable to replicate Elman’s simulation results, so it is difcult to know how to interpret these results. This article considers the hypothesis that systems learning aspects of visual perception may benet from the use of suitably designed developmental progressions during training. We report the results of simulations in which four systems were trained to detect binocular disparities in pairs of visual images. Three of the systems were developmental models in the sense that the nature of their input changed during the course of training. These systems received a relatively impoverished visual input early in training, and the quality of this input improved as training progressed. The fourth system was a nondevelopmental model; the nature of its input remained constant during the course of training. The inputs to the systems were left and right retinal images ltered with binocular energy lters tuned to various spatial frequencies. The training of the rst system, referred to as the coarse-scale-to-multiscale model (model C2M), included a developmental sequence such that the system was exposed only to lowspatial-frequency information at the start of training, and information at higher spatial frequencies was added to its input as training progressed. The training of the second system, referred to as the ne-scale-to-multiscale model (model F2M), included an analogous developmental sequence with spatial frequency information added in the reverse order. This system received high-spatial-frequency information at the start of training, and information at lower spatial frequencies was added as training progressed. The third system, referred to as the random-developmental model (model RD), was similar to models C2M and F2M in the sense that its training included a developmental sequence. However, whereas the inputs received by models C2M and F2M at each developmental stage were organized by spatial frequency content, the inputs received by model RD at each stage were randomly selected. Finally, the fourth system, referred to as the nondevelopmental model (model ND), was not trained using a developmental sequence; it received information at all spatial frequencies throughout the training period. When comparing the three developmental models with the nondevelopmental model, there are at least two reasonable predictions that one could make about the simulation results. One prediction is that the nondevelopmental model should outperform the developmental models. The
164
M. Dominguez and R. Jacobs
nondevelopmental model received all input information throughout all stages of training, whereas the developmental models were deprived of portions of the input at certain training stages. If more information is better than less information, then the nondevelopmental model ought to perform best. This would be consistent with the traditional view of human infant development—that perceptual immaturities are barriers to be overcome. An alternative prediction, consistent with the general approach of the less-is-more hypothesis, is that the developmental models would show the better performance. If it is believed that too much information could lead a learning system in its early stages of training to form poor internal representations, then the developmental models ought to have an advantage. The nondevelopmental model had a greater number of inputs than the developmental models during the early stages of training and, thus, a greater number of modiable weights. Because learning in neural networks is a search in weight space, the nondevelopmental model needed to perform an unconstrained search in a high-dimensional weight space. Unconstrained searches frequently lead to the acquisition of poor representations. In contrast, the developmental models initially had fewer inputs and thus fewer modiable weights. During early stages of training, their searches in weight space were comparatively constrained. If the constraints were appropriate, this should have facilitated the acquisition of useful representations by the developmental models. When comparing the performances of the two developmental models whose stages are based on spatial frequency content (models C2M and F2M), there are at least three reasonable predictions one could make about the simulation results. One prediction is that model C2M should perform best. A motivation for this prediction comes from the eld of computer vision. Consider the task of aligning two images of a scene where the images differ due to a small horizontal offset in their viewpoints. Roughly, this is known as the stereo correspondence problem. If this were something that you had never done before, you might try to align ne details of each image. For example, you might pick a white dot in one image and repeatedly try aligning it with white dots in the other image. If the images are highly textured, such an approach would be inefcient because there is an intractable number of potential alignments that might need to be checked before the two images are properly aligned (see Figure 1A). If, however, the images are blurred, the ne details that were the source of so much confusion would be removed, leaving a smaller number of larger features in each image. The problem of nding a good alignment is now signicantly easier because there is a smaller number of potential alignments (see Figure 1B). After you have gained skill at aligning blurred images, you would then have a reliable foundation that you could use to learn to align clearer images. You might attempt to match the larger image features rst. Subsequent analysis of ne details would seek to remove ambiguities that arise when aligning large features instead of being a starting point for locating correspondences.
Acquisition of Binocular Developmental Sensitivities
165
A
B Figure 1: (A) One image from a stereo pair depicting a textured object and background. Due to the ne details in the stereo pair, solving the stereo correspondence problem is relatively difcult. (B) The image from A blurred so as to remove the ne details. It is easier to solve the stereo correspondence problem when the stereo pair is blurred because image features tend to be larger, less numerous, and more robust to noise.
This style of processing is commonplace in the computer vision literature. Systems by Marr and Poggio (1979), Quam (1986), and Barnard (1987), among many others, initially search for correspondences within a pair of low-resolution images. Low-resolution images are used initially because these images contain fewer image features, larger image features, and image features that are relatively robust to noise. Next, these systems rene their estimates of corresponding image features using information from one or more higher-resolution pairs of images. The usefulness of a coarse-to-ne processing strategy when searching for stereo correspondences suggests that a coarse-scale-to-multiscale developmental strategy might be useful for learning to detect binocular disparities. Before adopting this hypothesis, however, it is important to keep in mind the differences between these two strategies. Computer vision researchers use a coarse-to-ne strategy when searching for correspondences in individual pairs of images. In contrast, we used the coarse-scale-to-multiscale developmental sequence while training a learning system to detect binocular disparities using many pairs
166
M. Dominguez and R. Jacobs
of images. It is not obvious that lessons from one situation can be applied to the other situation. If we assume that human intelligence provides a guide to creating machine intelligence, then a second motivation for the prediction that model C2M should perform best is the fact that human infants show a related developmental progression. Visual acuity is often measured using a grating, which is a visual pattern whose luminance values are sinusoidally modulated. Acuity is characterized by the highest-frequency grating, which is distinguishable from a solid gray pattern. Using this method, it has been found that newborns’ visual acuity is extremely poor. Whereas adults with normal vision (so-called 20 / 20 vision) can discriminate approximately 30 cycles per degree of arc, newborns can discriminate only 1 to 2 cycles per degree, giving them a visual acuity of about 20/ 400. Acuity improves approximately linearly from these low levels at birth to near adult levels by around eight months of age (Norcia & Tyler, 1985). Importantly for our purposes, infants are acquiring other visual abilities during this time period; in particular, sensitivity to binocular disparities appears at around four months of age (Atkinson & Braddick, 1976; Fox, Aslin, Shea, & Dumais, 1980; Held, Birch, & Gwiazda, 1980; Petrig, Julesz, Krop, Baumgartner, & Anliker, 1981). We speculate that the developments of visual acuity and binocular disparity sensitivity may be related in the sense that poor acuity at an early age aids in the acquisition of disparity sensitivity later in life. An alternative prediction is that model F2M should perform better than model C2M. A motivation for this prediction is the fact that computer vision researchers often nd it easier to solve the stereo correspondence problem by rst extracting edge information from left and right images, a form of highfrequency bandpass ltering, and then searching for a good alignment of the images based on this information. This strategy is useful because edges can be sparse, large, and robust to noise relative to other image features. Analogous to computer vision systems’ initial use of high-frequency information, model F2M is initially trained solely with information extracted from highfrequency bandpass lters. If we are willing to assume that computer vision methods for nding stereo correspondences may provide lessons for how learning systems can learn to detect binocular disparities, then we might predict that model F2M has an advantage. A second motivation for the prediction that model F2M should perform best is the seemingly counterintuitive result that neural networks trained with input patterns that have been corrupted by noise frequently show better generalization than equivalent networks trained with input patterns that have not been corrupted (Sietsma & Dow, 1991). Training with noisy inputs has been shown analytically to be equivalent to a form of regularization (Bishop, 1995; Matsuoka, 1992; Webb, 1994). In other words, the learning process of networks trained with noisy inputs is more constrained than that of similar networks whose inputs are not corrupted by noise. If the retinal images contain noise and if model C2M tends to lter out the noise during
Acquisition of Binocular Developmental Sensitivities
167
early stages of training, then this model will not obtain the benets of training with noisy inputs. We would therefore expect model F2M to perform best. Another possible prediction is that models C2M and F2M perform about equally well. If the relative advantages of model C2M (being able to use fewer image features, larger image features, and more noise-resistant image features at the start of training) and the relative advantages of model F2M (initial exposure to high-frequency bandpass information, being able to obtain the benets of training with noisy inputs) are roughly balanced, then neither model would be expected to show superior performance. Furthermore, we outlined the logic of computer vision researchers who advocate a coarse-to-ne processing strategy when analyzing stereo correspondences, and we tentatively speculated that the usefulness of this strategy suggests that the seemingly related coarse-scale-to-multiscale developmental strategy might be useful for learning to detect binocular disparities. If, however, we again make the assumption that human intelligence provides a guide to creating machine intelligence, then it is worth noting that humans often do not use a coarse-to-ne strategy when analyzing stereo correspondences. Mallot, Gillner, and Arndt (1996) found that unambiguous information at a coarse scale is not always used by observers to disambiguate ner-scale information and that observers can use unambiguous ne-scale information to disambiguate coarse-scale information, meaning that observers are using a ne-to-coarse processing strategy in these circumstances. Related ndings have been reported by several other researchers (McKee & Mitchison, 1988; Mowforth, Mayhew, & Frisby, 1981; Smallman, 1995). Because human observers do not use a coarse-to-ne strategy or a ne-to-coarse strategy exclusively, we might not expect the exclusive use of a C2M developmental strategy or the exclusive use of a F2M strategy to yield a relative performance advantage. When comparing the performances of models C2M and F2M with that of model RD, the only reasonable prediction seems to be that model RD should perform no better than the other developmental models, and it is likely that it will perform worse. We have included model RD in this article in order to demonstrate that the use of a developmental sequence does not necessarily lead to performance advantages. Instead, the benets of a developmental sequence are found only when the sequence incorporates constraints that are useful for learning the desired behavioral task. This article reports the results of computer simulations comparing the learning performances of the developmental and nondevelopmental models on the task of estimating binocular disparities in novel pairs of images. The results show that the developmental models whose stages are based on spatial frequency content (models C2M and F2M) consistently outperformed the nondevelopmental and random developmental models. We speculate that the superior performance of these models is due to two important features of their developmental progressions: (1) these
168
M. Dominguez and R. Jacobs
models were exposed to visual inputs at a single scale early in training, and (2) the spatial scale of their inputs progressed in an orderly fashion from one scale to a neighboring scale. Simulation results consistent with these speculations are presented. We conclude that suitably designed developmental sequences can be useful to systems learning to detect binocular disparities and that the general idea that visual development can aid visual learning is a viable hypothesis in need of future study. 2 Developmental and Nondevelopmental Models
Figure 2 illustrates the structure of the developmental and nondevelopmental models. This structure is based on a similar architecture studied by Gray, Pouget, Zemel, Nowlan, & Sejnowski (1998). The retinal layer consisted of two one-dimensional arrays 62 pixels in length for the left and right eye images. Each retina was treated as if it were shaped like a circle; the leftmost and right-most pixels were regarded as neighbors. This wraparound of the left and right edges was done to avoid edge artifacts. Although onedimensional retinas are a simplication, their use is justied by the fact that the models were concerned only with horizontal disparities, as these are the ones that provide information about the three-dimensional conguration of the visual environment.1 The retinal inputs were ltered using binocular energy lters. Based on neurophysiological studies, Ohzawa, DeAngelis, and Freeman (1990) proposed binocular energy lters as a way of modeling the binocular sensitivities of simple and complex cells in primary visual cortex. These lters are an extension of motion energy lters proposed by Adelson and Bergen (1985). A simple cell receives input from a pair of subunits, one for each retina. The receptive eld proles of the subunits can be described mathematically as Gabor functions: ³ ´ x2 exp ¡ 2 sin(2p vx C w ) 2s 2p s ³ ´ 1 x2 exp ¡ 2 sin(2p vx C w C dw ) . gR ( x, w ) D p 2s 2p s gL ( x, w ) D p
1
(2.1) (2.2)
Each function is a sinusoid multiplied by a gaussian envelope. The parameter x is the distance to the center of the gaussian, s 2 is the variance of the 1
Vertical disparities provide information about viewing distance and angle of gaze but by themselves do not carry information about the three-dimensional structure of the visual environment. That is, by themselves, they cannot be used for making relative or absolute depth judgments. In contrast, horizontal disparities alone can be used for making relative depth judgments and, when scaled by viewing distance information (perhaps obtained via vertical disparities), for making absolute depth judgments.
Acquisition of Binocular Developmental Sensitivities
169
Output Layer Hidden Layer Input Layer
Binocular Energy Filters
Low SF
Med SF
Stage 1 Stage 2 Stage 3
High SF
Left Eye Retina
Right Eye Retina
Figure 2: The developmental and nondevelopmental models shared a common structure. The bottom portion of this structure is the left and right retinal images. These images are then ltered by binocular energy lters. The outputs of these lters are the inputs to an articial neural network that is trained to estimate the disparity present in the images. The model illustrated here is model C2M in which low-spatial-frequency information was received during early stages of training, and information at higher frequencies was added as training progressed.
gaussian, v is the frequency of the sinusoid, and w and dw are referred to as the base phase and phase offset of the sinusoid, respectively. The Gabor functions associated with the left and right retinal subunits differ in that the phase of one is offset from the phase of the other. The output of a simple cell is formed in two stages. First, the convolution of the left retinal image with the left subunit Gabor is added to the convolution of the right retinal image with the right subunit Gabor; next, this sum is half-wave rectied and squared (a negative sum is mapped to zero; a positive sum is mapped to its square). The magnitude of a simple cell’s output is related to the presence of a binocular disparity of a particular size in the retinal input. Simple cells formed from subunits with different phase offsets are sensitive to disparities of different sizes (Fleet, Wagner, & Heeger, 1996; Qian, 1994). The output of a complex cell is the sum of the outputs of four simple
170
M. Dominguez and R. Jacobs
cells. Because the base phases of these simple cells form quadrature pairs (the base phases are 0, p / 2, p , and 3p / 2), the complex cell’s output is relatively insensitive to the exact position of a disparity within its receptive eld. In our simulations, there were 35 receptive eld locations that received input from overlapping regions of the retina. At each location, there were 30 complex cells corresponding to three spatial frequencies and 10 phase offsets at each frequency. The three spatial frequencies were each separated by an octave: 0.25, 0.125, and 0.0625 cycles per pixel. The standard deviations of the Gabor functions were set to be inversely proportional to the frequency: 1.0 for 0.25 cycles per pixel, 2.0 for 0.125 cycles per pixel, and 4.0 for 0.0625 cycles per pixel. The 10 phase offsets were equally spaced over a range from 0 to p / 2. The outputs of the complex cells were normalized using a softmax nonlinearity, eEi ( x ) / t EO i (x ) D P E ( x ) / t , j je
(2.3)
where E i (x ) was the initial output of the complex cell, EO i ( x ) was the normalized output, t is a scaling parameter known as a temperature parameter (its value was set to 0.25), and j indexed the 10 complex cells with different phase offsets at a receptive eld location within a single frequency band. As a result of this normalization, complex cells tended to respond to relative contrast in an image rather than absolute contrast. The normalized outputs of the complex cells were the inputs to an articial neural network. The network had 1050 input units (the complex cells had 35 receptive eld locations, and there were 30 cells at each location). The hidden layer of the network contained 32 units organized into 8 groups of 4 units each. The connectivity to the hidden units was set so that each group had a limited receptive eld; a group of hidden units received inputs from seven receptive eld locations at the complex cell level. The hidden units used a logistic activation function. The output layer consisted of a single linear unit; this unit’s output was an estimate of the disparity present in the right and left retinal images. The weights of an articial neural network were initialized to small random values and were adjusted during the course of training to minimize a sum of squared error cost function using a conjugate gradient optimization procedure (Press, Teukolsky, Vetterling, & Flannery, 1992). This procedure was used because it tends to converge quickly and because it has no free parameters (e.g., no learning rate or momentum parameters). Weight sharing was implemented at the hidden unit level so that corresponding units within each group of hidden units had the same incoming and outgoing weight values and a hidden unit had the same set of weight values from each receptive eld location at the complex unit level. This provided the
Acquisition of Binocular Developmental Sensitivities
171
network with a degree of translation invariance and also dramatically decreased the number of modiable weight values in the network. It therefore decreased the number of data items needed to train the network and the amount of time needed to train the network. Models were trained and tested using separate sets of training and test data items. Training sets contained 250 randomly generated data items; test sets contained 122 data items that were generated so as to cover the range of possible binocular disparities uniformly. Training was terminated after 35 iterations through the training set in order to minimize overtting of the training data. The results reported are based on the data items from the test set. Model C2M was trained using a coarse-scale-to-multiscale developmental sequence. This was implemented as follows. The training period was divided into three stages where the rst and second stages were each 10 iterations and the third stage was 15 iterations.2 During the rst stage, the neural network portion of the model received only the outputs of complex cells tuned to low spatial frequencies (the outputs of the other complex cells were set to zero). During the second stage, the network received the outputs of complex cells tuned to low and medium spatial frequencies; it received the outputs of all complex cells during the third stage. The training of model F2M was identical to that of model C2M except that its training used a ne-scale-to-multiscale developmental sequence. During the rst stage of training, its network received the outputs of complex cells tuned to high spatial frequencies. This network received the outputs of complex cells tuned to high and medium frequencies during the second stage and received the outputs of all complex cells during the third stage. The training of model RD also used a developmental sequence, though this sequence was generated randomly and thus was not based on the spatial frequency tuning of the complex cells. The collection of complex cells was randomly partitioned into three equal-sized subsets with the constraint that each subset included all phase offsets at all receptive eld locations. During the rst stage of training, the neural network portion of the model received the outputs of only the complex cells in the rst subset. It received the outputs of the cells in the rst and second subsets during the second stage of training and the outputs of all complex cells during the third stage. In contrast, the training period for the nondevelopmental model was not divided into separate stages; its neural network received the outputs of all complex cells throughout the training period.
2 The number of iterations in the training stages was roughly optimized using models C2M and F2M and the solid object data set described below. Models with stages of different sizes were tested, and the results suggest that models tended to show worse generalization performance if the number of iterations in the rst and second stages was decreased or increased.
172
M. Dominguez and R. Jacobs
3 Data Sets and Simulation Results
The performances of the four models were evaluated on three data sets. These data sets were based on related data sets used by Gray et al. (1998). In all cases the images were gray scale with luminance values between 0 and 1 and disparities with values between 0 and 3 pixels. Ten simulations of each model on each data set were conducted. In the solid object data set, images consisted of a single light or dark object on a gray background. The object’s gray-scale value was either between 0.0 and 0.1 or between 0.9 and 1.0, whereas the gray-scale value of the background was always 0.5. The location of the object was randomly chosen to be a real-valued location on the retina. The object’s disparity was randomly chosen to be a real value between 0 and 3 pixels. The object’s size was randomly chosen to be a real value between 10 and 25 pixels. Since the object’s size, location, and disparity were all real numbers, the ends of the object could fall at a real-valued location within a pixel. In these (common) cases, the value of the partially covered pixel was interpolated between the gray-scale value of the object and that of the background in proportion to the amount of the pixel covered by the object and background. An example of a right and left image is shown in the top panel of Figure 3. Given the right and left images, the task of a model was to estimate the object’s disparity. The upper left graph of Figure 4 illustrates the results. The horizontal axis gives the model, and the vertical axis gives the root mean squared error (RMSE) at the end of training on the data items from the test set (the error bars give the standard error of the mean). On average, developmental model C2M had a 16.5% smaller generalization error than the nondevelopmental model (the difference between the mean error rates is statistically signicant; t D 3.77, p < 0.002 using a two-tailed t-test), and a 19.3% smaller error than the random developmental model (t D 4.60, p < 0.001). Developmental model F2M had a 12.2% smaller error than the nondevelopmental model (t D 23.74, p < 0.001) and a 15.2% smaller error than the random develop-
Figure 3: Examples of right and left images (top and bottom rows in each panel) from the solid object, noisy object, and planar data sets.
Acquisition of Binocular Developmental Sensitivities
1.2
solid object data set
1.0
1.2
173
noisy object data set
1.0
RMSE 0.8
0.6
0.8
C2M F2M
RD
1.2
0.6
ND
C2M F2M
RD
ND
planar data set
1.0 RMSE 0.8
0.6
C2M F2M
RD
ND
Figure 4: The four models’ RMSE on the test set data items after training on the three data sets (the error bars give the standard error of the mean).
mental model (t D 49.01, p < 0.001). Clearly, the two developmental models whose developmental progressions were organized by spatial frequency content outperformed both the nondevelopmental model and the random developmental model. A statistical comparison between models C2M and F2M shows that their performances were not signicantly different. The variability in model C2M’s performance was notably greater than the variability in the performances of the other models. Although it is difcult to know why, it may have to do with the properties of binocular energy units. As discussed by Qian (1994), Zhu and Qian (1996), and Fleet et al. (1996), the responses of these units may peak even when the input disparity is outside the range of disparities to which the unit was thought to be responsive. In such cases, the large response is known as a false peak. False peaks occur for nearly all stimuli, including white noise. Moreover, some of the false peaks will be signicantly larger than the peak at the disparity to which the unit
174
M. Dominguez and R. Jacobs
was thought to be selective. False peaks tend to occur at integer multiples of the wavelength to which the unit is tuned, meaning that units tuned to low frequencies tend to have false peaks that are relatively far from each other, and units tuned to high frequencies tend to have false peaks near to each other. Consequently, units tuned to low spatial frequencies tend to have false peaks at disparities that are signicantly different from the disparity to which they were thought to be responsive, whereas units tuned to high frequencies tend to have false peaks at disparities that are closer to the disparity to which they were thought to be responsive (see Figure 5). The large variability in model C2M’s performance may be due to the fact that this model is sometimes misled early in training by false peaks at disparities that are far from the disparities to which units are thought to be selective. This is less likely to be as serious a problem for the other models because they are less likely than model C2M to emphasize information provided by units tuned to low spatial frequencies. The learning curves for the four models on the solid object data set are illustrated in the upper left graph of Figure 6. The horizontal axis gives the training time in epochs, and the vertical axis gives the RMSE on the test set data items. The solid black line is for model C2M, the solid gray line is for model F2M, the dashed black line is for model RD, and the dotted black line is for model ND. The learning curve for model ND falls quickly within the rst few epochs of training and then is relatively at for the remainder of the training period. In contrast, the developmental models learned relatively slowly, and models C2M and F2M eventually showed the best generalization performance. This result is consistent with the notion described above that apparent inadequacies in performance during early development are not necessarily bad; they sometimes suggest the use of internal representations, which are useful stepping-stones for the subsequent development of advanced behaviors. The images in the second data set, referred to as the noisy object data set, were meant to resemble random dot stereograms frequently used in behavioral experiments. Images contained a noisy object against a noisy background. The gray-scale values of the object pixels and the background pixels were set to random numbers between 0 and 1. The location of the object was randomly chosen to be a real-valued location on the retina. The object’s size was randomly chosen to be a real value between 10 and 25 pixels. The object’s disparity was a randomly chosen integer between 0 and 3 pixels. As before, the task was to map the left and right images to an estimate of the object’s disparity. An example of a left and right image is shown in the middle panel of Figure 3. The results are shown in the upper right panel of Figure 4. As before, the developmental models whose developmental progressions were organized by spatial frequency content performed best. On average, model C2M had a 7.1% smaller generalization error than model ND (t D 448.8, p < 0.001) and a 6.6% smaller error than model RD (t D 2.46, p < 0.05). Model F2M had a
Acquisition of Binocular Developmental Sensitivities
175
Low Spatial Frequency Left Eye
Right Eye
d d+l
High Spatial Frequency Left Eye
Right Eye
d d+l
Figure 5: The top two gures illustrate Gabor lters for the left-eye and righteye images that are tuned to a low spatial frequency (l is the wavelength). The right-eye Gabor is phase-shifted by d units relative to the left-eye Gabor. The sum of the lter responses tends to be large when the disparity in the images is d, d ¡ l, or d C l. Because the wavelength l is relatively large, the sum can peak at input disparities (e.g., d C l) that are far from the disparity (d) to which the sum was thought to be responsive. As illustrated in the bottom two gures, a similar situation holds for lters tuned to high spatial frequencies, but now the wavelength l is smaller and the false peaks occur at input disparities that are signicantly closer to the disparity to which the sum was thought to be responsive. Consequently, false peaks are more misleading when using lters tuned to low frequencies than when using lters tuned to high frequencies.
176
M. Dominguez and R. Jacobs
solid object data set
2.2
C2M F2M RD ND
1.8
1.8
RMSE 1.4
1.4
1.0
1.0
0.6
noisy object data set
2.2
0.6 0
10 20 30 training time (epochs)
0
10 20 30 training time (epochs)
planar data set
2.2 1.8 RMSE 1.4 1.0 0.6 0
10 20 30 training time (epochs)
Figure 6: Learning curves for the four models on the three data sets.
4.3% smaller error than model ND (t D 1566.5, p < 0.001) and a 3.8% smaller error than model RD (this result is not statistically signicant). Comparing the developmental models C2M and F2M, C2M had a 2.85% smaller error (t D 170.6, p < 0.001). The learning curves for the four models are shown in the upper right graph of Figure 6. A notable feature illustrated by this graph is that the random developmental model had highly unstable levels of performance during the course of training. The last data set, the planar data set, was different from the rst two data sets. Instead of an object in front of a background, the images depicted a frontoparallel plane. The values of the left-image pixels were randomly chosen to be either 0 or 1. The right image was generated by applying an integer shift to the left image of 0, 1, 2, or 3 pixels. Given the left and right images, the task was to estimate the size of the shift. An example of a left and right image is shown in the bottom panel of Figure 3.
Acquisition of Binocular Developmental Sensitivities
177
The results are shown in the bottom graph of Figure 4. Again, the developmental models whose developmental progressions were organized by spatial frequency content tended to outperform the other models. Model C2M had a 6.7% smaller generalization error than model ND (t D 2.27, p < 0.05) and a 1.3% smaller error than model RD (this result is not statistically signicant). Model F2M had an 8.3% smaller error than model ND (t D 16.84, p < 0.001) and a 3.0% smaller error than model RD (t D 7.01, p < 0.001). The performances of models C2M and F2M were not statistically different. The bottom graph of Figure 6 gives the learning curves for the four models. In summary, the simulation results using the three data sets show that the developmental models whose progressions were organized by spatial frequency content (models C2M and F2M) signicantly outperformed the nondevelopmental and random-developmental models. To understand the performances of models C2M and F2M better, we analyzed more carefully their behaviors using the solid object data set. We looked at how these models performed on images with disparities of different sizes (small, midsize, and large disparities) at the end of each developmental stage (stages 1, 2, and 3). These data are given in Figure 7. The three graphs in this gure correspond to the three disparity sizes. The horizontal axis of each graph gives the stage number for each model; the vertical axis gives each model’s RMSE at the end of the indicated stage on those test set data items with the indicated disparity size. For purposes of comparison, the graphs also include the data for the nondevelopmental model. Model ND did not signicantly change its ability to detect disparities of a given size across training stages; its performance at detecting small, midsize, or large disparities remained nearly constant across the stages. Model C2M improved its ability to estimate disparities of all sizes as training progressed from one stage to the next. In contrast, model F2M appears to have learned about different-sized disparities in each of the different training stages. For example, it learned a great deal about detecting small disparities in stage 1; in stages 2 and 3, it learned more about detecting midsize and large disparities. Not surprisingly, its performance at detecting small disparities degraded as its performance at detecting midsize and large disparities improved. The question remains as to why models C2M and F2M showed superior performance. One logical possibility is that these models performed well due to the fact that they received relatively few inputs early in training. This possibility can be ruled out, however, because model RD received equally few inputs early in training, but this model did not perform well. We speculate that two important features of the developmental progressions of models C2M and F2M account for their superior performances. First, these models were exposed to visual inputs at a single scale early in training. Model C2M received only coarse-scale information at the start of training; model F2M received only ne-scale information. In contrast, models RD and ND received information at all spatial scales at all stages of training.
178
M. Dominguez and R. Jacobs
small disparities (0-1 pixels)
mid-size disparities (1-2 pixels)
2.0
2.0
RMSE 1.0
1.0
0.0
123 C2M
123 F2M
0.0
123 ND
123 C2M
123 F2M
123 ND
large disparities (2-3 pixels) 2.0
RMSE 1.0
0.0
123 C2M
123 F2M
123 ND
Figure 7: Performance of models C2M, F2M, and ND on images with disparities of different sizes (small, midsize, and large disparities) at the end of each developmental stage (stages 1, 2, and 3). Training and test data items came from the solid object data set.
We conjecture that it might be advantageous for a learning system to receive inputs at a single scale early in training because this allows the system to combine and compare input features without the need to compensate for the fact that these features could be at different spatial scales. Second, models C2M and F2M might be at an advantage because the spatial scale of their inputs progressed in an orderly fashion from one scale to a neighboring scale. Consequently, when these models received inputs at a new spatial scale, this new scale was close to a scale with which the models were already familiar. If it is the case that this second feature of models C2M and F2M is important, then this leads to an interesting prediction. We predict that developmental models whose progressions do not proceed in an orderly manner from one scale to a neighboring scale ought to show poor performance. To test this prediction, two additional models were created and tested on the solid object data set. These models had developmental stages based
Acquisition of Binocular Developmental Sensitivities
179
solid object data set 1.4 RMSE 1.0
0.6
C2M
F2M
C-CF-CMF
F-CF-CMF
Figure 8: RMSE of models C2M, F2M, C-CF-CMF, and F-CF-CMF on the test items from the solid object data set.
on spatial frequency content (like models C2M and F2M), but the addition of new frequency bands to their inputs at each stage did not proceed in an orderly manner. The rst new model is referred to as model C-CF-CMF. It received the outputs of complex cells tuned to a low spatial frequency early in training. In stage 2, it received the outputs of complex cells tuned to low and high frequencies, and it received the outputs of cells tuned to low, medium, and high frequencies in stage 3. The second new model, referred to as model F-CF-CMF, was similar to model C-CF-CMF, but it started with high-frequency information. That is, it received the outputs of complex cells tuned to a high spatial frequency early in training. It received the outputs of cells tuned to low and high frequencies in stage 2, and the outputs of all complex cells in stage 3. Figure 8 shows the performances of models C2M, F2M, C-CF-CMF, and F-CF-CMF at the end of training on the solid object data set. In accord with our prediction, models C-CF-CMF and F-CF-CMF showed very poor performance. These data are consistent with the conjecture that it is advantageous to a developmental system for the spatial scale of its inputs to progress in an orderly fashion from one scale to a neighboring scale. 4 Summary and Conclusion
This article has considered the hypothesis that systems learning aspects of visual perception may benet from the use of suitably designed develop-
180
M. Dominguez and R. Jacobs
mental progressions during training. We reported the results of simulations in which different systems were trained to detect binocular disparities in pairs of visual images. Three of the systems were developmental models in the sense that the nature of their input changed during the course of training. These systems received a relatively impoverished visual input early in training, and the quality of this input improved as training progressed. The fourth system was a nondevelopmental model; the nature of its input remained constant during the course of training. The results show that the developmental models whose stages were based on spatial frequency content, models C2M and F2M, consistently outperformed the nondevelopmental and random developmental models. We speculate that the superior performance of these models is due to two important features of their developmental progressions: (1) these models were exposed to visual inputs at a single scale early in training, and (2) the spatial scale of their inputs progressed in an orderly fashion from one scale to a neighboring scale. Simulation results consistent with these speculations were presented. We conclude that suitably designed developmental sequences can be useful to systems learning to detect binocular disparities, in accord with the less-is-more view of development. Moreover, the idea that visual development can aid visual learning is a viable hypothesis in need of study. With relatively few exceptions, the relationship between development and learning has been ignored by the neural computation community. We believe that this is unfortunate. It is well known that systems learn best when they are suitably constrained through the use of domain knowledge. Learning systems are inherently faced with the bias-variance dilemma (Geman, Bienenstock, & Doursat, 1995). Systems with little or no bias are often capable of learning many different sets of training items. Unfortunately, they tend to interpolate in unpredictable ways and thus generalize poorly to novel data items. In contrast, systems that are constrained through the use of domain knowledge and thus have large bias are not able to learn as wide a variety of training sets. However, they tend to show better generalization performance and less variable generalization performance when exposed to those training sets that they can adequately learn. The design of appropriate developmental progressions through the use of domain knowledge provides researchers with an effective means of biasing their learning systems so as to enhance their performances.
Acknowledgments
This work was supported by NSF graduate fellowship DGE9616170 and by NIH research grant R01-EY13149.
Acquisition of Binocular Developmental Sensitivities
181
References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2, 284– 299. Atkinson, J., & Braddick, O. (1976). Stereoscopic discrimination in infants. Perception, 5, 29–38. Barnard, S. T. (1987).Stereo matching by hierarchical,microcanonical annealing (Tech. Rep. 414). Menlo Park, CA: Articial Intelligence Center, SRI International. Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7, 108–116. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 43, 71–99. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Neural encoding of binocular disparity: Energy models, position shifts, and phase shifts. Vision Research, 36, 1839–1857. Fox, R., Aslin, R. N., Shea, S. L., & Dumais, S. T. (1980). Stereopsis in human infants. Science, 207, 323–324. Geman, S., Bienenstock, E., & Doursat, R. (1995) Neural networks and the bias/variance dilemma. Neural Computation, 4, 1–58. Gray, M. S., Pouget, A., Zemel, R. S., Nowlan, S. J., & Sejnowski, T. J. (1998). Reliable disparity estimation through selective integration. Visual Neuroscience, 15, 511–528. Greenough, W. T., Black, J. E., & Wallace, C. S. (1987). Experience and brain development. Child Development, 58, 539–559. Harwerth, R. S., Smith, E. L., Duncan, G. C., Crawford, M. L. J., & von Noorden, G. K. (1986). Multiple sensitive periods in the development of the primate visual system. Science, 232, 235–238. Held, R., Birch, E., & Gwiazda, J. (1980). Stereoacuity in human infants. Proceedings of the National Academy of Sciences USA, 77, 5572–5574. Mallot, H. A., Gillner, S., & Arndt, P. A. (1996). Is correspondence search in human stereo vision a coarse-to-ne process? Biological Cybernetics, 74, 95– 106. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proceedings of the Royal Society of London B, 204, 301–328. Matsuoka, K. (1992). Noise injection into inputs in back-propagation learning. IEEE Transactions on Systems, Man, and Cybernetics, 22, 436–440. McKee, S. P., & Mitchison, G. J. (1988). The role of retinal correspondence in stereoscopic matching. Vision Research, 28, 1001–1012. Mowforth, P., Mayhew, J. E. W., & Frisby, J. P. (1981). Vergence eye movements made in response to spatial-frequency-ltered random-dot stereograms. Perception, 10, 299–304. Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 11–28. Norcia, A., & Tyler, C. (1985). Spatial frequency sweep VEP: Visual acuity during the rst year of life. Vision Research, 25, 1399–1408.
182
M. Dominguez and R. Jacobs
Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Petrig, B., Julesz, B., Krop, W., Baumgartner, G., & Anliker, M. (1981). Development of stereopsis and cortical binocularity in human infants: Electrophysiological evidence. Science, 213, 1402–1405. Piaget, J. (1952). The origins of intelligence in children. New York: International Universities Press. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Computation, 6, 390–404. Quam, L. H. (1986). Hierarchical warp stereo (Tech. Rep. 402). Menlo Park, CA: Articial Intelligence Center, SRI International. Rohde, D. L. T., & Plaut, D. C. (1999). Language acquisition in the absence of explicit negative evidence: How important is starting small? Cognition, 72, 67–109. Shatz, C. J. (1996). Emergence of order in visual system development. Proceedings of the National Academy of Sciences USA, 93, 602–608. Sietsma, J., & Dow, R. J. F. (1991). Creating articial neural networks that generalize. Neural Networks, 4, 67–79. Smallman, H. S. (1995). Fine-to-coarse scale disambiguation in stereopsis. Vision Research, 34, 2971–2982. Turkewitz, G., & Kenney, P. A. (1982). Limitations on input as a basis for neural organization and perceptual development: A preliminary statement. Developmental Psychobiology, 15, 357–368. Webb, A. R. (1994). Functional approximation by feed-forward networks: A least-squares approach to generalization. IEEE Transactions on Neural Networks, 5, 363–371. Zhu, Y., & Qian, N. (1996). Binocular receptive eld proles, disparity tuning, and characteristic disparity. Neural Computation, 8, 1611–1641. Received November 14, 2001; accepted June 4, 2002.
LETTER
Communicated by Terrence Sejnowski
A Quantied Sensitivity Measure for Multilayer Perceptron to Input Perturbation Xiaoqin Zeng
[email protected] Daniel S. Yeung
[email protected] Department of Computing, Hong Kong Polytechnic University, Hong Kong The sensitivity of a neural network’s output to its input perturbation is an important issue with both theoretical and practical values. In this article, we propose an approach to quantify the sensitivity of the most popular and general feedforward network: multilayer perceptron (MLP). The sensitivity measure is dened as the mathematical expectation of output deviation due to expected input deviation with respect to overall input patterns in a continuous interval. Based on the structural characteristics of the MLP, a bottom-up approach is adopted. A single neuron is considered rst, and algorithms with approximately derived analytical expressions that are functions of expected input deviation are given for the computation of its sensitivity. Then another algorithm is given to compute the sensitivity of the entire MLP network. Computer simulations are used to verify the derived theoretical formulas. The agreement between theoretical and experimental results is quite good. The sensitivity measure can be used to evaluate the MLP’s performance. 1 Introduction
Theoretically, an articial neural network is aimed at realizing a mapping between its inputs and outputs, so the sensitivity of a neural network’s output to its input perturbation is an important measure for evaluating its performance, such as error tolerance and generalization ability. Obviously, a quantied sensitivity measure will have both theoretical and practical value. How to dene and quantify an appropriate sensitivity for neural networks is still a problem in need of exploration. In the past decade, a variety of studies on the sensitivity of neural networks have emerged. For feedforward neural networks, there are two kinds of sensitivity that differ in the target function being studied: the sensitivity with respect to an objective (error) function and the sensitivity with respect to the neural networks’ output function (Engelbrecht, Fletcher, & Cloete, 1999). These two sensitivity measures seem in general to be different. The sensitivity of the error function is closely related to the process of training Neural Computation 15, 183–212 (2003)
° c 2002 Massachusetts Institute of Technology
184
X. Zeng and D. Yeung
and thus would be useful for training related issues, whereas the output sensitivity has been found useful in a number of problems both in and after training. This article focuses on the sensitivity with respect to the neural networks’ output function. Stevenson, Winter, and Widrow (1990) made use of a hypersphere as the mathematical model in analyzing the sensitivity of Adalines and Madalines. They dened sensitivity as the probability of erroneous output of Madalines under the assumption that input and weight perturbations are small and the number of Adalines per layer is sufciently large. Alippi, Piuri, and Sami (1995) generalized the hypersphere model by also considering the sensitivity as the probability of erroneous output of the feedforward multilayered neural networks with multiple-step symmetric activation functions. But their results are applicable only to networks with discrete neuron inputs and outputs and thus are not suitable for multilayer perceptrons (MLPs) with continuous inputs and outputs. Cheng and Yeung (1999) again used a geometrical technique to investigate the sensitivity of neocognitrons by modifying the hypersphere model. Unfortunately, the hypersphere model they used fails in the case of the MLP because its input vectors generally do not span a hypersphere surface or volume. PichÂe (1995) employed a statistical rather than a geometrical argument to analyze the effects of weight errors in Madalines. He made the assumption that inputs and weights as well as their errors are all independently and identically distributed with mean zero. Based on such a statistical model and under the condition that both input and weight errors are small enough, PichÂe derived an analytical expression for the sensitivity as the ratio of the variance of the output error to the variance of the output. Recently, Yeung and Sun (2002) generalized PichÂe’s method by deriving a universal expression of the MLP’s sensitivity for antisymmetric squashing activation functions, without any restriction on the magnitude of input and weight perturbations. However, this approach is applicable only to an ensemble of networks, not to any individual one, because the assumptions on the statistical model are too strong. There are basically two approaches to study the sensitivity of the MLP. The analytical approach denes the sensitivity as a partial derivative of output to input. For example, Hashem (1992) and Fu and Chen (1993) computed the partial derivative one layer at a time, starting from the output layer and proceeding backward (using backward chaining) toward the input layer. But this approach is applicable only when the input deviation tends to zero. The other approach is statistically based. For example, Choi and Choi (1992) introduced a statistical sensitivity measure for the MLP with differentiable activation functions. The sensitivity is dened as the ratio of the standard deviation of output errors to the standard deviation of weight or input error under the condition that the latter tends to zero. They derived a formula for the sensitivity of a single-output MLP under a given input pattern. However, these two approaches are useful only to measure the output deviation with respect to a given input pattern. They are not
MLP’s Sensitivity to Input Perturbation
185
able to measure the expected output deviation with respect to the overall input patterns, an important factor affecting the performance measurement of networks like MLPs. For our study, instead of using a hypersphere as described in Stevenson et al. (1990), Alippi et al. (1995), and Cheng and Yeung (1999), we employed a hyperrectangle as the mathematical model to represent an MLP’s input space because it usually falls into a continuous interval. Our approach is different from those of the other studies and offers certain advantages over them. For example, it does not demand that the input deviation be very small as the derivative approach requests, and the mathematical expectation used in our approach reects the network’s output deviation more directly and exactly than the variance does; for example, some assumptions must be made for the expectation, as in Choi and Choi (1992), PichÂe (1995), and Yeung and Sun (2002). Moreover, our result is applicable to the MLP that deals with innite input patterns, which is an advantage of the MLP over other discrete feedforward networks like Madalines. In Zeng and Yeung (2001) the sensitivity for an ensemble of MLPs due to input and weight perturbations was investigated. But this article is mainly concerned with an individual MLP with xed weights and thus xed network architecture. We note that the sensitivity analysis of an ensemble of MLPs with respect to weight perturbations provides some useful design guidelines prior to the hardware implementation of an MLP, whereas the sensitivity measure of a trained MLP to input perturbations is directly related to its application performance. The arrangement of this article is as follows. The MLP model is briey described in section 2. The denitions of sensitivity are given in section 3 for different granules in the MLP from neuron to layer and nally to the entire network. Sections 4 and 5 deal with the computation of the sensitivity, in which algorithms based on some analytical expressions are given for the computation of the sensitivity of a single neuron and the sensitivity of the entire MLP. Simulation results that support the theoretical results are presented in section 6. Section 7 concludes the article.
2 The Multilayer Perceptron Model 2.1 Notation. Generally, an MLP can have L layers, and each layer l (1 · l · L ) has nl ( nl ¸ 1) neurons. The form n 0 ¡n1 ¡, . . . , ¡nL is used to represent an MLP with a certain structural conguration, in which each nl (0 · l · L ) not only indicates the number of neurons (n0 is an exception, which indicates the dimension of input), but also stands for a layer from left to right, including the input layer. n0 stands for the input layer and nL for the output layer. Since the number of neurons in layer l ¡ 1 is equal to the output dimension of that layer, which is also equal to the input dimension of layer l, the input dimension of layer l is nl¡1 . For a neuron i (1 · i · nl )
186
X. Zeng and D. Yeung
in layer l, its input vector is Xl D ( xl1 , . . . , xlnl¡1 ) T , its weight vector is W il D (wli1 , . . . , wl l¡1 ) T , and its output is yli D f ( Xl ¤W il ) , where f (¢) is an activation in function having the summation of weighted inputs as its variable. For each layer l, all of its neurons have the same input vector Xl . Its weight set is W l D fW 1l , . . . , Wnl l g, and its output vector is Yl D ( yl1 , . . . , ylnl ) T . For the entire MLP, its input is the vector X1 or Y 0 , its weight is W D W 1 [ ¢ ¢ ¢ [ W L , and its output is YL . Let D Xl D ( D xl1 , . . . , D xlnl¡1 ) T and D Yl D (D yl1 , . . . , D ylnl ) T be the corresponding deviations for input and output, respectively. 2.2 Structure. Neurons, as cells, in an MLP are organized into layers by linking them together. Links exist only between neurons of two adjacent layers; there is no link between neurons in the same layer or in any two nonadjacent layers. All neurons in a layer are fully linked from all the neurons in the immediately preceding layer and are fully linked to all the neurons in the immediately succeeding layer. At each layer except the input layer, the inputs of each neuron are the outputs of the neurons in the previous layer, Xl D Yl¡1 ( l ¸ 1). The commonly used logistic function is chosen as the activation function, that is,
f ( x ) D 1 / (1 C e¡x ) ,
(2.1)
with the following characteristics: 0 · f ( x) · 1
(2.2)
f ( x ) C f (¡x) D 1
(2.3)
f ( x0 ) < f ( x00 ) if x0 < x00 .
(2.4)
3 Denition of the Sensitivity
The main issue addressed by the sensitivity study is what the effects of input deviation will be on the MLP network’s output. For the case of a single neuron, the most direct and natural way to express the output deviation arising from input deviation is the difference of its deviated and nondeviated output,
D yli
D g ( Xl , D Xl , W il ) D f ( ( Xl C
D Xl ) ¤Wil ) ¡ f (Xl ¤Wil ).
(3.1)
Since Wil is xed, D yli can be easily computed if the values of Xl and D Xl are known. But in most cases, especially in the performance evaluation of a network, it would not be appropriate to compute D yli for a specic input. Rather, it would be more desirable to nd the relationship between
MLP’s Sensitivity to Input Perturbation
187
the output deviation D yli and its cause, the input deviation D Xl , in an ensemble sense by taking overall inputs into consideration. It is obvious that equation 3.1 reects the relationship between D yli and D Xl . The problem is to dene such a sensitivity that keeps the relationship without reference to Xl . Our approach is to consider the mathematical expectation of D yli with respect to all possible values of Xl , that is, to treat Xl as a statistical variable. Furthermore, since input deviation is usually randomly oriented, it is proper to replace D Xl in equation 3.1 by its expected absolute value, denoted as D Xl D (D xl1 , D xl2 , . . . , D xlnl¡1 ) D ( E (| D xl1 |) , E (| D xl2 | ) , . . . , E (| D xlnl¡1 |) ) . The following denitions use a bottom-up way to dene the sensitivity of different granules in the MLP. First is the sensitivity of a single neuron, then the sensitivity of a layer, and nally the sensitivity of the entire MLP. For any input vector Xl 2 [a1 , b1 ] £[a2 , b2 ] £¢ ¢ ¢ £[anl¡1 , bnl¡1 ] µ [0, and its corresponding expected deviation D Xl and a given weight vector l¡1 Wil 2 Rn , the sensitivity of neuron i in layer l is dened as the mathematical expectation of the absolute value of D yli with respect to Xl , which is expressed as Denition 1. l¡1 1]n
sli D E (| g ( Xl ) |) D E (| f ( (Xl C D Xl )¤Wil ) ¡ f ( Xl ¤Wil ) | ).
(3.2)
Obviously, sli is a function of D Xl , independent of any specic Xl . Because of equation 2.2, the interval of the input of all layers except the input layer is l¡1 in [0, 1]n (l ¸ 2), and one can normalize the input of the rst layer to be in l¡1 0 [0, 1]n . So the assumption Xl 2 [a1 , b1 ]£[a2 , b2 ]£¢ ¢ ¢£[anl¡1 , bnl¡1 ] µ [0, 1]n in the sensitivity denition is practically acceptable. For any input vector Xl 2 [a1 , b1 ] £[a2 , b2 ] £¢ ¢ ¢ £[anl¡1 , bnl¡1 ] µ [0, 1]n and its corresponding expected deviation D Xl and a given weight vector l¡1 set W l D fWil | Wil 2 Rn I 1 · i · nl g, the sensitivity of layer l is dened as Denition 2. l¡1
Sl D ( sl1 , . . . , slnl ) T .
(3.3)
D Xl (l > 1) in fact can be regarded as the sensitivity of the previous layer, that is, D Xl D S 0 D ( E (| D x11 |) , E (| D x12 |) , . . . , E (| D x1n0 |) ) when l D 1 and D Xl D Sl¡1 when l > 1. It is noted that Sl is a vector, which consists
of the sensitivities of all neurons in layer l. Since sensitivity is generally agreed to be a scalar rather than a vector, one may choose, for instance, norm(Sl ) , max(Sl ) , avg(Sl ) , or min(Sl ) as the representation of a layer ’s sensitivity depending on the nature of the application involved. One popular strategy is to choose the one with a smaller value if the application requires a higher precision.
188
X. Zeng and D. Yeung
The sensitivity of an MLP is the sensitivity of its output layer,
Denition 3.
that is, S D SL D ( sL1 , . . . , sLnL ) T .
(3.4)
4 Computation of the Sensitivity for a Single Neuron
In this section, we derive formulas for the computation of equation 3.2, i.e., the sensitivity of a single neuron. For simplicity, the superscript that marks the neuron’s layer and the subscript that marks the neuron’s order in the layer are omitted because only single neurons are considered and their positions in the network will be irrelevant. Hence, s, X, D X, and W are used to replace sli , Xl , D Xl , and Wil in this and the next section. Clearly equation 3.2 can be expressed as Z sD
Z V
Q (X ) | f ( (X C D X ) ¤W) ¡ f ( X¤W ) | dV,
¢¢¢
(4.1)
where Q (X ) is the density function and V is the region [a1 , b1 ] £ ¢ ¢ ¢ £ [an , bn ] spanned by X. By assuming that the components of X are independent from each other and thus each can be regarded as a coordinate axis, equation 4.1 can be further expressed as Z sD
b1 a1
Z
b2 a2
Z ¢¢¢
bn an
Q (X ) | f ( (X C D X ) ¤W) ¡ f ( X¤W ) | dx1 dx2 , . . . , dxn .
(4.2)
Furthermore, by equation 2.4, namely, f ( x) being a monotonic increasing function, and the fact that Q ( X ) ¸ 0, the absolute value sign can be moved to the outside of the integral. Thus, 8Z b Z b Z bn 1 2 > > > ¢ ¢ ¢ Q (X ) f ( ( X C D X ) ¤W) dx1 dx2 , . . . , dxn > > > a1 a2 an > > Z b1 Z b2 Z bn > > > > ¡ ¢¢¢ Q (X ) f ( X¤W ) dx1 dx2 , . . . , dxn , if D X¤W ¸ 0 > > > a1 a2 an < Z b1 Z b2 Z bn sD > ¢ ¢ ¢ Q (X ) f (X¤W ) dx1 dx2 , . . . , dxn > > > a1 a2 an > > Z b1 Z b2 Z bn > > > > ¡ ¢ ¢ ¢ Q ( X ) f ( ( X C D X ) ¤W) dx1 dx2 , . . . , dxn , > > > a1 a2 an > > : if D X¤W < 0
MLP’s Sensitivity to Input Perturbation
189
Z b Z b2 Z bn 1 ¢¢¢ Q ( X ) f ( ( X C D X )¤W ) dx1 dx2 , . . . , dxn D a1 a2 an Z b1 Z b2 Z bn ¡ ¢¢¢ Q ( X ) f ( X¤W ) dx1 dx2 , . . . , dxn . a1 a2 an
(4.3)
(¢) Now replacing the inner P product by a summation and f by the logistic function and let s D jnD1 D xj wj , equation 4.3 becomes 0 1 Z Z Z bn n X b1 b2 ¢¢¢ Q (X ) f @ (xj C D xj ) wj A dx1 dx2 , . . . , dxn sD a1 a2 an j D1 0 1 Z b1 Z b2 Z bn n X ¡ ¢¢¢ Q (X) f @ xj wj A dx1 dx2 , . . . , dxn a1 a2 an j D1 ¿³ ´ Z b Z b2 Z bn Pn ¡ 1 j D1 xj wj ¡ s ¢¢¢ Q (X ) 1Ce D dx1 dx2 , . . . , dxn a1 a2 an Z ¡
b1
Z
a1
b2 a2
Z ¢¢¢
¿³
bn
Q (X ) an
1Ce
¡
Pn
j D1 xj wj
´
. dx1 dx2 , . . . , dxn
(4.4)
Therefore, the computation of the sensitivity is essentially reduced to the calculation of the following integral, Z I ( A, B, W, s) D
b1 a1
Z
b2 a2
Z ¢¢¢
bn an
¿³ ´ Pn ¡ xj wj ¡s D 1 j Q (X) 1 1Ce dx1 dx2 , . . . , dxn ,
(4.5)
where A D fa1 , a2 , . . . , an g and B D fb1 , b2 , . . . , bn g are sets of lower and upper bounds of the integral, and W D fw1 , w2 , . . . , wn g is the set of the 6 weights, with bj > aj and wj D 0 for 1 · j · n. The following algorithm could be used to compute the sensitivity for a single neuron: Algorithm: NEURON SENSITIVITY ( A, B, Q ( X) , D X, W ) 1. s D D X¤W;
190
X. Zeng and D. Yeung
Z 2. I ( A, B, W, s) ¼
Z
a1
Z 3. I ( A, B, W, 0) ¼
b1
a2
b1 Z a1
b2
b2 a2
Z ¢¢¢
Q (X ) an
Z ¢¢¢
¿³
bn
1Ce
¡
Pn
j D1 xj wj
¡s
´
dx1 dx2 , . . . , dxn ; ¿³ ´ Pn bn ¡ j D 1 xj wj Q (X) 1Ce
an
dx1 dx2 , . . . , dxn ;
4. s D |I ( A, B, W, s) ¡ I ( A, B, W, 0) | is the sensitivity for a single neuron. The following sections present the derivation of an approximate solution for equation 4.5 under the assumption that Q (X ) D 1 (X being uniformly distributed in the hyperrectangle), and an algorithm for its computation. 4.1 Approximate Solution for Integration of the Logistic Function.
Since no primitive function can be found for the integrand in equation 4.5, the Taylor series,
1 / (1 C x ) ¼
p X
(¡1) kx k,
(4.6)
kD 0
under the convergent condition |x| < 1, is employed to approximate the p( ) integrand. Let i (X ) denote the integrand Pn and i X denote its Taylor expan¡ j D 1 xj wj ¡ s for x in equation 4.6, the sion of p terms. By substituting e integrand could be approximately expressed as
i ( X ) ¼ ip ( X ) D
p X
(¡1) ke
¡k
Pn
j D1 xj wj
¡ ks
,
(4.7)
kD 0
Pn Pn ¡ j D 1 xj wj ¡ s | < 1, namely, where |e j D 1 xj wj C s > 0, is required. But Pn how to deal with i ( X ) when jD 1 xj wj C s · 0? It is quite clear that when P X satises jnD1 xj wj C s < 0, ip (X ) will certainly be divergent. Fortunately, (¡ ) ( ) the logistic function’s characteristic, expression 2.3, P f x D 1 ¡ f x , can be used to solve this problem. When X satises jnD 1 xj wj C s D 0, ip (X ) oscillates between 0 and 1, that is, ip ( X ) D 0 when p is an odd number, say p1 , and ip (X ) D 1 when p is an even number, say p2 . But at X it actually has i (X ) D 1 / 2. PThus, it is necessary to amend the imprecision arising from X satisfying jnD 1 xj wj C s D 0 in V. This can be done by calculating equation 4.5 twice, once with p1 and once with p2 , and then taking the average of the two
MLP’s Sensitivity to Input Perturbation
191
results. Now equation 4.5 can be rewritten as Z I¼
Z V
¢¢¢
ip ( X ) dV
8Z ! Z ÁX Pn p > ¡k xj wj ¡ ks > k D 1 j > (¡1) e ¢¢¢ dV, > > > V > kD 0 > > > n > X > > > if (4.8a) xj wj C s > 0 > > > > D 1 j > > ³Z ´ Z Z Z > > 1 > p1 ( ) p2 ( ) > > C ¢ ¢ ¢ ¢ ¢ ¢ i X dV i X dV , > > 2 > V V > < n X D if (4.8b) xj wj C s D 0 > > > jD 1 > > > > > > > > ! >Z > Z Á Pn p > X > xj wj C ks > ¢¢¢ k k D 1 j > (¡1) 1¡ e dV, > > > V > kD 0 > > > n > X > > > if xj wj C s < 0. (4.8c) > : jD 1
PnFrom a geometric point of view, the hyperplane P that is dened by j D 1 xj wj C s D 0 can be regarded as a dividing plane. It divides the ndimensional space into three parts: the points P on P, the points below P, and the points above P. The points on P satisfy jnD 1 xj wj C s D 0, and each of P P the other two parts satises either jnD1 xj wj C s > 0 or jnD 1 xj wj C s < 0. Pn They are two opposite cases. If one satises jD 1 xj wj C s > 0, the other P must satisfy jnD1 xj wj C s < 0, and vice versa. Now the problem is how to nd out whether P cuts V. If it does, how can equations 4.8a and 4.8c be computed? A solution to this problem is proposed below. The idea is to locate the intersection points of P and those parallel lines that are the extensions of the edges of V. Each line must have one and only one intersection point 6 with P under the assumption wj D 0 for 1 · j · n. Then by comparing the coordinates on a given axis of those intersection points with the lower bound and the upper bound of V on that axis, it is easy to identify the intersection situations between P and V. It is known that V has 2n¡1 edges parallel to a given coordinate axis. Take the xn -axis as an example. Each edge parallel to the xn -axis can be determined by assigning the n ¡ 1 coordinates ( x1 , . . . , xn¡1 ) with either xj D aj or xj D bj . Thus, the xn -coordinate for a
192
X. Zeng and D. Yeung
given line’s intersection point can be calculated by 0 xn D @
1 n¡1 X jD 1
xj wj C s A / wn
(4.9)
with its n¡1 xed coordinates. This will yield a set En D fen,1 , en,2 , . . . , en,2n¡1 g representing 2n¡1 edges. Note that during the calculation of En , an order is set to select the 2n¡1 edges, which are represented by the second subscript of the elements of E n . This means that the second subscript uniquely corresponds to an edge and thus its n ¡ 1 xed coordinates, denoting as (x1 , . . . , xn¡1 ) or fx1 , . . . , xn¡1 g for en,i (1 · i · 2n¡1 ) . This property will be used later. Let en,s D minfen,1 , en,2 , . . . , en,2n¡1 g and en,l D maxfen,1 , en,2 , . . . , en,2n¡1 g, and compare the four values en,s , en,l , an , and bn . In the following we present ve possible cases together with their corresponding formulas for calculating equation 4.8. Only three-dimensional gures are used for the illustration. 4.1.1 Case 1: en,s ¸ bn or en,l · an . In this case as illustrated by Figure 1, P does not cutP V. Hence, in V, only one convergent condition of equation 4.8 is satised. If jnD 1 xj wj C s > 0, then Z I¼
V
Z D D D
b1
Z
a1 p X
p X
¡k (¡1) k
!
Pn
j D 1 xj wj ¡ ks
e
kD 0 b2
a2
Z
¢¢¢ Z
(¡1) k
bn
p X
an b1
a1
kD 0
Z
(¡1) ke
kD 0 b2 a2
¢¢¢
Z
¡k
bn
e
Pn
¡k
j D1 xj wj
Pn
j D 1 xj wj
dV ¡ ks
dx1 dx2 , . . . , dxn
¡ ks
dx1 dx2 , . . . , dxn
an
(¡1) kI1 (¡k, s, A, B, W ).
(4.10)
kD 0
Otherwise,
I¼
¢¢¢
Z ÁX p
Pn
p X
j D 1 xj wj C
s < 0, and it follows that
(¡1) kC 1 I1 ( k, s, A, B, W ) .
(4.11)
kD 1
4.1.2 Case 2:en,s ¸ an and en,l · bn . In this case, as illustrated by Figure 2, P cuts V by the side and divides V into two parts, V1 and V2 . Supposing
MLP’s Sensitivity to Input Perturbation
193
Figure 1: P does not cut V.
Pn
j D 1 xj wj C
s < 0 in V1 and
Z I¼
V1
D
kD 0
¢¢¢
Z X p Z
V1
p X kD 0
p X kD 1
C
(¡1) ke
kD 0
(¡1) kC 1
kD 1
C
D
V2
p X
j D 1 xj wj C
s > 0 in V2 , it follows that
! Z Á Pn p X k xj wj C ks k D 1 j (¡1) e ¢¢¢ 1¡ dx1 dx2 , . . . , dxn
Z C
Pn
Z
¢¢¢
Z
e
¡k
k
Z
(¡1) k V2
¢¢¢
e
Pn
j D1 xj wj
¡ ks
j D1 xj wj C
ks
Pn
¡k
Pn
j D 1 xj wj
dx1 dx2 , . . . , dxn
dx1 dx2 , . . . , dxn
¡ ks
dx1 dx2 , . . . , dxn
(¡1) kC 1 |I2 ( k, s, A, B ¡ fbn g, W ) |
p X kD 0
(¡1) k |I2 (¡k, s, B, A ¡ fan g, W ) |,
Figure 2: P cuts the four parallel sides of V.
(4.12)
194
X. Zeng and D. Yeung
where Z
Z V1
¢¢¢
e
Z D
b1
k
Pn
j D 1 xj wj C
Z
a1
b2 a2
Z ¢¢¢
bn¡1
ks
±P n¡1
Z
an¡1
dx1 dx2 , . . . , dxn
¡
jD 1
an
£e
k
² xj wj C s / wn
Pn
j D 1 xj wj C
ks
dx1 dx2 , . . . , dxn , (4.13)
and Z V2
¢¢¢
Z X p
Z D
(¡1) ke
kD 0 b1 a1
Z
b2 a2
Z ¢¢¢
¡k
Pn
bn¡1 an¡1
j D1 xj wj
Z
bn
p X kD 0
C
² xj wj C s / wn Pn ¡k j D1 xj wj ¡ ks jD 1
£e
I¼
dx1 dx2 , . . . , dxn
±P n¡1
¡
Similarly, the formula for I with in V2 is
¡ ks
Pn
j D 1 xj wj C s
dx1 dx2 , . . . , dxn . (4.14)
> 0 in V1 and
Pn
j D 1 xj wj C s
< 0
(¡1) k |I2 (¡k, s, A, B ¡ fbn g, W ) |
p X kD 1
(¡1) kC 1 |I2 ( k, s, B, A ¡ fan g, W ) |.
(4.15)
4.1.3 Case 3: en,s D an and en,l > bn or en,s < an and en,l D bn . In this case, as illustrated by Figure 3 , P cuts a prism from V, and there is a special intersection point that is a vertex of V. We assume that the point C(c1 , c2 , . . . , cn ) is the lower vertex of V on the edge corresponding to en,s and the point D(d1 , d2 , . . . , dn ) is the upper vertex of V on the edge corresponding to en,l . Then either C or D is the special intersection point, and it must be true that cn D an and dn D bn . The other n ¡ 1 coordinates of C and D can be determined by en,s and en,l , as mentioned above. The coordinates of these two points are critical in setting the boundary for the integral in calculating equation 4.8. For simplicity of expression, they are also regarded as two sets: C D fc1 , . . . , cn g and D D fd1 , . . . , dn g. Referring to the results in section 4.2, three regions V1 , V2 , and V3 are used to replace V, as indicated in Figures 4, 5, and 6. It is obvious that V1 is suitable for equation 4.24 and V2 for 4.23. But V3 , as Figure 7 illustrates, needs to be extended
MLP’s Sensitivity to Input Perturbation
Figure 3: P cuts a prism from V.
Figure 4: The region of V1 .
Figure 5: The region of V2 .
Figure 6: The region of V3 .
Figure 7: The extended region of V3 .
195
196
X. Zeng and D. Yeung
with 2n¡1 ¡ 2 small prisms in order to be suitable for equation 4.24. In fact, this situation may be true for V1 too. In order to give a general solution, equation 4.24 is rst used to calculate the integral by regarding V1 and V3 as prisms, and then the results are adjusted by subtracting the integral in each small prism under the condition that the integrand expansion in that small prism is convergent. The integrand expansion in each of the small prisms will diverge if V1 and V3 do not need such a prism for the extension. Thus, according to whether cn D en,s or dn D en,l , and the satisfaction of the convergent conditions of equations 4.8a and 4.8c in V1 , V2 , and V3 , four groups of formulas for different conditions could P be derived. Below, the derivation for only one group is given: cn D en,s , jnD 1 xj wj C s > 0 in V1 and Pn j D 1 xj wj C s < 0 in V 2 and V3 . Those for the other three groups can be found in Zeng (2001).
Z I¼
V1
¢¢¢
D
p X
¡ D
C
¡
¡ ks
dx1 dx2 , . . . , dxn
kD 0
kD 0
Z
Z
(¡1) k V1
¢¢¢
p X kD 1
kD 0
e
¡k
Z
Pn
j D1 xj wj
Z
(¡1) kC 1 V2
kD 1
kD 0
j D 1 xj wj
V3
p X
p X
Pn
! Z Á Pn p X k xj wj C ks k D 1 j (¡1) e ¢¢¢ 1¡ dx1 dx2 , . . . , dxn
kD 0
C
¡k
V2
Z ¡
(¡1) ke
! Z Á Pn p X k xj wj C ks k D 1 j (¡1) e ¢¢¢ 1¡ dx1 dx2 , . . . , dxn
Z C
Z X p
¢¢¢
Z
e Z
(¡1) kC 1 V3
¢¢¢
e
k
k
¡ ks
dx1 dx2 , . . . , dxn
Pn
j D1 xj wj C
ks
Pn
j D1 xj wj C
ks
dx1 dx2 , . . . , dxn
dx1 dx2 , . . . , dxn
0
(¡1) k @ |I3 (¡k, s, C ¡ fcn g) [ fdn g, W ) | ¡
p X kD 1 p X kD 1
2n¡1 X
1 dh A
6 s&h D 6 l hD 1&hD
(¡1) kC 1 |I2 ( k, s, C, D ¡ fdn g, W ) | 0 (¡1) kC 1 @|I
3 ( k,
s, D, W ) | ¡
n¡1 2X
6 s&hD 6 l h D1&hD
1
eh A ,
(4.16)
MLP’s Sensitivity to Input Perturbation
where Z V1
Z ¢¢¢
e
¡k
Pn
j D 1 xj wj
¡ ks
197
dx1 dx2 , . . . , dxn
² ±P n¡1 Z j D 2 cj wj C dn wn C s / w1 D c1 ± ² Z x1 w1 C Pn¡1 cj wj C dn wn C s / w2 j D3 £ c2 ± ² Z Pn¡2 xj wj C dn wn C s / wn¡1 Z dn j D1 ±P ¢¢¢ n¡1
² xj wj C s / wn Pn ¡k j D 1 xj wj ¡ ks dx dx , . . . , dx £e 1 2 n
cn¡1
¡ Z ¢¢¢
(4.17)
e
k
Pn
j D 1 xj wj C
ks
dx1 dx2 , . . . , dxn
± ² Z b1 Z b2 Z bn¡1 Z ¡ Pn¡1 xj wj C s / wn p X D 1 j kC 1 ¢¢¢ D (¡1) a1 a2 an¡1 an kD 1 Pn C k x w ks j j D 1 j £e dx1 dx2 , . . . , dxn ,
Z
Z V3
dh ,
6 s&hD 6 l h D1&h D
Z V2
and
n¡1 2 X
j D1
¢¢¢
e
k
Pn
j D 1 xj wj C
ks
(4.18)
dx1 dx2 , . . . , dxn
Z Z d2 d1 ± ² ± ² D P Pn n C C C s ¡ s d w w x w d w ¡ / / w2 1 1 1 j j j j j D2 jD 3 ±P ² n¡1 Z dn¡1 Z C s x w / wn j j jD 1 ² ¢ ¢ ¢ ±Pn¡2 ¡ j D 1 xj wj C dn wn C s / wn¡1 dn n¡1 Pn 2 X C k x w ks j j j D1 £e ¡ eh . (4.19) dx1 dx2 , . . . , dxn h D 1&hD6 s&hD6 l
198
X. Zeng and D. Yeung
Further, 8±P ² n¡1 Z > C dh wh C dn wn C s / w1 c w > j j 6 D 2&j D j h > > > ¢¢¢ > > > c1 > > > ± ² > > Z Ph¡1 xj wj C Pn¡1 cj wj C dn wn C s / wh > > D1 Dh C 1 j j > > > ¢¢¢ > > > dh > ± ² > P > n¡2 < Z Z dn j D1 xj wj C dn wn C s / wn¡1 dh D ±P ² > n¡1 > C ¡ s c x w / wn > n¡1 j j jD 1 > > > > > Pn > > > ¡k > j D 1 xj wj ¡ ks dx , . . . , dx > £ e , 1 h > > > > > > > > > if it is convergent for large p > > : 0, if it is divergent for large p 8 |I3 (¡k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) | > > > > p > X > > > (¡1) k |I3 (¡k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) | if < D
kD 0
> > > > > > > > :
< 0,
p X kD 0
(¡1) k |I
3 (¡k,
s, (C ¡ fcn g) [ fdn g, W ) |
otherwise
and 8 Z > > d1 > > ± ² ¢¢¢ > Pn > > > ¡ ch wh C jD2&jD6 h dj wj C s / w1 > > > Z c > h > > > ±P ² ¢¢¢ > P > h¡1 > ¡ > x1 w1 C jnDh C 1 dj wj C s / wh > D 1 j > > ± ² > > Z Pn¡1 xj wj C s / wn < Z dn¡1 jD 1 eh D ±P ² n¡2 > > C C ¡ s x w d w / wn¡1 dn > n n j j jD 1 > > > > > Pn > > > > £ ek jD1 xj wj C ks dx1 , . . . , dxn , > > > > > > > > > if it is convergent for large p > > > :0, if it is divergent for large p
(4.20)
MLP’s Sensitivity to Input Perturbation
199
Figure 8: V is divided into two smaller rectangles.
Figure 9: V is extended to a bigger rectangle with two extra rectangles.
D
8 |I3 ( k, s, (D ¡ fdh g) [ fch g, W ) |, > > > > p > X > > > (¡1) kC 1 |I3 ( k, s, ( D ¡ fdh g) [ fch g, W ) | if > < > > > > > > > > > :
k D1
< 0,
p X
(4.21)
(¡1) k C 1 |I3 ( k, s, D, W ) |
kD 1
otherwise.
In a similar way, the other three groups of formulas could be derived. 4.1.4 Case 4: en,s > an and en,l > bn or en,s < an and en,l < bn . In this case, P cuts V as Figure 8 illustrates. It can be reduced to cases 1 and 3 by cutting V with the hyperplane xn D en,s or xn D en,l into two smaller hyperrectangles. 4.1.5 Case 5: en,s < an and en,l > bn . This case, as Figure 9 illustrates, can be reduced to cases 2 and 3 by considering a bigger hyperrectangle rst, as case 2 does, and then dealing with two extra hyperrectangles on the top and bottom, as case 3 does. 4.2 Three Integral Formulas Used in the Derivations. Below are three integration formulas that are used in the derivations of equations 4.10 to
200
X. Zeng and D. Yeung
4.21. These formulas can be established by mathematical induction on n. It is assumed that k is an integer Pn and s is a real number. These formulas all k j D 1 xj wj C ks , but different integration regions. have the same integrand; e Each region represents a closed subregion within the hyperrectangle V. Integration Formula Z b1 Z I1 ( k, s, A, B, W ) D a1
D
b2 a2
Z ¢¢¢
bn
e
k
Pn
j D 1 xj wj C
ks
an
8Y n > > ( bj ¡ aj ) > > < jD 1
dx1 dx2 , . . . , dxn
if k D 0 (4.22)
n Y > > ks > (e kbj wj ¡ e kaj wj ) / ( kwj ) , > e :
6 0 if k D
j D1
The integration region of this formula is the entire V, which is enclosed by 2n planes perpendicular to each other, xj D aj and xj D bj (1 · j · n ) , as Figure 10 illustrates. Integration Formula Z I2 ( k, s, A, B ¡ fbn g, W ) D
b1 a1
Z
b2 a2
Pn
Z ¢¢¢
bn¡1
Z
±P
¡
n¡1 j D 1 xj wj C
² s / wn
an¡1
j D1 xj wj C
ks dx1 dx2 , . . . , dxn 10 1 8 00 n¡1 n¡1 > X Y > > > ¡ @@ wj (bj C aj ) / 2A C an wn C s A @ ( bj ¡ aj ) A / wn , > > > > jD 1 jD 1 > > > < if k D 0 0 1 (4.23) D > n¡1 n¡1 n > Y Y Y > > > (bj ¡ aj ) / ( kwn ) ¡ ekan wn C ks ( ekbj wj ¡ ekaj wj ) / @kn wj A , > > > > D 1 D 1 D 1 j j j > > : 6 0. if k D £e 1
k
The integration P region of this formula is the lower part of V cut by the hyperplane P, jnD 1 xj wj C s D 0, through the side parallel to an axis, say xn axis, as Figure 11 illustrates. This formula is also suitable for the upper part of V by adjusting the integration limits. Furthermore, if one is interested only in the scalar of the integral, the condition bj > aj can be removed because the lower limit and the upper limit for each variable in the integral can be arbitrarily exchanged without affecting the absolute value of the integral.
MLP’s Sensitivity to Input Perturbation
201
Figure 10: The entire V is the integration region.
The previous formula, 4.22, and the next formula, 4.24, also possess this characteristic. Integration Formula ± Z Pn I3 ( k, s, A, W ) D
j D 2 aj wj C
² ± ² P s / w1 Z x1 w1 jnD 3 aj wj C s / w2
a1
a
¢¢¢
2 ± ² ± ² Z Pn¡2 xj wj C an wn C s / wn¡1 Z Pn¡1 xj wj C s / wn jD 1 jD 1
an¡1 k
Pn
an
j D 1 xj wj C ks
£e dx1 dx2 , . . . , dxn 8 0 1n ,0 1 n n > X Y > > > @n! (¡1) n @ aj wj C s A wj A , > > > > j D1 jD 1 > > > 0 1l ,0 1 > > <X n¡1 n n X Y @kn¡l l! D (¡1) n C 1 @ aj wj C s A wj A > > > lD 1 j D1 j D1 > > 0 1 > > ³ ´, Pn > n > Y > > n C 1 1 ¡ ek j D 1 aj wj C ks @ kn > wj A , > C (¡1) : jD 1
if k D 0 (4.24)
6 0. if k D
The integration region of this formula is a prism part of V, as Figure 12 illustrates, which is enclosed by P with the n planes perpendicular to each other: xj D aj (1 · j · n ) . This formula is suitable for any prism part of a
Figure 11: The lower part of V is the integration region.
202
X. Zeng and D. Yeung
Figure 12: The prism part of V is the integration region.
hyperrectangle cut by a hyperplane. The hyperplane part in the prism is its base, and the coordinate ( a1 , a2 , . . . , an ) indicates the vertex not on the base. 4.3 Algorithm for the Computation of the Integral. Based on the above derivations, the following algorithm outlines the computation of equation 4.5 with Q (X ) D 1:
Algorithm: I ( A, B, W, s) P n¡1 with either 1. Calculate en,i D ¡( jn¡1 D 1 xj wj C s) / wn for 1 · i · 2 xj D aj or xj D bj ; 2. Find s and l satisfying en,s D minfen,1 , . . . , en,2n¡1 g and en,l D maxfen,1 , . . . , en,2n¡1 g; 3. Branch to the following four cases according to the values of en,s , en,l , an , and bn : case1. en,s ¸ bn or en,l · an : If
n X jD 1
If
n X jD 1
xj wj C s > 0 then I ¼ xj wj C s < 0 then I ¼
p X
(¡1) kI1 (¡k, s, A, B, W ) ;
kD 0 p X
(¡1) kC 1 I1 ( k, s, A, B, W ) ;
kD 1
case2. en,s ¸ an and en,l · bn : If
n X jD 1
I¼
xj wj C s < 0 in V1 and p X kD 1
C
n X j D1
xj wj C s > 0 in V2 then
(¡1) kC 1 |I2 ( k, s, A, B ¡ fbn g, W ) |
p X kD 0
(¡1) k |I2 (¡k, s, B, A ¡ fan g, W ) |I
MLP’s Sensitivity to Input Perturbation
If
n X j D1
I¼
xj wj C s > 0 in V1 and p X kD 0
C
n X jD 1
203
xj wj C s < 0 in V2 then
(¡1) k |I2 (¡k, s, A, B ¡ fbn g, W ) |
p X kD 1
(¡1) kC 1 |I2 ( k, s, B, A ¡ fan g, W ) |I
case3. en,s D an and en,l > bn or en,s < an and en,l D bn : Let C D fc1 , . . . , cn g D fx1 , . . . , xn¡1 gs [ fan g and D D fd1 , . . . , dn g D fx1 , . . . , xn¡1 gl [ fbn g; If cn D en,s ,
I¼
p X kD 0
C
¡ where
dh D
and
eh D
n X j D1
xj wj C s > 0 in V1 , and 0
n X jD 1
xj wj C s < 0 in V2 and V3 then
(¡1) k @|I3 (¡k, s, (C ¡ fcn g) [ fdn g, W ) | ¡
p X kD 1 p X
2n¡1 X 6 s&hD 6 l hD 1&hD
(¡1) kC 1 |I2 ( k, s, C, D ¡ fdn g, W ) | 0 (¡1) kC 1 @ |I3 ( k, s, D, W ) |
kD 1
2n¡1 X
1
eh A ,
6 s&h D 6 l hD 1&h D
8 |I3 (¡k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) |, > > > > p > X > > > (¡1) k |I3 (¡k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) | if > < > > > > > > > > > :
kD 0
< 0,
p X kD 0
(¡1) k |I3 (¡k, s, (C ¡ fcn g) [ fdn g, W ) |
otherwise
8 |I3 ( k, s, (D ¡ fdh g) [ fch g, W ) | > > > > p > X > > > (¡1) kC 1 |I3 ( k, s, ( D ¡ fdh g) [ fch g, W ) | if > < > > > > > > > > > :
k D1
< 0,
p X kD 1
(¡1) kC 1 |I3 ( k, s, D, W ) |
otherwise
1 dh A
204
X. Zeng and D. Yeung
If cn D en,s ,
I¼
p X kD 1
C
¡ where
j D1
xj wj C s < 0 in V1 , 0
n X jD 1
xj wj C s > 0 in V2 and V3 then
(¡1) kC 1 @|I3 ( k, s, (C ¡ fcn g) [ fdn g, W ) | ¡
p X kD 0 p X kD 0
2n¡1 X
1 dh A
6 s&h D 6 l hD 1&h D
(¡1) kI2 (¡k, s, C, D ¡ fdn g, W ) 0 (¡1) k @|I3 (¡k, s, D, W ) | ¡
1
2n¡1 X
eh A ,
6 s&h D 6 l hD 1&h D
8 |I3 ( k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) |, > > > > p > X > > > (¡1) kC 1 |I3 ( k, s, ( C ¡ fch , cn g) [ fdh , dn g, W ) | if >
> > > > > > > > :
and
eh D
> 3 k, > > p > X > > > (¡1) k |I3 (¡k, s, (D ¡ fdh g) [ fch g, W ) | if > < > > > > > > > > > :
kD 0
0 in V1 , 0
n X jD 1
xj wj C s < 0 in V2 and V3 then
(¡1) k @ |I3 (¡k, s, ( D ¡ fdn g) [ fcn g, W ) | ¡
p X kD 1 p X kD 1
2n¡1 X 6 s&h D 6 l hD 1&hD
(¡1) kC 1 |I2 ( k, s, D, C ¡ fcn g, W ) | 0 (¡1) kC 1 @|I
3 ( k,
s, C, W ) |
2n¡1 X 6 s&h D 6 l h D1&h D
1
eh A ,
1 dh A
MLP’s Sensitivity to Input Perturbation
205
where 8 |I3 (¡k, s, ( D ¡ fdh , dn g) [ fch , cn g, W ) |, > > > > p > X > > > (¡1) k |I3 (¡k, s, ( D ¡ fdh , dn g) [ fch , cn g, W ) | if >
> > > > > > > > :
kD 0
> 3 k, s, C ¡ fch g) [ fdh g, W > > p > X > > > (¡1) kC 1 |I3 ( k, s, ( C ¡ fch g) [ fdh g, W ) | if > < > > > > > > > > > :
k D1
p X kD 1
C
¡
(¡1) kC 1 |I3 ( k, s, C, W ) |
kD 1
0,
If dn D en,l ,
I¼
p X
0 in V2 and V3 then
0 (¡1) kC 1 @|I3 ( k, s, ( D ¡ fdn g) [ fcn g, W ) | ¡
p X kD 0 p X kD 0
n¡1 2 X
6 s&hD 6 l h D 1&hD
(¡1) k |I2 (¡k, s, D, C ¡ fcn g, W ) | 0 (¡1) k @ |I3 (¡k, s, C, W ) | ¡
2n¡1 X
1
eh A ,
6 s&h D 6 l hD 1&hD
where
dh D
8 |I3 ( k, s, (D ¡ fdh , dn g) [ fch , cn g, W ) |, > > > > p > X > > > (¡1) kC 1 |I3 ( k, s, ( D ¡ fdh , dn g) [ fch , cn g, W ) | if > < > > > > > > > > > :
kD 1
< 0,
p X kD 1
(¡1) kC 1 |I3 ( k, s, ( D ¡ fdn g) [ fcn g, W ) |
otherwise
1 dh A
206
X. Zeng and D. Yeung
and
eh D
8 |I (¡ s, ( C ¡ fch g) [ fdh g, W ) |, > > 3 k, > > p > X > > > (¡1) k |I3 (¡k, s, ( C ¡ fch g) [ fdh g, W ) | if > < > > > > > > > > > :
kD 0
< 0,
p X
(¡1) k |I3 (¡k, s, C, W ) |
kD 0
otherwise
case4. en,s > an and en,l > bn or en,s < an and en,l < bn : Reduce to case1 and case3; case5. en,s < an and en,l > bn : Reduce to case2 and case3. 5 Computation of the Sensitivity for the Entire MLP
This section discusses the computation of the sensitivity for an entire MLP. From a global point of view, the sensitivity of an MLP reects its output deviation, or exactly its output layer ’s output deviation due to the rst layer’s input deviation. As the inputs of each neuron in a layer are the outputs of all neurons in its immediately preceding layer, the rst layer ’s input deviation will be propagated through all internal layers to inuence the output layer’s output. The sensitivity of the output layer can be computed neuron by neuron from the rst layer to the last layer. In more detail, S1 can be computed with D X1 and W 1 by NEURON SENSITIVITY(A, B, Q ( X1 ) , D X1 , Wi1 ) (1 · i · n1 ) , then S2 with D X2 D S1 and W 2 , and so on to, nally, SL with D XL D SL¡1 and WL . Logically, since the input space and the distribution of a neuron in layer l (2 · l · L ) may be different from the neurons in its previous layers, the input bounds A and B and the density function Q ( Xl ) for each layer l (2 · l · L ) need to be adjusted in the computation. An algorithm to compute the MLP’s sensitivity is given below. The expected input deviation D X1 , that is, S 0 , and the computed sensitivity Sl (1 · l · L ) will form a two-dimensional array. By controlling the mapping between the iterative time and the arrays’ subscript, it would be easier to compute the sensitivity for an MLP. Algorithm: MLP SENSITIVITY(A, B, Q ( X1 ) , D X1 , W ) 1. Put the given expected input deviation D X1 (i.e., S 0 ) and the weight set W into array forms; 2. Compute the sensitivity neuron by neuron and layer by layer: For l D 1 step 1 to L
MLP’s Sensitivity to Input Perturbation
207
If l > 1 then adjusting A, B, and Q ( Xl ) ; For i = 1 step 1 to nl Call NEURON SENSITIVITY(A, B, Q (Xl ) , Sl¡1 , Wil ) and save result sli ; 3. SL is the computed sensitivity for an MLP. In this algorithm, the adjustment of the input bounds for layer l (2 · l · L ) can be easily done with the bounds and the weight information of layer l ¡1 because of the linear feature of the summation of weighted inputs and the monotonic increasing feature of the sigmoid function. But the adjustment of the density function is generally very complex. A reasonable distribution function to be employed as an approximation is the uniform distribution. Other distribution functions would most likely make the integration of equation 4.5 virtually impossible. Our experimental results presented in section 6 support the use of uniform distribution as being a reasonably good one. 6 Experimental Verication
To verify the derived theoretical results, a number of experiments have been conducted. Again, a single neuron is considered rst. In the experiments, for simplicity and without losing the generality of the spatial dimension and direction, eight neurons with four different input dimensions from n D 2 to n D 5 and different weight values are used. Each neuron has either all positive weight values or separate positive and negative weight values. As for the case of the neurons with all negative weight values, it could be reduced to the case of all positive elements because of equation 2.3. Some weight values for these neurons are given in Table 1. Both theoretical computations and the computer simulations are based on the data given by this table. The theoretical results, that is, the sensitivities, for each neuron are computed with the algorithm NEURON SENSITIVITY(A, B, Q (X ) , D X, W ) with the parameters A D f0g, B D f1g, Q ( X ) D 1, and the elements of D X being all identical and ranging from 0 to 0.1 with a step size of 0.01. The simulation results, that is, the average output deviations for each neuron, are obtained with respect to more than 100 million inputs selected randomly and uniformly from the interval [0, 1]n (2 · n · 5) . Theoretical and simulation results are shown in Figures 13 through 16. It is clear that the agreement between the two results is very good. The experiments on the networks were carried out for three MLPs: 2 ¡2 ¡1, 2 ¡3 ¡1, and 2 ¡2 ¡2 ¡1. Their weight values are given in Table 2. Similar to the verication of individual neurons, the theoretical computations and the computer simulations were run with the given weight sets, and the expected input deviations ranged from 0 to 0.1 with a step size of 0.01. The sensitivity for each network is approximately computed by the algorithm MLP SENSITIVITY(A, B, Q ( X1 ) , D X1 , W ) with
208
X. Zeng and D. Yeung
Table 1: Weight Dimension and Weight Values for Neurons. Parameter
n
w(1)
w(2)
Neuron 1 Neuron 2 Neuron 3 Neuron 4 Neuron 5 Neuron 6 Neuron 7 Neuron 8
2 2 3 3 4 4 5 5
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
2.0 ¡2.0 2.0 ¡2.0 2.0 ¡2.0 2.0 ¡2.0
w(3)
3.0 3.0 3.0 3.0 3.0 3.0
w(4)
4.0 ¡4.0 4.0 ¡4.0
w(5)
5.0 5.0
Figure 13: Sensitivity of the neurons with two-dimensional weight to expected input deviation.
A D f0g, B D f1g, Q (X1 ) D 1. They are not adjusted during the computations. The average output deviations for each network are with respect to more than 100 million inputs selected randomly and uniformly from the interval [0, 1]2 . The theoretical and simulation results are shown in Figure 17. 7 Conclusion
We have proposed a quantied measure and its computation for the sensitivity of the MLP to its input perturbation. The theoretical results obtained
MLP’s Sensitivity to Input Perturbation
209
Figure 14: Sensitivity of the neurons with three-dimensional weight to expected input deviation.
Figure 15: Sensitivity of the neurons with four-dimensional weight to expected input deviation.
210
X. Zeng and D. Yeung
Figure 16: Sensitivity of the neurons with ve-dimensional weight to expected input deviation. Table 2: Three Sets of Weight Values.
W
2 ¡2 ¡1 2 ¡3 ¡1 2 ¡2 ¡2 ¡1
w(1I 1, 1) w(2I 1, 1) w(3I 1, 1)
w(1I 1, 2) w(2I 1, 2) w(3I 1, 2)
w(1I 2, 1) w(2I 1, 3)
w(1I 2, 2) w(2I 2, 1)
¡5.0 ¡1.0
6.0 2.0
¡3.0
4.0
1.0 7.0
¡2.0 ¡8.0
3.0 9.0
¡4.0
10.0 ¡5.0 ¡1.0
¡9.0 6.0 2.0
8.0 ¡3.0
¡7.0 4.0
w(1I 3, 1) w(2I 2, 2)
w(1I 3, 2)
5.0
¡6.0
are in reasonable agreement with the simulation results. The sensitivity measure is something like an analog rule that network users can use as a tool to evaluate their networks’ performance. For example, Lee and Oh (1994) indicated that in pattern classication applications, their sensitivity (misclassication probability) result could be applied to select the optimal weight set among many weight sets acquired through a number of learning
MLP’s Sensitivity to Input Perturbation
211
Figure 17: Sensitivity of the three MLPs to expected input deviation.
trials. Among these weight sets, an MLP with the optimal set will have the best generalization capability and the best input noise immunity. Zurada, Malinowski, and Usui (1997) made use of analytical sensitivity to delete redundant inputs to an MLP for pruning its architecture. Engelbrecht and Cloete (1999) also used analytical sensitivity to dynamically select training patterns. Some theoretical issues as well as applications of the MLP, such as improving error tolerance, measuring generalization ability, and pruning the network architecture, would benet from this study. Acknowledgments
This research work is supported by a Hong Kong CERG 5114/97E. References Alippi, C., Piuri, V., & Sami, M. (1995). Sensitivity to errors in articial neural networks: A behavioral approach. IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, 42(6), 358–361. Cheng, A. Y., & Yeung, D. S. (1999). Sensitivity analysis of neocognitron. IEEE Transactions on System, Man, and Cybernetics—Part C:Applications and Reviews, 29(2), 238–249.
212
X. Zeng and D. Yeung
Choi, J. Y., & Choi, C. H. (1992).Sensitivity analysis of multilayer perceptron with differentiable activation functions. IEEE Transactions on Neural Networks, 3(1), 101–107. Engelbrecht, A. P., & Cloete, I. (1999). Incremental learning using sensitivity analysis. In Proceedings of IEEE IJCNN’99 (Vol. 2, pp. 1350–1355). Washington, DC. Engelbrecht, A. P., Fletcher, L., & Cloete, I. (1999). Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In Proceedings of IEEE IJCNN’99 (Vol. 3, pp. 1829–1833). Washington, DC. Fu, L., & Chen, T. (1993). Sensitivity analysis for input vector in multilayer feedforward neural networks. In Proceedings of IEEE International Conference on Neural Networks (Vol. 1, pp. 215–218). San Francisco. Hashem, S. (1992). Sensitivity analysis for feedforward articial neural networks with differentiable activation functions. In Proceedings of IJCNN’92 (Vol. 1, pp. 419–424). Baltimore, MD. Lee, Y., & Oh, S. H. (1994). Input noise immunity of multilayer perceptrons. ETRI Journal, 16(1), 35–43. PichÂe, S. W. (1995). The selection of weight accuracies for madalines. IEEE Transactions on Neural Networks, 6(2), 432–445. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity of feed forward neural networks to weight errors. IEEE Transactions on Neural Networks, 1(1), 71–80. Yeung, D. S., & Sun, X. (2002). Using function approximation to analyze the sensitivity of MLP with antisymmetric squashing activation function. IEEE Transactions on Neural Networks, 13(1), 34–44. Zeng, X. (2001). Output sensitivity of MLPs derived from statistical expectation. Unpublished doctoral dissertation, Hong Kong Polytechnic University. Zeng, X., & Yeung, D. S. (2001). Sensitivity analysis of multilayer perceptron to input and weight perturbations. IEEE Transactions on Neural Networks, 12(6), 1358–1366. Zurada, J. M., Malinowski, A., & Usui, S. (1997). Perturbation method for deleting redundant inputs of perceptron networks. Neurocomputing, 14, 177–193. Received July 24, 2001; accepted May 30, 2002.
LETTER
Communicated by Hagai Attias
Variational Mixture of Bayesian Independent Component Analyzers R. A. Choudrey
[email protected] S. J. Roberts
[email protected] University of Oxford, Robotics Research, Oxford, U.K. There has been growing interest in subspace data modeling over the past few years. Methods such as principal component analysis, factor analysis, and independent component analysis have gained in popularity and have found many applications in image modeling, signal processing, and data compression, to name just a few. As applications and computing power grow, more and more sophisticated analyses and meaningful representations are sought. Mixture modeling methods have been proposed for principal and factor analyzers that exploit local gaussian features in the subspace manifolds. Meaningful representations may be lost, however, if these local features are nongaussian or discontinuous. In this article, we propose extending the gaussian analyzers mixture model to an independent component analyzers mixture model. We employ recent developments in variational Bayesian inference and structure determination to construct a novel approach for modeling nongaussian, discontinuous manifolds. We automatically determine the local dimensionality of each manifold and use variational inference to calculate the optimum number of ICA components needed in our mixture model. We demonstrate our framework on complex synthetic data and illustrate its application to real data by decomposing functional magnetic resonance images into meaningful—and medically useful—features. 1 Introduction
The goal of pattern analysis and recognition is to extract information from some data. In order for this information to be useful, the distribution of data must be represented in some meaningful way. In many cases, insight may be gained by dividing the data into self-similar areas and analyzing each of these clusters under some informative framework, for example, using some understanding of the assumed data-generating process. One such method is to model the data as being produced by a mixture of data generators (also called analyzers), where each component generator is responsible for generating a particular cluster. The problems to overcome in this mixture Neural Computation 15, 213–252 (2003)
° c 2002 Massachusetts Institute of Technology
214
R. Choudrey and S. Roberts
modeling are to decide how many generators are needed, where to place them, and how to adjust them to best represent the data. Mixtures of gaussians (MoG) are widely used throughout the elds of machine learning and statistics for data modeling, where each generator is a gaussian density. Despite their popularity, however, MoGs suffer from two serious drawbacks. The rst is that as the dimensionality S of the problem space increases, the size of each covariance matrix, S2 , becomes prohibitively large. This can be dealt with by assuming isotropic gaussians (i.e., ignoring the covariance structure), but this greatly reduces the exibility of the model class. This problem has been solved by Tipping and Bishop (1999), who replaced each gaussian with a probabilistic principal component analyzer (PCA). This allowed the dimensionality of each covariance to be effectively reduced while maintaining the richness of the model class. This was formulated under a maximum-likelihood framework, which, although efcient, does not allow one to infer the optimum number of generators needed. This mixture was modied into a mixture of factor analyzers (FA) by Ghahramani and Beal (2000), where Bayesian inference was used to infer the optimum number of analyzers. The second problem with MoGs is that each component is a gaussian, a strong assumption that is often violated in many natural clustering problems (Lee, Lewicki, & Sejnowski, 2000). Although MoGs are capable of modeling any distribution given enough components, the problem still remains of automatically grouping gaussians that together describe some larger-scale feature. It is this second problem that we address in this article. A solution is reached by extending the mixtures of probabilistic PCA / FA model to an independent component analysis (ICA) mixture model. We improve on previous work on mixtures of ICA by Lee et al. (2000) and Penny and Roberts (2001) by incorporating a very exible ICA model that can generate arbitrary densities using MoGs and by bringing the formalism into the Bayesian arena. We use Bayesian inference to infer the optimum number of ICAs needed and to determine their ideal dimensionalities automatically. 1.1 ICA. Independent component analysis seeks to extract salient features and structure from data assumed to be a linear mixture of independent underlying (hidden) features
x D As,
(1.1)
where x is an observed S-dimensional data vector, s is the L-dimensional source vector, and A is the SxL bases or mixing matrix.1 An ICA model formulated in this way is well posed only if L · S; otherwise, there are effectively more unknowns than equations in the model. The overcomplete 1 The term mixing refers to the linear mixing in the ICA model; mixture refers to the stochastic generating method in a mixture model.
215
S o u rc e 2
S e n so r 3
Variational Mixture of Bayesian Independent Component Analyzers
Source 1
Sensor 2
Sensor 1
Figure 1: The underlying and observed data densities.
problem (L ¸ S) is an area of ongoing research and will not be dealt with here. Which of A or s represents the features depends on one’s view of the data. If the data are considered to be constructed from varying amounts of L static features, then the columns of A represent the L static basis vectors and s represents the amount of each basis used for a given data vector. For example, if the data represent an image, that image may be considered a mixture of underlying (maybe fewer) independent edges, textures, and so forth. The aim of ICA in this case is to “unmix” the data set and recover these representative features. If the data are time varying, they can be seen as a linear mixing of underlying (possibly lower-dimensional) signals represented by s. For example, the data may be signals received at S microphones at a cocktail party. These signals are then simply mixings of L people talking. ICA is used to blindly (i.e., without access to s and A) separate the sensor signals into the underlying source signals s. This is known as blind source separation (BSS) in the signal processing literature. In either case, ICA can be recast as data density modeling. The data density p ( x) is a linear transform (i.e., scaling, rotation and possibly higher-dimensional projection) of an unknown, nongaussian manifold—the source density p (s ) (see Figure 1). The goal of ICA is to recover the source density and estimate the transform matrix. ICA has traditionally been performed in the noiseless limit (Bell & Sejnowski, 1995; Lee, Girolami, Bell, & Sejnowski, 1998), with limited exibility in the source density model and with noise often being dealt with as an extra source. More recently, however, Attias (1999b) extended ICA by using a more exible source model and incorporating full covariance noise explicitly into the ICA framework. The model, which Attias dubbed a independent factor analysis (IFA), was subsequently learned through a maximum likelihood expectation-maximization (EM) algorithm. Bayesian
216
R. Choudrey and S. Roberts
formalisms have also been introduced, most notably (under a variational framework) by Attias (1999a), Lappalainen (1999), Choudrey, Penny, and Roberts (2000), and Miskin and MacKay (2000). For a comprehensive review of ICA, see Hyv¨arinen (1999) or Roberts and Everson (2001). 1.2 ICA Mixture Model. ICA was originally developed as a solution to the BSS problem in auditory signal analysis by HÂerault and Jutten (1986). Since then, the BSS approach has been extended to human electroencephalograms (EEGs) (Makeig, Bell, Jung, & Sejnowski, 1996), nancial data analysis (Back & Weigend, 1998), and telecommunications (Ristaniemi & Joutsensalo, 1999), among many other applications. Recently, however, ICA’s remit has broadened to one of representation in general. The way data are presented very much inuences what patterns are found and how much information can be extracted from them. A primary goal in pattern recognition, then, is to nd some intrinsic coordinate system from which structure is easier to extract. In this sense, ICA is a higherorder extension of second-order methods such as PCA and FA. PCA/FA nd representations of data that preserve the maximum amount of variance by effectively nding an intrinsic coordinate system for the covariance matrix. Projections of the data on to the bases found lead to decorrelation. ICA seeks a loftier goal: a representation in which the data are maximimally statistically independent. In this way, ICA aims to repackage data, which may contain structure across its constituent dimensions, into independent signals that contain structure only within components, thereby making patterns (hopefully) more cogent. This application of ICA has led to interesting work in representation of text corpora (Isbell & Viola, 1999; Kolenda, Hansen, & Sigurdsson, 2000), of images such as face representation (Bartlett, Lades, & Sejnowski, 1998), and of natural scenes (Bell & Sejnowski, 1997). This, in turn, has hinted at possible links between ICA and coding schemes in the primary visual cortex (Olshausen & Field, 1997; Lewicki & Olshausen, 1999), although we stress that this connection has generated some controversy (van Hateren & van der Schaaf, 1998). It is this representational viewpoint that motivates our extension of ICA into a mixture of ICAs. Decomposing and representing data using ICA assumes the whole data distribution is adequately described by one coordinate frame. This may not be appropriate for many problems, however. In the BSS problem, ICA assumes each data signal carries a constant mixture of source signals. Consider the scenario proposed by Lee, Lewicki, and Sejnowski (1999). There are two people talking to each other, but never at the same time, and there is music in the background. This cacophany is picked up by two microphones. At any one time, the microphones pick up a mixture of one voice and the music or the other voice and the music, but never all three at the same time. Clearly in this case, standard ICA is an inadequate model. As Lee et al. (1999) showed, a mixture of ICAs is a more appropriate generative model. More generally, if the sensor density in Figure 1 consists of
Variational Mixture of Bayesian Independent Component Analyzers
217
various self-similar, nongaussian manifolds, enforcing a single global representation is not appropriate and will produce a suboptimal representation. A more intuitive way of representing the data is the use of a number of local coordinate frames, each constructed under local conditions. This leads to a mixture of independent component analyzers. ICA mixture models were rst formulated by Lee et al. (1999) using the extended Infomax algorithm (Lee, Girolami, & Sejnowski, 1999) to switch the source model between subgaussian and supergaussian regimes. The model was learned through a combination of maximum likelihood and gradient ascent. Although well demonstrated, the source model could switch only between Laplacian and bimodal densities and thus lacked exibility. Penny and Roberts (2001) relaxed this constraint by using generalized exponential sources that can model a wide variety of kurtoses by the adjustment of a parameter. A basic method of model selection was also incorporated using the Bayesian information criterion (BIC) to infer the number of hidden sources (i.e., the intrinsic dimensionality of the local manifold). Due to the problem formulation used, only two-dimensional or higher manifolds could be modeled. Although more exible, the densities could only be unimodal, and the learning scheme was also maximum likelihood. In this article, we present an ICA mixture model trained using Bayesian methods. In line with Attias (1999b), Lappalainen (1999), Choudrey et al. (2000), and Miskin and MacKay (2000), we choose a fully adaptable factorial MoG as the source model for our ICA components, allowing us to recover arbitrary source densities. Essentially, each ICA component will model selfsimilar areas (the features) as a mixture of gaussian subfeatures. To overcome the heavy computational load associated with Bayesian learning, we use the variational framework reviewed by Jaakkola and Jordan (2000) to make assumptions about the posterior, giving tractability to the Bayesian model. The variational framework is a whole class of methods for bounding intractable integrals. In particular, we use free-energy minimization (Hinton & van Camp, 1993) or ensemble learning (MacKay, 1995a), a variational method for Bayesian parameter estimation. We further extend the ensemble learning results of Choudrey et al. (2000) and Miskin and MacKay (2000) by using an MoG posterior density in a similar fashion to Attias (1999b)—rather than a single gaussian—letting us capture much richer posteriors. This, in turn, leads to greater robustness under uncertainty (Choudrey & Roberts, 2001). The variational Bayesian method is carried through to the ICA mixture model, allowing model comparison, incorporation of prior knowledge of the mixture process, and—by integrating out nuisance parameters— control of model complexity and thus overtting. By monitoring the variational free energy of our models, we can compare model assumptions, allowing us, for example, to infer the optimum number of components needed in our ICA mixture model (Attias, 1999a). We also employ automatic relevance determination (ARD) (MacKay, 1995b) to suppress unsupported sources and thus effectively infer the number of latent dimensions of each
218
R. Choudrey and S. Roberts
ICA component as part of the learning process. This leads to the variational Bayesian independent component analyzers mixture model (vbICA-MM). 2 The Model
The probability of generating a data vector xn from a C-component mixture model given assumptions M is p ( xn | M ) D
C X
p ( c | M 0 ) p (xn | Mc , c ).
(2.1)
cD 1
A data vector is generated by choosing one of the C components stochastically under p (c | M0 ) and then drawing from p ( xn | Mc , c) . M D fM 0 , M1 , . . . , MC g is the vector of component model assumptions, Mc , and assumptions about the mixture process, M 0 . The assumptions represent everything that essentially denes the model: values of xed parameters, model structure, details of the component switching method, any prior information, and so forth. p ( xn | M ) is known as the evidence for model M and quanties the likelihood of the observed data under model M. The variable c indicates which component of the mixture model is chosen to generate a given data vector x. If p (c | M 0 ) is a vector of probabilities and each component p ( xn | Mc , c) is a gaussian, then equation 2.1 simply describes an MoG. If the MoG is adapted through a maximum likelihood approach, then M represents a list of point estimates for the corresponding parameters. In our mixture model, however, each component has a nongaussian density derived from the ICA model presented in the next section, and M represents assumptions concerning the distribution of possible parameter values. Figure 2 shows a generative model for a data vector x. 2.1 The ICA Model. In common with ICA in the literature, we choose a generative model to work with. The observed variables, x, of dimension S are modeled as a linear combination of statistically independent latent variables, sc , of dimension Lc with added gaussian noise,
x D A c sc C y c C ec
(2.2)
where y c is an S-dimensional bias vector, ec is S-dimensional additive noise, and c represents the cth ICA model. In signal processing nomenclature, S is the number of (observed) sensors, and Lc is the number of latent (hidden) sources. Equation 2.2 acts as a complete description for cluster c in the data density. The bias vector, y c , denes the position of the cluster in the S-dimensional data space, Ac describes its orientation, and sc describes the underlying manifold. Using ICA, data produced by mixing gaussian sources cannot be
Variational Mixture of Bayesian Independent Component Analyzers
M1
M2
MC
ICA 1
ICA 2
ICA C
x
c
219
M0
Figure 2: ICA mixture model.
separated as they only have 2-moment distributions, so can never be more than decorrelated. The noise, ec , is assumed to be zero-mean gaussian and isotropic, p (ec | 0, L c , c) D N
( ec I 0, lc I ) ,
(2.3)
where lc I is the precision.2 The noise essentially absorbs any (spherical) gaussianity present in the cluster. The probability of observing data vector xn under component c is then given by ³ p ( x n | hc , c ) D
lc 2p
´S 2
exp[¡Ec ],
(2.4)
where hc D fAc , snc , ¸c g and
lc n ( x ¡ Ac snc ¡ y c ) T ( xn ¡ Ac snc ¡ y c ) . 2 Since the sources sc D fsc,1 , .., sc,i , .., sc,Lc g are by denition mutually independent, the distribution over sc for data point n can be written as Ec D
p (snc | Msc , c) D
Lc Y
p ( snc,i | Msc, i , c) ,
(2.5)
iD 1
2 Conventional notation species covariance rather than precision. We have found precisions easier to manipulate in the math, so have abused the notation somewhat to make subsequent equations easier to read.
220
R. Choudrey and S. Roberts
where the product runs over the Lc sources of component c and Msc is the vector of source model assumptions. p (snc | Msc , c) is the source model for ICA component c. Traditionally a reciprocal-cosh (Bell & Sejnowski, 1995) or unimodal density (Miskin & MacKay, 2000) has been used. These, however, have limited exibility in kurtosis representation or capturing multiple modes. Methods have been proposed for switching between super- and subGaussian regimes (Girolami, 1998; Lee et al., 2000; Lee, Lewicki et al., 1999), although these depend on heuristic stability analyses. More recently, Attias (1999b) introduced an MoG source model into his ICA formalism, potentially allowing any distribution to be modeled. This is the route we take. 2.2 ICA Source Model. The choice of a exible and mathematically attractive (tractable) source model is crucial if a wide variety of source distributions are to be modeled; in particular, the source model should be capable of encompassing both super- and subgaussian distributions (distributions with positive and negative kurtosis respectively) and complex multimodal distributions. One such distribution is a factorized mixture of one-dimensional gaussians with Lc factors (i.e., sources) and mi components per source (see Figure 3),
p ( snc | ’c , c) D D
Lc X mi Y i D 1 qi D 1 Lc X mi Y i D 1 qi D 1
p ( qni D qi | ¼ i , c) p ( snc,i | Qc,i , c) p i,qi N
(snc,i I m i,qi , bi,qi ) ,
(2.6)
where, for brevity, the ICA component subscript c has been dropped from parameters that can be seen to belong to ICA c from context. From now on, all subscripted parameters should be assumed to belong to the cth ICA model unless otherwise stated. Equation 2.6 essentially describes the local features of cluster c: m i,qi is the position of feature qi with regard to the cluster center, b i,qi is its size, and p i,qi its prominance with regard to other features. The mixture proportions p i,qi D p ( qni D qi | ¼ i ) are the prior probabilities of choosing component qi of the ith source (of the cth ICA model). qni is a variable indicating which component of the ith source is chosen for generating snc,i and takes on values of fqi D 1, . . . , qi D mi g (where mi , of course, depends on ICA model c). The mean and precision of gaussian qi in source i are m i,qi and bi,qi respectively. The parameters of source i are Qc,i D f¼ c,i , ¹c,i , ¯ c,i g where boldface type indicates the vector of mi parameters. The complete parameter set of the source model is ’c D fQc,1 , Qc,2 , . . . , Qc,Lc g.
Variational Mixture of Bayesian Independent Component Analyzers
M s,1
M s,2
Gaussian1
221
M s,m
Gaussian2
Gaussianm
s
q
M s,0
Figure 3: MoG source model.
The complete collection of possible Qsource states is denoted q c D fq c,1 , q c,2 , . . . , q c,m g and runs over all m D i mi possible combinations of source states. The probability of state q nc being chosen and generating source vector snc is p (snc , q nc | ’c , c) D
Lc Y iD 1
p ( qni D qi | ¼ i , c ) p ( snc,i | qi , m i,qi , bi,qi , c)
(2.7)
D p ( q nc | ¼ c , c ) p ( snc | q nc , ’c , c ) ,
where ¼ c D f¼ c,1 , ¼ c,2 , . . . , ¼ c,Lc g (similarly for ¹ and ¯). Note that the product of Lc one-dimensional MoGs in equation 2.6 is equivalent to a single MoG in Lc -dimensional space with m states. By integrating and summing over the hidden variables, fsc , q c g, the likelihood of the IID data X D fx1 , x2 , . . . , xN g given the model parameters H c D fAc , y c , lc , ’c g can now be written as p (X | H c , c) D
N X m Z Y
p ( xn , snc , q nc | Hc , c) dsc ,
(2.8)
n D 1 qD 1
Q where dsc D i ds c,i . If we stipulate a form for p (c | M 0 ) in equation 2.1, p (X | M ) D
C X c D1
p ( c | ·) p ( X | H c , c ) ,
(2.9)
222
R. Choudrey and S. Roberts
where p (c | ·) D fp(c D 1) D k1 , p ( c D 2) D k2 , . . . , p (c D C) D kC g, then equations 2.4 through 2.5 and 2.6 can be substituted into equation 2.9 to yield a maximum likelihood model. This can then be learned through an iterative process such as gradient descent (Lee et al., 2000) or the EM algorithm (Penny & Roberts, 2001). In this article, however, we take the Bayesian route by also integrating out the parameters f·, H c g in equation 2.9. 3 Bayesian Inference and Variational Learning
The maximum likelihood approach to learning the parameters of a model is well documented (for an introduction, see Pearlmutter & Parra, 1996; Cardoso, 1997; and Attias, 1999b), as are the pitfalls (e.g., overtting, no quantication of uncertainty in the model or model comparison). We choose to take the Bayesian approach and integrate out the parameters f·, H c g and hidden variables fsc , q c g. This allows us to take our assumptions one stage higher; rather than assume point estimates for the parameters (very harsh and specic), we assume prior distributions over all possible parameter values. This is much more exibile because it allows us to tailor the distributions to avoid overtting the data, reduce the parameter search space to reasonable values, and code our inevitable uncertainty in the parameters into the model. First, we will state the prior distributions over the hidden variables and model parameters. 3.1 The Priors. The priors are chosen to be appropriately conjugate to allow tractibility. All boldface parameters without sub- or superscripts indicate the collection of C parameter vectors. The prior over the ICA mixture indicator variables, c D fc1 , c2 , . . . , cN g, simply factorizes over the N data vectors:
p ( c | ·) D
N Y
k cn .
(3.1)
nD 1
The prior over the ICA mixture coefcients k is a Dirichlet p ( ·) D
P C Y C ( c0 ic0 ) ic ¡1 kc . C (ic ) c D1
(3.2)
Because of source independence, it follows that the distribution over the MoG component indicator variables, p (q c | ¼c , c) , is a product over all p i,qni where i indexes the sources and n the data: p ( q c | ¼ c , c) D
Lc N Y Y nD 1 iD 1
p i,qni .
(3.3)
Variational Mixture of Bayesian Independent Component Analyzers
223
For a given state q , the distribution over the sources is p ( sc | q c , ’ c , c ) D
N Y Lc Y
( snc,i I m i,qi , bi,qi ) .
N
nD 1 iD 1
(3.4)
The prior over the source model (MoG) parameters is a product of priors over ¼ c , ¹c , ¯c : p (’) D
C Y
p ( ¼ c ) p (¹c ) p ( ¯ c ).
(3.5)
cD 1
The prior over the mixture proportions, ¼ c , for the ith source is a Dirichlet with parameter ri,qi : p (¼ ) D
C Y Lc Y mi C ( Y
P q0i
ri,q0i )
c D 1 i D1 qi D 1
C (r i,qi )
r i,q ¡1
p i,qi i
.
(3.6)
The prior over each MoG mean, m i,qi , is a gaussian with mean mi0 and precision ti0 , and the prior over the associated precision, bi,qi , is a gamma with width and scale parameters bi0 and ci0, respectively: p (¹ ) D p(¯ ) D
Lc Y mi C Y Y cD 1 i D 1 q i D 1 Lc Y mi C Y Y cD 1 i D 1 q i D 1
(m i,qi I mi0 , ti0 )
N
G (b i,qi I bi0 , ci0 ) ,
(3.7)
(3.8)
where we dene the gamma distribution as
G ( wI b, c) D
± w² 1 wc¡1 exp ¡ . C ( c) bc b
(3.9)
The prior over the bias vector, y c D fy1 , y2 , . . . , yS g, is a product over S zero-mean gaussians with precision tyj : p (y ) D
C Y S Y cD 1 jD 1
N
( yc, j I 0, tyj ) .
(3.10)
The prior over the sensor noise precision, lc , is a gamma distribution with width and scale parameters blc and clc : p (¸ ) D
C Y cD 1
G (lc I blc , clc ) .
(3.11)
224
R. Choudrey and S. Roberts
The prior over each element of the mixing matrix, Aji , is a zero-mean gaussian with precision ai for each column: p ( A) D
Lc Y C Y S Y
N
(aji | 0, ai ) .
(3.12)
c D 1 i D1 j D1
By monitoring the precisions ®c D fa1 , . . . , aLc g, the relevance of each source may be automatically determined (ARD). If ai is large, column i of Ac will be close to zero, indicating source i is irrelevant. Finally, the prior over each ai is a gamma(bai , cai ): p (®) D
Lc C Y Y cD 1 i D 1
D G (ac,i I bai , cai ) .
(3.13)
Figure 4 shows the graphical model for the variational Bayes ICA (vbICA) model. Bayesian inference in such a model is computationally intensive and often intractable. An important and efcient tool in approximating posterior distributions is the variational method (see Jordan, Ghahramani, Jaakkola, & Saul, 1999, for an excellent tutorial). In particular, we take the variational Bayes approach detailed by Attias (1999a) and Jaakkola and Jordan (2000). 3.2 Variational Bayesian Learning. Consider the log evidence for
data X, log p ( X ) D log
p (X , W ) , p (W | X )
(3.14)
which simply follows from Bayes’ rule and where the explicit dependence on model assumptions has been dropped for brevity. The term W is the vector of all hidden variables and unknown parameters. Because the log evidence does not depend on W , this can be rewritten as Z
p0 ( W ) p ( X , W ) dW p0 ( W ) p (W | X ) Z Z p(X , W ) p0 ( W ) C p0 ( W ) log D p0 ( W ) log d W dW p0 ( W ) p (W | X )
log p ( X ) D
p0 ( W ) log
0 D F[W ] C KL[p kp],
(3.15)
where p0 ( W ) is some approximation to the posterior p ( W | X ) and F[W ] D hlog p ( X , W ) ip0 ( W ) C H[p0 ( W ) ] Z p0 ( W ) 0 KL[p kp] D p0 ( W ) log dW . p (W | X )
(3.16) (3.17)
Variational Mixture of Bayesian Independent Component Analyzers
m10 t10 r1,q
m1,1
1
m1,m
q1
p1
mL0tL0
b10 c10 b1,1
1
b1,m
1
s1
rL,q
L
pL
mL,1
225
bL0 cL0
mL,m
L
qL
bL,m
bL,1
L
sL
ba1 ca1 a11
a1
l
x1
l
x2
l
xS
bl cl
y1
bl cl
y2
bl cl
yS
ty1
ty2
tyS
Figure 4: Graphical model of vbICA. Circles represent random variables, and boxes denote parameters not governed by distributions (initially set by hand).
H [p0 (W ) ] is the entropy of p0 ( W ) . The rst term in equation 3.15, F, is known as the mean-eld bound to the log evidence. Alternatively, ¡F is called the Helmholtz energy or variational free energy. The second term is the Kullback-Leibler divergence (KL divergence), a pseudodistance that measures the difference between two densities. This term is strictly positive, which means that F is a strict lower bound on the log evidence. By maximizing F, not only do we minimize the KL divergence between the approximating and true posterior, we also implicitly integrate out the unknowns W . By choosing an appropriate form for the approximation p0 ( W ), we perform tractable Bayesian learning. As F is a strict lower bound to the model (log) evidence, a wide variety of models and assumptions can be compared and contrasted by calculating F for each model. The higher F is, the higher the likelihood of the data under that model, and therefore the better that model is at explaining the data. 3.3 Variational Learning for vbICA-MM. In our model, W D fc, s, q , ·, £ g. By choosing p0 ( W ) such that it factorizes, terms in each hidden vari-
226
R. Choudrey and S. Roberts
able can be maximized individually. We choose the following factorization, p0 ( W ) D p0 ( c) p0 ( sc | q c , c) p0 ( q c | c) p0 (· ) p0 (y ) £ p0 (¸ ) p0 ( A ) p0 ( ® ) p0 ( ’ ) ,
(3.18)
where p0 ( ’ ) D p0 (¼ ) p0 (¹ ) p0 (¯ ) and p0 ( a | b) is the approximating density of p ( a | b, X ). We will also stipulate that the posteriors over the sources factorize such that p0 ( sc , q c | c) D
Lc Y
p0 (qi | c) p0 (sc,i | qi , c).
(3.19)
i D1
This additional factorization allows efcient scaling of computation with the number of hidden sources, with little loss of accuracy (Miskin & MacKay, 2000). The term p0 ( sc,i | qi , c) in equation 3.19 implies a mixture posterior source density for source i in ICA component c, p0 ( snc,i | c) D D
mi X qi D 1 mi X qi D 1
p0 (qni D qi | c) p0 (snc,i | qni , c) n cOi,q N i
n ) ( snc,i I mO ni,qi , bOi,q , i
(3.20)
(3.21)
where a posterior MoG density follows from from Bayes’ rule. By substituting p ( X , W ) and equation 3.18 into 3.16, we obtain expressions for the bound, F, to our model Ftot D Fmix C
C X
FICAc ,
(3.22)
cD 1
where Fmix D F[c, ·] FICAc D F[sc , q c , ¸c , Ac , ®c , ’c ].
(3.23)
The terms in equation 3.23 can be further factorized into contributions from each parameter. By monitoring subsets of F, we can infer the effect of local assumptions, for example, FICAc can be used instead of ARD to infer the most likely number of sources in ICA component c. One may now proceed by specifying functional forms of each of the approximating posteriors and using these in equation 3.16, as shown by Lappalainen (1999). As shown by MacKay (1995a), however, there is no
Variational Mixture of Bayesian Independent Component Analyzers
227
need to specify functional forms for (all) the approximating posteriors. The functional forms “fall out” of the maximization process, helped by the factorized form of p0 ( W ) and the conjugacy of the priors. The optimal form for each posterior is simply given by p0 (W k ) / p ( W k ) exp[hlog p (X , W ) iQ
lD 6 k
p0 ( Wl )
],
(3.24)
where the index k refers to the kth parameter in W . If p0 ( sc | q c , c) D p0 ( sc ), free-form optimization gives the ICA algorithms rst presented by Choudrey et al. (2000). This factorization gives a gaussian posterior over s. Using equation 3.18, however, is less restrictive and, as discussed below, more robust for an ICA mixture model. The posterior expectations of the data likelihood, equation 2.3, under the mixture density, equation 3.21, are given by hsnc,i | ci D hsnc,i 2 | ci D
mi X qi D 1 mi X qi D 1
p0 ( qni D qi | c)hsc,i n | qi , ci
(3.25)
2 p0 ( qni D qi | c)hsnc,i | qi , ci,
(3.26)
where n p0 (qni D qi | c) D cOi,q i
(3.27)
hsnc,i | qni , ci D mO ni,qi
(3.28)
2 hsnc,i | qni , ci D (mO ni,qi ) 2 C
1 . n O b i,q i
(3.29)
The advantages of an MoG posterior over the sources instead of a single gaussian (Choudrey et al., 2000; Miskin & MacKay, 2000) are twofold. First, it allows us to capture arbitrary densities, something not possible under a gaussian posterior. Second, an MoG posterior is more robust under uncertainty, especially in the context of an ICA mixture model. To understand this more clearly, consider the update equations for p0 ( snc,i | c) and p0 (snc,i | qni , c ) presented below (see the appendix for notation), mO ni /
mi X qi D 1
n cOi,q hbi,qi ihm i,qi i C gO cn hlc i i
mO ni,qi / hbi,qi ihm i,qi i C gO cn hlc i
S X jD 1
S X jD 1
n | ci¡hyj i) haji i(xjn ¡hxOj,k 6 i D
n | q nk , ci ¡ hyj i) , haji i(xjn ¡ hxOj,k 6 i D
(3.30)
(3.31)
228
R. Choudrey and S. Roberts
where the explicit dependence on c has been dropped for brevity. Equation 3.30 is the update under a gaussian source posterior (posterior gaussian mean), and equation 3.31 is the update under an MoG posterior (posterior mean for component qi ). The important feature to note is that the updates consist of prior (i.e., current) information from the MoG source model plus new data information. This information is combined and fed back up to the MoG source models. The MoGs then use this information to update their parameters. In uncertain situations (i.e., high noise), hlc i tends toward 0, drastically downweighting the data term. In the gaussian posterior case, the MoG source model gets fed a weighted average across MoG components of the current component parameters; this leads the MoG components to update their parameters toward common values. If there are few data or little data support, the MoGs will evolve toward the centroids of their respective densities rather than staying static. Over a number of iterations, they will effectively become the same single gaussian. If the source posterior is itself an MoG, then each component of the posterior MoG is responsible for its equivalent in the source prior MoG. In equation 3.31, as hlc i ! 0, separate information is fed back to each component qi —essentially current parameter values. In this case, if there are few data or little data support, the source MoGs will remain static. The difference is more acute in an ICA mixture model context. As well as the noise estimate, hlc i, the data term is also weighted by the current ICA’s responsibility for the current datum, gO cn . If ICA model c has little or no responsibility, its source parameters should remain static until it observes data deemed under its jurisdiction. Again, this is possible only under a mixture posterior distribution for the sources. 4 Implementing vbICA-MM
The measure F is maximized using the free-form approach of equation 3.24. All the derived posteriors require solving a set of coupled parameter update equations (see the appendix). In practice, this is best achieved by rst initializing the posterior component responsibilities (p0 ( c )), using these to initialize each ICA component, then beginning learning on each ICA component. These components are then used to calculate the new posterior responsibilities, and the learning process is repeated until convergence. These steps may be conveniently implemented in the algorithm shown in pseudocode form in Table 1. Once trained, the model can be used to reconstruct hidden source signals (to within a scaling and permutation) given a data set and the (now xed) model parameter distributions by calculating hci, hq c i, and hsc i under their respective posteriors. 4.1 Priors and Initialization. Broad priors have little or no effect on the results because these allow the data to “do the talking.” Such priors allow pa-
Variational Mixture of Bayesian Independent Component Analyzers
Table 1: Pseudocode for vbICA-MM Updates.
initialize; WHILE (D F (ica ¡mm ) < tolerance ) FOR every ICA component WHILE (D F (ica) < tolerance) update ICA observation model by cycling through equations A.12–A.17 until convergence; update ICA source model by cycling through equations A.18–A.24 until convergence; update ICA noise model using equations A.25–A.28; calculate Fnew ; (ica ) . |; calculate D F (ica ) D |Fnew ¡ Fold (ica ) (ica ) END WHILE; END FOR;
update ICA-MM indicator probabilities and parameters using equations A.29–A.30; calculate Fnew ; (ica¡mm ) . |; calculate D F (ica ¡mm ) D |Fnew ¡ Fold (ica ¡mm ) (ica ¡mm ) END WHILE;
229
230
R. Choudrey and S. Roberts
rameters and hidden variables to remain within sensible boundaries while encoding very little knowledge of what values they might take. Allowing the priors to become too broad, however, can reduce their regularizing abilities, leading to implausible magnitudes and dangers of overtting. Narrow priors, on the other hand, encode strong assumptions about possible parameter and variable values and have a discernible effect on the inference process. If the priors are too narrow, the model becomes inexible, and its ability to learn is compromised. For the ICA mixture model presented in this article, we examined a wide variety of priors (10¡6 · b · 106 , 10¡6 · c · 106 for all gamma distributions, scale parameter = 5–5000 for all Dirichlets, and precision = 10¡6 ¡ 106 for gaussians). For gamma distributions, the variance is given by b2 c and the mean by bc. Large values of c give distributions that are highly peaked around the mode, tending toward gaussianity. Values of c < 1 have no mode and are more like exponentials. Consequently, they encourage small values for the random variate while curtailing large values beyond the mean. We found 1 · b · 1000 with bc ¼ 1 a useful range for gamma distributions. Gamma distributions govern scale parameters, in this case, precisions. Values of bc > 10 have a narrowing effect on the posteriors, making them overcondent. Values of bc < 10¡1 with c > 1 have a smoothing effect on the posteriors, losing detail in the process. Similarly, we have found precisions of the order of 100 ¡ 10¡3 a good range. The value of Dirichlet scale parameters is as a pseudocounts, so high values are very constraining. Appropriate values depend on the number of data vectors. If N is the number of data vectors, then values between 0.01N and 0.1N stop components dying while allowing the data to set the posterior values accurately. If the algorithm was recast as an on-line scheme, these values would have to be reexamined because the results would necessarily become more sensitive to the choice of priors. The choice of initialization is also important. As with any other ICA formalism, the vbICA-MM algorithm nds a local maximum in the parameter space, and results depend on initialization. There are many ways to initialize; we recommend that readers explore the different options available to work out which is most suitable for them. We have investigated two initialization schemes: random and singular value decomposition (SVD). We initialize the model by rst running Ccomponent k-means on the data. The class responsibilities generated are used to partition the data into C segments, which are used to initialize C ICA models either randomly or by using SVD on each segment. If SVD is used, then the rst Lc columns of the SVD data matrix are used to initialize Lc MoGs using k-means, while residual singular values are used to estimate the noise. The rst Lc columns of the SVD projection matrix are used to initialize the mixing matrix. We have found random initialization the most useful overall, particularly for modeling clusters that are best described by nonorthogonal axes. Initializing using SVD (tantamount to initializing at the PCA solution)
Variational Mixture of Bayesian Independent Component Analyzers
231
yields a quicker result, but vbICA-MM can have problems with resolving directions in clusters exhibiting high correlation, often staying at or close to the PCA result. This is discussed further in section 6.1. One must also choose the number of components in each source MoG. Although this can be determined by monitoring the contribution to F by each source (F[Qc,i ]) as a function of components, this is time-consuming. We have found that three components are enough for most data sets, although ve should be used for image data. 5 Results
We rst demonstrate the versatility of the vbICA-MM algorithm on two- and three- dimensional synthetic data, each grouped into three classes. The algorithm’s ability to model intricate, highly nongaussian data is highlighted, as is its structure determination using ARD and the mean-eld bound, F, to the log evidence. We then use vbICA-MM on real functional magnetic resonance imaging (fMRI) data to decompose the images into interpretable, independent features and compare the results with previous methods. We set vague priors (b D 1000, c D 1e ¡ 3 for all gamma distributions, scale parameter = 5 for all Dirichlets and precision = 1e-3 for gaussians) to encode poor prior knowledge for both the synthetic and real data sets. 5.1 Test Data. We tested the vbICA-MM algorithm on two-dimensional and three-dimensional synthetic test data drawn from three classes and with 10% added gaussian noise. We set three components per source MoG and commenced learning using the algorithm presented in Table 1. Training continued until F changed by less than 0.01% or the number of iterations reached 200. Typically, iterations of each ICA component start high (approximately 100–200 for the nonorthogonal cluster presented below and less than 100 for the orthogonal clusters) and end with only a handful of iterations. In the 2D case, the underlying source manifolds were two-dimensional. Figure 5a shows the three clusters, with one gaussian cluster, one nongaussian cluster described by orthogonal directions, and one highly correlated cluster made up of six subclusters. For this data set, vbICA-MM was intialized randomly. Training took nine iterations of the vbICA-MM algorithm. Figure 5b shows the partition by vbICA-MM trained on 3000 points (1000 from each cluster) compared with the original data; only 12 (0.4%) data were assigned incorrectly. The rst deviations of the underlying MoG components are shown as ellipses. Note how the gaussian source is deemed one-dimensional, while the two-dimensional structure of the other classes has been clearly captured. A gaussian source is unidentiable from gaussian noise in ICA. Its principal direction is captured by the single source, while its spread is absorbed by the (gaussian) noise variance. The vbICA-MM data density model in Figure 5e shows the multimodal structure captured within
232
R. Choudrey and S. Roberts
the clusters. In comparison, an MoG (see Figure 5c) and the generalized exponential decorrelating ICA mixture model (gedecICA-MM) presented by Penny and Roberts (2001) (see Figure 5d) show no structure within the classes. Subsequently, vbICA-MM gives a more efcient representation of the true data density, with little probability mass wasted in regions of low density within each cluster. The data density of the 3D test data is shown in Figure 6a. The three clusters are intrinsically one-, two- and three-dimensional. We initialized a vbICA-MM using SVD and trained on 1500 points, 500 drawn from each cluster. Training took ve iterations of the vbIVA-MM algorithm. Figure 6b depicts the model captured by vbICA-MM showing accurate representation of the clusters and cluster structures. The Hinton diagrams (Hinton, 1989) of the mixing matrices in Figures 6c through 6e show how vbICA-MM has used ARD to infer the latent dimensionalities correctly. Unsupported columns in the matrices have been suppressed by small ARD coefcients (inverse ai ’s shown in Figure 7a), effectively switching off unnecessary sources in the source reconstructions below. Another way of inferring the latent dimensionality of each ICA component is to monitor the free energy of the model. As implied by equation 3.23, this is equivalent to monitoring the contribution each ICA component makes to the overall free energy of the model. Figure 7b is a plot of the mean-eld bound (i.e., the negative of the free energy) for each ICA component, FICAc . Each curve has had its maximum subtracted such that the trends in Figure 7b are more apparent. These curves conrm the latent dimensionality inferred by ARD. Plotting the overall mean-eld bound, F, across components in Figure 7c correctly infers three clusters. There is a further peak at ve components (and, indeed, progressively shallower ones at seven and nine) as ICAs latch on to different subclusters in the 3D cube. It must be noted, however, that F is a bound to the log evidence, so these further peaks disappear when exponentiated to yield the evidence bound. In comparison, a Bayesian MoG infers 21 clusters, while the BIC in gedecICA-MM predicts 9. Figure 8 shows how well the supported sources’ multimodal densities have been modeled by the MoGs. The ability of vbICA-MM to capture undulating manifolds is fundamental in correctly inferring the true number of clusters. As a comparison, Figure 7c also plots the Bayesian information criterion (BIC) approximation found to the log evidence using MAP estimates (BIC is the negative of maximum description length modeling; see Rissanen, 1994, for an introduction). Interestingly, this also picks out the correct model order, with a higher score for the log evidence. Attias (1999a) showed that the BIC is equivalent to variational Bayes in the innite data limit. The BIC achieves a higher score, presumably because it is a much simpler model, with no model averaging, although we must admit surprise at the difference. We must point out, however, that Penny, Roberts, and Everson (2001) have suggested that penalized likelihood is more applicable to ICA than other models due to the likely violation of the independence assumption in overcomplex models,
Variational Mixture of Bayesian Independent Component Analyzers
(a) Data density
(b) vbICA-MM partition
(c) Gaussian mixture model
(d) gedec-ICA mixture model
233
(e) vbICA mixture model
Figure 5: Two-dimensional test data results. Contours plot 0.001 · p ( x ) · 1 in 250 intervals.
234
R. Choudrey and S. Roberts
(a) Data density
(c) 1D cluster
(b) vbICA-MM isosurface of p ( x) D 0.01
(d) 2D cluster
(e) 3D cluster
Figure 6: Three-dimensional test data results.
leading to a penalizing effect on the data likelihood itself. This suggests that a full variational integration is not necessary in the high data–low noise limit and that penalized likelihood MAP learning is adequate. With more noisy data, however, Penny et al. (2001) showed that model order selction using penalized likelihood quickly breaks down, so in uncertain situations, a full Bayesian approach would be more robust.
Variational Mixture of Bayesian Independent Component Analyzers
ARD Coeff.
1
0.5
0 1 2
3
S o u rce N u m b e r
2
3 1
IC A C om po n e nt
(a) ARD Coefcients 0 2 00
F(ICA )
4 00 6 00 8 00
1d 2d 3d
10 00 12 00 14 00
1
2
3
N um b e r of S o u rc es
(b) FICA for each ICA component 4 000 2 000
F
0 2 000 4 000 6 000 8 000
1
2
3
4
5
Number of Components
6
(c) Solid: vbICA-MM negative free-energy Dotted: BIC using MAP estimates
Figure 7: Structure determination using ARD and negative free energy.
235
236
R. Choudrey and S. Roberts
(a) ICA 1
(b) ICA 2
(c) ICA 3
Figure 8: (Top) Original source distributions. (Bottom) Learned distributions.
5.2 Real Data. We used the vbICA, variational Bayesian mixture of factor analyzers (vbMFA), gedecICA-MM, and vbICA-MM algorithms on fMRI images of a slice through a tumor patient’s brain. Data were collected using both T2 and proton density (PD) spin sequences, which are used directly to form a two-dimensional feature space. Two 100 £ 100 pixel images were
Variational Mixture of Bayesian Independent Component Analyzers
237
vectorized, and 2000 samples were randomly drawn from the subsequent 2D data. Models with a range of latent dimensions were trained on the 2000 data vectors, and the most likely models were then used to unmix the complete data set of 10,000 points into the corresponding independent features. Figure 9 shows the two original images used, along with the feature extracted by vbICA and the gedecICA mixture model. Due to the inherent limitations of ICA, no more than two features could be extracted from the fMRI images. The vbICA algorithm favored a two-source ICA model, shown in Figure 9b. The left-hand feature is simply a copy of one of the original images. The right-hand image has separated out a local feature: cerebrospinal uid. The vbMFA algorithm infers an eight-component model with F D ¡14813.75, giving the features in Figure 9c. The BIC of the gedecICA-MM chose a four-component model with two sources per ICA component, giving the eight overall features in Figure 9d. Although the gedecICA-MM has managed to separate out the cerebrospinal uid, most features are simply scaled copies of each other, and therefore the gedecICA-MM overrepresents the fMRI images. We trained one-component to ve-component models using vbICA-MM, with each ICA component having the maximum two allowed sources. The negative free energy (F) plot in Figure 10a shows that a two-component model is preferred,3 while the Hinton plots in Figures 10b and 10c imply one-source and two-source ICA components. Compared to vbMFA, vbICAMM had F D ¡3618.8, a substantial improvement. The three features extracted by the most likely model are presented in Figure 11, along with their learned distributions. The single source of the rst ICA component, shown in Figure 11a, is global “background” brain tissue detail. The second ICA component represents more local features, where the central part of Figure 11b is, once again, the cerebrospinal uid. The two-tone background demarcates white and gray brain matter and has not been separated out, unlike some of the features shown in Figure 9. The reasons for this are not yet clear. More interesting, however, the second source has extracted two dark blobs. These are the tumors, which neither vbICA, nor vbMFA, nor gedecICA-MM picked out. The features’ respective MoGs can be interpreted as the distribution of gray-scale values in each feature. The tumors’ distribution is heavily peaked around darker values, with the left-hand tail capturing the lighter gray–level information. The other distributions similarly represent lighter to darker gray values from left to right. The ability of vbICA-MM to capture multimodal feature distributions is pivotal in allowing the complex background distribution to be separated from the rest. This representation is thus much more interpretable and efcient than either simple ICA or the gedecICA-MM.
3
The BIC preferred a one-component model.
238
R. Choudrey and S. Roberts
(a) Original data
(b) ICA features
(c) vbMFA features
(d) gedecICA-MM features
Figure 9: fMRI images and ICA, vbMFA and gedecICA-MM extracted features.
Variational Mixture of Bayesian Independent Component Analyzers
33 50 34 00 34 50
F
35 00 35 50 36 00 36 50 37 00 37 50 1
2
3
4
Num ber of Com ponents
(a) vbICA-MM negative free-energy
(b) 1D cluster
(c) 2D cluster
Figure 10: Inferred latent structure for fMRI images.
5
239
240
R. Choudrey and S. Roberts
(a) Background
(d) ICA 1
(b) C-S uid
(c) Tumors
(e) ICA 2
Figure 11: vbICA-MM features extracted from fMRI images and respective ICA component source distributions.
Figure 12 illustrates how the original data density has been modeled and partitioned by both gedecICA-MM and vbICA-MM. The lighter cluster in Figure 12d is described by the 1D ICA component, while the darker area is the 2D ICA cluster. vbICA-MM has managed to partition the data density into two clusters that are not linearly separable. 6 Discussion
In this article, we have presented an algorithm for modeling multidimensional, nongaussian data distributions. The vbICA-MM algorithm splits the distribution into self-similar areas and uses ICA components to learn representations of these areas. The ICAs model these local cluster manifolds by forming independent directions in the underlying distribution. Automatic relevance determination selects the appropriate dimensionality of each manifold, and a Bayesian learning scheme allows the optimum number of ICA components to be inferred. We have demonstrated the algorithm on two-dimensional and three-dimensional data by faithfully modeling intricate, discontinuous clusters while simultaneously inferring their intrinsic dimensions, something not possible under previous ICA mixture model. We
Variational Mixture of Bayesian Independent Component Analyzers
241
Figure 12: vbICA-MM model for fMRI images.
have shown how it is possible to compare assumptions underlying various models, in particular, using the negative free energy, F, to ascertain the correct number of clusters. We have used vbICA-MM to decompose real fMRI images into interpretable, self-similar features, blindly and automatically. 6.1 Problems to Overcome. Although we have comprehensively demonstrated the versatility and robustness of the vbICA-MM algorithm, there are practical issues to address, in particular, initialization and speed. As discussed in section 4.1, the choice of initialization can have an effect on the nal solution. SVD initialization always starts at the same place for
242
R. Choudrey and S. Roberts
a given data set and always ends at the same solution. In practice, we have found that different random initializations generally lead to the same solution as well. However, if parts of the data density are highly correlated, the ICA component most responsible for that area will stay close to the PCA solution if initialized using SVD. Different random initializations may lead to different solutions or may converge onto the PCA / FA solution. ICA can be seen as a generalization of principal component analysis and factor analysis. As shown by Attias (1999b), noiseless ICA with gaussian sources is equivalent to PCA, while FA is the same with nonisotropic noise. This means the PCA / FA solution is a subset of all possible ICA solutions for a given data set. As a particular cluster becomes more and more correlated (or, equivalently, its independent directions become less orthogonal), the second moment starts to dominate so the cluster becomes easier to describe by a gaussian density. This means the PCA / FA maxima in the ICA solution space become more prominent, so more and more initialization points will lead to them. The nonorthogonal cluster in Figure 5 was chosen because empirically, it seems to be the limit of correlation and nonorthogonality that vbICA-MM can resolve into independent directions. This particular cluster has a correlation of ¡0.67, and we needed on average ve random initializations before vbICA-MM latched on, as indicated by the signicantly higher value for F after one iteration for this initialization. More investigation is needed into initialization strategies to overcome this. One possible solution is to decorrelate, or whiten, each cluster identied by the k-means step and then commence learning on these whitened data. We have had encouraging results using this admittedly rather ad hoc method. Also, the factorial approximation in equation 3.19 is no longer a good approximation for highly correlated data, so dropping this will also solve much of the problem, albeit with a corresponding reduction in speed. Speed is the other problem. Bayesian learning is inherently slower than non-Bayesian methods due to the extra parameters that have to be estimated. In fact, Bayesian inference can be shown to be NP-hard (Cooper, 1990). The variational approximation brings tractability to the Bayesian formulation and is speedier than the nondeterministic sampling regime, but it still remains understandably slower than its non-Bayesian counterparts. Interesting work by Valpola and Pajunen (2000) on speeding up Bayesian ICA may provide a basis for a faster vbICA-MM algorithm. Also, modelorder selection for the number of ICA components can become prohibitive if a large number are required. Some progress may be made in this area if, for example, some sort of birth-death (Roberts, Holmes, & Denison, 2001) and split-merge (Ueda, Nakano, Ghahramani, & Hinton, 1999) criteria could be enforced allowing active generating, culling, splitting, and merging of components during the learning process. This is an area of ongoing research. 6.2 Further Applications. Interesting data are, almost by denition, not gaussian distributed. For data distributions that consist of self-similar non-
Variational Mixture of Bayesian Independent Component Analyzers
243
gaussian clusters, we believe this model will provide a better representation over current methods, such as gaussian mixture modeling, mixtures of factor analyzers, and plain-vanilla ICA. The kinds of data that can benet from such an analysis range from the prosaic to the provocative. The most obvious area is in nonstationary blind source separation, as alluded to in section 1.2. Standard ICA is now the de facto method for blindly separating mixtures of signals. It has shown to be very robust, effective, and, in a probabilistic formulation, highly interpretable. However, ICA is essentially limited to stationary mixings of independent sources, which number fewer than the sensors picking up the observations. ICA mixture models are an important step in overcoming these restrictions. Sudden changes in the mixing amounts will be encoded by vbICA-MM as a switching between ICA components. In the cocktail party problem, for example, this would include different voices cutting in and out over a background cacophony. Provided the number of voices (plus background) active at any one time does not exceed the total number of sensors, vbICA-MM can capture more sources than sensors. This was demonstrated in the fMRI example, where three features were extracted from 2D data. Furthermore, the independence assumption at the core of ICA holds within components but is not necessary between components. Sources that are not independent will be represented as sources within different ICA components by vbICA-MM. The application of vbICA-MM to enhanced blind source separation does not have to stop at auditory signals. The multimodal nature of the source distributions lends itself readily to image separation and other BSS problems. Clustering in general is another area that naturally comes under the vbICA-MMs umbrella. Mixture modeling is the denitive formalism for nding clusters in data distributions. Interesting clusters are very often not gaussian, so mixture models based on gaussianity will generally not model such clusters well. The synthetic clusters in section 5 were nongaussian and also contained structure within themselves. This confused both a gaussian mixture model and the mixture of factor analyzers model. The former did not model the data accurately, while the latter failed to infer the correct number of clusters. Lee et al. (2000) have shown that their ICA mixture model performed better at clustering the Iris data of Fisher (1936) than those based on gaussian methods. The grouping of documents into semantic clusters is currently a very active area of research (see Landaur, Foltz, & Laham, 1998, for an introduction). It is highly unlikely that such man-made corpora are intrinsically gaussian in nature. The reader may have noticed that in the use of ICA as a representative tool, we have verged on the evangelical. This, we strongly believe, is where ICA and its extensions will prove to be indispensable. Lee, Lewicki et al. (1999) have already shown that ICA mixture models based on Laplacian source densities are over 20% more efcient than PCA at coding natural scenes and images of newspaper text, and more efcient than standard
244
R. Choudrey and S. Roberts
ICA. As ICA is only beginning to leave its BSS beginnings behind, this area is as yet largely unexplored. 6.3 Future Directions. There are a number of modications that could be made to vbICA-MM to take into account further information that one may have on the data. For example, the source models could be restricted to being the same dimensionality for all ICA components if it is expected that the data lie on some dimensionally consistent manifold. Indeed, the source models could be shared among a number, if not all, of the ICA components if different regions of the data space are thought to be different projections of some handful of underlying manifolds. Temporal information may be introduced if the mixture process prior is conditioned across samples as in, for example, a Markov process, leading to a Bayesian formulation of hidden Markov ICA (Penny, Roberts, & Everson, 2000). vbICA-MM leads to a whole family of ICA mixture models depending on which assumptions are coded into the model. Because of the Bayesian formulation, the effectiveness of these models can be quantitatively compared. Again, this is an area of active research. Appendix: Update Equations
The parameter posteriors are given by p0 ( sc | q c , c) D p0 ( A ) D p0 ( ® ) D p0 ( q c | c ) D 0
p (¼ ) D p0 ( ¹ ) D p0 ( ¯ ) D
Lc N Y Y nD 1 i D1
N
n ) ( snc,i I mO ni,qi , bOi,q i
Lc Y C Y S Y
( aji I m O aji , aO ji )
N
c D1 i D 1 j D 1 Lc C Y Y c D1 i D 1 Lc N Y Y nD 1 i D1
(A.1)
(A.2)
G (ai I bOai , cOai )
(A.3)
n cOi,q i
(A.4)
Lc Y mi ( C Y Y
c D1 i D 1 qi D1 Lc Y mi C Y Y c D1 i D 1 qi D1
q i0
lO i,qi0 )
C ( lO i,qi )
c D1 i D 1 qi D1 Lc Y mi C Y Y
P
lO i,q ¡1
p i,qi i
(m i,qi I m O i,qi , tOi,qi )
N
G (b i,qi I bOi,qi , cOi,qi )
(A.5)
(A.6)
(A.7)
Variational Mixture of Bayesian Independent Component Analyzers
p0 ( y ) D p0 ( ¸ ) D p0 ( c ) D p0 ( · ) D
C Y S Y
( yj I mO yj , tOyj )
N
c D1 j D 1 C Y c D1
(A.8)
G (lcI bOlc , cOlc )
N Y C Y n D1 c D1 C Y ( c D1
(A.9)
gOcn
P
245
(A.10)
0 c0 Oic )
C (Oic )
k cOic ¡1 .
(A.11)
Updated distribution parameters are hatted versions of the original parameters, hai are expectations taken with regard to p0 ( a) , and ha | bi are expectations with regard to p0 (a | b) . A.1 Observation Model.
A.1.1 p0 ( sc | qc , c) .
mO ni,qi
2 S X 1 4hb i,qi ihm i,qi i C gO cn hlc i haji i D n bOi,q D1 j
i
3
n | q nk , ci ¡ hyj i) 5 £ ( xjn ¡ hxOj,k 6 i D
n bOi,q D hb i,qi i C gO cn hlc i i
S X jD 1
ha2ji i
where
n xOj,k 6 i D D
hxOj,nkD6 i | ci D hxOj,nkD6 i | q nk , ci D
Lc X
ajksnc, k
6 i kD
Lc X 6 i kD
Lc X 6 i kD
hajk ihsnc, k | ci hajk ihsnc, k | q nk , ci,
(A.12)
(A.13)
246
R. Choudrey and S. Roberts
and dene xOjn D hxOjn | ci D hxOjn | q ni , ci D
Lc X
aji snc,i
iD 1 Lc X iD 1 Lc X iD 1
haji ihsnc,i | ci haji ihsnc,i | q ni , ci.
The expectations for the sources s are given by the expressions in section 3.3. In practice, equation A.12 has to be iterated for every i a number of times 6 i. until mO ni,qi converges as it depends on every other k D Note the intuitive form of equations A.12 and A.13. Source i “sees” data xjn at sensor j and works out what it can “explain away” given information n in the rest of the model, that is, xOj,k . The residual information is then used 6 i D to update its own parameters in a bid to explain what is “left over.” This is analogous to the message-passing algorithms of Pearl (1988), whereby a n ) using messages node (sc,i ) updates its belief (encapsulated in mO ni,qi and bOi,q i passed to it by its parents (the MoG expectations), its children (the sensor data), and all other parents of its children (i.e., all other sources and quantied by xOj, kD6 i ). This use of information from a variable’s Markov blanket runs through all the update equations. A.1.2 p0 (A) . mO aji D
N hlc i X gO n hsn | ci ( xjn ¡ hxOj,nkD6 i | ci ¡ hyj i) a O ji nD 1 c c,i
aO ji D hai i C hlc i
N X nD1
gO cn hsnc,i 2 | ci
(A.14)
(A.15)
In practice, equation A.14 has to be iterated for every i a number of times until mO aji converges. A.1.3 p0 ( ®c ) . 0
bOai
1¡1 S X 1 1 C ha2 iA D @ 2 jD 1 ji bai
cOai D cai C
S 2
(A.16)
(A.17)
Variational Mixture of Bayesian Independent Component Analyzers
247
A.2 Source Model.
A.2.1 p0 (qc | c) . Á n c i,q i
D pQ i,qi
bQi,qi n bOi,qi
!1 2
µ
1 n n2 (bO mO ¡ hbi,qi ihm 2i,qi i) exp 2 i,qi i,qi
¶
n c i,q i n P cOi,q D n i 0 c qi i,q0
(A.18)
(A.19)
i
where 2
0
pQ i,qi D exp 4Y (rOi,qi ) ¡ Y @
13 X q0i
rOi,q0i A5
bQi,qi D bOi,qi exp[Y ( cOi,qi ) ] and where equation A.19 ensures that function.
P
n qi cOi,qi
D 1. Y (.) is the digamma
A.2.2 p0 ( ¼ ) . rOi,qi D r i,qi C
N X nD 1
n cOi,q i
(A.20)
A.2.3 p0 ( ¹) . Á O i,qi D m
1 tOi,qi
ti0 mi0 C hbi,qi i
tOi,qi D ti0 C hbi,qi i
N X nD 1
N X nD 1
! n cOi,q hsnc,i i
|
qni ,
ci
n cOi,q . i
A.2.4 p0 ( ¯ ) . ³ ´ ¡1 Obi,q D 1 C 1 sQ i,q i 2 i bi0 cOi,qi D ci0 C
N 1X cO n , 2 nD1 i,qi
where we dene the average variance of component qi in source i sQ i,qi D
N X nD 1
n (h n 2 | n cOi,q sc,i qi , ci ¡ 2hm i,qi ihsnc,i | qni , ci C hm 2i,qi i). i
(A.21)
(A.22)
(A.23)
(A.24)
248
R. Choudrey and S. Roberts
A.3 Noise Model.
A.3.1 p0 (y) . mO yj D
N hlc i X gO n ( xn ¡ hxOjn | ci) tOyj nD 1 c j
tOyj D tyj C hlc i
N X nD 1
(A.25)
gO nc .
(A.26)
A.3.2 p0 (l) . 2
bOlc
3¡1 S X N 1 1X C gO n h(xn ¡ xOjn ¡ yj ) 2 i5 D 4 2 j D1 n D1 c j blc
cOlc D clc C
N SX gO n . 2 n D1 c
(A.27)
(A.28)
A.4 ICA Mixture Update.
A.4.1 p0 (c) . gcn
S
Q c2 D kQ c l
µ
¶ lc n 2 n ) exp h(xj ¡ xOj ¡ yj i 2 j D1
S Y
gn gO cn D PC c
n c D 1 gc
,
(A.29)
(A.30)
where "
kQ c D exp Y (Oic ) ¡ Y
Á X c0
!# Oic0
lQ c D bOlc exp[Y ( cOlc ) ], where equation A.30 ensures that
P
O cn cg
D 1.
A.4.2 p0 ( ·) . iOc D ic C
N X nD 1
gO cn
(A.31)
Variational Mixture of Bayesian Independent Component Analyzers
249
The relevent moments are given by hai D mean(a)
(A.32)
ha2 i D mean(a) 2 C variance(a) . Note that for a mixture with only one ICA component gcn D 1 for all n, and equations A.12 through A.24 reduce to the update equations for a single ICA model. Acknowledgments
We thank Iead Rezek, Peter Sykacek, and Evangelos Roussos for perceptive comments and discussions. Thanks to Mike Gibbs for solving computerrelated problems. Thanks also to the anonymous referees for very helpful and constructive criticism. References Attias, H. (1999a). Learning parameters and structure of latent variable models by variational Bayes. In Electronic Proceedings of the Fifteenth Annual Conference on Uncertainty in Articial Intelligence (VAI-1999). Available on-line at: http://www2.sis.pitt.edu/»dsl/UAI/uai.html. Attias, H. (1999b). Independent factor analysis. Neural Computation, 11, 803–851. Back, A., & Weigend, A. (1998). A rst application of independent component analysis to extracting structure from stock returns. International Journal of Neural Systems, 8(4), 473–484. Bartlett, M., Lades, H., & Sejnowski, T. (1998). Independent components representations for face recognition. In Proceedings of the SPIE Symposium on Electronic Imaging: Science and Technology: Conference on Human Vision and Electronic Imaging III (pp. 528–539). San Jose, CA: SPIE. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A., & Sejnowski, T. (1997).The “independent components” of natural scenes are edge lters. Vision Research, 37(23), 3327–3338. Cardoso, J.-F. (1997). Infomax and maximum liklihood for blind source separation. IEEE Letters on Signal Processing, 4, 112–114. Choudrey, R., Penny, W., & Roberts, S. (2000). An ensemble learning approach to independent component analysis. In Proceedings of Neural Networks for Signal Processing X. Sydney, 2000. Choudrey, R., & Roberts, S. (2001). Variational Bayesian independent component analysis with exible sources (Tech. Rep. No. PARG-01-05). Available on-line at: http://www.robots.ox.ac.uk/sjrob/pubs.html. Oxford: Oxford University. Cooper, G. (1990). Computational complexity of probabilistic inference using Bayesian belief networks. Articial Intelligence, 42(3), 393–405. Fisher, R. (1936). The use of multiple measurements in taxonomic problem. Annals of Eugenics, 7(2), 179–188.
250
R. Choudrey and S. Roberts
Ghahramani, Z., & Beal, M. (2000). Variational inference for Bayesian mixtures of factor analysers. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 449–455). Cambridge, MA: MIT Press. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10(8), 2103–2114. HÂerault, J., & Jutten, C. (1986). Space or time adaptive signal processing by neural models. In J. Denker (Ed.), Proceedings AIP Conference: Neural Networks for Computing (Vol. 151, pp. 206–211). Snowbird, UT: AIP. Hinton, G. (1989). Connectionist learning procedures. Articial Intelligence, 40, 185–234. Hinton, G., & van Camp, D. (1993). Keeping neural networks simple by minimising the description length of the weights. In Proceedings of COLT-93. Santa Cruz, CA. Hyva¨ rinen, A. (1999). Survey on independent component analysis. Neural Computing Surveys, 2, 94–128. Available on-line at: http://www.icsi.berkeley. edu/jagota/NCS. Isbell, B., & Viola, P. (1999). Restructuring sparse high dimensional data for effective retrieval. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 480–486). Cambridge, MA: MIT Press. Jaakkola, T., & Jordan, M. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10, 25–37. Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). An introduction to variational methods for graphical models. In M. Jordan (Ed.), Learning in graphical models. Cambridge, MA: MIT Press. Kolenda, T., Hansen, L., & Sigurdsson, S. (2000). Independent component analysis in text. In M. Girolami (Ed.), Advances in independent component analysis. Berlin: Springer-Verlag. Landaur, T., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 295–284. Lappalainen, H. (1999).Ensemble learning for independent component analysis. In Proceedings of the First International Workshop on Independent Component Analyis (pp. 7–12). Aussois, France. Lee, T., Girolami, M., Bell, A., & Sejnowski, T. (1998). A unifying informationtheoretic framework for independent component analysis. International Journal of Mathematical and Computer Modelling, 39, 1–21. Lee, T., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11(2), 409–433. Lee, T., Lewicki, M., & Sejnowski, T. (1999). ICA mixture models for unsupervised classication and automatic context switching. In International Workshop on Independent Component Analysis (pp. 209–214). Aussois, France. Lee, T., Lewicki, M., & Sejnowski, T. (2000). ICA mixture models for unsupervised classication of non-gaussian classes and automatic context switching
Variational Mixture of Bayesian Independent Component Analyzers
251
in blind signal separation. IEEE Transactions on Pattern Recognition and Machine Intelligence, 22(10), 1–12. Lewicki, M., & Olshausen, B. (1999). A probabilistic framework for the adaption and comparison of image codes. Journal of the Optical Society of America, 16(7), 1587–1601. MacKay, D. (1995a). Developments in probabilistic modelling with neural networks—ensemble learning. In Proceedings of the Third Annual Symposium on Neural Networks (pp. 191–198). Nijmegen, Netherlands: SpringerVerlag. MacKay, D. (1995b). Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469–505. Makeig, S., Bell, A., Jung, T.-P., & Sejnowski, T. (1996). Independent component analysis for electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Miskin, J., & MacKay, D. (2000). Application of ensemble learning to infra-red imaging. In Proceedings of ICA2000. Helsinki, Finland. Olshausen, B., & Field, D. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pearl, J. (1988). Probabilistic reasoning in intelligent systems (2nd ed.). San Mateo, CA: Morgan Kaufmann. Pearlmutter, B., & Parra, L. (1996). A context-sensitive generalization of ICA. In ICONIP ’96 (pp. 151–157). Berlin: Springer-Verlag. Penny, W., & Roberts, S. (2001). Mixtures of independent component analysers. In Articial Neural Networks—ICANN2001 (pp. 527–534). Vienna, Austria. Penny, W., Roberts, S., & Everson, R. (2000). Hidden Markov independent components analysis. In M. Girolami (Ed.), Advances in independent component analysis. Norwell, MA: Kluwer. Penny, W., Roberts, S., & Everson, R. (2001). ICA: Model order selection and dynamic source models. In S. Roberts & R. Everson (Eds.), Independent component analysis: Principles and practice (pp. 299–314). Cambridge: Cambridge University Press. Rissanen, J. (1994). MDL modeling—an introduction. In P. Grassberger & J.P. Nadal (Eds.), From statistical physics to statistical inference and back (pp. 95– 104). Norwell, MA: Kluwer. Ristaniemi, T., & Joutsensalo, J. (1999). On the performance of blind source separation in CDMA downlink. In Proceedings of ICA ’99 (pp. 437–442). Aussois, France. Roberts, S., & Everson, R. (Eds.). (2001).Independent component analysis:Principles and practice. Cambridge: Cambridge University Press. Roberts, S., Holmes, C., & Denison, D. (2001). Minimum entropy data partitioning using reversible jump Markov chain Monte Carlo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(8), 909–915. Tipping, M., & Bishop, C. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2), 443–482.
252
R. Choudrey and S. Roberts
Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. (1999). SMEM algorithm for mixture models. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Valpola, H., & Pajunen, P. (2000). Fast algorithms for Bayesian independent component analysis. In Proceedings of ICA2000 (pp. 233–237). Helsinki, Finland. van Hateren, J., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London, Series B, 265, 359–366. Received October 15, 2001; accepted July 8, 2002.
Communicated by Paul Bressloff
LETTER
Interspike Interval Correlations, Memory, Adaptation, and Refractoriness in a Leaky Integrate-and-Fire Model with Threshold Fatigue Maurice J. Chacron
[email protected] Department of Physics, University of Ottawa, Ottawa, Canada K1N 6N5
Khashayar Pakdaman
[email protected] Inserm, U444, 75571 Paris Cedex 21, France
AndrÂe Longtin
[email protected] Department of Physics, University of Ottawa, Ottawa, Canada K1N 6N5
Neuronal adaptation as well as interdischarge interval correlations have been shown to be functionally important properties of physiological neurons. We explore the dynamics of a modied leaky integrate-and-re (LIF) neuron, referred to as the LIF with threshold fatigue, and show that it reproduces these properties. In this model, the postdischarge threshold reset depends on the preceding sequence of discharge times. We show that in response to various classes of stimuli, namely, constant currents, step currents, white gaussian noise, and sinusoidal currents, the model exhibits new behavior compared with the standard LIF neuron. More precisely, (1) step currents lead to adaptation, that is, a progressive decrease of the discharge rate following the stimulus onset, while in the standard LIF, no such patterns are possible; (2) a saturation in the ring rate occurs in certain regimes, a behavior not seen in the LIF neuron; (3) interspike intervals of the noise-driven modied LIF under constant current are correlated in a way reminiscent of experimental observations, while those of the standard LIF are independent of one another; (4) the magnitude of the correlation coefcients decreases as a function of noise intensity; and (5) the dynamics of the sinusoidally forced modied LIF are described by iterates of an annulus map, an extension to the circle map dynamics displayed by the LIF model. Under certain conditions, this map can give rise to sensitivity to initial conditions and thus chaotic behavior.
Neural Computation 15, 253–278 (2003)
° c 2002 Massachusetts Institute of Technology
254
M. Chacron, K. Pakdaman, and A. Longtin
1 Introduction Since the seminal work of Adrian and Zotterman (1926), adaptation has been demonstrated to be a key property of neurons. For example, photoreceptor neurons have to cope with very different light intensities (e.g., day and night) and must adapt to each (Kufer, Fitzhugh, & Barlow, 1957). Moreover, this adaptation can sometimes give rise to interspike interval (ISI) correlations in experimental recordings (Kufer et al., 1957; Goldberg, Adrian, & Smith, 1964; Yamamoto & Nakahama, 1983; Teich & Lowen, 1994; Sch¨afer, Braun, Peters, & Bretschneider, 1995; Longtin & Racicot, 1998; Chacron, Longtin, St.Hilaire, & Maler, 2000; Neiman & Russell, 2001; Liu & Wang, 2001; Brandman & Nelson, 2002). Information-theoretic approaches have recently shown that neural information transfer can either increase or decrease as a result of correlations between spikes (Abbott & Dayan, 1999; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001; Tiesinga, Fellous, JosÂe, & Sejnowski, 2002) or ISIs (Chacron, Longtin, & Maler, 2001a). Also, correlations between neuronal inputs can increase or decrease information transfer (Abbott & Dayan, 1999; Panzeri et al., 2001). One prototype sensory system in which the role of ISI correlations has been addressed is the electrosensory system of weakly electric sh. Negative ISI correlations have been found in P-type electroreceptors (Longtin & Racicot, 1998). A modeling study of these receptors has further shown that these correlations enhance not only information transfer but signal detection as well (Chacron et al., 2001a). It is clear from those studies that ISI correlations depend on various biophysical model parameters, as well as on noise and periodic forcing. The goal of this article is to provide a foundation for understanding how ISI correlations arise from the interplay of these factors in the context of a simple model. The leaky integrate-and-re model (LIF) is one of the most elementary spiking models and has been widely used to gain a better understanding of information processing in neurons (see, e.g., Tuckwell, 1988). One feature that makes this model particularly suitable for theoretical analysis is that it is analytically tractable and retains two key neuronal properties of type I membranes: the all-or-none response and the postdischarge refractoriness. The tractability of the standard LIF is in part due to the fact that following a discharge, the dynamics of the LIF depends on only the present and future inputs. In other words, after each ring, the LIF has no memory of the past: inputs and discharge times prior to the last one bear no inuence on the next discharge times. This property has been a key ingredient in the analysis of the response of the standard LIF to inputs such as sinusoidal current or white gaussian noise. Indeed, it allows the standard LIF to be described by iterates of orientation-preserving circle maps (Rescigno, Stein, Purple, & Popele, 1970; Keener, Hoppensteadt, & Rinzel, 1981; Coombes & Bressloff, 1999; Pakdaman, 2001). Moreover, the discharge times evoked by
Memory and Adaptation in a Model Neuron
255
gaussian white noise stimulation form a renewal process (for recent reviews on the noisy LIF, see Lansky & Sato, 1999; Ricciardi, Di Crescenzo, Giornio, & Nobile, 1999; Pakdaman, Tanabe, & Shimokawa, 2001). The lack of memory about past discharges, which constitutes an advantage in the theoretical analysis of the standard LIF, is also one of the shortcomings of this model. Indeed, this property can eliminate the correlation between successive interspike intervals (ISIs) of the LIF. For instance, the ISIs evoked by additive gaussian white noise stimulation of this model are independent from one another. As mentioned, this is not the case experimentally for many neurons. Moreover, the LIF model assumes that the threshold to ring is constant in time. There is experimental evidence that the threshold for ring is not constant but depends on the past spiking history of the neuron (Azouz & Gray, 1999). Thus, a physiologically plausible modication to the LIF neuron would be to make the threshold a dynamical variable also. In this article, we study a modied LIF neuron in which the threshold is a dynamical variable that models threshold fatigue after an action potential. Using step currents as stimuli, we show that the model can give rise to adaptation (i.e., a gradual change in ring rate), a feature that is absent from the standard LIF model. Under certain conditions, we show that the threshold fatigue can lead to a saturation in the ring rate of the neuron as the depolarizing current is increased. This feature is also absent from the LIF model, where the ring rate diverges as a function of input current. We then show that the modied LIF incorporating threshold fatigue can reproduce ISI correlations similar to those found in experimental recordings. The choice of the threshold dynamics is guided by previous observations that ISI correlations can result from the progressive accumulation of refractory effects (Stein, 1965; Weiss, 1966; Geisler & Goldberg, 1966; Holden, 1976; Gerstner & van Hemmen, 1992; Chacron et al., 2000, 2001a; Chacron, Longtin, & Maler, 2001b; Liu & Wang, 2001). If the neuron res two spikes in a relatively short time interval, then the next action potential usually occurs after a longer time interval. Implementing threshold fatigue, that is, decreasing excitability during high-frequency ring, in the LIF should thus be a candidate for increasing the ISI correlations in a satisfactory way. Our investigations provide another conrmation of this point. Although more detailed ionic models of neurons such as the Hodgkin-Huxley model (Hodgkin & Huxley, 1952) are capable of adaptation and ISI correlations with the addition of suitable ionic channels, such models seldom are analytically tractable due to their complexity. In contrast, our model retains enough of the simplicity of the LIF model to remain analytically tractable for the most part. Various integrate-and-re-like models with mechanisms analogous to threshold fatigue have been previously used—for instance: In the analysis of mutually inhibiting neuronal models (Reiss, 1962; Wilson & Waldron, 1968: for an account of these studies, see MacGregor & Lewis, 1977)
256
M. Chacron, K. Pakdaman, and A. Longtin
A stochastic model where an inhibitory conductance is being reset to a value dependent on past history to mimic the possible summation of hyperpolarizing afterpotentials following each discharge (Geisler & Goldberg, 1966; Wehmier, Dong, Koch, & van Essen, 1989; Treves, 1996; Liu & Wang, 2001) Periodically and irregularly forced neuronal models (Segundo, Perkel, Wyman, Hegstad, & Moore, 1968) Self-exciting models and interacting excitatory assemblies (Vibert, Pakdaman, & Azmy, 1994; Pakdaman & Vibert, 1995; Pakdaman, Vibert, Boussard, & Azmy, 1996) An LIF model in which the threshold decayed exponentially and was incremented by a xed amount immediately after an action potential (Geisler & Goldberg, 1966; Holden, 1976; Chacron et al., 2000, 2001a, 2001b; Liu & Wang, 2001) Although the model used in this study bears similarities with previous ones, one key difference resides in the introduction of a memory parameter to adjust the level of fatigue. Indeed, previous studies dealt with neuron models with a xed level of fatigue, whereas this one is concerned with a systematic analysis of the inuence of this factor. This program is carried out in three stages involving the description of the response of the modied LIF to (1) constant and step current stimulation, (2) white gaussian noise, and (3) sinusoidal forcing. The rst two stages are concerned with adaptation and the ISI correlations due to fatigue. The last stage examines how these factors affect the response of the modied model to other inputs. The particular selection of sinusoidal forcing is motivated by its relevance for understanding information processing in extrinsically forced neurons (Chacron et al., 2000). Furthermore, comparison of our results with the extensive studies of sinusoidally forced LIFs and their variants (Rescigno et al., 1970; Glass & Mackey, 1979; Keener et al., 1981; Alstrøm & Levinsen, 1988; Coombes, 1999; Coombes & Bressloff, 1999; Pakdaman, 2001) provides the basis for the determination of specic contributions of threshold fatigue to the dynamics of the model. 2 The Model We consider the model given by the following equations and ring rule: dv D ¡v / tv C I ( t) if v ( t) < s ( t) , dt ds sr ¡ s if v ( t) < s ( t) , D ts dt v ( t C ) D v0 C
if v ( t) D s (t)
s ( t ) D s0 C W ( s ( t) , a)
(2.1) (2.2) (2.3)
if v (t) D s ( t) ,
(2.4)
Memory and Adaptation in a Model Neuron
257
Figure 1: Voltage (black solid line) and threshold (gray solid line) time series obtained with the model. An action potential occurs when voltage and threshold are equal. The ring times tn thus satisfy v ( tn ) D s ( tn ) . Immediately after an action potential, the voltage is reset to zero while the threshold is set to a value s ( tnC ) D s 0 C W ( s ( tn ) , a) .
where v is the voltage, s is the threshold, I (t ) is the stimulation current, tv and ts are the time constants for voltage and threshold, respectively, and sr is the value at which the threshold stabilizes in the absence of ring. Firing occurs when the voltage reaches the threshold. Following this, the voltage is reset to v 0 (see equation 2.3), and the threshold is set to s ( t C ) D s0 C W ( s ( t) , a), where s 0 is a parameter and W is a monotically increasing function of s and a with W ( s, 0) D 0. a is a positive parameter controlling the memory in the model. We will mostly look at the case W ( s, a) D W 1 ( s, a) ´ as, but other nonlinear forms could be used for W. Throughout, we assume that v 0 · 0 < sr · s0 . The model dynamics are graphically illustrated in Figure 1. Equation 2.4 with W D W1 represents threshold fatigue, with a being the memory parameter. Indeed, when a D 0, the threshold value immediately after a discharge is independent of the threshold value at that discharge, so that future discharges bear no memory of the ring history. The case where a D 0 and s0 D sr corresponds to the standard LIF with constant threshold. Conversely, when W D W1 and a D 1, one recovers the threshold fatigue
258
M. Chacron, K. Pakdaman, and A. Longtin
implemented in previous studies, that is, after each discharge, the threshold is raised by a xed amount s0 (Segundo et al., 1968; Vibert et al., 1994; Pakdaman & Vibert, 1995; Pakdaman et al., 1996; Chacron et al., 2000, 2001a, 2001b; Liu & Wang, 2001). Other models used a conductance that was decremented by a xed amount immediately after an action potential to achieve similar effects (Wehmeier et al., 1989; Treves, 1996; Liu & Wang, 2001). Liu and Wang (2001) have shown that the latter approach gave qualitatively similar results to our model with W D W 1 and a D 1. Previous studies were thus limited to the case a D 1. Generally, increasing a from zero corresponds to increasing the degree of fatigue in the model and, consequently, its dependence on its past. The following sections examine how this parameter affects the correlation between consecutive ISIs and the response of the model to various stimuli. 3 Results 3.1 Constant and Step Current Stimulation. Throughout this section, we assume W D W1 and that I (t ) D m is constant. Two situations arise in response to such stimuli. When m tv < sr , the stimulus is subthreshold, the membrane potential v (t ) stabilizes at m tv , and no ring occurs. Conversely, when m tv > sr , the model generates sustained ring. The remainder of this section analyzes the corresponding discharge pattern. This is done through the construction of a map relating consecutive postdischarge threshold values. Let v ( u ) be the solution to equation 2.1 with v (0) D v0 , and let s ( u, S ) be the solution to equation 2.2 with s (0, S ) D S. These are given by v ( u ) D ( v 0 ¡ m tv ) e¡u/ tv C m tv
(3.1)
s (u, S ) D ( S ¡ sr ) e¡u/ ts C sr .
(3.2)
Let us assume that a discharge occurred at time tn and resulted in the postdischarge threshold s ( tnC ) ´ SnC . The next ring takes place after a time interval D n C 1 ´ tn C 1 ¡ tn such that s (D n C 1 , SnC ) D v ( D n C 1 ) and s ( u, SnC ) > v ( u ) for all 0 · u < D n C 1 .
(3.3)
This ring yields the postdischarge threshold SnCC 1 at time tnCC 1 : SnCC 1 D s 0 C as( D n C 1 , SnC ) D s 0 C av( D n C 1 ) C
D s 0 C a[(Sn ¡ sr ) e
(3.4) (3.5)
¡D n C 1 / ts
C sr ].
(3.6)
Memory and Adaptation in a Model Neuron
259
From these equations, the relation between the nth and n C 1th ISIs D n and D n C 1 is readily derived as: 8 [(v 0 ¡ m tv ) e ¡D nC 1 / tv C m tv ¡ sr ] > > < £eD n C 1 / ts C sr ¡ s 0 ¡ am tv D n D ¡tv ln > a(v 0 ¡ m tv ) > :
9 > > = > > ;
.
(3.7)
While equation 3.7 can be used to discuss the relation between successive ISIs, we prefer to focus on the relation between successive values of the postdischarge threshold. Indeed, for constant stimulation, from the ISI D n one can unambiguously derive SnC , the nth postdischarge threshold, and conversely, given SnC , there is a unique ISI D n , so that using either the ISIs or the postdischarge thresholds yields the same information about the dynamics of the model. However, for the class of stimuli treated in section 3.2, this is not necessarily so; different values of Sn may yield the same D n . This in turn makes it more difcult to obtain the relation between D n and D n C 1 . For this reason, the following paragraphs describe the relation between SnC and SnCC 1 . Equation 3.3 uniquely denes D n C 1 D D n C 1 ( SnC ) as a function of SnC , so that we can rewrite equation 3.4 as SnCC 1 D s0 C as( D n C 1 ( SnC ) , SnC ) ´ F ( SnC ) .
(3.8)
This relation implies that the postdischarge threshold SnC after the nth ring entirely determines the next postdischarge ring SnCC 1 . In other words, the dynamics of the LIF with threshold fatigue receiving a constant stimulation is determined by the iterates of the map F. This map is dened on the interval [s 0 C asr , C 1) . In general, D (S ) , and consequently F ( S ) cannot be explicitly written in closed form because equation 3.3 cannot be solved analytically for arbitrary ts . However, the geometrical properties of F that determine the behavior of the sequence fSnC g can be determined without such knowledge. For xed u, the function s (u, S ) is monotonic increasing, so that if S > S0 , then s ( u, S) cannot intersect the voltage prior to s ( u, S0 ) . In other words, S > S0 implies that D (S ) > D ( S0 ) , which means that the larger the postdischarge threshold is, the longer the following interspike interval. Given that v0 · 0 < m tv (the postdischarge potential is reset below the resting voltage) and that v ( u ) is monotonic increasing, we have that D (S ) > D (S0 ) implies v (D ( S ) ) > v (D ( S0 ) ) . This relation together with equation 3.5 and a > 0 yields that F ( S ) > F ( S0 ) whenever S > S0 , that is, F is monotonic increasing. This, combined with the observation that F (s 0 C asr ) > s0 C asr , and F (S ) ! s0 C am tv as S ! C 1, ensures that for any S1C 2 [s0 C asr , C 1) , the sequence SnC is either increasing or decreasing and converges to some value in [s0 C asr , s0 C am tv ]. Finally, all sequences converge to the same value, that is, F has a unique xed point, because F is concave. This property comes from the fact
260
M. Chacron, K. Pakdaman, and A. Longtin
that S ! s (u, S ) is contracting, that is, |s(u, S ) ¡ s (u, S0 ) | D exp(¡u / ts ) |S ¡ S0 | < |S ¡ S0 | for u > 0, and v ( u ) is concave, so that F0 , the derivative of F is monotonic decreasing. In summary, we have established that the dynamics of the modied LIF with constant current is governed by SnCC 1 D F ( SnC ) ,
(3.9)
where F, which maps [s 0 C asr , C 1) onto [s 0 C asr , s0 C am tv ) , is a concave monotonic increasing function with a unique xed point that we denote by S¤ . Thus, for all S1 < S¤ , the sequence SnC increases and converges to S¤ , and, conversely, for all S1C > S¤ , the sequence SnC decreases toward S¤ . The rst consequence of the above result is that in response to a constant stimulation, the modied LIF stabilizes at a periodic ring with constant interspike intervals D ¤ given by µ
D¤
D tv ln
¶ a(v 0 ¡ m tv ) . S¤ ¡ s0 ¡ am tv
(3.10)
We remark that since D ( S ) is independent of a, the map F dened in equation 3.8 is an increasing function of a. A consequence of this observation is that the value of equilibrium postdischarge S¤ increases with a. Given that D ¤ D D ( S¤ ) , the ring period also increases with a. In other words, the stationary ring slows down as the factor representing threshold fatigue increases. We now study how D ¤ varies with the input current m . Although it is not possible in general to solve equation 3.7 to nd D ¤ as a function of m , it is possible to solve for m as a function of D ¤ . We get m (D ¤ ) D
¤ ¤ ¤ ¤ ¡s 0 C sr ¡ sr eD / ts C v 0eD / ts e ¡D / tv ¡ av0 e¡D / tv . ¤ ¤ ¤ ¤ tv (a ¡ eD / ts C eD / ts e¡D / tv ¡ ae ¡D / tv )
(3.11)
The derivative of m with respect to D ¤ is always negative under the assumptions v 0 · 0 < sr · s 0, a > 0, and D ¤ ¸ ts ln(a). We will now show that the ISI D ¤ can never be less than ts ln(a) . If a · 1, this condition is trivially satised as ts ln(a) · 0 and D ¤ > 0. We thus concentrate on the case a > 1. The denominator of equation 3.11 is zero when D ¤ D ts ln(a) , so the function m ( D ¤ ) has a pole at this point. Since the function m ( D ¤ ) is monotonically decreasing and continuously differentiable on the range (ts ln(a) , C 1) and limD ¤ !C 1 m (D ¤ ) D sr / tv > 0, we have that limD ¤ !ts ln(a) m ( D ¤ ) D C 1. Thus, as we increase m from sr / tv to C 1, the ISI D ¤ decreases from 1 to ts ln(a) . The ISI D ¤ can thus never be lesser than ts ln(a) if a > 1. This has important implications for the model dynamics. In particular, it implies that the stationary ISI D ¤ remains greater than ts ln(a) as we increase m to arbitrarily large values.
Memory and Adaptation in a Model Neuron
261
Moreover, the fact that m ( D ¤ ) is a decreasing function of D ¤ implies that the ring frequency 1 / D ¤ is always a monotonically increasing function of m . However, if a > 1, the ring frequency will saturate to a nite value given by (ts ln(a) ) ¡1 . As mentioned above, this feature is absent from the LIF model (a D 0, sr D s 0 ) where the ring rate diverges as a function of the input current. This type of behavior is commonly seen in experimental recordings and is usually associated with absolute refractoriness. Note, however, that the saturation in ring rate does not imply that an absolute refractory period is present in the model. Indeed, the model can give rise to an ISI smaller than ts ln(a) following a sufciently high current pulse or step (see Figure 2C). Thus, we show that even in the absence of an absolute refractory period in the classical sense, threshold fatigue can give rise to a saturation in the ring frequency under constant current input. This has important implications; it may be that the saturation in ring rate seen in experimental recordings under constant depolarizing current is due to threshold fatigue rather than the absolute refractory period of the neuron. A modication dm of the value of m leads to a transient modication of the discharge rate of the model. Notably, when dm > 0, the sequence Sn increases from the previous value of S¤ toward the new value, resulting in a progressive lengthening of intervals to the new stationary ISI. Note that this ISI will be shorter than the old one. If initially m tv < sr is subthreshold and the new value is suprathreshold, the transient regime in which the ISIs are shorter than in the stationary regime corresponds to neuronal adaptation, that is, a transient frequency increase at the onset of stimulation. The above results hold for all a > 0. However, some distinctions exist between the different values of a that clarify the role of this parameter. One point is that for a · 1, the map F is contracting, that is, F0 (S ) < 1 for all S, but this is not necessarily so for a > 1, where the map may be expanding in some intervals of S. This means that the effects such as discharge rate adaptation tend to be more pronounced in maps with large a than in those with small a. Figure 2A shows the response of the model to a step current going from the subthreshold regime to the suprathreshold regime. Note that the ISIs progressively increase to the new equilibrium value as is seen in experimental recordings (Adrian & Zotterman, 1926). Figure 2B shows the response of the model when the value of the current before the step was set to be in the suprathreshold regime with all other parameters unchanged. Figure 2C shows the response of the model to the same step as in Figure 2B but with a increased. Note the faster rate of adaptation. The rate at which the model adapts to a step current can thus be varied by changing the parameter a. 3.2 Gaussian White Noise Stimulation. Throughout this section, we assume W D W1 and that I ( t) D m C sj ( t) where m is constant and j is white gaussian noise with unit intensity. In this situation, the voltage and threshold of the model are stochastic processes that depend on the particular noise
262
M. Chacron, K. Pakdaman, and A. Longtin
Figure 2: Response of the model to a step increase in current. (A) The current goes from subthreshold (m D 0.1) to suprathreshold (m D 0.9) at t D 50. Note the adaptation in the threshold and the progressive lengthening of ISIs to the new equilibrium value. Parameter values used were ts D 2, tv D 1, a D 1, sr D 0.2, s0 D 0.1, v 0 D 0. (B) The current goes from m D 0.4 to m D 0.9 with all other parameters unchanged. (C) Illustration of the effects of increased a. m goes from 0.4 to 0.9 but a D 10 with all other parameters unchanged. Note the increased rate of adaptation.
realization. The interspike intervals are dened as the rst passage times (FPTs) of the voltage through the threshold. When a D 0, that is, in the absence of threshold fatigue, the ISIs are independent and identically distributed random variables. We denote by g ( t | s0 ) the probability density function (pdf) of these variables, that is, the function g ( t | s0 ) is the FPT pdf of the Ornstein-Uhlenbeck process g, dg D ¡g / tv C m C sj ( t) with g(0) D g 0, dt
(3.12)
Memory and Adaptation in a Model Neuron
263
through the threshold s ( t) D ( s0 ¡sr ) exp(¡t / ts ) C sr . The FPT pdf g contains all information concerning the point process formed by the discharge times of the model. When a > 0, the description of the discharge times needs to take into account the variations in the postdischarge time threshold. In the same way as for constant stimulation, the key point to the description of the behavior of the model is in establishing the relation between consecutive postdischarge thresholds. In this case, given that the stimulation is noise, the description involves the construction of a Markov chain rather than a map. This is detailed in the following. The conditional pdf of SnCC 1 given SnC can be written as µ ¶ ts a(SnC ¡ sr ) C | Sn , P 1 ( u | Sn ) D g ts ln u ¡ s 0 ¡ asr u ¡ s0 ¡ asr C
(3.13)
where as before g represents the FPT pdf of the Ornstein-Uhlenbeck process, equation 3.12, through the threshold s (t, SnC ) D ( SnC ¡ sr ) exp(¡t / ts ) C sr . The pdf P 1 denes the rst-order transition probabilities of an irreducible Markov chain. We denote by h¤ ( S ) the stationary distribution of this chain— in other words, in the long run, the probability for the postdischarge threshold to take a value within (S, S C dS ) is given by h¤ (S ) dS. The characteristics of the point process formed by the discharge times of the model are then dened in terms of this stationary distribution together with g and P 1 . For instance, the pdf of ISI distribution g¤ (t ) is given by Z ¤( ) (3.14) g t D g ( t | S ) h¤ (S ) dS. We are interested in the serial correlation coefcient of ISIs dened by rp D
hD n D n C p i ¡ hD n i2 , hD 2n i ¡ hD n i2
(3.15)
where hD n D n C p i, hD 2n i, and hD n i are the expectations of D n D n C p , D 2n , and D n , respectively. R For n large, the last two quantities are determined by hD n i D tg¤ ( t) dt R and hD 2n i D t2 g¤ ( t) dt. For the rst term—the expectation of the product of two intervals—we rst consider the case where n is large and p D 1. Given SnC , the postdischarge threshold after the nth discharge, the ISI between the nth and n C 1th rings, that is, D n , is distributed according to g (t | SnC ) . The knowledge of SnC and D n uniquely determines SnCC 1 as SnCC 1 D (SnC ¡ sr ) e¡D n / ts C sr D f (SnC , D n ).
(3.16)
This knowledge in turn yields that the ISI between the n C 1th and n C 2th rings, D n C 1 , has pdf g ( t | SnCC 1 ) D g ( t | f ( SnC , D n ) ). Combining these with
264
M. Chacron, K. Pakdaman, and A. Longtin
the fact that for n large, SnC is distributed according to h¤ yields Z Z Z hD n D n C 1 i D
S
D D0
DD 0 £ g[D 0 | f (S, D )] £ g[D | S] £ h¤ (S ) dD 0 dD dS.
(3.17)
In a similar way, we derive Z Z Z Z hD n D n C p i D
S
D S0 D 0
DD 0 £g[D 0 | S0 ]P p¡1 [S0 | f ( S, D ) ]£g[D | S] £ h¤ ( S ) dD 0 dS0 dD dS,
(3.18)
where Pp¡1 is the p ¡ 1th transition pdf, R that is, P 0 (u | S ) D d ( u ¡ S ) the Dirac function at S, and P kC 1 ( u | S ) D S0 P 1 ( u | S0 ) P k (S0 | S ) dS0 . In the absence of threshold fatigue, when a D 0, the postdischarge threshold is a xed value s 0, so that P 1 ( u | S ) D d (u ¡ s0 ) , h¤ ( S ) D d (S ¡ s0 ), and f ( S, D ) D s 0. Substituting these into equation 3.18 yields that for p ¸ 1, hD n D n C p i D hD n ihD n C p i and consequently that rp D 0. The analysis of the previous paragraphs establishes that the response to gaussian white noise of the modied LIF with threshold fatigue is not described by a renewal process when a > 0. This is in contrast with the standard LIF whose response to white gaussian noise can be described by a renewal process. It also revealed that the modied model with noisy forcing was characterized by a Markov chain relating consecutive postdischarge thresholds. Finally, it showed that the dependence of a postdischarge threshold on the previous value induces correlation between ISIs. Once it is theoretically established that ISIs of the modied model are correlated with one another, we examine the nature of this dependence. For this purpose, rather than using the integral expressions obtained previously, it is appropriate to use numerical simulations of the model. Indeed, the integrals require the computation of the FPT pdf of the Ornstein-Uhlenbeck process through an exponential boundary. Given that no general analytical expression is available for this quantity, derivation of the correlation from the integrals can be computationally more demanding than estimating the same quantities from simulations. Figure 3A shows the ISI distribution g¤ ( t) obtained when the system is driven by gaussian white noise for a D 1, and Figure 3B shows the ISI correlation coefcients rp . The estimated r 1 is negative. Serial correlation coefcients at higher lags are close to zero, illustrating the rapid return of the system to equilibrium. The fact that r 1 is negative and that it is the only coefcient substantially different from zero implies that ISIs shorter (longer) than average will be followed by ISIs longer (shorter) than average. Figure 3C shows the ISI distribution obtained for a D 4. We can see that increasing a increases the mean ISI. Also, the magnitude of the coefcient r1
Memory and Adaptation in a Model Neuron
265
Figure 3: (A) ISI distribution obtained for a D 1 in the presence of gaussian white noise of standard deviation 0.1. (B) Correlation coefcients rj as a function of lag. Note that only r1 D ¡0.38 is negative and that all coefcients are zero for higher lags. (C) ISI distribution obtained for a D 4. (D) Correlation coefcients. Note that r1 D ¡0.48 is lower than for a D 1. Other parameter values were ts D 8, tv D 1, m D 1, sr D 0, s0 D 1.
increases with a, which is consistent with the predictions that more memory will give rise to more pronounced ISI correlations. The magnitude of the serial correlation coefcient at lag one depends on the slope of the map S ! F (S ) and on the strength of the noise. For high noise intensity, it is expected that the noise will wash out the correlations induced by the map. We thus expect an increase in r1 as a function of noise intensity. This is shown in Figure 4, where noise activates the deterministic properties of the map. The value of r1 has already been linked to the detectability of weak signals in neurons as well as their information transfer capabilities (Chacron et al., 2001a).
266
M. Chacron, K. Pakdaman, and A. Longtin
Figure 4: r1 as a function of the noise standard deviation. r1 exhibits a minimum for the noise intensity around 0.2. It is at this noise level that the noise is most effective at perturbing the map without itself destroying the ISI correlations. Parameter values were the same as in Figure 3 with a D 1. Twenty thousand ISIs were used in each case.
To understand how the deterministic properties of the map can give rise to ISI correlations in the presence of noise, we study the response of the model to perturbations without noise (s D 0). We assume that the stationary regime has been reached, that the model res periodically, and that a single sufciently large pulsatile stimulation is delivered a time lapse h < D after a discharge, to make the neuron re ahead of time. This results in a shortened ISI with length h , and consequently the postdischarge threshold S (h C ) is larger than S¤ . Following the perturbation, the sequence Sn progressively decreases toward S¤ . Given that the interspike interval D ( S) is a monotonic increasing function of S, the sudden increase in S at the time of the perturbation and its progressive decrease indicate that the shortened interval is followed by one that is longer than the period and that subsequent intervals progressively decrease toward the stationary value. If the system returns quasi-instantaneously to equilibrium, then only r 1 will be signicantly negative. However, if the system takes a long time to return to equilibrium after the perturbation, then coefcients at higher lags can also be negative. We concentrate on the regime at which
Memory and Adaptation in a Model Neuron
267
the voltage has almost reached its asymptotic value m tv before a ring: this occurs when ts À tv . We start from equation 3.3 in which we set v (D n C 1 ) D m tv : S (D n C 1 , SnC ) D m tv .
(3.19)
The stationary threshold value is then S¤ D s0 C am tv , and equation 3.10 gives us the stationary ISI D ¤ :
D¤
D ts ln
s 0 C am tv ¡ sr . m tv ¡ sr
(3.20)
Let us suppose that a voltage perturbation advances or delays an action potential, thus causing an ISI h that can be smaller or greater than D ¤ . Then the new value of the threshold immediately after the action potential is given by µ ³ ´¶ ¡h . Sh D s0 C a sr C ( s0 C am tv ¡ sr ) exp ts C
(3.21)
The next ISI, in the absence of further perturbations, will then be given by » hnext D ts ln
¼ ShC ¡ sr . m tv ¡ sr
(3.22)
Equations 3.21 and 3.22 give us the relation between hnext and h: ( hnext D ts ln
s0 ¡ sr C a[sr C ( s0 C am tv ¡ sr ) exp(¡ ths ) ] m tv ¡ sr
) .
(3.23)
The derivative of the map hnext D f (h ) is given by f 0 (h ) D
¡a(s 0 C am tv ¡ sr ) s0 ¡ sr C a[sr C ( s0 C am tv ¡ sr ) exp(¡ ths ) ]
(3.24)
and is negative for s0 > sr > 0 and am tv > 0. We now consider the effects of a on the ISI hnext . If a D 0, we have » ¼ s ¡s D ¤ D hnext D ts ln 0 r , (3.25) m ¡ sr that is, hnext is independent of h since we have f 0 (h ) D 0. This is because the system returns instantaneously to equilibrium and there is no memory extending beyond one ISI. We now consider the effects of a > 0. hnext increases with decreasing h. Thus, a perturbation that causes h larger (smaller)
268
M. Chacron, K. Pakdaman, and A. Longtin
Figure 5: Voltage (black solid line) and threshold (gray solid line) time series. A perturbation in voltage was applied at t D 1098. The perturbation caused an action potential to occur earlier than expected. As such, the ISIh was shorter than D ¤ , and the value of the threshold immediately after that action potential, shC , was higher than the threshold equilibrium value S¤ . Consequently, the threshold took a longer time to decay to m , and the next ISI hnext was longer than D ¤ . Parameter values were the same as in Figure 4.
than D ¤ will make hnext smaller (larger) than D ¤ . Furthermore, increasing a increases hnext. We illustrate this with numerical examples. Figure 5 shows a time series in which a perturbation in the voltage was applied at t D 1099. The perturbation increased voltage and caused an action potential to occur prematurely. The ISI h is thus shorter than D ¤ and consequently ShC (see equation 3.21) is greater than S¤ . Since the threshold starts from a higher value, it will take a longer time to decay to the value m tv . As such, the next ISI hnext will be greater than D ¤ . Similarly, had h been longer than D ¤ , shC would have been smaller than S¤ , and consequently hnext would have been smaller than D ¤ . The actual map hnext D f (h ) is shown in Figure 6A and has a negative slope as expected. Thus, the deterministic properties of the map lead to a perturbed ISI longer (shorter) than average being followed by an ISI shorter (longer) than average. Figure 6B illustrates the dependence of hnext on a for different values of h. As discussed, hnext increases with a. The rate at which hnext increases is, however, greater for low h than for higher h . This shows how the determin-
Memory and Adaptation in a Model Neuron
269
Figure 6: (A) ISI hnext as a function of h (solid black line). The slope of the map is negative as expected (see the text). (B) ISI hnext as a function of a for different values of h . Increasing a will increase hnext . Parameters have the same value as in Figure 5 except a.
istic properties of the map in response to perturbations can give rise to ISI correlations in the presence of noise. 3.3 Sinusoidal Stimulation. In this section, we consider the case where the stimulation is sinusoidal I ( t) D m C a sin(vt) , with a ¸ 0. We show that the response of the LIF with threshold fatigue to such inputs is described by iterates of an annulus map. Such maps can exhibit complex dynamics that can even be chaotic. Through numerical simulations, we illustrate the occurrence of such regimes in the periodically forced LIF with threshold fatigue. Assuming a ring took place at time tn , we denote by v (t, tn ) and s (t, tn , SnC ) the voltage and threshold at time t, given the initial conditions v (tn , tn ) D v 0 and s (tn , tn , SnC ) D SnC . Provided t ¸ tn and no discharge takes place on
270
M. Chacron, K. Pakdaman, and A. Longtin
[tn , t], we have v (t, tn ) D m tv (1¡exp[(tn ¡t) / tv ]) C
atv [sin(wt) ¡wtv cos(wt) ] 1 C w2 tv2
atv exp[(tn ¡ t) / tv ] [sin(wtn ) ¡ wtv cos(wtn ) ] 1 C w2 tv2 ³ ´ tn ¡ t . s (t, tn , SnC ) D sr C ( SnC ¡ sr ) exp ts ¡
(3.26) (3.27)
p If m tv C atv / 1 C v2 tv2 · sr , then v ( t, tn ) < s (t, tn , SnC ) for all t ¸ tn (i.e., no substained ring occurs), and, conversely, if q m tv C atv / 1 C v2 tv2 > sr ,
(3.28)
then there exists some nite tn C 1 , with tn C 1 > tn such that v ( tn C 1 , tn ) D s ( tn C 1 , tn , SnC ). In other words, the inequality 3.28 is a necessary and sufcient condition for sustained ring of the model. When this inequality holds, the LIF with threshold fatigue will generate an innite sequence of ring times fti g1 i D 1 with tn ! 1 as n ! 1. We will assume that this is the case throughout the remainder of this section. The condition for sustained ring is independent of the initial voltage and threshold values; however, the sequence of discharge times depends on these quantities. In the following, we describe the relation between successive discharge times and postdischarge thresholds through the construction of an annulus map. The ring time tn C 1 is given by def
tn C 1 D F1 ( tn , SnC ) D infft: t > tn , v ( t, tn ) D s ( t, tn , SnC ) g,
(3.29)
and the threshold SnCC 1 immediately after the ring is given by def
SnCC 1 D F2 ( tn , SnC ) D s0 C W ( sr C (SnC ¡ sr ) e
tn C 1 ¡tn ts
, a).
(3.30)
We dene the discharge map as F: ( tn , SnC ) ! ( tn C 1 D F1 ( tn , SnC ) , SnCC 1 D F2 ( tnp, SnC ) ) on [0, 1) £ [s0 C W ( sr , a) , s0 C W (VM , a)], where VM D m tv C atv / 1 C v2 tv2 . For neural computation, it is of interest to know where spikes occur in relation to periodic forcing (see Hopeld, 1995). Given that the input current I is T-periodic with T D 2p / v, we thus dene the ring phase w n 2 [0, 2p ) as » µ ¶¼ tn w n D w tn ¡ Tint , T
where the int[.] denote the integer part.
(3.31)
Memory and Adaptation in a Model Neuron
271
We can thus associate to F a map f D ( f1 , f2 ) on the annulus [0, 2p ) £ [s 0 C W ( sr , a) , s0 C W ( VM , a) ] as ³ ´ 2p T w n , SnC modulo T (3.32) f1 (w n , SnC ) D F1 2p T ³ ´ T C C w n , Sn . (3.33) f2 (w n , Sn ) D F2 2p The behavior of the periodically forced LIF with threshold fatigue is completely characterized by the iterates of the discharge map F and its associated annulus map f , in the sense that given the initial discharge time t1 and the corresponding postdischarge threshold S1C , the following discharge times and postdischarge thresholds are determined iteratively as ( tn C 1 , SnCC 1 ) D F (tn , SnC ) . We will concentrate on the annulus map f and the sequence of ring phases and postdischarge thresholds (w i , SiC ) 1 i D 1 . When a D 0, the postdischarge threshold is always set to s0, so that the annulus map reduces to a map on the circle of radius s0 . Furthermore, when sr D s 0, that is, when dealing with the standard LIF without threshold fatigue, this circle map, when restricted to its range, is orientation preserving (Pakdaman, 2001). In such a situation, the system can mainly display phase locking and quasiperiodic behavior, as well as strange nonchaotic dynamics on a parameter set of zero measure. It cannot produce chaos (Coombes, 1999). However, when a > 0, no such restrictions hold. Although we did not observe chaotic behavior with W D W1 for the region of parameter space considered in this article, it is possible to observe sensitivity to initial conditions and chaotic dynamics when a nonlinear form is taken for W. We observed chaotic behavior when we chose W (s, a) D W2 (s, a) ´ exp(as) ¡1 (data not shown). Through systematic numerical simulations, we have observed that such dynamics can be easily observed when a is large and that the sensitivity to initial conditions is associated with a positive leading Lyapunov exponent. The details of this analysis will be described in another study (Chacron, Pakdaman, & Longtin, 2002). 4 Discussion In this study, we have shown that a simple single neuron model that accounts for the memory seen in several classes of neurons has surprisingly rich dynamical behavior. Previous studies involved either a D 0 (Geisler & Goldberg, 1966; Kretzberg, Egelhaaf, & Warcheza, 2001) or a D 1 (Geisler & Goldberg, 1966; Holden, 1976; Vibert et al., 1994; Pakdaman & Vibert, 1995; Pakdaman et al., 1996; Chacron et al., 2000, 2001a, 2001b; Liu & Wang, 2001). We instead varied a 2 [0, 1) continuously and studied its effects. Different values for the parameter a (measuring the amount of memory in the system) gave rise to qualitatively different regimes. The rate and degree of adaptation to a step in current were shown to vary with a. Adaptation is commonly seen in biological neurons; we have found that this interesting
272
M. Chacron, K. Pakdaman, and A. Longtin
feature could be reproduced by our model and that the rate of adaptation depended on the amount of memory. Another common feature of biological neurons is the saturation in ring rate as the input current is increased. Our model reproduces this feature when a > 1, in contrast with the LIF neuron for which the ring rate diverges as a function of the input current. In experimental recordings, this saturation is usually thought to be associated with an absolute refractory period or network effects. Our results show that an absolute refractory period is not necessary to obtain such an effect. In fact, this prediction could be veried experimentally in vitro if the ringrate saturation due to adaptation is lesser than the ring-rate saturation due to the absolute refractory period. Under the inuence of perturbations and noise, we have shown that the model could give rise to negative ISI interval correlations. The value and rate of decay of these correlations depend on the parameter a and on the noise strength and can be varied to give rise to very different regimes. Expressions were derived for the ISI correlation coefcients. The model was also shown to give rise to ISI correlations similar to those seen in experimental data (Longtin & Racicot, 1998; Chacron et al., 2000). Further, the value of the correlation coefcient at lag 1 was shown to increase as a function of noise intensity. These correlations were shown to increase the detectability of weak signals in neurons (Chacron et al., 2001a). There thus might be an optimal noise intensity at which signal detection is maximal. ISI correlations have been observed in experimental data taken from neurons in different sensory systems. The advantages of a negative correlation coefcient at lag 1 for stimulus detection through long-term spike train regularization as well as stimulus encoding have been outlined in another study (Chacron et al., 2001a). Moreover, a similar ring mechanism has already been used successfully to model electroreceptors of weakly electric sh that display this negative serial correlation coefcient at lag 1 (Longtin & Racicot, 1998; Chacron et al., 2000, 2001a, 2001b) and could potentially be used to model other neurons that also display negative ISI correlations (Kufer et al., 1957; Goldberg et al., 1964; Geisler & Goldberg, 1966). In fact, models similar to our own have already been used to model several classes of neurons in the visual system (Keat, Reinagel, Reid, & Meister, 2001), cortical neurons (Liu & Wang, 2001), and electroreceptor afferents (Brandman & Nelson, 2002). Results similar to our own can be obtained by having adaptation in the reset value of the membrane potential. Geisler and Goldberg (1966) have shown that an adapting reset could give rise to ISI correlations. However, as mentioned, there is experimental evidence for the threshold to ring being dependent on the spiking history (Azouz & Gray, 1999), while there is, to our knowledge, no such evidence for the reset value. We thus believe a dynamic threshold to be more physiologically realistic than a dynamic reset.
Memory and Adaptation in a Model Neuron
273
Although our model contains two timescales, it cannot reproduce the bursting dynamics seen in many neurons with the parameter range considered in this study. This is because we considered only threshold fatigue after an action potential and not facilitation. Our model is thus different from the integrate-and-re-or-burst model considered by Coombes, Owen, & Smith (2001), which also has two timescales. It has, however, been shown that the addition of a facilitation current to our model with a D 1 could give rise to bursting dynamics (Chacron et al., 2001b). The response of the model to perturbations and noise will carry over in the presence of sinusoidal forcing (Chacron et al., 2000). However, it is then impossible to obtain analytical explicit expressions for the ISI under voltage perturbations. Furthermore, under sinusoidal forcing, we have shown that our model could be described by a map dened on an annulus. While this may seem a simple extension of the many studies that have described the dynamics of sinusoidally forced standard LIFs by circle maps (Rescigno et al., 1970; Keener et al., 1981; Coombes & Bressloff, 1999; Pakdaman, 2001), it has major implications in terms of the dynamics of the system. Indeed, the circle map associated with the periodically forced standard LIF is oneto-one and orientation preserving, so that it produces only one of the three following regimes: phase locked, quasi-periodic, or strange nonchaotic behavior (Keener, 1980). For xed parameters, it can produce neither a mix of these nor chaos. In contrast, the annulus map associated with the LIF model with threshold fatigue is not constrained by such limitations and may very well lead to more complex dynamics (Le Calvez, 2000), in agreement with our numerical investigations (details will be described in a future work). Also, introducing threshold fatigue as in our model yields an adapting sequence of ISIs in the presence of periodic forcing. In contrast, for a step increase in bias current, models in which periodic forcing is taken into account by sinusoidally modulating the threshold or the voltage reset value see an abrupt change in ISI. It should be noted that the annulus map is dissipative and not conservative in general. This system is thus different from the so-called standard map studied by various authors (e.g., Meiss, 1992). Our results suggest that neurons with strong adaptation are more prone to displaying complex dynamics than those that either do not adapt or adapt more slowly. Such sensitivity to initial conditions may confer some advantages, as chaotic systems have the potential to transmit much more information about time-varying stimuli than nonchaotic ones, depending on the Lyapunov exponents of the system (Abarbanel, 1996). Further studies are needed to investigate such possible functional roles of chaos in signal processing in nervous systems. In this study, we have considered how the addition of threshold fatigue affected the spike train of a single LIF neuron. However, it is clear that adaptation and ISI correlations could also be due to network or synaptic dynamics. To expand on this point, we now discuss the possible biological mechanisms that the threshold fatigue could model.
274
M. Chacron, K. Pakdaman, and A. Longtin
Cumulative inactivation of sodium channels (Mickus, Jung, & Spruston, 1999) could, for example, give rise to neural adaptation and ISI correlations. However, a fast spike-activated slowly inactivating negative current could also give rise to similar effects: a likely candidate would be members of the KV family of potassium currents (Wang, Gan, Forsythe, & Kaczmarek, 1998). Another candidate could be a calcium-activated potassium current such as Iahp . This current was shown to be present in cortical neurons (Madison & Nicoll, 1984). Liu and Wang (2001) have shown that a model with an Iahp current was qualitatively equivalent to our threshold fatigue model with W D W1 (i.e., a linear threshold reset) and a D 1. It is further known that most ionic currents have voltage-dependent conductances. The parameter s0 represents the amount of reset that is voltage independent. The parameter a in our model could then represent the degree of dependence on the past ring time in the ionic channel that the dynamic threshold models. Increasing a would thus lead to a more history-dependent conductance. Long-term depression at a synapse could also give rise to relative refractoriness (Hausser & Roth, 1997). In fact, a depressing synapse produces negative spike train correlations (Goldman, Maldonado, & Abbott, 2002). Our model thus bears some similarity with models of short-term plasticity (Fuhrman, Segev, Markram, & Tsodyks, 2002; Goldman et al., 2002). Furthermore, it is known that many synapses are voltage dependent. The parameter a in our model could then represent the voltage dependence of synaptic depression. It should, however, be noted that the recovery time constant of the neurotransmitter at typical synapses is usually in the range of hundreds, if not thousands, of milliseconds (von Gernsdoff, Schneggenburger, Weis, & Neher, 1997). Although neural adaptation usually occurs on much shorter timescales, our model could in principle still be used to account for such phenomena if the threshold time constant is sufciently long. Moreover, it has been observed that neural adaptation was in many ways similar to recurrent inhibition (Ermentrout, Pascal, & Gutkin, 2001). Our single-neuron model could therefore describe network effects. It has thus been shown that threshold fatigue can be similar to recurrent inhibition, synaptic depression, or an intrinsic Iahp current responsible for adaptation. Acknowledgments We thank C. Laing, B. Doiron, J. Benda, B. Lindner, and L. Maler for useful discussions. This research was supported by NSERC (M. J. C. and A. L.). References Abarbanel, H. D. I. (1996). Analysis of observed chaotic data. New York: SpringerVerlag.
Memory and Adaptation in a Model Neuron
275
Abbott, L. F., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comp., 11, 91–101. Adrian, E. D., & Zotterman, Y. (1926) The impulse produced by sensory nerve endings. Part 2: The response of a single end-organ. J. Physiol. (Lond.), 61, 151–171. Alstrøm, P., & Levinsen, M. T. (1988). Phase-locking structure of “integrate-andre” models with threshold modulation. Phys. Lett. A, 128, 187–192. Azouz, R., & Gray, C. M. (1999). Cellular mechanisms contributing to response variability of cortical neurons in vivo. J. Neurosci., 19, 2209–2223. Brandman, R., & Nelson, M. E. (2002). A simple model of long-term spike train regularization. Neural Comp., 14, 1507–1544. Chacron, M. J., Longtin, A., & Maler, L. (2001a).Negative interspike interval correlations increase the neuronal capacity for encoding time dependent stimuli. J. Neurosci., 21, 5328–5343. Chacron, M. J., Longtin, A., & Maler, L. (2001b). Simple models of bursting and non-bursting P-type electroreceptors. Neurocomputing, 38, 129–139. Chacron, M. J., Longtin, A., St.-Hilaire, M., & Maler, L. (2000). Suprathreshold stochastic ring dynamics with memory in P-type electroreceptors. Phys. Rev. Lett., 85, 1576–1579. Chacron, M. J., Pakdaman, K., & Longtin, A. (2002). Chaotic ring in the sinusoidally forced leaky integrate and re model with threshold fatigue. Unpublished manuscript. Coombes, S. (1999). Lyapunov exponents and mode-locked solutions for integrate-and-re dynamical systems. Phys. Lett. A., 255, 49–57. Coombes, S., & Bressloff, P. C. (1999). Mode locking and Arnold tongues in integrate-and-re neural oscillators. Phys. Rev. E., 60, 2086–2096. Coombes, S., Owen, M. R., & Smith, G. D. (2001). Mode locking in a periodically forced integrate-and-re-or-burst neuron model. Phys. Rev. E, 64, 041914. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comp., 13, 1285–1310. Fuhrmann, G., Segev, I., Markram, H., & Tsodyks, M. (2002). Coding of temporal information by activity-dependent synapses. J. Neurophysiol., 87, 140–148. Geisler, C. D., & Goldberg, J. M. (1966). A stochastic model of the repetitive activity of neurons. Biophys. J., 7, 53–69. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Glass, L., & Mackey, M. C. (1979). A simple model for phase locking of biological oscillators. J. Math. Biol., 7, 339–352. Goldberg, J. M., Adrian, H. O., & Smith, F. D. (1964). Response of neurons of the superior olivary complex of the cat to acoustic stimuli of long duration. J. Neurophysiol., 27, 706–749. Goldman, M. S., Maldonado, P., & Abbott, L. F. (2002). Redundancy reduction and sustained ring with stochastic depressing synapses. J. Neurosci., 22, 584–591.
276
M. Chacron, K. Pakdaman, and A. Longtin
Hausser, M., & Roth, A. (1997). Dendritic and somatic glutamate receptor channels in rat cerebellar Purkinje cells. J. Physiol. (Lond.), 501, 77–95. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–544. Holden, A. V. (1976). Models of the stochastic activity of neurones. New York: Springer-Verlag. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the response of visual neurons. Neuron, 30, 803– 817. Keener, J. P. (1980). Chaotic behavior in piecewise continuous difference equations. Trans. Amer. Math. Soc., 261, 589–604. Keener, J. P., Hoppensteadt, F. C., & Rinzel, J. (1981). Integrate-and-re models of nerve membrane response to oscillatory input. SIAM J. of Appl. Math., 41, 816–823. Kreztberg, J., Egelhaaf, M., & Warzecha, A. K. (2001). Membrane potential uctuations determine the precision of spike timing and synchronous activity: A model study. J. Comput. Neurosci., 10, 79–97. Kufer, S. W., Fitzhugh, R., & Barlow, H. B. (1957). Maintained activity in the cat’s retina in light and darkness. J. Gen. Physiol., 40, 683–702. Lansky, P., & Sato, S. (1999). The stochastic diffusion models of nerve membrane depolarization and interspike interval generation. J. Peripher. Nerv. Syst., 4, 27–42. Le Calvez, P. (2000) Dynamical properties of diffeomorphisms of the annulus and of the torus. Providence, RI: American Mathematical Society. Liu, Y. H., & Wang, X. J. (2001). Spike frequency adaptation of a generalized leaky integrate-and-re neuron. J. Comput. Neurosci., 10, 25–45. Longtin, A., & Racicot, D. M. (1998). Spike train patterning and forecastability. Biosystems, 40, 111–118. MacGregor, R. J., & Lewis, E. R. (1977). Neural modeling. New York: Plenum Press. Madison, D. V., & Nicoll, R. A. (1984). Control of the repetitive discharge of rat CA1 pyramidal neurones in vitro. J. Physiol., 354, 319–331. Mickus, T., Jung, H. Y., & Spruston, N. (1999). Properties of slow cumulative sodium channel inactivation in rat hippocampal CA1 pyramidal neurons. Biophys. J., 76, 846–860. Meiss, J. D. (1992). Symplectic maps, variational principles, and transport. Rev. Mod. Phys., 64, 795–848. Neiman, A., & Russell, D. F. (2001). Stochastic biperiodic oscillations in the electroreceptors of paddlesh. Phys. Rev. Lett., 86, 3443–3446. Pakdaman, K. (2001). The periodically forced leaky integrate-and-re model. Phys. Rev. E., 63, 1907–1912. Pakdaman, K., Tanabe, S., & Shimokawa, T. (2001). Coherence resonance and discharge time reliability in neurons and neuronal models. Neural Networks, 14, 895–905.
Memory and Adaptation in a Model Neuron
277
Pakdaman, K., & Vibert, J. F. (1995). Modeling excitatory networks. Journal of Sice, 34, 788–793. Pakdaman, K., Vibert, J. F., Boussard, E., & Azmy, N. (1996). Single neuron with recurrent excitation: Effect of the transmission delay. Neural Networks, 9, 797– 818. Panzeri, S., Petersen, R. S., Schultz, S. R., Lebedev, M., & Diamond, M. E. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Reiss, R. F. (1962). A theory and simulation of rhythmic behavior due to reciprocal inhibition in small nerve nets. Proceedings of the 1962 AFIPS Spring Joint Computer Conference, 21, 171–194. Rescigno, A., Stein, R. B., Purple, R. L., and Popele, R. E. (1970).A neuronal model for the discharge patterns produced by cyclic inputs. Bull. Math. Biophys., 32, 337–353. Ricciardi, L. M., Di Crescenzo, D., Giorno, V., & Nobile, A. G. (1999). An outline of theoretical and algorithmic approaches to rst passage time problems with applications to biological modelling. Mathematica Japonica, 50, 247–322. Sch¨afer, K., Braun, H. A., Peters, C., & Bretschneider, F. (1995). Periodic ring pattern in afferent discharges from electroreceptor organs of catsh. Eur. J. Physiol., 429, 378–385. Segundo, J. P., Perkel, D. H., Wyman, H., Hegstad, H., & Moore, G. P. (1968). Input output relations in computer simulated nerve cells. Kybernetik, 4, 157– 175. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Teich, M. C., & Lowen, S. B. (1994). Fractal patterns in auditory nerve spike trains. IEEE Eng. Med. Biol. Mag., 13, 197–202. Tiesinga, P. H. E., Fellous, J. M., JosÂe, J. V., & Sejnowski, T. J. (2002). Information transfer in entrained cortical neurons. Network, 13, 41–66. Treves, A. (1996). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tuckwell, H. C. (1988). Introduction to theoreticalneurobiology:I. Cambridge: Cambridge University Press. Vibert, J. F., Pakdaman, K., & Azmy, N. (1994). Inter-neural delay modication synchronizes biologically plausible neural networks. Neural Networks, 7, 589– 607. von Gernsdoff, H., Schneggenburger, R., Weis, S., & Neher, E. (1997). Presynaptic depression at a calyx synapse: The small contribution of metabotropic glutamate receptors. J. Neurosci., 17, 8137–8146. Wang, L. Y., Gan, L., Forsythe, I. D., & Kaczmarek, L. K. (1998). Contribution of the Kv3.1 potassium channel to high-frequency ring in mouse auditory neurones. J. Physiol. Lond., 509, 183–194. Weiss, T. F. (1966). A model of the peripheral auditory system. Kybernetik, 3, 153–175. Wehmeier, W., Dong, D., Koch, C., & van Essen, D. (1989). Modeling the mamalian visual system. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press.
278
M. Chacron, K. Pakdaman, and A. Longtin
Wilson, D. M., & Waldron, I. (1968). Models for the generation of the motor output pattern in ying locusts. Proceedings IEEE, 56, 1058–1064. Yamamoto, A., & Nakahama, H. (1983). Stochastic properties in somatosensory cortex and mesencephalic reticular formation during sleep-waking states. J. Neurophysiol., 49, 1182–1198. Received April 9, 2002; accepted July 8, 2002.
LETTER
Communicated by Roderick V. Jensen
Reliability of Spike Timing Is a General Property of Spiking Model Neurons Romain Brette
[email protected] Centre de MathÂematiques et de Leurs Applications,Ecole Normale SupÂerieurede Cachan, 94230 Cachan, France, and INSERM U483, UniversitÂe Pierre et Marie Curie, 75005 Paris, France
Emmanuel Guigon INSERM U483, UniversitÂe Pierre et Marie Curie, 75005 Paris, France
The responses of neurons to time-varying injected currents are reproducible on a trial-by-trial basis in vitro, but when a constant current is injected, small variances in interspike intervals across trials add up, eventually leading to a high variance in spike timing. It is unclear whether this difference is due to the nature of the input currents or the intrinsic properties of the neurons. Neuron responses can fail to be reproducible in two ways: dynamical noise can accumulate over time and lead to a desynchronization over trials, or several stable responses can exist, depending on the initial condition. Here we show, through simulations and theoretical considerations, that for a general class of spiking neuron models, which includes, in particular, the leaky integrate-and- re model as well as nonlinear spiking models, aperiodic currents, contrary to periodic currents, induce reproducible responses, which are stable under noise, change in initial conditions and deterministic perturbations of the input. We provide a theoretical explanation for aperiodic currents that cross the threshold. 1 Introduction The responses of neurons to dynamic stimuli have been shown to exhibit high reliability in vitro (Mainen & Sejnowski, 1995; Hunter, Milton, Thomas, & Cowan, 1998; Fellous et al., 2001; Beierholm, Nielsen, Ryge, Alstrom, & Kiehn, 2001) and in vivo (Berry, Warland, & Meister, 1997; Nowak, SanchezVives, & McCormick, 1997; Reich, Victor, Knight, Ozaki, & Kaplan, 1997; Berry & Meister, 1998; Buracas, Zador, DeWeese, & Albright, 1998; Bair & Koch, 1996). In this case, spike timing is reproducible on a trial-by-trial basis up to a precision of 1 ms or less, even a long time after stimulus onset (1 s in Mainen & Sejnowski, 1995). Consequently, in a variety of neurons and for time-varying stimuli, spike times convey more information about the stimulus than spike count alone does (Victor & Purpura, 1996; Neural Computation 15, 279–308 (2003)
° c 2002 Massachusetts Institute of Technology
280
R. Brette and E. Guigon
Figure 1: Three cases regarding the reproducibility of neuron responses to a given stimulus. (A) Small variances in ISIs accumulate, so that after some time, spikes are not synchronized between trials. (B) Spike timing is reliable even in the long run but depends on the initial state of the neuron—here, two stable solutions coexist. (C) Spike timing is reliable; noise does not accumulate over time.
de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Buracas et al., 1998; Reich, Mechler, Purpura, & Victor, 2000; Reinagel & Reid, 2000). Rate-modulated Poisson processes do not account for these properties (Victor & Purpura, 1996; Baddeley et al., 1997; de Ruyter van Steveninck et al., 1997; Berry et al., 1997; Nowak et al., 1997; Reich, Victor, & Knight, 1998; Kara, Reinagel, & Reid, 2000; Reich et al., 2000). On the other hand, stationary stimuli elicit spike trains with very coarse precision (Mainen & Sejnowski, 1995; Berry et al., 1997; de Ruyter van Steveninck et al., 1997; Buracas et al., 1998). Mainen and Sejnowski (1995) injected a constant current into a neuron and compared the output spike trains over repeated trials. At the beginning, spikes occurred at the same time in all trials, but then small variances in interspike intervals (ISIs) began to accumulate, so that spikes eventually were totally desynchronized over trials. It is not clear why this does not happen with time-varying currents. What happens when the responses of a neuron to a given stimulus are compared over repeated trials, knowing that each trial is different due to intrinsic noise? Three cases may occur, as illustrated in Figure 1: Case 1: Small variances in ISIs accumulate, so that eventually spikes tend to desynchronize over trials.
Reliability of Spike Timing
281
Case 2: Spike times are reproducible in the long run but depend on the initial state of the neuron (e.g., on the membrane potential at the onset of the stimulus); that is, several stable responses exist. Case 3: Spike times are reproducible in the long run and do not depend on the initial state of the neuron. Experimentally, case 1 has been shown to occur for constant input currents, whereas time-varying stimuli induce reproducible responses (Mainen & Sejnowski, 1995), although the experimental procedures used in these studies do not allow us to distinguish between cases 2 and 3. Only in the last case can the responses be considered truly reproducible. In this article, we show that for a general class of spiking neuron models, which includes, in particular, the leaky integrate-and-re model (Lapicque, 1907; Knight, 1972) as well as nonlinear spiking models, all three cases can occur if the input current is periodic, while aperiodic currents induce reproducible responses. In addition to numerical simulations, we put forth a theoretical explanation of this property for aperiodic currents that oscillate around threshold. The conditions required for our explanation are not fullled by the nonleaky integrate-and-re model—also called perfect integrator—which is never reliable. 2 Materials and Methods 2.1 What Is Reliability? A neuron is said to be reliable when its responses to a given stimulus are reproducible. Thus, reliability is a priori stimulus dependent. The experimental results of Mainen and Sejnowski (1995) show that with a constant current, the responses of a neuron in different trials are initially synchronous, but then small variances in ISIs add up and the spikes become desynchronized. If the intrinsic noise of neurons were lower, it would simply take the spikes longer to desynchronize, but in the end, the result would be the same. However, with nonconstant currents resembling synaptic inputs (low-pass ltered gaussian noise), desynchronization does not happen. In both cases, intrinsic noise is low at the timescale of a single ISI, but it eventually leads to a desynchronization only in the case of a constant current. Thus, we propose the following denition of reproducibility: responses to a stimulus are considered reproducible if the spike time jitter remains of the order of the intrinsic noise, however long we wait, or, equivalently, that any given precision can be reached if the level of noise is low enough. This denition allows us to distinguish the two cases mentioned above. However, neuron responses can also fail to be reproducible if spike timing depends on the initial condition, that is, the membrane potential at stimulus onset. Therefore, we shall consider in the denition the asymptotic spike time jitter induced not only by intrinsic noise but also by the dispersion of initial conditions.
282
R. Brette and E. Guigon
Our aim is to show that depending on the stimulus, all of the three cases mentioned in section 1 can occur. Furthermore, we shall show that a broad class of models has reproducible responses to aperiodic stimuli. We pay particular attention to leaky models—in a general sense, as dened below. However, we also introduce and study other types of models—the perfect integrator and nonleaky models—to show that the nonreliability of the perfect integrator is the exception rather than the rule. 2.2 Models. 2.2.1 The General Model. We consider spiking neuron models, where the membrane potential V evolves according to the following one-dimensional differential equation, dV D f ( V, t) , dt
(2.1)
with the spike modeled as follows: when the potential reaches a threshold Vt , it is reset to Vr . This general denition includes many different models, such as the leaky integrator (Lapicque, 1907; Knight, 1972) and the perfect integrator (Knight, 1972). The equation needs not be linear. Such models are not always reliable, as the example of the perfect integrator shows. However we shall show that a very broad class of models of this kind exhibits reproducible responses. 2.2.2 Leaky Models. One example of such a model is the leaky integrator (Lapicque, 1907; Knight, 1972), where the membrane potential V evolves according to the following equation: t
dV D ¡V C RI ( t) , dt
(2.2)
where t is the membrane time constant, R is the resistance, and I ( t) the input current. We generalize the notion of leak by dening a leaky model as a model of type 2.1 satisfying the following condition: @f @V
< 0.
This includes the case when t and R vary in time and also includes nonlinear equations. Basic mathematical properties of leaky models are summarized in the article appendix. In our simulations, we used a leaky integrator with t D 33 ms, R D 200 MV, Vt D 15 mV, and Vr D ¡5 mV, and a nonlinear leaky model dened by t
dV D ¡aV3 C RI ( t) , dt
(2.3)
Reliability of Spike Timing
283
with the same values for t , R, Vt , and Vr , and a D 4444 V¡2 (this is for scaling reasons). This corresponds to a neuron model with rectication, that is, with a voltage-dependent leak. We chose these parameter values so as to obtain gures of the same magnitude as in experimental results. However, the actual value of parameters has no inuence on the results, since two sets of parameters actually dene the same model (change of variables). 2.2.3 Nonleaky Models. We shall show that the leak is not necessary to obtain reliable spike timing. We simulated an example of a nonleaky model, described by the following equation, t
dV D VI ( t) C k, dt
(2.4)
where I (t) is the input current, with Vt D 1, Vr D 0, t D 33 ms, and k D 1. This model is leaky only if I ( t) < 0 for all t. Otherwise, the solutions of equation 2.4 diverge exponentially in all intervals where I ( t) > 0. However, we shall see that it does not prevent the model from having reliable spike timing. Equation 2.4 does not correspond to a neuron model; we chose it only to illustrate our explanation of reliability. 2.2.4 Perfect Integrator. As an example of a nonreliable model, we have taken the common nonleaky integrate-and-re model, t
dV D f ( t) , dt
(2.5)
where f (t ) is an input function. We used t D 12.5 ms (in order to have comparable timescales), Vt D 1, and Vr D 0. We shall see in the simulations that this model is never reliable. Mathematically, the unreliability of this model results from the fact that the distance between two solutions is constant modulo 1, and thus dynamical noise always accumulates (Knight, 1972). 2.3 Stimuli. We tested the reproducibility of model responses to various periodic and aperiodic currents. We also assessed the inuence of current parameters, such as amplitude and mean, on spike timing. For this purpose, we used the following generic method: we dene a basis function B (t ) as one with values between ¡1 and 1, such as a sinusoidal wave, and then derive a family of currents by a parameterized afne change of variables, for example, I ( t) D Ith C 0.5 C pB ( t) with p 2 [0, 1], where Ith is the input current threshold (the minimum constant current eliciting a response). In this example, we obtain a family of currents with a xed DC part and whose amplitude varies from 0 to 1.
284
R. Brette and E. Guigon
Figure 2: Construction of a 2D spike density gure. (A) Basis function. A family of currents is dened with a parameter controlling the current mean. (B) Trials for a given parameter. Every short vertical line represents a spike. (C) PSTH computed from 2000 trials with the parameter used in B. (D) The PSTH is converted into gray rasters, with the higher the ring rate, the darker the lines. (E) Gray rasters for all parameters are collected and arranged vertically.
In most simulations, we chose the change of variables so that the current is always above threshold for p 2 [0, 0.5] and oscillates around it for p 2 [0.5, 1]. Subsequently, we want to display the reliability of model responses for all the parameters of a given family of currents on a single gure. How this is done is illustrated by Figure 2. For a given basis function (see Figure 2A), we compute the outputs of a large number of trials for each parameter value (see Figure 2B) and then compute a poststimulus time histogram (PSTH) using small time bins (see Figure 2C), which we convert into a line of gray rasters (see Figure 2D). Arranging these lines vertically, we get a two-dimensional spike density gure (see Figure 2E), where histograms for all parameter values between 0 and 1 are represented simultaneously, with parameter on
Reliability of Spike Timing
285
the vertical axis and time on the horizontal axis. This allows us to visualize not only the reliability of spike timing for every parameter value, but also the effect of parameters on spike timing and reliability. Thus, we assess the stability of spike timing under noise as well as under deterministic perturbations. For periodic currents, we used sine waves, and for aperiodic currents, we used two types of basis functions: (1) a triangle-shaped function, with the slope of each line chosen according to a random uniform distribution between two values, and (2) a noisier function: a low-pass ltered gaussian noise that we distorted so that its magnitude distribution over time was uniform between ¡1 and 1 (see, e.g., Figure 4.2.2). This was done to distinguish between the input current always above threshold from the input current oscillating around it. The specic shape of input currents did not inuence the results. 2.4 Measure of Precision. Precision of spike timing was assessed from PSTHs. Most previous measures relied on the identication of “events” (Mainen & Sejnowski, 1995), dened as peaks in the PSTH, whose width is an estimate of the precision. However, assessing the precision only in events might lead to an overestimate, as observed in Tiesinga, Fellous, & Sejnowski (2002a), since the periods when the PSTH is not peaked enough are discarded. Other measures are based on the variance of the PSTH about the mean output frequency (Hunter et al., 1998), but this measure is misleading when the output frequency varies over time. The measure we propose here is based on the entropy of distributions (Borst & Theunissen, 1999). We cannot directly compute the entropy from a PSTH, because this is not a normalized probability distribution. We choose the successive ISIs of one trial to divide the PSTH in successive windows. The models we use have the property that for every other trial, the neuron spikes exactly once in each of these windows. Equivalently, we can split the PSTH in successive windows containing 2000 spikes (the number of trials). We can evaluate the precision of spike timing in any window from the PSTH using the following formula, X HD¡ p (i) log2 p (i) , i
where p ( i) is the number of spikes in time bin i divided by the number of trials. H is the entropy of the distribution of spike times in the window and represents the number of bits required on average to express in which time bin a neuron spikes. One advantage over event-based measures is that if there are m stable solutions, each PSTH window contains m peaks, which gives a higher entropy. However, this quantity depends on the size of the time bin used for the PSTH. We can translate it into a time measure that does not depend on the size of the time bin in the following way. An entropy of H bits is the entropy of a uniform distribution over 2H time bins, that is, over
286
R. Brette and E. Guigon
an interval of size dt2H , where dt is the length of the time bin, which gives the following formula: Á dt exp ¡
X
! p ( i) log p (i) .
(2.6)
i
As dt tends to 0, this quantity tends to ³ Z ´ exp ¡ p ( t) log p ( t) dt , where p ( t) represents the spike time density. Thus, if dt is small enough, the precision measure given by formula 2.6 does not depend on the length of the time bin. As an example, this measure gives a precision of D for a uniform distribution over an interval [t ¡ D / 2, t C D / 2] and a precision of 4s for a gaussian distribution with standard deviation s. We used formula 2.6 to compute the precision in each window and averaged over all successive windows, discarding the rst 2.5 s of the 10 s runs (to take the time of convergence into account). In the results, we shall say that neuron responses have a high precision if their measure of precision given by formula 2.6 is small. 2.5 Details of Simulations. Input currents were sampled at a rate depending on the required resolution—2000 Hz in most simulations with the leaky integrator. Equations for the linear leaky integrator and the perfect integrator were solved by exact integration. The equations of the two other models—the nonlinear and the nonleaky ones—were integrated by the Euler method (these were simulated mostly for illustration, and only on a short timescale). Spike timing was computed by bisection with a precision of 1 m s using the following observation: over a single current sampling step (1 / 2000 s), equation 2.1 reduces to an autonomous scalar equation because the input current is constant, which means that V ( t) is monotonous over this duration, and the time when the potential hits threshold can then be computed unambiguously and quickly by bisection. A small noise was added to the potential at each time step of integration. The type of noise used is irrelevant because it tends to a gaussian noise as the time step tends to 0. Therefore, for computational ease, we used a uniform noise with an amplitude ranging from 2.5 mV.s¡1/ 2 to 3.5 mV.s ¡1 / 2 . Reliability of spike timing was assessed by 2000 repeated trials for every tested model and input current over an extended period of time—10 s for the leaky integrator—and PSTHs were computed with a 0.5 ms time bin. Doing many trials avoided the problem of smoothing PSTHs.
Reliability of Spike Timing
287
3 Theoretical Predictions 3.1 Periodic Currents. Responses to periodic currents highlight the two difculties that may prevent the neuron responses from being reproducible: (1) a small noise may accumulate over time, or (2) several stable responses may coexist. Because periodically driven leaky models are related to the dynamical theory of homeomorphisms of the circle (Denjoy, 1932; Coddington & Levinson, 1955; Arnold, 1961; Herman, 1977; Keener, 1980; Veerman, 1989), their mathematical properties are well known (Knight, 1972; Keener, Hoppensteadt, & Rinzel, 1981). However, the reliability of periodically driven neuron models is a recent issue (Tiesinga, 2002; Tiesinga, Fellous, & Sejnowski, 2002b). We describe the available theoretical results and explain how they relate to our problem. The dynamics of leaky models depend on the value of the rotation number, dened as the ratio of the input frequency to the output spike rate: It is rational if and only if the output pattern of spikes is (asymptotically) periodic, the period being a multiple of the input period. When this pattern is stable under perturbations, this is called phase locking, which has also been studied experimentally (Rescigno, Stein, Purple, & Poppele, 1970; Ascoli, Barbi, Chillemi, & Petracchi, 1977; Guttman, Feldman, & Jakobsson, 1980; Koppl, 1997). There is p : q phase locking when a stable pattern of p spikes is produced every q periods of the input. Thus when the model is phase-locked, a small noise does not accumulate over time. However, several stable responses may coexist. As noted in Tiesinga (2002), if a p : q phase-locked solution is shifted by a multiple of the input period, a new solution is obtained, so that at least q distinct stable solutions exist. Thus, regarding reliability, cases 2 and 3 can appear. When it is irrational, under some regularity conditions, the dynamics of the model are topologically equivalent to the dynamics of a model with constant input, which means that a small noise always accumulates over time, however small it may be. Thus, case 1 occurs. When the input current is parameterized, for example, by its frequency and mean, the rotation number is a continuous function of the parameters, which implies that cases 1, 2, and 3 occur in any area of the parameter space where the rotation number is not constant. Thus, there is no systematic relationship between current parameters and reliability. However, while the set of parameters where case 1 occurs has positive measure if the current is always above threshold (Herman, 1977), it has measure zero, and thus is not observable, if the current oscillates around threshold (Keener, 1980; see Veerman, 1989, for a rigorous mathematical proof). To sum up, cases 2 and 3 occur in any area where the current oscillates around threshold; cases 1,
288
R. Brette and E. Guigon
Figure 3: A current below threshold results in an interval where no spiking is possible. Arrows represent whether the vector eld points up or down at threshold.
2, and 3 occur in any area where the current is above threshold (unless the rotation number is constant). The (asymptotic) spike time jitter is an increasing function of the level of intrinsic noise, which tends to 0 as the noise variance tends to 0 when the model is phase-locked and to a positive number when it is not. Quantitatively, this function depends on various detailed aspects of the dynamics, such as the Lyapunov exponent (Tiesinga, 2002). 3.2 Aperiodic Currents. In the case of aperiodic currents, such as ltered gaussian noise, we will see in the numerical results that neither case 1 nor case 2 arises. Other studies suggest that the responses are reliable when the current is above threshold (Pakdaman & Tanabe, 2001). Here, we explain why this should also be so when the input current oscillates around threshold. Our explanation relies on the construction of a set of possible spike times, which is done with no noise on the dynamics. In the appendix, we prove that this set is stable under noise and deterministic perturbations. Let t1 be a time when the input current goes below threshold. A solution arriving at threshold at time t1 cannot cross it and will hit threshold again only later, at time t2 , as shown in Figure 3. The area dened by this solution and the line V D Vt cannot be reached by any other solution, since trajectories cannot cross. Therefore, no spike can occur between t1 and t2 , which happens every time the current goes below threshold, and so such time intervals can be removed from the possible spike times. The area dened by the solutions starting at reset between t1 and t2 cannot be reached by any solution starting before t1 , since such a solution would spike between t1 and
Reliability of Spike Timing
289
Figure 4: Construction of unreachable time intervals. As in Figure 3, time is on the horizontal axis and membrane potential on the vertical axis. The color areas are unreachable by any solution. Thus, the sample solution, shown as a dashed line, always remains in the white area. The simulation was 500 ms long with a leaky integrator.
t2 . The spike times of these forbidden solutions dene a whole sequence of unreachable spike time intervals. Therefore, every time the input current goes below threshold, a whole sequence of intervals is removed from the possible spike times, as shown in Figure 4. We can see that the rst time the current goes below threshold, a red area that no solution can reach is created near threshold. The solutions starting from the resulting forbidden time interval are in red. The next time the current goes below threshold occurs outside the red area and creates a new sequence of yellow stripes. The third time occurs inside the red area, and the result is displayed in blue. Thus, a solution starting at time 0 always remains in the white area, as shown by the dashed black line. One can see that the white area gradually gets smaller and smaller. Based on the above considerations, we predict that when the current oscillates around threshold, the set of possible spike times eventually gets very narrow, so that spike timing should be reproducible in the long run, even if the potential is not xed at reset at stimulus onset. For the general model given by equation 2.1, “the current goes below threshold” means that the vector eld points downward at threshold, that is, f (Vt , t) < 0. In the construction described above, we implicitly assumed that a solution is at reset potential at time t only when it spikes at time t, which is not necessarily so in the general case of equation 2.1. For example, in Figure 4, one might imagine a solution starting at time 0, going below reset, and then going up into the rst red stripe. This cannot happen if we assume f (Vr , t) > 0 for all t. Another sufcient condition to avoid this problem is to assume that the model is leaky, as discussed in the appendix. None of these two conditions is fullled by the perfect integrator. Indeed, in this model, when the vector eld points downward at threshold, it also does so at reset. The numerical simulations show that this model is never reliable. We briey describe the numerical construction of the set of reachable spike times, as illustrated in Figure 4. First, nd the rst time t1 when the input current goes below threshold. This is the starting point of the rst
290
R. Brette and E. Guigon
forbidden interval. Then run the solution starting from threshold at time t1 and stop when it hits threshold again at time t2 . The interval I D [t1 , t2 ] is the rst forbidden interval, where no spike can occur. Then run the solution starting from reset at time t1 (resp. t2 ) and stop when it hits threshold at time t3 (resp. t4 ), so as to obtain the next forbidden interval [t3 , t4 ]. By repeating the process, we get a whole sequence of intervals where no spike can occur. In the remaining set of times, the current goes again below threshold. So nd the rst time when it does so after t2 , and repeat the procedure. Thus, when the algorithm ends, we get the remaining set of reachable spike times, and we shall see that this set eventually gets very small. Since we chose to display the reliability of model responses for all parameters of a given family of currents (see Figure 5A) on a single gure, we computed the set of reachable spike times for each parameter (see Figures 5B and 5C) and displayed it in a two-dimensional white and black gure, with time on the horizontal axis and parameter on the vertical axis (see Figure 5D). Black dots represent reachable spike times. The explanation above does not apply to the case when the input current is always above threshold. However, in our numerical results, responses seem to be reproducible in both cases, although the required intrinsic noise for reproducibility is signicantly lower than for currents crossing the threshold. Our explanation does not rule out the possibility of multiple stable solutions, but this does not seem to happen in our simulations. 4 Numerical Results 4.1 Periodic Currents. The results regarding periodic stimuli are displayed in Figure 6, which illustrates the theoretical results. A leaky integrator is driven by a 20 Hz sinusoidal current (see Figure 6A), whose mean is a decreasing linear function of the parameter. For parameters p < 0.5, the current is always above threshold, whereas for p > 0.5, the current oscillates around threshold. The spike density gure (see Figure 6B) was computed without noise, with random initial potential for every trial, so that phase locking appears clearly. Adding dynamical noise results only in blurring this gure; it has no qualitative inuence on the results. The model was run for 5 s. When phase locking occurs, all the stable solutions can be observed in the last 250 ms of the runs (see Figure 6B). Phase locking appears for a given parameter value when the density line for this value consists of a nite number of black dots only, that is, when the corresponding PSTH consists of a nite number of isolated peaks. Conversely, there is no phase locking when the density line consists of gray shades, that is, when the corresponding PSTH is a smooth distribution. As expected, responses are always phase-locked for p > 0.5, whereas there are not always for p < 0.5. Mean ring rate and precision were computed for every parameter value with noise-free dynamics, as seen in Figure 6C (solid lines), where phase locking corresponds to plateaus in the ring rate and to drops in precision
Reliability of Spike Timing
291
Figure 5: Construction of the set of possible spike times for all parameters of a given family. (A) Basis function. (B) For a given parameter, we compute the times when no spike may occur (dark areas are not reachable). (C) This is converted into a line of white and black rasters, with black ones corresponding to possible spike times. (D) Collecting all these lines for all parameters and arranging them vertically, we get a two-dimensional white and black gure of possible spike times, with time on the horizontal axis and parameter on the vertical axis.
measure. Precision was also computed with a small noise on the dynamics (1 mV.s ¡1 / 2 ), which results in a smoothed version of the noise-free precision plot, where only 1:1 phase locking appears clearly (see Figure 6C, dashed line). There is no trivial relationship between precision and parameter. The precision is better for p > 0.5, where the model is always phase-locked (the current oscillates around threshold), than for p < 0.5. The best precision is obtained for 1:1 phase locking, as observed in Hunter et al. (1998). We showed the results for three parameter values (see Figures 6D–6F), which illustrate the three possible cases of reliability. The simulations were done with a small dynamical noise. Case 1: Noise accumulates. For p D 0, responses were not reproducible. After 5 s, the PSTH was at, and individual trials were not synchro-
292
R. Brette and E. Guigon
Reliability of Spike Timing
293
nized (see Figure 6D). The model behaves in the same way as with a constant current. Case 2: Multiple stable solutions. For p D 1, the responses are phaselocked, as shown in the spike density gure (see Figure 6B). The spike rate is 13.3 Hz (see Figure 6C), which is two-thirds the input frequency. Thus, there is 2:3 phase locking; the model neuron spikes twice every three periods, as shown in the individual trials (see Figure 6F). The density gure (see Figure 6B) and the PSTH (see Figure 6F) indicate six spikes every three periods, which means there are three stable solutions, resulting also in a higher sensitivity to noise (see Figure 6C, dashed line). Case 3: One stable solution. For p D 0.5, the spike rate is 20 Hz (see Figure 6C): the model is 1:1 phase-locked. There is only one stable solution, as seen in the individual trials (see Figure 6E). Thus, responses are truly reproducible, and precision is very high (see Figure 6D). In these simulations, case 3 occurs in only about 30% of the parameters range (the 1:1 phase-locking area). 4.2 Aperiodic Currents. 4.2.1 Noise Does Not Accumulate over Time. In Figure 7, we assessed the reliability of a noisy leaky integrator in response to a random triangle wave, whose mean is constant and amplitude varies linearly with the parameter (the current was constant for p D 0). For all trials, membrane potential was at rest at stimulus onset; thus, we do not distinguish between cases 2 (several stable solutions) and 3 (one stable solution). The input current is always above threshold for p < 0.5 and oscillates around threshold for p > 0.5. Figure 6: Facing page. Reliability of a leaky integrator driven by a periodic input current. The model is a linear leaky integrator. (A) Input current is a 20 Hz sine wave I ( t) D 85 C 40(1 ¡ p ) C 30sin(40p t) pA, where p is the parameter. The results are shown from 4.75 s after the onset of the stimulus. (B) The model was run 2000 times for every parameter value (400 values between 0 and 1), with initial potential drawn uniformly between threshold and reset, but no noise on the dynamics, so that phase locking appears clearly. (Adding noise results in a blurred version of this gure.) The results are displayed as a spike density gure. (C) Firing rate was computed as a function of the parameter, with no noise on the dynamics. Precision was computed with no noise on the dynamics (solid line) and with a noise of 1 mV.s¡1 / 2 (dashed line). Phase locking appears for parameter values below 0.5 as drops in precision. (D–F) For three different parameter values (0, 0.5, 1), we show the results of 10 trials of the model with noise on the dynamics (2.5 mV.s¡1 / 2 ) and random initial potential, and also show the PSTH, computed from 2000 trials. Note the different scales for ring rate in these graphs.
294
R. Brette and E. Guigon
Trials were 10 s long. The spike density gure (see Figure 7B) shows that when the input current oscillated around threshold (p > 0.5), spikes were synchronized in the long run, and the actual spike times corresponded with our computational prediction of possible spike times (see Figure 7C). Tight synchronization can be clearly seen in the 10 individual trials computed for parameter p D 0.8 in Figure 7E, and the PSTH is very peaked. Thus, noise did not accumulate (case 2 or 3). Spiking seems less precise when the input current is always above threshold (p < 0.5), as shown in the spike density gure (see Figure 7B) and the precision plot (see Figure 7F, solid line). Individual trials for parameter p D 0.1 (see Figure 7D) show synchronized spikes at the beginning of the simulation, but not in the last 500 ms of the 10 s runs. This is reected by the PSTH as a steady decrease in the height of the peaks during the rst 500 ms and as a at distribution during the last 500 ms. However, we also computed the precision of long runs (200 s) with a lower noise on the dynamics (0.7 mV.s ¡1/ 2 ) (see Figure 7F, dashed line). This duration was enough for noise to accumulate in the case of a constant current (p D 0), but it did not lead to a desynchronization of trials for parameters above 0.2. Thus, it seems that in contrast with periodic currents, responses to aperiodic currents are also reproducible (case 2 or 3) when the current is above threshold, as suggested by other studies (Jensen, 1998; Pakdaman & Tanabe, 2001), though the level of tolerable noise is lower than when the current oscillates around threshold. The type of current used (e.g., gaussian instead of triangle wave) did not alter the results. 4.2.2 There Is Only One Stable Solution. In Figure 8, we assessed the reliability of a noisy leaky integrator in response to a random triangle wave, Figure 7: Facing page. Reliability of a leaky integrator with initial potential at rest. The model is a linear leaky integrator. (A) Input current is I ( t) D 150 C 150pB(t) pA, where p is the parameter and B ( t) is a triangular wave taking values between ¡1 and 1, with rising and falling time drawn uniformly between 10 ms and 50 ms. We show the results for the rst and last 500 ms of the 10 s simulations. (B) The model was run 2000 times for every parameter value (400 values between 0 and 1), with membrane potential initially at rest value, and noise on the dynamics (3.5 mV.s¡1/ 2 ). The results are displayed as a spike density gure. (C) The set of possible spike times was computed for every parameter value. (D, E) For parameter values p D 0.1 (E) and p D 0.8 (F), 10 individual trials were extracted and displayed, along with the PSTH computed from 2000 trials. (F) Mean ring rate does not vary much with the parameter. The precision was computed as a function of the parameter from the last 7.5 s of 10 s0 runs with a noise of 3.5 mV.s¡1/ 2 on the dynamics (solid line, 400 parameter values), and from the last 10 s of 200 s0 runs with a noise 0.7 mV.s¡1 / 2 (dashed line, 10 parameter values).
Reliability of Spike Timing
295
296
R. Brette and E. Guigon
whose mean is constant and amplitude varies linearly with the parameter. For every trial, membrane potential was drawn at random uniformly between rest and threshold, so that we could distinguish between cases 2 (several stable solutions) and 3 (one stable solution). When the input current oscillates around threshold (p > 0.5), we can see from the spike density gure (see Figure 8B) that spike timing becomes precise after some time. In the individual trials for parameter p D 0.95 (see Figure 8E), spikes initially occur at different times, depending on the membrane potential at onset, and eventually they become synchronized over trials, whatever the initial potential, which corresponds to case 3, and was predicted by the computation of possible spike times (see Figure 8C). The results are shown only for the rst 500 ms, but the behavior after 9.5 s is similar to the one shown in Figure 7. When the input current is always above threshold (p < 0.5), responses are not reproducible with this magnitude of intrinsic noise, as the results of individual trials for p D 0.1 show (see Figure 8D). However, we found that with a lower intrinsic noise, responses were reproducible in the long run for small parameter values even after 200 s (precision of less than 6 ms for p > 0.2). 4.2.3 Spike Times Are Stable Under Deterministic Perturbations. Do spike times in the long run change when the input current is increased? The spike times of a neuron driven by a constant current are not stable under deterministic perturbations: if the (constant) ISI is T and a slight decrease of the current makes it T C d, then the difference between the times of the nth spike is nd, leading to a desynchronization of the two patterns of spikes after enough time. In Figure 9, we simulated a noisy leaky integrator driven by a random triangle wave, whose mean and amplitude increased linearly with the parameter. The current oscillated around threshold for every parameter value. For every trial, membrane potential was drawn at random uniformly between rest and threshold. The mean ring rate, computed over 10 s, increased from 22 to 35 Hz with the parameter (see Figure 9D), but spike times did not vary signicantly with the parameter, even after 750 ms (see Figures 9B and 9C). Figure 8: Facing page. Reliability of a leaky integrator with random initial potential. The model is a linear leaky integrator. (A) Input current is I ( t ) D 150 C 150pB(t) pA, where p is the parameter and B ( t) is a normalized ltered gaussian noise with time constant 10 ms. (B) The model was run 2000 times for every parameter value (400 values between 0 and 1), with initial potential drawn uniformly between rest and threshold, and noise on the dynamics (2.5 mV.s¡1 / 2 ). The results are displayed for the rst 500 ms as a spike density gure. (C) The set of possible spike times was computed for every parameter value. (D, E) For parameter values p D 0.1 (D) and p D 0.95 (E), 10 individual trials were extracted and displayed, along with the PSTH computed from 2000 trials.
Reliability of Spike Timing
297
298
R. Brette and E. Guigon
Thus, an increase in ring rate creates additional spikes rather than a shift of all spikes. Precision was almost constant for all parameter values—around 3 ms (see Figure 9D). Thus, when the input current oscillates around threshold, responses are stable under signicantly large deterministic perturbations. 4.2.4 Nonlinear and Nonleaky Models Are Also Reliable. In Figure 10, we assessed the reliability of a noisy leaky nonlinear model driven by equation 2.3 (see Figures 10B and 10C) and a noisy nonleaky model described by equation 2.4 (see Figures 10D and 10E) driven by a gaussian input current whose mean is constant and amplitude varies linearly with the parameter. For every trial, membrane potential was drawn at random uniformly between rest and threshold. Again, responses were reproducible (case 3) when the input current oscillated around threshold (p > 0.5). 4.2.5 The Perfect Integrator Is Never Reliable. As a counterexample, we simulated a noisy perfect integrator (Knight, 1972; see Figure 11), driven by a gaussian current whose mean is constant and amplitude varies linearly with the parameter. For all trials, membrane potential was at rest at stimulus onset. The hypotheses required for our theoretical explanation are not fullled by this model. We can see in Figure 11B that whether the input current is above threshold (p < 0.5) or oscillates around it (p > 0.5), the noise accumulates and leads to a complete desynchronization of spikes over trials (case 1). Figure 11C shows that the set of possible spike times does not get smaller over time, and Figure 11D that the precision is always above 13 ms, which is half the average ISI. 5 Discussion We identied two ways in which neuron responses can fail to be reproducible: dynamical noise can accumulate over time and lead to a desynchronization over trials, or several stable responses can exist, depending on the initial condition. These two cases can occur when the model neuron is driven by a periodic current, but the former can occur only if the current is above threshold (Keener et al., 1981). In contrast, we showed that responses of a general class of spiking neuron models to aperiodic currents oscillating around threshold are reproducible: spike timing in the long run is stable under noise and deterministic perturbation and does not depend on the initial condition. We provided a theoretical explanation based on a geometrical construction that shows that the set of possible spike times becomes smaller and smaller over time. Our numerical results suggest that responses to aperiodic currents above threshold are also reproducible, which conrms other studies (Pakdaman & Tanabe, 2001), but the level of tolerable noise seems lower in this case.
Reliability of Spike Timing
299
Figure 9: Stability of spike timing under deterministic perturbations. The model is a linear leaky integrator. (A) Input current is I ( t) D 112.5 C (75 C 37.5p) B ( t ) C 37.5p pA, where p is the parameter and B ( t ) is a triangular wave taking values between ¡1 and 1, with rising and falling time drawn uniformly between 10 ms and 50 ms. Thus, the mean input current increases with the parameter from 112.5 pA to 150 pA. (B) The model was run 2000 times for every parameter value (400 values between 0 and 1), with initial potential drawn uniformly between rest and threshold and noise on the dynamics (2.5 mV.s¡1 / 2 ). The results are displayed for the rst 750 ms as a spike density gure. (C) The set of possible spike times was computed for every parameter value. (D) Mean ring rate and precision were computed from 10 s simulations.
300
R. Brette and E. Guigon
Figure 10: Reliability of a nonlinear and a nonleaky model. We assessed the reliability of two models that differ from the linear leaky integrator: a leaky nonlinear model driven by equation 2.3 (B, C) and a nonleaky model driven by equation 2.4 (D, E). (A) Input current is I ( t) D 150 C 150pB(t) pA for the leaky nonlinear model and I ( t) D 0.5 C 3pB(t) for the nonleaky model (unit is irrelevant), where p is the parameter and B ( t) is a normalized ltered gaussian noise with time constant 10 ms. (B) The nonlinear leaky model was run 2000 times for every parameter value (400 values between 0 and 1), with initial potential drawn uniformly between rest and threshold, and noise on the dynamics (2.5 mV.s¡1/ 2 for the leaky model). The results are displayed for the rst 500 ms as a spike density gure. (C) The set of possible spike times was computed for every parameter value. (D, E) The same simulations were done for the nonleaky model, with noise on the dynamics 0.1 V.s¡1/ 2 .
Reliability of Spike Timing
301
Figure 11: The perfect integrator is not reliable. The model is a perfect integrator driven by equation 2.5. (A) Input current is I ( t) D .5 C pB ( t) (units are not signicant here), where p is the parameter and B (t ) is a a normalized ltered gaussian noise with time constant 10 ms. We show the results for the rst and last 500 ms of the 10 s simulations. (B) The model was run 2000 times for every parameter value (400values between 0 and 1), with membrane potential initially at rest value and noise on the dynamics (0.27 V.s¡1 / 2 ). The results displayed are displayed as a spike density gure. (C) The set of possible spike times was computed for every parameter value. (D) Mean ring rate is almost constant, around 40 Hz, while precision is always above 13 ms, which is very high, knowing that this is about half the mean ISI.
302
R. Brette and E. Guigon
We showed that for aperiodic currents that cross the threshold, noise does not accumulate over time. In other words, in this case, the asymptotic spike time jitter is a function of the level of intrinsic noise that tends to 0 as the noise variance tends to 0, which is not generally so for periodic currents. This result is not a consequence of particular factors such as the leak, the amount of uctuations in the input current, or its frequency spectrum. However, quantitatively, the ratio of the asymptotic spike time jitter to the noise variance depends on various characteristics of the input current (and, more generally, of the differential equation; Hunter et al., 1998; Fellous et al., 2001; Beierholm et al., 2001; Tiesinga, 2002; Tiesinga et al., 2002b). Such a relationship is relevant only when responses are reproducible in the way dened above. In particular, it would be illusory to search for a formula describing the precision as a function of a parameter of the input in the case of periodic currents above threshold. Indeed, when varying the parameter, the model alternates between phase locking, where precision is close to 0, and structural unstability, where spike timing is imprecise even with an arbitrarily small intrinsic noise (besides being dependent on the initial condition). However, when the model is phase-locked, the (quantitative) precision depends on dynamical properties such as the Lyapunov exponent and the distance to the boundary of the Arnold tongue (Tiesinga, 2002). Appendix: Stability of the Set of Possible Spike Times A leaky model is described by a one-dimensional differential equation, dV D f ( V, t) dt
(A.1)
where V is the membrane potential, with the spike modeled as follows: when the potential reaches a threshold Vt , it is reset to Vr . The leak is described by the following condition on f : @f @V
< 0.
This inequality implies that the distance between any two solutions of equation A.1 (without spike) always decreases and tends to zero—hence, the term leak. Since we are interested in the sequence of spike times, it is natural to introduce the map Q: R ! R that gives the time of the rst spike following a given spike time, that is, Q ( t) is the time when a run starting from reset potential at time t hits threshold. We shall call this map the spike map. Thus, the sequence of spike times of a run with rst spike at time t is just t, Q (t) , Q2 (t) , . . . The spike map was rst introduced in Rescigno et al. (1970) in the case when the input current is periodic and studied further in Knight (1972) and Keener et al. (1981).
Reliability of Spike Timing
303
One can easily show that for every t, f (Vt , Q ( t) ) ¸ 0, which means there can be no spike if the vector eld points downward at threshold, that is, in the case of the linear leaky integrator, if the input current is below the current threshold. This result is the rst cornerstone of the construction of the set of possible spike times, and it holds also for nonleaky spiking models, including the perfect integrator. It allows us to remove intervals from the set of possible spike times, as described in section 3.2. In the construction of this set, we also remove the image of these intervals by iterates of Q. However, this is correct only if Q is injective. For leaky models, one can show that Q is strictly increasing on its range (see Keener et al., 1981, for the case of periodic currents). This is also true if f ( Vr , t) > 0 for all t, which is the case of the nonleaky models we simulated. For periodic currents, that is, in the general case, when f is periodic with respect to t, the fact that Q is strictly increasing on its range means it is a lift of an orientation-preserving circle map. If Q is surjective, that is, if the current is always above threshold, then Q is a lift of a homeomorphism of the circle. The dynamics of both noninvertible orientation-preserving circle maps and homeomorphisms of the circle are well known, as we described in section 3.1. Our aim is to prove that the set of possible spike times constructed in section 3.2 is stable under noise and deterministic perturbations. We assume that the duration of ISIs is bounded, so that there is an M > 0 such that Q ( t) ¡ t < M for all t. For mathematical convenience, we consider here that a run may hit threshold without spiking, as long as it does not cross it. Consider runs dened on R (not only R C ), with innitely many spikes on both R C and R ¡ . Note R ½ R the set of spike times of all these runs, that is, the set of possible spike times constructed from t D ¡1. Now consider a perturbation of this system dened in the following way: take 2 > 0, representing the amount of perturbation; the spike following a spike at time tn occurs at time tn C 1 such that |tn C 1 ¡ Q ( tn ) | · 2 . This accounts for an uncertainty that may be random or deterministic. Note R2 the set of spike times of all possible perturbed runs for a given 2 > 0. We shall prove the following stability theorem: Theorem 1. For any compact set K such that K \ R D ;, there is an 2 > 0 such that K \ R2 D ;. In other words, any compact set, such as an interval, that is not reachable under the original deterministic dynamical system is not reachable either if a small perturbation is added on the dynamics. Therefore, adding a small noise to the dynamics should have a small inuence on the set of spike times. In the same way, a slight change of the vector eld should cause a small perturbation of spike times.
304
R. Brette and E. Guigon
First, let us formalize the considerations above. Considering that a run may hit threshold without spiking, as long as it does not cross it, means in the present case that if the spike map is discontinuous at t, then the spike following a spike at time t may occur at either time Q ( t¡ ) or Q ( t C ). Now choose 2 > 0 as the level of uncertainty; if there is a spike at time t, then the next spike will occur within distance 2 of either Q ( t¡ ) or Q (t C ) , that is, the set of possible times for next spike is [Q ( t¡ ) ¡ 2 , Q (t¡ ) C 2 ] [ [Q ( t C ) ¡ 2 , Q ( t C ) C 2 ]. Formally, this framework may be expressed in terms of a setvalued dynamical system. We refer the reader to Aubin and Frankowska (1990) for more details about set-valued analysis. A sequence ( tn ) of spike times agreeing with our considerations is one that satises tn C 1 2 F2 ( tn ) ,
(A.2)
where F2 ( t) D [Q (t¡ ) ¡ 2 , Q ( t¡ ) C 2 ] [ [Q ( t C ) ¡ 2 , Q ( t C ) C 2 ]. Thus, F2 is a set-valued map that we write F2 : R ; R, as in Aubin and Frankowska (1990). We introduce a few denitions from set-valued analysis: Denition 1. The graph of a set-valued map F: X ; Y is the set of points ( x, y ) 2 X £ Y such that y 2 F ( x) .
The inverse of a set-valued map F: X ; Y is the set-valued map F ¡1 : Y ; X dened by x 2 F ¡1 ( y ) if and only if y 2 F ( x) . If A ½ R, then [ F (A) D F ( t). t2A
If F: R ; R and G: R ; R are two set-valued maps, then the composition G ± F: R ; R is the set-valued map dened by G ± F ( t) D G ( F (t ) ) . The notation Fn stands for the n-fold composition F ± F ± ¢ ¢ ¢ ± F.
The intersection of F: R ; R and G: R ; R is the set-valued map F \ G: R ; R dened by F \ G ( t) D F ( t) \ G ( t) . Thus, F2 n ( t) is the set of possible times for the nth spike following a spike at time t, given an uncertainty 2 . The set of possible times for the nth spike of any run is F2 n ( R ). Consider a run dened on R, with innitely many spikes on R ¡ . Every spike of this run must occur within the set F2 n (R ) , for any integer n. Thus, the spike times are contained in the set \ R (F2 ) D F2 n ( R ) . n> 0
Conversely, for any time t 2 R ( F2 ) , we can nd a sequence of spike times satisfying equation A.2. Thus, R (F2 ) is the set of spike times of all such runs.
Reliability of Spike Timing
305
In order to prove theorem 1, we need another denition: Denition 2. Let F: X ; Y be a set-valued map. F is said to be compact on compact sets (CC) if for any sequence xn 2 X converging to x, any sequence yn 2 F ( xn ) has a cluster point y 2 F (x ) . Equivalently, F is CC if its graph is closed and it maps bounded sets to bounded sets. This is a generalization of continuity for set-valued maps. One may note that when F is a single-valued map, it is equivalent to continuity. Let us state a few basic results regarding CC maps: Proposition 1. 1. If F: X ; Y and G: Y ; Z is CC, then so is G ± F.
2. If Fl : X ; Y, l 2 I, is a family of CC set-valued maps, then the intersection \ Fl l2I
is also CC. 3. Let I £X ; Y
(l, x ) 7! Ft ( x )
a CC set-valued map, and let K ½ X be a compact set. Then l 7! Fl ( K) is CC. Theorem 1 follows from the following lemma: Lemma 1.
The set-valued map 2 2 R C 7! F2 n (R ) \ K is CC.
Proof of Theorem 1. It follows from lemma 1 and proposition 1 that the set-valued map 2 7! R ( F2 ) \ K is CC; hence, its domain is closed, that is, the 6 ; is closed. Therefore, if R ( F 0 ) \ K D ;, then set of 2 such that R ( F2 ) \ K D for 2 small enough, R (F2 ) \ K D ;. Proof of Lemma 1. Note that F2 n ( R ) \ K D F2 n ( F2 ¡n ( K ) ). It follows from proposition 1 that (2 , t) 7! F2 n ( t) is CC. Therefore, it is sufcient to prove that (2 , t) 7! F2 ¡n ( t) is CC, and since its graph is closed, we need to prove only that it maps bounded sets to bounded sets. That is, we must prove that
306
R. Brette and E. Guigon
for any A > 0 and any compact interval I, the set [ 2 ·A
¡n (I) F2 ¡n (I ) D FA
is bounded. Since the function t 7! Q ( t) ¡ t is bounded, we have lim Qn ( t) D ¡1.
t!¡1
Therefore, Qn maps an unbounded set of R ¡ to an unbounded set. Hence, ¡n (I ) must be bounded. FA Acknowledgments We thank K. Pakdaman and J.-P. Aubin for many helpful discussions and Christine Lienhart and Marc Maier for revising our English. References Arnold, V. I. (1961). Small denominators. I. Mapping the circle onto itself. Izv. Akad. Nauk. SSSR Ser. Mat., 25, 21–86. Ascoli, C., Barbi, M., Chillemi, S., & Petracchi, D. (1977). Phase-locked responses in the Limulus lateral eye: Theoretical and experimental investigation. Biophys. J., 19(3), 219–240. Aubin, J. P., & Frankowska, H. (1990). Set-valued analysis. Boston: Birkh¨auser. Baddeley, R., Abbott, L., Booth, M., Sengpiel, F., Freeman, T., Wakeman, E., & Rolls, E. (1997).Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. Lond. B Biol. Sci., 264(1389), 1775– 1783. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Comp., 8(6), 1185–1202. Beierholm, U., Nielsen, C., Ryge, J., Alstrom, P., & Kiehn, O. (2001). Characterization of reliability of spike timing in spinal interneurons during oscillating inputs. J. Neurophysiol., 86(4), 1858–1868. Berry, M., & Meister, M. (1998). Refractoriness and neural precision. J. Neurosci., 18(6), 2200–2211. Berry, M., Warland, D., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94(10), 5411–5416. Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nat. Neurosci., 2, 947–957. Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efcient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20(5), 959–969. Coddington, E., & Levinson, N. (1955). Theory of ordinary differential equations. New York: McGraw-Hill.
Reliability of Spike Timing
307
de Ruyter van Steveninck, R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Denjoy, A. (1932). Sur les courbes dÂenies par les e quations diffÂerentielles a` la surface du tore. J. Math. Pures. Appl., 9(11), 333–375. Fellous, J. M., Houweling, A., Modi, R., Pao, R., Tiesinga, P., & Sejnowski, T. (2001).Frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. J. Neurophysiol., 85, 1782–1787. Guttman, R., Feldman, L., & Jakobsson, E. (1980). Frequency entrainment of squid axon membrane. J. Membrane Biol., 56, 9–18. Herman, M. (1977). Mesure de Lebesgue et nombre de rotation. Berlin: SpringerVerlag. Hunter, J., Milton, J., Thomas, P., & Cowan, J. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol., 80(3), 1427–1438. Jensen, R. (1998). Synchronization of randomly driven nonlinear oscillators. Phys. Rev. E, 58(6), R6907–R6910. Kara, P., Reinagel, P., & Reid, R. (2000). Low response variability in simultaneously recorded retinal, thalamic, and cortical neurons. Neuron, 27(3), 635–646. Keener, J. (1980). Chaotic behavior in piecewise continuous difference equations. Trans. Amer. Math. Soc., 261(2), 589–604. Keener, J., Hoppensteadt, F., & Rinzel, J. (1981). Integrate-and-re models of nerve membrane response to oscillatory input. SIAM J. Appl. Math., 41, 503– 517. Knight, B. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Koppl, C. (1997). Phase locking to high frequencies in the auditory nerve and cochlear nucleus magnocellularis of the barn owl, Tyto alba. J. Neurosci., 17(9), 3312–3321. Lapicque, L. (1907).Recherches quantitatives sur l’excitation e lectrique des nerfs traitÂee comme une polarisation. J. Physiol. Pathol. Gen., 9, 620–635. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Nowak, L., Sanchez-Vives, M., & McCormick, D. (1997). Inuence of low and high frequency inputs on spike timing in visual cortical neurons. Cereb. Cortex, 7(6), 487–501. Pakdaman, K., & Tanabe, S. (2001). Random dynamics of the Hodgkin-Huxley neuron model. Phys. Rev. E, 64(5Pt1), 050902. Reich, D., Mechler, F., Purpura, K., & Victor, J. (2000). Interspike intervals, receptive elds, and information encoding in primary visual cortex. J. Neurosci., 20(5), 1964–1974. Reich, D., Victor, J., & Knight, B. (1998). The power ratio and the interval map: Spiking models and extracellular recordings. J. Neurosci., 18, 10090–10104. Reich, D., Victor, J., Knight, B., Ozaki, T., & Kaplan, E. (1997).Response variability and timing precision of neuronal spike trains in vivo. J. Neurophysiol., 77(5), 2836–2841. Reinagel, P., & Reid, R. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20(14), 5392–5400.
308
R. Brette and E. Guigon
Rescigno, A., Stein, R., Purple, R., & Poppele, R. (1970). A neuronal model for the discharge patterns produced by cyclic inputs. Bull. Math. Biophys., 32, 337–353. Tiesinga, P. (2002). Precision and reliability of periodically and quasiperiodically driven integrate-and-re neurons. Phys. Rev. E, 65, 041913. Tiesinga, P., Fellous, J. M., & Sejnowski, T. (2002a). Attractor reliability reveals deterministic structure in neuronal spike trains. Neural Comput., 14(7), 1629– 1650. Tiesinga, P., Fellous, J. M., & Sejnowski, T. (2002b). Spike-time reliability of periodically driven integrate-and-re neurons. Neurocomputing, 44–46: 195–200. Veerman, J. (1989). Irrational rotation numbers. Nonlinearity, 2(3), 419–428. Victor, J., & Purpura, K. (1996). Nature and precision of temporal coding in visual cortex: A metric-space analysis. J. Neurophysiol., 76(2), 1310–1326.
Received March 15, 2002; accepted July 5, 2002.
LETTER
Communicated by Terrence Sejnowski
The Dynamic Neural Filter: A Binary Model of Spatiotemporal Coding Brigitte Quenet
[email protected] Laboratoire d’Electronique, Ecole Superieure de Physique et Chimie Industrielles, Paris 75005, France David Horn
[email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel We describe and discuss the properties of a binary neural network that can serve as a dynamic neural lter (DNF), which maps regions of input space into spatiotemporal sequences of neuronal activity. Both deterministic and stochastic dynamics are studied, allowing the investigation of the stability of spatiotemporal sequences under noisy conditions. We dene a measure of the coding capacity of a DNF and develop an algorithm for constructing a DNF that can serve as a source of given codes. On the basis of this algorithm, we suggest using a minimal DNF capable of generating observed sequences as a measure of complexity of spatiotemporal data. This measure is applied to experimental observations in the locust olfactory system, whose reverberating local eld potential provides a natural temporal scale allowing the use of a binary DNF. For random synaptic matrices, a DNF can generate very large cycles, thus becoming an efcient tool for producing spatiotemporal codes. The latter can be stabilized by applying to the parameters of the DNF a learning algorithm with suitable margins. 1 Introduction
Three types of neural network paradigms (Hertz, Krogh, & Palmer, 1991; Peretto, 1992) were developed two decades ago. Two of them were based on supervised learning, in order to construct feedback attractor networks (Hopeld, 1982) for associative memory and feedforward multilayer perceptrons (Rumelhart, Hinton, & Williams, 1986) for functional representation. The third used unsupervised learning to produce self-organized maps (Kohonen, 1982). These three conventional paradigms of neural computation were formulated in terms of mathematical neurons that are very different from biological reality, and it is still unclear how, if at all, biological neural Neural Computation 15, 309–329 (2003)
° c 2002 Massachusetts Institute of Technology
310
B. Quenet and D. Horn
networks perform any of them. Over the past decade, interest in theoretical neuroscience (Dayan & Abbott, 2001) has shifted to analyzing neural networks of spiking neurons with dynamical synapses in the hope of getting closer to the natural roles and purposes of the relevant biological networks. We will argue that it is nevertheless timely to consider yet another model of binary neurons with discrete temporal dynamics and show its applicability to a biological system. An interesting issue in neuroscience is the question of spatiotemporal coding. Its existence has been demonstrated (Wehr & Laurent, 1996) in the locust olfactory system, where the spatiotemporal behavior of projection neurons encodes the odor presented to its receptor neurons. This transformation from odor input to spatiotemporal activity occurs in the antennal lobe, that is, the rst module of the olfactory system. This system may therefore be regarded as a dynamic neural lter that turns spatial information distributed over its many glomeruli, fed by the receptor neurons, to specic spatiotemporal outputs. Although this is a complicated biological system, it has an important simplifying feature that allows it to be represented by a simplied model of mathematical neurons: the fact that the activity of the projection neurons is limited to temporal bins dened by an oscillatory local eld potential. Hence, a model with binary neurons obeying Hopeld-Little dynamics (Peretto, 1992) can provide a valid rst-order approximation of the spatiotemporal behavior of that system. We study a recurrent model of this kind using an asymmetric coupling matrix that allows for the generation of large temporal sequences. The novelty of our approach is that we use this model as a dynamic neural lter (DNF), relating input space to spatiotemporal behavior of the recurrent network. As such, it does not correspond to any of the three paradigms noted above, it does not come to rest at xed points, and it is not necessarily based on supervised learning. We have presented this model in a previous work (Quenet, Horn, Dreyfus, & Dubois, 2001) and demonstrated how it can be used to generate the spatiotemporal behavior of projection neurons observed by Wehr and Laurent (1996). Here we discuss several fundamental issues concerning this model, such as its stability and coding capacity. Moreover, we provide an algorithm for constructing a DNF that can generate a given set of spatiotemporal data. This algorithm is used to dene and characterize the complexity of the data set of Wehr and Laurent (1996). It should be emphasized that the latter is used only as a relevant example, and we do not claim that the DNF is a realistic representation of the biological reality. Nonetheless, as in the conventional three paradigms, it can be used as a schematic model that will subserve further detailed modeling of biological systems. 2 The Model 2.1 Short Biological Motivation. The antennal lobe of insects serves as the rst stage of olfactory computation, using as input signals of chemical
The Dynamic Neural Filter
311
receptors and transforming them into outputs of projection neurons (PN) to the next stage of the olfactory tract (the mushroom body). In the locust, there are about 800 excitatory PN cells (Laurent and Naraghi, 1994) and a slightly smaller number of inhibitory local interneurons (LN) that form dendrodendritic interactions with the PNs and with themselves, and receive receptor information as well. The output of this system is that of PNs only; hence, we will consider our model as a rough sketch of PNs, with interactions that should be regarded as effective interactions mediated by the LNs. The antennal lobe exhibits a reverberating eld potential (LFP) at a frequency of about 20 Hz. This LFP reverberates for a few cycles (up to tens of them) and then quiets down until it begins to repeat such behavior. (See Laurent et al., 2001, for details.) The PNs re within the up-phase of the LFP; hence the latter may be regarded as dening an effective clock, with time bins of, for example, 30 ms of activity and 10 ms of rest. This motivates us to use a simplied model with discrete temporal dynamics. 2.2 The Recurrent Network. In our model, a binary neuron i, representing a PN, may either re, ni D 1, or be quiescent, ni D 0, in a given temporal bin of the LFP. There are N neurons in the model obeying the following Hopeld-Little dynamics,
0 ni (t C 1) D H (hi ( t C 1) ) D H @
1 X j
wij nj ( t) C Ri ¡ hi A ,
(2.1)
where wij is the synaptic coupling matrix, Ri is the external constant input (specifying odor activation), and hi is the threshold. H is the Heaviside step function taking the values 0 for nonpositive arguments and 1 for positive ones. This model can be readily generalized to account for the presence of noise by replacing the deterministic rule by the stochastic one, Prob[ni ( t C 1) D 1] D
1 . 1 C e¡hi ( t C 1) /2
(2.2)
where 2 is the noise parameter. We will study the stability of results of the deterministic dynamics to the noise introduced by the stochastic dynamics. 2.3 Deterministic Dynamics. In order to analyze the one-step dynamics of equation 2.1, we dene a Lyapunov function L that describes just this onestep, in the sense that, considering all dynamical states at time t C 1, it will be minimal for the state that is a solution to these dynamics. Let us dene the initial and nal states,
nIi D ni (t )
nFi D ni ( t C 1) ,
(2.3)
312
B. Quenet and D. Horn
relevant to the one-step of equation 2.1. Clearly, I and F are just two of the 2N states that the system of neuron possesses. We can prove that L JI D ¡
X ij
J wij ni njI ¡
X i
J ni ( Ri ¡ hi )
(2.4)
obtains its unique minimum on the state J D F. Consider 0 1 X JI J J ki D ¡(2ni ¡ 1) hIi D ¡(2ni ¡ 1) @ wij njI C Ri ¡ hi A .
(2.5)
j
For J D F, this quantity is negative for every i, following the deterministic dynamics. For any other J, there will be elements i for which the sign will P JI JI ip. Hence, i ki will be minimal for J D F. We note in equation 2.5 that ki J JI contains terms that are dependent on ni , to be denoted by 2li where JI li
D
¡niJ
0 1 X @ wij njI C Ri ¡ hi A ,
(2.6)
j
J
and terms that are independent of ni . Since the latter are common for all J, P JI the minimum of i ki will be reached for the same J as the minimum of P JI L JI D i li , the Lyapunov function of equation 2.4. 2.4 Stochastic Dynamics. The one-step Lyapunov function, equation 2.4, plays an important role in the stochastic dynamics.1 As we will see, the probability of obtaining a state J after starting from a state I can be written in terms of this function:
e¡L /2 P ( J | I ) D P ¡LKI 2 . / Ke JI
(2.7)
To prove this result, we start from the probability of obtaining the state F, that is, the one following from the deterministic dynamics. It is straightforward to show that P (nFi | I ) D
1 FI 1 C eki /2
I
(2.8)
1 It was applied to a model with symmetric synaptic couplings by Quenet, Cerny, Dreyfus, and Lutz (1997).
The Dynamic Neural Filter
313
hence, P (F | I ) D P i P ( nFi | I ) .
(2.9)
Any other state J differs from the state F by some ips of neuronal states, J for example, nm D 1 ¡ nFm . In such a case, P ( J | I ) will be the probability of obtaining the corrupted state F, having the wrong digit in its mth place, FI which is P ( F | I ) e km /2 . But at this location m, kFI m D
1 FI JI JI ( k ¡ km ) D lFI m ¡ lm . 2 m
(2.10)
JI Since for all other neurons lFI i D li , it follows that FI
P ( J | I ) D P (F | I ) e ( L
¡L JI ) /2
.
(2.11)
This will be true for any state regardless of how many ips occur because of their independent nature. Hence, equation 2.7 is proved. The stochastic dynamics of equations 2.1 and 2.2 lead to a homogeneous and irreducible Markov process; the transition matrix is time invariant, and all elements of the transition matrix, T JI D P ( J | I ) ,
(2.12)
are strictly positive. Such a Markov process has a stationary probability distribution, p ( I ), that is, an eigenvector of the transition matrix with the highest eigenvalue, 1: X (2.13) T JI p ( I ) D p ( J ) . I
From this stationary probability distribution, obtained asymptotically in the evolution of the system, one can calculate the asymptotic probabilities of the transitions p ( I ! J ) : p (I ! J ) D P ( J | I ) p (I ) .
(2.14)
The stationary Markov process has an entropy rate (Cover & Thomas, 1991), X HD¡ (2.15) p (I ) T JI log2 T JI , IJ
that serves as a lower bound on the expected dimension of any binary code of the Markov process. Since in our case all states are described by vectors of length N, it follows that
H · N.
(2.16)
The upper limit is approached when the noise parameter 2 is high and all transition matrix elements tend to be equal.
314
B. Quenet and D. Horn
3 Numerical Examples
We propose to view the model described in the previous section as a lterrelating input space Ri to spatiotemporal sequences of the recurrent network. In this section, we will demonstrate numerically how the DNF acts. For simplicity, we choose both wij and Ri to have positive and negative integer values, while xing hi D 12 . This means that Ri will now effectively represent both input and threshold of neuron i. 3.1 Mapping of Input Space. Once the matrix w is given, there is a
restricted range of interest for R: ¡
X j
wij H (wij ) · Ri · ¡
X j
wij H (¡wij ) C 1.
(3.1)
Outside this range the dynamics becomes trivial because the input determines the neuronal values directly. As a rst example of deterministic dynamics, we study an N D 2 system because it can be easily mapped in an exhaustive manner. We choose the synaptic matrix ³ ´ 1 2 wD ¡2 ¡1 and use an initial null state ni (0) D 0. The dynamics generate 14 different sequences, including the four (2N ) xed points, several two-cycles, and one four-cycle. In Figure 1, we display the coding zones, that is, the ranges of R corresponding to a single temporal sequence. The overall range of R space is chosen to be somewhat larger than the range of equation 3.1. In its center, we nd that many different sequences are produced (small coding zones), while in the periphery, a single sequence dominates. Figure 2 shows the length of each sequence on the same R plane. We dene it as the length of the sequence the model generates before one of its states is being repeated. This denition of length takes into account both the order of the cycle and the transition time it takes to reach it. The interesting (i.e., long) sequences appear near the center of R space. The periphery is dominated by xed points and two-cycles. Building on this experience, we next study an N D 5 system with the synaptic matrix 0 1 0 ¡2 ¡5 ¡3 0 B 6 2 8 ¡14 0C B C 1 1 0 ¡2 1C wDB B C. @¡4 6 1 1 3A 4 ¡1 2 ¡4 0
Label of the spatio-temporal sequence
The Dynamic Neural Filter
315
14 12 10 8 6 4 2 0 10 5 Input R2
0 -5
-6
-4
-2
0
2
4
6-
Input R1
Figure 1: Ranges of input space in a two-neuron problem that lead to identical spatiotemporal behavior, dened as coding zones. The label of each code, or spatiotemporal behavior, varies from 1 to 14 in this problem.
Now much larger sequences can be generated, and we would like to see the ensuing mapping of input space and test the stability of the sequences to noise. In order to be able to view the results, we choose R3 R4 R5 to lie in the center of their expected range, as specied by equation 3.1, and study the system in the R1 R2 plane, limited to their relevant ranges. First, let us view, in Figure 3, the coding zones of different sequences in R space. Their average size is 10; that is, there is an order of magnitude of information compression from R space to the spatiotemporal sequences. The lengths of the latter are displayed in Figure 4. They include many cycles of lengths 4, 5, and 6. The coding zones are the analogs of basins of attraction, which signify domains of initial conditions that lead to the same attractor. Here, we stay with a xed initial condition (the null state) and search for domains in input space that lead to the same temporal sequence (transition into a cycle). Note the large number of different sequences obtained in this problem. It is exponentially larger than the number of xed points that we are accustomed to in attractor neural networks (describable by the same dynamics with symmetric synaptic matrices). We estimate, on the basis of simulations, that the total number of sequences in the ve-dimensional R space is of the order of 3800. In the two-dimensional section of R space shown in Figure 3, we nd 38 different sequences, whose lengths are displayed in Figure 4.
316
B. Quenet and D. Horn
Sequence length
4
3
2
1 10 5 R2
0 -5
-8
-6
0
-2
-4
2
4
6
R1
Figure 2: Sequence length of the N D 2 problem mapped on the two-dimensional input space. Table 1: Six Neighboring Sequences in a System with Five Neurons, Displayed over Seven Time Steps. R2
DE
DH7
tD1
2
3
4
5
6
7
¡15 ¡12 ¡8 ¡3 2 8
0 2 4 5 6 7
0 1 2 10 11 7
17 17 17 17 25 25
22 22 22 30 30 30
6 14 14 16 16 16
8 8 16 3 3 11
3 3 3 17 17 3
17 17 17 30 30 17
22 22 22 16 16 30
3.2 Distances Between Codes. The mapping of R space into spatiotemporal codes can also serve as dening a distance between different codes. It is thus interesting to compare different sequences generated over a neighborhood of R space, as seen in Table 1. All of these sequences can be read off as we move along the R2 axis at xed R1 D 4. The sequences are ordered according to increasing value of R2 , where we indicate the central value of the coding zone, and the states for each time bin are written in the biP nary representation 1 C 5iD1 ni 25¡i . The columns DE and DH7 describe two distances, dened below, from the rst sequence.
Label of the spatio-temporal sequence
The Dynamic Neural Filter
317
40 35 30 25 20 15 10 5 0 20 10 R2
0 -10 -20
0
2
4
6 R1
8
10
12
Figure 3: Coding zones of spatiotemporal sequences of an N D 5 problem, plotted on a plane dened by two of the input variables. Other inputs are held constant at their central values.
As one moves along from the rst sequence to the second, and then to the third, only one state changes over the observed range of T D 7. Moreover, this change corresponds to the ipping of just one neuron (n2 at t D 3 between the rst and second sequence). Proximity in R may therefore be related to proximity in the identity of states appearing in two sequences, or proximity in the sense of Hamming distance between all neurons at all time steps. Thus, we dene three distances between sequences: DE is the “edit distance,” dened as the number of insertions and deletions needed to change one sequence into another, limiting oneself to the natural length of each sequence, that is, until one of the states reappears. DH is the Hamming difference between the two spatiotemporal patterns of neural activities. Clearly this grows with the length of the sequence. In the example of Table 1, where we limit ourselves to seven time steps, we list the corresponding seven-step Hamming distance, DH7 . Finally, we may dene DR , the Euclidean distance between the centers of the coding zones in R space. Table 1 lists the values of DE and DH7 distances from the rst sequence in this table. Using 18 sequences of this problem, including the six shown in Table 1, we nd a high correlation (0.87) between DE and DR and a lower cor-
318
B. Quenet and D. Horn
Sequence length
7 6 5 4 3 2 1
20 10
15 R2
0
10 -10
5 -20
0
R1
Figure 4: Sequence length of the 38 sequences mapped out in Figure 3.
relation (0.56) between DH7 and DR . The latter signies the fact that whereas the Hamming distance correlates well with R distance for some sequences, there are cases where several neurons ip as one moves from one code to the next. In Table 1, this occurs between the third and the fourth sequences and between the fth and the sixth ones. Note, however, that in the fourth sequence, the state 30 replaces two states, 22 and 14, of the third sequence. Therefore, DE between the two will be only 3. This explains in what sense the two sequences should be regarded as close to one another, as also borne out by the short distance in R. These results can also be studied from the viewpoint of the individual neurons. Since the rst line has the lowest value of R2 , none of its states involves an active neuron number 2, that is, all have n2 D 0. Moving up to R2 D ¡12, n2 is activated in state 14. At the next stage, it is activated in state 16. Further increases of R2 lead to activation of neuron number 2 in the states 30 and 25. Thus, we observe a gradual increase in the number of states that contain an active neuron number 2. Since the volume of state-space is relatively moderate, 32, we nd that in spite of the change in the identity of the state, for example, a change of 6 to 14 in step t D 3 between the rst and the second row, the next state remains 8, and the rest of the sequence is reiterated. This leads to the good correlation with DE . We believe that this
The Dynamic Neural Filter
319
property may be useful in DNF applications to the study of problems such as gene sequences, where low N models may apply. The correlation between R distance and edit distance may be lost at large N. The reason is that the volume of state-space grows like 2N . Hence, once a change occurs in one state, the next transition may point to another state in this large space, and the ensuing sequence may change completely. Numerical trials in an N D 50 model have shown this to be the case. Thus, the correlation of nearby sequences, as dened by proximity in R space, may be lost in large neural networks. 3.3 Stability Under Stochastic Dynamics. Next, let us ask how robust these sequences are against noise. We choose 2 D 0.5, which is where considerable effects may be expected, since this is the lowest absolute value that the potentials hi can have. As an example, let us look at the probability of a correct four-step sequence,
P4 D P ( F4 | F3 ) P ( F3 | F2 ) P (F2 | F1 ) P ( F1 | 1),
(3.2)
where 1 designates the initial null state. This is the product of the probabilities of obtaining the correct rst four states of the appropriate deterministic sequence at a given point in R space. The results for the problem at hand are plotted in Figure 5 on the same grid of R space as in the two previous gures. We see that a few sequences are relatively stable, but others have a low probability of being correctly recovered during the rst four steps of this Markov process. To exemplify the origins of instability we compare two six-cycles, at locations (10, ¡10) and (10, 15) of Figures 4 and 5. In Figure 6 we plot the histograms of the hi values (over all neurons for the rst four steps of the dynamics) for these two cases. Clearly, at 2 D 12 , the occupancy of the bins of h D § 12 will determine the instability following from equation 2.2. Since the sequence at (10, ¡10) has three elements in these bins, it is reduced to P4 D 0.36. All other sequences in Figure 5 have three elements in these bins or more. The sequence at (10, 15) has ve elements in these bins, as can be seen in Figure 6; hence, its retrieval probability is reduced to P4 D 0.18. Let us remark at this point on the Markov chain properties discussed in section 2.4 and, in particular, the stationary probability distribution p ( I) of equation 2.13. Using the R-space point (10, ¡10) as an example, we nd that the following states are the most probable ones: 17, 22, 30, 32. Their respective probabilities are 0.106, 0.175, 0.173, 0.169. In the problem at hand, we start with state 1 as the initial condition, looking for the sequence 1 ! 17 ! 22 ! 30 ! 32. The relevant transition probabilities for the terms in equation 3.2 are 0.98 £ 0.95 £ 0.73 £ 0.53, leading to P4 D 0.36. Obviously, the major reduction of the probability occurs at the rst two steps, from state 1 to 17 and from 17 to 22. This is also where the three appearances of
320
B. Quenet and D. Horn
0.4 0.35 0.3 P4
0.25 0.2 0.15 0.1 0.05 0 20 10 R2
0 -10 -20
0
2
4
6
8
10
12
R1
Figure 5: Probability of recovering correctly the rst four time steps in the sequences of Figures 3 and 4 when stochastic dynamics is being used with 2 D 0.5.
h D § 12 occur: two of them affect the transition of 1 to 17, and one occurs in the transition from 17 to 22. 3.4 Entropy of the Markov Chains. Finally, we show in Figure 7 an analysis of the entropy of the different Markov chains of this system. This is done for 2 D 0.5 and can be compared with the P4 results in Figure 5. The Markov chains with low entropy are the ordered ones, leading to a relatively high probability of correct retrieval of the rst four states. The chains with high entropy are the disordered ones, leading to a low probability of correct retrieval. 4 The Inverse Problem
The previous section exhibited examples of what may be called the direct DNF problem: given a matrix w and a set of R values, solve the dynamics of equation 2.1, and generate the resulting sequence. We now wish to pose an inverse problem: given a set of several sequences, nd a corresponding DNF, that is, values of w and R that can produce this set. There are two parts to this question. First, does a solution exist? Second,
The Dynamic Neural Filter
321
Figure 6: Histograms of hi values of two six-cycles in the N D 5 problem. Table 2: Six Spatiotemporal Sequences Dened for Four Time Steps in a System with Four Neurons. t/ k
1
2
3
4
5
6
1 2 3 4
1100 1110 1101 0001
1000 1100 1101 0001
1110 1111 0111 0 01 1
1000 1 01 0 0110 0111
1 01 1 1000 1110 1111
1000 1110 0111 0 0 01
if the problem is soluble, nd an example. This may be expanded into a search for all possible solutions, which lies outside the scope of this work. 4.1 The Existence Problem. Suppose a given data set comprises K different spatiotemporal sequences of length T each, presented as activities of N neurons. An example is shown in Table 2 for K D 6, T D 4 in a system with N D 4 neurons. We wish to nd out if there exists a DNF such that this set of sequences is elicited by K different inputs Rik, k D 1, . . . , K.
322
B. Quenet and D. Horn
1.8 1.6
Entropy
1.4 1.2 1 0.8 0.6 0.4 0.2 20 10 R2
0 -10 -20
0
2
4
6
8
10
12
R1
Figure 7: Entropy of the Markov chains corresponding to the sequences of Figures 3 and 4 when stochastic dynamics is being used with 2 D 0.5.
The neuronal values may be labeled nik ( t) , t D 1, . . . , T. Each time step, for each sequence, denes a state of the system | k, ti D fnik ( t)giD 1,...,N .
(4.1)
Twenty-four such states can be seen in Table 2, some of them identical, yet no identical states appear in a single sequence. Each state |k, ti follows from the previous state |k, t ¡ 1i through deterministic dynamics. The initial state |k, 0i is chosen as the null state nik D 0 for all i D 1, . . . , N and k D 1, . . . , K. Consider a single neuron i in this system. We may then reduce the question posed here into a perceptron problem for this neuron. The neuron is given KT initial vectors (counting states from t D 0 to t D T ¡ 1) and its required output values nik ( t) (from t D 1 to t D T for all k D 1, . . . , K). It has to perform calculation 2.1 given a different input Rik for each of the K sequences. To take account of the K different inputs (biases), we extend the vector states of length N to new vectors of length N C K through the concatenation | k, t) D |k, ti £ fda,kgaD 1,...,K ,
(4.2)
where da,k is the Kronecker delta function obtaining the value 1 for a D k and 0 otherwise. The extension by K bias axes allows us to represent the
The Dynamic Neural Filter
323
deterministic dynamics, for each neuron i, as KT perceptron inequalities in an N C K dimensional system. These can be strictly met (Hertz et al., 1991) for N obeying
A:
KT · N C K
or
K (T ¡ 1) · N.
(4.3)
In the large N limit, one can apply the Cover result (Cover, 1965), saying that a solution may be found for
B:
KT · 2(N C K )
or
1 K (T ¡ 2) · N. 2
(4.4)
For the example of Table 1, the strict condition A implies N ¸ 18, and condition B leads to N ¸ 6. This could mean one should extend the system shown in Table 2 by at least two, if not more, neurons. Nonetheless, one can obtain a solution for N D 4. This is due to the fact that the states in this table were not chosen in a random and independent fashion. 2 Note also the example discussed in section 3.1. Thirty-three of the 38 sequences in Figure 4 have length 4 or more. Counting only the rst four states, we nd 22 different sequences, many more than conditions A and B would suggest. This is so because we did not ask in section 3.1 for the matrix that can produce an arbitrary choice of states appearing in K sequences of length 4, but counted how many different sequences were produced by a given matrix. These sequences turned out to be quite similar to one another, as described in section 3.2. Thus, conditions A and B should be regarded as sufciency conditions, assuring us A that a solution must exist and B that it may be found for lower N. With more effort, one can try to solve the inverse problem, that is, nd an explicit matrix w , for yet smaller N. 4.2 Perceptron Learning. From our approach to the existence problem, it should be quite evident that if a solution exists, it may be obtained using a straightforward extension of the Rosenblatt algorithm (Rosenblatt, 1962; Hertz et al., 1991). We describe it in this section, realizing explictly the abstract vectors in N C K dimensions described in the previous section. Let us start by dening, for each neuron i, an N C K–dimensional vector of perceptron weights w E i,
(w E i ) j D wij
for j D 1, . . . , N
(w E i ) N C k D Rik ¡ hi
for k D 1, . . . , K.
(4.5)
The sequence states of interest will be represented by vectors xE i, k ( t) , for neuron i, living in the same N C K–dimensional space but carrying further 2
The rationale for the choice of states in Table 2 is discussed in section 5.3.
324
B. Quenet and D. Horn
indices of k and t as follows: ( xE i,k (t) ) j D njk ( t) (2nik ( t C 1) ¡ 1)
for j D 1, . . . , N, t D 0, . . . , T ¡ 1
(xE i,k ( t) ) N C a D dak (2nik (t C 1) ¡ 1)
(4.6)
for a D 1, . . . , K.
In other words, the vectors xE i, k ( t) represent the states weighted by a positive or negative sign depending on whether the target neuron i at the next time step will be 1 or 0, respectively. With these denitions, all constraints of the inverse problem can be written as KT perceptron inequalities: Ei ¢ x E i,k ( t) > 0 for all k D 1, . . . , K, t D 0, 1, . . . , T ¡ 1. w
(4.7)
We propose to use the perceptron learning rule (Hertz et al., 1991),
D wE i
E i, k ( t) H (¡w E i ¢ xE i,k ( t) ) , D gx
(4.8)
while iterating the system, time and again, over all KT states. The Heaviside function guarantees that w E i gets modied at a given iteration by the vectors xE i, k (t ) that do not satisfy the inequality. This algorithm converges (Hertz et al., 1991) if the system of inequalities is soluble. Moreover, one can use the same algorithm to require stability under stochastic dynamics by insisting on a margin M in the perceptron inequalities: Ei ¢ x E i,k ( t) > M. w
(4.9)
This leads to the learning rule:
D wE i
E i, k ( t) H (¡w E i ¢ xE i,k ( t) C M ) . D gx
(4.10)
In section 3, where we used integer values of w and R and kept all h D 12 , we worked with an implicit margin of size 12 . This is the reason that testing the system with stochastic dynamics of 2 D 12 , we found considerable effects. Insisting on larger M, if the data allow it, will guarantee stability for larger ranges of 2 . Given a situation of the type encountered in Figure 5, one may want to improve the robustness of a possible code, such as the one located at (10,15) in the R plane. This can be tried by applying the algorithm presented by the previous equation for a suitably chosen M. The size of M is, of course, limited by the data. (For a general discussion of optimal margin selection, see Vapnik, 1995.) If the algorithms are implemented with g D 1, they lead to integer values of w and R ¡h. An arbitrary choice of g < 1 can lead to continuous h values. It is then advisable to keep a nite M to ensure some robustness to noise. Finally, we wish to comment that one may apply in a similar manner the Ho-Kashyap procedure (Duda, Hart, & Stork, 2001), indicating, if it does not converge, that a given set of data is not linearly separable.
The Dynamic Neural Filter
325
5 Capacity and Complexity 5.1 Capacity of Codes of Given Length. The DNF produces sets of spatiotemporal codes. Its capacity could be measured by the number of codes it can produce. This is a very large number and may not be very meaningful if we consider an application with stochastic dynamics. Moreover, what would be really interesting is a measure of capacity of random codes, expected to be very distant from one another. Clearly, two codes that are close will not constrain the system. If one is learned, some other close-by codes will be automatically produced for nearby R values. Suppose we look for the number of codes of length T such that the same state does not appear twice, either within the same code or among the different codes. This would be the case if the states are chosen randomly from all 2N possibilities of an N-neuron system. Following the reasoning of section 4.1, this number can be expected to reach
CT ¼
2N T ¡2
(5.1)
if criterion B is applied. CT may be viewed as a capacity measure of a DNF. Note that the last equation leads to CT D 1 for T D 2N C 2. This means that a DNF of order N may be expected to accommodate a cycle of length 2N C 2. The latter may then be used to construct the DNF. 5.2 Random Synaptic Weights. There exist synaptic matrices that do not have the ability to generate large cycles. It is well known that symmetric matrices w lead to xed points or two-cycles and antisymmetric matrices may lead up to four-cycles only (Peretto, 1992). In general, however, a system of N neurons may generate large cycles, limited by the total number of states, 2N . Gutfreund, Reger, and Young (1988) have shown that for Ri D 0, large cycles are generated when values of wij are chosen randomly from a gaussian P distribution centered at zero and thresholds hi D 12 j wij ¼ 0.3 This was further studied numerically by Nutzel ¨ (1991), who pointedPout that large
cycles can be obtained for some range of the asymmetry a D Pij ij
wij wji wij wij
around
the point 0 of random asymmetry.4 The average size of the cycles grows as an exponential in N, with an exponent that increases as the asymmetry 3 Related questions within the framework of Ising spin systems have been studied by Crisanti and Sompolinsky (1988). 4 This denition of asymmetry differs from the parameter used by Gutfreund et al. (1988) and Nutzel ¨ (1991), but it shares the relevant characteristics of varying between 1 and ¡1, with the extremes characterizing the symmetric and antisymmetric cases. The completely asymmetric case corresponds to a D 0. The example of the ve-dimensional matrix in section 3 has a D ¡0.4067.
326
B. Quenet and D. Horn
Table 3: Binary Odor Coding. t/ k
1
2
3
4
5
6
1 2 3 4
11 11 11 00
10 11 11 00
11 11 01 00
10 10 01 01
10 10 11 11
10 11 01 00
approaches zero. A system with a similar structure but slightly different dynamics was studied by McGuire, Littlewort, and Rafelski (1991). Working with nonzero R values in our DNF, we noted that the interesting large cycles occur around the center of the relevant range of R. This is where the average effect of the inputs cancels the average effect of the synaptic weights, in agreement with the observations of Gutfreund et al. (1988). Thus, we expect, for any given N, that matrices w that have large capacities CT for large T have a random-like distribution of synaptic values. We are, however, content even with cycles of order N, as is the case of our examples in section 3. Note that in the locust antennal lobe (see section 2.1), hundreds of neurons are participating in the real biological system. Hence, it is quite reasonable for it not to display cyclic behavior over the few LFP bins over which it is measured. 5.3 An Example: The Wehr-Laurent Experiment. We return now to the example that motivated our investigation. Wehr and Laurent (1996) display in Figure 3 of their article the response of two specic neurons to nine mixtures of odors. Signicant results were obtained during the rst four reverberations of the local eld potential. They can be presented as six different binary codings, shown in Table 3, three of which appeared twice in the nine odor mixtures. Each of the six columns species the state of activity of the observed pair of neurons for one of the six spatiotemporal codes. Looking at the rst column, it is clear that it cannot be produced by a DNF, where states at time t are determined by states at t ¡ 1, unless it possesses at least two hidden neurons. Hence, we conclude that this should be treated as a T D 4, K D 6 problem. Looking for a system with C4 ¼ 6, we know we can do it with N D 18 (using condition A), but may expect to be able to do it with N D 6 (using condition B) or less. It turns out we can do it with N D 4. A set of states of four neurons, which accommodate the two observed neurons of Table 3, is given in Table 2. The latter was constructed by choosing states of hidden neurons by trial and error until all perceptron inequalities were satised, ensuring the existence of DNFs with matrices w and inputs Rk that can generate this table. Thus, we are able to implement the binary odor coding of Wehr and Laurent (1996) in an N D 4 model.
The Dynamic Neural Filter
327
5.4 Minimal DNF as Measure of Complexity. The example of the WehrLaurent experiment raises the possibility of using the DNF model to dene the complexity of a spatiotemporal data set. This degree of complexity can be characterized by the minimal number of neurons N needed to accommodate the data within a DNF. From the discussion in the previous paragraph we conclude that the Wehr-Laurent problem of Table 3 has DNF complexity of degree 4.5 6 Discussion
The problem of odor spatiotemporal encoding was recently reviewed in detail by Laurent et al. (2001). As a theoretical model, these authors suggest “winnerless competition,” a concept that was expanded in Rabinovich et al. (2000, 2001). Basing their intuition on Lotka-Volterra equations they point out that for suitably chosen parameters, there exist heteroclinic orbits connecting all N attractors of the N-dimensional system, which are very sensitive to external inputs. They studied systems of FitzHugh-Nagumo neurons that exhibit spatiotemporal sensitivity to external inputs as a neural implementation, but made no attempt to t a particular data set such as the one of Wehr and Laurent (1996). In comparison, we use a binary model with discrete dynamics, but nevertheless we claim that it is relevant to the observed data. The reason is that the periodic LFP in the antennal lobe provides a temporal scale that accounts for specic time bins, and allows discrete coding, as observed by Wehr and Laurent (1996). With a DNF model that ts these data, one can envisage a model of spiking neurons capable of mimicking the experimental results (Quenet, Dubois, Sirapian, Dreyfus, & Horn, in press). This can be done by imposing overall periodic inhibition that allows for action potentials during periodic time windows and by tting synaptic delay parameters to match the same windows. Recently Friedrich & Laurent (2001) have observed spatiotemporal odor representations in the olfactory bulb of zebrash. This system also has a reverberating local eld potential and mitral cells that re in coincidence with it. The authors have studied the response of this system to similar odors, characterized by small changes in molecular structures of the relevant chemicals. One of their interesting results is that the correlation between temporal patterns of similar odors over the mitral cells reduces with time (between the rst and second 500 ms after odor presentation). This contradicts the expectation that short distances in R space lead to high correlations between the spatiotemporal sequences. The reason is the large value of N in
5 Obviously this should be regarded as a mathematical statement and may have nothing to do with the biological reality. In the antennal lobe, one nds excitation of hundreds of PNs during each odor presentation.
328
B. Quenet and D. Horn
this system. Since the volume of state-space is 2N , once a change in a sequence occurs, it may lead to a new sequence uncorrelated with the old one. Our model can be compared with recent work by Maass, Natschl a¨ ger, and Markram (in press), who propose a recurrent neural network that performs what they call “liquid computation.” Their network is composed of spiking neurons and is structured in two stages: (1) a lter transforming spatiotemporal input into spatiotemporal behavior of the network and (2) the activation of a readout map. Thus, this network projects spatiotemporal input into an output representation that is spatial in nature. Although these authors put special emphasis on real-time learning based on perturbations, the common feature of their approach and ours is the use of a lter projecting from an input to an output space. Whereas in our model the input space is spatial and the output space is spatiotemporal, their model goes in the opposite direction. However, in general, both models can connect spatiotemporal spaces to one another, and both use neural networks as lters, performing computations that are different from the conventional major paradigms. Our model is based on binary mathematical neurons, whereas Maass et al. (in press) deal with more realistic neural behavior. Nevertheless, the advantage of our DNF is that its structure can be made explicit, thus allowing full understanding of its operation and capabilities. Among other things, we can use this understanding to solve an inverse problem: construct a DNF that produces a given repertoire of spatiotemporal data. Acknowledgments
We thank G. Dror for advice and assistance. References Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. Elect. Comp., 14, 326–334. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Crisanti, A., & Sompolinsky, H. (1988). Dynamics of spin systems with randomly asymmetric bonds: Ising systems and Glauber dynamics. Phys. Rev. A, 37, 4865–4874. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classication. New York: Wiley. Friedrich, R. W., & Laurent, G. (2001). Dynamic optimization of odor representations by slow temporal patterning of mitral cell activity. Science, 291, 889–894.
The Dynamic Neural Filter
329
Gutfreund, H., Reger, D. J., & Young, A. P. (1988). The nature of attractors in an asymmetric spin glass with deterministic dynamics. J. Phys. A: Math. Gen., 21, 2775–2797. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. Sci. USA, 79, 2554–2558. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cyb., 43, 59–69. Laurent, G., & Naraghi, M. (1994). Odorant-induced oscillations in the mushroom bodies of the locust. J. Neurosciences, 14, 2993–3004. Laurent, G., Stopfer, M., Friedrich, R. W., Rabinovich, M. I., Volkovskii, A., & Abarbanel, H. D. I. (2001). Odor encoding as an active dynamical process: Experiments, computation and theory. Ann. Rev. Neurosci., 24, 263–297. Maass, W., Natschl¨ager, T., & Markram, H. (in press). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation. McGuire, P. C., Littlewort, G. C., & Rafelski, J. (1991). Brainwashing random asymmetric “neural” networks. Physics Letters A, 160, 255–260. Nutzel, ¨ K. (1991). The length of attractors in asymmetric random neural networks with deterministic dynamics. J. Phys. A: Math. Gen., 24, L151–L157. Peretto, P. (1992). An introduction to the modeling of neural networks. Cambridge: Cambridge University Press. Quenet, B., Cerny, V., Dreyfus, G., & Lutz, A. (1997). A dynamic model of key feature extraction: The example of olfaction. Unpublished manuscript, ESPCI. Quenet, B., Dubois, R., Sirapian, S., Dreyfus, G., & Horn, D. (in press). Two step modeling of spatiotemporal olfactory data: From binary to Hodgkin-Huxley neurons. Biosystems. Quenet, B., Horn, D., Dreyfus, G., & Dubois, R. (2001). Temporal coding in an olfactory oscillatory model. Neurocomputing, 38–40, 831–836. Rabinovich, M. I., Huerta, R., Volkovskii, A., Abarbanel, H. D. I., Stopfer, M., & Laurent, G. (2000). Dynamical coding of sensory information with competitive networks. J. Physiol. (Paris), 94, 465–471. Rabinovich, M. I., Volkovskii, A., Lecanda, P., Huerta, R., Abarbanel, H. D. I., & Laurent, G. (2001). Dynamical encoding by networks of competing neuron groups: Winnerless competition. Phys. Rev. Lett., 87, 068102. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representation by back-propagating errors. Nature, 323, 533–536. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wehr, M., & Laurent, G. (1996). Odor encoding by temporal sequences of ring in oscillating neural assemblies. Nature, 384, 162–166. Received November 20, 2001; accepted July 30, 2002.
LETTER
Communicated by Kwabena Boahen
Modeling Short-Term Synaptic Depression in Silicon Malte Boegerhausen
[email protected] Pascal Suter
[email protected] Shih-Chii Liu
[email protected] Institute of Neuroinformatics, University of Zurich and ETH Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland We describe a model of short-term synaptic depression that is derived from a circuit implementation. The dynamics of this circuit model is similar to the dynamics of some theoretical models of short-term depression except that the recovery dynamics of the variable describing the depression is nonlinear and it also depends on the presynaptic frequency. The equations describing the steady-state and transient responses of this synaptic model are compared to the experimental results obtained from a fabricated silicon network consisting of leaky integrate-and-re neurons and different types of short-term dynamic synapses. We also show experimental data demonstrating the possible computational roles of depression. One possible role of a depressing synapse is that the input can quickly bring the neuron up to threshold when the membrane potential is close to the resting potential. 1 Introduction
Short-term synaptic dynamics has been observed in different parts of the cortical system (Stratford, Tarczy-Hornoch, Martin, Bannister, & Jack, 1998; Varela et al., 1997; Markram, Wang, & Tsodyks, 1998; Reyes et al., 1998). The functionality of the short-term dynamics has been implicated in various cortical models (Senn, Segev, & Tsodyks, 1998; Chance, Nelson, & Abbott, 1998; Matveev & Wang, 2000). The properties of a network with dynamic synapses have been investigated by Tsodyks, Pawelzik, and Markram (1998), and Maass and Zador (1999) and the use of such a network in time series processing by Maass and Zador (1999) and Natschlager, Maass, and Zador (2001). The introduction of these dynamic synapses into hardware implementations of recurrent neuronal networks allows a wide range of operating regimes, especially in the case of time-varying inputs. In this work, we describe a model that was derived from a circuit implementation of short-term depression and facilitation. The dynamics of this Neural Computation 15, 331–348 (2003)
° c 2002 Massachusetts Institute of Technology
332
M. Boegerhausen, P. Suter, and S. Liu
circuit, initially proposed by Rasche and Hahnloser (2001), was not analyzed in their work. Our work is presented in two parts. We rst describe the model of this synaptic circuit and compare the differential equations describing the depression with the equations of one of the theoretical models frequently used in network simulations (Abbott, Sen, Varela, & Nelson, 1997; Varela et al., 1997). We analyze the transient and steady-state responses of this synaptic model in response to inputs of different statistical distributions. Next, we compare the theoretical results with experimental responses obtained from a silicon network of leaky integrate-and-re neurons together with different short-term dynamic synapses. We also show experimental data from the chip that demonstrate the possible computational roles of depression. From these experiments, we postulate that one possible role of depression is to bring the neuron’s response up to threshold quickly if the membrane potential of the neuron was close to the resting potential. We also mapped a proposed cortical model of direction selectivity that uses depressing synapses onto this chip. The results are qualitatively similar to the results obtained in the original work (Chance et al., 1998). The similarity of the circuit responses to the responses from Abbott and colleagues’ model means that we can use these VLSI networks of integrateand-re (IF) neurons as an alternative to computer simulations of dynamical networks composed of large numbers of integrate-and-re (IF) neurons using synapses with different time constants. The outputs of such networks can also be used as an interface with wetware. An infrastructure for a reprogrammable, recongurable, multichip neuronal system is being developed along with a user-dened interface so that the system is easily accessible to a naive user. 2 Comparisons Between Models of Depression
Two theoretical models describing synaptic depression and facilitation used frequently in network simulations are the phenomenological model of Tsodyks and Markram (1997) and the model from Abbott et al. (1997). Because these models are equivalent if the inactivation synaptic time constant in Tsodyks and Markram’s model is small when compared with the recovery time constant of the depression, we compare the circuit model only with the model from Abbott and colleagues. Here, we describe only the circuit model for synaptic depression. The equivalent model for facilitation is described elsewhere (Liu, in press). 2.1 Theoretical Model for Depression Model. In the model from Abbott et al. (1997), the synaptic strength is described by gD (t) , where D is a variable between 0 and 1 that describes the amount of depression (D D 1 means no depression) and g is the maximum synaptic strength. The recovery
Modeling Short-Term Synaptic Depression in Silicon
333
dynamics of D is td
dD D 1 ¡ D, dt
(2.1)
where td is the recovery time of the depression. The update equation for D right after a spike at time t D tsp is C ¡ ) D dD ( tsp ), D ( tsp
(2.2)
where d (d < 1) is the amount by which D is decreased right after the spike and tsp is the time of the spike. The average steady-state value of depression for a regular spike train with a rate r is Dss D
1 ¡ e¡1/ ( rtd ) . 1 ¡ de ¡1 / (rtd )
(2.3)
2.2 Circuit Model of Depressing Synapse. In this model of synaptic depression derived from the circuit implementation, the equation that describes the recovery dynamics of the depressing variable, D, is nonlinear. This nonlinearity comes about because there is no compact circuitry to implement a linear resistor whose value can be adjusted easily after fabrication. A transistor can act as a linear resistor as long as the terminal voltages satisfy certain criteria. Additional circuitry would be needed to satisfy these criteria for a wide range of voltages, and it would increase the size of the circuit. One alternative is to replace the exponential dynamics in equation 2.1 with diode dynamics, which is easily obtained with a single diode-connected transistor. The current through the diode has a nonlinear decay instead of an exponential decay in the case of a linear resistor. Because of the different dynamics, the equation describing the recovery of D (derived from the circuit in the region where a transistor operates in the subthreshold region or the current is exponential in the gate voltage of the transistor) can be formulated as
dD 1k D M (1 ¡ D / ) , dt
(2.4)
where 1 / M is the equivalent of td in equation 2.1 and k (a transistor parameter) is less than 1. The maximum value of D is 1. The update equation remains as before: C ¡ ) D dD ( tsp ). D ( tsp
(2.5)
2.2.1 Circuit. Equations 2.4 and 2.5 are derived from the circuit in Figure 1. The operation of this circuit is described in the caption, and the detailed analysis leading to the differential equations for D is described in the
334
M. Boegerhausen, P. Suter, and S. Liu
Figure 1: Schematic for a depressing synapse circuit and responses to a regular input spike train. (a) Depressing synapse circuit. The voltage Va determines the synaptic conductance g, while the synaptic term gD or Isyn is exponential in the voltage, Vx . The subcircuit consisting of transistors, M1 , M2 , and M3 controls the dynamics of Isyn. The presynaptic input goes to the gate terminal of M3 , which acts like a switch. When there is a presynaptic spike, a quantity of charge (determined by Vd ) is removed from the node Vx . In between spikes, Vx recovers to the voltage, Va , through the diode-connected transistor, M1 . When there is no spike, Vx is around Va . When the presynaptic input comes from a regular spike train, Vx decreases with each spike and recovers between spikes. It reaches a steady-state value, as shown in b. During the spike, transistor M4 turns on, and the synaptic weight current Isyn charges up the membrane potential of the neuron through the current-mirror circuit consisting of M6 , M7 , and the capacitor C2 . We can convert the Isyn current source into a synaptic current Id with some gain and a “time constant” by adjusting the voltage Vgain . The decay dynamics of Id is given by Id ( t) D
Isyn (t D tsp ) 1C
AIsyn (tD tsp ) QT
k (V
where QT D C2 UT and A D e
) dd ¡Vgain / UT .
In a
normal synapse circuit (i.e., without short-term dynamics), Vx is controlled by an external bias voltage. (b) Input spike train at a frequency of 20 Hz (bottom curve) and corresponding response Vx (top curve) of the circuit for Vd D 0.3 V. The diode-connected transistor M1 has nonlinear dynamics. This dynamics is discussed in the text. The recovery time of the variable D varies depending on the distance of the present value of Vx from Va . The recovery rate of D increases for a larger difference between Vx and Va .
appendix. The voltage Vx in Figure 1 codes for gD. The conductance g is set by Va , while the dynamics of D is set by both Vd and Va . The time taken for the present value of Vx to return to the value of Va is determined by the current dynamics of the diode-connected transistor M1 and Va . As shown in the appendix, the recovery time constant (1 / M) of D is set by Va .
Modeling Short-Term Synaptic Depression in Silicon
335
The synaptic weight is described by the current, Isyn in Figure 1a, Isyn ( t) D Ion ek Vx / UT D gD ( t) , where g D Ion ek Va / UT is the synaptic strength, D ( t) is k ( Vdd ¡Vx ) / UT
Irf D Iop e
(2.6) k (V
e
dd ¡Va ) / UT Iop , Irf ( t)
and
. The recovery time constant (1 / M) of D is set by Va
Iopk ¡(1¡k ) ( V ¡Va ) / UT dd ). QT e
(M D The synaptic current, Id to the neuron, is then a current source Isyn , which lasts for the duration of the pulse width of the presynaptic spike. However, we can set a longer time constant for the synaptic current through Vgain . The equation describing this dependence (that is, the equation of the current in a current-mirror circuit) is given in the caption of Figure 1. It is difcult to compute a closed-form solution for equation 2.4 for any value of k (a transistor parameter which is less than 1). This value also changes under different operating conditions and between transistors fabricated in different processes. Hence, we solve for D (t) in the case of k D 0.5 given that the last spike occurred at t D t0 1 : dD / dt D M (1 ¡ D2 ) ) D ( t) D D
D (t 0 ) C D ( t0 ) e2Mt ¡ 1 C e2Mt ¡D(t0 ) C D ( t0 ) e2Mt C 1 C e2Mt
D ( t0 ) cosh ( Mt ) C sinh ( Mt ) . cosh ( Mt ) C D ( t0 ) sinh ( Mt )
(2.7)
When D is far from its recovered value of 1, we can approximate its recovery dynamics by dD / dt D M (irrespective of k ), and solving for D (t ), we get D ( t) D Mt C D (t 0 ) . In this regime, D ( t) follows a linear trajectory. Note that the same is true of equation 2.1 when t ¿ td . 3 Neuron Circuit
The dynamics of the neuron circuit of Figure 2 is similar to that of a leaky IF neuron with a constant leak. The circuit was previously described in Indiveri (2000) and Liu et al. (2001). It is a modied version of previous designs (Mead, 1989; Van Schaik, 2001) and also includes the circuitry that models ring-rate adaptation (Boahen, 1997a, 1997b) frequently seen in pyramidal 1
Note that if k =1, the equation reduces to equation 2.1.
336
M. Boegerhausen, P. Suter, and S. Liu C1
Vb V thresh Id V leak
Ileak Cm
Vo
Vm M5
V pw
M6
Vo
Vo
Vm
M1
Vo
M2
Vca
M4
M3
V refr
Ipw
Ca
Vt
Spike adaptation circuitry
Figure 2: Schematic of the leaky integrate-and-re neuron circuit. In this gure, Vrefr, sets the refractory period, Vthresh sets the threshold voltage, Vpw sets the pulse width of the spike, Vleak sets the leak current Ileak, and VCa and Vt set the output spike adaptation dynamics.
cells in the cortex. The equation describing the depolarization of the soma potential Vm is Cm
dV m (t) D i ( t ) ¡ Ileak ¡ Iahp , dt
Vm (t ) < Vthresh ,
(3.1)
where i ( t) is the synaptic current to the soma, Ileak is the leakage current, and Iahp is the after-hyperpolarization potassium (K) current, which causes the adaptation in the ring rate of the cells. The synaptic current can be set to a point current source, that is, i ( t) D Idd ( t ¡tjk ) , or it can have a “time constant” and gain that is determined by the current-mirror circuit consisting of M6 , M7 , and C2 in Figure 1. The time constant and gain are controlled by the ( tD tsp ) I voltage Vgain . The decay dynamics of Id is of the form Id ( t) D syn AIsyn (tD tsp ) , kV
dd
¡V
1C
QT
gain
UT where QT D C2 UT and A D e . The operation of the circuit is as follows. When Vm (t ) increases above Vthresh at ts D t (ts is the time of the output spike), it increases by a step increment determined by the amplitude of the spike and the capacitive divider set by C1 and Cm . The output Vo becomes active at this time and turns on the discharging current path through transistors M5 and M6 . The time during which Vo remains high, TP , depends on the time taken for Vm to discharge below Vthresh . In this design, the pulse width TP is determined by the rate at which Vm is discharged, which in turn depends on the difference between the input current Id , the leak current Ileak, and the current Ipw . In other designs, Vm is reset immediately below Vthresh when Vo becomes active because either the input current is blocked from charging the membrane or
Modeling Short-Term Synaptic Depression in Silicon
337
the current Ipw is much larger than the input current. The refractory period, TR , is determined by Vrefr , which keeps Vo high so that Id cannot charge up the membrane. The spike output is taken from the node VNo . We ignore the Iahp current in determining the ring rate dependence on the input current. The time needed for the neuron to charge up to threshold V
with the input current is TI D Cm i ( t)thresh . The spiking rate is given by ¡I rD
leak
1 . TI C TP C TR
4 Comparison Between Models
We compare the two models by looking at how D changes in response to a Poisson-distributed train whose frequency varied from 40 Hz to 1 Hz, as shown in Figure 3. We used a simple linear differential equation to describe the dynamics of the membrane potential Vm , tm
dV m (t) D Ri ( t) ¡ Vm ( t) , dt
where tm is the membrane time constant and i ( t) is the synaptic current. We ran an optimization algorithm on the parameters in the two models so that the least-square error between the excitatory postsynaptic potential (EPSP) outputs of both models was at a minimum. In this case, the EPSP responses were identical (see Figure 3b) and the corresponding D values (see Figure 3c) were almost identical except in the region where D was close to the maximum value. We performed the same comparison with Tsodyks and Markram’s model, and the results were similar. Hence, the circuit model can be used to describe short-term synaptic depression in a network simulation. However, the nonlinear recovery dynamics of the circuit model leads a different functional dependence of the average steady-state EPSP on the frequency of a regular input spike train. 5 Transient Response
The data in the gures in the remainder of this article are obtained from a fabricated silicon network of aVLSI IF neurons of the type described in section 3 together with different types of synapses. We rst measured the transient response of the neuron when stimulated by a 10 Hz spike train through the depressing synapse. We tuned the parameters of the synapse and the leak current so that the membrane potential did not build up to threshold. These data are shown in Figure 4a for a regular train and in Figure 4b for a Poisson-distributed train. The measured EPSP amplitude, D Vm can be tted with equation 2.6,
D Vm ( t)
D
dt d (Isyn (t) ¡ Ileak ) D t ( gD ( t) ¡ Ileak ) , Cm Cm
(5.1)
338
M. Boegerhausen, P. Suter, and S. Liu
Figure 3: Comparison between the outputs of the two models of depression. An optimization algorithm was used to determine the parameters of the models so that the least-square error between the EPSPs from the two models was at a minimum. The corresponding D distribution is shown in c. (a) Poissondistributed input with an initial frequency of 40 Hz and an end frequency of 1 Hz. (b) The EPSP responses of both models were identical. (c) The D values were almost identical except in the region when D is close to 1. Parameters used in the simulations: tm D 20 ms, d D 0.6, k D 0.7, M D 2.20 s ¡1 .
where Cm is the membrane capacitance, Ileak is the leak current, dt is the pulse width of the presynaptic spike, and D ( t) is computed using equation 2.7. Equation 5.1 ts the measured data well (see Figure 4). 6 Steady-State Response
The equation describing the dependence of the steady-state values of D on the presynaptic frequency can easily be determined in the case of a regular spiking input of rate r by using equations 2.5 and 2.7. From these equations, we get p (¡1 C d) (1 C e2M/ r ) C 4d(¡1 C e2M/ r ) 2 C (1¡ d ) 2 (1 C e2M/ r ) 2 . (6.1) Dss D 2d(¡1 C e2M/ r )
Modeling Short-Term Synaptic Depression in Silicon
339
Figure 4: Transient responses to a (a) 10 Hz regular spike train and a (b) 10 Hz Poisson-distributed train. The input is the bottom curve of each plot. In a, the amplitude of the EPSP decreases with each incoming input spike, clearly showing the effect of synaptic depression. In b, the EPSP amplitude depends on the occurrence of the previous spike. The asterisks are the ts of the circuit model to the peak value of each EPSP. The ts give a d value of 0.79.
340
M. Boegerhausen, P. Suter, and S. Liu
If we use the reduced dynamics expression dD / dt D M, we replace equation 2.7 with equation 3.1 and obtain a simpler expression for Dss : Dss D
M . (1 ¡ d ) r
(6.2)
This equation shows that the steady-state D, and, hence, the steady-state EPSP amplitude, is inversely dependent on the presynaptic rate r. The form of the curve is similar to the results obtained by Abbott et al. (1997), where the data can be tted with equation 2.3. From the chip, we measured the steady-state EPSP amplitudes using the two-input spike distributions changing over a frequency range of 3 Hz to 50 Hz in steps of 1 Hz. Each frequency interval lasted 15 s, and the EPSP amplitude was averaged in the last 5 s to obtain the steady-state value. Four separate trials were performed; the resulting mean and the variance of the measurements are shown in Figure 5 for both the regular train and the Poisson-distributed train. We tted the experimental data using equations 5.1 and 6.1. The parameters from the ts to the data in Figure 5a were then used to generate the tted curve to the data from the Poissondistributed train. The values from the ts give recovery time constants from 1 to 3 s and Dss values varying between 0.02 and 0.04. 7 Role of Synaptic Depression
Different computational roles have been proposed for networks that incorporate synaptic depression. In this section, we describe some measurements that illustrate the postulated roles of depression. 7.1 Gain Control. Depressing synapses have been implicated in cortical gain control (Abbott et al., 1997). A depressing synapse acts like a transient detector to changes in frequency (or a rst derivative lter) similar to the way in which the photoreceptor responds with a higher gain to contrast and is less responsive to background illumination. A synapse with short-term depression responds equally to equal percentage rate changes in its input on different ring rates. We demonstrate the gain control mechanism of short-term depression by measuring the neuron’s response to step changes in input frequency from 10 Hz to 20 Hz to 40 Hz. Each step change is the same rate change in input frequency. These results are shown in Figure 7.1a for a regular train and in Figure 6b for a Poisson-distributed train. Each frequency epoch lasted 3 s, so the synaptic strength should have reached steady state before the next step increase in input frequency. For both gures in Figure 7.1, the top curve shows the response of the neuron when stimulated by the input (bottom curve) through a depressing synapse (top curve) and a nondepressing synapse (middle curve). We tuned the parameters of
Modeling Short-Term Synaptic Depression in Silicon
341
Figure 5: Dependence of the steady-state EPSP amplitude on the input frequency for different values of depression. The data were measured from the fabricated circuit. The solid lines are ts using equations 5.1 and 6.1. The parameters for the ts are determined using the data in a. (a) Steady-state EPSP amplitude versus frequency for a regular train. From the retrieved values of d (between 0.2 and 0.6), we nd that Dss varies between 0.02 to 0.04. (b) Steadystate EPSP amplitude versus frequency for a Poisson-distributed train.
342
M. Boegerhausen, P. Suter, and S. Liu
both types of synapses in Figure 6b so that the steady-state ring rate of the neuron was similar during the 20 Hz input period. Figure 6a shows clearly the transient increase in the ring rate of a neuron when stimulated through a depressing synapse right after each step increase in input frequency and the subsequent adaptation of its ring rate to a steady-state value. The steady-state ring rate of the neuron with a depressing synapse is less dependent on the absolute input frequency when compared to the ring rate of the neuron when stimulated through the nondepressing synapse. In the latter case, the ring rate of the neuron is approximately linear in the input rate. The data shown in Figure 6a were obtained using a Poisson-distributed input. One obvious difference in the responses between the depressing and nondepressing synapse is that the neuron quickly reached threshold for a 10 Hz input for the depressing-synapse case, while the neuron remained subthreshold in the nondepressing synapse case until the input increased to 20 Hz. This suggests that a depressing synapse can be used to drive a neuron quickly to threshold when the membrane potential of the neuron is far from the threshold. 7.2 Phase Advance. Another feature of the depressing synapse is the phase advance of the membrane response of the neuron. This advance increases when the depression increases (see Figure 7). This feature was exploited in a model that described the direction-selective responses of visual cortical neurons (Chance et al., 1998). In this model, the neuron was driven by the output of a set of lateral geniculate nucleus (LGN) cells through depressing synapses and through an adjacent spatially shifted set of LGN cells through nondepressing synapses. We have attempted the same experiment using a neuron on our chip by using spikes recorded from an LGN
Figure 6: Facing page. Response of neuron to a step change in input frequency (bottom curve) when stimulated through a depressing synapse (top curve) and a nondepressing synapse (middle curve). The neuron was stimulated for three frequency intervals (10 Hz to 20 Hz to 40 Hz) lasting 3 s each. (a) Response of neuron using a regular spiking input. The steady-state ring rate of the neuron increased almost linearly with the input frequency when stimulated through the nondepressing synapse. In the depressing synapse curve, there is a transient increase in the neuron’s ring rate before the rate adapted to steady state. The steady-state output is less dependent on the absolute input frequency. (b) Response of neuron using a Poisson-distributed input. The parameters for both types of synapses were tuned so that the steady-state ring rates were about the same at the end of each frequency interval for both synapses. Notice that during the 10 Hz interval, the neuron quickly built up to threshold if it was stimulated through the depressing synapse, but it stayed subthreshold when it was driven through to the nondepressing synapse until the input frequency reached 20 Hz.
Modeling Short-Term Synaptic Depression in Silicon
343
344
M. Boegerhausen, P. Suter, and S. Liu
Figure 7: Phase advance in the membrane response for different levels of depression in the synapse. The neuron was driven through the synapse by a Poisson-distributed input depicted at the bottom of the gure. The neuron shows the largest phase advance when the synapse has the largest depression value (Vd D 0.3V has the largest d value in the plot).
cell in the cat visual cortex during stimulation with a drifting sinusoidal grating (courtesy of K. Martin). An example of the direction-selective response is shown in Figure 8 (Liu, 2001). The direction-selective results were qualitatively similar to the data in Chance et al. (1998). 8 Conclusion
We described a model of synaptic depression derived from a circuit implementation. This circuit model has nonlinear recovery dynamics for the depression variable in contrast to current theoretical models of dynamic synapses. However, it gives qualitatively similar results when compared to the model of Abbott and colleagues (1997). Measured data from a chip with aVLSI IF neurons and dynamic synapses show that this network can be used to simulate the responses of dynamic networks with short-term dynamical synapses. Experimental results suggest that depressing synapses can be used to drive a neuron quickly up to threshold if its membrane potential is at the resting potential. The silicon networks provide an alternative to computer simulation of spike-based processing models with different time constant synapses because they run in real time and the computational time does not scale with the size of the neuronal network.
Modeling Short-Term Synaptic Depression in Silicon
345
Figure 8: Response to a 1 Hz drifting sinusoidal grating. The input spikes from the LGN cell are depicted at the bottom of the curve. The top curve shows the response of the neuron when the grating drifted in the preferred direction. The sharp excursions at the top of the potential are the output spikes of the neuron. The middle curve shows the response to the stimulus in the null direction. The membrane potential did not build up to threshold.
Appendix
The current equation for a p-type eld-effect transistor (pFET) in subthreshold where the bulk of the transistor is at the voltage Vdd is Iop ek ( Vdd ¡Vg ) / UT (e ¡(Vdd ¡Vs ) / UT ¡ e ( Vdd ¡Vd ) / UT ) , where Vg is the voltage at the gate, and Vd and Vs are the voltages at the drain and source terminals of a transistor; k describes the efciency of the gate in controlling the current in the channel and is less than 1 for a transistor operating in subthreshold. The voltages in the exponent are normalized in terms of the thermal voltage, UT . The differential equation for the current in a diode-capacitor circuit (e.g., M1 and C in Figure 1) when there is no input current is given by Iop ek ( Vdd ¡Vx ) / UT ( e¡(Vdd ¡Va ) / UT ¡ e ¡(Vdd ¡Vx ) / UT ) D C
dV x . dt
(A.1)
346
M. Boegerhausen, P. Suter, and S. Liu
By introducing the current variable, Irf D Iop ek ( Vdd ¡Vx ) / UT , we solve for the differential, dI rf : dI rf D ¡
k UT
Irf dV x ) dV x D ¡
UT dI rf . k Irf
By changing variables from Vx to Irf in equation A.1, we get, Á
0
!
1 d k Irf
QT
"
) D dt @e¡(Vdd ¡Va / UT
k (V
By substituting D D
e
dd
¡Va ) / UT
Irf
Iop
Iop ¡ Irf
#1/k 1 A.
(A.2)
into equation A.2, we get
Iopk ¡(1¡k ) ( V ¡Va ) / UT dD dd (1 ¡ D1/k ) D M (1 ¡ D1 /k ) , D e dt QT
(A.3)
where M D QT e¡(1¡k ) ( Vdd ¡Va ) / UT . In steady state, D D 1. The update dynamics for D can be inferred from the dynamics of Ir , which are described by Iopk
Ir ( tnC ) D (1 C a) Ir ( tn¡ ) ,
(A.4)
where tnC is the time right after the nth spike and tn¡ is the time right before the nth spike. The a factor is determined by a voltage parameter and the pulse width of the presynaptic spike. The corresponding equation for D is D ( tnC ) D dD ( tn¡ ) , where d is
1 1C a
(A.5)
in equation 2.5. 2
Acknowledgments
This work was supported in part by the Swiss National Foundation Research SPP grant. We also acknowledge Kevan Martin, Pamela Baker, and Ora Ohana for many discussions on dynamic synapses. 2 In equation A.5, for d D 1, a has to be 0. But a is never zero due to leakage currents. To compute d, we rst solve for 1 C a D e (k Idt )/ QT , where I is set by the voltage parameter Vd , dt is the pulse width of the input spike, and k is a transistor parameter. The smallest I is the leakage current, so for this value, we approximate a D (k Ir0dt ) / QT .
Modeling Short-Term Synaptic Depression in Silicon
347
References Abbott, L., Sen, K., Varela, J., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–223. Boahen, K. A. (1997a). The retinomorphic approach: Pixel-parallel adaptive amplication, ltering, and quantization. Analog Integrated Circuits and Signal Processing, 13, 53–68. Boahen, K. A. (1997b). Retinomorphic vision systems: Reverse engineering the vertebrate retina. Unpublished doctoral dissertation, California Institute of Technology, Pasadena, CA. Chance, F., Nelson, S., & Abbott, L. (1998). Synaptic depression and the temporal response characteristics of V1 cells. Journal of Neuroscience, 18, 4785–4799. Indiveri, G. (2000). Modeling selective attention using a neuromorphic aVLSI device. Neural Computation, 12, 2857–2880. Liu, S.-C. (2001, Oct. 19). Simple cortical modeling with aVLSI spiking neurons and dynamic synapses. Paper presented at the ZNZ Symposium, University of Zurich. Liu, S.-C. (in press). Analog VLSI circuits for short-term dynamic synapses. EURASIP Journal on Applied Signal Processing: Special Issue. Liu, S.-C., Kramer, J., Indiveri, G., Delbruck, ¨ T., Burg, T., & Douglas, R. (2001). Orientation-selective aVLSI spiking neurons. Neural Networks: Special Issue, 14(6/7), 629–643. Maass, W., & Zador, A. (1999). Computing and learning with dynamic synapses. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 157–178). Cambridge, MA: MIT Press. Markram, H., Wang, Y., & Tsodyks, M. (1998). Differential signaling via the same axon of neocortical pyramidal neurons. Proc. Natl. Acad. Sci. USA, 95, 5323–5328. Matveev, V., & Wang, X. (2000). Differential short-term synaptic plasticity and transmission of complex spike trains: To depress or to facilitate? Cerebral Cortex, 10, 1143–1153. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Natschlager, T., Maass, W., & Zador, A. (2001). Efcient temporal processing with biologically realistic dynamic synapses. Network, 12, 75–87. Rasche, C., & Hahnloser, R. (2001). Silicon synaptic depression. Biological Cybernetics, 84, 57–62. Reyes, A., Lujan, R., Rozov, A., Burnashev, N., Somogyi, P., & Sakmann, B. (1998). Target-cell-specic facilitation and depression in neocortical circuits. Nature Neuroscience, 1, 279–285. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neuronal synchrony with depressing synapses. Neural Computation, 10, 815–819. Stratford, K., Tarczy-Hornoch, K., Martin, K., Bannister, N., & Jack, J. (1998). Excitatory synaptic inputs to spiny stellate cells in cat visual cortex. Nature, 382, 258–261. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94(2), 719–723.
348
M. Boegerhausen, P. Suter, and S. Liu
Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Van Schaik, A. (2001). Building blocks for electronic spiking neural networks. Neural Networks, 14(6/7), 617–628. Varela, J., Sen, K., Gibson, J., Fost, J., Abbott, L., and Nelson, S. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. Journal of Neuroscience, 17, 7926–7940. Received June 14, 2002; accepted July 24, 2002.
LETTER
Communicated by Hagai Attias
Dictionary Learning Algorithms for Sparse Representation Kenneth Kreutz-Delgado
[email protected] Joseph F. Murray
[email protected] Bhaskar D. Rao
[email protected] Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A. Kjersti Engan
[email protected] Stavanger University College, School of Science and Technology Ullandhaug, N-4091 Stavanger, Norway Te-Won Lee
[email protected] Terrence J. Sejnowski
[email protected] Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A. Algorithms for data-driven learning of domain-specic overcomplete dictionaries are developed to obtain maximum likelihood and maximum a posteriori dictionary estimates based on the use of Bayesian models with concave/Schur-concave (CSC) negative log priors. Such priors are appropriate for obtaining sparse representations of environmental signals within an appropriately chosen (environmentally matched) dictionary. The elements of the dictionary can be interpreted as concepts, features, or words capable of succinct expression of events encountered in the environment (the source of the measured signals). This is a generalization of vector quantization in that one is interested in a description involving a few dictionary entries (the proverbial “25 words or less”), but not necessarily as succinct as one entry. To learn an environmentally adapted dictionary capable of concise expression of signals generated by the environment, we develop algorithms that iterate between a representative set of sparse representations found by variants of FOCUSS and an update of the dictionary using these sparse representations. Experiments were performed using synthetic data and natural images. For complete dictionaries, we demonstrate that our algorithms have imNeural Computation 15, 349–396 (2003)
° c 2002 Massachusetts Institute of Technology
350
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
proved performance over other independent component analysis (ICA) methods, measured in terms of signal-to-noise ratios of separated sources. In the overcomplete case, we show that the true underlying dictionary and sparse sources can be accurately recovered. In tests with natural images, learned overcomplete dictionaries are shown to have higher coding efciency than complete dictionaries; that is, images encoded with an overcomplete dictionary have both higher compression (fewer bits per pixel) and higher accuracy (lower mean square error). 1 Introduction
FOCUSS, which stands for FOCal Underdetermined System Solver, is an algorithm designed to obtain suboptimally (and, at times, maximally) sparse solutions to the following m £ n, underdetermined linear inverse problem1 (Gorodnitsky, George, & Rao, 1995; Rao & Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler, Rao, & Kreutz-Delgado, 1996; Rao & Kreutz-Delgado, 1997; Rao, 1997, 1998), y D Ax,
(1.1)
for known A. The sparsity of a vector is the number of zero-valued elements (Donoho, 1994), and is related to the diversity, the number of nonzero elements, sparsity D #fx[i] D 0g 6 0g diversity D #fx[i] D
diversity D n ¡ sparsity. Since our initial investigations into the properties of FOCUSS as an algorithm for providing sparse solutions to linear inverse problems in relatively noise-free environments (Gorodnitsky et al., 1995; Rao, 1997; Rao & Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler et al., 1996; Rao & Kreutz-Delgado, 1997), we now better understand the behavior of FOCUSS in noisy environments (Rao & Kreutz-Delgado, 1998a, 1998b) and as an interior point-like optimization algorithm for optimizing concave functionals subject to linear constraints (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997, 1998a, 1998b, 1998c, 1999; Kreutz-Delgado, Rao, Engan, Lee, & Sejnowski, 1999a; Engan, Rao, & Kreutz-Delgado, 2000; Rao, Engan, Cotter, & Kreutz-Delgado, 2002). In this article, we consider the use of the FOCUSS algorithm in the case where the matrix A is unknown and must be learned. Toward this end, we rst briey discuss how the use of concave (and 1
For notational simplicity, we consider the real case only.
Dictionary Learning Algorithms for Sparse Representation
351
Schur concave) functionals enforces sparse solutions to equation 1.1. We also discuss the choice of the matrix, A, in equation 1.1 and its relationship to the set of signal vectors y for which we hope to obtain sparse representations. Finally, we present algorithms capable of learning an environmentally adapted dictionary, A, given a sufciently large and statistically representative sample of signal vectors, y, building on ideas originally presented in Kreutz-Delgado, Rao, Engan, Lee, and Sejnowski (1999b), Kreutz-Delgado, Rao, and Engan (1999), and Engan, Rao, and Kreutz-Delgado (1999). We refer to the columns of the full row-rank m £ n matrix A, A D [a1 , . . . , an ] 2 Rm£n ,
n À m,
(1.2)
as a dictionary, and they are assumed to be a set of vectors capable of providing a highly succinct representation for most (and, ideally, all) statistically representative signal vectors y 2 Rm . Note that with the assumption that rank(A) D m, every vector y has a representation; the question at hand is whether this representation is likely to be sparse. We call the statistical generating mechanism for signals, y, the environment and a dictionary, A, within which such signals can be sparsely represented an environmentally adapted dictionary. Environmentally generated signals typically have signicant statistical structure and can be represented by a set of basis vectors spanning a lowerdimensional submanifold of meaningful signals (Field, 1994; Ruderman, 1994). These environmentally meaningful representation vectors can be obtained by maximizing the mutual information between the set of these vectors (the dictionary) and the signals generated by the environment (Comon, 1994; Bell & Sejnowski, 1995; Deco & Obradovic, 1996; Olshausen & Field, 1996; Zhu, Wu, & Mumford, 1997; Wang, Lee, & Juang, 1997). This procedure can be viewed as a natural generalization of independent component analysis (ICA) (Comon, 1994; Deco & Obradovic, 1996). As initially developed, this procedure usually results in obtaining a minimal spanning set of linearly independent vectors (i.e., a true basis). More recently, the desirability of obtaining “overcomplete” sets of vectors (or “dictionaries”) has been noted (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Donoho, 1994; Rao & KreutzDelgado, 1997). For example, projecting measured noisy signals onto the signal submanifold spanned by a set of dictionary vectors results in noise reduction and data compression (Donoho, 1994, 1995). These dictionaries can be structured as a set of bases from which a single basis is to be selected to represent the measured signal(s) of interest (Coifman & Wickerhauser, 1992) or as a single, overcomplete set of individual vectors from within which a vector, y, is to be sparsely represented (Mallat & Zhang, 1993; Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Rao & Kreutz-Delgado, 1997). The problem of determining a representation from a full row-rank overcomplete dictionary, A D [a1 , . . . , an ], n À m, for a specic signal mea-
352
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
surement, y, is equivalent to solving an underdetermined inverse problem, Ax D y, which is nonuniquely solvable for any y. The standard least-squares solution to this problem has the (at times) undesirable feature of involving all the dictionary vectors in the solution2 (the “spurious artifact” problem) and does not generally allow for the extraction of a categorically or physically meaningful solution. That is, it is not generally the case that a least-squares solution yields a concise representation allowing for a precise semantic meaning.3 If the dictionary is large and rich enough in representational power, a measured signal can be matched to a very few (perhaps even just one) dictionary words. In this manner, we can obtain concise semantic content about objects or situations encountered in natural environments (Field, 1994). Thus, there has been signicant interest in nding sparse solutions, x (solutions having a minimum number of nonzero elements), to the signal representation problem. Interestingly, matching a specic signal to a sparse set of dictionary words or vectors can be related to entropy minimization as a means of elucidating statistical structure (Watanabe, 1981). Finding a sparse representation (based on the use of a “few” code or dictionary words) can also be viewed as a generalization of vector quantization where a match to a single “code vector” (word) is always sought (taking “code book” = “dictionary”).4 Indeed, we can refer to a sparse solution, x, as a sparse coding of the signal instantiation, y. 1.1 Stochastic Models. It is well known (Basilevsky, 1994) that the sto-
chastic generative model (1.3)
y D Ax C º, m
can be used to develop algorithms enabling coding of y 2 R via solving the inverse problem for a sparse solution x 2 Rn for the undercomplete (n < m) and complete (n D m) cases. In recent years, there has been a great deal of interest in obtaining sparse codings of y with this procedure for the overcomplete (n > m) case (Mallat & Zhang, 1993; Field, 1994). In our earlier work, we have shown that given an overcomplete dictionary, A (with the columns of A comprising the dictionary vectors), a maximum a posteriori (MAP) estimate of the source vector, x, will yield a sparse coding of y in the low-noise limit if the negative log prior, ¡ log(P(x) ) , is concave/Schurconcave (CSC) (Rao, 1998; Kreutz-Delgado & Rao, 1999), as discussed below. 2 This fact comes as no surprise when the solution is interpreted within a Bayesian framework using a gaussian (maximum entropy) prior. 3 Taking “semantic” here to mean categorically or physically interpretable. 4 For example, n D 100 corresponds to 100 features encoded via vector quantization (“one column = one concept”). features using up to four ³ ´ If³we are ´ allowed ³ ´ to³represent ´ 100 100 100 100 C C C D 4,087,975concepts showing columns, we can encode 1 2 3 4
a combinatorial boost in expressive power.
Dictionary Learning Algorithms for Sparse Representation
353
For P (x ) factorizable into a product of marginal probabilities, the resulting code is also known to provide an independent component analysis (ICA) representation of y. More generally, a CSC prior results in a sparse representation even in the nonfactorizable case (with x then forming a dependent component analysis, or DCA, representation). Given independently and identically distributed (i.i.d.) data, Y D YN D ( y1 , . . . , yN ) , assumed to be generated by the model 1.3, a maximum likelibML , of the unknown (but nonrandom) dictionary A can be hood estimate, A determined as (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000) bML D arg max P (YI A ) . A A
This requires integrating out the unobservable i.i.d. source vectors, X D XN D ( x1 , . . . , xN ) , in order to compute P ( YI A ) from the (assumed) known probabilities P ( x) and P (º). In essence, X is formally treated as a set of nuisance parameters that in principle can be removed by integration. However, because the prior P ( x ) is generally taken to be supergaussian, this integration is intractable or computationally unreasonable. Thus, approximations to this integration are performed that result in an approximation to P (YI A ), which is then maximized with respect to Y. A new, better approximation to the integration can then be made, and this process is iterated until the estimate of the dictionary A has (hopefully) converged (Olshausen & Field, 1996). We refer to the resulting estimate as an approximate maxibAML ). mum likelihood (AML) estimate of the dictionary A (denoted here by A No formal proof of the convergence of this algorithm to the true maxibML, has been given in the prior literature, but mum likelihood estimate, A it appears to perform well in various test cases (Olshausen & Field, 1996). Below, we discuss the problem of dictionary learning within the framework of our recently developed log-prior model-based sparse source vector learning approach that for a known overcomplete dictionary can be used to obtain sparse codes (Rao, 1998; Kreutz-Delgado & Rao, 1997, 1998b, 1998c, 1999; Rao & Kreutz-Delgado, 1999). Such sparse codes can be found using FOCUSS, an afne scaling transformation (AST)–like iterative algorithm that nds a sparse locally optimal MAP estimate of the source vector x for an observation y. Using these results, we can develop dictionary learning algorithms within the AML framework and for obtaining a MAP-like bMAP , of the (now assumed random) dictionary, A, assuming in estimate, A the latter case that the dictionary belongs to a compact submanifold corresponding to unit Frobenius norm. Under certain conditions, convergence to a local minimum of a MAP-loss function that combines functions of the discrepancy e D ( y ¡ Ax ) and the degree of sparsity in x can be rigorously proved. 1.2 Related Work. Previous work includes efforts to solve equation 1.3 in the overcomplete case within the maximum likelihood (ML) framework.
354
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
An algorithm for nding sparse codes was developed in Olshausen and Field (1997) and tested on small patches of natural images, resulting in Gabor-like receptive elds. In Lewicki and Sejnowski (2000) another ML algorithm is presented, which uses the Laplacian prior to enforce sparsity. The values of the elements of x are found with a modied conjugate gradient optimization (which has a rather complicated implementation) as opposed to the standard ICA (square mixing matrix) case where the coefcients are found by inverting the A matrix. The difculty that arises when using ML is that nding the estimate of the dictionary A requires integrating over all possible values of the joint density P ( y, xI A ) as a function of x. In Olshausen and Field (1997), this is handled by assuming the prior density of x is a delta function, while in Lewicki and Sejnowski (2000), it is approximated by a gaussian. The xed-point FastICA (Hyv¨arinen, Cristescu, & Oja, 1999) has also been extended to generate overcomplete representations. The FastICA algorithm can nd the basis functions (columns of the dictionary A) one at a time by imposing a quasi-orthogonality condition and can be thought of as a “greedy” algorithm. It also can be run in parallel, meaning that all columns of A are updated together. Other methods to solve equation 1.3 in the overcomplete case have been developed using a combination of the expectation-maximization (EM) algorithm and variational approximation techniques. Independent factor analysis (Attias, 1999) uses a mixture of gaussians to approximate the prior density of the sources, which avoids the difculty of integrating out the parameters X and allows different sources to have different densities. In another method (Girolami, 2001), the source priors are assumed to be supergaussian (heavy-tailed), and a variational lower bound is developed that is used in the EM estimation of the parameters A and X. It is noted in Girolami (2001) that the mixtures used in independent factor analysis are more general than may be needed for the sparse overcomplete case, and they can be computationally expensive as the dimension of the data vector and number of mixtures increases. In Zibulevsky and Pearlmutter (2001), the blind source separation problem is formulated in terms of a sparse source underlying each unmixed signal. These sparse sources are expanded into the unmixed signal with a predened wavelet dictionary, which may be overcomplete. The unmixed signals are linearly combined by a different mixing matrix to create the observed sensor signals. The method is shown to give better separation performance than ICA techniques. The use of learned dictionaries (instead of being chosen a priori) is suggested.
2 FOCUSS: Sparse Solutions for Known Dictionaries 2.1 Known Dictionary Model. A Bayesian interpretation is obtained from the generative signal model, equation 1.3, by assuming that x has the
Dictionary Learning Algorithms for Sparse Representation
355
parameterized (generally nongaussian) probability density function (pdf), Z Pp ( x) D Zp¡1 e ¡c p dp ( x ) ,
Zp D
e ¡c p dp ( x ) dx,
(2.1)
with parameter vector p. Similarly, the noise º is assumed to have a parameterized (possibly nongaussian) density P q (º) of the same form as equation 2.1 with parameter vector q. It is assumed that x and º have zero means and that their densities obey the property d ( x ) D d (| x| ) , for | ¢ | dened component-wise. This is equivalent to assuming that the densities are symmetric with respect to sign changes in the components x, x[i] Ã ¡x[i] and therefore that the skews of these densities are zero. We also assume that d (0) D 0. With a slight abuse of notation, we allow the differing subscripts q and p to indicate that dq and dp may be functionally different as well as parametrically different. We refer to densities like equation 2.1 for suitable additional constraints on dp (x ) , as hypergeneralized gaussian distributions (Kreutz-Delgado & Rao, 1999; Kreutz-Delgado et al., 1999). If we treat A, p, and q as known parameters, then x and y are jointly distributed as P (x, y ) D P ( x, yI p, q, A ) . Bayes’ rule yields P (x | yI p, A ) D
1 1 P ( y | xI p, A ) ¢ P ( xI p, A ) D Pq ( y ¡ Ax ) ¢ Pp ( x ) (2.2) b b Z
b D P ( y ) D P ( yI p, q, A ) D
P ( y | x) ¢ Pp ( x) dx.
(2.3)
Usually the dependence on p and q is notationally suppressed, for example, b D P ( yI A) . Given an observation, y, maximizing equation 2.2 with respect to x yields a MAP estimate b x. This ideally results in a sparse coding of the observation, a requirement that places functional constraints on the pdfs, and particularly on dp . Note that b is independent of x and can be ignored when optimizing equation 2.2 with respect to the unknown source vector x. The MAP estimate equivalently is obtained from minimizing the negative logarithm of P ( x | y ), which is b x D arg min dq ( y ¡ Ax) C ldp ( x) , x
(2.4)
where l D c p /c q and dq ( y¡Ax ) D dq ( Ax¡y ) by our assumption of symmetry. The quantity l1 is interpretable as a signal-to-noise ratio (SNR). Furthermore, assuming that both dq and dp are concave/Schur–concave (CSC) as dened in section 2.4, then the term dq ( y ¡ Ax) in equation 2.4 will encourage sparse
356
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
residuals, e D y ¡ Ab x, while the term dp ( x ) encourages sparse source vector estimates, b x. A given value of l then determines a trade-off between residual and source vector sparseness. This most general formulation will not be used here. Although we are interested in obtaining sparse source vector estimates, we will not enforce sparsity on the residuals but instead, to simplify the development, will assume the q D 2 i.i.d. gaussian measurement noise case (º gaussian with known covariance s 2 ¢ I), which corresponds to taking, bx ) D c q dq ( y ¡ Ab
1 bxk2 . ky ¡ Ab 2s 2
(2.5)
In this case, problem 2.4 becomes 1 b x D arg min ky ¡ Axk2 C ldp ( x ). x 2
(2.6)
In either case, we note that l ! 0 as c p ! 0 which (consistent with the generative model, 1.3) we refer to as the low noise limit. Because the mapping A is assumed to be onto, in the low-noise limit, the optimization, equation 2.4, is equivalent to the linearly constrained problem, b x D arg min dp (x )
subject to
Ax D y.
(2.7)
In the low-noise limit, no sparseness constraint need be placed on the residuals, e D y ¡ Ab x, which are assumed to be zero. It is evident that the structure of dp (¢) is critical for obtaining a sparse coding, b x, of the observation y (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999). Throughtout this article, the quantity dp ( x ) is always assumed to be CSC (enforcing sparse solutions to the inverse problem 1.3). As noted, and as will be evident during the development of dictionary learning algorithms below, we do not impose a sparsity constraint on the residuals; instead, the measurement noise º will be assumed to be gaussian (q D 2). 2.2 Independent Component Analysis and Sparsity Inducing Priors.
An important class of densities is given by the generalized gaussians for which p
dp ( x ) D kxkp D
n X
|x[k]| p ,
(2.8)
kD 1
for p > 0 (Kassam, 1982). This is a special case of the larger `p class (the p-class) of functions, which allows p to be negative in value (Rao & KreutzDelgado, 1999; Kreutz-Delgado & Rao, 1997). Note that this function has the special property of separability, dp ( x ) D
n X kD 1
dp ( x[k]) ,
Dictionary Learning Algorithms for Sparse Representation
357
which corresponds to factorizability of the density Pp ( x ) , Pp ( x) D
n Y
Pp ( x[k]) ,
kD 1
and hence to independence of the components of x. The assumption of independent components allows the problem of solving the generative model, equation 1.3, for x to be interpreted as an ICA problem (Comon, 1994; Pham, 1996; Olshausen & Field, 1996; Roberts, 1998). It is of interest, then, to consider the development of a large class of parameterizable separable functions dp ( x) consistent with the ICA assumption (Rao & Kreutz-Delgado, 1999; KreutzDelgado & Rao, 1997). Given such a class, it is natural to examine the issue of nding a best t within this class to the “true” underlying prior density of x. This is a problem of parametric density estimation of the true prior, where one attempts to nd an optimal choice of the model density Pp ( x ) by an optimization over the parameters p that dene the choice of a prior from within the class. This is, in general, a difcult problem, which may require the use of Monte Carlo, evolutionary programming, or stochastic search techniques. Can the belief that supergaussian priors, Pp ( x ), are appropriate for nding sparse solutions to equation 1.3 (Field, 1994; Olshausen & Field, 1996) be claried or made rigorous? It is well known that the generalized gaussian distribution arising from the use of equation 2.8 yields supergaussian distributions (positive kurtosis) for p < 2 and subgaussian (negative kurtosis) for p > 2. However, one can argue (see section 2.5 below) that the condition for obtaining sparse solutions in the low-noise limit is the stronger requirement that p · 1, in which case the separable function dp (x ) is CSC. This indicates that supergaussianness (positive kurtosis) alone is necessary but not sufcient for inducing sparse solutions. Rather, sufciency is given by the requirement that ¡ log Pp ( x ) ¼ dp ( x) be CSC. We have seen that the function dp (x ) has an interpretation as a (negative logarithm of) a Bayesian prior or as a penalty function enforcing sparsity in equation 2.4, where dp ( x ) should serve as a “relaxed counting function” on the nonzero elements of x. Our perspective emphasizes that dp ( x ) serves both of these goals simultaneously. Thus, good regularizing functions, dp ( x ), should be exibly parameterizable so that Pp ( x) can be optimized over the parameter vector p to provide a good parametric t to the underlying environmental pdf, and the functions should also have analytical properties consistent with the goal of enforcing sparse solutions. Such properties are discussed in the next section. 2.3 Majorization and Schur-Concavity. In this section, we discuss functions that are both concave and Schur-concave (CSC functions; Marshall & Olkin, 1979). We will call functions, dp (¢) , which are CSC, diversity functions, anticoncentration functions or antisparsity functions. The larger the value of the
358
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
CSC function dp (x ) , the more diverse (i.e., the less concentrated or sparse) the elements of the vector x are. Thus, minimizing dp ( x ) with respect to x results in less diverse (more concentrated or sparse) vectors x. 2.3.1 Schur-Concave Functions. A measure of the sparsity of the elements of a solution vector x (or the lack thereof, which we refer to as the diversity of x) is given by a partial ordering on vectors known as the Lorentz order. For any vector in the positive orthant, x 2 RnC , dene the decreasing rearrangement . x D ( xb1c , . . . , xbnc ) , xb1c ¸ ¢ ¢ ¢ ¸ xbnc ¸ 0 and the partial sums (Marshall & Olkin, 1979; Wickerhauser, 1994), Sx [k] D
k X iD 1
xbnc ,
k D 1, . . . , n.
We say that y majorizes x, y  x, iff for k D 1, . . . , n, S y [k] ¸ Sx [k]I
S y [n] D Sx [n],
and the vector y is said to be more concentrated, or less diverse, than x. This partial order dened by majorization then denes the Lorentz order. We are interested in scalar-valued functions of x that are consistent with majorization. Such functions are known as Schur-concave functions, d (¢) : RnC ! R. They are dened to be precisely the class of functions consistent with the Lorentz order, yÂx
)
d ( y ) < d ( x ).
In words, if y is less diverse than x (according to the Lorentz order) then d ( y) is less than d ( x) for d (¢) Schur-concave. We assume that Schur-concavity is a necessary condition for d (¢) to be a good measure of diversity (antisparsity). 2.3.2 Concavity Yields Sparse Solutions. Recall that a function d (¢) is concave on the positive orthant RnC iff (Rockafellar, 1970) d ( (1 ¡ c ) x C c y ) ¸ (1 ¡ c ) d (x ) C c d ( y ) , 8x, y 2 RnC , 8c , 0 · c · 1. In addition, a scalar function is said to be permutation invariant if its value is independent of rearrangements of its components. An important fact is that for permutation invariant functions, concavity is a sufcient condition for Schur-concavity: Concavity C permutation invariance ) Schur-concavity. Now consider the low-noise sparse inverse problem, 2.7. It is well known that subject to linear constraints, a concave function on RnC takes its minima
Dictionary Learning Algorithms for Sparse Representation
359
on the boundary of RnC (Rockafellar, 1970), and as a consequence these minima are therefore sparse. We take concavity to be a sufcient condition for a permutation invariant d (¢) to be a measure of diversity and obtain sparsity as constrained minima of d (¢) . More generally, a diversity measure should be somewhere between Schur-concave and concave. In this spirit, one can dene almost concave functions (Kreutz-Delgado & Rao, 1997), which are Schur-concave and (locally) concave in all n directions but one, which also are good measures of diversity. 2.3.3 Separability, Schur-Concavity, and ICA. The simplest way to ensure that d ( x ) be permutation invariant (a necessary condition for Schurconcavity) is to use functions that are separable. Recall that separability of dp ( x) corresponds to factorizability of Pp (x ) . Thus, separability of d ( x ) corresponds to the assumption of independent components of x under the model 1.3. We see that from a Bayesian perspective, separability of d ( x) corresponds to a generative model for y that assumes a source, x, with independent components. With this assumption, we are working within the framework of ICA (Nadal & Parga, 1994; Pham, 1996; Roberts, 1998). We have developed effective algorithms for solving the optimization problem 2.7 for sparse solutions when dp (x ) is separable and concave (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999). It is now evident that relaxing the restriction of separability generalizes the generative model to the case where the source vector, x, has dependent components. We can reasonably call an approach based on a nonseparable diversity measure d ( x) a dependent component analysis (DCA). Unless care is taken, this relaxation can signicantly complicate the analysis and development of optimization algorithms. However, one can solve the lownoise DCA problem, at least in principle, provided appropriate choices of nonseparable diversity functions are made. 2.4 Supergaussian Priors and Sparse Coding. The P-class of diversity measures for 0 < p · 1 result in sparse solutions to the low-noise coding problem, 2.7. These separable and concave (and thus Schur-concave) diversity measures correspond to supergaussian priors, consistent with the folk theorem that supergaussian priors are sparsity-enforcing priors. However, taking 1 · p < 2 results in supergaussian priors that are not sparsity enforcing. Taking p to be between 1 and 2 yields a dp ( x) that is convex and therefore not concave. This is consistent with the well-known fact that for this range of p, the pth-root of dp ( x ) is a norm. Minimizing dp (x ) in this case drives x toward the origin, favoring concentrated rather than sparse solutions. We see that if a sparse coding is to be found based on obtaining a MAP estimate to the low-noise generative model, 1.3, then, in a sense, supergaussianness is a necessary but not sufcient condition for a prior to be sparsity enforcing. A sufcient condition for obtaining a sparse MAP coding is that the negative log prior be CSC.
360
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
2.5 The FOCUSS Algorithm. Locally optimal solutions to the knowndictionary sparse inverse problems in gaussian noise, equations 2.6 and 2.7, are given by the FOCUSS algorithm. This is an afne-scaling transformation (AST)-like (interior point) algorithm originally proposed for the low-noise case 2.7 in Rao and Kreutz-Delgado (1997, 1999) and Kreutz-Delgado and Rao (1997); and extended by regularization to the nontrivial noise case, equation 2.6, in Rao and Kreutz-Delgado (1998a), Engan et al. (2000), and Rao et al. (2002). In these references, it is shown that the FOCUSS algorithm has excellent behavior for concave functions (which includes the the CSC concentration functions) dp (¢) . For such functions, FOCUSS quickly converges to a local minimum, yielding a sparse solution to problems 2.7 and 2.6. One can quickly motivate the development of the FOCUSS algorithm appropriate for solving the optimization problem 2.6 by considering the problem of obtaining the stationary points of the objective function. These are given as solutions, x¤ , to
AT ( Ax¤ ¡ y ) C lrx dp ( x¤ ) D 0.
(2.9)
In general, equation 2.9 is nonlinear and cannot be explicitly solved for a solution x¤ . However, we proceed by assuming the existence of a gradient factorization,
rx dp (x ) D a(x) P ( x ) x,
(2.10)
where a(x) is a positive scalar function and P (x ) is symmetric, positivedenite, and diagonal. As discussed in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999), this assumption is generally true for CSC sparsity functions dp (¢) and is key to understanding FOCUSS as a sparsity-inducing interior-point (AST-like) optimization algorithm.5 With the gradient factorization 2.10, the stationary points of equation 2.9 are readily shown to be solutions to the (equally nonlinear and implicit) system, x¤ D ( AT A C b ( x¤ )P ( x¤ ) ) ¡1 AT y D P
¡1
¤
(x ) A
T (b ( ¤ )
x I C AP
¡1
(2.11) ¤
T ) ¡1
(x ) A
y,
(2.12)
where b ( x ) D la(x) and the second equation follows from identity A.18. Although equation 2.12 is also not generally solvable in closed form, it does 5 This interpretation, which is not elaborated on here, follows from dening a diagonal positive-denite afne scaling transformation matrix W (x) by the relation
P (x) D W ¡2 (x).
Dictionary Learning Algorithms for Sparse Representation
361
suggest the following relaxation algorithm, b x Ã
P ¡1 (b x ) AT (b (b x ) I C AP ¡1 (b x) AT ) ¡1 y,
(2.13)
which is to be repeatedly reiterated until convergence. Taking b ´ 0 in equation 2.13 yields the FOCUSS algorithm proved in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999) to converge to a sparse solution of equation 2.7 for CSC sparsity functions 6 0 yields the regularized FOCUSS algorithm that will dp (¢). The case b D converge to a sparse solution of equation 2.6 (Rao, 1998; Engan et al., 2000; Rao et al., 2002). More computationally robust variants of equation 2.13 are discussed elsewhere (Gorodnitsky & Rao, 1997; Rao & Kreutz-Delgado, 1998a). Note that for the general regularized FOCUSS algorithm, 2.13, we have b (b x) D la(b x) , where l is the regularization parameter in equation 2.4. The function b ( x) is usually generalized to be a function of b x, y and the iteration number. Methods for choosing l include the quality-of-t criteria, the sparsity critera, and the L-curve (Engan, 2000; Engan et al., 2000; Rao et al., 2002). The quality-of-t criterion attempts to minimize the residual error y ¡ Ax (Rao, 1997), which can be shown to converge to a sparse solution (Rao & Kreutz-Delgado, 1999). The sparsity criterion requires that a certain number of elements of each x k be nonzero. The L-curve method adjusts l to optimize the trade-off between the residual and sparsity of x. The plot of dp (x ) versus dq ( y ¡ Ax) has an L shape, the corner of which provides the best trade-off. The corner of the L-curve is the point of maximum curvature and can be found by a one-dimensional maximization of the curvature function (Hansen & O’Leary, 1993). A hybrid approach known as the modied L-curve method combines the Lcurve method on a linear scale and the quality-of-t criterion, which is used to place limits on the range of l that can be chosen by the L-curve (Engan, 2000). The modied L-curve method was shown to have good performance, but it requires a one-dimensional numerical optimization step for each xk at each iteration, which can be computationally expensive for large vectors. 3 Dictionary Learning 3.1 Unknown, Nonrandom Dictionaries. The MLE framework treats parameters to be estimated as unknown but deterministic (nonrandom). In this spirit, we take the dictionary, A, to be the set of unknown but deterministic parameters to be estimated from the observation set Y D YN . In particular, bML is found from maximizing given YN , the maximum likelihood estimate A the likelihood function L ( A | YN ) D P (YN I A) . Under the assumption that
362
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
the observations are i.i.d., this corresponds to the optimization bML D arg max A A
N Y kD 1
P ( y kI A ) ,
Z P ( yk I A ) D
(3.1)
Z P ( y k, xI A) dx D
Z
P ( y k | xI A ) ¢ Pp ( x ) dx
Pq ( yk ¡ Ax ) ¢ Pp (x ) dx.
D
(3.2)
Dening the sample average of a function f ( y ) over the sample set YN D ( y1 , . . . , yN ) by h f ( y) iN D
N 1 X f ( yk ) , N kD 1
the optimization 3.1 can be equivalently written as bML D arg min ¡hlog(P ( yI A) ) iN . A A
(3.3)
Note that P ( y kI A ) is equal to the normalization factor b already encountered, but now with the dependence of b on A and the particular sample, y k, made explicit. The integration in equation 3.2 in general is intractable, and various approximations have been proposed to obtain an approximate bAML (Olshausen & Field, 1996; Lewicki & maximum likelihood estimate, A Sejnowski, 2000). In particular, the following approximation has been proposed (Olshausen & Field, 1996), b) ) , Pp (x ) ¼ d (x ¡ b xk ( A
(3.4)
where b) D arg max P ( yk , xI A b) , b xk ( A x
(3.5)
b for A. This approximation for k D 1, . . . , N, assuming a current estimate, A, corresponds to assuming that the source vector x k for which y k D Ax k is b) . With this approximation, the optimization 3.3 known and equal to b xk ( A becomes bAML D arg minhdq ( y ¡ Ab A x) N , A
(3.6)
Dictionary Learning Algorithms for Sparse Representation
363
which is an optimization over the sample average h¢iN of the functional 2.4 encountered earlier. Updating our estimate for the dictionary, bà A bAML , A
(3.7)
bAML has converged, hopefully we can iterate the procedure (3.5)–(3.6) until A bML D A bML (YN ) as the maximum likelihood (at least in the limit of large N) to A bML ( YN ) has well-known desirable asymptotic properties in the estimate A limit N ! 1. Performing the optimization in equation 3.6 for the q D 2 i.i.d. gaussian measurement noise case (º gaussian with known covariance s 2 ¢ I) corresponds to taking bb dq ( y ¡ A x) D
1 bb ky ¡ A xk2 , 2s 2
(3.8)
in equation 3.6. In appendix A, it is shown that we can readily obtain the unique batch solution, bAML D SyxO S ¡1 , A Ox xO
SyxO
D
N 1 X ykb xTk , N kD 1
(3.9)
SxO xO
D
N 1 X b x kb xTk . N kD 1
(3.10)
Appendix A derives the maximum likelihood estimate of A for the ideal case of known source vectors X D ( x1 , . . . , xN ) , ¡1 Known source vector case: AML D Syx Sxx ,
which is, of course, actually not computable since the actual source vectors are assumed to be hidden. As an alternative to using the explicit solution 3.9, which requires an often prohibitive n £ n inversion, we can obtain AAML iteratively by gradient descent on equations 3.6 and 3.8, N X bAML Ã A bAML ¡ m 1 A e kb xTk , N kD 1
bAMLb ek D A x k ¡ y k,
(3.11)
k D 1, . . . , N,
for an appropriate choice of the (possibly adaptive) positive step-size pabAML D A. b rameter m . The iteration 3.11 can be initialized as A A general iterative dictionary learning procedure is obtained by nesting the iteration 3.11 entirely within the iteration dened by repeatedly solvbAML , of the dictionary becomes ing equation 3.5 every time a new estimate, A
364
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
available. However, performing the optimization required in equation 3.5 is generally nontrivial (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000). Recently, we have shown how the use of the FOCUSS algorithm results in an effective algorithm for performing the optimization required in equation 3.5 for the case when º is gaussian (Rao & Kreutz-Delgado, 1998a; Engan et al., 1999). This approach solves equation 3.5 using the afne-scaling transformation (AST)-like algorithms recently proposed for the low-noise case (Rao & Kreutz-Delgado, 1997, 1999; Kreutz-Delgado & Rao, 1997) and extended by regularization to the nontrivial noise case (Rao & Kreutz-Delgado, 1998a; Engan et al., 1999). As discussed in section 2.5, for the current dictionary b a solution to the optimization problem, 3.5, is provided by the estimate, A, repeated iteration, bT (b (b b ¡1 (b bT ) ¡1 y k, b x k à P ¡1 (b xk ) A x k ) I C AP xk ) A
(3.12)
k D 1, . . . , N, with P ( x ) dened as in equation 3.18, given below. This is the regularized FOCUSS algorithm (Rao, 1998; Engan et al., 1999), which has an interpretation as an AST-like concave function minimization algorithm. The proposed dictionary learning algorithm alternates between iteration 3.12 and iteration 3.11 (or the direct batch solution given by equation 3.9 if the inversion is tractable). Extensive simulations show the ability of the AST-based algorithm to recover an unknown 20 £ 30 dictionary matrix A completely (Engan et al., 1999). 3.2 Unknown, Random Dictionaries. We now generalize to the case where the dictionary A and the source vector set X D XN D ( x1 , . . . , xN ) are jointly random and unknown. We add the requirement that the dictionary is known to obey the constraint
A 2 A D compact submanifold of Rm£n . A compact submanifold of Rm£n is necessarily closed and bounded. On the constraint submanifold, the dictionary A has the prior probability density function P ( A) , which in the sequel we assume has the simple (uniform on A) form, ( A 2 A) ,
P ( A ) D cX
(3.13)
where X (¢) is the indicator function and c is a positive constant chosen to ensure that Z P (A) D P ( A ) dA D 1. A
The dictionary A and the elements of the set X are also all assumed to be mutually independent, P ( A, X ) D P ( A) P ( X ) D P ( A) Pp ( x1 ) , . . . , Pp ( xN ).
Dictionary Learning Algorithms for Sparse Representation
365
With the set of i.i.d. noise vectors, (º1 , . . . , ºN ) also taken to be jointly random with, and independent of, A and X, the observation set Y D YN D ( y1 , . . . , yN ) is assumed to be generated by the model 1.3. With these assumptions, we have P ( A, X | Y) D P (Y | A, X ) P ( A, X ) / P (Y ) D cX
(3.14)
( A 2 A ) P ( Y | A, X ) P ( X ) / P ( Y)
D
N cX ( A 2 A ) Y P ( y k | A, x k ) Pp ( xk ) P (Y ) kD 1
D
N cX ( A 2 A ) Y Pq ( y ¡ Ax k ) Pp ( x k ) , P (Y ) kD 1
using the facts that the observations are conditionally independent and P ( y k | A, X ) D P ( yk | A, xk ) . The jointly MAP estimates bMAP , X bMAP ) D ( A bMAP , b (A x1, MAP , . . . , b xN, MAP ) are found by maximizing a posteriori probability density P ( A, X | Y) simultaneously with respect to A 2 A and X. This is equivalent to minimizing the negative logarithm of P ( A, X | Y ), yielding the optimization problem, bMAP , X bMAP ) D arg min hdq ( y ¡ Ax) C ldp ( x) iN . (A A2A,X
(3.15)
Note that this is a joint minimization of the sample average of the functional 2.4, and as such is a natural generalization of the single (with respect to the set of source vectors) optimization previously encountered in equation 3.6. By nding joint MAP estimates of A and X, we obtain a problem that is much more tractable than the one of nding the single MAP estimate of A (which involves maximizing the marginal posterior density P ( A | Y )). The requirement that A 2 A, where A is a compact and hence bounded subset of Rm£n , is sufcient for the optimization problem 3.15 to avoid the degenerate solution,6 for k D 1, . . . , N, yk D Axk, with kAk ! 1 and kx kk ! 0.
(3.16)
This solution is possible for unbounded A because y D Ax is almost always solvable for x since learned overcomplete A’s are (generically) onto, and for any solution pair ( A, x) the pair ( a1 A, ax) is also a solution. This fact shows that the inverse problem of nding a solution pair ( A, x) is generally ill posed 6
kAk is any suitable matrix norm on A.
366
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
unless A is constrained to be bounded (as we have explicitly done here) or the cost functional is chosen to ensure that bounded A’s are learned (e.g., by adding a term monotonic in the matrix norm kAk to the cost function in equation 3.15). A variety of choices for the compact set A are available. Obviously, since different choices of A correspond to different a priori assumptions on the set of admissible matrices, A, the choice of this set can be expected to affect the performance of the resulting dictionary learning algorithm. We will consider two relatively simple forms for A. 3.3 Unit Frobenius–Norm Dictionary Prior. For the i.i.d. q D 2 gaussian measurement noise case of equation 3.8, algorithms that provably converge (in the low step-size limit) to a local minimum of equation 3.15 can be readily developed for the very simple choice,
A F D fA | kAkF D 1g ½ Rm£n ,
(3.17)
where kAkF denotes the Frobenius norm of the matrix A, kAk2F D tr(AT A ) , trace ( AT A ) , and it is assumed that the prior P ( A) is uniformly distributed on AF as per condition 3.13. As discussed in appendix A, AF is simply connected, and there exists a path in AF between any two matrices in AF. Following the gradient factorization procedure (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999), we factor the gradient of d ( x) as
r d (x ) D a(x) P ( x ) x,
a(x) > 0,
(3.18)
where it is assumed that P (x ) is diagonal and positive denite for all nonzero p x. For example, in the case where d ( x ) D kxkp , P ¡1 ( x ) D diag(|x[i]| 2¡p ).
(3.19)
Factorizations for other diversity measures d (x ) are given in Kreutz-Delgado and Rao (1997). We also dene b ( x) D la(x). As derived and proved in appendix A, a learning law that provably converges to a minimum of equation 3.15 on the manifold 3.17 is then given by d bT A b C b (b bT y kg, b x k D ¡V kf( A xk ) P (b x k ) )b xk ¡ A dt d b b ¡ tr( A bTd A b) A b) , m > 0, A D ¡m (d A dt
(3.20)
Dictionary Learning Algorithms for Sparse Representation
367
b is initialized to kAk b F D 1, V k are n £ n positive for k D 1, . . . , N, where A b denite matrices, and the “error” d A is b D he(b dA x)b xT iN D
N 1 X e (b xk )b xTk , N kD 1
bb e (b xk ) D A x k ¡ y k,
(3.21)
which can be rewritten in the perhaps a more illuminating form (cf. equations 3.9 and 3.10), bD A bSxO xO ¡ SyxO . dA
(3.22)
A formal convergence proof of equation 3.20 is given in appendix A, where it is also shown that the right-hand side of the second equation in 3.20 correb onto the tangent space of A F , thereby sponds to projecting the error term d A b lies in the tangent space. Convergence of ensuring that the derivative of A the algorithm to a local optimum of equation 3.15 is formally proved by interpreting the loss functional as a Lyapunov function whose time derivabX b) is guaranteed tive along the trajectories of the adapted parameters ( A, to be negative denite by the choice of parameter time derivatives shown in equation 3.20. As a consequence of the La Salle invariance principle, the loss functional will decrease in value, and the parameters will converge to the largest invariant set for which the time derivative of the loss functional is identically zero (Khalil, 1996). b and the vectors b Equation 3.20 is a set of coupled (between A x k) nonlinear differential equations that correspond to simultaneous, parallel updatb and b ing of the estimates A x k. This should be compared to the alternated separate (nonparallel) update rules 3.11 and 3.12 used in the AML algorithm described in section 3.1. Note also that (except for the trace term) the right-hand side of the dictionary learning update in equation 3.20 is of the same form as for the AML update law given in equation 3.11 (see also the discretized version of equation 3.20 given in equation 3.28 below). The key difference is the additional trace term in equation 3.20. This difference corresponds to a projection of the update onto the tangent space of the manifold 3.17, thereby ensuring a unit Frobenius norm (and hence boundedness) of the dictionary estimate at all times and avoiding the ill-posedness problem indicated in equation 3.16. It is also of interest to note that choosing V k to be the positive-denite matrix, bT A b C b (b Vk D g k ( A x k ) P (b x k ) ) ¡1 , gk > 0,
(3.23)
in equation 3.20, followed by some matrix manipulations (see equation A.18 in appendix A), yields the alternative algorithm, d bT (b (b b ¡1 (b bT ) ¡1 y kg b xk D ¡gkfb x k ¡ P ¡1 (b xk ) A xk ) I C AP xk ) A dt
(3.24)
368
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
with g k > 0. In any event (regardless of the specic choice of the positive denite matrices V k as shown in appendix A), the proposed algorithm outb) , which satises the implicit and lined here converges to a solution (b x k, A nonlinear relationships, bT (b (b b ¡1 (b bT ) ¡1 y, b x k D P ¡1 (b xk ) A x k ) I C AP xk ) A b D S T ( SxO xO ¡ cI ) ¡1 2 AF , A yxO
(3.25)
bTd A b) . for scalar c D tr( A To implement the algorithm 3.20 (or the variant using equation 3.24) in discrete time, a rst-order forward difference approximation at time t D tl can be used, b d x k (tl C1 ) ¡ b x k ( tl ) b x k ( tl ) ¼ dt tl C 1 ¡ tl
,
b x k[l C 1] ¡ b x k[l]
Dl
.
(3.26)
Applied to equation 3.24, this yields b [l] x k[l C 1] D (1 ¡ m l )b xk [l] C m lb xFOCUSS k
bT (b (b b ¡1 (b bT ) ¡1 y k b [l] D P ¡1 (b xFOCUSS xk ) A xk ) I C AP xk ) A k m l D g kD l ¸ 0.
(3.27)
b Similarly, discretizing the A-update equation yields the A-learning rule, equation 3.28, given below. More generally, taking m l to have a value between zero and one, 0 · m l · 1 yields an updated value b xk [l C 1], which is [l]. a linear interpolation between the previous value b x k[l] and b xFOCUSS k When implemented in discrete time, and setting m l D 1, the resulting Bayesian learning algorithm has the form of a combined iteration where we loop over the operations, bT (b (b b ¡1 (b bT ) ¡1 y k, b x k à P ¡1 (b xk ) A x k ) I C AP xk ) A k D 1, . . . , N
and
bà A b ¡ c (d A b ¡ tr( A bTd A b) A b) A
c > 0.
(3.28)
We call this FOCUSS-based, Frobenius-normalized dictionary-learning algorithm the FOCUSS-FDL algorithm. Again, this merged procedure should be compared to the separate iterations involved in the maximum likelihood b given by approach given in equations 3.11 and 3.12. Equation 3.28, with d A equation 3.21, corresponds to performing a nite step-size gradient descent
Dictionary Learning Algorithms for Sparse Representation
369
on the manifold AF . This projection in equation 3.28 of the dictionary update onto the tangent plane of AF (see the discussion in appendix A) ensures the well-behavedness of the MAP algorithm.7 The specic step-size choice m l D 1, which results in the rst equation in equation 3.28, is discussed at length for the low-noise case in Rao and Kreutz-Delgado (1999). 3.4 Column-Normalized Dictionary Prior. Although mathematically very tractable, the unit-Frobenius norm prior, equation 3.17, appears to be somewhat too loose, judging from simulation results given below. In simulations with the Frobenius norm constraint AF , some columns of A can tend toward zero, a phenomenon that occurs more often in highly overcomplete A. This problem can be understood by remembering that we are using the 6 0 diversity measure, which penalizes columns associated with dp ( x) , p D terms in x with large magnitudes. If a column ai has a small relative magnitude, the weight of its coefcient x[i] can be large, and it will be penalized more than a column with a larger norm. This leads to certain columns being underused, which is especially problematic in the overcomplete case. An alternative, and more restrictive, form of the constraint set A is obtained by enforcing the requirement that the columns ai of A each be normalized (with respect to the Euclidean 2-norm) to the same constant value (Murray & Kreutz-Delgado, 2001). This constraint can be justied by noting that Ax can be written as the nonunique weighted sum of the columns ai ,
Ax D
n X iD 1
ai x[i] D
n ³ ´ X ai i D1
ai
(ai x[i]) D A0 x0 ,
for any ai > 0, i D 1, . . . , n, showing that there is a column-wise ambiguity that remains even after the overall unit-Frobenius norm normalization has occurred, as one can now Frobenius-normalize the new matrix A0 . Therefore, consider the set of matrices on which has been imposed the column-wise constraint that » ¼ 2 1 T AC D A kai k D ai ai D , i D 1, . . . , n . (3.29) n The set AC is an mn ¡ n D n ( m ¡1) –dimensional submanifold of Rm£n . Note that every column of a matrix in A C has been normalized to the value p1n . 7 Because of the discrete-time approximation in equation 3.28,and even more generally because of numerical round-off effects in equation 3.20, a renormalization,
bà A b/ kAkF , A is usually performed at regular intervals.
370
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
In fact, any constant value for the column normalization can be used (including the unit normalization), but, as shown in appendix B, the particular normalization of p1n results in AC being a proper submanifold of the mn ¡ 1 dimensional unit Frobenius manifold AF ,
A C ½ AF , indicating that a tighter constraint on the matrix A is being imposed. Again, it is assumed that the prior P ( A ) is uniformly distributed on AC in the manner of equation 3.13. As shown in appendix B, AC is simply connected. A learning algorithm is derived for the constraint AC in appendix B, following much the same approach as in appendix A. Because the derivation of the b x k update to nd sparse solutions does not depend on the form of b update in algorithm 3.28 needs to be modied. the constraint A, only the A Each column ai is now updated independently (see equation B.17), ai à ai ¡ c ( I ¡b aib aTi )d ai i D 1, . . . , n,
(3.30)
b in equation 3.21. We call the resulting where dai is the ith column of d A column-normalized dictionary-learning algorithm the FOCUSS-CNDL algorithm. The implementation details of the FOCUSS-CNDL algorithm are presented in section 4.2. 4 Algorithm Implementation
The dictionary learning algorithms derived above are an extension of the FOCUSS algorithm used for obtaining sparse solutions to the linear inverse problem y D Ax to the case where dictionary learning is now required. We refer to these algorithms generally as FOCUSS-DL algorithms, with the unit Frobenius-norm prior-based algorithm denoted by FOCUSS-FDL and the column-normalized prior-base algorithm by FOCUSS-CNDL. In this section, the algorithms are stated in the forms implemented in the experimental tests, where it is shown that the column normalization-based algorithm achieves higher performance in the overcomplete dictionary case. 4.1 Unit Frobenius-Norm Dictionary Learning Algorithm. We now summarize the FOCUSS-FDL algorithm derived in section 3.3. For each of the data vectors yk, we update the sparse source vectors x k using the FOCUSS algorithm:
P ¡1 (b x k ) D diag(|b x k[i]| 2¡p )
bT (lkI C AP b ¡1 (b bT ) ¡1 y k b x k à P ¡1 (b xk ) A xk ) A
(FOCUSS),
(4.1)
Dictionary Learning Algorithms for Sparse Representation
371
where lk D b (b xk ) is the regularization parameter. After updating the N b is reestimated, source vectors x k, k D 1, . . . , n, the dictionary A
SyxO
D
N 1 X y kb xTk N kD 1
SxO xO
D
N 1 X b x kb xTk N kD 1
bD A bSxO dA O x ¡ S yxO bà A b ¡ c (d A b ¡ tr( A bTd A b) A b) , A
c > 0,
(4.2)
where c controls the learning rate. For the experiments in section 5, the data block size is N D 100. During each iteration all training vectors are updated using equation 4.1, with a corresponding number of dictionary updates b it is renormalized using equation 4.2. After each update of the dictionary A, b to have unit Frobenius norm, k AkF D 1. The learning algorithm is a combined iteration, meaning that the FOCUSS algorithm is allowed to run for only one iteration (not until full convergence) before the A update step. This means that during early iterations, the b x k are in general not sparse. To facilitate learning A, the covariances S yOx and SxO xO are calculated with sparsied b x k that have all but the rQ largest elements set to zero. The value of rQ is usually set to the largest desired number of nonzero elements, but this choice does not appear to be critical. The regularization parameter lk is taken to be a monotonically increasing function of the iteration number, lk D lmax tanh (10¡3 ¢ ( iter ¡ 1500) ) C 1) .
(4.3)
While this choice of l k does not have the optimality properties of the modied L-curve method (see section 2.5), it does not require a one-dimensional optimization for each b x k and so is much less computationally expensive. This is further discussed below. 4.2 Column Normalized Dictionary Learning Algorithm. The improved version of the algorithm called FOCUSS-CNDL, which provides increased accuracy especially in the overcomplete case, was proposed in Murray and Kreutz-Delgado (2001). The three key improvements are colb an efcient way of adjusting umn normalization that restricts the learned A, the regularization parameter l k, and reinitialization to escape from local optima. The column-normalized learning algorithm discussed in section 3.4 and derived in appendix B is used. Because the b x k update does not depend on the constraint set A, the FOCUSS algorithm in equation 4.1 is used to
372
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
update N vectors, as discussed in section 4.1. After every N source vectors are updated, each column of the dictionary is then updated as ai à ai ¡ c ( I ¡ b aib aTi )d ai i D 1, . . . , n,
(4.4)
b which is found using equation 4.2. After where dai are the columns of d A, updating each ai , it is renormalized to kai k2 D 1 / n by ai à p
ai , nkai k
(4.5)
b F D 1 as shown in section B.1. which also ensures that kAk The regularization parameter lk may be set independently for each vector in the training set, and a number of methods have been suggested, including quality of t (which requires a certain level of reconstruction accuracy), sparsity (requiring a certain number of nonzero elements), and the L-curve which attempts to nd an optimal trade-off (Engan, 2000). The L-curve method works well, but it requires solving a one-dimensional optimization for each l k, which becomes computationally expensive for large problems. Alternatively, we use a heuristic method that allows the trade-off between error and sparsity to be tuned for each application, while letting each training vector y k have its own regularization parameter lk to improve the quality of the solution, Á l k D lmax
! bx kk kyk ¡ Ab 1¡ , ky kk
l k, lmax > 0.
(4.6)
For data vectors that are represented accurately, lk will be large, driving the algorithm to nd sparser solutions. If the SNR can be estimated, we can set lmax D (SNR) ¡1 . The optimization problem, 3.15, is concave when p · 1, so there will be multiple local minima. The FOCUSS algorithm is guaranteed to converge only to one of these local minima, but in some cases, it is possible to determine when that has happened by noticing if the sparsity is too low. Periodically (after a large number of iterations), the sparsity of the solutions b xk is checked, and if found too low, b x k is reinitialized randomly. The algorithm is also sensitive to initial conditions, and prior information may be incorporated into the initialization to help convergence to the global solution. 5 Experimental Results
Experiments were performed using complete dictionaries ( n D m) and overcomplete dictionaries ( n > m ) on both synthetically generated data and
Dictionary Learning Algorithms for Sparse Representation
373
natural images. Performance was measured in a number of ways. With synthetic data, performance measures include the SNR of the recovered sources x k compared to the true generating source and comparing the learned dictionary with the true dictionary. For images of natural scenes, the true underlying sources are not known, so the accuracy and efciency of the image coding are found. 5.1 Complete Dictionaries:Comparison with ICA. To test the FOCUSSFDL and FOCUSS-CNDL algorithms, simulated data were created following the method of Engan et al. (1999) and Engan (2000). The dictionary A of size 20 £20 was created by drawing each element aij from a normal distribution with m D 0, s 2 D 1 (written as N (0, 1) ) followed by a normalization to ensure that kAkF D 1. Sparse source vectors x k, k D 1, . . . , 1000 were created with r D 4 nonzero elements, where the r nonzero locations are selected at random (uniformly) from the 20 possible locations. The magnitudes of each nonzero element were also drawn from N (0, 1) and limited so that |x kl | > 0.1. The input data yk were generated using y D Ax (no noise was added). For the rst iteration of the algorithm, the columns of the initializabinit, were taken to be the rst n D 20 training vectors yk. tion estimate, A The initial xk estimates were then set to the pseudoinverse solution b xk D bT ( A binit A bT ) ¡1 y k. The constant parameters of the algorithm were set as A init init follows: p D 1.0, c D 1.0, and lmax D 2 £ 10¡3 (low noise, assumed SNR ¼ 27 dB). The algorithms were run for 200 iterations through the entire b was updated after updating 100 data data set, and during each iteration, A vectors b x k. To measure performance, the SNR between the recovered sources b x k and the the true sources x k were calculated. Each element xk [i] for xed i was considered as a time-series vector with 1000 elements, and SNRi for each was found using
³ SNRi D 10 log10
kx k[i]k2 kx k[i] ¡ b xk[i]k2
´ .
(5.1)
The nal SNR is found by averaging SNRi over the i D 1, . . . , 20 vectors and 20 trials of the algorithms. Because the dictionary A is learned only to within a scaling factor and column permutations, the learned sources must be matched with corresponding true sources and scaled to unit norm before the SNR calculation is done. The FOCUSS-FDL and FOCUSS-CNDL algorithms were compared with extended ICA (Lee, Girolami, & Sejnowski, 1999) and Fastica 8 (Hyv¨arinen 8 Matlab and C versions of extended ICA can be found on-line at http://www.cnl.salk. edu/»tewon/ICA/code.html. Matlab code for FastICA can be found at: http://www.cis. hut./projects/ica/fastica/.
374
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Figure 1: Comparison between FOCUSS-FDL, FOCUSS-CNDL, extended ICA, and FastICA on synthetically generated data with a complete dictionary A, size 20 £ 20. The SNR was computed between the recovered sources b xk and the true sources xk . The mean SNR and standard deviation were computed over 20 trials.
et al., 1999). Figure 1 shows the SNR for the tested algorithms. The SNR for FOCUSS-FDL is 27.7 dB, which is a 4.7 dB improvement over extended ICA, and for FOCUSS-CNDL, the SNR is 28.3 dB. The average run time for FOCUSS-FDL/CNDL was 4.3 minutes, for FastICA 0.10 minute, and for extended ICA 0.19 minute on a 1.0 GHz Pentium III Xeon computer. 5.2 Overcomplete Dictionaries. To test the ability to recover the true A and x k solutions in the overcomplete case, dictionaries of size 20 £ 30 and 64 £ 128 were generated. Diversity r was set to xed values (4 and 7) and uniformly randomly (5–10 and 10–15). The elements of A and the sources x k were created as in section 5.1. The parameters were set as follows: p D 1.0, c D 1.0, lmax D 2 £ 10¡3 . The algorithms were run for 500 iterations through the entire data set, and b was updated after updating 100 data vectors b during each iteration, A xk . As a measure of performance, we nd the number of columns of A that were matched during learning. Because A can be learned only to within column permutations and sign and scale changes, the columns are normalized b is rearranged columnwise so that b so that kb ai k D kaj k D 1, and A aj is given the index of the closest match in A (in the minimum two-norm sense). A match is counted if
1 ¡ |aTib ai | < 0.01.
(5.2)
Dictionary Learning Algorithms for Sparse Representation
375
Table 1: Synthetic Data Results. Learned A Columns Algorithm
Size of A
Diversity, r
FOCUSS-FDL FOCUSS-CNDL FOCUSS-CNDL FOCUSS-CNDL FOCUSS-FDL FOCUSS-CNDL
20 £ 30 20 £ 30 64 £ 128 64 £ 128 64 £ 128 64 £ 128
7 7 7 5–10 10–15 10–15
Average SD 25.3 28.9 125.3 126.3 102.8 127.4
3.4 1.6 2.1 1.3 4.5 1.3
Learned x
%
Average
SD
%
84.2 96.2 97.9 98.6 80.3 99.5
675.9 846.8 9414.0 9505.5 4009.6 9463.4
141.0 97.6 406.5 263.8 499.6 330.3
67.6 84.7 94.1 95.1 40.1 94.6
Similarly, the number of matching b xk are counted (after rearranging the b elements in accordance with the indices of the rearranged A), 1 ¡ |xTib xi | < 0.05.
(5.3)
If the data are generated by an A that is not column normalized, other measures of performance need to be used to compare x k and b x k. The performance is summarized in Table 1, which compares the FOCUSSFDL with the column-normalized algorithm (FOCUSS-CNDL). For the 20 £ 30 dictionary, 1000 training vectors were used, and for the 64£128 dictionary 10,000 were used. Results are averaged over four or more trials. For the 64 £ 128 matrix and r D 10–15, FOCUSS-CNDL is able to recover 99.5% (127.4/128) of the columns of A and 94.6% (9463/10,000) of the solutions x k to within the tolerance given above. This shows a clear improvement over FOCUSS-FDL, which learns only 80.3% of the A columns and 40.1% of the solutions x k. Learning curves for one of the trials of this experiment (see Figure 2) show that most of the columns of A are learned quickly within the rst 100 iterations, and that the diversity of the solutions drops to the desired level. Figure 2b shows that it takes somewhat longer to learn the xk correctly and that reinitialization of the low sparsity solutions (at iterations 175 and 350) helps to learn additional solutions. Figure 2c shows the diversity at each iteration, measured as the average number of elements of each b x k that are larger than 1 £ 10¡4 . 5.3 Image Data Experiments. Previous work has shown that learned basis functions can be used to code data more efciently than traditional Fourier or wavelet bases (Lewicki & Olshausen, 1999). The algorithm for nding overcomplete bases in Lewicki and Olshausen (1999) is also designed to solve problem 1.1 but differs from our method in a number of ways, including using only the Laplacian prior (p D 1), and using conjugate gradient optimization for nding sparse solutions (whereas we use the FOCUSS algorithm). It is widely believed that overcomplete representations
376
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Figure 2: Performance of the FOCUSS-CNDL algorithm with overcomplete dictionary A, size 64 £ 128. (a) Number of correctly learned columns of A at each iteration. (b) Number of sources xk learned. (c) Average diversity (n-sparsity) of the x k . The spikes in b and c indicate where some solutions b x k were reinitialized because they were not sparse enough.
Dictionary Learning Algorithms for Sparse Representation
377
are more efcient than complete bases, but in Lewicki and Olshausen (1999), the overcomplete code was less efcient for image data (measured in bits per pixel entropy), and it was suggested that different priors could be used to improve the efciency. Here, we show that our algorithm is able to learn more efcient overcomplete codes for priors with p < 1. The training data consisted of 10,000 8 £ 8 image patches drawn at random from black and white images of natural scenes. The parameter p was varied from 0.5–1.0, and the FOCUSS-CNDL algorithm was trained for 150 iterations. The complete dictionary (64 £ 64) was compared with the 2£ overcomplete dictionary (64 £ 128). Other parameters were set: c D 0.01, lmax D 2 £ 10¡3 . The coding efciency was measured using the entropy (bits per pixel) method described in Lewicki and Olshausen (1999). Figure 3 plots the entropy versus reconstruction error (root mean square error, RMSE) and shows that when p < 0.9, the entropy is less for the overcomplete representation at the same RMSE. An example of coding an entire image is shown in Figure 4. The original test image (see Figure 4a) of size 256 £ 256 was encoded using the learned dictionaries. Patches from the test image were not used during training. Table 2 gives results for low- and high-compression cases. In both cases, coding with the overcomplete dictionary (64£128) gives higher compression (lower bits per pixel) and lower error (RMSE). For the high-compression case (see
Figure 3: Comparing the coding efciency of complete and 2£ overcomplete representations on 8 £8 pixel patches drawn from natural images. The points on the curve are the results from different values of p, at the bottom right, p D 1.0, and at the top left, p D 0.5. For smaller p, the overcomplete case is more efcient at the same level of reconstruction error (RMSE).
378
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Figure 4: Image compression using complete and overcomplete dictionaries. Coding with an overcomplete dictionary is more efcient (fewer bits per pixel) and more accurate (lower RMSE). (a) Original image of size 256 £ 256 pixels. (b) Compressed with 64 £ 64 complete dictionary to 0.826 bits per pixel at RMSE = 0.329. (c) Compressed with 64 £ 128 overcomplete dictionary to 0.777 bits per pixel at RMSE = 0.328.
Dictionary Learning Algorithms for Sparse Representation
379
Table 2: Image Compression Results. Dictionary Size 64 £ 64 64 £ 128 64 £ 64 64 £ 128
p
Compression (bits/pixel)
RMSE
0.5 0.6 0.5 0.6
2.450 2.410 0.826 0.777
0.148 0.141 0.329 0.328
Average Diversity 17.3 15.4 4.5 4.0
Figures 4b and 4c), the 64 £ 128 overcomplete dictionary gives compression of 0.777 bits per pixel at error 0.328, compared to the 64 £ 64 complete dictionary at 0.826 bits per pixel at error 0.329. The amount of compression was selected by adjusting lmax (the upper limit of the regularization parameter). For high compression, lmax D 0.02, and for low compression, lmax D 0.002. 6 Discussion and Conclusions
We have applied a variety of tools and perspectives (including ideas drawn from Bayesian estimation theory, nonlinear regularized optimization, and the theory of majorization and convex analysis) to the problem of developing algorithms capable of simultaneously learning overcomplete dictionaries and solving sparse source-vector inverse problems. The test experiment described in section 5.2 is a difcult problem designed to determine if the proposed learning algorithm can solve for the known true solutions for A and the spare source vectors xk to the underdetermined inverse problem y k D Axk . Such testing, which does not appear to be regularly done in the literature, shows how well an algorithm can extract stable and categorically meaningful solutions from synthetic data. The ability to perform well on such test inverse problems would appear to be at least a necessary condition for an algorithm to be trustworthy in domains where a physically or biologically meaningful sparse solution is sought, such as occurs in biomedical imaging, geophysical seismic sounding, and multitarget tracking. The experimental results presented in section 5.2 show that the FOCUSSDL algorithms can recover the dictionary and the sparse sources vectors. This is particularly gratifying when one considers that little or no optimization of the algorithm parameters has been done. Furthermore, the convergence proofs given in the appendixes show convergence only to local optima, whereas one expects that there will be many local optima in the cost function because of the concave prior and the generally multimodal form of the cost function. One should note that algorithm 3.28 was constructed precisely with the goal of solving inverse problems of the type considered here, and therefore
380
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
one must be careful when comparing the results given here with other algorithms reported in the literature. For instance, the mixture-of-gaussians prior used in Attias (1999) does not necessarily enforce sparsity. While other algorithms in the literature might perform well on this test experiment, to the best of our knowledge, possible comparably performing algorithms such as Attias (1999), Girolami (2001), Hyv¨arinen et al. (1999), and Lewicki and Olshausen (1999) have not been tested on large, overcomplete matrices to determine their accuracy in recovering A, and so any comparison along these lines would be premature. In section 5.1, the FOCUSS-DL algorithms were compared to the well-known extended ICA and FastICA algorithms in a more conventional test with complete dictionaries. Performance was measured in terms of the accuracy (SNR) of the recovered sources x k, and both FOCUSS-DL algorithms were found to have signicantly better performance (albeit with longer run times). We have also shown that the FOCUSS-CNDL algorithm can learn an overcomplete representation, which can encode natural images more efciently than complete bases learned from data (which in turn are more efcient than standard nonadaptive bases, such as Fourier or wavelet bases; Lewicki & Olshausen, 1999). Studies of the human visual cortex have shown a higher degree of overrepresentation of the fovea compared to the other mammals, which suggests an interesting connection between overcomplete representations and visual acuity and recognition abilities (Popovic & Sjostrand, ¨ 2001). Because the coupled dictionary learning and sparse-inverse solving algorithms are merged and run in parallel, it should be possible to run the algorithms in real time to track dictionary evolution in quasistationary environments once the algorithm has essentially converged. One way to do this would be to constantly present randomly encountered new signals, y k, to the algorithm at each iteration instead of the original training set. One also has to ensure that the dictionary learning algorithm is sensitive to the new data so that dictionary tracking can occur. This would be done by an appropriate adaptive ltering of the current dictionary estimate driven by the new-data derived corrections, similar to techniques used in the adaptive ltering literature (Kalouptsidis & Theodoridis, 1993).
Appendix A: The Frobenius-Normalized Prior Learning Algorithm
Here we provide a derivation of the algorithm 3.20–3.21 and prove that it converges to a local minimum of equation 3.15 on the manifold AF D fA | kAkF D 1g ½ Rm£n dened in equation 3.17. Although we focus on the development of the learning algorithm on AF , the derivations in sections A.2 and A.3, and the beginning of subsection are done for a general constraint manifold A.
Dictionary Learning Algorithms for Sparse Representation
381
A.1 Admissible Matrix Derivatives.
A.1.1 The Constraint Manifold AF . In order to determine the structural d form of admissible derivatives, AP D dt A for matrices belonging to AF ,9 it is useful to view A F as embedded in the nite-dimensional Hilbert space of matrices, Rm£n , with inner product hA, Bi D tr(AT B ) D tr(BT A ) D tr(ABT ) D tr(BAT ) . The corresponding matrix norm is the Frobenius norm, p p kAk D kAkF D tr AT A D tr AAT . We will call this space the Frobenius space and the associated inner product the Frobenius inner product. It is useful to note the isometry, A 2 Rm£n () A D vec ( A ) 2 Rmn , where A is the mn-vector formed by stacking the columns of A. Henceforth, bold type represents the stacked version of a matrix (e.g., B D vec (B ) ). The stacked vector A belongs to the standard Hilbert space Rmn , which we shall henceforth refer to as the stacked space. This space has the standard Euclidean inner product and norm, p hA , B i D A T B , kA k D A T A . It is straightforward to show that hA, Bi D hA , B i and
kAk D kA k.
In particular, we have A 2 AF () kAk D kA k D 1. Thus, the manifold in equation 3.17 corresponds to the (mn¡1)–dimensional unit sphere in the stacked space, Rmn (which, with a slight abuse of notation, we will continue to denote by A F ). It is evident that AF is simply connected so that a path exists between any two elements of A F and, in particular, a path exists between any initial value for a dictionary, Ainit 2 AF , used to initialize a learning algorithm, and a desired target value, Anal 2 AF .10 9 Equivalently, we want to determine the structure of elements AP of the tangent space, TAF , to the smooth manifold AF at the point A. 10 For example, for 0 · t · 1, take the t–parameterized path,
A (t) D
(1 ¡ t)A init C tA nal . k(1 ¡ t)A init C tA nal k
382
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
A.1.2 Derivatives on AF : The Tangent Space T AF . Determining the form of admissible derivatives on equation 3.17 is equivalent to determining the form of admissible derivatives on the unit Rmn –sphere. On the unit sphere, we have the well-known fact that A 2 AF H)
d P D 0 H) A P ? A. kA k2 D 2A T A dt
P is A P D ¤ Q , where Q is arbitrary and This shows that the general form of A ³ D
¤
I¡
AA T
kA k2
´ D ( I ¡ AA T )
(A.1)
is the stacked space projection operator onto the tangent space of the unit Rmn –sphere at the point A (note that we used the fact that kAk D 1). The projection operator ¤ is necessarily idempotent, ¤ D ¤ 2 . ¤ is also self-adjoint, ¤ D ¤ ¤ , where the adjoint operator ¤ ¤ is dened by the requirement that h¤
¤
Q 1 , Q 2 i D hQ 1 , ¤ Q 2 i,
for all Q 1 , Q 2 2 Rmn ,
showing that ¤ is an orthogonal projection operator. In this case, ¤ ¤ D ¤ T , so that self-adjointness corresponds to ¤ being symmetric. One can readily show that an idempotent, self-adjoint operator is nonnegative, which in this case corresponds to the symmetric, idempotent operator ¤ being a positive semidenite matrix. This projection can be easily rewritten in the Frobenius space, P D ¤ Q D Q ¡ hA , Q iA () AP D L Q D Q ¡ hA, QiA A D Q ¡ tr(AT Q) A.
(A.2)
Of course, this result can be derived directly in the Frobenius space using the fact that A 2 AF H)
d P D 2 tr(AT AP ) D 0, kAk2 D 2hA, Ai dt
from which it is directly evident that AP 2 T A F at A 2 A F
,
P D tr AT AP D 0, hA, Ai
(A.3)
and therefore AP must be of the form11 tr(AT Q ) AP D L Q D Q ¡ A D Q ¡ tr(AT Q ) A. tr(AT A ) 11
It must be the case that L D I ¡
notation (Dirac, 1958).
| AihA| kAk 2
(A.4)
D I ¡ |AihA|, using the physicist’s “bra-ket”
Dictionary Learning Algorithms for Sparse Representation
383
One can verify that L is idempotent and self-adjoint and is therefore a nonnegative, orthogonal projection operator. It is the orthogonal projection operator from Rm£n onto the tangent space T AF . In the stacked space (with some additional abuse of notation), we represent the quadratic form for positive semidenite symmetric matrices W as kA k2W D A T WA . Note that this denes a weighted norm if, and only if, W is positive denite, which might not be the case by denition. In particular, when W D ¤ , the quadratic form kA k2 is only positive semidenite. Finally, note from ¤ equation A.4 that 8A 2 A F , L Q D 0 () Q D cA, with c D tr(AT Q) .
(A.5)
A.2 Minimizing the Loss Function over a General Manifold A. Con-
sider the Lyapunov function, VN (X, A ) D hdq ( y ¡ Ax ) C ldp (x ) iN ,
A 2 A,
(A.6)
where A is some arbitrary but otherwise appropriately dened constraint manifold associated with the prior, equation 3.13. Note that this is precisely the loss function to be minimized in equation 3.15. If we can determine smooth parameter trajectories (i.e., a parameter-vector adaptation P AP ) such that along these trajectories VP ( X, A) · 0, then as a conserule) ( X, quence of the La Salle invariance principle (Khalil, 1996), the parameter values will converge to the largest invariant set (of the adaptation rule viewed as a nonlinear dynamical system) contained in the set C D f(X, A ) | VP N (X, A ) ´ 0 and A 2 A g.
(A.7)
The set C contains the local minima of VN . With some additional technical assumptions (generally dependent on the choice of adaptation rule), the elements of C will contain only local minima of VN . Assuming the i.i.d. q D 2 gaussian measurement noise case of equation 3.8, 12 the loss (Lyapunov) function to be minimized is then ½ ¾ 1 kAx ¡ yk2 C ldp ( x) , A 2 A, (A.8) VN (X, A ) D 2 N which is essentially the loss function to be minimized in equation 3.15. 13 12 Note that in the appendix, unlike the notation used in equations 3.8 et seq., the “hat” notation has been dropped. Nonetheless, it should be understood that the quantities A and X are unknown parameters to be optimized over in equation 3.15, while the measured signal vectors Y are known. 13 The factor 12 is added for notational convenience and does not materially affect the derivation.
384
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Suppose for the moment, as in equations 3.4 through 3.9, that X is assumed to be known, and note that then (ignoring constant terms depending on X and Y) VN can be rewritten as VN ( A) D htr(Ax ¡ y ) ( Ax ¡ y ) T iN
D htr(AxxT AT ) ¡ 2 tr(AxyT ) C tr(yyT ) iN D tr A Sxx AT ¡ 2 tr A Sxy
VN ( A) D trfA Sxx AT ¡ 2A Sxyg, T dened as in equation 3.10. Using standard results for Sxx and S xy D Syx from matrix calculus (Dhrymes, 1984), we can show that VN ( A ) is minimized by the solution 3.9. This is done by setting
@ @A
b) D 0, VN ( A
and using the identities (valid for W symmetric) @ @A
tr AWAT D 2AW and
@ @A
tr AB D BT .
This yields (assuming that Sxx is invertible), @ @A
b) D 2A bSxx ¡ 2 S T D 2( A bSxx ¡ Syx ) D 0, VN ( A xy b D SyxO S ¡1 , )A xO xO
which is equation 3.9, as claimed. For Sxx nonsingular, the solution is unique and globally optimal. This is, of course, a well-known result. Now return to the general case, equation A.8, where both X and A are unknown. For the data indexed by k D 1, . . . , N, dene the quantities d k D dp ( xk ) , e ( x) D Ax ¡ y, and e k D Ax k ¡ yk. The loss function and its time derivative can be written, ½ VN ( X, A) D
1 T e ( x ) e (x ) C ldp (x ) 2
¾ N
P N VP N ( X, A) D heT ( x ) eP (x ) C lr T dp ( x ) xi
P ) C lr T dp (x ) xi P N. D heT ( x ) ( AxP C Ax
Then, to determine an appropriate adaptation rule, note that VP N D T1 C T2 ,
(A.9)
Dictionary Learning Algorithms for Sparse Representation
385
where P ND T1 D h(eT ( x ) A C lr T dp ( x ) ) xi
N 1 X ( eT A C lr T d k ) xP k N kD 1 k
(A.10)
and N X P ND 1 P k. T2 D heT ( x ) Axi eT Ax N kD 1 k
(A.11)
Enforcing the separate conditions T1 · 0 and T2 · 0,
(A.12)
(as well as the additional condition that A 2 A) will be sufcient to ensure that VP N · 0 on A. In this case, the solution-containing set C of equation A.7 is given by C D f(X, A ) | T1 (X, A ) ´ 0, T2 ( X, A) ´ 0 and A 2 Ag.
(A.13)
Note that if A is known and xed, then T 2 ´ 0, and only the rst condition of equation A.12 (which enforces learning of the source vectors, x k) is of concern. Contrawise, if source vectors x k, which ensure that e ( x k ) D 0 are xed and known, then T1 ´ 0, and the second condition of equation A.12 (which enforces learning of the dictionary matrix, A) is at issue. A.3 Obtaining the x k Solutions with the FOCUSS Algorithm. We now develop the gradient factorization-based derivation of the FOCUSS algorithm, which provides estimates of x k while satisfying the rst convergence condition of equation A.12. The constraint manifold A is still assumed to be arbitrary. To enforce the condition T1 · 0 and derive the rst adaptation rule given in equation 3.20, we note that we can factor r d k D r d ( x k ) as (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999)
r d k D akP kx k with ak D axk > 0 and P xk D P k positive denite and diagonal for all nonzero x k. Then, dening b k D lak > 0 and selecting an arbitrary set of (adaptable) symmetric positive-denite matrices V k, we choose the learning rule xP k D ¡V kfAT ek C lr d kg D ¡Vkf(AT A C b kP k ) x k ¡ AT y kg, k D 1, . . . , N,
(A.14)
386
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
which is the adaptation rule for the state estimates x k D b xk given in the rst line of equation 3.20. With this choice, we obtain T 1 D ¡hkAT e (x ) C lr d ( x) k2V iN D ¡
N 1 X kAT ek C lr d kk2Vk N kD 1
D ¡
N 1 X k(AT A C b kP k ) xk ¡ AT y kk2Vk · 0, N kD 1
(A.15)
as desired. Assuming convergence to the set A.13 (which will be seen to be the case after we show below how to ensure that we also have T2 · 0), we will asymptotically obtain (reintroducing the “hat” notation to now denote converged parameter estimates) bT A b C b kP k )b bT y kk2 ´ 0, k D 1, . . . , N, k( A xk ¡ A Vk which is equivalent to bT A b C b kP k ) ¡1 A bT y k, k D 1, . . . , N, b xk D ( A
(A.16)
at convergence. This is also equivalent to the condition given in the rst line of equation 3.25, as shown below. Exploiting the fact that V k in equation A.14 are arbitrary (subject to the symmetry and positive-deniteness constraint), let us make the specic choice shown in equation 3.23, V k D gk ( AT A C b kP k ) ¡1 , g k > 0, k D 1, . . . , N.
(A.17)
Also note the (trivial) identity, ( AT A C bP ) P ¡1 AT D AT ( AP ¡1 AT C bI) , which can be recast nontrivially as ( AT A C bP ) ¡1 AT D P ¡1 AT (b I C AP ¡1 AT ) ¡1 .
(A.18)
With equations A.17 and A.18, the learning rule, A.14, can be recast as T (b C T ) ¡1 xP k D ¡gkfx k ¡ P ¡1 AP ¡1 ykg, k D 1, . . . , N, kI k A k A
(A.19)
which is the alternative learning algorithm, 3.24. At convergence (when T1 ´ 0) we have the condition shown in the rst line of equation 3.25, bT (b kI C AP b ¡1 A bT ) ¡1 y k. b x k D P ¡1 A
(A.20)
Dictionary Learning Algorithms for Sparse Representation
387
This also follows from the convergence condition A.16 and the identity A.18, showing that the result, A.20, is independent of the specic choice of V k > 0. Note from equations A.14 and A.15 that T1 ´ 0 also results in xP k ´ 0 for k D 1, . . . , N, so that we will have converged to constant values, b x k, which satisfy equation A.20. For the case of known, xed A, the learning rule derived here will converge to sparse solutions, x k, and when discretized as in section 3.3, yields equation 3.27, which is the known dictionary FOCUSS algorithm (Rao & Kreutz-Delgado, 1998a, 1999). Note that the derivation of the sparse source-vector learning algorithm here, which enforces the condition T1 · 0, is entirely independent of any constraints placed on A (such as, for example, the unit Frobenius-norm and column-norm constraints considered in this article) or of the form of the A-learning rule. Thus alternative choices of constraints placed on A, as considered in Murray and Kreutz-Delgado (2001) and described in appendix B, will not change the form of the x k-learning rule derived here. Of course, because the x k learning rule is strongly coupled to the A-learning rule, algorithmic performance and speed of convergence may well be highly sensitive to conditions placed on A and the specic A learning algorithm used. A.4 Learning the Dictionary A.
A.4.1 General Results. We now turn to the enforcement of the second convergence condition, T2 · 0 and the development of the dictionary adaptation rule shown in equation 3.20. First, as in equation 3.21, we dene the error term dA as dA D he(x) xT iN D
N 1 X e ( xk ) xTk D A Sxx ¡ Syx , N kD 1
(A.21)
using the fact that e ( x) D Ax ¡ y. Then, from equation A.11, we have P N D htr(xe T ( x ) AP ) iN D tr(hxe T ( x ) iN AP ) T2 D heT ( x ) Axi P. D tr(dAT AP ) D d A T A
(A.22)
So far, these steps are independent of any specic constraints that may be placed on A, other than that the manifold A be smooth and compact. With A constrained to lie on a specied smooth, compact manifold, to ensure correct learning behavior, it is sufcient to impose the constraint that AP lies in the tangent space to the manifold and the condition that T2 · 0. A.4.2 Learning on the Unit Frobenius Sphere, AF . To ensure that T2 · 0 and that AP is in the tangent space of the unit sphere in the Frobenius space Rmn , we take P D ¡m ¤ d A () AP D ¡m LdA D ¡m (d A ¡ tr(ATdA) A ) , m > 0, A
(A.23)
388
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
which is the adaptation rule given in equation 3.20. With this choice, and using the positive semideniteness of ¤ , we have T 2 D ¡m kd A k2¤ · 0,
as required. Note that at convergence, the condition T2 ´ 0, yields AP ´ 0, so that we will have converged to constant values for the dictionary elements, and b D L (A bSxO xO ¡ S yxO ) H) d A b D (A bSxO xO ¡ SyxO ) D cA, b 0 D Ld A
(A.24)
bTd A b) . Thus, the steady-state solution is from equation A.5, where c D tr( A b D S y, xO ( SxO xO ¡ cI ) ¡1 2 AF . A
(A.25)
Note that equations A.20 and A.25 are the steady-state values given earlier in equation 3.25. Appendix B: Convergence of the Column–Normalized Learning Algorithm
The derivation and proof of convergence of the column-normalized learning algorithm, applicable to learning members of AC , is accomplished by appropriately modifying key steps of the development given in appendix A. As in appendix A, the standard Euclidean norm and inner product apply to column vectors, while the Frobenius norm and inner product apply to m £ n matrix elements of Rm£n . B.1 Admissible Matrix Derivatives.
B.1.1 The Constraint Manifold AC . Let ei D (0 ¢ ¢ ¢ 0 1 0 ¢ ¢ ¢ 0) T 2 Rn be the canonical unit vector whose components are all zero except for the value 1 in the ith location. Then n X ID ei eTi and ai D Aei . iD 1
Note that kai k2 D aTi ai D ( Aei ) T Aei D eTi AT Aei D tr ei eTi AT A D tr MTi A D hMi , Ai,
where
Mi , Aei eTi D ai eTi D [0 0 ¢ ¢ ¢ 0 ai 0 ¢ ¢ ¢ 0] 2 Rm£n .
(B.1)
Note that only the ith column of Mi is nonzero and is equal to ai D ith column of A. We therefore have that AD
n X i D1
Mi .
(B.2)
Dictionary Learning Algorithms for Sparse Representation
389
Also, for i, j D 1, . . . , n, hMi , Mj i D tr MTi Mj D tr ei eTi AT Aej ejT D tr eTi AT Aej ejT ei D kai k2di, j ,
(B.3)
where di, j is the Kronecker delta. Note, in particular, that kai k D kMi k. Let A 2 AC , where AC is the set of column-normalized matrices, as dened by equation 3.29. AC is an mn¡n D n ( m¡1 )–dimensional submanifold of the Frobenius space Rm£n as each of the n columns of A is normalized as kai k2 D kM i k2 ´
1 , n
and kAk2 D tr AT A D tr
n X iD 1
MTi
n X jD 1
Mj D
n X iD 1
kai k2 D
n D 1, n
using linearity of the trace function and property, B.3. It is evident, therefore, that
AC ½ AF , for AF dened by equation 3.17. We can readily show that AC is simply connected, so that a continuous path exists between any matrix A D Ainit 2 AC to any other columnnormalized matrix A0 D AFinal 2 AC . Indeed, let A and A0 be such that A D [a1 , . . . , an ], A0 D [a01 , . . . , a0n ], p kai k D kaj0 k D 1 / n, 8i, j D 1, . . . , n. There is obviously a continuous path entirely in A C from A 2 A C to the intermediate matrix [a01 , a2 , . . . , an ] 2 A C . Similarly, there is a continuous path entirely in AC from [a01 , a2 , . . . , an ] to [a01 , a02 , . . . , an ], and so on to [a01 , . . . , a0n ] D A0 . To summarize, AC is a simply connected ( nm ¡ n ) –dimensional submanifold of the simply connected ( nm ¡ 1) –dimensional manifold AF and they are both submanifolds of the (nm ) –dimensional Frobenius space Rm£n . B.1.2 Derivatives on AC : The Tangent Space T AC . i D 1, . . . , n dene b ai D
p ai D nai , kai k
kb ai k D 1,
and p bi D Mi D nMi D b M ai eTi , kMi k
bi k D 1. kM
For convenience, for
390
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Note that equation B.3 yields bi , M bj i D tr M bT M bj D di, j . hM i
(B.4)
For any A 2 AC we have for each i D 1, . . . , n, d d d P i kai k2 D aTi ai D eTi AT Aei D 2eTi AT Ae dt dt dt T P T P D 2 tr ei eT i A A D 2 tr Mi A,
0D
or P D tr MT AP D 0 for A 2 AC ) hMi , Ai i
i D 1, . . . , n.
(B.5)
In fact, AP 2 T A C at A 2 AC
,
bi , Ai P D tr M bT AP D 0 hM i for
i D 1, . . . , n.
(B.6)
Note from equations B.2 and B.5 that * + n X P P hA, Ai D Mi , A D 0, iD 1
showing that (see equation A.3) T AC ½ T AF , as expected from the fact that AC ½ A F . An important and readily proven fact is that for each i D 1, . . . , n, bT AP D 0 tr M i
,
AP D Pi Qi
(B.7)
for some matrix Qi and projection operator, Pi , dened by b i hM bi , Qi D Q ¡ M bi tr M bT Q. Pi Q , Q ¡ M i
(B.8)
From equation B.4, it can be shown that the projection operators commute, P i Pj D Pj P i ,
6 iD j,
i, j D 1, . . . , n,
(B.9)
and are idempotent, P 2i D Pi ,
i D 1, . . . , n.
(B.10)
Indeed, it is straightforward to show from equation B.4 that for all Q, b i hM bi , Qi ¡ M bj h M bj , Qi, P i Pj Q D Pj Pi Q D Q ¡ M
for all
6 iD j, (B.11)
Dictionary Learning Algorithms for Sparse Representation
391
and b i hM bi , Qi D Pi Q, P2i Q D Q ¡ M
i D 1, . . . , n.
(B.12)
It can also be shown that hPi Q1 , Q2 i D hQ1 , Pi Q2 i,
showing that Pi is self–adjoint, Pi D P¤i . Dene the operator, P D P 1 , . . . , Pn .
(B.13)
Note that because of the commutativity of the Pi , the order of multiplication on the right-hand side of equation B.11 is immaterial, and it is easily determined that P is idempotent, P2 D P. By induction on equation B.11, it is readily shown that b1 h M b1 , Qi ¡¢ ¢ ¢ ¡ M b n hM bn , Qi D Q ¡ PQ D Q ¡ M
n X i D1
b i hM bi , Qi. M
(B.14)
Either from the self-adjointness and idempotency of each Pi , or directly from equation B.14, it can be shown that P itself is self–adjoint, P D P¤ , hPQ1 , Q2 i D hQ1 , PQ2 i.
Thus, P is the orthogonal projection operator from Rm£n onto T AC . As a consequence, we have the key result that AP 2 T AC at A
AP D PQ,
,
(B.15)
for some matrix Q 2 Rm£n . This follows from the fact that given Q, then the right-hand side of equation B.6 is true for AP D PQ. On the other hand, P we can take Q D AP if the right-hand side of equation B.6 is true for A, in equation B.15. Note that for the T AF –projection operator L given by equation A.2, we have that P D L P D PL , consistent with the fact that T AC ½ T AF . Let qi , i D 1, . . . , n be the columns of Q D [q1 , . . . , qn ]. The operation Q0 D Pj Q, corresponds to q0i D qi ,
6 iD j,
and
qj0 D (I ¡b ajb ajT ) qj ,
while the operation Q0 D PQ, corresponds to q0i D (I ¡ b aib aTi ) qi ,
i D 1, . . . , n.
392
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
B.2 Learning on the Manifold AC . The development of equations A.6 through A.22 is independent of the precise nature of the constraint manifold A and can be applied here to the specic case of A D A C . To ensure that T2 · 0 in equation A.22 and that AP 2 T AC , we can use the learning rule, Á ! n X P b b (B.16) A D ¡m PdA D ¡m dA ¡ Mi tr MidA , i D1
for m > 0 and dA given by equation A.21. With the ith column of dA denoted by dai , this corresponds to the learning rule, aP i D ¡m ( I ¡b aib aTi )d ai ,
i D 1, . . . , n.
(B.17)
With rule B.16, we have P D ¡m hdA, PdAi D ¡m hdA, P2dAi T 2 D hdA, Ai 2 D ¡m hPdA, PdAi D ¡m kPdAk · 0,
where we have explicitly shown that the idempotency and self-adjointness of P correspond to its being a nonnegative operator. Thus, we will have convergence to the largest invariant set for which AP ´ 0, which from equation B.16 is equivalent to the set for which PdA D dA ¡
n X iD 1
bi h M bi , dAi ´ 0. M
(B.18)
This, in turn, is equivalent to dA D
n X iD 1
bi h M bi , dAi D M
n X i D1
nMi hMi , dAi D
n X
ci M i ,
(B.19)
iD 1
with ci , nhMi , dAi D naTidai ,
i D 1, . . . , n.
An equivalent statement to equation B.19 is dai D ci ai ,
i D 1, . . . , n.
Dening the diagonal matrix C D diag[ci , . . . , cn ] and recalling the denitions A.21 and B.1, we obtain from equation B.19 that A Sxx ¡ Syx D dA D AC,
which can be solved as
A D Syx ( Sxx ¡ C) ¡1 .
(B.20)
This is the general form of the solution found by the AC –learning algorithm.
Dictionary Learning Algorithms for Sparse Representation
393
Acknowledgments
This research was partially supported by NSF grant CCR-9902961. K. K.-D. also acknowledges the support provided by the Computational Neurobiology Laboratory, Salk Institute, during a 1998–1999 sabbatical visit. References Adler, J., Rao, B., & Kreutz-Delgado, K. (1996). Comparison of basis selection methods. In Proceedings of the 30th Asilomar Conference on Signals, Systems, and Computers. Pacic Grove, CA: IEEE. Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Basilevsky, A. (1994). Statistical factor analysis and related methods: Theory and applications. New York: Wiley. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Coifman, R., & Wickerhauser, M. (1992).Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, IT-38(2), 713–718. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36, 287–314. Deco, G., & Obradovic, D. (1996). An information-theoreticapproach to neural computing. Berlin: Springer-Verlag. Dhrymes, P. (1984). Mathematics for econometrics (2nd ed.). New York: SpringerVerlag. Dirac, P. A. M. (1958). The principles of quantum mechanics. Oxford: Clarendon Press. Donoho, D. (1994). On minimum entropy segmentation. In C. Chui, L. Montefusco, & L. Puccio (Eds.), Wavelets:Theory, algorithms,and applications(pp. 233– 269). San Diego, CA: Academic Press. Donoho, D. (1995). De-noising by soft-thresholding. IEEE Trans. on Information Theory, 41, 613–627. Engan, K. (2000). Frame based signal representation and compression. Unpublished doctoral dissertation, Stavanger Universty College, Norway. Engan, K., Rao, B., & Kreutz-Delgado, K. (1999). Frame design using FOCUSS with method of optimal directions (MOD). In Proc. NORSIG-99. Tronheim, Norway: IEEE. Engan, K., Rao, B. D., & Kreutz-Delgado, K. (2000). Regularized FOCUSS for subset selection in noise. In Proc. NORSIG 2000, Sweden, June, 2000 (pp. 247– 250). Kolmarden, Sweden: IEEE. Field, D. (1994). What is the goal of sensory coding. Neural Computation, 6, 559– 599. Girolami, M. (2001). A variational method for learning sparse and overcomplete representations. Neural Computation, 13, 2517–2532. Gorodnitsky, I., George, J., & Rao, B. (1995). Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm. Journal of Electroencephalographyand Clinical Neurophysiology, 95(4), 231–251.
394
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Gorodnitsky, I., & Rao, B. (1997). Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans. Signal Processing, 45(3), 600–616. Hansen, P. C., & O’Leary, D. P. (1993). The use of the L-curve in the regularization of discrete ill-posed problems. SIAM J. Sct. Comput., 14, 1487–1503. Hyva¨ rinen, A., Cristescu, R., & Oja, E. (1999). A fast algorithm for estimating overcomplete ICA bases for image windows. In Proc. Int. Joint Conf. on Neural Networks (IJCNN’99) (pp. 894–899). Washington, DC. Kalouptsidis, T., & Theodoridis, S. (1993). Adaptive system identication and signal processing algorithms. Upper Saddle River, NJ: Prentice Hall. Kassam, S. (1982). Signal detection in non-gaussian noise. New York: SpringerVerlag. Khalil, H. (1996). Nonlinear systems (2nd ed). Upper Saddle River, NJ: Prentice Hall. Kreutz-Delgado, K., & Rao, B. (1997). A general approach to sparse basis selection: Majorization, concavity, and afne Scaling (Full Report with Proofs) (Center for Information Engineering Report No. UCSD-CIE-97-7-1). San Diego: University of California. Available on-line at: http://dsp.ucsd.edu/»kreutz. Kreutz-Delgado, K., & Rao, B. (1998a). A new algorithm and entropy-like measures for sparse coding. In Proc. 5th Joint Symp. Neural Computation. La Jolla, CA. Kreutz-Delgado, K., & Rao, B. (1998b). Gradient factorization-based algorithm for best-basis selection. In Proceedings of the 8th IEEE Digital Signal Processing Workshop. New York: IEEE. Kreutz-Delgado, K., & Rao, B. (1998c). Measures and algorithms for best basis selection. In Proceedings of the 1998 ICASSP. New York: IEEE. Kreutz-Delgado, K., & Rao, B. (1999). Sparse basis selection, ICA, and majorization: Towards a unied perspective. In Proc. 1999 ICASSP. Phoenix, AZ. Kreutz-Delgado, K., Rao, B., & Engan, K. (1999). Novel algorithms for learning overcomplete Dictionaries. In Proc. 1999 Asilomar Conference. Pacic Grove, CA: IEEE. Kreutz-Delgado, K., Rao, B., Engan, K., Lee, T.-W., & Sejnowski, T. (1999a). Convex/Schur–convex (CSC) log-priors and sparse coding. In Proc. 6th Joint Symposium on Neural Computation. Pasadena, CA. Kreutz-Delgado, K., Rao, B., Engan, K., Lee, T.-W., & Sejnowski, T. (1999b).Learning overcomplete dictionaries: Convex/Schur-convex (CSC) log-priors, factorial codes, and independent/dependent component analysis (I/DCA). In Proc. 6th Joint Symposium on Neural Computation. Pasadena, CA. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11(2), 417–441. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. J. Opt. Soc. Am. A, 16(7), 1587– 1601. Lewicki, M. S., & Sejnowski, T. J. (2000). Learning overcomplete representations. Neural Computation, 12(2), 337–365. Mallat, S., & Zhang, Z. (1993). Matching pursuits with time-frequency dictionaries. IEEE Trans. ASSP, 41(12), 3397–3415.
Dictionary Learning Algorithms for Sparse Representation
395
Marshall, A., & Olkin, I. (1979). Inequalities: Theory of majorization and its applications. San Diego, CA: Academic Press. Murray, J. F., & Kreutz-Delgado, K. (2001).An improved FOCUSS-based learning algorithm for solving sparse linear inverse problems. In Conference Record of the 35th Asilomar Conference on Signals, Systems and Computers. Pacic Grove, CA: IEEE. Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B., & Field, D. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Pham, D. (1996). Blind separation of instantaneous mixture of sources via an independent component analysis. IEEE Trans. Signal Processing, 44(11), 2768– 2779. Popovic, Z., & Sj¨ostrand, J. (2001). Resolution, separation of retinal ganglion cells, and cortical magnication in humans. Vision Research, 41, 1313–1319. Rao, B. (1997). Analysis and extensions of the FOCUSS algorithm. In Proceedings of the 1996 Asilomar Conference on Circuits, Systems, and Computers (Vol. 2, pp. 1218–1223). New York: IEEE. Rao, B. (1998). Signal processing with the sparseness constraint. In Proc. 1998 ICASSP. New York: IEEE. Rao, B. D., Engan, K., Cotter, S. F., & Kreutz-Delgado, K. (2002). Subset selection in noise based on diversity measure minimization. Manuscript submitted for publication. Rao, B., & Gorodnitsky, I. (1997). Afne scaling transformation based methods for computing low complexity sparse solutions. In Proc. 1996 ICASSP (Vol. 2, pp. 1783–1786). New York: IEEE. Rao, B., & Kreutz-Delgado, K. (1997). Deriving algorithms for computing sparse solutions to linear inverse problems. In J. Neyman (Ed.), Proceedingsof the 1997 Asilomar Conference on Circuits, Systems, and Computers. New York: IEEE. Rao, B., & Kreutz-Delgado, K. (1998a). Basis selection in the presence of noise. In Proceedings of the 1998 Asilomar Conference. New York: IEEE. Rao, B., & Kreutz-Delgado, K. (1998b). Sparse solutions to linear inverse problems with multiple measurement vectors. In Proceedings of the 8th IEEE Digital Signal Processing Workshop. New York: IEEE. Rao, B., & Kreutz-Delgado, K. (1999). An afne scaling methodology for best basis selection. IEEE Trans. Signal Processing, 1, 187–202. Roberts, S. (1998). Independent component analysis: Source assessment and separation, a Bayesian approach. IEE (Brit.) Proc. Vis. Image Signal Processing, 145(3), 149–154. Rockafellar, R. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Ruderman, D. (1994). The statistics of natural images. Network: Computation in Neural Systems, 5, 517–548. Wang, K., Lee, C., & Juang, B. (1997). Selective feature extraction via signal decomposition. IEEE Signal Processing Letters, 4(1), 8–11. Watanabe, S. (1981). Pattern recognition as a quest for minimum entropy. Pattern Recognition, 13(5), 381–387.
396
K. Kreutz-Delgado, J. Murray, B. Rao, K. Engan, T. Lee, and T. Sejnowski
Wickerhauser, M. (1994).Adaptedwavelet analysis from theoryto software. Wellesley, MA: A. K. Peters. Zhu, S., Wu, Y., & Mumford, D. (1997). Minimax entropy principle and its application to texture modeling. Neural Computation, 9, 1627–1660. Zibulevsky, M., & Pearlmutter, B. A. (2001). Blind source separation by sparse decomposition in a signal dictionary. Neural Computation, 13(4), 863–882. Received August 6, 2001; accepted June 4, 2002.
LETTER
Communicated by Christian Wehrhahn
Spatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural Scenes Eizaburo Doi
[email protected] Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A., and Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Toshio Inui
[email protected] Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan Te-Won Lee
[email protected] Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A., and Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A. Thomas Wachtler
[email protected] Neurobiology and Biophysics, Institute of Biology III, Albert-Ludwigs-Universit¨at, 79104 Freiburg, Germany Terrence J. Sejnowski
[email protected] Institute for Neural Computation, University of California, San Diego, La Jolla, CA 92093, U.S.A., Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California, San Diego, La Jolla, CA 92093, U.S.A. Neurons in the early stages of processing in the primate visual system efciently encode natural scenes. In previous studies of the chromatic properties of natural images, the inputs were sampled on a regular array, with complete color information at every location. However, in the retina cone photoreceptors with different spectral sensitivities are arranged in a mosaic. We used an unsupervised neural network model to analyze the statistical structure of retinal cone mosaic responses to calibrated color natural images. The second-order statistical dependencies derived from the covariance matrix of the sensory signals were removed in the rst stage Neural Computation 15, 397–417 (2003)
° c 2002 Massachusetts Institute of Technology
398
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
of processing. These decorrelating lters were similar to type I receptive elds in parvo- or konio-cellular LGN in both spatial and chromatic characteristics. In the subsequent stage, the decorrelated signals were linearly transformed to make the output as statistically independent as possible, using independent component analysis. The independent component lters showed luminance selectivity with simple-cell-like receptive elds, or had strong color selectivity with large, often double-opponent, receptive elds, both of which were found in the primary visual cortex (V1). These results show that the “form” and “color” channels of the early visual system can be derived from the statistics of sensory signals. 1 Introduction
New algorithms have recently been developed to test the hypothesis that sensory signals are transformed in the early stages of sensory processing to reduce the redundancy of the inputs (Barlow, 1961; Field, 1994). The spatial properties of ganglion cells in the retina and thalamocortical neurons in the lateral geniculate nucleus (LGN) are consistent with algorithms that use second-order statistics to decorrelate natural scenes (Atick, 1992; Atick & Redlich, 1993) and the localized and oriented simple cells in the primary visual cortex (V1) reduce higher-order statistics (Olshausen & Field, 1996; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). These studies used gray-scale natural images. Natural color images have also been analyzed using RGB inputs (Tailor, Finkel, & Buchsbaum, 2000; Hoyer & Hyvarinen, 2000) and LMS inputs with the spectral sensitivities of the human L-, M-, and S-cone photoreceptors (Wachtler, Lee, & Sejnowski, 2001). In these studies, the images were sampled on a regular grid, with the color at each location represented as a vector of three elements. In the retina cone photoreceptors are arranged in a mosaic, with only one cone at each location. Thus, chromatic information in the retina is extracted by comparing the responses of different cone types at different locations, which requires spatial interaction. This is not the case for RGB or LMS color images, where the information for chromaticity is fully available at each pixel. Here, we consider a model that incorporates the cone mosaic found in the trichromatic foveal region of primates. We adopt a hierarchical model that consists of a decorrelating stage corresponding to LGN and a subsequent independent component analysis (ICA) stage corresponding to V1 (Bell & Sejnowski, 1997). Applying biological and computational constraints at each level of processing allows a better comparison to the properties of neurons in the early visual system. 2 Methods 2.1 Cone Mosaic Responses. A small cone mosaic patch was used to model the trichromatic foveal region of the primate retina (see Figure 1). To
Information-Theoretic Analyses of Cone Mosaic Responses
399
obtain estimates of the cone mosaic responses to natural scenes, we took natural images using a 3CCD camera and converted them into LMS images (see the appendix for details). By scanning the data set randomly with the cone mosaic, 507,904 samples were generated. 2.2 Model.
2.2.1 Computational Goal. cessing are linear,
We assume that the early stages of visual pro-
u D Wx ,
(2.1)
where x D ( x1 , . . . , xn ) T are the input signals to the visual system (namely, the cone mosaic responses), W D [w 1 , . . . , w n ]T is a square matrix consisting of receptive elds w i , and u D ( u1 , . . . , un ) T is the output, corresponding to the neural activities (Olshausen & Field, 1996; Bell & Sejnowski, 1997). The computational goal of the system is assumed to reduce the redundancies or statistical dependencies among the output elements. This coding scheme is known as minimum entropy coding or factorial coding (Atick, 1992; Field, 1994) and can be implemented by ICA (Bell & Sejnowski, 1995; Lee, 1998; Girolami, 1999; Hyvarinen, Karhunen, & Oja, 2001). A linear generative model for the input data x is given by x D As D s1 a 1 C ¢ ¢ ¢ C sn a n ,
(2.2)
where A D [a1 , . . . , a n ] is a set of basis functions and s1 , . . . , sn are their coefcients, called sources or causes. Given sources s that are statistically independent of each other, equation 2.2 assumes that the input x is generated by a linear mixture of statistically independent causes s . The goal of ICA is to nd a matrix W that unmixes x so that u D s, namely, W D A ¡1 (up to permutation and scaling). The assumption of statistical independence among sources s and the general evidence of high redundancy in the sensory input x mean that the basis functions a i introduce statistical dependency in the input. Equation 2.2 means that when a source si is activated (i.e., taking nonzero value), more than one sensor will concurrently have nonzero values according to the pattern of a i , which leads to statistical dependency in the sensory signals x. 2.2.2 Hierarchical Implementation. We followed the approach of Bell and Sejnowski (1997) in decomposing the linear transformation W into decorrelation and ICA, W I W Z (see Figure 2): the rst operator W Z decorrelates the input x such that the covariance matrix of its output u Z satises hu Z u Z T i D h( W Z x ) ( W Z x ) T i D I,
(2.3)
400
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
Information-Theoretic Analyses of Cone Mosaic Responses
401
Figure 2: Three-stage model of the early visual system. This is three-layer feedforward neural network. The rst stage corresponds to the photoreceptor layer in the retina, as shown in Figure 1. The subsequent stages are assumed to reduce redundancy in the sensory input in a sequential manner.
where I is the identity matrix. This decorrelation normalizing the output variance is also known as the whitening transformation or zero-phase lters (Bell & Sejnowski, 1997). The second operator W I makes the decorrelated signals W Z x to be statistically as independent as possible (ICA for whitened data). This decomposition is justied as follows. First, the overlap of cone
Figure 1: Facing page. Properties of the input. (a) Cone mosaic model. This is based on the anatomical data from old world monkeys. It consists of 217 photoreceptors of L- (red), M- (green), and S-cone (blue). L- and M-cones are arranged randomly (Mollon & Bowmaker, 1992), whereas S-cones are arranged regularly (de Monasterio, McCrane, Newlander, & Schein, 1985; Szel, Diamantstein, & Rohlich, 1988). The cone ratio is L : M D 1 : 1 (Mollon & Bowmaker, 1992; Baylor et al., 1987), and S-cone occupies about 3% (de Monasterio et al., 1985; Mollon & Bowmaker, 1992). In this model L : M : S D 112 : 98 : 7. (b) Spectral sensitivities of cone photoreceptors. Here we used those of human cone photoreceptors (Stockman & Sharpe, 2000). (c–e) An example of cone mosaic output compared with RGB and gray-scale natural images. (c) RGB image. (d) Gray-scale image. M-cone responses are shown for reference. Larger responses correspond to brighter pixels. (e) Cone mosaic responses. The red hexagon shows the window of the cone mosaic shown in a; thus, the image inside is the response pattern of the cone mosaic. This image was prepared by concatenating several response patches of the same cone mosaic. The pixels are arranged in the hexagonal grid instead of a square grid to mimic the closely packed structure of the retinal cone mosaic as shown in a. They are generated by averaging four pixels of square-grid images with one pixel offset between each row. (f) Examples of natural images. Sixty-two natural images were used.
402
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
spectral sensitivities (especially between L- and M-cones) produces a strong correlation between elements of the input, as is the case of gray-scale natural images. This correlation could be removed as a rst step toward a statistically independent representation. This operation restricts the solution of the next stage W I to be in the orthogonal matrix manifold. Second, spatial properties of receptive elds of retinal ganglion cells and LGN cells have been suggested to be explained by decorrelation, using gray-scale data sets (Atick, 1992; Atick & Redlich, 1993; Bell & Sejnowski, 1997). 2.2.3 Learning Algorithms. W Z is a whitening matrix that satises equation 2.3 with an additional degree of freedom in the choice of whitening matrix: if W Z is a whitening matrix, UW Z is also a whitening matrix for arbitrary orthogonal matrices U . Here we assume that the whitening matrix is symmetric, W Z T D W Z , which makes it unique. For gray-scale natural images, this whitening yields a set of zero-phase lters, which atten the spatial-frequency (1/f) spectrum of shift-invariant data and show concentric center-surround receptive eld organization (Atick, 1992; Atick & Redlich, 1993). Such a matrix can be derived by matrix square root of the inverse of the input covariance matrix (Golub & Loan, 1996; Bell & Sejnowski, 1997) or using a more biologically plausible learning algorithm (Atick & Redlich, 1993). Since both are equivalent, we used the former method for convenience. The ICA algorithm we used can be derived using information maximization or maximum likelihood estimation (Bell & Sejnowski, 1995; Cardoso, 1997) with the natural gradient (Amari, Cichocki, & Yang, 1996),
D W I / [I ¡ Q ( u ) u ] W I ,
(2.4)
( )
@ log p u where Q (u ) D ¡ @u , p ( u ) D P i p (ui ) is the probability density of u , and D W I is the change of weights computed iteratively until it converges to zero. Note that it requires a density model of p ( ui ) . We use the extended infomax algorithm for ICA that assumes different p (ui ) ’s for supergaussian and subgaussian density (Lee, Girolami, & Sejnowski, 1999). Thus, the results do not require the coefcients to have sparse distributions, unlike some previous methods (Olshausen & Field, 1996; Bell & Sejnowski, 1997).
3 Results 3.1 Decorrelating Stage. All units in the decorrelating stage showed concentric center-surround receptive eld organization with the center region driven by a single cone (see Figure 3). Receptive elds of type I neurons of parvo- or konio-cellular LGN (pLGN/kLGN) of the macaque have these properties (Lennie & D’Zmura, 1988; in their review, as in this study, the properties of retinal ganglion cells were not distinguished from those of LGN cells). They can be further classied on the basis of the center-driving
Information-Theoretic Analyses of Cone Mosaic Responses
403
Figure 3: Receptive elds of the decorrelating stage. Each panel shows a weighting pattern (receptive eld organization) from cone mosaic to a certain decorrelating unit. Here we show two examples for each L-, M-, and S-center type. Bright and dark circles show positive and negative weights; red, green, and blue colors correspond to weights of L-, M-, and S-cone; and the circle size indicates the amplitude of the weight.
cone type, leading to three types of units that we refer to as L-center, Mcenter, and S-center. All units were classied as color selective unit using the criterion in Hanazawa, Komatsu, and Murakami (2000): the Color Selectivity Index (CSI) greater than 0.5 (see Figure 4b). 1 This matches the lack of luminance or achromatic units in pLGN/kLGN (Lennie & D’Zmura, 1988; Hanazawa et al., 2000). Figure 4a illustrates color selectivity proles in comparison with those of pLGN/kLGN neurons, showing agreement between the model and the physiological data. Figure 4c is the histogram of their color tuning directions, segregated in three clusters as in the physiological data. The color tuning directions of the L-center and M-center types were almost completely opposite (¡6.3 § 13.5 and 180.8 § 1.9, respectively). This is also conrmed in Figure 4a, where contours of the responses were aligned in the same orientation but in the reverse order. This indicates that both the L- and M-center types were responsible for the same r-g channel, 1 (Color Selectivity Index) = [(max response) ¡ (min response)]/[max abs(response)]. In the physiological study of Hanazawa et al. (2000), the response is dened by [mean discharge ratio] ¡ [baseline activity]. Therefore, negative values are possible.
404
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
Information-Theoretic Analyses of Cone Mosaic Responses
405
while the S-center type was responsible for the y-b channel, in the color opponent stage. Regarding L-center and M-center type, there was no signicant conetype specicity in the receptive eld organization. In fact, they showed close similarity to the whitening lters derived from a single cone-type cone mosaic (here we used M-cone, and as a result the data are gray-scale): w Z T w Zgray / kw Z kkw Zgray k D 0.995 § 0.005. This suggests that their receptive eld organization is determined mainly by spatial factor, not by cone-type dependent chromatic factor. This also means that the signicant color selectivity of L- and M-center type is mainly due to the weighting ratio between the center and the surround, itself mainly due to the spatial factor, not the cone-type specic, antagonistic wiring. These results would be reasonable since the majority of the input signals come from L- and M-cones, of which spectral sensitivities are highly overlapping. Therefore, the statistics of cone mosaic data are close to those of gray-scale (only M-cone) data. S-center type shows S-cone specicity with larger receptive elds than Land M-center types (see Figure 3). This is consistent with the large dendritic arborization of the small bistratied retinal ganglion cells, which are distinct from midget and parasol ganglion cells and belong to the konio-cellular (Scone) pathway (Dacey, 2000). Our results show that the S-cone weights in the surround are inhibitory. This is reasonable since the difference of correlated signals is oriented along the uncorrelated axis. The larger size of S-center
Figure 4: Facing page. Chromatic properties of the decorrelating stage. (a) Color selectivity. Model data (upper row) were derived from the same units illustrated in the lower row of Figure 3. Physiological data (lower row) are of pLGN/kLGN neurons. This diagram shows the responses of a certain unit to a set of color stimuli distributed in both MacLeod-Boynton (MB) and CIE 1931 xy (CIE-xy) chromaticity coordinates. For illustration purposes, a CIE-xy chromaticity diagram was used. Open and lled circles show positive and negative responses, respectively; their diameters indicate response amplitudes; C corresponds to the chromaticity of the color stimuli. It is important to note that all color stimuli were adjusted to have the same luminance (isoluminant stimuli), and thus any difference among the responses of a certain unit can be attributed to its color selectivity. Contours of the responses are also plotted: from thick to thin, 80%, 60%, 40%, and 20% of [maximum response] ¡ [minimum response]. Axes of color orientations dened in MB chromaticity coordinates are shown as well (0, 90, 180, 270 degrees). For the analysis of second-layer units, the stimuli were a uniform pattern to mimic diffuse light in the physiological experiment. (b) Distribution of color selectivity index. (c) Distribution of color tuning directions. This direction is dened by the maximum slope of the responses. Physiological data are of pLGN/kLGN. Color of the histogram is identical to that of b. All physiological data are from Hanazawa, Komatsu, and Murakami (2000).
406
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
type receptive elds should be due to the sparse arrangement of S-cone in the cone mosaic, since in the case of L- and M-center type, the cones next to the center respond in a highly correlated way to the center cone. While the S-cone antagonism between the center and the surround is reasonable from the redundancy reduction point of view, physiological studies showed only the spatially coextensive yellow-blue cells in retinal ganglion cells (Dacey, 2000). 3.2 ICA Stage. In the ICA stage, two main types of receptive elds emerged (see Figures 5 and 6): the majority (94.9%) of luminance type (achromatic, color nonselective) and the others (4.6%) of color-selective type. Note that this analysis was based on W , not W I . The luminance unit was similar to those derived using gray-scale samples, showing simple-cell-like receptive eld with high spatial-frequency selectivity (see Figure 5a; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). This suggests that the luminance units would represent spatial information (without color). Responses to isoluminant color stimuli are almost constant (see Figure 6a), and the color selectivity index was less than 0.5 for almost all luminance units (0.28 § 0.14; see Figure 6b), showing low color selectivity. The low value of this index means that the spectral selectivity was almost identical to the luminosity function Vl ,2 consistent with physiological data from simple cells in V1 (Lennie & D’Zmura, 1988; Lennie, Krauskopf, & Sclar, 1990). The color-selective units were characterized by cone-type specic antagonistic regions in their receptive eld, and classied into two subtypes: Y/B type of yellow-blue selectivity and R/G type of red-green selectivity. The
Figure 5: Facing page. Receptive elds of the ICA stage. (a) Receptive eld organization. Here, each panel shows a weighting pattern from cone mosaic to a certain ICA unit. Two examples for each luminance, Y/B, and R/G type are shown. The optimal bar stimulus for the units is shown just below each receptive eld. The optimal bar is dened by the bar that maximally matches the weighting pattern. A set of bars was prepared with 13 width, 1 » 3 length, and 22 orientation for arbitrary location. To nd the color-opponent region for color-selective type, weighting patterns were modied as follows: C L C M ¡ S for Y/B type; C L ¡ M for R/G type (S-cone is neglected). (b) Basis function. These are the dual vectors of lters shown in a. 2
Due to the use of isoluminant color stimuli, the responses of luminosity function to the color stimuli are identical. The reverse is also true: any spectral sensitivity is written as the sum of luminosity function and R its residual, S (l) R D aVl (l) C D (l), R and the response to an isoluminant stimulus i(l) is S(l)i(l)dl D a Vl (l)i(l)dl C D (l)i (l)dl, where R R R a ´ S(l)i(l)dl / Vl (l)i(l)dl and a Vl (l)dl is constant by denition. If the left side is constant over any isoluminant color stimuli, D (l) D 0; thus, S(l) D aVl (l).
Information-Theoretic Analyses of Cone Mosaic Responses
407
408
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
Figure 6: Chromatic properties of the ICA stage. (a) Color selectivity, derived from the units shown in the middle row of Figure 5a. For physiological data, see Figure 3A in Hanazawa et al. (2000) for R/G type; data corresponding to luminance or Y/B unit are not illustrated there, although they were reported. (b) Distribution of color selectivity index. (c) Distribution of color tuning direction. Here, those units of which CSI is greater than 0.5 are included in the analysis in accordance with the physiological study. Color of the histogram is identical to that of b. All physiological data are taken from Hanazawa et al. (2000).
Information-Theoretic Analyses of Cone Mosaic Responses
409
color selectivity index was much larger than 0.5 (larger than 1) for all colorselective units (1.43 § 0.16 for Y/B, 1.70 § 0.01 for R/G; see also Figure 6b), indicating that they are highly color selective. Note that CSI > 1 indicates the existence of color opponency, which yields positive response to one color and negative response to another. These color-selective subtypes may correspond to the two color channels suggested in psychophysical studies (Lennie & D’Zmura, 1988). However, it does not match the physiological studies that showed distributed tuning over the color axis (see the right panel in Figure 6c; see also Figure 2 in De Valois, Cottaris, Elfar, Mahon, & Wilson, 2000). Most color-selective units showed orientation selectivity. Although the orientation selectivity of color-selective cells in V1 is controversial (Michael, 1978; Lennie et al., 1990; Conway, 2001; Johnson, Hawken, & Shapley, 2001), our results suggest that they may be expected in redundancy reduction of cone mosaic responses. In addition to luminance units and color-selective units, we obtained a basis function showing a uniform pattern over the cone mosaic, referred to as the DC unit. Its receptive eld organization was affected by the boundary effect. In a previous study, this component was excluded from the analysis by subtracting the average for each image patch (Hoyer & Hyvarinen, 2000). Color-selective units had a larger receptive eld compared to luminance units: the size of the optimal bar is 4.42 § 3.41 (cone size) for luminance units and 36.30 § 6.73 and 74.62 § 25.52 for Y/B and R/G units. This is consistent with the low spatial-frequency selectivity of the color mechanism (Mullen, 1985), as expected, since the color information is derived from the comparison between at least two or three types of cones, but there is only a single cone at each retinal location. Thus, chromatic information requires averaging over several cones, leading to lower spatial-frequency selectivity. Note also that the percentage of the color-selective units was signicantly lower than that of luminance units, indicating that the dimensionality assigned for chromatic information is much less than that for luminance information. Our analysis thus far of properties of receptive elds w i suggests that ICA yields two types of units: luminance units responsible for encoding spatial information without color and color-selective units responsible for color information without ne spatial information. We can rewrite the datagenerative model of the cone mosaic responses (see equation 2.2) as xD
X i2Form
ui a i C
X
uj aj ,
(3.1)
j2Color
where we assumed that Form consists of luminance and DC units and Color consists of color-selective units. Recall that the response ui represents the presence of the basis function a i in the input x. Some examples of the basis function are shown in Figure 5b. Regarding luminance units, the basis functions show cone-type irrespective edge pattern, exactly as the case of grayscale images. Note that even the locations of S-cone are lled in as if they
410
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
were part of the edge surface, while S-cone scarcely contributes their receptive eld organization. Regarding color-selective units, the basis functions show cone-type specic patterns, that is, color information. These suggest that the statistically independent patterns in the cone mosaic responses are a large number of cone-type unspecic edge patterns and a small number of cone-type specic edge patterns, along with a spatially uniform pattern for each type. In Figure 7, we visualize these different types of information by separate images (Color is divided into R/G and Y/B for convenience). It is clear that the form information is extracted by luminance and DC units (see Figure 7c), and color information is extracted by Y/B (see Figure 7d) and R/G units (see Figure 7e) from the cone mosaic responses (see Figure 7b). In this study, all resulting independent components had supergaussian distributions. When histogram equalization was used as a model for cone nonlinearity instead of the physiological model (see the appendix), one subgaussian component emerged whose basis function showed a DC component. This is consistent with other studies using gray-scale images (van Hateren & van der Schaaf, 1998) or LMS images (Wachtler et al., 2001), both of which applied the logarithm function as the cone nonlinearity. 3.3 PCA Comparison. The cone mosaic responses were also analyzed by principal component analysis (PCA) to compare the results with those derived from ICA. As shown in Figure 8, PCA yielded nonlocal lters as in the case of gray-scale images (Olshausen & Field, 1996; Bell & Sejnowski, 1997). There were only two components that showed cone-type specicity; both showed S-cone specicity. This means that the spatial variation (variance)
Figure 7: Facing page. Illustration of information content in Luminance, Y/B, or R/G channel. (a) RGB image. (b) Cone mosaic responses. This is the same as Figure 1e but for another natural scene. (c) Reconstruction from luminance and DC units. For each small patch of cone mosaic frame, the image xO was P reconstructed by xO D i ui a i , where i indicates luminance and DC units. These patches were tiled over the whole image. (d) Reconstruction from Y/B units. As in c above, but i indicates Y/B units. We also postprocessed the image to represent it using pseudocolor: L- and M-cones are regarded as contributing yellow direction (i.e., R- and G-pixels with like sign; B-pixel with opposite sign); S-cone is regarded as blue direction (B-pixel with like sign; R- and G-pixels with opposite sign). To avoid dark pixels, with color hard to see, the base color is set to be gray. (e) Reconstruction from R/G units. As in d, except that i indicates R/G units. In postprocessing, L- and S-cones are regarded as contributing red direction (i.e., R-pixel with like sign, G-pixel with opposite sign); M-cone is regarded as green direction (G-pixel with like sign; R-pixel with opposite sign). Any B-pixel is set as 0. As in d, the base color is set gray. (f) Superimposing luminance, R/G, and Y/B information above to see form and color information simultaneously.
Information-Theoretic Analyses of Cone Mosaic Responses
411
common with all cone-type dominates the chromatic variation specic for each cone type (especially for L- and M-cone type). When LMS natural images are analyzed by PCA, Y/B and R/G components were as common as black-white components (Ruderman, Cronin, & Chiao, 1998). This demonstrates that it is not straightforward for the visual system to extract color information from the cone mosaic responses.
412
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
Figure 8: Principal components of cone mosaic responses. Here we show the 1st, 5th, 11th, 18th, 25th, and 120th principal component, as indicated for each panel. Here, cone type is not indicated by color, but otherwise these are the same illustrations as Figures 3 and 5a. Cone-type specicity can be seen in only two units that show S-cone specicity; the eighteenth component is one of them. The rst component is exactly the same as the DC type basis function in the ICA stage.
4 Discussion 4.1 Multiplexing and Separating Stages. It has been suggested that in the pLGN/kLGN, luminance information is conveyed with red-green information in the r-g opponent channel and that these “multiplexed” signals are “separated” into red-green and luminance information in the following stages, possibly in V1 (Ingling & Martinez-Uriegas, 1983; Lennie & D’Zmura, 1988; Lennie et al., 1990). Our results reproduce this two-stage model. We emphasize that this is an emergent property derived from redundancy reduction and not put in by hand as in the previous model studies (Billock, 1991; De Valois & De Valois, 1993; Kingdom & Mullen, 1995). These results suggest mechanisms by which the visual system could reduce redundancy in a sequence of stages. 4.2 Learning Visual Information Processing. It is possible that the color selectivity properties of neurons in the visual system are the end result of
Information-Theoretic Analyses of Cone Mosaic Responses
413
evolution, and during development they are implemented without visual experience. All of the properties of the cells in our model were obtained with unsupervised algorithms. This suggests that some of the properties of cells in both pLGN/kLGN and V1 could be acquired by unsupervised learning during development from visual input. Other properties of cortical neurons, such as orientation and direction selectivity, depend on the visual experience during development (Blakemore & Cooper, 1970; Blakemore & van Sluyters, 1975). Studies of natural scene statistics also support this idea (Olshausen & Field, 1996; Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). There are several psychophysical studies investigating the plasticity of color vision during development (Teller, 1998; Knoblauch, VitalDurand, & Barbur, 2001); however, the physical basis of plasticity in color vision has not yet been examined. Our results suggest that color selectivity, like orientation selectivity, may depend on the statistics of sensory signals. They also suggest that in general, the visual coding of attributes such as form and color (Livingstone & Hubel, 1988) can be understood as a result of redundancy reduction. 4.3 Relation to Previous Studies. Other studies applying ICA to RGB images (Tailor et al., 2000; Hoyer & Hyvarinen, 2000) used spectral sensitivities unlike those found in the human retina, which precludes direct comparisons to physiological or psychophysical data. We used spectral sensitivities of human cone photoreceptors, enabling us to obtain results under more realistic conditions. A similar approach was used earlier (Wachtler et al., 2001; Lee, Wachtler, & Sejnowski, 2002), but without sampling of a trichromatic cone mosaic. We examined the additional constraint that spatial separability of chromatic signals imposes on the human visual system and found that in contrast with the results of previous studies, (1) the cone-type specicities for color vision can be learned under the realistic and difcult condition due to the cone mosaic sampling; (2) the spectral sensitivity of simple-cell-like units is along the luminance axis, not the black-white axis; (3) the percentage of luminance (simple-cell like) units is large ( > 90%), not low (about 1/3); and (4) the luminance units show higher spatial-frequency selectivity than color-selective units. 5 Conclusion
The goal of this study was to analyze the cone mosaic responses to natural scenes and to nd an efcient coding of spatiochromatic signals in the visual system. Our results suggest that a hierarchical redundancy reduction model explains the spatiochromatic properties of neurons in the early stages of processing in the visual system. Although our model used more realistic inputs than previous approaches (Tailor, et al., 2000; Hoyer & Hyvarinen, 2000; Wachtler et al., 2001), it still assumes a linear transformation. Thus, the current model cannot capture
414
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
complex properties such as nonlinear combination of cone signals (De Valois et al., 2000) or contextual interactions (Wachtler, Sejnowski, & Albright, 1999). Our model nevertheless approximates the early processing stages of the visual system. The similarities between our results and the known properties of the visual system further strengthen the hypothesis that redundancy reduction is an organizing principle in the visual system. Appendix: Preparation of Cone Responses to Natural Scenes
Images of 62 natural scenes was obtained using 3CCD digital still camera system (HC-2500, Fujilm, Japan). Each RGB image (1000 £ 1280 pixels, 10 bits for each R, G, B) were corrected for gamma calibration, effect of iris size, and exposure time. These linear RGB data, of which spectral sensitivities were also measured, were transformed into LMS cone responses using a 3£3 matrix. This matrix was derived by minimizing square errors for 170 surface reectance functions (Vrhel, Gershon, & Iwan, 1994) rendered by four kinds of CIE daylight functions (D50, D55, D65, and D75). The resulting errors were 0.23%, 0.14%, and 0.04% for L-, M-, and S-cone, respectively. Its approximation accuracy was also evaluated by a different set of reectance functions (ColorCheker, Macbeth, U.S.A.), and the estimation errors were reasonably small as 0.39%, 0.23%, and 0.03% for L-, M-, and S-cone, respectively. These linear cone response data were further transformed with an empirical cone nonlinear function (Baylor, Nunn, & Schnapf, 1987) given by rnl D 1 ¡ exp(¡k ¢ rl ) ,
(A.1)
where rl is linear cone responses, rnl is nonlinearly transformed cone responses, the range of which is [0,1], and the parameter k is determined so that the median of rl is 0.5 for each scene and for each cone type. We did not apply any preprocessing of the data such as low-pass ltering (Olshausen & Field, 1996) or dimensionality reduction using PCA (van Hateren & van der Schaaf, 1998). Acknowledgments
E.D. was supported by JSPS Research Fellowships for Young Scientists (no. 200103140), T.I. was supported by Grant-in-Aid for Scientic Research (A)(2) from MEXT (no. 11309007), and T.J.S. was supported by the Howard Hughes Medical Institute. We thank Akitoshi Hanazawa and Hidehiko Komatsu for allowing us to use their gures and for their helpful comments, Toshikazu Wada for help with the camera calibration, Kyoto Botanical Gardens for assistance with natural image collection, and Greg Horwitz for his helpful comments on the manuscript. Materials such as color natural images and corresponding cone responses, and the results of all units of decorrelation, ICA, and PCA are available on-line at http:/ / inc2.ucsd.edu/»edoi/pub/.
Information-Theoretic Analyses of Cone Mosaic Responses
415
References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8, (pp. 757–763). Cambridge, MA: MIT Press. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251. Atick, J. J., & Redlich, A. N. (1993). Convergent algorithm for sensory receptive eld development. Neural Computation, 5, 45–60. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenblith (Ed.), Sensory communication (pp. 217– 234). Cambridge, MA: MIT Press. Baylor, D. A., Nunn, B. J., & Schnapf, J. L. (1987). Spectral sensitivity of cones of the monkey macaca fascicularis. Journal of Physiology, 390, 145–160. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129– 1159. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge lters. Vision Research, 37, 3327–3338. Billock, V. A. (1991). The relationship between simple and double-opponent cells. Vision Research, 31, 33–42. Blakemore, C., & Cooper, G. F. (1970). Development of the brain depends on the visual environment. Nature, 228, 477–478. Blakemore, C., & van Sluyters, R. C. (1975). Innate and environmental factors in the development of the kitten’s visual cortex. Journal of Physiology, 248, 663–716. Cardoso, J. F. (1997). Infomax and maximum likelihood for source separation. IEEE letters on Signal Processing, 4, 112–114. Conway, B. R. (2001). Spatial structure of cone inputs to color cells in alert macaque primary visual cortex (v-1). Journal of Neuroscience, 21, 2768–2783. Dacey, D. M. (2000). Parallel pathways for spectral coding in primate retina. Annual Review of Neuroscience, 23, 743–775. de Monasterio, F. M., McCrane, E. P., Newlander, J. K., & Schein, S. J. (1985). Density prole of blue-sensitive cones along the horizontal meridian of macaque retina. Investigative Ophthalmology and Visual Science, 26, 289–302. De Valois, R. L., Cottaris, N. P., Elfar, S. D., Mahon, L. E., & Wilson, J. A. (2000). Some transformations of color information from lateral geniculate nucleus to striate cortex. PNAS, 97, 4997–5002. De Valois, R. L., & De Valois, K. K. (1993). A multi-stage color model. Vision Research, 33, 1053–1065. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601. Girolami, M. (1999). Self-organizing neural networks: Independent component analysis and blind source separation. New York: Springer-Verlag. Golub, G. H., & Loan, C. F. (1996). Matrix computation (3rd ed.). Baltimore: Johns Hopkins University Press.
416
E. Doi, T. Inui, T. Lee, T. Wachtler, and T. Sejnowski
Hanazawa, A., Komatsu, H., & Murakami, I. (2000). Neural selectivity for hue and saturation of colour in the primary visual cortex of the monkey. European Journal of Neuroscience, 12, 1753–1763. Hoyer, P. O., & Hyvarinen, A. (2000). Independent component analysis applied to feature extraction from color and stereo images. Network, 11, 191–210. Hyvarinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Ingling, C. R. Jr., & Martinez-Uriegas, E. (1983). The relationship between spectral sensitivity and spatial sensitivity for the primate r-g X-channel. Vision Research, 23, 1495–1500. Johnson, E. N., Hawken, M. J., & Shapley, R. (2001). The spatial transformation of color in the primary visual cortex of the macaque monkey. Nature Neuroscience, 4, 409–416. Kingdom, F. A. A., & Mullen, K. T. (1995). Separating colour and luminance information in the visual system. Spatial Vision, 9, 191–219. Knoblauch, K., Vital-Durand, F., & Barbur, J. L. (2001). Variation of chromatic sensitivity across the life span. Vision Research, 41, 23–36. Lee, T.-W. (1998). Independent component analysis:Theory and applications. Norwell, MA: Kluwer. Lee, T.-W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11, 417–441. Lee, T.-W., Wachtler, T., & Sejnowski, T. J. (2002). Color opponency is an efcient representation of spectral properties in natural scenes. Vision Research, 42, 2095-2103. Lennie, P., & D’Zmura, M. (1988). Mechanisms of color vision. CRC Critical Reviews on Neurobiology, 3, 333–400. Lennie, P., Krauskopf, J., & Sclar, G. (1990). Chromatic mechanisms in striate cortex of macaque. Journal of Neuroscience, 10, 649–669. Livingstone, M., & Hubel, D. (1988). Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science, 240, 740–749. Michael, C. R. (1978). Color vision mechanisms in monkey striate cortex: Simple cells with dual opponent-color receptive elds. Journal of Neurophysiology,41, 1233–1249. Mollon, J. D., & Bowmaker, J. K. (1992). The spatial arrangement of cones in the primate fovea. Nature, 360, 677–679. Mullen, K. T. (1985). The contrast sensitivity of human color vision to red-green and blue-yellow chromatic gratings. Journal of Physiology, 359, 381–400. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Ruderman, D. L., Cronin, T. W., & Chiao, C.-C. (1998). Statistics of cone responses to natural images: Implications for visual coding. Journal of the Optical Society of America A, 15, 2036–2045. Stockman, A., & Sharpe, L. T. (2000). Spectral sensitivities of the middle- and long-wavelength sensitive cone derived from measurements in observers of known genotype. Vision Research, 40, 1711–1737.
Information-Theoretic Analyses of Cone Mosaic Responses
417
Szel, A., Diamantstein, T., & Rohlich, P. (1988). Identication of the blue-sensitive cones in the mammalian retina by anti-visual pigment antibody. Journal of Comparative Neurology, 273, 593–602. Tailor, D. R., Finkel, L. H., & Buchsbaum, G. (2000). Color-opponent receptive elds derived from independent component analysis of natural images. Vision Research, 40, 2671–2676. Teller, D. Y. (1998). Spatial and temporal aspects of infant color vision. Vision Research, 38, 3275–3282. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proceedings of Royal Society of London B, 265, 359–366. Vrhel, M. J., Gershon, R., & Iwan, L. S. (1994). Measurement and analysis of object reectance spectra. Color Research and Application, 19, 4–19. Wachtler, T., Lee, T.-W., & Sejnowski, T. J. (2001). Chromatic structure of natural scenes. Journal of Optical Society of America A, 18, 65–77. Wachtler, T., Sejnowski, T. J., & Albright, T. D. (1999). Interactions between stimulus and background chromaticities in macaque primary visual cortex. Investigative Ophthalmology and Visual Science, 40, 641. Received April 23, 2002; accepted July 23, 2002.
LETTER
Communicated by Erkki Oja
Linear Geometric ICA: Fundamentals and Algorithms Fabian J. Theis
[email protected] Institute of Biophysics, University of Regensburg, Germany Andreas Jung
[email protected] Institute for Theoretical Physics, University of Regensburg, Germany Carlos G. Puntonet
[email protected] Elmar W. Lang elmar.lang@biologie,uni-regensburg.de Department of Architecture and Computer Technology, University of Granada, Spain Geometric algorithms for linear independent component analysis (ICA) have recently received some attention due to their pictorial description and their relative ease of implementation. The geometric approach to ICA was proposed rst by Puntonet and Prieto (1995). We will reconsider geometric ICA in a theoretic framework showing that xed points of geometric ICA fulll a geometric convergence condition (GCC), which the mixed images of the unit vectors satisfy too. This leads to a conjecture claiming that in the nongaussian unimodal symmetric case, there is only one stable xed point, implying the uniqueness of geometric ICA after convergence. Guided by the principles of ordinary geometric ICA, we then present a new approach to linear geometric ICA based on histograms observing a considerable improvement in separation quality of different distributions and a sizable reduction in computational cost, by a factor of 100, compared to the ordinary geometric approach. Furthermore, we explore the accuracy of the algorithm depending on the number of samples and the choice of the mixing matrix, and compare geometric algorithms with classical ICA algorithms, namely, Extended Infomax and FastICA. Finally, we discuss the problem of high-dimensional data sets within the realm of geometrical ICA algorithms. 1 Introduction
Given a random vector, independent component analysis (ICA) tries to nd its statistically independent components. This idea can also be used to solve the blind source separation (BSS) problem, which is, given only Neural Computation 15, 419–439 (2003)
° c 2002 Massachusetts Institute of Technology
420
F. Theis, A. Jung, C. Puntonet, and E. Lang
the mixtures of some underlying independent source signals, to separate the mixed signals—henceforth called sensor signals—thus recovering the original sources. In contrast to correlation-based transformations such as principal component analysis (PCA), ICA renders the output signals as statistically independent as possible by evaluating higher-order statistics. The idea of ICA was rst expressed by Jutten, HÂerault, Comon, and Sorouchiary (1991), while the term ICA was later coined by Comon (1994). However, the eld became popular only with the seminal paper by Bell and Sejnowski (1995), who elaborated on the Infomax principle rst advocated by Linsker (1989, 1992). Many other ICA algorithms have been proposed, with the FastICA algorithm (Hyv¨arinen, 1999) being the most efcient among them. Recently, geometric ICA algorithms have received further attention due to their relative ease of implementation (Puntonet & Prieto, 1995). They have been applied successfully to the analysis of real-world biomedical data (Bauer, Puntonet, Rodriguez-Alvarez, & Lang, 2000) and have been extended as well to nonlinear ICA problems (Puntonet, Alvarez, Prieto, & Prieto, 1999). 2 Basics
In linear BSS, a random vector X: V ! Rn called sensor signal is given; it originates from an unknown independent random vector S: V ! Rn , which will be denoted as source signal, via a mixing process with an unknown mixing matrix A 2 Gl(n), that is, X D A ± S. Note that we assume as many sensors as sources. Here, V denotes a xed probability space and Gl(n) :D 6 fW 2 Mat(n £ nI R ) | det(W) D 0g the general linear group in Rn . Only the sensor signal is observable, and the task is to recover A and thereby S D A ¡1 ± X. Uniqueness of the solutions can be acertained if we allow at most one of the source variables Si :D p i ± S, where p i : Rn ! R denotes the projection on the ith coordinate, to be gaussian. Then any solution to the BSS problem, that is, any B 2 Gl(n) such that B ± X is independent, is equivalent to A ¡1 , where equivalent means that B can be written as B D LPA ¡1 with an invertible diagonal matrix (scaling matrix) L 2 Gl(n) and an invertible matrix with unit vectors in each row (permutation matrix) P 2 Gl(n) (Comon, 1994). Vice versa, any matrix B that is equivalent to A¡1 solves the BSS problem, since the transformed mutual information is invariant under scaling and permutation of coordinates. 3 Geometric Considerations
The basic idea of the geometric separation method lies in the fact that in the source space fs1 , . . . , sl g ½ Rn , where si represent a xed number of samples of the source vector S with zero mean, the data clusters along the axes of the coordinate system are transformed by A into data clusters along
Linear Geometric ICA
421
a2
a1
x
2
j=0
x1 Figure 1: A two-dimensional scatter plot of a mixture of two Laplacian signals with identical variance. The signals have been mixed by a matrix A mapping the unit vectors onto vectors inclined with angle ai to the x1 -axis. The ragged line along the circle is the histogram of the observations after projection onto the circle plotted in polar coordinates. Dashed lines show borders of the receptive elds—the borders of the circle sections that lie closest to the angles ai or ¡ai , respectively.
transformed coordinate axes through the origin. The detection of these n new axes allows determining a demixing matrix B that is equivalent to A ¡1 (see Figure 1). We now consider the learning process to be terminated and describe precisely how to recover the matrix A then—after the axes, which span the observation space, have been extracted from the data successfully. Let 6 ig L :D f(x1 , . . . , xn ) 2 Rn | 9i xi > 0, xj D 0 for all j D
422
F. Theis, A. Jung, C. Puntonet, and E. Lang
be the set of positive coordinate axes, and denote with L 0 :D AL the image of this set under the transformation A. Note that due to A being bijective, L 0 intersects the unit (n ¡ 1) -sphere, Sn¡1 :D fx 2 Rn | |x| D 1g, in exactly n distinct points fp1 , . . . , pn g and that those pi ’s form a basis of Rn . Dene the matrix Mp1 ,...,pn 2 Gl(n) to be the linear mapping of ei onto pi for i D 1, . . . , n, that is, Mp1 ,...,pn D ( p1 | ¢ ¢ ¢ |pn ). This matrix thus effects the linear coordinate change from the standard coordinates ( ei ) to the new basis ( pi ). Note that for this coordinate transformation, the following lemma holds: For any permutation s 2 Sn , the two matrices Mp1 ,...,pn and Mps (1) ,...,ps(n) are equivalent.
Lemma 1.
Now we can state the following theorem: Theorem 1 (Uniqueness of the Geometric Method).
The matrix Mp1 ,...,pn is
equivalent to A. By construction of Mp1 ,...,pn , we have Mp1 ,...,pn ( ei ) D pi D f ( ei ) D so there exists a li 2 Rnf0g such that Mp1 ,...,pn (ei ) D li Aei . Dening L such that L ( ei ) :D li ei yields an invertible diagonal matrix L 2 Gl(n) such that Mp1 ,...,pn D LA. This shows the claim. Proof. Aei |Aei | ,
Corollary 1.
The matrix Mp¡1 solves the BSS problem. 1 ,...,pn
4 The Ordinary Geometric Algorithm
For now, we will restrict ourselves to the two-dimensional case for simplicity; extensions to higher dimensions are discussed in section 11. Let S: V ¡! R2 be an independent two-dimensional Lebesgue-continuous random vector describing the source pattern distribution; its density function is denoted by r: R2 ¡! R. As S is independent, r factorizes r ( x, y) D r1 ( x)r 2 ( y ) , with ri : R ¡! R denoting the corresponding marginal source density functions. As a further simplication, we will assume the source variables Si to have zero mean E (S ) D 0 and to be distributed symmetrically, that is, r i ( x) D r i (¡x ) for x 2 R and i D 1, . . . , n. To ensure stability of the geometric algorithm, we have to assume that the signal distributions are nongaussian and unimodal. In practice, these restrictions are often met at least approximately.
Linear Geometric ICA
423
As above, let X denote the sensor signal vector and A the invertible mixing matrix such that X D A ± S. Without loss of generality, assume that A is of the form Á ! cos a1 cos a2 AD , sin a1 sin a2 where ai 2 [0, p ) denote two angles. The ordinary geometric learning algorithm (Puntonet & Prieto, 1995) for symmetric distributions in its simplest form then is as follows: Pick four starting vectors w1 , w01 , w2 , and w02 on S1 such that wi and w0i are opposite each other (i.e., wi D ¡w0i for i D 1, 2) and w1 and w2 are linearly independent vectors in R2 . Usually one takes the unit vectors w1 D e1 and w2 D e2 . Furthermore, x P a learning rate g: N P¡! R with (Cottrell, Fort, & Pag`es, 1994) g(t) > 0, n2N g (n ) D 1 and n2N g (n ) 2 < 1. Then iterate the following step until an appropriate abort condition has been met: Choose a sample x ( t) 2 R2 according to the distribution of X. If x ( t) D 0, pick a new one. Note that this case happens with probability zero since the probability density function (pdf) r X of X is assumed to be continuous. x ( t) Project x ( t) onto the unit sphere to get y ( t) :D |x(t) | . Let i be in f1, 2g such that 0 wi or wi is the vector closest to y with respect to a Euclidean metric. Then update wi (t) according to the following update rule, ³ ´ y ( t) ¡ wi ( t) wi ( t C 1) :D pr wi ( t) C g ( t) , |y(t) ¡ wi ( t) | where pr: R2 nf0g ¡! S1 represents the projection onto the unit sphere, and w0i ( t C 1) :D ¡wi (t C 1). The other two w’s are not moved in this iteration. In Figures 2 and 3, the learning algorithm has been visualized on the sphere and after the projection onto [0, p ) . This weight update rule resembles unsupervised competitive learning rules used in many clustering algorithms like k-means, vector quantization, or Kohonen’s self-organizing maps, but with the modications that the step size along the direction of a sample does not depend on distance and the learning process takes place on S1 , not in R. 5 Formal Model of the Geometric Algorithm
Now we present a formal theoretical framework for geometric ICA that will be used in the next section to formulate a proper convergence condition. First, we show using the symmetry of S that it is in fact not necessary to have two vectors wi and w0i moving around on the same axis. Indeed, we should not speak of vectors but of lines in R2 , so the wi ’s would be in
424
F. Theis, A. Jung, C. Puntonet, and E. Lang
w2(0)
*
*
w2(too)
* w (t ) *w (0)
x2
1 oo
1
x1 Figure 2: Visualization of the geometric algorithm with starting points w1 (0) and w2 (0) and end points w1 (1) and w2 (1) . Dash-dotted lines mark receptive eld borders.
the real projective space RP1 D S1 / », where » identies antipodal points. This is the manifold of all one-dimensional subvector spaces of R2 . A metric is dened by setting d ([x], [y]) :D minf|x ¡ y|, |x C y|g for [x], [y] 2 RP1 . Alternatively, one can picture the w’s in S1C :D S1 \ f(x1 , x2 ) 2 R2 | x2 ¸ 0g / », where » identies the two points (1, 0) and (¡1, 0) . Let f : S1 ¡! S1C represent the canonical projection. Furthermore, it is useful to introduce polar coordinates Q: S1C ¡! [0, p ) on S1C with the stratication Q0 : R ¡! S1C such that Q0 ± Q D id, where id denotes the identity. Let  :D Q ± Q0 : R ¡! [0, p ) be the modulo p map. We are interested in the projected random sensor signal vector pr ±X: V ¡! S1 , so after cutting open the circle S1 and identication of opposite points, we want to approximate the transformed
Linear Geometric ICA
425
0.01
Probability Density [1/Deg]
0.008
* w (t i
oo
)
0.006
0.004
50%
*
0.002
0
w (t) i
w (0) i
ai 0
10
20
50% 30
40 50 60 Angle in F [Deg]
70
80
* 90
i
Figure 3: Plot of the density r Y of a mixture of two Laplacian signals with identical variance. The weight adaptation by the geometric algorithm is also visualized. See also Figure 2.
random variable Y :D Q ± f ± pr ±X: V ¡! [0, p ) in a suitable manner. Note that using the symmetry of r, the density function r Y of the transformed sensor signal Y can be calculated from the density rX of the original sensor signal X by Z 1 rY (Q ) D 2|det A| ¡1 r ( A ¡1 ( r cos Q, r sin Q ) > ) r dr. 0
Then the geometric learning algorithm induces the following discrete Markov process W (t ) : V ¡! R2 , dened recursively by W (0) D ( w1 , w2 ) and W ( t C 1) D Â 2 ( W (t) C g ( t) q ( ( Y (t ) , Y (t) ) ¡ W (t ) ) ) , where ( (sgn (x ) , 0) |y| ¸ |x| q (x, y ) :D (0, sgn ( y) ) |x| > |y| and Y (0) , Y (1) , . . . is a sequence of independent and identically distributed random variables V ¡! R with the same distribution as Y. These random
426
F. Theis, A. Jung, C. Puntonet, and E. Lang
variables will be needed to represent the independence of the successive sampling experiments. Note that the modulo p map  guarantees that W (t C 1) 2 [0, p ) . Indeed, this is just winner-takes-all learning with a signum function in R, but taking into account the fact that we have to stay in [0, p ) . Note that the metric used here is the planar metric, which obviously is equivalent to the metric on S1C induced by the Euclidean metric on S1 ½ R2 . We furthermore can assume that after enough iterations, there is one point a 2 S1 that will not be transversed anymore, and without loss of generality, we assume a to be 0 (otherwise, cut S1 open at a and project along this resulting arc), so that the above algorithm simplies to the planar case with the recursion rule W ( t C 1) D W ( t) C g (t )q ( (Y ( t) , Y ( t) ) ¡ W ( t) ). Without the sign function and the additional fact that the probability distribution of Y is log concave, it has been shown (Cottrell & Fort, 1987; Ritter & Schulten, 1988; Benaim, Fort, & Pag e` s, 1998) that the process W ( t) converges to a unique constant xed-point process W ´ w 2 R2 such that Z wi Z b2 ( Fi ) r Y (Q ) dQ D r Y (Q ) dQ b 1 ( Fi )
wi
for i D 1, 2, where Fi :D F ( wi ) :D fQ 2 [0, p ) | Â (| Q ¡ wi | ) · Â (| Q ¡ wj |) 6 for all j D ig with Fi D Â ([b1 (Fi ) , b2 (Fi ) ]) denotes the receptive eld of wi and bj ( Fi ) designate the receptive eld borders. However, it is not clear how to generalize the proof to the geometric case, especially because we do not have (and do not want) log concavity of Y, as this would lead to a unique xed point. Therefore, we will assume convergence in a sense stated in the following section. 6 Limit Points of the Geometric Algorithm
In this section, we study the end points of geometric ICA, so we will assume that the algorithm has already converged. The idea is to formulate a condition that the end points will have to satisfy and to show that the solutions are among them. Denition 1 (Geometric Convergence Condition). Two angles l1 , l2 2 [0, p ) satisfy the geometric convergence condition (GCC) if they are the medians of Y restricted to their receptive elds, that is, if li is the median of r Y | F ( li ) .
O ´ ( wO 1 , wO 2 ) 2 R2 is called xed A constant random vector W O (t) ) ) D 0. point of geometric ICA in the expectation if E (q ( Y ¡ W Denition 2.
Hence, the expectation of a Markov process W ( t) starting at a xed point of geometric ICA will not be changed by the geometric update rule because E (W ( t C 1) ) D E ( W ( t) ) C g (t) E (q ( Y ( t) ¡ W (t) ) ) D E ( W (t ) ) .
Linear Geometric ICA
427
Given that the geometric algorithm converges to a constant random vector W (1) ´ ( w1 (1) , w2 (1) ) , then W (1) is a xed point of geometric ICA in the expectation, if and only if the wi (1) satisfy the GCC. Theorem 2.
Assume W (1) to be a xed point of geometric ICA in the expectation. Without loss of generality, let [b 1 , b2 ] be the receptive eld of w1 (1) such that bi 2 [0, p ). Since W (1) is a xed point of geometric ICA in the expectation, we have E ( Â[b1 ,b2 ] ( Y ( t) ) sgn ( Y (t) ¡ w1 (1) ) ) D 0, where Â[b1 ,b2 ] denotes the characteristic function of that interval. But this means R w1 (1) R R (1) (¡1)rY (Q ) dQ C wb2 (1) 1r Y (Q ) dQ D 0 and therefore bw1 r Y (Q ) dQ D b1 1 1 R b2 ( ) (1) r Q dQ, so w1 satises GCC. The same calculation for w2 (1) w1 (1) Y shows one direction of the claim. The other direction follows by simply reading the above proof backward, which completes the proof. Proof.
As before, let pi :D Aei be the transformed unit vectors, and let qi :D Q ± f ± pr(pi ) 2 [0, p ) be the corresponding angles for i D 1, 2. Theorem 3.
The transformed angles qi satisfy GCC.
Because of the symmetry of the claim, it is enough to show that q1 satises GCC. Without loss of generality, let 0 < a1 < a2 < p using the symmetry of r. Then, due to construction, qi D ai . Let b1 :D a1 C2 a2 ¡ p2 and b2 :D b1 C p2 . Then the receptive eld of q1 can be written (modulo p ) as F ( q1 ) D [b1 , b2 ]. Therefore, we have to show that q1 D a1 is the median of Ra Rb r Y restricted to F ( q1 ) , which means b11 rY (Q ) dQ D a12 rY (Q ) dQ. We will reduce this to the orthogonal standard case A D id by transforming the integral as follows: Z a1 Z a1 Z 1 r Y (Q ) dQ D 2|det A| ¡1 dQ r dr r ( A¡1 ( r cos Q , r sin Q ) > ) Proof.
b1
Z ¡1 D 2|det A|
b1
0
dx dy r ( A¡1 ( x, y) > ) , K
where K :D f(x, y ) 2 R2 | b1 · arctan(y / x ) · a1 g denotes the cone of opening angle a1 ¡ b1R starting from angle b1 . Using the transformation R a formula, we continue b11 rY (Q ) dQ D 2 A¡1 ( K ) dx dy r ( x, y ) . Now note that the transformed cone A¡1 (K ) is a cone ending at the x-axis of opening angle p / 4, because A is linear; therefore, we are left with the following integral: Z a1 Z 1 Z 0 Z 1 Z x r Y (Q ) dQ D 2 dx dy r (x, y ) D 2 dx dy r ( x, ¡y) b1
0
Z D 2
1 0
Z
¡x
0
Z
x
dx 0
dy r ( x, y ) D
b2
a1
0
rY (Q ) dQ,
428
F. Theis, A. Jung, C. Puntonet, and E. Lang
where we have used the same calculation for [a1 , b 2 ] as for [b1 , a1 ] at the last step. This completes the proof of the theorem. Combining both theorems, we have therefore shown: Let W be the set of xed points of geometric ICA in the expectation. ¡1 Then there exists ( wO 1 , wO 2 ) 2 W such that Mw O 1 , wO 2 solves the BSS problem. The stable xed points in W can be found by the geometric ICA algorithm. Theorem 4.
Furthermore, we believe that in the special case of unimodal, symmetric, and nongaussian signals, the set W consists of only two elements: a stable and an unstable xed point, where the stable xed point will be found by the algorithm: Assume that the sources Si are unimodal, symmetric, and nongaussian. Then there are only two xed points of geometric ICA in the expectation. Conjecture 1.
We can prove this conjecture for the special case of two sources with identical distributions that are nicely super- or subgaussian in the sense that r Y has only four extremal points. Theorem 5. Assume that the sources Si are unimodal and symmetric and that r1 D r2 . Assume that rY | [0, p ) with A D id has exactly two local maxima and two local minima Then there exist only two xed points of geometric ICA in the expectation.
The same calculation as in the proof of theorem 3 shows that we can assume A D id without loss of generality. Then by theorem 2 and by r1 D r2 , the two pairs f0, p2 g and f p4 , 3p 4 g satisfy GCC. We have to show that no other pair fullls this condition. ) First, note that the symmetry of r1 and r2 shows that r Y ( np 2 ¡Q D np ) ( C rY ( np Q for 2 and Q 2 , and r r induces even r n Z R D 1 2 Y 2 4 ¡ ). C Q ) D rY ( np Q Due to assumption, r has only two maxima and two Y 4 minima in [0, p ) ; the above equations then show that those have to be at f0, p4 , p2 3p 4 g. np 6 | [b 1 , b1 C p2 ] Now we claim that for b1 D 4 , n 2 Z, the median of rY p does not equal b1 C 4 . Note that this shows theorem 5. For the proof of the claim, consider the smallest integer p 2 Z such that C p4 . Let b 2 :D b1 C p2 and a :D b1 C p4 . We have c 1 :D pp 2 > b 1 , and set c 2 :D c 1 to show that the median of r Y | [b1 , b 2 ] does not equal a. A visualization of these relations and the following denitions is given in Figure 4. Using the symmetry noted above, we have for d1 :D c 1 C (c 1 ¡ b1 ) D Rc Rd 6 2c 1 ¡ b1 : C1 :D b11 r Y D c 11 rY . Note that d1 D a, or else pp D 2c 1 D Proof.
Linear Geometric ICA
429
Figure 4: Visualization of the proof of theorem 5. a, bi , c i , and di are dened in the text.
a C b1 D 2b 1 C p2 , which contradicts the assumption to b 1 . Without loss of generality, let a > d1 ; the other case can be proven similarly. Setting d2 :D c 2 C (c 2 ¡ a) D 2c 1 ¡ a, again symmetry shows that C2 :D R c2 R d2 Ra R b2 a r Y D c 2 rY and C3 :D d1 rY D d2 rY I the second equation follows because c 2 ¡ d1 D b2 ¡ c 2 and c 2 ¡ a D d2 ¡ c 2 . Now assume in contradiction to R aour claim R b that a is the median of rY | [b1 , b2 ]. Then we have 2C1 C C3 D b r Y D a 2 r Y D 2C2 C C3 and therefore 1 C1 D C3 . As shown above, r Y (c 1 ) and r Y (c 2 ) are the only extremal points of r Y | [b1 , b2 ]. Without loss of generality, let r Y (c 1 ) be R c a maximum; then r Y (c 2 ) has to be a minimum. But this means that C1 D b11 rY > (c 1 ¡ b1 )r Y (a) ¸ R c2 a r Y D C 2 which contradicts the above. This completes the proof of the theorem. Conjecture 1 states that there are only two xed points of geometric ICA. In fact, we claim that of those two, only one xed point is stable in the sense that slight perturbations of the initial conditions preserve the convergence. Then, depending on the kurtosis of the sources, either the stable (supergaussian case) or the unstable (subgaussian case) xed point represents the image of the unit vectors. This is stated in the following conjecture. Assume that the sources Si are unimodal, symmetric, and nongaussian. Then by conjecture 1, there are only two xed points (wO 1 , wO 2 ) and Conjecture 2.
430
F. Theis, A. Jung, C. Puntonet, and E. Lang
(wQ 1 , wQ 2 ) of geometric ICA in the expectation. We claim: i. There is only one stable xed point (wO 1 , wO 2 ) . ¡1 ii. If the sources are supergaussian, Mw O , wO solves the BSS problem. 1
iii. If the sources are subgaussian,
¡1 Mw Q 1 , wQ 2
2
solves the BSS problem.
7 Update Rules Without Sign Functions
We have shown that the geometric update step requires the signum function as follows: wi (t C 1) D wi (t) C g(t) sgn ( y ( t) ¡ wi ( t) ) . Then (normally), the wi converge to the medians in their receptive eld. Note that the medians do not have to coincide with any maxima of the sensor signal density distribution on the sphere, as shown in Figure 5. Therefore, in general, any algorithm searching for the maxima of the distribution (Prieto, Prieto, Puntonet, Canas, & Martin-Smith, 1999) will not end at the medians, which are the correct images of the unit vectors under the given mixing transformation. Only given special restrictions of the sources (same supergaussian distribution of each component, as, for example, speech signals), the medians correspond to the maxima, and a maximum searching algorithm will converge to the correct xed points of geometric ICA. 8 Histogram-Based Algorithm: FastGeo
So far in geometric ICA, mostly on-line algorithms with competitive learning rules as in section 4 have been used (Puntonet & Prieto, 1995). As shown above, they search for points satisfying the GCC. In the following, we will establish a new geometric algorithm based on the GCC alone. For this, let n D 2 and Á cos(a1 ) A :D sin(a1 )
! cos(a2 ) . sin(a2 )
(8.1)
Theorem 3 shows that the vectors (cos(ai ) , sin(ai ) ) > satisfy the GCC. Therefore, the vectors wi will converge to the medians in their receptive elds. This enables us to compute these positions directly using a search on the histogram of Y (see Figure 6), which reduces the computation time by a factor of about 100 or more. In the FastGeo algorithm, we scan through the different receptive elds and test GCC. In practice, this means discretizing the distribution fY of Y using a given bin size b > 0 and then testing the p / b different receptive elds. The algorithm will be formulated more precisely in the following.
Linear Geometric ICA
431
0.014
Probability Density [1/Deg]
0.012 0.01 0.008 0.006 0.004 0.002
a 0
0
a
1
45
2
90 Angle [Deg]
135
180
Figure 5: Projected density distribution rY of a mixture of two Laplacian signals with different variances, with the mixture matrix mapping the unit vectors ei to (cos ai , sin ai ) for i D 1, 2. Dark line = theoretical density function; gray line = histogram of a mixture of 10,000 samples.
For simplicity, let us assume that the cumulative distribution FY of Y is invertible; this means that FY is nowhere constant. Dene a function m : [0, p )
Q
¡! R p² l1 (Q ) C l2 ( Q ) ± 7¡! ¡ QC , 2 2
(8.2)
where ³ li (Q ) :D FY¡1
FY (Q C i p2 ) C FY (Q C ( i ¡ 1) p2 ) 2
´ (8.3)
is the median of Y | [Q C ( i ¡ 1) p2 , Q C i p2 ] in [Q C (i ¡ 1) p2 , Q C i p2 ] for i D 1, 2. Lemma 2.
Let Q be a zero of m in [0, p ) . Then the li (Q ) satisfy the GCC.
432
F. Theis, A. Jung, C. Puntonet, and E. Lang
0.01 F2
F1
Probability Density [1/Deg]
0.008
0.006
0.004
0.002
0
a1 0
a2 45
a1 +180°
90
135 180 225 Angle [Deg]
a2 +180° 270
315
360
Figure 6: Probability density function fY of Y from Figure 1 with the mixing angles ai and their receptive elds Fi for i D 1, 2. ( ) ( ) ( ) ( ) By denition, [ l1 Q C2 l2 Q ¡ p2 , l1 Q C2 l2 Q ] is the receptive eld of l1 (Q ) . Since m (Q ) D 0, the starting point of the above interval is Q, because Q D l1 (Q ) C l 2 (Q ) ¡ p2 . Hence, we have shown that the receptive eld of l1 (Q ) is 2 [Q, Q C p2 ], and by construction l1 (Q ) is the median of Y restricted to the above interval. The claim for l2 (Q ) then follows.
Proof.
Algorithm 1 (FastGeo).
Find the zeros of m .
m always has at least two zeros that represent the stable and the unstable xed point of the ordinary geometric algorithm. In practice, we extract the xed point, which then gives the proper demixing matrix A ¡1 by picking Q0 such that fY ( l1 (Q0 ) ) C fY ( l2 (Q0 ) ) is maximal. For unimodal and supergaussian source distributions, conjecture 2 claims that this results in a stable xed point. For subgaussian sources, choosing Q0 with fY ( l1 (Q0 ) ) C fY ( l2 (Q0 ) ) being minimal induces the corresponding demixing matrix. Hence, one advantage of this histogram-based algorithm is that without any modications, we can solve the ICA problem for subgaussian signals too. Furthermore, the
Linear Geometric ICA
433
sophisticated parameter choice of the ordinary geometric algorithm is not necessary any more; only one parameter, the bin size, has to be chosen. In practice, one sometimes notices that due to the discretization of the distribution, the approximated distribution has a rather noisy shape on small scales. This results from m having zeros being split up into multiple zeros close together. Therefore, a useful improvement of convergence can be established by smoothing this distribution with a kernel function with sufciently small halfwidth. This smoothing should be performed preferably during the discretization process of the original distribution. 9 Accuracy
In this section, we consider the dependence of the FastGeo ICA algorithm on the number of samples after the bin size b has been xed. As seen in the previous section, the accuracy of the histogram-based algorithm then depends only on the distribution of the samples X (resp. Y); that is, we can estimate the error made by approximating the mixing matrix A by a nite number of samples. In the following, we present some results of test runs made with this algorithm. When choosing two arbitrary angles ai 2 [0, p ) , i D 1, 2 for the mixing matrix A, we dene a as the distance between these two angles modulo p2 . This will give us an angle in the range between 0 and p2 . In Figure 7, the |a ¡arecovered |
relative accuracy of the recovered angles Daa D i ia is given as a function of the angle a for a xed number of samples. Obviously, the resulting graph is reasonably constant over a wide range of a, demonstrating that the estimate of the ai with respect to a is robust over a wide range of a (a > 10± ) and gets only slightly worse for small values of a. Note that the distortions around the origin are due to the nite bin size; increasing the number of bins increases the accuracy for small a’s, but also the computational effort. To investigate the relation between the error D a and the number of samples, we examine for different a the dependence of the standard deviation of D a on the number of samples (see Table 1), where we chose the following mixing matrix A: Á A :D
1 0.5
! 0.5 . 1
(9.1)
The table entry gives the standard deviation of the nondiagonal terms after normalizing each column of the mixing matrix, so that the diagonal elements are unity. For comparison, we also calculated the performance index E1 (Amari, Cichocki, & Yang, 1996).
434
F. Theis, A. Jung, C. Puntonet, and E. Lang
0.1
Mean Standard deviation 95% confidence interval
[°]
0.08 0.06 0.04 0.02 0
0
10
20
30
40
50
60
70
80
90
[°] Figure 7: Mixture of 1,000samples of two Laplacian source signals with identical variances. Plotted are the mean, standard deviation, and 95% condence interval of D a/ a versus a calculated from 100 runs for each angle a. Table 1: Standard Deviations of the Nondiagonal Terms and the Performance Index E1 with Different Number of Samples. Number of Samples
Standard Deviation
Index E1
1000 10,000 100,000
0.033 0.013 0.007
0.18 0.07 0.038
10 Examples
In this section, we compare geometric ICA algorithms with other ICA algorithms: the Extended Infomax algorithm (Lee, Girolami, & Sejnowski, 1999), which is based on the classical ICA algorithm (Bell & Sejnowski, 1995), and the FastICA algorithm (Hyv¨arinen & Oja, 1997). As geometric algorithms, we use the ordinary geometric algorithm from section 4 (Puntonet & Prieto, 1995) and the FastGeo algorithm from above. Calculations were made on a PIII-850 PC using Matlab 6.0. In our rst example, we consider a mixture of two Laplacian signals. The results of the different algorithms are shown in Table 2. For each algorithm,
Linear Geometric ICA
435
Table 2: Comparison of Time per Run and Cross-talking Error of ICA Algorithms for a Random Mixture of two Laplacian Signals. Algorithm
Elapsed Time [s]
Index E1
Extended Infomax FastICA (pow3 D default) FastICA (tanh) FastICA (gauss) Ordinary Geometric FastGeo
11.1 0.068 0.11 0.12 > 60 0.84
0.072§ 0.002 0.076§ 0.004 0.052§ 0.001 0.048§ 0.001 0.18 § 0.10 0.110§ 0.071
Note: Means and standard deviations were taken over 1000 runs (100 runs for Extended Infomax and Ordinary Geo) with 10,000 samples and uniformly distributed mixing matrix elements.
Table 3: Comparison of Time per Run and Cross-Talking Error of ICA Algorithms for a Random Mixture of Two Sound Signals with 22,000 Samples. Algorithm
Elapsed Time [s]
Index E1
Extended Infomax FastICA (pow3 D default) FastICA (tanh) FastICA (tanh) Ordinary Geometric FastGeo
41.2 0.14 0.24 0.26 > 60 0.89
0.058§ 0.002 0.050§ 0.005 0.022§ 0.001 0.019§ 0.001 0.49 § 0.29 0.136§ 0.087
Note: Means and standard deviations were taken over 1000 runs (100 runs for Extended Infomax and Ordinary Geo) with uniformly distributed mixing matrix elements.
we measure the mean elapsed CPU time per run and the mean cross-talking error E1 with its standard deviation. Both the Extended Infomax and the FastICA algorithm perform best in terms of accuracy, and in terms of computational speed, FastICA lives up to its name, being followed by FastGeo, then InfoMax, and then the ordinary geometric algorithm, always by one scale difference. The last algorithm lacks accuracy and also shows some convergence problems, whereas FastGeo lies between the geometric algorithm and FastICA and Infomax regarding accuracy. The second example deals with real-world data: two audio signals (one speech and one music signal; see Table 3). The results are similar to the Laplacian toy example. FastICA outperforms the other algorithms in terms of speed. The accuracy of Extended Infomax and FastICA is comparable, and FastGeo is slightly (factor 4) worse but faster than the Extended Infomax. The ordinary geometric algorithm is both slower and less accurate, mainly because of convergence problems.
436
F. Theis, A. Jung, C. Puntonet, and E. Lang
11 Higher Dimensions
So far we have explicitly considered two-dimensional data sets only. In realworld problems, however, the sensor signals are usually high-dimensional data sets, for example, EEG data with 21 dimensions. Therefore, it would be satisfactory to generalize geometric algorithms to higher dimensions. The ordinary geometric algorithm can be easily translated to higher-dimensional cases, but one faces serious problems in the explicit calculations. In order to approximate higher-dimensional pdfs, it becomes necessary to have an exponentially growing number of samples available, as will be shown. The number of samples in a ball Bd¡1 of radius q on the unit sphere d¡1 ½ Rd divided by the number of samples on the whole Sd¡1 can be S calculated as follows if we assume a uniformly distributed random vector. Let Bd :D fx 2 Rd | |x| · 1g and Sd¡1 :D fx 2 Rd | |x| D 1g. The volume of Bd can be calculated by vol (Bd ) D
d
p2 ( d2 )!
D cd . It follows for d > 3: d¡1 )q d¡1
n vol(B Number of samples in ball vol(S d¡1 ) D n n
·
q d¡1 cd¡1 q d¡1 d . D p cd C 1
Obviously, the number of samples in the ball decreases by q d¡1 d if q < 1, which is the interesting case. To have the same accuracy when estimating the medians, the decrease must be compensated by an exponential growth in the number of samples. For three dimensions, we have found a good approximate of the demixing matrix by using 100,000 samples. In four dimensions, the reconstructed mixing matrix could not be reconstructed correctly, even with larger numbers of samples. A different approach for higher dimensions has been taken by Bauer et ) al. (2000), where A has been calculated using d ( d¡1 projections of X from 2 2 d R onto R along the different coordinate axes and reconstructed the multidimensional matrix from the two-dimensional solutions. However, this approach works satisfactorily only if the mixing matrix A is close to the unit matrix up to permutation and scaling. 12 Conclusion
The geometric ICA algorithm has been studied in a concise theoretical framework. The xed points of the geometric ICA learning algorithm have been examined in detail. We have introduced a GCC, which has to be fullled by the xed points of the learning algorithm. We further showed that it is also fullled by the mixed unit vectors spanning the sensor signal space. Hence, geometric ICA can solve the BSS problem. We then gave two conjectures for the unimodal case where the xed-point property is expected to be very rigid.
Linear Geometric ICA
437
We then presented a new algorithm for linear geometric ICA (FastGeo) based on histograms that is both robust and computationally much more efcient than the ordinary geometric ICA algorithm. The accuracy of the algorithm concerning the estimation of the relevant medians of the underlying data distributions, when varying both the mixing matrices and the sample numbers, has been explored quantitatively, showing a rather robust performance of the algorithm. When comparing FastGeo with classical ICA algorithms and the ordinary geometric one, we noticed that FastGeo performs only slightly worse than the classical ones in terms of accuracy and better than the ordinary geometric one. In terms of speed, FastGeo falls between FastICA and the Extended Infomax and is much faster than the rst geometric approaches, which also suffer from severe convergence problems. Furthermore, the fact that geometric algorithms and especially FastGeo are very easy to implement makes FastGeo a good choice even in comparison with the classical ICA algorithms in practical two-dimensional applications. We also considered the problem of high-dimensional data sets with respect to the geometrical algorithms and discussed how projections to lowdimensional subspaces could solve this problem for a special class of mixing matrices. Simulations with nonsymmetrical and nonunimodal distributions have shown promising results so far, indicating that the new algorithm will perform well with almost any distribution. This is the subject of ongoing research in our group. In future work, besides treating nonsymmetric sources S, the two conjectures will have to be proven in full, as well as the Kohonen proof of convergence to be translated into the above model. In addition, the histogrambased algorithm could be extended to the nonlinear case similar to Puntonet et al. (1999), using multiple centered spheres for projection on the surface on which the projected data histograms could then be evaluated. Finally, we are experimenting with the FastGeo algorithm for the overcomplete case, where we are currently able to detect three or more sources in only two mixtures. This can be useful for higher-dimensional cases as in section 11.
Acknowledgments
This research was supported by the Deutsche Forschungsgemeinschaft (DFG) in the Graduiertenkolleg “Nonlinearity and Nonequilibrium in Condensed Matter.” We thank Christoph Bauer and Tobias Westenhuber for suggestions and comments on the ordinary geometric algorithm and Michaela Lautenschlager and Stefanie Ulsamer for their helpful comments in trying to prove convergence.
438
F. Theis, A. Jung, C. Puntonet, and E. Lang
References Amari, S., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bauer, C., Puntonet, C., Rodriguez-Alvarez, M., & Lang, E. (2000). Separation of EEG signals with geometric procedures. In C. Fyfe (Ed.), Engineering of intelligent systems (Proc. EIS’2000) (pp. 104–108). Millet, Alberta, Canada: ICSC Press. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Benaim, M., Fort, J.-C., & PagÂes, G. (1998). Convergence of the one-dimensional Kohonen algorithm. Adv. Appl. Prob., 30, 850–869. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Cottrell, M., & Fort, J.-C. (1987). Etude d’un processus d’autoorganisation. Annales de l’Institut Henri PoincarÂe, 23(1), 1–20. (in French) Cottrell, M., Fort, J. C., & Pag`es, G. (1994). Two or three things that we know about the Kohonen algorithm. In M. Verleysen (Ed.), Proc. ESANN’94, European Symp. on Articial Neural Networks (pp. 235–244).Brussels: D Facto Conference Services. Hyva¨ rinen, A. (1999). Fast and robust xed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyva¨ rinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Jutten, C., HÂerault, J., Comon, P., & Sorouchiary, E. (1991). Blind separation of sources, parts I, II, and III. Signal Processing, 24, 1–29. Lee, T.-W., Girolami, M., & Sejnowski, T. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and supergaussian sources. Neural Computation, 11, 417–441. Linsker, R. (1989). An application of the principle of maximum information preservation to linear systems. In D. Touretzky (Ed.), Advances in neural information processing systems, 1. San Mateo, CA: Morgan Kaufmann. Linsker, R. (1992). Local synaptic learning rules sufce to maximize mutual information in a linear network. Neural Computation, 4, 691–702. Prieto, A., Prieto, B., Puntonet, C., Canas, A., & Martin-Smith, P. (1999). Geometric separation of linear mixtures of sources: Application to speech signals. In J. F. Cardoso, Ch. Jutten, & Ph. Loubaton (Eds.), Independent component analysis and signal separation (Proc. ICA’99) (pp. 295–300). Gi`eres, France: Imp. de Ecureuils. Puntonet, C., Alvarez, M., Prieto, A., & Prieto, B. (1999). Separation of speech signals for nonlinear mixtures. Lecture Notes in Computer Science, 1607, 665– 673. Puntonet, C., & Prieto, A. (1995). An adaptive geometrical procedure for blind separation of sources. Neural Processing Letters, 2.
Linear Geometric ICA
439
Ritter, H., & Schulten, K. (1988). Convergence porperties of Kohonen’s topology conserving maps: Fluctuations, stability, and dimension selection. Biological Cybernetics, 60, 59–71. Received December 14, 2001; accepted July 3, 2002.
LETTER
Communicated by Geoffrey Hinton
Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network Xiaohui Xie
[email protected] Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. H. Sebastian Seung
[email protected] Howard Hughes Medical Institute and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. Backpropagation and contrastive Hebbian learning are two methods of training networks with hidden neurons. Backpropagation computes an error signal for the output neurons and spreads it over the hidden neurons. Contrastive Hebbian learning involves clamping the output neurons at desired values and letting the effect spread through feedback connections over the entire network. To investigate the relationship between these two forms of learning, we consider a special case in which they are identical: a multilayer perceptron with linear output units, to which weak feedback connections have been added. In this case, the change in network state caused by clamping the output neurons turns out to be the same as the error signal spread by backpropagation, except for a scalar prefactor. This suggests that the functionality of backpropagation can be realized alternatively by a Hebbian-type learning algorithm, which is suitable for implementation in biological networks. 1 Introduction
Backpropagation and contrastive Hebbian learning (CHL) are two supervised learning algorithms for training networks with hidden neurons. They are of interest, because they are generally applicable to wide classes of network architectures. In backpropagation (Rumelhart, Hinton, & Williams, 1986b, 1986a), an error signal for the output neurons is computed and propagated back into the hidden neurons through a separate teacher network. Synaptic weights are updated based on the product between the error signal and network activities. CHL updates the synaptic weights based on the steady states of neurons in two different phases: one with the output neurons clamped to the desired values and the other with the output neurons free (Movellan, 1990; Baldi & Pineda, 1991). Clamping the output neurons Neural Computation 15, 441–454 (2003)
° c 2002 Massachusetts Institute of Technology
442
X. Xie and H. Seung
causes the hidden neurons to change their activities, and this change constitutes the basis for the CHL update rule. CHL was originally formulated for the Boltzmann machine (Ackley, Hinton, & Sejnowski, 1985) and was extended later to deterministic networks (Peterson & Anderson, 1987; Hinton, 1989), in which case it can be interpreted as a mean-eld approximation of the Boltzmann machine learning algorithm. However, this interpretation is not necessary, and CHL can be formulated purely for deterministic networks (Movellan, 1990; Baldi & Pineda, 1991). Compared to backpropagation, CHL appears to be quite different. Backpropagation is typically implemented in feedforward networks, whereas CHL is implemented in networks with feedback. Backpropagation is an algorithm driven by error, whereas CHL is a Hebbian-type algorithm, with update rules based on the correlation of pre- and postsynaptic activities. CHL has been shown to be equivalent to backpropagation for networks with only a single layer (Movellan, 1990), and there has been some other work to relate CHL to the general framework of backpropagation (Hopeld, 1987; O’Reilly, 1996; Hinton & McClelland, 1988). However, a direct link between them for networks with hidden neurons has been lacking. To investigate the relationship between these two algorithms, we consider a special network for which CHL and backpropagation are equivalent. This is a multilayer perceptron to which weak feedback connections have been added and with output neurons that are linear. The equivalence holds because in CHL, clamping the output neurons at their desired values causes the hidden neurons to change their activities, and this change turns out to be equal to the error signal spread by backpropagation, except for a scalar factor. 2 The Learning Algorithms
In this section, we describe the backpropagation and CHL algorithms. Backpropagation is in the standard form, implemented in a multilayer perceptron (Rumelhart et al., 1986b). CHL is formulated in a layered network with feedback connections between neighboring layers of neurons. It is an extension of the typical CHL algorithm formulated for recurrent symmetric networks (Movellan, 1990). 2.1 Backpropagation. Consider a multilayer perceptron with L C 1 layers of neurons and L layers of synaptic weights (see Figure 1A). The activities of the kth layer of neurons are denoted by the vector xk, their biases by the vector b k, and the synaptic connections from layer k ¡ 1 to layer k by the matrix W k . All neurons in the kth layer are assumed to have the same transfer function fk, but this transfer function may vary from layer to layer. In particular, we will be interested in the case where fL is linear, though the other transfer functions may be nonlinear. In the basic denition, fk acts on a scalar and returns a scalar. However, we will generally use fk to act on a
Backpropagation and Contrastive Hebbian Learning
443
Figure 1: Diagram on the network structures of the (A) multilayer perceptron and the (B) layered network with feedback connections. Layer 0 is the input, layer L is the output, and the others are hidden layers. The forward connections are the same for both networks. In B, there exist feedback connections between neighboring layers of neurons.
vector, in which case it returns a vector, operating component by component by fk. fk0 is the derivative of fk with respect to its argument. Similar to fk, when fk0 acts on a vector, it returns a vector as well, operated by fk0 in each component. We assume that fk is monotonically increasing. Backpropagation learning is implemented by repeating the following steps for each example in a training set of input-output pairs: 1. In the forward pass, x k D fk (W kxk¡1 C bk )
(2.1)
is evaluated for k D 1 to L, thereby mapping the input x0 to the output xL .
444
X. Xie and H. Seung
2. The desired output d of the network, provided by some teacher, is compared with the actual output xL to compute an error signal, yL D D L ( d ¡ xL ) .
(2.2)
The matrix D k ´ diagffk0 (W kxk¡1 C bk )g is dened by placing the components of the vector fk0 ( W kx k¡1 C b k ) in the diagonal entries of a matrix. 3. The error signal is propagated backward from the output layer by evaluating y k¡1 D Dk¡1 W Tk y k
(2.3)
for k D L to 2.
4. The weight update
D Wk
D gy k xTk¡1
(2.4)
is made for k D 1 to L, where g > 0 is a parameter controlling the learning rate. 2.2 Contrastive Hebbian Learning. To formulate CHL, we consider a modied network in which, in addition to the feedforward connections from layer k ¡ 1 to layer k, there are also feedback connections between neighboring layers (see Figure 1B). The feedback connections are assumed to be symmetric with the feedforward connections, except that they are multiplied by a positive factor c . In other words, the matrix c W Tk contains the feedback connections from layer k back to layer k ¡ 1. CHL is implemented by repeating the following steps for each example of the training set:
1. The input layer x 0 is held xed, and the dynamical equations dx k ) C x k D f k ( W k x k¡1 C c W T kC 1 x kC 1 C b k dt
(2.5)
for k D 1 to L are run until convergence to a xed point. The case k D L is dened by setting xL C 1 D 0 and WL C 1 D 0. Convergence to a xed point is guaranteed under rather general conditions, to be shown later. This is called the free state of the network and is denoted by xL k for the kth layer neurons. 2. The anti-Hebbian update,
D Wk
D ¡gc k¡L xL k xL Tk¡1 ,
is made for k D 1, . . . , L.
(2.6)
Backpropagation and Contrastive Hebbian Learning
445
3. The output layer xL is clamped at the desired value d, and the dynamical equation 2.5 for k D 1 to L ¡ 1 is run until convergence to a xed point. This is called the clamped state and is denoted by xO k for the kth layer neurons. 4. The Hebbian update,
D Wk D gc
k¡L
xO kxO Tk¡1 ,
(2.7)
is made for k D 1, . . . , L. Alternatively, the weight updates could be combined and made after both clamped and free states are computed,
D W k D gc
k¡L ( O
x kxO Tk¡1 ¡ xL kxL Tk¡1 ).
(2.8)
This form is the one used in our analysis. This version of CHL should look familiar to anyone who knows the conventional version, implemented in symmetric networks. It will be derived in section 4, but rst we prove its equivalence to the backpropagation algorithm. 3 Equivalence in the Limit of Weak Feedback
Next, we prove that CHL in equation 2.8 is equivalent to the backpropagation algorithm in equation 2.4, provided that the feedback is sufciently weak and the output neurons are linear. In notation, x k, xO k, and xL k represent the kth layer activities of the feedforward network, the clamped state, and the free state, respectively. We consider the case of weak feedback connections, c ¿ 1, and use ¼ to mean that terms of higher order in c have been neglected and » to denote the order. The proof consists of the following four steps: 1. Show that the difference between the feedforward and free states is of order c in each component, d xL k ´ xL k ¡ x k » c ,
(3.1)
for all k D 1, . . . , L.
2. Show that in the limit of weak feedback, the difference between the clamped and free states satises the following iterative relationship, dx k ´ xO k ¡ xL k D c DkW TkC 1dx kC 1 C O (c for k D 1, . . . , L ¡ 1, and dxL D d ¡ xL L .
L¡k C 1 )
,
(3.2)
446
X. Xie and H. Seung
3. Show that if the output neurons are linear, dxk is related to the error signal in backpropagation through dx k D c
L¡k
y k C O (c
L¡k C 1 )
.
(3.3)
4. Show that the CHL update can be approximated by
D Wk
D gy k xTk¡1 C O (c ) .
(3.4)
In the CHL algorithm, clamping the output layer causes changes in the output neurons to spread backward to the hidden layers because of the feedback connections. Hence, the new clamped state differs from the free state over the entire network, including the hidden neurons. Equation 3.2 states that dx k decays exponentially with distance from the output layer of the network. This is because the feedback is weak, so that dxk is reduced from dxkC 1 by a factor of c . Remarkably, as indicated in equation 3.3, the difference between the clamped and free states is equivalent to the error signal y k computed in the backward pass of backpropagation, except for a factor of c L¡k, when the output neurons are linear. Moreover, this factor annihilates the factor ofc k¡L in the CHL rule of equation 2.8, resulting in the update rule equation 3.4. 3.1 Proof. To prove the rst step, we start from the steady-state equation of the free phase,
xL k D fk ( W kxL k¡1 C bk C c W TkC 1 xL kC 1 ) ,
(3.5)
for k D 1, . . . , L ¡ 1. Subtracting this equation from equation 2.1 and performing Taylor expansion, we derive d xL k ´ xL k ¡ x k
D fk ( W k xL k¡1 C bk C c
(3.6) W TkC 1 xL kC 1 )
¡ fk ( W k xk¡1 C bk )
(3.7)
D D k W kd xL k¡1 C c D k W kTC 1 xL k C 1 C O (kW kd xL k¡1 C c W kTC 1 xL k C 1 k2 ) (3.8)
for all hidden layers, and d xL L D DL WLd xL L¡1 C O (kWLd xL L¡1 k2 ) for the output layer. In equation 3.7, the expansion is done around W kx k¡1 C bk. Since the zeroth layer is xed with the input, d xL 0 D 0, under the above iterative relationships, we must have d xL k » c in each component, that is, d xL k is in the order of c , for all k D 1, . . . , L. To prove equation 3.2, we compare the xed-point equations of the clamped and free states, f ¡1 ( xL k ) D W kxL k¡1 C bk C c W TkC 1 xL kC 1
(3.9)
f ¡1 ( xO k ) D W kxO k¡1 C bk C c W TkC 1 xO kC 1 ,
(3.10)
Backpropagation and Contrastive Hebbian Learning
447
for k D 1, . . . , L ¡ 1, where f ¡1 is the inverse function of f in each component. Subtract them, and perform Taylor expansion around xL k. Recall the denition of dx k ´ xL k ¡ xO k . We have W kdxk¡1 C c W kTC 1dx kC 1 D f ¡1 (xO k ) ¡ f ¡1 ( xL k ) D Jkdx k C O (kd x k k2 ) ,
(3.11) (3.12)
where the matrix Jk ´ diagf@f ¡1 ( xL k ) / @xL kg. Since xL k ¡ xk » c , to the leading order inc , matrix Jk can be approximated by J k ¼ diagf@f ¡1 ( x k ) / @x kg D D ¡1 , k (c C and Jkdx k D D ¡1 dx O kdx k). Substituting this back to equation 3.12, k k k we get dx k D Dk (W kdx k¡1 C c W TkC 1dxkC 1 ) C O (c kdx kk) C O (kd x kk2 ) .
(3.13)
Let us start from k D 1. Since the input is xed (dx 0 D 0), we have dx1 D c D1 W 2Tdx2 C O (c kdx1 k) , and therefore, dx1 D c D1 W 2Tdx2 C O (c 2 kdx2 k). Next, we check for k D 2, in which case dx2 D D2 W2dx1 C c D2 W3Tdx3 C O (c kdx2 k) . Substituting dx1 , we have dx2 D c D2 W3Tdx3 C O (c kdx2 k), and therefore, dx2 D c D2 W3Tdx3 C O (c 2 kdx3 k). Following this iteratively, we have dx k D c D kW TkC 1dxkC 1 C O (c 2 kdx kC 1 k),
(3.14)
for all k D 1, . . . , L ¡ 1. Notice that dxL D d ¡ xL L D d ¡ xL C O (c ) is of order 1. Hence, dxL¡1 D c DL¡1 WLTdxL C O (c 2 ) , and iteratively, we have dx k D c D kW TkC 1dxkC 1 C O (c
L¡k C 1 )
,
(3.15)
for k D 1, . . . , L¡1. Therefore, equation 3.2 is proved. This equation indicates that dx k » c dx kC 1 and dx k » c L¡k in each component. If the output neurons are linear, then yL D dxL C O (c ) . Consequently, dxk D c L¡ky k C O (c L¡kC 1 ) for all k D 1, . . . , L. Finally, the weight update rule of CHL follows:
D Wk
D gc k¡L ( xO k xO Tk¡1 ¡ xL k xL Tk¡1 )
(3.16)
D gc k¡Ldx k xL Tk¡1 C gc k¡L xL kdxTk¡1 C gc k¡Ldx kdxTk¡1
(3.17)
D gc k¡Ldx k xL Tk¡1 C O (c )
(3.18)
D gy k xTk¡1 C O (c ).
(3.19)
The last approximation is made because xL k¡1 ¡ x k¡1 » c . This result shows that the CHL algorithm in the layered network with linear output neurons is identical to the backpropagation as c ! 0.
448
X. Xie and H. Seung
For nonlinear output neurons, dxL of CHL is different from yL in equation 2.2 computed in backpropagation. However, they are within 90 degrees if viewed as vectors. Moreover, if the activation function for the output neurons is taken to be the sigmoidal function, that is, fL ( z ) D 1 / (1 C exp(¡z) ) , the CHL algorithm is equivalent to backpropagation based on the cost function, Remark.
¡ dT log(xL ) ¡ (1 ¡ d ) T log(1 ¡ xL ) ,
(3.20)
since in this case, yL D d ¡ xL . In the above, function log acts on a vector and returns a vector as well. 4 Contrastive Function
The CHL algorithm stated in section 2.2 can be shown to perform gradient descent on a contrastive function that is dened as the difference of the network’s Lyapunov functions between clamped and free states (Movellan, 1990; Baldi & Pineda, 1991). Suppose E ( x) is a Lyapunov function of the dynamics in equation 2.5. Construct the contrastive function C (W ) ´ E ( xO ) ¡ E ( xL ) , where xO and xL are steady states of the whole network in the clamped and free phase, respectively, and W ´ fW1 , . . . , W L g. For simplicity, let us rst assume that E ( x) has a unique global minimum in the range of x and no local minima. According to the denition of Lyapunov functions, xL is the global minimum O but under the extra constraints that the output neurons are of E and so is x, clamped at d. Therefore, C ( W ) D E ( xO ) ¡ E ( xL ) ¸ 0 and achieves zero if and O that is, when the output neurons reach the desired values. only if xL D x, Performing gradient descent on C ( W ) leads to the CHL algorithm. On the other hand, if E ( x) does not have a unique minimum, xL and xO may be only local minima. However, the above discussion still holds, provided that xO is in the basin of attraction of xL under the free phase dynamics. This imposes some constraints on how to reset the initial state of the network after each phase. One strategy is to let the clamped phase settle to the steady state rst and then run the free phase without resetting hidden neurons. This will guarantee that C (W ) is always nonnegative and constitutes a proper cost function. Next, we introduce a Lyapunov function for the network dynamics in equation 2.5, E (x ) D
L X kD 1
c
k¡L [1T
FN k ( x k ) ¡ xTk W kx k¡1 ¡ bTk x k],
(4.1)
where function FN k is dened so that FN0k ( x ) D fk¡1 (x ) . x ´ fx1 , . . . , xL g represents the states of all layers of the network. Equation 4.1 is extended from
Backpropagation and Contrastive Hebbian Learning
449
Lyapunov functions previously introduced for recurrent networks (Hopeld, 1984; Cohen & Grossberg, 1983). For E ( x ) to be a Lyapunov function, it must be nonincreasing under the dynamics equation 2.5. This can be shown by EP D D D
´T L ³ X @E xk
kD 1 L X
c
xP k
k¡L [f ¡1 ( ) xk k
kD 1 L X kD 1
¡c
(4.2)
¡ W kx k¡1 ¡ c W TkC 1 x kC 1 ¡ bk ]T xP k
k¡L [f ¡1 ( ) xk k
¡ fk¡1 ( xP k C x k ) ]T [x k ¡ (xP k C x k ) ]
· 0,
(4.3)
(4.4) (4.5)
where the last inequality holds because fk is monotonic as we have assumed. Therefore, E (x ) is nonincreasing following the dynamics and stationary if and only if at the xed points. Furthermore, with appropriately chosen fk, such as sigmoid functions, E ( x) is also bounded below, in which case E ( x) is a Lyapunov function. Given the Lyapunov function, we can form the contrastive function C ( W ) and derive the gradient-descent algorithm on C accordingly. The derivative of E (xO ) with respect to W k is X @E @xO k dE ( xO ) @E C D dW k @W k @xO k @W k k D
(4.6)
@E
(4.7)
@W k
D ¡c k¡L xO k xO Tk¡1 ,
(4.8)
where the second equality holds because @E / @xO k D 0 for all k at the steady states. Similarly, we derive dE ( xL ) D ¡c dW k
k¡L L
x kxL Tk¡1 .
(4.9)
Combining equations 4.8 and 4.9, we nd the derivative of C ( W ) with respect to W k shall read dC dE ( xO ) dE ( xL ) ¡ D D ¡c dW k dW k dW k
k¡L (
xO k xO Tk¡1 ¡ xL kxL Tk¡1 ).
(4.10)
With a suitable learning rate, gradient descent on C ( W ) leads to the CHL algorithm in equation 2.8.
450
X. Xie and H. Seung
5 Equivalence of Cost Functions
In section 3, we proved that the CHL algorithm in the layered network with linear output neurons is equivalent to backpropagation in the weak feedback limit. Since both algorithms perform gradient descent on some cost function, the equivalence in the update rule implies that their cost functions should be equal, up to a multiplicative or an additive constant difference. Next, we demonstrate this directly by comparing the cost functions of these two algorithms. The backpropagation learning algorithm is gradient descent on the squared difference, kd ¡ xL k2 / 2, between the desired and actual outputs of the network. For the CHL algorithm, the cost function is the difference of Lyapunov functions between the clamped and free states, as shown in the previous section. After reordering, it can be written as CD
L X
c
k¡L
kD 1
[1T ( FN k ( xO k ) ¡ FN k (xL k ) ) ¡ dxTk (W kxL k¡1 C bk ) ¡ dxTk¡1 W kT xO k]. (5.1)
Recall that dx k » c L¡k. Therefore, the dxk term above multiplied by the factor c k¡L is of order 1, whereas the dx k¡1 multiplied by the same factor is of order c , and thus can be neglected in the leading-order approximation. After this, we get CD
L X
c
k¡L [1T (
kD 1
FN k ( xO k ) ¡ FN k (xL k ) ) ¡ dxTk (W kxL k¡1 C bk ) ] C O (c ) .
(5.2)
If the output neurons are linear (fL ( x) D x), then FNL ( x) D xT x / 2 and WL xL L¡1 C bL D xL L . Substituting them into C and separating terms of the output and hidden layers, we derive CD
1 T ( xO xO L ¡ xL TL xL L ¡ dxTL xL L ) 2 L C
L¡1 X kD 1
c
dxTk [fk¡1 ( xL k ) ¡ W kxL k¡1 ¡ bk ] C O (c )
k¡L
1 D kd ¡ xL k2 C O (c ) , 2
(5.3)
where the second term with the sum vanishes because of the xed-point equations. In conclusion, to the leading order in c , the contrastive function in CHL is equal to the squared error cost function of backpropagation. The demonstration on the equality of the cost functions provides another perspective on the equivalence between these two forms of learning algorithms.
Backpropagation and Contrastive Hebbian Learning
451
So far, we have always assumed that the output neurons are linear. If this is not true, how different is the cost function of CHL from that of backpropagation? Repeating the above derivation, we get the cost function of CHL for nonlinear output neurons, C D 1T FNL ( xO L ) ¡ 1T FNL ( xL L ) ¡ dxTL fL¡1 (xL L ) C O (c )
¡1 TN D ¡1T FNL ( xL ) ¡ dxT L fL ( xL ) C 1 FL ( d ) C O (c ).
(5.4) (5.5)
With sigmoidal-type activation function for output neurons, FNL ( z ) D x log(z) C (1 ¡ z ) log(1 ¡ z) . Substituting this into equation 5.5, we nd C D ¡dT log(xL ) ¡ (1 ¡ d) T log(1 ¡ xL ) C 1T FNL ( d) C O (c ) ,
(5.6)
which is the same as the cost function in equation 3.20 in the small c limit, except a constant difference. 6 Simulation
In this section, we use the backpropagation and the CHL algorithm to train a 784-10-10 three-layer network to perform handwritten digit recognition. The data we use are abridged from the MNIST database containing 5000 training examples and 1000 testing examples. We use the sigmoidal function fk ( x ) D 1 / (1 C exp(¡x) ) for the hidden and output layers. The backpropagation algorithm is based on the cost function in equation 3.20. We simulate the CHL algorithm in the layered network with three different feedback strengths: c D 0.05, 0.1, and 0.5. The label of an input example is determined by the index of the largest output. After each epoch of on-line training, the classication and squared error for both training and testing examples are computed and plotted in Figure 2. The classication error is dened as the percentage of examples classied incorrectly. The squared error is dened as the mean squared difference between the actual and desired output for all training or testing examples. The desired output is 1 on the output neuron whose index is the same as the label and zero on the output neurons otherwise. The learning curve of CHL algorithm is very similar to those of backpropagation for c D 0.05 and 0.1, and it deviates from the learning curve of backpropagation for c D 0.5 (see Figure 2). In the simulation, we nd the overall simulation time of the CHL algorithm is not signicantly longer than that of the backpropagation. This is because the layered network tends to converge to a steady state fast in the case of weak feedback. 7 Discussion
We have shown that backpropagation in multilayer perceptrons can be equivalently implemented by the CHL algorithm if weak feedback is added.
452
X. Xie and H. Seung
Figure 2: Comparison of performance between the backpropagation and the CHL algorithm. The two algorithms are used to train a 784-10-10 three-layer network to perform handwritten digit recognition. The CHL algorithm is used to train the three-layer network with feedback c D 0.05, 0.1, and 0.5 added. (A, B) Classication and squared errors for training examples. (C, D) Test examples.
This is demonstrated from two perspectives: evaluating the two algorithms directly and comparing their cost functions. The essence behind this equivalence is that CHL effectively extracts the error signal of backpropagation from the difference between the clamped and free steady states. The equivalence between CHL and backpropagation in layered networks holds in the limit of weak feedback, which is true mathematically. This, however, does not imply that in engineering problem solving, we should substitute CHL for backpropagation to train neural networks. This is because in networks with many hidden layers, the difference between the clamped and free states in the rst few layers would become very small in the limit of weak feedback, and therefore CHL will not be robust against noise during training. In practice, when CHL is used for training the layered networks with many hidden layers, the feedback strength should not be chosen to be too small, in which case the approximation of CHL to backpropagation algorithm will be inaccurate.
Backpropagation and Contrastive Hebbian Learning
453
The investigation on the relationship between backpropagation and CHL is motivated by research looking for biologically plausible learning algorithms. It is believed by many that backpropagation is not biologically realistic. However, in an interesting study on coordinate transform in posterior parietal cortex of monkeys, Zipser and Anderson (1988) show that hidden neurons in a network model trained by backpropagation share very similar properties to real neurons recorded from that area. This work prompted the search for a learning algorithm, which has similar functionality as backpropagation (Crick, 1989; Mazzoni, Andersen, & Jordan, 1991) and at the same time is biologically plausible. CHL is a Hebbian-type learning algorithm, relying on only pre- and postsynaptic activities. The implementation of backpropagation equivalently by CHL suggests that CHL could be a possible solution to this problem. Mazzoni et al. (1991) also proposed a biologically plausible learning rule as an alternative to backpropagation. Their algorithm is a reinforcementtype learning algorithm, which is usually slow, has large variance, and depends on global signals. In contrast, the CHL algorithm is a deterministic algorithm, which could be much faster than reinforcement learning algorithms. However, a disadvantage of CHL is its dependence on special network structures, such as the layered network in our case. Whether either algorithm is used by biological systems is an important question that needs further investigation in experiments and theory. Acknowledgments
We acknowledge helpful discussions with J. J. Hopeld and S. Roweis. We thank J. Movellan for suggesting the error function for nonlinear output neurons. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann Machines. Cognitive Science, 9, 147–169. Baldi, P., & Pineda, F. (1991). Contrastive learning and neural oscillator. Neural Computation, 3, 526–545. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. on Systems, Man, and Cybernetics, 13, 815–826. Crick, F. (1989). The recent excitement about neural networks. Nature, 337, 129–132. Hinton, G. E. (1989). Deterministic Boltzmann learning performs steepest descent in weight-space. Neural Computation, 1, 143–150. Hinton, G. E., & McClelland, J. (1988). Learning representations by recirculation. In D. Z. Anderson (Ed.), Neural Information Processing Systems. New York: American Institute of Physics.
454
X. Xie and H. Seung
Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hopeld, J. J. (1987). Learning algorithms and probability distributions in feedforward and feed-back networks. Proc. Natl. Acad. Sci. USA, 84, 8429–8433. Mazzoni, P., Andersen, R. A., & Jordan, M. I. (1991). A more biologically plausible learning rule for neural networks. Proc. Natl. Acad. Sci. USA, 88, 4433–4437. Movellan, J. (1990). Contrastive Hebbian learning in the continuous Hopeld model. In D. Touretzky, J. Elman, T. Sejnowski, & G. Hinton (Eds.), Proceedings of the 1990 Connectionist Models Summer School (pp. 10–17). San Mateo, CA: Morgan Kaufmann. O’Reilly, R. (1996). Biologically plausible error-driven learning using local activation differences: The generalized recirculation algorithm. Neural Computation, 8, 895–938. Peterson, C., & Anderson, J. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986a). Learning internal representations by error propagation. In D. E. Rumelhart, & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986b). Learning representations by back–propagating errors. Nature, 323, 533–536. Zipser, D., & Andersen, R. A. (1988). A back-propagation programmed network that simulates response properties of a subset of posterior parietal neurons. Nature (London), 331, 679–684.
Received January 25, 2002; accepted August 1, 2002.
LETTER
Communicated by Maxwell Stinchcombe
Indexed Families of Functionals and Gaussian Radial Basis Functions Irwin W. Sandberg
[email protected] Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712, U.S.A. We report on results concerning the capabilities of gaussian radial basis function networks in the setting of inner product spaces that need not be nite dimensional. Specically, we show that important indexed families of functionals can be uniformly approximated, with the approximation uniform also with respect to the index. Applications are described concerning the classication of signals and the synthesis of recongurable classiers. 1 Introduction
Radial basis functions are of interest in connection with a variety of approximation problems in the neural networks area and in other areas as well. Much is understood about the properties of these functions (Park & Sandberg, 1991, 1993; Hartman, Keeler, & Kowalski, 1990; Lo, 1972; Parzen, 1962). It is known (Park & Sandberg, 1993), for example, that arbitrarily good approximation in L1 ( Rn ) of a general f 2 L1 ( Rn ) is possible using uniform smoothing factors and radial basis functions generated in a certain natural way from a single g in L1 ( Rn ) if and only if g has a nonzero integral. As another example, Hartman et al. (1990) proved that gaussian radial basis functions can uniformly approximate arbitrarily well any continuous real functional dened on a compact convex subset of Rn . In Sandberg (2001), an approximation result is proved concerning gaussian radial basis functions in a general inner product space, and two applications are given. In particular, it is shown that gaussian radial basis functions dened on Rn can in fact uniformly approximate arbitrarily well over all of Rn any continuous real functional f on Rn that meets the condition that lim | f ( x ) | D 0.
kxk!1
This generalizes the result in Hartman et al. (1990) because, by the LebesgueUrysohn extension theorem (Stone, 1962), sometimes attributed to Tietze, any continuous real functional dened on a bounded closed subset of Rn Neural Computation 15, 455–468 (2003)
° c 2002 Massachusetts Institute of Technology
456
I. Sandberg
Figure 1: Classication network.
can be extended so that it is dened and continuous on all of Rn and meets the above condition.1 The second application concerns the problem of classifying signals, which is of interest in several application areas (e.g., signal detection, computerassisted medical diagnosis, automatic target identication). Typically, we are given a nite number m of pairwise disjoint sets C1 , . . . , Cm of signals, and we would like to synthesize a system that maps the elements of each Cj into a real number aj so that the numbers a1 , . . . , am are distinct. Sandberg (1994) (see also Sandberg et al., 2001) showed that the structure displayed in Figure 1 can perform this classication for any prescribed distinct a1 , . . . , am , assuming only that the Cj are compact subsets of a real normed linear space. In the gure, x is the signal to be classied, y1 , . . . , yk are functionals that can be taken to be linear,2 h denotes a continuous memoryless nonlinear mapping that, for example, can be implemented by a neural network having one hidden nonlinear layer, the blocklabeled Q is a quantizer that for each j maps numbers in the interval (aj ¡0.25r , aj C 0.25r) into aj , where r D miniD6 j |ai ¡aj |, and w is the output of the classier.3 Sandberg (2001) showed that a similar classication problem can be solved in a very different way using just a generalized gaussian radial basis function network and a quantizer of the type in Figure 1. The network structure considered can also classify signals x whose critical property is that the norm of x is sufciently large.
1
For example, let a continuous f0 dened on a bounded closed subset A of Rn be given, and let A be contained in an open ball B centered at the origin of Rn . Then, since the complement C of B with respect to Rn is closed (and thus A [ C is closed), by the Lebesgue-Urysohn extension theorem there is a continuous extension f of f0 dened on Rn such that f (x) D 0, x 2 C. 2 Examples are given in Sandberg (1994) and Sandberg et al. (2001). 3 The proof given in Sandberg (1994) and Sandberg et al. (2001) expands on a brief remark in Sandberg (1991) that any continuous real functional on a compact subset of a real normed linear space can be uniformly approximated arbitrarily well using only a feedforward neural network with a linear-functional input layer and one memoryless hidden nonlinear (e.g., sigmoidal) layer.
Families of Functionals
457
This article too is concerned with the capabilities of gaussian radial basis functions. Here we show that important indexed families f f (a, ¢) : a 2 Ag of functionals of the type addressed in Sandberg (2001) can be uniformly approximated, with the approximation uniform also with respect to the index. More specically, we consider a setting in which A denotes, for example, the index set [0, 1], and X2 (we avoid the use of X or X1 because they are used differently in the following section) is taken to be Rn or a certain more general type of subset of an inner product space S2 that may be, for instance, a function space. We show that large families of continuous real functionals f on A £ X2 can be uniformly approximated arbitrarily well using gaussian radial basis functions, in the sense that given 2 > 0, there are a positive integer q, real functions c1 , . . . , cq dened on A, positive numbers b1 , . . . , bq , and elements u1 , . . . , uq of a certain subset of S2 such that q X 2 0, there is a j > 0 such that | f ( a, x ) | < c for kxk > j and all a 2 A.). Such f ’s arise naturally. For example, let p be an arbitrary positive integer, and let f1 , f2 , . . . , fp be p continuous real functionals on Rn with the property that lim | fq ( x ) | D 0
kxk!1
for each q. Let I denote f1, 2, . . . , pg, and let A be any bounded closed real interval containing I. There are many functions f from A £ Rn to R such that f (q, x ) D fq ( x ) for all (q, x ) 2 I £ Rn . The collection of these functions is of interest with regard to the possibility of synthesizing an easily recongurable network that can approximate any one of the p functionals, with “easily recongurable” meaning essentially that one simply changes the point a at which the coefcients in equation 1.1 are evaluated. An example of such an f : A £ Rn ! R is given by the Lagrange-type interpolation relation, f ( a, x ) D
p X qD 1
c q ( a ) fq ( x ) ,
(1.3)
458
I. Sandberg
in which c q (a ) D
P `2I¡fqg ( a ¡ `) . P `2I¡fqg ( q ¡ `)
This f satises the two conditions mentioned above for approximation in the sense of equation 1.1. To see this, note that since each c q is continuous on A and each fj is continuous on Rn , f is continuous on A £ Rn ,4 and that given c > 0, we may choose j so that for all j, max max |c q ( a) | ¢ | fj ( x ) | < c / p, kxk > j, q
a
in which case we have | f ( a, x ) | ·
p X q D1
|c q (a ) fq ( x ) | · max max |c q ( a) | q
a
p X
| fj (x ) | < c
jD 1
for kxk > j and all a 2 A. Motivated by the connection described in Sandberg (2001) between classication problems and the implementation of functionals as networks, this leads to the conclusion that there are classiers on Rn that are recongurable in a simple manner. (See theorem 4 for an example of a precise statement along these lines, and see theorem 7 for a related proposition.) The results in section 2 show that this conclusion holds also for a large collection of classication problems in which the objects to be classied are drawn from a function space, and a specic result is given that addresses the case in which the objects are Lipschitz continuous functions dened on the n-dimensional interval [0, 1]n . There are applications of the material in section 2 to problems other than classication problems. In particular, because of the close relation between the problem of approximating shift-invariant nonlinear input-output maps, in which inputs and outputs depend on a time or space variable, and the problem of approximating functionals (see, for instance, Sandberg et al., 2001), the results in section 2 are of direct interest concerning the problem of approximating indexed families of input-output maps. In this connection, cases in which the elements of f f (a, ¢) : a 2 Ag are dened on a compact set are also addressed in section 2. The problem of approximating indexed families of maps arises also in connection with shift-varying nonlinear input-output maps, in which time or a space variable plays the role of the index (Sandberg et al., 2001; Sandberg, 1998). It comes up in addition in connection with studies of adaptive systems. 5 4 Regard eachc q and each fj as dened on A£Rn (with any of the usual product metrics; Sutherland, 1975), and use the proposition that the sum and product of two continuous R-valued functions on a metric space are continuous. 5 See, in particular, Lo and Bassu (2002) in which the focus is on approximation using sigmoidal functions and maps dened on compact subsets of nite-dimensional spaces.
Families of Functionals
459
2 Approximation of Indexed Families 2.1 Preliminaries. We rst describe a theorem in Sandberg (2001) that we will use. Let S be a real inner-product space6 (i.e., a real pre-Hilbert space) with inner product h¢, ¢i and norm k¢k derived in the usual way from h¢, ¢i. Let X be a metric space whose points are a subset of the points of S, and denote the metric in X by d. With V any convex subset of S such that fk¢ C vk: v 2 Vg separates the points of X (i.e., such that for x and y in X 6 6 with x D ky C vk), and with P any y, there is a v 2 V for which kx C vk D nonempty subset of (0, 1) that is closed under addition, let X 0 denote the set of functions g dened on X that have the representation
g ( x) D a expf¡bkx ¡ vk2 g
(2.1)
in which a 2 R, b 2 P, and v 2 V. Let X stand for the set of continuous functions f from X to the reals R with the property that for each 2 > 0, there is a compact subset X f,2 of X such that | f ( x ) ¡ f ( y) | < 2 for x, y 2/ X f,2 . And let X 1 denote the family of f ’s in X such that for each f and each 2 > 0, there is a compact subset X f,2 of X such that | f ( x ) | < 2 for x 2/ X f,2 . The result in Sandberg (2001) is the following: Assume that X is locally compact but not compact7 and that X is such that X 0 ½ X 1 . Then for each f 2 X 1 and each 2 > 0, there are a positive integer q, real numbers a1 , . . . , aq , numbers b 1 , . . . , bq belonging to P, and elements v1 , . . . , vq of V such that Theorem 1.
q X 2 ak expf¡b kkx ¡ v kk g 0, we have X 0 ½ X 1 when for each c > 0 there is a compact subset Xc of X such that kxk ¸ c for x 2 X with x 2/ Xc .
2.2 Product Spaces and Indexed Families. Let S1 be a real inner-product space with inner product h¢, ¢i1 and norm k¢k1 , and let S2 be a second real inner-product space with inner product h¢, ¢i2 and norm k¢k2 . The norms in S1 and S2 are derived from their respective inner products in the usual way. Take S (in section 2.1) to be the inner-product space S1 £ S2 with inner product h¢, ¢i given by
h(a1 , b1 ) ( a2 , b2 ) i D ha1 , a2 i1 C hb1 , b2 i2 . Let ( X1 , d1 ) and (X2 , d2 ) be two metric spaces whose points are subsets of the points of S1 and S2 , respectively. (The metrics d1 and d2 are not necessarily derived in the usual way from k¢k1 and k¢k2 .) Take X to be the metric space X1 £ X2 with the metric given by d[ (a1 , b1 ) , ( a2 , b2 )] D d1 (a1 , a2 ) C d2 ( b1 , b2 ) . Recall that V is a convex subset of S such that fk¢ C vk: v 2 Vg separates 6 the points of X (i.e., such that for x and y in X with x D y, there is a v 2 V 6 ky C vk), and that P is any nonempty subset of (0, 1) for which kx C vk D that is closed under addition. For v 2 S, we use v1 and v2 to denote the components of v belonging to S1 and S2 , respectively. One of our main results concerning the approximation of families of functionals is the following. Assume that ( X1 , d1 ) and ( X2 , d2 ) are locally compact but that (X1 , d1 ) or (X2 , d2 ) , or both, are not compact, and that X is such that X 0 ½ X 1 . Then for each f 2 X 1 and each 2 > 0, there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , bq belonging to P, and elements v1 , . . . , vq of V such that Theorem 2.
q X 2 2 ak expf¡b kka ¡ v k1 k1 g expf¡b kkb ¡ vk2 k2 g 0, there is a compact subset Xc of ( X2 , d2 ) such that kxk2 ¸ c for x 2 X2 with x 2/ Xc . We give two examples. Assumption A.1 is met if X2 D S2 D Rn , in which Rn stands for the linear space of real n-vectors (as suggested earlier, n is an arbitrary positive integer), the inner product and corresponding norm k¢k2 on Rn are the usual ones, and the metric d2 associated with X2 is derived in the usual way from k¢k2 . For our other example, take S2 to be the inner-product space of real continuous functions dened on the n-dimensional interval [0, 1]n , with the inner product and corresponding norm on S2 specied by Z hx, yi D
x (w ) y ( w ) dw. [0,1]n
With ` a positive number and with some norm on Rn , take X2 to be the set of all elements of S2 that are Lipschitz continuous with Lipschitz constant `, and let d2 be the usual uniform (i.e., max) metric. By a version of the Arzel`aAscoli theorem (Sutherland, 1975), closed bounded subsets of ( X2 , d2 ) are compact. In addition, (X2 , d2 ) is locally compact because the closure of any open ball in ( X2 , d2 ) is compact, and ( X2 , d2 ) is unbounded. By the equicontinuity of X2 , it follows that for each c > 0, there is a compact subset Xc of ( X2 , d2 ) such that kxk2 ¸ c for x 2 X2 with x 2/ Xc . More specically, let c be given and set d D sup a1 ,a2 2[0,1]n Ix2X2 |x(a1 ) ¡x(a2 ) |, which is nite. The subset Xc of (X2 , d2 ) given by fx 2 X2 : maxa |x(a) | · c C dg is compact, and we have Z
[0,1]n
|x(w) | 2 dw > c
2
for x 2 X2 with x 2/ Xc . This shows that A.1 is met.
462
I. Sandberg
With regard to slightly more general examples, we note that if A.1 is met, then A.1 is met also if ( X2 , d2 ) is replaced with ( XQ 2 , d2 ), where XQ 2 is any nonempty unbounded closed subset of ( X2 , d2 ) . Other examples can be given in which the elements of X2 are not necessarily continuous. Let A be a bounded closed subinterval of R. Suppose that A.1 holds, and let U be any convex subset of S2 such that fk¢ C uk2 : u 2 Ug separates the points of X2 . Let f be a continuous function from ( A, | ¢ |) £ ( X2 , d2 ) 9 to R such that for each c > 0, there is a compact subset Xc of (X2 , d2 ) for which Theorem 3.
6 Xc ) . | f ( a, x ) | < c ( a 2 A, x 2
(2.4)
Then for each 2 > 0 there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , bq belonging to P, elements a1 , . . . , aq of A, and elements u1 , . . . , uq of U such that q X ak expf¡b k |a ¡ a k | 2 g expf¡b kkx ¡ u kk22 g 0 there is a closed ball in ( A £ X2 , d) outside of which | f (a, x) | < 2 . Specically, let 2 be given, and select a compact subset X2 of ( X2 , d2 ) so 6 X2 and all a 2 A. Choose any ( a 0, x 0 ) 2 A £ X2 , that | f (a, x ) | < 2 for all x 2 and let r be the radius of a closed ball B 0 ( r) in ( X2 , d2 ) centered at x 0 such that B 0 (r ) contains the points of X2 . Observe that for ( a, x ) 2 A £ X2 , but outside the closed ball in ( A£X2 , d ) of radius r C maxa1 ,a2 2A |a1 ¡a2 | centered at ( a0 , x0 ) , we have x 2/ B 0 ( r) and thus x 2/ X2 . Now we make use of the hypothesis that for eachc > 0 there is a compact subset Xc of ( X2 , d2 ) such that kxk2 ¸ c for x 2 X2 with x 2/ Xc , and let Proof.
9 Here, as suggested, | ¢ | denotes the metric associated with the absolute-value norm. In accord with the material in section 2.2, the metric on (A, | ¢ | ) £ (X2 , d2 ) is the sum of the two metrics.
Families of Functionals
463
j > 0 be given. Select (a 0 , b0 ) 2 A £ X2 , and choose r > maxa2A |a ¡ a 0 | so that fx 2 Xj g ½ fb 2 X2 : d2 ( b, b0 ) · r ¡ max |a ¡ a 0 |g. a2A
Let B denote the closed ball in ( A £ X2 , d ) of radius r centered at ( a 0, b0 ). Then for ( a, b) 2 A £ X2 with (a, b) 2/ B, we have |a ¡ a0 | C d2 ( b, b0 ) > r and thus d2 ( b, b 0 ) > r ¡ maxa2A |a ¡ a 0 |, implying that b 2/ Xj . Therefore, for ( a, b) 2 A £ X2 with ( a, b) 2/ B, we have 1
k(a, b) k D (kak21 C kbk22 ) 2 ¸ kbk2 ¸ j, showing that X 0 ½ X 1 . Finally, let C denote a closed ball in (or any closed bounded subset of) ( A £ X2 , d) , and let f(a k, bk ) g1 be a sequence in C. By the compactness of A kD 0 and the compactness of closed bounded subsets of ( X2 , d2 ), there is a point ( a1 , b1 ) 2 A £ X2 and a subsequence f(a 0k, b0k ) g1 of f(ak, bk ) g1 such that kD 0 kD 0 |a 0k ¡ a1 | ! 0 as k ! 1 and d2 ( b0k, b1 ) ! 0 as k ! 1, and thus such that d[ ( a0k, b 0k ) , (a1 , b1 ) ] ! 0 as k ! 1. Since C is closed, (a1 , b1 ) 2 C. Thus, by the equivalence of compactness and sequential compactness in metric spaces, C is compact.10 In the case of either of the two examples given concerning the hypothesis A.1, U can be taken to be X2 , or even any subset of X2 that contains an open ball in ( X2 , d2 ) (see the comments following theorem 1). In theorem 3, a key assumption is that f is continuous on the product space ( A, |¢| ) £(X2 , d2 ). It is not enough that f be continuous separately in each of its two variables. A sufcient condition that is often useful is given in the appendix. Our next result concerns classication. Let U be any convex subset of S2 such that fk¢ C uk2 : u 2 Ug separates the points of X2 , suppose that A.1 holds, let d2 be derived in the usual way from a norm k¢kd2 , and let p be a positive integer. For each ` 2 f1, . . . , pg, let C1 ( `) , . . . , Cm (`) ( `) be pairwise disjoint compact subsets of ( X2 , d2 ) , in which m (`) is a positive integer, and let C0 ( `) D fx 2 X2 : kxkd2 ¸ j`g where j` > maxjD6 0 supfkxkd2 : x 2 Cj ( `) g. Also for each ` 2 f1, . . . , pg, let a0 ( `) , a1 ( `) , . . . , am ( `) (`) be distinct real numbers with a0 ( `) D 0 and, with r` D miniD6 j |ai ( `) ¡ aj (`) |, let Q`: [j ( aj (`) ¡ 0.25r `, aj ( `) C 0.25r `) ! R be specied by Q (a ) ` D aj ( `) for a 2 ( aj ( `) ¡ 0.25r`, aj ( `) C 0.25r`) and each j. Then there are a positive integer Theorem 4.
10 Theorem 3 can be easily extended to cover the case in which A is instead a bounded closed n-dimensional subinterval of Rn , in which n is an arbitrary positive integer. In that case, | ¢ | in the theorem denotes the euclidean norm on Rn , and A D A1 £ ¢ ¢ ¢ £ An , in which the Aj are closed bounded subintervals of R. The proof is essentially the same.
464
I. Sandberg
q, real numbers a1 , . . . , aq , numbers b1 , . . . , bq belonging to P, elements a1 , . . . , aq of [1, p], and elements u1 , . . . , uq of U such that, for each ` 2 f1, . . . , pg, Á
q X
Q` k D1
! ak expf¡b k | ` ¡ ak | g expf¡b kkx 2
¡ u kk22 g
D aj ( `)
for x 2 Cj ( `) with j D 0, 1, . . . , m (`) . ( ) For each ` 2 f1, . . . , pg, let g`: [jmD 0` Cj ( `) ! R be dened by g` ( x ) D aj ( `) for x 2 Cj ( `) and each j. Observe that each [jmD 0 Cj ( `) is closed in (X2 , d2 ) , and that each g` is continuous (note that the distance between pairs of the Cj ( `) is positive). Using the Lebesgue-Urysohn extension theorem (Stone, 1962), let f` be a continuous extension of g` to (X2 , d2 ) for each `. Note that for every `, f`( x ) D 0 for kxkd2 ¸ j`. Proceeding as indicated in section 1, and with A D [1, p], construct a continuous function f from ( A, | ¢ |) £ ( X2 , d2 ) to R such that f (l, x ) D f` (x ) for each `, and f (a, x ) D 0 for a 2 A and x outside the compact set fx 2 X2 : kxkd2 · max` j`g. Thus, by theorem 3, for each 2 > 0 there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , bq belonging to P, elements a1 , . . . , aq of A, and elements u1 , . . . , uq of U such that q X 2 2 ak expf¡b k |a ¡ a k | g expf¡b kkx ¡ u kk2 g 0, there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , b q belonging to P, and elements v1 , . . . , vq of V such that q X 2 2 ak expf¡b kka ¡ vk1 k1 g expf¡b kkb ¡ v k2 k2 g 0, there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , b q belonging to P, and elements v1 , . . . , vq of V such that q X ak expf¡b kkx ¡ v kk2 g 0, there are a positive integer q, real numbers a1 , . . . , aq , numbers b1 , . . . , bq belonging to P, elements a1 , . . . , aq of A, and elements u1 , . . . , uq of U such that q X ( 2 2 ak expf¡b k |a ¡ ak | g expf¡b kkb ¡ u kk2 g 0, there are positive constants d1 and d2 such that d3 [ f ( a, b) , f ( a0 , b )] < 2 for d1 ( a, a0 ) < d1 and d2 (b, b0 ) < d2 . Under these conditions, f is continuous on ( Y, d) . To see this, let ( a 0, b0 ) 2 Y1 £ Y2 and 2 1 > 0 be given. Choose d > 0 so that d3 [ f ( a 0, b) , f ( a0 , b0 )] < 2 1 / 2 for d2 ( b, b0 ) < d, and so that d3 [ f ( a, b ) , f ( a 0, b) ] < 2 1 / 2 for d1 ( a, a 0 ) < d and d2 ( b, b0 ) < d. Then observe that for d[ ( a, b) , ( a 0, b0 ) ] < d, we have d3 [ f ( a, b) , f ( a0 , b 0 ) ] · d3 [ f ( a, b) , f ( a0 , b )] C d3 [ f ( a0 , b ) , f (a 0, b0 ) ] < 2 1 , showing that f is continuous. (A simple example of an f that meets the conditions given here is f dened by equation 1.3.) We note also that if f is continuous on ( Y, d) and ( Y, d ) is compact, then f is uniformly continuous on (Y, d ) and therefore is both continuous separately in its second variable and continuous on ( Y1 , d1 ) locally uniformly with respect to (Y2 , d2 ) . References Hartman, E. J., Keeler, J. D., & Kowalski, J. M. (1990). Layered neural networks with gaussian hidden units as universal approximators. Neural Computation, 2, 210–215. Lainiotis, D. (1976). Partitioning: A uniform framework for adaptive systems (parts I and II). Proceedings of the IEEE, 64, 1126–1197. Lo, J. T. (1972). Finite dimensional sensor orbits and optimal nonlinear ltering. IEEE Transactions on Information Theory, IT-18-15, 583–588. Lo, J. T., & Bassu, B. (2002). Adaptive multilayer perceptrons with long- and short-term memories. IEEE Transactions on Neural Networks, 13, 22–33. Mukherjea, A., & Pothoven, K. (1978). Real and functional analysis. New York: Plenum Press.
468
I. Sandberg
Park, J., & Sandberg, I. W. (1991). Universal approximation using radial basis function networks. Neural Computation, 3, 246–257. Park, J., & Sandberg, I. W. (1993). Approximation and radial-basis function networks. Neural Computation, 5, 305–316. Parzen, E. (1962). On estimation of a probability density function and mode. Annals of Mathematical Statistics, 33, 1065–1076. Sandberg, I. W. (1991). Structure theorems for nonlinear systems. Multidimensional Systems and Signal Processing, 2, 267–286. (Errata in vol. 3, no. 1, p. 101, 1992.) Sandberg, I. W. (1994). General structures for classication. IEEE Transactions on Circuits and Systems-I, 41, 372–376. Sandberg, I. W. (1998).Notes on uniform approximation of time-varying systems on nite time intervals. IEEE Transactions on Circuits and Systems-I, 45, 863– 865. Sandberg, I. W. (2001). Gaussian radial basis functions and inner-product spaces. Circuits, Systems, and Signal Processing, 20, 635–642. (Errata in vol. 21, no. 1, p. 123, 2002.) Sandberg, I. W., Lo, J. T., Francourt, C., Principe, J., Katagiri, S., & Haykin, S. (2001). Nonlinear dynamical systems: Feedforward neural network perspectives. New York: Wiley. Stone, M. H. (1962). A generalized Weierstrass approximation theorem. In R. C. Buck (Ed.), Studies in modern analysis. Englewood Cliffs, NJ: Prentice Hall. Sutherland, W. A. (1975). Introduction to metric and topological spaces. Oxford: Clarendon Press.
Received April 17, 2002; accepted July 17, 2002.
LETTER
Communicated by John Platt
Efcient Greedy Learning of Gaussian Mixture Models J. J. Verbeek
[email protected] N. Vlassis
[email protected] B. Krose ¨
[email protected] Informatics Institute, University of Amsterdam, 1098 SJ Amsterdam, The Netherlands This article concerns the greedy learning of gaussian mixtures. In the greedy approach, mixture components are inserted into the mixture one after the other. We propose a heuristic for searching for the optimal component to insert. In a randomized manner, a set of candidate new components is generated. For each of these candidates, we nd the locally optimal new component and insert it into the existing mixture. The resulting algorithm resolves the sensitivity to initialization of state-of-the-art methods, like expectation maximization, and has running time linear in the number of data points and quadratic in the (nal) number of mixture components. Due to its greedy nature, the algorithm can be particularly useful when the optimal number of mixture components is un known. Experimental results comparing the proposed algorithm to other methods on density estimation and texture segmentation are provided. 1 Introduction
This article concerns the learning (tting the parameters) of a mixture of gaussian distributions (McLachlan & Peel, 2000). Mixture models form an expressive class of models for density estimation. Applications in a wide range of elds have emerged, used for density estimation in unsupervised problems, clustering purposes, estimating class-conditional densities in supervised learning settings, situations where the data are partially supervised, and situations where some observations have missing values. The most popular algorithm to learn mixture models is the expectationmaximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). For a given nite data set X n of n observations and an initial mixture f 0 , the algorithm provides a means to generate a sequence of mixture models f f i g with nondecreasing log likelihood on X n . The EM algorithm is known to converge to a locally optimal solution. However, convergence to a globally optimal solution is not guaranteed. The log likelihood of the given data set under the found mixture distribution is highly dependent on the initial mixture f 0. Neural Computation 15, 469–485 (2003)
° c 2002 Massachusetts Institute of Technology
470
J. J. Verbeek, N. Vlassis, and B. Krose ¨
The standard procedure to overcome the high dependence on initialization is to start the EM algorithm for several random initializations and use the mixture yielding maximum likelihood on the data. In this article, we present a method to learn nite gaussian mixtures, based on the approximation result of Li and Barron (2000), which states that we can quickly approximate any target density with a nite mixture model, as compared to the best we can do with any (possibly nonnite) mixture from the same component class. This result also holds if we learn the mixture models in a greedy manner: start with one component and optimally add new components one after the other. However, locating the optimal new component boils down to nding the global maximum of a log-likelihood surface. Li (1999) proposed to grid the parameter space to locate the global maximum, but this is infeasible when learning mixtures in high-dimensional spaces. Here, we propose a search heuristic to locate the global maximum. Our search procedure selects the best member from a set of candidate new components that is dynamic in the sense that it depends on the mixture learned so far. Our search procedure for a new component has running time O (n ) . For learning a k component mixture, this results in a run time of O ( nk2 ) if the complete mixture is updated with EM after each insertion. If the mixture is not updated between the component insertions (the approximation result still holds in this case), the run time would be O ( nk). The article is organized as follows. In section 2, we recapitulate the definition and EM learning of gaussian mixtures. Section 3 discusses greedy learning of gaussian mixtures. Our component insertion procedure is discussed in section 4, the core of the article. In section 5, we present experimental results on two tasks: modeling of articially generated data drawn from several types of gaussian mixtures and texture segmentation. The experiments compare the performance of the new algorithm with the performance of several other methods to learn gaussian mixtures. Section 6 presents conclusions and a discussion. 2 Gaussian Mixtures and the EM Algorithm
A gaussian mixture (GM) is dened as a convex combination of gaussian densities. A gaussian density in a d-dimensional space, characterized by its mean m 2 Rd and d £ d covariance matrix C , is dened as w (x I h ) D (2p ) ¡d/ 2 det( C ) ¡1/ 2 exp (¡( x ¡ m ) > C ¡1 ( x ¡ m ) / 2),
(2.1)
where h denotes the parameters m and C . A k component GM is then dened as fk ( x) D and for
k X jD 1
p j w ( xI h j ) ,
with
j 2 f1, . . . , kg: p j ¸ 0.
k X jD 1
pj D 1 (2.2)
Efcient Greedy Learning of Gaussian Mixture Models
471
The p j are called the mixing weights and w (x I h j ) the components of the mixture. As compared to nonparametric density estimation, mixture density estimation (or semi-parametric density estimation) allows for efcient evaluation of the density, since only relatively few density functions have to be evaluated. Furthermore, by using few mixture components, a smoothness constraint on the resulting density is enforced, allowing for robust density estimates. The EM algorithm (Dempster et al., 1977) enables us to update the parameters of a given k-component mixture with respect to a data set X n D fx1 , . . . , x n g with all x i 2 Rd , such that the likelihood of X n is never smaller under the new mixture. The updates for a GM can be accomplished by iterative application of the following equations for all components j 2 f1, . . . , kg: P ( j | x i ) :D p j w ( x i I hj ) / fk (x i ) , p j :D m j :D C j :D
n X
(2.3)
P ( j | xi ) / n,
(2.4)
P ( j | xi ) x i / (np j ) ,
(2.5)
P ( j | xi ) ( x i ¡ mj ) (x i ¡ mj ) > / (np j ) .
(2.6)
iD 1 n X iD 1 n X iD 1
The EM algorithm is not guaranteed to lead us to the best solution, where “best” means the solution yielding maximal likelihood on X n among all maxima of the likelihood. (We implicitly assume that the likelihood is bounded by restricting the parameter space, and hence the maximum likelihood estimator is known to exist; (Lindsay, 1983.) The good thing is that if we are close to the global optimum of the parameter space, then it is very likely that by using EM, we obtain the globally optimal solution. 3 Greedy Learning of Gaussian Mixtures
In this section, we discuss the greedy approach to learning GMs. The basic idea is simple. Instead of starting with a (random) conguration of all components and improving on this conguration with EM, we build the mixture component-wise. We start with the optimal one-component mixture, whose parameters are trivially computed. Then we start repeating two steps until a stopping criterion is met: (1) insert a new component, and (2) apply EM until convergence. The stopping criterion can implement the choice for a prespecied number of components, or it can be any model complexity selection criterion (like minimum description length, Akaike information criterion, or cross validation).
472
J. J. Verbeek, N. Vlassis, and B. Krose ¨
3.1 Motivation. The intuitive motivation for the greedy approach is that the optimal component insertion problem involves a factor k fewer parameters than the original problem of nding a near-optimal start conguration. We therefore expect it to be an easier problem. Of course, the intuitive assumption is that we can nd the optimal (with respect to log likelihood) ( k C 1) -component mixture by local search if we start the search from the mixture obtained by inserting optimally a new component in the optimal k-component mixture. We recently applied a similar strategy to the k-means clustering problem (Likas, Vlassis, & Verbeek, in press) and obtained encouraging results. Two recent theoretical results provide complementary motivation. The Kullback-Leibler (KL) divergence between two densities f and g is dened as
Z D ( f k g) D
V
f ( x ) log [ f ( x ) / g ( x )] dx,
(3.1)
where V is the domain of the densities f and g (see Cover & Thomas, 1991, for details). Li and Barron (2000) showed that for an arbitrary probability density function f , there exists a sequence f fi g of nite mixtures such that Pk ( ) ( ) ( ) fk ( x ) D i D 1 p i w xIRhi achieves KL divergence D f k f k · D f k gP C c / k for every gP D w ( xI h ) P (dh ) . Hence, the difference in KL divergence achievable by k-component mixtures and the KL divergence achievable by any (possibly nonnite) mixture from the same family of components tends to zero with a rate of c / k, where c is a constant dependent not on k but only on the component family. Furthermore, Li and Barron (2000) showed that this bound is achievable by employing the greedy scheme discussed in section 3.2. This tells us that we can quickly approximate any density by the greedy procedure. Therefore, we might expect the results of the greedy procedure—as compared to methods that initialize all at once—to differ more when tting mixtures with many components. The sequence of mixtures generated by the greedy learning method can conveniently be used to guide a model selection process in the case of an unknown number of components. Recently, a result of “almost” concavity of the log likelihood of a data set under the maximum likelihood k-component mixture, as a function of the number of components k, was presented (Cadez & Smyth, 2000). The result states that the rst-order Taylor approximation of the log likelihood of the maximum likelihood models as a function of k is concave under very general conditions. Hence, if we use a penalized loglikelihood model selection criterion based on a penalty term that is concave or linear in k, then the penalized log likelihood is almost concave. This implies that if the “almost” concave turns out to be concave, then there is only one peak in the penalized log likelihood, and hence this peak can be relatively easily identied with a greedy construction method.
Efcient Greedy Learning of Gaussian Mixture Models
473
3.2 A General Scheme for Greedy Learning of Mixtures. Let L ( X n , f k ) P n i D 1 log f k ( xi ) (we will just write L k if no confusion arises) denote the log likelihood of the data set X n under the k-component mixture fk. The greedy learning procedure above can be summarized as follows: D
1. Compute the optimal (yielding maximal log likelihood) one-component mixture f1 . Set k :D 1. 2. Find the optimal new component w ( x Ih ¤ ) and corresponding mixing weight a¤ , fh ¤ , a¤ g D arg max fh,ag
n X iD 1
log [(1 ¡ a) fk (x i ) C aw (x i I h ) ],
(3.2)
while keeping fk xed. 3. Set fkC 1 ( x ) :D (1 ¡ a¤ ) fk ( x ) C a¤ w ( x Ih ¤ ) and k :D k C 1. 4. Optional: Update fk using EM until convergence.
5. If a stopping criterion is met, then quit; else go to step 2. Clearly, the crucial step is the component insertion (step 2), which will be the topic of the next section. Let us make a few remarks before we turn to that section. First, the stopping criterion in step 5 can be used to force the algorithm to nd a mixture of a prespecied number of components. Step 5 may also implement any kind of model selection criterion. Second, note that step 4 may also be implemented by algorithms other than EM. Furthermore, step 4 is not needed to obtain the approximation result of Li and Barron (2000). Third, note the similarity to the vertex direction method (VDM) (Bohning, ¨ 1995; Lindsay, 1983; Wu, 1978) to learn GMs. Indeed, VDM may be regarded as a specic choice for implementing the search needed for step 2, as we explain in the discussion in Section 6. 4 Efcient Search for New Components
Suppose we have obtained a k-component mixture fk , and the quest is to nd the component characterized by equation 3.2. It is easily shown that if we x fk and h , then L kC 1 is concave as a function of a only, allowing efcient optimization. For example, Vlassis and Likas (2002) used a secondorder Taylor approximation. However, L kC 1 as a function of h can have multiple maxima. Hence, we have to perform a global search among the new components in order to identify the optimum since no constructive method to nd the global maximum is known. In our new algorithm, the global search for the optimal new component is achieved by starting km partial EM searches from initial parameter pairs f(hi , ai ) g, 1 · i · km. By partial EM search, we mean that we x fk and optimize L kC 1 over h and a only. The use of partial EM searches is dictated
474
J. J. Verbeek, N. Vlassis, and B. Krose ¨
by the general scheme above, for which the results of Li and Barron (2000) hold. One may wonder why we would use such partial searches, since we might as well optimize over all parameters of the resulting ( k C 1) -component mixture. The answer is that if we updated all parameters, then the number of computations needed in each of the km individual searches would be O ( nk) instead of O ( n ). Each partial search starts with a different initial conguration. After these multiple partial searches, we end up with km candidate new components together with their mixing weights. We pick the candidate component w ( xI hO ) that maximizes the likelihood when mixed into the previous mixture by a factor a, O as in equation 3.2. Then, in step 3 of the general algorithm, in place O a). of the (unknown) optimal parameter pair (h ¤ , a¤ ), we use (h, O Note that a simple uniform quantization of the parameter space, as implemented for VDM for mixtures of univariate gaussian distributions in DerSimonian (1986), is not feasible in general since the number of quantization levels grows exponentially with the dimension of the parameter space. In the following section, we discuss step 2 as proposed in Vlassis and Likas (2002) and its drawbacks. In section 4.2, we present our improved search procedure. 4.1 Every Data Point Generates a Candidate. Vlassis and Likas (2002) proposed using n candidate components. (Henceforth, we refer to this method as VL.) Every data point is the mean of a corresponding candidate. All candidates have the same covariance matrix s 2 I, where s is taken at a theoretically justiable value. For each candidate component, the mixing weight a is set to the mixing weight maximizing the second-order Taylor approximation of L kC 1 around a D 1 / 2. The candidate yielding the highest log likelihood when inserted in the existing mixture fk is selected. The selected component is then updated using partial EM steps before it is actually inserted into fk to give fkC 1 . Note that for every component insertion that is performed, all point candidates are considered. There are two main drawbacks to this method. First, using n candidates in each step results in a time complexity O ( n2 ) for the search, which is unacceptable in most applications. O ( n2 ) computations are needed since the likelihood of every data point under every candidate component has to be evaluated. Second, by using xed small (referring to the size of the determinant of the covariance matrix) candidate components, the method keeps inserting small components in high-density areas, while in low-density areas, the density is modeled poorly. We observed this experimentally. Larger components that give greater improvement of the mixture are not among the candidates, nor are they found by the EM procedure following the component insertion. Working on improvements of the algorithm in Vlassis and Likas (2002), we noticed that both better and faster performance could be achieved by using a smaller set of candidates tailored to the existing mixture. In the
Efcient Greedy Learning of Gaussian Mixture Models
475
next section, we discuss how we can generate the candidates in a data- and mixture-dependent manner. 4.2 Generating Few Good Candidates. Based on our experience with the above method, we propose a new search heuristic for nding the optimal new mixture component. Two observations motivate the new search method:
1. The size (i.e., the determinant of the covariance matrix) of the new component should in general be smaller than the size of the existing components of the mixture. 2. As the existing mixture fk contains more components, the search for the optimal new component should become more thorough. In our search strategy, we account for both observations by (1) setting the size of the initial candidates to a value related to (and in general smaller than) the size of the components in the existing mixture and (2) increasing the number of candidates linearly with k. For each insertion problem, our method constructs a xed number of candidates per existing mixture component. Based on the posterior distributions, we partition the data set X n in k disjoint subsets Ai D fx 2 X n : i D arg maxj fP( j | x ) gg. If one is using EM for step 4 (of the general scheme given above), then the posteriors P (i | x ) are available directly since we have already computed them for the EM updates of the k-component mixture. For each set Ai (i D 1, . . . , k), we construct m candidate components. In our experiments we used m D 10 candidates, but more can be used, trading run time for better performance. In fact, as in Smola and Sch¨olkopf (2000), we can compute condence bounds of the type: “With probability at least 1 ¡d the best split among m uniformly randomly selected splits is among the best 2 fraction of all splits, if m ¸ dlogd / log (1 ¡ 2 )e.” To generate new candidates from Ai , we pick uniformly randomly two data points x l and x r in Ai . Then we partition Ai into two disjoint subsets Ail and Air . For elements of Ail , the point x l is closer than x r and vice versa for Air . The mean and covariance of the sets Ail and Air are used as parameters for two candidate components. The initial mixing weights for candidates generated from Ai are set to p i / 2. To obtain the next two candidates, we draw new x l and x r , until the desired number of candidates is reached. The initial candidates can be replaced easily by candidates that yield higher likelihood when mixed into the existing mixture. To obtain these better candidates, we perform the partial EM searches, starting from the initial candidates. Each iteration of the km partial updates takes O ( mnk) computations, since we have to evaluate the likelihood of each datum under each of the mk candidate components. In the following section, we discuss how we can reduce the amount of computations needed by a factor k, resulting in only O ( mn) computations to perform one iteration of the partial updates.
476
J. J. Verbeek, N. Vlassis, and B. Krose ¨
Figure 1: The construction of a four-component mixture distribution.
We stop the partial updates if the change in log likelihood of the resulting ( k C 1) -component mixtures drops below some threshold or if some maximal number of iterations is reached. After these partial updates, we set the new component w ( x I hkC 1 ) as the candidate that maximizes the log likelihood when mixed into the existing mixture with its corresponding mixing weight. An example is given in Figure 1, which depicts the evolution of a solution for articially generated data. The mixtures f1 , . . . , f4 are depicted; each component is shown as an ellipse that has the eigenvectors of the covariance matrix as axes and radii of twice the square root of the corresponding eigenvalue. 4.3 Speeding up the Partial EM Searches. In order to achieve O ( mn) time complexity for the partial EM searches initiated at the O ( km ) initial candidates, we use a lower bound on the true log likelihood instead of the true log likelihood itself. The lower bound is obtained by assuming that the support (nonzero density points in X n ) for a candidate component w ( x , m kC 1 , C kC 1 ) originating from existing component i is fully contained in Ai . The lower bound, termed negative free energy (Neal & Hinton, 1998), is detailed in the appendix. Optimizing the lower bound allows us to base the
Efcient Greedy Learning of Gaussian Mixture Models
477
partial updates for a candidate from Ai purely on the data in Ai . The update equations are: aw ( xj I m kC 1 , C kC 1 ) (1 ¡ a) fk ( xj ) C aw ( xj I m kC 1 , C kC 1 ) P j2Ai p ( k C 1 | xj ) xj D P ( ) j2Ai p k C 1 | xj P ( )( )( )> j2Ai p k C 1 | xj xj ¡ m k C 1 xj ¡ m k C 1 P D j2Ai p ( k C 1 | xj ) P j2Ai p ( k C 1 | xj )
p ( k C 1 | xj ) D m kC 1 C kC 1
aD
N
(4.1) (4.2)
(4.3)
(4.4)
4.4 Time Complexity Analysis. The total run time of the algorithm is O ( k2 n ) (or O ( kmn) if k < m). This is due to the updates of the mixtures fi , which cost O ( ni ) computations each if we use EM for these updates. P Therefore, summing over all mixtures, we get O ( ikD 1 ni ) 2 O (nk2 ) . This is a factor k slower than the standard EM procedure. The run times of our experiments conrm this analysis; the new method performed on average about k / 2 times slower than standard EM. However, if we use EM to learn a sequence of mixtures consisting of 1, . . . , k components, then the run time would also be O ( nk2 ) . If we do not use the speed-up, the run time would become O ( k2 mn) since in that case, the component allocation step for the i C 1st mixture component would cost O ( imn) . This is a factor m more than the preceding EM steps; hence, the total run time goes up by a factor m. 5 Experimental Results
In this section, we present results of two experiments that compare the results obtained with k-means initialized EM, the method of VL discussed in section 4.1, split-and-merge EM (SMEM; Ueda, Nakano, Ghahramani, & Hinton, 2000), and the new greedy approach presented here. The SMEM algorithm checks after convergence of EM if the log likelihood can be improved by splitting one mixture component in two and merging two other mixture components. To initialize SMEM, we also used the k-means initialization, which makes use of the results of applying the k-means or generalized Lloyd algorithm to the data (Gersho & Gray, 1992). The means are initialized as the centroids of the Voronoi regions (VR) found and the covariance matrices as the covariance matrix corresponding to the data in the VRs. The mixing weights were initialized as the proportion of the data present in each VR. The centroids of the k-means algorithm were initialized as a uniformly random selected subset of the data.
478
J. J. Verbeek, N. Vlassis, and B. Krose ¨
5.1 Articial Data. In this experiment, we generated data sets of 400 points in Rd with d 2 f2, 5g. The data were drawn from a gaussian mixture of k 2 f4, 6, 8, 10g components. The separation of the components was chosen: c 2 f1, 2, 3, 4g. A separation of c means (Dasgupta, 1999)
8iD6 j : k¹i ¡ ¹j k2 ¸ c ¢ maxftrace( C i ) , trace( Cj ) g. fi, jg
(5.1)
For each mixture conguration, we generated 50 data sets. We allowed a maximum eccentricity (i.e., the largest singular value of the covariance matrix over the smallest) of 15 for each component. Also, for each generated mixture, we generated a test set of 200 points not presented to the mixture learning algorithm. In the comparison below, we evaluate the difference in log likelihood of the test sets under the mixtures provided by the different methods. In Table 1, we show experimental results comparing our greedy approach to VL, SMEM, and several runs of normal EM initialized with k-means. The average of the difference in log likelihood on the test set is given for different characteristics of the generating mixture. The results show that the greedy algorithm performs comparably to SMEM and generally outperforms VL. Both greedy methods and SMEM outperform the k-means initialization by far. 5.2 Texture Segmentation. In this experiment, the task is to cluster a set of 16 £ 16 D 256 pixel image patches. The patches are extracted from 256£256 Brodatz texture images; four of these textures are shown in Figure 2. Because different patches from the same texture roughly differ from each other only in a limited number of degrees of freedom (translation, rotation, scaling, and lightness), patches from the same texture are assumed to form clusters in the 256-dimensional patch space. We conducted experiments where the number of textures from which patches were extracted ranged from k D f2, . . . , 6g. For each value of k, we created 100 data sets of 500k patches each by picking up 500 patches at random from each of k randomly selected textures. The experiment consisted of learning a gaussian mixture model with as many components as different textures involved. We compared our greedy approach to the same methods as in the previous experiment. To speed up the experiment, we projected the data into a linear subspace with principal component analysis. We used a 50-dimensional subspace or a lower than 50-dimensional space if that could capture at least 80% of the data variance. Once the mixture model was learned, we segmented the patches by assigning each patch to the mixture component with the highest posterior probability for this patch. To evaluate the quality of the different segmentations discovered by the different learning methods, we considered how informative the found segmentations are. We measure how predictable the texture label of a patch is given the cluster of the
Efcient Greedy Learning of Gaussian Mixture Models
479
Table 1: Detailed Exposition of Log-Likelihood Differences. Vlassis and Likas
dD2
cD1
2
3
4
dD5
cD1
2
3
4
kD4 6 8 10
0.03 0.01 0.00 0.01
0.00 0.03 0.04 0.06
0.01 0.05 0.06 0.07
0.01 0.02 0.03 0.03
kD4 6 8 10
0.05 0.04 0.03 0.04
0.02 0.04 0.07 0.05
¡0.01 0.13 0.13 0.11
0.04 0.13 0.04 0.16
Best of k runs of k-means initialization EM
dD2
cD1
2
3
4
dD5
cD1
2
3
4
kD4 6 8 10
0.03 0.03 0.02 0.06
0.14 0.16 0.23 0.27
0.13 0.34 0.38 0.45
0.43 0.52 0.55 0.62
kD4 6 8 10
0.02 0.09 0.12 0.13
0.28 0.35 0.45 0.49
0.36 0.48 0.75 0.75
0.37 0.54 0.82 0.83
dD2
cD1
2
3
4
dD5
cD1
2
3
4
kD4 6 8 10
¡0.01 ¡0.02 ¡0.02 ¡0.00
0.00 0.00 0.00 0.00
0.00 0.02 0.00 0.03
0.00 0.02 0.03 0.04
kD4 6 8 10
¡0.02 ¡0.06 ¡0.05 ¡0.01
0.01 ¡0.01 ¡0.04 0.04
0.00 ¡0.01 0.00 0.01
0.00 0.00 0.02 0.05
SMEM
Distance to generating mixture
dD2
cD1
2
3
4
dD5
cD1
2
3
4
kD4 6 8 10
0.04 0.07 0.10 0.13
0.03 0.06 0.07 0.12
0.03 0.05 0.07 0.10
0.02 0.04 0.09 0.12
kD4 6 8 10
0.16 0.28 0.45 0.58
0.13 0.22 0.33 0.50
0.14 0.19 0.32 0.45
0.11 0.18 0.42 0.51
Notes: The upper three tables give the log likelihood of a method subtracted from the log likelihood obtained using our greedy approach. The bottom table gives averages of the log likelihood under the generating mixture minus the log likelihood given by our method.
class. This can be done using the conditional entropy (Cover & Thomas, 1991) of the texture label given the cluster label. In Table 2, the average conditional entropies (over 50 experiments) are given. The results of the greedy method and SMEM are roughly comparable, although the greedy approach was much faster. Both provide generally more informative segmentations than using the k-means initialization. Note that VL fails to produce informative clusters in this experiment, probably because the high dimensionality of the data renders spherical candidate components rather unt.
480
J. J. Verbeek, N. Vlassis, and B. Krose ¨
Figure 2: Several Brodatz textures. The white squares indicate the patch size.
Table 2: Average Conditional Entropy for Different Values of k. k Greedy k-means SMEM VL Uniform
2
3
4
5
6
0.20
0.34 0.46
0.48 0.74
0.53
0.61
0.22
0.36
1.52 1.59
2.00 2
0.80 0.56 2.29 2.32
0.94 0.68 2.58 2.58
0.19
0.21 0.93 1
Notes: The bottom line gives the conditional entropy if the clusters are totally uninformative. Boldface is used to identify the method with minimum conditional entropy for each k.
Efcient Greedy Learning of Gaussian Mixture Models
481
6 Discussion and Conclusion 6.1 Discussion. Both VDM (Bohning, ¨ 1995; Lindsay, 1983; Wu, 1978) and the proposed method are instantiations of the more general greedy scheme given in section 3.2 (although VDM skips step 4 of the scheme). VDM makes use of the directional derivative D fk (w ) of the log likelihood for the current mixture fk, where
D fk (w ) D lim [L ( X n , (1 ¡ a) fk C aw ) ¡ L (X n , fk ) ] / a. a!0
VDM proceeds by picking the w ¤ that maximizes D fk (w ) and inserting this in the mixture with a factor a¤ such that a¤ D arg maxa fL ( X n , (1 ¡a) fk C aw ¤ g. Using the directional derivative at the current mixture can be seen as an instantiation of step 2 of the general scheme. The optimization over both w and a is replaced by an optimization over D fk (w ) (typically implemented by gridding the parameter space) and then an optimization over a. Note that by moving in the direction of maximum, D fk (w ) does not guarantee that we move in the direction of maximum improvement of log likelihood if we optimize over a subsequently. Recently, several novel methods to learn mixtures (of gaussian densities) were proposed (Dasgupta, 1999; Figueiredo & Jain, 2002; Ueda et al., 2000). The rst tries to overcome the difculties of learning mixtures of gaussian densities in high-dimensional spaces. By projecting the data to a lowerdimensional subspace (which is chosen uniformly randomly), nding the means in that space, and then projecting them back, the problems of high dimensionality are reduced. The last two methods try to overcome the dependence of EM on the initial conguration as does our method. In Ueda et al. (2000), split-and-merge operations are applied to local optima solutions found by applying EM. These operations constitute jumps in the parameter space that allow the algorithm to jump from a local optimum to a better region in the parameter space. By then reapplying EM, a better (hopefully global) optimum is found. The latter method and the greedy approach are complementary; they can be combined. The split-and-merge operations can be incorporated in the greedy algorithm by checking after each component insertion whether a component i can be removed such that the resulting log likelihood is greater than the likelihood before insertion. If so, the component i is removed and the algorithm continues. If not, the algorithm proceeds as usual. However, in experimental studies on articially generated data, we found that this combination gave little improvement over the individual methods. An important benet of our new method over Ueda et al. (2000) is that the new algorithm produces a sequence of mixtures that can be used to perform model complexity selection as the mixtures are learned. For example, a kurtosis-based selection criterion, like the one in Vlassis and Likas (1999), can be used here. We also note that our algorithm executes faster
482
J. J. Verbeek, N. Vlassis, and B. Krose ¨
than SMEM, which has a run time of O ( nk3 ) . Figueiredo and Jain (2002) proposed starting with a large number of mixture components and successively annihilating components with small mixing weights. This approach can be characterized as pruning a given mixture, whereas our approach can be characterized as growing a mixture. As compared to both these methods, ours does not need an initialization scheme, since the optimal one-component mixture can be found analytically. Also, Bayesian methods have been used to learn gaussian mixtures. For example, in Richardson and Green (1997), a reversible-jump Markov chain Monte Carlo (MCMC) method is proposed. There, the MCMC allows jumps between parameter spaces of different dimensionality (i.e., parameter spaces for mixtures consisting of differing number of components). However, the experimental results that Richardson and Green (1997) reported indicate that such sampling methods are rather slow compared to constructive maximum likelihood algorithms. It is reported that about 160 sweeps per second are performed on a SUN Sparc 4 workstation. The experiments involve 200,000 sweeps, resulting in about 20 minutes of run time. Although the 200,000 sweeps are not needed for reliable results, this contrasts sharply with the 2.8 seconds and 5.7 seconds run time (allowing respectively about 480 and 960 sweeps) of the standard EM and our greedy EM in a similar experimental setting also executed on a SUN Sparc 4 workstation. 6.2 Conclusion. We proposed an efcient greedy method to learn mixtures of gaussian densities that have run time O ( k2 n C kmn) for n data points, a nal number of k mixture components, and m candidates per component. The experiments we conducted have shown that our greedy algorithm generally outperforms the k-means initialized EM. In experiments on articial data, we observed that SMEM performs in general comparable to the new method in terms of quality of the solutions found. The results on natural data were comparable to those obtained with SMEM if SMEM was initialized with the k-means algorithm. An important benet of the greedy method, compared to both SMEM and normal EM, is the production of a sequence of mixtures, which obviates the need for initialization and facilitates model selection. As compared to VL, the O ( n2 k2 ) time complexity has been reduced by a factor n; the somewhat arbitrary choice of spherical candidate components with xed variance has been replaced by a search for candidate components that depends on the data and the current mixture; and experiments suggest that if the methods yield different performance, then the new method generally outperforms the old one. Appendix: A Lower Bound for Fast Partial EM
In the insertion step, we have a two-component mixture problem. The rst component is the mixture fk, which we keep xed in the partial EM steps.
Efcient Greedy Learning of Gaussian Mixture Models
483
The second component is the new component w , which has a mixing weight a. The initial k C 1 component mixture is given by fk C 1 D aw C (1 ¡ a) fk . Using the maximization-maximization view on EM (Neal & Hinton, 1998), we can dene a lower bound F on the actual log likelihood: X X log fkC 1 ( x i ) ¸ F ( Q, h ) D log fkC 1 ( x i ) ¡ DKL ( Qi k p (¢ | x i ) ) , (A.1) i
i
where DKL denotes Kullback-Leibler divergence and p (¢ | x i ) denotes the posterior probability on the mixture of components. The distributions Qi are the responsibilities used in EM. We write qi for the responsibility of the new component and (1 ¡qi ) for the responsibility of fk for data point x i . The above lower bound can be made to match the log likelihood by setting the responsibilities equal to the posteriors p ( s | x i ) . However, in our case, when considering a candidate based on component 6 Aj : qi D 0. It is easy j, we x the Qi for all points outside Aj such that 8xi 2 to see from equation A.1 that the Qi for which x i 2 Aj should match the posterior probability on the new component in order to maximize F. Note that the more the assumption that data points outside Aj have zero posterior on the new component is true, the tighter the lower bound becomes. If the data outside Aj indeed have zero posterior on the new component, the lower bound equals the log likelihood after the partial E-step. For the M-step, we use the alternative decomposition of F, X EQi log p ( x i , s) C H ( Qi ) , (A.2) F (Q, h ) D i
where we used the variable s to range over the two mixture components, fk and w . Note that the entropies H ( Qi ) remain xed in the M-step since they do not depend on h. The rst term can be written as: X (A.3) qi [log w ( x i ) C log a] C (1 ¡ qi ) [log fk ( xi ) C log (1 ¡ a) ] i
D
X i2Aj
qi [log w ( xi ) C log a] C
X i
(1 ¡ qi ) [log fk ( x i ) C log (1 ¡ a) ]. (A.4)
Note that in equation A.4, the new mixture component w occurs only in terms for data in Aj . Setting the derivative of equation A.5 to zero with regard to the parameters of the new component (mixing weight a, mean m and covariance matrix C ) gives the update equations in section 4.3. Note that from equation A.4, F can also be evaluated using only the points in Aj if we stored the complete data log likelihood under fk in advance. Acknowledgments
We thank Aristidis Likas for useful discussions and a referee for pointing out the condence bounds used in Smola and Scholkopf ¨ (2000). This research
484
J. J. Verbeek, N. Vlassis, and B. Krose ¨
is supported by the Technology Foundation STW (project AIF4997) applied science division of NWO, and the technology program of the Dutch Ministry of Economic Affairs. Software implementing the algorithm in MATLAB is available from J. J. V. by e-mail.
References Bohning, ¨ D. (1995). A review of reliable maximum likelihood algorithms for semiparametric mixture models. J. Statist. Plann. Inference, 47, 5–28. Cadez, I. V., & Smyth, P. (2000). On model selection and concavity for nite mixture models. In Proc. of Int. Symp. on Information Theory (ISIT). Available on-line at: http://www.ics.uci.edu / »icadez / publications.html. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dasgupta, S. (1999). Learning mixtures of gaussians. In Proc. IEEE Symp. on Foundations of Computer Science. New York. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38. DerSimonian, R. (1986). Maximum likelihood estimation of a mixing distribution. J. Roy. Statist. Soc. C, 35, 302–309. Figueiredo, M. A. T., & Jain, A. K. (2002). Unsupervised learning of nite mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence,24, 381– 396. Gersho, A., & Gray, R. M. (1992). Vector quantization and signal compression. Norwell, MA: Kluwer. Li, J. Q. (1999). Estimation of mixture models. Unpublished doctoral dissertation, Yale University. Li, J. Q., & Barron, A. R. (2000). Mixture density estimation. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Likas, A., Vlassis, N., & Verbeek, J. J. (in press). The global k-means clustering algorithm. Pattern Recognition. Lindsay, B. G. (1983). The geometry of mixture likelihoods: A general theory. Ann. Statist., 11(1), 86–94. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwood, MA: Kluwer. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. Roy. Statist. Soc. B, 59(4), 731–792. Smola, A. J., & Sch¨olkopf, B. (2000). Sparse greedy matrix approximation for machine learning. In Proc. of Int. Conf. on Machine Learning (pp. 911–918). San Mateo, CA: Morgan Kaufmann. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (2000). SMEM algorithm for mixture models. Neural Computation, 12, 2109–2128.
Efcient Greedy Learning of Gaussian Mixture Models
485
Vlassis, N., & Likas, A. (1999). A kurtosis-based dynamic approach to gaussian mixture modeling. IEEE Trans. Systems, Man, and Cybernetics, Part A, 29, 393–399. Vlassis, N., & Likas, A. (2002). A greedy EM algorithm for gaussian mixture learning. Neural Processing Letters, 15(1), 77–87. Wu, C. (1978). Some algorithmic aspects of the theory of optimal designs. Annals of Statistics, 6, 1286–1301. Received November 13, 2001; accepted June 27, 2002.
LETTER
Communicated by John Platt
SMO Algorithm for Least-Squares SVM Formulations S.S. Keerthi
[email protected] Department of Mechanical Engineering, National University of Singapore, Singapore 117576 S.K. Shevade
[email protected] Genome Institute of Singapore, National University of Singapore, Singapore 117528 This article extends the well-known SMO algorithm of support vector machines (SVMs) to least-squares SVM formulations that include LS-SVM classication, kernel ridge regression, and a particular form of regularized kernel Fisher discriminant. The algorithm is shown to be asymptotically convergent. It is also extremely easy to implement. Com putational experiments show that the algorithm is fast and scales efciently (quadratically) as a function of the number of examples. 1 Introduction
This article concerns least-squares formulations of support vector machines (SVMs). These formulations, which provide interesting and useful alternatives to SVMs, have been proposed for both regression and classication. Saunders, Gammerman, and Vovk (1998) proposed the kernel version of ridge regression and empirically demonstrated its usefulness. Suykens and Vandewalle (1999) gave an interesting and useful extension of the idea to classication and called their method least-squares SVM (LS-SVM). A careful empirical study by Van Gestel et al. (2000) has shown that LS-SVM is comparable to SVM in terms of generalization performance. Recently, Van Gestel et al. (2002) also established the equivalence of LS-SVM with a particular form of regularized kernel Fisher discriminant (KFD) method (Mika, Ratsch, ¨ & Muller, ¨ 2001). While SVMs are characterized by convex quadratic programming problems involving inequality constraints, kernel ridge regression and LS-SVMs are formulated using equality constraints only. Hence, their solution is obtained by solving a system of linear equations (of the order of the number of training examples) involving the dense kernel matrix. To tackle large-scale problems, Suykens, Lukas, Van Dooren, De Moor, and Vandewalle (1999) and Van Gestel et al. (2000) effectively employed the conjugate-gradient method. In this article, we propose a new iterative algorithm for these probNeural Computation 15, 487–507 (2003)
° c 2002 Massachusetts Institute of Technology
488
S.S. Keerthi and S.K. Shevade
lems. It is based on the solution of the Wolfe dual problem using ideas similar to those of the SMO (sequential minimal optimization) algorithm for SVMs (Platt, 1998; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001). The algorithm is extremely easy to implement and scales efciently (quadratically) as a function of the number of examples. The scaling is much better than that of the CG algorithm. The article is organized as follows. In section 2, we formulate the problem and develop the Wolfe dual. Optimality conditions for the dual are derived in section 3. A useful condition for terminating numerical algorithms solving the least-squares formulations is also given. These conditions form the basis for the SMO algorithm developed in section 4, where we also prove that the algorithm is asymptotically convergent. In section 5, we discuss the conjugate gradient method of Suykens et al. (1999) and show how the termination criterion derived in section 3 can be easily applied to this method. Computational experiments describing the effectiveness of the algorithm are given in section 6. Section 7 briey discusses the solution of KFD. Section 8 extends the SMO algorithm to least-square formulations without the threshold parameter, followed by concluding remarks in section 9. The appendix gives a pseudocode for the SMO algorithm. 2 Problem Formulation
Throughout, we will use x to denote the input vector of the classicationregression problem and z to denote the feature space vector related to x by the transformation, z D Q ( x) . As in all other kernel designs, we do not assume Q to be known; all computations will be done using only the kernel function, k ( x, xO ) D Q ( x ) ¢ Q ( xO ) ,
(2.1)
where “¢” denotes inner product in the z space. Let f(xi , yi )gm i D 1 denote the training set, where xi is the ith input pattern and yi is the corresponding target value. For classication problems (two classes), yi takes only two possible values (say, 1 if xi is in class 1 and ¡1 if xi is in class 2), while for regression, yi takes any real value. Let zi D Q ( xi ) . In kernel ridge regression1 and classication by LS-SVMs, the following optimization problem is solved: min E s.t.
D
1 2 C C 2 kwk 2
P
2 i ji
yi ¡ ( w ¢ zi ¡ b) D ji 8i.
(2.2)
1 Saunders et al. (1998) leave out b from their formulation. This actually leads to a simpler problem. We will discuss its solution in section 8.
SMO Algorithm for Least-Squares SVM Formulation
489
Dene Á
! w p I Cj
wQ D
Á zQ i D
zi
!
p1 ei C
8i,
(2.3)
where j is an m-dimensional vector containing the ji ’s and, for every i, ei is the m-dimensional vector in which the ith component is 1 and all other components are 0. Dene kQ ( xi , xj ) D zQ i ¢ zQj . By equations 2.1 and 2.3, 1 kQ ( xi , xj ) D k ( xi , xj ) C dij , C
(2.4)
where dij D 1 if i D j and 0 otherwise. It is easy to see that equation 2.2 transforms to min E D
1 kwk Q 2 s.t. yi ¡ ( wQ ¢ zQ i ¡ b ) D 0 8i. 2
(2.5)
The Lagrangian for this problem is LD
X 1 kwk Q 2C ai [yi ¡ ( wQ ¢ zQi ¡ b) ]. 2 i
(2.6)
The KKT optimality conditions are given by
rwQ L D wQ ¡ @L @b
D
X i
X i
ai zQ i D 0
(2.7)
ai D 0.
(2.8)
Note that because equation 2.5 has only equality constraints, the ai ’s are not constrained to be nonnegative. Using equation 2.7, wQ can be expressed as a function of the ai ’s: X ai zQ i . (2.9) wQ (a) D i
Let us now apply Wolfe duality theory to equation 2.5. The Wolfe dual corresponds to the maximization of L subject to equations 2.7 and 2.8, with Q b, and ai ’s as variables. Using equations 2.8 and 2.9, we can simplify the w, Wolfe dual as X 1 max f (a) D ¡ kwQ (a) k2 C ai yi 2 i
s.t.
X i
ai D 0.
(2.10)
490
S.S. Keerthi and S.K. Shevade
P P Note that kwk Q 2 D i j ai aj kQ ( xi , xj ) . Hence, equation 2.10 is a simple convex quadratic programming problem involving a single equality constraint and no nonnegativity constraints. Once the ai ’s are obtained by solving equation 2.10, the primal variables, w and ji ’s, can be determined using equations 2.9 and 2.3. The determination of b will be addressed in the next section. 3 Optimality Conditions for the Dual
To derive proper stopping conditions for algorithms that solve the dual and also determine the threshold parameter b, it is important to write down the optimality conditions for the dual. The Lagrangian for the problem in equation 2.10 is X X 1 ai yi C b ai . LN D ¡ kwQ (a) k2 C 2 i i
(3.1)
Dene Fi D ¡
@f @ai
X
Q (a) ¢ zQi ¡ yi D Dw
j
aj kQ ( xi , xj ) ¡ yi ,
(3.2)
where f is dened in equation 2.10. The KKT conditions for the dual problem are @LN
@ai
D (b ¡ Fi ) D 0 8 i.
(3.3)
Dene bup D max Fi
i up D arg max Fi
blow D min Fi
i low D arg min Fi .
i
i
i
and
i
(3.4)
Then optimality conditions will hold at a given a iff blow D bup.
(3.5)
In equations 3.2 and 3.4, note that Fi , bup, i up, blow , and i low are all functions of a. The functional dependencies have not been noted to avoid notational clutter. Using equations 3.3, 3.2, and 2.5, it is easy to see the close relationship between the threshold parameter b in the primal problem and the multiplier, b. In particular, at optimality, b and b are identical. Therefore, in the rest of the article, b and b will denote the same quantity.
SMO Algorithm for Least-Squares SVM Formulation
491
We will say that an index pair (i, j) denes a violation at a if 6 Fj . Fi D
(3.6)
Thus, optimality conditions will hold at a iff there does not exist any index pair ( i, j) that denes a violation. Suppose ( i, j) satises equation 3.6 at some a. Then it is possible P to achieve an increase in f (while maintaining the equality constraint, ak D 0) by adjusting ai and aj only. To see this, let us dene the following: a Q i ( t) D ai ¡ tI
a Q j ( t) D aj C tI
and
6 i, j a Q k ( t) D ak 8k D
(3.7)
and w ( t) D f ( a(t) Q ).
(3.8)
Then it is easy to verify that w 0 ( t) D Fi ¡ Fj ,
(3.9)
6 where Fi and Fj are evaluated at a(t) 0 at Q . Since by equation 3.6, Fi ¡ Fj D t D 0, an increase in w is possible by choosing t suitably away from 0. Since in numerical solution, it takes too much computing time to compute the exact optimal solution, there is a need to dene approximate stopping conditions that guarantee a specied closeness to the solution of the primal problem, equation 2.2. We consider a stopping condition based on duality gaps. This has been used effectively by Smola and Sch¨olkopf (1998) in SVM regression. P Suppose a is any vector that satises i ai D 0, b is any real number, P w D i ai zi , and
ji D yi ¡ (w ¢ zi ¡ b ) D b C
ai ¡ Fi , C
(3.10)
where Fi is as dened by equation 3.2. Note that ( w, b, j ) satises the constraints of the primal, equation 2.2, and a satises the constraints of the dual, equation 2.10. Let E denote the primal cost corresponding to ( w, b, j ) , E? be the optimal primal cost of equation 2.2, f denote the dual objective function value given by a, and f ? be the optimal dual objective function value. By Wolfe duality (Fletcher, 1987), we have: E ¸ E ? D f ? ¸ f.
(3.11)
Although E ?( D f ?) is unknown, equation 3.11 allows us to obtain guaranteed bounds using duality gap (a computable quantity) dened by ¶ Xµ ± ai ² C 2 C ji . ai Fi ¡ (3.12) Dgap D E ¡ f D 2C 2 i
492
S.S. Keerthi and S.K. Shevade
Dgap is a nonnegative quantity that becomes zero iff optimality is reached. Thus, if far g is a sequence that converges to a?, the solution of equation 2.10, then the continuity of the Dgap function implies that fDgapr g goes to zero, where Dgapr is the value of Dgap for ar . Also, for any nontrivial problem, E? D f ? > 0. Thus, it is appropriate to choose 2 > 0 and use the stopping criterion, Dgap · 2 f.
(3.13)
More important, if equation 3.13 is satised, then we can obtain a guaranteed bound for the primal problem using equation 3.11: E ¡ E? · Dgap · 2 f · 2 E.
(3.14)
Thus, if 2 D 10¡p and equation 3.13 is satised, then the closeness of E to E? has more than p decimal digits of accuracy. These ideas are used in interior point methods of linear programming (Vanderbei, 1994) and have also been employed in SVMs (Smola & Sch¨olkopf, 1998). In our implementation of the SMO algorithm derived in the next section, we employ equation 3.13 as the stopping criterion. In section 5, we explain how to use equation 3.13 to efciently stop the conjugate gradient algorithm for LS-SVM proposed by Suykens et al. (1999). At a given a, we need a method for determining a good value of b for use in the above calculations as well as for use in the classier function dened by o ( x) D w ¢ Q ( x) ¡ b D
X i
ai k ( x, xi ) ¡ b.
(3.15)
One possibility is to take bD
blow C bup , 2
(3.16)
where bup and blow are as dened in equation 3.4. An alternative method is to choose bPto minimize Dgap after xing w at w (a) . This is equivalent to minimizing i ji2 , which can be done in closed form to get bD
1 X± ai ² . Fi ¡ m i C
(3.17)
4 SMO Algorithm
In this section, we give the SMO algorithm for solving equation 2.10. A basic step consists of starting with a point a and optimizing only two variables,
SMO Algorithm for Least-Squares SVM Formulation
493
ai and aj , to form the new point anew . Consider equations 3.7 and 3.8. Given equation 3.9, the natural choice is i D i up and j D i low so as to make |w 0 (0) | as large as possible. Using the notations of section 3, we can write the optimization problem and the resulting solution as t? D arg max w ( t) t
and
anew D a(t Q ?) .
(4.1)
The function w ( t) is a simple quadratic function of t, w ( t) D w (0) C tw 0 (0) C
t2 00 w (0) , 2
(4.2)
where w 0 is given by equation 3.9, and w 00 (0) D g D 2 kQ ( xi , xj ) ¡ kQ ( xi , xi ) ¡ kQ (xj , xj ) .
(4.3)
Thus, equation 4.1 can be easily solved in closed form: t? D ¡
Fi ¡ Fj w 0 (0) . D¡ 00 w (0) g
(4.4)
The SMO algorithm can now be described. It is very much along the lines of the simplied version of the SMO algorithm for SVMs given by Chang and Lin (2001). SMO Algorithm for Equation 2.10
1. Choose any a0 , and set r D 0.
2. If ar satises equation 3.5, stop. If not, set a D ar , choose i D i up, j D i low, and solve equation 4.1. 3. Let ar C 1 D anew , r :D r C 1 and go back to step 2.
We now establish asymptotic convergence of the algorithm. The absence of hard boundaries in the optimization makes the proof of convergence much simpler than corresponding proofs for SMO algorithm for SVMs. Putting equation 4.4 in equation 4.2 and simplifying, we get f (ar C 1 ) ¡ f (ar ) D w ( t?) ¡ w (0) D ¡
(t?) 2 g. 2
(4.5)
p Using equation 2.4, we see that g < ¡2 / C. Also, |t?| D kar C 1 ¡ ar k / 2. Thus, equation 4.5 implies a sufcient increase in f in each basic step of the SMO algorithm: f (ar C 1 ) ¡ f (ar ) ¸
1 kar C 1 ¡ ar k2 . 2C
(4.6)
494
S.S. Keerthi and S.K. Shevade
Because of equation 2.4, ¡ f is a positive denite quadratic form. Hence, the solution of equation 2.10 is unique. Call this point a?. We show the following result. Lemma 1.
far g converges to a?.
Proof. Since the algorithm increases f at each step and f is bounded above,
f f (ar ) g is a convergent sequence. By equation 4.6, we get that far C 1 ¡ ar g converges to 0. Because ¡ f is a positive denite quadratic form, the set fa | f (a) ¸ f (a0g is a compact set. Since far g lies in this set, it is a bounded sequence and therefore has at least one limit point. Let far ( s) gs ¸ 0 denote a convergent subsequence and a N denote the limit point to which it converges. For any r ¸ 0, let i ( r) D i up (ar ) and j ( r) D i low (ar ), the two indices chosen for optimization at the rth step. Since w 0 ( t?) D 0 for t? given by equation 4.1, we get from equation 3.9 that Fi ( r ) (ar C 1 ) ¡ Fj ( r ) (ar C 1 ) D 0.
(4.7)
Since there is only a nite number of indices, there exists at least one pair (i1 , j1 ) such that i1 D i (r (s ) ) and j1 D j (r (s ) ) for innitely many s. Let us restrict ourselves to this subsequence. To keep notations simple, we rename the subsequence and take that i1 D i (r (s ) ) D i up (ar ( s ) )
and
j1 D j (r (s ) ) D i low (ar ( s) )
8 s ¸ 0.
Since bup and blow are continuous functions of a, we also get bup (a) N ¡ blow ( a) N D lim [bup (ar ( s) ) ¡ blow (ar ( s) ) ] s!1
( ) ( ) D lim [Fi1 (ar s ) ¡ Fj1 (ar s ) ] s!1
(4.8)
D lim [P(s) C Q ( s ) C R ( s) ], s!1
where P ( s) D [Fi1 (ar ( s) ) ¡ Fi1 (ar ( s) C 1 ) ]
Q ( s) D [Fi1 (ar ( s) C 1 ) ¡ Fj1 (ar ( s) C 1 ) ]
(4.9)
( ) ( ) R ( s) D [Fj1 (ar s C 1 ) ¡ Fj1 (ar s ) ].
Since far ( s) C 1 ¡ ar ( s) g converges to 0, lims!1 P (s ) D 0 and lims!1 R ( s) D 0. By equation 4.7, Q ( s) D 0 8s. Thus, equation 4.8 yields bup (a) N ¡ blow ( a) N D 0. By equation 3.5, a N is a solution of equation 2.10, that is, aN D a?. Strict convexity of ¡ f implies that the sequence far g itself converges to a?. This completes the proof.
SMO Algorithm for Least-Squares SVM Formulation
495
The implementation of the SMO algorithm is extremely simple. A pseudocode is given in the appendix. For efciency, a cache is maintained for all Fi ’s. At each basic step, when only two indices, say, i1 and i2, change their old and aold to anew and anew , the values from ai1 Fi ’s can be updated as i2 i1 i2 new old Q new old Q Fnew D Fold i i C (ai1 ¡ ai1 ) k ( xi1 , xi ) C (ai2 ¡ ai2 ) k ( xi2 , xi ) 8i.
(4.10)
Also, the updating of the dual objective function can be written, using equations 4.4 and 4.5, as f new D f old ¡
g ?2 (t ) , 2
(4.11)
new ¡ aold ) old ¡ anew ) . (ai1 where t? D (ai2 i2 D i1 The algorithm can be initialized with a D 0. For this choice, Fi D ¡yi for all i and f (a) D 0. When the regularization parameter C is tuned (say, using k-fold cross validation) we usually come across a situation in which the solution of equation 2.10 is available for some C and 2.10 needs to be solved again for dC where d > 0. In such a situation, we can do alpha seeding (DeCoste & Wagstaff, 2000), that is, use the solution at C to initialize the variables for solving the problem at dC. We give the details below. Suppose a is the solution at C, and f , Fi 8 i are available. To start the algorithm for dC, we can use the same a. For this choice, the Fi for dC can be initialized using
F i :D F i ¡
(d ¡ 1) ai 8i. dC
(4.12)
After the Fi are computed, f can be obtained using f D
1X ai ( yi ¡ Fi ). 2 i
(4.13)
An alternative is to initialize with da for solving the problem at dC. Update formulas similar to equations 4.12 and 4.13 can be derived for this case too. Both initializations above work effectively and give similar performance. 5 Conjugate Gradient Algorithm
Suykens et al. (1999) devised an efcient method for obtaining the solution of LS-SVM using the conjugate gradient technique (Golub & Van Loan, 1996). This method is well suited for problems with a large number of examples. Let KQ be the m £m matrix whose (i, j)th element is kQ (xi , xj ). The steps of their method are as follows.2 2 The notations and description of the method given here are different from those given by Suykens et al. (1999).
496
S.S. Keerthi and S.K. Shevade
CG Method
1. Solve Q De Kg
and
Q Dy Kº
(5.1)
for g and º using the conjugate gradient method (twice), where e is a vector with all elements equal to 1 and y is a vector whose ith component is yi . 2. Compute b D ¡g ¢ y /g ¢ e, and set a D º C bg. Although the conjugate gradient algorithm solves a system of m linear equations in m steps, for efciency, it is important to stop the algorithm much earlier than m steps. Hamers, Suykens, and De Moor (2001) suggest stopping the algorithm when one of three criteria is satised; for details, see their report. These criteria are not directly related to the solution of the problem in equation 2.2 and hence cannot be used efciently to stop the algorithm with a prescribed guarantee on the primal solution, such as satisfying equation 3.14. Below we explain how Dgap can be computed easily in the CG method so that equation 3.14 can be satised by using equation 3.13. Although the linear systems in equation 5.1 are decoupled, for efcient use of kernel computations, they are solved together. At any stage of application of the conjugate gradient method to solve them, g and º, which approxQ imately satisfy equation 5.1, are available. The residue vectors, rg D e ¡ Kg Q and rº D y ¡ Kº, play a crucial role in the computations and are also available. Step 2 of the CG method can be used to construct approximate b and a. Using that a, it is possible to compute F (the vector containing the Fi ’s) as Q ¡ y D b ( e ¡ rg ) ¡ rº. F D Ka
(5.2)
Then Dgap can be obtained using equations 3.10 and 3.12. However, there is a difculty associated with the use of Dgap computed in this way for ensuring equation 3.14. Recall from section 3 that P to derive equation 3.14, an important requirement is that the constraint i ai D 0 has to be satised. Since g and º are only approximate solutions of equation 5.1, this condition is usually not satised. Fortunately, this issue is easily taken care of if we replace step 2 of the CG method with this alternate implementation: Compute b D ¡º ¢ e /g ¢ e and set a D º C bg.
(5.3)
What we have done is replace g ¢ y with º ¢ e in the calculation of b. Note that when g and º satisfy equation 5.1, we have g ¢ y D º ¢ e. Once step 2 is replaced by equation 5.3, application of equations 5.2, 3.10, 3.12, and 3.13 guarantees equation 3.14.
SMO Algorithm for Least-Squares SVM Formulation
497
Table 1: Computational Costs for SMO and CG Algorithms (a D 0 Initialization).
log 10 C ¡4 ¡3 ¡2 ¡1 0 1 2 3 4
Banana s 2 D 1.8221
Image s 2 D 2.7183
Splice s 2 D 29.9641
Waveform s 2 D 24.5325
SMO
CG
SMO
CG
SMO
CG
SMO
CG
0.90 0.83 0.53 0.50 0.70 3.53 31.58 320.42 3317.90
0.32 0.48 0.64 1.28 2.24 4.00 7.84 14.40 32.32
9.30 7.48 5.17 4.84 7.05 33.65 254.13 1908.62 11825.40
3.38 5.07 8.05 20.28 49.01 131.82 410.67 1275.95 4096.56
4.73 4.35 2.94 3.13 3.07 5.33 11.45 29.88 99.90
1.00 2.00 3.00 6.00 13.00 30.00 65.00 89.00 109.00
0.93 0.91 0.52 0.42 0.54 2.39 10.51 21.62 28.33
0.32 0.48 0.64 1.12 2.40 5.76 14.56 23.36 27.36
Note: Each unit corresponds to 106 kernel evaluations.
As in the SMO algorithm, alpha seeding can be done for the CG method also, as pointed out by Hamers et al. (2001). Efcient update formulas for rº, rg , and the dual objective function can be written down similar to equations 4.12 and 4.13.
6 Numerical Experiments on Classication Problems
The SMO and CG methods were implemented in Fortran and executed on a Pentium IV 1.5 GHz machine running on Windows platform. The gaussian xk2 ) was used. For both methods we used the kernel K ( x, xN ) D exp(¡ kx¡N 2s 2 termination criterion in equation 3.13 with 2 D 10¡6 . We empirically compared the two methods in terms of computational cost, measured using the number of kernel evaluations since the bulk of the effort in both algorithms resides in computing kernels. Four benchmark data sets were used for this purpose: Banana, Image, Splice, and Waveform. Detailed information about these data sets can be found in Ra¨ tsch (1999). For each data set, we xed s at a suitable value that gives good generalization performance and varied C over a wide range. In the rst experiment, we tried the following nine C values: 10i , i D ¡4, ¡3, . . . , 3, 4. For each C, the methods were initialized with a D 0. The computational costs associated with the four data sets are given in Table 1 as functions of C. In the second experiment, we tested the effect of alpha seeding on the computational cost. C was sequentially varied from 2 ¡10 to 210 in multiples of 2. Thus, there were 21 C values. For C D 2 ¡10, the methods were ini-
498
S.S. Keerthi and S.K. Shevade
Table 2: Computational Costs for SMO and CG Algorithms with Alpha Seeding. Banana s 2 D 1.8221
Image s 2 D 2.7183
Splice s 2 D 29.9641
Waveform s 2 D 24.5325
log 2 C
SMO
CG
SMO
CG
SMO
CG
SMO
CG
¡10 ¡9 ¡8 ¡7 ¡6 ¡5 ¡4 ¡3 ¡2 ¡1 0 1 2 3 4 5 6 7 8 9 10
0.83 0.62 0.47 0.51 0.44 0.44 0.42 0.42 0.43 0.45 0.60 0.90 1.47 2.61 4.84 9.31 18.42 36.56 72.40 146.95 297.37
0.48 0.32 0.32 0.48 0.64 0.80 0.96 1.12 1.28 1.60 2.08 2.24 2.88 3.04 4.00 5.12 5.44 7.04 8.64 9.76 12.16
7.44 6.15 5.54 4.56 4.69 4.09 3.98 3.79 3.82 4.38 5.83 8.71 13.95 24.16 43.70 81.07 149.06 278.62 518.92 955.57 1717.04
5.07 3.38 5.07 5.07 6.76 10.14 13.52 18.59 21.97 28.73 38.87 54.08 72.67 98.02 138.58 197.73 265.33 378.56 549.25 763.88 1098.50
4.37 3.50 3.34 2.92 2.96 2.75 2.56 2.70 2.40 2.32 2.43 2.74 3.19 3.62 3.91 4.09 4.04 4.17 4.18 4.04 3.77
2.00 1.00 2.00 2.00 3.00 3.00 4.00 5.00 6.00 8.00 10.00 14.00 17.00 23.00 32.00 37.00 52.00 65.00 73.00 82.00 88.00
0.91 0.78 0.70 0.50 0.39 0.40 0.42 0.37 0.38 0.39 0.49 0.73 1.09 1.82 2.99 4.70 7.09 9.92 12.78 14.91 15.83
0.48 0.32 0.32 0.48 0.48 0.64 0.80 0.96 1.12 1.44 1.92 2.56 3.68 4.96 6.40 8.96 11.84 14.72 17.44 20.64 23.36
Note: Each unit corresponds to 106 kernel evaluations.
tialized with a D 0. For all other C values, alpha seeding was used. 3 The computational costs associated with the four data sets are given in Table 2 as functions of C. For the SMO algorithm, the increase in cost at large C values is sharp. See, for instance, the growth in cost of SMO after C D 102 for the Banana and Image data sets. In the case of the Image data set, while SMO was doing quite better than CG at low and medium C values, it became very expensive at large C values. In most practical problems, the optimal value of C where best generalization occurs (say, the value at which the k-fold cross-validation error is smallest) is usually not high. For instance, in the case of Image data set, ve-fold cross-validation error was smallest around C D 24.16. Thus, when searching for the optimal C, if large C values can be carefully avoided, then the SMO algorithm will work very efciently. It is also worth noting that for both SMO and CG methods, the effectiveness of the alpha seeding diminishes at large C values. Since large C values correspond to large costs 3 In section 4 we mentioned two alpha seedings: (1) set a to the optimal a of the previous C value and (2) set a to d times the optimal a of the previous C value. Our experiment showed that both initializations give nearly the same cost.
SMO Algorithm for Least-Squares SVM Formulation
499
Table 3: Ratio of the Cross-Validation Cost of CG and the SMO Algorithm. Banana (s 2 D 1.8221)
Image (s 2 D 2.7183)
Splice (s 2 D 29.9641)
Waveform (s 2 D 24.5325)
1.89
2.62
5.77
1.34
for both methods, such values should be avoided (unless really needed) in the C tuning procedure to maintain efciency. On the Image and Splice data sets, SMO was quite faster than CG at the middle C values. Since the “optimal” C usually occurs in this range, this is an advantage in favor of the SMO method. On the Banana and Waveform data sets, the difference between the two methods was not much in this range of C values. When C is tuned using cross validation, we usually start with a low C value, say, 2 ¡10, increase C in steps (say, in multiples of 2), and stop when the cross-validation error clearly shows no further decrease. We compared the CG and the SMO methods with respect to the total cost associated with this process. To do this, we used the same four data sets of Table 2, while also xing s 2 at the values given in that table. Fivefold cross validation was employed. Table 3 gives the ratio of the computational efforts associated with the two methods. The SMO method is clearly faster in all cases, so the SMO algorithm proposed here can be viewed as an excellent alternative method for solving LS-SVMs. To see how the SMO algorithm for LS-SVM scales to large-size problems, we did a scaling experiment on the Adult and Web data sets by gradually increasing the training set size in eight steps and observing the training cost. (These data sets are available on-line from Platt’s SMO page: http://www.research.microsoft.com/˜jplatt/smo.html.) The Adult data set was trained with C D 1.0 and s 2 D 10, while the Web data set was trained with C D 5.0 and s 2 D 10. Figures 1 and 2 give plots of the computational costs of the two algorithms as functions of the problem size. The SMO algorithm for LS-SVM scales well on both data sets, with scaling exponents of 2.017 and 2.006, respectively, for the Adult and Web data sets. For the CG algorithm, the corresponding scaling exponents were 2.385 and 2.496. This is due to the fact that the number of conjugate gradient iterations needed to achieve equation 3.13 slowly rises as the number of examples increases. The results indicate that the SMO algorithm is better suited for large problems. Next, we briey compared the generalization performance of SVM and LS-SVM. Fivefold cross validation was used to tune the hyperparameters involved in the problem formulations (that is, C and s), and the test set error was obtained using the optimal hyperparameter values for each formulation. The search for optimal hyperparameters was done on a 20 £ 20 uniform grid in the (log C, log s) space. The fractions of test set errors (TErr) are given in Table 4. It is clear that the generalization capabilities of both
500
S.S. Keerthi and S.K. Shevade
1e+11
N u m b e r o f K e rn e l E v a lu a tio n s
CG SMO 1e+10
1e+09
1e+08
1e+07 1000
10000 Training set size
Figure 1: Variation of computational cost with training set size for the Adult data set. Table 4: Generalization Performances of LS-SVM and SVM on Benchmark Data Sets. Data Set
LS-SVM
SVM
Banana Image Splice Waveform
.1161 .0208 .1012 .1089
.1180 .0228 .0989 .1076
methods are comparable. This observation is consistent with that made by Van Gestel et al. (2000). 7 Solution of KFD
Mika, Ra¨ tsch, Weston, Sch¨olkopf, and Muller ¨ (1999) rst proposed the idea of (regularized) KFD. Mika et al. (2001) gave a quadratic programming
SMO Algorithm for Least-Squares SVM Formulation
501
1e+11
N u m b e r o f K e r n e l E v a lu a tio n s
CG SMO 1e+10
1e+09
1e+08
1000
10000 Training set size
Figure 2: Variation of computational cost with training set size for the Web data set.
formulation of KFD employing general regularizers. For the case of Kregularizer, that is, kwk2 , Van Gestel et al. (2002) established the equivalence 4 between LS-SVM and KFD. In this section, we give an alternative to their way of establishing the equivalence between the KFD and LS-SVM formulations. Our approach clearly brings out the relationship between the dual variables and the threshold parameter associated with the two formulations. The quadratic programming formulation of Mika et al. (2001) is: min s.t.
1 O 2 C C 2 kwk 2
P
O2 i ji
yi ¡ (wO ¢ zi ¡ bO ) D jOi 8i P P jOi D 0 and i2IC
(7.1) O i2I¡ ji D 0,
4 By equivalence, we mean that the normal directions of the optimal hyperplanes associated with the two formulations are the same.
502
S.S. Keerthi and S.K. Shevade
where I C and I¡ are the index sets corresponding to the two classes and, as usual, yi takes the value 1 if example i is in class 1 and ¡1 otherwise. The Lagrangian for this problem is 1 C X O2 X O 2C j C a O i [yi ¡ ( wO ¢ zi ¡ bO ) ¡ jOi ] LO D kwk 2 2 i i i X X C mC jOi C m ¡ jOi , i2IC
i2I¡
where the a O i ’s, m C and m ¡ are Lagrange multipliers. Setting the gradients O O bO and jOi ’s to zero and using them together with the of L with respect to w, equalities in equation 7.1 yields the following square system of equations O m C , and m ¡ : for solving the a O i ’s, b, P P i
P
aO i D 0,
j
Oj C Kij a
i2IC
1 a O C i
m D bO C Ci C yi 8 i
aO i D m C m C
and
P
i2I¡
aO i D m¡m ¡ ,
(7.2)
where m i is m C if yi D 1 and m ¡ if yi D ¡1; m C is the number of examples in class 1 (positive class); m ¡ is the number of examples in class 2 (negative class); and m D m C C m ¡. These equations imply that m C m C C m¡m ¡ D 0. Let us dene m using the equation m D m C m C D ¡m¡m ¡ .
(7.3)
If we dene aD 1C
m 2C
³
1 1 C m C m¡
´ and
bD
µ ³ ´¶ 1 O m 1 1 ¡ bC , 2C m C m ¡ a
(7.4)
then the right-hand side of each of the rst-level equations in equation 7.2, that is, bO C mCi C yi , can be rewritten as a ( b C yi ) . If we also dene ai D
a Oi a
8i,
(7.5)
it is easy to see that the ai and b are obtained by solving P j
Kij aj C C1 ai ¡ b D yi 8 i P i ai D 0.
(7.6)
Using the ai ’s, we can get m by solving X i2IC
µ ai D m / a D m /
m 1C 2C
³
1 1 C mC m¡
´¶ .
(7.7)
SMO Algorithm for Least-Squares SVM Formulation
503
O and the a Then equations 7.3, 7.4, and 7.5 yield m C , m ¡ , a, b, O i ’s. Since equation 7.6 is nothing but the solution of the instance of equation 2.2 corresponding to LS-SVM, it follows that the solution of equation 7.1 can be obtained by solving LS-SVM; this can be numerically done using the SMO algorithm of section 4. The two equality constraints in the last row of equation 7.1 contribute only to a scaling of w and a shift in b given by the LS-SVM solution. Thus, if b and bO associated with equations 2.2 and 7.1 are determined in a uniform way (say, using the class posterior probabilities obtained in a post-processing step; Platt, 1999), then the classiers obtained by using these two formulations are identical. 8 Primal Problem Without b
In kernel ridge regression (Saunders et al., 1998) and Bayesian designs (Van Gestel et al., 2002), it is usual to consider the primal problem, equation 2.2, P without the b parameter.5 When b is absent, the equality constraint i ai D 0 disappears from the dual, equation 2.10, and the optimality conditions for the dual problem become Fi D 0 8i.
(8.1)
Because of the absence of the equality constraint, the SMO algorithm can simply modify one ai at a time. If the indices are chosen sequentially for updating, then the SMO algorithm is precisely the SOR algorithm (Bertsekas, 1999; Hamers et al., 2001). The efciency of this SOR algorithm can be signicantly improved if we modify it so that at any given instance, the index that gives the best increase in f value is chosen. This can be neatly done by maintaining a cache for the Fi ’s. Like the sequential SOR algorithm, the modied algorithm also requires only O ( m ) computations and kernel evaluations in each step. However, the modied algorithm requires many fewer steps to achieve the termination criterion. Let us now work out the details of the modied algorithm. Suppose at a given a, an index i is considered for updating. Dene a Q i ( t) D ai C tI
6 iI a Q j ( t) D aj 8j D
and
w ( t) D f (a(t) Q ).
(8.2)
With t? and anew as dened in equation 4.1, it is easy to check that t? D
Fi kQ ( xi , xi )
and
f (anew ) ¡ f (a) D
F2i 2 kQ (xi , xi )
.
(8.3)
5 If b is present and the term c b2 / 2 is added to the cost function in equation 2.2, that is equivalent to a primal problem without b if the constant 1 /c is added to the kernel function k.
504
S.S. Keerthi and S.K. Shevade
Thus, at each step of the SMO algorithm, we can choose i that has the F2 largest value of Q i . Since a cache is maintained for the Fi ’s and kQ ( xi , xi ) , 2 k ( xi ,xi )
i D 1, . . . , m can be precomputed and stored, the above step can be efciently done. Convergence of the resulting SMO algorithm can be established along the lines given in section 4. The proofs actually become much simpler because of the simpler problem and algorithm associated with the absence of b. The solution of the problem in equation 2.10 without the equality conQ D y, so the solution can also be straint corresponds to the solution of Ka obtained directly using the conjugate gradient method. Duality gap ideas discussed in section 3 can be easily adapted (simply omit b in all calculations) to give efcient stopping criteria for both SMO and CG methods. 9 Conclusion
We have given a new algorithm for least-squares formulations of SVMs (LSSVMs and kernel ridge regression), proved its convergence, and discussed implementation aspects. Our algorithm also applies to a particular form of regularized kernel Fisher discriminant method. The algorithm is robust in the sense that on the many complex data sets we have tried, there was not a single case of failure. It is also fast and scales well to large-size problems. It provides a better alternative to the conjugate gradient algorithm developed by Hamers et al. (2001), especially for large problems. Appendix: Pseudocode for the SMO Algorithm
Notes for the Pseudocode 1. kernel(i,j) refers to kQ ( xi , xj ) given by equation 2.4. 2. The ai are stored in alpha array, while the Fi are stored in Fcache array. 3. 2 in equation 3.13 is referred to as epsilon . 10¡6 is a suitable value for it. 4. eps is used to check if sufcient change has occurred in each step. It can be set to 10¡12. 5. The function fabs returns the absolute value of a real number. 6. Any line in the pseudocode that begins with the character % is treated as a comment.
procedure takeStep(i1, i2) alph1 = alpha{i1}; alph2 = alpha{i2}; y1 = y{i1}; y2 = y{i2}; F1 = Fcache{i1}; F2 = Fcache{i2}; gamma = alph1 + alph2; k11 = kernel(i1,i1); k22 = kernel(i2,i2); k12 = kernel(i1,i2);
SMO Algorithm for Least-Squares SVM Formulation
505
eta = 2*k12-k11-k22; a2 = alph2 - (F1-F2)/eta; if (fabs(a2-alph2) < eps*(a2+alpha2+eps)) return 0; a1 = gamma-a2; alpha{i1} = a1; alpha{i2} = a2; Update the DualObjectiveFunction using (4.11). Update Fcache{i} for all i in I using (4.10). Compute (iÇ low, bÇ low) and (iÇ up, bÇ up) using (3.4). return 1; endprocedure procedure ComputeDualityGap() DualityGap = 0; set b using (3.16) or (3.17). loop over all i xi = b + (alpha{i}/C) - Fcache{i}; DualityGap += alpha{i}(Fcache{i}-(0.5*alpha{i}/C)) + C*xi*xi/2; return DualityGap; endprocedure procedure InitializeVariables(delta) if delta is non-zero % delta denotes the factor by which C is changed. for every i, reset Fcache{i} using (4.12). compute DualObjectiveFunction using (4.13). else initialize alpha{i}=0 for all i. set Fcache{i}=-y{i} for all i and DualObjectiveFunction to 0. Compute (iÇ low, bÇ low) and (iÇ up, bÇ up) using (3.4). endprocedure procedure SMO() % If solution is available at some value of C=C1 % and solution is to be % found at C=C2 then set delta := C2/C1. Otherwise, % set delta := 0. InitializeVariables(delta); numChanged = 1; DualityGap = ComputeDualityGap(); while (DualityGap > epsilon*DualObjectiveFunction && numChanged != 0) numChanged = takeStep(iÇ up, iÇ low); DualityGap = ComputeDualityGap(); % If the procedure stops with numChanged=0, it means that % epsilon is too small. endprocedure
506
S.S. Keerthi and S.K. Shevade
References Bertsekas, D. P. (1999). Nonlinear programming (2nd ed.). Belmont, MA: Athena Scientic. Chang, C. C., & Lin, C. J. (2001). LIBSVM: A library for support vector machines. Available on-line at: http://www.csie.ntu.edu.tw/»cjlin/libsvm. DeCoste, D., & Wagstaff, K. (2000).Alpha seeding for support vector machines. Paper presented at the International Conference on Knowledge Discovery and Data Mining, Boston. Fletcher, R. (1987). Practical methods of optimizations (2nd ed.). New York: Wiley. Golub, G. H., & Van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press. Hamers, B., Suykens, J., & De Moor, B. (2001). A comparison of iterative methods for least squares support vector machine classiers (Internal Rep. 01-110). Leuven, Belgium: ESAT-SISTA, K. U. Leuven. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM design. Neural Computation, 13, 637–649. Mika, S., R¨atsch, G., & Muller, ¨ K. R. (2001). A mathematical programming approach to the kernel Fisher algorithm. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural informationprocessingsystems,13 (pp. 591–597).Cambridge, MA: MIT Press. Mika, S., R¨atsch, G., Weston, J., Sch¨olkopf, B., & Muller, ¨ K. R. (1999). Fisher discriminant analysis with kernels. In Y. H. Hu, J. Larsen, E. Wilson, & S. Douglas (Eds.), Neural networks for signal processing IX (pp. 41–48). New York: IEEE. Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support vector machines (Tech. Rep. MSR-TR-98-14). Redmond, WA: Microsoft Research. Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In A. Smola, B. Sch¨olkopf, & D. Schuurmans (Eds.), Advances in large margin classiers. Cambridge, MA: MIT Press. R¨atsch, G. (1999). [Benchmark repository.] Available on-line at: http://ida.rst. gmd.de/»raetsch/data/benchmarks.htm. Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proceedingsof the Fifteenth InternationalConference on Machine Learning (ICML-98). Madison, WI. Smola, A., & Sch¨olkopf, B. (1998) A tutorial on support vector regression (NeuroColt2 Tech. Rep. NC2-TR-1998-030).Berlin: ESPRIT Working Group in Neural and Computational Learning II. Suykens, J., Lukas, L., Van Dooren, P., De Moor, B., & Vandewalle, J. (1999). Least squares support vector machine classiers: A large scale algorithm. In Proceedings of the European Conference on Circuit Theory and Design (ECCTD’99) (pp. 839–842). Stresa, Italy. Suykens, J., & Vandewalle, J. (1999). Least squares support vector machine classiers. Neural Processing Letters, 9, 293–300.
SMO Algorithm for Least-Squares SVM Formulation
507
Vanderbei, R. J. (1994). LOQO: An interior point code for quadratic programming (Tech. Rep. SOR-94-15). Princeton, NJ: Statistics and Operations Research, Princeton University. Van Gestel, T., Suykens, J., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G., De Moor, B., & Vandewalle, J. (2000). Benchmarking least squares support vector machine classiers (Internal Rep. 00-37). Leuven, Belgium: ESAT-SISTA, K. U. Leuven. Van Gestel, T., Suykens, J., Lanckriet, G., Lambrechts, A., De Moor, B., & Vandewalle, J. (2002). A Bayesian framework for least squares support vector machine classiers. Neural Computation, 14, 1115–1147. Received May 9, 2002; accepted July 8, 2002.
ARTICLE
Communicated by Bard Ermentrout
Synchronization in Networks of Excitatory and Inhibitory Neurons with Sparse, Random Connectivity Christoph Borgers ¨
[email protected] Department of Mathematics, Tufts University, Medford, MA 02155, U.S.A.
Nancy Kopell
[email protected] Department of Mathematics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A.
In model networks of E-cells and I-cells (excitatory and inhibitory neurons, respectively), synchronous rhythmic spiking often comes about from the interplay between the two cell groups: the E-cells synchronize the I-cells and vice versa. Under ideal conditions—homogeneity in relevant network parameters and all-to-all connectivity, for instance—this mechanism can yield perfect synchronization. We nd that approximate, imperfect synchronization is possible even with very sparse, random connectivity. The crucial quantity is the expected number of inputs per cell. As long as it is large enough (more precisely, as long as the variance of the total number of synaptic inputs per cell is small enough), tight synchronization is possible. The desynchronizing effect of random connectivity can be reduced by strengthening the E!I synapses. More surprising, it cannot be reduced by strengthening the I!E synapses. However, the decay time constant of inhibition plays an important role. Faster decay yields tighter synchrony. In particular, in models in which the inhibitory synapses are assumed to be instantaneous, the effects of sparse, random connectivity cannot be seen. 1 Introduction In networks of E-cells (excitatory neurons) and I-cells (inhibitory neurons), synchronous, rhythmic spiking often results from the interplay between the two cell groups, with the E-cells synchronizing the I-cells and vice versa. This mechanism, called PING (pyramidal interneuronal network gamma; Whittington, Traub, Kopell, Ermentrout, & Buhl, 2000) or ° -II (Tiesinga, Fellous, Jos´e, & Sejnowski, 2001), has been observed in modeling studies, and there are reasons to believe that some experimentally observed gamma rhythms are in fact based on this mechanism (Traub, Jefferys, & Whittington, 1999; Whittington et al., 2000; Tiesinga et al., 2001). Neural Computation 15, 509–538 (2003)
c 2003 Massachusetts Institute of Technology °
510
C. Borgers ¨ and N. Kopell
A
B E cells
400
E cells
400 200
200 0 100
I cells
I cells
0 100 50 0 0
100 t
200
50 0 0
100 t
200
Figure 1: PING in E-I networks with (A) all-to-all connectivity and (B) sparse connectivity.
The conditions under which PING occurs are not completely understood, though we discuss them heuristically in section 6. We focus here on the effects of random connectivity on PING. This effect is illustrated by Figure 1. Details about this gure will be given later. For now, it sufces to say that Figure 1A shows the emergence of PING in a model E=I network with all-to-all connectivity and without any kind of heterogeneity in network parameters. The gure indicates spike times. Both cell groups (E and I) synchronize tightly. If the connectivity is made sparse and random, Figure 1A turns into Figure 1B. The two cell groups now re spike volleys of brief but positive duration. (The spike volleys of the I-cells are so brief that they appear to have zero duration in the plot.) The main goal of this article is to analyze the durations and shapes of these volleys and their dependence on network parameters. Synchronization in the presence of random connectivity has been studied previously for excitatory networks (Barkai, Kanter, & Sompolinsky, 1990), inhibitory networks (Brunel & Hakim, 1999; Wang & Buzs a´ ki, 1996), and E=I networks (Brunel, 2000; Bush & Sejnowski, 1996; Golomb & Hansel, 2000; Hansel & Mato, 2001; van Vreeswijk & Sompolinsky, 1996, 1998; Wang, Golomb, & Rinzel, 1995). This literature is aimed at understanding either the stability of the asynchronous state or transitions from asynchrony to rhythms and vice versa. We do not consider these issues here, but focus instead on a detailed understanding of the near-synchronous state. In section 2, we review the theta model (Ermentrout & Kopell, 1986; Gutkin & Ermentrout, 1998; Hoppensteadt & Izhikevich, 1997), an idealization of a large class of conductance-based neuronal models. Our arguments and simulations in this article are based on this model. We also introduce
Synchronization in Sparse, Random Networks
511
our model of synapses in section 2 and describe the connectivity of our model networks. In section 3, we present numerical experiments demonstrating that the desynchronizing effect of sparseness and randomness in the connectivity primarily originates from the variance in the number of inputs per cell, not from the randomness per se. The synchronization of a population of cells by an inhibitory input pulse is analyzed in section 4, rst assuming that all cells receive the same input pulse, and then, motivated by the result of section 3, assuming that different cells receive input pulses of different strengths. Similarly, the synchronization of a population of cells by an excitatory input pulse is analyzed in section 5, rst assuming that all cells receive the same input pulse, then assuming that different cells receive input pulses of different strengths. The results of sections 4 and 5 are combined in section 6 to analyze PING in E=I networks. In section 7, we summarize our results and put them into the context of other recent work on the same subject. 2 Review of Theta Neurons 2.1 Equation of a Single Theta Neuron. In the Hodgkin-Huxley model, a periodically spiking space-clamped neuron is represented by a point moving on a limit cycle in a four-dimensional phase space. Analogously, in the theta model (Ermentrout & Kopell, 1986; Gutkin & Ermentrout, 1998; Hoppensteadt & Izhikevich, 1997), a neuron is represented by a point P D .cos µ; sin µ / moving on the unit circle S1 . In the absence of synaptic coupling, the differential equation governing the motion is 1 dµ D .1 ¡ cos µ/ C I.1 C cos µ/: dt ¿
(2.1)
Here, I should be thought of as an input “current,” measured in radians per unit time. The time constant ¿ > 0 is needed to make equation 2.1 dimensionally correct; it is analogous to a membrane time constant. When I < 0, equation 2.1 has the two xed points: µ0§ D §2 arccos p
1 1 ¡ ¿I
:
µ0¡ is stable, and µ0C is unstable. For I < 0, the vector eld on the circle is shown in the left panel of Figure 2A. If µ is perturbed slightly from µ0¡ , it returns to µ0¡ . However, if µ is raised beyond µ0C , a large excursion occurs, with the point P moving around the entire circle while µ increases to µ0¡ C2¼. The stable xed point µ0¡ is the analog of the stable equilibrium of a neuron. The unstable xed point µ0C is the analog of a spiking threshold. As I approaches 0, the xed points approach each other. A saddle-node bifurcation occurs when I D 0. The two xed points come together at µ D 0 (see the center panel of Figure 2A). When I > 0, dµ=dt > 0 for all t, so there is no xed point (see the right panel of Figure 2A).
512
C. Borgers ¨ and N. Kopell
A:
B:
sin q
1 0
1 0
20
40
t
60
80
100
Figure 2: Theta model. (A) Vector eld on the circle for I < 0, I D 0, and I > 0. (B) Sin µ as a function of time.
The transition from I < 0 to I > 0 is the analog of the transition from excitability to spiking in a neuron. Neuronal models are called of type I if this transition involves a saddle-node bifurcation on a limit cycle and of type II if it involves a subcritical Hopf bifurcation (Ermentrout, 1996; Gutkin & Ermentrout, 1998; Rinzel & Ermentrout, 1998). This classication goes back to Hodgkin (1948). Thus, the theta model is a type I neuronal model. It has been shown to be canonical, in the sense that other type I models can be reduced to it by coordinate transformations (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997). If 0 < I ¿ 1=¿ , the motion is much slower near µ D 0, that is, near the “ghost” of the xed point annihilated in the saddle-node bifurcation, than elswhere. This is illustrated in Figure 2B, which shows sin µ as a function of t for ¿ D 1, I D 0:02. As the point P D .cos µ; sin µ/ moves slowly past .1; 0/, sin µ changes slowly. As P moves rapidly around the circle, sin µ rapidly rises to 1, then falls to ¡1, then returns to values slightly below 0. To some extent the graph of sin µ resembles the voltage trace of a spiking neuron.1 1 Some authors think of ¡ cos µ , not sin µ , as the “voltage-like” quantity in the theta model. Which of the two we call voltage-like is of no consequence for this article, and neither is voltage-like in any precise sense. However, we prefer to think of sin µ as the voltage-like quantity for the following aesthetic reason. Consider a theta neuron driven slightly above threshold. Near the ghost of the equilibrium point (µ D 0), sin µ is slowly increasing, while ¡ cos µ has a local minimum. On the other hand, for a Hodgkin-Huxleytype neuron of type I driven slightly above threshold, the membrane potential V is slowly
Synchronization in Sparse, Random Networks
513
When µ crosses .2l ¡ 1/¼ , l integer, with dµ=dt > 0, we say that the neuron spikes. If I > 0 and ¡¼ · µ1 · µ2 · ¼, the time it takes for µ to rise from µ1 to µ2 equals Z
µ2 µ1
dµ D .1 ¡ cos µ/=¿ C I.1 C cos µ/
r µ ¶ tan.µ =2/ µ2 ¿ arctan p : I ¿I µ1
(2.2)
Setting µ1 D ¡¼ and µ2 D ¼ in this formula, we nd that the period equals r
PD¼
¿ : I
(2.3)
We denote the time it takes for µ to rise from ¼=2 to 3¼=2 by W and call it the spike width. Applying formula 2.2 with .µ1 ; µ2 / D .¼=2; ¼ / and .µ1 ; µ2 / D .¡¼; ¡¼=2/ and adding the results, we nd µ ¶r 1 ¿ W D ¼ ¡ 2 arctan p : I ¿I
(2.4)
For physiological realism, we wish to ensure W=P ¿ 1. By equations 2.3 and 2.4, 2 1 W D 1 ¡ arctan p : P ¼ ¿I Therefore, W=P ¿ 1 means the same as ¿ I ¿ 1. Since arctan.1=²/ D ¼=2 ¡ ² C O.² 2 / as ² ! 0, equation 2.4 implies W ¼ 2¿ when ¿ I ¿ 1. Thus, in the parameter regime of interest to us, ¿ is approximately half the spike width. Motivated by this discussion and by the fact that spike widths in real neurons are on the order of milliseconds, we set ¿ D1 for the remainder of this article, think of time as measured in milliseconds, and always consider input currents I ¿ 1. 2.2 Synapses Between Theta Neurons. We model synapses by adding time-dependent input currents to equation 2.1. When the presynaptic neuron spikes, the postsynaptic neuron receives an input current (positive or negative, depending on whether the synapse is excitatory or inhibitory), which jumps to its maximum value instantaneously and then decays exponentially. This is the model of section 2.1.5 of Izhikevich (2000) (see also Ermentrout, 1996, for a discussion of synapses between theta neurons). increasing near the ghost of the equilibrium point. In this sense, V is more similar to sin µ than to ¡ cos µ .
514
C. Borgers ¨ and N. Kopell
Thus, a network of N coupled theta neurons is described by a system of differential equations of the form Á ! N X dµj D 1 ¡ cos µj C Ij C ®i gij sij .1 C cos µj /; 1 · j · N: dt iD1 Ij denotes the external input to the jth neuron, which can be positive or negative. The constant ®i equals C1 or ¡1, depending on whether neuron i is excitatory (E) or inhibitory (I). The constant gij ¸ 0 measures the strength of the synapse from neuron i to neuron j, and sij D sij .t/ is the synaptic gating variable associated with this synapse. The value of sij always lies between 0 and 1. It jumps to 1 when neuron i spikes. Between spikes of neuron i, it decays exponentially, following the differential equation dsij sij D¡ : dt ¿ij The decay time constants ¿ij are positive. The jumps of sij occurring when neuron i spikes cause difculties in the numerical simulation of the network and are not physiologically realistic. In our simulations, we therefore replace the jumps by rapid but smooth rises, letting sij be governed by a differential equation of the form dsij dt
D¡
sij ¿ij
C e¡´.1Ccos µi /
1 ¡ sij ¿R
;
with ¿R D 0:1 and ´ D 5. The term e¡´.1Ccos µi / .1 ¡ sij /=¿R is very close to zero unless µi ¼ .2l ¡ 1/¼ , l integer, and drives sij toward 1 rapidly when µi ¼ .2l ¡ 1/¼ . The parameter ¿R is reminiscent of a synaptic rise time. Figure 3 shows sij for ¿ij D 2 and 10, with the input of the presynaptic neuron chosen so that its period equals 25. 2.3 Sparse, Random E=I Networks. Throughout this article, the decay time constants ¿ij are assumed to depend on the type of i only (excitatory or inhibitory), so there are two distinct values of ¿ij , denoted ¿E and ¿I . We also assume that all excitatory neurons receive the same constant external input drive IE , and all inhibitory neurons receive the same constant external drive II . In later sections, we consider networks of coupled theta neurons, including NE D 4N=5 excitatory and NI D N=5 inhibitory neurons. These proportions are motivated by the fact that in large portions of the cortex, there are about four times more excitatory than inhibitory neurons (Braitenberg & Schuz, ¨ 1998). The strengths gij of the synapses are chosen at random. For a given network, they are chosen once and for all; they do not depend on time. To de-
Synchronization in Sparse, Random Networks
515
1
2:
0.5
s
t
0 0
20
40
20
40
60
80
100
60
80
100
1
10 :
0.5
s
t
0 0
t
Figure 3: Synaptic gating variable s as a function of time, with decay time constants ¿ D 2 and ¿ D 10.
scribe the choice of the gij , the following notation is useful. Let E µ f1; : : : ; Ng denote the set of all indices of excitatory neurons, and similarly I the set of all indices of inhibitory neurons. For i 2 E and j 2 I , we dene gij D
gEI wij ; pEI NE
(2.5)
with ( wij D
1 0
with probability pEI , otherwise,
(2.6)
where gEI ¸ 0 and pEI 2 .0; 1] are constants. For j 2 I , we dene wEj D
X
(2.7)
wij :
i2E
Note that wEj is a binomially distributed random variable. We also dene gEj D
X i2E
gij D
gEI wEj : pEI NE
(2.8)
516
C. Borgers ¨ and N. Kopell
From the formulas for the mean and standard deviation of binomially distributed random variables, we see that gEj has mean gEI and standard deviation s 1 ¡ pEI (2.9) ¾EI D gEI : pEI NE By the central limit theorem, gEj is approximately normally distributed if pEI NE is large. Assuming pEI ¿ 1, as is physiologically realistic (Braitenberg & Schuz, ¨ 1998), equation 2.9 shows that 1 ¾ EI ¼p : gEI pEI NE
(2.10)
The left-hand side of equation 2.10 is the coefcient of variation (the standard deviation divided by the mean) of gEj , j 2 I . The expression pEI NE appearing on the right-hand side is the expected number of excitatory inputs per I-cell. Formulas analogous to 2.5 through 2.10 apply to the I ! E, E ! E, and I ! I synapses. In particular, s
¾IE D gIE
1 ¡ pIE ; pIE NI
(2.11)
and therefore 1 ¾ IE ¼ p gIE pIE NI
(2.12)
for pIE ¿ 1. The left-hand side of equation 2.12 is the coefcient of variation of gIj , j 2 E . The expression pIE NI appearing on the right-hand side is the expected number of inhibitory inputs per E-cell. Several authors have pointed out that pEI NE and pIE NI are much more important than pEI , NE , pIE , and NI in isolation (Golomb & Hansel, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2002; Wang et al., 1995). 3 Loss of Tight Synchrony Is Attributable to Variance in the Number of Inputs per Cell Before considering networks with sparse, random connectivity, we consider one with all-to-all connectivity (pEI D pIE D 1:0). Figure 1A shows an example of a simulation with IE D 0:1; II D 0; gEI D gIE D 0:25; gEE D gII D 0; ¿E D 2; ¿I D 10:
(3.1)
Synchronization in Sparse, Random Networks
517
Figure 1A shows spike times, with the horizontal axis indicating time and the vertical axis cell index. Each of the two cell groups (E and I) synchronizes very rapidly, with the synchronous population spikes of the I-cells slightly lagging behind those of the E-cells. The synchronization mechanism seen in Figure 1A is PING, briey described in section 1 and discussed in more detail in section 6.1. We comment briey on our parameter choices. A neuron driven with p IE D 0:1 spikes periodically with an interspike interval equal to ¼= I ¼ 9:93. Since we think of time as measured in milliseconds, this corresponds to a frequency of .1000=9:93/ H ¼ 100 Hz. Thus, the E-cells are driven so hard that they would spike above gamma frequency if they were not subject to any inhibition. The I-cells are driven at threshold; they do not spike without additional excitatory input, but any excitatory input, regardless how weak, will make them spike. The values of gEI and gIE can be varied considerably without any qualitative change in Figure 1A. However, for small values of gEI (roughly < 0:1), two or more population spikes of the E-cells occur before the I-cells respond. For large values of gEI (roughly > 0:7), two or more population spikes of the I-cells occur in response to a population spike of the E-cells. For small values of gIE (roughly < 0:1), the E-cells are not synchronized. For large values of gIE , the rhythm is slow but not qualitatively different from that in Figure 1A. For simplicity, we assume here that there are no E ! E or I ! I synapses— gEE D gII D 0. In our experience, E ! E synapses (with a brief synaptic decay time such as ¿E D 2) do not affect PING rhythms much. However, I ! I synapses are crucial in some parameter regimes; this will be discussed in section 6.1. Our choices of ¿E D 2 and ¿I D 10 are motivated by the decay time constants of excitatory synapses involving AMPA receptors (approximately 2 ms) and inhibitory synapses involving GABAA receptors (approximately 10 ms); recall that we think of time as measured in milliseconds. When pEI and pIE are reduced from 1:0 to 0:5, Figure 1A turns into Figure 1B. The two cell groups now re spike volleys of positive durations. As stated in section 1, the main goal of this article is to understand the durations and shapes of these volleys. In the network underlying Figure 1B, each E-cell receives input from a random number of I-cells. The expected value of this number is 50. Similarly, each I-cell receives input from a random number of E-cells, with expectation 200. We now repeat the simulation of Figure 1B with a network in which the connectivity is still sparse and random, but the variance in the number of inputs per cell has been eliminated. That is, each E-cell receives input from a random set of exactly 50 I-cells, and each I-cell receives input from a random set of exactly 200 E-cells. The network is like that of Figure 1B in all other regards. The result is shown in Figure 4A; tight synchrony is restored. When the number of inputs per cell is xed, tight synchronization is possible even for much sparser networks. Figure 4B shows a simulation
518
C. Borgers ¨ and N. Kopell
A
B E cells
400
E cells
400 200
200 0 100
I cells
I cells
0 100 50 0 0
100 t
50 0 0
200
200
D 400
E cells
E cells
C
100 t
1000
200
I cells
0 100
I cells
0 400 200 0 0
100 t
200
50 0 0
100 t
200
Figure 4: Simulation of E-I networks with sparse, random connectivity but without variance in the number of inputs per cell. (A) Parameters as in Figure 1B, but with variance in numbers of inputs per cell eliminated. (B) Only four excitatory inputs per I-cell and one inhibitory input per E-cell. (C) Same as B in a larger network. (D) Same as B, but with only one excitatory input per I-cell.
in which each E-cell receives input from one I-cell, and each I-cell receives input from four E-cells. Although synchronization is achieved a little less rapidly than in Figure 4A, it becomes tight within a few oscillation periods. A four times larger network, again with one inhibitory input into each Ecell and four excitatory inputs into each I-cell, shows similar behavior (see Figure 4C).
Synchronization in Sparse, Random Networks
519
If the number of excitatory inputs per I-cell is reduced again, from four to one, there appears to be no more synchronization (see Figure 4D). Thus, there appears to be a minimum number of inputs per cell needed for synchronization—a “percolation threshold.” However, this number is very small in our simulations. In previous work on different models, similar thresholds have been found (Golomb & Hansel, 2000; Wang & Buzs a´ ki, 1996; Wang et al., 1995). In all three of these references, asynchrony was found to become unstable when the number of inputs per cell exceeded a threshold value independent of network size. A cell in a human brain typically receives input from thousands of other cells (Braitenberg & Schuz, ¨ 1998). Our numerical experiments suggest that with so many synapses, the impact of sparse, random connectivity on synchronization is attributable to the variance in the number of inputs per cell, not to the percolation threshold. We will assume this to be the case from now on. 4 Synchronization of a Population of Theta Neurons by a Single Strong Inhibitory Pulse 4.1 Synchronization by an Inhibitory Pulse of Uniform Strength. We consider a population of N identical, uncoupled neurons with common constant external drive above threshold, receiving a common inhibitory synaptic input pulse at time 0. Following section 2, we model this population by the equations dµj D .1 ¡ cos µj / C .I ¡ gs.t//.1 C cos µj /; dt
1 · j · N;
(4.1)
with I > 0; g > 0, and ( e¡t=¿I if t > 0 s.t/ D 0 if t · 0 with ¿I > 0. If the inhibitory pulse is strong, it brings the population close to synchrony. An example is shown in Figure 5A, which displays results of a simulation with N D 100;
I D 0:05;
g D 0:25;
¿I D 10:
(4.2)
The synchronization brought about by the inhibitory pulse at time 0 is immediate and nearly perfect. To understand the synchronization shown in Figure 5A, consider the initial value problem, dµ D 1 ¡ cos µ C .I ¡ ge¡t=¿I /.1 C cos µ/; dt µ.0/ D µ0 ;
t > 0;
(4.3) (4.4)
520
C. Borgers ¨ and N. Kopell
Figure 5: Synchronization by a single inhibitory pulse. (A) Inhibitory pulse of uniform strength. (B) Inhibitory pulse of nonuniform strength. (C) Phase portrait for equations 4.5 and 4.6 with I D 0:05, ¿I D 10, with the stable river indicated boldly. (D) Distribution of spikes within the rst volley following t D 0, in a simulation identical to that in B, but with 1000 neurons: predicted (solid line) and actual (bars).
with ¡¼ < µ0 < ¼ . We dene J.t/ D I ¡ ge¡t=¿I : Equations 4.3 and 4.4 can then be rewritten in the form dµ D 1 ¡ cos µ C J.1 C cos µ/ dt dJ J¡I D¡ dt ¿I µ .0/ D µ0
J.0/ D I ¡ g:
(4.5) (4.6) (4.7) (4.8)
Synchronization in Sparse, Random Networks
521
The phase portrait for the two-dimensional dynamical system 4.5 and 4.6 is shown in Figure 5C for I D 0:05 and ¿I D 10. The dashed line in Figure 5C is the nullcline dµ=dt D 0. The nullcline dJ=dt D 0 is the horizontal line J D I, the upper edge of the window shown in Figure 5C. The gure should be extended periodically in µ with period 2¼. The ow is upward, in the direction of increasing J. The most striking feature of Figure 5C is the existence of strongly attracting and strongly repelling trajectories. Trajectories of this kind exist in many systems of ordinary differential equations and are called rivers (Diener, 1985a, 1985b). The gure reveals a stable river, that is, a trajectory .µs ; Js / that is attracting in forward time, indicated as a bold line in Figure 5C, with .µs ; Js / ! .¡¼; ¡1/ as t ! ¡1, dµs =dt > 0, and Js .t/ D I ¡ ge ¡t=¿I
(4.9)
for all t. We denote by T the time when µs .T/ D ¼, and dene J¤ D Js .T/ 2 .0; I/:
(4.10)
Equations 4.9 and 4.10 imply T D ¿I ln g ¡ ¿I ln.I ¡ J¤ /:
(4.11)
Note that the phase portrait depends on the parameters I and ¿I but not on g. Therefore, the value of J¤ depends on I and ¿I but not on g. The synchronization seen in Figure 5A can be understood from Figure 5C in the following way. For g sufciently large (that is, J.0/ sufciently negative), and for µ0 sufciently far from ¼ , .µ .t/; J.t// is rapidly attracted to .µs .t/; Js .t//. At the time when µ D ¼, we therefore have J ¼ J¤ or t ¼ T. Thus, the rst spike after time zero occurs approximately at time T. For µ0 sufciently close to ¼ , µ.t/ quickly passes through ¼ and is then rapidly attracted to .µ s .t/ C 2¼; Js .t//. When µ.t/ reaches 3¼, J ¼ J¤ and therefore t ¼ T. Thus, a spike occurs soon after time zero, followed by a spike approximately at time T. Only for values of µ0 in a narrow transition regime is .µ .t/; J.t// attracted to neither .µ s .t/; Js .t// nor .µs .t/ C 2¼; Js .t//. Equation 4.11 gives the time of the rst approximately synchronous population spike. Since we have no explicit formula for J¤ , equation 4.11 is not an explicit formula for T. However, since J¤ does not depend on g, equation 4.11 does give the precise dependence of T on g. This dependence will be of primary interest to us in the remainder of section 4. For later reference, we note how equation 4.11 is modied when the synchronizing inhibitory pulse arrives not at time 0 but at some time T0 : T D T0 C ¿I ln g ¡ ¿I ln.I ¡ J¤ /:
(4.12)
522
C. Borgers ¨ and N. Kopell
4.2 Approximate Synchronization by an Inhibitory Pulse of Nonuniform Strength. Motivated by section 3, we are interested in the effects of variable synaptic strengths. We therefore let the constant g in equation 4.1 depend on j: dµj
D .1 ¡ cos µj / C .I ¡ gj s.t//.1 C cos µj /; 1 · j · N: dt That is, different neurons receive inhibitory pulses of different strengths. We assume that the gj are independent, normally distributed random variables, with mean g > 0 and standard deviation ¾ g > 0. If g is large enough and ¾ g is small enough, the population is still synchronized approximately. This is illustrated in Figure 5B, which shows results of a simulation similar to that of Figure 5A, with N D 100;
I D 0:05;
g D 0:25;
¾ g D 0:025;
¿I D 10:
(4.13)
(Compare these parameters with those in equations 4.2.) Instead of the nearly synchronous population spikes of Figure 5A, we now see spike volleys of brief but positive durations. To analyze the durations of these spike volleys, let us consider the initial value problem 4.3 and 4.4 with a random g > 0. Let ½ g D ½g .° /, ° > 0 be the probability density of g, and let g > 0 and ¾ g > 0 be its mean and standard deviation. Let X D ln g:
(4.14)
Combining equations 4.11 and 4.14, T D ¿I X ¡ ¿I ln.I ¡ J¤ /:
(4.15)
The only random quantity on the right-hand side of equation 4.15 is X. We will discuss its distribution rst. Let ½X D ½X .» /, ¡1 < » < 1 be the probability density of X. For ¡1 < a < b < 1, Z
b a
½X .» / d» D P.X 2 .a; b// D P.ln g 2 .a; b// D P.g 2 .ea ; eb // Z D
Z
eb ea
½ g .° / d° D
b
e» ½ g .e» / d»:
a
Therefore, ½X .» / D e» ½g .e» /
(4.16)
for all » . For small ¾ g , the standard deviation of X is ¾X ¼ ln0 .g/¾ g D
¾g : g
(4.17)
Synchronization in Sparse, Random Networks
523
From equations 4.15 and 4.16, we see that the probability density function of T is ½T .t/ D
1 .tC¿I ln.I¡J¤ //=¿I ¤ e ½g .e.tC¿I ln.I¡J //=¿I /: ¿I
(4.18)
For later reference, we note how formula 4.18 changes if the approximately synchronizing inhibitory pulse arrives not at time 0 but at some time T0 : ½T .t/ D
1 .t¡T0 C¿I ln.I¡J¤ //=¿I ¤ e ½ g .e.t¡T0 C¿I ln.I¡J //=¿I /: ¿I
(4.19)
Using equations 4.15 and 4.17, we see that the standard deviation of T is ¾ T D ¿I ¾ X ¼ ¿I
¾g g
(4.20)
for small ¾ g . We think of ¾T as a measure of the duration of the spike volleys. Thus, the duration of the spike volleys is proportional to the product of ¿I , the decay time constant of inhibition, and the coefcient of variation ¾ g =g of g. To verify these results computationally, we return to the example of Figure 5B. Strictly speaking, the preceding discussion does not apply to this example, since g, which is assumed to be normally distributed, is not guaranteed to be positive. However, formulas 4.16, 4.18, and 4.20 are well dened if ½ g is a normal density. Since g D 0:25 and ¾ g D 0:025, the probability of g · 0 is extremely small. We therefore expect equations 4.16, 4.18, and 4.20 to hold with good accuracy. We dene T .j/ to be the time of the spike of the jth neuron within the rst nearly synchronous spike volley. We set
O TD
PN
jD1 T
N
.j/
and
¾O T D
v uP u N .T .j/ ¡ T/ O 2 t jD1 N¡1
:
(4.21)
Here and for the remainder of this article, hats indicate results obtained from numerical simulations. In the example of Figure 5B, we nd ¾O T ¼ 1:02: We see that ¾O T is indeed close to ¿I ¾ g =g, which equals 1:0 in this example. If we double ¿I in this experiment, ¾O T rises from 1:02 to 2:04, in agreement with equation 4.20. To verify equation 4.18 numerically, we must know the value of ¿I ln.I ¡ J ¤ /. Taking expectations on both sides of equation 4.15, we nd ¿I ln.I ¡ J ¤ / D ¿I E.X/ ¡ E.T/ D ¿I E.ln g/ ¡ E.T/:
524
C. Borgers ¨ and N. Kopell
For sufciently small ¾ g , this implies ¿I ln.I ¡ J¤ / ¼ ¿I ln g ¡ E.T/; suggesting the approximation O ¿I ln.I ¡ J¤ / ¼ ¿I ln g ¡ T:
(4.22)
Figure 5D shows the density ½T , as dened in equation 4.18, using the approximation 4.22, with ¿I D 10, and assuming that ½ g is a normal density with g D 0:25 and ¾ g D 0:025. The histogram in Figure 5D indicates the actual spike time density, determined from the numerical simulation. The agreement between the theoretical prediction and the actual spike time distribution is excellent. For later reference, we note how equation 4.22 changes when the inhibitory pulse arrives not at time 0 but at some time T0 . From equation 4.12, we then obtain O ¿I ln.I ¡ J¤ / ¼ ¿I ln g C T0 ¡ T:
(4.23)
5 Synchronization of a Population of Theta Neurons by a Single Strong Excitatory Pulse 5.1 The Synchronous Population Spike Triggered by an Excitatory Pulse of Uniform Strength. We next consider a population of N identical, uncoupled neurons with common constant external drive below or at threshold, receiving a common excitatory synaptic input pulse at time 0. We model this situation by the equations dµj D .1 ¡ cos µj / C .I C gs.t//.1 C cos µj /; dt
1 · j · N;
(5.1)
with I · 0, g > 0, and ( e¡t=¿E if t > 0 s.t/ D 0 if t · 0 with ¿E > 0. If the excitatory pulse is strong, it triggers a nearly synchronous population spike soon after time 0. To analyze this in more detail, we consider the initial value problem: dµ D 1 ¡ cos µ C .I C ge¡t=¿E /.1 C cos µ/; dt 1 µ .0/ D ¡2 arccos p : 1¡I
t > 0;
(5.2) (5.3)
Synchronization in Sparse, Random Networks
525
cell index
100
A:
50 0 10
0 t
10
B:
T
10 5 0 0
0.2
0.4
0.2
0.4
g
0.6
0.8
1
0.6
0.8
1
C:
¶ T/¶ g
0 5
10 0
g
Figure 6: (A) Approximately synchronous population spike triggered by a single nonuniform excitatory pulse. (B) Time T between arrival of excitatory pulse and spike triggered by it, as a function of the strength g of the pulse, for .I; ¿E /D.0; 1/ (solid line), .0; 5/ (dashed line), .0; 2/ (dash-dotted line), .¡0:01; 2/ (circles). (C) @T=@g as a function of g, for the parameter values of B.
Recall from section 2 that for I < 0, the right-hand side of equation 5.3 represents the stable xed point of the equation dµ=dt D 1 ¡ cos µ C I.1 C cos µ /. Thus, we are considering the response of a neuron at rest to an excitatory synaptic pulse. We denote by T the rst time at which the neuron spikes, with T D 1 if there is no spike at all. We have no general analytic expression for T as a function of I, ¿E , and g. However, it is easy to see that for xed I and ¿E , T is a strictly decreasing function of g, with limg!1 T D 0 and limg!gc C T D 1 for some gc ¸ 0 (see Figure 6B). For ¿E D 1, T can be computed using formula 2.2, with I
526
C. Borgers ¨ and N. Kopell
replaced by I C g and ¿ D 1. The formula becomes particularly simple for I D 0; in that case, TD
¼ 1 p : 2 g
(5.4)
The approximation ¿E D 1 is accurate as long as e¡T=¿E ¼ 1, since then the exponential decay in equation 5.2 can be neglected over the time interval [0; T]. Since T ! 0 as g ! 1, this means that the assumption ¿E D 1 is accurate for sufciently large g. The assumption I D 0 is accurate when jIj=g is sufciently small. So this assumption too is accurate for sufciently large g. Figure 6B shows T as a function of g, for various values of I and ¿E , demonstrating that equation 5.4 approximates T reasonably well over a large range of parameter values. 5.2 The Approximately Synchronous Population Spike Triggered by an Excitatory Pulse of Nonuniform Strength. If the synaptic strength in equation 5.1 depends on j, that is, if different neurons receive excitatory pulses of different strengths, the equations are dµj
D .1 ¡ cos µj / C .I C gj s.t//.1 C cos µj /; 1 · j · N: dt Figure 6A shows that the resulting population spike is not perfectly synchronous. (But notice that Figure 6A shows a brief time window only; the synchronization is not perfect but fairly tight.) In the simulation underlying this gure, N D 100;
I D 0;
g D 0:25;
¾ g D 0:025;
¿E D 2:
(5.5)
To analyze the duration of the spike volley triggered by an excitatory pulse of nonuniform strength, we consider the initial value problem 5.2, 5.3 with a random g > gc . If ¾ g is small, the standard deviation of T is @T (5.6) ¾T ¼ ¾g : @g For ¿E D 1 and I D 0, @T ¼ D g¡3=2 4 @g
(5.7)
by equation 5.4. Figure 6C shows @ T=@g as a function of g, for various values of I and ¿E . The gure conrms that the right-hand side of equation 5.7 approximates @T=@g reasonably well for large values of g. Combining equations 5.6 and 5.7 yields ¾T ¼
¼ 1 ¾g p : 4 g g
(5.8)
Synchronization in Sparse, Random Networks
527
For illustration, we return to the example of Figure 6A (I D 0, ¿E D 2, g D 0:25, ¾ g D 0:025). We dene T .j/ to be the time of the spike of the jth neuron, and dene TO and ¾O as in equation 4.21. Numerically, we nd T
¾O T ¼ 0:270: To evaluate the right-hand side of equation 5.6, we approximate @T=@g numerically. We nd @T=@g ¼ ¡10:30 for the parameter values of Figure 6A. The approximation of equation 5.6, based solely on the assumption that ¾ g is so small that the relation between ¾T and ¾ g is approximately linear, proves fairly accurate here; it yields ¾T ¼ 0:256: The assumption ¿E D 1, which underlies equation 5.8, degrades the accuracy, but by less than a factor of two: ¾T ¼ 0:157: 6 The PING Synchronization Mechanism 6.1 PING in Fully Connected E-I Networks. We return to the example of Figure 1A. Each of the cell groups (E and I) synchronizes rapidly, with the population spikes of the inhibitory neurons slightly lagging behind those of the excitatory ones. We state, in a nonrigorous way, based on numerical experience and heuristics, conditions that are sufcient to induce ring patterns as in Figure 1A: Condition 1: The E-cells receive external input signicantly above their spiking threshold. Condition 2: The E ! I synapses are so strong and have so short a rise time that a surge in spiking of the E-cells quickly triggers a surge in spiking of the I-cells. Condition 3: The I-cells spike only in response to the E-cells. Condition 4: The I ! E synapses are so strong that a population spike of the I-cells approximately synchronizes the E-cells. If these four conditions are satised, synchronous rhythmic spiking develops as follows (Whittington et al., 2000; Tiesinga et al., 2001). Initial activity in the E-cells triggers activity in the I-cells. This inhibits activity in the E-cells, thereby removing the drive to the I-cells. A period of low activity in both E- and I-cells results. When the inhibition wears off and the E-cells spike again, they are closer to synchrony than previously because of the mechanism described in section 4.1. The spiking of the E-cells causes spiking of the I-cells, closer to synchrony than previously because of the mechanism of section 5.1. The cycle now repeats.
528
C. Borgers ¨ and N. Kopell
A
B E cells
400
E cells
400 200
200 0 100
I cells
I cells
0 100 50 0 0
100 t
50 0 0
200
C E cells
E cells
400
200
200 0 100
I cells
0 100
I cells
200
D
400
50 0 0
100 t
100 t
200
50 0 0
100 t
200
Figure 7: Illustration of conditions 2–4 from section 6.1. (A) PING is lost when E ! I synapses become too weak. (B) PING is lost as a result of too much drive to the I-cells. (C) Rhythm is restored by adding I ! I synapses. (D) PING is lost when I ! E synapses become too weak.
Condition 1 is evidently needed to drive activity. We discuss conditions 2 through 4 in more detail and present numerical results illustrating what happens when they are violated. When condition 2 is violated, that is, when the E ! I synapses are weak, a pattern such as the one shown in Figure 7A often develops: the E-cells and the I-cells still synchronize, but the E-cells spike several times between population spikes of the I-cells. The parameters in Figure 7A are as in Figure 1A, except that gEI has been reduced from 0:25 to 0:05.
Synchronization in Sparse, Random Networks
529
Condition 3 is violated if the drive to the I-cells becomes too strong, but can be restored by introducing I ! I-synapses. We illustrate this with the following numerical experiment. In Figure 1A, II D 0. If we raise II to 0:05, the gure changes dramatically, as shown in Figure 7B. Here, II is strong enough to drive asynchronous activity in the I-cells that suppresses the Ecells altogether. Condition 3, and with it the rhythm, is restored by setting gII D 0:25 (see Figure 7C). We note that in the example of Figure 7C, the I-cells would spike synchronously even without the E-cells. Thus, the role of the I-I synapses is to synchronize the I-cells, replacing nearly constant inhibition by phasic inhibition, which allows the E-cells to re. One might therefore consider the rhythm in Figure 7C an “ING” or “° -I” rhythm (see section 7). The parameter regime investigated here is similar to that of Figure 7d of Tiesinga et al. (2001). There as here, asynchronous activity of the I-cells suppresses activity in the E-cells altogether, while synchronous activity of the I-cells permits ring of the E-cells. However, in Figure 7C, the E-cells do play an important role in setting the frequency of the rhythm. In the absence of the E-cells, the rhythm would be much slower. Thus, the ring of the I-cells in Figure 7C does come in response to the ring of the E-cells, as in PING. Figure 7D shows what may happen when condition 4 is violated, that is, when the I ! E synapses are weak. The parameters in Figure 7D are as in Figure 1A, except that gIE has been reduced from 0:25 to 0:05. The inhibitory synapses no longer sufce to synchronize the E-cells. As a result, the I-cells, which were synchronized by the E-cells in Figure 1A, are no longer synchronized either. A mathematical examination of conditions 1 through 4 will be the subject of future publications. 6.2 PING in Sparsely, Randomly Connected E-I Networks. When pEI and pIE are reduced from 1:0 to 0:5, Figure 1A turns into Figure 1B. In analyzing the population spikes of the E-cells in Figure 1B, we make the simplifying assumption that the population spikes of the I-cells are perfectly synchronous. Since the I-cells are in fact fairly tightly synchronized in Figure 1B, this is a good approximation at least for the parameters used in Figure 1B. When the I-cells spike, all E-cells receive inhibitory pulses. However, different neurons receive inputs of different strengths because of the random connectivity. The resulting approximately synchronous spike volley of the E-cells can be analyzed using section 4.2. We focus on one particular spike volley of the E-cells, say, the rst one .j/ following t D 100 in Figure 1B. We dene TE to be the time of the spike of the jth E-cell during this volley. We dene v u P .j/ P .j/ u .T ¡ TO /2 T t j E E OT D j E and ¾O E D : E NE NE ¡ 1
530
C. Borgers ¨ and N. Kopell
From equations 2.11 and 4.20, we nd the prediction s ¾O E ¼ ¿I
1 ¡ pIE : pIE NI
(6.1)
For the rst spike volley of the E-cells following t D 100 in Figure 1B, we nd numerically ¾O E D 1:18: The prediction of equation 6.1 is remarkably accurate: s 1 ¡ pIE ¼ 1:22: ¿I pIE NI We expect the shape of the spike volley to be approximately described by equation 4.19. To evaluate equation 4.19, we must evaluate T0 ¡ ¿I ln.I ¡ J¤ /. (Recall that T0 is the time at which the synchronizing inhibitory pulse arrives, here the time of the inhibitory population spike immediately preceding the excitatory population spike under consideration.) Following equation 4.23, we use the approximation T 0 ¡ ¿I ln.I ¡ J¤ / ¼ TO E ¡ ¿I ln gIE : The predicted and actual spike time distributions are shown in Figure 8. The agreement is good. We now consider the rst spike volley of the I-cells following t D 100 in .j/
Figure 1B. We dene TI to be the time of the spike of the jth I-cell during this volley, and v u P .j/ P .j/ u .T ¡ TO /2 t j I I OT D j TI and O D ¾ : I I NI NI ¡ 1 From equations 2.9 and 5.8, we nd the prediction s ¼ 1 ¾O I ¼ p 4 gEI
1 ¡ pEI : pEI NE
(6.2)
For the rst spike volley of the I-cells following t D 100 in Figure 1B, we nd numerically ¾O I D 0:151 : The prediction of equation 6.2 is somewhat inaccurate, as was to be expected, because it is based on three rather substantial idealizations: ¿E D 1, perfect
Synchronization in Sparse, Random Networks
531
Figure 8: Distribution of spikes within the rst spike volley following t D 100 in Figure 1B: predicted (solid line) and actual (bars).
synchrony of the E-cells, and the assumption that the I-cells return to rest between the spike volleys of the E-cells (see equation 5.3). However, the discrepancy is still not greater than a factor of two: s ¼ 1 p 4 gEI
1 ¡ pEI ¼ 0:0785: pEI NE
We note that it would be possible to relax the assumption of perfect synchrony in the E-cells, since we have fairly precise information about the durations and even the shapes of the spike volleys of the E-cells. We do not pursue this here, since our goal here is qualitative insight, not precise quantitative information. 7 Discussion We have analyzed the effects of sparse, random connectivity on the PING synchronization mechanism. In particular, we have derived approximate formulas for the durations of the spike volleys, equations 6.1 and 6.2. To make these formulas as transparent as possible, let us use the approximations 1 ¡ pIE ¼ 1 and 1 ¡ pEI ¼ 1, and write M EI D pEI NE
532
C. Borgers ¨ and N. Kopell
for the expected number of excitatory inputs per inhibitory cell and MIE D pIE NI for the expected number of inhibitory inputs per excitatory cell. Equations 6.1 and 6.2 then become ¾O E ¼ ¿I p
1
(7.1)
M IE
and 1 ¼ p ¾O I ¼ p : 4 gEI MEI
(7.2)
An interesting feature of these formulas is their lack of symmetry. The time constant ¿I appears in equation 7.1, but the time constant ¿E does not appear in equation 7.2. Similarly, gEI appears in equation 7.2, but gIE does not appear in equation 7.1. If a theta neuron were driven with a constant drive equal to gEI , it would spike periodically with a period which we call PEI . From equation 2.3, ¼ P EI D p : gEI Using this in equation 7.2, we nd ¾O I ¼
1 PEI p : 4 M EI
(7.3)
Let P denote the period of the PING rhythm. In the simulations of this article, P EI ¿ ¿I < P;
M EI > MIE À 1:
(7.4)
The inequality MIE À 1 is certainly realistic (Braitenberg & Schuz, ¨ 1998). The inequality ¿I < P holds for a gamma rhythm if the inhibitory synapses are mediated by GABAA , since the typical period of gamma oscillations is about 25 ms and the decay time constant of GABA A synapses is about 10 ms. It appears that PING without PEI ¿ ¿I is not possible with realistic values of ¿I ; more will be said on this point. Combining 7.1, 7.3, and 7.4, we nd ¾O I ¿ ¾O E ¿ P: Thus, our theory predicts that for PING in realistic parameter regimes, sparse and random connectivity will not prevent either E- or I-cells from synchronizing fairly tightly, but the I-cells will synchronize much more tightly
Synchronization in Sparse, Random Networks
533
E cells
400 200
I cells
0 100 50 0 0
100 t
200
Figure 9: PING in an E-I network with sparse, random connectivity, with nearly instantaneous synapses.
than the E-cells. Our computational results are in agreement with these conclusions (see Figure 1B). It is possible to obtain PING without the condition PEI ¿ ¿I if ¿I is made unrealistically small. To compensate, one must then also make gIE unrealistically large. In fact, in many modeling studies, synapses have been assumed to act instantaneously, that is, with zero rise and decay times (Brunel, 2000; Brunel & Hakim, 1999; Izhikevich, 1999; Mirollo & Strogatz, 1990; Tiesinga & Sejnowski, 2001); this amounts to taking a limit in which simultaneously ¿I ! 0 and gIE ! 1. (In Brunel, 2000, and Brunel & Hakim, 1999, there is a delay between the spiking of the presynaptic neuron and its effect on the postsynaptic neuron; however, this delay is not relevant for this discussion.) In some regard, the behavior of instantaneous synapses is not very different from that of synapses with more realistic time courses. However, equations 7.1 and 7.2 predict that the effects of random connectivity will be reduced dramatically by making the synapses instantaneous. To verify this numerically, we repeat the simulation of Figure 1B, with ¿I reduced by a factor of 50 and gIE raised by a factor of 50. This closely mimics instantaneous synapses but is computationally simpler. The result is shown in Figure 9. As predicted, synchronization is much tighter than in Figure 1B, even though the randomness in the connectivity is the same in Figure 9 as in Figure 1B. The rhythm is also accelerated; this is a result of the reduction in ¿I .
534
C. Borgers ¨ and N. Kopell
Our theory for the spike volleys of the E-cells is precise enough to allow accurate predictions not just of the durations of the volleys but even of their shapes, as shown in Figure 8. A renement of the theory could be obtained by taking into account the positive durations of the spike volleys of the I-cells, perhaps using ideas similar to those of section 3.2 of Tiesinga and Sejnowski (2001). (Recall that in section 4, the duration of the spike volleys of the I-cells was assumed to be negligible.) However, our numerical results indicate that in our parameter regime, neglecting the durations of the spike volleys of the I-cells gives an excellent approximation. Tiesinga et al. (2002) discuss a related problem, using numerical simulation primarily, and with an emphasis on information-theoretic ideas. They also present experimental results. For instance, Figure 8b of their article illustrates how the jitter in the spike times of a rat hippocampal neuron decreases when the quantity npre (our pIE NI ) increases, in rough qualitative agreement with equation 6.1. (Rough qualitative agreement is the best that can be expected here, in view of the idealized nature of our theory.) Our theory for the spike volleys of the I-cells is less accurate than that for the excitatory ones. However, since the synchronization of the I-cells is typically quite tight and brought about by a crude mechanism (a burst of excitation triggers an almost immediate, and therefore almost synchronous, response of the I-cells), a precise theory for the spike volleys of the I-cells is of less interest here. To create such a theory, we would need to study how the duration of an excitatory input spike volley is related to the duration of an output spike volley triggered by it. (Recall that in section 5, the duration of the spike volleys of the E-cells was assumed to be negligible.) This issue is centrally important in the study of synre chains (Diesmann, Gewaltig, & Aertsen, 1999). Figure 3c of Diesmann et al. (1999) shows that the strength of the input spike volley is crucial. Strong, loosely synchronous input spike volleys can trigger tightly synchronous output spike volleys. Using a different way of measuring synchrony, the relation between input synchronization and output synchronization was also studied by Burkitt and Clark (2001). Combining ideas from the preceding two paragraphs, a more accurate overall theory of PING may be created as follows. First, the approximate spike time distribution within the spike volleys of the E-cells, computed based on section 4, is taken into account in approximating the spike time distribution within the spike volleys of the I-cells. This yields a renement of section 5, which in turn can be taken into account in approximating the spike time distribution within the spike volleys of the E-cells, leading to a renement of section 4. Iterating this process, one may obtain increasingly accurate approximations to the spike time distributions within spike volleys. However, such an improved theory of PING is beyond the scope of this article. While working on this project, we carried out far more simulations than have been presented here. Our conclusions hold over a wide range of pa-
Synchronization in Sparse, Random Networks
535
E cells
400 200
I cells
0 100 50 0 0
100 t
200
Figure 10: ING in an E-I network with sparse, random connectivity.
rameters. For instance, E!E synapses with gEE D 0:25 have little effect on Figure 1B. Weak I!I synapses have little effect, except for the point made at the end of section 6.1: such synapses can make gamma rhythms possible in cases when the external drive to the I-cells is fairly strong. Strengthening I!I synapses often leads to a transition to a different synchronization mechanism, called ING by Whittington et al. (2000) and ° -I by Tiesinga et al. (2001). In ING, the I-cells are driven externally, not by the E-cells, and synchronize not only the E-cells but also themselves. For ING, the widths and shapes of the spike volleys of both E- and I-cells can be approximated using the ideas of section 4.1. Figure 10 shows a simulation with gIE D gII D 0:25;
gEI D gEE D 0;
IE D II D 0:1;
¿I D 10;
pIE D pII D 0:5:
As our theory would predict, the volleys of the E- and I-cells are now of equal duration. In future work, we will investigate ING in sparse, random networks in more detail, including in particular the effects of E ! I-synapses. Throughout this article, we have used the theta model. As remarked in section 2.1, the theta model is canonical for type I neuronal models in the sense that other type I models can be reduced to it by coordinate transformations (Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997). We therefore expect the picture to be qualitatively similar for all type I models,
536
C. Borgers ¨ and N. Kopell
even though the details of our calculations, and in particular the details of the centrally important Figure 5C, do depend on our choice of model. Whether and how our results generalize to type II models remains to be explored. It would also be interesting to explore the effects of spike adaptation on our analysis. Synchronization in sparsely, randomly connected networks with spike adapation has been studied by van Vreeswijk and Hansel (2001). They discuss synchronization of bursts (not individual spikes) via adaptation (not inhibition) and observe that strong I ! E synapses desynchronize bursts, in contrast with our regime, in which strong I ! E synapses synchronize spikes. The model and analysis are so different from ours that a detailed comparison would be a major endeavor, but the article certainly suggests studying effects of adaptation in our model in the future. We have analyzed synchronization by common input for the purpose of better understanding PING and ING. We remark, however, that synchronization by common input is also of neurobiological interest by itself (Usrey & Reid, 1999). We have shown that even very sparse input can synchronize and have analyzed the desynchronizing effect of heterogeneity in input strength. Acknowledgments We are grateful to Marcelo Camperi for stimulating discussions during the early stages of this work and to Steve Epstein for carefully reading the manuscript and providing helpful criticism. We thank the referees for interesting and useful comments. C. B. thanks the Center for BioDynamics at Boston University for a productive and pleasant visit during the spring of 2001, when this project was begun. His work was also supported by an equipment grant from Tufts University’s Marshall Fund for Biomedical Research. N. K. received support from NSF grant DMS-9706694 and NIH grant MH47150. References Barkai, E., Kanter, I., & Sompolinsky, H. (1990). Properties of sparsely connected excitatory neural networks. Phys. Rev. A, 41, 590–597. Braitenberg, V., & Schuz, ¨ A. (1998). Cortex: Statistics and geometry of neuronal connectivity (2nd ed.). New York: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comp., 11, 1621–1671. Burkitt, A. N., & Clark, G. M. (2001). Synchronization of the neural response to noisy periodic synaptic input. Neural Computation, 13, 2639–2672.
Synchronization in Sparse, Random Networks
537
Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comp. Neurosci., 3, 91–110. Diener, F. (1985a). Propri´et´es asymptotiques des euves. C. R. Acad. Sci. Paris, 302, 55–58. Diener, M. (1985b). D´etermination et existence des euves en dimension 2. C. R. Acad. Sci. Paris, 301, 899–902. Diesmann, M., Gewaltig, M.-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 879–1001. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math, 46, 233–253. Golomb, D., & Hansel, D. (2000). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Comp., 12, 1095–1139. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Hodgkin, A. L. (1948). The local changes associated with repetitive action in a non-medullated axon. J. Physiol. (London), 107, 165–181. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (1999). Class 1 neural excitability, conventional synapses, weakly connected networks, and mathematical foundations of pulse-coupled models. IEEE Transactions on Neural Networks, 10, 499–507. Izhikevich, E. M. (2000). Neural excitability, spiking, and bursting. Int. J. Bifurcation and Chaos, 10, 1171–1266. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 251–292). Cambridge, MA: MIT Press. Tiesinga, P. H. E., Fellous, J.-M., Jos´e, J. V., & Sejnowski, T. J. (2001). Computational model of carbachol-induced delta, theta, and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274. Tiesinga, P. H. E., Fellous, J.-M., Jos´e, J. V., & Sejnowski, T. J. (2002). Information transfer in entrained cortical neurons. Network: Comput. Neural Syst., 13, 41–66. Tiesinga, P. H. E., & Sejnowski, T. J. (2001). Precision of pulse-coupled networks of integrate-and-re neurons. Network: Comput. Neural Syst., 12, 215– 233. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Usrey, W. M., & Reid, R. C. (1999). Synchronous activity in the visual system. Annual Review of Physiology, 61, 435–456.
538
C. Borgers ¨ and N. Kopell
van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Comp., 13, 959–992. van Vreeswijk, C. A., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C. A., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1372. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst ring in a thalamic model: Specic neuronal mechanisms. Proc. Natl. Acad. Sci. USA, 2, 5577–5581. Whittington, M. A., Traub, R. D., Kopell, N., Ermentrout, B., & Buhl, E. H. (2000). Inhibition-based rhythms: Experimental and mathematical observations on network dynamics. Int. J. of Psychophysiol., 38, 315–336. Received February 11, 2002; accepted July 30, 2002.
NOTE
Communicated by Emilio Salinas
The Effects of Input Rate and Synchrony on a Coincidence Detector: Analytical Solution Shawn Mikula
[email protected] Ernst Niebur
[email protected] Zanvyl Krieger Mind=Brain Institute and Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
We derive analytically the solution for the output rate of the ideal coincidence detector. The solution is for an arbitrary number of input spike trains with identical binomial count distributions (which includes Poisson statistics as a special case) and identical arbitrary pairwise crosscorrelations, from zero correlation (independent processes) to complete correlation (identical processes). 1 Introduction Individual neurons typically receive thousands of inputs that vary temporally with respect to both mean rate and cross-correlation. Much experimental evidence underscores the importance of mean input rate and cross-correlation as determinants of neural ring, though the relative contributions of each are not so clear (Abeles, 1982, 1991; Softky & Koch, 1992, 1993; Shadlen & Newsome, 1994; Alonso, Usrey, & Reid, 1996; Decharms & Merzenich, 1996; Konig, ¨ Engel, & Singer, 1996; de Oliveira, Thiele, & Hoffmann, 1997; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997; Steinmetz et al., 2000; Salinas & Sejnowski, 2000; Niebur, Hsiao, & Johnson, 2002; Williams & Stuart, 2002). Given the importance of rate and cross-correlation in neural coding, the determination of how inputs modulated with respect to both affect neural ring assumes signicance if the impact of these different neural codes is to be fully appreciated. Employing simple probabilistic and combinatorial methods, we derive the exact analytical solution for the output rate of a coincidence detector receiving an arbitrary number of inputs modulated with respect to both mean rate and cross-correlation 2 Binomial Spike Trains with Specic Cross-Correlation In this section, we introduce a systematic method for the generation of an arbitrary number of spike trains with specied pair-wise mean crossNeural Computation 15, 539–547 (2003)
c 2003 Massachusetts Institute of Technology °
540
S. Mikula and E. Niebur
correlations and ring rates. 1 Spikes are distributed according to binomial counting statistics in each spike train. Mean ring rates and crosscorrelations are the same for all spike trains (or all pairs of spike trains, respectively), but they can be chosen independently of each other. The algorithm that we describe here is as suitable for the analytical computations in this report as for the implementation of such spike trains in numerical simulations. We emphasize that although we use this particular algorithm to illustrate the derivation of the output rate of the coincidence detector, our results are of general validity since the properties of the spike trains are entirely specied by their correlation coefcient and the statistics of the individual spike trains. However, no higher-order effects (correlations of order 3 and higher; see, e.g., Bohte, Spekreijse, & Roelfsema, 2000) are included. First, we establish some notation. Let m be the number of input spike trains, each one having n time bins. All bins are of equal length 1t, chosen sufciently small so that each contains a maximum of one spike, that is, each bin is guaranteed to contain either one or zero spikes. The probability that a spike is found in any given time bin is p; no spike is therefore found with probability 1 ¡ p. Bins in any given spike train are independent, which implies that all the following analysis can be limited to a single time bin. Note that the physiologically important Poisson statistics are the special case of low rates and long spike trains. Generation of m spike trains with specic correlation starts by generating m C 1 independent spike trains with the desired mean ring rate (which is just p=1t). Let us assume we want a correlation coefcient q (with 0 · q · 1) between each and any pair of the m spike trains. We designate spike train number m C 1 as the reference spike train. In order to introduce the correlation coefcient q between spike trains 1; : : : ; m, we will switch, with p a probability q, the state of a time bin in each of these spike trains to that found in the reference spike train. This yields a mean correlation coefcient of q between any two of the spike trains 1; : : : ; m without changing the mean ring rate.2 No further use is made of the reference spike train.
1 The method proposed here is not the only possible way to introduce correlations between spike trains. For instance, one could add spikes generated by a common Poisson process to each individual spike train (Stroeve & Gielen, 2001), or start from a common spike train and generate two different spike trains by removing spikes independently. The method presented here has the advantage of being straightforward and efcient and of generating spike trains with controlled rates and correlation coefcients, both determined in a direct way. 2 Since all spike trains, including the reference spike train, have the same mean ring rate, changing the state of any bin to the state of any bin in another spike train does not change the probability of this bin’s containing a spike. Therefore, this manipulation does not change the mean ring rate of the spike train.
Coincidence Detector: Analytical Solution
541
3 Coincidence Detection for Vanishing Input Correlation The coincidence detector is a computational unit that res if the number of input spikes received within a given time bin equals or exceeds the threshold, µ. It is instructive to rst derive a solution for independent inputs (q D 0) with mean rate p=1t. Using the binomial distribution and elementary combinatorics, we nd that the probability of obtaining exactly j coincident input spikes from a set of m input spike trains with probability p for a spike (to exist within each time bin) is
P.j/ D
³ ´ m j p .1 ¡ p/m¡j : j
(3.1)
Since we are interested only in cases where the coincidence detector receives at least µ coincident input spikes, the probability for the coincidence detector to produce an output spike is
Pout .p; m; µ I q D 0/ D
m ³ ´ X m jDµ
j
p j .1 ¡ p/m¡j :
(3.2)
Dividing equation 3.2 by 1t, we obtain the mean output rate of a coincidence detector in the case of vanishing input correlation:
Nout .p; m; µI q D 0/ D
m ³ ´ 1 X m j p .1 ¡ p/m¡j : 1t jDµ j
(3.3)
4 Finite Correlation Let us continue to the case of input spikes that vary with respect to both mean rate and cross-correlation. In the notation introduced in section 3, we have to compute Pout .p; m; µ; q/, the probability that the coincidence detector produces an output spike (in this time bin) when it receives m input spike trains with cross-correlation q. Conceptually, it is helpful to start with independent input spike trains, q D 0, and then apply the modication of the result as the procedure described in section 2 is applied to generate input spike trains with nite q. Thus, the generation of our correlated input spike trains has two parts: (1) the generation of independent spike trains and (2) the imposition of our cross-correlation procedure to introduce crosscorrelation between the input spike trains.
542
S. Mikula and E. Niebur
Without loss of generality, we treat Pout .p; m; µ; q/ as the following joint probability, again derived from the binomial distribution: m ³ ´ X m j P out .p; m; µ; q/ D p .1 ¡ p/m¡j C1 .j; q/ j jDµ C
µ¡1 ³ ´ X m jD0
j
p j .1 ¡ p/m¡j C2 .j; q/:
(4.1)
In this equation, both C1 .j; q/ and C2 .j; q/ are conditional probabilities; in addition to j and q, they depend on µ and m, but these arguments are suppressed to alleviate the notation. Specically, C1 .j; q/ is the probability of obtaining at least µ coincident input spikes given that initially (i.e., before applying our cross-correlation procedure), there were j ¸ µ coincident input spikes. The factor C2 .j; q/ is the probability of obtaining at least µ coincident input spikes given that initially there were j < µ. From their denition, it is obvious that C1 .j; 0/ ´ 1 and C2 .j; 0/ ´ 0, which will later be found to be the case explicitly (see equations 4.2 and 4.4). As it must be, equation 3.2 is thus just the special case of q D 0. Equation 4.1 expresses the joint probabilities for a two-part experiment: the generation of independent spike trains and the correlation procedure applied to these spike trains that is described in section 2. Let us rst consider C2 , the conditional probability for a suprathreshold event (that is, among the m input neurons, µ or more have a spike) occurring in the (whole) experiment given that initially (after completion of only the rst part) it was subthreshold. This can happen only if the reference spike train has a spike in the time bin under consideration since otherwise, a subthreshold event can never be converted to a suprathreshold event (if the state of any neuron will be switched, it will be switched to that of the reference spike train). This occurs with a probability p.3 Since by assumption j < µ neurons already have a spike (and their state will not be inuenced even if a switch should occur since the reference spike by assumption also has a spike), we have to compute the probability that a sufcient number of the remaining .m ¡ j/ neurons that do not spike before application of the correlation procedure will be switched. Since p all switches are independent and occur with a probability q, we can expect i switches among the available .m ¡ j/ neurons with a probability ¡m¡j¢p i p q .1 ¡ q/.m¡j/¡i . This gives rise to a suprathreshold event if i ¸ µ ¡ j i 3
Formally, the procedure could also be viewed as a three-part experiment: (1) the generation of spike trains, (2) the correlation procedure described in section 3 for any reference spike train, and (3) the drawing of either a spike or a nonspike in the reference spike train, with probabilities p and .1 ¡ p/, respectively. We have simplied the analysis somewhat by taking only those parts in equations 4.2 and 4.3 that actually contribute to the output of the coincidence detector.
Coincidence Detector: Analytical Solution
543
since in this case, the total number of spikes is j C i ¸ µ . We therefore have to sum the probabilities for all cases with i ¸ µ ¡ j and obtain C2 .j; q/ in equation 4.1 as C2 .j; q/ D
³
m¡j X iDµ ¡j
´ m¡ j p i p q .1 ¡ q/m¡j¡i p: i
(4.2)
Now we turn to the derivation of C1 .j; q/ in equation 4.1, which is the probability that an initially (i.e., at q D 0) suprathreshold event remains suprathreshold after cross-correlation. This is more difcult to compute directly, and therefore we compute its complement, that is, one minus this probability. The latter is the probability of switching a suprathreshold event to a subthreshold event. Assuming now j ¸ µ, this probability is found by an argument completely analogous to that leading to equation 4.2 as ³´ j X j p i p q .1 ¡ q/ j¡i .1 ¡ p/: i iDj¡µ C1
(4.3)
It can be seen that this is the probability that i (with i > j ¡ µ) input spikes get switched to the value in the reference spike train during the generation of correlated spike trains. This expression also takes into account the requirement that the bin in the reference train contains no spike (from which the .1 ¡ p/ term results). Therefore, we have the following: C1 .j; q/ D 1 ¡
³´ j X j p i p q .1 ¡ q/ j¡i .1 ¡ p/: i iDj¡µC1
(4.4)
This completes the computation of equation 4.1 for Pout .q/. To compute the mean output rate of the coincidence detector Nout .p; q; µ/ as a function of input rate p=1t, cross-correlation q, and threshold µ, we collect terms and multiply equation 4.1 by the inverse of the time bin width, 1t, to yield the following: Nout .p; m; µ; q/ 2 0 1 ³´ j m ³ ´ X 1 X m j p p 4 D p j .1¡p/m¡j @1¡ qi .1¡ q/ j¡i .1¡p/A 1t jDµ j i iDj¡µ C1 C
3 ³ ´ m¡j p i p p j .1 ¡ p/m¡j q .1 ¡ q/m¡j¡i 5 : (4.5) j i iDµ¡j
µ¡1 ³ ´ X m jD0
m¡j X
We have thus obtained our main result: the exact analytical solution for the output rate of a coincidence detector receiving an arbitrary number of
544
S. Mikula and E. Niebur
inputs with identical binomial statistics and modulated with respect to both mean rate and cross-correlation. The validity of equation 4.5 is conrmed by simulation (not shown). 5 Example An illustration of equation 4.1 for a coincidence detector receiving m D 100 binomial input spike trains and with a threshold µ D 15 is shown in Figure 1a. For low values of cross-correlation, the output rate of the coincidence detector is sigmoidal as a function of the input rate, and it rapidly saturates as the input rate increases. For increasing cross-correlation, this behavior approaches more and more a linear increase with input rate, which is, of course, nothing but the solution in which each synchronous volley of incoming action potentials leads to exactly one output spike. In Figure 1b, we show a cut through Figure 1a for the case of constant probability p D 0:1, corresponding to a ring rate of 50 Hz for 1t D 2 ms, which is physiologically realistic in many cases. For this low probability, the binomial distribution of input spike trains can be reasonably well approximated by a Poisson distribution. Bernander, Koch, and Usher (1994) emphasized that increasing synchrony does not necessarily lead to an increased output rate since synchronous volleys with more spikes than required to overcome the threshold are wasted. They demonstrated by way of simulation of different neuron models that the output ring rate as a function of input correlation in this case takes on the shape of an inverted U. Figure 1b shows an example of this behavior in the case of our exact solution. While synchrony in this example always increases the efcacy of the incoming ring rates beyond that obtained for independent spike trains, this increase reaches a maximum for small q, after which the output rate decreases again. For this value of p, the trivial solution in which one input volley generates exactly one output spike (i.e., output rate ´ p) is obtained for all q »> 0:1.
Figure 1: Facing page. The plots illustrate the main equation (4.1) derived in this article for a coincidence detector receiving m D 100 binomial input spike trains with a threshold µ D 15. (a) The probability for a coincidence detector to re an action potential per unit time (see equation 4.1). (b) An iso-frequency (p = .1) slice through the surface of a, showing the output probability of the coincidence detector as a function of input cross-correlation. At this relatively low input frequency (corresponding to 50 Hz for 1t D 2 ms), the input spike trains are approximately Poisson. Note the inverted U shape of the plot, whereby higher cross-correlated inputs result in a lower output due to inefcient allocation of input spikes.The horizontal dashed line shows p D 0:1.
Coincidence Detector: Analytical Solution
545
546
S. Mikula and E. Niebur
6 Discussion In this article, we have derived analytically the solution for the output ring rate of an ideal coincidence detector as a function of the input rate and correlation between input spike trains. Our derivation was based on a simple model and it is important to make explicit the key assumptions of our model: (1) the bins in any given input spike train are independent, (2) the coincidence detector “neuron” integrates over a very short time window, and (3) there is no inhibition. The rst condition means that there might be additional effects that depend on the structure of the individual spike trains; the second condition makes this model memoryless, which serves to differentiate it from integrator models with long time constants; and the last condition means that the applicability of our model to neurons that receive strong, structured inhibitory input may be limited. All of these conditions are signicant simplications; for instance, it is well known (for a recent example, see Tiesinga, Fellous, & Sejnowski, 2002) that neurons are not memoryless processes (assumption 1 for incoming spike trains and assumption 2 for the neural processing). We feel that these simplications are justied in the interest of obtaining an analytical solution. Others have noted that output frequency varies with the correlation of input spike trains in a nontrivial and frequently nonmonotonic way. In particular, the existence of a maximum, as shown in Figure 1b, has been predicted by Bernander et al. (1994), based on simulations. We can compute the location of this maximum from our solution by differentiating equation 4.5 with respect to the cross-correlation q and setting the right-hand side equal to zero. The computation is straightforward but not particularly illuminating. Instead, we plotted iso-frequency curves through the surface given by equation 4.5 to obtain the position of this maximum (not shown). We observed that the prominence of this peak increases with increasing number of inputs and that it is conned to low input rates (the Poisson regime), disappearing altogether above a certain input rate where monotonically decreasing iso-frequency curves are obtained. This general behavior can be seen in Figure 1a. These observations lead us to infer that correlations modulate ring rate best when the number of inputs is high and the input rates are low, which corresponds to the most neurally plausible scenario. Thus, our results may have relevance for synchrony-based models of neural computing. In sum, we believe our derivation will prove useful both as a theoretical example highlighting the power of combinatorial methods and for explaining the relative contributions of rate versus synchrony in neural coding. Acknowledgments This work was supported by the National Science Foundation under a CAREER grant to E.N. (grant 9876271).
Coincidence Detector: Analytical Solution
547
References Abeles, M. (1982). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Alonso, J. M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated ring in cells of the lateral geniculate nucleus. Nature, 383 (6603), 815–819. Bernander, O., Koch, C., & Usher, M. (1994). The effect of synchronized inputs at the single neuron level. Neural Computation, 6(4), 622–641. Bohte, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effects of pair-wise and higher-order correlations on the ring rate of a postsynaptic neuron. Neural Computation, 12, 153–179. Decharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613. de Oliveira, S. C., Thiele, A., & Hoffmann, K. P. (1997). Synchronization of neuronal activity during stimulus expectation in a direction discrimination task. J. Neurosci., 17(23), 9248–9260. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends in Neurosciences, 719(4), 130– 137. Niebur, E., Hsiao, S. S., & Johnson, K. O. (2002). Synchrony: A neuronal mechanism for attentional selection? Current Opinion in Neurobiology,12(2), 190–194. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. Journal of Neuroscience, 20(16), 6193–6209. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Softky, W., & Koch, C. (1992). Cortical cells should re regularly but do not. Neural Computation, 4(5), 643–646. Softky, W., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Computation, 13, 2005–2029. Tiesinga, P. H. E., Fellous, J.-M., & Sejnowski, T. J. (2002). Attractor reliability reveals deterministic structure in neuronal spike trains. Neural Computation, 14, 1629–1650. Williams, S. R., & Stuart, G. J. (2002). Dependence of EPSP efcacy on synapse location in neocortical pyramidal neurons. Science, 295, 1907–1910. Received May 2, 2002; accepted July 30, 2002.
NOTE
Communicated by Arjen van Ooyen
A Theoretical Model of Axon Guidance by the Robo Code Geoffrey J. Goodhill
[email protected] Department of Neuroscience, Georgetown University Medical Center, Washington, D.C. 20007, U.S.A.
After crossing the midline, different populations of commissural axons in Drosophila target specic longitudinal pathways at different distances from the midline. It has recently been shown that this choice of lateral position is governed by the particular combination of Robo receptors expressed by these axons, presumably in response to a gradient of Slit released by the midline. Here we propose a simple theoretical model of this combinatorial coding scheme. The principal results of the model are that purely quantitative rather than qualitative differences between the different Robo receptors are sufcient to account for the effects observed following removal or ectopic expression of specic Robo receptors, and that the steepness of the Slit gradient in vivo must exceed a certain minimum for the results observed experimentally to be consistent. 1 Introduction A crucial process in the construction of a nervous system is the establishment of appropriate connectivity. This consists of two stages: the guidance of axons to target regions largely independent of neural activity, followed by the activity-dependent renement of synaptic connections. Much progress has recently been made in understanding the mechanisms of axon guidance in response to molecular cues in the developing nervous system (TessierLavigne & Goodman, 1996; Mueller, 1999; Wilkinson, 2001; Song & Poo, 2001). An emerging theme is that a limited number of evolutionarily conserved axon guidance molecules are reused in many different contexts. When it is necessary to generate several different outcomes in the same region of space and time, an obvious strategy is combinatorial coding. For instance, two guidance molecules could simultaneously attract three populations of axons to different targets if population 1 expressed only one type of receptor, population 2 expressed only the other type, and population 3 expressed both types. Since axons can also be attracted or repelled depending on their internal state (Song, Ming, & Poo, 1997), for N guidance molecules, the number of separate populations that could in principle simultaneously be guided to different locations is of order 3N . About 100 guidance molecules are now known, allowing many more potential outcomes than there are total Neural Computation 15, 549–564 (2003)
c 2003 Massachusetts Institute of Technology °
550
G. Goodhill
connections in any biological nervous system. Combinatorial coding is thus a very efcient coding strategy and is used by nervous systems in contexts ranging from axon guidance (Shirasaki & Pfaff, 2002) to information coding by spike trains (Diesman, Gewaltig, & Aertsen, 1999) and odor coding in the olfactory system (Malnic, Hirono, Sato, & Buck, 1999). A particularly well-studied model system in axon guidance is the growth of commissural axons to, across, and beyond the midline (Kaprielian, Imondi, & Runko, 2000). Commissural axons are initially attracted toward the midline (e.g., the insect nerve cord or the vertebrate spinal cord) by a gradient of Netrin released by midline glia (Serani et al., 1994; Kennedy, Serani, de la Torre, & Tessier-Lavigne, 1994; Mitchell et al., 1996). On reaching the midline, they lose their responsiveness to Netrin but become responsive instead to the repellent molecule Slit, which binds to the Robo family of receptors (Brose et al., 1999; Kidd, Bland, & Goodman, 1999). In Drosophila, this currently consists of the three members Robo, Robo2, and Robo3, each of which plays a particular role in enabling commissural axons to cross the midline and not then recross (Rajagopalan, Nicolas, Vivancos, Berger, & Dickson, 2000; Simpson, Kidd, Bland, & Goodman, 2000). After crossing, axons project to specic lateral pathways before turning longitudinally. Recent results from Drosophila (Rajagopalan, Vivancos, Nicolas, & Dickson, 2000; Simpson, Bland, Fetter, & Goodman, 2000) have focused on three specic longitudinal Fas-II positive tracts, referred to as medial, intermediate, and lateral. In this case, the choice of pathway depends on a combinatorial code of Robo receptors expressed by axons after they have crossed the midline—the so-called Robo code. Axons that normally turn in the medial tract express only Robo, axons that normally turn in the intermediate tract express Robo and Robo3, and axons that normally turn in the lateral tract express Robo, Robo2, and Robo3 (see Figure 1A). Although a gradient of Slit emanating from the midline has not yet been directly observed, it has been hypothesized that this gradient drives axons to specic lateral positions based on their combinatorial code of Robo receptors (Rajagopalan, Vivancos, Nicolas, & Dickson, 2000; Simpson, Bland, Fetter, & Goodman, 2000). This hypothesis is supported by loss- and gain-of-function experiments (Rajagopalan, Vivancos, Nicolas, & Dickson, 2000; Simpson, Bland, Fetter, & Goodman, 2000). When particular Robo receptors are deleted, axons often project to a more medial fascicle than normal, whereas when particular Robo receptors are ectopically expressed, axons often project to a more lateral fascicle than normal. The shifts induced by these genetic perturbations are always discrete jumps between fascicles: these axons do not generally turn between fascicles. Both Rajagopalan, Vivancos, Nicolas, and Dickson (2000) and Simpson, Bland, Fetter, and Goodman, (2000) argue that the Robo code targets axons to a general region; then local cues take over to target a particular tract within that region. Although these different Robo receptors all mediate repulsion by Slit, the specic signal transduction pathways they activate inside the growth
Axon Guidance by the Robo Code
A
551
B M
I
L S(x)
robo robo3 robo robo3 robo2
MIDLINE
robo
Slit gradient S(M) S(I) S(L) x0
M
d
M
d
d
I
d
L
X
Figure 1: (A) Schematic diagram of the Robo code. Three different populations of commissural axons (dashed lines) are shown projecting to three different lateral fascicles: M (medial), I (intermediate), and L (lateral). (B) Parameters of the model. x0 D distance of the medial fascicle from the midline, 2d D spacing of the fascicles. S(M), S(I), and S(L) refer to the Slit concentrations at the medial, intermediate, and lateral fascicles, respectively.
cone may be different (discussed in Rajagopalan, Vivancos, Nicolas, & Dickson, 2000). One possibility is that each type of Robo receptor signals qualitatively different information regarding levels of Slit, which would be a combinatorial code in the sense most analogous to the other examples cited above. However, an alternative possibility is that the differences between types of Robo receptors are simply quantitative; each generates the same type of repulsive signal to particular levels of Slit, differing only in degree. Roughly speaking, this is consistent with the experimental results that the more Robo receptors an axon expresses, the more repulsion and thus the further its lateral projection. By analogy with levels of measurement in statistics (Krzanowski, 1988), we refer to the rst possibility as a categorical model and the second as a ratio model. In the categorical model, only the categories of Robo receptors expressed by a given axon determine where it projects to, whereas in the ratio model, only the relative proportions matter. The categorical model is consistent with many of the experimental results. For instance, overexpressing Robo does not generally cause a lateral shift (Simpson, Bland, Fetter, & Goodman, 2000). However, a purely categorical model is insufcient to explain some results: for instance, axons that normally project laterally still do so in the robo=robo3 double mutant (Simpson, Bland, Fetter, & Goodman, 2000), and medially projecting axons that project to the intermediate zone when robo3 is added can be “supershifted” to the lateral zone by the addition of two copies of the robo3 gene (Rajagopalan, Vivancos, Nicolas, & Dickson, 2000). Results such as these led both Simpson, Bland, Fetter, and Goodman (2000) and Rajagopalan, Vi-
552
G. Goodhill
vancos, Nicolas, and Dickson (2000) to propose a mixed categorical-ratio model, where Robo acts as a category (only the presence of Robo is important, not its precise amount) and Robo2 and Robo3 act in a more graded way. To explore these coding issues in a more quantitative way, we have developed a simple mathematical model of the Robo code based on a pure ratio and a mixed categorical-ratio approach. We show that a pure ratio model could in fact be consistent with the data if the Slit gradient is steep enough. This leads to experimentally testable predictions regarding how the outcome of Robo overexpression experiments might depend on the steepness of the Slit gradient. In addition, for the mixed categorical-ratio model, we show that even this version is consistent with the data only when the Slit gradient is sufciently steep. 2 Mathematical Model 2.1 Form of the Slit Gradient. There is no direct evidence concerning the shape of the Slit gradient in vivo. Theoretically, one approach would be to model it as free diffusion from a line source in an innite three-dimensional volume (Crank, 1975; Carslaw & Jaeger, 1959), analogous to a recent model of axon guidance by a target-derived diffusible factor (Goodhill, 1997, 1998). However, data from other systems suggest the diffusion might not be free (Hiramoto, Hiromi, Giniger, & Hotta, 2000). Given this uncertainty, it seems sensible to choose a very generic shape with a parameterized steepness for the Slit gradient. We therefore assume it to be a decaying exponential, S.x/ D Ce¡x=® , where C gives the concentration of Slit at the midline, x is distance from the midline, and ® gives the rate of decrease moving away from the midline. For small ®, Slit is tightly concentrated around the midline, whereas for larger ®, it decays more gradually. 2.2 Generation of Repulsion: Linear Model. In the linear version of the model, we assume that the amount of repulsion induced in growth cones by Slit is given by S.x/ £ R, where R is just a sum of the amount of repulsion transduced by each Robo receptor separately (Nakamoto et al., 1996; Goodhill & Richards, 1999). That is, R D r for axons normally projecting to the medial fascicle, R D r C r3 for intermediate axons, and R D r C r3 C r2 for lateral axons, where r, r3 , and r2 give the “net” amount of repulsion due to each Robo receptor (for the mixed categorical-ratio model, R is assumed to depend on only r2 and r3 ). The values of the r’s that are consistent with the data come out of the model. For experiments where a second copy of a gene is inserted, this net repulsion is assumed to be doubled. That is, the net repulsion is assumed to be equal to the product of the number of receptors times the strength of repulsive signaling through a single receptor of that type. For Robo overexpression experiments, where the degree of amplication of r is uncertain, we introduce a multiplication
Axon Guidance by the Robo Code
553
parameter n. The values of n that are consistent with the data emerge from the model. Each axon is assumed to be repelled by Slit until the amount of repulsion S.x/£R reaches a certain threshold T that is the same for all axons. Axons are then assumed to be attracted by local cues toward the nearest of the medial, intermediate, or lateral fascicles. The model presented here concerns only how axons target a particular general area based on their Robo code and does not explicitly address how local cues rene their positions within these general areas (we return to this in section 4). This means that the positions that axons reach dened purely by Slit repulsion can be anywhere in the region of a particular fascicle, provided they are nearer to that fascicle than neighboring fascicles. Fascicles are assumed to be evenly spaced, distance 2d apart, with the medial fascicle distance x0 from the midline (see Figure 1B). We also assume that for axons to project to the medial fascicle, their turning position must be more than x0 ¡ d from the midline, and for axons to project to the lateral fascicle, their turning position must be less than x0 C 5d from the midline. Based on data such as Figure 5A of Simpson, Bland, Fetter, and Goodman (2000), we chose x0 D 4, d D 1 (in arbitrary units), so that the three fascicles run at distances 4, 6, and 8 from the midline (absolute distances are not indicated in Simpson, Bland, Fetter, & Goodman, 2000, or Rajagopalan, Vivancos, Nicolas, & Dickson, 2000). These units set the scale for ®. The constraints imposed by the experimental data can now be expressed as a set of inequalities that must all be simultaneously satised. As an example, consider the normal targeting of axons projecting to the intermediate fascicle. The equation for targeting to position x is S.x/.r C r3 / D T:
(2.1)
Substituting the exponential form for S gives e¡x=® .r C r3 / D T=C: The ratio T=C sets the absolute scale for the r’s. However, since we are interested only in the relative values of the r’s, we can without loss of generality set T=C D 1. This yields a value for x of x D ® log.r C r3 /:
(2.2)
This target position must be closer to the intermediate tract than either the medial or lateral tracts. Thus we require (cf. Figure 1) that x0 C d < ® log.r C r3 / < x0 C 3d or equivalently e
x0 Cd ®
< r C r3 < e
x0 C3d ®
:
554
G. Goodhill
Table 1 shows a set of inequalities that can be derived by this approach from the experimental data of Rajagopalan, Vivancos, Nicolas, and Dickson (2000) and Simpson, Bland, Fetter, and Goodman (2000) (see particularly the schematic Figure 4 in Simpson, Bland, Fetter, & Goodman, 2000). Cases labeled “crossing defect” are assumed to represent problems with crossing rather than lateral termination zone and are not considered further in our analysis. In case 9 in Table 1 (axons that normally project to the lateral fascicle in the robo2 mutant), the axons seem undecided between the intermediate and lateral fascicles. This was not included in our analysis, and we return to this issue in section 4. Not all of these inequalities are independent. For instance, if ® log.r3 / > x0 C d (index 5 in Table 1), then it is certainly true that ® log.r C r3 / > x0 C d (index 2 in Table 1), and the latter constraint is superuous. However, we explicitly include all the constraints in Table 1 since it makes it easier to see which piece of datum each is based on and what would happen if each were deleted or altered. 2.3 Generation of Repulsion: Nonlinear Model. A more realistic picture of binding kinetics than the linear version presented above is [RL] D
[R][L] ; [L] C Kd
where [R] is the receptor concentration, [L] is the ligand concentration, [RL] is the concentration of the product, and Kd is the dissociation constant (Gutfreund, 1995). When [L] ¿ Kd , this simplies to the linear version (with Kd absorbed into the threshold T). The three Robo receptors have similar Kd s for binding Slit, in the range 10 to 40 nM (Simpson, Kidd, Bland, & Goodman, 2000), and we consider these as equal. The analog of equation 2.1 is then S.x/ .r C r3 / D T: S.x/ C Kd Substituting the exponential form for S.x/ and dividing by C gives e¡x=® .r C r3 / D T; C 1=Nc
e¡x=®
(2.3)
where cN D C=Kd , the normalized Slit concentration at the midline. Again, T simply sets the overall scale of the r’s, so we can set it to 1 without loss of generality. The analog to equation 2.2 is now x D ® log[Nc.r C r3 ¡ 1/]; and an analogous set of constraints to Table 1 can be generated (not shown).
WT WT WT r¡ r¡ r¡ r2 ¡ r2 ¡ r2 ¡ r3 ¡ r3 ¡ r3 ¡ r¡, r2 ¡ r¡, r2 ¡ r¡, r2 ¡ r¡, r3 ¡ r¡, r3 ¡ r¡, r3 ¡ r2 ¡, r3 ¡ r2 ¡, r3 ¡ r2 ¡, r3 ¡ nr nr nr r2 C r3 C r2 C, r3 C 2r3 C 2r2 C, r3 C
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Intermediate Lateral Medial Medial Medial Medial Medial
Population
r2 r r r nr nr, r3 nr, r3 , r2 r, r2 r, r3 r, r3 , r2 r, 2r3 r, r3 , 2r2 < < < < < < < < < < <
µ1 . On the other hand, when the postsynaptic neuron is emitting at a low rate, its depolarization is uctuating around the rest potential, and hence below µ2 , so P R would be an increasing function of ºpost . We have argued that the synaptic device behaves in a Hebbian way. In other words, a synapse between two neurons, both emitting at a high rate, tends to be potentiated (PT > 0, PR » 1), while a synapse connecting a high-rate presynaptic neuron to a low-rate postsynaptic one tends to be depressed (PT > 0, PR » 0). In fact, as we shall see, a dynamic synapse naturally extends the plasticity scenario: synapses in neural populations where one would expect LTP may undergo LTD, and vice versa. This may be due to uctuations or to the width of the emission rate distribution in a functionally uniform neural population. When the average rates are high and close to each other in the pre- and postsynaptic cells or if there is a very large difference between the two, deviations are few, and Hebb is maintained to a very good accuracy. But if the average rates are closer, as would, for example, be the case when one population of neurons is stimulated and another is in enhanced delay activity (the substrate for the generation of contextcorrelations; Miyashita, 1988; Brunel, 1996), or of pair-associate representations (Sakay & Miyashita, 1991; Erickson & Desimone, 1999), LTP and LTD probabilities for the synapses connecting the two populations would both be signicant. The one for LTP would be higher to allow the association.
574
D. Amit and G. Mongillo
The structuring in that population will not go on indenitely (creating representation collapse), but will saturate at a level determined by the ratio of the two probabilities, a very welcome modication that will be explored in depth elsewhere (Mongillo & Amit, 2001a). 4 Simulations 4.1 Single Cell Model. As a model of the spiking neuron we use the linear integrate-and-re (LIF) model of Fusi and Mattia (1999) because it has been found to be consistent with the behavior of real cells in noisy conditions (Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2002) and also because of its analytical tractability. Despite its simplicity, this neural model is able to reproduce much of the phenomenology of networks of RC integrate-and-re neurons (Fusi & Mattia, 1999; Mongillo & Amit, 2001b). It is characterized by a ring threshold µ, a (postspike) reset potential Vr , a constant leakage current ¯I , and a refractory period ¿arp . The evolution of the membrane depolarization below threshold is ( ¡¯I C I.t/ if V.t/ > 0 (4.1) VP D 0 if V.t/ D 0 and VP < 0: I.t/ is the afferent current charging the cell’s membrane. When V crosses the threshold, a spike is emitted, and the depolarization is reset to Vr and kept constant for ¿arp milliseconds. If I.t/ is stationary, gaussian, and independent at different times (delta correlated), the distribution of the depolarization P.V/ and the mean emission rate º are calculated to be (Fusi & Mattia, 1999) ± ²i º h ¹ 1 ¡ exp ¡2 2 .µ ¡ V/ for V 2 [0; µ] (4.2) P.V/ D ¹ ¾ µ ³ ´¶¡1 ¾ 2 2¹µ ¡ 2¹µ ¾2 ¡ 1 C (4.3) º ´ 8.¹; ¾ 2 / D ¿arp C e ; 2¹2 ¾ 2 for Vr D 0. The hypotheses on the statistics of the afferent current are quite well satised in a large spiking network (Amit & Brunel, 1997b). This model does not include adaptation effects, essential for tting neural response dynamics of real cells (Rauch et al., 2002), limiting the increase in neural emission rates during sustained stimulation. In our simulations, this simplication is corrected by hand (see section 4.2.5). 4.2 The Learning Process. 4.2.1 The Network Architecture. The network (modeling a cortical module) is composed of NE pyramidal cells and NI interneurons. Each neuron receives, on average, CE synaptic contacts from excitatory and CI from inhibitory neurons inside the network and Cext excitatory synaptic contacts
Spike-Driven Synaptic Dynamics Generating Working Memory States
575
representing external afferents (Amit & Brunel, 1997b). The current resulting from the activation of the external synapses represents noise from the rest of the cortex and selective afferents due to stimuli. When the network is set up, the afferent presynaptic neighbors of a given neuron are selected as follows: for each (postsynaptic) neuron (excitatory or inhibitory), one selects the presynaptic neighbors, independently and at random, by a binary process in which a presynaptic excitatory (inhibitory) neuron is a neighbor with probability CE =NE (CI =NI ). The existing inhibitory as well as the excitatory synapses onto interneurons are assigned continuous values, each drawn from a gaussian distribution with its preassigned mean and variance. These synapses remain xed throughout the simulation. The plastic excitatory-excitatory synapses (prior to training) are distributed independently and randomly, with P .Jij .0/ D Jp / D Cp0 D 0:25. This is equivalent to a distribution for excitatory-excitatory synapses with mean JEE =µ D Cp0 Jp C .1 ¡ Cp0 /Jd D 0:022 and variance 12 JEE =µ 2 D Cp0 .1 ¡ Cp0 /.Jp2 C Jd2 / D 0:00063. Numerical values correspond to the parameters in Table 1. Furthermore, to each synapse is associated a transmission delay ±, which represents the time needed by the presynaptic spike to affect the postsynaptic depolarization. In our simulations, we chose ± D 1 ms for all the synapses. This is the unstructured state of the network, which is supposed to sustain spontaneous activity. The complete list of parameters used in the simulations is reported in Table 1. 4.2.2 The Dynamics. The simulation consists of a numerical integration of the discretized dynamical equations of both neurons and plastic synapses. The temporal step 1t is chosen to be shorter than the time between two successive afferent spikes. In our simulations, we take 1t D 0:05 ms. The initial distribution of depolarization in the network is set uniform. Spikes begin to be emitted due to external excitatory afferents. All neurons are receiving a nonselective external current Ii.ext/ D Jext ½.t/, where ½.t/ is a Poisson process with mean Cext ºext per unit time. ºext is taken of the order of the rate of excitatory neurons in the spontaneous state and is kept xed. At each 1t, the afferent external current is Jext ½.t/1t. The initial distribution of the neural depolarization is found not to be important. The network reaches its stationary state (depending on the synaptic matrix and level of external signal) within short relaxation times (see Figure 6 and the discussion in section 5.1). The network is allowed to evolve freely (i.e., without stimulus presentation) for about 100 ms, to reach stationary spontaneous activity, and then the training stage begins. The depolarization of all neurons is sequentially updated according to equation 4.1. If Vj .tC1t/ > µ, a spike is delivered to all postsynaptic neurons, and the depolarization is reset to Vj D Vr . The spike adds to the value of the depolarization of postsynaptic neuron i, at time t C 1t C ±, the value of the synaptic efcacy connecting the neuron j to i, if they are connected.
576
D. Amit and G. Mongillo Table 1: Parameters Used in the Simulations. Network parameters NE Number of pyramidal cells NI Number of interneurons CE Number of recurrent E connections per cell CI Number of recurrent I connections per cell Cext Number of connections from outside ºext [Hz] Spike rate at external synapses p Number of selective populations f Fraction of cells responding to a stimulus Single cell parameters E µ[µ ] Spike emission threshold 1 Vr =µ Postspike reset potential 0 ¯I =µ [ms¡1 ] Leakage current .011 ¿arp [ms] Absolute refractory period 2 Synaptic parameters Jext=µ EPSP produced by external afferent JEI =µ IPSP amplitude on pyramidal cells JIE =µ EPSP amplitude on interneurons JII =µ IPSP amplitude on interneurons Jp =µ EPSP produced by potentiated synapse Jd =µ EPSP produced by depressed synapse Cp0 Fraction of Jp in unstructured network ± [ms] Synaptic delay Synaptic dynamics parameters µX Threshold for synaptic transition µ1 =µ Threshold for upregulation of X µ2 =µ Threshold for downregulation of X ®[ms¡1 ] Drift toward 0 ¯[ms¡1 ] Drift toward 1 a Amplitude of up jump b Amplitude of down jump
5000 1250 380 120 380 5 5 0.1 I 1 0 .0113 2 .019 .063 .025 .055 .057 .01 .25 1 .374 .7 .5 .0067 .01 .17 .14
Notes: Units are given in brackets. Before structuring, the mean and variance of the excitatory efcacy in the excitatory population are given by JEE =µ D Cp0 Jp C.1¡ Cp0 /Jd D 0:022 and 12 JEE =µ 2 D Cp0 .1 ¡ Cp0 /.Jp2 C Jd2 / D 0:00063.
Moreover, if the emitting neuron is excitatory, all plastic synapses connecting it to other excitatory neurons are updated according to equations 3.1, 3.2, and 3.4. The level of the postsynaptic depolarization, determining the direction of the jump in X.t/ according to equation 3.4, is read at the time of the presynaptic emission (see section 3.1). If Xij .t C 1t/ (the internal variable of the plastic synapse connecting neuron j to i) crosses the threshold µX , the efcacy Jij .t C 1t/ is suitably modied, producing either LTP or LTD. 4.2.3 Statistics of Stimuli. The p stimuli to be used in training are set up when the simulation is initialized and is kept xed. Each stimulus cor-
Spike-Driven Synaptic Dynamics Generating Working Memory States
577
responds to a pool of fN e , visually responsive excitatory neurons. f is the coding level of the stimuli and is chosen low ( f ¿ 1). These pools are selected nonoverlapping, that is, the neurons have a perfectly sharp tuning curve, and pf < 1. They are p consecutive groups of fNe neurons. Although this is not a realistic constraint, it is very useful in allowing the monitoring of the complex double dynamics by mean-eld theory and in simplifying greatly the analysis of the learning dynamics. This fact, however, does not render the process trivial (see section 5.1). Moreover, for low memory loading levels, p ¿ 1=f , it is a good approximation to the case of unconstrained random choice of stimuli (Amit & Brunel, 1997b). At higher loading levels, one must confront the full issue of the network memory capacity, whose details are beyond this study. The presentation of a stimulus is expressed by an increase in the rates of the external afferents to the selective cells (the corresponding pool). The rate of spikes arriving at these neurons is increased by a factor ge > 1. ½k .t/ is still a Poisson process, but with mean ge Cext ºext dt, where the index k runs over the visually responsive population. External currents to the other excitatory neurons are unaltered. Similarly, the afferents to all interneurons increase their activation rates by a factor gi > 1 (Miller et al., 1996). Accordingly, the neurons selective to the stimulus presented emit at elevated rates. In other words, a stimulus elicits a visual response in the same subset of cells whenever it is presented. There is no noise in the process of presentation. In a more realistic situation, the subset of cells excited may be slightly different on different presentations of the same stimulus. Accordingly, we carried out simulations with a noisy version of the stimuli. In these simulations, each stimulus was considered a prototype of specic f NE cells. In each presentation, a fraction 1 ¡ f1 of the prototype was excited, as well as f2 of all other .1 ¡ f /NE excitatory cells. f1 is the noise level, and f1 D f2 D 0 is the noiseless case. To have a constant (on average) number of stimulated cells in each presentation, we chose f2 so that f2 D f1 f=.1 ¡ f /: 4.2.4 Training Protocol. For the training protocol, the set of stimuli is repeatedly presented to the network. Each stimulus appears for Tstim milliseconds. Then it is removed, and following a delay period of Tdelay, during which none of the populations is stimulated, another stimulus is presented. The presentation sequence of the stimuli is either kept xed (1 ! ¢ ¢ ¢ ! p ! 1) or the sequence is generated by choosing each stimulus to be presented, independently and randomly, with probability 1=p. All along the simulation, during and between stimulus presentation, the neural and synaptic dynamics are free and are described by equations 3.4 and 4.1. 4.2.5 Control of Visual Response. As structuring takes place, the increasing average recurrent excitatory synaptic efcacy causes a signicant in-
578
D. Amit and G. Mongillo
crease of neural emission rates, at parity of external signal (contrast, ge ). Such an increase is doubly undesirable—rst, because it is not observed experimentally (Erickson & Desimone, 1999), and second, and more relevant for the present purposes, because it distorts the learning process. As things stand, it seems to be an artifact of the simplicity of the single cell dynamics as well as of the synaptic transmission model rather than of the learning process. In a more realistic situation, this problem is resolved by the adaptation features of the neurons and of the synaptic transmission (Tsodyks & Markram, 1997). This tendency of increased visual response during training is partially balanced by the stimulation of the inhibitory neurons during stimulus presentation. The enhanced ring of interneurons limits the visual response of the pyramidal cells by the hyperpolarizing currents. Moreover, it enhances the synaptic depression process between stimulated and unstimulated neurons because the stronger hyperpolarizing currents, arising from stimulated interneurons, decrease the mean depolarization level of neurons of the unstimulated population. This enhances the depression of synapses afferent on them. The stimulation of inhibitory neurons, and its effects, appears to be consistent with experimental ndings (Steele & Mauk, 1999). However, the activity of the inhibitory population cannot grow beyond a certain level because the interneurons also inhibit each other. On the other hand, inhibitory contacts among interneurons cannot be weakened too much, or the emission rate of the inhibitory population becomes so high as to suppress the activity of the excitatory cells completely. When inhibition was no longer sufcient, we articially kept the rate of stimulated neurons approximately constant during the learning process. The emission rates during stimulus presentation are suggested by the properties of the single synapse (see section 3.2) and are chosen at the start of the simulation. When, as a consequence of synaptic structuring, these rates exceed the initial level by more than 15%, stimulus presentation is interrupted. The synaptic structuring reached is used to calculate, by mean-eld analysis (see the appendix), a new level of contrast (ge ) for the stimuli, keeping gi xed. ge is calculated to produce the original rate of visual response, and the simulation resumes. 4.2.6 Observables Monitored. During the simulation, various observables related to the collective neural activity as well as the synaptic dynamics are sampled. To describe the evolution of the structuring among excitatory neurons in the network, we dene functional neural populations and functional synaptic populations. The neural populations are of excitatory neurons, such as the population corresponding to one of the p stimuli; when a particular stimulus is presented, the union of the p ¡ 1 populations corresponding to the other stimuli, not now recalled; and the population of background cells. In every phase of the simulation, the neurons in each of these populations receive the same average external input and see, on average, the same synaptic structure.
Spike-Driven Synaptic Dynamics Generating Working Memory States
579
The functional synaptic populations are the synapses connecting pairs of neurons within a functional neural population or between two different functional neural populations. Therefore, a synaptic population is fully dened once the functional properties of its pre- and postsynaptic neurons are specied. The state of any synaptic population (its structuring) is quantied by the fraction, Cp , of potentiated synapses in that population. In practice, we monitor only two structuring parameters, Cp . One of them is CpHH , between neurons in one of the p selective populations. It is one parameter and not p since we observe a very small variability of structuring corresponding to the different stimuli, and so we monitor their average. The other is CpHL , between neurons belonging to a selective population and those in the background or in other, unstimulated populations. Again, the average over p such populations is monitored. Recall that only synapses with high-rate presynaptic neurons have a signicant transition probability. We do not distinguish between postsynaptic neurons in the background and neurons of nonselected populations because their rates are rather close. This is obvious before the network is signicantly structured. It remains so after structuring due to the enhanced inhibitory activity. Most of the (nonstimulated) neurons emit at very low rates, and the distributions of frequencies in those populations are quite similar. This was checked by carrying out a set of simulations and comparing the structuring of the synaptic population between visually responsive and background neurons with the structuring between visually responsive and neurons responsive to a different stimulus. Moreover, we do not consider the structuring between background, or unselected, neurons and other populations, because the presynaptic rates are low, and so structuring is negligible, as conrmed by the simulations. Because the selective neural populations are nonoverlapping, the functional synaptic populations dened above are also nonoverlapping. This is true also for the synaptic populations between selected neurons and the rest, because even when the synapses of the two populations share postsynaptic neurons, the presynaptic neurons are different. The synaptic structuring reached in one synaptic population is not disrupted by the presentation of a different stimulus, and this fact simplies the analysis of synaptic structuring. The level of synaptic structuring required for WM states may be estimated by mean-eld analysis (see the appendix). When synaptic structuring reaches the level theoretically estimated, the training stage is interrupted. Each stimulus is presented for 150 ms and then removed and the network evolves for 1 second in the absence of further stimulations. If, following the removal, the network exhibits a WM state for all stimuli, the learning stage is terminated. Otherwise, it is resumed. If stimulus presentation continues after WM activity appeared, the structuring reaches an asymptotic level and further presentations do not affect the synaptic matrix. The asymptotic
580
D. Amit and G. Mongillo
structuring level expresses a detailed balance between potentiating and depressing processes (see section 6.1). 5 Results 5.1 The Structuring Process. The simulations presented here are for nonoverlapping stimuli. The parameters are given in Table 1. Even in this basic case, the process of synaptic structuring, and the consequent appearance of WM states, is not trivial. Indeed, various instabilities tend to appear during learning. The most common instabilities encountered in our simulations were oscillatory behavior, uncontrolled growth of the network global activity, or depression where one expects potentiation or vice versa. The essential underlying reason is that to reach stable, selective WM states, synapses must not only be potentiated among visually selective neurons, but this potentiation must be accompanied by adequate depression from selective to nonselective cells (Brunel, 2000). On the other hand, with the synaptic dynamics proposed here, in any population of synapses, one nds potentiations as well as depressions, as we explain below. Depressing synapses that should be potentiated can prevent the formation of WM states, because the fraction of potentiated synapses remains too low, at xed Jp and Jd . Often these effects appear when the level of visual response becomes too high. As a consequence, the rate of synaptic modications increases appreciably, and hence, the effects of unwanted synaptic transitions, occurring due to the wide distribution of rates inside the neural populations, are amplied. Alternatively, the ratio between the probabilities of potentiation and depression could vary, again provoking unwanted synaptic modications. Another source of unwanted plasticity is the correlation between the stimulated and the unstimulated pyramidal cells. Due to the increased inhibitory activity, the activity in the background population is quite low. The spike emission of a neuron belonging to the background population is provoked principally by the spikes afferent from the visually responsive population. As a consequence, spike emission in a presynaptic neuron of the selective population consistently tends to precede the emission of the postsynaptic neuron if it belongs to the background population (Rubin, Lee, & Sompolinsky, 2001). Thus, despite the low emission rate, the postsynaptic depolarization could be found often at a high level when the synapse is activated by presynaptic emission. This could provoke LTP instead of the appropriate LTD. However, this effect is negligible when the network is sufciently large and the single EPSP is small with respect to the threshold for spike emission. Hence, in a way, the results we report represent a demonstration that in the rich and complex space of parameters of the synapses and the neurons exists a zone in which the fundamental structuring for working memory can take place. It would have been more satisfactory had the scenario been
Spike-Driven Synaptic Dynamics Generating Working Memory States
581
more robust, in the sense of self-organization, so that learning would induce neural activity conducive to structuring. But the complexity of the situation limits us to one step at a time. Figure 3 reports the average fraction of potentiated synapses in both the HH and HL synaptic populations as a function of the number of presentations per stimulus. Cp increases monotonically in the HH populations with stimulus presentation, while the fraction of potentiated synapses decreases with stimulus presentation (in the HL populations). Figure 4 reports the average number (averaged over stimuli) of synaptic potentiations and depressions per presentation in both the HH and HL synaptic populations. Note that there are depressions in the HH populations as well as potentiations in the HL populations, due to the wide distribution of rates in each
0.9 0.8
HH
0.7 0.6
C
p
0.5 0.4 0.3 0.2 HL
0.1 0 0
20
40 60 80 # of presentations/stimulus
100
120
Figure 3: Synaptic structuring: Cp in the two monitored synaptic populations as a function of the number of presentations (nonoverlapping populations). Horizontal line: The unstructured initial state (Cp0 D 0:25). CpHH (between visually responsive neurons) increases monotonically with stimulus presentation; CpHL (between visually responsive to nonresponsive neurons) decreases monotonically. The values of CpHH and CpHL are averaged over all functionally equivalent synaptic populations. The interpopulation variability is included in the circles representing the points. Synaptic structuring allowing for WM is reached after 105 presentations of each stimulus. See also Figure 7. The parameters are given in Table 1.
582
D. Amit and G. Mongillo
1500
400
1000
200
500
# of transitions
600
0 0
10 20 # of presentation
0 0
10 20 # of presentation
Figure 4: Number of synaptic transitions (stimulus average) per presentation versus presentation number for the rst 20 presentations. Squares represent number of potentiations and circles the number of depressions. (Left) HH population: number of potentiations greater than the number of depressions. (Right) HL population: number of depressions greater than the number of potentiations. The difference between the absolute numbers of transitions in the two gures is due to the greater number of synapses in the HL population. Parameters are as in Figure 3.
population (see Figure 5), which produces a nite probability that synapses within a selective population of neurons nd a high-rate presynaptic neuron and a low-rate postsynaptic neuron. Similarly, there is a nite probability that synapses between selective and nonselective populations nd two high-rate neurons. However, with the parameters of Table 1 and the relevant rates, the number of “correct” transitions is overwhelmingly greater than the number of “wrong” transitions. The outcome is an increase of CpHH and a strengthening of the mean synaptic efcacy among selective neurons and a monotonic decrease of CpHL and a depression of the mean synaptic efcacy from selective to nonselective neurons. In this situation, we recover the classical Hebb rule. As a control, we have veried that similar results are obtained when the set of stimuli is noisy. We carried out simulations with f1 D 0:1 and f1 D 0:2. Recall that 1 ¡ f1 is the probability that a neuron of the prototype is effectively activated by its presentation in the presence of noise. The only noticeable, and expected, effect was that WM states require more presentations to appear: 140 to 150 presentations per stimulus instead of 100 to 110 required in the noiseless case. 5.1.1 Appearance of Working Memory. Working memory (selective delay activity) states appear for CpHH D 0:875 and CpHL D 0:0045. The appearance of WM states is related not only to the level of synaptic structuring, that is, the values of Cp in the various synaptic populations, but also to the other
Spike-Driven Synaptic Dynamics Generating Working Memory States 0.07
583
0.1 0.09
0.06
0.08 0.05
0.07 0.06
P(n)
0.04
0.05 0.03
0.04 0.03
0.02
0.02 0.01 0 20
0.01 40
60 n [Hz]
80
100
0 0
20 n [Hz]
40
Figure 5: Population distribution of spike rates during stimulation. (Left) Stimulated population. The mean emission rate is ºstim D 60 Hz. (Right) Unstimulated pyramidal cells; the mean emission rate is ºbg D 0:9 Hz. Despite the low mean emission rate, neurons are found with high rate. Note that the high-frequency tail of the background distribution overlaps with the stimulated distribution. This could impair the synaptic structuring. Recurrent inhibition is effective in reducing this overlap. See the text for details.
parameters of the network, principally the ratio Jp =Jd between potentiated and depressed synaptic efcacy. For example, decreasing Jp once the structuring level allowing for WM states is reached prevents the appearance of WM states. On the other hand, increasing Jp could destabilize spontaneous activity. All network parameters, as connectivity, mean synaptic efcacies, level of synaptic structuring for WM states, and so forth should be chosen to lie in a biologically plausible range. Starting from an unstructured synaptic matrix (Cp0 D 0:25), 100 to 150 presentations per stimulus are required to reach stable WM, that is, selective delay states. Figure 6 shows the neural activity in a visually responsive population before, during, and following the presentation of a stimulus at various level of synaptic structuring. What is plotted is the number of spikes emitted per neuron, in bins of 5 ms, averaged over the selective populations and normalized to Hz (dividing by the bin size). The transients into the various stationary states, spontaneous activity, WM, or stimulated state are
584
D. Amit and G. Mongillo
100
100
n [Hz]
A
B 50
50
0 0
500
T [ms] 100
0 0
T [ms]
500
n [Hz]
C 50
0 0
500 T [ms]
1000
Figure 6: Selective population activity during trial at three structuring stages: population-averaged neural activity (in 5 ms bins) converted to Hz (see text), before, during, and following stimulation. Arrows indicate 150 ms stimulation interval. Note the short transients from one stationary state to another. Mean eld predictions of mean emission rates, ºth , during stimulation (ST), spontaneous (SA), and working memory (WM) activity are compared with the results of the simulation. (A) Fifty-ve presentations per stimulus; CpHH D 0:755; CpHL D 0:0245. ST: ºth D 51:5 Hz ºsim D 44:5 § 1:5 Hz; SA: ºth D 1:3 Hz ºsim D 1:21 § 0:21 Hz. (B) Seventy presentations per stimulus; CpHH D 0:8; CpHL D 0:016. ST: ºth D 51:5 Hz ºsim D 45:7 § 2:9 Hz; SA: ºth D 1:3 Hz ºsim D 1:17 § 0:18 Hz. Horizontal dashed line: Average emission rate during stimulus presentation and SA. In both A and B no WM: population returns to spontaneous state after removal of stimulus. (C) Appearance of WM state: 105 presentations per stimulus; CpHH D 0:875 and CpHL D 0:0045. WM: ºth D 27 Hz ºsim D 24:3 § 4:25 Hz. Horizontal dashed line: Average emission rate during WM activity. Large uctuations are due to strong nite-size effects. See text for details.
Spike-Driven Synaptic Dynamics Generating Working Memory States
585
short, on the order of 50 to 100 ms. In Figures 6A and 6B, the level of synaptic structuring is still too low, and the stimulated population returns to spontaneous activity when the stimulus is removed. In Figure 6C, one observes a WM state in which the cells of the selective populations emit at elevated rates (24.3 Hz) following the removal of the stimulus. Mean eld predictions for the mean emission rates during spontaneous and WM activity (dashed horizontal lines) are in good agreement with the simulations. During WM activity, we observed very large uctuations of the average emission rate. This is due to nite size effects; to reduce the duration of the simulations with the double dynamics, the selective populations were taken relatively small and, given the low connectivity, a neuron receives only about 38 connections from other neurons in the same selective population (see Table 1). Due to the small number of connections from other selective cells and the stochastic nature of spike emission,pthe current afferent on a cell has large uctuations (of relative order 1= N, N is the number of pre-synaptic neurons) around its mean which produce uctuations in the emission rate. 6 Estimates of the Structuring Process 6.1 Population Dynamics of the Structuring Level. We dene qC as the probability that a depressed synapse undergoes LTP during one stimulus presentation and q¡ as the probability of depression of a potentiated synapse per presentation. It is expected that qC > q¡ in an HH synaptic population and vice versa in an HL population. The fraction of potentiated synapses follows a population evolution as a function of the number of stimulus presentations, Cp .n C 1/ D Cp .n/[1 ¡ q¡ .n/] C qC .n/[1 ¡ Cp .n/];
(6.1)
where Cp .n/ represents the fraction of potentiated synapses in a given synaptic population after n presentations of the same stimulus. The dependence of qC and q¡ on n takes into account the possibility that they could vary along the learning process. If both the frequencies of visual response and the statistics of the emission process do not vary appreciably during learning, qC and q¡ would remain approximately constant. In this case, for nonoverlapping populations, after a large number of presentations, (
HH HH CpHH D qHH C =.qC C q¡ /
CpHL
D
HL qHL C =.qC
C qHL ¡ /
for HH populations for HL populations:
(6.2)
Note that when the selective populations overlap, the structuring of a given population could change even if its preferred stimulus is not presented. In
586
D. Amit and G. Mongillo
this case, equation 6.1 becomes more complicated, and one must also take into account the structuring due to the presentation of different stimuli. We do not expand on this issue here. In our case (see, e.g., Figure 4), the number of wrong transitions—depressions instead of potentiations in the selective population and potentiation between selective and nonselective—is rather small. We neglect these transitions, and equation 6.1 becomes (
Cp .n C 1/ D Cp .n/ C qC .n/[1 ¡ Cp .n/] Cp .n C 1/ D Cp .n/[1 ¡ q¡ .n/]
for HH populations for HL populations:
(6.3)
The dashed curves in Figure 7 are a least-square t of Cp in the HH and HL populations measured in the simulation by the solution of equation 6.3 with constant qC and q¡ , during the rst 20 stimulations. The degree of agreement between the curves indicates that these probabilities are approximately constant during learning, partly because the rates under stimulation are kept approximately constant. 6.2 Population Transition Probabilities and Transition Numbers. The probability of an LTP transition in a homogeneous population of synapses, per stimulus presentation of duration T, could be estimated as follows.3 Let c be the probability of a synaptic contact from a neuron in the presynaptic population to a neuron in the postsynaptic one. The population of synapses connects neurons in two populations: one of Npre presynaptic neurons and one of Npost postsynaptic ones (the physical neurons in the two populations can, of course, be the same). The number of potentiations in such a synaptic population, per presentation of a stimulus, is a random variable, X Npot D cij ¢ °ij ; i;j
where the index j runs over the presynaptic neurons and i over the postsynaptic ones. If there exists a synapse between the two neurons, cij D 1, otherwise D 0. °ij D 1 if the synapse was potentiated during the stimulation, otherwise D 0. Hence, °ij D 1 with probability PLTP .ºi ; ºj ; T/ ¢ ±.Jij ¡ Jd /, where Jij is the synaptic efcacy before the stimulation and ±.¢/ is 1 if its argument is zero, and zero otherwise. The mean number of potentiations is Npot D
X i;j
cij ±.Jij ¡ Jd / ¢ PLTP .ºi ; ºj ; T/:
(6.4)
The sum on the right-hand side can be converted to a sum over pairs of activity .ºi ; ºj /. Let Ppost.pre/ .º/ be the probability of nding a neuron with 3
The reasoning is analogous for LTD.
Cp
Spike-Driven Synaptic Dynamics Generating Working Memory States
0.6
0.3
0.4
0.2
587
0.2 0.1 0 10 20 0 10 20 # of presentations/stimulus # of presentations/stimulus Figure 7: Population dynamics description of synaptic structuring during initial 20 presentations per stimulus. (Left) Average Cp in HH populations. (Right) Average Cp in HL populations. Dashed lines represent the solution of equation 6.3, with qC D 0:0233 and q¡ D 0:0426, constant (independent of n). Error bars are included in the circles. Horizontal lines: unstructured state of the network.
rate º in the post-(pre-)synaptic neural neural population, respectively. The probability of having a depressed synapse with a presynaptic neuron at rate ºj afferent on a postsynaptic neuron at rate ºi is c.1 ¡ Cp /Ppre .ºj /Ppost .ºi /. Hence, when the number of neurons is large, equation 6.4 becomes Z Z Npot ’ c.1¡Cp /NpreNpost dºi dºj PLTP .ºi ; ºj ; T/Ppre.ºj /Ppost .ºi /: (6.5) Correspondingly, the mean number of synaptic depressions is Z Z Ndep ’ cCp NpreNpost dºi dºj PLTD .ºi ; ºj ; T/Ppre.ºj /Ppost .ºi /:
(6.6)
If during learning, the integrals on the right-hand side of equations 6.5 and 6.6 do not vary signicantly, they can be identied, respectively, with qC and q¡ . In that case, it becomes straightforward to obtain equations 6.3 from equations 6.5 and 6.6. In order to compare the theoretical estimates with the values of qC and q¡ obtained from simulations, we have to estimate PLTP and PLTD over the range of frequencies observed in the simulation. For PLTP , the rate intervals of both pre- and postsynaptic activity cover the distribution in the stimulated population. Similarly, for PLTD , ºpre varies over the emission rates of stimulated neurons, while ºpost varies over the rates of unstimulated neurons. (See, e.g., Figure 5.) To obtain the currents giving rise to the observed emission rates, we proceeded as follows. In the stimulated population, the neurons operate in a signal-dominated regime (Fusi & Mattia, 1999), with small uctuations. We
588
D. Amit and G. Mongillo
assume that the variance of the current is constant within the population and is given by mean-eld analysis (see the appendix). The mean current corresponding to a given frequency is simply obtained (for constant ¾ ) by inverting the transduction function, equation 4.3, which is a monotonic function of ¹. In the unstimulated population, the neurons operate in the noise-dominated regime (Fusi & Mattia, 1999); spike emission is due to sporadic large uctuations in the afferent current, which tends to hyperpolarize the membrane. Mean emission rates are very low, and the corresponding population distribution of rates is peaked around the mean-eld result. We assume that all neurons in the population are in this state of activity. Mean eld then provides the mean and variance of the currents. We proceed by simulating the coupled neural and synaptic dynamics, as described in Fusi et al. (2000), to obtain PLTP and PLTD . The extent to which the assumptions used are justied in the real network can be judged by comparing the theoretical estimates, equations 6.5 and 6.6, and the probabilities sim observed during the simulation: qth C D 0:017 and qC D 0:023 § 0:005, while th sim for LTD we have q¡ D 0:038 and q¡ D 0:043 § 0:009. 6.3 Estimating Lifetime of Synaptic Structure. The synaptic structure may be affected by synaptic transitions provoked by spontaneous or selective delay activity. This may cause progressive erasure of the stored memories in the absence of stimulation. The mean lifetime of the acquired synaptic structure depends on the probability of having such synaptic transitions. After stable WM states appeared, a separate set of simulations was carried out in order to check the stability of the synaptic structure against both spontaneous and selective delay activity. First, we checked the stability of the synaptic structure when the network exhibits spontaneous activity. Simulations as long as 60 seconds were carried out, with no stimulation. There were no modications in synaptic matrix. Similarly, a WM state was elicited by presenting the corresponding stimulus for 150 ms, and the network evolved freely for 4 seconds. The time difference with respect to the case of the spontaneous activity is due to the shorter lifetime of the WM state as a consequence of nite-size effects. Again, there were no synaptic modications. The probability of a synaptic transition during spontaneous activity is very small, because the average rate of about 1 Hz ¿ 1=¿p . Thus, the number of repetitions needed to obtain a reasonable estimate of these probabilities becomes enormous. To obtain a rough estimate of q in spontaneous activity, we ran simulations to obtain the P LTP for neurons at a rate of 20 Hz and then rescaled this probability to rates of 1 Hz. To obtain the rst, we ran 106 repetitions of the single synapse simulation, of T D 400 ms each, with ºpre ´ ºpost D 20 Hz. No transition occurred. The default estimate of the transition probability is then P LTP .20; 20; T D 400/ » 10¡6 . In other words, the mean time one has to wait until the synapse makes a spontaneous transition (at 20 Hz) is about 4 days. Next we approximately rescale this probability to 1 Hz.
Spike-Driven Synaptic Dynamics Generating Working Memory States
589
For n up-jumps to provoke LTP within a time interval T 0 , one must have, na ¡ ¯T 0 ¸ µX ;
(6.7)
where a is the single up-jump amplitude and ¯ is the synaptic refresh current (see equations 3.2 and 3.4). For n xed, the longest time interval within which the presynaptic spikes must arrive and yet be able to produce LTP is Tmax .n/ D .na ¡ µX /=¯. For the parameters of Table 1, three spikes, all provoking up-jump, within 20 ms is the most probable burst that can produce LTP, considering the low emission rates of both pre- and postsynaptic neurons. Hence, to provoke the transition, such a burst must occur within the 400 milliseconds over which we have measured the transition probabilities above. During spontaneous activity, the statistics of the emission process of the neurons is well approximated by a Poisson process (Amit & Brunel, 1997b), then P3 .T 0 D 20I º D 20/ · 203 ¢ P3 .T 0 D 20I º D 1/; where Pn .T 0 I º/ is the probability that n presynaptic spikes occur within a time interval T 0 when the emission rate is º. Neglecting other factors, as the dependence of the probability of an up-jump on the postsynaptic emission rate, we obtain qSA .400/ D PLTP .1; 1; 400/ » 20¡3 PLTP .20; 20; 400/; leading to a mean lifetime of the order of 10 years. 7 Discussion The principal result of this study is a feasibility test. The spike-driven synaptic dynamics, introduced in Fusi et al. (2000), for a suitable choice of the parameters implements rate-dependent plasticity and exhibits both longterm potentiation and long-term homosynaptic depression under diverse experimental stimulation protocols. 4 It is not to be excluded that a synaptic device in natural conditions behaves more like the synapse discussed here than as in the special protocols in which precise timing is observed. Moreover, leaving from an unstructured synaptic state, the synaptic dynamics is able to drive the network into a structured state, sustaining selective delay activity, or working memory. The generated synaptic structure is robust against spontaneous and delay activity in the absence of stimulation. The question of stability of the acquired synaptic structure, and hence the related neural activity, is also addressed. The appearance of WM states, following the repetitive presentation of a set of stimuli, is not simply a direct consequence of the rate-dependent 4
This synaptic dynamics does not generate long-term heterosynaptic depression.
590
D. Amit and G. Mongillo
plasticity, implemented by the synaptic model. Several constraints must be met during the structuring stage: ² The effects of LTP should be adequately balanced by LTD to avoid instabilities along the learning process. This balance between the probabilities of LTP and LTD, imposes constraints on the synaptic and neural parameters. Further work is needed to relate such constraints to the neural and synaptic parameters in a simple form. ² The frequency distributions in the stimulated and unstimulated populations should not overlap signicantly. Alternatively, the synapse should be highly sensitive to small variations of the frequencies in the overlap region (Del Giudice & Mattia, 2001). Indeed, this could be a source of unwanted plasticity that in turn impairs the structuring process. We have found that recurrent inhibition could play a signicant role in separating the distributions, reducing the high-rate tail of the frequency distribution of the unstimulated populations. A signicant by-product, mentioned here only briey, is the resulting extension of the Hebbian plasticity rule. A realistic synaptic model would generate both LTP as well as LTD in any population of synapses. In the particular situation (parameters) considered in detail here, the difference between “right” and “wrong” transitions is so large that the classical Hebbian picture ensues. But when one deals with the emergent coupling between two neural populations whose mean rates are not so different (as between stimulus driven cells and delay activity cells), the extension of the Hebbian scenario promises saturation in structuring, which is an essential ingredient in maintaining independent representations alongside WM. 7.1 Back to Timing-Dependent Plasticity. The sharp cutoff in the time difference between the arrival of pre- and postsynaptic spike, observed in experiment (Bi & Poo, 1998), could be a consequence of the experimental procedure. In the model presented here, the two thresholds (see equation 3.4) can be chosen to reproduce the results of Markram et al. (1997) when the two neurons connected by the synapse operate in deterministic conditions. This may, in fact, be the case in the experiments mentioned. In a deterministic situation, one can consider the two neurons stimulated by a constant noiseless current. The evolution of the depolarization is deterministic. Thus, there is a relation between the value of V at a given instant and the time elapsed from the emission of the last spike. If one chooses the two synaptic thresholds µ1 and µ2 of equation 3.4 to be µ1 D µ ¡ T1 ¹;
µ2 D T2 ¹;
(7.1)
where ¹ is the total current, the following picture emerges. Upon the emission of a presynaptic spike, the postsynaptic neuron has V < µ2 only if it
Spike-Driven Synaptic Dynamics Generating Working Memory States
591
emitted a spike at most T2 before, and the synaptic activation causes downregulation. Similarly, the postsynaptic neuron has V > µ1 only if the time of the presynaptic spike emission precedes the postsynaptic spike by at most T1 . It will then be followed by an upregulation of the synapse. If this is the case, one should conclude that up- and downregulation depend on the level of postsynaptic depolarization and the timing effect is only a consequence of the experimental setup, as argued above. Recent experimental ndings seem to corroborate this hypothesis (Sj¨ostrom ¨ et al., 2001). Recently, Abbott and Song (1999) and Rubin et al. (2001) studied the behavior of a synaptic model into which the exact temporal relation, observed experimentally by Bi & Poo (1998), is built in. Efcacy is modied according to the temporal interval 1t D tpost ¡ tpre , where tpre and tpost are the times of the pre- and postsynaptic spike emission. When 1t > 0, efcacy is upregulated, and the efcacy is downregulated for 1t < 0. Such a synaptic mechanism tends to potentiate synapses connecting correlated neurons, for example, a presynaptic neuron that consistently res before the postsynaptic one. On the other hand, presynaptic inputs that are not causally correlated with postsynaptic ring are weakened. The overall effect is the convergence of the synaptic distribution to an asymptotic stationary distribution. Depending on the update rule, the equilibrium distribution can be either unimodal or bimodal (Rubin et al., 2001). However, in both cases, the asymptotic distribution is largely insensitive to the ring rates of pre- and postsynaptic cells. Therefore, such a model of plastic synapse is unable to structure the synaptic matrix to sustain selective delay activity. 7.2 Open Issues. The collective behavior of coupled homogeneous neural and synaptic populations has been described in a compact form, in terms of probabilities of potentiation PLTP and depression P LTD , functions of the pre- and postsynaptic emission rates. Here, these functions were obtained from the detailed model of the plastic synapse. It is not clear whether such a description could be more general, so as to be qualitatively independent of the detailed neural and synaptic dynamics. A priori, this seems plausible as a general consequence of the stochasticity of the neural activity. If so, we could characterize the properties of P LTP and PLTD , allowing for successful structuring, regardless of the specic model of the synapse. Of course, the construction of a detailed synaptic model, which matches both available experimental data and theoretical desiderata, should be the natural conclusion of such a study. The constraints on the transition probabilities become tighter when the issue of network capacity is considered. If stimuli are sparsely coded ( f ¿ 1) and the probabilities of transition are low (slow learning), a necessary condition to recover the optimal storage capacity is that the number of potentiations approximately balances the number of depressions in each presentation, requiring q¡ » fqC (Amit & Fusi, 1994; Brunel, Carusi, & Fusi,
592
D. Amit and G. Mongillo
1998). This ensures that the difference between Cp in HH populations and Cp in HL populations, reached asymptotically, is maximal, depending on the number of stimuli to be learned. However, the appearance of WM states is not guaranteed. It depends not only on the level of structuring but also on the other network parameters. It remains to be seen whether the network exhibits WM activity in a biologically plausible range of parameters. The behavior of the synaptic device is manipulable. By choosing suitably the parameters a, b, ®, ¯, and µX , the balance constraint may be satised. Simulations should be carried out to check the behavior of large networks at high loading level. Those require very efcient algorithms, such as Mattia and Del Giudice (2000). But little is guaranteed; the results reported in Amit and Fusi (1994) and Brunel et al. (1998) are obtained under a simplifying hypothesis: the neurons are binary, and the existence of the attractors is checked by using a signal-to-noise ratio analysis. A further step toward the networks of spiking neurons is made by Herz and colleagues (see for a review Herz, 1996). They studied the capacity of network with analog neurons (no spiking) but with a symmetric connectivity matrix. The situation may be quite different when one deals with a recurrent network of spiking neurons, like that studied here. Appendix: Mean Field for Nonoverlapping Populations The mean-eld analysis developed allows calculating the mean ring frequencies in each of the populations in stationary (asynchronous) states of the network as a function of the instantaneous synaptic structure and the level of the external signal. The instantaneous synaptic structuring is described by the fraction of potentiated synapses in the various synaptic populations. Plastic synapses are divided by an incoming stimulus in three homogeneous populations, according to pre- and postsynaptic activities: ² High pre- and postsynaptic activity. It is expected that in this population, the fraction of potentiated synapses increases to CpHH .> Cp0 /. ² High presynaptic and low postsynaptic activity. The fraction of potentiated synapses decreases to CpHL .< Cp0 /. ² Low presynaptic activity. The probability of a synaptic transition is negligible; thus, the fraction of potentiated synapses is unaltered. Following the presentation of a stimulus that has been previously “learned,” the network divides in four functionally different populations of cells: cells belonging to the population activated by the stimulus (denoted sel); cells representing other stimuli, not activated (denoted +); cells not responsive to any stimulus (denoted bg); and interneurons (denoted I).
Spike-Driven Synaptic Dynamics Generating Working Memory States
593
The mean ring rates in each in the four neural populations are given by mean-eld equations (Amit & Brunel, 1997a), 2 2 ºsel D 8.¹sel ; ¾sel /; ºC D 8.¹C ; ¾C2 /; ºbg D 8.¹bg ; ¾bg /; ºI D 8.¹I ; ¾I2 /:
The statistics of the afferent currents is calculated as function of synaptic structuring. The recurrent currents are .e/ ¹sel D ¡¯I C CE [ f JC ºsel C f .p ¡ 1/J¡ ºC C .1 ¡ fp/J0 ºbg ] ¡ CI JEI ºI
¹C D ¡¯I.e/ CCE [ f J¡ ºsel C f [JC C .p ¡ 2/J¡ ]ºC C.1¡fp/J0 ºbg]¡CI JEI ºI .e/ ¹bg D ¡¯I C CE [ f J¡ ºsel C f .p ¡ 1/J¡ ºC C .1 ¡ fp/J0 ºbg ] ¡ CI JEI ºI
¹I D ¡¯I.i/ C CE JIE [ f ºsel C f .p ¡ 1/ºC C .1 ¡ fp/ºbg] ¡ CI JII ºI ;
(A.1)
where JC D CpHH Jp C .1 ¡ CpHH /Jd J¡ D CpHL Jp C .1 ¡ CpHL /Jd J0 D Cp0 Jp C .1 ¡ Cp0 /Jd :
(A.2)
The variances of the recurrent currents are 2 2 D CE [ f 1JC ºsel C f .p ¡ 1/1J¡ ºC C .1 ¡ fp/1J0 ºbg ] C CI JEI ¾sel ºI 2 ¾C2 D CE [ f 1J ¡ºsel C f [1JC C.p¡2/1J¡ ]ºC C.1¡fp/1J0 ºbg ]CCI JEI ºI 2 2 D CE [ f 1J¡ ºsel C f .p ¡ 1/1J¡ ºC C .1 ¡ fp/1J0 ºbg ] C CI JEI ¾bg ºI 2 [ f ºsel C f .p ¡ 1/ºC C .1 ¡ fp/ºbg] C CI JII2 ºI ¾I2 D CE JIE
where 1JC D CpHH Jp2 C .1 ¡ CpHH /Jd2 and analogous for 1J¡ , 1J0 . The external currents depend on the level of the external signal: .ext/
¹sel .ext/
¹C
2
.ext/ 2 D ge Cext Jext ºext; ¾sel D ge Cext Jext ºext 2
2
.ext/ 2 ´ ¹.ext/ D Cext Jext ºext ; ¾ C.ext/ ´ ¾bg D Cext Jext ºext bg 2
2 D gi Cext Jext ºext ; ¾I.ext/ D gi Cext Jext ¹.ext/ ºext : I
In a steady state, the density function of V will be also stationary. It is given by equation 4.2. Both the mean and the variance of the afferent
594
D. Amit and G. Mongillo
currents are a linear function of º; hence, P.V/ is parameterized only by the mean emission rates. In other words, there is a one-to-one correspondence between the emission rates and the distribution of the depolarization. Here we use the theory to determine (1) the range of CpHH and CpHL for which the network is able to sustain selective delay activity states and (2) the mean emission rate of stimulated populations as a function of the instantaneous level of synaptic structuring (3) and to compare simulation results in stationary states with theory. Acknowledgments We thank Stefano Fusi for having brought to our attention Sjostr ¨ om ¨ et al. (2001). This study was supported by a Center of Excellence grant, “Changing Your Mind,” from the Israel Science Foundation and by a Center of Excellence grant, “Statistical Mechanics and Complexity,” of the INFM, Rome. References Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning and neuronal response variability. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences, 18, 617–626. Amit, D. J. (1998). Simulation in neurobiology—Theory or experiment? TINS, 21, 231–237. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network: Computation in Neural Systems, 8, 373–404. Amit, D. J., & Fusi, S. (1994).Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Bear, M. F. (1996). A synaptic basis for memory storage in the cerebral cortex. Proc. Natl. Sci. USA, 93, 13453–13459. Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength and cell type. Journal of Neuroscience, 18, 10464–10472. Brunel, N. (1996). Hebbian learning of context in recurrent neural networks. Neural Computation, 8, 1677–1710. Brunel, N. (2000). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network, 9, 123–152.
Spike-Driven Synaptic Dynamics Generating Working Memory States
595
Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X. J. (2000). Synaptic mechanisms and neural dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Del Giudice, P., & Mattia, M. (2001). Long and short-term synaptic plasticity and the formation of working memory: A case study. Neurocomputing, 38, 1175–1180. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of NMDA receptor blockade. Proc. Natl. Sci. USA, 89, 4363–4367. Dudek, S. M., & Bear, M. F. (1993).Bidirectional long-term modication of synaptic effectiveness in the adult and immature hippocampus. Journal of Neuroscience, 13, 2910–2918. Erickson, C. A., & Desimone, R. (1999).Responses of macaque perirhinal neurons during and after visual stimulus association learning. Journal of Neuroscience, 23, 10404–10416. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-re neurons. Neural Computation, 11, 643–652. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spikedriven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258. Herz, A. V. M. (1996). Global analysis of recurrent neural networks. In E. Domany, J. L. van Hemm, & K. Schulten (Eds.), Models of neural networks III. New York: Springer-Verlag. Kritzer, M. F., & Goldman-Rakic, P. S. (1995). Intrinsic circuit organization of the major layers and sublayers of the dorsolateral prefrontal cortex in the rhesus monkey. Journal of Computational Neurobiology, 359, 131–143. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mattia, M., & Del Giudice, P. (2000). Efcient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305–2329. Miller, E. K., Erickson, C., & Desimone, R. (1996). Neural mechanism of working memory in prefrontal cortex of macaque. Journal of Neuroscience, 16, 5154– 5167. Miyashita, Y. (1988). Neural correlate of visual associate long-term memory in the primate temporal cortex. Nature, 335, 817–820. Mongillo, G., & Amit, D. J. (2001a). Spike driven synaptic dynamics—a selfsaturating Hebbian paradigm. Neural Plasticity, 8, 188. Mongillo, G., & Amit, D. J. (2001b). Oscillations and irregular emission in networks of linear spiking neurons. Journal of Computational Neuroscience, 11, 249–261. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopeld, J. J. (1998). All-ornone potentiation at CA3-CA1 synapses. Proc. Natl. Sci. USA, 95, 4732–4737. Rauch, A., La Camera, G., Luscher, ¨ H. R., Senn, W., & Fusi, S. (2002). Neocortical pyramidal cells respond as integrate-and-re neurons to in vivo–like input currents. Manuscript submitted for publication.
596
D. Amit and G. Mongillo
Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–366. Sakay, K., & Miyashita, Y. (1991). Neural organization for the long-term memory of paired associates. Nature, 354, 152–155. Sjostr¨ ¨ om, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Cooperativity and frequency dependence of spike timing–dependent plasticity in visual cortical layer-5 pairs. In Proceedings of the 31st SFN Meeting. San Diego. Steele, P. M., & Mauk, M. D. (1999). Inhibitory control of LTP and LTD: Stability of synapse strength. Journal of Neurophysiology, 81, 1559–1566. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Sci. USA, 94, 719–723. van Vreeswijk, C. A., & Sompolinsky, H. (1996). Chaos in neural networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Received December 13, 2001; accepted August 2, 2002.
LETTER
Communicated by David Horn
A Stochastic Method to Predict the Consequence of Arbitrary Forms of Spike-Timing-Dependent Plasticity Hideyuki Cˆateau
[email protected] Core Research for the Evolutional Science and TechnologyProgram, JST, Tokyo 1948610, Japan
Tomoki Fukai
[email protected] Department of Engineering, Tamagawa University and CREST, JST, Tokyo 1948610, Japan
Synapses in various neural preparations exhibit spike-timing-dependent plasticity (STDP) with a variety of learning window functions. The window functions determine the magnitude and the polarity of synaptic change according to the time difference of pre- and postsynaptic spikes. Numerical experiments revealed that STDP learning with a single-expo nential window function resulted in a bimodal distribution of synaptic conductances as a consequence of competition between synapses. A slightly modied window function, however, resulted in a unimodal distribution rather than a bimodal distribution. Since various window functions have been observed in neural preparations, we develop a rigorous mathematical method to calculate the conductance distribution for any given window function. Our method is based on the Fokker-Planck equation to determine the conductance distribution and on the OrnsteinUhlenbeck process to characterize the membrane potential uctuations. Demonstrating that our method reproduces the known quantitative results of STDP learning, we apply the method to the type of STDP learning found recently in the CA1 region of the rat hippocampus. We nd that this learning can result in nearly optimized competition between synapses. Meanwhile, we nd that the type of STDP learning found in the cerebellum-like structure of electric sh can result in all-or-none synapses: either all the synaptic conductances are maximized, or none of them becomes signicantly large. Our method also determines the window function that optimizes synaptic competition. 1 Introduction Plasticity that depends on the interval t D tpost ¡tpre between pre- and postsynaptic spikes has been observed in layer II/III or V pyramidal neurons, spiny stellate cells of neocortex, Xenopus tectum, excitatory and inhibitory cells of Neural Computation 15, 597–620 (2003)
c 2003 Massachusetts Institute of Technology °
598
H. Caˆ teau and T. Fukai
the hippocampus, and the cerebellum-like structure of electric sh (Abbott & Nelson, 2000; Levy & Steward, 1983; Gustafsson, Wigstrom, Abraham, & Huang, 1987; Debanne, Gahwiler, & Thompson, 1994, 1998; Bell, Han, Sugawara, & Grant, 1997; Markram, Lubke, Frotscher, & Sakmann, 1997; Magee & Johnston, 1997; Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Egger, Feldmeyer, & Sakmann, 1999; Feldman, 2000; Nishiyama, Hong, Mikoshiba, Poo, & Kato, 2000). This form of plasticity is termed spiketiming-dependent plasticity (STDP). The window function, for instance, ( AC e¡®C t .t ¸ 0/; (1.1) G.t/ D ¡A¡ e¡®¡ jtj .t < 0/; species that a synapse is potentiated by AC e¡®C t when a postsynaptic spike follows a presynaptic spike .tpost ¸ tpre/ and is depressed by ¡A¡ e¡®¡ jtj when a postsynaptic spike precedes a presynaptic spike .tpost ¸ tpre /. Using this window function, Song, Miller, and Abbott (2000) simulated STDP learning of excitatory synaptic conductances mediating a thousand inputs to one model neuron (a leaky integrator). They examined how excitatory synaptic conductances on a model neuron changed due to STDP learning evoked by Poissonian inputs under the condition that the total depression is slightly larger than the total potentiation (i.e., A¡ =®¡ > AC =®C ). As a result, they found that the nal distribution Pn .w/ of synaptic conductances w displayed a bimodal distribution in which conductances were concentrated around either zero or the maximal value. Since synaptic conductances are divided into “winners” and “losers,” we regard this phenomenon as synaptic competition. On the other hand, van Rossum, Bi, and Turrigiano (2000) and Rubin, Lee, and Sompolinsky (2001) examined STDP learning using a single-exponential window function that employed a multiplicative update rule w ! w.1C1w/ instead of the additive update ruled w ! wC1w assumed in Song et al. (2000). Noting that this update rule is equivalent to replacing a window function G.t/ for wG.t/ (Kistler & van Hemmen, 2000; van Rossum et al., 2000; Rubin et al., 2001; Gutig, ¨ Aharono, Rotter, & Sompolinsky, 2002), they showed that multiplicative learning led to a gaussian conductance distribution, representing no synaptic competition. These theoretical studies of STDP learning were conducted using the Fokker-Planck equations (Risken, 1996), which generally describe how the distribution of stochastically changing variables (the conductance for STDP) evolves in time. To determine a critical coefcient of the Fokker-Planck equation, a drift term, van Rossum et al. (2000) simplied the calculations by adopting a piecewise linear, neuronal response. Rubin et al. (2001) simplied the calculations by assuming a reasonable functional form for the correlation between pre- and postsynaptic spikes. The Fokker-Planck equations thus approximated claried the qualitative differences between additive and multiplicative STDP learning. Levy, Horn, Meilijson, and Ruppin (2001) modeled STDP learning by introducing a relaxation dynamics of w
Stochastic Method to Analyze STDP
599
in addition to the usual synaptic update. They constructed the corresponding Fokker-Planck equation and derived an exponential distribution in the no-relaxation limit. They also conducted simulations of mutually connected neurons in their STDP learning framework and obtained bimodal conductance distributions for such window functions that display a total depression smaller than a total potentiation. Similar numerical simulations of a network of Hodgkin-Huxley neurons were conducted without the synaptic relaxation dynamics (Kitano, Cateau, ˆ & Fukai, 2002). In the simulations, bimodal conductance distributions were obtained when the total depression is slightly larger than the total potentiation, as was the case in Song et al. (2000). In this study, we develop an analytical method to calculate the nal distribution Pn .w/ for an arbitrary window function G.t/. Such a general framework is useful for analyzing the learning procedure for many different window functions observed in experiments (Abbott & Nelson, 2000). To describe the evolution of synaptic conductances, we adopt the Fokker-Planck equation (van Rossum et al., 2000; Rubin et al., 2001; Levy et al., 2001). To analytically evaluate the drift term approximately, we take the nonlinearity in neural spiking into account by using the Ornstein-Uhlenbeck process for the membrane uctuations (Tuckwell, 1988; Risken, 1996; Cateau ˆ & Fukai, 2001). With this careful evaluation of the drift term, which has not previously been attempted, we derive P n .w/ as an explicit function of model parameters, such as the size and the width of a window function and ring threshold. Our method allows us to reproduce the previously known quantitative characteristics of STDP learning (Song et al., 2000), such as the linear relationship between pre- and postsynaptic ring rates and unique dependencies of postsynaptic ring rate and coefcient of variation on the amplitude ratio A¡ =AC . Our method also enables us to derive a corresponding formula for an arbitrary window function. Our method proves quite generally that synaptic competition occurs only when the total depression R 1is slightly larger than the total potentiation, that is, the total area S0 D ¡1 G.t/ dt is negative and small. This criterion is valid if the mean postsynaptic spike interval T is greater than the width of a window function (moderate ring condition) and the synaptic update is additive. Note that since the criterion is derived for Poissonian input spike trains, it does not necessarily apply to network simulations. Our method also proves that synaptic conductances exhibit either gaussian distribution or a saturated distribution, in which conductances are concentrated around the maximal value, when the total depression is smaller than the total potentiation (i.e., S0 > 0). Thus, the total area S0 plays the most crucial role in determining Pn .w/. If the moderate ring condition is not RT obeyed, the total area should be replaced with the restricted area ¡T G.t/ dt in the above statements. To demonstrate the wide applicability of our method, we apply our method to two specic window functions that are similar to those observed
600
H. Caˆ teau and T. Fukai
in the rat hippocampal CA1 slice preparation (Nishiyama et al., 2000) and in the cerebellum-like structure of the electric sh (Bell et al., 1997). Our method predicts that the CA1 type learning (i.e., the window function has depressing effects also in the causal domain, tpost > tpre ) results in almost maximal competition between synapses when the total depression is slightly larger than the total potentiation. Our method also predicts that the electric sh–type learning can induce an all-or-none development of synaptic conductances when the total depression is slightly smaller than the total potentiation. The results of our analyses can be summarized as the following general properties of additive STDP learning: ² The correlation between pre- and postsynaptic spikes increases only by such STDP learning that leads to efcient synaptic competition. ² A depression component in the late causal domain of the window function enhances synaptic competition. ² Synaptic competition is optimal when the causal part of a window function has the same shape as the nonaccidental correlation function of pre- and postsynaptic spikes, C.t/. 2 The Fokker-Planck Equation for the Conductance Distribution This section explains how the Fokker-Planck equation is used to elucidate the synaptic conductance distribution and how to derive the Fokker-Planck equation corresponding to any given STDP learning episode. This article focuses on additive STDP learning, although our formulation also applies to multiplicative STDP learning when an appropriate w-dependence to a window function is included. Throughout, we consider STDP learning of excitatory synapses assuming nonplastic inhibition. As discussed later, the extension of our formalism to STDP learning of inhibitory synapses is straightforward. We assume that a neuron receives synaptic inputs from thousands of excitatory presynaptic neurons. Accordingly, the neuron occasionally res a spike. The stochastic changes in thousands of synaptic conductances w are described by the following Fokker-Planck equation for the conductance distribution P.w; t/ (Tuckwell, 1988; Risken, 1996): 1 @2 @ P.w; t/ @ D¡ .A.w/P.w; t// C .B.w/P.w; t//: 2 @w2 @t @w
(2.1)
Here, time is measured by the membrane time constant (i.e., t RD time=¿m ). N The drift term R and the diffusion term are calculated as A.w/ D dr rT.w; r/ N N and B.w/ D dr r2 T.w; r/, respectively, from the transition probability T.w; r/ that the synaptic update w ! wCr occurs. As in Song et al. (2000), we assume that synaptic conductance is conned in 0 · w · wmax : if a synaptic update makes w go beyond these boundary values, w is reset to zero or wmax . This
Stochastic Method to Analyze STDP
601
is equivalent to solving equation 2.1 with a reecting boundary condition, that is, J.0; t/ D J.wmax ; t/ D 0; J.w; t/ D A.w/P.w; t/ ¡
1 @ .B.w/P.w; t//: 2 @w
(2.2)
Hereafter, we set wmax D 1 and measure the potential v in the unit of the maximal conductance. Since the transition amplitude r is related to the time interval t D tpost ¡tpre via a window function G.t/, A.w/ and B.w/ can be reexpressed as follows: Z A.w/ D
1 ¡1
Z dt G.t/T.w; t/ and B.w/ D
1
dt G.t/2 T.w; t/:
(2.3)
¡1
N Here, T.w; t/ D T.w; G.t//G0 .t/ is the probability that pre- and postsynaptic spikes are paired with the interval t. The Fokker-Planck equation depends on the shape of the window function G.t/, as indicated in equation 2.3. To dene STDP learning precisely, we must specify the rule by which each presynaptic spike is paired with individual spikes in a postsynaptic spike train. In Song et al. (2000), each presynaptic spike was paired with every postsynaptic spike, even in cases in which the postsynaptic spikes were separated by arbitrarily large intervals. In this article, however, to make analytical study more tractable, each presynaptic spike is paired only with the last and the next postsynaptic spikes. This rule is based on the assumption that a postsynaptic spike “resets” the internal state of a neuron almost completely. We must await future experiments to reveal which pairing rule is more relevant to actual synapses. We note that the two rules are essentially the same if the postsynaptic interspike interval (ISI) is larger than the width of the window function (the time difference for which G.t/ has a signicant size). In this case, pairings considered beyond the postsynaptic ISI will contribute negligibly to each synaptic update. We call this condition the moderate ring condition because the postsynaptic ring rate cannot be too high for the condition to be met. We carry out all the calculations except some in appendix A under this moderate ring condition. The activity regulation inherent in STDP learning (Song et al., 2000) tends to achieve a moderate postsynaptic ring rate at the late stage of learning, validating our analysis of the steady state based on the moderate ring condition. Now we describe T.w; t/: the probability of having pre- and postsynaptic spikes with interval t. We denote presynaptic (input) and postsynaptic (output) ring rates as ¸in and ¸out , respectively. In the following argument, it is understood that a presynaptic spike occurs at t D 0, and a postsynaptic spike occurs at time t. In the acausal domain (t < 0), pre- and postsynaptic spikes can be regarded as independent events. Hence, T.w; t/ is equal to
602
H. Caˆ teau and T. Fukai
an accidental correlation resulting from the product of pre- and postsynaptic spikes. In the causal domain .t > 0/, in which pre- and postsynaptic spikes are causally related, the correlation acquires a small additional term proportional to C.t/ as follows (see appendix A for detailed calculations): ( T.w; t/ D
¸in ¸out ¸in ¸out .1 C wC.t//
t < 0; t ¸ 0:
(2.4)
The nonaccidental correlation C.t/ represents an effect that a single presynaptic spike at an arbitrarily chosen synapse advances a postsynaptic spike. We can calculate this effect by regarding other presynaptic spikes as background. Hence, we obtain C.t/ D ¡
1 ¸out
Z
µ
¡1
dv ’ 0.v/ f .v; µ; t/;
(2.5)
(see appendix A for the derivations). Here, ’.v/ is the probability of nding the membrane potential at v and f .v; µ; t/ is the rst passage time distribution—the probability of the membrane potential going from v to threshold µ in time t (Tuckwell, 1988; Risken, 1996). With an integrate-and-re neuron, the uctuations can be characterized by a stochastic process called the Ornstein-Uhlenbeck process—the Brownian motion with leak effects—which depends on the mean .m/ and the variance .s2 / of the total synaptic input to a neuron (Tuckwell, 1988). For the Ornstein-Uhlenbeck process, ’.v/ was analytically calculated (Brunel, Chance, Fourcaud, & Abbott, 2001) and an efcient algorithm for numerically calculating f .v; µ; t/ was developed (Plesser & Tanaka, 1997). Thus, C.t/ can be numerically evaluated as in Figure 1. Since a presynaptic spike at the referenced excitatory synapse always advances a postsynaptic spike, the probability of having a postsynaptic spike is enhanced in the neighborhood of t D 0.C.t/ > 0/; this enhancement must be compensated by the decreases in the probability for large t.C.t/ R 1 < 0/. This compensation is implied by the probability conservation 0 C.t/ dt D 0, which can be directly shown from equation 2.5. Note that jC.t/j ¿ 1 in almost every instance except in the neighborhood of t D 0, for C.t/ represents the impact of just a single presynaptic spike on the postsynaptic activity. From equations 2.3 and 2.4, we can express the drift and diffusion terms for an arbitrary window function as follows: Z A.w/=.¸in ¸out / D Z B.w/=.¸in ¸out / D
1 ¡1 1
¡1
Z dt G.t/ C w
1
Z dt G.t/2 C w
dt G.t/C.t/;
(2.6)
0
0
1
dt G.t/2 C.t/:
(2.7)
Stochastic Method to Analyze STDP
603
C(t)
0.2 0.15 0.1 0.05 1
2
3
4
5
time/t m Figure 1: The plot of the nonaccidental correlation C.t/ between pre- and postsynaptic spikes versus t D time=¿m . Given this function, the Fokker-Planck equation corresponding to any given STDP learning can be calculated. The p normalized form D 2 depends on only two parameters, x0 D C .t/ sC.t/= 0 p p 2.v0 ¡ m/=s; x1 D 2.µ ¡ m/=s, where v0 is the reset potential after ring. We set x0 D ¡1, x1 D 1:5, and s D 7 here, which are typical values when synapse competition occurs.
Here, the higher-order terms in the w-expansion were dropped since C.t/ p is small, as explained above. (Since we can show that C.t/ » 1= t in the neighborhood of t D 0, the contributions of this region to the integrations are negligible). The stationary synaptic distribution Pn .w/, which is reached by integrating the Fokker-Planck equation up to t D 1, can be obtained by setting the right-hand side of equation 2.1 equal to zero: @P.w; t/=@t D 0. Thus, from the above expressions, the nal synaptic distribution is given as Pn .w/ D n0 .ew .1 C ¯w/¡1=¯ ¡wd /° ;
(2.8)
with Z w0 D ¡ Z ¯D Z ° D
1
dt G.t/
Z
¯
¡1
1
0
dt C.t/G.t/2
¯
0 1 0
dt C.t/G.t/
1
¯
Z
dt C.t/G.t/; 1
¡1
Z
1
dt G.t/2 ;
dt C.t/G.t/2 :
(2.9)
0
It is somewhat complicated to obtain P n .w/ for a given window function, since C.t/ depends on the mean m and the variance s2 of the membrane uctuations, whereas these quantities are given in terms of Pn .w/
604
H. Caˆ teau and T. Fukai
(see appendix B for details). Therefore, we must solve the self-consistency equations given as m D °1 .m; s2 / and s2 D °2 .m; s2 /. 3 Analytical Expressions of the Synaptic Distributions for Single-Exponential Window Functions If the causal part of the window function is given by a single-exponential function, all the integrations dening the drift and diffusion terms in equation 2.9 reduce to the Laplace transform of C.t/, which can be expressed by R1 the Laplace transform fO.v; µ; ®/ D 0 dt e¡®t f .v; µ; t/ through equation 2.5. Unlike f .v; µ; t/ itself, fO.v; µ; ®/ has been analytically calculated (Tuckwell, 1988; Risken, 1996). Thus, the drift and diffusion terms can be analytically calculated as A¡ AC C C A.w/=.¸in ¸out / D ¡ ®¡ ®C A2¡ A2 C C C B.w/=.¸in ¸out / D 2®¡ 2®C
p p
2AC h.x1 ; ®C / w; s.®C C 1/
2A2C h.x1 ; 2®C / w; s.2®C C 1/
(3.1) (3.2)
R1 R1 2 2 where h.x1 ; ®/ D 0 dy y® e¡.y¡x1 / =2 = 0 dy y®¡1 e¡.y¡x1 / =2 (see appendix C). p This quantity depends on ® and x1 D 2.µ ¡ m/=s, which is a normalized distance between m and the threshold µ . The window function studied in Song et al. (2000) corresponds to ®C D ¿m =¿C D 1 and ®¡ D ¿m =¿¡ D 1, and in this case, the nal distribution of synapses takes a particularly simple form: P n .w/ D n0 exp.a.w ¡ w0 /2 /;
(3.3)
with 8 > :
2AC
A2C C A2¡
Á
! 2 2 2 µ ¡m e¡.µ¡m/ =s C p ; s2 s ¼ 1 C erf..µ ¡ m/=s/
aw0 D 4.A¡ ¡ AC /=.A2C C A2¡ /:
In equation 3.3, the exponent of Pn .w/ was obtained up to O.w2 /. This expression is informative because the dependencies of the nal distribution on each model parameter are manifest. From equation 3.3, Pn .w/ is expected to show two sharp peaks at w D 0 and w D 1 if 0 < w0 < 1. For the STDP window function adopted in Song et al., (2000), the temporal evolution of the conductance distribution starting with a gaussian distribution is shown in Figure 2 by numerically integrating the Fokker-Planck equation dened by equations 2.1, 3.1, and 3.2. As expected from analytical expression 3.3, the nal distribution of synapses exhibits two sharp peaks.
Stochastic Method to Analyze STDP
605
time/t m 105 103 101
8 6 4 2 0
0.5
1
w Figure 2: Temporal evolution of the conductance distribution obtained by numerical integration of the Fokker-Planck equation. The window function used here is the same as that used in Song et al. (2000, equation 1.1) with AC D 0:005, A¡ D 0:00525, and ®C D ®¡ D 1/. Synaptic conductances were initially distributed around value w D 0:8 with the standard deviation ¾ D 0:1 in a gaussian manner. The postsynaptic neuron model is dened by the membrane time constant ¿m , ring threshold µ in units of maximal excitatory conductance, the reset potential v0 , R (ratio of the inhibitory synaptic conductance to the maximal excitatory conductance), the number of excitatory and inhibitory inputs .N; N 0 /, and their ring rates .¸in ; ¸0in /. We used the setting, ¿m D 20 ms, µ D 20, v0 D 0, R D 1, and N D 2 £ 104 throughout this article. We set N 0 ¸0in D 80 and ¸in =¿m D 10 Hz to obtain the three-dimensional plot.
The previous numerical study (Song et al., 2000) showed that ¸out linearly depended on ¸in with a very gentle slope. It was also shown that ¸out is a monotonically decreasing concave curve as a function of A¡ =AC , and the coefcient of variation (CV) of postsynaptic spikes is a monotonically increasing convex curve approaching unity as a function of the same ratio. With the present method, ¸out and CV can be calculated as 8 Z .µ ¡m/=s p > > 2 ¡m/=s/¡ ¼ u..µ .1¡erf.y// exp.y2 / dy < 1 m=s Z m=s D p ¸out > > : ¼ .1¡erf.y// exp.y2 / dy .µ¡m/=s
.m < µ / (3.4) .µ · m/;
606
H. Caˆ teau and T. Fukai
with u.x/ D
Rx 0
exp.y2 / dy, and
q CV D
M2 ¡ M 21 =M1
(3.5)
R1 with Mj D 0 t j f .0; µ; t/ dt. We note that m and s2 depend on A¡ =AC , since they are determined by solving the self-consistency equation for each value of A¡ =AC (see the explanation following equation 2.9). Plots of ¸out and CV versus A¡ =AC shown in Figure 3 prove that the method successfully reproduces all of these unique parameter dependencies. 4 Conductance Distributions for Two Examples of STDP Learning Having validated our method, we now examine STDP learning that has not been examined in detail in the literature. 4.1 CA1-Type Window Function. We consider the window function 8 ¡® t 0 ¡® t 0/ is multiplied by .1 ¡ w/
B
A
Dm
0.003
60 40 20
0.002 -4
0.001
-2
4
2
-150 -100
time/t m
-0.001
-50
-20
50
100
-40 -60
-0.002
C
m
12 10 8 6 4 2 -0.2
0.2
-2
0.4
0.6
0.8
w
1
D
-7.8
-7.6
E 200
0.2
0.4
lout/t m (Hz)
150 100 50
lin/t m (Hz) 6
7
8
9
10
0.6
0.8
1
w
150
Stochastic Method to Analyze STDP
611
(Kistler & van Hemmen, 2000; van Rossum et al., 2000; Rubin et al., 2001), a large w-proportional term appears in the drift term from the rst term in the right-hand side of equation 2.6. The coefcient of this term—the rst term in equation 2.6—is generally large and always negative because w is multiplied by a negative number and 1 ¡ w is multiplied by a positive number. Therefore, the drift term will have a negative slope unless the w2 proportional term originating in the second term in equation 2.6 becomes exceptionally large for some reason. Consequently, the nal distribution quite generally becomes either a gaussian-like distribution .0 < w0 < 1/ or a saturated one (w0 < 0 or 1 < w0 ), where w0 is the intersection of the drift term with the w axis (see Figures 5C and 5D). Recently, it was demonstrated numerically that a gaussian-like distribution changes into a bimodal distribution when a continuous shift of the learning rule occurs (i.e., from a multiplicative rule to an additive rule; (G utig ¨ et al., 2002). 6 Crucial Determinant of the Final Conductance Distribution Let us consider the general condition for obtaining a bimodal, gaussian, or saturated conductance distribution. In order for the nal distribution to be bimodal, which leads to synaptic competition, the constant term of A.w/ in equation 2.6—the total area S0 encircled by the window function—must be negative, and the line representing A.w/ must intersect the w-axis at about w D 0:5 (see Figure 4) with a posture slope. For the latter condition to be satised, the constant term and the coefcient of the w-proportional term in A.w/ must be comparable. However, this can be the case only when the net potentiation and depression effects represented by G.t/ are highly compensated to make the constant term small because C.t/ is, in general, small, as explained following Figure 5: Facing page. Electric sh–type window function. (A) A window function similar to that observed in the electric sh cerebellum-like structure (Bell et al., 1997) was constructed (see the text for its precise denition), and associated STDP learning was examined using the same procedures and the same parameter values, with the exception N 0 ¸0in D 100 (the number times the rate of inhibitory inputs), used to construct Figures 3 and 4. (B) Unlike the previous cases, the 1m versus m plot for a xed value of s2 revealed one unstable and two stable xed points. As for s2 , a unique xed point was always found. (C) Small initial conductances (w is distributed around 0.2 with ¾ D 0:1) caused m to be xed at a small value, resulting in the concentration near w D 0 (° D 1360; ¯ D 0:441; w0 D 0:0377), whereas (D) large initial conductances (w is distributed around 0:8 with ¾ D 0:1) caused m to be xed at a large value, resulting in the concentration near w D 1.° D 1150; ¯ D 0:00254; w0 D ¡7:73/. (E) Despite the large difference between this type of learning and the singleexponential, or CA1-type learning, the ¸out versus ¸in plot is linear. A.w/ was scaled to facilitate visualization of the dotted line.
612
H. Caˆ teau and T. Fukai
equation 2.5. Meanwhile, if the total area S0 is positive, the nal distribution becomes either a gaussian or saturated distribution depending on whether the line representing A.w/ intersects the segment 0 < w < 1. These considerations show that the total area S0 is another crucial determinant of the nal conductance distribution. 7 General Properties of Additive STDP Learning Song et al. (2000) showed that STDP learning dened by their window function increased the correlation between pre- and postsynaptic spikes. Is the correlation always enhanced by STDP learning? In fact, we can derive the following general rule: The correlation between pre- and postsynaptic spikes increases when STDP learning is designed to lead to strong synaptic competition. R1 R1 When synaptic competition occurs, ¡1 G.t/ dt C w 0 G.t/C.t/ dt intersects near the middle point of the segment 0 < w < 1 with a positive slope. In fact, the upward ow of w to the right of the intersection and downward ow of w to the left of the intersection result in the synaptic competition. R1 Therefore, if synaptic competition is strong, 0 G.t/C.t/ dt must be a large, positive value. For this integral to be a large, positive value, G.t/ must be positive when C.t/ is positive and G.t/ must be negative when C.t/ is negative. This implies that the connections between positively correlated neuron pairs .C.t/ > 0/ are strengthened .G.t/ > 0/, and the connections between negatively correlated neuron pairs .C.t/ < 0/ are weakened .G.t/ < 0/. The overall correlation thereby increases, leading to the rst property, noted above. This property suggests that the correlation between pre- and postsynaptic spikes may not be enhanced for STDP learning that does not imply strong synaptic competition. The electric sh–type learning is a typical example of noncompetitive STDP learning. The correlation between pre- and postsynaptic spikes does not increase through this type of STDP learning. As shown in Figure 1, C.t/ changes its sign at The intersection t0 depends on two p some point t D t0 . p parameters x1 D 2.µ ¡m/=s and x0 D 2.v0 ¡m/=s ( v0 : the reset potential). When synaptic competition occurs, x1 and x0 , respectively, take positive and negative values of an order of unity. For such parameter values, we typically have t0 D time=¿m » D 1 (¿m : membrane time constant). In other words, C.t/ changes its sign at time » D ¿m D 10 » 20 ms. Therefore, a depression component .G.t/ < 0/ in the late causal domain .t0 < t/ contributes positively to synaptic competition, implying the second property: A depression component in the late causal domain .t0 < t/ of the window function enhances synaptic competition. This enhancement of synaptic competition is understood to result from the removal R 1 of less correlated synaptic connections. If we note that the slope 0 G.t/C.t/ dt measures the strength of synaptic R1 competition and that 0 G.t/C.t/ dt is maximized for G.t/ / C.t/, we nd
Stochastic Method to Analyze STDP
613
the third property: Synaptic competition is optimal when the causal part of a window function has the same shape as C.t/. 8 Discussion A recent experiment describing plasticity in the rat hippocampal slice preparation of CA1 (Nishiyama et al., 2000) revealed a window function whose causal part was similar to C.t/. Meanwhile, our analysis has revealed that the optimal window function for synaptic competition is proportional to C.t/. This suggests that the window function of the hippocampal CA1 pyramidal neurons may be designed to enhance synaptic competition in this area—that is, the competition between inputs from the CA3 neurons. This result raises the interesting possibility that memory traces loaded into CA3 are highly selected when they are passed to CA1 for further processing required for the consolidation in long-term memory storage. It is worth summarizing the differences between our method and previous ones. van Rossum et al. (2000) and Rubin et al. (2001) revealed the qualitative features of STDP learning with approximate drift and diffusion terms that were evaluated by assuming a piecewise linear neuron, a rectangular window function, simplied forms of the correlation functions, and other mathematical simplications. We have taken the nonlinearity of neural ring into account by using the Ornstein-Uhlenbeck process for membrane uctuations. The previous work stated that m and ¾ 2 are determined by solving the self-consistency equations. However, they did not need to solve these equations explicitly because they focused on the qualitative aspects of STDP learning. We have solved the self-consistent equation explicitly (see Figures 4B and 5B) and conducted quantitative analyses (see Figure 3). This article has also evaluated the Laplace transformed correlation to be » O C.®/ D .µ ¡ m/=s2 for the window function used in Song et al. (2000), which conrmed the assumption by Rubin et al. (2001) that this quantity should be a monotonically decreasing function of ¸out . Note that ¸out is roughly proportional to m. Levy et al. (2001) dened STDP learning as a stochastic process consisting of the usual synaptic update and the relaxation of synaptic conductance. Therefore, in the corresponding Fokker-Planck equation, the average change of w had an additional component arising from the relaxation. Their analysis of the Fokker-Planck equation yielded exponential distributions rather than bimodal ones in the no-relaxation limit. In their analysis, however, the average change of w originating from the correlation between preand postsynaptic spikes was not treated explicitly. The necessary condition for justifying this approximation is not obvious. Their denition of STDP is by itself biologically interesting and may be tested by experiment. If the relaxation effect exists in STDP learning, it might dramatically inuence the distribution of synapses.
614
H. Caˆ teau and T. Fukai
Kempter, Gerstner, and van Hemmen (2001) derived what is conceptually similar to equation (2.5), which denes C.t/, and discussed its approximate structure. A great advantage of our approach is that we can evaluate the equation exactly as in Figure 1 by using the Ornstein-Uhlenbeck process. From the obtained expression of equation 2.5, we derived the three general properties of STDP learning. p p O We can show that C.t/ ! 1= t for t ! 0 from C.®/ ! 1= ® for ® ! 1 (see appendix C), that is, C.t/ changes steeply around t D 0. Since several experimentally observed window functions display a steep change around t D 0 (Abbott & Nelson, 2000), our analysis predicts that any modulation of a spike-generation mechanism or backpropagation mechanism must have a signicant impact on STDP learning. Under the moderate ring condition assumed throughout this study, the acausal part of G.t/ appears in the Fokker-Planck equation only through R0 R1 the rst part of the total integral S0 D ¡1 G.t/ dt C 0 G.t/ dt. This implies that the shape of the acausal part is unimportant under this condition. However, when this condition is violated, S0 must be replaced by R1 R jtj 0 0 ¡1 G.t/.1 ¡ 0 dt f .v0 ; µ; t // dt (see appendix A). Therefore, the explicit shape of the acausal part of the window function becomes important. Since the second factor in the integrand approaches zero for jtj being much greater than the average interspike interval T D 1=¸out , the integral is nearly equal RT to ¡T G.t/ dt. Our method is applicable to STDP learning of inhibitory synapses provided that appropriate changes are made in the calculations. First, the selfconsistency equations must be changed to 8 Z 1 > 0 0 > < m D N¸in R ¡ N ¸in dw wP.w; t/; Z0 1 > > :s2 D N¸in R2 C N0 ¸0in dw w2 P.w; t/; 0
where R now represents the number of random inputs mediated by (nonplastic) excitatory synapses. Second, equation 2.4 must be replaced with ( ¸in ¸out t < 0; T.w; t/ D ¸in ¸out .1 ¡ wC.t// t ¸ 0; since timing of postsynaptic spikes is negatively correlated with that of presynaptic spikes mediated by inhibitory synapses. Here, we do not repeat the detailed calculations for STDP learning of inhibitory synapses. In summary, this study provided a powerful analytical tool to study how a group of synapses on a postsynaptic neuron changes according to STDP learning with an arbitrary window function. We have identied the window function that optimizes competition between synapses. For a more complete understanding of STDP, it is important to clarify how STDP learning reorganizes a recurrent network of neurons in an activity-dependent way. Such
Stochastic Method to Analyze STDP
615
an issue has been studied numerically by several authors (Amemori & Ishii, 1998; Levy et al., 2001; Kitano et al., 2002). Besides an asynchronous activity at a moderate ring rate, a synre chain-like activity (Abeles, 1991; Diesmann, Gewaltig, & Aertsen, 1999; Cˆateau & Fukai, 2001) was self-organized under a certain condition. How to apply our method to STDP learning in mutually connected neurons is open for further study. Appendix A N This appendix calculates T.w; t/ for negative and positive t. Suppose that f .v0 ; µ; t/ denotes the ISI distribution of a postsynaptic neuron with reset point v0 and ring threshold µ. If a check mark is indicated on a time axis every time a postsynaptic neuron res, the lengths of the resulting time intervals on the time axis will be distributed according to f .v0 ; µ; t/. A.1 Acausal Domain. For t D tpost ¡ tpre < 0, pre- and postsynaptic spikes are not causally related. Therefore, T.w; t/ is estimated by supposing that a presynaptic spike occurs at any time point on the checked axis considered above with equal probability. Specically, T.w; t/ is equal to the probability that an arbitrarily chosen point (presynaptic spike) on the checked axis is located at distance jtj from the last check mark (postsynaptic spike). Our pairing rule (see section 2) takes only the last check mark into account. Among all the intervals on the checked axis, only ones longer than jtj can accommodate a point representingR a presynaptic spike considered above. 1 Hence, we nd that T.w; t/=¸in / jtj dt 0 f .v0 ; µ; t0 /. Normalization of it for R0 a given presynaptic ring event, ¡1 dt0 T.w; t/=¸in D 1, determines T.w; t/ R jtj R jtj as T.w; t/ D ¸in ¸out .1 ¡ 0 dt 0 f .v0 ; µ; t0 //. The term 0 dt0 f .v0 ; µ; t0 / signies that the contribution from pairings beyond the postsynaptic ISI was removed. Under the moderate ring condition, this term can be neglected, because when this term is large, G.t/ becomes small. A.2 Causal Domain. For t D tpost ¡tpre ¸ 0, pre- and postsynaptic spikes are causally related: a presynaptic spike advances the occurrence of a postsynaptic spike. Postsynaptic membrane potential v uctuates due to random excitatory and inhibitory inputs from thousands of presynaptic neurons. When a presynaptic neuron res a spike, membrane potential v of a postsynaptic neuron abruptly increases to v C w due to the synaptic input from a synapse with conductance w. This neuron elicits a spike at time t with a probability f .v C w; µ; t/. Summing this probability over all possible values of v with probability weight ’.v/ (the probability of nding membrane potential at v) results in the total probability of a postsynaptic spike occurring t after a R µ ¡w presynaptic spike, T.w; t/ D ¸in ¡1 dv ’.v/ f .v C w; µ; t/.
616
H. Caˆ teau and T. Fukai
A change of variable and an expansion of this in w up to the linear order result in Z T.w; t/=¸in » D
µ ¡1
Z dv ’.v/ f .v; µ; t/ ¡ w
µ
¡1
dv ’ 0.v/ f .v; µ; t/:
The second term on the right-hand side of this equation represents how much the probability of postsynaptic ring increases due to one presynaptic input of size w. This probability increase resulting from only a single input is understood to be small if we consider that 10 to 100 presynaptic inputs are required to elicit neural ring. In fact, the Laplace transform of the above expression calculated in equation 3.1 shows that the linear term in w is about 1=s times smaller than the constant term, where s (standard deviation of the membrane potential uctuations) is of the same order as µ (the number of excitatory postsynaptic potentials needed to re a neuron from the rest). Therefore, neglecting higher-order terms in w is a good approximation. The rst term on the right-hand side equals T.w D 0; t/, which is the ring probability when no specic input other than the background is specied. Therefore, this rst term equals ¸out We note that ’.v/ and f .v; µ; t/ appearing in the above expression can be calculated analytically (see Brunel et al., 2001) and numerically (Plesser & Tanaka, 1997), respectively. Appendix B This appendix explains that membrane potential uctuations can be described by the Ornstein-Uhlenbeck process (Tuckwell, 1988; Risken, 1996). When a neuron receives random N excitatory inputs with their sizes distributed as P.w; t/.1 · w · wmax D 1/ and random N0 inhibitory inputs with their sizes being R in units of maximal excitatory conductance, at average rates of ¸in and ¸0in , respectively, membrane potential uctuations are characterized by the Ornstein-Uhlenbeck process dened by the stochastic differential equation, dv D ¡.v ¡ m/ dt C s dW. Here, m and s2 are written as 8 Z 1 > > D m N¸ dw wP.w; t/ ¡ RN0 ¸0in ; < in 0 Z 1 > > :s2 D N¸in dw w2 P.w; t/ C R2 N 0¸0in : 0
Since ’.v/ and f .v; µ; t/ depend on m and s2 , the drift and diffusion terms in the Fokker-Planck equation also depend on m and s2 (see appendix A), and, consequently, its solution P.w; t/ also depends on m and s2 . Therefore, the above equations imply self-consistency equations for m and s2 of the form m D °1 .m; s2 / and s2 D °2 .m; s2 /.
Stochastic Method to Analyze STDP
617
Appendix C This appendix shows how to calculate the Laplace transform of C.t/ analytically. It is known that ’.v/ and fO.v; µ; ®/ solve backward and forward Kolmogorov equations (Risken, 1996): ³ ´ @ s2 @ 2 ¡.v ¡ m/ C fO.v; µ; ®/ D ® fO.v; µ; ®/; 2 @v2 @v ³ ´ @ s2 @ 2 ..v ¡ m/¢/ C ’.v/ D ¡¸out ±.v ¡ v0 /: 2 @v2 @v This implies that F.w; ®/ D ential equation
Rµ
O
¡1 dv ’.v ¡ w/ f .v; µ; ®/ satises
a linear differ-
» ¼ s2 O0 C D ¡ ¡ C .® w@=@w/F.w; ®/ .µ m/ f .µ ; µ; ®/ ’.µ ¡ w/ 2 ¡
s2 0 ’ .µ ¡ w/ ¡ ¸out fO0 .w C v0 ; µ; ®/: 2
This is solved to give F.w; ®/=¸out D
1 ¡ fO.v0 ; µ; ®/ fO0 .µ ; µ; ®/ ¡ fO0 .v0 ; µ; ®/ C w C ¢¢¢: ® ®C1
Under the moderate ring condition, fO.v0 ; µ; ®/ ¿ 1 and fO0 .v0 ; µ; ®/ ¿ fO0 .µ ; µ; ®/, we have F.w; ®/=¸out D
1 fO0 .µ ; µ; ®/ C w: ® ®C1
Meanwhile, fO.v; µ; ®/ of the Ornstein-Uhlenbeck process dened by .m; s2 / is expressed using the corresponding function fO0 .x; x1 ; ®/ of the standard Ornstein-Uhlenbeck process dened by .m; s2 / D .0; 2/ as p p fO.v; µ; ®/ D fO0 .x D 2.v ¡ m/=s; x1 D 2.µ ¡ m/=s; ®/: p Correspondingly, we have fO0 .µ; µ; ®/ D 2=s fO00 .x1 ; x1 ; ®/, where fO0 .x; x1 ; ®/ is calculated as fO0 .x; x1 ; ®/ D Á.x; ®/=Á.x1 ; ®/; using a solution to .@ 2 =@x2 ¡ x@=@x ¡ ®/Á .x; ®/ D 0 with Á.0; ®/ D 1; Á.x; 0/ D 1:
(C.1)
618
H. Caˆ teau and T. Fukai
The solution can be obtained as Á .x; ®/ D
1 X 1 2n=2 0.®=2 C n=2/xn : 0.®=2/ nD0 n!
(C.2)
The series expansion in terms of the gamma function 0.x/ can be trans2 formed into an integral form to give Á .x; ®/ D ex =2 9.x; ®/=.0.®=2/2®=2¡1 /, R1 2 where 9.x; ®/ D 0 dy y®¡1 e¡.y¡x/ =2 . With the recursion formula derived p from equations C.1 and C.2, @ Á.x; ®/=@ x D 20..® C 1/=2/= 0.®=2/Á .x; ® C 1/, we obtain Z 1 Z ¯ 1 2 2 fO00 .x1 ; x1 ; ®/ D h.x1 ; ®/ D dy y® e¡.y¡x1 / =2 dy y®¡1 e¡.y¡x1 / =2 : 0
0
From this expression, we can show p fO0 .µ ; µ; ®/ ’ .µ ¡ m C .µ ¡ m/2 C 2s2 .® ¡ 1/=s2 for large ®. References Abbott, L. F., & Nelson, S .B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex Cambridge: Cambridge University Press. Amemori, K., & Ishii, S. (1998). Self-organizing network learning for submillisecond temporal coded information. In S. Usui & T. Omori (Eds.), Proceedings of the Fifth International Conference on Neural Information Processing (Vol. 3, pp. 1285–1288). Burke, VA: IOS Press. Bell, C. C., Han, V. G., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bi, G.-Q., & Poo, M.-M. (1998). Activity-induced synaptic modications in hippocampal culture, dependence on spike timing, synaptic strength and cell type. Journal of Neuroscience, 18, 10464–10472. Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and ltering on the frequency response of spiking neurons. Physical Review Letters, 86, 2186–2189. Cˆateau, H., & Fukai, T. (2001). The Fokker-Planck approach to the pulse packet propagation in synre chain. Neural Networks, 14, 675–685. Debanne, D., Gahwiler, B. H., & Thompson, S. M. (1994). Asynchronous preand postsynaptic activity induces associative long-term depression in area CA1 of the rat hippocampus in vitro. Proceedings of National Academy of Science (USA), 91, 1148–1152. Debanne, D., Gahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. Journal of Physiology (London), 507, 237–247.
Stochastic Method to Analyze STDP
619
Diesmann, M, Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529– 533. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and efcacy changes in synaptic connections between spiny stellate neurons of the rat barrel cortex. Nature Neuroscience, 2, 1098–1105. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Gustafsson, B., Wigstrom, H., Abraham, W. C., & Huang, Y.-Y. (1987). Longterm potentiation in the hippocampus using depolarizing current pulses as the conditioning stimulus to single volley synaptic potentials. Journal of Neuroscience, 7, 774–780. Gutig, ¨ R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2002).Learning input correlations through non-linear temporallyasymmetricHebbian plasticity.Unpublished manuscript, Albert-Ludwigs-University. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2741. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Computation, 12, 385–405. Kitano, K., Caˆ teau, H., & Fukai, T. (2002). Self-organization of memory activity through spike-timing-dependent plasticity, Neuro Report, 13, 795–798. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. Levy, W. B., & Steward, D. (1983). Distributed synchrony in a cell assembly of spiking neurons: Temporal contiguity requirements for long-term associative potentiation/depression in the hippocampus. Neuroscience, 8, 791–797. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209–213. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M.-M., & Kato, K. (2000). Calcium stores regulate the polarity and input specicity of synaptic modication. Nature, 408, 584–588. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Physics Letters, A225, 228–237. Risken, H. (1996). The Fokker-Planck equation. New York: Springer-Verlag. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364– 367. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Tuckwell, H. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press.
620
H. Caˆ teau and T. Fukai
van Rossum, M. C. W, Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. Journal of Neuroscience, 20, 8812–8821. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received January 25, 2002; accepted August 1, 2002.
LETTER
Communicated by Peter Latham
Permitted and Forbidden Sets in Symmetric Threshold-Linear Networks Richard H. R. Hahnloser
[email protected] H. Sebastian Seung
[email protected] Howard Hughes Medical Institute, Department of Brain and Cognitive Sciences, MIT E25-210, Cambridge, MA 02139, U.S.A.
Jean-Jacques Slotine
[email protected] Department of Mechanical Engineering and Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
The richness and complexity of recurrent cortical circuits is an inexhaustible source of inspiration for thinking about high-level biological computation. In past theoretical studies, constraints on the synaptic connection patterns of threshold-linear networks were found that guaranteed bounded network dynamics, convergence to attractive xed points, and multistability, all fundamental aspects of cortical information processing. However, these conditions were only sufcient, and it remained unclear which were the minimal (necessary) conditions for convergence and multistability. We show that symmetric threshold-linear networks converge to a set of attractive xed points if and only if the network matrix is copositive. Furthermore, the set of attractive xed points is nonconnected (the network is multiattractive) if and only if the network matrix is not positive semidenite. There are permitted sets of neurons that can be coactive at a stable steady state and forbidden sets that cannot. Permitted sets are clustered in the sense that subsets of permitted sets are permitted and supersets of forbidden sets are forbidden. By viewing permitted sets as memories stored in the synaptic connections, we provide a formulation of long-term memory that is more general than the traditional perspective of xed-point attractor networks. There is a close correspondence between threshold-linear networks and networks dened by the generalized Lotka-Volterra equations.
Neural Computation 15, 621–638 (2003)
c 2003 Massachusetts Institute of Technology °
622
R. Hahnloser, H. Seung, and J. Slotine
1 Introduction Recurrent networks of neurons are believed to play an important role in the perceptual interpretation and memorization of sensory events. The dynamics of simplied networks are often studied in a regime where as a response to constant sensory stimulation, neural activities converge to a steady state. Such networks can efciently implement complex nonlinear mappings between input stimulation and response activity. In a very general framework, we characterize the geometry of potentially multiple stable, steady responses and show that in addition to the dependence on synaptic connections, steady responses also depend on the inherent ambiguity of the sensory input. A Lyapunov function can be used to prove that a xed point of a set of differential equations is stable. By extension, for some differential equations that have many distinct xed points, Lyapunov-like functions can be constructed with multiple local minima, each corresponding to a stable xed point. These local minima were previously interpreted as memorized patterns that can be recalled by suitable initialization. Lyapunov theory applies mainly to symmetric networks in which neurons have monotonic activation functions (Hopeld, 1984; Cohen & Grossberg, 1983). Here we show that the restriction of activation functions to threshold linear ones can yield new insights into the computational behavior of recurrent networks. The results we present elaborate on previous work (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000; Hahnloser & Seung, 2001). We present three main theorems about the neural responses to constant inputs. The rst theorem provides necessary and sufcient conditions on the synaptic weight matrix for the existence of a set of globally attractive xed points. These conditions can be expressed in terms of copositivity, a concept from quadratic programming and linear complementarity theory. Alternatively, they can be expressed in terms of certain eigenvalues and eigenvectors of submatrices of the synaptic weight matrix, making a connection to linear systems theory. The theorem guarantees that the network will produce a steady-state response to any constant input. We regard this response as the computational output of the network, and its characterization is the topic of the second and third theorems. In the second theorem, we introduce the idea of permitted and forbidden sets. Under certain conditions on the synaptic weight matrix, we show that there exist sets of neurons that are “forbidden” by the recurrent synaptic connections from being coactivated at a stable steady state, no matter what input is applied. Other sets are “permitted,” in the sense that they can be stably coactivated for some input. Our main result is that if and only if a network possesses a forbidden set is it conditionally multiattractive, that is, there exists an input for which there is a nonconnected set of attractive xed points, each of which can be retrieved by suitable initial conditions.
Permitted and Forbidden Sets
623
Hence, we nd that forbidden sets and conditional multiattractiveness are inseparable concepts. The conceptual importance of permitted and forbidden sets suggests a new way of thinking about memory in neural networks. When an input is applied, the network must select a set of active neurons, and this selection is constrained to be one of the permitted sets. Therefore, the permitted sets can be regarded as memories stored in the synaptic connections. The memory storage in threshold-linear networks is limited by the topological properties of the memories. Our third theorem states that there are constraints on the groups of permitted and forbidden sets that can be stored by a network. No matter which learning algorithm is used to store memories, active neurons cannot arbitrarily be divided into permitted and forbidden sets, because subsets of permitted sets have to be permitted and supersets of forbidden sets have to be forbidden. In the appendix, we show that symmetric threshold-linear networks have the same set of attractive xed points as do networks dened by the generalized Lotka-Volterra equations. 2 Threshold-Linear Network Equations Our theory is applicable to the network dynamics, 2 3C X dxi C xi D 4bi C Wij xj 5 ; dt j
(2.1)
where [u]C D maxfu; 0g is a (half-wave) rectication nonlinearity and the synaptic weight matrix is symmetric, Wij D W ji . The dynamics can also be written in a more compact matrix vector form as xP C x D [b C Wx]C . The state of the network is x. An input to the network is an arbitrary vector b. An output of the network is a steady state x.xP D 0/ in response to b. The existence of outputs and their relationship to the input are determined by the synaptic weight matrix W. Threshold-linear units were introduced to neural modeling by Hartline and Ratliff (1958). Under certain circumstances, networks of conductance-based model neurons can be reduced to nonspiking dynamics like equation 2.1 (Ermentrout, 1994). In this context, the term in square brackets is interpreted as a spike rate, whereas xi is interpreted as a synaptic activation variable. Denition 1. A vector v is said to be nonnegative, v ¸ 0, if all of its components are nonnegative. Similarly, a vector is said to be positive if all of its components are positive. The nonnegative orthant fv: v ¸ 0g is the set of all nonnegative vectors.
624
R. Hahnloser, H. Seung, and J. Slotine
It can be shown that any trajectory of equation 2.1 starting in the nonnegative orthant remains in the nonnegative orthant. Therefore, for simplicity, we will consider initial conditions that are conned to the nonnegative orthant. 3 Convergence to an Equilibrium The following theorem establishes necessary and sufcient conditions for a symmetric network of the form in equation 2.1 to converge always to a steady state. Denition 2. Let A be an N £ N matrix, and let 6 D .i1 ; : : : ; ik / be a set of k ordered indices, 1 · i1 < i2 < ¢ ¢ ¢ < ik · N. The matrix A6 is said to be a k £ k 6 submatrix of A if A6 uv D Aiu jv for 8u; v 2 6. The matrix A can be constructed from A simply by removing from A all rows and columns not indexed by 6. Theorem 1. alent:
Assuming that W is symmetric, the following statements are equiv-
1. All positive eigenvectors of all submatrices of I¡W have positive eigenvalues. 2. The matrix I ¡ W is copositive; that is, xT .I ¡ W/x > 0 for all nonnegative x, except x D 0.
3. For all constant b and all initial conditions, the dynamics converge to an equilibrium point. Proof. ² Statement 2 follows from statement 1. We will assume that I ¡ W is not copositive and show that there exists a positive eigenvector of a submatrix of W with nonpositive eigenvalue. Let v¤ be the minimum of Q D vT .I ¡ W/v over nonnegative v on the unit sphere, v¤ D argminv¸0 G, where G D Q ¡ ®vT v (® is a Lagrange multiplier). Dene by 6 D .i1 ; : : : ; ik / the k positive components of v¤ , v¤i26 > 0. The gradient rG has to vanish in these components,P and so for all i 2 P 6, we have the Kuhn-Tucker conditions ®v¤i D v¤i ¡ j Wij vj¤ D v¤i ¡ j26 Wij vj¤ . Thus, the vector v¤i26 is a positive eigenvector of the matrix .I ¡ W/6 with eigenvalue ®. And because the minimum Q.v¤ / D ® is nonpositive by assumption, we nd that the eigenvalue of this positive eigenvector is nonpositive.
² Statement 3 follows from statement 2. We show that given the copositivity of I¡W, we can nd a Lyapunov-like function that is strictly decreasing under the dynamics in equation 2.1 and is zero only at steady states. By the copositivity of I ¡ W, the function L D 12 xT .I ¡ W/x ¡ bT x is lower bounded and radially unbounded in the nonnegative orthant, x ¸ 0. For all xed b, L is nonincreasing under the network dynamics:
Permitted and Forbidden Sets
625
LP D xP T .x ¡ Wx ¡ b/ D ¡.x ¡ [Wx C b]C /T .x ¡ Wx ¡ b/ D ¡.x ¡ yC /T .x ¡ y/ · 0, where y D Wx C b. This last inequality follows from the fact that all components of x and y contribute to the decrease of L: (1) if a component i satises yi ¸ 0, we have that yC i D yi and so 2 ¡.xi ¡yC i /.xi ¡yi / D ¡.xi ¡yi / · 0; (2) if a component j satises yj < 0, we have that ¡.xj ¡ yjC /T .xj ¡ yj / D ¡xj .xj ¡ yj / · 0 (because xj ¸ 0). All equilibria of L correspond to steady states of the dynamics of equation 2.1. From LP D ¡.x ¡ yC /T .x ¡ y/ D 0, it follows that for every component i, we have that either .xi ¡ yC i / D 0 or .xi ¡ yi / D 0. If P i D 0; if xi ¡ yi D 0, then we must have xi ¡ yC i D 0, then by denition x P i D 0. that yi D yC i , which by positivity of xi implies x Hence, for all b and all initial conditions, the network converges to an equilibrium. ² Statement 1 follows from statement 2. By contradiction, assume there is a positive eigenvector v6 of a submatrix .I ¡ W/6 with eigenvalue ¸ · 0. Dene the vector v by vi D v6 i if i 2 6 and vi D 0 otherwise. Then vT .I ¡ W/v D ¸vT v · 0, and therefore I ¡ W is not copositive.
² Statement 2 follows from statement 3. By contradiction, assume that I ¡ W is not copositive. From the equivalence of statement 2 with statement 1, we know that the positive components 6 of the minimum v¤ D argminv¸0 G correspond to a positive eigenvector of the submatrix .I ¡ W/6 with nonpositive eigenvalue ® · 0. According to the Kuhn-Tucker conditions, minimization also implies that the gradient ¤ ¤ rG restricted to the vanishing P components P Z of v¤ (vi2Z D 0) has to be ¤ ¤ ¤ nonnegative: ¡®vi C vi ¡ j W ij vj D ¡ j Wij vj ¸ 0 for i 2 Z. Thus, by choosing inputs b6 D v¤6 and bZ D 0 as well as initial conditions x6 .0/ D v¤6 and xZ .0/ D 0, we have constructed a nonconverging trajectory: if ® < 0, this trajectory is given by x6 .t/ D .e¡®t C ®1 /v¤6 and xZ .t/ D 0; if ® D 0, it is given by x6 .t/ D tv¤6 and xZ .t/ D 0. The function L that was constructed for the proof of theorem 1 is locally a Lyapunov function in the sense that the stable xed points of the dynamics correspond to local minima of L. Conversely, we also have that all local minima of L correspond to stable xed points. We will again make use of L in the next section, where we characterize the geometry of attractive xed points. The fact that L is a quadratic function will be particularly useful. Theorem 1 states sufcient and necessary conditions that guarantee bounded dynamics, the absence of limitcycles, and the absence of chaos. The rst two conditions for convergence in theorem 1 are best appreciated by comparison with the conditions for convergence of a purely linear network constructed from equation 2.1 by dropping the rectication nonlinearity. Let us rst consider statement 1: in a linear network, all eigenvalues of I ¡ W have to be positive to ensure convergence. In theorem 1, only eigenvalues of nonnegative eigenvectors have to be positive to ensure convergence. Eigen-
626
R. Hahnloser, H. Seung, and J. Slotine
values of nonpositive eigenvectors may well be nonpositive, a fact that will turn out to be important for multistability, as we shall see. But all submatrices of I ¡ W must be considered in theorem 1, because different sets of feedback connections are active for different sets of neurons that are above threshold. Let us now consider statement 2: in a linear network, I ¡W would have to be positive denite to ensure asymptotic stability, but because of the rectication, in theorem 1 positive deniteness is replaced by the weaker condition of copositivity. Although theorem 1 implies convergence to a xed point, it does not imply that there is only a single xed point. As we shall see, for example, there can be an attractive line of xed points (a line attractor) toward which the dynamics must converge. Nevertheless, theorem 1 excludes line attractor networks of the integrator type (Seung, 1996), that is, where there exists a positive eigenvector with zero eigenvalue, such as in xP C x D [x C b]C . In such networks, a positive input is indenitely integrated, resulting in unbounded behavior. Lemma 1. For any nonnegative vector v ¸ 0, there exists an input b, such that v is a steady state of equation 2.1 with input b. Proof. Choose b D .I ¡ W/v. Lemma 1 states that any nonnegative vector can be realized as a xed point. Sometimes this xed point is stable, such as in symmetric and copositive networks in which only a single neuron is active. Indeed, the submatrix of I ¡ W corresponding to a single active neuron corresponds to a diagonal element, which according to statement 1 of theorem 1 must be positive. Hence, it is always possible to activate only a single neuron at an asymptotically stable xed point. However, as will become clear from theorem 2, not all nonnegative vectors can be realized as an asymptotically stable xed point. 4 Forbidden and Permitted Sets In this section, we present a geometrical description of the set of attractive xed points by introducing the notion of forbidden and permitted sets of neurons. Recall rst the classical denition of stability in the sense of Lyapunov. Denition 3. A steady state x is stable if for all balls B with x 2 B, there exists a ball A with x 2 A, such that for any initial condition x.0/ 2 A, we have that x.t/ 2 B for all times t ¸ 0. A steady state is asymptotically stable if it is stable and there is a ball of initial conditions that converge to it.
Permitted and Forbidden Sets
Denition 4.
627
A neuron is said to be activated if its activity is nonzero.
Denition 5. A set of neurons is permitted if the neurons can be coactivated at a stable steady state for some input b. A set of neurons is forbidden if it is not permitted, that is, if the neurons cannot be coactivated at a stable steady state no matter what the input b. According to these denitions, permitted and forbidden sets are network characteristics that are independent of the input. Their signicance can best be appreciated by comparison to linear systems. In a linear network, any set of neurons is permitted if and only if all eigenvalues of I ¡ W are nonnegative (the linear system is stable). This means that either all sets of neurons are permitted or none. For a threshold-linear network, this is not the case. It is possible that some sets are permitted, whereas others are forbidden. Whether a particular set is permitted depends on the eigenvalues of .I ¡ W/6 . For example, a given set 6 is forbidden means that .I ¡ W/6 must have at least one negative eigenvalue. And by copositivity of I ¡ W, we know that the corresponding eigenvector cannot be nonnegative and so must have both positive and negative components. Denition 6. A threshold-linear network is said to be conditionally multiattractive if there exists an input b such that the set of stable equilibrium points is disconnected. Multiattractiveness is a statement about the topology of stable xed points. The word conditional refers to the fact that multiattractiveness exists only for certain inputs, but not for others. For example, consider a network and a particular input for which all stable xed points lie on a line. The network is not multiattractive for this input, because from any point of the line, any other point of the line can be reached by staying on the line only (i.e., a line is connected). However, it may well be that for another input, the network has two stable xed points that are separate from each other, which would characterize the network as conditionally multiattractive. To prove our theorems about conditional multiattractiveness and forbidden sets, we will need the following lemma about the minimization of convex functions over convex sets (Bertsekas, 1995). Denition 7. A set C is convex if ®x C .1 ¡ ®/y 2 C for all x, y 2 C, and 0 · ® · 1 (if any line between two points remains in C). Denition 8. A scalar function f is convex if f .®x C .1 ¡ ®/y/ · ®f .x/ C .1 ¡ ®/ f .y/ for all x, y 2 C, and 0 · ® · 1 (if any line between two points does not pass below the function f ).
628
R. Hahnloser, H. Seung, and J. Slotine
Lemma 2. (a) If C is a convex set (such as the nonnegative orthant) and f a convex function on C, then a local minimum of f is also a global minimum. (b) The set of global minima of f is convex (and thus connected). Proof. (a) Let x be a local minimum of f but not a global minimum. Then there exists a y such that f .y/ < f .x/. From the convexity of f , we have that f .®x C .1 ¡ ®/y/ · ®f .x/ C .1 ¡ ®/ f .y/ · f .x/ for every ® 2 [0; 1]. This contradicts the assumption that x is a local minimum. (b) Assume there exist two global minima x, y 2 C and ® 2 [0; 1] such that z D ®x C .1 ¡ ®/y is not a global minimum, f .z/ > f .x/ D f .y/. By convexity of f , we have that f .z/ · ®f .x/ C .1 ¡ ®/ f .y/ D f .x/, contradicting the assumption that z is not a global minimum. Denition 9. A symmetric N £ N matrix A is positive semidenite if for all vectors x we have that xT Ax ¸ 0. If a matrix is positive semidenite, then all its eigenvalues are nonnegative. We will use that if I ¡ W is a positive semidenite matrix, then for all inputs b, the Lyapunov-like energy function L is convex (this follows from the fact that L is quadratic). We will also need the following interlacing theorem for eigenvalues of symmetric matrices. Lemma 3. Let A be a symmetric N £ N matrix with real eigenvalues ¸1 · ¸2 · ¢ ¢ ¢ · ¸N , and let A0 be a .N ¡ 1/ £ .N ¡ 1/ submatrix of A. Then the eigenvalues ´1 · ¢ ¢ ¢ · ´N¡1 of A0 interlace the eigenvalues of A: ¸1 · ´1 · ¸2 · ¢ ¢ ¢ · ´N¡1 · ¸N . Proof. The proof is given in Horn and Johnson (1985). As a corollary of lemma 3, if A has k negative eigenvalues, then A0 must have at least k ¡ 1 negative eigenvalues. A symmetric threshold-linear network has a unique stable, steady state if and only if I ¡ W is positive denite (all eigenvalues are positive; Feng & Hadeler, 1996). As an extension, the following theorem contains necessary and sufcient conditions for symmetric threshold-linear networks to be conditionally multiattractive. Theorem 2. Assume that the matrix I ¡ W is symmetric and copositive. The following statements are equivalent: 1. The matrix I ¡ W is not positive semidenite. 2. There exists a forbidden set.
3. The network is conditionally multiattractive.
Permitted and Forbidden Sets
629
Proof. ² Statement 2 follows from statement 1. I ¡ W is not positive semidenite, and so I ¡ W has a negative eigenvalue. Consider a steady state with all neurons active. Because of this negative eigenvalue, the steady state cannot be stable, and so the set of all neurons is forbidden. ² Statement 1 follows from statement 2. Let 6 be the forbidden set. The smallest eigenvalue of .I ¡ W/6 must be negative. By the interlacing theorem, the smallest eigenvalue of I ¡ W must be negative as well and so I ¡ W is not positive semidenite.
² Statement 1 follows from statement 3. Suppose that statement 1 is false. Choose an input b, and let X be a nonconnected set of attractive xed points of the network. By Lyapunov stability, X corresponds to the set of local minima of L. Because I ¡ W is positive semidenite, L is a convex function. By lemma 2, the local minima of L in the nonnegative orthant are connected, contradicting our initial assumption that X is nonconnected. ² Statement 3 follows from statement 1. The plan is to construct an input for which the network is multiattractive. I¡W is not positive semidenite, and so must have at least one negative eigenvalue. If I ¡ W has more than one negative eigenvalue, then dene a sequence of subsets 6N ¾ 6N¡1 ¾ ¢ ¢ ¢ ¾ 6k D 6 by removing neurons one by one until only one of the k eigenvalues of .I ¡ W/6 is negative. By the corollary of the interlacing theorem, such a set 6 can always be found. The forbidden set 6 contains at least k ¸ 2 neurons because by copositivity, any single-neuron set is permitted. Denote the negative eigenvalue of .I ¡ W/6 by ¸, and let v be the corresponding eigenvector. Without loss of generality, assume that v D .vp ; vn / has q nonnegative and k ¡ q p negative components, vi D vi ¸ 0 for i D 1; : : : ; q and vni D vqCi < 0 ± ² v for i D 1; : : : ; k¡q. Dene a nonnegative vector y D kvppk2 ; ¡ kvvnnk2 that is orthogonal to v, and dene the input to the network as bi D yi for i 2 6 and bi D ¡K ¿ 0 for i 2 = 6. The value K has to be sufciently large to prevent neurons not belonging to 6 from being activated during the dynamics (how to construct K is shown after the next paragraph). Dene a hyperplane H passing through b, spanned by the k ¡ 1 stable eigenvectors orthogonal to v. The hyperplane divides N6 —the nonnegative orthant restricted to 6—into two disjoint regions. The function L restricted to this hyperplane is convex, so by lemma 2, it has a connected set of local minima that are also global minima. Choose an arbitrary minimum z D argminx2H L.x/, and dene two initial conditions of the dynamics, both contained in N6 but on opposite sides of the hyperplane, x1i D zi C ²vi , x2i D zi ¡ ²vi for i 2 6, and x1i D 0, x2i D 0 for i 2 = 6, with ² ¿ 1. Because L.x1 / D L.x2 / D L.z/ C ¸2 ² 2 < L.z/ and because L is nonincreasing along
630
R. Hahnloser, H. Seung, and J. Slotine
trajectories, the two trajectories will not cross the hyperplane, but will converge toward two sets of attractive xed points X1 and X2 on either side of the hyperplane. How to choose K: Dene a region C ½ N6 by C D fx ¸ 0 j x 2 N6 ; L.x/ < L.z/g. Because L is radially unbounded, C is contained in a ball of radius R centered at the origin. By choosing K D Rkwk P C 1 (where k:k denotes an arbitrary matrix norm), we guarantee that j wij xj < bi for all i 2 = 6 and all x 2 C, and thus neurons that are initially inactive remain inactive for all times. It remains to be shown that X1 and X2 are sets of xed points that are not connected, neither by a curve conned within N6 nor by a curve extending beyond N6 . First, X1 and X2 are not connected by a curve 01 within N6 because this curve would have to cross the hyperplane, implying that there exists a point u 2 H for which L.u/ < L.x1 / < L.z/, contradicting the assumption that z is a global minimum of L restricted to H. Second, X1 and X2 are not connected by a curve 02 extending beyond N6 , because the strongly inhibitory inputs prevent the existence of xed points in a small neighborhood around N6 . By contradiction, assume there exists a xed 1 point q 2 , = N6 on a curve c connecting X1 and X2 and satisfying kq¡pk < kwk where p 2 c and p 2 N6 . Because q 2 = N6 , at least one component i 2 = 6 of the steady state q mustPbe positive, qP i > 0. Thus, the total input to neuron i must be positive: 0 < j wij qj ¡ K D j wij .qj ¡ pj C pj / ¡ K · kwkkq ¡ pk C P j wij pj ¡ K · 1 C kwkR ¡ K D 0, a contradiction. In conclusion, we have constructed an input b for which there exist two nonconnected sets X1 and X2 of attractive xed points (see Figure 1). How does our new concept of multiattractiveness compare to the older concept of multistability? Given our denition of multiattractiveness, multistability is a subtly more general concept: whereas a single line attractor is multistable (it has multiple stable xed points), it is not multiattractive. On the other hand, any multiattractive system is necessarily also multistable, implying that the difference between multistability and multiattractiveness is that the latter excludes single line (and surface) attractors. Previous results about stability of rectied linear networks were concerned with either necessary and sufcient conditions for monostability (Feng & Hadeler, 1996; Hadeler & Kuhn, 1987) or sufcient conditions for multistability (Hahnloser, 1998; Wersing, Beyn, & Ritter, 2001). As an extension to these results, theorem 2 contains a sufcient and necessary condition for multiattractiveness: the existence of a forbidden set. To give an intuitive understanding of the equivalence of conditional multiattractiveness and forbidden sets, let us consider a one-dimensional system of the form xP D ¡ f .x/. If such R x a system has to have two stable xed points, then the function F.x/ D dy f .y/ must have two separate local minima. And by the smoothness of f , it must be that F has a local maximum between the two local minima, that is, the system must have an instability. A similar
Permitted and Forbidden Sets
q p
G2 i 0
631
NS
H
X1 u G1 X2
Figure 1: Proof of theorem 2. The nonnegative orthant N6 is two-dimensional and is drawn in the horizontal plane. The hyperplane H is a line dividing N6 into two disjoint regions (shaded areas). Because 6 is a forbidden set, the two sets of stable xed points X1 and X2 must lie on the boundary of N6 . They are not connected, neither by a trajectory 01 (straight line) passing inside N6 and intersecting H in u, nor by a trajectory 02 passing outside N6 (along axis i), passing by points p 2 N6 and q 2 = N6 .
principle also exists for two-dimensional systems, in which case f is a twodimensional vector eld. The index theorem says that the contour integral of f along a closed curve C equals the sum of indices of the isolated xed points inside C (Strogatz, 1994). Maxima and minima have an index of 1, and saddles have an index of ¡1. Hence, if f is radially unbounded, then the index of a contour integral along C equals 1. And if C is to contain two minima, then it must also contain a saddle: 1 D 2 ¡ 1. Hence, two-dimensional systems are multistable only if they contain an instability. Unfortunately, there is no similarly general result for multistability in dynamical systems of more than two dimensions. But we show here that for rectied linear networks, the intuitive generalization of the index theorem—the equivalence of multistability and instability—is true in arbitrary dimensions. Using theorem 2 and previous results, we can summarize all possible geometries of stable steady states in symmetric networks: ² I ¡ W is positive denite: There exists a single globally attractive xed point. ² I ¡ W is copositive and positive semidenite but not positive denite: Multiple steady states can exist in the form of a line (or surface) attractor. For example, the two-neuron network given by W D .0; ¡1I ¡1; 0/
632
R. Hahnloser, H. Seung, and J. Slotine
results in eigenvalues 0 and 2 of I ¡ W, and so the two-neuron set is permitted. Let b D .1I 1/; it follows that there is a line of attractive xed points given by x1 C x2 D 1 and x1 ¸ 0, x2 ¸ 0. The xed points are connected, and so the network is not multiattractive for this input. By theorem 2, the network is not conditionally multiattractive. Recall that no surface attractor of the integrator type is possible for copositive networks. ² I ¡ W is copositive but not positive semidenite. There may exist multiple, nonconnected sets of attractive xed points dependent on the input. These might either be separate points or nonintersecting line or surface attractors. The following theorem characterizes the relationship between forbidden and permitted sets. Theorem 3. Assume that the matrix I ¡ W is symmetric and copositive. Any subset of a permitted set is permitted. Any superset of a forbidden set is forbidden. Proof. Use the interlacing theorem. If the smallest eigenvalue of a symmetric matrix is nonnegative, then so are the smallest eigenvalues of all its submatrices. And if the smallest eigenvalue of a symmetric submatrix is negative, then so is the smallest eigenvalue of the original matrix. The hierarchical grouping of permitted sets holds only for symmetric networks and may be violated in nonsymmetric networks. Consider the two-neuron network dened by w D .2; ¡1I 4; ¡2/. The matrix I ¡ W has the unique eigenvalue 1, and so the two neurons form a permitted set. However, the rst diagonal element is ¡1, and so the rst neuron, as a subset of a permitted set, forms a forbidden set, in disagreement with theorem 3. What happens in this network is that due to the strong self-excitation, the activity of the rst neuron diverges until it necessarily activates the second neuron, the inhibition of which has a stabilizing effect. 5 An Example: The Ring Network Symmetric threshold-linear networks with local excitation and more global inhibition have been studied in the past as models for cortical processing (Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Here we argue that invariances of neural responses to sensory stimulation in these models arise from the existence of forbidden sets. We also show that many of these networks exhibit spurious permitted sets, consisting of sets of noncontiguous neurons. Let the synaptic matrix of a 10-neuron ring network be translationally invariant. The connection between neurons i and j is given by Wij D ¡¯C®0 ±ij C®1 .±i;jC1 C±iC1;j /C®2 .±i;jC2 C±iC2;j /, where ¯ quanties global inhi-
633
Permitted set number
Permitted and Forbidden Sets
N e u ro n n u m b e r
N e u ro n n u m b e r
Figure 2: (Left) Output of a ring network of 10 neurons to uniform input (random initial condition). (Right) The 9-parent permitted sets (x-axis: neuron number; y-axis: set number). White means that a neuron belongs to a set, and black means that it does not. Left-right and translation-symmetric parent-permitted sets of the ones shown have been excluded (they can be obtained by translation of the sets shown). The rst parent permitted set (rst row from the bottom) corresponds to the output on the left.
bition, ®0 self-excitation, ®1 rst-neighbor lateral excitation, and ®2 secondneighbor lateral excitation. In Figure 1, we numerically computed the permitted sets of this network, with the parameters taken from Hahnloser et al. (2000), that is, ®0 D 0, ®1 D 1:1, ®2 D 1, and ¯ D 0:55. The permitted sets were determined by diagonalizing the 210 submatrices of I ¡ W and classifying the eigenvalues. In agreement with theorem 1, we nd that all positive eigenvectors have positive eigenvalues, and so the network has a set of globally attractive xed points. (For a sufcient but not necessary criterion for stability of a similar ring network, see Wersing et al., 2001.) We classied sets of neurons 6 into permitted and forbidden sets according to the eigenvalues of .I ¡ W/6 . In Figure 2, we depict the parent permitted sets (sets that have no permitted supersets). Consistent with the nding that such ring networks can explain contrast invariant tuning of V1 cells (BenYishai et al., 1995) and multiplicative response modulation of parietal cells (Salinas & Abbott, 1996), we nd that there are no permitted sets that consist of more than ve contiguous active neurons. For example, as a model for parietal cells, any combination of positive visual and positive eye-position inputs retrieves a permitted set consisting of a group of contiguous neurons. Because the resulting activity pattern in the network is localized, the tuning widths of visual responses are limited, making the combined eye position and visual responses look invariant in shape. It can be shown that in order to retrieve one of the spurious permitted sets in Figure 2, some inputs must be negative, a rather irrelevant case, because cortical inputs are usually excitatory.
634
R. Hahnloser, H. Seung, and J. Slotine
6 Discussion In earlier network models of associative memory (Hopeld, 1982; Hinton & Sejnowski, 1986), computational inputs were encoded as initial conditions to the neural dynamics and memories as attractive xed points. After applying an input in these models, the dynamics converge to a steady state, which represents the retrieved memory in response to that input. At the steady state, all information about the input is lost. In threshold-linear networks, the situation is different; the attractive xed points are input dependent, meaning that a small change in the input leads to a small change in the location of the retrieved steady state. Thus, in these networks, memories cannot be represented by steady states. We have shown that memories in threshold-linear networks can be dened in terms of permitted sets of neurons, for example, sets of neurons that can be stably coactivated at a steady state. This offers a new concept of pattern memorization that fullls the criterion of input independence also for the case of threshold-linear networks. Although memories are input independent, the retrieval of a memory is strongly constrained by the input. A typical input will not allow for the retrieval of arbitrary stored permitted sets. This comes from the fact that the existence of nonconnected sets of stable xed points is dependent not just on whether there exists a forbidden set but also on the input (see theorem 2). Generally, multiattractiveness in the ring network is possible only when more than a single neuron is excited. Notice that threshold-linear networks can behave as traditional attractor networks when the inputs are represented as initial conditions of the dynamics. For example, by xing b D 1 and initializing a copositive network with some input, the permitted sets unequivocally determine the stable xed points. Thus, in this case, the notion of permitted sets is no different from xed-point attractors. However, the hierarchical grouping of permitted sets (see theorem 3) becomes irrelevant, since there can be only one attractive xed point per hierarchical group dened by a parent permitted set. We would like to discuss the limits within which we believe the network dynamics studied here are approximations of real biological neural networks. The fact that threshold-linear neurons are nonsaturating is a simplication of real neurons that loses its validity when interspike intervals are close to the refractory time of action potentials or when postsynaptic receptors become saturated. Unfortunately, for saturating nonlinearities, we are unable to derive any formal and input-independent results about the computations performed by a network. The reason is that the effect of a saturating nonlinearity is to induce an activity-dependent change in the slopes of transfer functions. As a result, depending on the network architecture, some eigenvalues may decrease, whereas others may increase as ring rates change, making stability criteria subtly activity dependent and hard to characterize. The symmetry of connections is a simplication that
Permitted and Forbidden Sets
635
is valid on average when excitatory neurons are recurrently connected (as is common, e.g., in cortex; Douglas et al., 1995) and when inhibitory neurons are nonspecic and fast acting (in which case, inhibitory ring rates can be expressed in terms of excitatory ring rates, then inhibitory neurons can be eliminated from the network, and one ends up with symmetric connections). Last but not least, the notion of forbidden sets does not have to be incompatible with the nonzero spontaneous (background) ring rates of cortical neurons. We would like to argue that the spontaneous background ring of cortical neurons may involve the rapid switching of permitted sets so that on a larger timescale, all neurons seem to be simultaneously active, whereas in reality they are not. The switching of permitted sets may be driven, for example, by the dynamics of long-range cortico-cortical inputs, which reect the behavioral state of an animal and its expectations about the sensory environment; switching may also involve the intrinsic noisiness of synapses (such as their release probability). The fact that permitted sets never have forbidden subsets implies a hierarchical organization of symmetric networks. As a model for perception, if neurons are feature detectors, then different parent permitted sets may represent different objects. Certain combinations of features will not be perceived as a recognizable object because the features are encoded by neurons forming a forbidden set. If not all but just a subset of the features of an object is present, such as may occur during occlusion, for example, then a subset of the neurons encoding the object may still be coactivated because they still form a permitted set. But how would a network encode, for example, the letter P as a separate object from the letter R and not as a subset thereof? Are symmetric networks not a suitable medium for encoding such object pairs? We do not believe so, because cortical feature detectors have receptive elds comprising both excitatory and inhibitory subelds. In this way, they can signal the absence of a feature as well as they can signal its presence. For example, a neuron that responds to a dark vertical bar surrounded by white anks will respond to the letter P but not to the letter R, because the slanted bar in R excites the inhibitory subeld, causing the neuron to be silent. It is thus highly plausible that the encoding of the letter P is based on the activation of sensory neurons that will not be activated for the letter R and so the parent permitted sets of P and R will not be subsets of each other. The representational strategy of excitatory and inhibitory receptive subelds makes it possible in principle to store just about any collection of objects as an antichain of subsets, that is, as a collection of mutually exclusive sets. For a simple procedure on how to store a given family of possibly overlapping patterns as permitted sets, see Xiaohui, Hahnloser, and Seung (2002). Appendix: The Generalized Lotka-Volterra Equations In many previous studies, the following generalized Lotka-Volterra equations were used to model neuron dynamics, instead of the rectied linear
636
R. Hahnloser, H. Seung, and J. Slotine
equation 2.1 (Fukai & Tanaka, 1997; Rabinovich et al., 2001): 0 1 X dxi D xi @¡xi C bi C Wij xj A : dt j
(A.1)
The generalized Lotka-Volterra equation A.1 has a similar property as the rectied linear equation, which is that once initialized to nonnegative values, neural activities xi remain nonnegative for all future times. The reason for nonnegativity is that the nonnegative orthant cannot be left, because for all inputs b, we have that xP i D 0 whenever xi D 0. Here we show that given the same connection matrix W and input b, the rectied linear dynamics of equation 2.1 and the generalized Lotka-Volterra dynamics of equation A.1 have the same set of stable xed points. This correspondence of attractors will follow from theorem 4, establishing that the quadratic function L that was a Lyapunov-like function for equation 2.1 is also a Lyapunov-like function for equation A.1. Theorem 4. If W is symmetric and I ¡ W copositive, then for all constant b and initial conditions, the generalized Lotka-Volterra equation A.1 converges to an equilibrium. Furthermore, for all b, the set of attractive xed points is identical to that of the threshold-linear equation 2.1. Proof. Because I ¡ W is copositive, L D 12 xT .I ¡ W/x ¡ bT x is bounded below. @L The dynamics of equation A.1 can be written as xP i D ¡xi @x . From this, it i ± ²2 @L P @L follows that L is nonincreasing along trajectories: LP D @x D ¡ x xi · i @xi i 0. Fixed points of 2 correspond to steady states of L and vice versa: from xP i D @L 0, it follows that either xi D 0 or @x D 0, and so LP D 0. Similarly, from LP D 0, it i follows that xP i D 0. Hence, for all b, the dynamics of equation A.2 converge to a steady state. Furthermore, because L is a Lyapunov-like function for both equations 2.1 and A.1, it follows that their sets of attractive xed points must be identical. The two networks in equations 2.1 and A.1 have the same Lyapunov-like function. It follows that theorems 2 and 3 about conditional multiattractiveness and the grouping of permitted sets also apply to the generalized Lotka-Volterra equations, and so make these networks practically identical. The only differences between them are their basins of attraction, their speed of convergence, and the pathological unstable xed points xi D 0 of the generalized Lotka-Volterra equation (recall that the origin is alway a xed point of equation A.1 because it is not possible to leave this xed point by the action of b, changing b). In practical applications, however, the origin is not a worrisome xed point. An approximation to a threshold-linear network was recently im-
Permitted and Forbidden Sets
637
plemented in terms of a silicon circuit using analog very large-scale integration (aVLSI) technology (Hahnloser et al., 2000). There, neural activities are represented by electrical currents. The equations describing the current dynamics were shown to be the generalized Lotka-Volterra equation A.1. The origin is not a pathological xed point in this physical system because thermal noise causes the dynamics to leave the repulsive origin for positive b. Acknowledgments We thank Misha Tsodyks for his critical reading of the manuscript. References Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bertsekas, D. (1995). Nonlinear programming. Belmont, MA: Athena Scientic. Cohen, M., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 288–307. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ermentrout, B. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Computation, 6, 679–695. Feng, J., & Hadeler, K. (1996). Qualitative behaviour of some simple networks. J. Phys. A, 29, 5019–5033. Fukai, T., & Tanaka, S. (1997). A simple neural network exhibiting selective activation of neuronal ensemblies: From winner-take-all to winners-shareall. Neural Computation, 9(1), 77–97. Hadeler, K., & Kuhn, D. (1987). Stationary states of the Hartline-Ratliff model. Biological Cybernetics, 56, 411–417. Hahnloser, R. H. (1998). About the piecewise analysis of networks of linear threshold neurons. Neural Networks, 11, 691–697. Hahnloser, R., Sarpeshkar, R., Mahowald, M., Douglas, R., & Seung, H. (2000). Digital selection and analog amplication coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hahnloser, R. H., & Seung, H. (2001). Permitted and forbidden sets in symmetric threshold linear networks. In Proceedings of NIPS2001—Neural Information Processing Systems: Natural and Synthetic (Vol. 13, pp. 217–223). Cambridge, MA: MIT Press. Hartline, H. K., & Ratliff, F. (1958). Spatial summation of inhibitory inuence in the eye of limulus and the mutual interaction of receptor units. J. Gen. Physiol., 41, 1049–1066. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (pp. 448–453). Cambridge, MA: MIT Press.
638
R. Hahnloser, H. Seung, and J. Slotine
Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79(8), 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Horn, R., & Johnson, C. (1985). Matrix analysis. Cambridge: Cambridge University Press. Rabinovich, M., Volkovskii, A., Lecanda, P., Huerta, R., Abarbanel, H., & Laurent, G. (2001). Dynamical encoding by networks of competing neuron groups: Winnerless competition. Physical Review Letters, 87(6), 068102 (1–4). Salinas, E., & Abbott, L. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93, 13339–13344. Strogatz, S. (1994). Nonlinear dynamics and chaos. Reading, MA: Addison-Wesley. Wersing, H., Beyn, W.-J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13(8), 1811–1825. Xiaohui, X., Hahnloser, R. H., & Seung, H. S. (2002).Selectively grouping neurons in recurrent networks of lateral inhibition. Neural Computation, 14(11), 2627– 2646. Received April 29, 2002; accepted August 2, 2002.
LETTER
Communicated by Richard Hahnloser
Multistability Analysis for Recurrent Neural Networks with Unsaturating Piecewise Linear Transfer Functions Zhang Yi
[email protected] College of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, People’s Republic of China, and Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576
K. K. Tan
[email protected] T. H. Lee
[email protected] Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576
Multistability is a property necessary in neural networks in order to enable certain applications (e.g., decision making), where monostable networks can be computationally restrictive. This article focuses on the analysis of multistability for a class of recurrent neural networks with unsaturating piecewise linear transfer functions. It deals fully with the three basic properties of a multistable network: boundedness, global attractivity, and complete convergence. This article makes the following contributions: conditions based on local inhibition are derived that guarantee boundedness of some multistable networks, conditions are established for global attractivity, bounds on global attractive sets are obtained, complete convergence conditions for the network are developed using novel energy-like functions, and simulation examples are employed to illustrate the theory thus developed. 1 Introduction The dynamical properties of recurrent neural networks play primarily important roles in their applications. The dynamical analysis of recurrent neural networks, especially convergence analysis, has attracted extensive interest and the attention of scientists. So far, many of the reported conditions for the convergence of neural networks are centered around monostability. However, monostable networks are computationally restrictive; they cannot deal with important neural computations (Hahnloser, 1998), such as those necessary in decision making , where multistable networks become necessary. Recently, there has been increasing interest in multistability analysis for Neural Computation 15, 639–662 (2003)
c 2003 Massachusetts Institute of Technology °
640
Z. Yi, K. Tan, and T. Lee
neural networks (see Hahnloser, 1998; Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000; Wersing, Beyn, & Ritter, 2001). Multistable networks are signicantly different from monostable ones, and their analysis also merits different treatment. Monostable networks restrict each network to one equilibrium point, whereas a multistable network can have multiple equilibrium points. Conditions for monostability are usually not applicable to multistable networks. In our view, there are three basic properties of a multistable network. The rst property is boundedness. It is interesting to derive conditions that not only guarantee boundedness but also allow networks to possess more than one equilibrium point. The second property is that of attractivity. It is necessary to establish whether there exist compact sets into which all the trajectories of a network will be attracted and bounded indenitely. Clearly, all the equilibrium points and limit circles (if any) must be located in the attracting compact sets. The third property relates to complete convergence. It is necessary to know if every trajectory of a network converges to the equilibrium set. In the dynamical analysis of recurrent neural networks, the activation functions are important factors that affect the dynamics of neural networks. Various activation functions have been used for neural networks. Recently, the studies reported in Ben-Yishai, Lev Bar-Or, and Sompolinsky, (1995), Douglas, Koch, Mahowald, Martin, & Suarez, (1995), Hahnloser (1998), Hahnloser et al. (2000), Wersing, Beyn, and Ritter (2001), and Wersing, Steil, and Ritter (2001) are focused on a class of recurrent neural networks with unsaturating linear threshold activation functions (LT networks). This class of neural networks has potential in many applications. Hahnloser et al. (2000) demonstrated an efcient silicon design for LT networks and discussed the coexistence of analog amplication and digital selection in networks circuits. This transfer function is also more appropriate for recurrent neural networks (Douglas et al., 1995). Since the unsaturating piecewise linear activation functions are unbounded, more complex dynamic properties may exist in the networks. As far as we know, there have been few results on this class of neural networks originating from a general dynamic analysis, although some boundedness conditions have been reported in Wersing, Beyn, and Ritter (2001). In this article, we study the recurrent neural networks described by the following nonlinear differential equation, xP i .t/ D ¡xi .t/ C
n X jD1
wij ¾ .xj .t// C hi ;
.i D 1; : : : ; n/;
(1.1)
or its equivalent vector form, P D ¡x.t/ C W¾ .x.t// C h; x.t/
for all t ¸ 0, where each xi denotes the activity of neuron i, W D .wij /n£n is a real n £ n matrix, each of its elements wij denotes the synaptic weights
Multistability of Recurrent Neural Networks
641
Figure 1: The activation function ¾ .s/.
and represents the strength of the synaptic connection from neuron j to neuron i, x D .x1 ; : : : ; xn /T 2 Rn , and h 2 Rn denotes external inputs. For any x 2 Rn , ¾ .x/ D .¾ .x1 /; : : : ; ¾ .xn //T , and the activation function ¾ is dened as follows: ¾ .s/ D maxf0; sg;
s 2 R:
The activation function ¾ is an unsaturating piecewise linear function, which is continuous, unbounded, and nondifferentiable. Figure 1 shows this function. We address the three dynamical properties for network 1.1. Conditions based on local inhibition will be derived to guarantee boundedness and also allow the network to have multiequilibrium points. Under such boundedness conditions, we will show that the network possesses compact sets that attract all the trajectories of the network. Explicit inequalities to calculate the global attractive sets will be given. In addition, if the weight matrix of the network possesses some kind of symmetry, we will show that network 1.1 has the property of complete convergence. Wersing, Beyn, and Ritter (2001) studied the boundedness of the following neural networks: P D ¡y.t/ C ¾ .Wy.t/ C h/ y.t/
(1.2)
for t ¸ 0. Based on local inhibition of the network, they established some interesting and elegant boundedness conditions. The network in equation 1.2 has some differences from the network described in equation 1.1. Dynamically, network 1.1 is more general than that of network 1.2. In fact, if we let x.t/ D Wy.t/ C h, then network 1.2 can be easily transformed to network 1.1. The networks have the following relationship, P D ¡y.t/ C ¾ .x.t//; y.t/
for t ¸ 0. For any T ¸ 0, we have Z
ky.t/k · ky.T/ke¡.t¡T/ C
t T
e¡.t¡s/ k¾ .x.s//k ds;
(1.3)
642
Z. Yi, K. Tan, and T. Lee
for t ¸ T. By equation 1.3, the boundedness, attractivity, and convergence of network 1.1 can be easily translated to network 1.2. However, network 1.1 cannot be transformed into 1.2 and retain the dynamic properties, except when matrix W is invertible (Wersing, Beyn, & Ritter, 2001). In fact, if W is not invertible, the equilibrium points of networks 1.1 and 1.2 do not have a oneto-one correspondence under the above transform. In many applications, it may not be reasonable to assume that the matrix W is invertible. Many neural systems exhibiting short-term memory are modeled by noninvertible networks, such as the oculomotor integrator (Seung, 1996) or the headdirection system (Zhang, 1996). In these networks, precise tuning of synaptic connections is necessary to result in line attractors along which the dynamics are marginally stable. In addition, the attractivity issue was not studied by Wersing, Beyn, and Ritter (2001). We believe that attractivity is an important property for multistable networks. The rest of this article is organized as follows. Preliminaries are given in section 2. Boundedness and attractivity analysis of network 1.1 are given in section 3. In section 4, we discuss the complete convergence of the network. Simulations and illustrative examples are given in section 5. Finally, section 6 presents the conclusions. 2 Preliminaries Since the activation function ¾ .¢/ is an important factor that affects the dynamics of the network, we begin with important properties that will be useful in the subsequent multistability analysis. Lemma 1.
Z
s¾ .s/ D ¾ 2 .s/ D 2
s
¾ .µ / dµ; 0
for any s 2 R. Proof. If s ¸ 0, then Z s Z s 2 ¾ .µ / dµ D 2 µ dµ D s2 D ¾ 2 .s/ D s¾ .s/: 0
0
If s < 0, then ¾ .µ / D 0 for all s · µ · 0. Thus, Z s 2 ¾ .µ / dµ D 0 D s¾ .s/ D ¾ 2 .s/: 0
Lemma 2. [¾ .s/ ¡ ¾ .µ /]2 · .s ¡ µ /[¾ .s/ ¡ ¾ .µ /] for all s; µ 2 R.
Multistability of Recurrent Neural Networks
643
Proof. For any s; µ 2 R, we have ¾ .s/ ¡ ¾ .µ / D ¾ .s ¡ µ C µ/ ¡ ¾ .µ / · ¾ .s ¡ µ/: Without loss of generality, we assume that s ¸ µ. Then ¾ .s/ ¸ ¾ .µ /, and so [¾ .s/ ¡ ¾ .µ /]2 · ¾ .s ¡ µ/[¾ .s/ ¡ ¾ .µ /] · .s ¡ µ /[¾ .s/ ¡ ¾ .µ /]: Denote in this article a norm in Rn by v u n uX kxk D t jxi j2 ; iD1
for any x 2 Rn . Then, from lemmas 1 and 2, it follows that n Z xi X xT ¾ .x/ D ¾ T .x/¾ .x/ D 2 ¾ .µ / dµ; iD1
0
and k¾ .x/ ¡ ¾ .y/k2 · .x ¡ y/T ¢ [¾ .x/ ¡ ¾ .y/]; for all x; y 2 Rn Denition 1.
Network (1.1) is bounded if each of its trajectories is bounded.
Denition 2. Let S be a compact subset of Rn . We denote the ² ¡ neighborhood of S by S² . The compact set S is called a global attractive set of network 1.1 if for any ² > 0, all trajectories of that network ultimately enter and remain in S² . Let f .t/ be a continuous function dened on [0; C1/; the upper limit of f .t/ is dened as " # lim sup f .t/ D lim
t!C1
t!C1
sup f .µ / : µ ¸t
Clearly, if f .t/ is upper-bounded, its upper limit exists. If limt!C1 sup f .t/ D » < C1, then for any ² > 0, there exists a T ¸ 0 such that f .t/ · » C ² for t ¸ T. If every trajectory x.t/ of network 1.1 satises lim sup jxi .t/j · si < C1;
t!C1
644
Z. Yi, K. Tan, and T. Lee
then, clearly, S D fx j jxi j · si ; .i D 1; : : : ; n/g is a global attractive set of network 1.1. A vector x¤ 2 Rn is called an equilibrium point of network 1.1, if it satises ¡x¤ C W¾ .x¤ / C h ´ 0:
We use Se to denote the equilibrium set of network 1.1. Denition 3. Network 1.1 is said to be completely convergent if the equilibrium set Se is not empty and every trajectory x.t/ of that network converges to Se , that is, 1
kx.t/ ¡ x¤ k ! 0 dist.x.t/; Se / D min ¤ e x 2S
as t ! C1. Throughout this article for any constant c 2 R, we denote cC D max.0; c/. We also denote C T hC D .hC 1 ; : : : ; hn / :
For any matrix A D .aij /n£n , we denote AC D .aC ij /, where ( aii ; iD j aC ij D max.0; aij /; i 6D j: We will use ¸min .A/ and ¸max .A/ to denote the smallest and largest eigenvalues of A, respectively. 3 Boundedness and Global Attractivity In this section, conditions based on local inhibition will be derived that guarantee boundedness and allow network 1.1 to have the property of multistability. Moreover, explicit expressions will be provided, locating the compact sets that globally attract all the trajectories of network 1.1. Let us rst prove a lemma that will be useful in the proof of the main theorems. Lemma 3. Let M ¸ 0 and T ¸ 0 be constants. If xi .t/ · M; .i D 1; : : : ; n/ for t ¸ T, then we have jxi .t/j · jxi .T/je¡.t¡T/ C M for all t ¸ T.
n X jD1
jwij j C jhi j; .i D 1; : : : ; n/
Multistability of Recurrent Neural Networks
645
Proof. Since xi .t/ · M.i D 1; : : : ; n/ for t ¸ T, then by the denition of the activation function ¾ .¢/, clearly it holds that ¾ .xi .t// · M for all t ¸ T. From network 1.1, we have 2 3 Z t n X xi .t/ D xi .T/e¡.t¡T/ C e¡.t¡s/ 4 wij ¾ .xj .s//Chi 5 ds; .i D 1; : : : ; n/; T
jD1
for t ¸ T. Then it follows that
n X
jxi .t/j · jxi .T/je¡.t¡T/ C M
jD1
jwij j C jhi j; .i D 1; : : : ; n/
for all t ¸ T. Lemma 3 shows that if a trajectory of network 1.1 is upper-bounded, then it is lower-bounded also. This property will help us to prove the following theorem. Theorem 1.
If there exist constants ®i > 0.i D 1; : : : ; n/ such that
1
°i D 1 ¡ wii ¡
n 1 X ®j wC ij > 0; ®i jD1;j6Di
.i D 1; : : : ; n/;
then network 1.1 is bounded. Moreover, 8 9 < n = X jxi j · 5 ¢ jwij j C jhi j; .i D 1; : : : ; n/ SD x : ; jD1 is a global attractive set of network 1.1, where ( 5 D max f®j g ¢ max 1·j·n
1·j·n
hjC
) :
°j
Proof. Since ¾ .¢/ ¸ 0, it follows from network 1.1 that xP i .t/ · ¡xi .t/Cwii ¾ .xi .t//C
n X jD1;j6Di
for t ¸ 0. Dene vi .t/ D
xi .t/ ; .i D 1; : : : ; n/; ®i
C wC ij ¾ .xj .t//Chi ; .i D 1; : : : ; n/;
(3.1)
646
Z. Yi, K. Tan, and T. Lee
for t ¸ 0. Then, from equation 3.1, we have vP i .t/ · ¡vi .t/ C wii ¾ .vi .t// 0 1 n X 1 CA @ C ®j wC ; .i D 1; : : : ; n/ ij ¾ .vj .t// C hi ®i jD1;j6Di
(3.2)
for t ¸ 0. For any x.0/ 2 Rn , denote » ¼ » C¼ 1 hi D kx.0/k ¢ max C max C 1: m 1·i·n ®i 1·i·n ®i °i We will prove that vi .t/ < m; .i D 1; : : : ; n/
(3.3)
for all t ¸ 0. Otherwise, since vi .0/ < m.i D 1; : : : ; n/, there must exist some i and a time t1 > 0 such that vi .t1 / D m and ( < m; j D i; 0 · t < t1 vj .t/ · m; j 6D i; 0 · t · t1 : Thus, it must follow that vP i .t1 / ¸ 0. However, from equation 3.2, we have 0 1 n X 1 CA @ vP i .t1 / · ¡m C wii m C ®j wC ij m C hi ®i jD1;j6Di D ¡m ¢ °i C
hC i ®i
< 0: This is a contradiction that proves equation 3.3 to be true. From equation 3.3, it follows that 1
xi .t/ < max f®j g ¢ m D M; .i D 1; : : : ; n/ 1·j·n
for all t ¸ 0. By lemma 3, we have jxi .t/j · jxi .0/je¡t C M ¢ · jxi .0/j C M ¢
n X jD1
n X jD1
jwij j C jhi j
jwij j C jhi j; .i D 1; : : : ; n/
for t ¸ 0. This proves that network 1.1 is bounded.
Multistability of Recurrent Neural Networks
647
Next, we will prove the global attractivity under the conditions. Denote lim sup vi .t/ D »i ; .i D 1; : : : ; n/:
t!C1
Since each vi .t/ is upper-bounded, then »i < C1; .i D 1; : : : ; n/. We will prove that ( »i < max
hjC
) ; .i D 1; : : : ; n/:
°j
1·j·n
(3.4)
For convenience in the subsequent discussion and without loss of generality, we assume that »l D max1·j·n f»j g. If »l · 0, clearly equation 3.4 is true. In what follows, we consider the case that »l > 0. Suppose equation 3.4 is not true: ( »l > max
1·j·n
hjC °j
) :
(3.5)
We will prove that this assumption will lead to a contradiction that will then conrm that equation 3.4 is true. Clearly, we can choose a small ² > 0 such that ( ) 2².1 C jwjj j/ C hjC 1 D k: »l > max ²; 1·j·n °j For this choice of ², by the basic property of the upper limit, there exists a t2 ¸ 0 such that vj .t/ · »j C ² · »l C ²; .j D 1; : : : ; n/; for all t ¸ t2 . We claim that there exists a t3 ¸ t2 such that vP l .t/ < 0
(3.6)
for all t ¸ t3 . If equation 3.6 is not true, there must exist a t4 > t2 such that vP l .t4 / ¸ 0;
vl .t4 / ¸ »l ¡ ²:
However, from equation 3.2, we have vP l .t4 / · ¡.»l ¡ ²/ C wll »l C jwll j² C
n 1 X C ²/ C hC ®j wC lj ¾ .»j l ®l jD1;j6Di
648
Z. Yi, K. Tan, and T. Lee
· ¡.»l ¡ ²/ C wll »l C jwll j² C
n 1 X C ²/ C hC ®j wC lj .»l l ®l jD1;j6Di
· ¡°l »l C 2².1 C jwll j/ C hC l · ¡°l k C 2².1 C jwll j/ C hC l < 0; which is a contradiction, thus proving that equation 3.6 is true. Equation 3.6 shows that vl .t/ is monotonically decreasing. Thus, the limit of vl .t/ exists, that is, lim vl .t/ D lim sup vl .t/ D »l :
t!C1
t!C1
Then there exists a t5 ¸ t3 such that ( »l ¡ ² < vl .t/ < »l C ² vj .t/ < »j C ² < »l C ²; .j D 1; : : : ; n/ for all t ¸ t5 . From equation 3.2, it follows for t ¸ t5 that vP l .t/ · ¡.»l ¡ ²/ C wll »l C jwll j² C
n 1 X C ²/ C hC ®j wC lj ¾ .»l l ®l jD1;j6Di
· ¡°l »l C 2².1 C jwll j/ C hC l
· ¡°l k C 2².1 C jwll j/ C hC l : Thus, ] vl .t/ · vl .t5 / ¡ .t ¡ t5 /[°l k ¡ 2².1 C jwll j/ ¡ hC l ! ¡1;
as t ! C1:
This contradicts »l > 0. Therefore, equation 3.5 does not hold, and instead, equation 3.4 is true. By equation 3.4, we have ( C) hj 1 lim sup xi .t/ D ®i »i · max f®j g ¢ max D 5; .i D 1; : : : ; n/: t!C1 1·j·n 1·j·n °j
Multistability of Recurrent Neural Networks
649
For any ´ > 0, there exists a t6 ¸ 0 such that xi .t/ · 5 C ´;
.i D 1; : : : ; n/;
for t ¸ t6 . By lemma 3, it follows that jxi .t/j · jxi .t6 /je¡.t¡t6 / C .5 C ´/ ¢ for t ¸ t6 . Thus, we have lim sup jxi .t/j · .5 C ´/ ¢
t!C1
n X jD1
n X jD1
jwij j C jhi j
jwij j C jhi j:
Letting ´ ! 0, it follows that lim sup jxi .t/j · 5 ¢
t!C1
n X jD1
jwij j C jhi j; .i D 1; : : : ; n/:
This shows that S is a global attractive set of network 1.1. The local inhibition-based conditions of theorem 1 are identical to Wersing, Beyn, and Ritter (2001) for neural network 1.2. As far as we known, Wersing, Beyn, and Ritter (2001) are the rst authors to obtain conditions for boundedness of neural networks based on local inhibition. Such conditions are quite interesting and elegant. However, they did not study attractivity, which is an important dynamic property for multistability of recurrent neural networks. Theorem 2.
If µ
WT C W ÄDI¡ 2
¶C
is a positive denite matrix, then network 1.1 is bounded. Moreover, » ¼ p khC k T W/C / ¢ kxk · C khk SD x ¸ ..W max ¸min .Ä/ is a global attractive set of the network 1.1. Proof. Dene a differentiable function n Z xi .t/ X V.t/ D ¾ .µ / dµ jD1
0
for t ¸ 0. Obviously, we have V.t/ ¸ 0 for all t ¸ 0.
650
Z. Yi, K. Tan, and T. Lee
Calculating the derivative of V.t/ along the trajectories of equation 1.1, we have P D ¡¾ T .x.t//xT .t/ C ¾ T .x.t//W¾ .x.t// C ¾ T .x.t//h V.t/ µ T ¶ W CW · ¾ T .x.t// ¡ I ¾ .x.t// C ¾ T .x.t//hC 2 Áµ ! ¶ TCW C W · ¾ T .x.t// ¡ I ¾ .x.t// C ¾ T .x.t//hC 2 · ¡¸min .Ä/ ¢ ¾ T .x.t//¾ .x.t// C ¾ T .x.t//hC ·¡
khC k2 ¸min .Ä/ T ¾ .x.t//¾ .x.t// C 2 2¸min .Ä/
for all t ¸ 0. By lemma 1, we have khC k2 2¸min .Ä/
P · ¡¸min .Ä/V.t/ C V.t/ for all t ¸ 0 and V.t/ · V.0/e
¡¸min .Ä/t
khC k2 C 2¸min .Ä/
· V.0/e¡¸min .Ä/t C
Z
t
e¡¸min .Ä/.t¡s/ ds
0
khC k2 2[¸min .Ä/]2
for all t ¸ 0. By lemma 1, we have n Z xi .0/ X kx.0/k2 xT .0/¾ .x.0// · V.0/ D ¾ .µ / dµ D 2 2 jD1 0 and k¾ .x.t//k2 · kx.0/k2 e¡¸min .Ä/t C
khC k2 1 D Á.t/ [¸min .Ä/]2
(3.7)
for all t ¸ 0. Moreover, we have lim
t!C1
p
Á.t/ D
khC k : ¸min .Ä/
(3.8)
From equation 1.1, we have Z x.t/ D x.0/e¡t C
t 0
e¡.t¡s/ [W¾ .x.s// C h] ds
(3.9)
Multistability of Recurrent Neural Networks
651
for t ¸ 0. Then, from equations 3.9 and 3.7, we have Z
kx.t/k · kx.0/ke¡t C Z · kx.0/ke¡t C Z · kx.0/ke¡t C · kx.0/ke¡t C ¡t
· kx.0/ke
C
t 0 t
p e¡.t¡s/ ¾ T .x.s//.W T W/¾ .x.s// ds C khk
t
p e¡.t¡s/ ¾ T .x.s//..W T W/C /¾ .x.s// ds C khk
0
0
p
e¡.t¡s/ kW¾ .x.s//k ds C khk
Z ¸max ..W T W/C / ¢
t 0
Z
p ¸max
..W T W/C /
¢
t 0
e¡.t¡s/ k¾ .x.s//k ds C khk p e¡.t¡s/ Á.s/ ds C khk
1
D 8.t/ for t ¸ 0. Since lim 8.t/ D
p
t!C1
D
p
Z ¸max ..W T W/C / lim
t
t!C1 0
¸max ..W T W/C / lim
p
t!C1
p e¡.t¡s/ Á.s/ ds C khk
Á.t/ C khk;
then, using equation 3.8, we have lim 8.t/ D
p
t!C1
¸max ..W T W/C / ¢
khC k C khk: ¸min .Ä/
Thus, it follows that lim sup kx.t/k ·
p
t!C1
¸max ..W T W/C / ¢
khC k C khk: ¸min .Ä/
This shows that S globally attracts network 1.1. In theorems 1 and 2, if hi · 0 .i D 1; : : : ; n/, the global attractive sets become S D fx j jxi j · jhi j; .i D 1; : : : ; n/g and S D fx j kxk · khkg in theorems 1 and 2, respectively. In particular, if hi D 0.i D 1; : : : ; n/, the attracting sets contain only the origin; that is, network 1.1 can have only the
652
Z. Yi, K. Tan, and T. Lee
origin as its unique equilibrium point and all trajectories of it converge to the origin. The conditions attached to the weight matrix of network 1.1 in theorems 1 and 2 are different. For example, consider the matrices 2 3 2 3 0 6 7 0 5 0:5 6 7 6 7 2 5: W 1 D 4¡4 0 55 ; W2 D 4¡7 0 ¡5 ¡3 0 0:5 ¡4 0 Matrix W1 satises the conditions in theorem 1 but not theorem 2. And, matrix W2 satises conditions in theorem 2 but not theorem 1. Hence, although both theorems relate to boundedness and global attractivity, they can be used complementarily in different situations. 4 Complete Convergence In this section, we study the complete convergence of network 1.1, assuming that the connection matrix W is symmetric in some sense. The general method to study complete convergence of neural networks is to construct suitable energy functions. However, for network 1.1, since the activation function ¾ .¢/ is not differentiable, it is difcult to construct energy functions such as those used in Chua and Yang (1988), Cohen and Grossberg (1983), Hopeld (1984), Wersing, Beyn, and Ritter (2001), and Wu and Chua (1997). To counter this nondifferentiability problem, we rst associate network 1.1 with another differential equation. Through this equation, a novel energy-like function can be constructed and the conditions for complete convergence derived accordingly. For any x0 2 Rn , we will associate network 1.1 with the following equivalent differential equation instead, zP .t/ D ¡z.t/ C ¾ .Wz.t/ C h C .x0 ¡ h/e¡t /;
(4.1)
for t ¸ 0. Lemma 4. For any x0 2 Rn , the solution x.t/ of network 1.1 starting from x0 can be represented as x.t/ D Wz.t/ C h C .x0 ¡ h/e¡t ; where z.t/ is the solution of equation 4.1 starting from the origin. Proof. For any x0 2 Rn , let x.t/ be the solution of equation 1.1 starting from x0 . Consider the following differential equation, zP .t/ D ¡z.t/ C ¾ .x.t//; for all t ¸ 0.
(4.2)
Multistability of Recurrent Neural Networks
653
From equations 1.1 and 4.2, it follows that d[x.t/ ¡ h ¡ Wz.t/] D ¡[x.t/ ¡ h ¡ Wz.t/] dt for t ¸ 0. Solving this equation, we have x.t/ D .x0 ¡ h ¡ Wz.0//e¡t C Wz.t/ C h for t ¸ 0, where z.0/ 2 Rn is any initial vector of equation 4.2. Taking z.0/ D 0 in particular—that is, z.t/ is the solution of equation 4.2 starting from the origin—then x.t/ D .x0 ¡ h/e¡t C Wz.t/ C h
(4.3)
for t ¸ 0. The result now follows by substituting equation 4.3 into 4.2. Lemma 5. For any x0 2 Rn , the solution z.t/ of equation 4.1 starting from the origin satises zi .t/ ¸ 0; .i D 1; : : : ; n/ for all t ¸ 0. Proof. Since z.0/ D 0, then from equation 4.1, we have 0 1 Z t n X zi .t/ D e¡.t¡s/ ¾ @ wij zj .s/ C hi C .xi0 ¡ hi /e¡s A ds 0
jD1
for t ¸ 0 and .i D 1; : : : ; n/. Since ¾ .¢/ ¸ 0 always, clearly, zi .t/ ¸ 0 for t ¸ 0. Lemma 6. Suppose there exists a diagonal positive denite matrix D such that the matrix DW is symmetric. Given any x0 2 Rn , if the solution z.t/ of equation 4.1 starting from the origin is bounded, then lim zP .t/ D 0:
t!C1
Proof. Construct an energy-like function E.t/ D
1 T z .t/D.I ¡ W/z.t/ ¡ hT Dz.t/ 2 ³ ´ Z t ¡t ¡s T ¡ .x0 ¡ h/ D z.t/e C z.s/e ds 0
for all t ¸ 0. Clearly, E.0/ D 0.
(4.4)
654
Z. Yi, K. Tan, and T. Lee
By the assumption that z.t/ is bounded, it is easy to see that E.t/ is also bounded. By lemma 5, zi .t/ ¸ 0.i D 1; : : : ; n/ for all t ¸ 0; we have ¾ .z.t// D z.t/ for all t ¸ 0. Denote D D diag.d1 ; : : : ; dn /; then di > 0.i D 1; : : : ; n/. Let d D min1·i·n .di /. It follows from equations 4.4 and 4.1 and lemma 2 that P D [z.t/ ¡ ..x0 ¡ h/e¡t C Wz.t/ C h/]T D E.t/
£ [¡z.t/ C ¾ ..x0 ¡ h/e¡t C Wz.t/ C h/]
D [z.t/ ¡ ..x0 ¡ h/e¡t C Wz.t/ C h/]T D
£ [¡¾ .z.t// C ¾ ..x0 ¡ h/e¡t C Wz.t/ C h/]
· ¡d ¢ k¾ .z.t// ¡ ¾ ..x0 ¡ h/e¡t C Wz.t/ C h/k2 D ¡d ¢ kz.t/ ¡ ¾ ..x0 ¡ h/e¡t C Wz.t/ C h/k2 D ¡d ¢ kPz.t/k2
(4.5)
for all t ¸ 0. Thus, E.t/ is monotonically decreasing. Since E.t/ is bounded, there must exist a constant E0 such that lim E.t/ D E0 < ¡1:
t!C1
From equation 4.5, Z
C1 0
Z kPz.s/k2 ds D lim
t
t!C1 0
Z · lim
t!C1 0
D¡
t
kPz.s/k2 ds P ¡E.s/ ds d
1 lim E.t/ d t!C1
¡E0 d < C1: D
Since z.t/ is bounded, from equation 4.1 it follows that zP .t/ is bounded. Then z.t/ is uniformly continuous on [0; C1/. Again, from equation 4.1, it follows that kPz.t/k2 is also uniformly continuous on [0; C1/. Thus, it must hold that lim kPz.t/k2 D 0:
t!C1
Theorem 3. Suppose there exists a diagonal positive denite matrix D such that DW is symmetric. If the network 1.1 is bounded, then the network is completely convergent.
Multistability of Recurrent Neural Networks
655
Proof. By lemma 4, for any x0 2 Rn , the solution of equation 1.1 starting from this point can be represented as x.t/ D .x0 ¡ h/e¡t C Wz.t/ C h
(4.6)
for t ¸ 0, where z.t/, which depends on x0 , is the solution of equation 4.1 starting from the origin. Let us rst show that z.t/ is bounded. By the assumption that x.t/ is bounded, there exists a constant K > 0 such that kx.t/k · K for all t ¸ 0. From equation 4.2, we have Z t z.t/ D e¡.t¡s/ ¾ .x.s// ds 0
for t ¸ 0. Thus,
Z
kz.t/k ·
t 0
Z ·
t
e¡.t¡s/ k¾ .x.s//k ds e¡.t¡s/ K ds
0
·K for t ¸ 0. This shows that z.t/ is bounded. Using lemma 6, we have limt!C1 zP .t/ D 0. Then, from equation 4.6, we have P D W lim zP .t/ D 0: lim x.t/
t!C1
t!C1
Since x.t/ is bounded, every subsequence of fx.t/g must contain convergent subsequence. Let fx.tm /g be any of such a convergent subsequence. There exists a x¤ 2 Rn such that lim x.tm / D x¤ :
tm !C1
Then, from equation 1.1, P m / D 0: ¡x¤ C W¾ .x¤ / C h D lim x.t tm !C1
Thus, x¤ 2 Se . This shows that Se is not empty, and any convergent subsequence of fx.t/g converges to a point of Se . Next, we use the method in Liang and Wang (2000) to prove lim dist.x.t/; Se / D 0:
t!C1
(4.7)
Suppose equation 4.7 is not true. Then there exists a constant ²0 > 0 such N Se / ¸ ²0 . From that for any T ¸ 0, there exists a tN ¸ T that satises dist.x.t/;
656
Z. Yi, K. Tan, and T. Lee
this and by the boundedness property of x.t/, we can choose a convergent subsequence fx.tNm /g, where limtNm !C1 x.tNm / D x† 2 Se , such that dist.x.tNm /; Se / ¸ ²0 ; .m D 1; 2; : : :/: Letting tm ! C1, we have dist.x† ; Se / ¸ ²0 > 0; which contradicts dist.x† ; Se / D 0 since x† 2 Se . This proves equation 4.7 is true. Network 1.1 is thus completely convergent. 5 Simulation Examples In this section, we provide examples of simulation results to illustrate and verify the theory developed. 5.1 Example 1. Consider the two-dimensional neural network "
# " # " 0 xP 1 .t/ x1 .t/ D¡ C ¡0:6 xP 2 .t/ x2 .t/
for t ¸ 0. Clearly, " 0 WD ¡0:6
¡1 0:4
#" # " # 0:8 ¾ .x1 .t// C 0:48 ¾ .x2 .t//
(5.1)
# ¡1 : 0:4
Let us rst use theorem 1 to show the boundedness of this network. C Taking ®1 D ®2 D 1, since wC 12 D w21 D 0, it is easy to calculate that °1 D 1 > 0 and °2 D 0:6 > 0. Thus, the conditions of theorem 1 are fully satised. Using theorem 1, the network is bounded, and there exists a compact set to attract the network. By simple calculations, we obtain the global attractive set (" # ) x1 jx1 j · 1:6; jx2 j · 1:28 : SD x2 Dene a diagonal positive denite matrix " # 1 0 DD : 0 53 Then we have " DW D
0 ¡1
¡1
¡ 23
# ;
Multistability of Recurrent Neural Networks
657
which is symmetric. By theorem 3, this network is completely convergent. Thus, every trajectory of equation 5.1 must converge to the equilibrium set. The equilibrium set of the network must be located in the attractive set S. In fact, it is not difcult to calculate out the equilibrium set of equation 5.1 as (" # ) x1 e S D x1 C x2 D 0:8; x1 ¸ 0; x2 ¸ 0 : x2 Clearly, Se ½ S. The boundedness and global attractivity of equation 5.1 can also be checked by theorem 2. If we invoke theorem 2, we can calculate out another global attractive set: SN D fx j kxk · 2:6077g: This also shows that theorem 2 does not imply monostability. Figure 2 shows the simulation results for global attractivity and complete convergence of network 5.1 for 250 trajectories originating from randomly selected initial points. It can be observed that every trajectory converges to the equilibrium set. Clearly, this network has a line attractor Se . The squarecontained part in Figure 2 is the global attractive set S calculated by using theorem 1. The circle-contained part is the global attractive set SN calculated by using theorem 2. 5.2 Example 2. Consider the following three-dimensional neural network P D ¡x.t/ C W¾ .x.t// C h; x.t/
(5.2)
for t ¸ 0, where x D .x1 ; x2 ; x3 /T , h D .¡0:2; ¡0:8; 0:3/T and 2 3 0 1 0:5 6 7 2 5: W D 4¡3 0 0:5 ¡4 0 It is not difcult to show that there does not exist any diagonal matrix D such that DW becomes symmetric. Hence, we cannot use theorem 3 to ascertain whether the network is completely convergent. However, we can use theorem 2 to show the boundedness and global attractivity of the network. We have 2 3 1 0 ¡0:5 µ T ¶C W CW 6 7 D4 0 1 0 5: I¡ 2 ¡0:5 0 1
658
Z. Yi, K. Tan, and T. Lee
12 10 8 6
x2
4 2 0 2 4 6 10
5
0
x1
5
10
15
Figure 2: Global attractivity and complete convergence of network 5.1.
Its eigenvalues are 0:5, 1:0, and 1:5, respectively. Thus, it is a positive denite matrix. By theorem 2, the network is bounded, and a global attractive set can be located as S D fx j kxk · 3:3528g: Figure 3 shows the simulation results in three-dimensional space for 50 trajectories originating from randomly selected initial points. Figures 4, 5, and 6 show the projections of the trajectories onto the phase planes .x1 ; x2 /, .x1 ; x3 / and .x2 ; x3 /, respectively. The parts contained in the circles in Figures 4 through 6 are the projections of the global attractive set S onto the corresponding phase planes. The boundedness of this network cannot be checked by theorem 1. In fact, we have 2 3 1 ¡1 ¡0:5 6 7 1 ¡2 5 ; I ¡ WC D 4 0 ¡0:5 0 1 which has a negative eigenvalue that obviously violates the conditions of theorem 1.
Multistability of Recurrent Neural Networks
659
20 10 x3
0 10 20 20 0 20 40
x2
20
10
0
10
20
x1
Figure 3: Boundedness and global attractivity of network 5.2.
20 15 10
x2
5 0 5 10 15 20 20
15
10
5
0 x1
5
10
15
Figure 4: Projection of trajectories of equation 5.2 on .x1 ; x2 / phase plane.
20
660
Z. Yi, K. Tan, and T. Lee
15 10 5
x3
0 5 10 15 20 25 20
15
10
5
0 x2
5
10
15
20
Figure 5: Projection of trajectories of equation 5.2 on .x2 ; x3 / phase plane.
From the simulation results, it can be observed that this network seems to have monostability property; however, we are not able to prove this point. It is worthwhile noting that this network does not satisfy the existing stability conditions, for example, the global stability conditions recently reported in Forti (1994), Liang and Si (2001), Yi, Heng, and Fu (1999), Yi, Heng, and Leung (2001), and Yi, Heng, and Vadakkepat (2002). Apart from the presented simulations, we have extensively simulated the boundedness, global attractivity, and complete convergence of networks that belong to network model 1.1. All the results verify the theory developed. In fact, certain global attractive sets approach the critical bounds containing the equilibrium points. 6 Conclusion In this article, we have studied the multistability of recurrent neural networks with unsaturating piecewise linear transfer functions. We have addressed three basic dynamical problems for this class of networks: boundedness, global attractivity, and complete convergence. Local inhibition-based conditions for boundedness and global attractivity are established. Under
Multistability of Recurrent Neural Networks
661
20 15 10
x3
5 0 5 10 15 20 20
15
10
5
x1
0
5
10
15
Figure 6: Projection trajectories of equation 5.2 on .x1 ; x3 / phase plane.
these conditions, the global attractive sets of the networks are obtained. With symmetry conditions attached to the networks’ weights matrix, the complete convergences are analyzed by showing that each trajectory converges to the equilibrium set. We believe that the complete convergence results of this article could be improved to show that each trajectory converges to an equilibrium point instead of the equilibrium set. However, we could not prove this, and it remains an interesting problem to study. Acknowledgments We thank the reviewers for their many helpful comments and suggestions to improve the quality and readability of this article. References Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Chua, L. O., & Yang, L. (1988). Cellular neural networks: Theory. IEEE Trans. Circuits Systems, 35, 1257–1272.
662
Z. Yi, K. Tan, and T. Lee
Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Systems, Man, Cybernetics, 13, 815–826. Douglas, R., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Forti, M. (1994). On global asymptotic stability of a class of nonlinear systems arising in neural network theory. J. Differential Equations, 113, 246–264. Hahnloser, R. L. T. (1998). On the piecewise analysis of linear threshold neural networks. Neural Networks, 11, 691–697. Hahnloser, R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J. , & Seung, H. S. (2000). Digital selection and analogue amplication coexist in a cortexinspired silicon circuit. Nature, 405, 947–951. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Liang, X. B., & Si, J. (2001). Global exponential stability of neural networks with globally Lipschitz continuous activations and its applications to linear variational inequality problem. IEEE Trans. Neural Networks, 12, 349–359. Liang, X. B., & Wang, J. (2000). A recurrent neural network for nonlinear optimization with a continuously differentiable objective function and bound constraints. IEEE Trans. Neural Networks, 11, 1251–1262. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93, 13339–13344. Wersing, H., Beyn, W. J., & Ritter, H. (2001). Dynamical stability conditions for recurrent neural networks with unsaturating piecewise linear transfer functions. Neural Computation, 13, 1811–1825. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive layer model for feature binding and sensory segmentation. Neural Computation, 13, 357–387. Wu, C. W., & Chua, L. O. (1997). A more rigorous proof of complete stability of cellular neural networks. IEEE Trans. Circuits and Systems, I, 44, 370–371. Yi, Z., Heng, P. A., & Fu, A. W. C. (1999).Estimate of exponential convergence rate and exponential stability for neural networks. IEEE Trans. Neural Networks, 10, 1487–1493. Yi, Z., Heng, P. A., & Leung, K. S. (2001). Convergence analysis of cellular neural networks with unbounded delay. IEEE Trans. Circuits and Systems, I, 48, 680–687. Yi, Z., Heng, P. A., & Vadakkepat, P. (2002). Absolute periodicity and absolute stability of delayed neural networks. IEEE Trans. Circuits and Systems, I, 49, 256–261. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: a theory. J. Neurosci., 16, 2112–2126. Received February 14, 2002; accepted September 4, 2002.
LETTER
Communicated by Bruno Olshausen
Simple-Cell-Like Receptive Fields Maximize Temporal Coherence in Natural Video Jarmo Hurri jarmo.hurri@hut.
Aapo Hyv¨arinen aapo.hyvarinen@hut. Neural Networks Research Centre, Helsinki University of Technology, 02015 HUT, Finland
Recently, statistical models of natural images have shown the emergence of several properties of the visual cortex. Most models have considered the nongaussian properties of static image patches, leading to sparse coding or independent component analysis. Here we consider the basic time dependencies of image sequences instead of their nongaussianity. We show that simple-cell-type receptive elds emerge when temporal response strength correlation is maximized for natural image sequences. Thus, temporal response strength correlation, which is a nonlinear measure of temporal coherence, provides an alternative to sparseness in modeling simple-cell receptive eld properties. Our results also suggest an interpretation of simple cells in terms of invariant coding principles, which have previously been used to explain complex-cell receptive elds. 1 Introduction The functional role of simple cells has puzzled scientists since the structure of their receptive elds was rst mapped by Hubel and Wiesel in the 1950s (Palmer, 1999). The rst hypothesis concerning their role was based on their visual appearance, that is, their similarity with edges and bars. The second major theory was the local spatial frequency analysis theory, which is based on generally applicable signal processing principles. The current view of the functionality of sensory neural networks emphasizes learning and the relationship between the structure of the cells and the statistical properties of the information they process (see, e.g., Field, 1994; Simoncelli & Olshausen, 2001). A major advance was achieved when Olshausen and Field (1996) showed that simple-cell-like receptive elds emerge when sparse coding is applied to natural image data. Similar results were obtained with independent component analysis (ICA) shortly after (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). In the case of image data, ICA is closely related to sparse coding (Hyv¨arinen, Karhunen, & Oja, 2001; Olshausen & Field, 1997). Neural Computation 15, 663–691 (2003)
c 2003 Massachusetts Institute of Technology °
664
J. Hurri and A. Hyva¨ rinen
In this article, we show that an alternative principle, temporal coherence (Becker, 1993; Foldi´ ¨ ak, 1991; Kayser, Einh¨auser, Dummer, ¨ Konig, ¨ & K¨ording, 2001; Mitchison, 1991; Stone, 1996; Wiskott & Sejnowski, 2002), leads to the emergence of simple-cell receptive elds from natural image sequences. This nding is signicant because it means that temporal coherence provides a complementary theory to sparse coding as a computational principle behind the formation of simple-cell receptive elds. The results also link the theory of achieving invariance by temporal coherence (Foldi´ ¨ ak, 1991) to real-world visual data and measured properties of the visual system. Whereas previous research has focused on establishing this link for complex cells, we show that such a connection exists even on the simple-cell level. Temporal coherence is based on the idea that when processing temporal input, the representation changes as little as possible over time. Foldi´ ¨ ak (1991) was one of the rst to suggest the usefulness of temporal coherence in computational neuroscience. He developed a two-layer network that was able to learn to identify a xed feature, such as a line with a xed orientation, even if the way the feature was expressed in the data changed, for example, if the line was translated. Foldi´ ¨ ak used temporal coherence as a tool to learn translation invariances: articially generated input data were temporally coherent (consecutive input frames contained translated versions of a line with the same orientation), and by using competition and short-term memory, the output was also taught to be temporally coherent. This associated translated versions of a feature with each other. Several other researchers have studied temporal coherence and other forms of coherence, such as coherence with respect to different views of the same scene. For example, in Becker and Hinton (1992) and Stone (1996), surface depth was discovered from stereograms by a multiple layer nonlinear network. In Becker and Hinton (1992), this was done by using stereograms representing different views of the same randomly generated scene and maximizing mutual information between outputs. In Stone (1996), learning was achieved by using a temporal sequence of slightly different stereograms and maximizing temporal smoothness of output while preserving variability in output. In these studies, the input data sets were generated so that there was an underlying coherent parameter in the data, and the objective was to nd that parameter by using coherence. Therefore, the main result was the demonstration of the usefulness of coherence using simulated data. The contribution of this article is to show that when the input consists of natural image sequences, the linear lters that maximize temporal response strength correlation are similar to simple-cell receptive elds. We rst describe temporal response strength correlation, which is a measure of temporal coherence, and an algorithm capable of optimizing the measure. In section 3, we apply this algorithm to natural image sequences. In addition to the main results, we describe several control experiments that were made to ensure the validity and novelty of our results. Finally, in section 4, we give an intuitive explanation of why optimization of the objective function pro-
Simple-Cell Receptive Fields and Temporal Coherence
665
duces such results, discuss the spatiotemporal and nonlinear (nonnegative) extensions of the model, and conclude by discussing the implications of this work. 2 Temporal Response Strength Correlation In the basic model, we restrict ourselves to consider linear spatial models of simple cells. Linear simple-cell models are commonly used in studies concerning the connections between visual input statistics and simple-cell receptive elds (Bell & Sejnowski, 1997; Olshausen & Field, 1996; van Hateren & van der Schaaf, 1998), because linearity seems to approximately characterize most simple cells (DeAngelis, Ohzawa, & Freeman, 1993b). Extensions of this basic framework are discussed in sections 4.4 and 4.5. The model uses a set of spatial lters (vectors) w1 ; : : : ; wK to relate input to output. Let signal vector x.t/ denote the input of the system at time t: A vectorization of image patches can be done by scanning images column-wise into vectors. For windows of size N £ N; this yields vectors with dimension N 2 : The output of the kth lter at time t, denoted by signal yk .t/; is given by the dot-product yk .t/ D wTk x.t/:
(2.1)
Let matrix W D [w1 ; : : : ; wK ]T denote a matrix with all the lters as rows. Then the input-output relationship can be expressed in vector form by y.t/ D Wx.t/;
(2.2)
where signal vector y.t/ D [y1 .t/; : : : ; yK .t/]T : Temporal response strength correlation, the objective function, is defined by f .W/ D
K X kD1
Et fg.yk .t//g.yk .t ¡ 1t//g;
(2.3)
where the nonlinearity g is strictly convex, even (rectifying), and differentiable. The symbol 1t denotes a delay in time. The nonlinearity g measures the strength (amplitude) of the response of the lter and emphasizes large responses over small ones (see section 4). Examples of choices for this nonlinearity are g1 .x/ D x2 ; which measures the energy of the response, and g2 .x/ D ln cosh x; which is a robust version of g1 : A set of lters with a large temporal response strength correlation is such that the same lters often respond strongly at consecutive time points, outputting large (either positive or negative) values. This means that the same lters will respond strongly over short periods of time, thereby expressing temporal coherence of a population code.
666
J. Hurri and A. Hyva¨ rinen
To keep the outputs of the lters bounded, we enforce the unit variance constraint on each of the output signals yk .t/; that is, we enforce the constraint Et fy2k .t/g D wTk Cx wk D 1 for all k; where matrix Cx D Et fx.t/xT .t/g: Additional constraints are needed to keep the lters from converging to the same solution. Standard methods (Hyv¨arinen et al., 2001) are either to force the set of lters to be orthogonal or to force their outputs to be uncorrelated, from which we choose the latter. This introduces additional constraints wTi Cx wj D 0; i D 1; : : : ; K; j D 1; : : : ; K; j 6D i: These uncorrelatedness constraints limit the number of lters K we can nd so that K · N2 : The unit variance constraints and the uncorrelatedness constraints can be expressed by the single matrix equation, WCx WT D I:
(2.4)
Note that if we use a nonlinearity g.x/ D x2 and 1t D 0; the objecPK 4 tive function becomes f .W/ D kD1 Et fyk .t/g: In this case, the optimization of the objective function under the unit variance constraint is equivalent to optimizing the sum of kurtoses of the outputs. Kurtosis is a commonly used measure in sparse coding. Similarly, in the case of nonlinearity g.x/ D ln cosh x and 1t D 0; the objective function can be interpreted as a nonquadratic measure of the nongaussianity of lter outputs. We return to this issue in section 3. Thus, the receptive elds are learned in our model by maximizing the objective function 2.3 under the constraint 2.4. The optimization algorithm used for this constrained optimization problem is a variant of the gradient projection method of Rosen (for the original algorithm, see Luenberger, 1969). The optimization approach employs whitening, that is, a temporary change of coordinates, to transform the constraint 2.4 into an orthonormality constraint. Then a gradient projection algorithm employing optimal symmetric orthogonalization can be used. See the appendix for details. 3 Experiments on Natural Image Sequences 3.1 Data Collection. The natural image sequences used as data were a subset of those used in van Hateren and Ruderman (1998). The original data set consisted of 216 monochrome, noncalibrated video clips of 192 seconds each, taken from television broadcasts. More than half of the videos feature wildlife; the rest show various topics, such as sports and movies. Sampling frequency was 25 frames per second, and each frame was blockaveraged to a resolution of 128 £ 128 pixels. For our experiments, this data set was pruned to remove the effect of human-made objects and artifacts. First, many of the videos feature human-made objects, such as houses and furniture. Such videos were removed from the data set, leaving us with 129 videos. Some of these 129 videos had been grabbed from television broadcasts, and there was a wide black bar with height 15 pixels at the
Simple-Cell Receptive Fields and Temporal Coherence
667
top of each image, probably because the original broadcast had been in wide screen format. Our sampling procedure never took samples from this topmost part of the videos. If these artifacts were not removed from the data set, the static ICA results computed for comparison showed longer horizontal and vertical receptive elds than results obtained from scenes without the artifacts. The nal, preprocessed (see below) data set consisted of 200,000 pairs of consecutive 11 £ 11 image windows (patches) at the same spatial position, but 1t milliseconds apart from each other. Depending on the experiment, 1t varied between 40 ms and 960 ms: However, because of the temporal ltering used in preprocessing, initially 200,000 longer image sequences with a duration of 1t C 400 ms; and the same spatial size 11 £ 11; were sampled with the same sampling rate. A second data set was needed for computing the corresponding (static) ICA solution for comparison. This data set consisted of 200,000 11 £ 11 images sampled from the same video data. 3.2 Preprocessing. The preprocessing in the main experiment consisted of three steps: temporal decorrelation, subtraction of local mean, and normalization. (The same preprocessing steps were applied in the control experiments. Whenever preprocessing was varied in control experiments, it is explained separately below.) Temporal decorrelation can be motivated in two ways. First, it can be motivated biologically as a model of temporal processing at the lateral geniculate nucleus (Dong & Atick, 1995). Second, for 1t D 0; the objective function can be interpreted as a measure of sparseness. Therefore, it is important to rule out the possibility that there is barely any change in short intervals in video data, since this would imply that our results could be explained in terms of sparse coding or ICA. To make the distinction between temporal response strength correlation and measures of sparseness clear, temporal decorrelation was applied because it enhances temporal changes. Note, however, that this still does not remove all of the static part in the video, an issue addressed in the control experiments that follow. To investigate the effect of temporal decorrelation, we rst examined the histogram of the distances between subsequent image windows separated by 1t D 40 ms; shown in Figure 1A. (This histogram also shows that there are indeed large changes between subsequent time points even without temporal decorrelation.) The local mean has been removed from these windows, and they have been normalized (see below); note that 2 is maximal distance because of normalization. Temporal decorrelation was performed with a temporal lter, shown in Figure 1B. The Fourier magnitude of the lter was computed by inverting the amplitude spectrum of input data. A Wiener lter approach was used in order not to amplify high-frequency noise. (We determined the lter directly from data—see Dong & Atick, 1995—for an analytic solution.) Noise power was estimated by assuming that it is equal to signal power at 5:5 Hz; a value also used in Dong and Atick (1995). Phases
668
J. Hurri and A. Hyva¨ rinen
B 2£104 1£104 0 0
0.5
1
1.5
Euclidean distance
2
0.1 0.05 0 0
0.1
0.2
0.3
0.4
0.5
1
1.5
2
time (seconds)
D occurrences
occurrences
C
¯ lter coe± cient
occurrences
A
2£104 1£104 0 0
0.5
1
1.5
Euclidean distance
2
2£104 1£104 0 0
Euclidean distance
Figure 1: Temporal decorrelation enhances temporal changes. (A) Distribution of Euclidean distances between consecutive samples at the same spatial position, but 40 ms apart from each other, without temporal decorrelation. Note that two is the maximum value because of normalization. (B) The temporally decorrelating lter. (C) Distribution of Euclidean distances between consecutive samples at the same spatial position, but 40 ms apart from each other, after temporal decorrelation. (D) Distribution of Euclidean distances between consecutive samples at the same spatial position, but 120 ms apart from each other, after temporal decorrelation. Note that removal of DC component and normalization were also performed in all of these cases.
were determined by adding a minimum energy delay constraint on the lter (see Oppenheim & Schafer, 1975). Filter length was originally 2500 ms but was truncated to 400 ms because this part contains over 99% of lter energy. When short image sequences of length 1t C 400 ms are ltered with this temporal lter, the resulting sequences are of length 1t: From each of these temporally ltered short sequences, two windows separated by 1t were taken. Figure 1C shows the distribution of the distances between such temporally decorrelated windows, after removal of local mean and normalization (see below), when 1t D 40 ms: This histogram shows that temporal decorrelation enhances temporal changes in the data. Figure 1D shows the distribution of the distances between temporally decorrelated windows, after removal of local mean and normalization, when 1t D 120 ms: In this case, consecutive windows are, on average, quite far from each other when measured with the Euclidean norm. In fact, the peak of the histogram is
Simple-Cell Receptive Fields and Temporal Coherence
669
approximately at 1.4, which for normalized vectors means that a large part of the consecutive windows are approximately orthogonal to each other. After temporal decorrelation, the local mean (DC component) was subtracted from each window. This reduces the number of dimensions of the data by one, so with image window size N D 11; the number of lters K · 120: Finally, the sample vectors (vectorized windows) were normalized to have unit Euclidean norm, which can be considered a form of contrast gain control (Carandini, Heeger, & Movshon, 1997; Heeger, 1992). Note that no spatial low-pass ltering or dimensionality reduction was performed during preprocessing. In the case of the second data set, which was used to compute the corresponding static ICA solution, preprocessing consisted of removal of the local mean, followed by normalization. Temporal decorrelation was not performed here, since it has no meaning in the case of static image data. 3.3 Results: Temporally Coherent Filters of Natural Image Sequences. The main experiment consisted of running the symmetric gradient projection algorithm 50 times using different random initial values, followed by a quantitative analysis of these results, as well as results obtained with a corresponding ICA algorithm. In this experiment, 1t was 40 ms: The number of extracted lters was set at the maximum value K D 120: Nonlinearity g in objective function 2.3 was chosen to be g.x/ D ln cosh x because of its robustness against outliers (Hyv¨arinen et al., 2001). Figure 2 shows the resulting lters (i.e., rows of matrix W) of the rst run. The lters have been ordered according to Et fg.yk .t//g.yk .t ¡ 1t//g; that is, according to their “contribution” into the nal objective value (lters with largest values top left). The lters resemble Gabor lters. They are localized and oriented and have different scales. These are the main features of simplecell receptive elds (Palmer, 1999). When compared qualitatively with earlier reported results, obtained with sparse coding and independent component analysis (Bell & Sejnowski, 1997; Olshausen & Field, 1996; van Hateren & van der Schaaf, 1998), our results show a larger variety of different spatial scales. To compare the results quantitatively, we extracted a corresponding set of 50 ICA separation matrices using the symmetric xed-point ICA (FastICA) algorithm, with robust nonlinearity tanh (Hyv¨arinen et al., 2001). This algorithm is a symmetric version of that used in van Hateren and van der Schaaf (1998). The symmetric nature of the algorithm facilitates extracting a balanced set of lters (Hyv¨arinen et al., 2001). Filters maximizing temporal response strength correlation were compared against the rows of the separating matrix (the ICA lters), since these lters are the natural counterparts in ICA (van Hateren & van der Schaaf, 1998). Figure 3 shows the ICA lters obtained from the rst run. In the quantitative comparison of the results, we measured the most important properties of the receptive elds. The results are shown in Figure 4. The measured properties were peak spatial frequency (see Figures 4A and
670
J. Hurri and A. Hyva¨ rinen
Figure 2: Temporally coherent lters of natural image sequences, given by the rst run of the main experiment. The lters were estimated from natural image sequences by optimizing temporal response strength correlation with the symmetric gradient projection algorithm (here nonlinearity g.x/ D ln cosh x). The lters have been ordered according to Et fg.yk .t//g.yk .t¡1t//g; that is, according to their contribution into the nal objective value (lters with largest values at top left).
4B; note the logarithmic scale and units cycles per pixel), peak orientation (see Figures 4C and 4D), spatial frequency bandwidth (see Figures 4E and 4F), and orientation bandwidth (see Figures 4G and 4H). See van Hateren and van der Schaaf (1998) for denitions of these measures. Although there are some differences, the most important observation here is the similarity of the histograms. This supports the idea that ICA=sparse coding and temporal coherence are complementary theories, in that both result in the emergence of simple-cell-like receptive elds. As for the differences, the results obtained using temporal response strength correlation have a slightly
Simple-Cell Receptive Fields and Temporal Coherence
671
Figure 3: For comparison, ICA lters estimated from natural image sequences by using the symmetric xed-point algorithm. Note that these lters have a much less smooth appearance than in most published ICA results; this is because for comparison, we show here the lters and not the basis vectors, and further, no low-pass ltering or dimension reduction was applied in the preprocessing.
smaller number of high-frequency receptive elds. Also, temporal response strength correlation seems to produce receptive elds that are somewhat more localized with respect to both spatial frequency and orientation. When the results are compared against the results in van Hateren and van der Schaaf (1998), the most important difference is the peak at zero bandwidth in Figures 4E and 4F. This difference is probably a result of the fact that no dimensionality reduction, antialiasing, or noise reduction was performed here, which results in the appearance of very small, checkerboard-like receptive elds. This effect is more pronounced in ICA, which also explains the stronger peak at the 45 degree angle in Figure 4D.
672
J. Hurri and A. Hyva¨ rinen
temporal coherence
ICA
2000 1000 0 0.1
0.2
0.4 0.7
peak spatial frequency (c/pixel)
occurrences
C
occurrences
B 2000 1000 0 0.1
0.4 0.7
D 1000 500 0 ¡ 90 ¡ 45
0
45
90
1000 500 0 ¡ 90 ¡ 45
peak orientation (degrees)
0
45
90
peak orientation (degrees)
F 1500 1000 500 0 0
1
2
3
1500 1000 500 0
occurrences
E occurrences
0.2
peak spatial frequency (c/pixel)
occurrences
occurrences
A
0
1
2
0
50
100
3
G
H 2000 1000 0 0
50
100
150
orientation bandwidth (degrees)
occurrences
spatial freq. bandwidth (octaves)
occurrences
spatial freq. bandwidth (octaves)
2000 1000 0 150
orientation bandwidth (degrees)
Figure 4: Comparison of properties of receptive elds obtained by optimizing temporal response strength correlation (left column, histograms A, C, E, and G) and estimating ICA lters (right column, histograms B, D, F, and H). See the text for details.
Simple-Cell Receptive Fields and Temporal Coherence
673
3.4 Control Experiments. To ensure the novelty and validity of our results, we made eight control experiments. All other aspects, except those specically mentioned here, were similar in these experiments as in the main experiment. 3.4.1 Control Experiment I: No Temporal Decorrelation. In the main experiment, we used temporally decorrelated data. However, as was seen in Figure 1, there is considerable temporal change in natural video data even without temporal decorrelation. Therefore, it is reasonable to ask whether temporal decorrelation is necessary for achieving results like those shown in Figure 2. To answer this question, the algorithm was run for data that were not temporally decorrelated. The results are shown in Figure 5A. Although the lters with highest-frequency components seem to be somewhat less localized, the results remain qualitatively very similar to those in Figure 2 in that they also resemble Gabor lters. This suggests that the simple-cell-like properties of the results of the main experiment are not a consequence of temporal decorrelation. 3.4.2 Control Experiments II–IV: Longer 1t. In the main experiment, consecutive samples were separated by 40 ms: Does the phenomenon found in the main experiment hold only for very small 1t? To answer this question, we examined the case 1t D 120 ms in control experiment II. As can be seen in Figure 1D, with this time separation, preprocessed consecutive sample windows are typically much farther from each other than when 1t D 40 ms: Figure 5B shows the results of applying the symmetric gradient projection algorithm to this data. Although the receptive elds are now larger than in the main experiment, the results are still qualitatively similar. This implies that the discovered relationship applies to short-time natural image sequences in general, not only for some specic value of 1t: It seems natural that there would be an upper limit for which the results are no longer qualitatively similar to those in Figure 2. This is indeed the case. The degree of spatial localization decreases when 1t increases. This can already be seen to some degree in Figure 5B and is more pronounced in the results of control experiment III, in which 1t D 480 ms (see Figure 5C). In the 480 ms case, most of the lters are poorly localized, resembling Fourier basis vectors corresponding to different frequencies. As 1t becomes large enough, spatial localization disappears, and lters also start to lose their orientation selectivity. This is illustrated in Figure 5D for 1t D 960 ms (control experiment IV). 3.4.3 Control Experiment V: Randomly Selected Consecutive Windows. Control experiment V was made to ensure that the results reect the dynamics of natural image sequences, and not just the relationship between any arbitrary image patches. Instead of using samples separated by 1t; random image samples were chosen as consecutive window pairs. No temporal
674
J. Hurri and A. Hyva¨ rinen
A
B
C
D
Figure 5: Results of control experiments I–IV. (A) Results of control experiment I in which no temporal decorrelation was performed. (B) Results of control experiment II in which 1t D 120 ms: (C) Results of control experiment III in which 1t D 480 ms: (D) Results of control experiment IV in which 1t D 960 ms:
decorrelation was done here, since random window pairs do not have a temporal relationship. Figure 6A shows the resulting spatial lters, which correspond to noise patterns, indicating that the original results in Figure 2 do reect natural image sequence dynamics. 3.4.4 Control Experiment VI: Complete Removal of the Static Part of Video. In most experiments, the data were temporally decorrelated, leading to enhanced temporal change. For example, note from Figure 1D that in the case 1t D 120 ms; after preprocessing (including the normalization step), the peak of the histogram of distances between preprocessed consecutive windows is at approximately 1.4. Remembering that the samples have been normalized, this indicates that a large part of the preprocessed sample con-
Simple-Cell Receptive Fields and Temporal Coherence
A
B
C
D
675
Figure 6: Results of control experiments V–VIII. (A) Results of control experiment V in which consecutive window pairs were chosen randomly. (B) Results of control experiment VI in which the static part of natural image sequences was removed altogether by employing Gram-Schmidt orthogonalization to consecutive windows. (C) Results of control experiment VII in which observer (camera) movement was compensated by using a tracking mechanism. (D) Results of control experiment VIII in which ordinary linear correlation between output values was maximized.
sists of consecutive windows that are approximately orthogonal to each other. This supports the idea that our results are not a consequence of the static part of natural image sequences. However, since preprocessing does not remove the static part of the video completely, we made another control experiment (control experiment VI), with 1t D 120 ms; in which the static part was removed altogether. This was done by modifying the preprocessing step so that no temporal decorrelation was done; instead, after removal of the local mean, consecutive windows
676
J. Hurri and A. Hyva¨ rinen
were orthonormalized using the Gram-Schmidt procedure. This removes the static part, the part present already at time t ¡ 1t; from the window at time t completely. No other form of temporal ltering was performed in this experiment. Figure 6B shows the results of this control experiment. The resulting lters are still localized and oriented and have different scales. This shows that the qualitative nature of our results is not a consequence of the static part of natural image sequences. The lters are more localized than in the corresponding experiment with temporal decorrelation, probably because removal of the static part is more likely to remove large features from the data (see section 4). If the same orthogonalization procedure is applied in the case 1t D 40 ms; the results (not shown) are even more localized. This suggests that here too, as in the case of temporally decorrelated data, the degree of spatial localization is a function of 1t: 3.4.5 Control Experiment VII: Compensation of Observer Movement. Control experiment VII was made to study the role of observer (camera) movement. To compensate for this movement, a simple correlation-based tracking mechanism was implemented into the sampling procedure. Tracking was applied before temporal ltering (temporal decorrelation), so each 440 ms (D 1t C 400 ms) sequence was tracked. Let x1 be the rst vectorized sample window in a 440 ms sequence, and X n be the set of candidate windows in the nth video frame of the sequence, differing at most 10 pixels from the spatial position of the rst window x1 : The nth window xn was chosen by xT x
n¡1 xn D arg maxx2X n kxn¡1 kkxk : The results, shown in Figure 6C, are qualitatively similar to the original results, showing that the main results are not a consequence of observer movement. The number of low-frequency receptive elds seems to be smaller in the control results. This change is probably caused by decreased large-scale movement (see section 4).
3.4.6 Control Experiment VIII: Ordinary Linear Correlation. Finally, the purpose of control experiment VIII was to show that higher-order correlation is indeed needed for the emergence of simple-cell-like lters. To study this, we computed the optimal lter solutions for maximizing linear correlation f` .wk / D E t fyk .t/yk .t ¡ 1t/g (see also Mitchison, 1991). 1 The unit variance constraint is used here again, so the problem is equivalent to minimizing E t f.yk .t/ ¡ yk .t ¡ 1t//2 g with the same constraint. A 1 Note that this objective function is dened for single lters. A similar single-unit rule for optimizing Et fg.yk .t//g.yk .t ¡ 1t//g can be dened and optimized for temporal response strength correlation. The optima for this objective function are similar to those in Figure 2, but the receptive elds are more elongated. In addition, there are problems with obtaining a complete basis because a deationary algorithm (Hyv¨arinen et al., 2001), in which rst-extracted solutions dominate, has to be used. In the case of linear correlation, we do not have this problem since a closed-form solution can be found.
Simple-Cell Receptive Fields and Temporal Coherence
677
closed-form solution can be derived from necessary Karush-Kuhn-Tucker (KKT) conditions and an additional zero DC constraint. Let Cx D EDET ; from which the DC eigenvector corresponding to zero eigenvalue has been dropped out. The necessary conditions are fullled by those eigenvectors of ED¡1 ET Et f.x.t/ ¡ x.t ¡ 1t//.x.t/ ¡ x.t ¡ 1t//T g that correspond to nonzero eigenvalues. Analysis of sufcient KKT conditions reveals that the lters fullling the rst-order conditions behave as in the case of principal components; that is, after the minimizing lter is found and removed from the set of eigenvectors, another eigenvector from the set will be the next minimum, assuming that the output of the next selected lter has to be uncorrelated with the outputs of the previously selected ones. The eigenvector corresponding to the smallest eigenvalue is the global minimum, and the next minimum is the one with the next smallest eigenvalue. Figure 6D shows the resulting lters, sorted according to the corresponding eigenvalue (smallest eigenvalue top left). These lters resemble Fourier basis vectors, and not simple-cell receptive elds. Thus, we see that emergence of localized receptive elds requires the use of a nonlinear temporal correlation measure. 4 Discussion 4.1 Temporal Coherence of Large Responses. Temporal response strength correlation, as dened by equation 2.3, does not explicitly measure the rate of change of the output signal. Therefore, it is important to examine what type of temporal coherence the objective function measures. In order to do this, consider three different temporally uncorrelated signals, y1 .t/; y2 .t/; and y3 .t/; depicted in Figure 7. (Note that in the main experiment, the output signals yk .t/ are temporally uncorrelated because they are spatially linearly ltered from temporally decorrelated input data.) All of these signals have unit energy and a gaussian marginal distribution. Signals y2 .t/ and y3 .t/ have been created by reordering the samples of y1 .t/ so that they contain two intervals of high amplitude (see the gure caption for details). Signal y3 .t/ has the largest temporal response strength correlation of these signals, as measured by equation 2.3. This is because the objective function emphasizes the temporal coherence of large amplitudes. This is not truepfor an arbitrary measure of amplitude correlation. For example, for g.x/ D jxj (concave on interval ]0; 1]), signal y2 .t/ has a larger measure than y3 .t/ ( f .y2 .t// ¼ 0:73; f .y3 .t// ¼ 0:71). 4.2 Temporal Coherence vs. Sparseness. We saw in the previous section that our objective function gives large values when both yk .t/ and yk .t ¡ 1t/ have large amplitudes, thus emphasizing the correlation of large activations. This property must not, however, be confused with sparseness. Sparseness of yk .t/ means that very large amplitudes, as well as very small ones, are relatively common. It is thus a property of the marginal distribution of
678
J. Hurri and A. Hyva¨ rinen
y (t)
ln cosh y(t)
A
B ln cosh y1 (t)
y 1 (t )
3 0 ¡ 3 0
200
time index
1 0 0
time index
200
400
0
time index
200
400
0
time index
200
400
D ln cosh y2 (t)
3
y 2 (t )
2
400
C 0 ¡ 3 0
200
time index
3 2 1 0
400
E
F ln cosh y3 (t)
3
y 3 (t )
3
0 ¡ 3 0
200
time index
400
3 2 1 0
Figure 7: Temporal response strength correlation emphasizes temporal coherence of large amplitudes. Three temporally uncorrelated signals y.t/ with unit energy and gaussian marginal distributions (A,C,E) and their rectied magnitudes ln cosh y.t/ (B,D,F). The signals in C and E were obtained by rearranging the time indices of the signal in A. This was done so that the two intervals of high amplitude in these signals contain the samples with the largest amplitudes, in random order. In C, the total length of the intervals is half of the total signal length; in E, this ratio is 15 : The signal in E has the largest temporal response strength correlation, as measured by equation 2.3 ( f .y1 .t// ¼ 0:13; f .y2 .t// ¼ 0:23; f .y3 .t// ¼ 0:29). This illustrates the fact that the objective function emphasizes the temporal correlation of large amplitudes. See the text for discussion.
Simple-Cell Receptive Fields and Temporal Coherence
A
679
B
Figure 8: Examples of transformations inducing approximate local translations. The local area is marked with a dashed square. (A) Planar rotation. (B) Bending.
yk .t/: The temporal correlation of large amplitudes says nothing about their frequency or any other aspect of the marginal distribution of yk .t/: All the signals in Figure 7 (on the left) have a gaussian marginal distribution, yet the signals vary considerably in their temporal coherence, as measured by our objective function. Our measure of temporal coherence could indirectly measure the marginal distribution (sparseness) only where there is little change in the data, that is, if the static part of the image sequence dominates. However, this possibility was ruled out by the use of temporal decorrelation, which ensures that there is quite a large amount of change in the data, as shown in Figure 1. Ultimately, control experiment VI showed that the use of temporal coherence produced similar results even if the static part was removed completely, thus proving that the principle of temporal coherence is distinct from sparse coding. 4.3 An Intuitive Explanation of Results. 4.3.1 The Importance of Translation in Natural Image Sequences. The most universal local visual properties of objects are edges and lines, so we limit the discussion here to their dynamic properties. Objects can undergo a number of transformations in image sequences: translation, rotation, occlusion, and, for nonrigid objects, deformation. A transformation in the 3D space can induce a different transformation in an image sequence. For example, a translation toward the camera induces a change of object size in the image sequence. Our hypothesis is that for local edges and lines and during short time intervals, most 3D object transformations result in local translations of these elements in image sequences. This is, of course, true for 3D translations of objects. Figure 8 illustrates this phenomenon for two other transformations: a planar rotation and bending of an object. Note that the effect illustrated in Figure 8A is even more pronounced if object rotation is not purely planar.
680
J. Hurri and A. Hyva¨ rinen
4.3.2 Why Gabor-Like Filters Maximize Correlation of Square-Rectied Responses. In order to demonstrate the correlation of square-rectied responses at consecutive time points, we will consider the interaction of features and lters in one dimension (orthogonal to the orientation of the lter). This allows us to consider the effect of local translations in a simplied setting. Figure 9 illustrates, in a simplied case, why the temporal response strengths of lines and edges correlate positively as a result of Gabor-like lter structure. Prototypes of two different types of image elements—the proles of a line and an edge—which both have a zero DC component, are shown in the topmost row of the gure. The leftmost column shows the proles of three different lters with unit norm and zero DC component: a Gabor-like lter, a sinusoidal (Fourier basis–like) lter, and an impulse lter. The rest of the gure shows the square rectied responses of the lters to the inputs as functions of spatial displacement of the input. Consider the rectied response of the Gabor-like lter to the line and the edge. The squared response at time t ¡ 1t (spatial displacement zero) is strongly positively correlated with response at time t; even if the line or edge is displaced slightly. This shows how small local translations of basic image elements still yield large values of temporal response strength correlation for Gabor-like lters. If you compare the responses of the Gabor-like lter to the responses of the sinusoidal lter, you can see that the responses to the sinusoidal lter are typically much smaller. This leads to a lower value of our measure of temporal response strength correlation that emphasizes large values. Also, while the response of an impulse lter to an edge correlates quite strongly over small spatial displacements, when the input consists of a line, even a very small displacement will take the correlation to almost zero. Thus, we can see that when considering three important classes of lters— lters that are maximally localized in space, maximally localized in frequency, or localized in both—the optimal lter is a Gabor-like lter localized in both space and frequency. If the lter is maximally localized in space, it fails to respond over small spatial displacements of very localized image elements. If the the lter is maximally localized in frequency, its responses to the localized image features are not strong enough. Figure 10 shows why we need nonlinear correlations instead of linear ones: raw output values might correlate either positively or negatively, depending on the displacement. Thus, we see why ordinary linear correlation is not maximized for Gabor-like lters, whereas the rectied (nonlinear) correlation is.
4.3.3 Emergence of Simple-Cell-Like Filters. Future research is needed to provide a detailed analysis of which properties of natural image sequence data are needed for the emergence of simple-cell-like lters. However, at
Simple-Cell Receptive Fields and Temporal Coherence
681
input
¡ 2¡ 1 0 1 2 3
¯ lter
¡ 2¡ 1 0 1 2 3
square-recti¯ ed output
¡ 2¡¡ 1 0 1 2 3
¡ 1 0 1
¡ 1 0 1
¡ 2¡ 1 0 1 2 3
¡ 1 0 1
¡ 1 0 1
¡ 2¡ 1 0 1 2 3
¡ 1 0 1
¡ 1 0 1
Figure 9: A simplied illustration of why a Gabor-like lter, localized in both space and frequency, yields larger values of temporal response strength correlation than a lter localized only in space or only in frequency. Top row: cross sections of a line (left) and an edge (right) as functions of spatial position. Leftmost column: cross sections of three lters with unit norm and zero DC component: a Gabor-like lter (top), a sinusoidal lter (middle), and an impulse lter (bottom). The other plots in the gure show the responses of the lters to the inputs as a function of spatial displacement of the input. The Gabor-like lter yields fairly large positively correlated values for both types of input. The sinusoidal lter yields small response values. The impulse lter yields fairly large positively correlated values when the input consists of an edge, but when the input consists of a line, even a small displacement yields a correlation of almost zero.
682
J. Hurri and A. Hyva¨ rinen
input
¡ 2¡ 1 0 1 2 3
¯ lter
¡ 2¡¡ 1 0 1 2 3
¡ 2¡ 1 0 1 2 3
raw output
¡ 1 0 1
¡ 1 0 1
Figure 10: A simplied illustration of why nonlinear correlation is needed for the emergence of the phenomenon. Raw response values of the Gabor-like lter to the line and edge may correlate positively or negatively, depending on the displacement. (See Figure 9 for an explanation of the layout of the gure.)
this point, we can provide a set of hypotheses as to why oriented, localized lters with multiple scales emerge from the data: Orientation. The lters are oriented because the contours (edges and lines) in image sequences are oriented. Because our objective function emphasizes strong responses, the lters need to be matched to the dominant features in the data (see Figure 9). Localization. As illustrated in Figure 9, the features are limited in width because lines and edges are mostly limited in width as well, and because short-time translations are very local. This second point is supported by the results of control experiments II through IV (see Figures 5B–5D), which showed that when 1t is increased, localization decreases. The features are limited in length for the same reason that they are limited in ICA or sparse coding: contours in real images have some curvature. Even a weak curvature makes the match (dot-product) with an elongated Gabor quite weak, and gives poor temporal coherence (Hoyer & Hyv¨arinen, 2002). Multiple scales. The lters respond to multiple scales for two reasons. The rst reason is the scale invariance of natural image sequences. This explanation is supported by the results of control experiment VI (Figure 6B), in which the static part of image sequences was removed completely. The results exhibit a narrower range of different scales because removal of the
Simple-Cell Receptive Fields and Temporal Coherence
683
static part of the sequences is more likely to remove large features. The second reason is that the features move with many different velocities in the data. This is supported by the results of control experiment VII (see Figure 6C), in which tracking was used to compensate for observer movement. The results exhibit a narrower range of different scales, because tracking is likely to reduce the velocity of moving features in the data. 4.4 The Case of Spatiotemporal Receptive Fields. Due to limited computational resources, we are currently unable to estimate the most temporally coherent spatiotemporal receptive elds. However, argumentation similar to the intuitive explanation provided above can be given to illustrate a similar phenomenon in the spatiotemporal case. Figure 11 illustrates a case in which a vertical line is moving in the image sequence. The response of a simple-cell-like motion-selective spatiotemporal lter (DeAngelis, Ohzawa, & Freeman, 1993a), whose spatiotemporal position and orientation match the initial position of the line and its direction of movement, is large in magnitude at consecutive time points. This illustrates how large temporal response strength correlation could arise in the case of spatiotemporal receptive elds. 4.5 The Case of a Nonlinear (Nonnegative) Cell Model. Linear lters with negative coefcients or negative-valued data have signed outputs. Their widespread use as simple-cell models is often motivated by an interpretation in which the term simple cell in fact refers to a unit of two cells. These two positive-output cells, with reversed polarities, are modeled using a single linear lter with signed output.2 This two-cell approach will be explained in detail below. At this point, notice that this coupling is very different from complex-cell pooling of simple cells. In complex-cell pooling, the coupled cells respond to similar features at different spatial positions, whereas two cells with opposite polarities respond to similar features at the same spatial position. For single simple cells, a more realistic basic model of the mapping from input to output via a receptive eld is a combination of linear ltering and a nonlinearity called half-wave rectication (Heeger, 1992). Using the same notation as above, the output of the cell, yk .t/; is computed by yk .t/ D maxf0; wTk x.t/g;
(4.1)
instead of the purely linear input-output relationship, equation 2.1. In this model, the output of a cell is never negative. 2 Another possibility would be to consider the negative and positive values as changes from maintained ring rate. However, simple cells have a low maintained ring rate, which makes this approach undesirable (Heeger, 1992).
684
J. Hurri and A. Hyva¨ rinen
A
B
C
t
t
t
x
x
x
Figure 11: An illustration of how temporal response strength correlation could be exhibited by the outputs of simple-cell-like spatiotemporal receptive elds. A phenomenon analog to translation in the spatial case can be observed in the spatiotemporal case. Let x and y denote the horizontal and vertical spatial coordinates, respectively, and let t denote the temporal coordinate. (A) The spatiotemporal trace (solid line) of a moving vertical line is shown here in the x–t coordinate system. The plot is similar for all y-coordinates because the moving line is vertical. Two different overlapping spatiotemporal input windows, separated by a small time difference, are also marked: one with dashed line and the other with dotted line. (B) A simple-cell-like spatiotemporal receptive eld, with position and orientation that match the initial position of the line and its direction of movement, responds strongly to the moving line. Here the spatiotemporal lter has been superimposed over the dashed temporal window. White indicates large, positive values in the lter, dark indicates large negative values, and middle gray indicates zero values. (C) When the same spatiotemporal receptive eld, at the same spatial position, is applied to the same input a moment later (dotted spatiotemporal input window), the response is still strong, but the sign changes. Therefore, the temporal response strength correlation of the outputs of the simple-cell-like spatiotemporal receptive eld would be large for this kind of input.
The purely linear model 2.1 combines the outputs of two such positiveoutput simple cells with reversed polarities. This is implemented so that the positive output values correspond to the output of one cell, and the negative values correspond to the output of another cell, with otherwise similar receptive eld except for a change of the sign of all the connection weights. The exact way an input pattern is mapped into a response in such a model is as follows. Let yk;1 .t/ and yk;2 .t/ denote the outputs of two cells with reversed polarities. Their outputs are given according to equation 4.1 by yk;1 .t/ D maxf0; wTk x.t/g and yk;2 .t/ D maxf0; ¡wTk x.t/g: The overall output
Simple-Cell Receptive Fields and Temporal Coherence
685
is dened as yk .t/ D yk;1 .t/ ¡ yk;2 .t/:
(4.2)
It is straightforward to show that this model is equivalent to a purely linear model, that is, that yk;1 .t/ ¡ yk;2 .t/ D wk x.t/: Therefore, one might interpret our model as measuring the temporal coherence of this two-cell unit, where the cells have similar receptive elds with reversed polarities. A large value of the objective function indicates that either of the two cells responds strongly to the input. As mentioned above, using the linear model is a common approach. The same model is used, for example, in Bell and Sejnowski (1997), Olshausen and Field (1996), and van Hateren and van der Schaaf (1998), which makes the comparison of results obtained with these different models feasible. However, a natural question is whether the same principle applies to the outputs of individual, half-wave rectied simple cells. To address this issue, we computed the optimal solution for half-wave rectied cell outputs, that is, replaced the original linear input-output relationship (see equation 2.1) with the half-wave rectied relationship (see equation 4.1), and computed the optimal solution for this model. The same constraint (see equation 2.4) was used in this case for simplicity. Instead of using the temporally decorrelated data set, we used the orthonormalized data set with 1t D 120 ms; also used in control experiment VI (see section 3.4.4). This is because by using the orthonormalized data set, we excluded the possibility of obtaining simple-cell-like receptive elds because of the static part of the video. This is important here since the static part might yield positive responses at consecutive time instances. The results of this experiment are shown in Figure 12A. Although the features are not as well dened as before, the resulting lters still show orientation, localization, and different spatial scales. What is the reason for the emergence of such features even in this case when the negative part of the linear lter response has been discarded and the static part of the video has been removed completely? First, as in the purely linear case, the objective function emphasizes the temporal correlation of large responses, and oriented and localized lters respond strongly to lines and edges. But how does temporal correlation arise? An illustration of a possible explanation is shown in Figure 12B. When a Gabor-like lter is applied to a line, the half-wave rectied output still correlates positively over time, but more weakly and over longer spatial displacements. Finally, let us note that some form of temporal coherence of simple-cell outputs is implicitly assumed in many studies. The ring rate of a simple cell is typically assumed to code for the output of a linear lter, possibly after some simple nonlinear transformations, such as half-wave rectication. This requires that the output of the linear lter has some temporal coherence. If the output of the linear lter changes quite randomly, the ring rate cannot
686
J. Hurri and A. Hyva¨ rinen
A
B
¡ 2¡ 1 0 1 2 3
¡ 2¡¡ 1 0 1 2 3
¡ 1 0 1
Figure 12: Temporal response strength correlation for a half-wave rectied (nonlinear, nonnegative) cell model. (A) The results of running the algorithm for a nonlinear cell model with half-wave rectication of cell output. (B) When a simple-cell-like lter is applied to an input containing a moving line, the halfwave rectied output is still correlated positively over time, but the correlation is weaker, and the spatial displacement needed for the correlation is larger. Cross sections of a line (left) and a lter (middle), and the half-wave rectied output (right). See also Figure 9.
provide much information on the output, because then the ring rate is very noisy when computed over short time intervals. Thus, maximization of temporal coherence could have a relation to maximization of the efciency of a code based on ring rates. 4.6 Implications of Our Results. In this article, we have shown that simple-cell-type receptive elds maximize temporal response strength correlation at cell output when the input consists of natural image sequences. Temporal response strength correlation, or temporal correlation of recti-
Simple-Cell Receptive Fields and Temporal Coherence
687
ed cell outputs, can be considered as a measure of temporal coherence. Our ndings have several important implications. First, temporal coherence provides an alternative or complementary theory to sparse coding as a computational principlebehind the formation of simple-cell receptive elds. The application of either one of these principles results in the emergence of simple-cell properties from natural data. Second, Foldi´ ¨ ak and others (Foldi´ ¨ ak, 1991; Kayser et al., 2001; Kohonen, Kaski, & Lappalainen, 1997; Wiskott & Sejnowski, 2002) have proposed that invariant visual representations, such as those found in complex cells, can be found by maximizing temporal coherence. (For an alternative complexcell model using sparse coding, see Hyv¨arinen & Hoyer, 2001.) Our results show that this principle is applicable to the visual system even on the level of simple cells, which usually are not considered as invariant detectors. Although in some complex-cell models (Kayser et al., 2001; Kohonen et al., 1997) simple-cell receptive elds are obtained as by-products, the learning is strongly modulated by the complex cells and therefore is very different from our model, which considers only the statistics of simple-cell outputs. Moreover, the simple-cell receptive elds learned in Kayser et al. (2001) and Kohonen et al. (1997) do not seem to have the important properties of spatial localization and multiresolution (different scales). Furthermore, whereas most earlier research results linking temporal coherence and properties of visual system have been based on theoretical considerations and simulated data, the results published in this article have been computed from natural image sequence data. To our knowledge, this is the rst time that localized and oriented receptive elds, with different scales, have been shown to emerge from natural data using the principle of temporal coherence. A step like this is important for any theory that tries to explain the structure and functionality of sensory neural networks using the statistical properties of natural input data.
Appendix: Details of the Symmetric Gradient Projection Algorithm This appendix gives a detailed description of the optimization algorithm. First, the maximization of objective function 2.3 under constraint 2.4 can be made easier by employing whitening, a temporary change of coordinates, so that constraint 2.4 is transformed into an orthonormality constraint. Let equation Cx D EDET denote the eigenvalue decomposition of matrix Cx : If an eigenvalue of Cx is zero, then that eigenvalue can be dropped out of eigenvalue matrix D; and the corresponding eigenvector can be removed from eigenvector matrix E: This was the case in our experiments because preprocessing reduced the dimensionality of input data x.t/ by one when the DC component was removed from data. This dimensionality reduction also means that the number of lters that can be extracted is K · N 2 ¡ 1:
688
J. Hurri and A. Hyva¨ rinen
Dening matrix U D WED1=2 ;
(A.1)
constraint 2.4 can be expressed as an orthonormality constraint, UUT D I:
(A.2)
This simpler constraint will be easier to handle in the optimization algorithm below. To express objective function 2.3 using the same transformed lter matrix U; we have to solve equation A.1 for matrix W: This is equivalent to solving matrix equation ET D1=2 WT D UT : In the case where the DC component has been removed, this is an underdetermined set of linear equations. By imposing the additional zero DC constraint on the lters in rows of W; the solution is given by W D UD¡1=2 ET
(A.3)
(see Hurri, 1997, for details). Substituting this into equation 2.2 gives y.t/ D UD¡1=2 ET x.t/ D Uz.t/;
(A.4)
where signal vector z.t/ D D¡1=2 ET x.t/: This transformation from input data x.t/ to z.t/ is called PCA whitening (Hyv¨arinen et al., 2001). The above shows that after input data x.t/ are whitened, we can optimize f .U/ D
K X kD1
Et fg.yk .t//g.yk .t ¡ 1t//g;
(A.5)
where output y.t/ is given by equation A.4, with respect to orthonormality constraint A.2. When the solution to this problem is found, the solution to the original problem is given by equation A.3. The actual optimization algorithm used for this constrained optimization problem is a variant of the gradient projection method of Rosen (for the original algorithm, see Luenberger, 1969). Let ®.m/ be a nonnegative decreasing sequence of real numbers, which converges to zero (initial step length ®.0/ is changed adaptively to speed up convergence). Let U.0/ be a random orthonormal matrix, and let U.n/ be the value of matrix U at iteration n: The algorithm nds a new candidate point by projecting matrix U.n/ C ®.m/ df .U.n// onto the constraint surface dened by equation A.2. If dU the candidate point is not an improvement, m is increased by one to nd a new candidate point. The critical step in the algorithm is the projection onto the constraint surface. This is achieved by optimal symmetric orthogonalization. Let A
Simple-Cell Receptive Fields and Temporal Coherence
689
be a square matrix with full rank. Then B D A.AT A/¡1=2 is the nearest orthonormal matrix to A with respect to Frobenius matrix norm (see Fan & Hoffman, 1955, theorem 1, for a generalized proof of this). This is exactly the property required to achieve the projection step. The resulting algorithm for maximizing temporal response strength correlation (TRSC) is shown below. The input arguments of the program are whitened data vectors z.t/; convergence tolerance ²; and initial step length ®.0/ (in this version ®.m/ D ®.0/ 2m ). Adaptation of initial step length ®.0/ is not included in this description. The algorithm assumes convergence when the objective function changes very little between successive steps. In our experiments, convergence tolerance ² varied between 10¡3 and 10¡6 : Note df .U/ that denoting U D [u 1 ¢ ¢ ¢ uK ]T ; the kth row of dU is given by the transpose of @ f .U/ @ D Et fg.yk .t//g.yk .t ¡ 1t//g @uk @uk D Et fg0 .yk .t//g.yk .t¡1t//z.t/ C g.yk .t//g0 .yk .t¡1t//z.t¡1t/g: funct U D max TRSC.z.t/; ²; ®.0// U.0/ :D rand./ comment: A random initial starting point. U.0/ :D symmetricOrth.U.0// f Old :D 0 n :D 0 while f .U.n// ¡ f Old > ² f Old :D f .U.n// m :D 0 ± ± ²² df .U.n//
while f symmetricOrth U.n/ C ®.0/ · f Old 2m dU m :D m C 1 end ± ² df .U.n// U.n C 1/ :D symmetricOrth U.n/ C ®.0/ m 2 dU
n :D n C 1 end U :D U.n/
: funct B D symmetricOrth.A/ B :D A.AT A/¡1=2 : Acknowledgments We thank Hans van Hateren for permission to use the natural image sequences and for providing us with the program used to measure receptive eld properties. We also thank Patrik Hoyer for helpful comments.
690
J. Hurri and A. Hyva¨ rinen
References Becker, S. (1993). Learning to categorize objects using temporal coherence. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 361–368). San Mateo, CA: Morgan Kaufmann. Becker, S., & Hinton, G. (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. Bell, A., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge lters. Vision Research, 37(23), 3327–3338. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. Journal of Neuroscience, 17(21), 8621–8644. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993a). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. I. General characteristics and postnatal development. Journal of Neurophysiology, 69(4), 1091–1117. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1993b). Spatiotemporal organization of simple-cell receptive elds in the cat’s striate cortex. II. Linearity of temporal and spatial summation. Journal of Neurophysiology,69(4), 1118–1135. Dong, D. W. & Atick, J. (1995). Temporal decorrelation: A theory of lagged and nonlagged responses in the lateral geniculate nucleus. Network: Computation in Neural Systems, 6(2), 159–178. Fan, K., & Hoffman, A. J. (1955). Some metric inequalities in the space of matrices. Proceedings of the American Mathematical Society, 6, 111–116. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6(4), 559–601. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–198. Hoyer, P. O., & Hyva¨ rinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hurri, J. (1997). Independent component analysis of image data. Master’s thesis, Helsinki University of Technology. Available on-line: http://www. cis.hut./jarmo/publications/. Hyva¨ rinen, A., & Hoyer, P. O. (2001). A two-layer sparse coding model learns simple and complex cell receptive elds and topography from natural images. Vision Research, 41(18), 2413–2423. Hyva¨ rinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Kayser, C., Einh a¨ user, W., Dummer, ¨ O., Konig, ¨ P., & Kording, ¨ K. (2001). Extracting slow subspaces from natural videos leads to complex cells. In G. Dorffner, H. Bischof, & K. Hornik (Eds.), Articial neural networks—ICANN 2001 (pp. 1075–1080). New York: Springer-Verlag. Kohonen, T., Kaski, S., & Lappalainen, H. (1997). Self-organized formation of various invariant-feature lters in the adaptive-subspace SOM. Neural Computation, 9(6), 1321–1344.
Simple-Cell Receptive Fields and Temporal Coherence
691
Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Mitchison, G. (1991). Removing time variation with the anti-Hebbian differential synapse. Neural Computation, 3(3), 312–320. Olshausen, B. A., & Field, D. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. Olshausen, B. A., & Field, D. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Oppenheim, A., & Schafer, R. (1975). Digital signal processing. Upper Saddle River, NJ: Prentice Hall. Palmer, S. E. (1999). Vision science—Photons to phenomenology. Cambridge, MA: MIT Press. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Stone, J. (1996). Learning visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463–1492. van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatio-temporal lters similar to simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265(1412), 2315–2320. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London B, 265(1394), 359–366. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received November 14, 2001; accepted August 23, 2002.
LETTER
Communicated by Michael Schmitt
Continuous-Time Symmetric Hopeld Nets Are Computationally Universal Ï õ ma JiÏrõ´ S´
[email protected] Institute of Computer Science, Academy of Sciences of the Czech Republic, P.O. Box 5 182 07 Prague 8, Czech Republic
Pekka Orponen
[email protected]. Laboratory for Theoretical Computer Science, Helsinki University of Technology, P.O. Box 5400, FIN-02015 HUT, Finland
We establish a fundamental result in the theory of computation by continuous-time dynamical systems by showing that systems corresponding to so-called continuous-time symmetric Hopeld nets are capable of general computation. As is well known, such networks have very constrained Lyapunov-function controlled dynamics. Nevertheless, we show that they are universal and efcient computational devices, in the sense that any convergent synchronous fully parallel computation by a recurrent network of n discrete-time binary neurons, with in general asymmetric coupling weights, can be simulated by a symmetric continuous-time Hopeld net containing only 18n C 7 units employing the saturated-linear activation function. Moreover, if the asymmetric network has maximum integer weight size wmax and converges in discrete time t¤ , then the corresponding Hopeld net can be designed to operate in continuous time 2.t¤ ="/ for any " > 0 such that wmax 212n · "21=" . In terms of standard discrete computation models, our result implies that any polynomially space-bounded Turing machine can be simulated by a family of polynomial-size continuoustime symmetric Hopeld nets. 1 Introduction In recent years, a number of studies have sought to understand the computational characteristics of “natural” dynamical systems. The results achieved include universal computation results for several types of ordinary differential equations (Asarin & Maler, 1994; Branicky, 1994), partial differential equations (Omohundro, 1984), and discrete iterations (Koiran, Cosnard, & Garzon, 1994; Moore, 1990; Siegelmann & Sontag, 1995). One much studied class of systems are those that are dened by various neural network models (Golden, 1996; Haykin, 1999; Siegelmann, 1999). This interest is motivated Neural Computation 15, 693–733 (2003)
c 2003 Massachusetts Institute of Technology °
694
Ï õ ma and P. Orponen J. S´
partly by the quest to understand the fundamental limits and possibilities of practical neurocomputing and partly by the realization that despite their formal simplicity, neural networks are computationally quite powerful, and thus may serve as a useful reference model for investigating more complicated systems. In general, the computational capabilities of discrete-time systems are fairly well understood, but in the area of continuous-time systems, much work remains to be done. (For an overview of the issues, see Moore, 1998, and Orponen, 1997a.) Continuous-time recurrent neural networks are an attractive class of computational models with applications in control, optimization, and signal processing (Cichocki & Unbehauen, 1993; Medsker & Jain, 2000). Probably the best-known and most widely used continuous-time recurrent network model is that popularized by Hopeld (1984) and often called the continuous-time Hopeld model. The dynamics of this model had actually been analyzed earlier by Cohen and Grossberg (1983) in a more general setting, but because of the afnity to the very inuential discrete-time binary-state version of the model (Hopeld, 1982), this additive special case of the Cohen-Grossberg equations has become associated with Hopeld’s name. As practical neural networks, proposed uses of Hopeld nets include associative memory (Hopeld, 1984) and fast approximate solution of combinatorial optimization problems (Hopeld & Tank, 1985), and designs exist for implementing them in analog electrical (Hopeld, 1984) and optical (Stoll & Lee, 1988) hardware. In this article, we prove a fundamental result concerning the computational power of continuous-time symmetric Hopeld nets. It is well known (Cohen & Grossberg, 1983; Hopeld, 1984) that the dynamics of any Hopeld net with a symmetric coupling weight matrix is governed by a Lyapunov, or energy function. This is a bounded function dened on the state-space of a network, whose values are properly decreasing along any nonconstant trajectory of the network’s dynamics. In particular, such a symmetric network always converges from any initial state toward some stable equilibrium state. This is a very useful property for obtaining guaranteed behavior in practical applications, but would at rst sight seem to limit the networks’ general dynamical capabilities severely. For instance, nondamping oscillations of the network state, which seem to be an essential prerequisite of general computation, obviously cannot be created under this constraint (i.e., strictly speaking, continuous-time Hopeld nets cannot simulate even a single alternating bit), whereas such oscillations are easily obtained in networks with asymmetric coupling weights. Nevertheless, we shall show that innite oscillations are the only feature of general-purpose digital computation that cannot be reproduced in continuous-time symmetric Hopeld nets. More precisely, we prove that any converging discrete-time computation described by a recurrent network of n ¸ 1 binary neurons working under synchronous fully parallel mode and with in general asymmetric coupling weights can be embedded in a
Continuous-Time Symmetric Hopeld Nets
695
continuous-time Hopeld net of 18n C 7 units with a symmetric coupling weight matrix and using the saturated-linear activation function. In particular, the simulation of t¤ discrete update steps can be achieved in continuous time 2.t¤ ="/, for any " > 0 such that wmax 212n · "21=" where wmax ¸ 1 is the maximum integer weight size of the discrete network. An experimental Ï õ ma & Orponen, 2000). validation of this result appeared previously (S´ A key observation is that any terminating computation by a discretetime deterministic network with state-space f0; 1gn must converge within 2n steps. A basic technique used in our proof is the construction of a symmetric continuous-time clock network—a simulated .n C2/-bit binary counter— that, using 8nC7 units, produces a sequence of 2n well-controlled oscillations (generated by the simulated second least signicant counter bit) before it converges. This sequence of clock pulses is used to drive the rest of the network where each discrete neuron is simulated by a symmetrically coupled subnetwork of 10 continuous-time units. The continuous-time clock is already of some interest from a dynamical systems perspective because it provides, to our knowledge, the rst known example of a continuoustime Lyapunov system whose convergence time grows exponentially in the Ï õ ma & Orponen, 2001a). system dimension (S´ A similar, but considerably simpler, construction was used in the discretetime setting (Orponen, 1996) to prove the computational equivalence of symmetric and convergent asymmetric binary neural networks. Our construction also provides the technique for improving this discrete-time simulation by reducing its overhead from a quadratic number of units to the asymptotÏ õ ma, Orponen, & Antti-Poika, 2000). The original ically optimal linear size (S´ idea for the discrete-time clock network used by Orponen (1996), and on which our current construction is based, stems from the result by Goles and Mart´õ nez (1989). Another related work (Orponen, 1997b) concerns the simulation of discrete-time binary networks by continuous-time asymmetric networks that do not adhere to the Lyapunov property. It is quite easy to see (Lepley & Miller, 1983; Orponen, 1996) that families of polynomial-size discrete neural networks are computationally equivalent to (nonuniform) polynomially space-bounded Turing machines; more precisely, they compute the complexity class PSPACE=poly (Balca´ zar, D´õ az, & Gabarr o, ´ 1995). By the result in this article, we now know that continuoustime symmetric Hopeld nets are also at least as powerful; given any polynomially space-bounded Turing machine, we can construct a family of polynomial-size continuous-time symmetric Hopeld nets (one network for each input length) for simulating it. This is, to our knowledge, the rst result concerning the computational power of continuous-time symmetric networks, and it is somewhat surprising that they turn out to be computationally universal in this complexity-limited sense. It remains an open question whether this is also an upper bound on the power of such networks. A related line of study concerns the computational capabilities of nite discrete-time analog-state neural networks (Siegelmann, 1999). Here it
696
Ï õ ma and P. Orponen J. S´
is known that the computational power of asymmetric networks using the saturated-linear activation function increases with the Kolmogorov complexity of the real weight parameters (Balcazar, ´ Gavalda, ` & Siegelmann, 1997). With integer weights, such networks coincide with binary networks and so are equivalent to nite automata (Horne & Hush, 1996; Indyk, 1995; Ï õ ma & Wiedermann, 1998), while with rational weights, arbitrary Turing S´ machines can be simulated (Indyk, 1995; Siegelmann & Sontag, 1995). With arbitrary real weights, the networks can even have “super-Turing” computational capabilities (Siegelmann & Sontag, 1994). To some extent, similar results also apply to discrete-time symmetric analog networks that are computationally equivalent to their asymmetric counterparts when an external Ï õ ma et al., 2000). On the other hand, clock pulse sequence is provided (S´ it is known that any amount of analog noise reduces the computational power of discrete-time analog recurrent networks to that of nite automata (Casey, 1996; Maass & Orponen, 1998) or even less (Maass & Sontag, 1999; Siegelmann, Roitershtein, & Ben-Hur, 2000). A preliminary version of this article appeared as an extended abstract Ï õ ma & Orponen, 2001b). This article is organized as follows. After a brief (S´ review of the basic denitions in section 2, our main construction of the symmetric continuous-time Hopeld net simulating a given discrete-time binary neural network is outlined in section 3, where its dynamics is informally explained. The formal verication of this construction, which has the form of a rather tedious case analysis, is given in section 4. In section 5, a numerical simulation example witnessing the validity of the construction is presented. Section 6 concludes with some open problems. 2 Preliminaries We rst specify the model of a nite discrete-time binary-state recurrent neural network (DRNN). Such a network consists of n simple computational binary units or neurons, indexed as 1; : : : ; n, that are connected into a generally cyclic directed graph or architecture. Each edge .i; j/ in this graph, leading from neuron i to j, is labeled with an integer coupling weight w.i; j/ D wji 2 Z. Note that for binary neurons, integer weights can be assumed without loss of generality (see Parberry, 1994). The absence of an edge in the architecture indicates a zero weight between the respective neurons, and vice versa. An instantaneous state of a DRNN at time t ¸ 0 is described by a vector .t/ n y.t/ D .y.t/ 1 ; : : : ; yn / 2 f0; 1g composed of current binary outputs (states) .t/ yj from particular neurons j D 1; : : : ; n. We consider here only the synchronous fully parallel dynamics in which the evolution of the network state y.t/ is determined for discrete time instants t D 0; 1; : : : ; as follows. At the beginning of a computation, the network is placed in some initial state y.0/ , which may include an external input. At discrete time t ¸ 0, each neuron j D 1; : : : ; n collects its local binary inputs from the states y.t/ i 2 f0; 1g of
Continuous-Time Symmetric Hopeld Nets
697
incident neurons i and determines its integer excitation, .t/
»j
D
n X
wji y.t/ i ;
iD0
j D 1; : : : ; n;
(2.1)
as the respective weighted sum of these inputs. Moreover, sum 2.1 includes an integer bias wj0 2 Z local to each neuron j, which is formally modeled as a coupling weight from an additional neuron with index 0 and constant .tC1/ unit output y.t/ 0 ´ 1. At the next instant t C 1, the new network state y is computed by applying an activation function to all excitations componentwise, .tC1/
yj
D H.»j.t/ /;
j D 1; : : : ; n;
where the Heaviside or threshold function, ( 1 for » ¸ 0 H.» / D ; 0 for » < 0
(2.2)
(2.3)
is employed. We also denote by wmax D
max
jD1;:::;nIiD0;:::;n
jwji j
(2.4)
the maximum weight size in the network. Similarly, a nite continuous-time analog neural network is composed of m analog units that operate (in our case) with the saturated-linear activation function 8 > : 0 for » · 0: Hence, the states yp .t/ of analog units p D 1; : : : ; m at time t ¸ 0 are real numbers within the interval [0; 1], and the coupling weights (including biases), denoted by v.p; q/ 2 R for edges from p to q, are reals as well. The computational dynamics of a continuous-time network is dened for every real t > 0, for example, by the following system of ordinary differential equations in real variables y1 ; : : : ; ym 2 [0; 1]: dyp dt
.t/ D ¡yp .t/ C ¾ .»p .t//;
p D 1; : : : ; m;
(2.6)
where »p .t/ D
m X qD0
v.q; p/yq .t/
(2.7)
Ï õ ma and P. Orponen J. S´
698
is the real-valued excitation for unit p at time t ¸ 0 (cf. equation 2.1) including bias v.0; p/ 2 R, which is associated with an additional formal variable y0 whose value is constant y0 .t/ ´ 1 in time. The initial network state y.0/ 2 [0; 1]m then determines the boundary conditions for system 2.6. In particular, we shall consider the continuous-time (symmetric) Hopeld networks, whose architecture is an undirected graph with symmetric weights, v.p; q/ D v.q; p/
for every 1 · p; q · m:
(2.8)
The dynamics of a continuous-time symmetric Hopeld net described by system 2.6 is controlled by the following Lyapunov or energy function, introduced by Cohen and Grossberg (1983): E.y/ D ¡
m X m m m Z yp X X 1X v.q; p/yq yp ¡ v.0; p/yp C ¾ ¡1 .y/ dy: 2 pD1 qD1 0 pD1 pD1
(2.9)
The characteristic properties of function E.y/ are that it is bounded on the system’s state-space [0; 1]m and that it is properly decreasing (i.e., dE=dt < 0) along any nonconstant trajectory of the network’s dynamics. It then follows that the continuous-time Hopeld net always converges, from any initial state, toward some stable equilibrium state with dyp =dt D 0 for all p D 1; : : : ; m. 3 Constructing the Continuous-Time Hopeld Net We now show how to simulate the convergent computations of a given DRNN N on a somewhat bigger continuous-time Hopeld network H. In our simulation, the binary output values 0 and 1 of discrete neurons in N will be represented by excitations 2.7 of corresponding analog units in H that are below the lower saturation threshold of 0 or above the upper saturation threshold of 1, respectively, for activation function 2.5. For brevity, we simply say that a unit p is saturated at 0 or 1 at time t if its excitation satises »p .t/ · 0 or »p .t/ ¸ 1, respectively. We say that p is unsaturated when 0 < »p .t/ < 1. Note that we use the excitations »p .t/ of continuous-time units p 2 H rather than their actual states yp .t/ to represent the binary values since starting at any point within the open interval .0; 1/, the outputs yp .t/ of the units saturated at 0 or 1 converge only to limit values 0 or 1, respectively, and never reach these boundaries (see point 1 in lemma 2 below). In fact, this is the main difference between the discrete dynamics 2.2 and its simulation within the continuous dynamics 2.6. The following theorem summarizes the result: Theorem 1. Any fully parallel computation by a recurrent neural network N of n ¸ 1 binary neurons with asymmetric integer weights of maximum size wmax ¸ 1,
Continuous-Time Symmetric Hopeld Nets
699
converging within t¤ discrete update steps, can be simulated by a continuous-time symmetric Hopeld network H with m D 18n C 7 analog units employing the saturated-linear activation function, within continuous time 2.t¤ ="/ for any real " > 0 such that wmax 212n · "21=" :
(3.1)
Proof. Since the DRNN N denes a deterministic dynamical system over the state-space f0; 1gn , any converging computation by N must terminate within t¤ · 2n discrete steps. To achieve the simulation of N by Hopeld net H , we rst construct in section 3.1 a continuous-time symmetric clock subnetwork C D C nC1 with 8n C 7 units that simulates a discrete .n C 2/bit binary counter. When the network C is initialized in the zero state, its interface unit x1 will exhibit a sequence of 2n well-controlled oscillations before C converges. Each “clock pulse” from x1 2 C is then exploited to “drive” a simulation of one parallel discrete update iteration by a simulating subnetwork S of size 10n units that is constructed in section 3.2. In particular, n G S D [jD1 j consists of n continuous-time symmetric gate subnetworks G j (j D 1; : : : ; n) each having 10 units for simulating one discrete neuron j 2 N . We rst discuss the operation of the continuous-time Hopeld net H D C [ S intuitively; its correctness will be veried formally in section 4. 3.1 The Clock Network. The construction of the continuous-time symmetric clock network C D C nC1 simulating an .n C 2/-bit binary counter of “order (n+1)” will be described by induction on n. The induction starts with the 2-bit counter network C 1 presented in Figure 1. The undirected edges connecting units in this graph are labeled with the corresponding symmetric weights, whereas the oriented edges drawn without an originating unit correspond to the biases. Assume that initially all of the states of the units indicated in this network are zero. The least signicant counter unit c0 of “order 0” has positive bias v.0; c0 / D " > 0, corresponding to its initial positive excitation. Because of its feedback coupling v.c0 ; c0 / D 1 C " > 1, the state of c0 gradually grows from initial 0 toward 1. Eventually c0 saturates at 1, at which point we say that the unit c0 becomes active or res. Recall that we associate the simulated discrete counter behavior to the excitations of the units rather than their outputs. The external state of c0 evolves smoothly converging to 1 and exhibits no abrupt “ring” transitions. Thus, c0 simply implements counting from 0 to 1 as required. This trick of gradual transition from 0 to 1, formally described in lemma 3, is used repeatedly throughout our construction of C . The remaining six units c1 ; a1 ; x1 ; b1 ; d1 ; z1 in the 2-bit counter network C 1 (see Figure 1) are of order 1. They function similarly to the corresponding units of higher orders, so we will describe only the inductive construction for general order k > 1. The eventual interconnection of the clock network C to the simulating subnetwork S , that is, to the gate subnetworks G j
Ï õ ma and P. Orponen J. S´
700
1+e 1+e
1+e 1
c1
1
3 1+e 1+e
1+e
e
c0
1+e 1
a1 W+4 x1
2 1+e 1
b1
1+ e 3
1+e (12u+39) n
3
d1 1+e
z1
1+e 3+e
W
A
B
Figure 1: A 2-bit continuous-time counter C 1 .
(j D 1; : : : ; n), is taken into account in the weight v.a1 ; x1 / D W C 4, which includes a large, positive integer parameter, 0 0 11 n X 3 jwj0 jAA > 0; (3.2) W D @nu C 28 @n C 2 jD1 where 0 u D 28 max @¡ jD1;:::;n
1 X 1·i·nIwji ·0
wji ;
X 1·i·nIwji ¸0
wji A ¸ 0;
(3.3)
and wji are the weights of the original DRNN N to be simulated. This large weight is needed because the interface unit x1 transfers the pulses generated by the clock C to all of the gate subnetworks G j , and its operation within C must not be affected by feedback effects from S (see section 3.2). For the induction step depicted in Figure 2, assume that an “order .k¡1/” counter network C k¡1 (1 < k · n C 1) has been constructed, containing the rst k counter units c0 ; : : : ; ck¡1 2 C k¡1 , together with auxiliary units a` ; x` ; b` ; d` ; z` 2 C k¡1 (` D 1; : : : ; k ¡ 1), and for k > 2 also x0` ; b0` (` D 2; : : : ; k ¡ 1), for a total of mk D jC
k¡1 j
D k C 5.k ¡ 1/ C 2.k ¡ 2/ D 8k ¡ 9
(3.4)
Continuous-Time Symmetric Hopeld Nets
701
units. Then the next counter unit ck is connected to all the mk units p 2 C k¡1 via unit weights v.p; ck / D 1, which, together with its bias v.0; ck / D ¡mk C " and feedback weight v.ck ; ck / D 1 C ", cause ck to re shortly after all these units are active (see lemma 3). This includes the rst k active counter bits c0 ; : : : ; ck¡1 , which means that the simulated counting from 0 to 2k ¡ 1 has been accomplished, and, hence, the next counter bit ck must re. In addition, unit ck is connected to a sequence of seven auxiliary units ak ; x0k ; b0k ; xk ; bk ; dk ; zk with feedback weights v.ak ; ak / D v.x0k ; x0k / D v.b0k ; b0k / D v.xk ; xk / D v.bk ; bk / D v.dk ; dk / D v.zk ; zk / D 1 C ", which are being, one by one, activated after ck res (see lemma 3). This is implemented by the weights v.ck ; ak / D mk , v.ak ; x0k / D ¡v.x0k ; x1 / C 1, v.x0k ; b0k / D v.xk ; bk / D v.bk ; dk / D 1, v.b0k ; xk / D Vk C v.x0k ; x1 /, v.dk ; zk / D Vk ¡ mk , and biases v.0; ak / D ¡mk C ", v.0; x0k / D v.0; xk / D v.0; dk / D ¡1 C ", v.0; b0k / D v.0; bk / D ¡1 C "=3, v.0; zk / D mk ¡ Vk C ", where v.x0k ; x1 / < 0 is in its absolute value a sufciently large, negative weight and Vk > 0 is a sufciently large, positive parameter, whose values will be determined by formulas 3.5 and 3.8, respectively, so that units x0k ; xk ; zk are not directly inuenced by the computations occurring in units from C k¡1 except via ck . The purpose of the auxiliary units ak ; b0k ; bk ; dk is only to slow down the continuous-time state ow in order to synchronize the computation. The units x0k ; xk are used to reset all the lower-order units in C k¡1 back to values near 0 by saturating them at 0 (see lemma 2, point 2b) after ck res, which is consistent with the correct counter computation. The units that saturate at 0 are then called passive. To achieve this effect, x0k is connected to x1 via a large, negative weight, v.x0k ; x1 / D ¡Skx1 ¡ .12u C 39/n; and similarly, xk is linked with each p 2 C
(3.5) k¡1 nfx1 g
via negative weight
v.xk ; p/ D ¡Skp ;
(3.6)
where parameter 2 Skp D 6 6v.ck ; p/ C 6 q2C
3 X k¡1 Iv.q;p/>0
v.q; p/7 7; 7
p2C
k¡1
(3.7)
exceeds the positive inuence of units in C k¡1 [ fck g on p, and term ¡.12u C 39/n in equation 3.5 balances the total positive inuence on the interface unit x1 originating from units A in the gate subnetworks G j (see section 3.2). Now, the value of parameter Vk D 1 ¡ v.x0k ; x1 / ¡
X p2C
k¡1 nfx1 g
v.xk ; p/
(3.8)
Ï õ ma and P. Orponen J. S´
702
1+e m k+e
mk
ck
Ck
1
mk m k+e
ak
x1 1+e
1 v(x’k ,x1) 1
v(x’k ,x1)+1 1+e 1+ e 3
x’k 1
bk’
p 1+e
v(xk ,p) 1
1+e 1+e
xk 1+e
Vk m k 1
Vk v( x’k ,x1)+1
Vk+ v(x’k ,x1) 1+e
1
1
bk 1+ e 3
1+e
1
dk 1+e
Vk m k
zk
1+e
m k Vk +e
Figure 2: Inductive construction of C k .
is determined so that unit xk res after b0k is activated in spite of the negative contributions through weights 3.6 from p 2 C k¡1 nfx1 g to the excitation of xk . Note that the interface unit x1 2 C k¡1 is rst suppressed separately by x0k (see lemma 4), while the remaining units in C k¡1 nfx1 g are active. In particular, units a1 and b1 incident on x1 that are of the same order 1 remain activated after x1 becomes passive because of active c1 and d1 , respectively, and unit c0 is connected with x1 via a negative weight (see Figure 1). Furthermore, units in C k¡1 of order greater than 1 are not directly inuenced by passive x1 . Thus, only after x1 becomes passive and b0k ; xk are activated, the other units from C k¡1 nfx1 g are reset by xk . Finally, unit zk balances the negative inuence of x0k ; xk on C k¡1 so that the rst k counter bits can again count from 0 to 2k ¡ 1, but now with ck being
Continuous-Time Symmetric Hopeld Nets
703
active. This is achieved by exact positive weights ( v.zk ; p/ D
¡v.x0k ; x1 / ¡ 1 ¡v.xk ; p/ ¡ 1
for p D x1 for p 2 C k nfx1 g;
(3.9)
in which ¡v.x0k ; x1 / and ¡v.xk ; p/ eliminate the inuence of x0k and xk , respectively, on p 2 C k¡1 , whereas ¡1 compensates for v.ck ; p/ D 1 for each p 2 C k¡1 . Clearly, units p 2 C k¡1 cannot reversely activate zk since their maximal contribution to the excitation of zk , X X v.p; zk / D ¡mk ¡v.x0k ; x1 /¡ v.xk ; p/ D Vk ¡mk ¡1; (3.10) p2C
k¡1
p2C
k¡1 nfx1 g
according to equations 3.9 and 3.8, cannot overcome its bias v.0; zk / D mk ¡ Vk C ". This completes the induction step of the counter network construction. 3.2 The Simulating Network. The high-level scheme of the simulating subnetwork S is depicted in Figure 3. Network S consists of a sequence of ve layers, each linked to the subsequent one. A signal is propagated through these layers (see point 2b of lemma 2) only in one direction (from top to bottom in Figure 3), although the respective connections are symmetric. For this purpose, a device used previously for symmetric implementation of feedforward networks (Parberry, 1994) is applied here. In particular, the absolute values of weights and biases in the symmetric network are arranged in a decreasing sequence, which prevents the signal from backpropagating. However, to make recurrent computation possible, the “bottom” layer is further connected to the “top” layer—but only by weights of small absolute values that need additional strong support from the clock C in order to transmit a signal. Therefore, Figure 3 also shows the clock subnetwork C that has been constructed in section 3.1. Particularly, interface unit x1 of C is connected to the rst layer denoted by A D f®j ; ¯j I j D 1; : : : ; ng by large, positive weights, and at the same time x1 is linked to the fourth layer B D fÁj ; Âj I j D 1; : : : ; ng by negative weights of large absolute values (see Figure 1). Within the simulated .n C 2/–bit binary counter C nC1 , unit x1 is of the second least signicant order 1 and thus res 2n times. To each discrete computational step of network N that converges after t¤ · 2n updates corresponds a twophase continuous-time state dynamics of S , controlled by one oscillation of x1 2 C . In particular, the next simulated state of N is computed in the rst phase when unit x1 is passive. In this phase, the rst layer A is locked; no signal can pass through A, since there is no support from C . At the same time, the negative inuence of x1 on the fourth layer B is suppressed. The third layer of S serves as a memory that encodes the current binary state y.t/ of the
Ï õ ma and P. Orponen J. S´
704
1.layer: feedback transfer
A= {a j , bj ; j=1,...,n} copying
+
CLOCK
2.layer: memory driver
{z j ,hj ; j=1,...,n}
x1
storing 3.layer: memory
y (t)
{p j , rj ; j=1,...,n}
computing
4.layer: memory driver
B= {fj , c j ; j=1,...,n} storing
5.layer: memory
y (t+1)
{yj ,wj ; j=1,...,n}
Figure 3: A high-level scheme of the simulation.
simulated network N at discrete-time instant t. This memory is properly initialized by the initial state y.0/ of N . The new state y.tC1/ of N at each time instant t C 1 is then computed from y.t/ in the subsequent fourth layer by using the original weights of N , and it is further stored in the following fth memory layer. This completes the rst phase, and S is stabilized until x1 res. In the second phase, when unit x1 is active, the new simulated state of N replaces the old one. The fourth layer B is now locked due to the large negative weights from x1 , while the rst layer A is unlocked by the positive support from C . The new state y.tC1/ is transmitted from the fth
Continuous-Time Symmetric Hopeld Nets
705
memory layer back to the rst layer in a cycle, and further transferred to the subsequent second layer that updates the contents of the third memory layer by new state y.tC1/ . Thus, one simulation step is completed, and the simulating subnetwork S is stabilized until x1 becomes passive again. Now, the detailed implementation of the simulating subnetwork S will be presented. As indicated in Figure 3, each of the ve layers in S is composed of n pairs of continuous-time units—one pair for each discrete neuron j in the network N to be simulated. Each system of ve unit pairs (one from each layer) that are associated with the same neuron j creates the symmetric gate subnetwork G j with 10 units for simulating discrete neuron j 2 N as depicted in Figure 4. Besides the denition of weights in G j , this gure again shows the interface clock unit x1 2 C that is linked to each G j . In what follows, we informally describe the dynamics of G j (j D 1; : : : ; n) that simulates the discrete-time updates of neuron j 2 N . For simplicity of presentation, we assume (see point 2b of lemma 2) that particular units in the continuoustime network G j approximately follow the discrete update rule 2.1–2.3 when their inputs are saturated for the duration of a certain period. At the beginning of the simulation, all units in G j are placed in zero initial states, except possibly for units ¼j ; %j 2 G j that doubly encode the initial state of discrete neuron j 2 N , y¼j .0/ D y%j .0/ D yj.0/ ;
(3.11)
which means that both units ¼j ; %j are initially saturated at value yj.0/ 2 f0; 1g. The rst phase of a general simulation step corresponding to the update of state of neuron j at discrete-time instant t C 1 starts when the interface unit x1 2 C becomes passive. Consequently, units ®j ; ¯j 2 A saturate at 0 and remain passive due to their negative biases (see Figure 4). Only then (see lemma 4) are the states of units Áj ; Âj 2 B allowed to evolve since the inuence of negative weights from x1 disappears with passive x1 . Assume now that the simulated state yj.t/ 2 f0; 1g of neuron j 2 N at discrete-time instant t is doubly represented by continuous-time units ¼j ; %j 2 G j that are both saturated at value yj.t/ . In particular, units ¼j ; %j are either both passive due to their negative biases and passive ³j , if yj.t/ D 0, or they are saturating each other at 1 by their positive coupling weight v.¼j ; %j / while ´j is passive, if yj.t/ D 1.
The value corresponding to the new state yj.tC1/ 2 f0; 1g of neuron j is computed by unit Áj 2 G j , based on its connections to the units ¼i 2 G i (i D 1; : : : ; n) that represent the simulated states y.t/ at the i of neurons i 2 N previous discrete-time instant t. Therefore, these connections are labeled with symmetric weights v.¼i ; Áj / D 28wji , and unit Áj has bias v.0; Áj / D 28wj0 C 14, where wji are the original weights of N . The large coefcients are used to guarantee that Áj is not inuenced by Ãj ; !j while the original
Ï õ ma and P. Orponen J. S´
706
12u 40 3
12u 37
aj 4 u +9 4 u 7
u 1
12u +39
hj
2 u 3
pj
u u 28wj 0 +14
4 u +9
12u +39
zj
2 u +3
2 u +3
rj
28wji
fj
3 (u +28|wj 0|+28) 2
cj
yj
6
6
u 28wj0 14
6
wj
4 Figure 4: A continuous-time gate network G
u 1 u
x1
6
1
2 u 3 2 u +3
28wji
6
3
bj
1 j
simulating the discrete neuron j.
function of j is preserved (Parberry, 1994). Similarly, unit Âj computes the negation of yj.tC1/ from units %i 2 G i by using the opposite weights v.%i ; Âj / D ¡28w.i; j/ and bias v.0; Âj / D ¡28wj0 ¡ 14, which also prevent Âj from being inuenced by Ãj ; !j . Note that because units ¼i ; %i have biases v.0; ¼i / D v.0; %i / D ¡u ¡ 1, where u is the large nonnegative parameter dened by formula 3.3, the state evolution of Áj ; Âj cannot reversely affect the dynamics of units ¼i ; %i , respectively. In addition, units Áj ; Âj 2 G j are never simultaneously active; either they are both passive, or each of them represents the negation of a binary state encoded by the other. Therefore, the positive parameter W introduced in equation 3.2 and implanted in weight v.a1 ; x1 / is sufcient for
Continuous-Time Symmetric Hopeld Nets
707
activating the clock unit x1 2 C regardless of the total negative inuence on x1 from units in B (see Figure 1). Furthermore, units Áj ; Âj store the new state yj.tC1/ into both Ãj and !j in such a way that either Áj ensures activating Ãj ; !j through positive weights, if yj.tC1/ D 1, or Âj arranges for their resetting by means of negative weights, if yj.tC1/ D 0. Consequently, continuous-time units Ãj ; !j store the simulated
state yj.tC1/ even if Áj ; Âj later become passive, since they are either both passive due to their negative biases, if yj.tC1/ D 0, or they are saturating each
other at 1 via their positive coupling weight v.Ãj ; !j /, if yj.tC1/ D 1. Since units ®j ; ¯j 2 A are temporarily locked, the gate network G j becomes stable until unit x1 is activated. In the second phase, interface unit x1 2 C becomes active and locks units Áj ; Âj 2 B by saturating them at 0 before releasing units ®j ; ¯j 2 A (see lemma 4). Unit ®j then receives yj.tC1/ from unit Ãj , as the positive support from x1 balances the negative bias of ®j so that the small, positive weight v.Ãj ; ®j / can have an inuence on ®j . Similarly, unit ¯j computes the negation of the binary state yj.tC1/ received from !j by means of the support from .tC1/ and its negation are further propagated in G j from ®j ; ¯j to x1 . State yj ³j ; ´j , respectively, and these units then update the simulated state of neuron j stored in ¼j ; %j by its new value yj.tC1/ . Since the negative weights v.x1 ; Áj / D v.x1 ; Âj / D ¡3.u C 28jwj0 j C 28/=2 from x1 overcome the total positive inuence on Áj ; Âj that originates in their incident units, network G j is temporarily stabilized until x1 becomes passive again and computing the new discrete state yj.tC2/ is initiated. Recall also that units ®j ; ¯j 2 G j are never simultaneously active, which implies that the term ¡.12u C 39/n in denition 3.5 of v.x0k ; x1 / (k > 1) sufces to suppress the total positive inuence from A on x1 (see Figure 1). This completes the construction of the n G . simulating subnetwork S D [jD1 j
4 Formal Verication Now the correct state evolution of the continuous-time Hopeld network H described in section 3 needs to be veried. This is achieved by a sequence of lemmas analyzing the behavior of the corresponding system of differential equations 2.6. Lemma 1 rst upper bounds the maximum sum of absolute values of weights incident on any unit in H. Lemma 2 then describes explicitly the continuous-time state evolution for saturated units. An analysis of how the decreasing defects (i.e., distances from limit values in the states of saturated units) affect the excitation of any other unit reveals that the units in H actually approximate the discrete update rule 2.1–2.3 of corre-
Ï õ ma and P. Orponen J. S´
708
sponding binary neurons after a certain transient time, provided that the incident saturated units stay saturated. Furthermore, the transfer of the activity in the clock C from a unit to a subsequent one, when all the incident units are saturated, will be analyzed explicitly in lemma 3. (But note that the dynamics of unit c0 at time t D 0 slightly differs from this analysis.) A crucial fact for the proof of theorem 1 is that the duration time of this transfer turns out to be constant and unaffected by any initial defect. This introduces a “discrete time” into the clock operation and enables the clock to synchronize the simulation. In point 2 of lemma 3, the result is also partially generalized to the case when some of the incident units may become unsaturated. Finally, lemma 4 will verify that the clock interface unit x1 2 C correctly synchronizes the simulation by network S ; locking the units in A and B always precedes unlocking the units in B and A, respectively. Lemma 1. For any unit p 2 H in the Hopeld network constructed in section 3, the sum of absolute values of its incident weights (excluding its local bias) is upper bounded by 4p D
m X qD1
jv.q; p/j < "21=" :
(4.1)
Proof. Consider rst the input weights to the clock interface unit x1 2 C as indicated in Figures 1, 2, and 4: 4x1 D v.x1 ; x1 / C jv.c0 ; x1 /j C v.a1 ; x1 / C v.b1 ; x1 / C C
nC1 X kD2 n X jD1
.v.ck ; x1 / C jv.x0k ; x1 /j C v.zk ; x1 // .v.®j ; x1 / C v.¯j ; x1 / C jv.Áj ; x1 /j C jv.Âj ; x1 /j/;
(4.2)
which reduces to 4x1 D 3W C 2.12u C 39/n C 9 C " C 2
nC1 X kD2
jv.x0k ; x1 /j
(4.3)
by using denitions 3.9 and 3.2. Weights v.x0k ; x1 / dened by formulas 3.5 and 3.7 can be expressed recursively, v.x0k ; x1 / D v.x0k¡1 ; x1 / ¡
X q2C
v.q; x1 /
k¡1 nC k¡2 Iv.q;x1 />0
D v.x0k¡1 ; x1 / ¡ v.ck¡1 ; x1 / ¡ v.zk¡1 ; x1 / D 2v.x0k¡1 ; x1 /;
(4.4)
Continuous-Time Symmetric Hopeld Nets
709
for k D 3; : : : ; n C 1 according to Figure 2 and equation 3.9, whereas v.x02 ; x1 / D ¡dv.c2 ; x1 /Cv.x1 ; x1 /Cv.a1 ; x1 /Cv.b1 ; x1 /e¡.12uC39/n D ¡.W C .12u C 39/n C 8/
(4.5)
follows from Figure 1. Hence, v.x0k ; x1 / D ¡2k¡2 .W C .12u C 39/n C 8/
(4.6)
for k D 2; : : : ; n C 1, which can be substituted into formula 4.3: 4x1 D 2nC1 .W C .12u C 39/n C 8/ C W ¡ 7 C ":
(4.7)
Furthermore, as indicated in Figure 2, the input weights to the clock unit znC1 2 C of the highest order n C 1 give 4znC1 D v.znC1 ; znC1 / C v.dnC1 ; znC1 / C
X
p2C
v.p; zk / n
D 2VnC1 ¡ 2mnC1 C " D 2VnC1 ¡ 16n C 2 C "
(4.8)
according to equations 3.10 and 3.4. Parameter VnC1 will be computed recursively starting with V2 D 1 ¡ v.x02 ; x1 / ¡
X p2C
2 D 1C
X p2C
1
v.x2 ; p/ 1 nfx1 g
6v.c2 ; p/ C 6 6 q2C
3 X 1 Iv.q;p/>0
v.q; p/7 7 C .12u C 39/n 7
D 2W C .12u C 39/n C 48;
(4.9)
which follows from equations 3.5 through 3.8 and Figure 1. For 2 < k · nC1, denition 3.8 of Vk can be rewritten as Vk D 1 ¡ v.x0k ; x1 / ¡
X p2C
k¡2 nfx1 g
v.xk ; p/ ¡
X p2C
v.xk ; p/:
(4.10)
k¡1 nC k¡2
Similarly as in equation 4.4, a recursive formula for the weights v.xk ; p/ for p 2 C k¡2 nfx1 g (k > 2) dened by formulas 3.6 and 3.7 can be derived, v.xk ; p/ D v.xk¡1 ; p/ ¡
X
q2C
v.q; p/
k¡1 nC k¡2 Iv.q;p/>0
D v.xk¡1 ; p/ ¡ v.ck¡1 ; p/ ¡ v.zk¡1 ; p/ D 2v.xk¡1 ; p/;
(4.11)
Ï õ ma and P. Orponen J. S´
710
according to Figure 2 and equation 3.9. By introducing formulas 4.4 and 4.11 into equation 4.10 and using denition 3.8, the recursive formula for Vk is obtained: X (4.12) Vk D 2Vk¡1 ¡ 1 ¡ v.xk ; p/: p2C
k¡1 nC k¡2
The weights v.xk ; p/ for p 2 C k¡1 nC k¡2 D fck¡1 ; ak¡1 ; x0k¡1 ; b0k¡1 ; xk¡1 ; bk¡1 ; dk¡1 ; zk¡1 g in equation 4.12 can be calculated from Figure 2 and by denitions 3.6 and 3.7 in which dv.ck ; p/ C v.p; p/e D 3, as follows: ¡v.xk ; ck¡1 / D 3 C v.ak¡1 ; ck¡1 / C
X
q2C
v.q; ck¡1 /
k¡2 Iv.q;ck¡1 />0
D 2mk¡1 C 3
(4.13)
D ¡v.x0k¡1 ; x1 / C mk¡1 C 4
(4.14)
¡v.xk ; ak¡1 / D 3 C v.ck¡1 ; ak¡1 / C v.x0k¡1 ; ak¡1 / ¡v.xk ; x0k¡1 / ¡v.xk ; b0k¡1 /
D D D D
¡v.xk ; xk¡1 / D D
3 C v.ak¡1 ; x0k¡1 / C v.b0k¡1 ; x0k¡1 / ¡v.x0k¡1 ; x1 / C 5 3 C v.x0k¡1 ; b0k¡1 / C v.xk¡1 ; b0k¡1 / Vk¡1 C v.x0k¡1 ; x1 / C 4
(4.15)
(4.16)
3 C v.b0k¡1 ; xk¡1 / C v.bk¡1 ; xk¡1 / Vk¡1 C v.x0k¡1 ; x1 / C 4
(4.17)
¡v.xk ; bk¡1 / D 3 C v.xk¡1 ; bk¡1 / C v.dk¡1 ; bk¡1 / D 5
(4.18)
¡v.xk ; dk¡1 / D 3 C v.bk¡1 ; dk¡1 / C v.zk¡1 ; dk¡1 / D Vk¡1 ¡ mk¡1 C 4 ¡v.xk ; zk¡1 / D 3 C v.dk¡1 ; zk¡1 / C
X q2C
D 2Vk¡1 ¡ 2mk¡1 C 2;
(4.19) v.q; zk¡1 /
k¡2 Iv.q;zk¡1 />0
(4.20)
where formula 3.10 has been employed in equation 4.20. Weights 4.13 through 4.20 are summed up as ¡
X p2C
k¡1 nC k¡2
v.xk ; p/ D 5Vk¡1 C 31;
(4.21)
which is plugged in formula 4.12: Vk D 7Vk¡1 C 30:
(4.22)
Continuous-Time Symmetric Hopeld Nets
711
It follows from equations 4.9 and 4.22 that VnC1 D 7n¡1 .2W C .12u C 39/n C 53/ ¡ 5;
(4.23)
which gives 4znC1 D 2 ¢ 7n¡1 .2W C .12u C 39/n C 53/ ¡ 16n ¡ 8 C ";
(4.24)
according to equation 4.8. Observe from Figure 4 that the maximum value of 4p among units p 2 S within the simulating subnetwork is reached by ®j ; ¯j 2 S , that is, 4®j D 4¯j D 16u C 51 < 4x1 ;
j D 1; : : : ; n;
(4.25)
which is still less than 4x1 associated with the clock interface unit x1 2 C according to equation 4.7. On the other hand, the maximum value of 4p among units p 2 C within the clock subnetwork is reached by unit znC1 2 C of the highest order n C 1 for n ¸ 2—for example, compare formula 4.8 to 4xnC1 D 2VnC1 ¡ 2v.x0nC1 ; x1 / C 1 C "
D 2VnC1 ¡ 2n .W C .12u C 39/n C 8/ C 1 C " < 4znC1
(4.26)
computed by formula 4.6—while 4x1 > 4z2 dominates for n D 1 according to equations 4.7 and 4.24. Hence, 4p · max.4znC1 ; 4x1 /
(4.27)
for every unit p 2 H . Clearly, " · 0:0625
(4.28)
follows from assumption 3.1 in which n ¸ 1, wmax ¸ 1, and "21=" is decreasing for " > 0. In addition, the following two upper bounds, W · 42n2 wmax C 42nwmax C 42n u · 28nwmax ;
(4.29) (4.30)
are obtained from denitions 3.2 and 3.3, respectively, which together with equations 4.7 and 4.24 and inequality 4.28 ensure that condition 4.27 implies 4p < 12 ¢ 7n .10n2 wmax C 11nwmax C 3n C 2/;
(4.31)
which, for n ¸ 1, gives 4p < wmax 212n · "21=" according to assumption 3.1.
(4.32)
Ï õ ma and P. Orponen J. S´
712
Lemma 2. 1. Let p 2 H be a unit saturated at b 2 f0; 1g with a defect ±p .t/ D jb ¡ yp .t/j;
(4.33)
for the duration of a continuous-time interval ¿ D [t0 ; t f ] for some t0 ¸ 0. Then the state dynamics of p converging toward value b can be explicitly solved as yp .t/ D jb ¡ ±p e¡.t¡t0 / j
(4.34)
for t 2 ¿ , where ±p D ±p .t0 / is p’s initial defect. 2a. Let Q µ H be a subset of units saturated for the duration of time interval ¿ D [t0 ; t f ]. Then the dynamics of the excitation »p .t/ for any unit p 2 H can be described as »p .t/ D v.0; p/C for t 2 ¿ , where 1pQ D
X v.q; p/C q2QI»q .t/¸1
X
q2QI»q .t0 /·0
v.q; p/±q ¡
X
v.q; p/yq .t/C1pQ e¡.t¡t0 / (4.35)
q2H nQ
X
v.q; p/±q
(4.36)
q2QI»q .t0 /¸1
is the initial total weighted defect of Q affecting »p .t0 /. 2b. In addition, let tf > t0 C t1 where t1 D
ln 2 ; "
(4.37)
and assume that the respective weights in H satisfy either X X v.0; p/ C v.q; p/ C v.q; p/ < ¡"
(4.38)
q2H nQIv.q;p/>0
q2QI»q .t0 /¸1
or v.0; p/ C
X q2QI»q .t0 /¸1
v.q; p/ C
X q2H nQIv.q;p/ 1 C ":
(4.39)
Then p is saturated at either 0 or 1, respectively, for the duration of time interval [t0 C t1 ; t f ]. Proof. 1. According to dynamics 2.5 and 2.6, the state yp .t/ of unit p saturated at b 2 f0; 1g for t 2 ¿ is independent of outputs from the remaining units, and
Continuous-Time Symmetric Hopeld Nets
713
its continuous-time dynamics is described by a differential equation, dyp dt
.t/ D ¡yp .t/ C b;
(4.40)
with a boundary condition, yp .t0 / D jb ¡ ±p j;
(4.41)
obtained from formula 4.33 for initial defect ±p . Hence, its explicit solution 4.34 follows. 2a. The excitation »p .t/ of unit p dened by formula 2.7 is split among the contributions from units outside Q and from those in Q saturated at 0 and 1 whose dynamics for t 2 ¿ is given by equation 4.34: »p .t/ D v.0; p/ C C
X
X q2H nQ
q2QI»q .t/¸1
v.q; p/yq .t/ C
X
v.q; p/±q e¡.t¡t0 /
q2QI»q .t/·0
v.q; p/.1 ¡ ±q e¡.t¡t0 / /:
(4.42)
By introducing the initial total weighted defect 4.36 into formula 4.42, the dynamics 4.35 follows. 2b. The defect term 1pQ e¡.t¡t0 / in equation 4.35 vanishes quickly as time proceeds, and its absolute value can be bounded for t 2 [t0 Ct1 ; t f ] as follows, j1pQ e¡.t¡t0 / j · 4p e¡t1 < ";
(4.43)
by using lemma 1 and equation 4.37. Hence, unit p is saturated at either 0 or 1 for the duration of time interval [t0 C t1 ; tf ] when condition 4.38 or 4.39, respectively, is assumed in equation 4.35. Lemma 3. 1. Consider a situation as depicted in Figure 5, where a clock unit p 2 C with fractional part of bias " 0 2 f"; "=3g and feedback weight v.p; p/ D 1 C " is supposed to receive a signal from preceding clock unit o 2 C , activate itself, and further transfer the signal to a subsequent clock unit r 2 C with bias fraction " and v.r; r/ D 1 C " via weight v.p; r/ ¸ 1. Let all the units incident on p; r excluding p; r be saturated for the duration of some sufciently large time interval ¿ D [t0 ; t f ] (e.g., t f > t0 C t2 where t2 is dened by formula 4.49 below), starting at a time t0 > 0 when »p .t0 / D 0. Assume that the initial defects meet ±p C 1rQ < "
(4.44)
Ï õ ma and P. Orponen J. S´
714
1+e
1+e
v ( p , r) > 1
p ..+e’
r ..+e
Figure 5: Activity transfer from unit p to unit r in the clock subnetwork C .
for Q D Hnfpg. Further assume that the respective weights satisfy X v.0; p/ C v.q; p/ D " 0
(4.45)
q2QI»q .t0 /¸1
v.0; r/ C
X q2QI»q .t0 /¸1
v.q; r/ D " ¡ v.p; r/:
(4.46)
Then p is unsaturated with the state dynamics yp .t/ D
" 0 .e".t¡t0 / ¡ 1/ " 0 C 1pQ e¡.t¡t0 / ¡ 1C" ".1 C "/
(4.47)
exactly for the duration of time interval .t0 ; t0 C t01 /, where t01 D
ln.1 C
" "0 /
(4.48)
"
(note t01 D t1 for " 0 D " and t01 D 2t1 for " 0 D "=3), and r is saturated at 0. In addition, p is saturated at 1 for the duration of time interval [t0 C t01 ; t0 C t2 ] and remains further saturated independently of the output from r, while r unsaturates from 0 at time t0 C t2 where t2 D ln
v.p; r/.." C " 0 /.1 C
" 1=" "0 /
C 1pQ / ¡ .1 C "/1 rQ
".1 C "/
¸ t01 :
(4.49)
2. Consider a situation as depicted in Figure 6, where a clock unit p 2 C with bias v.0; p/ D ¡1 C " and feedback weight v.p; p/ D 1 C " is supposed to receive a signal from preceding clock unit o 2 C , activate itself, and further transfer the signal to a subsequent clock unit r 2 C with v.0; r/ D ¡1C"=3 and v.r; r/ D 1 C", via weight v.p; r/ D 1, while the units in a given subset R µ H , incident on p (with no connections to r), may unsaturate after p unsaturates from 0. Let all the units
Continuous-Time Symmetric Hopeld Nets
715
R
q
1+e
o
1+e 1
p
r 1+ e 3
1+e
Figure 6: Activity transfer from unit p to r when units q 2 R unsaturate.
incident on p; r, excluding p; r, and R, be saturated for the duration of a sufciently large time interval ¿ D [t0 ; tf ] (e.g., at least until r unsaturates from 0) starting at a time t0 > 0 when »p .t0 / D 0. In Figure 6, Bt0 D fq 2 BI »q .t0 / ¸ 1g µ S
(4.50)
denotes the subset of units from simulating subnetwork S that are saturated at 1 at time t0 . Assume that the initial defects meet ±r < "2¡1="
(4.51)
1rQ0 < "2¡1="
(4.52)
for Q0 D Hn.R [ fpg/, and .1 C "/±p C
X
q2RI»q .t0 /·0
v.q; p/±q ¡
X q2RI»q .t0 /¸1
v.q; p/±q · "2¡1="
(4.53)
outside Q0. Further, assume that the respective weights satisfy v.0; p/ C
X q2Q0 I»q .t0 /¸1
v.q; p/ C
X q2RIv.q;p/ ¡" 0 ¡" ¡ ¢ " 1=" 0 C1pQ ¡ .1C"/1rQ v.p; r/ ." C " / 1C "0
(4.75)
where formulas 4.74 and 4.49 have been employed, which further reduces to ³ ± ´ " ²1=" C 1rQ .1 C "/." C " 0 / " 1 C 0 " ³ ´ ± " ²1=" 0 2 0 C." C " /1pQ ¡".1C"/±r : (4.76) < v.p; r/ ."C" / 1C 0 "
Ï õ ma and P. Orponen J. S´
720
According to assumption 4.44 and equation 4.59, it sufces to show inequality 4.76 with 1rQ and 1pQ replaced by " and ¡" 0 ¡ .1 C "/", respectively. In addition, v.p; r/ ¸ 1 and ±r · 1, which leads to ³ ± ´ " ²1=" C" .1 C "/." C " 0 / " 1 C 0 " ± ² 1=" " ¡ " 0 ." C " 0 / ¡ ".1 C "/.1 C " C " 0 /; < ." C " 0 /2 1 C 0 "
(4.77)
which holds due to inequality 4.28. This completes the argument for unit p to be saturated at 1 after r becomes unsaturated from 0. 2. Note that unit o saturates at 1 according to case 1 of this lemma before p is unsaturated from 0 at time instant t0 . Excitation of unit p derived from equation 4.35 can be lower bounded as »p .t/ ¸ " C .1 C "/yp .t/ C 1pQ0 e¡.t¡t0 /
(4.78)
for t 2 ¿ by assumption 4.54. According to dynamics equation 2.6, this also provides the following lower bound on the state derivative of unsaturated p: dyp dt
.t/ ¸ "yp .t/ C " C 1pQ0 e¡.t¡t0 / :
(4.79)
Moreover, in the beginning of time interval ¿ , the state evolution of unit p is determined by formula 4.47 before the rst unit q 2 R becomes unsaturated, since assumption 4.45 coincides with assumption 4.54 due to " 0 D " and »q .t0 / ¸ 1 for all q 2 R with v.q; p/ < 0 (see the table in Figure 6). Also, the initial total weighted defect 1pQ0 for Q0 D H n.R [ fpg/ can be expressed in terms of 1pQ D ¡" ¡ .1 C "/±p for Q D H nfpg from equation 4.59 as follows, 1pQ0 D ¡" ¡ .1 C "/±p ¡
X q2RI»q .t0 /·0
v.q; p/±q C
X
v.q; p/±q ; (4.80)
q2RI»q .t0 /¸1
according to denition 4.36. Hence, 1pQ0 ¸ ¡".1 C 2¡1=" /
(4.81)
by assumption 4.53. By introducing inequality 4.81 into formula 4.79, the state derivative of unsaturated p is further lower bounded as dyp dt
.t/ ¸ "yp .t/ C " ¡ ".1 C 2¡1=" /e¡.t¡t0 / :
(4.82)
Continuous-Time Symmetric Hopeld Nets
721
Since "yp .t/ ¸ 0, it follows that dyp dt
.t/ ¸ " ¡ " 2 > 0
(4.83)
for t ¸ t0 C td , where td D ln
1 C 2¡1=" ; "
(4.84)
provided that p is still unsaturated. This implies that yp .t/ for t ¸ t0 C td grows at least as fast as the straight line with equation ." ¡ " 2 /.t ¡ t0 ¡ td / ¡ y D 0
(4.85)
until unit p saturates at 1. Hence, p saturates at 1 certainly before t0 Ctd Cts < t0 C 2t1 where ts D
1 " ¡ "2
(4.86)
because »p .t/ > yp .t/ from equation 2.6 due to p’s state derivative 4.83 is positive for t ¸ t0 C td . In addition, it will be proved that the subsequent unit r is saturated at 0 at least until p saturates at 1. Excitation »r .t/ D ¡1 C
" C yp .t/ C 1rQ0 e¡.t¡t0 / 3
(4.87)
of unit r saturated at 0 can be derived from equation 4.35 by assumption 4.55. Let ty > 0 be the least local time instant at which yp .t0 C ty / D 1 ¡
" ¡ 1rQ0 e¡ty 3
(4.88)
when r is still saturated at 0 since »r .t0 C ty / D 0 follows from equation 4.87. Thus, it sufces to prove that the excitation of unit p can be lower bounded at time instant t0 C ty as ± ² " »p .t0 C ty / ¸ " C .1 C "/ 1 ¡ ¡ 1rQ0 e¡ty C 1pQ0 e¡ty ¸ 1 3
(4.89)
according to inequality 4.78. By substituting the error bounds 4.52 and 4.81, this reduces to 5 ¡ " ¡ 3.1 C .2 C "/2¡1=" /e¡ty ¸ 0;
(4.90)
Ï õ ma and P. Orponen J. S´
722
which holds for condition 4.28 due to e¡ty < 1, ensuring that unit p is already saturated at 1 at time instant t0 C ty . Finally, it must be checked that unit p remains saturated at 1 after t0 C ty when unit r may become unsaturated from 0. Inequality 4.78 now reads as ± ±" ² ² C 1rQ0 e¡ty e¡.t¡t0 ¡ty / C yr .t/ »p .t/ ¸ " C .1 C "/ 1 ¡ 3 C .1pQ0 ¡ ±r /e¡.t¡t0 / ;
(4.91)
since the state dynamics of unit p saturated at 1 is controlled by equation 4.34. In order to prove that »p .t/ ¸ 1 for all t 2 .t0 C ty ; tf ], it sufces to show 6 ¡ .1 C "/e¡.t¡t0 ¡ty / ¡ 3.1 C .3 C "/2¡1=" /e¡.t¡t0 / ¸ 0
(4.92)
according to inequality 4.91 in which yr .t/ ¸ 0 and the error bounds 4.51, 4.52, and 4.81 have been applied. Inequality 4.92 follows from e¡.t¡t0 / · e¡.t¡t0 ¡ty / · 1 and condition 4.28. This completes the argument for unit p to be saturated at 1 after r becomes unsaturated from 0. Lemma 4. All the units in A [ B µ S are saturated at 0 when the state of the clock interface unit x1 2 C equals 2=3. 1. The state of unit x1 unsaturated from 0 at time t0 ¸ 0 (i.e., »x1 .t0 / D 0) after reaching value yx1 .t0 C t0` / D 2=3 from below at time t0 C t0` , remains further increasing at least until x1 is again unsaturated from 1 by the next x0k unit (2 · k · n C 1). 2. Let unit x0k (2 · k · n C 1) cease its saturation at 0 at time t0 ¸ 0 when »x0k .t0 / D 0. Following this, let unit x1 unsaturate from 1 immediately after time instant t0 C tu > t0 , where »x1 .t0 C tu / D 1, and reach its state value yx1 .t0 C t` / D 2=3 from above at time t0 C t` (tu < t` ). Let 1x1 Q1 > ¡" for Q1 D
H nfx0k g
(4.93)
and
1x1 Q01 > ¡"2¡1="
(4.94)
for Q01 D H n.At0 [ B [ fx1 ; x0k g/ where At0 D fq 2 AI »q .t0 / ¸ 1g:
(4.95)
Further assume that X
±x1 < "2¡1="
(4.96)
¡1="
v.q; x1 /±q < "2
(4.97)
v.q; x1 /±q > ¡"2¡1=" :
(4.98)
q2At0
X q2B
Continuous-Time Symmetric Hopeld Nets
723
Then the state of unit x1 keeps decreasing after time t0 C t` at least until x1 is again unsaturated from 0 by a1 . Proof. It can rst be easily checked from Figure 4 that the excitation of all ®j ; ¯j ; Áj ; Âj 2 A [ B for j D 1; : : : ; n is negative when yx1 D 2=3: »®j · v.0; ®j / C v.Ãj ; ®j / C v.³j ; ®j / C v.x1 ; ®j / ¢ »¯j · v.0; ¯j / C v.´j ; ¯j / C v.x1 ; ¯j / ¢
2 D ¡2 3
2 D ¡2 3
»Áj · v.0; Áj / C u C v.Ãj ; Áj / C v.!j ; Áj / C v.x1 ; Áj / ¢ »Âj · v.0; Âj / C u C v.x1 ; Âj / ¢
(4.99) (4.100)
2 · ¡2 3
(4.101)
2 · ¡42: 3
(4.102)
1. Obviously, yx1 .t/ is further increasing for t ¸ t0 Ct` since .dyx1 =dt/.t/ > 0 for yx1 .t0 C t` / D 2=3 according to inequality 4.82, and hence, unit x1 then saturates at 1. 2. Excitation »x1 .t0 C tu / D W C .12u C 39/n C 3 C 2" C v.x0k ; x1 /yx0k .t0 C tu / C 1x1 Q1 e¡tu D 1
(4.103)
of unit x1 at t0 C tu is derived from equation 4.35 for Q1 where v.0; x1 /C
X
q2Q1 I»q .t/¸1
v.q; x1 / D v.0; x1 /Cv.x1 ; x1 /Cv.c0 ; x1 /Cv.a1 ; x1 / C v.b1 ; x1 / C v.ck ; x1 / C
X v.q; x1 / q2At0
D W C .12u C 39/n C 3 C 2"
(4.104)
follows from Figures 1 and 2, denition 3.9, and the fact that ®j 2 At0 iff ¯j 2 = At0 , for all j D 1; : : : ; n (see section 3.2). Equation 4.103 gives yx0k .t0 C tu / ¸
W C .12u C 39/n C 2 C " ¡v.x0k ; x1 /
(4.105)
by assumption 4.93 and v.x0k ; x1 / < 0. We already know from inequality 4.82 where e¡.t¡t0 / < 1 that dyx0
k
dt
.t/ ¸ "yx0k .t/ ¡ "2¡1=" :
(4.106)
Ï õ ma and P. Orponen J. S´
724
We prove that the state yx0k .t/ of unit x0k keeps increasing after x1 is unsaturated from 1 for t > t0 C tu . Since yx0k is continuous, it sufces to show that its derivative satises dyx0
k
dt
.t0 C tu / > 0
(4.107)
at time instant t0 C tu according to inequality 4.106, which further reduces to dyx0
k
dt
.t0 C tu / ¸
".W C .12u C 39/n C 2 C "/ ¡ "2¡1=" > 0 ¡v.x0k ; x1 /
(4.108)
by using condition 4.105. Inequality 4.108 follows from ¡v.x0k ; x1 / < "21=" according to lemma 1. Similarly, excitation X »x1 .t/ D v.0; x1 / C v.q; x1 / C .1 C "/yx1 .t/ C v.x0k ; x1 /yx0k .t/ C
X q2At0
q2Q01 I»q .t/¸1
v.q; x1 /yq .t/ C
X q2B
v.q; x1 /yq .t/ C 1x1 Q01 e¡.t¡t0 / (4.109)
of unit x1 for time t ¸ t0 is obtained from equation 4.35 for Q01 . Hence, the derivative of unsaturated x1 for t ¸ t0 C tu can be expressed as dyx1 dt
.t/ D v.0; x1 / C C
X q2At0
X q2Q01 I»q .t/¸1
v.q; x1 / C "yx1 .t/ C v.x0k ; x1 /yx0k .t/
v.q; x1 /yq .t/C
X q2B
v.q; x1 /yq .t/C1x1 Q01 e¡.t¡t0 / (4.110)
according to equation 2.6. Denote by 0 < t0d < t` the least local time instant such that dyx1 dt
.t0 C t0d / D 0:
(4.111)
Note that t0d > tu since unit x1 is still saturated at 1 at time t0 C tu when its state yx1 is increasing according to equation 4.34. The values dening the boundary time instants (printed in bold) for the variables associated with units x0k ; x1 and their signs or dynamics are summarized in Table 2. Now, the change in values of particular terms in the state derivative 4.110 of unit x1 will be estimated between time instants t0 C t` and t0 C t0d . Thus, yx1 .t0 C t0d / > yx1 .t0 / D 1 ¡ ±x1 > 1 ¡ "2¡1=" >
2 D yx1 .t0 C t` / 3
(4.112)
Continuous-Time Symmetric Hopeld Nets
725
Table 2: The Review of the Boundary Time Instants for Units x0k ; x1 .
t= t0 t0 + tu t0 + t0d t0 + t`
¹
dyx 0
x0k (t)
k
dt
=0
0
(t)
¹
x1 (t)
>1 =1 |
yx1 (t) = 1 ¡ ¯ x1 = 1 ¡ ¯ x1 e¡tu | 2 = 3
dyx 1 dt
(t)
>0 =0 > > b > k > : o
for p D b0k ; 2 · k · n C 1 for p D b1 for p D bk ; 2 · k · n C 1 otherwise:
(4.119)
It follows from Figures 1 and 2 that v.o0 ; r/ ¸ 0. In fact, only v.z1 ; ck / D 1 and v.zk ; c1 / > 0 for k D 2; : : : ; n C 1 are positive, while the remaining pairs o0 ; r are unconnected corresponding to v.o0 ; r/ D 0. In addition, v.p; r/ ¸ 1, and hence the defect in formula 4.44, can be upper bounded as follows: ±p C 1rQ · v.p; r/±p C 1rQ C v.o0 ; r/±o0 D 1rQ2 ;
(4.120)
where Q2 D H nfo0 g, according to denition 4.36. Assume rst that unit r 6D x1 is not connected to any unit in S , which gives 1rQ2 D 1rQ02 ;
(4.121)
where Q02 D Q2 \ C . For o0 6D b0k (2 · k · n C 1) and o0 6D bk (1 · k · n C 1), all the units in Q02 are saturated when o0 is being activated, which takes time t1 . On the other hand, the activation of o0 D b0k ; bk takes time 2t1 due to its bias
Continuous-Time Symmetric Hopeld Nets
727
v.0; o0 / D ¡1 C "=3, while simultaneously units in R \ C µ Q02 (for particular cases of R, see the table in Figure 6 in which o0 now corresponds to r) saturate within a time period of length t1 according to point 2b of lemma 2. Note that the corresponding unit, x0k or xk , preceding o0 D b0k or o0 D bk , respectively, is already saturated at 1 by point 2 of lemma 3 before o0 unsaturates from 0. Furthermore, all the units in Q02 are saturated for the duration of the next t1 period certainly before unit p from formula 4.44 unsaturates. It follows from inequality 4.43 that 1rQ02 < ", which implies condition 4.44 according to inequality 4.120 and equation 4.121. Similarly, for r D x1 (i.e., p D a1 ), it will be shown below that all the units in Q2 that are connected to r, especially units in A [ B t0 µ R from simulating subnetwork S , are saturated before o0 D c1 unsaturates, which clearly sufces for proving condition 4.44. Analogously, the stronger defect bounds 4.51 through 4.53 in point 2 of lemma 3 are met for p D x0k (2 · k · n C 1) and p D xk (1 · k · n C 1) since according to point 1 of lemma 3, the transient time 2t1 that is sufcient to decrease the underlying defects due to inequality 4.43 is guaranteed by the successive separate (all the units in C are saturated but one) activations of the preceding two units ck ; ak with v.ck ; r/ D v.ak ; r/ D 0 before unit p unsaturates from 0. It remains only to check inequality 4.53 for p D x1 when A [ Bt0 µ R, which again follows from the fact that units in A [ Bt0 µ S are saturated before o0 D c1 unsaturates, whose proof is given below. Now, the synchronization of the simulating subnetwork S will be analyzed. The rst phase of any simulated discrete step is delimited by the period during which the clock interface unit x1 2 C is passive. Recall that unit x1 is reset by x0k for some 2 · k · n C 1, which saturates at 1 by point 2 of lemma 3 before the subsequent unit b0k is unsaturated from 0. The activation of b0k then takes time 2t1 by point 1 of lemma 3, from which the rst period of length t1 sufces for x1 to saturate at 0 according to point 2b of lemma 2. Thus, the second t1 period of activating the b0k already belongs to the rst simulation phase. This phase then also includes the successive separate activations of units bk ; dk ; zk ; c0 ; c1 ; a1 , each of length t1 , except for the activation of bk , which requires time 2t1 according to point 1 of lemma 3. Altogether the length of the rst phase is certainly at least 8t1 . Note that in this consideration, the length of activating the xk is omitted for which only the upper bound is given in point 2 of lemma 3. Actually, the rst two t1 periods of the rst simulation phase are sufcient to stabilize the simulating subnetwork S , that is, to saturate all its units, even before dk unsaturates from 0. In particular, according to lemma 4, units in A are and remain saturated at 0 after yx1 D 2=3, while units in Bt0 only then saturate at 1 by point 2b of lemma 2 within the rst t1 period, long before o0 D c1 unsaturates, as required above. The following second t1 period ensures that the units in the second and fth layers of S saturate (see Figure 3). Similarly, the second phase of any simulated discrete step is dened as the period during which the clock interface unit x1 2 C is active. Recall that unit x1 is activated by a1 and saturates at 1 by point 2 of lemma 3 before the
728
Ï õ ma and P. Orponen J. S´
subsequent unit b1 becomes unsaturated from 0. Thus, the second simulation phase includes the successive separate activations of units b1 ; d1 ; z1 ; c0 ; ck ; ak for some 2 · k · n C 1, each of length t1 except for the activation of b1 , which requires time 2t1 according to point 1 of lemma 3. Altogether the length of the second phase is certainly at least 7t1 . In fact, already the rst three t1 periods of the second simulation phase are sufcient to stabilize the simulating subnetwork S , that is, to saturate all its units even before z1 unsaturates from 0. In particular, according to lemma 4, units in B are and remain saturated at 0 after yx1 D 2=3, while units in At0 only then saturate at 1 by point 2b of lemma 2 within the rst t1 period. The following two t1 periods ensure that the units in the second layer of S , followed by third-layer units, saturate according to point 2b of lemma 2 (see Figure 3). Obviously, the length of the second simulation phase that is lower bounded by 7t1 also guarantees the defect bounds 4.93, 4.94, and 4.96 through 4.98 in point 2 of lemma 4. Finally, the lower bound Ä.t¤ ="/ on the total simulation time follows immediately from the previous time analysis and equation 4.37. In addition, every unsaturated unit in H saturates within time at most O.t1 / according to point 2b of lemma 2 and lemma 3. Also during the simulation, all the units in H can simultaneously be saturated for a period of at most t1 before the respective defects decrease below " according to inequality 4.43, and the next unit unsaturates. This implies the corresponding upper bound O.t¤ ="/ and completes the proof of the theorem. 5 A Simulation Example A computer program, HNGEN, has been created to automate the construction from theorem 1. The input for HNGEN is a text le containing the asymmetric weights and biases of a given DRNN N , as well as its initial state. The program generates the system of differential equations 2.6 describing the dynamics of continuous-time symmetric Hopeld net H together with its boundary conditions in the form of a FORTRAN subroutine corresponding to N to be simulated. This FORTRAN procedure is then presented to a solver from the NAG library (Numerical Algorithms Group, 2002) that provides a numerical solution for this system. By using HNGEN, the underlying construction has been successfully tested on several examples. For instance, consider the simple three-neuron cycle DRNN N depicted in Figure 7, initiated in a state where neuron 1 has output 1 and neurons 2 and 3 output 0. Then the computation of the network consists simply of propagating the unit signal around the cycle. Implementing this system on the HNGEN generator results in a system of 61 differential equations describing the dynamics of a continuous-time symmetric Hopeld net H with 61 units. Figure 8 shows the numerical evolution of the states corresponding to the clock interface unit x1 and the three units ¼1 , ¼2 , ¼3 that represent binary states of the original discrete neurons from N , for
Continuous-Time Symmetric Hopeld Nets
729
1
3 1 1
1
1 1
2
1
Figure 7: A three-neuron cycle DRNN.
a period of eight (23 ) simulated discrete steps, conrming the correctness of the simulation. A parameter value of " D 0:3 was used in this numerical simulation, showing that the theoretical bound on " in theorem 1, which would require about " < 0:024 in the present case, is actually quite conservative. 6 Conclusion and Open Problems We have proved that an arbitrary convergent discrete-time recurrent neural network can be simulated by a symmetric continuous-time Hopeld net with only a linear increase in the network size. The existence of Lyapunov functions for Hopeld nets precludes the use of unbounded oscillations in such a simulation; nevertheless, we are able to base the construction on the bounded but exponentially long sequence of pulses generated by the continuous-time clock subnetwork. From the point of view of understanding analog computation in general, this technique is somewhat unsatisfying, since we are still basically discretizing the continuous-time computation. It would be most interesting to develop some theoretical tools (e.g., complexity measures, reductions, universal computation) for “natural” continuous-time computations that exclude the use of discretizing oscillations. First steps along this direction have recently been established (Ben-Hur, Siegelmann, & Fishman, 2002; Gori & Meer, 2001). Another challenge for further research is to prove upper bounds on the power of continuous-time systems. Note that in the case of discrete-time analog-state neural networks, a single xed-size network with rationalnumber parameters can be computationally universal, that is, able to simulate a universal Turing machine on arbitrary inputs (Siegelmann & Sontag, 1995). Can this strong universality result be generalized for continuoustime systems? Also, we have established an exponential lower bound on
730
Ï õ ma and P. Orponen J. S´
Figure 8: Continuous-time symmetric simulation of three-neuron cycle DRNN for " D 0:3.
Continuous-Time Symmetric Hopeld Nets
731
Ï õ ma the convergence time of symmetric continuous-time Hopeld nets (S´ & Orponen, 2001a). Can a matching upper bound be proved, or the lower bound be increased?
Acknowledgments Ï was partially supported by grants GA AS CR B2030007, GA CR No. 201= J. S. 02=1456. P. O. was partially supported by grant 81120 from the Academy of Finland.
References Asarin, E., & Maler, O. (1994). On some relations between dynamical systems and transition systems. In Proceedings of the 21st ICALP’94 International Colloquium on Automata, Languages, and Programming, LNCS 820 (pp. 59–72). Berlin: Springer-Verlag. Balc´azar, J. L., D´õ az, J., & Gabarr´o, J. (1995). Structural complexity I (2nd ed.). Berlin: Springer-Verlag. Balc´azar, J. L., Gavald`a, R., & Siegelmann, H. T. (1997). Computational power of neural networks: A characterization in terms of Kolmogorov complexity. IEEE Transactions on Information Theory, 43(4), 1175–1183. Ben-Hur, A., Siegelmann, H. T., & Fishman, S. (2002). A theory of complexity for continuous time systems. Journal of Complexity, 18(1), 51–86. Branicky, M. (1994). Analog computation with continuous ODEs. In Proceedings of the PhysComp’94 Workshop on Physics and Computation (pp. 265–274). Los Alamitos, CA: IEEE Computer Society Press. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8(6), 1135–1178. Cichocki, A., & Unbehauen, R. (1993). Neural networks for optimization and signal processing. Chichester: Wiley. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13(5), 815–826. Golden, R. M. (1996). Mathematical methods for neural network analysis and design. Cambridge, MA: MIT Press. Goles, E., & Mart´õ nez, S. (1989). Exponential transient classes of symmetric neural networks for synchronous and sequential updating. Complex Systems, 3(6), 589–597. Gori, M., & Meer, K. (2001). A step towards a complexity theory for dynamical systems (Tech. Rep. ALCOMFT-TR-01-191). Department of Computer Science, University of Aarhus, Denmark. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall.
732
Ï õ ma and P. Orponen J. S´
Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Hopeld, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52(3), 141–152. Horne, B. G., & Hush, D. R. (1996). Bounds on the complexity of recurrent neural network implementations of nite state machines. Neural Networks, 9(2), 243–252. Indyk, P. (1995). Optimal simulation of automata by neural nets. In Proceedings of the 12th STACS’95 Annual Symposium on Theoretical Aspects of Computer Science, LNCS 900 (pp. 337–348). Berlin: Springer-Verlag. Koiran, P., Cosnard, M., & Garzon, M. (1994). Computability with lowdimensional dynamical systems. Theoretical Computer Science, 132, 113–128. Lepley, M., & Miller, G. (1983).Computational power for networks of thresholddevices in an asynchronous environment (Unpublished manuscript). Cambridge, MA: Department of Mathematics, MIT. Maass, W., & Orponen, P. (1998). On the effect of analog noise in discrete-time analog computations. Neural Computation, 10(5), 1071–1095. Maass, W., & Sontag, E. D. (1999). Analog neural nets with gaussian or other common noise distribution cannot recognize arbitrary regular languages. Neural Computation, 11(3), 771–782. Medsker, L. R., & Jain, L. C. (Eds.). (2000). Recurrent neural networks: Design and applications. Boca Raton, FL: CRC Press. Moore, C. (1990). Unpredictability and undecidability in physical systems. Physical Review Letters, 64(20), 2354–2357. Moore, C. (1998). Finite-dimensional analog computers: ows, maps, and recurrent neural networks. In Proceedings of the 1st International Conference on Unconventional Models of Computation (pp. 59–71). Berlin: Springer-Verlag. Numerical Algorithms Group. (2002). Fortran library routine D02PCF. Available on-line: http://www.nag.co.uk/numeric//manual/pdf/D02/d02pcf.pdf. Omohundro, S. (1984). Modelling cellular automata with partial differential equations. Physica, 10D, 128–134. Orponen, P. (1996). The computational power of discrete Hopeld nets with hidden units. Neural Computation, 8(2), 403–415. Orponen, P. (1997a). A survey of continuous-time computation theory. In D.Z. Du & K.-I. Ko (Eds.), Advances in algorithms, languages, and complexity (pp. 209–224). Dordrecht: Kluwer Academic Publishers. Orponen, P. (1997b). The computational power of continuous time neural networks. In Proceedings of the 24th SOFSEM’97 Seminar on Current Trends in Theory and Practice of Informatics, LNCS 1338 (pp. 86–103). Berlin: SpringerVerlag. Parberry, I. (1994). Circuit complexity and neural networks. Cambridge, MA: MIT Press.
Continuous-Time Symmetric Hopeld Nets
733
Siegelmann, H. T. (1999). Neural networks and analog computation: Beyond the Tur¨ ing limit. Boston: Birkhauser. Siegelmann, H. T., Roitershtein, A., & Ben-Hur, A. (2000). Noisy neural networks and generalizations. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems (NIPS ’99), 12 (pp. 335–341). Cambridge, MA: MIT Press. Siegelmann, H. T., & Sontag, E. D. (1994). Analog computation via neural networks. Theoretical Computer Science, 131(2), 331–360. Siegelmann, H. T., & Sontag, E. D. (1995). Computational power of neural networks. Journal of Computer System Science, 50(1), 132–150. Ï õ ma, J., & Orponen, P. (2000). A continuous-time Hopeld net simulation of S´ discrete neural networks. In Proceedings of the 2nd NC’2000 International ICSC Symposium on Neural Computing (pp. 36–42). Wetaskiwin, Canada: ICSC Academic Press. Ï õ ma, J., & Orponen, P. (2001a). Exponential transients in continuous-time symS´ metric Hopeld nets. In Proceedings of the 11th ICANN’2001 Conference on Articial Neural Networks, LNCS 2130 (pp. 806–813). Berlin: Springer-Verlag. Ï õ ma, J., & Orponen, P. (2001b). Computing with continuous-time Liapunov S´ systems. In Proceedings of the 33rd STOC’2001 Annual ACM Symposium on Theory of Computing (pp. 722–731). New York: ACM Press. Ï õ ma, J., Orponen, P., & Antti-Poika, T. (2000). On the computational complexity S´ of binary and analog symmetric Hopeld nets. Neural Computation, 12(12), 2965–2989. Ï õ ma, J., & Wiedermann, J. (1998). Theory of neuromata. Journal of the ACM, S´ 45(1), 155–178. Stoll, H. M., & Lee, L.-S. (1988). A continuous-time optical neural network. In Proceedings of the IEEE International Conference on Neural Networks (Vol. II, pp. 373–384). New York: IEEE Press. Received May 29, 2002; accepted August 22, 2002.
ARTICLE
Communicated by Rajesh P. N. Rao
Modeling Reverse-Phi Motion-Selective Neurons in Cortex: Double Synaptic-Veto Mechanism Chun-Hui Mo
[email protected] Division of Biology, and Division of Chemistry, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Christof Koch
[email protected] Division of Biology, Division of Engineering and Applied Science, California Institute of Technology, Pasadena, CA 91125, U.S.A.
Reverse-phi motion is the illusory reversal of perceived direction of movement when the stimulus contrast is reversed in successive frames. Livingstone, Tsao, and Conway (2000) showed that direction-selective cells in striate cortex of the alert macaque monkey showed reversed excitatory and inhibitory regions when two different contrast bars were ashed sequentially during a two-bar interaction analysis. While correlation or motion energy models predict the reverse-phi response, it is unclear how neurons can accomplish this. We carried out detailed biophysical simulations of a direction-selective cell model implementing a synaptic shunting scheme. Our results suggest that a simple synaptic-veto mechanism with strong direction selectivity for normal motion cannot account for the observed reverse-phi motion effect. Given the nature of reverse-phi motion, a direct interaction between the ON and OFF pathway, missing in the original shunting-inhibition model, it is essential to account for the reversal of response. We here propose a double synaptic-veto mechanism in which ON excitatory synapses are gated by both delayed ON inhibition at their null side and delayed OFF inhibition at their preferred side. The converse applies to OFF excitatory synapses. Mapping this scheme onto the dendrites of a direction-selective neuron permits the model to respond best to normal motion in its preferred direction and to reverse-phi motion in its null direction. Two-bar interaction maps showed reversed excitation and inhibition regions when two different contrast bars are presented. 1 Introduction Reverse-phi motion was rst demonstrated by Anstis (1970; Anstis & Rogers, 1975). Subjects perceived the reverse direction of motion when the contrast of a moving object reverses in the second frame of a two-frame Neural Computation 15, 735–759 (2003)
c 2003 Massachusetts Institute of Technology °
736
C. Mo and C. Koch
shift experiment. A repetitive four-stroke cycle of reverse-phi motion gives a strong illusion of unidirectional apparent motion (Anstis & Rogers, 1986). The reverse-phi illusion involves the short-range motion process pathway (Chubb & Sperling, 1989). Random dot cinematograms (RDC) studies suggest that Dmax , the maximum distance dots can move from one frame to the other while still preserving the sense of motion, for reverse-phi motion is comparable to Dmax for normal motion, compatible with the notion that the same short-range direction-selective mechanism is most likely responsible for both normal and reverse-phi motion (Sato, 1989). Reverse-phi-like effects have also been reported during electrophysiological experiments from direction-selective complex cells in cat striate cortex (Emerson, Citron, Vaughn, & Klein, 1987), the H1 cell in the y’s lobula plate (Egelhaff & Borst, 1992), and the optical tract of the wallaby (Ibbotson & Clifford, 2001). Recent recordings from direction-selective cells in the alert monkey show that cells in cortical areas V1 and MT reverse facilitation and suppression regions in the two-bar interaction map when two different contrast bars are presented (Livingstone, Tsao, & Conway, 2000; Conway & Livingstone, 2001). This implies that these cells respond to reverse-phi motion in the reversed direction. Space-time plots of reverse-phi motion show energy in the reverse direction (see Figure 1A). Although the bar movement direction is to the right, the left motion energy unit aligns better with the stimuli and extracts more motion energy than the right motion energy unit. Therefore, both the motion energy model (Adelson & Bergen, 1985) and the equivalent Reichardt model (Reichardt, 1961; Santen & Sperling, 1985) can account for the reverse-phi motion. At the core of the Reichardt detector is a correlation step that involves multiplication between inputs. Mathematically, if the sign of one of the inputs being multiplied is reversed, as in the case of reverse-phi motion, the sign of the nal product is also reversed. However, there is no experimental evidence that either a single neuron or a neural network can perform a clean, four-quadrant multiplication operation. Even for a single identiable cell that performs multiplication, such as the LGMD neuron in the locust’s visual system (Hatsopoulos, Gabbiani, & Laurent, 1995; Gabbiani, Mo, & Laurent, 2001), it is unlikely that the sign of the cell’s output could be reversed for a reversed signed input, given multiple half-wave rectication mechanisms and narrow operating ranges for most biophysical processes involved. It is therefore of interest to study a direction-selective mechanism that can be implemented by neurons and can account for both normal and reverse-phi motion. The rst computational step in visual processing is half-wave rectication and separation into ON and OFF channels. It is unknown whether direction selectivity is generated between nonlinear interactions of these half-wave rectied signals or between simple cells that carry the reconstructed full wave signal. DL-2-amino-4-phosphonobutyric acid (APB) reversibly blocks the ON response in the mammalian retina (Schiller, 1982), but the detection of motion direction is largely unaffected in rabbit, cat, and monkey in electro-
Modeling Reverse-Phi Motion
737
Figure 1: Space-time plot of normal and reverse-phi motion and connectivity diagram of the model that accounts for direction and reverse-phi selectivity. (A) Space-time plot of a one-dimension white bar moving from left to right in normal motion (left panel) and reverse-phi motion (right panel). A rightmotion energy unit aligns well with the normal motion plot, but triggers a much reduced response for the reverse-phi motion. Instead, this strongly stimulates a left motion energy unit. (Adapted from Fig. 16 in Adelson & Bergen, 1985.) (B) Connectivity diagram of normal and reverse-phi motion direction-selective model. Input to LGN neurons comes from a one-dimensional array of 179 pixels. The intensities from those pixels were summed through difference of gaussian spatial lters on to LGN cells. There are one ON and one OFF center geniculate cell at one of six spatially offset locations. Each of the middle four ON center LGN cells provided excitatory input to one branch of the model cell dendrites, while the ON center LGN cell immediately to the right and the OFF center LGN cell immediately to the left provide delayed on-the-path inhibition. The converse connection scheme for the OFF center LGN cells’ excitatory inputs is not shown.
738
C. Mo and C. Koch
physiological experiments (Knapp & Mistler, 1983; Horton & Sherk, 1983; Schiller, Sandel, & Maunsell, 1986). Similarly, existing direction-selective models treat ON and OFF inputs separately (Koch & Poggio, 1985; Suarez, Koch, & Douglas, 1995; Rao & Sejnowski, 2000). However, the nature of reverse-phi motion stimuli suggests an interaction between ON and OFF channels. Different rectication schemes affect the ON-OFF interaction differently, and therefore carefully constructed reverse-phi stimuli were used in experiments to separate the rst-order, second-order, and third-order motion (Lu & Sperling, 1999; Mather & Murdoch, 1999). The requirement for ON-OFF interactions constrains cellular models of direction selectivity. V1 direction-selective cells show slanted excitatory regions in their receptive eld map (Livingstone, 1998). The asymmetry in the summation of excitatory inputs at dendrites alone was not sufcient to account for the directional response based on a modeling study by Anderson, Binzegger, Kahana, Martin, and Segev (1999). Therefore, asymmetrical delayed inhibition is likely to be the mechanism that underlies direction selectivity. Such a mechanism based on shunting inhibition (i.e., an increase in a chlorine-based GABAA conductance that reverses close to the cell’s resting position) was proposed for the cortex by Koch and Poggio (1985). Recent experiments in the retina provide evidence in favor of at least some nonlinear interactions between excitatory and shunting inhibitory inputs that take place within the dendrites of direction-selective ganglion cells (Taylor, He, Levick, & Vaney, 2000; for a dissenting view, see Borg-Graham, 2001). Large conductance changes that reverse around the cell’s resting potential have been observed in V1 during visual stimuli (Anderson, Carandini, & Ferster, 2000; Borg-Graham, Monier, & Fr´egnac, 1998). Here we show how a double synaptic-veto mechanism, derived from the traditional asymmetrical delayed shunting inhibition model, can account for both normal and reverse-phi motion direction selectivity. 2 Methods We followed a two-step compartmental simulation strategy using the program NEURON (Hines & Carnevale, 1997). We rst investigated the performance of an idealized neuron (see Figure 4B) before we implemented our synaptic assembly onto a reconstructed V1 cell (see Figure 7A). The idealized cell morphology is shown in Figure 4B. Eight dendrites (width, 0.5 ¹m; length, 100 ¹m) were directly connected to the soma (width, 16 ¹m; length, 16 ¹m). Each dendrite had 20 compartments, for a total of 180 compartments. The dendrites were passive, while the cell body contained a number of voltage-dependent currents that give rise to fast Hodgkin-Huxley-like action potentials. The biophysical parameters were as follows: Ra D 250 IJcm2 , Cm D 0:5¹F/cm2 , Eleak D ¡60 mV, Rm D 10 kIJcm; gNa D 0:024 S/cm2 , gK D 0:020 S/cm2 , ENMDA D 0 mV, gNMDA D 2:5 nS, ¿NMDA on D 0:1 ms, ¿NMDA off D 80 ms, E AMPA D 0 mV, gAMPA D 2:5 nS, ¿AMPA on D 0:1
Modeling Reverse-Phi Motion
739
ms, ¿AMPA off D 2 ms, E GABA D ¡60 mV, gGABA D 6:0 nS, ¿GABA on D 1 ms, ¿GABA off D 80 ms. Synaptic input was modeled using the point process in NEURON (adopted from Archie & Mel, 2000). We adapted a layer 4 stellate cell model from Mainen and Sejnowski (1996; see Figure 7A). The model cell contained sodium channels at the soma and dendrites, as well as fast potassium channels at the soma and the axon. Both calcium- and voltage-dependent slow potassium channels and high-threshold calcium channels were present at the soma and dendrites. All passive and active parameters were the same as those used in their original paper. All synapse parameters were the same as described above except EGABA D ¡70 mV, gGABA D 6:9 nS. The input connection scheme to both models is shown in Figure 1B. The spatial resolution of the stimulus was 1 minute of arc and the temporal resolution 0.1 ms. The lateral geniculate nucleus (LGN) layer was modeled as a transfer function. Stimuli projected onto the LGN layer consisted of two 6 £ 1 arrays of spatial lters modeling ON center and OFF center cells that covered 3 degrees of visual space. The spatial kernel was a difference of gaussian (DOG) function (adapted from Worg¨ ¨ otter & Koch, 1991). The gaussian kernel was G(x) = (K/2 ¼ ¾ 2 ) exp.¡x2 =2¾ 2 ), ¾center D 10:6 minute, ¾surround D 31:8 minute, and Kcenter =Ksurround D 17=16. Stimuli were rst passed through these spatial lters and then through low-pass temporal lters, with a delay between the center and the surround component (¿center D 10 ms, ¿surround D 20 ms delaysurround D 3 ms). The resulting values were scaled to give a time-dependent, stimulus-driven LGN instantaneous ring rate with a maximum value of 200 Hz. A background ring rate of 5 Hz was added, and all negative values were set to 0 (half-wave rectication). OFF center cells were modeled as the reverse of their ON center counterpart. Each of the middle four ON center LGN cells provided the excitatory input to one branch of the model cell dendrites, while the ON center LGN cell immediately to the right (preferred side) and the OFF center LGN cell immediately to the left (null side) provided delayed inhibition. Each of the middle four OFF center LGN cells provided excitatory input to one branch of the model cell dendrites, while the OFF center LGN cell immediately to the right and the ON center LGN cell immediately to the left provided delayed inhibition. The delay was 12 ms. There were 8 excitatory synapses and 16 inhibitory synapses in the model. Excitatory synapses were mapped to the dendrite compartment 60 ¹m away from the soma; the same type of inhibitory input was located 50 ¹m and a different type of inhibition 40 ¹m away from the cell body. Given that shunting inhibition was on the direct path between excitation and the soma, it could effectively and specically veto the excitatory input to that branch while only minimally affecting the excitatory input from neighboring branches (Koch, Poggio, & Torre, 1982). For the layer 4 stellate cell, eight triplets of excitatory-inhibitoryinhibitory synapses were mapped to eight separate terminal branches. This
740
C. Mo and C. Koch
arrangement was replicated four times and mapped onto 32 terminal branches. The total synaptic count was 32 excitatory and 64 inhibitory synapses. 3 Results We started by testing how well the reverse-phi effect was explained by the original asymmetric-delayed inhibition model of Barlow and Levick (1965), as implemented with shunting inhibition (Torre & Poggio, 1978; Poggio & Torre, 1978; Koch et al., 1982). We then carried out compartmental simulation in NEURON using an idealized dendritic geometry to prove our concept and compare the model against experimental data. Finally, we mapped our synaptic connection scheme to a more realistic cortical cell morphology and demonstrated how it could account for both normal and reverse-phi direction selectivity. 3.1 Asymmetric-Delayed Shunting Inhibition Model. The traditional Barlow and Levick (1965) inhibitory-based scheme is shown in Figure 2A. Inhibition was assumed to be of the shunting type (that is, with an inhibitory reversal potential around the local resting potential) so that synaptic interactions are restricted to local branches. We plotted the time course of excitatory and inhibitory inputs to the model from two adjacent cells in the LGN input layer (see Figure 2B). When a white bar moved in a normal fashion in the preferred direction, there was a temporal shift between excitatory and inhibitory inputs (see Figure 2B.a), while in the null direction, excitation and inhibition overlapped and therefore cancelled each other (see Figure 2B.c). If the same bar moved in the reverse-phi fashion, there was little difference in the temporal alignment of excitatory and inhibitory inputs between preferred and null direction movement (see Figures 2B.e–h). For normal motion Figure 2: Facing page. Normal shunting inhibition cannot account for reversephi motion. (A) Asymmetrical delayed inhibition scheme resulted in direction selectivity. ON excitation was gated by a delayed ON inhibition at its preferred side and the converse for OFF excitation (symbols as in Figure 1). (B) Excitatory and inhibitory inputs to model cell when a bright thin bar moves across its receptive eld at 10 degrees per second in preferred normal (PN) motion direction (a,b); in null normal (NN) motion direction (c,d), preferred reverse-phi (PR) motion direction (e,f), and null reverse-phi (NR) motion direction (g,h). Solid lines: Excitatory inputs from ON center cells (a,c,e,g) and OFF center cells (b,d,f,h). Dashed lines: Delayed inhibitory inputs (20 ms delay) were plotted in negative. Inputs were calculated as the stimuli passed through the spatial-temporal lters mentioned in section 2. (C) The direction index DI for the eight-armed dendritic model cell for different inhibitory delay times and contrast reversal rates. Over a wide range of inhibitory input delay times, the model cell was direction selective to normal motion but only weakly to reverse-phi motion.
Modeling Reverse-Phi Motion
A
741
P referred Direction
Input O rder in V is ua l S pace
DS O N-O N
Input S trength
B
a
PN
b
PN
c
NN
d
NN
e
PR
f
PR
g
NR
h
NR
0
C
O F F -O F F
100 200 T ime [ms ]
0
1.0
Norma l Motion 50Hz R evers eP hi
a
Direction Index
100 200 T ime [ms ]
0.0 1.0
Norma l Motion 75Hz R evers eP hi
b
c
0.0 1.0
Norma l Motion 100Hz R evers eP hi
0.0 0
10
20
30
40
50
Delay [ms ]
60
70
80
742
C. Mo and C. Koch
only, the ON branch received signicant input, while for reverse-phi motion, the input was spread between ON and OFF channels. The combined areas under the excitation or inhibition curves for both branches were much less for reverse-phi motion than normal motion. Because of the low-pass ltering and the rectication inherent in the LGN layer, both the excitatory and inhibitory inputs to the model cell during reverse-phi motion were reduced. We compared the direction index (DI) of the model cell using different stimulus reversal rates and the delay factor for the inhibition (see Figure 2C). DI was calculated as the response in the preferred direction minus the null direction response, divided by the sum of the preferred and the null direction. The value of preferred direction was calculated as the sum of excitation minus inhibition, with all negative values set to zero. Thus, DI ranges from 0 (for a nondiscriminatory system) to 1 (for a system that does not respond at all to null direction motion). The model was direction selective to normal motion for a wide range of delay values but only weakly direction selective to reverse-phi motion. DI for reverse-phi motion increased when the stimulus reversal rate increased, but the direction preference was the same as that of normal motion, contrary to the experimental data. Therefore, it appears that a traditional inhibition scheme cannot easily account for reverse-phi motion. Given the nature of reverse-phi motion, a direct nonlinear interaction between ON and OFF branches, missing in traditional schemes, is needed to account for reverse-phi motion-direction selectivity. We here propose a “reversed” shunting inhibition scheme (see Figure 3A), in which an ON excitation is gated by a delayed OFF inhibition at its null side and an OFF excitation is gated by a delayed ON inhibition at its null side (see Poggio, 1982; Koch & Poggio, 1987, for other synaptic logic models involving ONOFF interaction). Not surprisingly, this model responds equally strongly to both directions for normal motion, as inhibition is activated only by a bar of the opposite contrast from that of excitation. However, there is a difference in the temporal alignment of excitatory and inhibitory inputs between the Figure 3: Facing page. Reversed shunting inhibition scheme was direction selective to reverse-phi motion. (A) Reversed asymmetrical delayed-inhibition scheme. ON excitation was gated by a delayed OFF inhibition at its preferred side and the converse for OFF excitation. (B) Excitatory and inhibitory inputs to the model cell when a bright thin bar moves across its receptive eld at 10 degrees per second in the preferred normal (PN) motion direction (a,b); the null normal (NN) motion direction (c,d), the preferred reverse-phi (PR) motion direction (e,f), and the null reverse-phi (NR) motion direction (g,h). Solid lines: Excitatory inputs from ON center cells (a,c,e,g) and OFF center cells (b,d,f,h). Dashed lines (delayed inhibitory inputs—20 ms delay) are plotted in a reverse negative. (C) The DI calculated for reverse-phi motion. The model is not direction selective to normal motion, but when inhibitory input delay is optimal, the model was direction selective to reverse-phi motion.
Modeling Reverse-Phi Motion
A
743
Input O rder in V is ual S pace
P referred Direction
RS ON-O F F
Input S trength
B
PN
b
PN
c
NN
d
NN
e
PR
f
PR
g
NR
h
NR
0
Direction Index
C
OF F -O N
a
100 200 T ime [ms ]
0
100 200 T ime [ms ]
a
Normal Motion 50Hz R evers eP hi
b
Normal Motion 75Hz R evers eP hi
1.0
0.0 1.0
0.0 1.0
Normal Motion 100Hz R evers eP hi
c
0.0 0
10
20
30
40
50
Dela y [ms ]
60
70
80
744
C. Mo and C. Koch
preferred and null direction reverse-phi movement. When the inhibitory delay matches the stimuli reversal rate, DI is close to 1. DI decreases when the stimulus reversal rate increases. We conclude that the ON-OFF and OFFON vetoing scheme can discriminate the direction of reverse-phi motion. Given that the experimental data demonstrated that reverse-phi motion in the cell’s null direction elicited more vigorous responses than reverse-phi motion in the preferred direction (DI < 0), the delayed inhibitory input needs to reside at the excitatory input’s preferred side instead of its null side. 3.2 Double Synaptic-Veto Mechanism. A traditional shunting inhibition scheme can account for normal motion-direction selectivity, while a “reversed” shunting inhibition scheme can account for reverse-phi motiondirection selectivity. In order to account for both, we need to combine both synaptic schemes. One way to achieve this is to construct a model cell with four dendritic branches. Two of them implement a traditional shunting inhibition scheme and the other two a “reversed” shunting inhibition scheme. Such a connection scheme will be selective to both types of motion; however, the “traditional” branches offer a nongated path for reverse-phi excitatory inputs, while the “reversed” branches offer a nongated path for normal motion excitation. Such nongated excitatory inputs result in a high background level of depolarization at the soma and therefore low DI values. We propose a double synaptic-veto mechanism that combines the two synaptic schemes in a more sophisticated way at the microcircuitry level (see Figure 4A). In the new connection scheme, an ON excitatory synapse is gated by both a delayed ON inhibition at its null side (see Figure 1B, right side of the cell) and a delayed OFF inhibition at its preferred side (see Figure 1B, left side of the cell), and an OFF excitatory synapse is gated by both a delayed OFF inhibition at its null side and a delayed ON inhibition at its preferred side. Same-type inhibition at the null side vetoes nulldirection normal motion, while different-type inhibition at the preferred side vetoes preferred-direction reverse-phi motion. We mapped this triplet of synapses—one excitatory and two shunting inhibitory ones—onto the Figure 4: Facing page. A double synaptic-veto mechanism can account for both normal and reverse-phi motion direction selectivity. (A) Connection scheme of the double synaptic veto mechanism. ON excitation is gated by a delayed ON inhibition at its preferred side and a delayed OFF inhibition at its null side; the converse is true for OFF excitation. (B) Mapping the double synaptic-veto scheme onto the eight-armed cable model with spiking at the cell body. Each excitation-inhibition-inhibition triplet is mapped onto its own dendritic branch with the excitation at the far side of cell body. (C) Model cell’s response to a bright bar moving at 10 degrees per second across its receptive eld. The model cell responds best when a normal motion stimulus moves in its preferred direction and a reverse-phi motion stimulus moves in its null direction. The stimuli reverse rate is 75 Hz.
Modeling Reverse-Phi Motion
A
745
Input O rder in V is ual S pace
B
DS
P referred Direction
C
P referred Normal Motion
P referred R everse P hi
40
40
20
20
0
0
-20
-20
-40
-40
-60
-60 0
100
200
300
Null Normal Motion
40
0
0
0
-20
-20
-40
-40
V m [mv]
20
-60
200
300
Null R evers e P hi
40
20
100
-60 0
100 200 T ime [ms ]
300
0
100 200 T ime [ms ]
300
746
C. Mo and C. Koch
abstract cell model with a soma and eight dendritic branches (see Figure 4B), creating a simulacrum with four direction-selective subunits, each of which implements the synaptic connection of Figure 4A (the detailed connection scheme is explained in section 2). The model responded best to normal motion in its preferred direction (DI = 1) and reverse-phi motion (DI = ¡1) in its null direction (see Figure 4C). Note that the amplitude of the response to normal motion (6 spikes) was twice as large as the amplitude to reverse-phi motion (3 spikes), reecting the fact that both the excitatory and inhibitory inputs for normal motion stimuli were stronger. In a broad range of model parameters, Cm = 0.5 to 1 ¹F/cm2 , Eleak D ¡70 to ¡60 mV, Delayinhibitory input D 8 to 15 ms, gAMPA + gNMDA D 1 to 10 nS, gAMPA =gNMDA D 0:1 to 10, and gGABA =.gNMDA C gNMDA / D 1 to 10, the simulation showed direction selectivity for both types of motion, but only specic parameter sets resulted in high DI. When DI was small, null-direction normal motion or preferred-direction reverse-phi motion also elicited spikes, but the timing of the rst spike was late compared to the case of the preferred-direction movement (data not shown). Even if gNMDA was set to zero, the cell responded in a differential way to null- and preferreddirection motion. Since sodium and potassium channels are placed only at the soma, the voltage-dependent dendritic current was not required for the model’s direction selectivity. However, NMDA currents increased somatic voltage during preferred-direction movement and thus increased DI. The inhibitory synapses were always located between the excitatory input and the spike-triggering zone at the cell body, thereby fullling the “on-thepath” condition (Koch et al., 1982). The inhibitory conductance change in most cases needs to be only a little bit larger than the excitatory conductance change to achieve a “veto” effect. To test whether shunting inhibition was required for the direction selectivity we observed, we decreased the GABA channel reversal potential from ¡60 mV to ¡90 mV in 5 mV steps. At the same time, we decreased the amplitude of gGABA accordingly so that the model always responded to a preferred-direction normal-motion stimuli with six spikes and nulldirection reverse-phi stimuli with three spikes. DI for both types of motions decreased when the GABA channel reversal potential was decreased. At ¡90 mV, the model responded with ve spikes to null-direction normalmotion stimuli and with three spikes to preferred-direction reverse-phi motion stimuli. Thus, the direction selectivity was lost. 3.3 Receptive Field and Two-Bar Interaction Maps. The direction-selective cell shows a slant in the space-time plot (see Figure 5A). A recent study of direction-selective cells in awake monkey V1 shows an excitatory region on the cell’s preferred side and an inhibitory region on the cell’s null side, consistent with our connection scheme (Livingstone, 1998). To compare the experimental data with our model, we incorporated a synaptic noise source (AMPA conductance only) at the soma to achieve a reasonably
Modeling Reverse-Phi Motion
747
Figure 5: Comparison of space-time response mapping of monkey V1 cell and model cell. (A) Poststimulus time histogram (PSTHs) obtained from a layer 5/6 complex cell in primary visual cortex of the alert macaque monkey in response to ashed bars, presented in random order at a series of positions across the cell’s receptive eld. (From Livingston, 1998, Fig. 3A.) This cell’s preferred direction was from the bottom toward the top. Flash bar duration = 56 ms; interstimulus delay = 100 ms; 75 stimulus presentations. (B) PSTHs obtained from the model neuron to ashed bars at 20 spatial locations across its receptive eld. The model’s preferred direction is from the bottom toward the top. It shows a decrease in the response onset time and an increase in the response transiency, as does the V1 complex cell.
high background ring rate and then presented ashing bar stimuli at 20 different spatial locations across the receptive eld (see Figure 5B). There was good agreement between the experimental data and the response of our eight-arm model. Both space-time plots showed a progressive shortening of the response onset time and a more transient response going from the cell’s preferred side to its null side. In our model, the shortening of response onset time and the increase in response transiency were due to asymmetrical delayed inhibition and the basic property of integrate-and-ring neurons of LGN layer. The response onset time was determined mainly by the excitatory input since the inhibitory input was relatively small and delayed. Going from the preferred to the null side, the bar moves from the edge to the center of the receptive eld of the rst LGN cell that provided excitatory input, increasing the excitatory input amplitude and decreasing the time needed to charge the membrane to re the rst spike. The response transiency is determined primarily by how quickly inhibition can overcome excitation and shut off the response. Moving toward the null side, excitatory input strength decreases, while inhibitory input strength increases, as does the response transiency. The slant of the excitatory and inhibitory regions in the space-time plot is related to the cell’s velocity tuning (Livingstone, 1998;
748
C. Mo and C. Koch
McLean & Palmer, 1989; Reid, Soodak, & Shapley, 1991). Since we did not include any synaptic delay between the retina and V1, the model responds much earlier to the visual stimulus than actual striate cortex neurons do. The periodic ring is caused by the deterministic synaptic input. Random noise channels at the soma are responsible for the background ring and jitter of spikes around the peak ring time. Rao and Sejnowski (2000) showed a similar space-time plot for their direction-selective model. In their network model, the decrease of response onset time was due to the increase of the excitatory synaptic input strength from the cell at the preferred side toward the model cell itself. This effect is expected for any direction-selective model based on asymmetrical inhibition. The experimental data also show reversed excitatory and inhibitory regions in the two-bar interaction map of direction-selective cells in V1 and MT when opposite-contrast bars, instead of same-contrast bars were presented (MT: see Figure 6A, from Livingstone, 2001, Fig. 1; V1: Livingstone,
Modeling Reverse-Phi Motion
749
personal communication, April 2002). To compare with the experimental data, we tested our model’s response to two sequentially presented bars (see Figure 6B). The x-axis corresponds to the position of the rst ashed bar and the y-axis to the position of the second ashed bar. The diagonal line corresponds to both bars being presented at the same spatial location. A nondirection-selective cell should have excitation regions both above and below the diagonal line, while a direction-selective cell should have asymmetric excitation regions with respect to the diagonal line. When the two bars are of the same contrast, the excitatory region of our model cell’s response map is mostly above the diagonal, where the second presenting Figure 6: Facing page. Two-bar interaction maps. (A) Interaction map for two MT cells in the alert monkey (Fig. 1 in Livingstone et al., 2001). (B) Same interaction map for the eight-armed model neuron. Pairs of bars (8 minutes of width) were ashed sequentially for 13 ms each at different spatial locations. Spikes were counted for 100 ms from the start of the rst bar ash. The x-axis is the rst ash bar’s position, and the y-axis is the second bar’s position. The diagonal line is where two bars presented at the same spatial location. From left to right, Panel 1: Both bars were white. Areas above the diagonal line, where the second bar position was more positive than the rst bar position, were more active than the area below the diagonal line. Panel 2: Both bars were black. Panel 3: The rst bar was black and the second bar was white. Areas below the diagonal line were more active than the area above the diagonal line. Panel 4: The rst bar was white, and the second was black. Panel 5: The same contrast conditions minus the inverted contrast conditions (1 C 2 ¡ 3 ¡ 4). The model neuron preferred two same-contrast bars ashed sequentially in its preferred direction and two inverted-contrast bars ashed sequentially in its null direction. Given the symmetrical input, the white-to-white and black-to-blackmodel interactions are identical, as are the black-to-white and the white-to-black one. (C) Space-time two-bar interaction map of the model cell following the technique pioneered by Emerson et al. (1987). The reference bar was presented at time 0 at four different locations across the receptive eld. For each reference bar position, the probing bar was presented at different locations and times relative to the reference bar. Spikes were counted for 100 ms from the start of the rst bar ash. Four such maps were added together to give a position-invariant ds-dt map of two bar interactions. From left to right, Panel 1: Both bars were white. Areas along the diagonal line, where the two-bar presenting sequence matched the preferred direction and speed, were more active than areas orthogonal to the diagonal line, where the two-bar presenting sequence matched the null direction. Panel 2: Both bars were black. Panel 3: The reference bar was black, and the probing bar was white. Panel 4: The reference bar was white, and the probing bar was black. Panel 5: Same-contrast conditions minus the inverted-contrast conditions. The model neuron preferred two same-contrast bars ashed sequentially in its preferred direction and two inverted-contrast bars ashed sequentially in its null direction.
750
C. Mo and C. Koch
bar ’s position is located more toward the null side of the rst presenting bar. However, when the two bars have opposite signs of contrast, the excitatory region is mostly below the diagonal, whereas the second presenting bar’s position is located more toward the preferred side of the rst bar, signaling a reverse in the direction preference of the model. There is no difference between the white-white and black-black plots, given the symmetric ON and OFF inputs the model receives (the same is true for the black-white and black-white plots). The difference of the excitatory and inhibitory regions between the same- and inverted-contrast bar presentations is more clearly evident in Figure 6B: excitation is mostly above and inhibition mostly below the diagonal. Note that excitation was stronger at the model cell’s preferred side (minus side of spatial scale), while inhibition was stronger at the model cell’s null side (plus side of spatial scale). This was again due to the spatial asymmetry of excitation and inhibition. Although the model is direction selective to both presentations, direction selectivity was higher for samecontrast bar presentations than for different-contrast bar presentations. This translates into a weaker response to reverse-phi than to normal motion. The experimental data from MT cells in Figure 6A show the same overall trend but with a much larger receptive eld and better overall position invariance across the receptive eld. There were a few major differences between the empirically determined MT cell response maps (see Figure 6A) and our V1 model cell response (see Figure 6B). Figure 6A shows a diagonal organization, while Figure 6B shows a more circular organization. Part of this difference can be explained by the receptive-eld size difference between MT and V1 cells. This difference can also arise due to differences in the number and density of direction-selective subunits along the preferred direction and the extent of the nonlinear boost of the nal output stage. Some V1 direction-selective complex cells do show a circular interaction region, while other V1 complex cells reveal a more diagonal organization (M. Livingstone, personal communication, April 2002). In addition, the diagonal region in the model (see Figure 6B) is above baseline, whereas it appears to be below baseline in the data (see Figure 6A). This elevation is likely due to the reverse-correlation technique used in the experiment and further nonlinear excitation mechanisms that are missing from our model. Despite these differences, the “same minus inverted” maps (the rightmost panel in Figures 6A and 6B), which demonstrate the nonlinear interactions, do show a substantial amount of similarity. Finally, the cross-shaped backdrop in Figure 6B does not appear in the experimental data. Again, this is due to the difference in receptive-eld size between the recorded MT cell and our V1 model cell. If Figure 6A were evaluated within a larger spatial range (for example, within C= ¡ 10 degrees), the same cross-shaped background would appear (M. Livingstone, personal communication, April 2002). We also generated space-time two-bar interaction maps (see Figure 6C) to compare with the experimental data published by Emerson et al. (1987) on
Modeling Reverse-Phi Motion
751
complex cells in cat striate cortex. The x-axis corresponds to the time of the ashed probe bar relative to the time of presentation of the ashed reference bar. The y-axis corresponds to the position of the ashed probe bar relative to the position of the ashed reference bar (for more details, see Emerson et al., 1987). A direction-selective cell should have obliquely oriented excitatory regions (Figure 2 in Emerson et al., 1987). When the two bars are of the same contrast, the excitatory region of our model cell’s response map is mostly found along the diagonal, whereas the second presenting bar’s position is located more toward the null side of the rst presenting bar. However, when the two bars presented have opposite contrast sign, the excitatory region mostly anks the diagonal area, whereas the second presenting bar’s position is located more toward the preferred side of the rst bar, signaling a reverse in the direction preference of the model. The difference of excitatory and inhibitory regions between the same and inverted contrast bar presentations is more clearly evident in the “same minus inverted” panel of Figure 6C: excitation is mostly along the diagonal line, while inhibition mostly anks the diagonal area. Note that excitation is stronger at the model cell’s preferred side (minus side of spatial scale), while inhibition is stronger at the model cell’s null side (plus side of spatial scale). This is due to the spatial asymmetry of excitation and inhibition. The nonlinear facilitation observed in the experiment by Emerson and colleagues may derive from a specic excitatory directional interaction that is not addressed in our model or from generic nondirectional facilitations (such as network feedback or active channels on the dendrites that are masked by directional suppressions). 3.4 Layer 4 Stellate Cell Model. Although all of our parameters are physiologically plausible and the model displayed direction selectivity for a broad range of parameters, we wanted to ensure that the effect we observed was not due to the model cell’s cable structure. We therefore mapped our double synaptic-veto arrangement onto an anatomically correct layer 4 stellate cell model (Mainen & Sejnowski, 1996; see Figure 7A). All active and passive parameters of the original model cell remained unchanged. The new model’s response to different types of motions is illustrated in Figure 7B. The stellate cell showed direction selectivity for normal motion and the opposite selectivity for reverse-phi motion, with DI = 1 in both cases. We also tested our connection scheme on the layer 5 pyramidal cell model from Mainen and Sejnowski (1996). The pyramidal cell model showed the same directional preference when we mapped our synapse triples onto basal dendrites alone or basal and apical dendrites together (data not shown). 4 Discussion We demonstrate here that our double synaptic-veto mechanism can account for the reversal of direction selectivity in reverse-phi motion. Our biophys-
752
C. Mo and C. Koch
A
B 40 20 0 -20 -40 -60 -80
0
100
200
300
40 20 0 -20 -40 -60 -80 0
300
40 20 0 -20 -40 -60 -80 0
Null Normal Motion
V m [mv]
40 20 0 -20 -40 -60 -80
P referred Normal Motion
0
100 200 T ime [ms ]
P referre d R evers e P hi
100
200
300
Null R evers e P hi
100 200 T ime [ms]
300
Figure 7: Mapping the double synaptic-veto mechanism onto a layer 4 stellate cell model caused it to respond differentially for normal as well as reverse-phi motion. (A) Input synapse location on a layer 4 stellate cell. (Cell model from Mainen & Sejnowski, 1996.) (B) Stellate cell model’s response to a bright bar moving at 10 degrees per second across its receptive eld. The model responds best when a normal motion stimulus moves in its preferred direction and a reverse-phi motion stimulus moves in its null direction, as do many V1 cells in the macaque monkey (Livingstone et al., 2000).
Modeling Reverse-Phi Motion
753
ical simulations are obviously a mere proof of concept that such a scheme might be implemented in a plausible manner by cortical cells. Given the large number of degrees of freedom of any detailed biophysical simulations and the few constraints, except for order-of-magnitude estimations on the relevant parameters, little else is possible at this time. However, the fact that different cell models with distinct dendritic morphologies and voltagedependent currents can be driven by the same synaptic arrangement to replicate the experimental data in terms of direction selectivity shows that our double synaptic-veto mechanism is not implausible from a physiological point of view. This scheme makes several predictions that can be evaluated using extra cellular recordings. First, the nondirection-selective zone for reverse-phi motion should reside at the cell’s null side instead of the preferred side. Due to our inhibitory synaptic connection scheme, the nondiscriminating zone of direction-selective cells demonstrated experimentally in retinal and cortical neurons (Livingstone, 1998; He, Jin, & Masland, 1999) should also reverse its location for reverse-phi stimuli. Second, the response amplitude to normal motion in the preferred direction should be larger than to reversephi in the null direction (see Figures 4C and 7B). The inputs from the LGN layer to the model cell are weaker and spread into both ON and OFF channels for reverse-phi stimuli. Third, DI for reverse-phi motion should be more sensitive to parameter tuning than normal motion, especially for inhibitory input delay. Our input time course analysis (see Figures 2 and 3) suggests DI of reverse-phi motion is quite sensitive to the reversal rate: the higher the reversal rate is, the weaker are the input signals feeding into the direction-selective cell. This last property might show up in appropriate psychophysical studies as an increase in the motion detection threshold (increase of percentage of motion coherence or a decrease of Dmax ). In this study, we assume that direction selectivity is generated at the single cell level in a feedforward manner, without the aid of local feedback circuits. There are extensive feedback interactions among V1 cells, and these feedback currents are likely to be important for sharpening directional tuning (Douglas, Koch, Mahowald, Martin, & Suarez, 1995; Maex & Orban, 1996). As we stated before, our models respond in an appropriate directionselective manner to both types of motions over a broad parameter range, although DI was not always high. When DI was low, the response onset time for the preferred-direction motion was less than for the null-direction motion. A network could use this difference in response onset time to increase DI if we assume that neurons with the same direction preference have more excitatory feedback connection among themselves. In fact, this difference alone is enough to generate direction selectivity through network interaction (Maex & Orban, 1996; Suarez et al., 1995). Excitatory feedback might also help to produce balanced response amplitude for normal-motion and reverse-phi motion. In psychophysics experiments, human subjects have the same detection threshold for normal-motion and reverse-phi motion
754
C. Mo and C. Koch
(Sato, 1989). Monkey V1 cell’s response amplitudes to normal motion and reverse-phi motion are also comparable (M. Livingstone, personal communication, April 2002). In our simulation, although DI = 1 for normal motion and DI = ¡1 for reverse-phi motion, our model responds with many more spikes for the normal motion due to the low-pass nature and half-wave rectication of our vision system. This difference can be decreased by network interactions if the feedforward input triggers the cortical response and sets the directional bias, while the network itself determines the amplitude of the response. Of course, the use of shunting inhibition to compute the normal and reverse-phi direction selectivity does not rule out additional biophysical mechanisms to sharpen up this selectivity (Mel, 1993; Archie & Mel, 2000; Mel, Ruderman, & Archie, 1998) such as facilitation in the preferred direction (Emerson et al., 1987). We use the terms excitation and inhibition rather than facilitation or suppression in this article. Facilitation and suppression usually refer to the nonlinear part of a cell’s response. We did not isolate linear responses from nonlinear responses in our analysis, although the “same minus inverted” map (see Figures 6B and 6C) is a plot of the nonlinear interaction and thus shows facilitation and suppression (Emerson et al., 1987; Livingstone, Pack, & Born, 2001). The nonlinear suppression in the null direction comes from shunting inhibition. There is no signicant nonlinear facilitation in our model other than NMDA synaptic inputs. This might provide positive nonlinear interaction in the real neuron. Because the excitatory input for reverse-phi motion spreads between ON and OFF channels, such facilitation could further increase the difference of response amplitudes. Excitatory feedback, missing in our feedforward model, might underlie the facilitation in the preferred direction and might ll in the gap of response amplitude difference between normal and reverse-phi motion. In the traditional inhibition-based direction-selective scheme, shunting inhibition is not necessarily required if the neuron does not possess subunit structures. However, shunting inhibition is required for our double synaptic-veto mechanism to restrict the interaction to local branches. As we stated in section 3, direction selectivity for both normal and reverse-phi motion decreases to almost zero when the inhibitory channel reversal potential decreases from ¡60 mV to ¡90 mV. Our double synaptic veto scheme requires branch-specic computations that cannot be achieved by nonshunting inhibition. Archie and Mel (2000) demonstrated disparity tuning and reverse-phi-like effects without shunting inhibition in a similar modeling study. The nonlinear interaction underlying disparity tuning in their model is clustering (nonlinear excitatory interaction). The basic nonlinear direction interaction in our model is between the excitation and shunting inhibition; shunting inhibition is required for the on-path veto and direction selectivity. We only used one type of GABA synapse that has a long off-ramp (80 ms) in our simulation. This long off-ramp is necessary only to shut off the long-lasting NMDA currents. If we set NMDA conductance to zero and
Modeling Reverse-Phi Motion
755
GABA synapse off-ramp to 2 ms, the model is still direction selective. In real neurons, fast and slow inhibitions coexist. The key to our double synaptic gating mechanism is that excitatory inputs are half-wave rectied and carry separate ON or OFF signals, while the inhibitory input carries both ON and OFF signals (in a spatially segregated manner). In our model, inhibition originates (via interneurons) from LGN ON and OFF center cells. The inhibition could, in principle, also be supplied by a cortical simple cell, with spatially offset ON and OFF regions within its receptive eld. The separation of ON and OFF channels occurs at the very rst mammalian visual processing stage in the retina (Rodieck, 1998). Within the mammalian visual system, it is not until V1 that these two channels combine their information. Since the detection of reverse-phi requires interaction between ON and OFF channels and the simple cell might be the rst place that this interaction occurs, Livingstone et al. (2001) proposed that direction selectivity in the monkey might arise between the interactions of two simple cells that carry the full-wave signal. In the visual system of the y, the reverse-phi effect observed in higher-order visual cells (Egelhaaf & Borst, 1992) was used to support the argument that there is no ON-OFF channel separation. Although ON-OFF interaction is necessary to account for reverse-phi motion, our results suggest that only inhibition needs to carry the full-wave signal. The excitation to the direction-selective cell can originate from direct geniculate projection. The synaptic triplet arrangement proposed here for our double synapticveto scheme requires rather sophisticated synapse placement during the development process. Two inhibitory ON and OFF inputs need to synapse close by their associated excitatory input or between this input and the cell body in order to be able to veto excitation effectively (Koch et al., 1982). This is in contrast to traditional shunting inhibition schemes that require only the pairing of one inhibitory process for each excitatory one. It is possible that activity-dependent, temporally asymmetric Hebbian learning rules (Markram, Lubke, Frotscher, & Sakmann, 1997) might be used to establish such specic connection schemes as shown for normal direction selectivity by Rao and Sejnowski (2000). It does remain a major challenge to understand how direction-selective neurons with subunit structures could be established in an unsupervised manner on the basis of such learning rules. Reverse-phi motion stimuli do not appear to be a common feature of natural spatiotemporal scenes. It therefore remains unclear why cortical cells should invert their direction selectivity for reverse-phi motion. We believe that the synaptic circuitry responsible for detecting reverse-phi motion has to be established as a by-product of developing normal direction selectivity rather than as a stand-alone training process. If the inputs to directionselective cortical neurons have a signicant background ring rate, then OFF inhibition at the preferred side of ON excitation is released when a white bar moves in the preferred direction and thus helps to increase direction preference. If the inhibition comes from simple cells, it naturally
756
C. Mo and C. Koch
contains spatially offset ON and OFF signals. The challenge is to align the ON and OFF regions properly with excitatory inputs. We are investigating learning rules that could account for our double synaptic-veto mechanism. The “prediction and sequence learning” mechanism proposed in Montague and Sejnowski (1994) and Montague, Dayan, Person, and Sejnowski (1995) might play an important role in the establishment of the reverse-phi motiondetection circuit. Acknowledgments This project originated in a joint discussion with Tomaso Poggio and Margaret Livingstone. We thank both for their help and criticism. We also thank Kevin Archie, Bartlett Mel, and Terry Sejnowski for providing us with cell models, help, and criticism. This research was supported by grants from the NSF-sponsored Engineering Research Center at Cal Tech, the National Institutes of Health, and the National Institute of Mental Health. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2, 284–299. Anderson, J. C., Binzegger, T., Kahana, O., Martin, K. A. C., & Segev, I. (1999). Dendritic asymmetry cannot account for directional responses of neurons in visual cortex. Nature Neuroscience, 2, 820–824. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiology, 84, 909–926. Anstis, S. M. (1970). Phi movement as a subtractive process. Vision Research, 10, 1411–1430. Anstis, S. M., & Rogers, B. J. (1975). Illusory reversal of visual depth and movement during changes of contrast. Vision Research, 15, 957–961. Anstis, S. M., & Rogers, B. J. (1986). Illusory continuous motion from oscillating postive-negative patterns: Implications for motion perception. Perception, 15, 627–640. Archie, K. A., & Mel, B. W. (2000). A model for intradendritic computation of binocular disparity. Nature Neuroscience, 3, 54–63. Barlow, H. B, & Levick, R. W. (1965). The mechanism of directional selectivity in the rabbit’s retina. J. Physiology, 173, 477–504. Borg-Graham, L. J. (2001).The computation of directional selectivity in the retina occurs presynaptic to the ganglion cell. Nat Neurosci., 4, 176–183. Borg-Graham, L. J., Monier, C., & Fr´egnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 369–373. Chubb, C., & Sperling, G. (1989). Two motion perception mechanisms revealed through distance-driven reversal of apparent motion. Proc. Natl. Acad. Sci. USA, 86, 2985–2989.
Modeling Reverse-Phi Motion
757
Conway, B. R., & Livingstone, M. S. (2001). Space-time maps and two-bar interactions of direction-selective cells in macaque V-1. Manuscript submitted for publication. Douglas, R. J., Koch, C., Mahowald, M., Martin, K. A. C., & Suarez, H. H. (1995). Recurrent excitation in neocortical circuits. Science, 269, 981–985. Egelhaaf, M., & Borst, A. (1992). Are there separate ON and OFF channels in y motion vision? Visual Neuroscience, 8, 151–164. Emerson, R. C., Citron, M. C., Vaughn, W. J., & Klein, S. A. (1987). Nonlinear directionally selective subunits in complex cells of cat striate cortex. J. Neurophysiology, 58, 33–65. Gabbiani, F., Mo, C., & Laurent, G. (2001). Invariance of angular threshold computation in a wide-eld looming-sensitive neuron. J. Neuroscience, 21, 314–329. Hatsopoulos, N., Gabbiani, F., & Laurent, G. (1995). Elementary computation of object approach by a wide-eld visual neuron. Science, 270, 1000– 1003. He, S., Jin, Z. F., & Masland, R. H. (1999). The nondiscriminating zone of directionally selective retinal ganglion cells: Comparison with dendritic structure and implications for mechanism. J. Neuroscience, 19, 8049–8055. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Horton, J. C., & Sherk, H. (1983). Receptive eld properties in the cat’s lateral geniculate nucleus in the absence of ON-center retinal input. J. Neuroscience, 4, 374–380. Ibbotson, M. R., & Clifford, C. W. G. (2001). Interactions between ON and OFF signals in directional motion detectors feeding the NOT of the wallaby. J. Neurophysiol., 86, 997–1005. Knapp, A. G., & Mistler, L. A. (1983). Response properties of cells in rabbit’s lateral geniculate nucleus during reversible blockade of retinal ON-center channel. J. Neurophysiology, 50, 1236–1245. Koch, C., & Poggio, T. (1985). The synaptic veto mechanism: Does it underlie direction and orientation selectivity in the visual cortex? In D. Rose & V. Dobson (Eds.), Models of the visual cortex (pp. 408–419). New York: Wiley. Koch, C., & Poggio, T. (1987). Biophysics of computation: Neurons, synapses, and membranes. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function (pp. 637–697). New York: Wiley. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Philosophical Transactions of the Royal Society B, 298, 227–264. Livingstone, M. S. (1998). Mechanisms of direction selectivity in macaque V1. Neuron, 20, 509–526. Livingstone, M. S., Pack, C. C., & Born, R. T. (2001). 2-D substructure of MT receptive-elds. Neuron, 30, 781–793. Livingstone, M. S., Tsao, D. Y., & Conway, B. R. (2000).What happens if it changes contrast when it moves? Society for Neuroscience Abstracts, 26, 162.6. Lu, Z. L., & Sperling, G. (1999). Second-order reversed phi. Perception and Psychophysics, 61, 1075–1088.
758
C. Mo and C. Koch
Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. J. Neurophysiol., 75, 1515–1545. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mather, G., & Murdoch, L. (1999). Second-order processing of four-stroke apparent motion. Vision Research, 39, 1795–1802. McLean, J., & Palmer, L. A. (1989). Contribution of linear spatiotemporal receptive eld structure to velocity selectivity of simple cells in area 17 of cat. Vision Res., 29, 675–679. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neuroscience, 18, 4325–4334. Montague, P. R, Dayan, P., Person, C., & Sejnowski, T. J. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Montague, P. R., & Sejnowski, T. J. (1994). The predictive brain: Temporal coincidence and temporal order in synaptic learning mechanisms. Learn Mem, 1, 1–33. Poggio, T. (1982). Visual algorithms (AI Memo No. 683). Cambridge, MA: MIT Articial Intelligence Laboratory. Poggio, T., & Torre, V. (1978). A new approach to synaptic interactions. In R. Heim & G. Palm (Eds.), Theoretical approaches to complex systems (pp. 89–115). Berlin: Springer-Verlag. Rao, R. P. N., & Sejnowski, T. J. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. A. Solla, T. K. Lee, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Reichardt, W. (1961). Autocorrelation: A principle for the evaluation of sensory information by the nervous system. In W. A. Rosenblith (Ed.), Sensory communication. New York: Wiley. Reid, R. C., Soodak, R. E., & Shapley, R. M. (1991). Directional selectivity and spatio-temporal structure of receptive elds of simple cells in cat striate cortex. J. Neurophysiol., 66, 505–529. Rodieck, R. W. (1998). The rst steps in seeing. Sunderland, MA: Sinauer Associates. Santen, J. P. H., & Sperling, G. (1985). Elaborated Reichardt detectors. J. Opt. Soc. Am. A, 2, 300–320. Sato, T. (1989). Reversed apparent motion with random dot patterns. Vision Res., 29, 1749–1758. Schiller, P. H. (1982). Central connections of the retinal ON and OFF pathways. Nature, 297, 580–583.
Modeling Reverse-Phi Motion
759
Schiller, P. H., Sandel, J. H., & Maunsell, J. H. R. (1986). Functions of the ON and OFF channels of the visual system. Nature, 322, 824–825. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex within the framework of the canonical microcircuit. J. Neuroscience, 15, 6700–6719. Taylor, W. R., He, S., Levick, W. R., & Vaney, D. I. (2000). Dendritic computation of direction selectivity by retinal ganglion cells. Science, 289, 2347–2350. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. Roy. Soc. Lond. B, 202, 409–416. Worg ¨ otter, ¨ F., & Koch, C. (1991). A detailed model of the primary visual pathway in the cat: Comparison of afferent excitatory and intracortical inhibitory connection schemes for orientation selectivity. J. Neuroscience, 11, 1959–1979. Received April 3, 2002; accepted October 7, 2002.
LETTER
Communicated by Steven Nowlan
Visual Development and the Acquisition of Motion Velocity Sensitivities Robert A. Jacobs
[email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
Melissa Dominguez
[email protected] Department of Computer Science, Universityof Rochester,Rochester,NY 14627,U.S.A.
We consider the hypothesis that systems learning aspects of visual perception may benet from the use of suitably designed developmental progressions during training. Four models were trained to estimate motion velocities in sequences of visual images. Three of the models were developmental models in the sense that the nature of their visual input changed during the course of training. These models received a relatively impoverished visual input early in training, and the quality of this input improved as training progressed. One model used a coarseto-multiscale developmental progression (it received coarse-scale motion features early in training and ner-scale features were added to its input as training progressed), another model used a ne-to-multiscale progression, and the third model used a random progression. The nal model was nondevelopmental in the sense that the nature of its input remained the same throughout the training period. The simulation results show that the coarse-to-multiscale model performed best. Hypotheses are offered to account for this model’s superior performance, and simulation results evaluating these hypotheses are reported. We conclude that suitably designed developmental sequences can be useful to systems learning to estimate motion velocities. The idea that visual development can aid visual learning is a viable hypothesis in need of further study. 1 Introduction With relatively few exceptions, relationships between development and learning have largely been ignored by the neural computation community. This is surprising because development may be nature’s way of biasing biological learning systems so that they achieve better performance, and may represent an effective means for engineers to bias machine learning systems. It is well known in the machine learning literature that learning Neural Computation 15, 761–781 (2003)
c 2003 Massachusetts Institute of Technology °
762
R. Jacobs and M. Dominguez
systems are inherently faced with the bias-variance dilemma (Geman, Bienenstock, & Doursat, 1995). Systems with little or no bias tend to interpolate in unpredictable ways and thus have highly variable generalization performance. Systems with larger bias, in contrast, tend to show better generalization performance when exposed to those training sets that they can adequately learn. We speculate that development may be an effective means of adding suitable bias to a system, thereby enhancing the generalization performance of that system. In previous work, we studied the effects of different types of developmental sequences on the performances of systems trained to estimate the binocular disparities present in pairs of visual images (Dominguez & Jacobs, in press-a, in press-b). The systems consisted of three components. The rst component was a pair of right-eye and left-eye images. For example, the images may have depicted a light or dark object against a gray background. The second component was a set of binocular energy lters, which are widely used to model the binocular sensitivities of simple and complex cells in primary visual cortex of primates (Ohzawa, DeAngelis, & Freeman, 1990). Based on local patches of the right-eye and left-eye images, each lter acted as a disparity feature detector at a coarse, medium, or ne scale depending on whether the lter was tuned to a low, medium, or high spatial frequency, respectively. The third component was an articial neural network. The outputs of the binocular energy lters were the inputs to this network. The network was trained to estimate the disparity of the object, which was dened as the amount that the object was shifted between the right-eye and left-eye images. A nondevelopmental system was compared to three developmental systems. The network of the nondevelopmental system received the outputs of all binocular energy lters throughout the entire training period. The networks of the developmental systems, in contrast, were trained in three stages. The network of the coarse-to-multiscale system received the outputs of binocular energy lters tuned to a low spatial frequency during the rst training stage. It received the outputs of lters tuned to low and medium spatial frequencies during the second training stage, and it received the outputs of all lters during the third training stage. The network of the neto-multiscale system was trained in an analogous way, though its lters were added in the opposite order: it received the outputs of lters tuned to a high frequency during the rst training stage, and the outputs of lowerfrequency lters were added during subsequent stages. The network of the random developmental model was also trained in stages, though its inputs were chosen at random at each stage and thus were not organized by spatial frequency content. The results show that the coarse-to-multiscale and ne-to-multiscale systems consistently outperformed the nondevelopmental and random developmental systems. The fact that they outperformed the nondevelopmental system is important because this demonstrates that models that undergo
Visual Development and Motion Velocity Sensitivities
763
a developmental maturation can acquire a more advanced perceptual ability than one that does not. The fact that they outperformed the random developmental system is important because this demonstrates that not all developmental sequences can be expected to provide performance benets. To the contrary, only sequences whose characteristics are matched to the task should lead to superior performance. In conjunction with other results (not described here), these ndings suggest that the most successful systems at learning to detect binocular disparities are those that are exposed to visual inputs at a single scale early in training and for which the resolution of their inputs progresses in an orderly fashion from one scale to a neighboring scale during the course of training. At a more general level, these results suggest that the idea that visual development aids visual learning is a viable hypothesis in need of further study. This article studies this hypothesis in the context of visual motion velocity estimation. There are enough similarities between the tasks of binocular disparity estimation and motion velocity estimation that it is reasonable to believe that a developmental approach that is useful for the former task may also be advantageous for the latter task. However, the differences between the two tasks mean that this belief needs to be checked. Indeed, our simulations show that the two tasks do not yield the same pattern of results. Although a developmental approach to the velocity estimation task is shown to be benecial, it is not the case that all developmental progressions that lead to performance advantages on the disparity estimation task also lead to advantages on the velocity estimation task. In particular, a coarseto-multiscale developmental system outperformed nondevelopmental and random developmental systems on the velocity estimation task, but a neto-multiscale system did not. We hypothesize that the performance advantage of the coarse-to-multiscale system relative to the ne-to-multiscale system is due to the fact that the coarse-to-multiscale system learned to make greater use of motion energy lters tuned to a low spatial frequency. Analyses suggest that coarse-scale motion features are more informative for the velocity estimation task than ne-scale features. 2 Developmental and Nondevelopmental Systems The structure of the developmental and nondevelopmental systems is illustrated in Figure 1. The input to each system was a sequence of 88 retinal images where each image was a one-dimensional array 40 pixels in length. This sequence depicted an object moving at a constant velocity in front of a stationary background. The retinal array was treated as if it were shaped like a circle in the sense that the left-most and right-most pixels were regarded as neighbors. This wraparound of the left and right edges was done to avoid edge artifacts. Although a one-dimensional retina is a simplication, its use is justied by the need to keep the simulation times within reason. The sequence of retinal images was ltered using motion energy lters.
764
R. Jacobs and M. Dominguez
time = 1 time = 2
time = 88 Low SF Med SF High SF
Sequence of
Motion energy filters
retinal images
Artificial neural network
Figure 1: The developmental and nondevelopmental systems shared a common structure. The input to the systems was a sequence of retinal images, ltered by motion energy lters. The set of lters can be partitioned into subsets tuned to low, medium, and high spatial frequencies. The outputs of the lters were the inputs to an articial neural network that was trained to estimate the object velocity depicted in the image sequence.
Based on neurophysiological results, Adelson and Bergen (1985) proposed motion energy lters as a way of modeling the motion sensitivities of simple and complex cells in primary visual cortex. A sequence of one-dimensional images can be represented using a two-dimensional array where one dimension encodes space and the other dimension encodes time. In this case, motion energy lters are two-dimensional lters that extract motion information in local patches of the spatiotemporal space. The receptive eld prole of a simple cell can be described mathematically as a Gabor function, which is a sinusoid multiplied by a gaussian envelope. A quadrature pair of such functions with even and odd phases tuned to leftward (¡) and rightward (C) directions of motion is given by ( ) 1 x2 t2 D exp ¡ 2 ¡ cos.!x x § wt t/ 2¼¾x ¾t 2¾x 2¾t2 ( ) 1 x2 t2 § exp ¡ 2 ¡ sin.!x x § wt t/; go D 2¼¾x ¾t 2¾x 2¾t2 g§ e
(2.1)
(2.2)
where x and t are the spatial and temporal distances to the center of the gaussian, ¾x2 and ¾t2 are the spatial and temporal variances of the gaussian,
Visual Development and Motion Velocity Sensitivities
765
and !x and !t are the spatial and temporal frequencies of the sinusoids. The ratio !t =!x determines the orientation of a Gabor function in the spatiotemporal space, which in turn determines the velocity sensitivity of the function. The activity of a simple cell is given by the square of the convolution of the cell’s receptive eld prole with the spatiotemporal pattern, denoted f . 2 For example, . f ¤ gC e / is the activity of a simple cell with even phase that is sensitive to rightward motion. The activities of simple cells with even and odd phases are summed in order to form the activity of a complex cell. This sum is known as a motion energy. Thus, leftward and rightward motion energies are given by 2 ¡ 2 E¡ D . f ¤ g ¡ e / C . f ¤ go /
(2.3)
E D .f ¤
(2.4)
C
2 gC e /
C .f ¤
2 gC o / :
Because a complex cell’s activity is based on the combined properties of simple cells with even and odd phases, this activity is phase insensitive, meaning that this value is relatively insensitive to the exact position of a motion within the complex cell’s receptive eld. In our simulations, we used a subset of the possible receptive-eld locations in the two-dimensional (40 pixels £ 88 time frames) spatiotemporal space. This subset formed a 20 £ 4 uniform grid located in the middle of the space such that receptive elds were centered on odd-numbered pixels and odd-numbered time frames. An advantage of this choice of locations was that edge artifacts were avoided because all receptive elds fell entirely within the spatiotemporal space. Fifteen complex cells corresponding to three spatial frequencies and ve temporal frequencies were centered at each receptive-eld location. All cells were tuned to rightward motion because we restricted our data sets to include only objects that were moving to the right. The parameter values of these cells are shown in Table 1. The spatial and temporal frequencies were each separated by an octave. Importantly, temporal frequencies were chosen so that the set of cells at each spatial frequency had the same pattern of velocity tunings. Specically, the sets tuned to low, medium, and high spatial frequencies had velocity tunings of 0.25, 0.5, 1.0, 2.0, and 4.0 pixels per time frame. This was achieved by correlating the spatial and temporal frequency tunings of the cells: low-spatial-frequency cells were tuned to a comparatively low range of temporal frequencies, and high-spatial-frequency cells were tuned to a high range of temporal frequencies. 1 A cell’s spatial and 1 Recent neuroscientic studies indicate that the preferred velocity of velocity-tuned neurons in area MT of primates is largely independent of spatial frequency (Perrone & Thiele, 2001). Simoncelli and Heeger (2001) speculated that a velocity-tuned MT neuron pools the outputs of a set of V1 cells whose spatial and temporal frequency tunings are positively correlated.
766
R. Jacobs and M. Dominguez
Table 1: Parameter Values for the Complex Cells at Each Receptive-Field Location. Spatial Frequency (cycles per pixel)
Temporal Frequency (cycles per time frame)
Velocity (pixels per time frame)
0.0625
0.015625 0.03125 0.0625 0.125 0.25 0.03125 0.0625 0.125 0.25 0.5 0.0625 0.125 0.25 0.5 1.0
0.25 0.5 1.0 2.0 4.0 0.25 0.5 1.0 2.0 4.0 0.25 0.5 1.0 2.0 4.0
0.125
0.25
Spatial Frequency (cycles per pixel)
Spatial Standard Deviation (¾x )
0.0625 0.125 0.25
4.0 2.0 1.0
Temporal Frequency (cycles per time frame)
Temporal Standard Deviation (¾t )
0.015625 0.03125 0.0625 0.125 0.25 0.5 1.0
16.0 8.0 4.0 2.0 1.0 0.5 0.25
Note: The temporal frequencies were chosen so that the same velocity tunings existed at each spatial frequency.
temporal standard deviations were set to be inversely proportional to its spatial and temporal frequencies, respectively. The outputs of the complex cells within each spatial frequency band were normalized using a softmax nonlinearity, E.!xi ;!tj /=¿
O xi ; !tj / D Pe E.!
k
eE.!xi ;!tk /=¿
(2.5)
Visual Development and Motion Velocity Sensitivities
767
where !xi and !tj are the spatial and temporal frequencies to which a complex cell was tuned, E.!xi ; !tj / was the initial output of the complex cell, O x ; !t / was the normalized output, ¿ is a scaling parameter (its value E.! i j was set to 0.001), and !tk ; k D 1; : : : ; 5; indexed the ve temporal frequencies corresponding to spatial frequency !xi . As a result of this normalization, complex cells tended to respond to relative contrast in the image sequence rather than absolute contrast (Heeger, 1992; Nowlan & Sejnowski, 1994). The normalized outputs of the complex cells were the inputs to an articial neural network. The network had 1200 input units (the complex cells had 80 receptive-eld locations, and there were 15 cells at each location). The network’s hidden layer contained 18 hidden units, which were organized into three groups of 6 units each. The connectivity of the hidden units was set so that each group had a limited receptive eld and neighboring groups had overlapping receptive elds. A group of hidden units received inputs from 32 receptive-eld locations at the complex cell level, and the receptive elds of neighboring groups overlapped by 8 receptive-eld locations. The hidden units used a logistic activation function. The output layer consisted of a single linear unit; this unit’s output was an estimate of the object velocity depicted in the sequence of retinal images. The weights of an articial neural network were initialized to small, random values and were adjusted during the course of training to minimize a sum of squared error cost function using a conjugate gradient optimization procedure (Press, Teukolsky, Vetterling, & Flannery, 1992). Advantages of this procedure are that it tends to converge quickly and has no free parameters (e.g., no learning rate or momentum parameters). Weight sharing was implemented at the hidden unit level so that corresponding units within each group of hidden units had the same incoming and outgoing weight values and so that a hidden unit had the same set of weight values from each receptive-eld location at the complex unit level. This provided the network with a degree of translation invariance and dramatically decreased the number of modiable weight values in the network. It therefore decreased the number of data items and the amount of time needed to train the network. Models were trained and tested using separate sets of training and test data items. Each set contained 250 randomly generated items. Training was terminated after 100 iterations through the training set. The results reported are based on the data items from the test set. Three developmental systems and one nondevelopmental system were simulated. The coarse-to-multiscale system (model C2M) was trained using a coarse-to-multiscale developmental sequence implemented as follows. The training period was divided into three stages. During the rst stage, the neural network portion of the model received only the outputs of complex cells tuned to the low spatial frequency (the outputs of other complex cells were set to zero). During the second stage, the network received the outputs of complex cells tuned to low and medium spatial frequencies; it received
768
R. Jacobs and M. Dominguez
the outputs of all complex cells during the third stage. The training of the ne-to-multiscale system (model F2M) was identical to that of model C2M except that its training used a ne-to-multiscale developmental sequence. During the rst stage of training, its network received the outputs of complex cells tuned to the high spatial frequency. This network received the outputs of complex cells tuned to high and medium spatial frequencies during the second stage and received the outputs of all complex cells during the third stage. The training of the random developmental system (model RD) also used a developmental sequence, though this sequence was generated randomly and thus was not based on the spatial frequency tunings of the complex cells. The collection of complex cells was randomly partitioned into three equal-sized subsets, with the constraint that each subset included one-third of the cells at each receptive-eld location. During the rst stage of training, the neural network portion of the model received the outputs of the complex cells only in the rst subset. It received the outputs of the cells in the rst and second subsets during the second stage of training and the outputs of all complex cells during the third stage. In contrast, the training period of the nondevelopmental system (model ND) was not divided into separate stages; its neural network received the outputs of all complex cells throughout the entire training period. 3 Data Sets and Simulation Results The performances of the four models were evaluated on two data sets. In all cases, the images were gray scale with luminance values between 0 and 1, and motion velocities were rightward with magnitudes between 0 and 4 pixels per time frame. Fifteen simulations of each model on each data set were conducted. In the solid object data set, images consisted of a moving light or dark object in front of a stationary gray background. The object’s gray-scale values were randomly chosen to be in the range from 0.0 to 0.1 or from 0.9 and 1.0, whereas the gray-scale value of the background was always 0.5. The size of the object was randomly chosen to be an integer between 6 and 12 pixels, its initial location was a randomly chosen pixel on the retina, and its velocity was randomly chosen to be a real value between 0 and 4 pixels per time frame. Because the velocity was real-valued, the boundaries between an object’s internal gray-scale values, or the boundaries between the ends of an object and the background, could fall at a real-valued location within a retinal pixel. In these (common) cases, the luminance value of a pixel was appropriately linearly interpolated. Given a sequence of images, the task of a model was to estimate the object’s velocity. The top portion of Figure 2 gives an example of 10 frames of an image sequence from the solid object data set. The bar graph in Figure 3 illustrates the results. The horizontal axis of each graph gives the model, and the vertical axis gives the root mean squared
Visual Development and Motion Velocity Sensitivities
769
Solid object data item
Noisy object data item
Figure 2: (Top) Ten frames of an image sequence from the solid object data set. (Bottom) Ten frames of an image sequence from the noisy object data set.
error (RMSE) on the data items from the test set at the end of training (the error bars give the standard error of the mean). The labels for the developmental models C2M, F2M, and RD include a number. Recall that the training of these models was divided into three training stages (or developmental stages). The number in the label gives the length of developmental stages 1 and 2 (the length of developmental stage 3 can be calculated using the fact that the entire training period lasted 100 iterations). For example, the label C2M-5 corresponds to a version of model C2M in which the rst stage was 5 iterations, the second stage was 5 iterations, and the third stage was 90 iterations. In regard to model RD, we simulated four versions of this model (RD-5, RD-10, RD-20, and RD-30). Only the version that performed best is included in the graph. Model C2M signicantly outperformed all other models. Its bestperforming version was C2M-20, which had an 11.5% smaller generaliza-
770
R. Jacobs and M. Dominguez
solid object data set 0.55
RMSE
0.50
0.45
F2M-30
F2M-20
F2M-10
F2M-5
C2M-30
C2M-20
C2M-10
C2M-5
RD-20
ND
0.40
Figure 3: The RMSE on the test set data items for model ND, the best-performing version of model RD, and different versions of models C2M and F2M after training on the solid object data set. The error bars give the standard error of the mean.
tion error than model ND (the difference between the mean performances of these models is statistically signicant; t D 2:50; p < 0:02 using a two-tailed t-test with 28 degrees of freedom). In addition, C2M-20 had a 9.6% smaller error than the best version of model F2M (t D 3:57; p < 0:01) and a 7.2% smaller error than the best version of model RD (t D 2:30; p < 0:05). The images in the second data set, referred to as the noisy object data set, were meant to resemble random-dot kinematograms frequently used in behavioral experiments. Images contained a noisy object that was moving to the right and a noisy background that was stationary. The gray-scale values of the object pixels and the background pixels were set to random numbers between 0 and 1. The size of the object was randomly chosen to be an integer between 6 and 12 pixels, its initial location was a randomly chosen pixel on the retina, and its velocity was randomly chosen to be an integer between 0 and 4 pixels per time frame. As before, the task was to map an image sequence to an estimate of an object velocity. The bottom portion of Figure 2 gives an example of 10 frames of an image sequence from the noisy object data set. The results are shown in Figure 4. Model C2M once again outperformed the other models. Relative to model ND, all versions of model C2M showed
Visual Development and Motion Velocity Sensitivities
771
noisy object data set 0.80
RMSE
0.75
0.70
F2M-30
F2M-20
F2M-10
F2M-5
C2M-30
C2M-20
C2M-10
C2M-5
RD-20
ND
0.65
Figure 4: The RMSE on the test set data items for model ND, the best-performing version of model RD, and different versions of models C2M and F2M after training on the noisy object data set. The error bars give the standard error of the mean.
superior performance (ND vs. C2M-5: t D 2:69; p < 0:02; ND vs. C2M-10: t D 2:78; p < 0:01; ND vs. C2M-20: t D 3:03; p < 0:01; ND vs. C2M-30: t D 4:14; p < 0:001). The best-performing version of model C2M was C2M30. On average, it had an 8.9% smaller generalization error than model ND (t D 4:14; p < 0:001), a 6.1% smaller error than the best version of model F2M (t D 2:95; p < 0:01), and a 4.3% smaller error than the best version of model RD (t D 2:10; p < 0:05). Taken as a whole, the simulation results are interesting for a number of reasons. Most important for our purposes, the fact that model C2M outperformed model ND demonstrates that a model that undergoes a developmental maturation can acquire a more advanced perceptual ability than one that does not. The fact that model F2M performed similarly to or worse than model ND, and worse than model C2M, is important because this demonstrates that not all developmental sequences provide performance benets. It is tempting to hypothesize that only sequences whose characteristics are matched to the task should lead to superior performance. However, the nding that the best version of model RD outperformed model ND is inconsistent with this hypothesis because it is difcult to understand why a random developmental progression would be well matched to a ve-
772
R. Jacobs and M. Dominguez
0.0
0.0 C,M,F
0.5
Fine
0.5
Coarse
1.0
C,M,F
1.0
Fine
1.5
Coarse
1.5
Random
noisy object data set (10 epochs) 2.0
Random
RMSE
solid object data set (10 epochs) 2.0
Figure 5: Performances of several networks after 10 epochs of training. Their RMSE on the test set data items are shown for the (left) solid object data set and (right) noisy object data set.
locity estimation task. Future work will need to examine this unexpected nding. To understand these results better, we conducted additional simulations. We compared the performances of several different types of neural networks on the solid object and noisy object data sets. These networks were not developmental systems; each received the same set of inputs throughout the entire training period. Some networks received only the outputs of the lowspatial-frequency motion energy lters (labeled “coarse” networks), only the outputs of medium-frequency lters (labeled “medium” networks), or only the outputs of high-frequency lters (labeled “ne” networks). Other networks received the outputs of multiple subsets of lters; for example, C,M networks received the outputs of low- and medium-frequency lters, and C,M,F networks received the outputs of all lters (identical to model ND). The inputs to networks labeled “random” were the outputs of a randomly selected one-third of the motion lters. Furthermore, we examined the performances of these networks at a relatively early point in their training periods (after 10 epochs) and also at the end of training (after 100 epochs). The results at a relatively early point in training are shown in Figure 5. The results at the end of training on the solid object and noisy object data sets are shown in Figures 6 and 7, respectively. For our purposes, the most important result to emerge from the data in these gures is that motion features at different scales are not equally informative for the task of velocity estimation. To the contrary, coarse motion features (i.e., the outputs of low-spatial-frequency motion energy l-
Visual Development and Motion Velocity Sensitivities
773
solid object data set (100 epochs) 2.0
RMSE
1.5
1.0
0.5
C,M,F
M,F
C,M
Fine
Medium
Coarse
Random
0.0
Figure 6: The RMSE of several networks after 100 epochs of training on the solid object data set.
ters) are more informative than medium-scale motion features, which in turn are more informative than ne motion features. Evidence for this includes the facts that the coarse network outperformed the medium network, and the medium network outperformed the ne network. In addition, the C,M network performed better than the M,F network. These performance rankings were found using both solid object and noisy object data sets and were found at an early point in training and at the end of training. Based on these results, we can speculate as to why model C2M showed the best performance and why model F2M showed comparatively poor performance. Because the only inputs to model C2M at the start of training were the outputs of the coarse motion energy lters and because these outputs were the only set that it received at all stages of training, it is likely that this model made extensive use of these signals. The analyses suggest that the coarse motion signals are the most informative for the velocity estimation task. Model C2M’s extensive use of these highly informative signals presumably allowed it to achieve a high level of performance. In contrast, model F2M is likely to have made greater use of the outputs of ne motion energy lters because these outputs were its only inputs at the start of
774
R. Jacobs and M. Dominguez
noisy object data set (100 epochs) 2.0
RMSE
1.5
1.0
0.5
C,M,F
M,F
C,M
Fine
Medium
Coarse
Random
0.0
Figure 7: The RMSE of several networks after 100 epochs of training on the noisy object data set.
training and the only set that it received at all stages of training. The analyses suggest that the ne motion signals are the least informative for the velocity estimation task. Model F2M’s extensive use of these least informative signals presumably is responsible for this model’s comparatively poor performance. In earlier work, we found that the most successful systems at learning a binocular disparity estimation task were those that (1) received inputs at a single-frequency scale early in training and (2) for which the resolution of their inputs progressed in an orderly fashion from one scale to a neighboring scale during the course of training (Dominguez & Jacobs, in press-a, in pressb). Condition 1 allowed a system to combine and compare input features at an early training stage without the need to compensate for the fact that these features could be at different spatial scales. If condition 2 was satised, when a system received inputs at a new spatial scale, it was close to a scale with which the system was already familiar. To test the importance on the velocity estimation task for the resolution of a system’s inputs to progress in an orderly fashion from one scale to a neighboring scale, we compared the performances of ve systems. Three of the systems—ND, C2M-20, and F2M-20—were described above; the other
Visual Development and Motion Velocity Sensitivities solid object data set
775
noisy object data set 0.80
0.50
0.75
0.45
0.70
0.40
0.65 F-CF-CMF-20
C-CF-CMF-20
F2M-20
C2M-20
ND
F-CF-CMF-20
C-CF-CMF-20
F2M-20
C2M-20
ND
RMSE
0.55
Figure 8: The RMSE of several systems on the test set data items for the (left) solid object data set and (right) noisy object data set.
two systems are new. The neural network of model C-CF-CMF-20 received coarse motion features during the rst developmental stage; coarse and ne motion features during the second developmental stage; coarse, medium, and ne features during the third stage. The neural network of model FCF-CMF-20 received ne motion features in the rst stage; coarse and ne features in the second stage; and coarse, medium, and ne features during the nal developmental stage. Because the inputs to systems C-CF-CMF-20 and F-CF-CMF-20 did not proceed from one scale to a neighboring scale, we predicted that these systems would perform poorly. The results on the solid object and noisy object data sets are shown in Figure 8. On the solid object data set, C2M-20 outperformed C-CF-CMF-20, and F2M-20 outperformed F-CF-CMF-20. On the basis of these data, we conclude that it is important for a system’s inputs to proceed from one scale to a neighboring scale. On the noisy object data set, the results are different. The performances of C2M-20 and C-CF-CMF-20 are similar, and F2M-20 performed worse than F-CF-CMF-20. We do not believe that these results are necessarily inconsistent with the hypothesis that we are considering. Instead, the results may indicate that on this task, it is more important for a system to receive the outputs of the low-spatial-frequency motion lters as early in training as possible than it is for a system to receive inputs whose resolution changes in an orderly manner. Overall, we believe that our results imply that it is moderately, but not highly, important for a developmental system learning to estimate motion velocities to receive inputs whose reso-
776
R. Jacobs and M. Dominguez
lution progresses in an orderly fashion from one scale to a neighboring scale during training.2 We have presented analyses suggesting that coarser-scale motion features are more informative than ner-scale features on the velocity estimation task. Several factors may help explain this nding. As Weiss and Adelson (1998) discussed, motion signals tend to be less ambiguous when the stimulus is viewed for a long duration and more ambiguous when the stimulus is viewed for a short duration. Their reasoning applies to the activities of motion energy lters with receptive elds in the spatiotemporal domain. Coarse-scale lters tend to have larger receptive elds than those of ne-scale lters. Consequently, there is less ambiguity in the activities of coarse-scale lters relative to the activities of ne-scale lters. In addition, coarse-scale lters have large, overlapping receptive elds and thus form a high-resolution coarse code of the spatiotemporal space (Milner, 1974; Hinton, 1981; Ballard, 1986). This code could provide a network with accurate information as to the location of the moving object at each moment in time. For example, the activities of these lters may have coded with high accuracy the fact that the moving object was at location A at time tA and location B at time tB . If so, a network could have easily learned to estimate the object velocity accurately by calculating .B ¡ A/=.tB ¡ tA /. In contrast, ne-scale lters have smaller, less overlapping receptive elds, which form a lower-resolution coarse code.3 An interesting prediction follows from this line of reasoning about the advantages of lters with large, overlapping receptive elds. In general, lters with larger receptive elds tend to be tuned to slow velocities, whereas
2 Dominguez and Jacobs (in press-a, in press-b) found that this was a highly important factor for a developmental system learning to estimate binocular disparities, whereas we nd on a motion velocity estimation task that it is only a moderately important factor. Unfortunately, it is difcult to compare our current and prior ndings in a detailed way, for a number of reasons. First, whereas the disparity estimation task involved two images (lefteye and right-eye images), the motion estimation task involved a sequence of 88 images. Second, differences between binocular energy lters and motion energy lters make the two sets of simulations difcult to compare. For instance, binocular energy lters make extensive use of the relative phases of lters, whereas motion energy lters do not. As a second example, the binocular energy lters that we used earlier were one-dimensional (we were interested only in horizontal disparities, not vertical disparities), whereas the motion energy lters used here were two-dimensional. 3 Analyses by Hinton (1981) and Ballard (1986) provide an understanding of the relationship between receptive-eld size and resolution in the case of binary units that each become active when a stimulus falls within its receptive eld. If D is the diameter of a unit’s receptive eld, k is the dimensionality of the space to be represented, and N is the desired number of just noticeable differences in each dimension (i.e., the desired resolution), then the required number of units is N k =Dk¡1 . For a xed number of units, a high-resolution code (that is, one with a large N) requires units with large receptive elds (elds with a large D), whereas a low-resolution code can be achieved with units with small receptive elds.
Visual Development and Motion Velocity Sensitivities
777
solid object data set 0.65
RMSE
0.55
F2M-20 s f
F2M-10 s f
C2M-20 s f
C2M-10 s f
RD-20 s f
RD-10 s f
0.35
ND s f
0.45
Figure 9: The RMSE of several systems on the test set data items from the solid object data set with slow (s) or fast (f) object velocities.
lters with smaller receptive elds tend to be tuned to fast velocities. Consequently, we predict that all models should be better at estimating slow velocities than fast velocities. However, we expect that this tendency will be comparatively strong for model C2M, because the receptive elds of lowspatial-frequency lters tuned to slow velocities are much bigger than those of low-frequency lters tuned to fast velocities. In contrast, we expect that this tendency will be weak for model F2M, because the receptive elds of high-frequency lters tuned to slow velocities are only mildly bigger than those of high-frequency lters tuned to fast velocities. To test this prediction, we evaluated the accuracy of seven systems at estimating slow (0.0–1.333 pixels per frame) and fast (2.667–4.0 pixels per frame) object velocities after training on the solid object data set. The results are shown in Figure 9. The horizontal axis gives both the system and the object velocity (s = slow, f = fast); the vertical axis gives the RMSE. All systems are generally good at estimating both slow and fast object velocities in the sense that they all show subpixel accuracy. Most important for our current purposes, models ND, RD, and C2M were more accurate at estimating
778
R. Jacobs and M. Dominguez
slow velocities than at estimating fast velocities. This trend was strongest for model C2M, in agreement with the predicted outcome. Also in agreement with our prediction is the fact that model F2M showed roughly equal accuracy at estimating slow and fast velocities. 4 Conclusion We have compared four models on a visual motion velocity estimation task. Three of the models were developmental in the sense that the nature of their visual input changed during the course of training. Model C2M used a coarse-to-multiscale developmental progression, meaning that it received coarse-scale motion features early in training and ner-scale features were added to its input as training progressed, model F2M used a ne-tomultiscale progression, and model RD used a random progression. The nal model, model ND, was nondevelopmental in the sense that the nature of its input remained the same throughout the training period. The simulation results show that model C2M performed best, and model F2M often performed worst. The fact that model C2M outperformed model ND is important because this demonstrates that a model that undergoes a developmental maturation can acquire a more advanced perceptual ability than one that does not. The fact that model F2M performed similar to or worse than model ND, and worse than model C2M, is important because this demonstrates that not all developmental sequences provide performance benets. It is tempting to hypothesize that only sequences whose characteristics are matched to the task should lead to superior performance. However, the nding that the best version of model RD outperformed model ND is inconsistent with this hypothesis because it is difcult to understand why a random developmental progression would be well matched to a velocity estimation task. In general, we conclude that the idea that visual development can aid visual learning is a viable hypothesis in need of further study. Model C2M’s superior performance is interesting because its developmental progression resembles that of human infants. The spatial acuity of newborns is roughly one-fteenth to one-thirtieth that of adults with normal eyesight. In other words, newborns are sensitive only to low spatial frequencies; they cannot see ne details. Acuity improves approximately linearly from these low levels at birth to near adult levels by about eight months of age (Norcia & Tyler, 1985). Our simulation results suggest that this developmental sequence may provide important functional benets for the acquisition of motion velocity estimation. Model C2M’s superior performance is also interesting because this nding is broadly consistent with a theory of child development known as the “less is more” hypothesis (Newport, 1990). According to this view, cognitive, perceptual, and motor limitations early in infancy (such as being able to perceive visually only low spatial frequencies) are helpful, perhaps nec-
Visual Development and Motion Velocity Sensitivities
779
essary, stages in development. Limited mental abilities reect simple neural representations, which are useful stepping-stones or building blocks for the subsequent development of more complex representations (Turkewitz & Kenney, 1982). Consistent with the “less is more” hypothesis, our view of perceptual development is that early developmental periods set the stage for later periods in the sense that early periods bias the performance of a biological system during these later periods. Although machine learning researchers rarely use the term development when referring to their learning algorithms, many researchers have pursued a “stage-setting” strategy—for example: ² Systems that perform clustering frequently are trained in two stages. During the rst stage, the K-means algorithm is used to locate the cluster centers approximately. These centers are then used to initialize the mean vectors of a mixture of normal distributions, which is trained using an expectation-maximization (EM) algorithm during a second training period (Bishop, 1995). ² Bayesian systems are frequently trained in two stages. In the rst stage, point estimates of a system’s parameters are obtained using a maximum likelihood estimation algorithm such as the EM algorithm. These point estimates are then used to initialize a Markov chain Monte Carlo (MCMC) sampler such as a Gibbs sampler. The initialization of MCMC samplers in this manner can lead to dramatic speed-ups in their convergence (Peng, Jacobs, & Tanner, 1996). ² Jordan (1994) proposed that probabilistic decision trees (also known as hierarchical mixtures of experts; Jordan & Jacobs, 1994) be trained in two stages. In the rst stage, a deterministic decision tree is trained using, for example, the CART or C4.5 algorithms. The resulting tree is then used to initialize both the shape and the parameters of a probabilistic decision tree, which is trained using the EM algorithm in a second training stage. ² Neural networks are sometimes trained in two stages. During the rst stage, a genetic algorithm is used to identify a good network structure and a good set of initial weights. The backpropagation algorithm is then used to modify these initial weights in a second stage of training (Belew, McInerney, & Schraudolph, 1991). These examples highlight the fact that machine learning researchers are exploring multistaged strategies for biasing their learning systems so as to enhance their performances. We believe that a developmental approach based on biological principles as presented here represents a promising, but understudied, method for suitably biasing learning systems.
780
R. Jacobs and M. Dominguez
Acknowledgments We thank the anonymous reviewers for their comments and suggestions, which resulted in important improvements to this article. This work was supported by NIH research grant R01-EY13149. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Journal of the Optical Society of America A, 2, 284–299. Ballard, D. H. (1986). Cortical connections and parallel processing: Structure and function. Behavioral and Brain Sciences, 9, 67–120. Belew, R. K., McInerney, J., & Schraudolph, N. N. (1991). Evolving networks: Using the genetic algorithm with connectionist learning. In Proceedings of the Second Articial Life Conference. Reading, MA: Addison-Wesley. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Dominguez, M., & Jacobs, R. A. (in press-a). Does visual development aid visual learning? In P. Quinlan (Ed.), Connectionist models of development. East Sussex, UK: Psychology Press. Dominguez, M., & Jacobs, R. A. (in press-b). Developmental constraints aid the acquisition of binocular disparity sensitivities. Neural Computation, in press. Geman, S., Bienenstock, E., & Doursat, R. (1995). Neural networks and the bias=variance dilemma. Neural Computation, 4, 1–58. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Hinton, G. E. (1981). Shape representation in parallel systems. In A. Drina (Ed.), Proceedings of the Seventh International Joint Conference on Articial Intelligence. Menlo Park, CA: Morgan Kaufmann. Jordan, M. I. (1994). A statistical approach to decision tree modeling. In M. Warmuth (Ed.), Proceedings of the Seventh Annual ACM Conference on Learning Theory. New York: ACM Press. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Milner, P. M. (1974). A model for visual shape recognition. Psychological Review, 81, 521–535. Newport, E. L. (1990). Maturational constraints on language learning. Cognitive Science, 14, 11–28. Norcia, A., & Tyler, C. (1985). Spatial frequency sweep VEP: Visual acuity during the rst year of life. Vision Research, 25, 1399–1408. Nowlan, S. J., & Sejnowski, T. J. (1994). Filter selection model for motion segmentation and velocity integration. Journal of the Optical Society of America A, 11, 3177–3200. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041.
Visual Development and Motion Velocity Sensitivities
781
Peng, F., Jacobs, R. A., & Tanner, M. A. (1996). Bayesian inference in mixtures-ofexperts and hierarchical mixtures-of-experts models with an application to speech recognition. Journal of the American Statistical Association, 91, 953–960. Perrone, J. A., & Thiele, A. (2001). Speed skills: Measuring the visual speed analyzing properties of primate MT neurons. Nature Neuroscience, 4, 526–532. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Simoncelli, E. P., & Heeger, D. J. (2001). Representing retinal image speed in visual cortex. Nature Neuroscience, 4, 461–462. Turkewitz, G., & Kenney, P. A. (1982). Limitations on input as a basis for neural organization and perceptual development: A preliminary statement. Developmental Psychobiology, 15, 357–368. Weiss, Y., & Adelson, E. H. (1998). Slow and smooth: A Bayesian theory for the combination of local motion signals in human vision (Center for Biological and Computational Learning Paper No. 158). Cambridge, MA: Massachusetts Institute of Technology. Received May 13, 2002; accepted October 16, 2002.
LETTER
Communicated by Zoubin Ghahramani
Modeling Cross-Modal Enhancement and Modality-Specic Suppression in Multisensory Neurons Paul E. Patton
[email protected] Beckman Institute, University of Illinois at Urbana=Champaign, Urbana, IL 61801, U.S.A.
Thomas J. Anastasio
[email protected] Beckman Institute and Department of Molecular and Integrative Physiology, University of Illinois at Urbana=Champaign, Urbana, IL 61801, U.S.A.
Cross-modal enhancement (CME) occurs when the neural response to a stimulus of one modality is augmented by another stimulus of a different modality. Paired stimuli of the same modality never produce supraadditive enhancement but may produce modality-specic suppression (MSS), in which the response to a stimulus of one modality is diminished by another stimulus of the same modality. Both CME and MSS have been described for neurons in the deep layers of the superior colliculus (DSC), but their neural mechanisms remain unknown. Previous investigators have suggested that CME involves a multiplicative amplier, perhaps mediated by N-methyl D-aspartate (NMDA) receptors, which is engaged by cross-modal but not modality-specic input. We previously postulated that DSC neurons use multisensory input to compute the posterior probability of a target using Bayes’ rule. The Bayes’ rule model reproduces the major features of CME. Here we use simple neural implementations of our model to simulate both CME and MSS and to argue that multiplicative processes are not needed for CME, but may be needed to represent input variance and covariance. Producing CME requires only weighted summation of inputs and the threshold and saturation properties of simple models of biological neurons. Multiplicative nodes allow accurate computation of posterior target probabilities when the spontaneous and driven inputs have unequal variances and covariances. Neural implementations of the Bayes’ rule model account better than the multiplicative amplier hypothesis for the effects of pharmacological blockade of NMDA receptors on the multisensory responses of DSC neurons. The neural implementations also account for MSS, given only the added hypothesis that input channels of the same modality have more spontaneous covariance than those of different modalities.
Neural Computation 15, 783–810 (2003)
c 2003 Massachusetts Institute of Technology °
784
P. Patton and T. Anastasio
1 Introduction Multisensory integration occurs at numerous sites in the mammalian nervous system (Stein & Meredith, 1993) and has been studied extensively in the deep layers of the superior colliculus (DSC) of cats and monkeys (Meredith & Stein, 1986b; Wallace, Wilkinson, & Stein, 1996; Wallace, Meredith, & Stein, 1998). Enhancement is a form of multisensory integration in which the response to a stimulus of one modality is augmented by a stimulus of another modality (Meredith & Stein, 1986b; Stein & Meredith, 1993). Cross-modal enhancement (CME) was thought to be mediated by multiplicative interactions among inputs of different modalities, perhaps through N-methyl D-aspartate (NMDA) receptors present in DSC neurons (Meredith & Stein, 1986b; Stein, Huneycutt, & Meredith, 1988; Meredith, Wallace, & Stein, 1992; Stein & Meredith, 1993; Binns & Salt, 1996; Binns, 1999). CME contrasts sharply with the responses of multisensory neurons to paired stimuli of the same modality presented within their excitatory receptive elds. Such presentations never produce robust enhancement of the sort seen in CME and sometimes result in modality-specic suppression (MSS), in which the response to a stimulus of one modality is reduced by a second stimulus of the same modality (Stein & Meredith, 1993). The neural mechanisms of CME and MSS remain unknown. Open questions concern whether CME requires multiplicative interactions, the possible role of NMDA receptors, and the paradox of MSS. In previous work we modeled CME on an abstract level using Bayes’ rule (Anastasio, Patton, & Belkacem-Boussaid, 2000). Here we use neural implementations of the Bayes’ rule model to explore mechanisms of multisensory integration in DSC and other neurons. The DSC receives visual, auditory, and somatosensory inputs organized as a multimodal topographic map (Middlebrooks & Knudsen, 1984; Meredith & Stein, 1990; Clemo & Stein, 1991). Activation of cells at a particular location in this map can result in an orienting movement toward the corresponding location in space containing the stimulus source (Wurtz & Goldberg, 1972; Peck, 1990). Individual DSC neurons may receive input from two, or occasionally three, different sensory modalities (Meredith et al., 1992; Wallace et al., 1996). Receptive elds for multiple modalities are large and spatially coincident. CME in multisensory DSC neurons occurs when stimuli of different modalities are presented together within their excitatory receptive elds (Meredith & Stein, 1996; Kadunce, Vaughan, Wallace, & Stein, 2001). Enhancement can be expressed as a percentage of the largest single stimulus response SRmax , %E D
CR ¡ SRmax £ 100%; SRmax
(1.1)
where CR is the neural response to the combined stimuli. CME exhibits the property of inverse effectiveness, wherein weak unimodal stimuli, when
A Neural Model of Cross-Modal Enhancement
785
combined, produce a larger %E than do stronger stimuli (Meredith & Stein, 1986b; Wallace & Stein, 1994; Wallace et al., 1996). Multimodal neurons exhibiting CME have also been described in cat extraprimary sensory cortex (Wallace, Meredith, & Stein, 1992; Jiang, Lepore, Ptito, & Guillemot, 1994; Stein & Wallace, 1996), and cross-modal enhancement phenomena have been noted in the behavioral responses of cats (Stein et al., 1988; Stein, Meredith, Huneycutt, & McDade, 1989). MSS observed for DSC neurons can be expressed as negative percentage enhancement. Modality-specic suppressive interactions have also been reported for neurons in a number of primate cortical visual structures (Sato, 1989, 1995; Reynolds, Chelazzi, & Desimone, 1999). We have proposed previously that DSC neurons use stochastic sensory input to compute the posterior probability of a target using Bayes’ rule (Anastasio et al., 2000). We showed that CME and inverse effectiveness arise as straightforward consequences of this hypothesis. The initial study did not address the issue of how neurons might implement the needed computation. That posterior probabilities are a natural outcome of neural computation is well known to specialists in neural networks and pattern classication (Richard & Lippmann, 1991; Duda, Hart, & Stork, 2001). We apply these results in a neurobiological context and extend them to account for experimental observations. To illustrate how the Bayes’ rule model of CME (Anastasio et al., 2000) might be implemented by neurons, we consider three simple perceptron models. The simplest involves a single perceptron with sensory input channels that are Poisson distributed and conditionally independent. This model demonstrates that enhancement can be produced using only properties generically present in biological neurons and that no special multiplicative interactions are needed. The second, an augmented perceptron model, computes posterior probabilities under the more general circumstances in which inputs are multivariate gaussian. We show that it accounts better than the multiplicative interaction hypothesis for experimental ndings on the effects of NMDA receptor blockade on the responses of DSC neurons. The third implementation extends the Bayes’ rule model to provide an explanation for the otherwise puzzling phenomenon of MSS. The augmented perceptron produces suppression given neurobiologically plausible assumptions that include covariance among inputs of like modality in the absence of sensory stimulation. 2 A Perceptron Model with Poisson Inputs Can Simulate CME We rst consider a model in which the sensory input likelihoods are Poisson distributed and conditionally independent given the target. Under these conditions, a single perceptron is capable of computing the posterior probability of a target and simulating CME. The target is represented as binary random variable T (target absent, T D 0; target present, T D 1). The prior
786
P. Patton and T. Anastasio
probability distribution of the target is assigned arbitrarily as P.T D 1/ D 0:1 and P.T D 0/ D 0:9. According to the Bayes’ rule model (Anastasio et al., 2000), the posterior probability of a target is computed by using sensory input to modify this prior probability. Input from each modality i is modeled as random variable Mi .i D 1; 2; : : : ; k/, which can assume any discrete nonnegative value mi . The mi represent numbers of neural impulses per unit time (ring rate). The input likelihoods, which are conditional distributions of the inputs given the target, are modeled as Poisson densities: P.Mi D mi j T D t/ D
i ¡¸ it ¸m it e : mi !
(2.1)
The Poisson density was used in the original formulation of the Bayes’ rule model (Anastasio et al., 2000) because it requires the fewest assumptions and reasonably approximates neuronal ring-rate distributions (Gabbiani & Koch, 1998). The mean and variance of a Poisson distribution are equal and are specied by parameter ¸. The input likelihoods under spontaneous (target absent, P.Mi D mi j T D 0/) or driven (target present, P.Mi D mi j T D 1/) conditions are described using equation 2.1 with means ¸i0 and ¸i1 , respectively (¸i1 > ¸i0 ) (see Figure 1B). The posterior target probability is Figure 1: Facing page. (A) A perceptron model of multisensory integration. The model receives two conditionally independent Poisson-distributed sensory inputs, M1 D V and M2 D A, having synaptic weights wv and wa . The bias input b is xed and might represent tonic inhibitory input or biophysical properties intrinsic to a neuron. The sigma node 6 computes the weighted sum u of the inputs (see equation 2.5). The output is passed through the logistic squashing function f .u/ (see equation 2.6). (B) The likelihoods of observing discrete numbers of impulses per unit time (0.25 s) are shown for sensory input Mi under spontaneous (P.Mi D mi j T D 0/, spont, dot-dashed line) and stimulus driven (P.Mi D mi j T D 1/, driven, solid line) conditions. The likelihoods are given by Poisson distributions (see equation 2.1) with ¸i0 D 2 and ¸i1 D 6. (C) The bimodal perceptron model simulates CME. The response is equal to the posterior probability of a target as computed by the perceptron ( f .u/ D P.T D 1 j m/) using equations 2.5, 2.6, and 2.8. For both modalities V and A, the input likelihoods are assigned spontaneous mean ¸v0 D ¸a0 D 2 and driven mean ¸v1 D ¸a1 D 6. In the cross-modal both-driven case (stars), both modalities are driven by a target stimulus. Inputs V D v and A D a are set equal and varied over a range of impulses per unit time. In the modality-specic spontaneous-driven case (circles), one modality is driven, while the other is xed at the spontaneous mean. The difference between the two plots represents CME. The dashed line indicates prior target probability P.T D 1/. The both-driven and spontaneous-driven plots represent slices through the full 2D posterior. (D) Cross-modal enhancement for the perceptron model is computed using equation 1.1, where CR and SRmax values are taken from the both-driven and spontaneous-driven curves of C.
A Neural Model of Cross-Modal Enhancement
787
788
P. Patton and T. Anastasio
given using Bayes’ rule: P.m j T D 1/P.T D 1/ : P.m/
P.T D 1 j m/ D
(2.2)
In this equation, m is a vector representing the activity mi of each sensory modality Mi . We assume initially that the likelihoods for the various modalities are conditionally independent given the target. According to this simplifying assumption, the visibility of a target indicates nothing about its audibility, and vice versa. The joint likelihoods P.m j T D 1/ and P.m j T D 0/ can then be computed as P.m j T D t/ D
k Y iD1
P.Mi D mi j T D t/:
(2.3)
The unconditional joint probability P.m/ is given by the principle of total probability: P.m/ D P.T D 1/P.m j T D 1/ C P.T D 0/P.m j T D 0/:
(2.4)
We will use the perceptron as the basis for our model of a DSC neuron. The perceptron computes the weighted sum u of its inputs where each input mi has an associated synaptic weight wi : uDbC
k X
wi mi :
(2.5)
iD1
The constant bias input b might represent the biophysical properties of a neuron or a tonic input. The weighted input sum is passed through the logistic squashing function, f .u/ D
1 ; 1 C e¡u
(2.6)
which approximates the threshold and saturation properties of real neurons. We seek a formula for u such that the value of the logistic function f .u/ is equal to the posterior probability of the target given inputs m. f .u/ D P.T D 1 j m//. We substitute equation 2.6 for f .u/ and substitute Poisson distributions 2.1 for the input likelihoods in equations 2.2, 2.3, and 2.4. In solving for u, we obtain: µ u D ln
¶ X ´´ ¶ k µ³ ³ P.T D 1/ ¸i1 C ln mi C .¸i0 ¡ ¸i1 / : P.T D 0/ ¸i0 iD1
(2.7)
A Neural Model of Cross-Modal Enhancement
789
Equation 2.7 is linear with respect to mi . Thus, given the basic neuronal nonlinearities of threshold and saturation as represented by the squashing function, the posterior probability is computed by the weighted sum of sensory inputs. Multiplying inputs together is not necessary. This nding contrasts with previous ideas about the neural basis of CME (see section 5). Equation 2.7 can be rearranged to yield the input weights wi and bias b for a simple perceptron (see Figure 1A): ³ wi D ln
¸i1 ¸i0
´
µ and b D ln
¶ X k P.T D 1/ C .¸i0 ¡ ¸i1 /: P.T D 0/ iD1
(2.8)
The bias contains a term related to the target prior. Because ¸i1 > ¸i0 , the bias term will be negative whenever P.T D 1/ < P.T D 0/. This nding has interesting implications for how the target prior might be represented neurobiologically (see section 5). We propose that the response of a DSC neuron is proportional to posterior target probability. A perceptron with weights set according to equation 2.8 was used to simulate the results of a bimodal CME experiment. We consider visual V and auditory A inputs, but the simulation is valid for any combination of two sensory modalities. To simulate the response of a bimodal DSC neuron to cross-modal stimulation, modalities M 1 D V and M2 D A were set equal and varied from 0 to 25 impulses per unit time (both-driven). To simulate a bimodal DSC neuron’s response to modalityspecic stimulation, one input was varied from 0 to 25 impulses per unit time, while the other was held xed at its spontaneous mean (spontaneousdriven). The results are illustrated in Figures 1C and 1D. A single perceptron is thus sufcient to implement the Bayes’ rule model of CME for inputs that are Poisson distributed and conditionally independent given the target. 3 An Augmented Perceptron Model and the Role of NMDA Receptors As a more general form of our model, we consider a case in which sensory inputs are represented as gaussian distributions, and we drop the assumption that inputs from different modalities are conditionally independent given the target. Input to a DSC neuron from a single sensory modality is derived by combining signals from many different sensory neurons. According to the classical central limit theorem, the distribution of the combined input will be gaussian, provided that the individual signals are independent (Fristedt & Gray, 1997). We assume that input from a single modality will be approximately gaussian, despite possible dependencies between signals carried by individual neurons of the same modality (see below). We model the multimodal input and dependencies between inputs of different
790
P. Patton and T. Anastasio
modalities using a multivariate gaussian distribution (Duda et al., 2001): µ ¶ 1 1 T ¡1 exp ¡ (3.1) P.m j T D t/ D .m¡ ¹ / § .m¡ ¹ / : t t t 2 .2¼/k=2 j§t j1=2 In this equation, k is the number of modalities, and ¹t is a vector containing the means of the distributions associated with each modality when T D t. The symbol T denotes the transpose operation. §t is a k by k covariance matrix whose diagonal terms are the variances associated with individual modalities and whose off-diagonal terms are the covariances among modalities. The model classies input patterns into two categories: T D 0 and T D 1. For this simple case, u can be expressed as a discriminant function (Duda et al., 2001): ³ ´ ³ ´ ³ ´ P.T D 1 j m/ P.m j T D 1/ P.T D 1/ D ln Cln u D g.m/ D ln P.T D 0 j m/ P.m j T D 0/ P.T D 0/ D ln P.m j T D 1/Cln P.T D 1/¡[ln P.m j T D 0/Cln P.T D 0/]:
(3.2)
Substituting equation 3.1 and simplifying, we obtain: uD
1 T ¡1 ¡1 ¡1 T [m .§1 ¡ §¡1 0 /m C 2m .§1 ¹ 1 ¡ 60 ¹0 / 2 T ¡1 ¡ ¹T1 §¡1 1 ¹1 C ¹0 § 0 ¹0 ] ³ ´ j§0 j1=2 P.T D 1/ C ln : j§1 j1=2 P.T D 0/
(3.3)
As before, the subscript 1 refers to an input driven by the target stimulus (T D 1), and 0 to one active spontaneously in the absence of a stimulus (T D 0). Because of the quadratic and multiplicative terms involved, this formula cannot, in general, be implemented by a single perceptron. A multilayered perceptron is capable of approximating any continuous function (Cybenko, 1989). Such a network, given m as input and appropriate synaptic weights, should be capable of supplying an approximation of u in equation 3.3 to a logistic output node representing a DSC neuron. Another plausible implementation involves the introduction of pi nodes, or product units (Rumelhart, Hinton, & McClelland, 1986; Durbin & Rumelhart, 1989), to create an augmented-perceptron model (see Figure 2A). Pi nodes multiply their inputs together just as sigma nodes add them. Neurobiologically, they represent possible voltage-sensitive processes in the dendritic arbor of a neuron (Durbin & Rumelhart, 1989; Koch & Poggio, 1992). The augmentedperceptron model can be expressed as uDbC
k X iD1
wi mi C
k X k X iD1 jDi
½ij .mi mj /:
(3.4)
A Neural Model of Cross-Modal Enhancement
791
The rst two terms on the right-hand side are identical with those of the simple perceptron, equation 2.5. The last term species the sigma-pi function with weights ½ij . For the case of two sensory inputs V D v and A D a, equation 3.4 can be rewritten as u D b C wv v C wa a C ½v v2 C ½a a2 C ½va va:
(3.5)
Using equation 3.3, the constants ½v , ½a , ½va , wv , wa , and b can be specied as Á ! Á ! Á ! 2 2 2 2 2 2 ¾a1 ¾v1 ¾va0 1 ¾a0 1 ¾v0 ¾va1 ¡ ¡ ¡ ½v D ; ½a D ; ½va D ; (3.6) 2 ¯0 2 ¯0 ¯1 ¯1 ¯1 ¯0 Á ! Á ! 2 2 2 ¡ 2 ¡ ¹a1 ¾va1 ¹a0 ¾va0 ¹v0 ¾ a1 ¹v1 ¾a1 C (3.7) wv D ; ¯1 ¯0 Á ! Á ! 2 2 2 2 ¡ ¹v1 ¾va1 ¡ ¹a0 ¾v1 ¹a1 ¾v1 ¹v0 ¾va0 C (3.8) wa D ; ¯1 ¯0 bD
2 2 C 2¾ 2 2 2 [¡¾a1 ¹v1 va1 ¹v1 ¹a1 ¡ ¾v1 ¹a1 ] 2¯1 2 2 ¡ 2¾ 2 2 2 [¾a0 ¹v0 va0 ¹v0 ¹a0 C ¾v0 ¹a0 ] 2¯0 "Ás ! # ¯0 P.T D 1/ C ln : ¯1 P.T D 0/
C
(3.9)
2 2 ¡ 4 and 2 2 ¡ 4 . The terms In these equations, ¯0 D ¾v0 ¾a0 ¾vao ¯1 D ¾ v1 ¾a1 ¾va1 2 2 2 , and 2 are variances for the spontaneous and driven cases for ¾v0 ; ¾a0 ; ¾v1 ¾a1 2 2 each modality, and ¾va0 and ¾va1 are the covariances for the spontaneous and driven conditions, respectively. An augmented perceptron as specied by equations 3.5 through 3.9 can compute posterior target probability and thereby simulate CME (see Figures 2B and 2C). Note that as for the Poisson-input model, CME arises by summation in a nonlinear neuron. When the spontaneous and driven 2 D variances and the spontaneous and driven covariances are equal (¾ v0 2 2 D 2 , and 2 D 2 ¾v1 ; ¾a0 ¾a1 ¾va0 ¾va1 /, the pi-node terms disappear (½v D ½a D ½va D 0) and the computation can again be performed by a simple perceptron. The multiplicative terms associated with the pi nodes compensate for differences between spontaneous and driven variance and covariance rather than produce CME. Targets vary greatly in the efcacy with which they drive different sensory input neurons. It seems likely that the variance of a sensory input of any modality should be larger in the driven than in the spontaneous case.
792
P. Patton and T. Anastasio
A Neural Model of Cross-Modal Enhancement
793
2 2 In the two-channel model, we set the spontaneous variances ¾v0 D ¾a0 D5 2 2 and the driven variances ¾v1 D ¾a1 D 6. It also seems likely that two sensory inputs of different modalities should covary more under driven than under spontaneous conditions. Inputs of different modalities are segregated in modality-specic pathways prior to their arrival in the DSC (Sparks & Hartwich-Young, 1989; Wallace, Meredith, & Stein, 1993). Such input channels are unlikely to interact under spontaneous conditions, and sponta-
Figure 2: Facing page. (A) An augmented-perceptron model of multisensory integration. Sensory inputs are gaussian distributed. The model receives input from two sensory modalities: M1 D V and M2 D A. Each provides weighted input to summing node 6 with weights wv and wa . The summing node also receives a xed bias input b. The set of three pi nodes 5 multiplies both modalities by themselves and by each other. The pi nodes provide inputs to the summing node with weights ½v ; ½a ; and ½va . The output is passed through the squashing function f .u/. (B) The augmented-perceptron model simulates CME. Input likelihoods are gaussian distributed. For both modalities V and A, the spontaneous likelihood mean is ¹v0 D ¹a0 D 2, and the driven likelihood mean is ¹v1 D ¹a1 D 6. Spontaneous and driven variances for both modalities are 2 2 D ¾a02 D 5 and ¾v1 D ¾a12 D 6. The spontaneous and driven covariances are ¾v0 2 2 ¾va0 D 0:1 and ¾va1 D 2:8. The responses are equal to posterior target probability computed using the augmented perceptron. The cross-modal both-driven curve (stars) simulates a case in which both modalities are driven by a stimulus. Inputs V D v and A D a are set equal and varied from 0 to 25 impulses per unit time. The modality-specic spontaneous-driven curve (circles) simulates a case where only one modality is driven. The driven input varies from 0 to 25 impulses per unit time, while the input representing the nondriven modality was held xed at the mean of the spontaneous likelihood distribution. The both-driven and spontaneous-driven plots represent slices through the full 2D posterior. The no pi both-driven (asterisks) and no pi spontaneous-driven (triangles) curves illustrate the inaccurate probability computation performed for the both-driven and spontaneous-driven cases, respectively, by a network in which the pi nodes have been removed by setting ½v D ½a D ½va D 0. (C) Cross-modal enhancement (see equation 1.1) is plotted for the intact augmented perceptron (stars) and for a version from which the pi nodes have been removed (asterisks). (D) The amount by which removal of the pi nodes reduces the augmented perceptron response to input is illustrated for the both-driven (asterisks) and spontaneous-driven (circles) cases. In each case, reduction R is computed using RD
P.T D 1 j m/ ¡ L.m/ £ 100%; P.T D 1 j m/
where P.T D 1 j m/ is computed using the intact network, and L.m/ is the corresponding value computed by the lesioned network following removal of the pi nodes.
794
P. Patton and T. Anastasio
neous covariance between channels of different modalities should be low. In contrast, the sensory attributes of natural targets are likely to covary, and channels of different modalities will covary as they are driven by these targets. Accordingly, in the two-channel model, we set spontaneous covariance 2 2 D 0:1 and driven covariance ¾ va1 D 2:8. ¾va0 The specic variance and covariance values chosen are consistent with these plausible assumptions and produce an accurate simulation of neurophysiological ndings on DSC neurons when used to parameterize the two-channel augmented perceptron. An augmented-perceptron model DSC neuron receiving inputs of two modalities (see Figure 2A) is studied in the both-driven case, in which both modalities are driven equally by a stimulus, and in the spontaneous- driven case, in which one modality is driven by a stimulus while the other is spontaneously active. As the number of impulses increases, target probability increases from nearly zero to nearly one, and it increases faster in the both-driven than in the spontaneous-driven case (see Figure 2B). CME reaches its highest values in the region where the spontaneous-driven responses are still small (see Figures 2B and 2C). Presumably, experimental effects on CME would be studied in this region, which will be referred to as the enhancement zone. Binns and Salt (1996) investigated the effects of pharmacologically blocking NMDA receptors on the responses of DSC neurons to modality-specic and cross-modal stimuli. We postulate that NMDA receptors may serve the functional role assigned to pi nodes in the augmented-perceptron model. To investigate this hypothesis, the effects on the model of removing the pi nodes (i.e., setting ½v D ½a D ½va D 0) were compared with the effects reported by Binns and Salt (1996). Following pi node removal, the model no longer accurately computes the posterior probability of a target (see Figure 2B). The effect of pi node removal is to reduce the both-driven and spontaneousdriven responses to a given input below the values computed by the intact model (see Figures 2B and 2D). Removal of the pi nodes does not eliminate enhancement but reduces it within the enhancement zone (see Figure 2C). These effects are produced over a wide range of model variance and covariance parameters. Binns and Salt (1996) also report that blockade of NMDA receptors in DSC neurons produces reductions in both spontaneous-driven and both-driven responses. For the chosen parameters, the percentage by which the model response is reduced is greater for both-driven than for spontaneous-driven input within the enhancement zone (see Figure 2D), as also observed experimentally. The experimental data are thus more consistent with our hypothesis than with the view that NMDA receptors function as a multiplicative amplier for CME (see section 5).
4 Multiple Input Channels and Modality-Specic Suppression The augmented-perceptron model can be extended to account for the fact that cross-modal stimuli produce robust enhancement of DSC neuron re-
A Neural Model of Cross-Modal Enhancement
795
sponses, whereas paired modality-specic stimuli do not (Stein & Meredith, 1993). As an implementation of the Bayes’ rule model (Anastasio et al., 2000), the augmented perceptron accounts for CME as an increase in the posterior target probability computed by a DSC neuron due to integration of a second input of a different modality. The extension to MSS is based on the intuition that a second input should increase target probability less if it covaries more with the rst input and that spontaneous covariance, in particular, is likely to be stronger between inputs of the same modality. Analysis shows this intuition to be correct. Surprisingly, given specic but neurobiologically plausible assumptions concerning the inputs, it can even account for the counterintuitive phenomenon of MSS. Bayes’ rule can thus account for both CME and MSS, and both can be implemented using the same augmented perceptron. For convenience, we refer to visual and auditory modalities in what follows, although the result also applies to other modality combinations. A DSC neuron’s visual receptive eld is produced by combining input from multiple sources, many of which are known to have much smaller receptive elds than those of DSC neurons (Olson & Graybiel, 1987; Sparks & Hartwich-Young, 1989; Meredith & Stein, 1990). These visual neurons would have separate but overlapping receptive elds within the receptive eld of the DSC neuron. We assume that the paired modality-specic stimuli used to produce MSS will activate separate but overlapping sets of visual neurons. We will refer to the set of visual neurons that can be activated by the rst stimulus of the pair as channel V and those activated by the second stimulus as channel X. Receptive eld overlap will cause the V and X channels to covary under driven conditions. Many of the neurons that provide visual input to the DSC are spontaneously active (see the discussion in Anastasio et al., 2000). Overlap in the sets of visual neurons that comprise channels V and X will cause them to covary under spontaneous conditions. Overlap of spontaneously silent neurons will contribute additional covariance under driven conditions. In addition to the two visual channels, we assume a third channel A representing auditory input. This three-channel case is treated using an augmented perceptron (see Figure 3A). The means, variances, and covariances for the multivariate gaussians, equation 3.1, representing the spontaneous and driven likelihoods of V, X, and A, are set according to the assumptions made above, which are that means, variances, and covariances are higher under driven than under spontaneous conditions and that covariances, especially spontaneous covariances, are higher between modality-specic than between cross-modal channels. The parameters in the augmented-perceptron model can be computed using equation 3.3 as for the two-channel case above. The model can simulate the nding that crossmodal input pairs (V and A) produce CME, whereas modality-specic input pairs (V and X) do not (see Figures 3B and 3C). To illustrate the explanatory power of the model, the modality-specic case simulated in Figures 3B and 3C is actually one of suppression (MSS).
796
P. Patton and T. Anastasio
A Neural Model of Cross-Modal Enhancement
797
In the model, MSS arises over a specic but neurobiologically plausible range of parameters. Figure 4 shows the effects of varying spontaneous and 2 and 2 ) on peak driven covariance between the V and X channels (¾vx0 ¾ vx1 CME and MSS. For the simple case considered, all three modalities have equal spontaneous means (¹v0 D ¹x0 D ¹a0 D 2), driven means (¹v1 D 2 2 2 D ¾x0 D ¾a0 D 2), and driven ¹x1 D ¹a1 D 6), spontaneous variances (¾v0 2 2 2 variances (¾v1 D ¾x1 D ¾a1 D 6). Driven covariances between cross-modal 2 2 2 channels are larger than spontaneous covariances (¾va0 D ¾xa0 D 0:1; ¾va1 D 2 ¾xa1 D 2:8). These parameters are the same as those of Figure 3. A real-valued multivariate gaussian distribution, equation 3.1, exists only 2 2 2 when j6t j > 0. This condition is met whenever ¾v2 ¾ a2 ¾x2 C 2¾va ¾vx ¾ax > Figure 3: Facing page. (A) An augmented-perceptron model of multisensory integration for a three-channel case: M1 D V, M2 D X, and M3 D A. Each channel provides weighted input to summing node 6 with weights wv , wx , and wa . The summing node also receives a xed bias input b. Each modality multiplies itself and each of the other modalities on the set of six pi nodes 5. The pi nodes provide inputs to the summing node with weights ½v ; ½x ; ½a ; ½vx; ½xa ; and ½va . The output of the summing node is passed through the squashing function f .u/. (B) A three-channel augmented-perceptron model is used to simulate both CME and MSS. The input likelihoods are multivariate gaussians. The spontaneous input likelihoods for all three channels have means ¹v0 D ¹x0 D ¹a0 D 2 and variances 2 2 D ¾x0 D ¾a02 D 2. Driven input likelihoods have means ¹v1 D ¹x1 D ¹a1 D 6 ¾v0 2 2 2 and variances ¾v1 D ¾x1 D ¾a12 D 6. Within-modality covariances ¾vx 0 D 1:6 and 2 ¾vx1 D 3:6 are higher than the corresponding between-modality covariances 2 2 2 2 ¾va 0 D ¾xa0 D 0:1 and ¾va1 D ¾xa1 D 2:8. The responses are equal to the posterior probability computed by the augmented perceptron. Plots for four different input conditions are shown. For each condition, a driven channel is one where the input varies over the range from 0 to 15 impulses per unit time (abscissa), and a spontaneous channel is one where the input is held xed at the spontaneous mean. The V or X alone condition (circles) simulates an experiment in which a single visual stimulus is presented. One of the visual channels, V or X, is driven, while the other visual channel and the auditory channel are spontaneous. The A alone case (triangles) simulates an experiment in which an auditory stimulus is presented alone. Channel A is driven, and V and X are spontaneous. The V and X condition (squares) simulates an experiment in which two visual stimuli are presented simultaneously in a DSC neuron’s visual receptive eld. Channels V and X are driven, and A is spontaneous. The V and A case (stars) simulates an experiment in which a visual and an auditory stimulus are presented simultaneously. Channels V and A are driven, and X is spontaneous. Each plot constitutes a slice through the full 3D posterior. (C) Percentage enhancement is computed as described for Figure 1D. Presentation of two stimuli of different modalities (V and A) produces positive enhancement or CME, while presentation of two stimuli of the same modality (V and X) produces negative enhancement or MSS.
798
P. Patton and T. Anastasio
4 4 2 C ¾a2 ¾vx C ¾x2 ¾va . The chart in Figure 4 is bounded at the right and ¾v2 ¾ax on the bottom by a region for which no real-valued multivariate gaussian distribution exists. Peak CME (white bars) is always greater than 200% (asterisks) for spontaneous covariance values less than 0.6 (not shown). Suppression (black bars) becomes evident when the spontaneous covariance 2 approaches the spontaneous variance 2 D 2 D 2 D 2, but is rel¾vx0 ¾v0 ¾x0 ¾a0 2 atively uninuenced by the driven covariance ¾ vx1 . Similar relationships are observed when the variances for different modalities are unequal. This
A Neural Model of Cross-Modal Enhancement
799
parametric analysis further implicates spontaneous covariance as a factor causing suppression of one input channel by another. Graphical analysis illustrates how spontaneous covariance between input channels can lead to suppression. It also shows that spontaneous covariance is necessary for suppression but not sufcient, because spontaneous covariance leads to suppression only when driven variance is greater than spontaneous variance. Multivariate gaussian input likelihood distributions (P.m j T D t/; see equation 3.1) for the three-channel model are plotted in Figure 5, for a series of different variance and covariance conditions. Figure 6 plots posterior target probabilities (P.T D 1 j m/; see equation 2.2) for the same series of conditions. The heavy dashed curve in Figure 5 indicates the intersection of the spontaneous and driven likelihood distributions, P.m j T D 0/ D P.m j T D 1/. For inputs falling on this curve, the posterior probability of a target is equal to its prior probability, P.T D 1 j m/ D P.T D 1/. The same intersection curve is again shown in Figure 6, which plots posterior target probabilities. In the white region of Figure 6, P.T D 1 j m/ is nearly 0, and in the gray region nearly 1. In each panel of Figures 5 and 6, the dotted diagonal line indicates the case where channels V and X are held equal and driven over the range from 0 to 15, as in Figure 3. The dot-dashed horizontal line indicates the case where V alone is driven from 0 to 15, and X is held xed at its spontaneous mean. The dot-dashed vertical line indicates the case where X alone is driven from 0 to 15, with V remaining xed at its spontaneous mean. In all cases, A channel input is held xed at its spontaneous mean ¹a0 D 2. Figure 4: Facing page. Spontaneous covariance is associated with MSS in the three-channel augmented-perceptron model. Spontaneous and driven covari2 2 ance (¾vx 0 and ¾vx1 ) between the V and X channels are varied, while all other parameters are as specied for Figure 3. Each of the rectangles corresponds to 2 2 one combination of ¾vx 0 and ¾vx1 , as indicated. VX enhancement was computed by comparing the combined response CR and the maximum single modality response SRmax using equation 1.1. For CR, V and X were both driven by a stimulus over the range from 0 to 30, while A was held xed at its spontaneous mean. For SRmax , V alone was driven over the range from 0 to 30, and X and A were held xed at their spontaneous means. The maximum percentage enhancement over the range was computed and is indicated for each covariance combination by the height of a white bar above the horizontal line representing zero enhancement. Maximum enhancements exceeding 200% are indicated by a white bar with an asterisk. Spontaneous covariances less than 0.6 (not shown) produced enhancements greater than 200%. The minimum enhancement is computed for inputs that are larger than those at which the maximum occurred. If the minimum observed was less than zero, it is indicated as a black bar extending downward from the horizontal zero enhancement line. The depth of the bar indicates suppression on a scale of 0 to ¡200%.
800
P. Patton and T. Anastasio
Figure 5: Multivariate gaussian input likelihood distributions for the threechannel model under spontaneous and stimulus-driven conditions for different settings of variance and covariance. In each panel, spontaneous and driven likelihood is plotted as a function of V and X channel input. A channel input is held xed at its spontaneous mean (a D ¹a0 D 2). Each plot represents a slice through a 3D likelihood distribution. Likelihood value is indicated by contour lines and shading, where darker shading indicates a higher likelihood. In each panel of Figures 5 and 6, the diagonal dotted line indicates v D x, which corresponds to the V- and X-driven case of Figure 3. The horizontal dotdashed line corresponds to the V-alone case, where v varies from 0 to 15 and x D a D ¹x0 D ¹a0 D 2. The vertical dot-dashed line corresponds to the X-alone case. The heavy dashed curve in each panel indicates the points of intersection of the spontaneous and driven likelihood distributions (P.m j T D 0/ D P.m j T D 1/). (A) The spontaneous and driven covariances between all channels
A Neural Model of Cross-Modal Enhancement
801
2 2 2 2 2 Figure 5 (cont.): Facing page. are zero (¾va 0 D ¾xa0 D ¾vx0 D 0; ¾va1 D ¾xa1 D 2 D 0). All other parameters are as specied in Figure 3. The concentric ¾vx 1 contour lines about the point v D 2, x D 2 indicate the spontaneous likelihood distribution. From outermost to innermost, the contour lines indicate P.m j T D 0/ D 0:005; 0:01; 0:015, and 0.02, respectively. The concentric contour lines about the point v D 6, x D 6 indicate the driven likelihood distribution. From outermost to innermost P.m j T D 1/ D 2£10¡4 ; 4£10¡4 ; 6£10¡4 ; 8£10¡4 , and 1 £ 10¡3 respectively. The corresponding posterior target probability distribution is shown in Figure 6A. (B) The spontaneous and driven covariances are 2 2 2 nonzero and assigned the same values as for Figure 3 (¾va 0 D ¾xa0 D 0:1; ¾vx0 D 2 2 2 1:6; ¾va1 D ¾xa1 D 2:8, and ¾vx1 D 3:6). All other parameters are also as specied in Figure 3. The concentric contour lines centered about v D 2, x D 2 indicate the spontaneous likelihoods P.m j T D 0/ D 0:005; 0:01; 0:015; 0:02; 0:025; 0:03; and 0.035. The second set of concentric contour lines indicates the driven likelihoods P.m j T D 1/ D 2 £ 10¡4 ; 4 £ 10¡4 ; 6 £ 10¡4 ; 8 £ 10¡4 ; 0:001; and 0.0012. The corresponding target posterior probability distribution is shown in Figure 6B. (C) The spontaneous and driven covariances between the V, X, and A channels are zero, but the variances are larger for the V channel than for the X and A 2 2 2 2 channels (¾v0 D 8; ¾x0 D ¾a02 D 2; ¾v1 D 16; ¾x1 D ¾a12 D 6). Distribution means are as in Figure 3. The concentric contour lines about v D 2, x D 2 indicate the spontaneous likelihoods P.m j T D 0/ D 0:002; 0:004; 0:006; 0:008; and 0.01. The concentric contour lines centered about v D 6, x D 6 indicate the driven likelihoods P.m j T D 1/ D 1 £ 10¡4 ; 2 £ 10¡4 ; 3 £ 10¡4 ; 4 £ 10¡4 ; 5 £ 10¡4 ; and 6 £ 10¡4 . The corresponding target posterior probability distribution is shown in Figure 6C. (D) The spontaneous and driven covariances are nonzero 2 2 2 2 2 as in Figures 3 and 5B (¾va 0 D ¾xa0 D 0:1; ¾vx0 D 1:6; ¾va1 D ¾xa1 D 2:8, and 2 ¾vx1 D 3:6), but the spontaneous variance is larger than the driven variance 2 2 2 2 (¾v0 D ¾x0 D ¾a02 D 8; ¾v1 D ¾x1 D ¾a12 D 6). Distribution means are again as for Figure 3. The concentric contour lines about v D 2, x D 2 indicate the spontaneous likelihoods P.m j T D 0/ D 0:0005; 0:001; 0:0015; 0:002; and 0.0025. The second set of concentric contour lines indicates the driven likelihoods P.m j T D 1/ D 2 £ 10¡4 ; 4 £ 10¡4 ; 6 £ 10¡4 ; 8 £ 10¡4 ; 0:001; 0:0012; 0:0014; and 0.0016. The corresponding target posterior probability distribution is shown in Figure 6D.
In Figure 5A the driven variances are equal, the spontaneous variances are equal in all three channels, and driven variance is greater than spontaneous variance, as in Figure 3. Unlike Figure 3, there is no covariance between any of the channels in Figure 5A. Under these circumstances, the spontaneous (T D 0) and driven (T D 1) likelihood contours (see Figure 5A), and their associated posterior probability contours (see Figure 6A), are circular. The heavy vertical line in Figure 6A connects a point along the V and X both-driven line with the corresponding point along the V-only,
802
P. Patton and T. Anastasio
spontaneous-driven line. The both-driven point has posterior probability near one, while the spontaneous-driven point has posterior probability near zero, indicating very large enhancement. For the circular case, Figure 6A illustrates that both-driven posteriors will always be higher than the corresponding spontaneous-driven posteriors and, consequently, suppression cannot occur. In Figures 5B and 6B, the driven variances are equal, the spontaneous variances are equal again in all three channels, and driven variance is greater than spontaneous variance. Cross-modal channels are assigned a very small spontaneous covariance and a larger driven covariance. The V and X channels are assigned a much larger spontaneous covariance than the crossmodal channels and a larger driven covariance. All parameters have the
A Neural Model of Cross-Modal Enhancement
803
same values as for Figure 3. Due to nonzero covariance, the spontaneous and driven likelihood contours (see Figure 5B) are elliptical and elongated in the diagonal direction. The place of intersection of the spontaneous and driven likelihoods, and the posterior probability contours (see Figure 6B), are similarly elliptical and diagonally elongated. This effect is due primarily to the large VX spontaneous covariance. The heavy vertical line in Figure 6B connects a both-driven point having posterior probability near zero with a corresponding spontaneous-driven point having posterior probability near one, indicating suppression. Figure 6B illustrates how suppression can arise from the Bayes’ rule model and be simulated by a neural implementation of it when modality-specic channels exhibit spontaneous covariance and driven variance exceeds spontaneous variance. Figure 6: Facing page. Posterior target probability (P.T D 1 j m/) is plotted as a function of V and X channel input for the three-channel model described in Figure 3. Posterior target probability is indicated in each panel by shading and contour lines. White indicates P.T D 1 j m/ values near zero, and gray indicates P.T D 1 j m/ values near one. Contour lines are at 0.1 unit intervals. The heavy dotted curve in each panel indicates the input values for which the posterior and prior target probabilities are equal .P.T D 1 j m/ D P.T D 1/ D 0:1/, which is the same as the P.m j T D 0/ D P.m j T D 1/ curve in the corresponding panels of Figure 5. The dotted and dot-dashed lines are as described in Figure 5. (A) The likelihood parameters are as given for Figure 5A. An input parameter combination that produces enhancement of 1075% is shown by the heavy vertical line terminated with circular dots. The V and X both-driven posterior probability P.T D 1 j V D 6; X D 6; A D 2/ D 0:94 (upper dot) and the V alone posterior P.T D 1 j V D 6; X D 2; A D 2/ D 0:08 (lower dot). (B) The likelihood parameters are as given for Figure 5B. An input parameter combination that produces enhancement of ¡83:3% (suppression) is shown by the heavy vertical line terminated with circular dots. The V and X both-driven posterior probability P.T D 1 j V D 5:8; X D 5:8; A D 2/ D 0:16 (upper dot) and the V alone posterior P.T D 1 j V D 5:8; X D 2; A D 2/ D 0:96 (lower dot). (C) The likelihood parameters are as given for Figure 5C with different variances for the V and X channels. An input parameter combination that produces enhancement of only 40.3% is shown by the heavy vertical line terminated with circular dots. The V and X both-driven posterior probability P.T D 1 j V D 7; X D 7; A D 2/ D 0:94 (upper dot) and the V alone posterior P.T D 1 j V D 7; X D 2; A D 2/ D 0:67 (lower dot). (D) The likelihood parameters are as described in Figure 5D in which the spontaneous variance is larger than the driven variance. Here, the V and X both-driven posterior probability reaches a maximum value of only 0.27, and thus only two contour lines are visible. An input parameter combination that produces enhancement of 8337% is shown by the heavy vertical line terminated with circular dots. The V and X both-driven posterior probability P.T D 1 j V D 10; X D 10; A D 2/ D 0:27 (upper dot) and the V alone posterior P.T D 1 j V D 10; X D 2; A D 2/ D 0:0032 (lower dot).
804
P. Patton and T. Anastasio
Covariance is necessary for suppression. Elongation of the spontaneous or driven likelihoods by itself, without covariance, can decrease enhancement but cannot produce suppression. The spontaneous and driven covariances are set to zero in Figure 5C, but the likelihoods are elongated along the X channel axis (ordinate) by setting the V variance greater than the X variance under both spontaneous and driven conditions. The equal likelihood (see Figure 5C) and posterior probability (see Figure 6C) contours are also elongated along the X channel axis. The heavy vertical line in Figure 6C connects a both-driven point with a corresponding spontaneous-driven point that have almost equal posterior probabilities, indicating very little enhancement. Increasing V variance further would bring the posterior probability contours closer to parallel with the X channel axis. This would further reduce enhancement but never lead to suppression. Suppression requires the elongation combined with diagonalization of likelihood distributions (see Figures 5B and 6B) that can be produced only by covariance. Covariance is necessary but not sufcient for suppression. Suppression requires spontaneous covariance (with or without driven covariance), and that driven variance is greater than spontaneous variance. In Figures 5D and 6D, channel covariances are as in Figures 5B and 6B, and the spontaneous variances are equal and the driven variances are equal for all three channels. Unlike the case of Figures 5B and 6B, the driven variance in Figures 5D and 6D is smaller than the spontaneous variance. Now the heavy dashed curve indicating equal spontaneous and driven likelihood (see Figure 5D) and equal prior and posterior probability (see Figure 6D) forms a closed ellipse containing the point specied by the driven V and X means. The spontaneous-driven lines never intersect this ellipse. Within the ellipse, the posterior probability of a target never exceeds 0.27. The heavy vertical line in Figure 6D connects a both-driven point having posterior probability 0.27 with a spontaneous-driven point having posterior probability near zero, producing sizable enhancement only because of the small magnitude of the latter value. In summary, suppression can occur between two channels when driven variance is larger than spontaneous variance and spontaneous covariance approaches the value of spontaneous variance (with or without driven covariance). These conditions are very likely to be met by channels of the same modality that provide input to neurons in the DSC and elsewhere in the brain (see above). It is very likely that driven variance will be higher than spontaneous variance, because driven variance reects the huge variability in stimulus effectiveness, in addition to the stochastic properties of the nervous system that determine spontaneous variance. It is also very likely that input channels of the same modality will covary under both spontaneous and driven conditions, because they may share some of the same input axons. The seemingly paradoxical phenomenon of MSS arises naturally from the Bayes’ rule model of CME, given these neurobiologically plausible conditions of input channel variance and covariance. The three-channel neural
A Neural Model of Cross-Modal Enhancement
805
implementation presented in Figure 3 illustrates an example in which CME and MSS can arise in the same model neuron. The implementation can be extended to any number of cross-modal and modality-specic channels. A two-channel, single-modality model was also investigated, and suppression was found to occur under similar parametric circumstances. 5 Discussion CME is often referred to as supra-additive or multiplicative, since a neuron’s response to multimodal stimuli is often greater than the sum of the responses to the single modality stimuli presented alone (Meredith & Stein, 1986b; Stein et al., 1988; Meredith et al., 1992; Stein & Meredith, 1993). It has been suggested that CME requires a special nonlinear amplier, such as an ion channel, that responds to multimodal but not unimodal input (Stein & Meredith, 1993; Binns & Salt, 1996; Binns, 1999). The implementations we have examined here suggest a different view of cross-modal enhancement. We have previously argued that CME arises as a consequence of DSC neurons computing the posterior probability of a target using Bayes’ rule (Anastasio et al., 2000). Here we use neural implementations of the Bayes’ rule model to address issues of neurobiological mechanism. A single perceptron is capable of computing the target posterior when the inputs are conditionally independent and Poisson distributed, or gaussian distributed without the assumption of conditional independence, provided variances and covariances for the spontaneous likelihood distribution are equal to those for the driven likelihood distribution. When the above equality is not satised, an augmented perceptron, which allows inputs to be multiplied together, is needed to compute the posterior probability of a target. In either case, CME itself is produced by weighted summation in a nonlinear model neuron. Multiplication of inputs is not needed to produce CME. Instead, it serves a function not envisioned by the multiplicative amplier hypothesis. It serves to compensate for differences between spontaneous and driven variance and covariance, to allow computation of posterior target probability when these parameters are unequal. It has been suggested that NMDA-sensitive glutamate conductances present in DSC neurons might provide a multiplicative amplier for CME (Stein & Meredith, 1993; Binns & Salt, 1996; Binns, 1999). Binns and Salt (1996) tested this hypothesis by applying the NMDA receptor antagonist AP5 to multisensory DSC neurons in the cat during cross-modal and modality-specic stimulation. The results were equivocal with respect to the amplier hypothesis, since AP5 did not selectively eliminate CME. Instead, application of AP5 resulted in a reduction of both modality-specic and crossmodal responses, with the percentage reduction being greater for the latter. It is possible to reconcile these ndings with the multiplicative hypothesis by proposing, for example, that NMDA receptors involved in cross-modal responses have different channel kinetics from those involved in modality-
806
P. Patton and T. Anastasio
specic responses (Binns, 1999). Our model, however, can account for the ndings without any such special assumptions. The function of the pi nodes in the augmented perceptron model might be implemented by dendritic voltage-sensitive conductances in DSC neurons (Koch & Poggio, 1992; Mel, 1993). We propose that rather than serving as multiplicative ampliers for CME, NMDA receptors in DSC neurons might perform the function of these pi nodes. Simulations in which pi nodes were removed from the model, over a suitable range of parameters, yielded results similar to those observed in the NMDA channel blockade studies. Removing the pi nodes from the augmented-perceptron model similarly produced a reduction in response to both modality-specic and cross-modal stimuli, with the percentage reduction being greater for the latter (see Figure 2C). Our model provides a straightforward explanation for the NMDA channel blockade ndings, whereas the amplier model of CME does not. In addition to multimodal interactions, Stein and Meredith (1993) have examined interactions between two stimuli of the same modality. Given the assumptions that driven variances are greater than the spontaneous variances and that signicant spontaneous covariance should occur between modality-specic but not between cross-modal input channels, the augmented-perceptron implementation simulated the absence of enhancement for modality-specic stimuli and the presence of enhancement for cross-modal stimuli. Furthermore, our model proved capable of simulating the counterintuitive nding of modality-specic suppression. For suppression to occur in the model, the spontaneous covariance between the V and X channels must be less than, but close to, the spontaneous variances. When these conditions are met, the VX pi node invariably has a negative weight .½vx < 0/. Neurobiologically, a pi node with a negative weight might correspond, for example, to an inhibitory interneuron with a nonlinear conductance perhaps produced by NMDA receptors. Depressive interactions, in which the neural response to a cross-modal stimulus is less than that elicited by a single stimulus, can occur when the two stimuli are sufciently spatially disparate (Meredith & Stein, 1986a, 1986b, 1996; Kadunce, Vaughan, Wallace, Benedek, & Stein, 1997; Kadunce et al., 2001). Pairs of modality-specic stimuli can also produce such effects when one of the two is outside the excitatory receptive eld (Kadunce et al., 1997). This is distinct from the modality-specic suppression discussed above, in which both stimuli are presented within the excitatory receptive eld. Depression due to spatial disparity appears to result from competitive interactions involved in target selection (Munoz & Istvan, 1998; Findlay & Walker, 1999) rather than target detection, which is the focus of this study. Our results do not preclude competitive interaction as a possible explanation for modality-specic suppression. They do demonstrate that modality-specic suppression can be simulated using the Bayes’ rule model of CME and neural implementations thereof and is therefore a plausible consequence of target detection processes.
A Neural Model of Cross-Modal Enhancement
807
CME in the DSC apparently depends on cortical-descending input. The DSC of the cat receives descending cortical input from certain areas of extraprimary sensory cortex, including the anterior ectosylvian cortex (AES) and the lateral suprasylvian sulcus (Berson & McIlwain, 1983; Harting, Updyke, & Van Lieshout, 1992; Wallace et al., 1993). Inactivation of these areas, and of AES in particular, has been shown to selectively abolish CME, while preserving multisensory responsiveness in DSC neurons (Wallace & Stein, 1994; Jiang, Wallace, Jiang, Vaughan, & Stein, 2001). The model suggests that the cortical input could produce CME simply by adding a component to the DSC response to subcortical input, but it offers no insight into how the cortical and subcortical contributions may be partitioned. We are exploring a new model that subsumes the implementations proposed here within a broader model of the development of CME, using intrinsic Hebbian mechanisms and both ascending and cortical-descending inputs to the DSC. For the case of conditionally independent Poisson distributed inputs, the bias b is negative whenever P.T D 1/ < P.T D 0/ and ¸i0 < ¸i1 . For the more general case, the value of b is inuenced by all of the other parameters in the model (see equation 3.9). A number of restricted cases were examined, and it was found that b was negative for all but a small portion of parameter space (near asymptotes) whenever a set of plausible parameter requirements was maintained. The inhibitory bias might correspond to some biophysical mechanism intrinsic to DSC neurons or to an extrinsic source of tonic inhibition. The DSC contains inhibitory interneurons and receives extrinsic inhibitory input from the substantia nigra, zona incerta, contralateral colliculus, and a variety of brainstem structures (Karabelas & Moschovakis, 1985; Ficalora & Mize, 1989; Appell & Behan, 1990; Mize, 1992; May, Sun, & Hall, 1997). Both substantia nigra and zona incerta cells are tonically active (Hikosaka & Wurtz, 1983; Ma, 1996). Some substantia nigra cells respond to visual or auditory stimulation with a decrease or cessation of ring. Because the bias contains a term related to the target prior probability, it is possible that the sensory responses of nigral cells could modulate CME in the DSC according to changes in the prior expectation of targets. Acknowledgments We thank Jeffrey Bilmes, Ehtibar Dzhafarov, Jesse Reichler, and Liudmila Yafremava for helpful discussions and comments on the manuscript prior to submission. We also thank anonymous reviewers of previous versions of this article for useful comments. This work was supported by NSF grants IBN 92-21823 and IBN 00-80789, ONR grant N00014-01-0249, and a grant from the Critical Research Initiatives of the State of Illinois, all to T. J. A. The code used to generate the results presented here will be made available at our web site: http://csn.beckman.uiuc.edu/ .
808
P. Patton and T. Anastasio
References Anastasio, T. J., Patton, P. E., & Belkacem-Boussaid, K. (2000). Using Bayes’ rule to model multisensory enhancement in the superior colliculus. Neural Computation, 12, 997–1019. Appell, P. P., & Behan, M. (1990). Sources of subcortical GABAergic projections to the superior colliculus in the cat. J. Comp. Neurol., 302(1), 143–158. Berson, D. M., & McIlwain, J. T. (1983). Visual cortical inputs to the deep layers of cat’s superior colliculus. J. Neurophysiol., 50, 1143–1155. Binns, K. E. (1999). The synaptic pharmacology underlying sensory processing in the superior colliculus. Prog. Neurobiol., 59(2), 129–159. Binns, K. E., & Salt, T. E. (1996). Importance of NMDA receptors for multimodal integration in the deep layers of the cat superior colliculus. J. Neurophysiol., 75(2), 920–930. Clemo, H. R., & Stein, B. E. (1991). Receptive eld properties of somatosensory neurons in the cat superior colliculus. J. Comp. Neurol., 314(3), 534–544. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2, 303–314. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classication (2nd ed.). New York: Wiley. Durbin, R., & Rumelhart, D. E. (1989). Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Computation, 1, 133–142. Ficalora, A. S., & Mize, R. R. (1989). The neurons of the substantia nigra and zona incerta which project to the cat superior colliculus are GABA immunoreactive: A double label study using GABA immunocytochemistry and lectin retrograde transport. Neurosci., 29, 567–581. Findlay, J. M., & Walker, R. (1999). A model of saccade generation based on parallel processing and competitive inhibition. Behav. Brain Sci., 22, 661–721. Fristedt, B., & Gray, L. (1997). A modern approach to probability theory. Boston: Birkh a¨ user. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed., pp. 312–360). Cambridge, MA: MIT Press. Harting, J. K., Updyke, B. V., & Van Lieshout, D. P. (1992). Corticotectal projections in the cat: Anterograde transport studies of twenty-ve cortical areas. J. Comp. Neurol., 324(3), 379–414. Hikosaka, O., & Wurtz, R. H. (1983). Visual and oculomotor functions of the monkey substantia nigra pars reticulata. I. Relation of visual and auditory responses to saccades. J. Neurophysiol., 49, 1254–1267. Jiang, H., Lepore, F., Ptito, M., & Guillemot, J.-P. (1994). Sensory interactions in the anterior ectosylvian cortex of cats. Exp. Brain Res., 101, 385–396. Jiang, W., Wallace, M. T., Jiang, H., Vaughan, J. W., & Stein, B. E. (2001). Two cortical areas mediate multisensory integration in superior colliculus neurons. J. Neurophysiol., 85, 506–522. Kadunce, D. C., Vaughan, J. W., Wallace, M. T., Benedek, G., & Stein, B. E. (1997). Mechanisms of within- and cross-modality suppression in the superior colliculus. J. Neurophysiol., 78(6), 2834–2847.
A Neural Model of Cross-Modal Enhancement
809
Kadunce, D. C., Vaughan, J. W., Wallace, M. T., & Stein, B. E. (2001). The inuence of visual and auditory receptive eld organization on multisensory integration in the superior colliculus. Exp. Brain Res., 139, 303–310. Karabelas, A. B., & Moschovakis, A. K. (1985). Nigral inhibitory termination on efferent neurons of the superior colliculus: An intracellular horseradish peroxidase study in the cat. J. Comp. Neurol., 239(3), 309–329. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. L. Davis, & S. F. Zornetzer (Eds.), Single neuron computation (pp. 315–345). San Diego, CA: Academic Press. Ma, T. P. (1996). Saccade related omnivectorial pause neurons in the primate zona incerta. NeuroReport, 7, 2713–2716. May, P. J., Sun, W., & Hall, W. C. (1997). Reciprocal connections between the zona incerta and the pretectum and superior colliculus of the cat. Neurosci., 77(4), 1091–1114. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70(3), 1086–1101. Meredith, M. A., & Stein, B. E. (1986a). Spatial factors determine the activity of multisensory neurons in cat superior colliculus. Brain Res., 365(2), 350– 354. Meredith, M. A., & Stein, B. E. (1986b).Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. J. Neurophysiol., 56(3), 640–662. Meredith, M. A., & Stein, B. E. (1990). The visuotopic component of the multisensory map in the deep laminae of the cat superior colliculus. J. Neurosci., 10(11), 3727–3742. Meredith, M. A., & Stein, B. E. (1996). Spatial determinants of multisensory integration in cat superior colliculus neurons. J. Neurophysiol., 75(5), 1843– 1857. Meredith, M. A., Wallace, M. T., & Stein, B. E. (1992). Visual, auditory and somatosensory convergence in output neurons of the cat superior colliculus: Multisensory properties of the tecto-reticulo-spinal projection. Exp. Brain Res., 88(1), 181–186. Middlebrooks, J. C., & Knudsen, E. I. (1984). A neural code for auditory space in the cat’s superior colliculus. J. Neurosci., 4(10), 2621–2634. Mize, R. R. (1992). The organization of GABAergic neurons in the mammalian superior colliculus. Prog. Brain Res., 90, 219–248. Munoz, D. P., & Istvan, P. J. (1998). Lateral inhibitory interactions in the intermediate layers of the monkey superior colliculus. J. Neurophysiol., 79(3), 1193–1209. Olson, C. R., & Graybiel, A. M. (1987). Ectosylvian visual area of the cat: Location retinotopic organization, and connections. J. Comp. Neurol., 261, 277–294. Peck, C. K. (1990). Neuronal activity related to head and eye movements in cat superior colliculus. J. Physiol. (Lond.), 421, 79–104. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. Journal of Neuroscience, 19(5), 1736–1753. Richard, M. D., & Lippmann, R. P. (1991). Neural network classiers estimate Bayesian a posteriori probabilities. Neural Computation, 3, 461–483.
810
P. Patton and T. Anastasio
Rumelhart, D. E., Hinton, G. E., & McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1, pp. 45–76). Cambridge, MA: MIT Press. Sato, T. (1989). Interactions of visual stimuli in the receptive elds of inferior temporal neurons in awake macaques. Exp. Brain Res., 77, 23–30. Sato, T. (1995). Interactions between two different visual stimuli in the receptive elds of inferior temporal neurons in macaques during matching behaviors. Exp. Brain Res., 105, 209–219. Sparks, D. L., & Hartwich-Young, R. (1989). The deep layers of the superior colliculus. In R. H. Wurtz & M. Goldberg (Eds.), The neurobiology of saccadic eye movements (Vol. 3, pp. 213–255). Amsterdam: Elsevier. Stein, B. E., Huneycutt, W. S., & Meredith, M. A. (1988). Neurons and behavior: The same rules of multisensory integration apply. Brain Res., 448(2), 355–358. Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. Cambridge, MA: MIT Press. Stein, B. E., Meredith, M. A., Huneycutt, W. S., & McDade, L. (1989). Behavioral indices of multisensory integration: Orientation to visual cues is affected by auditory stimuli. J. Cog. Neurosci., 1(1), 12–24. Stein, B. E., & Wallace, M. T. (1996). Comparisons of cross-modality integration in midbrain and cortex. Prog. Brain Res., 112, 289–299. Wallace, M. T., Meredith, M. A., & Stein, B. E. (1992). Integration of multiple sensory modalities in cat cortex. Exp. Brain Res., 91(3), 484–488. Wallace, M. T., Meredith, M. A., & Stein, B. E. (1993). Converging inuences from visual, auditory, and somatosensory cortices onto output neurons of the superior colliculus. J. Neurophysiol., 69(6), 1797–1809. Wallace, M. T., Meredith, M. A., & Stein, B. E. (1998). Multisensory integration in the superior colliculus of the alert cat. J. Neurophysiol., 20(2), 1006–1010. Wallace, M. T., & Stein, B. E. (1994). Cross-modal synthesis in the midbrain depends on input from cortex. J. Neurophysiol., 71(1), 429–432. Wallace, M. T., Wilkinson, L. K., & Stein, B. E. (1996). Representation and integration of multiple sensory inputs in primate superior colliculus. J. Neurophysiol., 76(2), 1246–1266. Wurtz, R. H., & Goldberg, M. E. (1972).Activity of superior colliculus in behaving monkey. III. Cells discharging before eye movements. J. Neurophysiol., 35(4), 575–586. Received May 24, 2000; accepted September 11, 2002.
LETTER
Communicated by Wulfram Gerstner
Event-Driven Simulation of Spiking Neurons with Stochastic Dynamics Jan Reutimann
[email protected] Michele Giugliano
[email protected] Stefano Fusi
[email protected] Computational Neuroscience, Institute of Physiology, University of Bern, 3012 Bern, Switzerland
We present a new technique, based on a proposed event-based strategy (Mattia & Del Giudice, 2000), for efciently simulating large networks of simple model neurons. The strategy was based on the fact that interactions among neurons occur by means of events that are well localized in time (the action potentials) and relatively rare. In the interval between two of these events, the state variables associated with a model neuron or a synapse evolved deterministically and in a predictable way. Here, we extend the event-driven simulation strategy to the case in which the dynamics of the state variables in the inter-event intervals are stochastic. This extension captures both the situation in which the simulated neurons are inherently noisy and the case in which they are embedded in a very large network and receive a huge number of random synaptic inputs. We show how to effectively include the impact of large background populations into neuronal dynamics by means of the numerical evaluation of the statistical properties of single-model neurons under random current injection. The new simulation strategy allows the study of networks of interacting neurons with an arbitrary number of external afferents and inherent stochastic dynamics. 1 Introduction Most of the observed in vivo cortical phenomena are believed to be the expression of the collective dynamics of large populations of interacting neurons. The richness and the complexity of these phenomena call for powerful new techniques for investigating in a systematic way the emergent behavior of large networks of neurons. In this respect, computer simulations are becoming an increasingly important tool to model realistic cortical modules (104 –105 neurons and 107 –109 synapses), to relate single-neuron and synapse properties to the collective dynamics of large networks and to Neural Computation 15, 811–830 (2003)
c 2003 Massachusetts Institute of Technology °
812
J. Reutimann, M. Giugliano, and S. Fusi
check the validity of the numerous assumptions underlying the theories of neural networks as dynamical systems. Traditional xed time-step (synchronous) computer simulations are based on numerical methods for solving a set of coupled differential equations that describe the dynamics of the basic elements of the networks (the neurons and the synapses). The time axis is uniformly discretized, and each state variable is updated at every time step. Under such conditions, the algorithmic complexity usually scales with the number of elements contained in the network, multiplied by the number of time steps into which the simulated interval is divided. Such an approach also sets a cutoff on the temporal resolution of the simulation (the minimal time step), which is imposed on all the state variables, no matter whether they are characterized by completely different timescales. This represents an obvious drawback when slow and fast dynamic variables coexist, as all of them are updated at the very same pace, determined by the fastest variable. When the number of neurons and synapses approaches realistic numbers for a cortical column (» 105 ), the traditional techniques become impracticable, and new alternative simulation strategies should be devised. One possible solution takes natural inspiration from the way the neurons exchange information and interact: most of the dynamical variables are coupled for relatively short time intervals only, during the emission and the transmission of action potentials. The emission of a spike is a relatively rare event (the typical cortical frequencies being of the order of several Hz) and well localized in time (i.e., the width of a spike is of the order of 1 ms) with respect to the overall simulation times. Interestingly, it can be observed that within the interval between two consecutive relevant events (e.g., spikes as in our case, although our approach can be generalized to other classes of events), the neurons and the synapses are essentially isolated, and each state variable is usually described by a deterministic, uncoupled differential equation. Therefore, given the initial state at the beginning of the interval and the length of the interval, the nal state can be determined without iteratively updating the state variable at every time step. This approach is usually known as event driven; it constitutes the fundamental principle of modern operating systems and has already been applied successfully to large-scale simulations of networks of spiking neurons (Watts, 1994; Delorme, Gautrais, van Rullen, & Thorpe, 1999; Giugliano, 2000; Mattia & Del Giudice, 2000). The CPU time of these simulations depends mostly on the frequency of occurrence of relevant events, not on the time simulated. If the relevant events are rare, as would be the spikes in the cortex, then the event-driven approach reduces the computational loads dramatically. One drawback of this strategy manifests itself when the neuronal dynamics does not evolve deterministically between the emission of two successive spikes—for example, when the network to be studied is embedded in a very large population of neurons that provide an external, noisy background activity. This spike activity is usually meant to emulate the sensory
Event-Driven Simulation of Spiking Neurons
813
inputs or, in general, the inputs coming from other areas of the brain (Amit & Brunel, 1997a; Wang, 1999). This interaction, at least on a rst approximation, is modeled as a one-way connection (feedforward input) in the sense that the local simulated activity of the network of neurons does not affect the background activity that represents the external input. Each simulated (internal) neuron receives synaptic inputs from a group of Mext external neurons. The external population is assumed to be large, and the overlap among the external groups corresponding to different internal neurons is usually small. Hence, the total synaptic currents generated by the external neurons and injected into different internal neurons can be considered statistically independent. These inputs provide the external stimulation and allow the network to sustain the spontaneous activity observed in cortical recordings (Amit & Brunel, 1997a). In the case of the event-based simulation approach of Mattia and Del Giudice (2000), such an external activity was implemented by generating, for each of the N neurons in the simulated networks, a large number of additional random presynaptic spike events, red on average at a frequency ºext by each of the Mext neurons of the external background populations. Such hypotheses give rise to an additional algorithmic complexity that can be quantied as O .N ¢ºext ¢ Mext /. As a consequence, the increased realism introduced in the simulated neuronal dynamics, by appropriately increasing the size of the individual background populations for each unit of the network, has to be paid in terms of a considerable portion of CPU loads, substantially slowing the event-driven approach. Moreover, as the size of the external population increases, the external spikes can very quickly saturate the list of events that have to be handled by the simulator, and the inter-event interval can be as short as a few nanoseconds (it scales as O .1=.N ¢ ºext ¢ Mext //). In particular, such short time intervals can easily produce roundoff errors (see Press, Flannery, Teukolsky, & Vetterling, 1986) and require double-precision oating-point variables to encode the time of occurrence of the events. In this article, we propose a possible solution to these problems. If the synaptic background activity is irregular, as in the in vivo phenomena to be modeled, the impact of the external neurons on each simulated neuron of the network can be approximated by a continuous random current injection with appropriate statistical properties, directly related to ºext and M ext (Gerstein & Mandelbrot, 1964; Tuckwell, 1988). Furthermore, by numerically solving the master equation that governs the density distribution of the state variables, it is possible to predict the behavior of each neuron under such a noisy stimulating current, during those time intervals when no synaptic input (relevant event) from the internal simulated neurons occurs. The same approach can be applied to the case in which the state variables are inherently stochastic because of some source of noise related to the single-neuron dynamics. Here we focus on an example in which the dynamics are deterministic and the simulated neurons are embedded in a very large network. If the
814
J. Reutimann, M. Giugliano, and S. Fusi
(excitatory or inhibitory) postsynaptic potentials evoked by a single presynaptic spike are small compared to the threshold for triggering postsynaptic action potentials (i.e., diffusion approximation), the density equation can be reduced to a simpler Fokker-Planck equation, which in many cases can be solved analytically (Ricciardi, 1977). These theoretical approaches have already proved to constitute a powerful tool for studying network properties in stationary conditions, such as in studies on spontaneous activity, on the persistent stimulus-selective delay activity observed in the cortex (Amit & Brunel, 1997b; Yakovlev, Fusi, Berman, & Zohary, 1998; Wang, 1999; Fusi & Mattia, 1999; Brunel, 2000), or to study the network response properties to particular transient or oscillatory inputs (for a review, see Knight, 1972; Nykamp & Tranchina, 2000; Gerstner & Kistler, 2002). By combining these strategies with an event-driven simulation approach, we show in this article how to implement the global background activity, effectively reducing its computational complexity to O .N ¢ ºext /. 2 The Simulation Algorithm 2.1 The General Idea. In the event-driven approach described in Mattia and Del Giudice (2000), the dynamics of the neurons and the synapses were updated only on the arrival of a spike. Therefore, the action potential was the relevant event to trigger an update of the state variables. The new state was then computed on the basis of two elements: (1) the initial state, following the previous update, and (2) the time passed since the previous update. Because the dynamics of both synapses and neurons was deterministic in this interval, the nal state could be determined for an arbitrary time interval by the solution of the differential equations governing the dynamics of the neurons. In our case, the development of the neuronal state variables in the interspike interval is not deterministic. In particular, the external spikes are not explicitly treated as synaptic events. Instead, we assume that the external activity is highly irregular and the total synaptic current coming from outside can be described as a stochastic process with a known distribution. In a traditional xed time-step simulation, at each step the external current should be randomly generated, independently for each neuron, and the state variables affected by it should be updated. Therefore, the dynamics of the state variables would be nondeterministic, and the nal state could not be determined by the initial condition and the time since the previous update. However, given the distribution of the external current, it is possible to determine the distribution of the state variables as a function of time. It is as if each state variable starts from some determined initial value at the preceding update time (when the distribution is a delta function) and then evolves along all the possible trajectories determined by the different realizations of the input currents. This packet of trajectories would evolve until another relevant event occurs. At this
Event-Driven Simulation of Spiking Neurons
815
time, the packet should be reduced to a single trajectory, which is the one that has actually been followed by the neuron. The advantage is that it is not necessary to integrate iteratively, and presumably many times, the state variables during the interval between two relevant events. Deciding at the end what was the actual trajectory is completely equivalent to what would be obtained in a xed time-step approach, provided that the trajectory is chosen according to the correct distribution of the state variables. 2.2 An Example: Networks of Integrate-and-Fire Neurons. The approach for the proposed new event-driven algorithm is quite general and can be applied to a large class of neuronal models and synaptic interactions (see also Mattia & Del Giudice, 2000; Giugliano, 2000). In order to illustrate the algorithm, we will focus on a network of randomly connected integrate-and-re (IF) model neurons, whose state is fully determined by a single internal variable, the membrane potential V. In section 4, we show how to generalize this approach to more complex models. Formally, the differential equation that governs the dynamics of the membrane voltage below threshold for a generic neuron can be written as follows, dV D ¡L.V; t/ C Iint .t/ C Inoise .t/ dt
Vmin · V < V# ;
(2.1)
where L.V; t/ is a generic leakage term. We will show two examples: (1) the case in which L.V; t/ D .Vrest ¡ V/=¿m , where Vrest is the resting potential and ¿m is the passive membrane time constant, and (2) the case in which the leak is constant: L.V; t/ D ¸. The two models respond in the same way to noisy currents provided that the dynamics of the neuron with the constant leakage are complemented by the condition that the membrane voltage is limited from below by a rigid barrier at Vrest . Interestingly, the stationary response function of these two model neurons to noisy currents can be tted to the response of cortical pyramidal cells measured in in vitro experiments (Rauch, La Camera, Luscher, ¨ Senn, & Fusi, 2002). Iint .t/ represents the net charging current generated by synaptic inputs from other neurons within the simulated network, and Inoise .t/ is the noisy current that in our case represents the external background activity, coming from the feedforward neurons. We assume that each internal neuron receives the external input from a different group of M ext external neurons. If V reaches a threshold V# , the neuron emits a spike, and V is reset to a value Vreset < V# , where it remains during an absolute refractory period ¿arp . A rigid (reecting) barrier Vmin was included to restrict the membrane potential to the interval Vmin · V.t/ < V# . Although unusual compared to the IF dynamics described in the literature, such a choice is reasonable, as the physiological range for V in a real neuron is bounded by the ionic concentration ratios across the membrane.
816
J. Reutimann, M. Giugliano, and S. Fusi
If the spikes are point processes (i.e., their time width is zero), Iint .t/ can be written as a sum of Dirac deltas, X Iint .t/ D Jk ±.t ¡ tk /; k
where the sum extends over all the afferent spikes. The efcacy of each spike is given by Jk . In the interval between two successive synaptic inputs, Iint .t/ is 0, and V follows a random walk induced by the noisy input Inoise .t/. If the external spikes are statistically independent, the total noisy current impinging on a generic neuron of the network can be well approximated by a Gauss-distributed, delta-correlated stochastic process characterized by a mean ¹ and a variance ¾ 2 per unit time (Amit & Tsodyks, 1991; Gerstein & Mandelbrot, 1964; Tuckwell, 1988). ¹ and ¾ 2 are explicitly related to the mean ring rate ºext of the external population, its probability of feedforward connectivity cext, its size Mext , and the average synaptic strength Jext , according to the following equations: ¹ D ºext ¢ cext ¢ Mext ¢ Jext
¾ 2 D ºext ¢ cext ¢ Mext ¢ Jext 2 :
(2.2)
As a consequence, the temporal evolution of the membrane potential V in the intervals between two consecutive internal spikes can be regarded as a stochastic process whose statistical properties are completely described by PV .V; t/, representing the probability density of having V.t/ D V (Ricciardi, 1977; Tuckwell, 1988). Given that the neuron starts from V.t0 / D V0 at time t0 (the time of the last relevant event), PV .V; t/ describes the packet of depolarization trajectories generated by the external input when the neuron had a particular initial value V0 . Before the arrival of the next spike emitted by one of the network neurons, it might happen that the external synaptic current drives the voltage across the threshold V# . These crossings occur at any time t at a rate P fpt .t0 ; V0 I t/, which expresses the probability density of having a rst passage time at t, given the initial state V D V0 at time t0 . These probabilities can be computed numerically by solving the diffusion equations that govern PV .V; t/ (see appendix A). A solution is shown in Figure 1, where the temporal evolution of the probability density for the V and the corresponding interspike interval distribution are plotted. Note that there is an overall decrease of the distribution in time, because the fraction of neurons that cross the ring threshold are eliminated (we are interested in the rst passage time only). As apparent from the inspection of the interspike interval distribution, there is at any time a nonzero probability of crossing the ring threshold at V# D ¡50 mV. 2.3 The Extended Event-Driven Algorithm. Unlike in the fully deterministic case, where the state of a neuron is inuenced only by events having taken place before the current time, we now have to consider the possibility of having uncertain events—predictions about the neuron’s future that may or may not happen. By drawing a random number according to the
Event-Driven Simulation of Spiking Neurons
817
Figure 1: Numerical solution of the Fokker-Planck equation for the linear IF neuron (see appendix A): the temporal evolution of the probability density for the V (surface plot) and the interspike interval distribution (thick line) are shown. Neuron parameters: ¿m D 20 ms, Vrest D ¡70 mV, V# D ¡50 mV, Vmin D ¡80 mV, ¿arp D 2 ms, V0 D ¡70 mV, ¹¿m D 14:5 mV, and ¾ 2 ¿m D 10:2 mV2 .
probability distribution given by P fpt , we obtain the predicted time (T1 ) of a rst threshold crossing. However, if the state of the neuron is inuenced in any way before T1 , a new prediction has to be made based on the current state, and the old prediction becomes irrelevant. To describe in some detail our algorithm, we consider a generic neuron, whose initial condition is V.t0 / D V0 (see Figure 2), embedded in a network. First, we draw a pseudorandom number according to the distribution given by P fpt .t0 ; V0 I T1 /, which yields time T1 as the predicted time of the next threshold crossing according to our neuron model (see Figure 2, A1). Two possible situations might now occur, and they can be outlined as follows: 1. No presynaptic (internal) spike arrives in the time interval [t0 ; T1 ] from other neurons within the network. At time t D T1 , the neuron emits a spike (as determined in the rst step) and enters the absolute
818
J. Reutimann, M. Giugliano, and S. Fusi Initial Event
A1
P (t ,V ; T ) fpt 0
VJ
0
Next Event
A2
1
VJ
P (T +t fpt
1
arp
,V
;T)
reset
2
V
0
V t
T
0
reset
T +t
1
1
B1
T
arp
2
B2 Pfpt(t0,V0; T1)
VJ
V
V
0
1
t
0
sp 0 1
DV
P (t ,V ,t ; V ) v 0
t
VJ
sp 1
V +DV 1
sp fpt 1
P (t ,V +DV ; T )
1
1
1
2
1
1
sp 1
T
t
1
C1
T
2
C2 Pfpt(t0,V0; T1)
VJ
VJ
DV V
V
0
1
0
arp
,V
;T)
reset
2
1
V t
sp fpt 1
P (t +t
sp t 1
T
1
reset sp 1
t +t
arp
T
2
Figure 2: An illustration of the extended algorithm (see section 2.3). Column 1: Initial event in three different situations. Column 2: Corresponding next event. Initial voltage is V.t0 / D V0 for all cases (A1, B1, C1). (A1) First-passage-time T1 is determined randomly according to P fpt .t0 ; V0 /. As no further event occurs, a spike is emitted at time T1 . (A2) After the absolute refractory period ¿arp , the voltage is set to Vreset , and the next rst-passage-time T2 is determined according to P fpt .T1 C¿arp ; Vreset I T2 /. (B1) After T1 has been determined (as in A1), a synaptic sp sp sp event occurs at time t1 .t0 < t1 < T1 /. The current membrane voltage V1 D V.t1 / sp is chosen randomly according to PV .t0 ; V0 ; t1 / and subsequently updated by the synaptic contribution 1V1 . T1 is now obsolete. (B2) Since the membrane voltage remains below threshold, the next rst-passage-time T2 is drawn according to sp P fpt .t1 ; V1 C 1V1 I T2 /. (C1) Same as B1, but 1V1 is large enough to cause a threshold crossing and emission of a spike. (C2) After the refractory period, the sp next rst-passage-time is determined randomly according to P fpt .t1 C ¿arp ; Vreset /.
Event-Driven Simulation of Spiking Neurons
819
refractory period, after which the neuron continues at the rst step using V0new D Vreset and tnew D T1 C ¿arp (see Figure 2, A2). The spike 0 emission leads to the creation of further events, which represent the synaptic interactions that will take place in response to this spike after some axonal delay period. sp
sp
2. A single internal spike arrives at time t1 .t0 < t1 < T1 /. At this point, sp the membrane voltage V1 is drawn from PV .t0 ; V0 I V; t1 / and subsequently updated according to the synaptic strength 1V1 D J associated with the incoming spike. Again, two scenarios are possible, depending on whether the new membrane potential .V1 C 1V1 / is (a) below threshold, in which case the neuron continues at step 1 ussp ing V0new D V1 C 1V1 and tnew D t1 (see Figure 2B), or (b) equal to 0 or above threshold V# , in which case the neuron emits a spike and the membrane voltage is reset to Vreset during the absolute refractory period, after which the neuron continues at step 1 using V0new D Vreset sp and tnew D t1 C ¿arp (see Figure 2C). As before, the spike emission 0 results in the generation of new events to be processed later in the simulation. This is repeated during all the simulation time, or until no more events are to be processed. A critical situation occurs when the statistics of the background activity change, because this also affects the probability distributions. In order to propagate the change to the whole network, all neurons must immediately be visited, and the current state variables must be updated according to PV , using the old statistics. Then, using P fptwith the new statistics, spontaneous spike times are determined as usual, and the simulation continues as described above. It is clear from this procedure that the statistics of the background activity can change in time only by discrete steps rather than in a continuous way. This is an inherent limitation of the event-driven approach. However, most physiological situations can be adequately described by piecewise constant (discontinuous) changes in the statistical properties of the external background populations. Examples of such changes are given in Figures 3B and 4. For an extension to more complex transients or periodical noisy inputs, see section 4. 2.4 Implementation of the Event List. As a consequence of the eventbased simulation approach, a large number of events have to be processed in chronological order. The strategies involved in handling the events have to be chosen with care, given that a lack of efciency could lower the performance of the algorithm considerably. The problem reduces to having a single ordered list with random insertions (when creating new events), repetitive deletion of the lowest element
820
J. Reutimann, M. Giugliano, and S. Fusi
(when the next event is processed), and occasional random deletions (when an uncertain event becomes irrelevant). An appropriate data structure for the storage of the events is given by the class of balanced trees, which ensure good performance: searching for one element out of E costs O .log.E// operations. However, maintaining the balance of the tree causes considerable overhead, especially in our situation where repetitive deletions occur at one place (Wirth, 1986). In our implementation, we used the “skip list” data structure, which is a probabilistic alternative to balanced trees (Pugh, 1990; see also appendix B), based on linked lists. In this structure, maintenance of the balance comes as a consequence of the probabilistic nature of the structure, and the insertion and deletion algorithms are as simple as those for linked lists. Therefore, skip lists strongly reduce the overhead of the event handling, while offering the performance benets of balanced trees. 2.5 Look-up Tables. In the general case, the processing of an event requires a partial differential equation (e.g., equation A.1) to be solved numerically, which is time-consuming. Although this computational load does not directly depend on the assumed size of the external populations (Mext ), and therefore does not affect the run-time performance scaling of our algorithm with respect to Mext , it nevertheless reduces the overall performance of the simulation in a dramatic way. However, we can make use of the fact that the parameters governing the evolution of the probability distributions can change only in a discontinuous manner, and prepare tables of all needed distributions off-line. These tables can be stored and used during the simulation, therefore reducing this computationally delicate task to simple table look-ups. For a specic example, refer again to Figure 1, where such an off-line evaluated solution has been graphically represented: the time-varying probability density distribution PV and the corresponding interspike interval distribution density P fpt are plotted for a particular V0 D ¡70 mV. The density distribution represented by the surface starts as a Dirac-delta located at V D V0 , and as time goes by with no internal event occurring, it drifts toward the asymptotic value (¹¿m D 14:5 mV) and diffuses at a rate ¾ (with ¾ 2 ¿m D 10:2 mV2 ). Accordingly, the rst-passage time probability density distribution P fpt rst increases and then slowly decreases to zero for large time intervals. Of course, different initial conditions V0 yield different proles of PV and P fpt . 3 Performance Evaluation The proposed algorithm provides fast simulations of large networks. In order to check the simulation results for correctness, the emergent behaviors were compared to the mean-eld theory predictions.
Event-Driven Simulation of Spiking Neurons
821
3.1 Numerical Accuracy. The introduction of tabular discretizations will introduce numerical approximations that inuence the accuracy of the simulations. The fact that a higher table resolution produces more accurate results leads to a trade-off between accuracy and table resolution (which directly affects memory consumption). This is illustrated in Figure 3: the different lines in Figure 3B correspond to different table resolutions. The mean frequency response to noisy external inputs with stationary statistics is compared to the one predicted analytically by the mean eld theory (see Amit & Brunel, 1997a). The deviations from the predicted frequency in Figure 3B (23.6 Hz in the high noise regime) are plotted in Figure 3C as a function of table resolution. Of the resolutions tested in Figure 3C, the best results were obtained with 0:1 mV and 0:5 ms in the voltage and time domains, respectively, and the number of elements contained in the corresponding table was of the order of 106 . Figure 3D demonstrates that the accuracy is very sensitive to the discretization of the voltage and less sensitive to the time resolution. Minor discrepancies in the global ring rate were expected, since an additional reecting lower barrier for the depolarization of the single neuron was set at V D Vmin , slightly altering the IF dynamics compared to those employed in the theory (see equation A.1). The achieved accuracy was nevertheless satisfying, and the typical transient overshoots in the collective mean ring rate, expected after a sudden increase in the statistics of the background activity, were captured. We also compared directly the simulations and the theoretical prediction of the transient response of a population of coupled neurons to a stepwise change in the statistics of the external current (see Figure 4). In this particular case, we were using the IF neuron with a constant leakage because the analytical solution of the network dynamics during transients is simpler (Mattia & Del Giudice, 2002). Note that changing the neuron model affects only the generation of the look-up tables, as all the information about the neural dynamics is contained within these tables. Figure 4 shows simulations of a network of 5000 interacting, inhibitory neurons. At time t D 0 (2 seconds after the start of the simulation), the statistics of the external current are changed from (¹ D 125 mV/s, ¾ D 5:15 mV/s) to (¹ D 700 mV/s, ¾ D 12:20 mV/s). The transient response of the simulation follows very closely the theoretical prediction reported in Mattia and Del Giudice (2002). In order to maximize accuracy and minimize memory usage, nonuniform discretizations and efcient interpolation methods can be used. In particular, voltages near the threshold and short time intervals should be represented with a rened discretization. In this last example, we obtain the exact steady-state values of the population frequency, which reects the fact that we were using nonuniform, high-resolution discretizations of the underlying tables (minimal resolution of 0:1 mV at the reecting barrier Vmin , and maximal resolution of 0:01 mV near the threshold V# ).
822
J. Reutimann, M. Giugliano, and S. Fusi
Figure 3: Simulation of a population of IF neurons for two different regimes of the external input (¹1 ¿m D 14:5 mV, ¾12 ¿m D 10:2 mV2 , ¹2 ¿m D 16:5 mV, ¾22 ¿m D 27 mV2 ). (A) Raster plot of 30 of 500 neurons and the corresponding instantaneous network ring rate (B, average over 100 runs) compared to the analytical prediction (dotted line) by the mean-eld theory (see text; Amit & Tsodyks, 1991; Brunel, 2000; Fusi & Mattia, 1999). The three lines correspond to three different look-up table resolutions that were used. (C) Estimate of the inaccuracies induced by the discretizations of the look-up tables, measured in the last 200 ms of B. (D) Sensitivity of the inaccuracies to changes in table resolution (v: change in voltage axis [0:4 mV £ 0.5 ms, 0:2 mV £ 0.5 ms, 0:1 mV £ 0.5 ms; from right to left], t: change in time axis [0:1 mV £ 2 ms, 0:1 mV £ 1 ms, 0:1 mV £ 0.5 ms]).
Event-Driven Simulation of Spiking Neurons
823
25
n(t) [Hz]
20
15
10
5
0
0
100
200
300 time [ms]
400
500
600
Figure 4: Transient response to a step change in the statistics of the external current: simulation versus theory. For t < 0, the network is in an asynchronous stationary state with mean emission rate º D 0:2 Hz. At t D 0, an instantaneous increase of the external current from (¹ D 125 mV=s, ¾ D 5:15 mV=s) to (¹ D 700 mV=s, ¾ D 12:20 mV=s) drives the activity toward a new stable state with º D 20 Hz. The solid black line is the mean of the activity from 10 simulations of a coupled network (5000 inhibitory LIF neurons; see the text). The thick gray line is the theoretical prediction. (Used with permission from Mattia & Del Giudice, 2002).
3.2 Run-Time Performance. All the simulations were performed on a Pentium III 850 MHz system running Linux, and they conrmed that excellent run-time performance can be achieved if the computation of the probabilistic dynamics (i.e., the on-line probability tables search and look-up, and the consequent random number extractions) is fast enough. Figure 5A reports a benchmark comparing our algorithm with the event-driven approach of Mattia and Del Giudice (2000). Both simulations were performed over a simulated time of 1 second with a network of 1000 IF neurons, characterized by a weak average internal synaptic coupling (J D 0:1 mV) and by a low connectivity probability (c D 0:1). With the aim of accounting for differences in the output frequency, the execution times of both simulation approaches were normalized and plotted as a function of the mean number of spikes received by each neuron from the external neurons (background activity, Fext ) divided by the number of spikes received by the other simu-
824
J. Reutimann, M. Giugliano, and S. Fusi
execution time vs. ratio of external/internal spikes
A
extended algorithm Mattia & Del Giudice approach
execution time / n [s/Hz]
100
10
1
0.1
0.1
1
F
10 /F
100
1000
ext
100
10 execution time [s]
execution time [s]
B vs. number of neurons
1
0.01 100
1000 N
10000
C
vs. output frequency
8 6 4 2 0
0
20
40 n [Hz]
60
80
Figure 5: Execution performance of the proposed algorithm. (A) Comparison of the execution times of the state-of-the-art event-driven approach (as described in Mattia & Del Giudice, 2000, circles) and of the extended event-driven approach (crosses), as a function of the ratio between the number of the external and internal spikes (see section 3). The execution time is given per simulated second and averaged over ve runs of 4 s simulation time each. (B) The execution time scales with the square of the number of simulated neurons (c D 0:05, mean output frequency 4 Hz). (C) Execution time scales almost linearly with output frequency (c D 0:05, J D 0:1 mV).
Event-Driven Simulation of Spiking Neurons
825
lated neurons (internal activity, F): Fext Mext ¢ cext ¢ ºext D ; F N ¢c¢º where c and cext are set to 0:1 and 1, respectively, ºext is calculated from the statistics of the background populations, and º is the fraction of simulated neurons that emit a spike per unit time. The ratio Mext =N was varied systematically (0.1, 1, 10, 100, 1000). The gure demonstrates that the execution time of the extended event-driven approach scales very well (does not increase) with respect to the increase in the size of the background populations. Only in the case of very few external spikes per unit time does the approach of Mattia and Del Giudice perform better than ours, which can be explained by the lower time required to process a single event: only one random number has to be generated compared to at least two in the extended algorithm (one for PV and P fpt each and for each newly inserted event in the event list). The execution time is also affected by the size N of the network to be simulated, scaling with the square of N (see Figure 5B), and for a given network size, it increases almost linearly with the output frequency, as expected from the analysis of the traditional event-driven simulation paradigm. As demonstrated by the simulation results, the main strength of the proposed approach is therefore related to the fact that the external populations were not modeled explicitly, resulting in an execution time independent of the number of external neurons (Mext) and of the ratio of the external to internal spikes: O .N ¢ ºext /. As a consequence, Mext enters the simulation only as a parameter in the expressions of the mean ¹ and the variance ¾ 2 of the external current (see equation 2.2). 4 Discussion The considerable speed-up of simulation times makes our approach suitable for the investigation of the emergent collective behaviors in large-scale networks of neurons and extends the perspectives of the event-driven simulation approach. In our tests, a simulation of 1000 neurons with a connectivity probability of 0.1 was running in real time (i.e., 1 s of the simulated network was executed in 1 s, while the mean frequency was 4 Hz), no matter how large the external population was. 1 It is important to stress that our approach does not imply any approximation provided that the number of independent external afferents to be mimicked is large enough and that the solution of the density equation is accurate. Our simulations produce the same results as any traditional xed-time-step approach, in which the external afferents are statistically independent. In this, our strategy differs from 1 The software implementing the described algorithm can be obtained from the authors upon request.
826
J. Reutimann, M. Giugliano, and S. Fusi
that of Knight (1972), who suggested studying the whole network dynamics by solving the density equations that govern the assembly of neurons. His strategy strongly relies on the assumption that the large number of neurons in the simulated network lose their identity and, hence, that the network state can be fully described by the probability density function of the state variables. In many cases, this represents a good approximation, although nite-size effects should be added articially to the dynamic equations of the probability density function, to account for the nite size of the network under consideration (see Brunel & Hakim, 1999; Brunel, 2000; Mattia & Del Giudice, 2002). Moreover, the weak statistical correlations introduced by the synaptic couplings, which cannot be modeled in a density approach, must be neglected. In our case, all of these elements emerge naturally from the simulation, which still constitutes a useful tool to check the hypotheses of the density approach. Our simulation strategy can be easily extended to events that are not well localized in time. For instance, when the synaptic currents generated by the arrival of presynaptic spikes are not delta correlated, the depolarization V cannot be updated instantaneously. A persistent current would affect the state variable of the neuron throughout the duration of the synaptic input. If the synaptic current goes instantaneously up upon the arrival of a spike and then decays exponentially with a time constant ¿ , we have that (see also Srinivasan & Chiel, 1993; Destexhe, Mainen, & Sejnowski, 1994; Lytton, 1996; Giugliano, 2000; Nykamp & Tranchina, 2001) X X Iint .t/ D Jk e¡.t¡tk /=¿ D e¡t=¿ Jk etk =¿ D e¡t=¿ Isp ; k
k
where Iint , the total synaptic current, is updated when a spike arrives (a term is added to the sum Isp ) and decays exponentially during the interspike intervals. Such a dynamic behavior can be introduced in our approach by dening an effective external mean current ¹ef f .t/: ¹ef f .t/ D ¹.t/ C e¡t=¿ Isp : The density equations should then be solved for every Isp . This makes the look-up tables for the probability density functions more complicated (it adds a dimension corresponding to Isp ), but it provides the possibility of having more realistic currents. Another interesting extension to our model can deal with situations in which we have a time-varying periodic change in the background statistics (i.e., an oscillatory, noisy input current) or a transient with a complex temporal prole that is limited in time. Such a situation can be simulated without changing continuously the statistics of the input current, which would require a frequent update of all the state variables of the neurons even in the absence of incoming spikes (see section 2.3). The solution is to add a time-dependent state variable describing the neuron’s position within
Event-Driven Simulation of Spiking Neurons
827
the phase #0 of the period. At the update time, the neuronal state variable would be updated according to the appropriate look-up table, that is, the one corresponding to the neuronal dynamics under the time-varying stimulus starting at the update time. As an example, P fpt .t0 ; V0 I t/ would then become P fpt .t0 ; V0 ; µ0 I t/, and P V .t0 ; V0 I V; t/ would have to be extended to PV .t0 ; V0 ; µ0 I V; t/. The extra cost of this approach is an added dimension in the look-up tables whose extension depends on the time correlation length of the input signal, the length of the temporal prole (the period in the case of a cyclic input), and the time resolution. However, the simulation performance in terms of execution time is not impaired. In general, simulations with more complex model neurons or synapses also require the addition of other dimensions in the probability density function. Moreover, solving high-dimensional partial differential equations is difcult, and the look-up tables would require a huge amount of memory. This is a major limitation of our approach because the problem becomes intractable even with few state variables. Nevertheless, our strategy (as most of the event-driven approaches) is efcient when the problem and the model are already well dened, the relevant internal variables are known, and extensive simulations are needed to explore the parameter space. Appendix A: Fokker-Planck Equation The probability density function PV .V; t/ that describes the statistics of the dynamics during the intervals between two successive spikes evolves in the voltage-time domain according to the Fokker-Planck equation (Cox & Miller, 1965), @ @ PV .V; t/ D ¡ Á.V; t/; @t @V
(A.1)
2
@ where Á .V; t/ D ¡ ¾2 ¢ @V p.V; t/ C [.¹ ¡ L.V; t// ¢ p.V; t/]. In the most general case, equation A.1 must be complemented by the initial condition P V .V; t0 / D ±.V ¡ V0 / and by appropriate boundary conditions restricting, in the case of the leaky IF neuron, the potential
² at V D Vmin , in terms of a reecting barrier, since no neuron can have a depolarization that passes below Vmin 8 t; Á.Vmin ; t/ D 0
(A.2)
² at V D V# , in terms of an absorbing barrier, since neurons that are crossing such a threshold are absorbed and leave the interval (Vmin I V# ): 8 t; PV .V# ; t/ D 0:
(A.3)
828
J. Reutimann, M. Giugliano, and S. Fusi
Finally, at any time t, the probability density PV .V; t/ must satisfy a normalization condition, Z
V# Vmin
Z p.V; t/ dV C
t t0
º.t0 / dt0 D 1
(A.4) 2
@ where º.t/ D Á.V# ; t/ D P fpt .t/ D ¡ ¾2 ¢ @V PV .V; t/jV# , represents the rate at which neurons are crossing the threshold V# . We note that equation A.4 accounts for the deating time course of the surface PV .V; t/, sketched in Figure 1: with time, the realizations of the process V.t/ leave the integration interval (Vmin ; V# ) after absorption at the threshold, and they irreversibly R accumulate in the term tt0 º.t0 / dt0 . By preliminarily choosing a xed set of values for (¹; ¾ ) and discretizing the interval (Vmin I V# ) into a nite number of bins, the numerical integration of equation A.1 determines the desired time-dependent probability density distributions PV .¹; ¾ 2 ; t0 ; V0 ; t1 I V1 / and P fpt .¹; ¾ 2 ; t0 ; V0 I t/, as a function of (¹; ¾ ) and given that V.t0 / D V0 .
Appendix B: Skip List Skip lists are a data structure introduced by W. Pugh. Skip lists can be used in place of balanced trees (Wirth, 1986), and they use probabilistic balancing rather than strictly enforced balancing. The underlying data structure is an ordered linked list of length n. When searching a linked list, we may have to visit all the nodes of the list. If every second node also has a pointer to the node two ahead it in the list, we have to examine no more than dn=2e C 1 nodes, since we can skip over half the nodes and go back only one node in case we had skipped over the element we were looking for. If every .2i /th node has a pointer 2i nodes ahead, the number of nodes that must be examined can be reduced to dlog2 ne while only doubling the number of pointers. A node that has k forward pointers is called a level k node. In a skip list, the levels of nodes are chosen randomly; the probabilities of choosing a certain level follow a geometric series (50% are level 1, 25% level 2, 12.5% level 3, and so on). A node’s ith forward pointer, instead of pointing 2i¡1 nodes ahead, points to the next node of level i or higher. Thus, insertions and deletions require only local modications; the level of a node, chosen randomly when the node is inserted, never has to be changed. In summary, the data structure is a linked list with extra pointers that skip over intermediate nodes—hence the name. Skip lists are balanced by consulting a random number generator and do not require rearranging of the structure, as is the case for balanced trees. Furthermore, the simplicity of skip list algorithms (only local modications) makes them easier to implement and provides signicant constant factor speed improvements over
Event-Driven Simulation of Spiking Neurons
829
balanced and self-adjusting tree algorithms. A more detailed analysis of the algorithmic performance can be found in Pugh (1990). Acknowledgments We thank Maurizio Mattia for providing his software and assistance and for very useful discussions and remarks on the manuscript. This work was supported by SNF grants 21-57076.99 and 31-61335.00 and by the Human Frontier Science Foundation (LT00561=2001-B). References Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7(3), 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network: Computation in Neural Systems, 8, 373–404. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates I: Substrate-spikes, rates and neuronal gain. Network, 2(3), 259–274. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comp. Neurosci., 8, 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comp., 11(7), 1621–1671. Cox, D. R., & Miller, H. D. (1965). The theory of stochastic processes. London: Methuen. Delorme, A., Gautrais, J., van Rullen, R., & Thorpe, S. (1999). SpikeNET: A simulator for modeling large networks of integrate and re neurons. Neurocomputing, 26=27, 989–996. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp., 6, 14–18. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate and re neurons. Neural Computation, 11, 633–652. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W., & Kistler, W. (2002). Spiking neuron models: Single neurons, populations, plasticity. Cambridge: Cambridge University Press. Giugliano, M. (2000). Synthesis of generalized algorithms for the fast computation of synaptic conductances with Markov kinetic models in large network simulations. Neural Comput., 12(4), 771–799. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comp., 8, 501–509.
830
J. Reutimann, M. Giugliano, and S. Fusi
Mattia, M., & Del Giudice, P. (2000). Efcient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305–2329. Mattia, M., & Del Giudice, P. (2002). Population dynamics of interacting spiking neurons. Phys. Rev. E 66(5), 051917. Nykamp, D. Q., & Tranchina, D. (2000).A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Nykamp, D. Q., & Tranchina, D. (2001).A population density approach that facilitates large-scale modeling of neural networks: Extension to slow inhibitory synapses. Neural Comp., 13, 511–546. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986).Numerical recipes:The art of scientic computing. Cambridge: Cambridge University Press. Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668–676. Rauch, A., La Camera, G., Luscher, ¨ H.-R., Senn, W., & Fusi, S. (2002). Neocortical pyramidal cells respond as integrate-and-re neurons to in vivo-like input currents. Manuscript submitted for publication. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Comp., 5, 200–204. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology, Vol. 2: Non linear and stochastic theories. Cambridge: Cambridge University Press. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19(21), 9587–9603. Watts, L. (1994). Event-driven simulation of networks of spiking neurons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 927–934). San Mateo, CA: Morgan Kaufmann. Wirth, N. (1986). Algorithms and data structures. Upper Saddle River, NJ: Prentice Hall. Yakovlev, V., Fusi, S., Berman, E., & Zohary, E. (1998). Inter-trial neuronal activity in infero-temporal cortex: A putative vehicle to generate long term associations. Nature Neuroscience, 1(4), 310–317. Received May 3, 2002; accepted October 1, 2002.
Communicated by Rajesh P. N. Rao
LETTER
Isotropic Sequence Order Learning Bernd Porr
[email protected] Florentin Worg ¨ otter ¨
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland
In this article, we present an isotropic unsupervised algorithm for temporal sequence learning. No special reward signal is used such that all inputs are completely isotropic. All input signals are bandpass ltered before converging onto a linear output neuron. All synaptic weights change according to the correlation of bandpass-ltered inputs with the derivative of the output. We investigate the algorithm in an open- and a closed-loop condition, the latter being dened by embedding the learning system into a behavioral feedback loop. In the open-loop condition, we nd that the linear structure of the algorithm allows analytically calculating the shape of the weight change, which is strictly heterosynaptic and follows the shape of the weight change curves found in spike-time-dependent plasticity. Furthermore, we show that synaptic weights stabilize automatically when no more temporal differences exist between the inputs without additional normalizing measures. In the second part of this study, the algorithm is is placed in an environment that leads to closed sensormotor loop. To this end, a robot is programmed with a prewired retraction reex reaction in response to collisions. Through isotropic sequence order (ISO) learning, the robot achieves collision avoidance by learning the correlation between his early range-nder signals and the later occurring collision signal. Synaptic weights stabilize at the end of learning as theoretically predicted. Finally, we discuss the relation of ISO learning with other drive reinforcement models and with the commonly used temporal difference learning algorithm. This study is followed up by a mathematical analysis of the closed-loop situation in the companion article in this issue, “ISO Learning Approximates a Solution to the Inverse-Controller Problem in an Unsupervised Behavioral Paradigm” (pp. 865–884). 1 Introduction A central goal of every autonomous agent is to maintain homoeostasis (Ashby, 1956), without which it will eventually disintegrate (“die”). A generic way to achieve this is by reacting to a disturbance of the homoeostasis with a closed-loop negative feedback mechanism (a reex), which will Neural Computation 15, 831–864 (2003)
c 2003 Massachusetts Institute of Technology °
832
B. Porr and F. Worg ¨ otter ¨
compensate for the disturbance by means of a (motor) reaction. Thus, the simplest form of sensible autonomous behavior can be obtained by designing an agent whose (re-)actions are reex based (Brooks, 1989). This type of behavior is found even in rather primitive animals like amoebas, which retract their philopodia when encountering a potentially damaging chemical gradient. Such sensor-motor reex loops represent typical feedback reaction systems, because a reex will always be elicited only after a sensor event has already been encountered, as the word feedback implies. The reaction delay, which is unavoidably associated with every reex loop, can even lead to fatal situations in the worst case. Thus, in any kind of improved behavior, the acting agent will try to avoid reexes, for example, by predicting one sensor event from another earlier occurring event (at a different sensor). This takes place when predicting pain from the heat that radiates from a hot surface in order to prevent a retraction reex by means of an anticipatory avoidance reaction. In this example, heat radiation and pain are causally related. Many other similar causal relations exist during the life of an animal, for example, between smell and taste when foraging or between vision and touch when exploring. In all of these cases, a temporal sequence of sensor events occurs, which needs to be learned in order to avoid reex reactions to the later event. Thus, temporal sequence learning is a dominant aspect of animal behavior. It requires a late event, which serves as a reference to which the earlier event temporally relates. The goal is to learn this specic temporal relation and turn reactive into proactive behavior. In articial systems, temporal sequence learning can be achieved, for example, by classical Hebbian learning (Hebb, 1949) in combination with delays (Levy & Minai, 1993), by differential Hebbian learning (Kosco, 1986; Klopf, 1986), or by the very inuential temporal difference (TD) reinforcement learning algorithm (Sutton, 1988; Montague, Dayan, & Sejnowski, 1993; Dayan & Sejnowski, 1994; Abbott & Blum, 1996; Dayan, Kakade, & Montague, 2000; Rao & Sejnowski, 2001; Haruno, Wolpert, & Kawato, 2001; Schultz & Suri, 2001). In TD learning, the “later event” is represented by a designated reference signal (mostly a reward or punishment signal) to which the prediction of the learner is explicitly compared. The reference signal thus represents an explicitly dened so-called evaluative feedback for the learning, which stops when prediction and reward match. This may pose a problem, as pointed out by Klopf (1988), who had emphasized that evaluative feedback cannot exist in autonomously acting agents, which normally cannot rely on any external, evaluative, (teacher-like) signal. Klopf’s differential Hebbian algorithm is indeed nonevaluative and belongs to the so-called class of drive reinforcement models. This issue, however, is still rather controversial. Klopf’s arguments are convincing, yet evidence exists that dopamine could indeed serve as such a possible reward-like reference signal in the brains of higher mammals (Schultz, Dayan, & Montague, 1997; Schultz & Suri, 2001), which can respond to complex learning situations
Isotropic Sequence Order Learning
833
such as instrumental (operand) conditioning. Less complex forms of learning such as basic classical conditioning, however, can be observed even in very simple creatures (for example, Aplysia), which do not have a reward system (Kandel et al., 1983). One aspect of the current study, therefore, is to design an algorithm in which sequence order learning takes place in a reward-free, unsupervised way by means of a temporal Hebb learning rule that is isotropic with respect to the inputs (hence the name ISO learning, which stands for isotropic sequence order learning).1 Thus, the algorithm is strictly based on the causal relation between its inputs, which is in reality often given by the “properties of the world,” as described by the examples above. The reference signal is just the latest occurring signal (which often has the highest initial synaptic weight), a situation that can change during learning. The article is organized in the following way. First, we introduce the algorithm in an open-loop paradigm. Its linear structure allows an analytical treatment of some of its main characterizing features. More complex aspects are addressed with simulations. In this part of the study, it will become clear that all input lines are mathematically equivalent in our algorithm. Furthermore, we will show that the algorithm performs strict heterosynaptic learning. A detailed comparison of ISO learning with other algorithms is given in appendix B. In the second part of this study we embed our algorithm in a behavioral loop by means of a robot experiment. This creates a self-referential system and leads to stability. As a consequence of the fact that learning is heterosynaptic, we nd that synaptic weights will self-stabilize as soon as the reex input becomes silent. One central aspect of this and the companion article, “ISO Learning Approximates a Solution to the Inverse-Controller Problem in an Unsupervised Behavioral Paradigm,” is to show that unsupervised open-loop ISO learning inherently turns into a reference-based system as soon as it is embedded into a nonevaluative environment that leads to a closed sensor-motor loop. This could be expected from the results of Klopf (1988), but we will show analytically in the companion article that such a closed-loop system creates, by means of the learning process, a forward model of its environment. Temporal sequence learning using the ISO learning algorithm can therefore be understood as nding a solution to the specic inverse controller problem that replaces a reex by its forward model.
1 The self-referential structure of this abbreviation is meant to hint at the self-referential behavioral loop introduced in the second part of this article.
834
B. Porr and F. Worg ¨ otter ¨
Figure 1: The basic circuit in the time domain.
2 Open Loop: ISO Learning First, we describe the algorithm itself and its characteristics without behavioral feedback. We consider a system of N C 1 linear lters h receiving inputs x and producing outputs u. The lters connect with corresponding weights ½ to one output unit v (see Figure 1). In section 1, we emphasized that all input lines of our algorithm are mathematically equivalent. It should be remembered, however, that functionally, many times there are distinctive differences among them. As a consequence, we will use x0 to denote the one unit that will later represent the reex pathway. This has no mathematical consequences and is done only for convenience. The output v is then given as v D ½0 u0 C
N X
½ k uk :
(2.1)
kD1
Learning (weight change) takes place according to a differential Hebb rule, d ½j D ¹uj v0 dt
¹ ¿ 1;
(2.2)
where the weight change depends on the correlation between uj and the derivative of v. An extensive discussion how this rule relates to TD learning and to other differential Hebbian learning rules, as introduced by Klopf (1986, 1988) and Kosco (1986), is given in section 4 and appendix B. Here, we note only that other differential Hebbian learning rules use ltering and derivates in different pathways as compared to ISO learning (see Figure 12 for circuit diagrams). All weights can change (also ½0 ). The constant ¹ is adjusted such that all weight changes occur on a much longer timescale (i.e., very slowly) as
Isotropic Sequence Order Learning
835
compared to the decay of the responses u. Thereby the system operates in the steady-state condition. In general, the system that we consider shall operate in continuous time (e.g., with neuronal rate codes), and it shall be able to handle continuous input functions x.t/ of arbitrary shape. The transfer function h shall be that of a bandpass that transforms a ±pulse input into a damped oscillation (see Figure 2A) and is specied in the Laplace domain, h.t/ $ H.s/ D
1 ; .s C p/.s C p¤ /
(2.3)
where p¤ represents the complex conjugate of the pole p D a C ib. It is important to note that such a bandpass is stable only if its pole pair is located on the left complex half-plane; otherwise, an amplied oscillation is obtained. Real and imaginary parts of the poles are given by a :D Re.p/ D ¡¼ f=Q q b :D Im.p/ D .2¼ f /2 ¡ a2 ;
(2.4) (2.5)
where f is the frequency of the oscillation. The damping characteristic of the resonator is reected by Q > 0:5. Small values of Q lead to a strong damping. The use of resonators (bandpass lters) is motivated by biology because oscillatory neuronal responses (Traub, 1999) and bandpass-ltered response characteristics (at virtually all sensory front ends, cell membranes [Shepherd, 1990], and ion channels like NMDA) are very prevalent in neuronal systems. Several examples for the utilization of such bandpass-ltered responses provide Grossberg and Schmajuk (1989) with their spectral timing model, which they have used in different applications (Grossberg, 1995; Grossberg & Merrill, 1996). Thus, the main idea is to use a neuron that gets bandpass-ltered sensor signals at its inputs and generates a motor output. Later, one of these bandpasses (h0 ) has the special task of providing the input for a reex-like reaction. The other bandpass-ltered sensor signals are candidates for generating an earlier motor reaction through learning. 2.1 Analytical Findings: Open-Loop Condition. 2.1.1 Timing Dependence of Weight Change. Here we address the question how the timing between the input signals inuences the weight change. In order to perform analytical calculations, we introduce two restrictions,
836
B. Porr and F. Worg ¨ otter ¨
Figure 2: Input functions and the initial weight change for t D 0 according to equations 2.13 and 2.14. (a) The inputs x, the impulse responses u for a choice of two different resonators h, and the derivative of the output v0 . (b) The initial weight change ½1 .T/tD0 for H1 D H0 ; Q D 1; f D 0:01 (arbitrary units) and (c) for resonators with different frequencies f0 D 0:01; f1 D 0:02 but with the same Q D 1. The solid lines in b and c represent the analytical solutions derived from equations 2.13 and 2.14, and the dots the simulation results from the numerical integration of equation 2.9 with the same parameters for f and Q. For that purpose, the two lters H0 and H1 get two different inputs x1 .t/ D ±.t/ and x0 .t/ D ±.t ¡ T/. This pulse sequence was repeated every 2000 time steps. After 400,000 time steps, the weight ½ was measured and plotted against the temporal difference T. The learning rate was set to ¹ D 0:001. (d) Schematic explanation of the mutual weight change at a strong (A) and a weak synapse (B) with two subsequent delta pulses at the inputs x1 and x0 . For further explanations, see the text.
which we use often throughout the theoretical parts of this article: 1. We will consider only two resonators, thus, N D 1.
2. Accordingly we have to deal with only two input functions x0 ; x1 , and we dene them as (delayed) ±-pulses: x0 .t/ D ±.t ¡ T/; x1 .t/ D ±.t/:
T¸0
(2.6) (2.7)
The rst restriction is necessary because the analytical treatment of the case N > 1 is very intricate and largely impossible. Concerning the second
Isotropic Sequence Order Learning
837
restriction, we note that the theory of signal decomposition allows composing any causal input function from ±-pulses. Thus, the second constraint is not really a restriction. The delay T ensures a well-dened causal relation between both inputs, where x0 (the latter of the two) is the timing reference (the reex input). Especially the section on the robot implementation will show that the algorithm (with N > 1) is very robust with respect to variations in T. In general, we use as an initial condition ½0 D 1 and ½1 D 0. For the analytical treatment, we consider only the weight change at ½1 . (In fact, we later show that the algorithm normally operates always in a domain where ½0 changes very little.) Because we assume steady state, we can rewrite the product in the learning rule (see equation 2.2) as a correlation integral between input and output: ½1 ! ½1 C 1½1 Z 1 1½1 .T/ D ¹ u1 .T C ¿ /v0 .¿ / d¿:
(2.8) (2.9)
0
Similar to other approaches (Oja, 1982), we compute the weight change for the initial development of the weights as soon as learning starts, because this is indicative of the continuation of the learning. Therefore, we assume ½1 .t/ D 0 for t D 0, and equation 2.9 turns into Z
½1 .T/tD0 D ¹
1 0
u1 .T C ¿ /u00 .¿ / d¿:
(2.10)
In simple cases (e.g., for h0 D h1 ), this integral can be solved directly. A general solution, which can be extended to cover more than two inputs, requires applying the Laplace transform using the notational convention x.t/ $ X.s/ for a transformation pair of functions in the time and the Laplace domain. The linearity of our system allows solving the integral in equation 2.10 analytically, which is possible with the help of Plancherel’s theorem (see appendix A for this rather unknown theorem). Applying it together with the shift theorem x.t ¡ t0 / ! X.s/e¡t0 s to equation 2.10, we get: 1 1½1 D ¹ 2¼ D¹
1 2¼
Z Z
C1 ¡1
C1
¡1
H1 .¡i!/[i!e¡Ti! H0 .i!/] d!
(2.11)
H1 .i!/[¡i!eTi! H0 .¡i!/] d!:
(2.12)
Note that the symmetry of Plancherel’s theorem is broken due to the exponential term. Equation 2.11 represents a Fourier transform, and equation 2.12 is its inverse. Both integrals can be evaluated with the method of
838
B. Porr and F. Worg ¨ otter ¨
residuals. Equation 2.12, however, offers the advantage that we can neglect the right complex half-plane, because it leads to contributions for negative time (i.e., t < 0) only (McGillem & Cooper, 1984; Stewart, 1960). Thus, of the four residuals (poles) for H1 and H0 , only those of H1 need to be considered because those of H0 have ipped their sign in equation 2.12 and appear now on the right complex half-plane. We get as the nal result, ½1 .T/tD0 D ¹ ½1 .T/tD0 D ¹
b1 M cos.b1 T/C.a1 PC2a0 jp1 j2 / sin.b1 T/ ¡Ta1 e b1 .PC2a1 a0 C2b1 b0 /.PC2a1 a0 ¡2b1 b0 / b0 M cos.b0 T/C.a0 PC2a1 jp0 j2 / sin.b0 T/ ¡Ta1 e b0 .PC2a0 a1 C2b0 b1 /.PC2a0 a1 ¡2b0 b1 /
T ¸ 0 (2.13) T < 0; (2.14)
where M D jp1 j2 ¡ jp0 j2 and P D jp1 j2 C jp0 j2 . If we assume identical resonators H0 D H1 D H, we get 1½1 .T/tD0 D ¹
1 sin.bT/e¡aT ; 4ab
(2.15)
which is identical to the impulse response of the resonator itself apart from a different scaling factor. The corresponding weight change curves are plotted in Figures 2b and 2c. The curves show that synaptic weights are strengthened if the presynaptic signal arrives before the postsynaptic signal, and vice versa. The biological relevance of the learning curves becomes especially clear in the case H0 D H1 . This learning curve with identical resonators is similar to the curves obtained in neurophysiological experiments exploring spike-timingdependent synaptic plasticity (STDP or temporal Hebb; Markram, Lubke, ¨ Frotscher, & Sakman, 1997; Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Abbott & Nelson, 2000; Fu et al., 2002). 2 Furthermore we nd for this case (see Figure 2b) that the location of the maximum of the learning curve Topt falls in the interval ¸ ¸ < Topt < ; 2¼ 4
1 < Q < 1; 2
(2.16)
where ¸ D 1=f is the wavelength of the resonator. The isotropic setup of the algorithm in principle also leads to weight changes at ½0 . It is, however, evident that the change in ½0 is (very) small when the contribution from the other inputs ½k ; k ¸ 1 is small. This is most easily seen when considering Figure 2d, which shows a situation that arises 2 In order to reproduce STDP in a biophysical model, the signals x and u require a different interpretation involving NMDA conductances and backpropagating action potentials. This is the topic of a follow-up study currently in preparation (Saudargiene, Porr, & W¨orgo¨ tter, 2003).
Isotropic Sequence Order Learning
839
after some learning by using the standard initial conditions. The size of the synapses depicts the momentarily existing weight values. The input sequence is such that a weight increase arises at synapse B from the inuence of input line A onto line B (CT in learning curve), whereas weight decrease occurs at synapse A due to the inverse causal (¡T) inuence of input line B onto line A. The degree of change is depicted by the plus and minus signs, showing that the decrease of A is smaller than the increase of B. For two similar inputs, a simple rule of thumb is that the weight change 1½ roughly follows the weight value of the other input scaled by the learning rate ¹, while the sign of the change depends on the temporal sequence of events: 1½late input ¼ ¹½early input
1½early input ¼ ¡¹½late input:
(2.17) (2.18)
As a result, the strong input roughly maintains its strength while the contributions from the other inputs are small. This is the typical case when learning is guided by a strong reex and the organism has the task of building up predictive pathways that should be weaker but more precise in order to prevent the disturbance. We note that the above analytical results can be extended to cover the most general system structure as represented in Figure 1 with N > 1. Equation 2.1 turns into V.s/ D
N X
(2.19)
½k Uk .s/;
kD0
keeping it in the Laplace domain, because then we can directly obtain 1½j .T/ D ¹
1 2¼
Z
C1
¡1
¡i!V.¡i!/Uj .i!/ d!;
(2.20)
which is the general form of equation 2.9 in the Laplace domain. It should be noted that for all 1½j , this integral can still be evaluated analytically in the same way as in the special case with two resonators. In the following equations, we will always use the index j for the input weights and k for the summation of the output signal v. 2.1.2 Weight Change When x0 Becomes Zero. In this section, we address the question of weight development when the reference input (reex) becomes silent (x0 D 0) at some point during learning. This is motivated by the cases discussed in section 1, where the goal of learning is to avoid (late, painful, damaging) reex reactions. Thus, setting x0 D 0 corresponds to the condition when the reex has successfully been avoided. Note that we are now left with just one input (x1 ) asking if its synaptic weight will continue
840
B. Porr and F. Worg ¨ otter ¨
to change. This would correspond to a situation of homosynaptic learning (e.g., homosynaptic long-term potentiation; Guo-Quing & Poo, 1998). This section will show that our algorithm does not perform homosynaptic learning. Instead, the synaptic weight of x1 stabilizes as soon as x0 D 0. Thus, ISO learning is purely heterosynaptic learning. We use the same two restrictions as above and start with equation 2.20, inserting equation 2.19 into it. We set x0 D 0 $ X0 D 0, and the weight change becomes 1½j D ¹
Z C1 N 1 X ¡i!Hk .¡i!/Hj .i!/ d!: ½k 2¼ kD1 ¡1
For N D 1, we get Z C1 1 ¡i!H1 .¡i!/H1 .i!/ d! 1½1 D ¹ ½1 2¼ ¡1 Z C1 i D ¡¹ ½1 !jH1 .i!/j2 d!: 2¼ ¡1
(2.21)
(2.22) (2.23)
H1 .i!/H1 .¡i!/ D jH.i!/j2 is valid since the transfer functions can always be expressed as products of complex conjugate pole pairs. Multiplying H1 .i!/ with H1 .¡i!/ leads to products of a complex number with its conjugate counterpart, which renders its absolute value. Since all transfer functions are symmetrical in relation to the real axis, the frequency response jH.i!/j2 is also symmetrical, which leads to symmetrical responses in equation 2.23 at jH1 .i!/j2 . Due to ! in equation 2.23, the entire integral becomes antisymmetrical and thus zero.3 Thus, the weights stabilize if only x1 is active. This result can be summarized in a rather intuitive way: With N D 1 and x0 D 0, there is an input signal only at x1 . The weight change in that case is a correlation of a damped sine wave with its derivative, which is a damped cosine wave. The correlation of a sine with a cosine is always zero. We have not attempted to calculate the behavior of the weights for N > 1, which is very tedious, if not impossible. Instead, we will show simulation results for this later. However, the above argument can be extended by the Fourier theorem of wave decomposition to more inputs because each sine wave from a resonator is multiplied by its cosine counterpart. Thus, we also expect for N > 1 a zero correlation and a stop of the weight development as soon as x0 D 0. 3 In a practical application (e.g., a digital innite impulse response lter), this is true only if the frequency responses of the input X1 and the transfer function H1 vanish for high frequencies to avoid the integral’s becoming ill dened (1 ¡ 1). In other words, the transfer functions must contain a low-pass term. This reects the aspect that the time course of the input functions must be predictable (Kalman lter property).
Isotropic Sequence Order Learning
841
Figure 3: Simulation results with a circuit with two inputs, hence N D 1 (see Figure 1). Input pulse sequences were repeated every 100 time steps, the rst starting at zero. Both resonators had values of Q 0;1 D 1 and f0;1 D 0:1. The other parameters were ¹ D 0:01 and T D 2. Results for (a) t D 0 and (b) t D 900.
2.2 Simulations: Open-Loop Condition. In this section, we perform simulations with the neuronal circuit from Figure 1. The simulations have the purpose of validating the theoretical results from the previous section and exploring more complex situations (especially N > 1) that are not analytically treatable. The simulations were performed under Linux on an Athlon processor using C++. The resonators were implemented as time-discrete innite impulse response lters in the z-domain. We used the impulse invariant transformation from the s-plane to the z-plane and calculated the coefcients for the lters according to McGillem and Cooper (1984). We used normalized time steps resulting in normalized lter frequencies in the range f D [0; : : : ; 0:5]. In all applications, we used frequencies less than or equal to fmax D 0:1 in order to avoid sampling artifacts. 2.2.1 One Filter in the Predictive Pathway: N = 1. As before, we begin with the simplest case, N D 1: one resonator in the reex pathway x0 and one resonator in the predictive pathway x1 , and use both restrictions (1,2) noted previously. Signal shape. Figure 3a shows for t D 0 the ±-pulses at x0;1 and the responses u0 and u1 from the resonators H0 and H1 , respectively. Before learning, the output v is identical to the signal u0 because the weights were set to ½0 D 1 and ½1 D 0. The actual weight change of ½1 is caused by repeated pairing of the ±-pulses at x0 and x1 . The result after nine pairings is depicted in Figure 3b. The comparison between Figures 3a and 3b shows that the onset of the output v has shifted toward the earlier event x1 . Before learning, it was identical to the resonator response u0 in the reex pathway. After learning, the output is a superposition of both signals u0;1 , which leads to an onset that occurs together with the early onset of u1 .
842
B. Porr and F. Worg ¨ otter ¨
Thus, the circuit is able to “detect” the ±-pulse at x1 as a predictor of the ±-pulse x0 . Learning curve. Using the same setup, we can vary the interval T and plot the change of ½1 in dependence of T for the initial learning step (i.e., for t D 0 after one correlation). This was simulated using identical resonators H0 D H1 but also with different resonators H0 6D H1 . The results are shown together with the analytical ndings in Figures 2b and 2c, having used the same parameters in both the simulation and the analytical calculation. Thus, the analytically calculated weight change curves are reproduced by the simulation results. Development of ½0 . In all cases discussed so far, both weights were allowed to change, and substantial changes in ½1 were found for about 10 to 50 pairings, while we have claimed that ½0 remains stable. An easy intuition why this basically holds can be gained by using the rule of thumb dened in equations 2.17 and 2.18. From this, it is clear that the change of ½0 remains tiny for a prolonged time in our setup because ½1 equals zero at the beginning and ¹ is very small. In simulations, we found that ½0 starts to change by more than 1% only after about 50,000 learning steps when using a standard learning rate of ¹ D 0:001 and ½1 D 0; ½0 D 1 as the usual initial conditions. Note that in the robot experiments shown later, the learning goal is reached after not more than 20 pairings. During this time, the change in ½0 is minuscule. Several other relevant cases could occur. ² Another interesting initial condition would be setting the weights to the same initial values (e.g., ½0 D ½1 D 0:5). This will still lead to a weight growth at ½1 (until about learning step 100,000), but now ½0 will drop from the beginning. Functionally, this could be interpreted as a situation where the reex input becomes weaker, while the anticipatory pathway continues to take over. This could reect a situation where the reex has not been used for a long time, because then it is reasonable to allow the reex to disappear, leading to ½0 D 0. The only measure that has to be taken is to stop the weight from changing its sign by keeping it at zero. ² In conditions where the reex saves the organism from life-threatening situations, the weight ½0 can always be set to a xed value. ² In conditions where we have multiple synaptic weights of similar strength (i.e., N > 1), we can expect that the system’s development will be dominated by stimulus-sequence-induced symmetry-breaking effects. This can lead to rather complex patterns, which would require a more detailed analysis (and is beyond the scope of this article).
Isotropic Sequence Order Learning
843
Figure 4: Simulated development of the weight ½1 for the case of two inputs (N D 1). Parameters were f0;1 D 0:01 and Q0;1 D 1. The inputs are triggered at a temporal difference of T D 15: x0 D ±.t ¡ T/ and x1 D ±.t/. The pairing of the delta pulses is repeated every 2000 time steps. The learning rate is set to (a) ¹ D 0:001 and (b) ¹ D 0:01.
Weight stabilization for x0 D 0. The analytical results (see equation 2.22) predict that ½1 should stabilize as soon as x0 D 0. This, however, also requires that the learning rate ¹ is zero, which in reality cannot be ultimately achieved. The following simulation results show the effect of the learning rate on the development of the weights and compare the analytically obtained result with those obtained for more realistic situations. The simulation to test this was performed in the following way. First, we triggered the two resonators with paired ±-pulses. Then the input x0 was switched off (i.e., x0 D 0) at t D 400,000 and only the input x1 was still active. Figure 4 shows the weight development of ½1 over time for two different learning rates ¹. With a low learning rate, the weight ½1 approximately stabilizes when the input x0 is switched off (see Figure 4a), whereas with a higher learning rate, the weight continues to grow. Weight stabilization can be very desirable during learning, but so is a high learning rate. These conicting demands therefore lead to a trade-off, which needs to be taken care of in practical applications (like the robot application). 2.2.2 More Than One Filter in the Predictive Pathway. The setup with only one resonator (N D 1) in the predictive pathway has the disadvantage that there is only one specic temporal interval Topt where learning (weight change) is at the maximal rate. The use of an array of resonators with different frequencies in the predictive pathway removes this disadvantage (see the inset in Figure 5). The system should now be able to learn more than only one time interval properly. We have set up such a system with an array of 10 resonators in the predictive pathway. We triggered this array with the same ±-pulse (x1 D ±.t/). The reex pathway was triggered by a delayed
844
B. Porr and F. Worg ¨ otter ¨
±-pulse (x0 D ±.t ¡ T/I T D 10). The initial condition for learning was set to ½0 D 1I ½k D 0I k ¸ 1 as before.
Signal shape. Figure 5 shows the resonator responses uk scaled with their momentarily existing weights ½k (top) at time t D 390,000 during learning. The scaled response of u0 (a, dashed line) is still the biggest at this time. The diagram also shows the output signal v and its derivative during the learning process (also t D 390,000, bottom). Additionally, the output signal generated when silencing the input x0 is shown (c, dotted line, bottom, t D 400,000). The output v is a superposition of all resonator outputs. It can be seen that it has a rst and a second maximum (marked with 1 and 2 in Figure 5). The second maximum is due to the resonator response from the reex pathway u0 and vanishes when the input x0 is switched off (see the dotted curve in c). The rst maximum is generated by superposition of the responses ½k uk ; k > 0 (i.e., all except u0 ). In general, we have observed that this superposition process will always try to generate the rst maximum as close as possible to x0 . This can be understood by the ongoing amplication of an initially existing asymmetry in the system in the following way. At the rst learning step, the derivative of v is zero before x0 and then follows the shape of the v0 curve, as shown in the diagram. Thus, there is one resonator response whose shape matches the v0 curve best (best positive correlation). Obviously, it is that particular resonator that has its maximum at (or closest to) the maximum of the v0 curve (second cusp; the rst is still zero). For this resonator, we get the highest correlation result (see equation 2.9) and, thus, the strongest weight growth at the beginning of learning. The other weights grow less strong, and their growth rate is approximately (inversely) related to the distance of their resonator maximum from x0 . As a consequence, we get a distribution of weight values that follows the shape outlined by the y-position of the resonator maxima, as shown in the top panel by the dots on the curves. Superposition of these weighted responses thus leads to a maximum of v at x0 . This line of argumentation continues to hold for the following learning steps, because the theoretical results suggest that the contribution of the correlation of the rst part of the v0 curve (rst cusp) with the uk ; k > 0, which would correspond to homosynaptic learning, is zero in all cases (see equations 2.21–2.23), thereby not affecting the weight change. Thus, weight change continues to follow the distribution of the maxima in Figure 5a. The resonator with the lowest frequency ( fl ) determines the longest delay Tmax D 1fl , which can be learned. Equivalently, the shortest delay is Tmin D f1h , where fh is the resonator with the highest frequency. Within the range [Tmin ; Tmax ], any T causes an output with a maximum that always coincides with the location of x0 , provided there are enough resonators to allow for a sufciently accurate superposition process.
Isotropic Sequence Order Learning
845
Figure 5: Multiple lters (N D 10) in the predictive pathway: (a) Filter responses, (b) the neuronal circuit, and (c) its output during learning and after learning. The neuronal circuit (b) consists of a lter bank where the lter frequencies are 5 set to fk D kf0 I k ¸ 1 and f0 D 0:01. The learning rate was set to ¹ D 0:0005 and Q D 1. The lter bank gets two different inputs: x0 .t/ D ±.t/ (reex pathway) and x1 .t/ D ±.t ¡ T/ (predictive pathway), T D 10. The delta pulses are repeated every 2000 time steps. After 400,000 time steps, x0 is set to zero. The contribution of the signals uk ½k to the output v triggered by x1 .t/ is called HV and is marked by the shaded box in b. The weighted resonator responses ½k uk after learning are shown in a. The output signals during learning (time step 390,000) and after learning (after time step 400,000) are shown in c.
Learning curve. As in the case of only two resonators, the dependence of the weight change on the temporal distance T can be explored. Now, however, we have to monitor N changeable weights. For this experiment, we have chosen the same standard setup using paired ±-pulses with a temporal delay of T, but now we use 15 resonators (N D 15) in the predictive pathway. Their frequencies are chosen such that 10 resonators have a frequency that is higher and 5 resonators one that is lower than f0 (see Figure 6). Every second weight change curve is shown in Figure 6 for t D 0, where we varied
846
B. Porr and F. Worg ¨ otter ¨
Figure 6: Weight changes ½j dependent on the temporal distance T with a lter bank of resonators (N D 15) set up as in Figure 5b. The lter frequencies are set to 5f fk D k0 I k ¸ 1 with f0 D 0:01 and Q D 1. The learning rate was set to ¹ D 0:0001 and Q D 1. The case f0 D fk is marked with a thick line and reproduces the curve in Figure 2b. The lter bank gets two different inputs: x1 .t/ D ±.t/ (predictive pathway) and x0 .t/ D ±.t ¡ T/ (reex pathway). The delta pulses are repeated every 2000 time steps. After 400,000 time steps, the weight ½j was measured and plotted against the temporal difference T. Only every second curve is plotted.
T from ¡150 to 150. Every curve in this diagram represents one weight ½k of a specic resonator hk in dependence of T. The curve plotted with the thick line belongs to the resonator hk , which has the same frequency as the resonator h0 , hence, fk D f0 . The other weight change curves belong to resonators in the predictive pathway, which have different frequencies compared to f0 . It can be seen that every weight change curve has a specic T where weight change is maximal, or (in support of the argument used to explain the rst maximum in Figure 5) the other way around: for specic values of T and large N, there always exists one particular resonator that shows maximum weight change. Another interesting result is that the weight change curve with fk D f0 is identical to the weight change curve with only one resonator (see Figure 6). The fact that both weight change curves are the same is due to the linearity of our model.
Isotropic Sequence Order Learning
847
Figure 7: Weight change of multiple resonators N D 10 in dependence of the learning rate. The neuronal circuit (see Figure 5b) consists of a lter bank where the lter frequencies are set to fk D 0:1 I k ¸ 1, where the index k is also used as a k label for the different curves in this gure (Q D 1 in both cases). The lter bank gets two different inputs: x1 .t/ D ±.t/ (predictive pathway) and x0 .t/ D ±.t ¡ T/ (reex pathway) with T D 10. The delta pulses are repeated every 2000 time steps. After 400,000 time steps, x0 was set to zero. The learning rate was set to (a) ¹ D 0:0001 and (b) ¹ D 0:001.
In summary, in an array of different resonators, every resonator is responsible only for a specic and limited range of temporal intervals so that such an array is able to cover a wide range of different temporal intervals. The weight change curves for the different weights give precise information as to which resonator yields the maximum contribution to the output signal. Weight stabilization for x0 D 0. Next, we ask whether the weights also stabilize in a multiresonator setup if the reex pathway x0 becomes zero (compare to Figure 7). We use the same setup as before (N D 10 and paired ±-pulses with T D 10). The test was performed in the same way as above by setting x0 to zero at time t D 400,000. Figure 7 shows that the weights stabilize in the limit of ¹ ! 0. Thus, we nd again that the crucial parameter for an approximate weight stabilization is the learning rate ¹, which is too high in Figure 7b. Because of the complexity of the mathematics, we were not able to give robust analytical arguments for weight stabilization in the multiresonator case. We could argue only that the individual resonator responses (sine waves) should be orthogonal to the derivative of the output (cosine wave) as soon as x0 D 0 (see the dashed curve in Figure 6), leading to zero value of the correlation integral. The experimental ndings in Figure 7 support this notion. Thus, in the multiresonator case, we obtain the desired property of weight stabilization in the limit of ¹ ! 0. Homosynaptic learning does not take place even with more than two resonators.
848
B. Porr and F. Worg ¨ otter ¨
Let us in the context of N > 1 also briey consider more than one predictive pathway with several sensors that operate independently. In this case, the isotropy of the algorithm leads to the situation that learning will continue between those sensor inputs even after the reference (reex) input x0 has become silent. Weight changes of the other (nonreex) weights, however, will normally remain small because after learning, the absolute values of them are small, which leads only to minor cross inuences, as will be shown in the robot example (see Figure 11). Thus, even in such a situation, weights will be (approximately) stable. Another stabilizing factor arises if we place the learning algorithm in a closed behavioral loop (see also the next section). In the closed-loop paradigm analyzed in our companion article, we found that a perturbation of the weight ½1 (which disturbs the nal condition x0 D 0) leads to a counterforce that reestablishes the original weight. Taken together, these arguments show that weights might indeed undergo small drifts after removal of the reex, but these drifts do not lead to a divergence. This is supported by the robot experiments, where we never observed weight divergence even after prolonged runs. 3 Closed Loop: The Robot Experiment The task in this robot experiment is collision avoidance. The built-in reex behavior is a retraction reaction after the robot has hit an obstacle (see Figure 8, solid pathway). This represents a typical feedback mechanism with the desired state that the signal at the collision sensor should remain zero. In order to prevent the robot from leaving the desired state, it can use other sensor modalities that can predict a looming collision. In our case, this is achieved with range nders (see Figure 8, dashed pathway). The learning algorithm has the task of learning the existing temporal correlation between the range nder and the collision sensor signals. After learning, the robot can generate a motor reaction in response to the range nder signals and thereby avoid the retraction reex. Functionally, the reex will be eliminated, and the “predictive pathway” takes over after learning. Up to this point, the algorithm had been treated in a pure open-loop condition, where learning was entirely unsupervised. The robot experiments create a situation where the behavioral reaction inuences the sensor inputs, thereby creating a closed-loop situation (see Figure 8). Unsupervised learning thereby turns into something that we would call self-referenced learning in order to distinguish it from reinforcement learning, which requires an explicitly dened reference signal (punishment or reward), which is not present in closed-loop ISO learning. The theoretical treatment of this situation in our companion article will clarify that these two situations are fundamentally different. The robot’s circuit diagram is shown in Figure 9; a detailed description, which includes a list of the robot’s control parameters, is given in
Isotropic Sequence Order Learning
849
Figure 8: Simple sensor motor feedback with prediction that is made explicit with the example of collision avoidance. The solid lines depict a prewired reex loop that exists before learning. This reex loop performs a reex reaction—in this case, a retraction reaction (motor response) when the collision sensor (reex eliciting signal) has been triggered. The task is to learn that the earlier range nder signal (predictive signal, dashed pathway) can be used to generate an earlier motor reaction to prevent the collision (reex).
appendix C. The robot has three collision sensors and two range nders. All signals are ltered by bandpass lters and converge onto two neurons, which generate two different motor outputs: one controls the robot’s speed and the other the robot’s steering angle. The speed of the robot is set to a xed value and its steering to zero so that the undisturbed robot drives straight forward. The built-in retraction behavior is generated by the dotted pathways where the collision sensors trigger highly damped sine waves in the corresponding resonators. This signal is sign inverted and directly transmitted to the motors. Essentially, it consists of just a single half-wave, which leads to the retraction reaction. The weights are initially set to minus one and effectively do not change during learning, so that the retraction behavior always remains the same. The dotted collision sensor pathways with their strong weights that determine the motor output are, together with the arising behavioral feedback, equivalent to the reex loop discussed in Figure 8. The range nder signals (solid lines) react at a distance of about 15 cm from an obstacle and are therefore able to predict a collision. However, the temporal delay between the range nder signal and the collision signal is variable and depends on the actual motion trajectory of the robot. In order to cope with a rather wide range of temporal delays, we used the same approach as in section 2.2.2 and implemented two resonator lter banks, which get their signals from the two range nders. Filter banks consist of 10
850
B. Porr and F. Worg ¨ otter ¨
Figure 9: Robot circuit: The robot consists of three collision sensors (CS), two range nders (RF), and two output neurons for speed (ds) and steering angle (dÁ). P These output neurons represent simple summation circuits (indicated by ). The robot has a reex behavior established by the signals from the collision sensors (dotted lines), which are fed into four bandpass lters H0 with f0 D 1 Hz and Q0 D 0:6. The output of the bandpass lters is summed at the neurons for speed (ds) and steering angle (dÁ). The corresponding weights are adjusted in such a way that the robot performs an appropriate retraction reaction if either of the collision sensors is triggered. The task of learning is to use the signals from the range nders (RF) to predict the trigger of the collision sensor (CS). The two signals from the left and the right range nder are fed into two lter banks, with I k ¸ 1 and Q D 1 throughout. N D 10 resonators with frequencies of fk D 1Hz k The 20 signals from the two lter bank converge on both the speed neuron and the neuron responsible for the steering angle. Learning rate was ¹ D 0:00002. L depicts the implementation of the learning rule (see equation 2.2).
resonators covering approximally a temporal interval between 50 ms and 500 ms. These resonator signals converge onto both the speed neuron and the steering neuron. Their weights are initially set to zero. Depending on the initial conditions, the robot found different solutions to avoid obstacles. One solution, for example, is that after learning, the robot simply stops in front of an obstacle or slightly oscillates back and forth. This
Isotropic Sequence Order Learning
851
type of behavior may look trivial but is entirely compatible with the learning goal of avoiding obstacles. More commonly, we observed a different type of solution where the robot continuously drives around and uses mainly its steering to generate avoidance movements. Other solutions do not seem to be possible and have not been observed. Furthermore, we observed that the robot always found one of these solutions after sufciently long learning. Figure 10 shows episodes of the robot behavior and its signals for one selected example trajectory. The signals shown in Figures 10c and 10d correspond to a situation where the robot still collides with the walls. Corresponding collision points are marked in Figure 10a by c and d. As expected, learning leads to a change of the temporal relation between the range nder signal and the collision signal. This can be seen by the different lengths of T depicted in Figures 10c and 10d and is due to the learned motor output, which is increasingly dominated by the range nder signal. This supports the lter bank approach, which we have used in the robot experiment. Finally, Figure 10e depicts a situation where the robot has learned to avoid the obstacles (CS D 0). Note that the low-pass component of the bandpass lters smoothes the rater noisy range nder signals, which substantially adds to the robustness of the algorithm. Furthermore, pure noise signals are not correlated to other sensor signals and do not contribute to learning. The change of the weights in the robot example shall now be compared with the results from the simulations. We nd that the weights approximately stabilize in our robot experiments (see Figure 11). Their actual values depend on the solution found. The situation in the robot experiment, however, is more complicated than in the simulations shown earlier, because the ds- and dÁ-neurons get signals from more than two sensors at the same time. Thus, often triplets of temporal correlations exist; for example, during a slanted wall approach, we rst obtain a signal from the right, then one from the left range nder, and nally one from the right collision sensor. After successful learning, the collision sensor remains silent, but we are left with sequences of range nder events. Thus, learning continues, though at a smaller rate, even after the last collision. As a central observation, this shows that our system continues to operate without a designated reference signal (because x0 is zero now). Learning continues between the remaining inputs. This can be seen in Figure 11 when looking at the development of the weight from the left range nder to dÁ, which continues to change after the last collision has occurred (at t D 85s). Ultimately, the earlier of the two range nder signals would dominate, but this will lead to a stable situation only for very simple (e.g., circular) trajectories, where an unchanging relation between both range nder signals is forced on the robot. An equivalent reward-retrieval situation has also been simulated. These results shall not be presented here in order to limit the length of this article but can be viewed on-line at http://www.cn.stir.ac.uk/predictor/animat/.
852
B. Porr and F. Worg ¨ otter ¨
Figure 10: (a) Manually reconstructed robot movement trace in an arena (240cm £ 200cm) with three obstacles (shaded) at the onset of learning. Motors were not entirely balanced, leading to a curved start of the trajectory. Many collisions (solid circles show forward collision and dashed circles backward collision) occur, and trapping at obstacles happens. After a collision, a fast reexlike retraction and turning reaction is elicited. (b) Robot movement trace after successful learning of the temporal correlation between signals at RF and CS. No more collisions occur; the trajectory is smooth. A complete movie of this trial can be viewed at http://www.cn.stir.ac.uk/predictor/real—movie 1. (c–e) Signals at RF (top), CS (middle), and motor control signal ds (bottom) for different learning stages. (c) Signals occurring at the early collision marked c in part a. A stereotyped motor reaction is elicited in response to the CS signal. (d) Signals occurring at the late collision d. Motor reactions occur in response to RF but are not sufcient to avoid the collision. When it occurs, a strong motor reaction is again elicited. (e) Signals occurring at the curve marked e in part b. Smooth motor reactions occur in response to RF; CS remains silent because no collision occurs.
Isotropic Sequence Order Learning
853
Figure 11: (a) Complete motor signal traces for ds and dÁ and (b) development of the synaptic weights for the same trial as in Figure 10.
4 Discussion In this study, we have developed an isotropic algorithm for sequence order learning (ISO learning) in which learning relies only on the temporal order of its inputs. This has the advantage that all input signals are treated equally and that learning takes place between all of them. Thus, it represents a form of unsupervised sequence order learning. 4.1 Basic Properties. ISO learning is driven only by the temporal relation between pre- and postsynaptic signals. As a consequence, our learning rule is related to learning based on spike-timing-dependent plasticity (STDP; Gerstner, Kempster, van Hemmen, & Wagner, 1996; Gerstner, Kreiter, Markram, & Herz, 1997; Markram et al., 1997; Zhang et al., 1998; Bi & Poo, 1998; Roberts, 1999; Xie & Seung, 2000; Kistler & van Hemmen, 2000; Song, Miller, & Abbott, 2000; Song & Abbott, 2001; Fu et al., 2002), but our algorithm uses time-continuous functions and not spike trains as input signals. The measured curves for STDP are based on the relation between individual (pre- and postsynaptic) spikes. Curves with different characteristic shapes have been observed (Abbott & Nelson, 2000), such as that similar to our Figure 2, but also inverted versions of it have been measured. Thus, the question arises how these curves relate to the average ring rate of the neurons (Gerstner et al., 1997). This question was specically addressed in the studies of Roberts (1999) and Xie and Seung (2000), who found that the temporal derivative of the postsynaptic impulse rate directly relates to the temporal Hebbian STDP curve, or with sign inversion to the anti-Hebbian curve. This shows that a computational link exists between rate-based ISO learning and the spike-based STDP results because we found that the Hebbian STDP curve will be obtained analytically by integrating the ISO learning rule over time (see equations 2.10 through 2.14).
854
B. Porr and F. Worg ¨ otter ¨
In the second part of this study, we introduced a closed-loop situation by means of behavioral feedback. We implemented a primary reex loop, which is distinguished from all other inputs only by the fact that it initially carries the largest synaptic weight. In general, such closed-loop reex loop situations have the disadvantage that any reaction will occur only after an incoming sensor event. This inherent disadvantage of feedback loops leads to a general objective for improving animal behavior, which is to nd a mechanism that prevents the reex (Palm, 2000; Wolpert & Ghahramani, 2000). Sequence order learning can achieve this by creating earlier, anticipatory actions. In addition, we have shown that weights stabilize as soon as the reex has been successfully avoided. Due to the isotropy of the inputs, any other input line can take on the role of the reference signal during learning, and the initial reex can even be unlearned or reduced in strength, a situation observed in many physiological reexes. 4.2 Practical Aspects. A convenient aspect of ISO learning that leads to a very limited computational effort is the use of innite impulse response lters in our approach. With such lters, it is possible to generate a smooth and long-lasting response when using only two resonators. Such a response can bridge a very long temporal difference T and is therefore able to generate a basic predictive reaction. Additional lters contribute to increasingly precise timing. Thus, the basic temporal correlation between X1 and X0 can be established by one lter (N D 1) and then improved by adding more and more lters. In other approaches (Sutton & Barto, 1981, 1988; Klopf, 1988), delay lines are often used for the predictive input X1 . The discrete structure of these algorithms requires many more delay elements as compared to the analog operating ISO learning because they need a delay element for every unit time-step. Thus, the computational effort is much higher if a broad temporal range has to be covered. 4.3 Evaluative versus Nonevaluative Models. A fundamental difference exists between reference-based (reward-based) algorithms (e.g., TD learning) and so-called drive reinforcement algorithms, such as differential Hebbian learning and ISO learning. Probably the most inuential method for reference-based temporal sequence learning is the TD learning algorithm (Sutton, 1988). TD learning has the goal of generating an output v, which predicts a reference (reward r) by the help of its (sensorial) input signals x. This goal is achieved by minimizing a prediction error ± between reference and output. Thus, learning relies on the predened reference, which acts like a teacher signal in supervised learning. The direct comparison between the two algorithms shows that the reference (reward) pathway and the error calculation of TD learning are replaced by the reex pathway in our algorithm. Mathematically, the reex pathway is not functionally distinct from the other pathways in ISO learning;
Isotropic Sequence Order Learning
855
however, it drives the output with an initially strong weight. As described in section 2.2.1, the strongest input dominates the learning behavior of the other inputs and weights. Klopf (1988) called this drive reinforcement learning. In appendix B, we compare the different drive reinforcement models in the literature. Here, we just note that our algorithm belongs to the class of pure unsupervised learning algorithms opposed to TD learning, which is supervised using the reference (reward) as teaching signal. The initially strong weight in the reex pathway of ISO learning can be interpreted as a boundary condition, preventing the output from becoming arbitrary. Introducing boundary conditions is typical practice of unsupervised, especially Hebbian, learning (Miller, 1996). The structural differences of our learning algorithm and TD learning suggest different neuronal substrates. The TD learning circuit consists of two components: the predictive circuit and the error signal circuit. Usually, these two circuits are identied with different neuronal subsystems: the error circuit with the dopamine system and the predictive circuit with cortical or other dopamine-modulated brain areas. Strong supporting evidence is found in Schultz et al. (1997), and it is also known that reward-based learning plays a substantial role in animal behavior, such as during instrumental (operand) conditioning paradigms and action planning (Dayan & Abbott, 2001). Our algorithm, on the other hand, suggests only one neuronal circuit because all pathways are equivalent, as supported by Hauber, Bohn, and Grietler (2001). It is conceivable that such a system coexists with the rewardbased learning systems, because in an autonomous agent, any reward-based system needs to be bootstrapped by rst correlative experiences, such as those used by our system to drive learning. In the one-circuit scenario, our learning rule, based on temporal relation between pre- and postsynaptic signals, would have to be represented by internal neuronal variables like NMDA dynamics or Ca2C concentration. The direct relation of our learning rule with the shape of the STDP curves (see Figure 2b) indicates that it should be relatively straightforward to redesign our model into a biophysically more realistic one, which directly relies on such internal neuronal variables and uses spike trains as inputs. This has recently been attempted by Rao and Sejnowski (2001) using the TD learning algorithm but the relation between TD learning and STDP is less direct, and, accordingly, the transition between those two models is bit more intricate (Dayan, 2002). When talking to specialists in the eld of temporal sequence learning, we were asked to explain to what degree our learning rule is different from the one used in TD learning. This aspect is quite technical, and we refer readers to appendix B for an in-depth discussion. 4.4 Closed-Loop Condition. Hebbian learning rules like the one used here belong to the class of unsupervised learning rules. Unsupervised learn-
856
B. Porr and F. Worg ¨ otter ¨
ing seems to be the obvious choice for creating the rst and earliest stages of autonomous behavior, because it does not require external (teacher-like) knowledge. Instead, it relies purely on self-organization based on the correlation structure of the inputs. Such unguided self-organization processes, however, can also lead to a situation where nonsensical correlations are learned, leading in the end to an undesired network behavior. The standard solution to avoid this problem is the introduction of boundary conditions, which keep the self-organization process within sensible margins. In practice, this is done either heuristically by the network designer or, as a better choice, boundary conditions are introduced such that they intrinsically (and in a natural way) represent the structure of the problem to which the selforganization process is applied. In the case of our unsupervised temporal sequence learning algorithm, the same is achieved by embedding the learning circuit in an environment that leads to a closed-loop situation. The causal relation that naturally exists between many different pairs of sensor events (e.g., pain follows heat, taste follows smell) as described in section 1 creates an implicit boundary condition for our algorithm by using the latest incoming event (the one that drives the reex) as the temporal reference for learning. The environment has two properties in our model: it provides feedback, and it contains disturbances, but very clearly it does not provide any reward or any other teaching signal. Klopf (1988) called this feedback loop nonevaluative since there is nothing in the environment that evaluates the organism’s performance. Instead, here ISO learning becomes self-referenced (von Foerster, 1960; Maturana and Varela, 1980): the actions of the learner inuence its own learning without any evaluation process. Robotics is the discipline that can clarify the concepts of autonomous behavior and interaction with a complex environment quite naturally (Brooks, 1997). In the eld of temporal sequence learning, Vershure has been working for over 10 years in using robot applications (Verschure & Pfeifer, 1992; Verschure & Voegtlin, 1998). In his words, every organism undergoes three steps of development: prewired reex (xed connections), adaptive control (classical Hebbian learning of sequences of sensor inputs), and reective, contextual control (goal-oriented learning). In Vershure’s terminology, adaptive control has no goals but builds up temporal associations with “proximal” and “distal” sensors. At the stage of the reective control, a goal is introduced in the form of a reward or punishment when, for example, an object has successfully been found. Our study shows that this distinction may be too rigid. The behavioral pattern observed in our robot seems to be punishment guided, which would place it at the advanced level of reective control. The unsupervised, nonevaluative, but self-referenced structure of the robot’s interaction with the environment, however, places it at the simpler level of adaptive control. This shows that autonomous agents can develop rather complex behavioral patterns by means of simple nested feedback loop systems, without hav-
Isotropic Sequence Order Learning
857
ing to evaluate their own behavior. Of course, we would not argue against the importance of higher learning schemes, and it is also quite sensible to distinguish between increasingly higher levels starting with nonevaluative schemes, which are surpassed by evaluative (reward and punishment) schemes, which are followed up by contextual learning and nally by the different stages of cognitive learning. But it seems advisable to treat these different stages in a less separatist way, allowing for a broader transition range between them. In our companion article, we derive a theoretical treatment of the closedloop ISO learning situation. We will show analytically that the predictive pathway learns to approximate the inverse controller of the reex pathway, thereby creating a forward model of the control situation. Appendix A: Plancherel’s Theorem This theorem is not widely known, so we state it here as Z
1 0
f1 .t/ f2 .t/ dt D
1 2¼
D
1 2¼
Z Z
C1 ¡1
C1
¡1
F1 .i!/F2 .¡i!/ d!
(A.1)
F1 .¡i!/F2 .i!/ d!
(A.2)
where F is the Laplace transform of f (Stewart, 1960). If we set f1 D f2 D f , it becomes the more commonly used theorem of Parseval. Appendix B: Comparison Between TD, ISO Learning, and Other Differential Hebbian Learning Algorithms One has to distinguish drive reinforcement models such as those by Sutton and Barto (1981), Klopf (1986, 1988), and Kosco (1986) (to which ISO learning also belongs) from reference-based reinforcement models such as TD learning (Sutton, 1988; Dayan & Sejnowski, 1994). We rst discuss the differences between the different drive reinforcement models, which are shown in Figures 12a through 12c. The central difference between ISO learning and the other techniques is that in ISO learning, all inputs are ltered before they are summed at the output neuron. In the other algorithms, the inputs are summed in an unltered way. A low-pass lter is applied only to the conditioned stimulus when it enters the learning pathway (i.e., before the correlator £). This leads to a fundamentally different behavior of ISO learning because as a result, ISO learning produces an orthogonal behavior between input and output (after ltering, the inputs resemble sine waves; thus, the derivative of the output resembles a sum of cosines; see Figure 2). This orthogonal behavior, which crucially relies on the ltering of all pathways, leads to the inherent and desired
858
B. Porr and F. Worg ¨ otter ¨
Figure 12: Comparison of (a–c) three drive reinforcement algorithms and (d) TD learning in Laplace notation. Transfer functions are denoted as H, K, and the derivative operator as s. X0 represents the unconditioned and X1 the conditioned input. The amplier symbol denotes the changing synaptic weight. Note that c is drawn with a xed weight at X0 to make it more easily comparable to the other diagrams. All models use a derivative of the postsynaptic signal in order to control the weight change. Both Sutton and Barto models (a, d) use low-pass lters K only in the conditioned pathway. Klopf’s model b is identical to model a with the exception of an additional temporal derivative at x1 . Only in ISO learning are all inputs ltered, which together with the output-derivative generates orthogonal behavior, leading to weight stabilization. For further explanation, see the text.
weight stabilization property of ISO learning, which does not arise in the other drive-reinforcement algorithms without additional measures taken. Furthermore, we note that Klopf has introduced an additional derivative at the conditioned input, because he focuses on signal changes. TD learning belongs to yet another category of sequence learning algorithms. The difference arises from the fact that TD is evaluative (reference based; see Figure 12d), whereas the drive reinforcement models operate in a nonevaluative way. We have already discussed this elementary difference
Isotropic Sequence Order Learning
859
at great length. Here, we focus on the aspect that TD learning also uses some kind of derivative, which suggests a strong structural similarity between TD and ISO learning methods. In spite of this apparent similarity, however, our approach is more strongly related to Kalman ltering than to TD learning. This is due to the combination of linear ltering and applying a derivative. The predictive property of our algorithm thereby arises from the fact that every low-pass ltered function is smooth, which leads to the situation that its derivative linearly predicts its future development. This property of low-pass ltered signals is well known in signal theory and is mainly used in the Kalman lter theory (Bozic, 1994). TD learning instead calculates a temporal difference error ± (similar to the famous ±-rule by Widrow & Hoff, 1960) by means of subtracting subsequent output values from each other and relating this error value to the reward: ±.t/ D r.t/ C v.t C 1/ ¡ v.t/. The second group of terms seems to be related to the derivative used in our approach. This mathematical similarity, however, carries a distinctively different interpretation, which can be understood as follows: The goal of TD learning is that the output v.t/ at any point in time should predict the total remaining reward, v.t/ D
T X
r.s/;
(B.1)
s¸t
at the end of learning. Take the example of a rat exploring a maze where at each intersection, a decision about a turn has to be made, creating a temporal sequence of events. Each turn leads to a different reward (e.g., food) to be picked up along the way. This claries the concept of total remaining reward until the end of the maze is reached at T. Furthermore, it is known that the total remaining reward can be iteratively approximated using the next following prediction value v.tC1/ to yield something like the total remaining expected reward: T X s¸t
r.s/ ¼ r.t/ C v.t C 1/ :D e.t; t C 1/:
(B.2)
During learning, this total remaining expected reward e is compared with its actual prediction v in order to dene the prediction error ±. Thus, ±.t/ D e.t; t C 1/ ¡ v.t/, leading to the apparent similarity of the resulting temporal difference terms v.t C 1/ ¡ v.t/ in TD learning with the derivative we used. From this interpretation, however, it is quite clear that the term v.t C 1/ arises only in conjunction with r.t/. This kind of conjunction cannot be found in our algorithm because it is reward free. Furthermore, the structure of TD learning is acausal, looking forward in time using v.t C 1/ to calculate ±.t/. This and the reward-based structure of TD learning make it rather difcult to associate it with STDP, as attempted by Rao and Sejnowski (2001)
860
B. Porr and F. Worg ¨ otter ¨
(see Dayan, 2002 for discussion). Thus, the formal similarities between both rules do not seem to warrant treating them as equal. These interpretations continue to hold for the time-continuous version of TD learning designed by Doya (2000).
Appendix C: The Robot ² Hardware. A modied commercial robot (“rug warrior,” 16 cm diameter) was used. Two active wheels are driven by DC motors, and steering is achieved through different DC levels. Average speed was adjusted to 0.45 m=s using a control parameter c D 0:6. In order to detect mechanical contact, the robot has three microswitches, CSl ; CSr ; CSb , in a triangular conguration (see Figure 9) Visual signals are generated by two multiplexed, infrared-emitting, active range nders RFl ; RFr with an angle of 70 degrees between them. Infrared reection is detected by an infrared sensor centered between the emitters, which operates in synchrony with them. The detection range was adjusted to 0.5 to 15.0 cm. Interfacing between robot and computer is done tethered via a conventional I=O card. ² Sensor characteristics. Sensor signals are bandpass ltered, as in many biological systems. This is achieved by feeding the raw signal into a bandpass with transfer function h.t/. The output functions of the bandpass lters are denoted as uk and normalized to one. The bandpass characteristics of all collision sensors hr .t/ are identical with Q D 0:6 and f D 1 Hz. The signal from each vision sensor is fed in parallel into a lter bank of 10 bandpass lters. Its frequencies are set to fk D 10i Hz; k D 1; 2; : : : ; 10; Q is set to 1:0 throughout. The lter bank approach assures that large and varying temporal intervals between vision sensor and collision sensor signals can be covered. ² Neuronal circuitry. The robot has two neurons: one that controls the speed ds, the other the steering angle dÁ. Normal operation is straightforward motion (ds D const; dÁ D 0). Both neurons receive inputs from all sensors in a direct feedforward connectivity. ² Unconditioned retraction reaction. The unconditioned retraction reaction uses only the collision sensor signals. These signals drive the output neurons in such a way that an avoidance movement with a motion vector pointing away from the site of stimulation is elicited. ² Learning. All bandpass lter outputs uk from the collision and the vision sensors converge onto both neurons, where they are summed according to their synaptic weights ½k . The change of the weights is achieved by learning rule equation 2.2. The constant ¹ is set to 0:00002. ² Neuronal Output (resulting from unconditioned retraction reaction and learning). The output of the neurons is dened as ds D c ¡ ½0ds [hr .t/ ¤ dÁ .CSl C CSr ¡ CSb /] C lds and dÁ D ½0 [hr .t/ ¤ .CSl ¡ CSr /] C ldÁ . The asterisk
Isotropic Sequence Order Learning
861
denotes a convolution operation. The variables lds and ldÁ represent the total sum of all learned contributions that converge onto the ds- and dÁneuron, respectively. Learning follows equation 2.2. The synaptic weights in unconditioned reaction are initially set to ½0ds D 0:15 and ½0dÁ D ¡0:5. Acknowledgments We are grateful to Christian von Ferber, Leslie Smith, Richard Reeve, and the members of the CCCN seminar for their helpful comments during various stages of this work. This study was supported by grants from SHEFC RDG INCITE and by the European funding ECOVISION. References Abbott, L., & Blum, K. (1996). Functional signicance of long-term potentiation for sequence learning and prediction. Cereb. Cortex, 6, 406–416. Abbott, L., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience (Suppl.) 3, 1178–1179. Ashby, W. R. (1956). An introduction to cybernetics. London: Methuen. Bi, G.-q., & Poo, M.-m. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strengh, and postsynaptic cell type. J. Neurosci., 18(24), 10464–10472. Bozic, S. M. (1994). Digital and Kalman ltering: An introduction to discrete-time ltering and optimum linear estimation. London: E. Arnold. Brooks, R. A. (1989). How to build complete creatures rather than isolated cognitive simulators. In K. VanLehn (Ed.), Architecturesfor intelligence(pp. 225–239). Hillsdale, NJ: Erlbaum. Brooks, R. A. (1997). Intelligence without representation. In H. John (Ed.), Mind design II (pp. 395–420). Cambridge, MA: MIT Press. Dayan, P. (2002). Matters temporal. Trends in Cognitive Sciences, 6(3), 105–106. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neuroscience (Suppl.) 3, 1218–1223. Dayan, P., & Sejnowski, T. (1994). TD(¸) converges with probability 1. Mach. Learn., 14(3), 295–301. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Networks, 12(1), 219–245. Fu, Y.-X., Djupsund, K., Gao, H., Hayden, B., Shen, K., & Dan, Y. (2002). Temporal specicity in the cortical plasticity of visual space representation. Science, 296, 1999–2003. Gerstner, W., Kempter, R., van Hemmen, L., & Wagner, H. (1996). A neural learning rule for submilisecond temporal coding. Nature, 383, 76–78. Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A. V. (1997). Neural codes: Firing rates and beyond. Proc. Natl. Acad. Sci. USA, 94, 12740–12741.
862
B. Porr and F. Worg ¨ otter ¨
Grossberg, S. (1995). A spectral network model of pitch perception. J. Acoust. Soc. Am., 98(2), 862–879. Grossberg, S., & Merrill, J. (1996). The hippocampus and cerebellum in adaptively timed learning, recognition and movement. J. Cogn. Neurosci., 8, 257–277. Grossberg, S., & Schmajuk, N. (1989). Neural dynamics of adaptive timing and temporal discrimination during associative learning. Neural Networks, 2, 79–102. Guo-Quing, B., & Poo, M.-M. (1998). Synaptic modications in cultured hippocampus neurons. J. Neurosci., 18(24), 10464–10472. Haruno, M., Wolpert, D. M., & Kawato, M. (2001). Mosaic model for sensorimotor learning and control. Neural Comp., 13, 2201–2220. Hauber, W., Bohn, I., & Grietler, C. (2001). NMDA, but not dopamin D2 receptors in the rat nucleus accumbens are involved in guidance of the instrumental behaviour by stimuli predicting reward magnitude. J. Neurosci., 20(16), 6282–6288. Hebb, D. O. (1949). The organization of behavior: A neuropsychological study. New York: Wiley-Interscience. Kandel, E., Abrams, T., Bernier, L., Carew, T., Hawkins, R., & Schwartz, J. (1983). Classical conditioning and sensitization share aspects of the same molecular cascade in Aplysia. Cold Spring Harb. Symp. Quant. Biol., 48(2), 821–830. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Comp., 12, 385–405. Klopf, A. H. (1986). A drive-reinforcement model of single neuron function. In J. S. Denker (Ed.), Neural networks for computing: AIP conference proceedings (Vol. 151). New York: American Institute of Physics. Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiol., 16(2), 85–123. Kosco, B. (1986). Differential Hebbian learning. In J. S. Denker (Ed.), Neural networks for computing: AIP conference proceedings(Vol. 151). New York: American Institute of Physics. Levy, W. B., & Minai, A. A. (1993). Sequence learning in a single trial. In Proceedings of the 1993 INNS World Congress on Neural Networks II (pp. 505–508). Hillside, NJ: Erlbaum. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakman, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Maturana, H., & Varela, F. J. (1980). Autopoiesis and cognition: The realization of the living. Dordrecht: Reidel. McGillem, C. D., & Cooper, G. R. (1984). Continuous and discrete signal and system analysis. New York: CBS Publishing. Miller, K. D. (1996). Receptive elds and maps in the visual cortex: Models of ocular dominance and orientation columns. In E. Donnay, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks III (pp. 55–78). New York: Springer-Verlag.
Isotropic Sequence Order Learning
863
Montague, P. R., Dayan, P., & Sejnowski, T. J. (1993). Foraging in an uncertain environment using predictive hebbian learning. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15(3), 267–273. Palm, W. J. (2000). Modeling, analysis and control of dynamic systems. New York: Wiley. Rao, R. P., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Comp., 13, 2221–2237. Roberts, P. D. (1999). Temporally asymmetric learning rules: I. Differential Hebbian learning. J. Comput. Neurosci., 7(3), 235–246. Saudargiene, A., Porr, B., & Worg¨ ¨ otter, F. (2003). Biophysical evaluation of a linear model for temporal sequence learning: ISO-Learning revisited. Manuscript in preparation. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275, 1593–1599. Schultz, W., & Suri, R. E. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Comp., 13(4), 841–862. Shepherd, G. M. (Ed.). (1990). The synaptic organisation of the brain. New York: Oxford University Press. Song, S., & Abbott, L. (2001). Column and map development and cortical remapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Stewart, J. L. (1960). Fundamentals of signal theory. New York: McGraw-Hill. Sutton, R. (1988).Learning to predict by method of temporal differences. Machine Learning, 3(1), 9–44. Sutton, R., & Barto, A. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychol. Review, 88, 135–170. Sutton, R., & Barto, A. (1988). Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element. Behav. Brain. Res., 4(3), 221–235. Traub, R. D. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Verschure, P. F., & Pfeifer, R. (1992). Categorization, representations, and the dynamics of system-environment interaction: A case study in autonomous systems. In H. Roitblat, J. Meyer, & S. Wilson (Eds.), Proceedings of the Second International Conference on Simulation of Adaptive Behaviour (pp. 210–217). Cambridge, MA: MIT Press. Verschure, P., & Voegtlin, T. (1998). A bottom-up approach towards the aquisition, retention, and expression of sequential representations: Distributed adaptive control III. Neural Networks, 11, 1531–1549. von Foerster, H. (1960). On self-organizing systems and their environments. In M. Yovits & S. Cameron (Eds.), Self-organizing systems (pp. 31–50). New York: Pergamon Press.
864
B. Porr and F. Worg ¨ otter ¨
Widrow, G., & Hoff, M. (1960). Adaptive switching circuits. IRE WESCON Convention Record, 4, 96–104. Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of movement neuroscience. Nature Neuroscience (Suppl.), 3, 1212–1217. Xie, X., & Seung, S. (2000). Spike-based learning rules and stabilization of persistent neural activity. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 199–208). Cambridge, MA: MIT Press. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Received April 12, 2002; accepted October 28, 2002.
LETTER
Communicated by Rajesh P. N. Rao
ISO Learning Approximates a Solution to the Inverse-Controller Problem in an Unsupervised Behavioral Paradigm Bernd Porr
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland
Christian von Ferber
[email protected] Theoretical Polymer Physics, Freiburg University, 79104 Freiburg, Germany
Florentin Worg ¨ otter ¨
[email protected] Department of Psychology, University of Stirling, Stirling FK9 4LA, Scotland
In “Isotropic Sequence Order Learning” (pp. 831–864 in this issue), we introduced a novel algorithm for temporal sequence learning (ISO learning). Here, we embed this algorithm into a formal nonevaluating (teacher free) environment, which establishes a sensor-motor feedback. The system is initially guided by a xed reex reaction, which has the objective disadvantage that it can react only after a disturbance has occurred. ISO learning eliminates this disadvantage by replacing the reex-loop reactions with earlier anticipatory actions. In this article, we analytically demonstrate that this process can be understood in terms of control theory, showing that the system learns the inverse controller of its own reex. Thereby, this system is able to learn a simple form of feedforward motor control. 1 Introduction In our companion article in this issue, “Isotropic Sequence Order Learning,” we introduced a novel, linear, and unsupervised algorithm for temporal sequence learning, which we called ISO learning. ISO learning has the special feature that all sensor inputs are completely isotropic, which means that any input can drive the learning behavior. We used the algorithm to generate robot behavior by means of sensor inputs and motor actions. While the organism transforms sensor events into motor actions, the environment passively performs the opposite and forms together with the organism a closed sensor-motor feedback loop system. Here, we explain the systemtheoretical consequences of this. Neural Computation 15, 865–884 (2003)
c 2003 Massachusetts Institute of Technology °
866
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
ISO learning is completely unsupervised, and the output is self-organized. Unsupervised temporal sequence learning, however, usually leads— without additional measures taken—to rather undesired situations for the organism since it can learn arbitrary behavioral patterns. A xed reex loop prevents arbitrariness by dening an initial behavioral goal (Verschure & Voegtlin, 1998). A reex, however, is a typical reaction, which will always occur only after its eliciting sensor event (Wolpert & Ghahramani, 2000). ISO learning leads to the functional elimination of the reex loop in using predictive sensorial cues and generating appropriate anticipatory actions to prevent the triggering of the reex. We will see that these qualitative observations can be embedded in a control theoretical framework. In the eld of control theory, a reex loop is represented by a xed feedback loop (McGillem & Cooper, 1984; D’Azzo, 1988; Nise, 1992; Palm, 2000). Feedback loops try to maintain a desired state by comparing the actual input values with a predened state and adjusting the output so that the desired state is optimally maintained. The main advantage of a feedback loop is that the controller needs only very limited knowledge about the relation between input and output (the environment). Consider the typical example of a thermostat-controlled central heating system. There, it is necessary only to measure the temperature at the thermostat and use this to control the furnace; it is not necessary to know how much fuel needs to be burned to get a certain temperature increase. Even this is not enough to control the heating, because the temperature increase also depends on the existing inside-outside temperature gradient and maybe on more elusive parameters. In general, only in idealized situations does there exist sufcient prior knowledge to control a system without feedback, thus, by means of pure feedforward control. The central advantage of such an (ideal) feedforward controller, however, is that it acts without the feedback-induced delay. The sometimes fatally damaging sluggishness of feedback systems makes this a highly desirable feature. As a consequence, engineers try to replace feedback controllers with their equivalent feedforward controllers wherever possible, thereby trying to solve the famous inverse controller problem (Nise, 1992). In this study, we analytically prove that ISO learning approximates the inverse controller of a reex when embedded in a behavioral situation where the reex represents the reference for self-organized predictive learning. The article is organized in the following way. Very briey we summarize the main equations from our other article in this issue. Then we introduce the necessary terminology from control theory by means of discussing the reex-loop situation. After that, we show which shape a transfer function must take in order to approximate the inverse controller of the reex. In the next step, we demonstrate that the set of functions used in ISO learning can indeed approximate this transfer function. Finally, we show why the actual learning process does converge into the correct solution.
ISO Learning Approximates an Inverse Controller
867
2 The ISO Learning Algorithm: A Brief Summary The system consists of N C1 linear lters h receiving inputs x and producing outputs u. The lters connect with corresponding weights ½ to one output unit v (see Figure 1). The output v.t/ in the time domain and its transformed equivalent V.s/ in the Laplace domain are given as v.t/ D ½0 u0 C
N X kD1
½k uk $ V.s/ D ½0 U0 C
N X
½k Uk :
(2.1)
kD1
| {z } Hv
The transfer functions h shall be those of bandpass lters, which transform a ±-pulse input into a damped oscillation. They are specied in the time and in the Laplace domain by h.t/ D
1 at 1 e sin.bt/ $ H.s/ D ; b .s C p/.s C p¤ /
(2.2)
where p¤ represents the complex conjugate of the pole p D a C ib, with q
a :D Re.p/ D ¡¼ f=Q;
b :D Im.p/ D
.2¼ f /2 ¡ a2 :
(2.3)
f is the frequency of the oscillation and Q the damping characteristic. Learning takes place according to d ½j D ¹uj v0 dt
¹ ¿ 1;
(2.4)
Figure 1: The neuronal circuit in the Laplace domain. The shaded area marks the connections of the weights ½k with k ¸ 1 onto the neuron. The overall contribution from these input is called HV .
868
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
where v0 is the temporal derivative of v. (For a comparison of ISO learning with other models for temporal sequence learning, see appendix B in the companion article in this issue.) Note that ¹ is very small. The integral form of this learning rule is in the Laplace domain given by 1½j D
¹ 2¼
Z
1 ¡1
¡i!V.¡i!/Uj .i!/ d!
with U D XH:
(2.5)
Note that we use indices k to denote outputs (e.g., when associated with v), while indices j denote inputs (e.g., associated with u). (See the companion article for a complete description of the ISO learning algorithm and its properties.) 3 Analytical Treatment of the Closed-Loop Condition 3.1 Reex Loop Behavior. Every closed-loop control situation with negative feedback has a so-called desired state; the goal of the control mechanism is to maintain (or reach) this state as best and fast as possible. In our model, we assume that the desired state of the reex feedback loop is unchanging and dened by the properties of the reex loop. We dene it as X0 D 0. First, we discuss the system without learning. Figure 2a shows the situation of a learner embedded into a very simple but generic (i.e., unspecied) formal environment, which has a transfer function P 0 . This learner is able to react to an input only by means of a reex. A possible set of signals that can occur in such a system is shown in Figure 2b. First, the disturbance signal d deviates from zero, then the input x0 senses this change x0 6D 0, and nally the motor output v can generate a reaction in order to restore the desired state x0 D 0. Thus, there is always a reaction delay in such a system. 3.2 Augmenting the Reex by Temporal Sequence Learning. In this section, we show that the ISO learning algorithm can approximate the inverse controller of the reex. Figure 3 shows how the same disturbance D elicits a sequence of sensor events: it enters the outer loop, arriving at X1 ltered by the environment (P1 ), while it arrives at X0 only after a delay T. The goal of learning is to generate a transfer function Hv , which compensates for the disturbance. The inner structure of Hv given by the ISO learning setup is depicted by Figure 1. The environmental transfer function P01 closes the outer loop. 3.2.1 General Condition. The reex loop denes the goal of the feedforward controller: that there should always be zero input at X0 . Thus, rst we must show what shape the transfer function of the predictive pathway Hv (see Figures 1 and 3) takes when we assume that X0 D 0 holds. This is the necessary condition, which needs to be obeyed in order to obtain an
ISO Learning Approximates an Inverse Controller
869
Figure 2: (a) Fixed reex loop. The organism transfers a sensor event X0 into a motor response V with the help of the transfer function H0 . The environment turns the motor response V again into a sensor event X0 with the help of the transfer function P0 . In the environment, there exists the disturbance D, which adds its signal at © to the reex loop. (b) Possible temporal signal shapes occurring in the reex loop when a disturbance d 6D 0 happens. The desired state is x0 :D 0. The disturbance d is ltered by P0 and appears at x0 and is then transferred into a compensation signal at v, which eliminates the disturbance at ©.
appropriate Hv . It generally applies regardless of the learning algorithm used. In the following, we omit the function argument s where possible. Then we can write X0 D P 0 [V C De¡sT ]
(3.1)
as the reex pathway and X1 D HV D
P1 D C P1 P 01 X0 H0 1 ¡ P1 P01 HV N X
(3.2) (3.3)
½k Hk kD1
as the predictive pathway (see Figure 3). Eliminating X1 and V, we get X0 D e¡sT D C HV
P1 D C P1 P01 X0 H0 : 1 ¡ P1 P01 HV
(3.4)
870
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
Figure 3: Schematic diagram of the augmented closed-loop feedback mechanism, which now contains a secondary loop representing ISO learning. (a) H0 and P0 form the inner feedback loop shown in Figure 2. The new aspect is the input line X1 , which gets its signal via transfer function P1 from the disturbance D. The inner feedback loop receives a delayed version (T) of the disturbance D. The adaptive controller H V has the task to use the signal X1 , which is earlier than x0 , and thus “predicts,” the disturbance D at X0 , to generate an appropriate reaction at V to prevent a change at X0 . (b) A schematic timing diagram for the situation after successful learning when a disturbance has occurred. The output v sharply coincides with the disturbance d and prevents a major change at the input x0 .
Solving for X0 D 0 leads to HV D
N X
½k Hk
(3.5)
kD1
D¡
¡sT P¡1 1 e : 1 ¡ P01 e¡sT
(3.6)
The transfer function HV is the overall transfer function of the predictive pathway. Equation 3.5 demands that the weights ½k should be adjusted in such a way that equation 3.6 is obtained at the end of learning. Equation 3.6 requires interpreting. First, we consider the numerator and remember that the learning goal is to achieve X0 D 0. This requires com-
ISO Learning Approximates an Inverse Controller
871
pensating the disturbance D. The disturbance, however, enters the organism only after having been ltered by the environmental transfer function P1 . Thus, compensation of D requires undoing this ltering by the term P¡1 1 , which is the inverse transfer function of the environment (hence, “inverse controller”). The second term e¡sT in equation 3.6 compensates the delay between the signal in X1 and that at X0 , when the disturbance actually enters the inner feedback loop. Now we discuss the relevance of the denominator, showing that it can be generally neglected. Transfer functions are fully described by their poles and zero crossings. Poles very strongly affect the behavior of a system, while zero crossings are phase factors, which do not alter its general transfer characteristic (Stewart, 1960; Blinchikoff, 1976; McGillem & Cooper, 1984; Terrien, 1992; Palm, 2000). As a consequence, following methods from control theory, any transfer function may be reduced to only those terms that contain poles or zero crossing by neglecting all other components (Sollecito & Reque, 1981; Nise, 1992). Thus, we rewrite equation 3.6 as ¡sT HV D ¡P¡1 1 e
1 1 ¡ P01e¡sT
(3.7)
and analyze it to determine if the second term produces additional poles for HV . This would happen if 1 ¡ P01 e¡sT D 0 holds, which is equivalent to P01 D esT . The term esT , however, is meaningless; it represents a timeinverted delay, and thus an entity that violates causality. As a result, there are no additional poles for HV , and in the following we are allowed to set P01 :D 0 without loss of generality, thereby neglecting only possible changes in phase relationships. Thus, the behavior of HV is apart from phase terms entirely determined by1 ¡sT HV D P¡1 : 1 e
(3.8)
Equation 3.8 represents the necessary condition for the learning, and we ask in the next two sections if our specic algorithm is sufcient to achieve this. 3.2.2 Solutions in the Steady-State Case X0 D 0. Here we show by construction that for one resonator, there already exists a solution that approx1
Readers who are less familiar with control theory may nd it useful to think about P01 in a different way. P01 represents how the environmental transfer of the reaction of the system will inuence the sensor X1 . Many times this inuence is plainly zero from the beginning (or the connecting path can be decoupled by an appropriate system design). For example, for a predictively acting external temperature sensor X1 , the change of the temperature of the environment due to the heating of a room is totally insignicant.
872
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
imates equation 3.8 to the second order. Results for a fourth-order approximation have been numerically obtained, showing that the approximation continues to improve. Thus, rst we limit the discussion to the case of only two resonators, H0 and H1 (i.e., N D 1). The case with more resonators will be reintroduced at the end of this section. We specify which parameters the resonator H1 in the outer loop has in order to satisfy the learning goal. At rst, we set P1 D 1, looking at the case when the environment does not alter the shape of the disturbance (but see below). Considering equation 3.8, we have to solve ¡e¡sT D ½1 H1 :
(3.9)
The resonator H1 has two parameters, f1 D 1=T1 and Q1 , and together with its weight ½1 , we are looking for three parameters to solve this equation. The left-hand side of equation 3.9 can now be developed into a Taylor series, ¡
1 ¡1 ¡2T ¡2 D ¼ ; esT 1 C sT C 12 s2 T 2 C ¢ ¢ ¢ 2T ¡2 C 2sT¡1 C s2
(3.10)
and the right-hand side of equation 3.9 has to be explicitly written out according to equations 2.2 and 2.3: ½1 H1 .s/ D
½1 ½1 D : .s C p/.s C p¤ / pp¤ Cs .p C p¤ / Cs2 |{z} | {z } .2¼ f1 /2
(3.11)
¡2¼ f Q1
We can now compare the coefcients of equation 3.10 with equation 3.11 and get for the parameters, r 2 1 1 p ; Q1 D (3.12) ½ 1 D ¡ 2 ; f1 D § : 2 T ¼T 2 This result shows that for all T, there exists a resonator H1 with a weight ½1 , which approximates e¡sT to the second order. The result for f can be interpreted in the context of the companion article in this issue. We remember that X0 D 0 and hence V D X1 H1 . If we consider pure ±-pulse input at X1 , as in the simulations in the companion article, we receive the impulse response of the resonator h1 .t/ at the output; thus: 1 sin.b1 t/e¡a1 t b1 ³ ´ t t D ½1 Tsin e¡ T T ³ ´ 2 t t D ¡ sin e¡ T : T T
v.t/ D ½1
(3.13) (3.14) (3.15)
ISO Learning Approximates an Inverse Controller
873
This function has its maximum at t.2/ max D Tatan.1/. We can assume that 2 this is approximately equal to t.2/ max ¼ T. This, however, would be indicative of a response maximum that occurs exactly at the moment where the input x0 is to be expected. We refer readers to Figure 5 in the companion article, where this type of behavior has been observed in the simulations. We found that during learning, the output always has its rst maximum at the location where x0 occurs (or would have occurred). The strength of the resonator response, equation 3.14, is determined by the weight ½1 , whichRis adjusted 1 in a way that the resulting integral (see equation 3.15) becomes 0 v.t/ dt D ¡1 so that it has the same energy as the ±-pulse of the disturbance D and therefore optimally counteracts it. The shape of the disturbance in the form of the ±-pulse obviously cannot be achieved by a single or two resonators, but the energy (or the effect) is preserved. The nal stable value for ½1 is the main difference between the openloop case and the closed-loop case. While in the open-loop case, the weight ½1 grows endlessly due to the lack of feedback (see the simulations in the companion article), in the closed-loop condition, the weight ½1 converges to a specic value at the moment when x0 D 0 has been achieved. As a consequence, the experimentally observed behavior of the algorithms leads to a function Hv with similar properties as that obtained from the secondorder Taylor approximation. For all practical purposes, N needs to be found in trying to resolve the trade-off between the actually needed precision for t.1/ max ! T and hardware and software engineering constraints (costs). The robot experiment in the companion article in this issue demonstrates that in a real-world application, few resonators (N D 10) sufce to obtain the desired obstacle avoidance behavior after learning. Now we have to consider more complex transfer functions for P1 . Up to this point, we have set P1 D 1, which means that the disturbance basically reaches the input X1 unltered, which is in general not the case. Due to specic sensor properties and due to properties in the environment, the disturbance reaches the input X1 in a ltered form. All of these changes can be subsumed from the organism’s point of view by the function P1 (and the The relation t.2/ max ¼ T could be conrmed because we performed the same Taylor approximation with N D 2 (leading to a fourth-order Taylor approximation): 2
¡e¡sT D ½1 H1 .s/ C ½2 H2 .s/ D
¡1
1 ¡ sT C 12 s2 T2 ¡ 16 s3 T3 C
1 4 4 24 s T
½1 ½2 .s C p1 /.s C p¤1 / .s C p2 /.s C p¤2 /
(3.16) (3.17)
The resulting set of equations (from comparing the coefcients) has been solved numeri.4/ .1/ cally, and we received a solution that leads to tmax D 0:978T. This suggests that tmax D T is correct in the limit of N ! 1.
874
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
same applies to P0 ). We recall that we have used a Taylor approximation of equation 3.9 and matched it with the sum of resonators to obtain the coefcients. This, however, allows concluding that any transfer function P1 of the shape P1 D
.s C z0 /.s C z¤0 / ¢ ¢ ¢ .s C zn /.s C z¤n / .s C p0 /.s C p¤0 / ¢ ¢ ¢ .s C pm /.s C p¤m /
(3.18)
can still (together with the delay term ¡e¡sT ) be approximated by a sum of resonators, because this sum continues to take the shape of a broken rationale function similar to that in equation 3.18. 3 Such a shape of P1 , however, covers all generic combinations of high- and low-pass characteristics. Hence, it represents a standard passive transfer function. In addition, we can normally assume that the environment does not actively interfere with signal transmission in such a system and can therefore—with great likelihood—be represented by equation 3.18. Thus, we can argue that an appropriate approximation of the complete equation 3.8 will be found in almost all natural situations. The robot application shown in the companion article in this issue supports this notion experimentally. 3.2.3 Convergence Properties. The previous section has shown that it is possible to construct approximative solutions of equation 3.8 using resonators so that X0 .s/ ! 0. Here, we address the problem of whether the learning rule will actually converge onto such a solution. Conventional techniques used to derive a learning rule by calculating the partial derivatives of the weights and nding the minimum fail in our case because ISO learning is linear. As a consequence, the derivatives are constant, and a minimum cannot be found. An approach that leads to success, however, is to apply perturbation theory instead. Let us rst treat the system very generally without making a priori assumptions as to the characteristics of the Hk . In so doing, we can employ perturbation analysis with the nice aspect that we will not make any assumption as to the size of the perturbation. Thus, proof of stability against such a perturbation is equivalent to a proof of convergence. For real resonators, this will be a little bit different, though, as we will see. Let us assume that we have found a set of weights ½k ; k > 0 that solves equation 3.8, and we know that the development of the weights follows equation 2.5. Now we perturb the system, substituting ½j in equation 2.5 with ½j C ±½j D ½Qj . In order to ensure stability, we must prove that the 3 Note that we are even able to approximate zero crossings of equation 3.18 since we have a sum of resonator responses. If we calculate the overall transfer function of a sum of resonators (H1 C H2 C ¢ ¢ ¢), we automatically also get zero crossings, which can be used to identify them with the zero crossings in equation 3.18. Thus, the approximation, including the phase terms, is correct.
ISO Learning Approximates an Inverse Controller
875
perturbation is counteracted by the weight change; thus, we must solve equation 3.8 hoping to nd 1½j » ¡±½j :
(3.19)
Note that this would guarantee convergence because we know that ¹ is small, which prevents oscillations. After some calculations (see the appendix), we arrive at 1½j D
Z 1 N jX1 j2 Hk¡ ¹ X C ¡i! ±½k ¡ Hj d!; 2¼ kD1 1 ¡ ½ 0 P¡ ¡1 0 H0
(3.20)
where we use the superscripts C and ¡ for the function arguments Ci! and ¡i!. This result is still general in the sense that we are not necessarily dealing with resonator functions, so at the moment we are still free to make some reasonable assumptions about the set of Hk . Let us thus assume orthogonality given by Z 0D
1 ¡1
jX1 j2 HjC Hk¡
¡i!
¡ 1 ¡ ½ 0 P¡ 0 H0
for
d!
k 6D j;
(3.21)
and we get ¹ 1½j D ±½j 2¼
Z
1 ¡1
jX1C j2 jHjC j2
¡i!
¡ 1 ¡ ½ 0 P¡ 0 H0
d!:
(3.22)
In order to prove that the integral in equation 3.22 will be negative (ensuring convergence), the inner (reex) loop, which is determined by ½0 H0 P0 , needs to be considered. Note that this loop must at least be stable; otherwise, the system would not be functional to begin with. A theoretical result from the literature (Sollecito & Reque, 1981) supports the notion that the integral in question is negative as long as the stability of ½0 H0 P0 is guaranteed. Let us try to spell this rather general argument out more concretely. (In the appendix, we rigorously prove convergence for the important case of unity feedback.) By the use of Plancherel’s theorem (Stewart, 1960), we transfer the integral in equation 3.22 into the time domain and get Z 1½j D ¹±½j
0
1
ax¤h .t/ f 0 .t/ dt;
(3.23)
where we call ax¤h .t/ the autocorrelation function of x1 .t/ ¤ hj .t/, which is the inverse transform of jX1C HjC j2 (the asterisk denotes a convolution). We note
876
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
that the remaining term in equation 3.22, 0
¡i! , 1¡½0 P¡ H0¡ 0
contains the derivative
operator ¡i! in the numerator. Thus, f .t/ in equation 3.23 is the temporal derivative of the impulse response of the inverse transform of 1¡½ 1P¡ H¡ . 0
0
0
Now we must ask what the most general condition for the reex loop is (dened by ½0 H0 P0 ) to be stable. For a concrete stability analysis, knowledge of P0 would be required, which normally cannot be obtained. We can, however, in general assume that P0 , being an environmental transfer function, should again behave passively and follow equation 3.18. Furthermore, we know that the environment delays the transmission from the motor output to the sensor input. Thus, P 0 must be dominated by a low-pass characteristic, without which it would be unstable. 4 As a consequence, we can in general state that the fraction 1¡½01P0 H0 is dominated by the characteristic of a (nonstandard) high pass. It follows that its derivative has a very high negative value for t D 0 (ideally D ¡1) and vanishes soon after. The autocorrelation a is positive around t D 0. Thus, the integral in question will remain negative as long as the duration of the disturbance D remains short. As an important special case, we nd that this especially holds if we assume delta-pulse disturbance at t D 0, corresponding to x1 .t/ D ±.t/. Thus, for an orthogonal set of Hk , we have found that ISO learning will converge if P0 is dominated by a low-pass characteristic and if we use a disturbance D with a short duration. Finally, we have to prove that equation 3.22 is zero in the equilibrium state case where the feedback loop is no longer needed. Thus, we have 0 D X0 D ½0 H0 P 0 , and the denominator becomes one. We get 1½j D
¹ ±½j 2¼
Z
1
¡1
¡i!jX1C j2 jHjC j2 d!:
(3.24)
This integral is antisymmetrical, and thus zero as required. In the companion article in this issue, we discussed that the synaptic weights in the open-loop condition stabilize as soon as we explicitly set X0 D 0, arriving at the same equation (compare equation 2.21 in the companion article). In the closedloop condition used here, this is obtained in a natural way, as the result of implicitly eliminating the reex during the learning process. 3.2.4 Matching the Theoretical Convergence Properties to the Practical Approach. In this section, we now use real resonator functions for Hk and Hj (see equations 2.2 and 2.3). Normally, the transfer functions of the resonators are not orthogonal, but we will show by numerical integration that the system still behaves properly. 4 Note that the unity feedback condition, treated in the appendix, represents the simplest possible stable reex loop. Its low-pass characteristic is reduced to being a mere delay in this case.
ISO Learning Approximates an Inverse Controller
877
Here, we use the unity feedback condition dened in the appendix in order to be able to work with a concrete example, which is initially stable, and we get for equation 3.20 1½j D
Z 1 ¡i!H C H¡ N ¹ X j k ±½k d!; i!¿ 2¼ kD1 ¡1 1 ¡ ½0 e
(3.25)
where we have set D D 1, which represents a ±-function as a disturbance. Figure 4a shows the numerically obtained results for 1½j as dened in equation 3.25 in the case of a perturbation. We note that the resonators are not orthogonal since we have nonzero contributions for nearly all j 6D k. The system, however, still compensates for perturbations and thus converges, for the following reason. First, consider Figure 4a, which represents the case of how the system reacts to a perturbation, and look at the diagonal. We nd that the values of the integral (see equation 3.25) are negative on the diagonal. This means that any perturbation at ½j will lead to a counterforce onto itself and, consequently, to a compensation of the perturbation. However, the nondiagonal elements k 6D j are nonzero, so we have to discuss them and argue why this does not interfere with the compensation process. Thus, the question of stability must be rephrased into a question of how a perturbation at one given weight ½k will inuence the other weight(s). Most important, we observe that the value of the integral (see Figure 4a) is substantially smaller than one everywhere else. This, however, shows that any perturbation at index k will reenter the system at index j only in a strongly damped way. This process leads to a decay of any perturbation through further iterations. This strictly holds for two paired indices j and k. However, even for the complete sum in equation 3.25, which describes all cross-interference terms, we can argue that perturbations will be eliminated. This is true as long as the sum remains below one, which is realistic, given the small and sign-alternating values of the integral surface. From this, we realize that strict orthogonality as dened in equation 3.21 is not necessary to ensure convergence. This constraint can be relaxed to the constraint that the absolute value of the sum in equation 3.25 (or equation 3.20, respectively) should remain below one. Thus, for all practical purposes, we can concentrate on the behavior of the diagonal elements even without having to employ an orthogonal set of H. Figure 4b shows the equilibrium case with ½0 X0 P0 D 0. We note that in this case, the integral is zero for k D j, which is in accordance with theory. Since we are in the equilibrium, we do not expect any weight changes. 4 Discussion In this article, we have focused on nding a mathematically motivated interpretation of the results from feedback loop (self-referential)–based ISO
878
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
Figure 4: Numerical integration of equation 3.25. The disturbance D and the reex loop delay ¿ were both set to one. The frequencies of the resonators Hk and Hj were varied from 0:01 to 0:1 in steps of 0:001. The value of Q was set to Q D 0:9 for both resonators. The weight of the reex loop was ½0 D ¡0:9.
ISO Learning Approximates an Inverse Controller
879
learning. We were able to show that such a system approximates the inverse controller of the reex. The theoretical results are at some critical points rather nicely linked to the experimental ndings shown in the companion article that support the validity of the theory. Most of the technical aspects, like necessary assumptions (e.g., the orthogonality problem), have already been discussed in the previous sections. Therefore, we restrict the discussion here to more general problems. The inverse controller problem belongs to the most famous problems in engineering. Typical solutions are always based on an intrinsic model (a so-called forward model) of the to-be-controlled system. In contrast, our approach is model free because it is based on learning. Furthermore, engineered forward models have the central disadvantage that they will fail if something unexpected happens. Thus, control engineers use their forward controllers only in conjunction with the feedback-loop controller on which the forward model was originally based. The same strategy is pursued in a natural way in our setup. The double-loop structure of Figure 3 clearly shows that the reex will again take over if the outer loop fails. In contrast to engineered systems, however, this will lead to a continuation of the learning process such that the system will continue to improve throughout its lifetime. A frequently addressed problem in biology is the control of voluntary limb movements, for example, in the arm movement models developed by Haruno, Wolpert, and Kawato (2001) and others. These authors also employ forward models (inverse controllers) to address problems of limb control in a mixed-model approach (Wolpert & Ghahramani, 2000). The idea that forward models are involved in motor control has been explored, for example, by Grusser ¨ (1986), who tried to explain the stability of the visual percept during voluntary eye movements by means of an internal representation of the motor command, which is called “efferent copy” or “corollary discharge” (von Uexkull, ¨ 1926). By now, clear evidence exists for such a general mechanism, the details of how it is implemented, however, are still under debate. A discussion about this is beyond the scope of this article, but our theoretical results suggest that sequence order learning can provide a method by which forward models can be generically designed (i.e., learned). It is conceivable that this observation is not restricted to our specic algorithm but also holds for other temporal sequence learning algorithms like TD learning. The models by Wolpert and Ghahramani (2000), Haruno et al. (2001), and others have in common that they use supervised learning schemes, usually TD learning, to learn the forward model. As we stated in section 1, the goal of our two letters in this issue is to provide an unsupervised temporal sequence learning algorithm for autonomous behavior. An organism that is autonomous cannot rely on external rewards. Internal rewards are possible, but if we treat autonomy seriously, then even an internal reward originates in the last instance from a sensor input. Verschure and Voegtlin (1998) used the same paradigm: a (Hebb-like) unsupervised learning algorithm together
880
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
with a reex as a reference. A novel aspect of our work is that we have taken the environment explicitly into account and introduced it as a nonevaluative structure. Thereby, the organism reacts only to the relevant parts of the environment’s structure, and in the theoretical treatment, only aspects of the environment have to be taken into account that either establish the reex loop or can be used to supersede it. In that sense, the organism acquires not arbitrary sensorial information but useful information—useful in the sense of helping it to supersede the reex (von Glasersfeld, 1996). The denition of autonomy has been based on the aspect of (un-)predictability of behavior (Ford & Hayes, 1995). It is interesting to consider how our system ts into this framework. The acquisition of additional useful sensorial information enables the organism to predict unwanted changes in the environment. Thus, for the organism, predicting the reex leads to more behavioral security as compared to the situations when it had to entirely rely on the reex reaction. However, the gain of security for the organism will lead to an increase of uncertainty observed in the environment. What this means can be understood by reconsidering the robot experiment shown in the companion article. As long as the robot has only its reex behavior, it is absolutely predictable for an observer. From the moment learning eliminates the reex, the robot’s behavior becomes more and more unpredictable. Although the robot solves its goal (obstacle avoidance), it cannot be predicted how the robot actually achieves this. It is specically this duality of certainty versus uncertainty (depending on the point of view of actor versus observer) that is central to the denitions of autonomy. Such principles are also identied as the basis for the emergence of social behavior (Luhmann, 1995). Our two articles in this issue are meant to provide an alternative framework for temporal sequence learning, which by its linear structure provides better access to analytical treatment than do existing techniques. In addition, we believe that the ISO learning algorithm could have signicant commercial potential, because it can, in a model-free way, solve various inverse controller problems, which should be of relevance for different applied control situations. Two questions immediately arise that should be addressed by future research: Which modications have to be done to implement an “attraction” case opposed to the shown “avoidance” case? and Is there a way to implement ISO learning using spike trains and biophysically modeled neurons? These issues extend the scope of this article and are topics for further investigation. Appendix In this appendix, we give the detailed equations for the convergence proof and derive the proof in a rigorous way for the so-called unity feedback condition.
ISO Learning Approximates an Inverse Controller
881
A.1 Detailed Equations. We continue after equation 3.19. We need to dene U and V. U is easy: ( X0 H0 X1 Hj
Uj D Xj Hj D
jD0 : j>0
for for
(A.1)
V is more complicated. From the denition, we have V D ½0 X0 H0 C X1
N X
(A.2)
½k Hk ;
kD1
and from above, we know (see equation 3.1) that X0 D P 0 [V C De¡sT ]:
(A.3)
Thus, we get for V, V D ½0 P0 [V C De¡sT ]H0 C X1
N X
(A.4)
½k Hk kD1
D ½0 P0 H0 V C ½0 P 0 H0 De¡sT C X1
N X
½k Hk ;
(A.5)
kD1
resulting in VD
P ½0 P0 H0 De¡sT C X1 N kD1 ½k Hk : 1 ¡ ½0 P0 H0
(A.6)
Substituting ½j ! ½j C ±½j , we get P PN ½0 P0 H0 De¡sT C X1 N kD1 ½k Hk C X1 kD1 ±½k Hk Q VD 1 ¡ ½0 P0 H0 P X1 N kD1 ±½k Hk DVC : 1 ¡ ½0 P0 H0
(A.7) (A.8)
Calculating the weight change is done using equation 2.5: ¹ 1½Qj D 2¼
Z
1 ¡1
" ¡
¡i! V C
X¡ 1 1
PN
¡ kD1 ±½k Hk ¡ ¡ ¡ ½ 0 P0 H 0
# C XC 1 Hj d!;
(A.9)
where we have introduced the abbreviations C and ¡ for the function arguments Ci! and ¡i!.
882
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
We realize that the rst part of this integral describes the equilibrium state condition and can be dropped; thus, Z 1 N jX1 j2 Hk¡ ¹ X C ¡i! 1½j D ±½k ¡ Hj d!; 2¼ kD1 1 ¡ ½0 P¡ ¡1 0 H0
(A.10)
where for X1 we have made use of the fact that for transfer functions in general, we can write YC Y¡ D jYj2 , and we have reached equation 3.20 of the main text. A.2 Introducing the Unity Feedback Loop Restriction. The basic (critical) property of a reex loop is its delay characteristic. This property underlies the conceptual necessity for temporal sequence learning and is essential for any relevant mathematical treatment. The specic characteristics of some of the transfer function, on the other hand, are secondary and can therefore be simplied. Thus, we will use the so-called unity feedback loop assumption to capture this property. It is dened by ½0 2 ] ¡ 1; 0[
(A.11)
H0 :D 1
(A.12)
P0 :D e
¡s¿
(A.13)
:
The reex loop is thus entirely determined by its gain ½0 and by the delay ¿ (not to be confused with T), which is the delay between the motor output V and the sensor input X0 . The range of ½0 dened by equation A.11 results from the demand that the reex should be a negative feedback loop and that it must be stable. In addition, we assume that the transfer function P1 of the predictive pathway represents unltered throughput given by P 1 :D 1:
(A.14)
Finally, we assume that the disturbance D should be short, with a duration that is shorter than ¿ (otherwise, the loop would become unstable) and that it can be developed into a product series of conjugate zeroes and poles (e.g., low-=band- or high-pass characteristics). Thereby, D also takes on the property of a typical transfer function. A.3 Convergence for Unity Feedback. In the main text, we arrived at equation 3.22, 1½j D
¹ ±½j 2¼
Z
1
¡1
jX1C j2 jHjC j2
¡i! ¡ d! 1 ¡ ½0 P¡ 0 H0
(A.15)
ISO Learning Approximates an Inverse Controller
883
and we have to prove that this integral is negative. This can be directly shown for unity feedback. Thus, equation A.15 turns into 1½j D
¹ ±½j 2¼
Z
1 ¡1
¡i! jDHj j2 d!: | {z } 1 ¡ ½0 ei!¿ | {z } A.i!/
(A.16)
¡i!F.¡i!/
As in the main text, we apply Plancherel’s theorem to equation A.16 in order to transfer the integral back into the time domain and prove that it is negative. We get Z 1½j D ¹±½j
1
a.t/ f 0.t/ dt:
(A.17)
0
The function F.s/ of equation A.16 is given by the transformation pair, F.s/ D
1 $ f .t/ D .¡1/n ±.t ¡ n¿ /; 1 ¡ ½0 e¡s¿
n D 0; 1; 2; : : : ; (A.18)
where f represents an alternating ±-function at t D 0; ¿; 2¿; : : : ; which starts with a positive delta pulse (Doetsch, 1961). Thus, together with ¡i!, the complete term .¡i! 1¡½1 ei!¿ / represents f 0 .t/, hence the temporal derivative 0 of f . The other term A.s/ of equation A.16 is given by A.s/ D jDHj j2 $ a.t/ D 8[d.t/ ¤ hj .t/];
(A.19)
where the asterisk denotes a convolution and 8 the autocorrelation function. As a consequence of the above ndings, we have to discuss the integral in equation A.17 specied by the time functions in equations A.18 and A.19. The integral should be negative to ensure stability. We know that D is shortlived with a duration shorter than ¿ , without which the loop system would be unstable to begin with. Thus, we can restrict the discussion of the integral to t D 0. We know that the autocorrelation function a has a positive maximum at t D 0 and that the derivative f 0 of a delta pulse at zero approaches ¡1 for t ! 0I t > 0. As a consequence, the integral is negative, as required for convergence. Acknowledgments We are grateful to Leslie Smith and to the members of the CCCN seminar for their helpful comments during various stages of this work. This study was supported by grants from SHEFC RDG INCITE and by the European funding ECOVISION. UK patent pending.
884
B. Porr, C. von Ferber, and F. Worg ¨ otter ¨
References Blinchikoff, H. J. (1976). Filtering in the time and frequency domain. New York: Wiley. D’Azzo, J. J. (1988). Linear control system analysis and design. New York: McGrawHill. Doetsch, G. (1961). Guide to the applications of the Laplace and z-transforms. New York: Van Nostrand Reinhold. Ford, K. M., & Hayes, P. J. (Eds.). (1995). Android epistemology. Cambridge, MA: MIT Press. Grusser, ¨ O. (1986). Interaction of efferent and afferent signals in visual perception: A history of ideas and experimental paradigms. Acta Psychol., 63, 3–21. Haruno, M., Wolpert, D. M., & Kawato, M. (2001). Mosaic model for sensorimotor learning and control. Neural Comp., 13, 2201–2220. Luhmann, N. (1995). Social systems. Stanford, CA: Stanford University Press. McGillem, C. D., & Cooper, G. R. (1984). Continuous and discrete signal and system analysis. New York: CBS Publishing. Nise, N. S. (1992). Control systems engineering. New York: Cummings. Palm, W. J. (2000). Modeling, analysis and control of dynamic systems. New York: Wiley. Sollecito, W., & Reque, S. (1981). Stability. In J. Fitzgerald (Ed.), Fundamentals of system analysis. New York: Wiley. Stewart, J. L. (1960). Fundamentals of signal theory. New York: McGraw-Hill. Terrien, C. (1992). Discrete random signals and statistical signal processing. Upper Saddle River, NJ: Prentice Hall. Verschure, P., & Voegtlin, T. (1998). A bottom-up approach towards the acquisition, retention, and expression of sequential representations: Distributed adaptive control III. Neural Networks, 11, 1531–1549. von Glasersfeld, E. (1996). Learning and adaptation in constructivism. In L. Smith (Ed.), Critical readings on Piaget (pp. 22–27). London: Routledge. von Uexkull, ¨ B. J. J. (1926). Theoretical biology. London: Kegan Paul, Trubner. Wolpert, D. M., & Ghahramani, Z. (2000). Computational principles of movement neuroscience. Nature Neuroscience, 3 (Suppl.), 1212–1217. Received April 12, 2002; accepted November 1, 2002.
LETTER
Communicated by Randall Beer
Localization of Function via Lesion Analysis Ranit Aharonov
[email protected] Center for Neural Computation, Hebrew University, Jerusalem, Israel
Lior Segev
[email protected] School of Computer Sciences, Tel-Aviv University, Tel-Aviv, Israel
Isaac Meilijson
[email protected] School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv, Israel
Eytan Ruppin
[email protected] School of Computer Sciences, Tel-Aviv University, Tel-Aviv, Israel
This article presents a general approach for employing lesion analysis to address the fundamental challenge of localizing functions in a neural system. We describe functional contribution analysis (FCA), which assigns contribution values to the elements of the network such that the ability to predict the network’s performance in response to multilesions is maximized. The approach is thoroughly examined on neurocontroller networks of evolved autonomous agents. The FCA portrays a stable set of neuronal contributions and accurate multilesion predictions that are signicantly better than those obtained based on the classical single lesion approach. It is also used for a detailed synaptic analysis of the neurocontroller connectivity network, delineating its main functional backbone. The FCA provides a quantitative way of measuring how the network functions are localized and distributed among its elements. Our results question the adequacy of the classical single lesion analysis traditionally used in neuroscience and show that using lesioning experiments to decipher even simple neuronal systems requires a more rigorous multilesion analysis. 1 Introduction One of the fundamental challenges in understanding a nervous system is to characterize how its various functions are localized and distributed among the system elements. These elements may be single neurons, Neural Computation 15, 885–913 (2003)
c 2003 Massachusetts Institute of Technology °
886
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
neuronal assemblies, or even cortical areas, depending on the scale on which one chooses to analyze the system. Even simple nervous systems are capable of performing multiple and unrelated tasks, often in parallel. Each task recruits some of the elements of the system, and often the same element participates in several tasks. This poses a difcult challenge when one attempts to identify the roles of the network elements and assess their contributions to the different tasks. The localization of specic tasks in a nervous system is conventionally done in neuroscience by recording the activity of the system elements during behavior or measuring the decit in performance after lesioning specic elements. Obviously, inferring signicance from recording unit activity during behavior is problematic because correlation of neuronal activity with performance does not necessarily identify causality. For example, it is possible that some areas raise their activity while a given task is performed not because they signicantly contribute to the processing of that task, but rather because they are activated by other regions that do play an important role in this task processing. To try to uncover the units and regions whose activation really underlies a specic function or behavior, neuroscientists have traditionally adopted a system analysis approach, where the normal operating modes of a system are studied by lesioning and perturbing its processing in many possible ways. These abnormal manipulations are aimed to provide deeper insights into the system’s normal, regular operation. The vast majority of lesion studies in neuroscience have been single lesion studies, in which only one network component is abolished at a time (Squire, 1992; Farah, 1996). Such single lesions are very limited in their ability to reveal the signicance of elements that interact in complex ways. A straightforward conceptual example demonstrating this is when two elements have a high degree of redundancy with respect to the processing of a function to which they equally contribute. Then, lesioning either element alone will not reveal its true signicance, since no reduction in the performance of that function will occur. The problematic and limited value of single lesion analysis has already been widely noted in the neuroscience literature (Sprague, 1966; Farah, 1990; Sitton, Mozer, & Farah, 2000; Young, Hilgetag, & Scannell, 2000). One classical example discussed is that of the paradoxical lesioning effect (Sprague, 1966). In this paradigmatic case, lesioning area A alone is harmful, but lesioning area A given that area B is lesioned is benecial, hence the apparent “paradox.” Importantly, it demonstrates that looking at a single lesion alone may be misleading, as the benecial inuence of an area depends on the general state of the system. Finally, current lesioning studies in neuroscience have yielded mostly qualitative measures, lacking the ability to quantify precisely the contribution of a unit to the performance of the system and, perhaps even more important, to predict the effect of new, multiple-site lesions. Addressing these challenges, this article presents a functional contribution analysis (FCA). The FCA framework gives a rigorous, operative deni-
Localization of Function via Lesion Analysis
887
tion for the neurons’ contributions to the system’s performance in various tasks and an algorithm to measure them using multilesion analysis. Acknowledging that single lesions are insufcient for localizing functions in neural systems, the FCA framework assumes an existing data set of multiple lesions to a neural system and their corresponding system performance scores in a given set of tasks. The FCA uses these data to calculate the elements’ contributions to the various tasks, yielding an accurate prediction of the performance of a neural system when a new, untested, multiple lesion damage is inicted on it. Thus, the FCA harnesses the full power of the lesion approach to learn about the localization of functions in the intact, unperturbed network and predict its response to damage. Although still infrequent, multiple lesion studies are now being performed in an increasing frequency in neuroscientic studies of reversible inactivation (Lomber & Payne, 2001). However, these data are still of fairly limited magnitude. Hence, the development and study of the FCA method was primarily done in a theoretical modeling framework: that of neurally driven evolved autonomous agents (EAAs; Ruppin, 2002). EAA agents are software programs living in a simulated virtual environment that autonomously perform typical animat tasks like gathering food, navigating, evading predators, and seeking prey and mating partners. Each agent is controlled by an articial neural network “brain.” This network receives and processes sensory inputs from the surrounding environment and governs the agent’s behavior through the activation of the motors controlling its actions. It is developed using genetic algorithms that apply some of the essential ingredients of inheritance and selection to a population of agents that undergo evolution. In recent years, much progress has been made in nding ways to evolve EAAs that successfully cope with diverse behavioral tasks (Floreano & Mondada, 1996; Scheier, Pfeifer, & Kunyioshi, 1998; Kodjabachian & Meyer, 1998; Floreano & Mondada, 1998; Gomez & Miikkulainen, 1997), employing networks composed of several dozen neurons and hundreds of synapses (see Meyer & Guillot, 1994; Kodjabachian & Meyer, 1998; Yao, 1999; Guillot & Meyer, 2000, 2001). EAAs are a promising model system for studying neural processing and developing methods for its analysis (see Ruppin, 2002, for a detailed discussion of this issue). They are less biased than conventional neural networks used in neuroscience modeling as their architecture is typically not predesigned but emergent. Their potential is further substantiated by their feasibility: the small size of the evolved networks and the simplicity of their environments and behavioral tasks, coupled with the full information available regarding the model dynamics, form conditions that help make the analysis of the network’s dynamics an amenable task. Furthermore, numerous EAA studies have already shown that networks emerging in EAA systems can manifest interesting biological-like characteristics and provide additional computational insights (Cangelosi & Parisi, 1997; Ijspeert, Hallam, & Willshaw, 1999; Aharonov-Barki, Beker, & Ruppin, 2001).
888
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
Identifying the importance of elements of a system to varying tasks is often used as a basis for claims concerning the degree of localization versus distribution of computation in that system (Wu, Cohen, & Falk, 1994). The distributed representation approach hypothesizes that computation emerges from the joint interaction of numerous simple elements (McClelland, Rumelhart, & the PDP Research Group, 1986; Hopeld, 1982). The local representation hypothesis suggests that activity in single neurons represents specic and well-dened concepts (the grandmother cell notion) or performs specic computations (see Barlow, Bartholomew, Bremner, & Brunk, 1972). This fundamental question of distributed versus localized computation in nervous systems has attracted ample attention, spanning from Lashley’s seminal paper (Lashley, 1929) to today’s ongoing controversies (Downing, Jiang, Shuman, & Kanwisher, 2001; Haxby et al., 2001; Cohen & Tong, 2001). However, there seems to be a lack of a precise and rigorous denition for these terms (Thorpe, 1995). The ability of the FCA to quantify the contribution of elements to tasks yields, as a “by-product,” a precise determination of the level of distribution of processing in the network. This article presents a new tool for localizing functions in biological and articial neural networks and introduces a quantitative approach for studying the long-debated issue of local versus distributed computation in the brain. It capitalizes on the availability of full information that makes EAA models a good test-bed for studying neural processing. The article is organized as follows. Section 2 describes the functional contribution approach and provides the necessary denitions. In section 3, we apply the FCA to evolved neurocontrollers, demonstrating its capabilities to unravel the underlying neuronal contributions. We examine the implications of the type of lesioning method employed and show the FCA’s superiority over the classical neuroscience single lesion approach. Section 4 studies the localization of different tasks with the FCA, and section 5 shows that the FCA can be used for a detailed synaptic analysis of the underlying network connections. Our results and their implications are discussed in section 6. 2 The Functional Contribution Approach The FCA aims to compute the localization of a task from a series of multilesion experiments. In each such experiment, a lesioning conguration species which of the agent’s neurocontroller units (neurons, synapses, or any higher-order module) is lesioned. Given this lesion, the agent is run in the environment in which it was evolved, and its performance (i.e., its tness or any other well-dened performance measure) is measured. The FCA uses these data to compute a contribution vector c D .c1 ; : : : ; cN /, where ci is the contribution of element i to the task in question and N is the number of elements in the network. The goal of the FCA is to assign the contribution values that provide the best prediction of the agent’s performance in terms of mean squared error (MSE) under all possible multisite lesions.
Localization of Function via Lesion Analysis
889
2.1 Denitions. Suppose that a multilesion experiment is performed where a set of elements in the network is lesioned and the neurocontroller network then performs a certain task. The result of this experiment is described by the pair fm; pm g where the lesion conguration vector m has mi D 0 if element i was lesioned and 1 if it was left intact, and pm is the corresponding performance of the lesioned network, divided by the baseline performance of the fully intact network. For a system of N elements, there are 2N such possible multilesion experiments, that is, 2N possible lesioning congurations m. The underlying idea of our denition is that the contributions of the units are those values that allow the most accurate prediction of performance following lesions of any degree to the system. Within the FCA framework, given a conguration m of lesioned and intact units, the predicted performance pQm of the network is the sum of the contribution values of the intact units (m ¢ c), as evaluated by a nondecreasing function f , pQm D f .m ¢ c/:
(2.1)
The function f is a nondecreasing piecewise polynomial. It is nondecreasing to reect the notion that benecial elements (those whose lesioning results in performance deterioration) should have positive contribution values and that negative values indicate elements that hinder performance. Given these denitions, we seek to nd the pair fc; f g, which minimizes the mean squared prediction error ED
1 X .pQm ¡ pm /2 : 2N fmg
(2.2)
The vector c that minimizes this error is the contribution vector for the task tested, and the corresponding f is its adjoint performance prediction function. Since we can arbitrarily choose the scale of c and maintain the same P prediction by modifying f , we normalize c such that N iD1 jci j D 1. Thus, the optimal contribution vector c and its accompanying performance prediction function f minimize the MSE of predicted versus actual performance, over all possible lesioning congurations. The FCA is related to two known modeling and prediction methods: generalized linear modeling and projection pursuit regression. The generalized linear model (GLM; see McCullagh & Nelder, 1989) approach can be succinctly dened by the relation g.pQm / D m ¢ c, where g is the transfer function akin to the performance prediction function f in the FCA formulation (if g D f ¡1 the above equation reduces to equation 2.1). In GLM, the contribution vector c can be computed using a closed formula, and hence results in a much faster and deterministic computation than that performed within the FCA (see section 2.2). However, GLM cannot be used to predict
890
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
performance since in general g is not reversible; many lesioning congurations can lead to the same performance (e.g., complete failure of the agent). Projection pursuit regression (PPR; see Huber, 1985) has a formulation very similar to the FCA. PPR approximates the response surface g by a sum of Pk ridge functions g.x/ » jD1 fj .cj ¢ x/, which, together with the projection directions cj ; are found in an iterative, greedy manner, minimizing the distance from a target random variable. The FCA is hence a special case of PPR in which k D 1 and f is nondecreasing. Using only one ridge function and a nondecreasing f enables one to preserve the intuitive meaning of c as the contribution values vector. 2.2 The FCA Algorithm. The contribution vector c and the prediction function f for a task are computed using a training set of lesioning congurations and their corresponding performance levels in the task at hand. The FCA is a gradient descent search algorithm that iteratively updates f and c until it reaches a local minimum of the training error E0 , E0 D
1 X [ f .m ¢ c/ ¡ pm ]2 ; n m2M
(2.3)
where M is the training set and n is its size. The steps of the FCA are: 1. Choose a random initial normalized contribution vector c for the task, and compute f as in step 4. 2. Compute c. Using the current f , compute new values of c by performing a gradient descent on the error E0 (see equation 2.3), searching for c values that minimize E0 using the current f . P 3. Renormalize c, such that N iD1 jci j D 1. 4. Compute f . Given the current c, perform isotonic regression on the pairs fm ¢ c; pm g in the training set. Use a smoothing spline on the result of the regression to obtain a new f . Steps 2 through 4 are repeated a xed number of times. In Segev, Aharonov, Meilijson, and Ruppin (2002), an algorithm for the efcient selection of the training set in FCA is presented. This adaptive lesion selection algorithm is based on active learning (Cohn, Ghahramani, & Jordan, 1995; Engelbrecht & Cloete, 1999), and selects the next lesioning conguration based on the information contained in the current set of congurations. It is shown to reduce the required training set size considerably. For simplicity, throughout most of this article, the training set used is the full conguration set. However, utilizing the adaptive algorithm (or even random selection), a relatively small training set sufces to achieve low MSE on the test set (see sections 3.3 and 5).
Localization of Function via Lesion Analysis
891
2.3 Experimental Protocol. The FCA algorithm is stochastic and, as such, is dependent on both the initial conditions and the random choices made in the gradient descent step. To reduce stochasticity, each of the MSE values reported in this article is a mean of 10 FCA runs. A single run consists of 10 trials, each initialized with a different random contribution vector. The result of the trial with the lowest MSE on the training set is chosen for that run. We note, however, that the sensitivity to initial conditions is rather small in most cases. For each initial condition, the FCA executes a xed number I of iterations of updating f and c (see section 2.2).1 Throughout the article, I D 150, except for the analysis of agent S22 where I D 300. All FCA results presented are mean and standard deviation of 10 FCA runs (i.e., a total of 10 £ 10 initial condition choices). The MSE values are normalized by dividing them by the variance of the agent’s performances pm in the test set. Thus, a normalized MSE of q means that the FCA explains a fraction R2 D 1 ¡ q of the variance. That is, if one would predict for any conguration a performance level equal to the mean performance of the agent, then the normalized MSE would equal 1, corresponding to R2 D 0. 3 FCA of Evolutionary Neurocontrollers Using evolutionary simulations, Aharonov-Barki et al. (2001) have previously developed autonomous agents controlled by fully recurrent articial neural networks. High performance levels were attained by agents performing simple lifelike tasks of foraging and navigation. Aharonov-Barki et al. (2001) used classical neuroscience methods to analyze the evolved neurocontrollers. Here, the FCA is developed and studied in this environment and is applied to the analysis of EAA neurocontrollers, demonstrating several aspects of the approach. 3.1 The EAA Environment. The agents live in a virtual grid arena of size 30 £ 30 cells surrounded by walls (see Figure 1). “Poison” is randomly scattered all over the arena (consuming this resource results in a negative reward). “Food,” the consumption of which results in a positive reward, is randomly scattered in a restricted 10£10 “food zone” in a corner of the arena. The agent’s goal is to nd and eat as many food items as possible during its life, while ignoring the poison items. The agents are equipped with a set of sensors, motors, and a fully recurrent neurocontroller. The neurocontroller is coded in the genome and evolved; the sensors and motors are given and constant. 1 The number of iterations chosen is such that the MSE converges. Note that a further drift of the contribution values may exist if one continues to iterate the algorithm beyond this point. However, an appropriate renormalization exists, which reconstructs the contribution values obtained at this point.
892
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
The initial population consists of 100 agents equipped with random neurocontrollers. Each agent is evaluated in its own environment, which is initialized with 250 poison items and 30 food items. The life cycle of an agent (an epoch) is 150 time steps, in each of which one motor action takes place. At the beginning of an epoch, the agent is introduced to the environment at a random location and orientation. At the end of its life cycle, each agent is assigned a general performance score (tness), calculated as the number of food items minus the number of poison items it consumes, divided by 30, so that the maximum possible tness is 1. Since the agent’s lifetime is short, this is practically an unattainable score. Simulations last for a large number of generations, ranging between 5000 and 30,000. Details concerning the simulations and genetic methods are given in Aharonov-Barki et al. (2001). The agent’s neurocontroller consists of 10 to 22 internal recurrently connected McCulloch-Pitts binary neurons (the number of neurons is xed within a given simulation experiment). In addition to the internal neurons, the agent has ve input neurons connected to the ve basic sensors, collectively called the somatosensor (see Figure 1). Four of these sensors encode the presence of a resource (food or poison, without distinction between the two), a wall, or an empty cell in the cell that the agent occupies and in the three cells directly in front of it. The fth sensor is a “smell” sensor, which can differentiate between food and poison underneath the agent but gives a random reading if the agent is in an empty cell. In some simulations, agents were equipped with an extra position sensor returning the location (.x; y/ coordinates) of the agent in the arena. Four of the internal neurons are motor neurons. They initiate movement forward, a turn left or right, and control the state of the mouth—open or closed. In each step, a sensory reading occurs, network activity is then updated (network updating is synchronous), and a motor action is taken according to the resulting activity in the designated output neurons. Importantly, eating takes place only if the agent is neither moving nor turning. Thus, eating is a costly process requiring a time step with no other movement, in a lifetime of limited time steps. The task is especially difcult when the position sensor is not provided, because only partial and local sensory information about the environment is available to the agent. Thus, navigating to the food zone must rely on only the local environment and not global position cues. As shown previously (Aharonov-Barki et al., 2001), algorithms employing no memory are quite poor in this task. Aharonov-Barki et al. (2001) studied two types of agents: an SP-type agent possessing both a somatosensor and a position sensor and an S-type agent possessing a somatosensor only. For both types of agents, the most successful strategy relied on a switch between two behavioral modes: exploration and grazing. The exploration mode consists of moving in straight lines, ignoring resources in the sensory eld that are not directly under or in front of the agent, and turning at walls. The grazing mode consists of turning to resources to the right or left in order to examine them, turning at
Localization of Function via Lesion Analysis
893
Input from sensors
Recurrency
1 3
2
4
Fo rw ard Le ft R ight M ou th
Input Neuron
Poison
Output Neuron
Food
Other Neuron
Agent
Figure 1: An outline of the grid arena and the agent’s neurocontroller. For illustration purposes, the borders of the food zone are marked, but they are invisible to the agent. The agent is marked by a small arrow on the grid, whose direction indicates its orientation. The T shape marking in front of the agent denotes grid cells, which it senses. The output neurons control four motors: move forward, turn left or right, and open mouth. Eating can take place only when the agent stands still and its mouth is open. Output neurons and interneurons are fully connected to each other. (After Aharonov-Barki et al., 2001.)
walls, and maintaining the agent’s location on the grid in a relatively small, restricted region. In both modes, the agent consumes food items that it nds. The exploration mode is mostly observed when the agent is out of the food zone, allowing it to explore the environment and nd the food zone. Inside the food zone, however, the agents almost always display grazing behavior, which results in efcient consumption of food. Examining the networks of successful agents equipped with a position sensor (SP-type) has revealed an important common feature (AharonovBarki et al., 2001): certain interneurons have a position-sensitive response. One of these neurons typically res outside the food zone and remains quiescent inside the food zone and in its close vicinity. Clamping this positionsensitive cell and measuring the behavioral mode of the agent has revealed that it acts as a command neuron. When this command neuron is active, the agent assumes an exploration behavior, regardless of where it is in the environment, whereas clamping it to a silent state results in grazing behav-
894
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
ior. Thus, the place-dependent activity map of the command neuron serves as an accurate predictor of the agent’s behavior; where the map values are low, the agent will graze, and where they are high, it will explore. In the following sections, we apply the FCA to one such agent, whose network consists of 10 neurons (SP10). Similar to the SP case, also in agents possessing only the somatosensor, and lacking a position sensor (S-type), at least one neuron whose activity is place dependent was identied. We study such agents in subsequent sections: one whose network consists of 22 neurons (S22) and another with a 10-neuron network (S10). These command neurons emerge even though no explicit information about location is available to S-type agents. Moreover, their activity is locked not to the absolute position in the arena but rather to the location of the food zone, which may vary at different experiments (Aharonov-Barki et al., 2001). Aharonov-Barki et al. have previously demonstrated that this computational ability is based on a spontaneously emerging stochastic memory mechanism. The “knowledge” of location is actually a form of memory of the time elapsed since the last eating event (Aharonov-Barki et al., 2001). As will become clear from the analysis, this memory mechanism relies on feedback from the motor neurons, supplying information about eating events of the agent. The FCA approach requires a well-dened quantitative measure for performance on each task analyzed. We consider three tasks throughout the work: the general survival task of maximizing tness as dened above and two subtasks that correspond to the two emergent behaviors of successful agents, exploration and grazing. Exploration performance, pe , is measured by randomly placing the agent in the arena and measuring the number of steps t elapsed before the agent reaches the food zone for the rst time. If the agent fails to reach the food zone in fewer than 1000 steps, t is set to 1000. The exploration performance is computed by pe D
T ¡ t ¼b =S ¡ ; T t=T
(3.1)
where ¼b is the number of poison items consumed before reaching the food zone, S D 30 is the total number of food items in the arena, and T D 150 is the original number of steps in an epoch. The rst term is concerned with the speed of reaching the food zone and becomes negative for t > T. The second term penalizes poison consumption, which should be avoided in any behavioral mode and is normalized to be consistent with the general performance normalization. Grazing performance, p g , is measured by placing the agent in the arena for T time steps. Denote by s the total number of food sources eaten and by ¼a the number of poison items consumed after reaching the food zone for the rst time; then, pg D
.s ¡ ¼a /=S ; .T ¡ t/=T
(3.2)
Localization of Function via Lesion Analysis
895
where t is measured as before. The numerator is consistent with the general performance normalization, and the denominator normalizes by the time spent inside the food zone. 3.2 Selecting the Lesioning Method. An important determinant of a lesioning analysis is the manner in which the lesioning itself is performed. We have studied two main lesioning methods: biological lesioning and stochastic lesioning. Biological lesioning refers to the classical method employed in most neuroscience lesioning experiments, where the lesioned component is ablated or cooled (i.e., completely silenced). In our model, accordingly, in biological lesioning, a lesioned neuron is deactivated completely by cutting all its output connections to the rest of the network. In contrast, stochastic lesioning is performed by making the ring pattern of the lesioned component random rather than by completely silencing its output. That is, at every time step, a lesioned neuron res with probability equal to its overall mean ring rate in normal behavior, independent of its input eld. This ensures that the lesioning does not affect the mean eld of other neurons, affecting only the information content received from the lesioned neuron.2 In both methods, when lesioning motor neurons, we do not alter the activity transmitted to the motors themselves. This enables us to isolate the role of the motor units in the computation of the recurrent controller networks (i.e., their contribution to other neurons), without completely immobilizing the agent. Before advancing to study the FCA in depth, we ran a few experiments to gauge the accuracy of FCA analysis obtained with both lesioning methods. Figure 2A compares the test MSE obtained by the FCA on the controller of agent S10 in the general, grazing, and exploration tasks, once employing biological lesioning (black bars) and once using stochastic lesioning (lightcolored bars). As is evident, the MSE obtained from biological lesioning is signicantly poorer than that obtained when using stochastic lesioning. Indeed, as Figures 2B and 2C demonstrate, the contribution values of the neurons computed by both lesioning methods differ, showing that the poorer prediction capability of biological lesioning stems from a real difference in the signicance assigned to the system elements. These results testify to the crucial importance of selecting the lesioning method. Specically, they demonstrate that the classical ablation (biological) lesioning conventionally used in animal lesioning experiments might be questionable. A possible explanation for this phenomenon is that when a neuron is silenced, it transfers other unlesioned neurons into a regime in which they cannot “re” because their mean eld is far below threshold. Thus, the effective part of the network that is lesioned is much larger than
2 Similarly, in the synaptic analysis, a lesioned synapse transmits a stochastic version of the presynaptic activity.
896
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
0.6 0.5 0.4 0.3 0.2 0.1 0
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
2
4
6
8
10
0.2
2
4
6
8
10
Figure 2: Comparison of biological and stochastic lesioning. (A) MSE obtained by the biological versus stochastic lesioning for agent S10 in the general, grazing, and exploration tasks. Contribution values obtained using biological lesioning (B) and stochastic lesioning (C) for agent S10 in the general tness task.
what was intended and considered “lesioned” by the FCA. This results in a situation where lesions percolate to other network elements disabling and isolating specic elements and creating dispersed effective lesions. Stochastic lesioning also spreads the damage from lesioned neurons to neighboring neurons, but to a much lesser extent. As a result, biological lesioning is much more difcult to analyze, because the difference between the direct lesioning conguration used by the FCA and the effective one is much larger, causing FCA-like algorithms to be erratic and less accurate. Biological lesioning is hence not t for accurate lesion analysis (at least, in the FCA framework, which is fairly general and employs multiunit lesions)
Localization of Function via Lesion Analysis
897
even in very simple neurocontrollers. In view of these results, we have chosen to employ stochastic lesioning in the rest of the article. This method is superior for the analysis of animat agents. We shall return to discuss these two lesioning methods from a neuroscience perspective in section 6. 3.3 FCA of Agents’ Performance. We consider rst the performance obtained in the general task: maximizing tness by consuming food and avoiding poison. We study the three agents described in section 3.1: the two S-type agents (S10 and S22) and the SP-type agent (SP10). We apply the FCA to these neurocontrollers by performing lesioning experiments and testing the performance levels of the agents. For each agent, a small training set was chosen using the adaptive algorithm (see section 2.2) or a random selection, and the FCA was applied 10 times (see section 2.3). The result of the FCA is the contribution vector c and the performance prediction function f . The prediction capability of the FCA is measured by computing the MSE on a test set consisting of all 210 congurations in S10 and SP10, and of 30,000 random congurations in S22.3 The normalized MSE on these test sets were all below 0.07, demonstrating that the FCA succeeded in nding a pair fc; f g, which brings the error in equation 2.2 to a very low value even when trained on a very limited subset of the 2N conguration space. Figures 3A, 3C, and 3D depict the contribution vectors computed for the three agents. Each part of the gure depicts the mean and standard deviation of the contributions of the different neurons of the agent over the 10 runs of the FCA. Importantly, the FCA converges to similar minima of the error in all runs, yielding consistent measures of the contributions. As expected from previous receptive eld analysis of the agents (Aharonov-Barki et al., 2001), we nd that the command neurons (neuron 5 in S10, neuron 6 in SP10, and neuron 8 in S22) receive signicant and positive contributions in all three agents. Interestingly, in the two S-type agents, some of the motor neurons (neurons 1–4) receive high contribution values, testifying to the importance of the recurrency from the output units. 4 This is consistent with the notion that the feedback from the motor units is required for signaling eating events, enabling the agents to switch to and remain in grazing mode. In SP10, where location information is available to the agent, this recurrent connection from the motor neurons is not signicant. The inset in each panel plots the performance prediction function obtained by the FCA. All 10 performance prediction functions look similar (as can be seen for the case of S10, in Figure 3B), so one was chosen for clarity of exposition. The 3 Since obtaining the full 222 conguration set is impossible, we chose a large enough set such that almost any randomly chosen set of that size results in practically the same error. 4 Recall that under lesioning, the motor actions governed by an output neuron are kept intact; only the information transfer to the rest of the network is disturbed.
898
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
1
1 1
0.8
0.8 0.5
0.6
0.6 0 0
0.4
0.5
1
0.4
0.2
0.2
0 0.2
2
4
6
8
10
1
0 0
0.6
0.8
1
1
0.8
0.8 0.5
0.5
0.6
0.6 0 0
0.5
0.4
1
0.2
0.2
0
0
0.2
0.4
1 1
0.4
0.2
2
4
6
8
10
0.2
0 0
0.5
1
2 4 6 8 10 12 14 16 18 20 22
Figure 3: Contribution values of the agents’ neurons. The mean and standard deviation of the contribution values computed by the FCA are plotted in each panel (vertical axis). (A) Agent S10. FCA trained on 45 lesioning examples obtained by the adaptive lesioning algorithm. (B) The corresponding 10 performance prediction functions. The x-axis is m ¢ c; y-axis is the predicted performance f .m ¢ c/ (see equation 2.1). (C) Agent SP10. FCA trained on 45 lesioning examples obtained by the adaptive lesioning algorithm. (D) Agent S22. FCA trained on 99 random lesioning examples. The inset in panels A, C, and D depicts one performance prediction function. Neurons 1–4 (on the horizontal axis) are the motor neurons (see Figure 1).
similar form of the prediction functions again testies to the convergence of the FCA to similar minima of the prediction error. 3.4 Comparison to the Single Lesion Approach. An important aspect of the FCA is the integration of information obtained from multiple lesions, going beyond the limited information available in single lesions
Localization of Function via Lesion Analysis
899
alone. Here, we compare the FCA approach to the single lesion approach using the evolved agents. To study the signicance of using multiple lesions, we compute the null hypothesis; we calculate the contribution values relying only on the performance levels obtained when single units are lesioned. The contribution of neuron i is taken to be the decrease in the performance due to lesioning that neuron alone, normalized as before.5 The black bars in Figure 4A depict the contributions computed from single lesions for the general performance of agent S10. The light bars describe the contributions computed by the FCA, using all possible lesioning congurations. As is evident, the contributions assigned by the FCA differ signicantly from those obtained with knowledge of single lesion results. Do the contribution values computed by the FCA using multiple lesions describe the system more accurately than those inferred from single lesions? The obvious criterion is the prediction ability. In principle, the single lesion approach implies a linear prediction function, which would lead to very inaccurate performance predictions. Thus, to obtain a fair comparison, we take the contribution vector found by the single lesion analysis and optimally t its corresponding prediction function to the test set. This ensures that any difference in prediction capabilities stems from a difference in the contributions themselves. Figures 4B and 4C compare the predicted versus actual performance, using single lesions and the FCA, respectively. Clearly, single lesions do not reveal the true contributions of the neurons, testifying to the existence of some form of interaction between the effects of the different units on the network function. Figure 5 compares the test MSE obtained for several agents and tasks using single lesion information (black bars) and the FCA-derived contributions (light bars). These results demonstrate again that in some agents and tasks, complex interactions exist, yielding suboptimal contributions when considering only single lesions. The information from multiple lesions in such networks is indeed essential for revealing the true contributions of the network elements. The more complex, possibly nonlinear interactions are modeled well by the FCA, because although being a linear model, it is generalized by the nonlinear prediction function.6 However, obviously many forms of interactions cannot be modeled with the basic FCA. A classical example of such an interaction was given by Sprague (1966) in the neuroscience literature, showing that decits in orienting towards stimulus,
5
PN
ci D .1 ¡ pi /=. iD1 j1 ¡ pi j/, where pi denotes the performance when neuron i is lesioned alone. 6 A simple example of such nonlinear interactions is a two-unit system that functions perfectly unless both units are lesioned. The basic FCA models such a case by assigning both units a contribution value of 0.5, with f .0/ D 0 and f .0:5/ D f .1/ D 1.
900
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
1 0.8 0.6 0.4 0.2 0 0.2
2
4
6
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0
0.2
0.4
0.6
0.8
1
0 0
8
0.2
10
0.4
0.6
0.8
1
Figure 4: Comparison between the single lesion approach and the FCA on agent s10. (A) Contribution values obtained by the two methods. (B) Predicted versus actual performance on the test set using single lesions information. (C) Predicted versus actual performance on the test set using the FCA.
resulting from large cortical visual lesions, can be reversed through additional removal of contralateral areas. This has been known as the Sprague effect, or paradoxical lesioning (Hilgetag, Lomber, & Payne, 2000; Lomber & Payne, 2001). To address these challenges, we have generalized the FCA to include high-dimensional, context-dependent interactions. This nonlinear generalization is based on the usage of compound elements, which are specic combinations of a few simple elements. A preliminary discussion of the high-dimensional FCA appears in Segev et al. (2002), and a more thorough treatment will be given in a separate article. Note that for agent
Localization of Function via Lesion Analysis
901
0.4 0.3 0.2 0.1 0
Figure 5: Comparison of MSE obtained by the single lesion approach and the FCA. Train and test sets are the full 210 conguration set for S10 and SP10. For S22, the training set is a 5000 conguration set, and the test set consists of 20,000 congurations. All predictions used an optimally tted prediction function to the test set (see the text). In most tasks, the standard deviation of the MSE is too small to be seen.
SP10, single lesion information sufces to achieve accurate contributions, testifying to the simpler functional organization of the network. 7
4 Localization of Several Tasks In the previous section, we discussed the results of applying the FCA to the overall performance measure of the agent: its tness. We showed how the FCA can be used to nd the contribution of each neuron to the task of maximizing the tness. However, as discussed in section 3.1, the EAAs perform two additional subtasks: grazing and exploration. In this section, we show how the FCA can be used to localize these tasks in parallel. 7 However, it may be that localization of a specic task in this agent still requires an FCA analysis.
902
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
Figure 6: The contribution matrix. The network consists of N elements and performs K different functional tasks. The entry Cik is the contribution of element i to task k.
Consider an agent with a neurocontroller network of interconnected neurons that performs a set of K different functional tasks. Addressing the question of which elements contribute to which tasks, it is natural to think in terms of a contribution matrix, where Cik is the contribution of element i to task k, as shown in Figure 6. The elements studied may be neurons, synapses, or any higher-order module. As before, the data analyzed for computing the contribution matrix is gathered by inicting a series of multiple lesions onto the agent’s neurocontroller network. Under each lesion, the resulting performance of the agent in different tasks is measured. Given these data, the FCA nds the contribution values Cik that provide the best performance prediction on average for all possible multisite lesions (the kth column of the matrix is the contribution vector of task k as described in section 2). Following the spirit of Lashley (1929), the localization Lk of task k can now be dened as a deviation from equipotentiality along column k (e.g., L1 in Figure 6), and similarly, Si , the specialization of neuron i, is the deviation from equipotentiality along row i of the matrix (e.g., S2 in Figure 6). More formally, Lk is the standard deviation of column k of the contribution matrix divided by the maximal possible standard deviation,
Lk D p
std.C¢k /
.N ¡ 1/=N 2
:
(4.1)
Localization of Function via Lesion Analysis
903
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
2
4
6
8
10
0.2
2
4
6
8
10
Figure 7: Contribution values for grazing and exploration. (A) Grazing task. (B) Exploration task.
Note that Lk is in the range [0; 1], where Lk D 0 indicates full distribution and Lk D 1 indicates localization of the task to one neuron alone. Similarly, if neuron i is highly specialized for a certain task, Ci will deviate strongly from a uniform distribution, and thus we dene Si , the specialization of neuron i, as
Si D
8 > : p2¢std.jCi¢ j/
otherwise:
.K2 ¡1/=K2
(4.2)
Note that Si uses the absolute value of the contributions to reect the intuition that the specialization of the unit is determined more by the magnitude of its contribution than by its sign. Again, Si D 1 indicates maximal specialization of neuron i. We note, however, that in principle, other measures can be dened. The important point is that the contribution matrix enables one to dene such quantitative measures to capture qualitative notions. We concentrate on one agent (S10), and consider the two tasks that correspond to the two emergent behaviors of successful agents: exploration and grazing (as dened in section 3.1). The FCA is performed separately for each task, yielding a contribution vector and a prediction function for the task. Figure 7 depicts the contribution values computed by the FCA for the two tasks. As evident, the localization of the two tasks within the network differs considerably. Notably, neuron 10, which is an interneuron, contributes positively to grazing behavior but hinders exploration.
904
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
Table 1: Localization and Specialization in Agent S10. Grazing
Exploration
1 2 3 4 5 6 7 8 9 10
N
0.1696 0.2916 0.0878 0.0000 0.2228 0.0015 ¡0:0001 0.0007 0.0001 0.2257
0.5034 0.2344 0.1324 0.0000 0.0740 0.0006 0.0001 0.0001 0.0001 ¡0:0547
Lk LQ k
0.3684 0.1698
0.5323 0.4691
Si 0.3338 0.0571 0.0445 0.0000 0.1488 0.0009 0.0002 0.0006 0.0001 0.2804
Notes: Each entry is the contribution of the corresponding neuron to each of the two subtasks of grazing and exploration. The internal part of the table is thus the contribution matrix (see Figure 6) of agent S10 . The localization, Lk , effective localization, LQ k , and specialization, Si , are displayed for each task and neuron. The entries in bold type correspond to the ve signicant neurons in the network: those with nonvanishing contribution values. The effective localization is computed using those entries only (see the text).
Given these data, we can now construct the contribution matrix of the agent, depicted in Table 1, along with the measures of localization and specialization computed from it. Since the size of the evolved neurocontrollers was set arbitrarily, a number of neurons do not participate in any task the agent performs, and hence the effective size of the network, considering only neurons with nonvanishing contributions, is smaller. This should be taken into account when considering task localization in the network. Thus, the effective localization LQ k of the different tasks performed by the agent (see Table 1) is computed only on the nonvanishing neurons (neurons 1, 2, 3, 5, and 10 in this agent), after normalizing the sum of their absolute values of contributions to 1. In this agent, exploration is more localized than grazing. Note that although interneuron 10 has contribution values of similar absolute magnitude as neuron 5, its specialization is double that of the latter, as it contributes positively to grazing behavior while actually hindering exploration.
Localization of Function via Lesion Analysis
905
5 Synaptic Analysis The FCA is not limited to a specic level of analysis, and thus the same network can be studied concomitantly on the neuronal and synaptic level. In this section, we apply the FCA to agent S10, analyzed above, to determine the contribution values of its synapses and, hence, the underlying structure of its synaptic network. The neuronal analysis revealed that 5 of the 10 neurons are signicant. Thus, as a rst step, we consider a reduced network consisting of only the signicant neurons and their synapses. Recall, however, that insignicance of a motor neuron testies only to the insignicance of its outgoing synapses, not to the insignicance of its activity insofar as it controls the motor. Thus, the reduced network of agent S10 contains the six indispensable neurons (the four motor neurons, neuron 5, and neuron 10) and all the synapses connecting them (excluding the outgoing synapses from neuron 4, which has a vanishing contribution value). The reduced network has a total of 30 synapses. 8 The agent driven by this reduced network still attains a very high level of performance, 0.36, which is 90% of the performance obtained by the original agent, driven by the fully connected intact neurocontroller. The next step is to use the FCA to nd the backbone of the network, that is, the signicant synapses within the reduced neurocontroller. Applying the FCA for this synaptic analysis, with a training set consisting of 2000 randomly chosen congurations, the normalized MSE on a randomly chosen test set of 20,000 congurations is 0.05. The training set is extremely small relative to the full conguration space (230 congurations). Nevertheless, the FCA managed to reach a fairly low test error, testifying to the potential scalability of the method. Figure 8A depicts the synaptic contribution values obtained by the FCA. The right part of the gure (marked “Neuron”) depicts the contribution values of the neurons of the agent S10, as computed by the FCA on the neuronal level. Figure 8B depicts the weight values of the agent’s neurocontroller, normalized such that the sum of the absolute values of the synaptic weights is 1, also accompanied by the neuronal contributions computed by the FCA. Evidently, the network structure emerging from considering the synaptic contributions (see Figure 8A) differs quite signicantly from that inferred directly from the synaptic weights (see Figure 8B; correlation of 0.31). Is the synaptic signicance obtained by the FCA more accurate than that derived from the raw synaptic weights themselves? To answer this question, we lesion the synapses incrementally and measure the deterioration of the agent’s performance. Figure 9 depicts the performance of the
8 We consider here only the internal network, not the synapses coming from input neurons, as lesioning those synapses causes a degree of damage that basically causes a total malfunctioning of the agent.
906
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
0.5
c o n trib u tio n
0.4 0.3 0.2 0.1 0 0.1 0.2 1 2
po 3 st s 4 yn ap tic
5
5 10
1
10
ron Neu
3 ic apt syn e pr 2
0.5 0.4
w e ig h t
0.3 0.2 0.1 0 0.1 0.2 1 2
po 3 st s 4 yn ap tic
5 10
1
5
10
ron Neu
ptic yna s pre 2
3
Figure 8: Comparison of synaptic contributions computed by the FCA and the synaptic weights. (A) The FCA-computed contributions of the synapses and neurons (right row) of the reduced network. (B) The weights of the reduced network synapses, normalized such that the sum of the absolute values of these weights is 1. Neurons 1, 2, 3, 5, and 10 are recurrently connected. Neuron 4 is the motor neuron (controlling the mouth), whose outgoing synapses are insignicant.
Localization of Function via Lesion Analysis
907
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0
5
10
15
20
25
30
Figure 9: Agent performance as a function of lesioning degree. Performance of agent S10 as a function of the number of synapses lesioned, for three different methods of incremental synaptic pruning. The performance is given relative to the performance of the original fully intact network. All methods start from the reduced network consisting of 30 synapses, that is, 70 internal synapses lesioned. By contribution: For each of 10 FCA runs, synapses are incrementally lesioned by ascending order of their contribution value. Line depicts mean and standard deviation of measured performance levels over the 10 runs. By absolute weight: Performance levels following lesioning by ascending order of absolute synaptic weights. Random: Mean and standard deviation of performance levels over 100 random orders of synaptic lesions.
agent as a function of the number of lesioned synapses for three methods of incremental lesioning, starting from the intact reduced network of the 30 internal synapses. In the rst lesioning experiment, we use the synaptic contribution values computed by the FCA and lesion by ascending order of signicance. The graph depicts the mean and standard deviation using contribution values from 10 FCA runs. In the second lesioning experiment, we use the absolute weight as a measure of importance and lesion by ascending order of the absolute value of the weights. In the third experiment, we lesion incrementally by choosing the next synapse to be lesioned randomly. The graph depicts the mean and standard deviation of 100 such choices.
908
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
10
1
5
10
2
4
3
1
5
2
4
3
Figure 10: Network backbone. (A) The nine synapses of the internal network with the largest contribution values. (B) Similarly for the largest synaptic weights (by absolute value). Dashed lines denote negative values. The thickness of the lines scales with their absolute value and is normalized to have the same sum in each panel.
As evident, the FCA is much better at identifying the signicant synapses of the network. For example, leaving only 15 synapses in the entire internal network—15% of the total internal network and 50% of the reduced network—salvages 80% of the original performance level. In contrast, using the weights as a signicance measure to choose the 15 important synapses leaves only 50% of the original performance.9 Not surprisingly, choosing 15 synapses randomly completely destroys the agent’s performance. In summary, the FCA proves very reliable in identifying the signicant network backbone and can be used to select synapses for pruning a neural network. Using the FCA computed synaptic contributions, a minimal effective connectivity can be inferred to simplify the analysis of the network. This is especially important for fully recurrent networks where understanding the structure becomes a daunting task. Looking at Figure 8A, one observes that there are nine synapses with relatively large contribution values, compared to the rest. Figure 10A depicts the network formed by these nine largest synaptic contributions, and correspondingly Figure 10B depicts the nine synapses with largest weight values (in absolute value). Although part of the resulting backbone is similar, there are two important differences. First, although the synaptic weights imply that the connections into neuron 4 are important, the FCA correctly reveals that this is not the case. Second, the FCA analysis clearly shows the central role of the command neuron (neuron 5), a feature of the network much less apparent when considering the synaptic 9 Using the actual weight value and not the absolute value achieves even lower performance on all levels of synaptic lesioning.
Localization of Function via Lesion Analysis
909
strengths. The functional role of the command neuron in these networks has been studied in detail in Aharonov-Barki et al. (2001). Examining the network backbone of Figure 10A, one can postulate that the synapse from neuron 1 (forward step) to neuron 2 (left turn) is responsible for the move and turn behavior observed in grazing mode. Moreover, the synapse from the turn motor neurons (2 and 3) to the command neuron (neuron 5) is clearly important, as expected from previous analysis. As discussed above, the network functional connectivity emerging from the FCA analysis is more accurate in terms of being able to predict the network performance. Note that although only nine synapses receive a high contribution value (see Figure 8), the performance of the network with only these synapses intact is still rather low (see Figure 9). This is because other synapses are important in certain combinations, improving the performance if they are intact.10 However, the nine high-contribution synapses are strongly benecial in many lesioning congurations and hence are picked up by the basic FCA. 6 Discussion This article presents a functional contribution analysis that addresses the fundamental challenge of localization of function in neural networks. The FCA is based on the assignment of contribution values to the basic elements of the network, such that the ability to predict the network’s performance in response to multilesions is maximized. The algorithm is thoroughly examined in an EAA environment. The results shown testify that: ² In the recurrent yet simple EAA neurocontrollers studied here, the basic version of the FCA sufces to portray a stable set of contributions and yields fairly accurate multilesion predictions. These are signicantly better than those obtained with a single lesion analysis method. The FCA may be efciently trained on a relatively small subset of all possible multilesion congurations. It can be used to localize several subtasks that the agent performs and compute the corresponding localization and specialization indices. ² FCA analysis on the synaptic level testies to the potential scalability of the approach for larger systems. It also presents an important method for synaptic pruning and visualizing the effective backbone of the network architecture. These results should be viewed as an encouraging but rst step in the attempt to understand neural information processing in neural systems. Several important issues and limitations of the current FCA arise subsequently to this work and deserve considerable attention in the future. First and fore10
High-dimensional FCA is necessary to reveal these combinations.
910
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
most, the accurate analysis of complex networks requires high-dimensional descriptions involving compound elements that are composed of several basic units (neurons or synapses) that interact together. Such high-dimensional analysis is essential, for example, for an accurate account of subnetworks displaying the “paradoxical lesioning” effect. Second, the FCA makes one think about the “correct” or “best” way to perform the lesions in the network. Our results indicate that the conventional way of lesioning that is currently employed in neuroscience, that is, completely silencing the lesioned units, is essentially inadequate for general multilesion studies. In this study, we have focused on stochastic lesioning, but is it the best one possible? Finally, the FCA provides important and useful information about function localization and the true effective network size, but it does not otherwise advance our understanding of how neural information processing takes place. The way in which information is represented, encoded, and manipulated on the neural, assembly, and network levels remains an open question. The FCA has potential signicance beyond the scope of EAAs. Multilesion analysis algorithms like the FCA will become an essential tool in neuroscience for the analysis of reversible inactivation experiments, combining reversible neural cooling deactivation with behavioral testing of animals. These methods alleviate many of the problematic aspects of the classical single lesioning technique (ablation), enabling the acquisition of reliable data from multiple lesions of different congurations (for a review, see Lomber, 1999). If the underlying network of brain regions studied is simple enough, then perhaps the “biological” ablation method of lesioning currently employed in these experiments will sufce. However, our study suggests that if these networks are more complex, ablation lesions may result in excessive perturbations of the system and fail to reveal its normal operation (the problematic nature of analyzing lesioning data due to this excessive lesioning has also been discussed in Young et al., 2000). If that turns out to be the case, then other lesioning methods that cause smaller perturbations (much alike stochastic lesioning) should be employed. FCA-like algorithms could also be used for the analysis of transcranial magnetic stimulation studies, which aim to induce multiple transient lesions and study their cognitive effects (see Walsh & Cowey, 2000, for a review). They could also possibly be used for studying the functional efcacy of synaptic projections on different components of the neuronal dendritic tree, in parallel to recent studies examining this question with other methods (London, Shribman, Hausser, Larkum, & Segev, 2002). In both cases, stochastic lesioning may prove to be the preferred lesioning method employed both experimentally and in the FCA analysis. Another potentially interesting application is the analysis of functional imaging data. One direction is to assess the contributions of each element to others, that is, extending previous networks’ effective connectivity studies employing linear models (Friston, Frith, & Frackowiak, 1993) with high-dimensional FCA. The second direction involves the application of the FCA to the meta-analysis of
Localization of Function via Lesion Analysis
911
functional imaging data. This will require the development and study of a continuous version of the current binary FCA representations, where results of various activation studies of similar tasks are viewed as different “multiple-lesion” probes of the task in hand. Finally, the FCA studied in this article has focused on studying each task separately and then assembling this information and presenting it in a contribution matrix. The latter, however, could serve as an object of further elaborate analysis beyond the computation of localization and specialization indices. This analysis could involve a singular value decomposition of the contribution matrix, reducing it to a lower dimension and revealing possible hidden similarities in the localization of different tasks, in a manner analogous to the latent semantic analysis used in information retrieval applications (Deerwester, Dumals, Landauer, Fumas, & Harshman, 1990). In summary, the FCA provides a rigorous method for determining function localization, yielding useful insights into the organization of animat and animate nervous systems. Acknowledgments We acknowledge the valuable contributions and suggestions made by Hanoch Gutfreund, Nira Dyn, and Matan Ninio and the technical help provided by Oran Singer. This research has been supported by the Adams Super Center for Brain Studies in Tel Aviv University and the Israel Science Foundation. R. A. is supported by the Horowitz foundation. References Aharonov-Barki, R., Beker, T., & Ruppin, E. (2001). Emergence of memorydriven command neurons in evolved articial agents. Neural Computation, 13, 691–716. Barlow, R., Bartholemew, D., Bremner, J., & Brunk, H. (1972). Statistical inference under order restrictions. New York: Wiley. Cangelosi, A., & Parisi, D. (1997). A neural network model of Caenorhabditis elegans: The circuit of touch sensitivity. Neural Processing Letters, 6, 91–98. Cohen, J. D., & Tong, F. (2001). The face of controversy. Science, 293(5539), 2405–2407. Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1995). Active learning with statistical models. In G. Tesauro, D. Touretzky, & T. Leen (Eds.) Advances in neural information processing systems, 7 (pp. 705–712). Cambridge, MA: MIT Press. Available on-line: citeseer.nj.nec.com=article=cohn95active.html. Deerwester, S., Dumais, T., Landauer, T. K., Fumas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391–407. Downing, P. E., Jiang, Y., Shuman, M., & Kanwisher, N. (2001). A cortical area selective for visual processing of the human body. Science, 293(5539),2470–2473.
912
R. Aharonov, L. Segev, I. Meilijson, and E. Ruppin
Engelbrecht, A., & Cloete, I. (1999). Incremental learning using sensitivity analysis. In Proceedings of the IEEE International Joint Conference on Neural Networks. Washington D.C. Available on-line: citeseer.nj.nec.com= engelbrecht99incremental.html. Farah, M. J. (1990). Visual agnosia. Cambridge, MA: MIT Press. Farah, M. J. (1996). Is face recognition “special”? Evidence from neuropsychology. Behavioral Brain Research, 76, 181–189. Floreano, D., & Mondada, F. (1996). Evolution of homing navigation in a real mobile robot. IEEE Transactions on Systems, Man, and Cybernetics—Part B, 26(3), 396–407. Floreano, D., & Mondada, F. (1998). Evolutionary neurocontrollers for autonomous mobile robots. Neural Networks, 11(7–8), 1461–1478. Friston, K. J., Frith, C. D., & Frackowiak, R. S. J. (1993). Time-dependent changes in effective connectivity measured with PET. Human Brain Mapping, 1, 69–79. Gomez, F., & Miikkulainen, R. (1997). Incremental evolution of complex general behavior. Adaptive Behavior, 5(3=4), 317–342. Guillot, A., & Meyer, J.-A. (2000). From SAB94 to SAB2000: What’s new, animat? In Proceedings of the Sixth International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press. Guillot, A., & Meyer, J.-A. (2001). The animat contribution to cognitive systems research. Journal of Cognitive Systems Research, 2, 157–165. Haxby, J. V., Gobbini, M. I., Furey, M. L., Ishai, A., Schouten, J. L., & Pietrini, P. (2001). Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science, 293(5539), 2425–2430. Hilgetag, C. C., Lomber, S. G., & Payne, B. R. (2000). Neural mechanisms of spatial attention in the cat. Neurocomputing, 38, 1281–1287. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 2554–2558. Huber, P. J. (1985). Projection pursuit. Annals of Statistics, 13(2), 435–475. Ijspeert, A. J., Hallam, J., & Willshaw, D. (1999). Evolving swimming controllers for a simulated lamprey with inspiration from neurobiology. Adaptive Behavior, 7, 151–172. Kodjabachian, J., & Meyer, J. A. (1998). Evolution and development of neural controllers for locomotion, gradient-following and obstacle-avoidance in articial insects. IEEE Transactions on Neural Networks, 9(5), 796–812. Lashley, K. S. (1929). Brain mechanisms in intelligence. Chicago: University of Chicago Press. Lomber, S. G. (1999). The advantages and limitations of permanent or reversible deactivation techniques in the assessment of neural function. Journal of Neuroscience Methods, 86, 109–117. Lomber, S. G., & Payne, B. R. (2001). Task-specic reversal of visual hemineglect following bilateral reversible deactivation of posterior parietal cortex: A comparison with deactivation of the superior colliculus. Visual Neuroscience, 18(3), 487–499. London, M., Shribman, A., Hausser, M., Larkum, M., & Segev, I. (2002). The information efcacy of a synapse. Nature Neuroscience, 5(4), 333–340.
Localization of Function via Lesion Analysis
913
McClelland, J. L., Rumelhart, D. E., & the PDP Research Group. (1986). Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 2: Psychological and biological models. Cambridge, MA: MIT Press. McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). London: Chapman and Hall. Meyer, J.-A., & Guillot, A. (1994). From SAB90 to SAB94: Four years of animat research. In D. Cliff, P. Husbands, J.-A. Meyer, & S. K. Wilson (Eds.), Proceedings of the Third International Conference on Simulation of Adaptive Behavior. Cambridge, MA: MIT Press. Ruppin, E. (2002). Evolutionary autonomous agents: A neuroscience perspective. Nature Reviews Neuroscience, 3, 132–141. Scheier, C., Pfeifer, R., & Kunyioshi, Y. (1998). Embedded neural networks: Exploiting constraints. Neural Networks, 11(7–8), 1551–1569. Segev, L., Aharonov, R., Meilijson, I., & Ruppin, E. (2002). Localization of function in neurocontrollers. In Proceedings of the International Conference on the Simulation of Adaptive Behavior (SAB2002). Cambridge, MA: MIT Press. Sitton, M., Mozer, M., & Farah, M. J. (2000). Superadditive effects of multiple lesions in a connectionist architecture: Implications for the neuropsychology of optic aphasia. Psychological Review, 107, 709–734. Sprague, J. M. (1966). Interaction of cortex and superior colliculus in mediation of visually guided behavior in the cat. Science, 153, 1544–1547. Squire, L. R. (1992). Memory and the hippocampus: A synthesis of ndings with rats, monkeys, and humans. Psychological Review, 99, 195–231. Thorpe, S. (1995). Localized versus distributed representations. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Walsh, V., & Cowey, A. (2000). Transcranial magnetic stimulation and cognitive neuroscience. Nature Reviews Neuroscience, 1, 73–79. Wu, J., Cohen, L. B., & Falk, C. X. (1994). Neuronal activity during different behaviors in Aplysia: A distributed organization? Science, 263, 820–822. Yao, X. (1999). Evolving articial neural networks. Proceedings of the IEEE, 87(9), 1423–1447. Young, M. P., Hilgetag, C. C., & Scannell, J. W. (2000). On imputing function to structure from the behavioural effects of brain lesions. Phil. Trans. R. Soc. Lond. B, 355, 147–161. Received March 15, 2002; accepted October 4, 2002.
Communicated by Dan Lee
LETTER
The Concave-Convex Procedure A. L. Yuille
[email protected] Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A.
Anand Rangarajan
[email protected].edu Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The concave-convex procedure (CCCP) is a way to construct discrete-time iterative dynamical systems that are guaranteed to decrease global optimization and energy functions monotonically. This procedure can be applied to almost any optimization problem, and many existing algorithms can be interpreted in terms of it. In particular, we prove that all expectation-maximization algorithms and classes of Legendre minimization and variational bounding algorithms can be reexpressed in terms of CCCP. We show that many existing neural network and mean-eld theory algorithms are also examples of CCCP. The generalized iterative scaling algorithm and Sinkhorn’s algorithm can also be expressed as CCCP by changing variables. CCCP can be used both as a new way to understand, and prove the convergence of, existing optimization algorithms and as a procedure for generating new algorithms. 1 Introduction This article describes a simple geometrical concave-convex procedure (CCCP) for constructing discrete time dynamical systems that are guaranteed to decrease almost any global optimization or energy function monotonically. Such discrete time systems have advantages over standard gradient descent techniques (Press, Flannery, Teukolsky, & Vetterling, 1986) because they do not require estimating a step size and empirically often converge rapidly. We rst illustrate CCCP by giving examples of neural network, mean eld, and self-annealing (which relate to Bregman distances; Bregman, 1967) algorithms, which can be reexpressed in this form. As we will show, the entropy terms arising in mean eld algorithms make it particularly easy to apply CCCP. CCCP has also been applied to develop an algorithm that minimizes the Bethe and Kikuchi free energies and whose empirical convergence is rapid (Yuille, 2002). Neural Computation 15, 915–936 (2003)
c 2003 Massachusetts Institute of Technology °
916
A. Yuille and A. Rangarajan
Next, we prove that many existing algorithms can be directly reexpressed in terms of CCCP. This includes expectation-maximization (EM) algorithms (Dempster, Laird, & Rubin, 1977) and minimization algorithms based on Legendre transforms (Rangarajan, Yuille, & Mjolsness, 1999). CCCP can be viewed as a special case of variational bounding (Rustagi, 1976; Jordan, Ghahramani, Jaakkola, & Saul, 1999) and related techniques including lower-bound and upper-bound minimization (Luttrell, 1994), surrogate functions, and majorization (Lange, Hunter, & Yang, 2000). CCCP gives a novel geometric perspective on these algorithms and yields new convergence proofs. Finally, we reformulate other classic algorithms in terms of CCCP by changing variables. These include the generalized iterative scaling (GIS) algorithm (Darroch & Ratcliff, 1972) and Sinkhorn’s algorithm for obtaining doubly stochastic matrices (Sinkhorn, 1964). Sinkhorn’s algorithm can be used to solve the linear assignment problem (Kosowsky & Yuille, 1994), and CCCP variants of Sinkhorn can be used to solve additional constraint problems. We introduce CCCP in section 2 and prove that it converges. Section 3 illustrates CCCP with examples from neural networks, mean eld theory, self-annealing, and EM. In section 4, we prove the relationships between CCCP and the EM algorithm, Legendre transforms, and variational bounding. Section 5 shows that other algorithms such as GIS and Sinkhorn can be expressed in CCCP by a change of variables. 2 The Basics of CCCP This section introduces the main results of CCCP and summarizes them in three theorems. Theorem 1 states the general conditions under which CCCP can be applied, theorem 2 denes CCCP and proves its convergence, and theorem 3 describes an inner loop that may be necessary for some CCCP algorithms. Theorem 1 shows that any function, subject to weak conditions, can be expressed as the sum of a convex and concave part (this decomposition is not unique) (see Figure 1). This will imply that CCCP can be applied to almost any optimization problem. Theorem 1. Let E.E x/ be an energy function with bounded Hessian @ 2 E.E x/=@ xE@ xE. Then we can always decompose it into the sum of a convex function and a concave function. Proof. Select any convex function F.E x/ with positive denite Hessian with eigenvalues bounded below by ² > 0. Then there exists a positive constant ¸ such that the Hessian of E.E x/ C ¸F.E x/ is positive denite and hence E.E x/ C ¸F.E x/ is convex. Hence, we can express E.E x/ as the sum of a convex part, E.E x/ C ¸F.E x/, and a concave part, ¡¸F.E x/.
The Concave-Convex Procedure
917
6 5 4 3 2 1 0 1 10
5
0
5
120
20
100
0
80
10
20
60 40 40 60
20
80
0 20 10
5
0
5
10
100 10
5
0
5
10
Figure 1: Decomposing a function into convex and concave parts. The original function (top panel) can be expressed as the sum of a convex function (bottom left panel) and a concave function (bottom right panel).
Theorem 2 denes the CCCP procedure and proves that it converges to a minimum or a saddle point of the energy function. (After completing this work we found that a version of theorem 2 appeared in an unpublished technical report: Geman, 1984). Theorem 2. Consider an energy function E.E x/ (bounded below) of form E.E x/ D Evex .E x/ C Ecave .E x/ where Evex .E x/; Ecave .E x/ are convex and concave functions of xE, respectively. Then the discrete iterative CCCP algorithm xEt 7! xEtC1 given by E vex .E E cave .E rE xtC1 / D ¡rE xt /
(2.1)
is guaranteed to monotonically decrease the energy E.E x/ as a function of time and hence to converge to a minimum or saddle point of E.E x/ (or even a local maxima if it starts at one). Moreover, 1 E rE E vex .E E rE E cave .E E.E xtC1 / D E.E xt / ¡ .E xtC1 ¡ xEt /T fr x¤ / ¡ r x¤¤ /g 2 £ .E xtC1 ¡ xEt /; E rE.:/ E for some xE¤ and xE¤¤ , where r is the Hessian of E.:/.
(2.2)
918 0 0
A. Yuille and A. Rangarajan 0 0 0
0 0 0
0 0 0 0
0
10
10
0 X0
0
X2 X4
5
10
Figure 2: A CCCP algorithm illustrated for convex minus convex. We want to minimize the function in the left panel. We decompose it (right panel) into a convex part (top curve) minus a convex term (bottom curve). The algorithm iterates by matching points on the two curves that have the same tangent vectors. See the text for more detail. The algorithm rapidly converges to the solution at x D 5:0.
Proof. The convexity and concavity of Evex.:/ and Ecave .:/ means that E vex .E Evex.E x2 / ¸ Evex.E x1 / C .E x2 ¡ xE1 / ¢ rE x1 / and Ecave .E x4 / · Ecave .E x3 / C .E x4 ¡ xE3 / ¢ E rEcave .E x3 /, for all xE1 ; xE2 ; xE3 ; xE4 . Now set xE1 D xEtC1 ; xE2 D xEt ; xE3 D xEt ; xE4 D xEtC1 . E vex.E E cave .E Using the algorithm denition (rE xtC1 / D ¡rE xt /), we nd that tC1 C tC1 · t C t which proves the rst claim. Evex.E x / Ecave .E x / E vex .E x / Ecave .E x /, The second claim follows by computing the second-order terms of the Taylor series expansion and applying Rolle’s theorem. We can get a graphical illustration of this algorithm by the reformulation shown in Figure 2. Think of decomposing the energy function E.E x/ into E1 .E x/ ¡ E2 .E x/, where both E1 .E x/ and E2 .E x/ are convex. (This is equivalent to decomposing E.E x/ into a a convex term E1 .E x/ plus a concave term ¡E2 .E x/.) The algorithm proceeds by matching points on the two terms that have the E 2 .E same tangents. For an input xE0 , we calculate the gradient rE x0 / and nd E 1 .E E the point xE1 such that rE D rE We next determine the point xE2 x1 / x0 /. 2 .E E E such that rE1 .E x2 / D rE2 .E x1 /, and repeat. The second statement of theorem 2 can be used to analyze the convergence rates of the algorithm by placing lower bounds on the (positive semiE rE E vex.E E rE E cave .E denite) matrix fr x¤ / ¡ r x¤¤ /g. Moreover, if we can bound E ¡1 .¡rE E cave this by a matrix B, then we obtain E.E xtC1 / ¡ E.E xt / · ¡.1=2/frE vex ¡1 is the inE ¡1 .¡rE E cave .E E E ¡ g · 0, where rE .E xt // ¡ xEt gT BfrE x // x .E x / t t vex vex E vex.E E ¡1 .¡rE E cave .E verse of rE x/. We can therefore think of .1=2/frE xt // ¡ vex E ¡1 E E rE ¡ g as an auxiliary function (Della Pietra, Della xEt gT BfrE .¡ .E x // x cave t t vex Pietra, & Lafferty, 1997).
The Concave-Convex Procedure
919
We can extend constraints on the variables xE, P theorem¹2 to allow for linear ¹ ¹ for example, i c¹ i xi D ® , where the fci g; f® g are constants. This follows directly because properties such as convexity and concavity are preserved when linear constraints are imposed. We can change to new coordinates dened on the hyperplane dened by the linear constraints. Then we apply theorem 1 in this coordinate system. Observe that theorem 2 denes the update as an implicit function of xEtC1 . In many cases, as we will show in section 3, it is possible to solve for xEtC1 analytically. In other cases, we may need an algorithm, or inner loop, to E vex.E determine xEtC1 from rE xtC1 /. In these cases, we will need the following theorem where we reexpress CCCP in terms of minimizing a time sequence of convex update energy functions EtC1 .E xtC1 / to obtain the updates xEtC1 (i.e., at the tth iteration of CCCP, we need to minimize the energy EtC1 .E xtC1 /). Theorem 3. Let P E.E x/ D E vex .E x/ C Ecave .E x/ where xE is required to satisfy the ¹ ¹ linear constraints i ci xi D ® ¹ , where the fci g; f® ¹ g are constants. Then the tC1 update rule for xE can be formulated as setting xEtC1 D arg minxE EtC1 .E x/ for a time sequence of convex update energy functions EtC1 .E x/ dened by EtC1 .E x/ D Evex .E x/ C
X i
( ) X X @Ecave t xi .E x /C ¸¹ ci¹ xi ¡ ®¹ ; @xi ¹ i
(2.3)
where the Lagrange parameters f¸¹ g impose linear constraints. Proof.
Direct calculation.
The convexity of EtC1 .E x/ implies that there is a unique minimum xEtC1 D arg minxE EtC1 .E x/. This means that if an inner loop is needed to calculate xEtC1 , then we can use standard techniques such as conjugate gradient descent. P An important special case is when Evex .E x/ D i xi log xi . This case occurs frequently in our examples (see section 3). We will show in section 5 that Et .E x/ can be minimized by a CCCP algorithm. 3 Examples of CCCP This section illustrates CCCP by examples from neural networks, meaneld algorithms, self-annealing (which relate to Bregman distances; Bregman, 1967), EM, and mixture models. These algorithms can be applied to a range of problems, including clustering, combinatorial optimization, and learning. 3.1 Example 1. Our rst example is a neural net or mean eld Potts model. These have been used for content addressable memories (Waugh &
920
A. Yuille and A. Rangarajan
Westervelt, 1993; Elfadel, 1995) and have been applied to clustering for unsupervised texture segmentation (Hofmann & Buhmann, 1997). An original motivation for them was based on a convexity principle (Marcus & Westervelt, 1989). We now show that algorithms for these models can be derived directly using CCCP. Example. Discrete-time dynamical systems for the mean-eld Potts P model attempt to minimize discrete energy functions of form E[V] D .1=2/ i;j;a;b P Cijab Via Vjb CP ia µia Via , where the fVia g take discrete values f0; 1g with linear constraints i Via D 1; 8a. Mean-eld algorithms minimize a continuous effective energy Eef f [SI T] to obtain a minimum of the discrete energy E[V] in the limit as T 7! 0. The fSia g are continuous variables in the range [0; 1] and correspond to (approximate) estimates of the mean states of the fVia g with respect to the distribution P[V] D e¡E[V]=T =Z, where T is a temperature parameter and Z is a normalization constant. As described in Yuille and Kosowsky (1994), to ensure that the minima of E[V] and Eef f [SI T] all coincide (as T 7! 0), it is sufcient that Cijab be negative denite. Moreover, this can be attained by P 2 to adding a term ¡K ia Via E[V] (for sufciently large K) without altering the structure of thePminima of E[V]. Hence, without loss of generality, we can consider .1=2/ i;j;a;b Cijab Via Vjb to be a concave function. impose the linear constraints by adding a Lagrange multiplier term P We P f p a a i Via ¡ 1g to the energy where the fpa g are the Lagrange multipliers. The effective energy is given by E ef f [S] D .1=2/ C
X i;j;a;b
X
Cijab Sia Sjb C
( pa
X
a
i
X ia
µia Sia C T
X
Sia log Sia
ia
) Sia ¡ 1 :
(3.1)
P P PWe decompose Eef f [S] into a convex part Evex PD T ia Sia log SiaPC a pa f i Sia ¡ 1g and a concave part Ecave [S] D .1=2/ i;j;a;b Cijab Sia Sjb C ia µia Sia . Taking derivatives yields @S@ ia Evex [S] D T log Sia C pa and @S@ ia Ecave [S] D P @Evex tC1 / D ¡ @Ecave .St / gives j;b Cijab Sjb C µia . Applying CCCP by setting @Sia .S @Sia P t Tf1 C log StC1 ia g C pa D ¡ j;b C ijab Sjb ¡ µia . We solve for the Lagrange multiP pliers fpa g by imposing the constraints i StC1 ia D 1; 8a. This gives a discrete update rule: .¡1=T/f2
StC1 ia
e D P
c
P
.¡1=T/f2
e
j;b
Cijab Sjbt Cµia g
P
j;b
Cijcb Sjbt Cµic g
:
(3.2)
The Concave-Convex Procedure
921
3.2 Example 2. The next example concerns mean eld methods to model combinatorial optimization problems such as the quadratic assignment problem (Rangarajan, Gold, & Mjolsness, 1996; Rangarajan et al., 1999) or the traveling salesman problem. It uses the same quadratic energy function as example 1 but adds extra linear constraints. These additional constraints prevent us from expressing the update rule analytically and require an inner loop to implement theorem 3. Example. Mean-eld algorithms of P Pto minimize discrete energy functions P form E[V] D i;j;a;b Cijab Via Vjb C ia µia Via with linear constraints i Via D P 1; 8a and a Via D 1; 8i. This differs from the previous P P example because we need to add an additional constraint term i qi . a Sia ¡ 1/ to the effective energy Eef f [S] in equation 3.1 where fqi g are Lagrange multipliers. This constraint term is also added to the convex part of the energy Evex [S], and we apply CCCP. Unlike the previous example, it is no longer possible to express StC1 as an analytic function of St . Instead we resort to theorem 3. Solving for StC1 is equivalent to minimizing the convex cost function: EtC1 [S
tC1
I p; q] D T
X
C
tC1 StC1 ia log Sia
ia
X i
( qi
X a
C
X
( pa
a
) StC1 ia
X
¡1 C
) StC1 ia
i
X ia
StC1 ia
¡1
@Ecave t .Sia /: Sia
(3.3)
It can be shown that minimizing EtC1 [StC1 I p; q] can also be done by CCCP (see section 5.3). Therefore, each step of CCCP for this example requires an inner loop, which can be solved by a CCCP algorithm. 3.3 Example 3. Our next example is self-annealing (Rangarajan, 2000). This algorithm can be applied to the effective P energies of examples 1 and 2 provided we remove the “entropy term” ia Sia log Sia . Hence, self-annealing can be applied to the same combinatorial optimization and clustering problems. It can also be applied to the relaxation labeling problems studied in computer vision, and indeed the classic relaxation algorithm (Rosenfeld, Hummel, & Zucker, 1976) can be obtained as an approximation to selfannealing by performing a Taylor series approximation (Rangarajan, 2000). It also relates to linear prediction (Kivinen & Warmuth, 1997). Self-annealing acts as if it has a temperature parameter that it continuously decreases or, equivalently, as if it has a barrier function whose strength is reduced automatically as the algorithm proceeds (Rangarajan, 2000). This relates to Bregman distances (Bregman, 1967), and, indeed, the original derivation of self-annealing involved adding a Bregman distance to the en-
922
A. Yuille and A. Rangarajan
ergy function, followed by taking Legendre transforms (see section 4.2). As we now show, however, self-annealing can be derived directly from CCCP. Example. In this example of self-annealing for quadratic energy functions, we use the effective P energy of example 1 (see equation 3.1) but remove the entropy term T ia Sia log Sia. We rst apply both sets of linear constraints on the fSia g (as in example 2). Next we apply only one set of constraints (as in example 1). Decompose the energyP function into convex and concave parts by adding and subtracting a term ° i;a Sia log Sia , where ° is a constant. This yields Evex[S] D °
X ia
Sia log Sia C
X pa
Á X
a
i
! Sia ¡ 1 C
X qi
Á X
i
X X 1 X Cijab Sia Sjb C µia Sia ¡ ° Sia log Sia : 2 i;j;a;b i;a ia
E cave [S] D
a
! Sia ¡ 1 ; (3.4)
Applying CCCP gives the self-annealing update equations: .1=° /f¡
t StC1 ia D Sia e
P
t jb Cijab Sjb ¡µia ¡pa ¡qi g
;
(3.5)
where an inner loop is required to solve for the fpa g; fqi g to ensure that the constraints on fStC1 ia g are satised. This inner loop is a small modication of the one required for exampleP 2 (seeP section 5.3). Removing the constraints i qi . a Sia ¡ 1/ gives us an update rule (compare example 1), P .¡1=° /f C
StC1 ia
St Cµ g
ia jb ijab jb St e P D P ia ; t Cµ t e.¡1=° /f jb Cijbc Sbj ic g S ic c
(3.6)
which, by expanding the exponential by a Taylor series, gives the equations for relaxation labeling (Rosenfeld et al., 1976; see Rangarajan, 2000, for details). 3.4 Example 4. Our nal example is the elastic net (Durbin & Willshaw, 1987; Durbin, Szeliski, & Yuille, 1989) in the formulation presented in Yuille (1990). This is an example of constrained mixture models (Jordan & Jacobs, 1994) and uses an EM algorithm (Dempster et al., 1977). Example. The elastic net (Durbin & Willshaw, 1987) attempts to solve the traveling salesman problem (TSP) by nding the shortest tour through a set of cities at positions fE xi g. The net is represented by a set of nodes at positions fE ya g, and the algorithm performs steepest descent on a cost function E[E y].
The Concave-Convex Procedure
923
This corresponds to a probability distribution P.yE/ D e¡E[Ey] =Z on the node positions, which can be interpreted (Durbin et al., 1989) as a constrained mixture model (Jordan & Jacobs, 1994). The elastic net can be reformulated (Yuille, 1990) as minimizing an effective energy Eef f [S; yE] where the variables fSia g determine soft correspondence between the cities and the nodes of the net. Minimizing Eef f [S; yE] with respect to S and yE alternatively can be reformulated as a CCCP algorithm. Moreover, this alternating algorithm can also be reexpressed as an EM algorithm for performing maximum a posteriori estimation of the node variables fE ya g from P.yE/ (see section 4.1). The elastic net can be formulated as minimizing an effective energy (Yuille, 1990): X X X Eef f [S; yE] D Sia .E xi ¡ yEa /2 C ° yETa Aab yEb C T Sia log Sia ia
C
i;a
a;b
X
¸i
Á X
i
a
! Sia ¡ 1 ;
(3.7)
where the fAab g are components of a positive denite matrix representing a springP energy and f¸a g are Lagrange multipliers that impose the constraints y] D Eef f [S¤ .E y/; yE] where S¤ .E y/ D a Sia D 1; 8 i. By setting E[E arg minS Eef f [S; yE], we obtain the original elastic net cost function E[E y] D P P ¡jExi ¡Eya j2 =T P T ¡T i log a e C° a;b yEa Aab yEb (Durbin & Willshaw, 1987). P[E y] D e¡E[Ey] =Z can be interpreted (Durbin et al., 1989) as a constrained mixture model (Jordan & Jacobs, 1994). The effective energy Eef f [S; yE] can be decreased by minimizing it with respect to fSia g and fE ya g alternatively. This gives update rules, e¡jExi ¡Eya j =T D StC1 ; P ia ¡jExj ¡E yta j2 =T je X X EtC1 ¡ xEi / C D 0; 8a; StC1 Aab yEtC1 ia .y a b t 2
i
(3.8) (3.9)
b
tC1 where fE ytC1 a g can be computed from the fSia g by solving the linear equations. To interpret equations 3.8 and 3.9 as CCCP, we dene a new energy function E[S] D Eef f [S; yE¤ .S/] where yE¤ .S/ D arg minyE Eef f [S; yE] (which can be obtained by solving the linear equation 3.9 for fE ya g). We decompose E[S] D Evex [S] C Ecave [S] where Á ! X X X Evex[S] D T Sia log Sia C ¸i Sia ¡ 1 ; ia
Ecave [S] D
X ia
Sia jE xi ¡ yE¤a .S/j2
i
C°
a
X ab
yE¤a .S/ ¢ yE¤b .S/Aab :
(3.10)
924
A. Yuille and A. Rangarajan
It is clear that Evex [S] is a convex function of S. It can be veried algebraically that E cave [S] is a concave function of S and that its rst derivative is @ [S] D jE xi ¡ yE¤a .S/j2 (using the denition of yE¤ .S/ to remove additional @S ia Ecave terms). Applying CCCP to E[S] D Evex[S] C Ecave [S] gives the update rule ¤
t
e¡jExi ¡Eya .S /j StC1 ; yE¤ .St / D arg min Eef f [St ; yE]; ia D P ¡jExi ¡E y¤b .St /j2 yE e b 2
(3.11)
which is equivalent to the alternating algorithm described above (see equations 3.8 and 3.9). More understanding of this particular CCCP algorithm is given in the next section, where we show that is a special case of a general result for EM algorithms. 4 EM, Legendre Transforms, and Variational Bounding This section proves that two standard algorithms can be expressed in terms of CCCP: all EM algorithms and a class of algorithms using Legendre transforms. In addition, we show that CCCP can be obtained as a special case of variational bounding and equivalent methods known as lower-bound maximization and surrogate functions. 4.1 The EM Algorithm and CCCP. The EM algorithm P (Dempster et al., 1977) seeks to estimate a variable yE¤ D arg maxyE log fVg P.yE; V/, where fE yg; fVg are variables that depend on the specic problem formulation (we will soon illustrate them for the elastic net). The distribution P.E y; V/ is usually conditioned on data which, for simplicity, we will not make explicit. Hathaway (1986) and Neal and Hinton (1998) showed that EM is equivalent to minimizing the following effective energy with respect to the variO ables yE and P.V/, O D¡ E em[E y; P]
X V
C¸
O log P.yE; V/ C P.V/
nX
o O ¡1 ; P.V/
X V
O O log P.V/ P.V/ (4.1)
where ¸ is a Lagrange multiplier. O with respect to P.V/ O The EM algorithm proceeds by minimizing Eem [E y; P] and yE alternatively: P.yEt ; V/ PO tC1 .V/ D P ; Et O VO P.y ; V/ X yEtC1 D arg min ¡ POtC1 .V/ log P.yE; V/: yE
V
(4.2)
The Concave-Convex Procedure
925
O and give converThese update rules are guaranteed to lower Eem [E y; P] gence to a saddle point or a local minimum (Dempster et al., 1977; Hathaway, 1986; Neal & Hinton, 1998). For example, this formulation of the EM algorithm enables us to rederive the effective energy for the elastic net and show that the alternating algorithm is EM. We let the fE ya g correspond to the positions of the nodes of the net and the fVia g be binary variables indicating the correspondences between cities and nodes P (related to the fSia g in y; V/ D e¡E[Ey;V]=T =Z P example 4). P.E 2 where E[yE; V] D ia Via jE xi ¡ yEa j C ° ab yEa ¢ yEb Aab with the constraint that P O ia D 1/; 8 i; a. Then Eem [E D 1; 8 We dene V i. Sia D P.V y; S] is equal to a ia the effective energy Eef f [E y; S] in the elastic net example (see equation 3.7). The update rules for EM (see equation 4.2) are equivalent to the alternating algorithm to minimize the effective energy (see equations 3.8 and 3.9). We now show that all EM algorithms are CCCP. This requires two intermediate results, which we state as lemmas. O is equivalent to minimizing the function E[P] O Lemma 1. Minimizing Eem [E y; P] P P O O O O O O D Evex [P] C Ecave [P] where Evex [P] D V P.V/ log P.V/ C ¸f P.V/ ¡ 1g is a P O D¡ O O V/ is a concave function, log P.yE¤ .P/; convex function and Ecave [P] P.V/ VP ¤ O D arg minyE ¡ O where we dene yE .P/ y; V/. V P.V/ log P.E P O D Eem [E O P], O where yE¤ .P/ O D arg minyE ¡ O Proof. Set E[P] y¤ .P/; V P.V/ log O O O and P.yE; V/. It is straightforward to decompose E[P] as Evex[P] C E cave [P] O O verify that E vex [P] is a convex function. To determine that Ecave [P] is concave requires showing that its Hessian is negative semidenite. This is performed in lemma 2. O is a concave function and @ Ecave D ¡ log P.yE¤ .P/; O V/, Ecave [P] O @ P.V/ P O D arg minyE ¡ O E; V/. where yE¤ .P/ V P.V/ log P.y Lemma 2.
Proof. We rst derive consequences of the denition of yE¤ , which will be required when computing the Hessian of Ecave . The denition implies: X V
@ O log P.yE¤ ; V/ D 0; 8¹ (4.3) P.V/ @ yE¹
X X @ yE¤ @ @2 º Q O log P.E log P.E y¤ ; V/C P.V/ y¤ ; V/ D 0; 8¹; (4.4) O Q E E @ yE¹ @ y @ y ¹ º @ P. V/ º V where the rst equation is an identity that is valid for all PO and the second O Moreequation follows by differentiating the rst equation with respect to P. P ¤ O over, since yE is a minimum of ¡ V P.V/ log P.E y; V/, we also know that the
926
A. Yuille and A. Rangarajan
P
@2 O E¤ ; V/ is negative denite. (We use the convenV P.V/ @y¹ @yº log P.y tion that @ @yE log P.yE¤ ; V/ denotes the derivative of the function log P.yE; V/ ¹ with respect to yE¹ evaluated at yE D yE¤ .)
matrix
O with respect to P. O We obtain: We now calculate the derivatives of Econv [P] X X @ yE¤¹ @ log P.yE¤ ; V/ @ O V/ Q ¡ O Ecave D ¡ log P.E y¤ .P/; P.V/ O V/ Q O Q @ f¹ @ P. mu @ P.V/ V D ¡ log P.yE¤ ; V/;
(4.5)
where we have used the denition of yE¤ (see equation 4.3) to eliminate the second term on the right-hand side. This proves the rst statement of the theorem. To prove the concavity of Ecave , we compute its Hessian: @2 O O V/ Q @ P.V/@ P.
Ecave D ¡
X @ yE¤ @ º log P.yE¤ ; V/: O Q E º @ P.V/ @ yº
(4.6)
O (see equation 4.4), we can reexpress the By using the denition of yE¤ .P/ Hessian as: @2 O O V/ Q @ P.V/@ P.
Ecave D
X V;¹;º
¤ @ yE¤º @ yE¹ @ 2 log P.E y¤ ; V/ : O V/ Q @ P.V/ O @ yE¹ @ yEº @ P.
(4.7)
It follows that E cave has a negative denite Hessian, and hence E cave is P @2 O concave, recalling that ¡ V P.V/ log P.yE¤ ; V/ is negative denite. @y¹ @yº Theorem 4. The EM algorithm for P.E y; V/ can be expressed as a CCCP algorithm P P O O D O O O O D log P.V/ C ¸f P.V/ ¡ 1g and Ecave [P] in P.V/ with Evex[P] P.V/ V P P ¤ ¤ O O O O ¡ V P.V/ log P.yE .P/; V/, where yE .P/ D arg minyE ¡ V P.V/ log P.yE; V/. After convergence to PO¤ .V/, the solution is calculated to be yE¤¤ D arg minyE ¡ P ¤ O E; V/. V P .V/ log P. y Proof. The update rule for PO determined by CCCP is precisely that specied by the EM algorithm. Therefore, we can run the CCCP algorithm until it has converged to PO ¤ .:/ and then calculate the solution yE¤¤ D arg minyE ¡ P ¤ O E; V/. V P .V/ log P. y Finally, we observe that Geman’s technical report (1984) gives an alternative way of relating EM to CCCP for a special class of probability disE tributions. He assumes that P.E y; V/ is of form eyE¢Á.V/ =Z for some functions
The Concave-Convex Procedure
927
E He then proves convergence of the EM algorithm to estimate yE by ex8.:/. ploiting his version of theorem 2. Interestingly, he works with convex and concave functions of yE, while our results are expressed in terms of convex O and concave functions of P. 4.2 Legendre Transformations. The Legendre transform can be used to reformulate optimization problems by introducing auxiliary variables (Mjolsness & Garrett, 1990). The idea is that some of the formulations may be more effective and computationally cheaper than others. We will concentrate on Legendre minimization (Rangarajan et al., 1996, 1999) instead of Legendre min-max emphasized in Mjolsness and Garrett (1990). In the latter (Mjolsness & Garrett, 1990), the introduction of auxiliary variables converts the problem to a min-max problem where the goal is to nd a saddle point. By contrast, in Legendre minimization (Rangarajan et al., 1996), the problem remains a minimization one (and so it becomes easier to analyze convergence). In theorem 5, we show that Legendre minimization algorithms are equivalent to CCCP provided we rst decompose the energy into a convex plus a concave part. The CCCP viewpoint emphasizes the geometry of the approach and complements the algebraic manipulations given in Rangarajan et al. (1999). (Moreover, the results of this article show the generality of CCCP, while, by contrast, Legendre transform methods have been applied only on a case-by-case basis.) Denition 1. Let F.E x/ be a convex function. For each value yE, let F¤ .yE/ D minxE fF.E x/ C yE ¢ xE:g. Then F¤ .E y/ is concave and is the Legendre transform of F.E x/. Two properites can be derived from this denition (Strang, 1986): Property 1.
F.E x/ D maxyE fF¤ .yE/ ¡ yE ¢ xEg.
Property 2.
F.:/ and F¤ .:/ are related by
¤
¤
@F¤ E .y/ @ yE
D f @F g¡1 .¡yE/, ¡ @F x/ D @ xE @ xE .E
f @F g¡1 .E g¡1 .E x/. (By f @F x/ we mean the value yE such that @ yE @ yE
@F¤ .E y/ @ yE
D xE:)
The Legendre minimization algorithms (Rangarajan et al., 1996, 1999) exploit Legendre transforms. The optimization function E1 .E x/ is expressed as E 1 .E x/ D f .E x/ C g.E x/, where g.E x/ is required to be a convex function. This O yE/, where g.:/ O is is equivalent to minimizing E2 .E x; yE/ D f .E x/ C xE ¢ yE C g. the inverse Legendre transform of g.:/. Legendre minimization consists of minimizing E2 .E x; yE/ with respect to xE and yE alternatively. Theorem 5. Let E1 .E x/ D f .E x/ C g.E x/ and E2 .E x; yE/ D f .E x/ C xE ¢ yE C h.yE/, where f .:/; h.:/ are convex functions and g.:/ is concave. Then applying CCCP to E1 .E x/
928
A. Yuille and A. Rangarajan
is equivalent to minimizing E2 .E x; yE/ with respect to xE and yE alternatively, where g.:/ is the Legendre transform of h.:/. This is equivalent to Legendre minimization. Proof. We can write E1 .E x/ D f .E x/ C minyEfg¤ .yE/ C xE ¢ yEg where g¤ .:/ is the Legendre transform of g.:/ (identify g.:/ with F¤ .:/ and g¤ .:/ with F.:/ in denition 1 and property 1). Thus, minimizing E1 .E x/ with respect to xE is O E E equivalent to minimizing E1 .E x; y/ D f .E x/ C xE ¢ y C g¤ .yE/ with respect to xE and yE. (Alternatively, we can set g¤ . yE/ D h.yE/ in the expression for E 2 .E x; yE/ O and obtain a cost function E2 .E x/ D f .E x/ C g.E x/.) Alternatively minimization over xE and yE gives (i) @ f=@ xE D yE to determine xEtC1 in terms of yEt , and (ii) @ g¤ =@ yE D xE to determine yEt in terms of xEt , which, by property 2 of the Legendre transform, is equivalent to setting yE D ¡@g=@ xE. Combining these two stages gives CCCP: @ f tC1 @g t .E x / D ¡ .E x /: @ xE @ xE 4.3 Variational Bounding. In variational bounding, the original objective function to be minimized gets replaced by a new objective function that satises the following requirements (Rustagi, 1976; Jordan et al., 1999). Other equivalent techniques are known as surrogate functions and majorization (Lange et al., 2000) or as lower-bound maximization (Luttrell, 1994). These techniques are more general than CCCP, and it has been shown that algorithms like EM can be derived from them (Minka, 1998; Lange et al., 2000). (This, of course, does not imply that EM can be derived from CCCP.) Let E.E x/; xE 2 RD be the original objective function that we seek to minimize. Assume that we are at a point xE.n/ corresponding to the nth iteration. If we have a function Ebound .E x/ that satises the following properties (see Figure 3), E.E x.n/ / D Ebound .E x.n/ /; and E.E x/ · Ebound .E x/;
(4.8) (4.9)
then the next iterate xE.nC1/ is chosen such that E bound .E x.nC1/ / · E.E x.n/ / which implies E.E x.nC1/ / · E.E x.n/ /:
(4.10)
Consequently, we can minimize Ebound .E x/ instead of E.E x/ after ensuring that E.E x.n/ / D Ebound .E x.n/ /. We now show that CCCP is equivalent to a class of variational bounding provided we rst decompose the objective function E.E x/ into a convex and a concave part before bounding the concave part by its tangent plane (see Figure 3).
The Concave-Convex Procedure
E(x) E bound
929
E(x) Ecave
E
x(n)
*
Figure 3: (Left) Variational bounding bounds a function E.Ex/ by a function Ebound .Ex/ such that E.Ex.n/ / D Ebound .Ex.n/ /. (Right) We decompose E.Ex/ into convex and concave parts, Evex .Ex/ and Ecave .Ex/, and bound Ecave .Ex/ by its tangent plane at xE ¤ . We set Ebound .Ex/ D Evex .Ex/ C Ecave .Ex/.
Theorem 6. Any CCCP algorithm to extremize E.E x/ can be expressed as variational bounding by rst decomposing E.E x/ as a convex Evex.E x/ and a concave Ecave .E x/ function and then at each iteration starting at xEt set Etbound.E x/ D Evex .E x/ C @Ecave .Ext / t C t ¸ Ecave .E Ecave .E x / .E x ¡ xE / ¢ @ xE x/. Proof. Since Ecave .E ¸ x/ is concave, we have Ecave .E x¤ / C .E x ¡ xE¤ / ¢ @E@cave xE t Ecave .E x/ for all xE. Therefore, Ebound .E x/ satises equations 4.8 and 4.9 for variational bounding. Minimizing Etbound.E x/ with respect to xE gives @@xE Evex.E xtC1 / D t
.E x/ ¡ @Ecave , which is the CCCP update rule. @ xE
Note that the formulation of CCCP given by theorem 3, in terms of a sequence of convex update energy functions, is already in the variational bounding form. 5 CCCP by Changes in Variables This section gives examples where the algorithms are not CCCP in the original variables, but they can be transformed into CCCP by changing coordinates. In this section, we rst show that both generalized iterative scaling (GIS; Darroch & Ratcliff, 1972) and Sinkhorn’s algorithm (Sinkhorn, 1964) can be formulated as CCCP. Then we obtain CCCP generalizations of Sinkhorn’s algorithm, which can minimize many of the inner-loop convex update energy functions dened in theorem 3.
930
A. Yuille and A. Rangarajan
5.1 Generalized Iterative Scaling. This section shows that the GIS algorithm (Darroch & Ratcliff, 1972) for estimating parameters of probability distributions can also be expressed as CCCP. This gives a simple convergence proof for the algorithm. E of a The parameter estimation problem is to determine the parameters ¸ P E Á.E E x/ ¸¢ E E E E E distribution P.E x : ¸/ D e =Z[¸] so that xE P.E xI ¸/Á.E x/ D h, where hE are observation data (with components indexed by ¹). This can be expressed as E where nding the minimum of the convex energy function log Z[¸] ¡ hE ¢ ¸, P E Á.E E x/ E D E e¸¢ is the partition problem. All problems of this type can be Z[¸] x E converted to a standard form where ¸ 0; 8 , ¸ 0; 8 and Á .E x / ¹; x h ¹, ¹ ¹ P P E D 1; 8 and D 1 (Darroch & Ratcliff, 1972). From now on, Á .E x / x h ¹ ¹ ¹ ¹ we assume this form. The GIS algorithm is given by t t ¸tC1 ¹ D ¸¹ ¡ log h¹ C log h¹ ; 8¹;
P
(5.1)
E t /Á¹ .E x/. It is guaranteed to converge to the (unique) E and hence gives a solution minimum of the energy function log Z[¸] ¡ hE ¢ ¸ P E E Á.E E x/ D h, (Darroch & Ratcliff, 1972). to P.E xI ¸/ where ht¹ D
xI ¸ xE P.E
xE
We now show that GIS can be reformulated as CCCP, which gives a simple convergence proof of the algorithm. Theorem 7. We can express GIS as a CCCP algorithm in the P variables fr¹ D e¸¹ g by decomposing the cost function E[Er] into a convex term ¡ ¹ h¹ log r¹ and a concave term log Z[flog r¹ g].
E E that minimizes log Z[¸]¡ h¢ Proof. Formulate the problem as nding the ¸ E¸. This is equivalent to minimizing the cost function E[Er] D log Z[flog r¹ g] ¡ P ¸¹ r] D ¹ h¹ log r¹ with respect to fr¹ g where r¹ D e ; 8 ¹. Dene Evex[E P ¡ ¹ h¹ log r¹ and Ecave [Er] D log Z[flog r¹ g]. It is straightforward to verify that Evex [Er] is a convex function (recall that h¹ ¸ 0; 8¹). To show that Ecave is concave, we compute its Hessian: ¡±¹º X @ 2 Ecave ±¹º X D P.E x : Er/Áº .E x/ ¡ 2 P.E x : Er/Áº .E x/Á¹ .E x/ @ r¹ @rº r2º rº xE xE ( )( ) X X 1 ¡ P.E x : Er/Áº .E x/ P.E x : Er/Á¹ .E x/ ; rº r¹ xE xE P .log r¹ /Á¹ .E x/ where P.E xI Er/ D e ¹ =Z[flog r¹ g].
(5.2)
The Concave-Convex Procedure
931
The third term is clearly negative semidenite. To show that the sum of the rst two terms is negative semidenite requires proving that X xE
P.E x : Er/
¸
X xE
X
.³º =rº /2 Áº .E x/
º
P.E x : Er/
X
.³¹ =r¹ /.³º =rº /Áº .E x/Á¹ .E x/
(5.3)
¹;º
for any set of f³¹ g. This the Cauchy-Schwarz inequality pfollows by applying p P to the vectors f.³º =rº / Áº .E x/g and f Áº .E x/g, recalling that ¹ Á¹ .E x/ D 1; 8E x. We now apply CCCP by setting @r@ º Evex [rtC1 ] D ¡ @r@ º Ecave [rt ]. We calculate P @ D ¡h¹ =r¹ and @r@¹ Ecave D .1=r¹ / xE P.E x : Er/Á¹ .E x/. This gives @r¹ Evex 1 rtC1 ¹
D
1 1 X P.E x : Ert /Áº .E x/; rt¹ h¹ xE
(5.4)
which is the GIS algorithm after setting r¹ D e¸¹ ; 8¹. 5.2 Sinkhorn’s Algorithm. Sinkhorn’s algorithm was designed to make matrices doubly stochastic (Sinkhorn, 1964). We now show that it can reformulated as CCCP. In the next section, we describe how Sinkhorn’s algorithm, and variations of it, can be used to minimize convex energy functions such as those required for the inner loop of CCCP (see Theorem 3). We rst introduce Sinkhorn’s algorithm. Recall that an n £ n matrix 2 is a doubly stochastic matrix if all its rows and columns sum to 1. Matrices are strictly positive if all their elements are positive. Then Sinkhorn’s theorem states: Theorem (Sinkhorn, 1964). Given a strictly positive n £ n matrix M, there exists a unique doubly stochastic matrix 2 D EMD where D and E are strictly positive diagonal matrices (unique up to a scaling factor). Moreover, the iterative process of alternatively normalizing the rows and columns of M to each sum to 1 converges to 2. Theorem 8. Sinkhorn’s algorithm is CCCP with a cost function E[r] D Evex [r]C Ecave [r] where Evex[r] D ¡
X a
log ra ; Ecave [r] D
X i
log
X
ra Mia ;
(5.5)
a
g are the diagonal elements of E and the diagonal elements of D are where the fraP given by 1=f a ra Mia g.
932
A. Yuille and A. Rangarajan
Proof. It is straightforward to verify that Sinkhorn’s algorithm is equivaO s] D lent 1994) to minimizing an energy function E[r; P(Kosowsky P& Yuille, P ¡ a log ra ¡ i log si C ia Mia ra si with respect to r and s alternatively, where fra g and fsi g are the diagonal elements of E and D. We calculate O s¤ .r/] where s¤ .r/ D arg mins E[r; O s]. It is a direct calculation that E.r/ D E[r; Evex[r] is convex. The Hessian of Ecave [r] can be calculated to be X M ia M ib @2 P Ecave [r] D ¡ ; 2 @ ra @rb i f c rc Mic g
(5.6)
which is negative semidenite. The CCCP algorithm is D rtC1 a
X
P
i
M ia ; t c rc M ic
(5.7)
O s] with respect to r and s, which corresponds to one step of minimizing E[r; and hence is equivalent to Sinkhorn’s algorithm. This algorithm in theorem 8 has similar form to an algorithm proposed by Saul and Lee (2002) which generalizes GIS (Darroch & Ratcliff, 1972) to mixture models. This suggests that Saul and Lee’s algorithm is also CCCP. 5.3 Linear Constraints and Inner Loop. We now derive CCCP algorithms to minimize many of the update energy functions that can occur in the inner loop of CCCP algorithms. These algorithms are derived using similar techniques to those used by Kosowsky and Yuille (1994) to rederive Sinkhorn’s algorithm; hence, they can be considered generalizations of Sinkhorn. The linear assignment problem can also be solved using these methods (Kosowsky & Yuille, 1994). From theorem 3, the update energy functions for the inner loop of CCCP are given by E D E.E xI ¸/
X i
xi log xi C
X i
xi a i C
X ¹
¸¹
Á X i
! ¹ ci xi
¡ ®¹ :
(5.8)
Theorem 9. Update energy functions, of form 5.8, can be minimized by a CCCP algorithm for the dual variables f¸¹ g provided the linear constraints satisfy the conditions ®i ¸ 0; 8i and cºi ¸ 0; 8i; º. The algorithm is ®¹ rtC1 ¹
D e¡1
X i
e¡ai
¹ ci P cºi log rº e º : rt¹
(5.9)
P Proof. First, we scale the constraints we can require that º cºi · 1; 8 i. O Then we calculate the (negative) dual energy function to be E[¸] D ¡E[E x¤
The Concave-Convex Procedure
933
E : ¸], E where xE¤ .¸/ E D arg minxE E[E E It is straightforward to calculate .¸/ xI ¸]. P X ¡1¡a ¡P ¸ c¹ X ¹ i O ¸] E D e¡1¡ai ¡ ¹ ¸¹ ci ; 8i; E[ E D ¹ ¹ i C x¤i .¸/ e ®¹ ¸¹ : (5.10) i
¹
To obtain CCCP, we set ¸¹ D ¡ log r¹ ; 8¹. We set Evex[r] D ¡
X ¹
®¹ log r¹ ; Ecave [r] D e¡1
X
P ¹ c log r¹ e¡ai e ¹ i :
(5.11)
i
It is clear that E vex [r] is a convex function. To verify that Ecave [r] is concave, we differentiate it twice: X @E cave cº P ¹ D e¡1 e¡ai i e ¹ ci log r¹ ; @rº rº i P ¹ X @ 2 E cave cº D ¡e¡1 e¡ai i ±º¿ e ¹ ci log r¹ @rº @r¿ rº i
C e¡1
X
e¡ai
i
cºi c¿i P¹ c¹i log r¹ e : rº r¿
(5.12)
that semidenite, it is sufcient to require that P Toº ensure Pthisºisºnegative 2 2 ¸ f 2 for any set of fx g. This will always be true g c x =r c x =r º º º i v º nu i P provided that cºi ¸ 0; 8i; º and if º cºi · 1; 8i. Applying CCCP to Evex [r] and Ecave [r] gives the update algorithm. An important special case of equation 5.8 is the energy function, Á ! X X X Eef f [SI p; q] D AiaSia C pa Sia ¡ 1 ia
C
a
X i
Á qi
X a
i
! Sia ¡ 1 C 1=¯
X
Sia log Sia ;
(5.13)
ia
where we have introduced a new parameter ¯. As Kosowsky and Yuille (1994) showed, the minima of Eef f [SI p; q] at sufciently large ¯ correspond to the solutions of theQlinear assignment problem whose is toQselect the permutation matrix f ia g that minimizes Q goalP the energy E[ ] D ia ia Aia , where fAia g is a set of assignment values. Moreover, the CCCP algorithm for this case is directly equivalent to Sinkhorn’s algorithm once we identify fe¡1¡¯Aia g with the components of M, fe¡¯pa g with diagonal elements of E, and fe¡¯qi g with the diagonal elements of D (see the statement of Sinkhorn’s theorem). Therefore, Sinkhorn’s algorithm can be used to solve the linear assignment problem (Kosowsky & Yuille, 1994).
934
A. Yuille and A. Rangarajan
6 Conclusion CCCP is a general principle for constructing discrete time iterative dynamical systems for almost any energy minimization problem. We have shown that many existing discrete time iterative algorithms can be reinterpreted in terms of CCCP, including EM algorithms, Legendre transforms, Sinkhorn’s algorithm, and generalized iterative scaling. Alternatively, CCCP can be seen as a special case of variational bounding, lower-bound maximization (upper-bound minimization), and surrogate functions. CCCP gives a novel geometrical way for understanding, and proving convergence of, existing algorithms. Moreover, CCCP can also be used to construct novel algorithms. See, for example, recent work (Yuille, 2002) where CCCP was used to construct a double loop algorithm to minimize the Bethe and Kikuchi free energies (Yedidia, Freeman, & Weiss, 2000). CCCP is a design principle and does not specify a unique algorithm. Different choices of the decomposition into convex and concave parts will give different algorithms with, presumably, different convergence rates. It is interesting to explore the effectiveness of different decompositions. There are interesting connections between our results and those known to mathematicians. After much of this work was done, we obtained an unpublished technical report by Geman (1984) that states theorem 2 and has results for a subclass of EM algorithms. There also appear to be similarities to the work of Tuy (1997), who has shown that any arbitrary closed set is the projection of a difference of two convex sets in a space with one more dimension. Byrne (2000) has also developed an interior point algorithm for minimizing convex cost functions that is equivalent to CCCP and has been applied to image reconstruction. Acknowledgments We thank James Coughlan and Yair Weiss for helpful conversations. James Coughlan drew Figure 1 and suggested the form of Figure 2. Max Welling gave useful feedback on an early draft of this article. Two reviews gave useful comments and suggestions. We thank the National Institutes of Health for grant number RO1-EY 12691-01. A.L.Y. is now at the University of California at Los Angeles in the Department of Psychology and the Department of Statistics. A.R. thanks the NSF for support with grant IIS 0196457. References Bregman, L. M. (1967). The relaxation method of nding the common point of convex sets and its application to the solution of problems in convex programming. U.S.S.R. Computational Mathematics and Mathematical Physics, 7, 200–217.
The Concave-Convex Procedure
935
Byrne, C. (2000). Block-iterative interior point optimization methods for image reconstruction from limited data. Inverse Problem, 14, 1455–1467. Darroch, J. N., & Ratcliff, D. (1972). Generalized iterative scaling for log-linear models. Annals of Mathematical Statistics, 43, 1470–1480. Della Pietra, S., Della Pietra, V., & Lafferty, J. (1997). Inducing features of random elds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1–13. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B, 39, 1–38. Durbin, R., Szeliski, R., & Yuille, A. L. (1989). An analysis of an elastic net approach to the traveling salesman problem. Neural Computation, 1, 348–358. Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using an elastic net method. Nature, 326, 689. Elfadel, I. M. (1995).Convex potentials and their conjugates in analog mean-eld optimization. Neural Computation, 7, 1079–1104. Geman, D. (1984). Parameter estimation for Markov random elds with hidden variables and experiments with the EM algorithm (Working paper no. 21). Department of Mathematics and Statistics, University of Massachusetts at Amherst. Hathaway, R. (1986). Another interpretation of the EM algorithm for mixture distributions. Statistics and Probability Letters, 4, 53–56. Hofmann, T., & Buhmann, J. M. (1997). Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 19(1), 1–14. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999).An introduction to variational methods for graphical models. Machine Learning, 37, 183–233. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Kivinen, J., & Warmuth, M. (1997). Additive versus exponentiated gradient updates for linear prediction. J. Inform. Comput., 132, 1–64. Kosowsky, J. J., & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7, 477–490. Lange, K., Hunter, D. R., & Yang, I. (2000). Optimization transfer using surrogate objective functions (with discussion). Journal of Computational and Graphical Statistics, 9, 1–59. Luttrell, S. P. (1994). Partitioned mixture distributions: An adaptive Bayesian network for low-level image processing. IEEE Proceedings on Vision, Image and Signal Processing, 141, 251–260. Marcus, C., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physics Review A, 40, 501–509. Minka, T. P. (1998). Expectation-maximization as lower bound maximization (Tech. Rep.) Available on-line: http:==www.stat.cmu.edu= minka=papers=learning. html. Mjolsness, E., & Garrett, C. (1990). Algebraic transformations of objective functions. Neural Networks, 3, 651–669. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical model. Cambridge, MA: MIT Press.
936
A. Yuille and A. Rangarajan
Press, W. H., Flannery, B. R., Teukolsky, S. A., & Vetterling, W. T. (1986).Numerical recipes. Cambridge: Cambridge University Press. Rangarajan, A. (2000). Self-annealing and self-annihilation: Unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33, 635–649. Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., Yuille, A. L., & Mjolsness, E. (1999). A convergence proof for the softassign quadratic assignment algorithm. Neural Computation, 11, 1455– 1474. Rosenfeld, A., Hummel, R., & Zucker, S. (1976). Scene labelling by relaxation operations. IEEE Trans. Systems MAn Cybernetic, 6, 420–433. Rustagi, J. (1976).Variational methodsin statistics. San Diego, CA: Academic Press. Saul, L. K., & Lee, D. D. (2002). Multiplicative updates for classication by mixture models. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems. Cambridge, MA: MIT Press. Sinkhorn, R. (1964). A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist., 35, 876–879. Strang, G. (1986). Introduction to applied mathematics. Wellesley, MA: WellesleyCambridge Press. Tuy, H. (1997, Nov. 20). Mathematics and development. Speech presented at Link oping ¨ Institute of Technology, Link oping, ¨ Sweden. Available on-line: http:==www.mai.liu.se=Opt=MPS=News=tuy.html. Waugh, F. R., & Westervelt, R. M. (1993). Analog neural networks with local competition: I. Dynamics and stability. Physical Review E, 47, 4524–4536. Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Generalized belief propagation. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 689–695). Cambridge, MA: MIT Press. Yuille, A. L. (1990). Generalized deformable models, statistical physics and matching problems. Neural Computation, 2, 1–24. Yuille, A. L., & Kosowsky, J. J. (1994).Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Yuille, A. L. (2002). CCCP algorithms to minimize the Bethe and Kikuchi free energies. Neural Computation, 14, 1691–1722.
Received April 25, 2002; accepted August 30, 2002.
LETTER
Communicated by Bard Ermentrout
An Analysis of Synaptic Normalization in a General Class of Hebbian Models Terry Elliott
[email protected] Department of Electronics and Computer Science, University of Southampton, Higheld, Southampton, SO17 1BJ, U.K.
In standard Hebbian models of developmental synaptic plasticity, synaptic normalization must be introduced in order to constrain synaptic growth and ensure the presence of activity-dependent, competitive dynamics. In such models, multiplicative normalization cannot segregate afferents whose patterns of electrical activity are positively correlated, while subtractive normalization can. It is now widely believed that multiplicative normalization cannot segregate positively correlated afferents in any Hebbian model. However, we recently provided a counterexample to this belief by demonstrating that our own neurotrophic model of synaptic plasticity, which can segregate positively correlated afferents, can be reformulated as a nonlinear Hebbian model with competition implemented through multiplicative normalization. We now perform an analysis of a general class of Hebbian models under general forms of synaptic normalization. In particular, we extract conditions on the forms of these rules that guarantee that such models possess a xed-point structure permitting the segregation of all but perfectly correlated afferents. We nd that the failure of multiplicative normalization to segregate positively correlated afferents in a standard Hebbian model is quite atypical. 1 Introduction Activity-dependent competition among afferent cells, leading to the segregation of afferents’ arbors on their target structures, is a ubiquitous feature of the developing vertebrate nervous system (Purves & Lichtman, 1985). In order to construct a mathematical model of neuronal development that includes such competitive dynamics, it is well known that a standard Hebbian growth rule by itself is insufcient. This is because such a rule typically leads to the unconstrained growth of all afferents on all of their target cells. In order to implement competition, developmental models frequently employ postsynaptic normalization, a procedure by which it is assumed that postsynaptic or target cells constrain their total afferent input to remain at or below some limit. Thus, one afferent’s gain is at the other afferents’ loss, and competition is achieved at the same time as constraining growth. Neural Computation 15, 937–963 (2003)
c 2003 Massachusetts Institute of Technology °
938
T. Elliott
The classic form of synaptic normalization is multiplicative normalization (von der Malsburg, 1973). Under multiplicative normalization, a standard Hebbian growth rule is modied to include a decay term proportional to the strength (or efcacy or weight) of the afferent input, the constant of proportionality being chosen so that there is no overall synaptic growth at each target cell. However, it is well known that while multiplicative normalization in a standard Hebbian model does segregate afferents whose patterns of electrical activity are anticorrelated, it does not segregate positively correlated afferents (but see von der Malsburg & Willshaw, 1981). It is likely, however, that a model of neuronal development should be able to segregate positively correlated afferents. For example, the development of ocular dominance columns (Hubel & Wiesel, 1962) in kittens occurs in the presence of correlated vision in both eyes. More generally, from a theoretical viewpoint, it is desirable that self-organizing topographic feature maps (Kohonen, 1995) should be able to develop appropriate structures even when the input patterns are positively correlated. This deciency of multiplicative normalization led to the introduction of subtractive normalization (Goodhill & Barrow, 1994; Miller & MacKay, 1994). Under subtractive normalization, the standard Hebbian growth rule is modied so that a decay term independent of the particular afferent input is introduced, the decay term again being chosen so that the total synaptic input to a target cell remains xed. Unlike multiplicative normalization, subtractive normalization does segregate positively correlated afferents. From a biological perspective, however, subtractive normalization is difcult to motivate because if synaptic decay processes underlie normalization, then it is rather improbable that these processes should be independent of the local concentrations of any of the substances that together contribute to synaptic strength. In contrast, while multiplicative normalization may have its own problems, at least the decay processes are concentration dependent. Thus, we are impaled on the horns of a dilemma: one horn corresponding to a normalization rule that does segregate positively correlated afferents but is biologically implausible, the other horn corresponding to a normalization rule that does not segregate positively correlated afferents but is not biologically implausible. Some models that can segregate positively correlated afferents resolve this dilemma by employing signicant modications of the standard Hebbian rule (Bienenstock, Cooper, & Munro, 1982; Harris, Ermentrout, & Small, 1997). In our own attempt to overcome these problems, we developed an alternative developmental model inspired by biological data implicating a class of molecular factors, the neurotrophic factors, in activity-dependent synaptic competition (for review, see McAllister, Katz, & Lo, 1999). This model does segregate positively correlated afferents (Elliott & Shadbolt, 1998). Recently, we have shown that the model can be reformulated as a nonlinear Hebbian growth rule with competition implemented through multiplicative synaptic normalization (Elliott & Shadbolt, 2002). This result explic-
Normalization in Hebbian Models
939
itly contradicts the now widespread belief that multiplicative normalization cannot segregate positively correlated afferents in any Hebbian model. Synaptic normalization was not imposed in our original model (Elliott & Shadbolt, 1998) but rather emerges dynamically (Elliott & Shadbolt, 2002). This emergence hinges on the fact that the model’s dynamics decouple into two subspaces, with the competitive dynamics residing in a subspace in which the total synaptic input to target cells (a synaptic scale parameter) plays no role. We argued, moreover, that any model whose competitive dynamics exhibit such scale independence can be recast as a general Hebbian growth rule coupled with some form of synaptic normalization and that therefore synaptic normalization appears to be an acceptable, abstract mechanism for implementing competitive dynamics in a Hebbian model. We presented our results as an existence proof of a model using multiplicative normalization that can segregate all but perfectly correlated afferents (Elliott & Shadbolt, 2002). We did not perform a more general analysis, seeking to explicate the reasons for our model’s capacity to segregate positively correlated afferents under multiplicative normalization. To this end, we now perform an analysis of a general class of Hebbian growth rules coupled with general forms of synaptic normalization. In particular, we derive a set of conditions on the forms of these rules that guarantee that these models possess a xed-point structure permitting the segregation of all but perfectly correlated afferents. Surprisingly, we nd that these conditions are rather easily satised. Indeed, the conditions are so easily satised that the failure of multiplicative normalization to segregate positively correlated afferents under a standard Hebbian rule is seen to be quite unusual and atypical. 2 Analysis of Models We consider the dynamics of synaptic competition on a pointwise basis, at single target cells with no coupling between target cells. We may therefore restrict the analysis to a consideration of just one target cell. We consider n afferent cells innervating this target cell. Let letters such as i and j label afferent cells, so that, for example, i D 1; : : : ; n. Let the number of synapses (or synaptic efcacy or synaptic weight) between afferent cell i and the target cell be denoted by vi , and let the electrical activity of this afferent cell be denoted by ai . We take ai ¸ 0 8i. Let the vectors v and a be given by v D .v1 ; : : : ; vn /T and a D .a1 ; : : : ; an /T , where T denotes the transpose. Let the variable t denote time, and let a dot over a variable denote differentiation with respect to time, so that, for example, vP i D dvi =dt. We consider a general class of synaptic growth rules of the form vP i D ¼.vi ; ai /5.a ¢ v/;
(2.1)
P where a ¢ v denotes the dot product between a and v, a ¢ v D i ai vi . We assume that the functions ¼ and 5 are both nonnegative, that is, ¼.x; y/ ¸ 0
940
T. Elliott
8x; y and 5.x/ ¸ 0 8x. The term ¼.vi ; ai / represents a function of a purely presynaptic element, while the term 5.a¢v/ represents a function of a purely postsynaptic element. We have made the simplifying but standard assumption that the postsynaptic response of the target cell is some function of a ¢ v, the afferent input weighted by the number of synapses (or synaptic efcacy). The form of the synaptic growth rule in equation 2.1 is just that of a product between a presynaptic term and a postsynaptic term, and thus is Hebbian in character. It is well known that such synaptic growth rules typically result in unconstrained synaptic growth, with each vi growing unboundedly, and that synaptic normalization must be introduced to constrain the growth and ensure the presence of competitive dynamics in the models. We consider a general implementation of synaptic normalization. In particular, we require that X (2.2) f .vi / D 1; i
where f is some invertible function. Typically, modelers take f .x/ D x or f .x/ D x2 , but we shall consider more general cases. The growth rule in equation 2.1 is then modied so as to ensure that the dynamics respect equation 2.2. We achieve this by replacing equation 2.1 by vP i D ¼.vi ; ai /5.a ¢ v/ ¡ g.vi /A;
(2.3)
where g is some function and the time-dependent quantity A is selected so that X d X f .vi / D f 0 .vi /vP i D 0; dt i i
(2.4)
where a prime always denotes differentiation with respect to the argument of the associated function. Thus, P 0 j f .vj /¼.vj ; aj / (2.5) A D 5.a ¢ v/ P 0 ; j f .vj /g.vj / and our nal equation for the evolution of the vi is given by "
# f 0 .vj /¼ .vj ; aj / vP i D 5.a ¢ v/ ¼.vi ; ai / ¡ g.vi / P 0 : j f .vj /g.vj / P
j
(2.6)
We assume that the functions f and g are both nonnegative, that is, f .x/ ¸ 0, g.x/ ¸ 0 8x. Because f is invertible, f 0 is of constant sign, so the denominator in equation 2.6 cannot vanish.
Normalization in Hebbian Models
941
Finally, we make the assumption that the presynaptic function ¼.vi ; ai / is separable as a function of its two arguments. In particular, we assume that ¼.vi ; ai / D ½.ai /¾ .vi /:
(2.7)
Such an assumption simplies our analysis and guarantees that ¼.0; ai / D 0 provided that ¾ .0/ D 0, as will later be seen to be required. As we assume that the function ¼ is nonnegative, we assume that this is achieved by both ½ and ¾ being nonnegative, that is, ½.x/ ¸ 0, ¾ .x/ ¸ 0 8x. We analyze equation 2.6 in order to nd conditions on the functions 5, ½, ¾ , f , and g such that solutions exist in which all but one of the vi are zero. Such solutions correspond to nal, segregated states in which all other afferent input has been competitively eliminated in an activity-dependent manner. We insist that these solutions correspond to strictly stable xed points of equation 2.6 for all but perfectly correlated afferent activity patterns. We also examine the possible existence of unsegregated nal states in which two or more afferents innervate the target cell equally. When such states are xed points of the dynamics, we nd that the strict stability of the segregated states forces all such unsegregated xed points to be unstable. Our analysis thus consists of a xed-point analysis around these segregated and unsegregated states, determining conditions on the functions 5, ½, ¾ , f , and g such that these states embody the required activity-dependent, competitive dynamics. We address the question of the possible existence of oscillatory solutions in section 3. We proceed, for concreteness, in steps of increasing generality. First, we consider the minimal multiplicative model in which all ve functions 5, ½, ¾ , f , and g are the identity function, that is, 5.x/ D x, and so forth. We then consider the more general multiplicative model in which f and g are the identity function but 5, ½, and ¾ are unknown. Finally, we consider the fully general model is which all ve functions are unknown. 2.1 Minimal Multiplicative Model. For the minimal model, we take 5, ½, ¾ , f , and g all to be the identity function; they leave their arguments unchanged. Equation 2.6 then becomes (2.8) vP i D vi .a ¢ v/[ai ¡ .a ¢ v/]; P with i vi D 1. We average this equation over the ensemble of afferent activity patterns. Dening the symmetric matrix C with components Cij D hai aj i;
(2.9)
where the angle brackets hi denote ensemble averaging, and dropping, for notational convenience, these brackets around the vi , we then have vP i D vi [.Cv/i ¡ v ¢ Cv]:
(2.10)
942
T. Elliott
Consider the point u D .0; : : : ; 0; 1; 0; : : : ; 0/T , where only the ith element of u, ui , is unity, that is, uj D ±ij , the Kronecker delta. Expand about this segregated point by writing v D uC±v, where n¢±v D 0, with n D .1; : : : ; 1/T . For j 6D i, after linearizing, we obtain ± vPj D ±vj .Cji ¡ Cii /;
(2.11) P
with the equation for ± vP i following from ± vP i D ¡ j6Di ± vPj . The points u, i D 1; : : : ; n, are therefore all xed and are stable provided that Cii ¸ Cji 8i; j 6D i. These points are strictly stable if Cii > Cji 8i; j 6D i, and for our results to be applicable to the full nonlinear case, we require that the strict inequality holds. This condition is simply the statement that each entry on the diagonal of C is the largest in the corresponding column (and also row, since C is symmetric). If the condition Cii > Cji 8i; j 6D i cannot be met for reasonable and natural patterns of afferent activity, then the minimal multiplicative model is incapable of exhibiting the required competitive dynamics, because it would lack the required xed-point structure. What are the characteristic features of those ensembles that do satisfy this condition? Let letters such as ® and ¯ label members of the ensemble, so that, for example, a® is the ®th activity pattern in an ensemble of m patterns. The correlation matrix C is then given by CD
m 1 X a® a®T ; m ®D1
(2.12)
and the condition Cii > Cji 8i; j 6D i becomes m X ®D1
.a®i /2 >
m X
a®i aj®
®D1
8i; j 6D i:
(2.13)
T Let the vector Ai D .a1i ; : : : ; am i / be the vector formed from all the values of afferent i’s activity over the whole ensemble. We then require that
Ai ¢ Ai > Ai ¢ Aj
8i; j 6D i:
(2.14)
What is the form of the set of vectors Ai , i D 1; : : : ; n, that guarantees that this condition is satised? If jAi j D jAj j and Ai 6D Aj 8i; i 6D j, then this condition is always satised. In particular, if all afferents are treated in an unbiased manner, so that the ensemble of activity patterns that they experience is invariant under a uniform translation in the space of afferents, then the various vectors Ai will all be related through simple permutations of their components, so that, for i 6D j, Ai D Rij Aj for some permutation matrix Rij . Translation invariance of the ensemble of afferent activity patterns is
Normalization in Hebbian Models
943
an assumption widely made in the modeling literature and is therefore a reasonable one. Such ensembles are generated, for example, by constructing a basic pattern of activity and applying it by centering it on each afferent in turn. For simplicity, in what follows, we shall assume that an ensemble of afferent activity patterns is generated from only one such basic pattern, although because ensemble averaging is a linear function over the ensemble, any results that apply to one such basic pattern will also apply to any number of basic patterns. We now consider the possible existence of an unsegregated state in which l > 1 afferents innervate the target cell equally in the nal state. For unbiased activity patterns, unsegregated but unbalanced nal states of innervation cannot exist and are therefore not considered. Without loss of generality, we may consider the state uD
1 .1; : : : ; 1; 0; : : : ; 0/T ; l | {z } | {z } l
(2.15)
n¡l
that is, ui D 1l , i · l and ui D 0, i > l. Expand about this state as usual, writing v D u C ±v with n ¢ ±v D 0. For this state to be a xed point, C must satisfy .Cu/i D u ¢ Cu
i · l:
(2.16)
Q to be the upper l£l submatrix of C and nl to be the l-dimensional Dening C vector with all components unity, this equation simply states that if u is Q with a xed point of the dynamics, then n l must be an eigenvector of C 1 Q Q eigenvalue l nl ¢ Cnl . Because Cij ¸ 0 8i; j, this eigenvalue is positive and is, Q Provided that equation 2.16 in fact, the largest eigenvalue of the matrix C. is satised, we then obtain for l < n ( 1 [.C±v/i ¡ 2±v ¢ Cu] i · l (2.17) ± vP i D l ±vi [.Cu/i ¡ u ¢ Cu] i>l and for l D n ± vP D
1 C±v n
(2.18)
as the linearized equations for the evolution of ±v about u. Consider rst the case l < n. For i > l, the matrix determining the linear evolution is diagonal with no coupling in the lower left .n ¡ l/ £ l block of the matrix to the other vi ’s, i · l, so n ¡ l eigenvalues can be read off immediately. The remaining l eigenvalues are just the eigenvalues of the O Using upper l £ l submatrix of the full matrix, which we denote by C.
944
T. Elliott
equation 2.16 to simplify the terms in ±vi , i · l, in the expression ±v ¢ Cu, we have that 1 CO ij D [CQ ij ¡ 2u ¢ Cu]: l
(2.19)
O with negative eigenvalue ¡. 1 /2 nl ¢ Cn Q l if it is Thus, nl is an eigenvector of C l Q Moreover, if Cii > Cji 8i; j 6D i, as required by the strict an eigenvector of C. stability of segregated xed points, then we also have CO ii > CO ji 8i; j 6D i. O has a positive eigenvalue. Given these conditions, it is easy to prove that C O As C is symmetric, its eigenvalues can be characterized variationally, and, in particular, the largest eigenvalue not associated with the eigenvector nl is the O maximum of w¢ Cw=w¢w over the set of nonzero vectors w orthogonal to nl . Consider a vector w with ith component C1 and jth component ¡1, with all O other components zero. Clearly n l ¢ w D 0, and we have that w ¢ Cw=w ¢w D 1 O O O O O C ¡ 2 Provided that 8i; D 6 as required by the strict . C C C /. C > C j i, ii jj ij ii ji 2 O O stability of the segregated xed points, then w ¢ Cw=w ¢ w > 0, and so C has a positive eigenvalue. Thus, if the point u is a xed point, then it is an unstable xed point provided that the segregated xed points are strictly Q not just the stable. These arguments are clearly valid for any submatrix C, upper submatrices. For the l D n case, C has a positive eigenvalue associated with the eigenvector n. However, n ¢ ±v ´ 0, so the sign of this eigenvalue is irrelevant to the stability analysis. The matrix C is, in fact, positive semidenite, for if e is an eigenvector with eigenvalue ¸, then ¸jej2 D eT Ce D
m 1 X je ¢ a® j2 ¸ 0; m ®D1
so all eigenvalues are nonnegative. 1 To establish the existence of at least one further, strictly positive eigenvalue, the same variational argument as above shows that C has a positive eigenvalue associated with an eigenvector orthogonal to n. Indeed, if n ¡ 1 eigenvalues of C were identically zero, then C would be a rank 1 matrix with C / nnT , and such a matrix can be formed only by taking a® / n, 8®, that is, all afferents’ activities are perfectly correlated. Thus, if the unsegregated state with l D n is a xed point, then it is an unstable xed point. For which values of l are the states dened by equation 2.15 guaranteed to be xed points? Provided that afferents are treated in an unbiased manner, so that Cii D Cjj 8i 6D j, that is, jAi j D jAj j 8i 6D j, then the l D 2 and the 1 Q is positive semidenite. However, the matrix C O An identical argument shows that C Q l. manifestly is not, as it possesses the negative eigenvalue ¡. 1l /2 nl ¢ Cn
Normalization in Hebbian Models
945
l D n states are guaranteed to be xed points. For l D 2, this is because any Q will always be of the form submatrix C Á ! A B (2.20) ; B A and thus .1; 1/T is always an eigenvector, satisfying the condition for the l D 2 state to be a xed point. For l D n, n is always an eigenvector of C if m n 1 X 1X ¯ ai D a® m ¯D1 n jD1 j
8i; ®;
(2.21)
that is, if the mean of any pattern’s activity over all afferents equals the mean of any afferent’s activity over all patterns. Such a condition is always satised for the unbiased patterns discussed above, so the l D n state is always a xed point. The 2 < l < n states, however, are in general not xed points, except for very particular choices for the patterns of afferent activity. In summary, the minimal multiplicative model is guaranteed to possess segregated xed points, and these are guaranteed to be strictly stable provided that Cii > Cji 8i; j 6D i. This condition is satised by ensembles that treat all the afferents in an unbiased manner. The minimal multiplicative model may also possess many unsegregated xed points (potentially 2n ¡ n ¡ 1 of them), but the strict stability of the segregated xed points guarantees that any such unsegregated xed points are unstable. 2.2 General “Linear” Multiplicative Model. We now turn to a consideration of a wider class of multiplicative models, in which the functions 5, but in which f and g are still the identity functions, ½, and ¾ are unknown,P so that the linear sum i vi D 1 is maintained by the subtraction of a term vi A from the unconstrained growth rule. This is why we refer to this class of models as general “linear” multiplicative models. Our aim is to derive conditions on the functions 5, ½, and ¾ that guarantee that this class of models will exhibit the required xed-point structure. Our basic equation is then just 2 3 X (2.22) vP i D 5.a ¢ v/ 4¼.vi ; ai / ¡ vi ¼.vj ; aj /5 ; j
P
where i vi D 1 and, as usual, ¼.vi ; ai / D ½.ai /¾ .vi /. Consider, as before, the point u with components uj D ±ij —a segregated state in which all afferents but afferent i have been eliminated. Expand about this point as before by writing v D u C ±v, where n ¢ ±v D 0. For u to be a xed point of the equations after ensemble averaging, we must have that ¾ .0/h5.ai /½ .aj /i D 0
8j 6D i:
(2.23)
946
T. Elliott
As the term in angle brackets is not guaranteed to be zero, we are forced to require that ¾ .0/ D 0:
(2.24)
This is the reason for our assuming separability of the function ¼.vi ; ai /. With ¾ .0/ D 0, after linearizing we obtain, for j 6D i, ± vPj D ±vj 5.ai /[¾ 0.0/½.aj / ¡ ¾ .1/½.ai /];
(2.25)
¾ 0 .0/ · ¾ .1/
(2.26)
P and the equation for ± vP i can be obtained by using ± vP i D ¡ j6Di ± vPj . For the stability of this point, we require the ensemble average of the right-hand side (RHS) to be nonpositive. Thus, if
and h5.ai /½.aj /i · h5.ai /½ .ai /i
8j 6D i;
(2.27)
then this segregated xed point is guaranteed to be stable. A strict inequality in equation 2.27 will ensure strictly stable segregated xed points and is required for the applicability of our results to the full, nonlinear case. Dening the matrix D with components Dij given by Dij D h5.ai /½ .aj /i;
(2.28)
we must nd conditions on the functions 5 and ½ such that Dii > Dij 8i; j 6D i. Notice that the matrix D is not, in general, symmetric. For the unbiased ensembles of afferent activity discussed above, we show in the appendix that if 5 and ½ are either both monotonic increasing or both monotonic decreasing, then this condition is satised. We now consider the states corresponding to the unsegregated states dened by equation 2.15. For these states to be xed points, we require that *
"
#+ l 1X 5.a ¢ u/ ½.ai / ¡ ½.aj / D 0 l jD1
i · l:
(2.29)
With ½ a constant function, this condition is satised for any l, but we exclude this possibility because the presynaptic response function should depend on afferent activity. When 5.x/ and ½.x/ are the identity function, the condition reduces to that for the minimal multiplicative model discussed above. Unless l D n, however, in general this condition is never satised for general functions 5 and ½ except for very particular patterns of afferent
Normalization in Hebbian Models
947
activity. In particular, suppose that the ensemble consists exclusively of a set of nonsingleton subensembles in which a ¢ u is constant over a given subensemble. Then, for equation 2.29 to be satised, we require that within each subensemble, h½.ai /i D h½.aj /i 8i; j 6D i, and i; j · l. This requirement is met provided that all pairs of afferents for i · l are treated identically within the subensemble. For l D n and unbiased activity patterns, it should be noted that these conditions are guaranteed to be satised, so that for unbiased patterns, the l D n state is always a xed point. For l < n, a class of patterns that satises equation 2.29 is one generated from a basic pattern of afferent activity together with all its component permutations. Dene ¹ D a ¢ P u, let hi¹ denote an average over the subensemble with a ¢ u D ¹, and let ¹ denote the sum over all such subensembles. Writing Pl h½.ai /i¹ D ½N¹ , a constant for i · l, we have that 1l jD1 ½.aj / D ½N¹ when a is in the subensemble with a ¢ u D ¹. Then, expanding about these points as usual by writing v D u C ±v where n ¢ ±v D 0, after some algebra and Pl Pn repeated use of jD1 ±vj D ¡ jDlC1 ±vj , for i · l we obtain ³ ´X 1 ± vP i D ¾ 50 .¹/h½.ai /ai¹ ¢ ±v l ¹ µ ³ ´ ³ ´¶ X 1 1 C ±vi ¾ 0 ¡ l¾ 5.¹/½N¹ l l ¹
³ ´X n X 1 ¡¾ haj ¡ ¹i¹ ±vj 50 .¹/½N¹ l ¹ jDlC1 C
¶ n µ ³ ´ X 1X 1 5.¹/ ¾0 ½N¹ ¡ ¾ 0 .0/h½.aj /i¹ ±vj ; l ¹ l jDlC1
(2.30)
and for i > l we have ±vi D ±vi
X ¹
µ ³ ´ ¶ 1 5.¹/ ¾ 0 .0/h½.ai /i¹ ¡ l¾ ½N¹ l
(2.31)
as the linearized equations for the evolution of ±vi after ensemble averaging. n ¡ l of the eigenvalues of the matrix determining the linearized dynamics can be read off immediately, and the remaining l eigenvalues are given by the eigenvalues of the matrix EO with components EO ij given by ³ ´X 1 50 .¹/h½.ai /aj i¹ l ¹ µ ³ ´ ³ ´¶ X 1 1 C ±ij ¾ 0 ¡ l¾ 5.¹/½N¹ ; l l ¹
EO ij D ¾
i; j · l:
(2.32)
948
T. Elliott
O it is easy to The matrix EO is not, in general, symmetric. However, as with C, see that nl is an eigenvector of EO and has eigenvalue O D¾ ¸1 .E/
³ ´ X µ ³ ´ ³ ´¶ X 1 1 0 0 1 ¡l¾ l ¹5 .¹/½N¹ C ¾ 5.¹/½N¹ : (2.33) l l l ¹ ¹
O but n ¢ ±v ´ 0, so the sign of ¸1 .E/ O When l D n, n is an eigenvector of E, is irrelevant to the stability analysis. However, we can prove the existence of another eigenvalue not associated with n l that is positive, so that we do not need to consider the two cases l D n and l < n separately. Unlike O we cannot characterize E’s O other eigenvalues variationally. However, C, O O Tr E¡¸ 1 .E/ gives the sum of the other eigenvalues, and if this sum is positive, then at least one positive eigenvalue of EO not associated with nl must exist. We have that ³ ´X l X 1 h½.ai /ai i¹ 50 .¹/ l ¹ iD1 µ ³ ´ ³ ´¶ X 1 1 C l ¾0 ¡ l¾ 5.¹/½N¹ ; l l ¹
Tr EO D ¾
(2.34)
and so ³ ´X l X 1 Cov.½ .ai /; ai /¹ 50 .¹/ l ¹ iD1 µ ³ ´ ³ ´¶ X 1 1 C .l ¡ 1/ ¾ 0 ¡ l¾ 5.¹/½N¹ ; l l ¹
O D¾ Tr EO ¡ ¸1 .E/
(2.35)
where Cov.½ .ai /; ai /¹ D h[½.ai / ¡ ½N¹ ][ai ¡ ¹]i¹ . The second term on the RHS is positive if 1 0 ¾ l
³ ´ ³ ´ 1 1 ¸¾ l l
1 < l · n:
(2.36)
If ½ is monotonic increasing (decreasing), then ½.ai / and ai , i D 1; : : : ; l, positively (negatively) co-vary. Thus, the rst term on the RHS is positive if 50 ½ 0 ¸ 0, the same condition that guarantees that the segregated xed points are strictly stable. Clearly, these arguments go through for any nonsingleton subset of afferents, not just the subset i D 1; : : : ; l. Hence, when both 5 and ½ are either monotonic increasing or both monotonic decreasing, and when equation 2.36 is satised, EO has a positive eigenvalue not associated with nl , and so any unsegregated xed point states are all unstable.
Normalization in Hebbian Models
949
For the general “linear” multiplicative model to exhibit the correct dynamics, corresponding to the presence of strictly stable, segregated xed points and any unsegregated xed points being unstable, we thus have derived a number of conditions on the functions 5, ½, and ¾ based on the assumption that all afferents are treated in an unbiased manner. To examine the stability of the unsegregated states, a stronger assumption was necessary. Without this stronger assumption, the unsegregated states are not, in fact, xed points, except when l D n, in which case, the stronger assumption reduces to the unbiased assumption. For the segregated points to be xed points, we require that ¾ .0/ D 0. In order to guarantee the strict stability of the segregated xed points, we require that 50 ½ 0 ¸ 0, that is, 5 and ½ must both be monotonic increasing or monotonic decreasing, and we must also have that ¾ 0 .0/ · ¾ .1/. For any unsegregated xed point to be unstable, in addition to needing 50 ½ 0 ¸ 0, we must also have that 1n ¾ 0 . 1n / ¸ ¾ . n1 / for any positive integer n since we can have any number of afferents. We can recast the various conditions on the function ¾ in a more intuitive manner. The sequence f 1n j n 2 NC g possesses an accumulation point at zero. We can therefore regard equation 2.36 as an equation in a continuous variable in the neighborhood of zero. We then have the simple differential equation ¾ 0 .x/=¾ .x/ ¸ 1=x, or ¾ .x/ ¸ Ax, where A is some positive constant. Equation 2.36 therefore says that the function ¾ must grow at least as fast as linearly in its argument. Assuming that ¾ is monotonic increasing and extending ¾ .x/ ¸ Ax over the whole interval [0; 1], this condition then also implies that ¾ 0.0/ · ¾ .1/, which is necessary for the stability of the segregated xed points. We should stress that we must be careful in the application of the condition ¾ .x/ ¸ Ax. For example, it is illegitimate to argue that the constant function ¾ .x/ D A satises this condition on the interval [0; 1] and therefore should lead to acceptable dynamics. For if ¾ is a nonzero constant, then ¾ 0 D 0, and equation 2.36 is violated for all positive n, and the requirement that ¾ .0/ D 0 is also violated. The correct interpretation of ¾ .x/ ¸ Ax is, rather, that ¾ grows at least as fast as linearly in its argument. In summary, we can list the three derived conditions, and their origins, on the functions 5, ½, and ¾ : 1. ¾ .0/ D 0 (segregated states are xed points) 2. ¾ .x/ ¸ Ax (instability of unsegregated xed points) 3. 50 .x/½ 0 .x/ ¸ 0 8x (stability of segregated points) These conditions guarantee that the general “linear” multiplicative model possesses the correct xed-point structure.
950
T. Elliott
2.3 Fully General Model. Finally, we consider the fully general model in equation 2.6, "
# 0 j f .vj /¼ .vj ; aj / P vP i D 5.a ¢ v/ ¼.vi ; ai / ¡ g.vi / ; 0 j f .vj /g.vj / P
(2.37)
P with i f .vi / D 1, where f is some invertible function, and, as usual, ¼.vi ; ai / D ½.ai /¾ .ai /. Although messier, the analysis of the fully general model is very similar to that of the general “linear ” multiplicative model. We therefore do not labor the analysis, but rather state the key results. We continue to assume that the ensemble of afferent activity patterns treats all afferents in an unbiased manner. Consider the segregated state u, uj D ±ij . We require that f .1/ C .n ¡ 1/ f .0/ D 1. This must hold for any positive integer value for n, so we must have that f .0/ D 0 and f .1/ D 1. This segregated state is a xed point of the full dynamics if ¾ .0/ D 0;
(2.38)
g.0/ D 0:
(2.39)
Expanding about this point as usual, we obtain µ ¶ g0 .0/ ± vPj D ±vj 5.ai / ¾ 0 .0/½.aj / ¡ ¾ .1/ ½.ai / g.1/
(2.40)
as the linearized evolution equation for ±vj , j 6D i. As for the general “linear” multiplicative model, this point is therefore guaranteed strictly stable if 5 and ½ are monotonic (either both increasing or both decreasing). However, the previous condition ¾ 0 .0/ · ¾ .1/ becomes ¾ 0 .0/ g0 .0/ · : ¾ .1/ g.1/
(2.41)
Now consider the unsegregated states u dened by ui D ° , i · l, and ui D 0, i > l, where lf .° / D 1, so that ° D f ¡1 . 1l /, this being the reason for assuming invertibility of f . As before, these states are xed points if all afferents are treated in an unbiased manner and if the ensemble decomposes into subensembles in which a ¢ u is constant. After some algebra, we obtain, for i · l, X ± vP i D ¾ .° / 50 .¹/h½.ai /ai¹ ¢ ±v ¹
µ ¶ g0 .° / X C ±vi ¾ 0 .° / ¡ ¾ .° / 5.¹/½N¹ g.° / ¹
Normalization in Hebbian Models
¡ ¾ .° /
X ¹
50.¹/½N¹
n ½ X jDlC1
951
aj ¡
¹ l°
¾ ±vj ¹
n X X 1 [G½N¹ ¡ Hh½.aj /i¹ ]±vj ; 5.¹/ lg.° / f 0 .° / ¹ jDlC1
(2.42)
G D ¾ .° / f 0 .0/g0 .0/ ¡ f 0 .° /¾ .° /g0 .° / C f 0 .° /¾ 0 .° /g.° /;
(2.43)
H D g.° / f .0/¾ .0/;
(2.44)
C where
0
0
and for i > l we have ±vi D ±vi
X ¹
µ ¶ ¾ .° / 5.¹/ ¾ 0 .0/h½.ai /i¹ ¡ g0 .0/ ½N¹ g.° /
(2.45)
as the linearized evolution equations for ±vi . As before, these points are unstable if 50 ½ 0 ¸ 0. But the condition in equation 2.36 for the general “linear ” multiplicative model becomes ¾ 0 .° / g0 .° / ¸ ¾ .° / g.° /
(2.46)
with ° D f ¡1 . 1l /, 1 < l · n. There can be any number of afferents, so this condition must hold for any positive n. As the sequence f f ¡1 . n1 / j n 2 NC g possesses an accumulation point at f ¡1 .0/ D 0, we can, as before, regard this as an equation in a continuous variable in the neighborhood of this point, and hence we have ¾ 0 .x/=¾ .x/ ¸ g0 .x/=g.x/, or ¾ .x/ ¸ Ag.x/, where A is some positive constant. Thus, ¾ must grow at least as fast as g, modulo an overall constant. As in the earlier discussion of equation 2.36 for the general “linear ” multiplicative model, we must be careful in the application of ¾ .x/ ¸ Ag.x/. It is illegitimate, for example, to extremize the RHS of the inequality on the interval [0; 1] and set ¾ to the constant function whose value is this extremum, for equation 2.46 would then be violated, as would the requirement that ¾ .0/ D 0. Again, the correct interpretation is that ¾ grows at least as fast as the function g. In contrast to the specic case g.x/ D x considered for the general “linear” multiplicative model, if ¾ and g are both monotonic increasing and we extend ¾ .x/ ¸ Ag.x/ over the whole interval [0; 1], then this does not imply, in general, that equation 2.41 is automatically satised. However, when both ¾ and g are of the same functional form (e.g., both simple powers of their arguments or both exponential functions of their arguments), then ¾ .x/ ¸ Ag.x/ does imply that equation 2.41 is satised.
952
T. Elliott
In summary, assuming unbiased afferent activity patterns, the conditions on the functions 5, ½, ¾ , and g that guarantee that the fully general model exhibits the required xed-point dynamics are simply 1. ¾ .0/ D 0, g.0/ D 0 2. ¾ .x/ ¸ Ag.x/
3. 50 .x/½ 0 .x/ ¸ 0 There are no conditions on f , save that it is invertible and that it satises f .0/ D 0 and f .1/ D 1. 2.4 Particular Models and Special Subclasses. We can employ our results from the above analyses to understand some of the particular models typically employed in the literature and to examine some special subclasses of models in which elegant simplications occur. 2.4.1 Standard Hebbian Models. A standard Hebbian model is almost always stated with a presynaptic function of the very restricted form ¼.vi ; ai / D ai , that is, ¾ .x/ D 1 and ½.x/ D x. Of course, ¾ .x/ D 1 violates the requirement that ¾ .0/ D 0, which guarantees that segregated states correspond to xed points. Indeed, it is well known that the standard form of a Hebbian model does not exhibit a xed-point structure. Nevertheless, the overall dynamics can still be understood by examining the linearized equations on the assumption that ad hoc devices such as freezing synapses at the minimum or maximum desired values will give xed-point-like behavior, and thus our other criteria are applicable. Under multiplicative normalization (von der Malsburg, 1973), g.x/ D x. In this case, the condition in equation 2.46 is violated. Under subtractive normalization (Goodhill & Barrow, 1994; Miller & MacKay, 1994), however, g.x/ D 1. In this case, the condition in equation 2.46 is respected. Hence, we would expect multiplicative normalization to fail, in general, to segregate afferents in a standard Hebbian model. In fact, it is well known that while multiplicative normalization can segregate anticorrelated afferents in a standard Hebbian model, it cannot segregate positively correlated afferents. Our criteria apply if a model possesses a xed-point structure corresponding to the segregation of all but perfectly correlated afferents. As multiplicative normalization in a standard Hebbian model cannot segregate all but perfectly correlated afferents, our criteria correctly conrm its failure. In contrast, our criteria establish that subtractive normalization in a standard Hebbian model should be able to segregate all but positively correlated afferents, and this is indeed the case. Notice that these results are independent of thePform of the function f and, P in particular, whether we enforce i vi D 1 or i v2i D 1. 2.4.2 Nonstandard Hebbian Models. If the pair of functions .¾ .x/; g.x// dene both the presynaptic response as a function of the number of synapses
Normalization in Hebbian Models
953
and the form of the term that enforces synaptic normalization, then a standard Hebbian model under multiplicative normalization can be written as the .1; x/ model. The failure of the .1; x/ model in general to segregate afferents led to subtractive normalization, which is the .1; 1/ model. One of our criteria for the capacity of a model to segregate all but perfectly correlated afferents is that ¾ .x/ ¸ Ag.x/, interpreted as meaning that ¾ .x/ must grow at least as fast as g.x/. Thus, rather than moving from the .1; x/ model to the .1; 1/ model, we can instead move to the .x; x/ model. Provided that the other criteria are satised, the .x; x/ model will segregate all but perfectly correlated afferents. Moreover, it satises the xed-point requirements ¾ .0/ D 0, g.0/ D 0, unlike the .1; 1/ model. We have shown that our neurotrophic model of synaptic plasticity (Elliott & Shadbolt, 1998) can be reformulated as a nonlinear Hebbian model under multiplicative normalization (Elliott & Shadbolt, 2002), where the nonlinearity resides in the postsynaptic function 5. This model is precisely an .x; x/ model. The neurotrophic model exhibits two distinct parameter regimes, one in which segregation of all but perfectly correlated afferents does occur and one in which segregation never occurs. These two regimes were shown to correspond, respectively, to 50 > 0 and 50 < 0 (½ 0 > 0 in both regimes). Hence, the criteria derived earlier give a more general understanding of our neurotrophic model by showing that the two distinct behaviors are not specic to the model but are, instead, universally valid. When the postsynaptic function 5 is not monotonic increasing but ½ is monotonic increasing, the capacity for segregation breaks down. For n D 2 afferents, the minimal multiplicative model, which is the simplest .x; x/ model, is identical to Swindale’s model for the development of ocular dominance columns (Swindale, 1980). Let subscripted letters such as x and y denote different target cells, and let target cells be coupled through the lateral interaction function 1xy . Then the afferent input to target cell x is cx D a ¢ vx , and the full postsynaptic response can be written as 5x D
X
1xy cy :
(2.47)
y
Dening vx D 12 .1 C vxi / for either of the two afferents i, and assuming that both afferents are treated in an unbiased manner so that C11 D C22, we then obtain vP x D
X 1 .C11 ¡ C12 /.1 ¡ v2x / 1xy vy ; 2 y
(2.48)
which, up to an overall multiplicative factor, is precisely Swindale’s model. Thus, Swindale’s model is formally identical to the minimal multiplicative model with n D 2 afferents.
954
T. Elliott
2.4.3 The .xa ; x/ Model with a · 1. We have just discussed the .1; x/ model, which is the standard Hebbian model, and the .x; x/ model, a nonstandard Hebbian model, both employing multiplicative normalization. The latter model possesses xed points corresponding to the segregation of all but perfectly correlated afferents, while the former can segregate only anticorrelated afferents. If we consider an intermediate model of the form .xa ; x/, 0 · a · 1, then as a approaches unity, we might expect this model to be able to segregate increasingly strongly correlated afferents. We now obtain a measure of the maximum strength of the correlations that this intermediate model can segregate. We assume that the model is minimal in all other respects, that is, 5, ½, and f are all the identity functions. This model, after ensemble averaging, is then just vP i D ¾ .vi /.Cv/i ¡ vi vT C¾ ;
(2.49)
where ¾ D .¾ .v1 /; : : : ; ¾ .vn //T , with ¾ .x/ D xa , and Cij D hai aj i as usual. Although for a 6D 0, ¾ .x/ D xa does satisfy ¾ .0/ D 0, for a < 1, ¾ 0 .0/ is undened, and hence the segregated states cannot be analyzed simply. In our analyses above, we have found that the unsegregated states are always unstable if the segregated states are strictly stable. Because we cannot examine the stability of the segregated states, we instead impose the existence of a completely unsegregated xed-point state (i.e., l D n) and determine when its stability reverses. Although the presence of a stable unsegregated state does not guarantee the presence of unstable segregated states, it does give an indication of when the dynamics of the model undergo qualitative change. For simplicity, we consider just n D 2 afferents and expand about the unsegregated state by writing vi D 12 C ±vi . Because ±v1 D ¡±v2 , we have only one independent equation, which, after linearization, is 2a ± vP1 D ±v1 [a.C11 C C12/ ¡ 2C12 ];
(2.50)
where we have made the usual assumption that both afferents are treated in an unbiased manner, or C11 D C22. The dynamics of the model change when C12 a D : 2¡a C11
(2.51)
Consider, for concreteness, the four binary activity vectors .0; 0/T , .0; 1/T , .1; 0/T , and .1; 1/T occurring with probabilities p=2, .1 ¡ p/=2, .1 ¡ p/=2, and p=2, respectively. The parameter p 2 [0; 1] is the probability that the two afferents’ activities are equal. Then the correlation matrix is simply Á ! 1 1 p CD (2.52) ; 2 p 1
Normalization in Hebbian Models
955
so that C12 =C11 ´ p. Thus, for p > a=.2 ¡ a/, the unsegregated state is strictly stable, while for p < a=.2 ¡ a/, the unsegregated state is unstable. For the particular case a D 0, p > 0 guarantees stability, while for a D 1, p < 1 guarantees instability. For intermediate values of a, intermediate values of p are obtained. In particular, for a D 23 , the critical value of p is 12 . Hence, for a > 23 , the unsegregated xed point starts to become unstable for values of p corresponding to positively correlated patterns of afferent activity, and so it may become possible to segregate such afferents. 2.4.4 Special Subclasses. The form of the general model in equation 2.6 is rather cumbersome, particularly the term P
0 j f .vj /¼ .vj ; aj / P 0 j f .vj /g.vj /
(2.53)
on the RHS. In some subclasses, this term simplies, the simplest being the selection f .x/ D x and g.x/ D x, this corresponding to the general “linear” multiplicative model already discussed. In this case, the denominator P collapses to unity, and we are left with j ¼.vj ; aj /. If f .x/ is homogeneous of degree a C 1, then by Euler’s theorem, xf 0 .x/ D C .a 1/ f .x/. P Thus, if g.x/ D x, then the denominator again collapses, as it becomes j .a C 1/ f .xj / ´ a C 1, by denition. Of course, for a function of one variable, the only homogeneous function of degree a C 1 is f .x/ D xaC1 (and multiples thereof). In this case, equation 2.6 reduces to 2 vP i D 5.a ¢ v/ 4¼.vi ; ai / ¡ vi
3 X
vja ¼.vj ; aj /5 :
(2.54)
j
The general “linear ” multiplicative model is the particular case a D 0. 3 Discussion We have performed an analysis of a general class of Hebbian growth rules under a general form of synaptic normalization, characterized by equation 2.6, and extracted a set of conditions that guarantee that any particular model will exhibit the required xed-point structure characterizing activitydependent, competitive dynamics. That is, the conditions ensure the presence of strictly stable xed points corresponding to segregated states in which all afferents but one have been eliminated, and when xed points corresponding to unsegregated states in which two or more afferents innervate their target with equal strengths exist, these xed points are unstable. These conditions, by construction, are independent of the size of afferent correlations and are therefore valid for all but perfectly correlated afferents.
956
T. Elliott
Other than the particular functional form of the rule in equation 2.6, our major assumption in deriving these conditions was that all afferents are treated in an unbiased manner. Such an assumption corresponds to the requirement that the ensemble of afferent activity patterns is translation invariant in the space of afferents and is widely assumed in the modeling literature. Unbiased activity patterns correspond, for example, to normal patterns of vision during ocular dominance column development, in which afferents from either eye have a roughly equal chance of winning the competitive process. In contrast, biasing the patterns of activity toward one or more afferents would affect the xed points and correspond to abnormal developmental processes such as ocular dominance column development under monocular deprivation, in which the closed eye has little chance of winning the competition. For biased inputs, it is to be expected that all segregated xed points remain stable but that the ow toward the segregated state in which the favored input remains is more rapid than the ow toward the other segregated xed points. If two or more inputs are equally favored, then the same analysis as above would apply to competition between this subset of afferents. Because we are interested here in normal developmental processes, we have restricted our analysis to unbiased afferent activity patterns. We have implicitly assumed that the segregated and unsegregated xed points entirely characterize the dynamics of the models analyzed above. Such an assumption would be justied if the dynamics dened by equation 2.6 are curl free, demonstrating the existence of a Lyapunov function and thus proving convergence to a stable state (Wiskott & Sejnowski, 1998). However, in general, these dynamics are not curl free, thus admitting of the possibility of limit cycles or oscillatory solutions. In general, it would be difcult to rule out such behavior. However, numerical solutions of equation 2.6 for a range of possible models including the minimal multiplicative model show that these solutions are dominated by the xed points analyzed above and that convergence to the segregated xed points always occurs (unpublished observations). While such observations are not conclusive, they do suggest that for this class of models, oscillatory solutions are not present, or at least not generally observed. We have analyzed competitive dynamics on a pointwise basis only, at single target cells. Introducing multiple target cells coupled through lateral interactions generally renders such analyses very difcult, if not intractable. Within the context of our earlier neurotrophic model (Elliott & Shadbolt, 1998), some analysis of coupled multiple target cells has been performed (unpublished observations). Introducing a lateral interaction function reduces the maximum size of afferent correlations below which afferent segregation is possible. This is a continuous process, so that this maximum size reduces continuously as a function of the lateral interaction strength. We expect similar results within the context of a general Hebbian rule and a general form of synaptic normalization. For only weak lateral interactions,
Normalization in Hebbian Models
957
we would expect to be able to continue to segregate strongly correlated afferents, although the maximum correlation value below which segregation is possible will be reduced a little below unity. As lateral interaction strength is increased, it should cease to be possible to segregate positively correlated afferents, with the segregation of only anticorrelated afferents being possible. For stronger lateral interactions still, afferent segregation should break down entirely. For pointwise dynamics, provided that ¾ .0/ D 0 and g.0/ D 0, ensuring that segregated states are xed points and that ¾ .x/ grows at least as fast as g.x/, then the required competitive dynamics are guaranteed to be present in a model if 50 .x/½ 0 .x/ ¸ 0 8x, that is, if both 5 and ½ are either monotonic increasing or monotonic decreasing. While the local dynamics of models may differ, their global dynamics, by construction, are identical. In the presence of multiple target cells and interactions among them, the global dynamics of a model determine the nal structure of the resulting neuronal map, while the local dynamics determine how the model reaches that state. We may therefore regard all models in which 5 and ½ are both monotonic increasing or both monotonic decreasing as equivalent and interchangeable from a mathematical point of view. From a biological point of view, while the local dynamics of a model are important and can be used to resolve models by comparing their predictions against experiment, nevertheless, if, for example, two target cells possess differing but monotonic increasing postsynaptic response functions, then their responses are essentially equivalent, differing only in local detail. It is this equivalence that ensures that the global dynamics are identical. Mathematically, we are therefore justied in restricting attention to a consideration of the simplest and most natural form of monotonic 5 and ½, and these are arguably the forms 5.x/ D x and ½.x/ D x, the identity function in both cases. This equivalence between models having monotonic increasing or monotonic decreasing 5 and ½ raises issues about the evolvability and importance of activity-dependent, competitive dynamics in neuronal systems. Under the assumption that such neuronal dynamics have selective value and indeed have been a target for selection during evolution, it is natural to ask whether these dynamics are easy or difcult to evolve. From the analysis, the resulting conditions that guarantee the presence of the required xedpoint structure for all but perfectly correlated afferents are, in fact, rather easily satised. Thus, it would appear that evolution could satisfy these requirements without difculty and that activity-dependent neuronal competition should therefore be easily evolvable. But this argument is double edged, for the relative ease with which these requirements may be satised raises the possibility that evolution satised them accidentally and that activity-dependent, competitive neuronal dynamics have or, at least, had no adaptive advantage. Certainly, the importance of the neuroanatomical structures to which neuronal competition characteristically gives rise has been questioned (Purves, Riddle, & LaMantia, 1992), and we may broaden this
958
T. Elliott
questioning, in the light of the ease with which our criteria may be satised, to ask whether activity-dependent neuronal competition is itself important to the functioning of the vertebrate nervous system. Indeed, recent experimental data challenge the very existence of activity-dependent competition during developmental processes in some systems that hitherto have been regarded as archetypal examples of such dynamics (Crowley & Katz, 1999, 2000). Nevertheless, given the apparent ubiquity of activity-dependent neuronal competition, it is tempting to dismiss this line of reasoning, but its very ubiquity may rather be an indication of the inevitability of the emergence of competition. Despite this, it remains a possibility that even if activitydependent, neuronal competition is an accidental discovery, subsequently it has been put to good use. For multiplicative normalization, for which g.x/ D x, our criteria imply that provided 50½ 0 ¸ 0, it sufces to take ¾ .x/ D xa , a ¸ 1, in order to guarantee the presence of the required xed-point structure for all but perfectly correlated afferents. Thus, the failure of multiplicative normalization in a standard Hebbian rule in which ¾ .x/ D 1 is actually quite atypical. It thus seems odd that the possibility of expanding the standard Hebbian rule to include a more general presynaptic term has evaded attention. We believe that the reason for this pivots centrally on interpretation. Most modelers have hitherto considered synaptic plasticity in an anatomically xed network of variable-strength synapses. In contrast, our models have sought explicitly to address anatomical plasticity: the formation of new synapses and the removal of existing ones (Elliott & Shadbolt, 1998). In our models, the variable vi denotes the (scaled) number of synapses between afferent i and its target, but in anatomically xed models, it denotes synaptic efcacy (or strength or weight). In anatomically xed models, application of the Hebbian rule gives the standard, well-known form, vP i D ai 5. In our models, however, we can regard the Hebbian rule as a change per synapse, and therefore the total change in the number of synapses is given by the standard Hebbian rule multiplied by the total number of synapses supported by an afferent: vP i D ai vi 5. Thus, under an anatomical interpretation, we automatically have ¾ .x/ D x, not ¾ .x/ D 1. Indeed, this argument suggests that of all the possible forms for ¾ .x/ in the models considered above, the most natural choice, for a model of anatomical plasticity, is the linear form ¾ .x/ D x. Given the linear form, the most natural choice for the normalizing function g.x/ is then also the linear form g.x/ D x, corresponding to multiplicative normalization. As noted earlier, Swindale’s model of ocular dominance column formation (Swindale, 1980) is formally identical to the minimal multiplicative model with n D 2 afferents. Although Swindale introduced the term 1 ¡ v2x in an ad hoc fashion, the equivalence shows that it can be derived formally and owes its origin to the underlying multiplicative normalization. Oja’s principal component analyzer (Oja, 1982) is also very similar to the minimal multiplicative model, the only difference being that in the latter, a factor
Normalization in Hebbian Models
959
of vi multiplies both the growth and the decay terms on the RHS, while in Oja’s model, it is found only on the decay term. The models discussed here could be described as winner-take-all models at the afferent cell level. The conditions for winner-take-all dynamics at the target cell level have also been studied (Grossberg, 1987, 1988; Yuille, 1989). In summary, we have performed an analysis of a general class of Hebbian models under a general form of synaptic normalization and extracted conditions that ensure that the models possess a xed-point structure allowing the segregation of all but perfectly correlated afferents. The resulting criteria are surprisingly easily satised, requiring that the pre- and postsynaptic functions ½ and 5 are either both monotonic increasing or both monotonic decreasing and that the function ¾ grows at least as fast as the function g. Competitive dynamics are therefore easily achieved in an innite class of essentially equivalent models. Appendix: Derivation of Conditions on 5 and ½ In section 2.2 we dened the matrix D with components Dij D h5.ai /½ .aj /i and, in order to guarantee the presence of strictly stable segregated xed points, required that D satises Dii > Dij 8i; j 6D i. For ensembles of afferent activity that treat the afferents in an unbiased manner, we now derive conditions on the functions 5 and ½ that guarantee that these strict inequalities are satised. For ensemble element ®, ® D 1; : : : ; m, dene 5®i D 5.a®i / and ½i® D ½.a®i / T and the corresponding m-dimensional vectors ¦i D .51i ; : : : ; 5m i / and 1 T m ½i D .½i ; : : : ; ½i / are vectors of functions of the activity of afferent i over T the whole ensemble. Dene, as before, Ai D .a1i ; : : : ; am i / . The requirement that Dii > Dij 8i; j 6D i then becomes
¦i ¢ .½i ¡ ½j / > 0 8i; j 6D i:
(A.1)
For ensembles that treat the afferents in an unbiased manner, the vectors Ai and Aj are related, as observed earlier, through a simple permutation of elements, so that Aj D RAi , where R is some permutation matrix, its characteristic feature being that on any row or column, there is precisely one entry of unity and zero elsewhere. Similarly, ¦j D R¦i and ½j D R½i . Thus, dropping the index i and dening
ER D ¦ ¢ ½ ¡ ¦ ¢ R½;
(A.2)
we require conditions on 5 and ½ that guarantee that ER > 0 8R 6D I. We may assume, without loss of generality, that the components of A, Q a® , ® D 1; : : : ; m, are ordered so that a1 ¸ a2 ¸ ¢ ¢ ¢ ¸ am . Suppose that A ® Q and ½Q having components 5.aQ / represents an unordered vector, with ¦ and ½.Qa® /, respectively. Let A represent an ordered vector, with the vectors
960
T. Elliott
¦ and ½ dened as before. Let the matrix R1 be the permutation operator Q so that A D R1 A. Q Then that orders the components of A, Q ¢ ½Q ¡ ¦ Q ¢ R½Q D ¦ ¢ RT R1 ½ ¡ ¦ ¢ RT RR1 ½ ¦ 1 1
(A.3)
D ¦ ¢ ½ ¡ ¦ ¢ R2 ½;
(A.4)
since RT1 R1 D I and R2 D RT1 RR1 is some other permutation operator. Thus, we may assume that the components of A are ordered without loss of generality provided that we can establish that ER > 0 for all permutation matrices R 6D I. Dene the new variables p® D 5® ¡5®C1 , r® D ½ ® ¡½ ®C1 , ® D 1; : : : ; m¡1, P Pm 1 ® m ® m T and pm D m ®D1 5 and r D ®D1 ½ . The vectors p D .p ; : : : ; p / and 1 T m r D .r ; : : : ; r / are related to the vectors ¦ and ½ through a matrix T by p D T¦ and r D T½, where T can be easily written down and has the inverse 0 1
T¡1
B B B B 1 B D B mB B B @
m¡1 ¡1 ¡1 ¡1 : : : ¡1 ¡1 ¡1
m¡2 m¡2 ¡2 ¡2 : : : ¡2 ¡2 ¡2
m¡3 m¡3 m¡3 ¡3 : : : ¡3 ¡3 ¡3
¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢ ¢¢¢
2 2 2 2 : : : 2 ¡.m ¡ 2/ ¡.m ¡ 2/
1 1 1 1 : : : 1 1 ¡.m ¡ 1/
1 1C 1C C 1C C :C : :C :C 1C A 1 1
(A.5)
Writing ER D ER .¦; ½/ instead in terms of these new variables, ER D ER .p; r/, we can obtain the functional form of ER D ER .p; r/ from ER .¦; ½/. Because ER is a product in the 5® ’s and ½ ® ’s, it will similarly be a product in the p® ’s and r® ’s. Hence, we can write
ER D p ¢ Sr;
(A.6)
where the elements of S, S®¯ , are simply given by S®¯ D
@ 2 ER ; @p® @r¯
(A.7)
and these partial derivatives can be obtained from all the partial derivatives @ 2 ER =@5® @½ ¯ by repeated application of the chain rule for partial differentiation using @5® =@p¯ D @½ ® =@r¯ D .T¡1 /®¯ . For the permutation matrix R, let the ®th row have an entry of unity in the s® th column and zero in all other columns. For R to be a permutation matrix, the set fs1 ; : : : ; sm g must be a permutation of the set f1; : : : ; mg. Then we have that
ER D
m X ®D1
5® .½ ® ¡ ½ s® /:
(A.8)
Normalization in Hebbian Models
961
P It is easy @ ER =@pm D 0 and @ ER =@rm D 0, because ® .½ ® ¡½ s® / ´ 0 P to ®see that and ® .5 ¡ 5s® / ´ 0. Hence, ER .p; r/ is independent of pm and rm , and so S®m D 0 D Sm® 8®. After some algebra, we can obtain the other partial derivatives in equation A.7, to obtain, for ® < m and ¯ < m, ¯ ® X X @ 2 ER D minf®; ¡ ¯g ±ºs¹ : @p® @r¯ ¹D1 ºD1
(A.9)
P P It is straightforward to see that ®¹D1 ¯ºD1 ±ºs¹ · minf®; ¯g by considering the two cases ® < ¯ and ® ¸ ¯ separately. Hence, @ 2 ER =@p® @ r¯ ¸ 0 8®; ¯. Pulling all this together, we may nally write ER in the form
ER D
m¡1 X m¡1 X ®D1 ¯D1
" ®
.5 ¡5
®C1
/ minf®; ¯g¡
¯ ® X X ¹D1 ºD1
# ±ºs¹ .½ ¯ ¡½ ¯C1 /; (A.10)
where the term in square brackets on the RHS is nonnegative. ER is therefore guaranteed positive if .5® ¡ 5®C1 /.½ ¯ ¡ ½ ¯C1 / > 0, 8® < m, 8¯ < m. Thus, either both 5® ¡ 5®C1 and ½ ¯ ¡ ½ ¯C1 are always positive, or both 5® ¡ 5®C1 and ½ ¯ ¡½ ¯C1 are always negative. But the sequence a1 ; : : : ; am was assumed, without loss of generality, to be ordered, and therefore monotone. Hence, ER is guaranteed positive if the functions 5 and ½ are either both monotonic increasing or both monotonic decreasing. Of course, 5 or ½ a constant function forces ER D 0, but we exclude this possibility by considering only nonconstant 5 and ½. A second, rather more direct, geometric proof that ER > 0 can be seen. If the functions 5 and ½ are both monotonic increasing, and assuming without loss of generality that a1 ¸ a2 ¸ ¢ ¢ ¢ ¸ am , then the vectors ¦ and ½ both lie in the m-dimensional innite wedge dened by x1 ¸ x2 ¸ ¢ ¢ ¢ ¸ xm ¸ 0, where the x® are coordinates in m-dimensional Euclidean space. This wedge is constructed by taking the positive hyperquadrant dened by the hyperplanes x® D 0, 8®, and removing the regions corresponding to x® < x®C1 , ® D 1; : : : ; m ¡ 1. Thus, the hyperplanes x® D x®C1 , ® D 1; : : : ; m ¡ 1, contain the new faces of this region. Any permutation matrix R can be decomposed into a product of fundamental permutation matrices, these latter matrices simply exchanging only adjacent pairs of components. Thus, consider the matrix R® , ® D 1; : : : ; m ¡ 1, which exchanges only the ®th and .® C 1/th components of a vector. R® acting on the vector ½ thus serves to reect ½ in the hyperplane x® D x®C1 containing one of the faces of the wedge. Such an operation moves ½ out of the wedge by the same angle, relative to the hyperplane x® D x®C1 , that ½ is within the wedge. Hence, ¦ ¢ ½ > ¦ ¢ R® ½. Repeated applications of further R® ’s continue to reect ½ in these hyperplanes, either keeping it outside the wedge by the same angle relative to some hyperplane x® D x®C1 , for some ® < m, that ½ is within
962
T. Elliott
the wedge, or returning it within the wedge necessarily back to ½. Hence, ¦ ¢ ½ > ¦ ¢R½ for any R 6D I, fundamental or otherwise. A similar argument applies if 5 and ½ are both monotonic decreasing. In addition to monotonicity of 5 and ½ guaranteeing that ER is positive, setting 5.x/ D ½.x/ 8x also achieves this. In this case, 5 and ½ may be nonmonotonic. However, the analysis of EO requires that 5 and ½ are monotonic even if 5 D ½. Acknowledgments I thank the Royal Society for the support of a University Research Fellowship. I am indebted to the reviewers for providing many valuable suggestions for dramatically broadening the scope of the results derived here and am especially grateful for their supplying the variational proof for the existence of positive eigenvalues of symmetric matrices. References Bienenstock, E. L, Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Crowley, J. C., & Katz, L. C. (1999). Development of ocular dominance columns in the absence of retinal input. Nature Neurosci., 2, 1125–1130. Crowley, J. C., & Katz, L. C. (2000). Early development of ocular dominance columns. Science, 290, 1321–1324. Elliott, T., & Shadbolt, N. R. (1998). Competition for neurotrophic factors: Mathematical analysis. Neural Comp., 10, 1939–1981. Elliott, T., & Shadbolt, N. R. (2002). Multiplicative synaptic normalization and a non-linear Hebb rule underlie a neurotrophic model of competitive synaptic plasticity. Neural Comp., 14, 1311–1322. Goodhill, G. J., & Barrow, H. G. (1994). The role of weight normalization in competitive learning. Neural Comp., 6, 255–269. Grossberg, S. (1987). Competitive learning: From interaction activation to adaptive resonance. Cog. Sci., 11, 23–63. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms and architectures. Neural Networks, 1, 17–61. Harris, A. E., Ermentrout, G. B., & Small, S. L. (1997). A model of ocular dominance column development by competition for trophic support. Proc. Natl. Acad. Sci. U.S.A., 94, 9944–9949. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol., 160, 106–154. Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag. McAllister, A. K., Katz, L. C., & Lo, D. C. (1999). Neurotrophins and synaptic plasticity. Annu. Rev. Neurosci., 22, 295–318. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Comp., 6, 100–126.
Normalization in Hebbian Models
963
Oja, E. (1982). A simplied neuron model as a principal component analyzer. J. Math. Biol., 15, 267–273. Purves, D., & Lichtman, J. W. (1985). Principles of neural development. Sunderland, MA: Sinauer. Purves, D., Riddle, D. R., & LaMantia, A.-S. (1992). Iterated patterns of brain circuitry (or how the cortex gets its spots). Trends Neurosci., 15, 362–368. Swindale, N. V. (1980). A model for the formation of ocular dominance stripes. Proc. Roy. Soc. Lond. Ser. B, 208, 243–264. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100. von der Malsburg, C., & Willshaw, D. J. (1981). Differential equations for the development of topographical nerve bre projections. SIAM-AMS Proceedings, 13, 39–47. Wiskott, L., & Sejnowski, T. J. (1998). Constrained optimization for neural map formation: A unifying framework for weight growth and normalization. Neural Comp., 10, 671–716. Yuille, A. L. (1989). A winner-take-all mechanism based on presynaptic inhibition feedback. Neural Comp., 1, 334–347. Received February 8, 2002; accepted August 22, 2002.
ARTICLE
Communicated by Roger Brockett
Estimating a State-Space Model from Point Process Observations Anne C. Smith
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A.
Emery N. Brown
[email protected] Neuroscience Statistics Research Laboratory, Department of Anesthesia and Critical Care, Massachusetts General Hospital, Boston, MA 02114, U.S.A., and Division of Health Sciences and Technology, Harvard Medical School/Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
A widely used signal processing paradigm is the state-space model. The state-space model is dened by two equations: an observation equation that describes how the hidden state or latent process is observed and a state equation that denes the evolution of the process through time. Inspired by neurophysiology experiments in which neural spiking activity is induced by an implicit (latent) stimulus, we develop an algorithm to estimate a state-space model observed through point process measurements. We represent the latent process modulating the neural spiking activity as a gaussian autoregressive model driven by an external stimulus. Given the latent process, neural spiking activity is characterized as a general point process dened by its conditional intensity function. We develop an approximate expectation-maximization (EM) algorithm to estimate the unobservable state-space process, its parameters, and the parameters of the point process. The EM algorithm combines a point process recursive nonlinear lter algorithm, the xed interval smoothing algorithm, and the state-space covariance algorithm to compute the complete data log likelihood efciently. We use a Kolmogorov-Smirnov test based on the time-rescaling theorem to evaluate agreement between the model and point process data. We illustrate the model with two simulated data examples: an ensemble of Poisson neurons driven by a common stimulus and a single neuron whose conditional intensity function is approximated as a local Bernoulli process.
Neural Computation 15, 965–991 (2003)
c 2003 Massachusetts Institute of Technology °
966
A. Smith and E. Brown
1 Introduction A widely used signal processing paradigm in many elds of science and engineering is the state-space model. The state-space model is dened by two equations: an observation equation that denes what is being measured or observed and a state equation that denes the evolution of the process through time. State-space models, also termed latent process models or hidden Markov models, have been used extensively in the analysis of continuous-valued data. For a linear gaussian observation process and a linear gaussian state equation with known parameters, the state-space estimation problem is solved using the well-known Kalman lter. Many extensions of this algorithm to both nongaussian, nonlinear state equations and nongaussian, nonlinear observation processes have been studied (Ljung & Soderstr ¨ om, ¨ 1987; Kay, 1988; Kitagawa & Gersh, 1996; Roweis & Ghahramani, 1999; Gharamani, 2001). An extension that has received less attention, and the one we study here, is the case in which the observation model is a point process. This work is motivated by a data analysis problem that arises from a form of the stimulus-response experiments used in neurophysiology. In the stimulus-response experiment, a stimulus under the control of the experimenter is applied, and the response of the neural system, typically the neural spiking activity, is recorded. In many experiments, the stimulus is explicit, such as the position of a rat in its environment for hippocampal place cells (O’Keefe & Dostrovsky, 1971; Wilson & McNaughton, 1993), velocity of a moving object in the visual eld of a y H1 neuron (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991), or light stimulation for retinal ganglion cells (Berry, Warland, & Meister, 1997). In other experiments, the stimulus is implicit, such as for a monkey executing a behavioral task in response to visual cues (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997) or trace conditioning in the rabbit (McEchron, Weible, & Disterhoft, 2001). The neural spiking activity in implicit stimulus experiments is frequently analyzed by binning the spikes and plotting the peristimulus time histogram (PSTH). When several neurons are recorded in parallel, cross-correlations or unitary events analysis (Riehle et al., 1997; Grun, ¨ Diesmann, Grammont, Riehle, & Aertsen, 1999) have been used to analyze synchrony and changes in ring rates. Parametric model-based statistical analysis has been performed for the explicit stimuli of hippocampal place cells using position data (Brown, Frank, Tang, Quirk, & Wilson, 1998). However, specifying a model when the stimulus is latent or implicit is more challenging. State-space models suggest an approach to developing a model-based framework for analyzing stimulus-response experiments when the stimulus is implicit. We develop an approach to estimating state-space models observed through a point process. We represent the latent (implicit) process modulating the neural spiking activity as a gaussian autoregressive model driven by an external stimulus. Given the latent process, neural spiking activity
State Estimation from Point Process Observations
967
is characterized as a general point process dened by its conditional intensity function. We will be concerned here with estimating the unobservable state or latent process, its parameters, and the parameters of the point process model. Several approaches have been taken to the problem of simultaneous state estimation and model parameter estimation, the latter being termed system identication (Roweis & Ghahramani, 1999). In this article, we present an approximate expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) to solve this simultaneous estimation problem. The approximate EM algorithm combines a point process recursive nonlinear lter algorithm, the xed interval smoothing algorithm, and the statespace covariance algorithm to compute the complete data log likelihood efciently. We use a Kolmogorov-Smirnov test based on the time-rescaling theorem to evaluate agreement between the model and point process data. We illustrate the algorithm with two simulated data examples: an ensemble of Poisson neurons driven by a common stimulus and a single neuron whose conditional intensity function is approximated as a local Bernoulli process. 2 Theory 2.1 Notation and the Point Process Conditional Intensity Function. Let .0; T] be an observation interval during which we record the spiking activity of C neurons. Let 0 < uc1 < uc2 1) is assumed to be of the gaussian form with the mean value being xO t¡1 , P t .x/ D p
1 2¼ ¿t
exp¡.x¡xO t¡1 /
2 =2¿ 2 t
;
(3.1)
where the gaussian width ¿t2 reects the condence on the previous estimation, whose value will be determined later. The plausibility of using a gaussian prior will be justied later by the consistency of nal results (in the sense of best using the available information) and that it can be easily implemented in a neural circuit (see section 4). The posterior distribution of x in the tth step is P t .x j rt / D
P.rt j x/Pt .x/ ; P.rt /
(3.2)
and the solution of MAP is obtained by solving r ln P.xO t j rt / D r ln P.rt j xO t / ¡ .xO t ¡ xO t¡1 /=¿t2 ; D 0;
(3.3)
where rk.x/ D dk.x/=dx. Let us calculate decoding accuracies step by step. In the rst step, since no prior knowledge is available, MLI is used; its estimator is determined by xO 1 D argmax x ln P.r1 j x/:
(3.4)
The solution of MLI has been adequately analyzed in the literature for different correlation structures (Wu et al., 2002). It is known that when the correlation covers a wide range of neuron populations, MLI is not asymptotically, or quasi-asymptotically (for an unfaithful model), efcient. In these cases, the decoding error of MLI satises the Cauchy type of distribution (whose variance does not exist) instead of the gaussian one. How to choose a suitable performance measure in such situations is an open question and will be addressed in a future publication. In this study, since our focus is on Bayesian decoding, we will consider only cases when MLI is asymptotically or quasi-asymptotically efcient and use variance as the performance measure. 3 Thus, according to Wu et al. (2001, 2002), the variance of decoding error of MLI, in the large N limit, can be written as Ä21 D
h.r ln P.r1 j x//2 i ; h¡rr ln P.r1 j x/i2
(3.5)
3 Details of when MLI is asymptotically or quasi-asymptotically efcient are provided in Wu et al. (2002).
Sequential Bayesian Decoding
999
where the average is over the real distribution Q.r1 j x/. We also note that for the correlation structures considered here, that is, those ensuring the efciency of MLI, ¡rr ln P.r j x/ can be adequately approximated as a (positive) constant according to the law of large numbers (Wu et al., 2001, 2002). This property is used in the calculation below. Let us calculate the decoding error in the second step. Suppose xO 2 is close enough to x. By expanding r ln P.r2 j xO 2 / at x in equation 3.3, we obtain r ln P.r2 j x/ C rr ln P.r2 j x/.xO 2 ¡ x/ ¡ .xO 2 ¡ xO 1 /=¿22 D 0;
(3.6)
where the terms of higher order of .xO 2 ¡ x/2 are neglected. The variable xO 1 can be decomposed as xO 1 D xC²1 , with ²1 being a random number (varying in each trial) that satises the gaussian distribution of zero mean and variance Ä21 . By using the notation ²1 , equation 3.6 becomes xO 2 ¡ x D
r ln P.r2 j x/ C ²1 =¿22
1=¿22 ¡ rr ln P.r2 j x/
:
(3.7)
We dene a constant variable, ®2 D ¿22 .¡rr ln P.r2 j x//;
(3.8)
and a random variable, RD
r ln P.r2 j x/ : ¡rr ln P.r2 j x/
(3.9)
It is easy to check that R is subjected to the gaussian distribution of zero mean (for unbiased estimator being considered) and variance Ä21 . Combining equations 3.7 through (3.9), we get xO 2 ¡ x D
®2 R C ²1 ; 1 C ®2
(3.10)
whose variance is calculated as Ä22 D
1 C ®22 Ä2 ; .1 C ®2 /2 1
(3.11)
where the independence between ²1 and R, due to that r1 and r2 are independently sampled, has been considered. Since .1 C ®22 /=.1 C ®2 /2 < 1 holds for any positive ®2 , the decoding accuracy in the second step is guaranteed to be improved. The minimum value of Ä22 , however, is given by Ä22 D
1 2 Ä ; 2 1
(3.12)
1000
S. Wu, D. Chen, M. Niranjan, and S. Amari
when ®2 D 1, that is, the optimal value of ¿22 is ¿22 D
1 : ¡rr ln P.r2 j x/
(3.13)
Note that although the random number r2 appears in equation 3.13, ¿22 is approximately a constant, as explained above. When a faithful model is used, P.r j x/ D Q.r j x/, ¡rr ln Q.r j x/ is approximately equal to the Fisher information for the correlation cases we consider. ¿22 is hence the variance of decoding error in the rst step. By following the same idea, the decoding accuracy of SBD at any step t, for t > 2, is calculated to be (see the appendix) Ä2t D
Ä2t¡1 C ®t2 Ä21 ; .1 C ®t /2
(3.14)
whose value can be obtained iteratively by starting from step 1. The minimum value of Ä2t is given by Ä2t D Ä21 =t;
(3.15)
when ®t D 1=.t ¡ 1/ or ¿t2 D 1=[.t ¡ 1/.¡rr ln P/]. Remark. It is interesting to see that SBD, when its optimal values of f¿t2 g are used, achieves the same decoding accuracy as MLI by using all neural signals in SBD in one step.4 Indeed, this is the best result for any unbiased estimator to achieve (in the case of faithful model, it is the Cram´er-Rao bound.). However, SBD is not a simple replacement of this one-step MLI and has at least two computationally desirable properties. One is to save memory space. In SBD, only a single value of the immediate previous estimation needs to be remembered, whereas in the one-step MLI, if it is to achieve the same accuracy as SBD, the number of signals that need to be stored is M £ N for M steps. The other good property is that SBD is more adaptive to a changing environment. These two advantages are valuable in information processing in the brain. 4 Network Implementation of SBD So far, the performance of SBD has been theoretically analyzed. This section investigates its implementation in a neural circuit, in the sense that all 4 For SBD of M steps, there are M £ N neuronal signals being used, where N is the number of neurons. A one-step MLI can achieve the same decoding accuracy if all these signals are available. The trick is to average the M signals associated with each p neuron, obtaining a new value of ring rate with the noise level being decreased to 1= M.
Sequential Bayesian Decoding
1001
operations described by equation 2.5 for MLI and equation 2.4 for MAP are performed by the dynamics of a biologically plausible recurrent network. This study explores the possibility, from the mechanical point of view, of how the brain could use SBD. Since the new contribution of this work relates to the network implementation of MAP, without loss of generality, we will use a model as simple as possible for MLI. Generalization to more complicated cases is straightforward. The encoding process we consider is one in which the uctuations in neuronal activities are independent gaussian noises, that is, Q.r j x/ D
1 ¡ ½ exp 2¾ 2 Z
R
.rc ¡ fc .x//2 dc
;
(4.1)
where ½ is the neuronal density and Z the normalization factor, and the tuning function is gaussian, fc .x/ D p
1 2¼a
exp¡.c¡x/
2
=2a2
:
(4.2)
Also, a faithful model is used in all steps of decoding: P.r j x/ D Q.r j x/:
(4.3)
4.1 Implementation of MLI. For the above model setting, MLI in the rst step is to solve Z (4.4) xO 1 D argmaxx rc fc .x/ dc; R where the condition of fc .x/2 dc D const has been used. The network implementation of MLI can be intuitively understood as a template-matching process. The template is the tuning function, fc .x/, and the pattern to be matched is the noisy population activity, frc g. The matching process is to adjust the position of the template (the value of x, that is, the peak of template), such that its overlap with frc g is the largest. The nal position of the template gives the value of estimation. We next identify three elements that are essential in the network design: Requirement 1: The steady state of the network must have the same shape of tuning function in order to generate the correct template for matching. Requirement 2: When no stimulus exists, the steady state of the network should be neutrally stable on a one-dimensional attractor (hereafter called the line attractor), parameterized by all possible stimulus values. This enables the network to be ready to decode the whole range of stimuli. Point attractor does not work in this case.
1002
S. Wu, D. Chen, M. Niranjan, and S. Amari
Requirement 3: An external input of suitable form that contains the stimulus information drifts the template to the position that has the maximum overlap with the noisy population activity. Taking into account these three requirements, the following network dynamics is constructed. Let Uc denote the (average) internal state of neurons at c, and W c;c0 the recurrent connection weights from neurons at c to those at c0 . The neural excitation is governed by Z dU c D ¡Uc C Wc;c0 Oc0 dc0 C Ic ; (4.5) dt where Oc D
Uc2 R 1 C ¹ Uc2 dc
(4.6)
is the activity of neurons at c and Ic is the external input arriving at c. The recurrent interactions in the rst step are chosen to be 0 2
W c;c0 .1/ D exp¡.c¡c /
=2a2
;
(4.7)
which ensures that when there is no external input (Ic D 0), the network has a one-parameter family of nontrivial steady states, 2 2 OQ c .z/ D D1 exp¡.c¡z/ =2a ;
(4.8)
including z as a free parameter, and the corresponding internal state is 2 2 UQ c D D2 exp¡.c¡z/ =4a ;
(4.9)
where the coefcients D1 and D2 can be easily determined. The variable z denotes the peak position of the steady state comprising fOQ c g. Equations 4.8 and 4.9, parameterized by z, dene a line attractor for the network, on which the system is neutrally stable. This is due to the translation invariance of recurrent interactions. Note that the line attractor has the same shape as the tuning function; therefore, requirements 1 and 2 are satised. When the external input Ic is considered, the network typically no longer has the above good properties. However, if Ic is sufciently small, the steady state of the network has approximately the same shape as equation 4.8 (the deviation is of the second order of magnitude of Ic ), and its position is determined by the maximum overlap between Ic and OQ c .z/ (Pouget & Zhang, 1997; Wu et al., 2002). 5 This property is used to guarantee requirement 3. 5 In this study, we consider that I c is a small, constant ow. Alternatively, as in Pouget et al. (1998) and Deneve et al. (1999), Ic can be treated as being transient, triggering the
Sequential Bayesian Decoding
1003
Thus, if Ic D ²rc , with ² being a sufciently small positive number, the peak position of the steady state, that is, the network estimation, is given by Z zO 1 D argmaxz
rc OQ c .z/ dc;
(4.10)
which has the same form as equation 4.4. We say that the network implements MLI. 4.2 Implementation of MAP. Let us study the implementation of MAP in the second step. From equations 2.4, 3.1, and 4.1, the solution of MAP is written as µZ ¶ (4.11) xO 2 D argmaxx rc fc .x/ dc ¡ .x ¡ xO 1 /2 =2¿22 : In comparison with equation 4.4, equation 4.11 has one more term corresponding to the contribution of prior knowledge. Therefore, in order to implement MAP, apart from the above three requirements, one more is needed: Requirement 4: There is a mechanism that propagates the prior knowledge obtained from the previous decoding to the current one. We nd that this requirement can be satised through short-term modication of recurrent weights according to the Hebbian learning rule. After the rst step of decoding, the recurrent interaction has changed by a small amount, Wc;c0 .2/ ¼ Wc;c0 .1/ C ´ OQ c .Oz1 /OQ c0 .Oz1 /;
(4.12)
where ´ is a small, positive number representing the Hebbian learning rate, and OQ c .Oz1 / is the neuron activity in the rst step. A justication of this formula is given in section 4.3. With the modied recurrent interactions, the net input from others to the neuron at c is calculated to be Z Z Z Wc;c0 .2/Oc0 dc0 D Wc;c0 .1/Oc0 dc0 C ´ OQ c .Oz1 / OQ c0 .Oz1 /Oc0 dc0 ; Z
¼
Wc;c0 .1/Oc0 dc0 C º OQ c .Oz1 /;
(4.13)
network to be at a noisy state frc g initially. After relaxation, the network reaches the desired state. Both scenarios are likely to be approximations of the real case (for the purpose of being mathematically tractable), in which Ic is strong initially and subsequently decays to zero.
1004
S. Wu, D. Chen, M. Niranjan, and S. Amari
where º is a small constant. To get that approximation, the following facts have been used: (1) the initial state of the neuron in the second step of decoding is at OQ c .Oz1 /; (2) the neuron activity, Oc , during the decoding process is between OQ c .Oz1 / and OQ c .Oz2 /, where zO 2 is the steady peak position in the second step; and (3) .Oz1 ¡ zO 2 /2 =2a2 ¿ 1, considering that neurons are widely tuned, as seen in data (i.e., a is large), whereas two consecutive estimations are close enough. These factors ensure that the approximation, R OQ c0 .Oz1 /Oc0 dc 0 ¼ const, is good enough. Substituting equation 4.13 into 4.5, we see that, in effect, we can treat the network weight as unchanged, whereas the external input becomes Ic0 D ²rc C º OQ c .Oz1 /. Therefore, the decoding result in the second step can be obtained by maximizing the overlap between Ic0 and OQ c .z/, which gives Z zO 2 D argmax z
º rc OQ c .z/ dc C ²
Z OQ c .Oz1 /OQ c .z/ dc:
(4.14)
The rst term on the right-hand side of equation 4.14 corresponds to MLI. Let us now look at the contribution from the second term, which is transformed to Z 2 2 OQ c .Oz1 /OQ c .z/ dc D B exp¡.Oz1 ¡z/ =4a ; ¼ ¡B.z ¡ zO 1 /2 =4a2 C terms not on z;
(4.15)
where B is a constant. In that calculation, the condition of .Oz1 ¡ z/2 =4a2 ¿ 1 is used for the same reasons given above. Comparing equations 4.11 and 4.15, we see that the second term in equation 4.15 plays exactly the same role as the prior knowledge in MAP. Thus, the network implements MAP. The value of the Hebbian learning rate ´ can be adjusted accordingly to match the optimal choice of ¿22 given by the theory. 4.3 Implementation of Sequential Decoding. Following the same idea as in the previous section, we can implement MAP, at any step t > 1, by modifying the network weights according to W c;c0 .t/ D Wc;c0 .1/ C ´t¡1 OQ c .Ozt¡1 /OQ c0 .Ozt¡1 /;
(4.16)
where ´t¡1 is the learning rate during step t ¡ 1, whose value can be suitably chosen to match the optimal ¿t2 . It is easy to check that equation 4.16 propagates the prior knowledge of the form of equation 3.1. The modication rule, equation 4.16, can be understood as an approximation of a continuous process of Hebbian learning. Let us consider a
Sequential Bayesian Decoding
1005
simple dynamical picture. The network dynamics in each step of SBD is decomposed into two phases: reading out and learning. In the reading-out phase, as described in sections 4.1 and 4.2, the network weights are approximately unchanged, and neurons interact with each other, evolving to the steady state. In the learning phase, however, the neural state is approximately xed, and the network weight changes according to the Hebbian learning rule. This two-stage approximation can be essentially achieved by well balancing the convergence speeds of neuronal state and network weight through adjusting time constants. In the learning phase, we consider the dynamics of Wc;c0 in the step t ¡ 1 governed by dW c;c0 D ¡¸1 .W c;c0 ¡ Wc;c0 .1// C ¸2 OQ c .Ozt¡1 /OQ c0 .Ozt¡1 /; dt
(4.17)
where the constant ¸1 determines how fast W c;c0 forgets about the history, whereas the constant ¸2 D ¸1 ´t¡1 reects how quickly it is adapted to the new neuronal state. The solution of equation 4.17 is given by Wc;c0 .t/ D Wc;c0 .1/ C ´t¡1 OQ c .Ozt¡1 /OQ c0 .Ozt¡1 /
C [Wc;c0 .t ¡ 1/ ¡ ´t¡1 OQ c .Ozt¡1 /OQ c0 .Ozt¡1 /]e¡¸1 T ;
(4.18)
where T is the time duration of one-step decoding. Therefore, if ¸1 T is large enough (i.e., ¸1 or T is sufcient large), the second term in Wc;c0 .t/ can be neglected, and we get equation 4.16. This condition may not be perfectly satised in reality, and a deviation from it will result in prior knowledge from more than one previous step being used in MAP, leading to suboptimal performance. Here, for simplicity, we assume that equation 4.16 holds exactly. 5 Simulation Experiments In this section, simulation is done to conrm the theoretic analysis. Two scenarios are considered. In one case, the stimulus value is xed, and in the other, the stimulus is changing slowly over time. 5.1 Constant Stimulus. The simulation experiment is done with 101 neurons uniformly distributed in the region [¡3; 3] with the true stimulus being zero. Figure 1 shows the typical states of network before and after decoding. The steady state of the network becomes smooth after the noise in the initial state is cleaned up, and it has the same shape as the tuning function. Figure 2 compares the minimum decoding errors of the network with those predicted by theory in section 3. Here, the learning rates are optimally adjusted. Figure 2 shows that the estimation error decreases with decoding steps, and it agrees well with the theoretic analysis.
1006
S. Wu, D. Chen, M. Niranjan, and S. Amari
0.8 steady state tuning function initial state
Population Activity
0.6
0.4
0.2
0.0
0.2
3
2
1
0
1
2
3
Preferred Stimulus Figure 1: Typical states of the recurrent network before and after decoding. The tuning function (i.e., the mean population activity) is also shown in order to compare its shape with that of the steady state (being proportionally scaled). The parameters are a D 1, ¹ D 1, ¾ 2 D 0:0025, ² D 0:01, and T D 10, where T is the evolving time of network.
5.2 Time-Varied Stimulus. This part of simulation investigates the case when the stimulus slowly changes during the decoding. At each step of decoding, the stimulus increases by a small amount, 4x, compared with the previous step. At step t, the stimulus value is .t ¡ 1/4x (being zero in step 1), and the neuron population involved in decoding is those 101 uniformly distributed in the region [¡3 C .t ¡ 1/4x; 3 C .t ¡ 1/4x] (only neurons that active enough contribute to decoding). Due to the movement of stimulus, the previous estimate is on average different from the current value, with the difference being the amount of change between two consecutive steps. This implies that the prior knowledge is biased. However, if the change of stimulus is small enough, it is conceivable that the previous estimation can still be useful information, even though the gain from the prior knowledge cannot be as large as when the stimulus is xed. A justication for this can also be exhibited from the view of network implementation. When the stimulus is changed, the population of neurons involved in decoding also changes accordingly, whereas Hebbian learning, which happened in the previous step, modies the network weights belonging to the old population. If two populations of neurons overlap a lot,
Sequential Bayesian Decoding
1007
0.00020
Decoding Accuracy
network theory 0.00015
0.00010
0.00005
0.00000
1
2
3
Decoding Step
4
5
Figure 2: Comparing the minimum decoding error of the network with that predicted by the theory. The learning rates at each step are ´1 D 100, ´2 D 200, ´3 D 300, and ´4 D 400. Other parameters are a D 1, ¹ D 1, ¾ 2 D 0:0004, ² D 0:1, and T D 20.
corresponding to the small change of stimulus, it is understandable that the previous learning still has a positive effect. Without loss of generality, we assume that there is no prior knowledge on the motion of stimulus, and hence the same form of gaussian prior is used. Also, since the prior knowledge is biased in this case, the decoding of MAP is also biased in general. The variance is no longer a suitable measure 2 O of decoding accuracy, and we use the mean squared error, h.x¡x/ i, instead. In Figure 3, the network performance is compared with that of MLI. It shows that the network has a better performance than MLI in the rst three steps. After that, its error abruptly jumps to a high level. This is understandable. Since the estimator wrongly assumes the stimulus is stationary, its condence from the previous estimation is built up step by step. This is reected by the decrease of the width of gaussian prior knowledge or, equivalently, the increase of the Hebbian learning rate. As a result, the estimation depends more and more on the previous biased prior knowledge, and nally the error will blow up. In reality, this could be overcome by either restarting SBD after several steps or using the information about the stimulus change (it is reasonable to believe that the brain can sense this change when its value is large enough).
1008
S. Wu, D. Chen, M. Niranjan, and S. Amari 0.010
Decoding Accuracy
0.008
network MLI
0.006
0.004
0.002
0.000
1
2
3
4
Decoding Step Figure 3: Comparing the performance of network-implemented SBD with that of MLI without using any prior knowledge. The learning rates are ´1 D 100, ´2 D 200, ´3 D 300, and other parameters are a D 1, ¹ D 1, ¾ 2 D 0:001, ² D 0:1, T D 20, and 4x D 0:06.
6 Conclusion and Discussion This study investigates, in the framework of population code, a sequential Bayesian decoding paradigm called SBD. It consists of a sequence of decoding steps, with the prior knowledge at each step being the gaussian function of the estimation in the previous one. Two aspects of SBD have been studied. First, we analyzed the performance of SBD, obtaining the optimal form of gaussian prior that best (fully) uses the available information. Second, we investigate its implementation in a biologically plausible recurrent network, in the sense that all decoding operations are performed by the network dynamics. The simulation result conrms the analysis. It is shown that the minimum variance of SBD at step t is 1=t of its value in the rst step. This may not appear to be a big improvement at rst glance and leads to a skeptical view on the value of SBD in practice. However, we argue that the same data can be seen in a different light. Typically, the decoding error of SBD has a gaussian distribution. Either the variance or the survival probability (i.e., the probability of error being larger than a threshold) can serve as a performance measure. In reality, however, the survival probability seems to be more functionally meaningful,
Sequential Bayesian Decoding
1009
as it counts the chance of making a wrong decision in a discriminationrelated task instead of distinguishing the possibly tiny difference between two estimations as the measure of variance does. Suppose the threshold is chosen to be 3 (using a different value does not change the conclusion); the survival probability is 0:1336 for variance being 4 and 0:0027 for variance being 1. Thus, the probability of having an error larger than 3 is down to around 2% after the variance decreases from 4 to 1. This is achieved after applying SBD for only four steps (see Figure 2). Line attractor and Hebbian learning are two critical elements in the network design. The former enables the network to perform template matching with the tuning function as the template. This template-matching operation is the same as that in MLI and MAP. The property of the line attractor comes from the translational invariance of network interactions and has been shown to be involved in several brain functions (Amari, 1977; Seung, 1996; Zhang, 1996; Deneve, Latham, & Pouget, 2002). Hebbian learning is the mechanism that propagates prior knowledge between decoding steps. One may question the feasibility of Hebbian learning in the short-term scale being required by SBD. We believe this is not a problem, considering that synaptic modication can happen in a timescale of milliseconds (Bi & Poo, 1998), and only very weak modications are needed in each step. Indeed, this work reveals that the short-term plasticity of recurrent interaction can have an active role in information processing, a point that so far has not been well studied. To illustrate the network implementation of SBD, we have studied only the simplest case, in which neural activities are independent in the encoding process and a faithful model is used in the decoding. Generalization to more complicated cases, such as correlated responses and unfaithful decoding is straightforward. This is evident from the observation that the implementation of MLI and the propagation of prior knowledge are separated mechanically. It implies that to implement SBD, the same mechanism for propagating prior knowledge always holds, providing that the recurrent network implements MLI in the rst step. For instance, the network model in Deneve et al. (1999) achieves unfaithful MLI when the biologically more plausible multiplicative correlation structure is considered in the encoding process. This model can be easily generalized to implement SBD by including weight modication according to equation 4.16. There are nevertheless incomplete parts in this study. For example, in order to match the theoretic analysis of MAP, we consider a discrete procedure for network weight modication, without carefully analyzing its implementation in the continuous time domain. Moreover, the network model we use is rather simple and does not distinguish excitory and inhibitory neurons and presynaptic and postsynaptic connections. In order to understand the role of Hebbian learning in Bayesian inference fully, more detailed modeling work is needed. We are working in this direction.
1010
S. Wu, D. Chen, M. Niranjan, and S. Amari
Appendix: Decoding Accuracy of SBD The decoding accuracy of SBD at any step t, for t > 2, can be calculated similarly to that in step 2. Similarly to equation 3.7, we can obtain xO t ¡ x D
®t Rt C ²t¡1 ; 1 C ®t
(A.1)
where ®t D ¿t2 .¡rr ln P.rt j x//, Rt D r ln P.rt j x/=.¡rr ln P.rt j x//, and ²t¡1 D xO t¡1 ¡ x. Rt and ²t¡1 are independent random variables that satisfy gaussian distributions of zero mean and variances Ä21 and Ä2t¡1 , respectively. The variance of xO t is calculated to be Ä2t D
Ä2t¡1 C ®t2 Ä21 ; .1 C ®t /2
(A.2)
whose value can be iteratively calculated by starting from step 2. The minimum of Ä2t can be obtained in an inductive way. Let us assume the minimum variance at step t is Ä2t D Ä21 =t;
(A.3)
when ®t D 1=.t ¡ 1/ or, equivalently, ¿t2 D 1=[.t ¡ 1/.¡rr ln P/]. This holds when t D 2, as proved in section 3. At step t C 1, we have Ä2tC1 D D
2 Ä2t C ®tC1 Ä21
.1 C ®tC1 /2
2 1=t C ®tC1
.1 C ®tC1 /2
;
Ä21 ;
(A.4)
whose minimum value is Ä2tC1 D Ä21 =.t C 1/ when ®tC1 D 1=t. Thus, the assumption is conrmed. Acknowledgments We thank Peter Dayan and Kechen Zhang for their constructive discussions on the network dynamics and one of the anonymous reviewers for pointing out mistakes in our original manuscript. We also thank the two anonymous reviewers for their valuable comments when this work was rst submitted to Advances in Neural Information Processing Systems.
Sequential Bayesian Decoding
1011
References Amari, S. (1977). Dynamics of pattern formation in lateral-inhibition type neural elds. Biological Cybernetics, 27, 77–87. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Computation, 4, 196–210. Barlow, H. (2001). Redundancy reduction revisited. Network: Computation in Neural Systems, 12, 241–253. Bi, G., & Poo, M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 15, 10464–10472. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. S. (1995). The Helmoholtz machine. Neural Computation, 7, 889–904. Dayan, P., & Yu, A. (2002). ACH, uncertainty, and cortical inference. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Deneve, S., Latham, P., & Pouget, P. (1999). Reading population codes: A neural implementation of ideal observers. Nature Neurosci., 2, 740–745. Deneve, S., Latham, P., & Pouget, P. (2002). Efcient computation and cue integration with noisy population codes. Nature Neurosci., 4, 826–831. Georgopoulos, A. P., Kalaska, J. F., Caminiti, R., & Massey, J. T. (1982). On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex. J. Neurosci., 2, 1527–1537. Knill, D. C., & Richards, W. (1996). Perception as Bayesian inference. Cambridge: Cambridge University Press. Maunsell, J. H. R., & Van Essen, D. C. (1983). Functional properties of neurons in middle temporal visual area of the macaque monkey. I. Selectivity for stimulus direction, speed, and orientation. J. Neurophysiology, 49, 1127–1147. Mumford, D. (1992). On the computational architecture of the neocortex. II. The role of cortico-cortical loops. Biol. Cybern., 66, 241–251. Oram, M. W., Foldi´ ¨ ak, P., Perrett, D. I., & Sengpiel, F. (1998). The ideal homunculus: Decoding neural population signals. Trends Neurosci., 21, 259–265. Paradiso, M. A. (1988). A theory for use of visual orientation information which exploits the columnar structure for striate cortex. Biological Cybernetics, 58, 35–49. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nature Neuroscience Review, 1, 125–132. Pouget, A., & Sejnowski, T. (1997). Spatial transformations in the parietal cortex using basis functions. J. Cognitive Neuroscience, 9, 222–237. Pouget, A., & Zhang, K. (1997). Statistically efcient estimation using cortical lateral connections. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 97–103). Cambridge, MA: MIT Press. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efcient estimation using population coding. Neural Computation, 10, 373–401.
1012
S. Wu, D. Chen, M. Niranjan, and S. Amari
Rao, R. P. N., & Ballard, D. H. (1999). Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-eld effects. Nature Neuroscience, 2, 79–87. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. J. Comp. Neurosci., 1, 89–107. Salinas, E., & Abbott, L. F. (1995). Transfer of coded information from sensory to motor networks. J. Neurosci., 15, 6461–6474. Seung, S. (1996). How the brain keeps the eyes still. Proc. Acad. Sci. USA, 93, 13339–13344. Seung, S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Acad. Sci. USA, 90, 10749–10753. Simoncelli, E. P., & Olshausen, B. P. (2001). Natural images of statistics and neural representation. Annu. Rev. Neurosci., 24, 1193–1216. Wu, S., & Amari, S. (2002). Neural implementation of Bayesian inference in population codes. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing system, 14. Cambridge, MA: MIT Press. Wu, S., Amari, S., & Nakahara, H. (2002). Population coding and decoding in a neural eld: A computational study. Neural Computation, 14, 999–1026. Wu, S., Nakahara, H., & Amari, S. (2001). Population coding with correlation and an unfaithful model. Neural Computation, 13, 775–797. Zemel, R., Dayan, P., & Pouget, A. (1998). Probabilistic interpolation of population codes. Neural Computation, 10, 403–430. Zhang, K. (1996). Representation of spatial orientation by intrinsic dynamics of the head-direction cell esemble: A theory. J. Neuroscience, 16, 2112–2126. Zhang, K., Ginzburg, I., McNaughton, B., & Sejnowski, T. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal cells. J. Neurophysiol., 79, 1017–1044.
Received April 25, 2002; accepted November 7, 2002.
LETTER
Communicated by Michael Jordan
Learning Coefcients of Layered Models When the True Distribution Mismatches the Singularities Sumio Watanabe
[email protected] Precision and Intelligence Laboratory, Tokyo Institute of Technology, Midori-ku, Yokohama, 226-8503 Japan
Shun-ichi Amari
[email protected] Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako-shi, Saitama, 351-0198, Japan
Hierarchical learning machines such as layered neural networks have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, with the result that the conventional learning theory of regular statistical models does not hold. Recently, it was proved that if the parameter of the true distribution is contained in the singularities of the learning machine, the generalization error in Bayes estimation is asymptotically equal to ¸=n, where 2¸ is smaller than the dimension of the parameter and n is the number of training samples. However, the constant ¸ strongly depends on the local geometrical structure of singularities; hence, the generalization error is not yet claried when the true distribution is almost but not completely contained in the singularities. In this article, in order to analyze such cases, we study the Bayes generalization error under the condition that the Kullback distance of the true distribution from the distribution represented by singularities is in proportion to 1=n and show two results. First, if the dimension of the parameter from inputs to hidden units is not larger than three, then there exists a region of true parameters such that the generalization error is larger than that of the corresponding regular model. Second, if the dimension from inputs to hidden units is larger than three, then for arbitrary true distribution, the generalization error is smaller than that of the corresponding regular model. 1 Introduction 1.1 Background of the Problem. Almost all hierarchical learning machines such as multilayer perceptrons, gaussian mixtures, Boltzmann machines, Bayesian networks, and reduced-rank approximations are nonidentiable statistical models. In other words, the mapping from the parameter c 2003 Massachusetts Institute of Technology Neural Computation 15, 1013–1033 (2003) °
1014
S. Watanabe and S. Amari
to the probability distribution is not one-to-one. Nonidentiability creates singularities in the parameter space. In this article, a parameter is called a singularity if and only if the determinant of its Fisher information matrix is equal to zero. At singularities, the log-likelihood function cannot be approximated by any quadratic form of the parameter. In fact, neither the asymptotic normality of the maximum likelihood estimator holds nor does the Bayes a posteriori distribution converge to the normal distribution (Amari, Park, & Ozeki, 2002; Hagiwara, 2002; Hartigan, 1985; Rusakov & Geiger, in press; Watanabe, 1995; Watanabe & Amari, forthcoming). In nonidentiable learning machines, the asymptotic generalization error of the maximum likelihood method is still unknown, and statistical model selection methods such as Aikake’s information criterion, minimum description length, and Bayesian information criterion have no theoretical foundation in articial neural networks. In general, it is well known that the maximum likelihood estimator of a nonidentiable learning machine often diverges, and even if it is in a nite region, its generalization error becomes far larger than that of a regular statistical model. It is conjectured that the maximum likelihood method is not appropriate in articial neural networks. On the other hand, in spite of mathematical difculty of nonidentiable models, the Bayes generalization error has been claried using algebraic geometry, under the condition that the true distribution is contained in singularities (Watanabe, 1999, 2001a). It is mathematically claried that nonidentiable machines have far smaller generalization errors when Bayesian estimation is applied. In this article, we also study the Bayes estimation for nonidentiable learning machines. Although the Bayes generalization error has already been claried under the condition that the true distribution is strictly contained in singularities, it is still unknown if the true distribution mismatches the singularities. This article studies the Bayes generalization error of a nonindentiable learning machine when the true distribution is almost but not completely contained in singularities. 1.2 The Fact That Has Already Been Claried. Let us introduce the Bayes estimation and its generalization error. Let p.y j x; w/ be a conditional probability density function of a real output y for a given M-dimensional real input x and a d-dimensional real parameter w. The parameter set W is assumed to be an open subset of the d-dimensional Euclidean space. The training set Dn consists of n empirical samples, Dn D f.x1 ; y1 /; .x2 ; y2 /; : : : ; .xn ; yn /g; which are independently taken from q.x/p.y j x; w0 /, where q.x/ is some probability distribution on the input space. The parameter w0 is referred to as the true parameter.
Learning Coefcients of Singular Models
The Bayes generalization error is dened by µ ¶ p.ynC1 j xnC1 ; w0 / G.n/ D E log ; p.ynC1 j xnC1 ; Dn /
1015
(1.1)
where E shows the expectation value over all training sets and testing samples DnC1 , DnC1 D Dn [ f.xnC1 ; ynC1 /g; and p.y j x; Dn / is the Bayes predictive distribution of y for a new input x using a given a priori distribution ’.w/ and the training set Dn . In recent research (Watanabe, 1999, 2001a), it is proved that if the true parameter w0 is xed independent of the number of training samples, then ³ ´ 1 ¸ G.n/ D C o n n holds asymptotically .n ! 1/, where ¸ is a positive rational number. In this article, ¸ is referred to as a learning coefcient. It is well known that if a learning machine is a regular statistical model, which has a positive denite Fisher information matrix for an arbitrary w, then ¸ D d=2, where d is the dimension of the parameter space. On the other hand, if the set of true parameters, W0 D fw 2 WI p.y j x; w/ D p.y j x; w0 / .8 x; y/g; contains a manifold whose dimension is larger than zero, then the rank of the Fisher information matrix is smaller than d. In such cases, the Bayes generalization error has been left unknown even in mathematical statistics. The mathematical difculty is caused by the fact that the set W0 is not a manifold but a union of some manifolds. For example, if the learner and the true distribution are, respectively, the multilayer perceptrons with H and H0 hidden units .H0 < H/, then the set of true parameter W0 has quite complicated singularities. Recently, even in such cases, the Bayes generalization error has been claried. The Kullback distance from the true parameter w0 to a parameter w is dened by Z Z p.y j x; w0 / D.w0 k w/ D q.x/p.y j x; w0 / log dx dy: p.y j x; w/ We introduce the zeta function of the Kullback distance and the prior distribution ’.w/, Z ³ .z/ D D.w0 k w/z ’.w/ dw .z 2 C/: Then ³ .z/ is a holomorphic function of z in the region Re.z/ > 0, and it is proved that ³ .z/ can be analytically continued to the meromorphic function
1016
S. Watanabe and S. Amari
on the entire complex plane and its poles are all real, negative, and rational numbers. Let .¡¸1 / be the largest pole of ³ .z/. Then if the true parameter w0 is xed, in other words, if w0 does not depend on n, then ¸ D ¸1 , hence the asymptotic expansion G.n/ D
³ ´ 1 ¸1 Co n n
(1.2)
holds (Watanabe, 1999, 2001a). When both a learning machine and a true distribution are given, then ¸1 can be calculated by applying the desingularization theorem (Hironaka, 1964; Atiyah, 1970) in algebraic geometry to the set W 0 . It has been shown that if the set fw 2 W0 I ’.w/ > 0g is not empty, then ¸1 is smaller than d=2. If we employ the Jeffreys’ prior, then ¸1 ¸ d=2 (Watanabe, 2001b). In general, ¸1 depends on the set of true parameters W0 . Example. If a three-layer neural network with M input units, H hidden units, and one output unit, 0 Á !2 1 H X 1 1 exp @¡ p.y j x; w/ D p y¡ ah tanh.bh ¢ x C ch / A ; 2 2¼ hD1 is applied, where w D .ah ; bh ; ch /, and if the true distribution is represented by H0 hidden units, then an inequality, ¸·
1 f.M C 2/H0 C .H ¡ H0 /g; 2
holds (Watanabe, 2001c, 2001d). This is far smaller than d=2 D .M C 2/H=2. Recently, the Bayes generalization errors of the mixture models and the reduced-rank approximation have been claried (Yamazaki & Watanabe, 2002; Watanabe & Watanabe, in press). 1.3 Purpose of This Letter. Summing up the preceding section, Bayes estimation makes the learning coefcient smaller if the true parameter is contained in singularities. However, in practical applications, we often employ an articial neural network that almost, but not completely, attains the true distribution. Moreover, in a model selection procedure, comparing simple and complex learning machines, we have to decide whether they are redundant. In such cases, it is ambiguous whether we can assume that the true parameter is contained in singularities. We need a more precise theory that claries the phenomenon in the case when the true distribution is almost but not completely contained in the singularities. In this letter, in order to clarify the effect of singularities when the true parameter is in the neighborhood of singularities, we propose a new scaling
Learning Coefcients of Singular Models
1017
method of the true parameter. We study the problem based on the assumption that the Kullback distance from the singularities to the true distribution is equal to c=n, where n is the number of training samples and c is a constant. This scaling enables us to study the case when the true distribution is almost but not completely contained in the singularities. This scaling makes the true distribution to be dependent on the number of training samples; hence, equation 1.2 cannot be applied in this case. We show two results. First, if the number of the parameters from inputs to hidden units is not larger than three, then there exists c > 0 such that the generalization error is larger than d=2n, where d is the dimension of the parameter. Second, if otherwise, then for an arbitrary c ¸ 0, the generalization error is smaller than d=2n. These results indicate that in a small neural network, singularities make the Bayes generalization error to be larger if the true parameter mismatches singularities, whereas in a large neural network, singularities contribute to the precise prediction in Bayesian estimation independent of the place of the true parameter. 2 A Hierarchical Model Since singularities in multilayer learning machines have quite complex geometrical structures, we must use the advanced knowledge in modern algebraic geometry to treat them in a general manner (Watanabe, 2001a). In this article, we study a simple hierarchical model. Even in this simple model, a universal phenomenon caused by singularities in hierarchical learning machines can be found. Let us consider a learning problem: Learning Machine: ³ ´ 1 1 exp ¡ .y ¡ af .b; x//2 ; p.y j x; a; b/ D p 2 2¼
(2.1)
True Distribution:
Á ³ ´ ! 1 1 a0 f .b0 ; x/ 2 exp ¡ p q.y j x/ D p y¡ ; 2 n 2¼
(2.2)
where y 2 R1 is an output and x 2 RM is an input with the probability distribution q.x/. The parameter space is dened by f.a; b/ 2 R1 £ RN g. Because the true distribution q.y j x/ depends on n, the formula in equation 1.2 cannot be applied. It should be emphasized that the distance between the true distribution and singularities is measured by the scaling 1=n. This scaling enables us to extract clearly the effect of singularities by balancing the function approximation error with the statistical estimation error.
1018
S. Watanabe and S. Amari
In this article, we assume that f .b; x/ ´ 0 is equivalent to b D 0. Then W ¤ ´ f.a; b/I p.y j x; a; b/ D p.y j x; 0; 0/ .8x; y/g D fa D 0g [ fb D 0g: Therefore, W ¤ is not a manifold but a union of manifolds and .0; 0/ is a singularity of W ¤ . The Kullback distance from the distribution p.y j x; 0; 0/ to the true distribution q.y j x/ is in proportion to 1=n, Z q.x/p.y j x; 0; 0/ log
a2 p.y j x; 0; 0/ dx dy D 0 2n q.y j x/
Z f .b0 ; x/2 q.x/ dx:
In statistics, this scaling is often used to compare the powers of several methods of testing hypotheses. For simple calculation, we assume that the a priori distribution ’.a; b/ satises the condition that ’.a; b/ is a continuously differentiable function of a and that à .b/ ´ ’.0; b/ is a compact support function of b and Ã.0/ > 0. Let Dn D f.xi ; yi /I i D 1; 2; : : : ; ng be a set of training samples independently taken from q.x/q.y j x/. Both the Bayes a posteriori distribution p.a; b j Dn / and the Bayes predictive distribution p.y j x; Dn / are, respectively, dened by n Y 1 ’.a; b/ p.yi j xi ; a; b/; C iD1 Z p.y j x; Dn / D p.y j x; a; b/ p.a; b j Dn / da db;
p.a; b j Dn / D
where C is a normalizing constant. The generalization error G.n/ of the Bayes estimation is dened by equation 1.1. We assume that the learning machine satises the condition f .b; x/ D
J X
fj .b/ej .x/;
jD1
where fej .x/I j D 1; 2; : : : ; Jg is a set of orthonormal functions, Z ei .x/ej .x/q.x/ dx D ±ij :
(2.3)
Learning Coefcients of Singular Models
1019
For simplicity of mathematical proof, J is assumed to be nite. Then it follows that 2
k f .b/k ´
J X jD1
Z 2
fj .b/ D
f .b; x/2 q.x/ dx:
For a given sequence u D fuj g, we use the notation u ¢ f .b/ D
J X
uj fj .b/: jD1
We have the following theorem: Theorem 1. Assume that Z Ã.b/ db < 1: k f .b/k
(2.4)
Then the Bayes generalization error can be asymptotically expanded as G.n/ D
³ ´ 1 ¸.a0 ; b0 / Co : n n
Here ¸.a0 ; b0 / is independent of n and dened by 2¸.a0 ; b 0 / D 1 C a20 k f .b0 /k2 ¡
J X
a0 fj .b0 /E g
jD1
µ
@ log Z.g/ @gj
¶
where g D fgj g is a random variable subject to the J-dimensional gaussian distribution whose average and the covariance matrix are, respectively, zero and the identity, and E g denotes the expectation value over g, and Z Z.g/ D L.g; b/ D Proof.
exp.L.g; b//
Ã.b/ db ; k f .b/k
f.g C a0 f .b0 // ¢ f .b/g2 : 2 k f .b/k2
By using a log-likelihood ratio function,
r.y j x; a; b/ D log
q.y j x/ ; p.y j x; a; b/
1020
S. Watanabe and S. Amari
the Bayes generalization error is equal to "
# R P exp. ¡ nC1 r.yi j xi ; a; b// ’.a; b/ da db iD1 P G.n/ D E ¡ log R : exp. ¡ niD1 r.yi j xi ; a; b// ’.a; b/ da db Let us introduce the rescaling parameter ® D hS.®; b/i of a function of S.®; b/ by Z
p
n a and dene the average
³
´ ® e S.®; b/ ’ p ; b d® db n ³ ´ Z ; ® e¡E1 .®;b/ ’ p ; b d® db n ¡E1 .®;b/
hS.®; b/i D
where E1 .®; b/ D
n n 1 X 1 X d.®; b; xi /2 ¡ p ²i d.®; b; xi /; 2n iD1 n iD1
d.®; b; x/ D ®f .b; x/ ¡ a0 f .b 0 ; x/: Here p ²i ´ yi ¡ a0 f .b 0 ; xi /= n are independent samples taken from the standard normal distribution. Then the Bayes generalization error G.n/ is equal to µ ½ » ¼ ¾¶ e.®; b/ G.n/ D E ¡ log exp ¡ p ; n where 1 e.®; b/ D p d.®; b; xnC1 /2 ¡ ²nC1 d.®; b; xnC1 /: 2 n It follows that µ ³ ´¶ ³ ´ he.®; b/i he.®; b/2 i 1 C CO G.n/ D E ¡ log 1 ¡ p : 2n n3=2 n Then by the equation ³ ´ 1 1 1 2 p E[he.a; b/i] D E[he.a; b/ i] C O ; 2n n3=2 n
Learning Coefcients of Singular Models
1021
we obtain ³ ´ 1 1 2 G.n/ D E[he.®; b/i ] C O : 2n n3=2
(2.5)
In other words, the learning coefcient is given by 2¸ D lim E[he.®; b/i2 ]; n!1
if the above convergence holds. Let us prove its convergence and calculate the limiting value. By using the assumption equation 2.3 and the central limiting theorem, the following convergence in law holds, J J X 1X .® fj .b/ ¡ a0 fj .b0 //2 ¡ gj .® fj .b/ ¡ a0 fj .b0 // 2 jD1 jD1
E1 .®; b/ !
D E2 .®; b/ C c0 ; where fgj g is the J-dimensional gaussian random variable whose average and covariance matrix are, respectively, zero and the identity E2 .®; b/ is a function dened by J
E2 .®; b/ D
J
X 1X 2 ® fj .b/2 ¡ ®fj .b/.gj C a0 fj .b0 //; 2 jD1 jD1
and c0 is a random variable but a constant as a function of .®; b/. Also, we dene the average h¢i1 of S.®; b/ by Z
e¡E2 .®;b/ S.®; b/ Ã .b/ d® db Z : e¡E2 .®;b/ Ã .b/ d® db
hS.®; b/i1 D
Then it follows that 2¸.a0 ; b 0 / D lim E[he.®; b/i2 ] n!1
D lim E[hd.®; b; xnC1 /i2 ] n!1
2 3 J X D lim E 4 h®fj .b/ ¡ a0 fj .b0 /i2 5 n!1
D
J X jD1
jD1
E g [fh®fj .b/i1 ¡ a0 fj .b0 /g2 ]
1022
S. Watanabe and S. Amari
where E g shows the expectation value over g. Hence, "»
¼ 2# 1 @Z 2¸.a0 ; b0 / D ¡ a0 fj .b0 / Eg ; Z @ gj jD1 "» ¼ # X µ ¶ J J X 1 @Z 2 1 @Z D ¡ 2a Eg 0 fj .b0 /E g Z @ gj Z @gj jD1 jD1 J X
C
J X
a20 fj .b0 /2 ;
jD1
where Z e¡E2 .®;b/ Ã .b/ d® db
Z.g/ D Z D
0 P 1 J f jD1 fj .b/.gj C a0 fj .b0 //g2 A Ã .b/db : exp @ 2 2k f .b/k k f .b/k
By using the identities »
1 @Z Z @ gj
¼
2
D
J X @2Z jD1
@ gj2
1 @2Z @ ¡ 2 Z @ gj @ gj
D ZC
J X jD1
»
¼ 1 @Z ; Z @gj
.gj C a0 fj .b 0 //
@Z ; @gj
(2.6)
(2.7)
where equation 2.7 is shown by the partial integration of ®. Finally, by using the identity µ Eg
@ @ gj
»
1 @Z Z @gj
¼ ¶
µ ¶ 1 @Z D E g gj ; Z @ gj
we obtain theorem 1. Theorem 1 shows that if a0 D 0, then ¸.a0 ; b0 / D 1=2, which coincides with the general theory for the case when the true parameter is contained in the singularities. In fact, if a0 D 0, the zeta function of the Kullback information, Z ³ .z/ D
a2z kbk2z ’.a; b/ da db;
has the largest pole at z D ¡1=2.
Learning Coefcients of Singular Models
1023
Remark. The partition function Z.g/ is identied by the partial differential equation, 2.7, which can be rewritten as .1 ¡ .g C a0 f .b0 // ¢ r ¡ 1/ Z D 0: This fact indicates that the generalization error has a mathematical connection to the theory of differential equations. 3 Effect of Singularities In order to study the effect of singularities, we adopt the simple learning machine, 0
0
1 B 1 exp @¡ @y ¡ p.y j x; a; b/ D p 2 2¼
N X
12 1 C abj ej .x/A A ;
(3.1)
jD1
where a 2 R1 , b 2 RN , x 2 RM .N > 1/. Also, we assume that à .b/ depends on only the norm kbk, that is, Ã.b/ is rewritten as Ã.kbk/. Note that if N D 1, then equation 2.4 in theorem 1 does not hold. Hence, we study the cases N ¸ 2. If the true regression function is y D 0, then the set of true parameters is f.a; b/I a D 0 or b D 0g. Remark 1. By using the reparameterization wi D abi , the learning machine, equation 3.1, results in 0
0
1 B 1 exp @¡ @y ¡ p.y j x; w/ D p 2 2¼
N X
12 1 C wj ej .x/A A :
jD1
This learner is a regular statistical model, which has the learning coefcient ¸.w0 / D N=2 for an arbitrary w0 . Therefore, by comparing the learning coefcient ¸.a0 ; b0 / with N=2, let us clarify the effect of singularities. Remark 2. If the true distribution is p.y j x; 0; 0/, then the learning coefcient of the machine in equation 3.1 is ¸ D 1=2, as shown in theorem 1. If the true distribution is p.y j x; a¤ ; b ¤ /, which is independent of n and a¤ b¤ 6D 0, then D.a¤ ; b ¤ k a; b/ D
N 1X .abj ¡ a¤ bj¤ /2 : 2 jD1
1024
S. Watanabe and S. Amari
Hence, the largest pole of the zeta function, Z D.a¤ ; b¤ k a; b/z ’.a; b/ da db;
³ .z/ D
is ¸ D ¡N=2, and the learning coefcient is N=2. For a learning machine in equation 3.1, we have the more explicit form of the learning coefcient: Theorem 2. If N ¸ 2 and if the true distribution is equation 2.2, the learning machine given by equation 3.1 has the learning coefcient µ ¶ YN .g/ 2 2 2¸.a0 ; b0 / D 1 C E g .a0 kb0 k C a0 b0 ¢ g/ YN¡2 .g/
(3.2)
where Z YN .g/ D
¼=2 0
³ ´ 1 dµ sinN µ exp ¡ ka0 b0 C gk2 sin2 µ : 2
Proof. We introduce the general polar coordinate b D .r; Ä/. The function Z.g/ in theorem 1 is given by »
Z Z.g/ D
dr dÄ exp
..g C a0 b0 / ¢ Ä/2 2
¼ Ã.r/ rN¡2 :
R Since dÄ represents the integration with the homogeneous measure of the .N ¡ 1/-dimensional sphere, Z.g/ is independent of the direction of g C a0 b0 . Therefore, we can assume g C a0 b 0 D kg C a0 b0 k £ .1; 0; : : : ; 0/ without loss of generality. We represent Ä D b=r by bi =r D sin µ1 ¢ ¢ ¢ sin µi¡1 cos µi .1 · i · N ¡ 1/; bN =r D sin µ1 ¢ ¢ ¢ sin µN¡1 ; where 0 · µi · ¼ (1 · i · N ¡ 2) and 0 · µN¡1 < 2¼ . Then f.g C a0 b0 / ¢ Äg2 D kg C a0 b0 k2 cos2 µ1 and dr dÄ D dr
N¡1 Y iD1
sinN¡i¡1 µi dµi :
Learning Coefcients of Singular Models
1025
Therefore, Z Z.g/ D const:
¼=2
³ sinN¡2 µ1 exp
0
ka0 b0 C gk2 cos2 µ1 2
´ dµ1 :
It follows that a0 b0 ¢
@ O Z.g/ D .ka0 b0 k2 C a0 b0 ¢ g/ Z.g/; @g
where ± ² ka0 b0 Cgk2 N¡2 2 2 sin cos exp cos µ µ µ dµ1 1 1 1 0 2 O ± ² D Z.g/ R ¼=2 2 ka b Cgk 0 0 sinN¡2 µ1 exp cos2 µ1 dµ1 0 2 ± ² R ¼=2 ka b Cgk2 sinN µ1 exp ¡ 0 02 sin2 µ1 dµ1 0 ± ² D 1¡ R ka0 b0 Cgk2 ¼=2 2 N¡2 sin exp ¡ sin µ µ dµ1 1 1 0 2 R ¼=2
D 1¡
YN .g/ : YN¡2 .g/
By applying this equation to theorem 1, we obtain theorem 2.
Unfortunately, the function ¸.a0 ; b0 / in equation 3.2 cannot be represented by any classically analytic function. Figure 1 shows the value ¸.a0 ; b0 / given by equation 3.2 by numerical calculations for the cases N D 2; 3; : : : ; 6. The horizontal and longitudinal lines, respectively, show ja0 jkb0 k and 2¸.a0 ; b0 /=N. The generalization error is smaller than that of the corresponding regular statistical model if and only if 2¸.a0 ; b0 /=N < 1. For all cases, 2 · N · 6, ¸.a0 ; b 0 / converges to the dimension N=2 when ja0 jkb0 k ! 1. If N D 2 and N D 3, ¸.a0 ; b0 / becomes larger than N=2 if the true parameter mismatches the singularities. When N D 2, if ja0 jkb0 k > 2:8, then ¸.a0 ; b0 / > N=2. When N D 3, only in the interval 3:8 < ja0 jkb0 k < 6:8, ¸.a0 ; b0 / > N=2. On the other hand, if N ¸ 4, the learning coefcient ¸.a0 ; b0 / is always smaller than N, even if the true parameter is not contained in singularities. If the dimension of the parameter is large, then singularities make the Bayes generalization error smaller than regular statistical models, independent of the place of the true parameter.
1026
S. Watanabe and S. Amari 1.2
1
2lambda/N
0.8
0.6
"Gener:N=2" "Gener:N=3" "Gener:N=4" "Gener:N=5" "Gener:N=6" "lambda=1"
0.4
0.2
0
0
2
4
6 8 a0||b0||
10
12
Figure 1: Learning coefcients 2¸.a0 ; b0 /=N for a0 kb0 k.
This result can be analyzed more precisely by the asymptotic expansion: Theorem 3. The learning coefcient ¸.a0 ; b0 / can be asymptotically expanded when ja0 jkb 0 k ! 1: Á ! 1 .N ¡ 1/.N ¡ 3/ 2¸.a0 ; b0 / D N ¡ Co 2 : a20 kb0 k2 a0 kb0 k2 The coefcient of 1=a20 kb0 k2 is positive if N D 2, whereas it is negative if N ¸ 4. When N D 3, then the coefcient is equal to zero. Proof. By using the notation T D ka0 b0 C gk, the function YN .g/ in theorem 2 is rewritten as
YN .g/ D
1 T2
Z 0
1
³ 2´ xN µT .x/ x q exp ¡ dx; 2 2 1 ¡ Tx 2
Learning Coefcients of Singular Models
1027
where µT .x/ is equal to one if 0 · x < T, or zero if otherwise. Then the convergence
lim T 4 µT .x/
T!1
8 < :
1
q 1¡
x2 T2
9 ³ ´ x2 = 3 4 ¡ 1C D x 2T 2 ; 8
holds for each x. By applying Lebesgue’s convergence theorem to the integration of x, we have an asymptotic expansion for ka0 b0 k ! 1, YN .g/ D
³ ´ 1 CN CNC2 C C o ; ka0 b 0 C gkNC1 2ka0 b0 C gkNC3 ka0 b0 kNC3
where CN D 2.N¡1/=2 0. NC1 2 /. Then, 2 6 2 2 2¸.a0 ; b 0 / D 1 C E g 6 4.a0 kb0 k C a0 b 0 ¢ g/ 3 CN CNC2 C ka0 b 0 C gkNC1 2ka0 b0 C gkNC3 7 7 £ 5 CN¡2 CN C ka0 b 0 C gkN¡1 2ka0 b0 C gkNC1 ³ ´ 1 Co : ka0 b0 k2 By expanding this term with respect to 1=ka0 b0 k, we obtain µ ¶ N¡1 .a0 b0 ¢ g/2 2 2¸.a0 ; b 0 / D N C C 1 ¡ kgk Eg 2 ka0 b0 k2 ka0 b0 k2 ³ ´ 1 Co : ka0 b0 k2
(3.3)
The relation E g [.b0 ¢ g/2 ] D kb0 k2 completes the proof of theorem 3. 4 Discussion 4.1 Training Errors and Stochastic Complexity. By using the true conditional probability density function q.y j x/ and the Bayes predictive distribution p.y j x; Dn /, we dene the average Bayes training error by "
# n 1X q.yk j xk / log T.n/ D E : n kD1 p.yk j xk ; Dn /
1028
S. Watanabe and S. Amari
Then, under the same condition as theorem 1, we can prove the following result by the same method as theorem 1, ³ ´ 1 ¹.a0 ; b0 / Co T.n/ D ; n n where 2 ¹.a0 ; b0 / D ¸.a0 ; b0 / ¡ E g 4
J X jD1
3 @ log Z.g/5 : gj @gj
Then, under the same condition as theorem 2, we obtain µ 2¹.a0 ; b0 / D 1 ¡ 2N C E g
.a20 kb0 k2
¶ YN .g/ C 3a0 b0 ¢ g C 2kgk / ; YN¡2 .g/ 2
which is shown in Figure 2. In particular, this coefcient has an asymptotic expansion when ja0 jkb0 k ! 1, Á ! 1 .N ¡ 1/2 2¹.a0 ; b0 / D ¡N C 2 Co 2 : a0 kb0 k2 a0 kb0 k2 These results show that the coefcients of training errors are larger than ¡N=2 for an arbitrary true parameter. It is noted that the symmetry of the generalization error and the training error, ¸.a0 ; b 0 / C ¹.a0 ; b0 / D 0; does not hold in a neighborhood of singularities. In order to estimate the generalization error based on the training error, we have to construct a new learning theory, since the statistical asymptotic theory of regular statistical models does not hold in neural networks. It is well known that the Bayesian model selection is often carried out by minimization of the stochastic complexity (Akaike, 1980). The stochastic complexity is dened by
F.Dn / D ¡ log
Z Y n iD1
p.yi j xi ; w/ ’.w/ dw:
Learning Coefcients of Singular Models
1029
0
-0.2 "Train:N=2" "Train:N=3" "Train:N=4" "Train:N=5" "Train:N=6"
2mu/N
-0.4
-0.6
-0.8
-1
0
2
4
6 8 a0||b0||
10
12
Figure 2: Coefcients of training errors 2¹.a0 ; b0 /=N for a0 kb0 k.
The minimization of F.Dn / is equivalent to that of Z F0 .Dn / D ¡ log
Á
n X
q.yi j xi / exp ¡ log p.y i j xi ; w/ iD1
! ’.w/ dw:
Let us apply the minimum stochastic complexity criterion to our problem in equation 2.2. If the Kullback distance from the singularities to the true distribution is in proportion to c=n, then, for an arbitrary .a0 ; b0 /, the smaller model y D 0 is selected with the probability .1 ¡ ²/, where ² > 0 goes to zero when n tends to innity. This fact shows that the minimum stochastic complexity criterion is not optimal for the minimum generalization error. 4.2 The Reason That Singularities Are Important in Neural Network Research. Let us discuss the reason that singularities are important in neural network learning theory. In a multilayer neural network, singularities correspond to the parameters that represent smaller networks. The neighborhood of the singularities corresponds to the probability distributions which can be almost approximated by smaller networks. On the other hand, the true probability distribution is not contained in a nite parametric model in practical applications. However, if the num-
1030
S. Watanabe and S. Amari
ber of training samples is small, then the neighborhood of singularities is chosen with large probabilities in Bayes estimation; in other words, smaller networks are employed in the early stage of learning. As the number of training samples increases, gradually the larger networks are being chosen (Watanabe, 2001c). This phenomenon can be understood as the phase transition from the statistical physics point of view. The bias and variance problem is mathematically equivalent to the energy and entropy problem. The most interesting state of learning can be found just when a simpler network begins to change into a more complex one. In order to describe such a state clearly, we need the scaling in which the Kullback information is in proportion to 1=n. If the Kullback information is smaller than this order, then the learning machine cannot distinguish the true parameter from singularities. If it is larger, then the learning machine recognizes the true parameter as the general point. In this article, if the dimension of the input layer is smaller than four, then the generalization error momentarily becomes larger. It should be emphasized that such a phenomenon cannot be found without the proposed scaling of the Kullback information. We expect that our scaling will clarify the deeper mathematical structure in neural network learning. 4.3 Future Study. First, let us consider why hierarchical structure is needed in statistical estimation. In this article, we compared a simple layered model with the corresponding regular statistical model. When we employ a linear learner, yD
N X
bj ej .x/;
jD1
in some applications with N ¸ 4, we can expect the more precise statistical estimation by extending the hierarchical model,
yD
N X
abj ej .x/;
jD1
and using Bayesian estimation. This fact shows that even if the original statistical model has a linear parameter and does not need any hierarchical structure, we can obtain more precise statistical estimation by creating singularities. We expect that this is a reason that hierarchical structure is needed in practical neural information processing systems. Second, singularities do not always make the Bayes generalization error smaller. In our models, the condition N ¸ 4 is needed to ensure that the singularities make the estimation more precise. We do not yet have any explanation why such a condition is needed. In statistics, it is well known
Learning Coefcients of Singular Models
1031
that if x1 ; x2 ; : : : ; xn are independently taken from the d-dimensional normal distribution with the average µ and the covariance matrix is the identity, then Stein’s estimator, ³ µ¤ D 1 ¡
.d ¡ 2/ P k iD1 xi k
´
n 1X xi ; n iD1
P has the smaller average square error than the sample average .1=n/ niD1 xi , if and only if d ¸ 3 (Stein, 1956). In this article, we have shown that the same phenomenon is observed in hierarchical learning machines with singularities. In statistics, there are research results on the relation between Stein’s estimator and Bayesian estimation (Efron & Morris, 1973; Kass & Steffey, 1989). It is the important future work for neural network learning theory to clarify the mathematical foundation between such research and singularity theory. Finally, we can generalize the result in theorem 1 even for the case J D 1, based on more mathematical preparation. Moreover, for the general f .b; x/, we can derive the same asymptotic expansion as theorem 3. That is,
2¸.a0 ; b/ D min.J; M/ C
³ ´ 1 ° Co ; ja0 j ja0 j2
holds in general when ja0 j ! 1, where M is the dimension of b. However, it is still unknown what ° represents. In theorem 3, it seems that .N¡1/.N¡3/, the coefcient of ka0 b 0 k2 , shows some geometrical index of the singularities. Future study should clarify the relation between the algebraic geometrical structure and the coefcient of the asymptotic expansion of ja0 j ! 1. 5 Conclusion The effect of singularities when the true parameter mismatches them is claried based on the new scaling method. Singularities make the Bayes generalization error to be small if the dimension of the inputs to hidden units is larger than three. We expect that this research will serve as a base to clarify the reason that neural information processing systems need hierarchical structures. Acknowledgments This work was supported by the Ministry of Education, Science, Sports, and Culture in Japan, grant-in-aid for scientic research 12680370.
1032
S. Watanabe and S. Amari
References Akaike, H. (1980). Likelihood and Bayes procedure. In J. M. Bernald, Bayesian statistics (pp. 143–166). Valencia, Spain: University Press. Amari, S., Park, H., & Ozeki, T. (2002). Geometrical singularities in the neuromanifold of multilayer perceptrons. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Atiyah, M. F. (1970). Resolution of singularities and division of distributions. Comm. Pure and Appl. Math., 13, 145–150. Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. Journal of the American Statistical Association, 68, 117–130. Hagiwara, K. (2002). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Hartigan, J. A. (1985). A failure of likelihood ratio asymptotics for normal mixtures. In Proceedings of the Berkeley Conference in Honor of J. Neyman and J. Kiefer, 2 (pp. 807–810). Monterey, CA: Wadsworth Advanced Books, and Hayward, CA: Institute of Mathematical Statistics. Hironaka, H. (1964). Resolution of singularities of an algebraic variety over a eld of characteristic zero. Annals of Mathematics, 79, 109–326. Kass, R. E., & Steffey, D. (1989). Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association, 84, 717–726. Rusakov, D., & Geiger, D. (in press). Asymptotic model selection for naive Bayesian networks. In Proceedings of UAI’02. San Mateo, CA: Morgan Kaufmann. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the 3rd Berkeley Symp. on Mathematical Statistics and Probability, 1 (pp. 197–206). Berkeley: University of California. Watanabe, K., & Watanabe, S. (in press). Upper bounds of Bayes generalization errors in reduced rank approximation. IEICE Transactions, D-II-86J. Watanabe, S. (1995). A generalized Bayesian framework for neural networks with singular Fisher information matrices. International Symposium on Nonlinear Theory and Its Applications, 2, 207–210. Watanabe, S. (1999). Algebraic analysis for singular statistical estimation. In Lecture notes in computer science, 1720 (pp. 39–50). Berlin: Springer-Verlag. Watanabe, S. (2001a). Algebraic analysis for nonidentiable learning machines. Neural Computation, 13(4), 899–933. Watanabe, S. (2001b). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dietterich, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 13 (pp. 329–336). Cambridge, MA: MIT Press. Watanabe, S. (2001c). Algebraic geometrical methods for hierarchical learning machines. International Journal of Neural Networks, 14(8), 1049–1060.
Learning Coefcients of Singular Models
1033
Watanabe, S. (2001d). Learning efciency of redundant neural networks in Bayesian estimation. IEEE Trans. on Neural Networks, 12(6), 1475–1486. Watanabe, S., & Amari, S. (forthcoming). The effect of singularities in a learning machine when the true parameters do not lie on singularities. In Advances in neural information processing systems, 15. Yamazaki, K., & Watanabe, S. (2002). Resolution of singularities in mixture models and the upper bounds of the stochastic complexity. In Proceedings of the International Conference on Neural Information Processing. Received September 3, 2002; accepted December 6, 2002.
LETTER
Communicated by Alain Destexhe
Gamma Rhythmic Bursts: Coherence Control in Networks of Cortical Pyramidal Neurons Toshio Aoyagi
[email protected] Takashi Takekawa
[email protected] Department of Applied Analysis and Complex Dynamical Systems, Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan
Tomoki Fukai
[email protected] Department of Information-Communication Engineering, Tamagawa University, Machida, Tokyo 606-8501, Japan
Much evidence indicates that synchronized gamma-frequency (20–70 Hz) oscillation plays a signicant functional role in the neocortex and hippocampus. Chattering neuron is a possible neocortical pacemaker for the gamma oscillation. Based on our recent model of chattering neurons, here we study how gamma-frequency bursting is synchronized in a network of these neurons. Using a phase oscillator description, we rst examine how two coupled chattering neurons are synchronized. The analysis reveals that an incremental change of the bursting mode, such as from singlet to doublet, always accompanies a rapid transition from antisynchronous to synchronous ring. The state transition occurs regardless of what changes the bursting mode. Within each bursting mode, the neuronal activity undergoes a gradual change from synchrony to antisynchrony. Since the sensitivity to Ca2C and the maximum conductance of Ca2C -dependent cationic current as well as the intensity of input current systematically control the bursting mode, these quantities may be crucial for the regulation of the coherence of local cortical activity. Numerical simulations demonstrate that the modulations of the calcium sensitivity and the amplitude of the cationic current can induce rapid transitions between synchrony and asynchrony in a large-scale network of chattering neurons. The rapid synchronization of chattering neurons is shown to synchronize the activities of regular spiking pyramidal neurons at the gamma frequencies, as may be necessary for selective attention or binding processing in object recognition.
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1035–1061 (2003) °
1036
T. Aoyagi, T. Takekawa, and T. Fukai
1 Introduction In sensory perception, learning and memory, motor control, and attention, cortical neurons are known to exhibit synchronized oscillations in the gamma-frequency range (20–70 Hz) (Gray & McCormick, 1996; TallonBaudry, Bertrand, Delpuech, & Pernier, 1997). These rhythmic cortical activities are clearly stimulus and task specic, and therefore are thought to engage in higher cognitive functions (Buzsaki, Leung, & Vanderwolf, 1983; Steriade, McCormick, & Sejnowski, 1993). If we clarify the cellular origin of the gamma oscillation, we can gain better insight into the functional role of the oscillation. In the hippocampus, synchronous gamma oscillation is considered to occur through the GABA-A-mediated mutual inhibition among interneurons (Buhl, Halasy, & Somogyi, 1994; Cobb, Buhl, Halasy, Paulsen, & Somogyi, 1995; Freund & Buzsaki, 1996). Several theoretical studies have also addressed the synchronization among inhibitory neurons (Wang & Buzsaki, 1996; White, Banks, Pearce, & Kopell, 2000). In the neocortex, it is known that a class of pyramidal neurons termed fast rhythmic bursting cells (Steriade, Timofeev, Durmuller, & Grenier, 1998) or chattering cells (Gray & McCormick, 1996) generates fast rhythmic bursts in the gamma-frequency range. This raises the possibility that the chattering cell is the neocortical pacemaker to synchronize local cortical activities at the gamma frequencies. Few theoretical studies, however, have addressed the mechanism of synchronization among chattering cells. How synchronization is realized in a network of chattering neurons heavily depends on the ionic mechanism underlying the gamma-frequency bursting of the neurons. The mechanism, however, remains controversial. Some experimental results suggested that a persistent sodium current plays a crucial role in the generation of rhythmic bursts (Azouz, Jensen, & Yaari, 1996; Brumberg, Nowak, & McCormick, 2000; Llinas, Grace, & Yarom, 1991; Mantegazza, Franceschetti, & Avanzini, 1998). On the basis of these results, computational models were proposed to show that the rhythmic bursting of 20 to 40 Hz occurs through the so-called Ping-Pong mechanism, that is, the electronic interplay between soma and dendrites (Williams & Stuart, 1999; Wang, 1999). A similar backpropagation mechanism was also employed in modeling the gamma-frequency bursting of the pyramidal cells in the electrosensory lateral line lobe of an electric sh (Doiron, Longtin, Turner, & Maler, 2001). Other experimental studies have suggested that a subtype of the Ca2C -dependent nonselective cationic current contributes decisively to the generation of the gamma frequency bursts in neocortical neurons (Kang, Okada, & Ohmori, 1998). We have modeled chattering neurons by incorporating the calcium-dependent cationic current, with the reversal potential being approximately ¡45 mV (Kang et al., 1998) and have successfully described the burst generation in the entire gamma frequency range (20–70 Hz) (Aoyagi, Kang, Terada, Kaneko, & Fukai, 2002; Aoyagi, Terada, Kang, Kaneko, & Fukai, 2001).
Gamma Rhythmic Bursts
1037
In this article, we study how the activities of our model chattering neurons (Aoyagi et al., 2002, 2001) are synchronized or desynchronized when they are coupled through the AMPA glutamate receptor-mediated mutual excitation. To study the synchronization phenomena analytically, we apply the phase-oscillator analysis to a system of two chattering neurons exhibiting two, three, or more spikes per burst (doublet, triplet, : : : /. Then the essential results of the phase-oscillator analysis are conrmed by numerical simulations in a large-scale network of chattering neurons. We also study whether synchronous ring of chattering neurons may set synchronous activity in a network consisting of both chattering and regular-spiking pyramidal neurons. In order to elucidate the basic properties of synchronization among pyramidal neurons, we ignore the effects of inhibitory interneurons in the simulations. Synchronous activity in a more realistic model of the local cortical circuitry will be analyzed elsewhere. 2 The Model 2.1 Single-Neuron Model. Our model of chattering neuron has a single compartment and has only the ionic currents essential for generating fast rhythmic bursts (Aoyagi et al., 2002, 2001) (see Figure 1A). The time evolution of the membrane potential, V, is described as
Cm
dV D ¡INa ¡ IK ¡ ICa ¡ Icat ¡ ISK ¡ IL C Iapp; dt
(2.1)
where Cm D 1¹F=cm2 is the membrane capacitance and Iapp is an applied current or stimulus. The leak current is given by IL D gL .V ¡ VL /, with gL D 0:13 mS/cm2 and VL D ¡68:8 mV. Further details are described below. 2.2 Voltage-Dependent Currents. The voltage-dependent currents are described by the Hodgkin-Huxley formalism (Hodgkin & Huxley, 1952): A gating variable x (x D m for activation and x D h for inactivation) satises the rst-order kinetics dx=dt D 8x [®x .V/.1 ¡ x/ ¡ ¯x .V/x], with 8x being the temperature factor. The fast spike-generating sodium current is given by INa D gNa m3 h.V ¡ E Na /, with ®m D ¡0:1.V C 25/=.exp.¡0:1.V C 25// ¡ 1/, ¯m D 4 exp.¡.V C 50/=12/, ®h D 0:07 exp.¡.V C 42/=10/, and ¯h D 1=.exp.¡0:1.V C 12// C 1/. The delayed rectier is given by IK D gK m4 .V ¡ E K /, with ®m D ¡0:01.V C 26/=.exp.¡0:1.V C 26// ¡ 1/ and ¯m D 0:125 exp.¡.V C 36/=25/. The high-voltage activated calcium current (Kay & Wong, 1987) is given by ICa D m2 IGHK , with ®m D 1:6=.1 C exp.¡0:072.V ¡5/// and ¯m D 0:02.V C8:69/=.exp..V C8:69/=5:36/¡1/. The Goldman-Hodgkin-Katz current IGHK is given by IGHK D Pmax V.[Ca2C ]in ¡ [Ca2C ]out » /=.1 ¡ » /, with » D exp.¡2FV=RT/. The parameter values used here are gNa D 130, gK D 35 (mS/cm2 ), ENa D 55, EK D ¡97 (mV), F D 96:5
1038
T. Aoyagi, T. Takekawa, and T. Fukai
INa
IK-DR K+
Na+
B
ICa Ca2+
[Ca]+[B]
IAHP
54 mS/cm2 108 mS/cm2 DAP
gSK= 0.85 mS/cm2
[Ca-B]
Na+
K+
gcat = 0 mS/cm2
Ca2+ Ca-pump
K+
40mV
A
20msec
24.0 mA/cm2
ICation
C gSK = 0.45 mS/cm2 0.85 mS/cm2 gcat = 108 mS/cm2 24.0 mA/cm2 Figure 1: Schematic representation of the chattering neuron model and the typical response to a short current pulse. (A) The ionic channels included in this model are the minimum necessary for replicating the fast rhythmic bursting of chattering neurons based on the present Ca2C -dependent mechanism. For the calcium dynamics, we incorporate three processes: the inux via voltage-gated Ca2C channels, extrusion via Ca2C -pump, and buffering of calcium. (B) As the maximum conductance of the cationic current, gcat , is decreased, the amplitude of a hump-like depolarizing afterpotential (DAP) is decreased until it nally vanishes at gcat D 0. (C) A decrease in gSK results in a transient burst of spikes. In B and C, the gray traces represent the responses at the typical parameter values employed throughout the present simulations. The same scale bars apply to both gures.
C/mol, T D 293 K, R D 8:31 J/(K¢mol), and Pmax D 0:01 ¹A/(¹M¢mV¢cm2 ). For all temperature factors, a single value is used (8x D 10). 2.3 Calcium-Dependent Currents. Our neuron model has two calciumdependent currents: a small conductance potassium current ISK (Kohler et al., 1996) and a calcium-dependent nonselective cationic current Icat . The SK current is a well-known calcium-activated potassium current that contributes to the afterhyperpolarization potentials (AHP). The calciumdependent cationic current in cortical pyramidal neurons was recently stud-
Gamma Rhythmic Bursts
1039
ied electrophysiologically and pharmacologically (Kang et al., 1998). The current was shown to generate the depolarization afterpotential (DAP), on top of which a burst of spikes were generated. The calcium-dependent cationic currents are described as Iy D gy my .V ¡ E y / (where y represents SK or cat), and the kinetics of the gating variables my are given by dmy =dt D .m1 .[Ca2C ]in /¡my /=¿y .[Ca2C ]in /, with m1 D [Ca2C ]in =.[Ca2C ]in CKd;y / and ¿y D ’y =.[Ca2C ]in C Kd;y /. There are two functionally important differences between the two currents. One is in the reversal potential: Ecat D ¡42 mV while ESK D EK D ¡97 mV. The reversal potential of the cationic current Ecat can be calculated from the permeability ratio, PNa =P K D 0:16 (Kang et al., 1998). The other difference is in the calcium concentrations at half-maximal activation, which indicates the sensitivity or afnity to calcium: Kd;cat D 15 ¹M and Kd;SK D 0:4 ¹M (Kohler et al., 1996; Xia et al., 1998). We set other parameter values as gSK D 0:85, gcat D 108 (mS/cm2 ), and ’cat D ’SK D 2:8 (¹M¢msec) to mimic the burst ring patterns observed in experiments (Gray & McCormick, 1996; Steriade et al., 1998; Kang et al., 1998). 2.4 Calcium Dynamics. The calcium dynamics of the model chattering neuron is determined by the calcium entry via ICa , slow pump extrusion, and fast buffering, as follows (Robertson, Johnson, & Potter, 1981): d[Ca2C ]in D ¡´ICa C k¡ [B]Oc ¡ kC [Ca2C ]in [B].1 ¡ Oc / dt ¡ gpump
[Ca2C ]in [Ca
2C
]in C Km;pump
;
dOc D ¡k¡ Oc C kC [Ca2C ]in .1 ¡ Oc /: dt
(2.2)
Here, Oc represents the fraction of the binding sites that are occupied by Ca2C and [B] is the total concentration of the binding sites. The parameter values are set as k¡ D 0:3 msec¡1 , kC D 0:1 msec¡1 ¹M¡1 , Km;pump D 0:75 ¹M, gpump D 3:6 ¹M/msec, [B] D 30 ¹M, and ´ D 0:027. The values are within biologically reasonable ranges. 2.5 Coupling Dynamics. The AMPA receptor-mediated excitatory synaptic current is modeled as Isyn D gsyn ®.t ¡ ¿delay /.V ¡ Esyn /;
(2.3)
where Esyn is the reversal potential and ¿delay is the synaptic transmission delay (Rall, 1967). The function ®.t/, which represents the time evolution of the synaptic current evoked by a presynaptic spike arriving at t D 0, is
1040
T. Aoyagi, T. Takekawa, and T. Fukai
given by a dual exponential function:
®.t/ D
» ³ ´ ³ ´¼ 1 t t exp ¡ ¡ exp ¡ : ¿2 ¡ ¿1 ¿2 ¿1
(2.4)
When multiple presynaptic spikes occur in succession, the total change in the conductance is given by a temporal summation of the ®.t/’s evoked by the individual presynaptic spikes. We could have used a different model of synapse, such as the kinetic model (Destexhe, Mainen, & Sejnowski, 1994), that describes the saturation behavior of the synaptic responses to multiple spikes more accurately. Results of preliminary simulations, however, have conrmed that both models of synapses give no qualitative difference in the performance of synchronization. Therefore, we employ the alpha function– based manipulation since it reduces computational load signicantly. Typically, we use the following parameter values in simulations: ¿1 D 0:2 msec, ¿2 D 5:0 msec, gsyn D 0:05 mS/cm2 , Esyn D 0 mV, and ¿delay D 2 msec. 2.6 Externally Applied Current. In elucidating the basic properties of a single chattering neuron, we set Iapp as a constant external current and let the neuron re periodically. However, in investigating the synchronization behaviors of the neurons, we regard Iapp as a temporal summation of many excitatory synaptic inputs mediated by the AMPA glutamate receptors and adopt the following equation, Iapp D gapp.V ¡ Eapp /;
(2.5)
where the summed effective synaptic conduictance, gapp, may be constant (Tiesinga, Jose, & Sejnowski, 2000). The reversal potential is xed at Eapp D 0 mV. In some cases where we are interested in more realistic situations, we employ that the synapse (gapp) is activated by a Poisson spike train of a constant mean rate. 2.7 Numerical Methods. Numerical computations were performed on Pentium PCs and DEC Alpha machines running with the LINUX operating system. The software for the computations was written in C++. The ordinary differential equations were integrated using the formula of Dormand and Prince, with an adaptive step size that ensures an error tolerance of 10¡6 (Dormand & Prince, 1980). To examine the validity of the numerical results, some simulations were conducted with smaller error tolerances. In addition, we repeated some simulations using the Fehlberg-Runge-Kutta method and found no signicant differences in the results obtained by the two numerical algorithms.
Gamma Rhythmic Bursts
1041
3 Results 3.1 Single Neuron. In our model neuron, gamma rhythmic bursting arises from successive activation and inactivation of the Ca2C -dependent cationic current and the SK current, if the magnitudes of both currents are appropriately adjusted. The details regarding the single-neuron responses were reported in previous work (Aoyagi et al., 2002). Here, we briey review the essential properties of the gamma rhythmic bursting in this model. In particular, we show how modulations of the Ca2C -dependent nonselective cationic current affect the bursting pattern of the model neuron. Many experiments have reported cholinergic modulations of the Ca2C -dependent nonselective cationic current in hippocampal neurons (Caeser, Brown, Gahwiler, & Knpfel, 1993; Fraser & MacVicar, 1996; Guerineau, Bossu, Gahwiler, & Gerber, 1995) and the resultant induction of hippocampal gamma oscillations (Fisahn et al., 2002). In this study, modulations of the cationic current are crucial for regulating the coherence of the network activity, as shown later. Figure 1B shows the responses to a short current pulse when the maximal conductance of the Ca2C -dependent cationic current or the Ca2C -dependent potassium current is varied. The gray traces represent the neuronal responses for the typical parameter values used in numerical simulations. A depolarizing afterpotential (DAP) follows each action potential and is essential for the gamma-rhythmic bursting. Without the Ca2C -dependent cationic current, no DAP is observed (the black trace). A decrease in the maximum conductance of the SK current enhances the amplitude of the DAP, eventually generating a burst of spikes (see Figure 1C). In the simulations shown below, the values of the parameters were tuned such that the amplitude of the DAP is consistent with the results of electrophysiological studies (Gray & McCormick, 1996; Kang, 1997; Haj-Dahmane & Andrade, 1997). The responses of the model neuron to long-lasting depolarizing current pulses are quite similar to the chattering patterns observed in real neurons (Kang & Kayano, 1994; Gray & McCormick, 1996; Steriade et al., 1998; Brumberg et al., 2000). The bursting frequency (the inverse of the inter-burst interval) increases with the intensity of the injected current Iapp (see Figure 2A), until the bursting behavior is nally replaced by tonic ring. Most importantly, the model neuron exhibits stable doublet ring (two spikes per burst) over a wide range of the current intensity. For a given intensity of the injected current, the model neuron displays an increasing number of intraburst spikes as Kd;cat is increased (see Figure 2C). Note that the lower the value of Kd;cat , the more sensitive the cationic current is to Ca2C . Although the number of intraburst spikes varies in this way, the bursting frequency remains almost constant against the modulations of the cationic current. The behavior of the neuron model is depicted in Figures 2B and 2D in terms of interburst frequency and intraburst ISI (interspike interval in a
6 4 2 0 80 60 40
Frequency
Iapp = B 1.2mA/cm2
A
(Hz) Spikes/burst
T. Aoyagi, T. Takekawa, and T. Fukai
1.6
Intraburst ISI
3.8
6.0
20 0 8 6 4 2 0 0 1 2 3 4 5 6 7
(msec)
1042
8.0
16.4mM
15.0
11.4
(Hz)
Kd,cat =
100msec
50mV
7.8 7.6
Iapp = 3.0mA/cm2
6 4 2 0 80 60 40 20 0 8 6 4 2 0 18
(msec)
2.0mA/cm2
Frequency Spikes/burst
D
Intraburst ISI
C
Iapp (mA/cm2)
16
14
12
Kd,cat (mM)
Interburst ISI
10 8
Time
Figure 2: Typical responses of the model neuron to prolonged current pulses. (A) An increase in the current intensity results in a decrease in the interburst interval (equivalently, an increases in the bursting frequency). (B) The number of spikes per burst, the bursting frequency, and the intraburst ISI are plotted as functions of the current intensity. (C) A decrease in Kd;cat , which implies an increased sensitivity to Ca2C of the cationic current, increases the number of spikes per burst without signicantly affecting the bursting frequency. (D) The number of the intraburst spikes, the bursting frequency, and the interval between every successive spike pair in a burst are plotted as functions of Kd;cat .
Gamma Rhythmic Bursts
1043
burst). As the current intensity is increased, the interburst frequency increases almost linearly from 20 Hz to 70 Hz (see Figure 2B, middle), which constitutes the entire gamma-frequency band (Gray & McCormick, 1996). By contrast, the intraburst spike interval remains almost constant, specically, less than 3 msec, except for weak injected current (see Figure 2B, bottom). Thus, the current intensity has little effect on the intraburst ISI. Figure 2D displays the dependence of the number of intraburst spikes and the bursting frequency on Kd;cat . Within each bursting mode (dened by each xed number of intra-burst spikes), the interburst frequency is a weakly increasing function of Kd;cat (see Figure 2D, middle). At the points where the number of spikes per burst increases, the interburst frequency falls discontinuously to a lower frequency. Consequently, the bursting frequency changes little over a wide range of Kd;cat . To summarize, we have two parameters to control the burst ring in the chattering neuron model. The interburst frequency and the number of intraburst spikes are almost independently regulated by the current intensity and the calcium sensitivity of the cationic current, respectively. By simulations, we have found that a decrease in the maximal conductance of the cationic current and an increase in Kd;cat produce a similar change in bursting behavior. Therefore, we modulate the value of Kd;cat but not that of gcat in most of the following simulations. 3.2 A System of Two Coupled Neurons 3.2.1 Phase Reduction Technique. Our primary interests are in the synchronization phenomena appearing in large-scale networks of chattering neurons. As a rst step, we consider a network of two identical neurons coupled symmetrically with excitatory synapses. The so-called phase reduction technique provides a useful framework for analyzing the synchronization of the coupled neurons. In the following, we briey describe this technique. (Readers who are not interested in the mathematical details of this technique may skip this section.) Any system of two coupled neurons can be reduced to a system of simple coupled phase oscillators if the synaptic coupling is weak, that is, if the periodic orbit of an isolated neuron is kept unchanged by the coupling. In the reduced model, the state of each neuron can be characterized by a single phase variable Ái , which is dened along the orbit and describes the timing of periodic spikes of neuron i (i D 1; 2). The reduced phase equations then take the general form (Kuramoto, 1984; Ermentrout & Kopell, 1984) dÁ1 D ! C 0.Á2 ¡ Á1 /; dt dÁ2 D ! C 0.Á1 ¡ Á2 /; dt
(3.1)
1044
T. Aoyagi, T. Takekawa, and T. Fukai
where ! is the intrinsic ring frequency of an uncoupled neuron. The interaction function 0.Á/ is determined by the specic form of synaptic coupling and the dynamical nature of the periodic orbit of an isolated neuron. For the synaptic coupling given in equation 2.3, the interaction function is given by Z 1 T (3.2) 0.Á/ D ZV .t/.¡gsyn ®.t C Á ¡ ¿delay //.V.t/ ¡ Esyn / dt; T 0 where T is the period of repetitive ring. The function ZV .t/ is the voltage component of the phase response function and describes how an external perturbation advances or delays the phase in the reduced system (Ermentrout, 1996; Hansel, Mato, & Meunier, 1995). Since the interaction function in equation 3.1 depends on only the phase difference between the two neurons, the phase dynamics in equation 3.1 can be rewritten in terms of 1Á D Á2 ¡ Á1 as d1Á D 0odd .1Á/; dt
(3.3)
where 0odd .Á/ D 0.¡Á/ ¡ 0.Á/. Any mode of phase locking in this system corresponds to a solution to the equation 0odd .1Á/ D 0. There are always two obvious solutions, 1Á D 0 and 1Á D ¼ , which dene a synchronous ring state and an antisynchronous ring state, respectively. A solution 1Á can be reached in a steady state if it satises the stability condition, 0 0odd .1Á/ < 0;
(3.4)
where the prime denotes differentiation with respect to 1Á. 3.2.2 Typical Simulation Results: A Transition Between Synchrony and Antisynchrony. Figure 3 depicts the typical behavior of the two-neuron network when the bursting mode is changed. Here, the bursting mode is controlled by altering the value of Kd;cat in each neuron. The magnitude of gapp in equation 2.5 is set as gapp D 0:068 mS/cm2 such that the individual neurons, when they are decoupled, may discharge periodically. As shown in Figure 3A, neurons re antisynchronously when they exhibit singlet ring. If the calcium sensitivity of the cationic current is slightly increased, the bursting mode changes to doublet ring, and at the same time, a synchronous state becomes stable (see Figure 3B). If the calcium sensitivity is increased further, the synchronous state becomes unstable and the coupled system again evolves into the antisynchronous state (see Figure 3C). Interestingly, when the bursting pattern is changed from doublet ring to triplet ring with a slight increase in the calcium sensitivity, a similar transition from almost antisynchrony to synchrony occurs (see Figure 3D). It can be more clearly shown by the phase reduction technique that the network simultaneously undergoes the changes in the ring mode and the transitions between synchrony and antisynchrony.
Gamma Rhythmic Bursts
1045
3.2.3 The Reduced Phase Model of Coupled Chattering Neurons. Applying the phase oscillator analysis explained above, we examined the stable value of the phase difference between spikes of two coupled chattering neurons. In Figure 3E, the results of the analysis are given for the cases considered in Figures 3A and B. The top panel in Figure 3E displays the voltage component of the phase response functions over one period, and the bottom panel displays the odd part of the interaction function derived from the obtained phase response function and the synaptic coupling form. In Figure 3E, we can see a rapid switch of the stable state from antisynchrony to synchrony when singlet spiking is replaced by doublet bursting. A similar transition takes place when the doublet bursting mode is replaced by the triplet bursting mode (see Figure 3F). Both transitions from antisynchrony to synchrony accompany a sudden emergence of a narrow peak at Á ¼ 0 in ZV .Á/. These results are consistent with the general observations that a synchronous state is usually stable if ZV .Á/ has a signicant peak near zero phase and an antisynchronous state tends to be stable if the peak vanishes. To study the relationship between the bursting mode and the stable network states, we calculated the interaction function 0.Á/ while varying the value of Kd;cat to change the number of spikes per burst. With this manipulation, the bursting pattern of the individual chattering neurons changes from singlet ring to quartet ring, until it nally reaches a high-frequency tonic ring at Kd;cat < 8¹M. The obtained values of the equilibrium phase difference are plotted in Figure 3G as a function of Kd;cat . The results clearly show a close relationship between the sudden exchanges of the stability and the changes in the bursting mode. Figure 3G also shows that the prediction of the reduced phase model is in fair agreement with the result obtained by the direct numerical simulations in the full original model. Although the theory of phase reduction is explicit only in the weak coupling limit, we have found numerically that the theory captures the qualitative behavior of the coupled neuron system for larger coupling as well. 3.2.4 Effects of Other Model Parameters. The bursting mode can be changed not only by altering the calcium sensitivity of the cationic current, but also by altering other model parameters, such as the maximal conductance of the cationic current (gcat ) and the intensity of an applied current (gapp ; see equation 2.5). In this section, we prove that a change in the bursting mode, regardless of the cause of the change, results in a transition between synchrony and antisynchrony. For this purpose, we calculated the values of the stable phase difference between the spikes of a neuron pair as functions of several model parameters. Figure 4A depicts the stable phase difference as a function of the stimulus current intensity, gapp. As gapp is increased, the bursting mode changes from singlet ring to doublet ring, eventually reaching tonic ring. At the boundary between singlet and doublet ring, the stable state switches from the antisynchronous state to the synchronous state.
1046
T. Aoyagi, T. Takekawa, and T. Fukai
C
B
D 60 mV
A
50 msec
E
F
ZV
singlet a doublet b
doublet c triplet d
Godd
0
0 p
0
Stable Phase Difference
G p
p/2
18
p
0
2p
A singlet doublet triplet quartet simulation B
0
2p
C D
16 14 12 10 Kd,cat (Calcium sensitivity of the cationic current, mM)
Such a sudden exchange of the stability also occurs when the bursting mode is changed by altering either gcat (see Figure 4B) or gSK (see Figure 4C). These results strongly indicate that the exchange of the stability between synchrony and antisynchrony is always induced by some change in the bursting mode regardless of what causes the change.
Gamma Rhythmic Bursts
1047
So far, spike transmission delays between the neuron pair were assumed to be a constant value ¿delay D 2 msec. Next, we investigated the effect of the synaptic delay on the stable phase difference. Figure 4D plots the stable phase differences as the functions of the synaptic delay, ¿delay , for the two values of Kd;cat , one yielding singlet ring and one yielding doublet bursting. Interestingly, in a physiologically reasonable range of synaptic delay (0:5 » 4:5 msec), a perfectly synchronous ring is established by doublet bursting, whereas a perfectly antisynchronous ring is obtained for singlet ring. Thus, the synaptic delay enhances the separation between the synchronous doublet ring and the antisynchronous singlet ring in the coupled chattering neurons. 3.2.5 Robustness of Switching Phenomena Against Noisy Stimuli. In reality, cortical neurons are innervated by a large number of synaptic inputs. Therefore, we use an equation similar to equation 2.3 instead of equation 2.5 to describe the external stimuli generated by random synaptic inputs. The activation times of the synapses are determined according to an independent Poisson process with mean rate ¸. We denote the maximum conductance of these synapses as ginj and that of the recurrent excitatory synapses as gsyn to
Figure 3: Facing page. The results of the phase oscillator analysis of a coupled pair of chattering neurons. The synaptic coupling is excitatory and reciprocal (gsyn D 0:05 mS/cm2 ). (A) After a transient period of several gamma cycles, the two neurons exhibiting singlet ring are antisynchronized; that is, spikes are mutually phaselocked with a relative phase difference of about ¼ . Here, Kd;cat D 17 ¹M. (B) When the cationic current is slightly more sensitive to Ca2C (Kd;cat D 16:4 ¹M), the neuron pair exhibits doublet ring and the spikes are in almost in-phase synchrony. (C) Further increase of the calcium sensitivity (Kd;cat D 12 ¹M) turns the doublet ring antisynchronous. (D) At Kd;cat D 11:4 ¹M, the neuron pair makes a transition to triplet ring. The neuron pair initially discharging antisynchronously evolves into a synchronously ring state, but the transient period is rather long. (E) The phase response function ZV .Á/ (top) and the odd components of the interaction function 0odd .Á/ (bottom) are displayed for the singlet ring and the doublet ring displayed in A and B, respectively. (F) ZV .Á/ (top) and 0odd .Á/ (bottom) are displayed for the doublet ring and the triplet ring displayed in C and D, respectively. These functions change drastically at the points where the bursting mode of each neuron changes, leading to a transition in stable relative phase differences. (G) Stable phase difference between the nearest pairs of the spikes of the two neurons is displayed as a function of Kd;cat . The result obtained by the direct numerical simulations in the full original coupled neurons are also shown (£). The four cases displayed in A–D correspond to the four states designated by the arrows. It is found that a change in the bursting mode always triggers a rapid exchange between synchronous ring and antisynchronous ring.
T. Aoyagi, T. Takekawa, and T. Fukai
A p singlet doublet p/2
0
Stable Phase Difference
Stable Phase Difference
1048
B p singlet doublet triplet
p/2
0.05
0.1
C p p/2
singlet doublet triplet
0
1
0.8
Stable Phase Difference
Stable Phase Difference
0 0.15 50 100 150 200 mS/cm2 mS/cm2 gapp (Conductance of the applied current) gcat (Maximal conductance of the cation current) 0
0.6 mS/cm2 gSK (Maximal conductance of the SK current)
D p p/2
0
0
Kd,cat = 17.0 mM(singlet) Kd,cat = 16.4 mM(doublet)
2
4
6
tdelay(Synaptic delay)
msec
Figure 4: Roles of various parameters in determining the stable phase difference. (A–C) If we increase the maximal conductance of any of the following currents—the effective external current (gapp ), the cationic current (gcat ), and the SK current (gSK )—an upward transition is induced in the bursting mode. As in the previous case, a rapid transition from synchronous ring to antisynchronous ring occurs at every point where the number of spikes per burst increases. (D) The stable phase differences are plotted against the synaptic delay ¿delay for singlet ring (gray circles) and doublet ring (black circles). In B–D, the vertical arrows indicate the typical parameter values used throughout the simulations.
avoid confusion. In the following simulations, the mean ring rate and the maximal conductance were set as ¸ D 68 spikes per msec and ginj D 0:001 mS/cm2 . The left panels in Figure 5 display the time evolutions of the membrane potentials of two neurons at various levels of the calcium sensitivity of the cationic current. In the right panels, the respective spike cross-correlograms are displayed. In Figures 5A and 5B, neuronal responses are shown on either side of the boundary between the singlet ring state and the doublet ring state in the parameter space. The central peak in the cross-correlogram in Figure 5B appears at almost zero phase lag, which indicates coincident ring of the two neurons with spike doublets. In the case of singlet ring in Figure 5A, in contrast, the peaks are displaced by nite phase lags of about
Gamma Rhythmic Bursts
1049
A
Kd,cat = 18.0 mM
B
-40
C
-40
D
-40
-20
0
20
40
Kd,cat = 16.2 mM
-20
0
20
40
Kd,cat = 12.0 mM
-20
0
20
40
50 msec
60 mV
Kd,cat = 12.4 mM
-40
-20
0
20
40
Figure 5: Transitions between antisynchrony and synchrony in a pair of the chattering neurons responding to random synaptic input currents. The voltage traces of the two neurons (left) and the cross-correlograms between the neuron pair (right) are shown for the values of Kd;cat specied in the gure. The crosscorrelograms (bin size D 1 msec) in A and C show nite phase lags (» 10 msec) between the two neurons, while those in B and D show no phase lag. These changes indicate the stability exchange between antisynchrony and synchrony, as in the case of a constant applied stimulus.
10 msec, which indicates that the two neurons re in almost antiphase. Figures 5C and 5D display the voltage traces and the cross-correlograms on either side of the boundary between the doublet bursting state and the triplet bursting state. In this case, doublet bursting shows antiphase synchrony, and triplet bursting shows synchrony. Thus, as was the case for the constant applied current, the stability exchange occurs between in-phase synchrony and antiphase synchrony at the boundary where the bursting mode of the individual neurons is also changed. The phenomenon is robust as long as the randomness of synaptic inputs is not too strong.
1050
T. Aoyagi, T. Takekawa, and T. Fukai
3.3 Large-Scale Excitatory Networks. In this section, we investigate the dynamics of a large-scale network of chattering model neurons. It is well known that in the week coupling limit, a globally coupled network of identical neurons cannot achieve full synchrony if a synchronous state is unstable in two coupled neurons of the same type (Okuda, 1993). Given this fact and on the basis of the previous results for two coupled neurons, we may expect that the coherence of the activity in a large network can be enhanced or suppressed by changing the intensity of an external input or the intrinsic properties of the cationic current, such as the calcium sensitivity and the maximum conductance. To see how the coherence of activity is regulated in a large-scale network by, for instance, the modulations of the calcium sensitivity, we simulated fully connected excitatory networks of 100 to 300 neurons. As for the external stimuli to the individual neurons, independent Poisson spike trains of mean rate ¸ D 132 spikes/msec were given to the summed effective synapses having a maximal conductance of ginj D 0:0005 mS/cm2 . Wherever necessary, the conductance of the recurrent synaptic connections is specied in gure captions. 3.3.1 Rapid Synchrony-Asynchrony Transitions in a Large-Scale Network of Chattering Neurons. Figures 6A and 6B display a simulated rastergram and a spike histogram of 100 chattering neurons (bin width is 1 msec). From 500 msec to 1000 msec, the sensitivity of the cationic current to Ca2C was raised to a high level (Kd;cat D 16:4 ¹M). Otherwise, Kd;cat D 18 ¹M. At this low level of calcium sensitivity, the neurons show singlet ring, and the timing of the spikes is asynchronous. The asynchronously ring state can be regarded as a counterpart of the antisynchronously ring state in two coupled neurons. In the period of high calcium sensitivity, the neurons exhibit highly coherent doublet ring, as expected from the synchronous doublet ring of a neuron
Figure 6: Facing page. Switching the coherence of activity in a large-scale network of uniformly connected chattering neurons. gsyn D 0:002 mS/cm2 . (A) Rastergram of 100 chattering neurons. The value of Kd;cat is lowered from 500 msec to 1000 msec. (B) Spike histogram (bin size = 1 msec) shows increased spike counts during this period. (C) Time evolution of the overall network coherence C.T/ shows an apparent increase during the same period. 1 D 100 msec in equation 3.5 that denes C.T/. (D) Rastergram is displayed for a network of 300 chattering neurons, where Kd;cat (¹M) is modulated only between 100 and 200 neurons during the same period. gsyn D 0:0005 mS/cm2 . (E) Spike histograms are shown for the neurons belonging to the two groups (labeled A and B) in D. (F) Time evolution of the overall network coherence in group B shows a gradual increase during the period of enhanced calcium sensitivity. (G) How the steady values of C.T/ are correlated with the averaged number of intraburst spikes is shown as a function of Kd;cat . Here, gsyn D 0:0005 mS/cm2 . In the simulations, C.T/ was calculated in the steady state following an initial transient behavior.
Gamma Rhythmic Bursts
Neuron No.
Kd,cat = A100 18.0
Kd,cat = 16.4
Kd,cat = 18.0
Kd,cat = 16.4
D300 Kd,cat = 18.0
80
250
60
200
40
150
20
100
0
50
B
Histogram
1051
30
E
20 10 0
C.075 C(T)
.050 .025 .000 0
G
500
1000
time(msec)
A
B
0 A
20 0 20 F .100 .05 .00 .10 .05 .00 1500 0
500
12
10
B A B 1000
1500
time(msec)
.30
C(T)
.20 .10
Spikes / Burst
.00 4 3 2 1 0 18
16
14
8
Kd,cat (Calcium sensitivity of the cation current, mM)
pair. The coherence of the network activity increases very quickly within a few gamma cycles of the onset of the high-calcium-sensitivity period. To quantify the degree of the synchronization, we dene the normalized crosscorrelations of the membrane voltages over the duration 1 at time T, Cij .T/, as RT
¡ Vi /.Vj .t/ ¡ Vj / dt ; R 2 dt T .V .t/ ¡ V /2 dt ¡ .V .t/ V / i i j j T¡1 T¡1
Cij .T/ D qR T
T¡1 .Vi .t/
(3.5)
1052
T. Aoyagi, T. Takekawa, and T. Fukai
RT where Vi D T¡1 Vi .t/ dt. We average Cij .T/ over all neuron pairs to measure the coherence over the entire network: C.T/ D¿ Cij .T/ Àij . We set 1 to 100 msec, which makes C.T/ sensitive to the rapid changes in the coherence. The time evolution of C.T/ is displayed in Figure 6C, which manifests the rapid changes at the moments that the calcium sensitivity is altered. These rapid changes in C.T/ indicate rapid transitions between synchrony and asynchrony at these moments. 3.3.2 Selective Induction of Synchronization in a Subgroup of Chattering Neurons. Gamma oscillation is considered to induce a coherent activity in a relatively localized area of the cortex (Kopell, Ermentrout, Whittington, & Traub, 2000; von Stein & Sarnthein, 2000). Therefore, it is intriguing to examine whether a selective enhancement of the calcium sensitivity of the cationic current within a local subgroup of chattering neurons can induce a highly coherent activity in the subgroup. In Figure 6D, only those neurons with indices 100 to 200 (labeled B in the gure) were set in the state of high calcium sensitivity from 500 msec to 1000 msec. Other neurons remained in the state of low calcium sensitivity during the entire period of simulations. Note that all the neurons were mutually connected, as in the previous case. During the period of the high sensitivity to Ca2C , only the neurons belonging to the selected subgroup showed doublet ring, while others continued to show singlet ring. The spike histogram (bin width = 1 msec) of the selected subgroup showed increased spike counts during the period, indicating an increased degree of synchronization. Similar histograms of other subgroups (the 100-neuron ensemble, labeled A) showed almost no increase of the spike count during the same period (see Figure 6E). Correspondingly, we can see that the coherence of the neuronal activity is enhanced in subgroup B, but not in subgroup A, during this period (see Figure 6F). Since both intergroup recurrent synapses and intragroup ones have the same magnitudes and the same connectivity, the differences in the coherence of activity between the two subgroups originate from the differences in the sensitivity to Ca2C of the constituent neurons. Similar synchrony-asynchrony transitions could be induced in a subgroup of neurons by selectively changing the intensity of the external current to the subgroup (results not shown). 3.3.3 Correlated Changes in the Bursting Mode and the Activity Coherence. The results shown in Figure 6 indicate that in the large-scale network, the abrupt state transitions between synchrony and asynchrony are strongly correlated with the change in the bursting mode, as it was correlated with similar transitions between synchrony and antisynchrony in pairs of chattering neurons (see Figure 3G). To prove this, we have calculated the equilibrium values of C.T/, that is, the overall coherence of the network activity, as a function of Kd;cat dened over the same range as in Figure 3G. The results are shown in Figure 6G with the averaged number of spikes per burst over all neurons in the network (bottom trace).
Gamma Rhythmic Bursts
1053
It can be clearly seen that the value of C.T/ abruptly increases from almost zero to peak levels whenever the value of Kd;cat is lower than the critical values at which the averaged number of intraburst spikes is increased. This implies that abrupt transitions from asynchrony to synchrony occur when the number of intraburst spikes is increased in large-scale networks of the chattering neurons. 3.3.4 Regular Spiking Neurons Synchronized by a Small Ensemble of Chattering Neurons. It is likely that the ratio of chattering neurons in the entire population of neocortical pyramidal neurons is not very large. Chattering neurons belong to a minority of the neocortical neuronal subtypes. Therefore, it is important to study whether the rapid synchronization of chattering neurons can synchronize the activity of other regular-spiking pyramidal neurons. Here, we show some preliminary results of simulations on a network consisting of both chattering and regular-spiking pyramidal neurons. The mixture network consisted of 25 chattering neurons and 75 regular spiking pyramidal neurons. Both types of neurons were uniformly connected by excitatory synaptic connections with gsyn D 0:002 mS/cm2 . The regular spiking neuron was modeled similarly to the chattering neuron, except that the cationic current was absent in the former. The mean ring rates of external inputs were 108 spikes/msec for the chattering neurons and 132 spikes/msec for the regular-spiking neurons. The synaptic conductance of the external inputs was set as ginj D 0:0005 mS/cm2 . The typical result is shown in Figure 7. The gure demonstrates that the chattering neurons, which are the minority in the network, can synchronize the activities of the regular-spiking pyramidal neurons and, consequently, the activity of the whole network. It should be noted that the regular-spiking neurons could re only asynchronously in the absence of chattering neurons (results not shown). In Figure 7, however, the transition from asynchrony to synchrony is rather slow in comparison with the result shown in Figure 6. In fact, in the present case, C.t/ reached a peak level after almost 500 msec from the onset of doublet spiking in the chattering neurons (bottom traces). How is the performance of the network inuenced by the ratio of chattering neurons in the whole neural population? To see this, we calculated C.T/’s separately for the individual subnetworks of chattering neurons and regular-spiking neurons while varying the population ratio of the chattering neurons (Pch ). We also measured how long it took for the individual subnetworks to develop synchronously ring states and to desynchronize these states. As shown in Figure 8A, it has been revealed that the subnetworks show phase transition–like behaviors. There is a critical value of Pch below which an increase of Pch results in an improved coherence of the activity in both subnetworks. In this region of the parameter values, the larger the value of P ch , the faster are both synchronization (see Figure 8B) and desynchronization (see Figure 8C), although these changes occur only slowly, in 500 msec to 1 sec.
1054
T. Aoyagi, T. Takekawa, and T. Fukai
A 100 Neuron No.
80 RS
60 40 20
Spikes/msec
B
CH
0
S RS
20 10 0 20
CH
10 0
C(T)
C 0.2
RS
0 0.2 0 0
CH 500
1000
1500
2000
2500
time(msec) Figure 7: Entrainment of regular-spiking (RS) pyramidal neurons by synchronous ring of chattering (CH) neurons. (A) In the rastergram, the rst 25 neurons are chattering neurons, and other 75 neurons are regular-spiking neurons. In the chattering neurons, the value of Kd;cat , which was otherwise set as Kd;cat D 18 ¹M, was raised to Kd;cat D 16:5 ¹M during the period indicated by the arrow labeled S. This modulation induces the typical transition from singlet ring to doublet ring in the discharging pattern of the chattering neurons. (B) Spike histograms (bin width D1 msec) of regular spiking and chattering neurons display slow increases in the spike count in the period S. In addition, the desynchronization following this period occurs rapidly among chattering neurons but rather slowly among regular-spiking neurons. (C) The overall coherence C.T/ shows that the synchronization of regular-spiking neurons follows slightly behind that of chattering neurons and that the desynchronization of regular-spiking neurons lasts long after that of chattering neurons.
Gamma Rhythmic Bursts
1055
Unexpectedly, if Pch is greater than the critical value, the overall performance in synchronization was signicantly degraded. The subnetwork of chattering neurons could rapidly fall into a synchronously ring state, but the subnetwork of regular-spiking neurons never achieved synchrony (in Figure 8A, Pch > 60%). This implies that the chattering neurons failed to entrain the regular-spiking neurons at the gamma frequencies. We have found by numerical simulation that this critical value of Pch can be made as small as about 60% with relatively strong recurrent excitatory connections. Making the recurrent connections stronger, however, achieved only such a state that appears in the region Pch > 60% in Figure 8A. Only chattering neurons, but not regular-spiking neurons, can be synchronized with such stronger recurrent connections. Further studies are required to clarify how this somewhat unexpected behavior appears in the mixture network of chattering and regular-spiking neurons. We do not discuss this issue in detail, as Pch is very likely to be much less than 60% in the neocortical local circuitry. We speculate that synchronization of chattering neurons that occurs too rapidly, which is achievable only for relatively large values of Pch , leaves regular-spiking neurons far behind and in an asynchronously ring state. 4 Discussion Our analysis has revealed that the degree of synchrony in a pair of weakly coupled chattering neurons can be modulated by intensity of the external input and the properties of the calcium-dependent cationic current, such as the maximum conductance and the sensitivity to Ca2C . Changing the intensity of the external current changes the interburst interval (bursting frequency) and the number of spikes per burst (the bursting mode), whereas changing the maximum conductance or the calcium sensitivity of the cationic current changes the bursting mode without signicantly affecting the interburst interval. We have revealed that a change in the bursting mode systematically modulates the coherence of neuronal ring regardless of which parameter underlies the change. In a two-neuron network (see Figure 4), antisynchronous ring is abruptly replaced by synchronous ring whenever the neuron pair makes a transition to a higher bursting mode, from doublet ring to triplet ring, and so on. Similarly, asynchronous ring is replaced by synchronous ring in a uniformly connected large-scale network of chattering neurons when they enter into a higher bursting mode. Moreover, if only a subgroup of chattering neurons exhibits the transition to a higher bursting mode, the asynchrony-to-synchrony transition also occurs only in that subgroup. The selected neural ensemble is rapidly desynchronized when the modulation is turned off (see Figure 6A). In in vitro experiments, a repetitive depolarization of a cortical pyramidal neuron changes the response from a regular spiking pattern to a gamma-frequency rhythmic bursting (Kang et al., 1998; Brumberg et al., 2000). In our neuron model, a similar change in the ring pattern was in-
1056
T. Aoyagi, T. Takekawa, and T. Fukai
A 0.5 0.4
C(T)
0.3 0.2 0.1 0 -0.1 100
80
60
80
60
Time (msec)
B
Pch (%)
40
20
0
40
20
0
20
0
1500 1000 500 0 100
Pch (%)
Time (msec)
C
fig7
1500 1000 500 0 100
80
60
Pch (%)
40
duced by a slight increase of the maximum conductance or the sensitivity to Ca2C of the Ca2C -dependent nonselective cationic current (Aoyagi et al., 2001). Although the physiological mechanism of the changes in the in vitro ring pattern is not known yet, modulations by certain neuroactive substances, in particular the cholinergic modulation, may play an active role. In fact, many studies demonstrate the enhancement of Ca2C dependent cationic current by activation of muscarinic receptors (Caeser et al., 1993; Constanti & Libri, 1992; Fraser & MacVicar, 1996; Guerineau et al., 1995). Recently, it has been shown that activation of muscarinic M1 recep-
Gamma Rhythmic Bursts
1057
tors modulates a Ca2C -dependent nonselective cationic current as well as a hyperpolarization-activated cationic current (Ih ) in hippocampal CA3 pyramidal neurons (Fisahn et al., 2002). The activation of the M1 receptors was required for induction of hippocampal gamma oscillations. Several modeling studies have attempted to show how neuroactive substances including acetylcholine may affect the activity of hippocampal neurons and networks (Pinsky & Rinzel, 1994; Menschik & Finkel, 1999; Tiesinga, Fellous, Jose, & Sejnowski, 2001). Similar to the hippocampus, cholinergic enhancement of the Ca2C -dependent nonselective cationic current in chattering neurons may be crucial for induction of neocortical gamma oscillations. This study, however, has suggested that a network of chattering neurons may not sufce for entraining a majority of pyramidal neurons at the gamma frequencies. Modeling studies showed that the AMPA-type excitatory glutamatergic synapses tend to desynchronize rather than synchronize regular spiking pyramidal neurons (Hansel et al., 1995; Vreeswijk & Abbott, 1994). In addition, it was theoretically argued that neurons with a nonnegative phase response function (type I) cannot achieve in-phase synchrony (Ermentrout, 1996). Therefore, whether the synchronous ring of chattering neurons, which belong to a minority group in the cortical network, can synchronize the activity of other pyramidal neurons is not a trivial issue. The results of our simulations remain to inconclusive. If the population ratio of chattering neurons to regular-spiking neurons is appropriate, bursting of chattering neurons synchronizes slowly in a transient time of several hundred milliseconds to entrain the activities of regular spiking pyramidal neurons. If the synchronization of chattering neurons occurs much more
Figure 8: Facing page. Entrainment of regular-spiking neurons at various population ratios of chattering neurons Pch in a large-scale network. gsyn D 0:002 mS/cm2 . In the gures, bold curves are plotted for chattering neurons, and dashed curves are for regular-spiking neurons. (A) The averaged overall coherence C.T/ in the equilibrium state was calculated as functions of Pch for the populations of chattering neurons and of regular-spiking neurons. C.T/ was obtained by averaging C.T/ over the interval of 1000 msec following an initial transient period. The error bars display the standard deviations in 10 trials. There are two distinct regions. At Pch > 60%, chattering neurons may rapidly evolve into a synchronously ring state, but they fail to entrain regular-spiking neurons; at Pch < 60%, chattering neurons can entrain regular-spiking neurons, but the synchronization occurs rather slowly. (B) The time spent by each neuron group in evolving into the coherent state clearly indicates the phase transition– like behavior. At Pch > 60%, the convergence time for regular spiking neurons is not shown since they could not be synchronized. (C) The times spent by the neuron groups in desynchronizing the once established coherent states are plotted. In the gures, the vertical arrows indicate the population ratio employed in Figure 7.
1058
T. Aoyagi, T. Takekawa, and T. Fukai
rapidly, which seems to be realistic, the entrainment does not occur (see Figure 8). How can we remedy the difculties in synchronizing the entire network activity? It is likely that some interneurons should be incorporated into our network of pyramidal neurons. In fact, results of recent experiments have suggested that networks of interneurons, which are often connected simultaneously by electric and GABAergic synapses, may enhance the synchrony in cortical activity (Galarreta & Hestrin, 1999, 2001). In order to describe the synchronization phemomena in the local cortical circuitry more accurately, we must clarify how the dual synaptic wiring facilitates synchrony among interneurons and how interneurons cooperate with chattering neurons in synchronizing and desynchronizing cortical activity. These topics are under investigation. The results have indicated that a diversity in the intrinsic bursting patterns of neurons is essential and important in a speedy regulation of the coherence among pyramidal cell activities. It seems intriguing to study whether the correlated occurrence of the change in the bursting mode and the asynchrony-synchrony transition is merely a specic feature of the model or the generic feature widely seen in synchronization of bursting neurons. For instance, is this correlation between the two phenomena also found with the burst ring generated by the so-called Ping-Pong mechanism (Wang, 1999)? What mathematical type of the bursting mechanism (Izhikevich, 2000) leads to the correlation? These questions remain for further study. In conclusion, we have shown that modulations of the Ca2C -dependent cationic current in chattering neurons allow these neurons to synchronize their spikes within a few cycles of the gamma oscillation. The rapid establishment of synchrony at the gamma frequencies in sensory and motor cortices may represent stimulus-driven attentional inuences of higher cognitive centers (Engel, Fries, & Singer, 2001). Although we must await further studies to make a conclusive statement on the role of the gamma oscillation, chattering neuron is a promising candidate for the cortical pacemaker of the oscillation. Acknowledgments We thank Takeshi Kaneko and Youngnam Kang for valuable comments and suggestions. We also thank Y. Kuramoto for helpful discussions. T.A. is supported by the Japanese Grant-in-Aid for Science Research Fund from the Ministry of Education, Science and Culture. References Aoyagi, T., Kang, Y., Terada, N., Kaneko, T., & Fukai, T. (2002). The role of Ca2C dependent cationic current in generating gamma-frequency rhythmic bursts: Theoretical study. Neuroscience, 115, 1127–1138.
Gamma Rhythmic Bursts
1059
Aoyagi, T., Terada, N., Kang, Y., Kaneko, T., & Fukai, T. (2001).A bursting mechanism of chattering neurons based on Ca2C -dependent cationic currents. Neurocomputing, 38–40, 93–98. Azouz, R., Jensen, M., & Yaari, Y. (1996). Ionic basis of spike afterdepolarization and burst generation in adult rat hippocampal CA1 pyramidal cells. J. Physiol., 492, 211–223. Brumberg, J., Nowak, L., & McCormick, D. (2000). Ionic mechanisms underlying repetitive high-frequency burst ring in supragranular cortical neurons. J. Neurosci., 20, 4829–4843. Buhl, E. H., Halasy, K., & Somogyi, P. (1994). Diverse sources of hippocampal unitary inhibitory postsynaptic potentials and the number of synaptic release sites. Nature, 368, 823–828. Buzsaki, G., Leung, L., & Vanderwolf, C. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res. Rev., 6, 139–171. Caeser, M., Brown, D. A., Gahwiler, B. H., & Knpfel, T. (1993). Characterization of a calcium-dependent current generating a slow afterdepolarization of CA3 pyramidal cells in rat hippocampal slice cultures. Eur. J. Neurosci., 5, 560–569. Cobb, S., Buhl, E., Halasy, K., Paulsen, O., & Somogyi, P. (1995). Synchronization of neuronal activity in hippocampus by individual GABAergic interneurons. Nature, 378, 75–98. Constanti, A., & Libri, V. (1992). Trans-ACPD induces a slow post-stimulus inward tail current (IADP) in guinea-pig olfactory cortex neurons in vitro. Eur. J. Pharmacol, 216, 463–464. Destexhe, A., Mainen, Z., & Sejnowski, T. (1994). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18. Doiron, B., Longtin, A., Turner, R., & Maler, L. (2001).Model of gamma frequency burst discharge generated by conditional backpropagation. J. Neurophysiol., 86, 1523–1545. Dormand, J., & Prince, P. (1980). A family of embedded Runge-Kutta formulae. J. Comp. Appl. Math., 6, 19–26. Engel, A., Fries, P., & Singer, W. (2001). Dynamic predictions: Oscillations and synchrony in top-down processing. Nat. Rev. Neurosci., 2, 704–716. Ermentrout, G. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comput., 8, 979–1001. Ermentrout, G., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators. SIAM J. Math. Anal., 15, 215–237. Fisahn, A., Yamada, M., Duttaroy, A., Gan, J., Deng, C., McBain, C., & Wess, J. (2002). Muscarinic induction of hippocampal gamma oscillations requires coupling of the M1 receptor to two mixed cation currents. Neuron, 33(4), 615–624. Fraser, D. D., & MacVicar, B. A. (1996). Cholinergic-dependent plateau potential in hippocampal in CA1 pyramidal neurons. J. Neurosci., 16, 4113–4128. Freund, T. F., & Buzsaki, G. (1996). Interneurons of the hippocampus. Hippocampus, 6, 347–470. Galarreta, M., & Hestrin, S. (1999). A network of fast-spiking cells in the neocortex connected by electrical synapses. Nature, 402, 72–75.
1060
T. Aoyagi, T. Takekawa, and T. Fukai
Galarreta, M., & Hestrin, S. (2001). A network of fast-spiking cells in the neocortex connected by electrical synapses. Science, 292, 2295–2299. Gray, C., & McCormick, D. (1996). Chattering cells: Supercial pyramidal neurons contributing to the generation of synchronous oscillations in the visual cortex. Science, 274, 109–113. Guerineau, N. C., Bossu, J. L., Gahwiler, B. H., & Gerber, U. (1995). Activation of a nonselective cationic conductance by metabotropic glutamatergic, muscarinic agonists in CA3 pyramidal neurons of the rat hippocampus. J. Neurosci., 15, 4395–4407. Haj-Dahmane, S., & Andrade, R. (1997). Calcium-activated cation nonselective current contributes to the fast afterdepolarization in rat prefrontal cortex neurons. J. Neurophysiol., 78, 1983–1989. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 192–210. Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Izhikevich, E. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10, 1171–1266. Kang, Y. (1997). Depolarization-induced slow potentiation of depolarizing afterpotential. NeuroReport, 8, 1509–1514. Kang, Y., & Kayano, F. (1994). Electrophysiological and morphological characteristics of layer VI pyramidal cells in the motor cortex. J. Neurophysiol., 72, 578–559. Kang, Y., Okada, T., & Ohmori, H. (1998). A phenytoin-sensitive cationic current participates in generating afterdepolarization and burst afterdischarge in rat neocortical pyramidal cells. Eur. J. Neurosci., 10, 1363–1375. Kay, A., & Wong, R. (1987). Calcium current activation kinetics in isolated pyramidal neurons of the CA1 region of the mature guinea-pig hippocampus. J. Physiol., 392, 603–616. Kohler, M., Hirschberg, B., Bond, C., Kinzie, J., Marrion, N., Maylie, J., & Adelman, J. (1996). Small-conductance, calcium-activated potassium channels from mammalian brain. Science, 273, 1709–1714. Kopell, N., Ermentrout, B., Whittington, M., & Traub, R. (2000).Gamma rhythms and beta rhythms have different synchronization properties. Proc. Nat. Acad. Sci. USA, 97, 1867–1872. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Llinas, R., Grace, A., & Yarom, Y. (1991). In vitro neurons in mammalian cortical layer 4 exhibit intrinsic activity in the 10 to 50 Hz frequency range. Proc. Nat. Acad. Sci. U.S.A., 88, 897–901. Mantegazza, M., Franceschetti, S., & Avanzini, G. (1998). Anemone toxin (ATX II)-induced increase in persistent sodium current: Effects on the ring properties of rat neocortical pyramidal neurons. J. Physiol. (Lond.), 507, 105– 116. Menschik, E., & Finkel, L. (1999).Cholinergic neuromodulation and Alzheimer ’s disease: From single cells to network simulations. Prog. Brain Res., 121, 19–45.
Gamma Rhythmic Bursts
1061
Okuda, K. (1993). Variety and generality of clustering in globally coupled oscillators. Physica D, 63, 424–436. Pinsky, P., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Comput. Neurosci., 1, 39–60. Rall, W. (1967). Distinguishing theoretical synaptic potentials computed for different somadendritic distributions of synaptic inputs. J. Neurophysiol., 30, 1138–1168. Robertson, S., Johnson, J., & Potter, J. (1981). The time course of Ca2C exchange with calmodulin, troponin, and myosin in response to transient increases in Ca2C . J. Biophys., 34, 559–569. Steriade, M., McCormick, D., & Sejnowski, T. (1993).Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685. Steriade, M., Timofeev, I., Durmuller, N., & Grenier, F. (1998).Dynamic properties of corticothalamic neurons and lacal cortical interneurons generating fast rhythmic (30–40 Hz) spike bursts. J. Neurophysiol., 79, 483–490. Tallon-Baudry, C., Bertrand, O., Delpuech, C., & Pernier, J. (1997). Oscillatory gamma-band (30–70 Hz) activity induced by a visual search task in humans. J. Neurosci., 17, 722–734. Tiesinga, P., Fellous, J., Jose, J., & Sejnowski, T. (2001). Computational model of carbachol-induced delta, theta, and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274. Tiesinga, P., Jose, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Phys. Rev. E, 62, 8413–8419. von Stein, A., & Sarnthein, J. (2000). Different frequencies for different scales of cortical integration: From local gamma to long range alpha/theta synchronization. Int. J. Psychophysiol., 38, 301–313. Vreeswijk, C., & Abbott, L. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. Wang, X. (1999). Fast burst ring and short-term synaptic plasticity: A model of neocortical chattering neurons. Neuroscience, 89, 347–362. Wang, X., & Buzsaki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J., Banks, M., Pearce, R., & Kopell, N. (2000). Networks of interneurons with fast and slow gamma-aminobutyric acid type A (GABA A) kinetics provide substrate for mixed gamma-theta rhythm. Proc. Natl. Acad. Sci. USA, 97, 8128–8133. Williams, S., & Stuart, G. (1999). Mechanisms and consequences of action potential burst ring in rat neocortical pyramidal neurons. J. Physiol., 521, 467–482. Xia, X., Fakler, B., Rivard, A., Wayman, G., Johnson-Pais, T., Keen, J. E., Ishii, T., Hirschberg, B., Bond, C., Lutsenko, S., Maylie, J., & Adelman, J. (1998). Mechanism of calcium gating in small-conductance calcium-activated potassium channels. Nature, 395, 503–507. Received May 13, 2002; accepted November 26, 2002.
LETTER
Communicated by David Mumford
Manhattan World: Orientation and Outlier Detection by Bayesian Inference James M. Coughlan
[email protected] A. L. Yuille
[email protected] Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A.
This letter argues that many visual scenes are based on a “Manhattan” three-dimensional grid that imposes regularities on the image statistics. We construct a Bayesian model that implements this assumption and estimates the viewer orientation relative to the Manhattan grid. For many images, these estimates are good approximations to the viewer orientation (as estimated manually by the authors). These estimates also make it easy to detect outlier structures that are unaligned to the grid. To determine the applicability of the Manhattan world model, we implement a null hypothesis model that assumes that the image statistics are independent of any three-dimensional scene structure. We then use the log-likelihood ratio test to determine whether an image satises the Manhattan world assumption. Our results show that if an image is estimated to be Manhattan, then the Bayesian model’s estimates of viewer direction are almost always accurate (according to our manual estimates), and vice versa. 1 Introduction There has been growing interest in the statistics of natural images (see Huang & Mumford, 1999, for a recent review). Much of the interest in these statistics lies in their usefulness for quantitatively describing the regularities of images. Image statistics are also useful for solving visual inference problems. They can be used to design statistical edge detectors (Konishi, Yuille, Coughlan, & Zhu, 1999, 2003), determine statistical image invariants (Chen, Belhumeur, & Jacobs, 2000), and determine semantic categories (Oliva & Torralba, 2001). It is plausible that the receptive elds of neurons adapt to these statistics and exploit them for making inferences about the world (Balboa & Grzywacz, 2000). This article investigates the Manhattan world assumption (Coughlan & Yuille, 1999, 2000) where statistical image regularities arise from the geometrical structure of the scene being viewed. It assumes that the scene has a natural Cartesian x; y; z coordinate system and the image statistics will be determined by the alignment of the viewer with respect to this system. c 2003 Massachusetts Institute of Technology Neural Computation 15, 1063–1088 (2003) °
1064
J. Coughlan and A. Yuille
Figure 1: The Ames room, a geometrically distorted room that nevertheless appears rectangular from a special viewpoint. Despite appearances, the two people are the same size.
The Manhattan world assumption is plausible for indoor and outdoor city scenes. As we will show, it also applies to some scenes in the country and even to some paintings. Informal evidence that human observers use a form of the Manhattan world assumption is provided by the Ames room illusion (see Figure 1), where the observers appear to erroneously make this assumption, thereby perceiving the sizes of objects in the room as grotesquely distorted. The Ames room is actually constructed in a shape that strongly violates the Manhattan assumption but human observers, and our model (see the top row of Figure 18), interpret the room as if it had a Cartesian structure. In particular, we demonstrate a Bayesian statistical model that exploits the Manhattan world assumption to determine the orientation of the viewer relative to the grid. This enables us to determine the orientation of the viewer in a scene, indoor or outdoor, from a single image. It might, for example, be used as part of the “reptilian layer” of a vision system in accordance with the active vision philosophy (Blake & Yuille, 1992). It gives a calibration method that is an alternative to well-established techniques in computer vision based on calibration patterns (Faugeras, 1993) or motion ow (Hartley & Zisserman, 2000). It is related to methods for calibration by estimating vanishing points (see the review in Deutscher, Isard, & MacCormick, 2002), but these methods require multiple images or manual labeling of detected lines. The Bayesian model also allows us to detect outlier edges that are
Manhattan World
1065
not aligned to the dominant structures in the scene and may simplify object detection. We evaluate our model by comparing its estimates of the viewer orientation with estimates made manually by the authors (see Deutscher et al., 2002, for recent comparisons of Manhattan-type algorithms to alternative methods and ground truth). This is demonstrated by gures in the text (enabling the reader to make his or her own judgment). In addition, we construct a null hypothesis model that assumes that the image statistics are independent of the three-dimensional scene structure. This enables us to determine if an image obeys the Manhattan world assumption by comparing the evidence of the Manhattan and the null hypothesis model. This letter is organized as follows. In sections 2 and 3, we describe the geometry of the problem and the connection between three-dimensional structures in the scene and the corresponding properties of the image that are measured. Section 4 describes our statistical model of images. Section 5 describes the full Bayesian model of Manhattan world, and section 6 shows experimental results applying this model to a range of images and uses the null hypothesis model to estimate if an image is Manhattan. In section 7, we describe the application of the Manhattan model to nding outlier objects unaligned to the Manhattan grid. Section 8 presents a consistency check of the accuracy of the prior model. Finally, section 9 summarizes the letter. 2 The Geometry of the Problem There has been an enormous amount of work studying the geometry of computer vision (Faugeras, 1993; Mundy & Zisserman, 1992). Techniques from projective geometry have been applied to nding the vanishing points (Brillault-O’Mahony, 1991; Lutton, Maˆitre, & Lopez-Krahe, 1994; see Shufelt, 1999, for a recent review and analysis of these techniques). This work, however, has typically proceeded through the stages of edge detection, Hough transforms, and nally the calculation of the geometry. Alternatively, a sequence of images over time can be used to estimate the geometry (Torr & Zisserman, 1998). In this article, we demonstrate that accurate results can be obtained from a single image directly without the need for techniques such as edge detection and Hough transforms, by exploiting the statistical regularities of scenes. For completeness, we give the basic geometry. The camera orientation is dened by a set of camera axes aligned to the camera body, which has been rotated relative to the Manhattan xyz-axis system. We dene the camera E and cE. Ea and bE specify the horizontal and axes by three unit vectors Ea, b, vertical directions, respectively, along the lm plane in xyz coordinates. cE is the unit vector that points along the line of sight of the camera, so that Ea £ bE D ¡Ec. The projection from a three-dimensional point Er D .x; y; z/ to two-dimensional lm coordinates uE D .u; v/ (centered at the physical center
1066
J. Coughlan and A. Yuille
of the lm) is given by uD
f Er ¢ aE f Er ¢ bE ;v D ; Er ¢ cE Er ¢ cE
(2.1)
where f is the focal length of the camera. Here the lm coordinates are E chosen such that the u-axis is aligned to Ea and the v-axis is aligned to b. The camera orientation relative to the Manhattan xyz-axis system may be specied by three Euler angles ®; ¯; and ° . We can think of starting with the vectors cE, Ea, and bE aligned to the x; ¡y; z-axes and applying three successive rotations to the coordinate frame dened by these vectors (i.e., active transformations rather than passive coordinate transformations). The rst angle is the compass angle (or azimuth) ®, which rotates the camera about the z-axis, yielding a transformed coordinate system x0 y0 z0 . Next, the camera is rotated around the y0 -axis by the elevation angle ¯, yielding the next coordinate system x00 y00 z00 . This has the effect of elevating the line of sight from the xy plane. Finally, a twist angle ° applies a rotation ° about E to denote the x00 -axis, producing a twist in the plane of the lm. We use 9 all three angles .®; ¯; ° / of the camera orientation. (In previous work, we made the assumption that ¯ D ° D 0 and only allowed the compass angle ® to vary, which was a reasonable approximation for many images in our database.) E and cE as functions of 9, E we follow a stanTo derive expressions for aE, b, dard procedure (Mathews & Walker, 1970) of constructing rotation matriE D ces Rx .° /; Ry .¯/; Rz .®/ and dening the overall rotation matrix R.9/ E Rx00 .° /Ry0 .¯/Rz .®/ D Rz .®/Ry .¯/Rx .° /. The resulting expression is R.9/ given by 0
¡ sin ® cos ° C cos ® sin ¯ sin ° cos ® cos ° C sin ® sin ¯ sin ° cos ¯ sin ° 1 sin ® sin ° C cos ® sin ¯ cos ° C ¡ cos ® sin ° C sin ® sin ¯ cos ° A cos ¯ cos °
cos ® cos ¯ B @ sin ® cos ¯ ¡ sin ¯
(2.2)
E E We then obtain that Ea D R.9/.0; ¡1; 0/T , bE D R.9/.0; 0; 1/T , and cE D T E R.9/.1; 0; 0/ : 3 Mapping Manhattan Lines to Image Gradients Straight lines in the x; y; z directions in the Manhattan world project to straight lines in the image plane. Our Bayesian model (see section 5) will use estimates of the orientations of these lines on the image plane, provided
Manhattan World
1067
v q v
(u,v)
x-line V.P.
u vanishing point
u z
x
y
Figure 2: Examples of projection. (Left) Geometry of an x line projected onto the .u; v/ image plane. µ is the normal orientation of the line in the image. (Right) The vanishing point due to two square boxes aligned to the Manhattan grid. In both cases, the camera is assumed to point in a horizontal direction, and so the x vanishing point lies on the u-axis.
by image gradient information, to nd the camera orientation that is most consistent with these orientation measurements. We now derive the orientation in the image plane of an x; y; or z line projected at any pixel location E The calculation proceeds by .u; v/ as a function of camera orientation 9. rst calculating the x; y; z vanishing point positions in the image plane as a E The resulting .u; v/ coordinates for the x; y; z vanishing points function of 9. are .fax =cx ; fbx =cx /, .fay =cy ; fby =cy /, and .faz =cz ; fbz =cz /, respectively. Next, it is a straightforward calculation to show that a point in the image at uE D .u; v/ with intensity gradient direction .cos µx ; sin µx / is consistent with an x line in the sense that it points to the vanishing point if tan µx D .ucx ¡ fax /=.fbx ¡ vcx /. This calculation is for an ideal edge for which the intensity gradient direction .cos µx ; sin µx / points exactly perpendicularly to the x vanishing point; our Bayesian model will exploit the fact that the orientation of true intensity gradients uctuates about the ideal direction (see section 5) by modeling these uctuations statistically. Observe also that this equation is unaffected by adding §¼ to µx , and so it does not depend on the polarity of the edge. We get similar expressions for y and z lines: tan µy D .ucy ¡ fay /=.fby ¡ vcy / for intensity gradient direction .cos µy ; sin µy / and tan µz D .ucz ¡ faz /=.fbz ¡ vcz / for intensity gradient direction .cos µz ; sin µz /. See Figure 2 for an illustration of this geometry.
1068
J. Coughlan and A. Yuille
Figure 3: Poff .y/ (left) and Pon .y/ (right), the empirical histograms of edge reE is quantized to sponses off and on edges, respectively. Here the response y D jrIj take 20 values and is shown on the horizontal axis. Note that the peak of Poff .y/ occurs at a lower edge response than the peak of Pon .y/. These distributions were very consistent for a range of images.
4 Pon and Poff : Characterizing Edges Statistically A key element of our approach is that we do not use a binary edge map. Such edge maps make premature decisions based on too little information. The poor quality of some of the images we used—underexposed and overexposed—makes edge detection particularly difcult. Our algorithm showed a signicant decrease in performance when we adapted it to run on edge maps unless the edge detection threshold is varied from image to image. Instead we use the power of statistics. Following work by Konishi et al. (1999, in press), we determine probabilities P on .EuE / and Poff .EuE / for the probabilities of the response EuE of an edge lter at position uE in the image conditioned on whether we are on or off an edge. These distributions were learned by Konishi et al. for the Sowerby image database, which contains 100 presegmented images (see Konishi et al. [1999, 2003] for the similarity of these statistics from image to image). The more different Pon is from Poff , the easier the edge detection becomes (see Figure 3). A suitable measure of difference is the Chernoff information (Cover & Thomas, 1991), P This is motivated by theoC.Pon ; Poff / D ¡ min0·¸·1 log y P¸on .y/P1¡¸ off .y/. retical studies of the detectability of edge contours (Yuille & Coughlan, 2000; Yuille, Coughlan, Wu, & Zhu, 2001). Moreover, empirical studies (Konishi et al., 1999, 2003) showed that the Chernoff information for this task correlates strongly with other performance measures based on the ROC curve. Konishi et al. tested a variety of edge lters and ranked them by their effectiveness based on their Chernoff information. For this project, we chose E ¾ D1 ¤ Ij—the magnitude of the gradient a very simple edge detector jrG
Manhattan World
1069
150
100
50
0 -200
-150
-100
-50
0
50
100
Figure 4: Distribution Pang .:/ of edge orientation error (displayed as an unnormalized histogram, modulo 180 degrees). Observe the strong peak at 0 degree, indicating that the image gradient direction at an edge is usually very close to the true normal orientation of the edge. We modeled this distribution using a simple box function.
of the gray-scale image I ltered by a gaussian G¾ D1 with standard deviation ¾ D 1 pixel units—which has a Chernoff of 0:26 nats or 0:37 bits (1 bit = loge 2 nats ¼ 0:69 nats). More effective edge detectors are available; for example, the gradient at multiple scales using color has a Chernoff of 0:51 nats or 0:74 bits. But we do not need these more sophisticated detectors. We extend the work of Konishi et al. by putting probability distributions on how accurately the edge lter gradient estimates the true perpendicular direction of the edge (see Figure 4). These were learned for this data set by measuring the true orientations of straight-line edges and comparing them to those estimated from the image gradients. This gives us distributions on the magnitude and direction of the intensity gradient Pon .EEuE j µ/; Poff .EEuE /, where EEuE D .EuE ; ÁuE /, µ is the true normal E orientation of the edge and ÁuE is the gradient direction measured at point u. E We make a factorization assumption that Pon .EuE j µ/ D Pon .EuE /Pang .ÁuE ¡ µ / and Poff .EEuE / D Poff .EuE /U.ÁuE /. Pang .:/ (with argument evaluated modulo 2¼
1070
J. Coughlan and A. Yuille
and normalized to 1 over the range 0 to 2¼) is based on experimental data (see Figure 4), and is peaked about 0 and ¼ . In practice, we use a simple box function model: Pang .±µ/ D .1 ¡ ²/=4¿ if ±µ is within angle ¿ of 0 or ¼ , and ²=.2¼ ¡ 4¿ / otherwise (i.e., the chance of an angular error greater than §¿ is ² ). In our experiments, ² D 0:1 and ¿ D 6 degrees. By contrast, U.:/ D 1=2¼ is the uniform distribution. 5 Bayesian Model We now devise a Bayesian model that combines knowledge of the threedimensional geometry of the Manhattan world with statistical knowledge of edges in images. The model assumes that while the majority of pixels in the image convey no information about camera orientation, most of the pixels with high edge responses arise from the presence of x; y; z lines in the three-dimensional scene. The edge orientations measured at these pixels provide constraints on the camera angle, and although the constraining evidence from any single pixel is weak, the Bayesian model allows us to pool the evidence over all pixels (both on and off edges), yielding a sharp posterior distribution on the camera orientation. An important feature of the Bayesian model is that it does not force us to decide prematurely which pixels are on and off (or whether an on pixel is due to x; y; or z), but allows us to sum over all possible interpretations of each pixel. 5.1 Evidence at One Pixel. The image data EEuE at pixel uE are explained by one of ve models muE : muE D 1; 2; 3 mean the data are generated by an edge due to an x; y; z line, respectively, in the scene; muE D 4 means the data are generated by a random edge (not due to an x; y; z line); and muE D 5 means the pixel is off-edge. The prior probability P.muE / of each of the edge models was estimated empirically to be 0:02; 0:02; 0:02; 0:04; 0:9 for muE D 1; 2; : : : ; 5 (see section 8). Using the factorization assumption mentioned before, we assume the probability of the image data EEuE has two factors: one for the magnitude of the edge strength and another for the edge direction, E uE / D P.EuE j muE /P.ÁuE j muE ; 9; E u/; E P.EEuE j muE ; 9;
(5.1)
where P.E uE j muE / equals P off .EuE / if muE D 5 or Pon .EuE / if muE 6D 5. Also, E uE / equals Pang .ÁuE ¡ µ.9; E muE ; uE // if muE D 1; 2; 3 or U.ÁuE / if P.ÁuE j muE ; 9; E muE ; uE // is the predicted normal orientation of lines muE D 4; 5. Here, µ .9; determined by the equation tan µx D .ucx ¡fax /=.fbx ¡vcx / for x lines, tan µy D .uc y ¡ fay /=.fby ¡vcy / for y lines, and tan µz D .ucz ¡ faz /=.fbz ¡ vc z / for z lines. The structure of the probability distribution for all the variables relevant to E is graphically depicted in the Bayes net shown in one pixel (i.e., EEuE ; muE ; 9) Figure 5.
Manhattan World
1071
m
u
E
u
u
Figure 5: Bayes net for all variables pertaining to a single pixel. The net represents the structure of the joint probability of the image gradient direction ÁuE and E , the assignment variable muE at that pixel, and the camera magnitude EuE at pixel u E (The dependence of ÁuE on the pixel location u E is not shown since orientation 9. E is known.) we assume u
In summary, the edge strength probability is modeled by Pon for models 1 through 4 and by Poff for model 5. For models 1, 2, and 3, the edge orientation is modeled by a distribution that is peaked about the appropriate orientation E for of an x; y; z line predicted by the compass angle at pixel location u; models 4 and 5, the edge orientation is assumed to be uniformly distributed from 0 through 2¼ . Rather than decide on a particular model at each pixel, we marginalize over all ve possible models (i.e., creating a mixture model): E uE / D P.EEuE j 9;
5 X muE D1
E u/P.m E P.EEuE j muE ; 9; uE /:
(5.2)
E at In this way, we can determine evidence about the camera orientation 9 each pixel without knowing which of the ve model categories the pixel belongs to. 5.2 Evidence over All Pixels: Finding the MAP. To combine evidence over all pixels in the image, denoted by fEEuE g, we assume that the image data are conditionally independent across all pixels, given the camera orienE tation 9: Y E D E uE /: (5.3) P.fEEuE g j 9/ P.EEuE j 9; uE
Conditional independence is a key assumption of the Manhattan model (depicted in the Bayes net in Figure 6). By neglecting coupling of image
1072
J. Coughlan and A. Yuille
m
E
m
E
m
u
u
E 11
m
u
u 21
E
u
u 12
u
u
22
Figure 6: Bayes net for all variables in the Manhattan model. The net represents E uE at pixel the structure of the joint probability of the image gradient vector E ij
E ij (the pixel at row i and column j), the assignment variable muE at that pixel, u ij E The box represents the entire image, with an and the camera orientation 9. E uE and assignment variable muE at each pixel location. image gradient vector E ij
ij
The structure of the net graphically illustrates the assumption that the image E uE are conditionally independent from pixel to pixel. gradient vectors E
gradients at neighboring pixels, the conditional independence assumption makes an approximation that yields a model for which maximum a posteriori (MAP) inference is tractable (see equation 5.4). Note also that the conditional independence assumption provides a way of combining evidence across pixels without an explicit grouping process (e.g., by which pixels could be grouped into straight line segments). Conditional independence is, of course, an approximation of the form used in many statistical inference algorithms. Indeed, there exist theoretical studies (Yuille et al., 2001) proving that for certain types of problem, approximate models can give results almost as good as the correct models. We did implement a Manhattan model that included spatial interactions, but the results did not improve signicantly, and the model was far slower to implement.
Manhattan World
1073
The posterior distribution on the camera orientation is thus given by E E E /P.9/=Z, E E is a where Z is a normalization factor and P.9/ uE P.EuE j 9; u uniform prior on the camera orientation. To nd the MAP estimate, we need to maximize the log posterior term (ignoring Z, which is independent E of 9): Q
E E log[P.fEEuE g j 9/P. 9/] E C D log P.9/
X uE
" log
5 X
muE D1
# E E P.EuE j muE ; 9; uE /P.muE / :
(5.4)
E ¤ . To nd the MAP, our algorithm evaluWe denote the MAP estimate by 9 ates the log posterior numerically for the compass direction for a quantized E values. (The conditional independence assumption makes the form set of 9 of the log posterior simple enough that it can be evaluated for any given E value by summing over only ve terms for each pixel.) One such set of 9 quantized values that works for a range of images is given by searching over all combinations of ® from ¡45 to 45 degrees in increments of 4 degrees, elevation ¯ from ¡40 to 40 degrees in increments of 2 degrees, and twist ° from ¡4 to 4 degrees in increments of 2 degrees. In our preliminary work (Coughlan & Yuille, 1999) we assumed that the camera was pointed horizontally, so we effectively set ¯ D 0 and ° D 0 and searched for all ® from ¡45 to 45 degrees in increments of 2 degrees. A coarse-to-ne search strategy was employed to speed up the search for the MAP estimate, which succeeded for most images for which the true values of ¯ and ° were close to 0. The rst stage of the search was to nd the best value of ® from ¡45 to 45 degrees in increments of 4 degrees, while setting ¯ D 0 and ° D 0. The best value of ® that was obtained, ®c , was used to initialize a medium-scale search, which searched over all .®; ¯; ° / of the form .®c C i1®m ; j1¯m ; k1°m /, where i; j; k 2 f¡1; 0; 1g and 1®m D 2 degrees, 1¯m D 5 degrees, and 1°m D 5 degrees. The best Euler angles thus obtained, .®m ; ¯m ; °m /, were then used to initialize a ne-scale search, which searched over all .®; ¯; ° / of the form .®m ; ¯m C j1¯ f ; °m C k1° f /, where j; k 2 f¡2; ¡1; 0; 1; 2g and 1¯ f D 2:5 degrees and 1°f D 2:5 degrees. We should mention the issues of algorithmic speed. At present, the algorithm takes half a minute on a Pentium 3 using the coarse-to-ne search strategy on images of size 640 £ 480. Optimizing the code and subsampling the image will allow the algorithm to work signicantly faster. Other techniques involve rejecting image pixels where the edge detector response is so low that there is no realistic chance of an edge being present. This would mean that at least 70% of the image pixels could be removed from the computation. We observe that the algorithm is entirely parallelizable. Stochastic gradient-descent techniques may also be employed for signicant speedups (Deutscher et al., 2002).
1074
J. Coughlan and A. Yuille
6 Experimental Results for Determining Manhattan Structure Our model has been tested on four data sets of images: indoor scenes, outdoor city scenes, outdoor rural scenes, and miscellaneous non-Manhattan scenes. Images from the rst two data sets were taken by an unskilled photographer unfamiliar with the goals of the study, the outdoor rural scenes were obtained from a database of scenes of English countryside, and the non-Manhattan scenes were downloaded from the Web. We tested our model in two ways. First, we compared the vanishing points estimated by the algorithm to our own manual estimates. Second, we implemented a null hypothesis model and used a log-likelihood test to estimate whether an image does, or does not, obey the Manhattan world assumption. The null hypothesis model removes the image intensity dependence on any three-dimensional scene structure. Our experiments show that the algorithms’ estimates are usually close to the manual estimates for the rst three domains (see sections 6.1, 6.2, and 6.3). Moreover, images in these domains are almost always estimated as obeying the Manhattan world assumption while the miscellaneous images are typically estimated as not obeying the assumption (see section 6.4). 6.1 Estimating Viewpoint for Indoor Scenes . On this data set, the camera was held roughly horizontal, but no special attempt was made to align it precisely. The camera was set on automatic so some images are over- or underexposed. A total of 25 images were tested. Since the camera was held roughly horizontal, we set ¯ D 0 and ° D 0 and searched for the optimal value of ® (the results are similar when searching simultaneously over all three camera angles ®, ¯, and ° ). On 23 images, the angle estimated by the algorithm was within 5 degrees of our manual estimate. On two images, the orientation of the camera was far from horizontal, and the estimation was poor. Examples of successes, demonstrating the range of images used, are shown in Figures 7 and 8. The log posteriors for typical images, plotted as a function of ®, are shown in Figure 9 and are sharply peaked. 6.2 Estimating Viewpoint for Outdoor City Scenes. We next tested the accuracy of estimation on outdoor city scenes. Again we used 25 test images (taken by the same photographer as for the indoor scenes). In these scenes, the vast majority of the results (22) were accurate up to 10 degrees (with respect to our own manual estimates). On three of the images, the angles were worse than 10 degrees. Inspection of these images showed that the log posterior had multiple peaks, one peak corresponding to the true compass angle (to within 10 degrees), as well as false peaks, which were higher. The false peaks typically corresponded to the presence of structured objects in the scene (e.g., stairway railings), which did not align to Manhattan structure.
Manhattan World
1075
Figure 7: Estimates of the compass angle and geometry obtained by our algorithm. The estimated orientations of the x and y lines are indicated by the black line segments drawn on the input image. At each point on a subgrid, two such segments are drawn: one for x and one for y. (Left) Observe how the x directions align with the wall on the right-hand side and with features parallel to this wall. The y lines align with the wall on the left (and objects parallel to it). (Right) Observe that the x; y directions align with the appropriate walls despite the poor quality of the image (i.e., underexposed).
Figure 8: Another indoor (left) scene, its exterior (center), and another indoor scene (right). Same conventions as in Figure 7. The vanishing points are estimated to within 5 degrees (perfectly adequate for our purposes). Note the poor quality of the indoor image (i.e., overexposed). (Indoor 23,8 and Outdoor 12.)
1076
J. Coughlan and A. Yuille 5
5.38
x 10
5.435
x 10
5
5.44 5.445 5.385
5.45 5.455 5.46
5.39
5.465 5.47 5.395 50
0
50
5.475 50
0
50
Figure 9: The log posteriors as a function of ® (from ¡45 to 45 degrees along the horizontal axis) for images Indoor 17 (left) and Indoor 15 (right). These results are typical for both the indoor and outdoor data set. (For these plots, it was assumed that the camera is roughly horizontal, so only the angle ® needs to be varied.)
On 22 of the 25 images, however, the algorithm gave estimates accurate to 10 degrees (compared with our manual estimates). See Figure 10 for a representative set of images on which the algorithm was successful. We also demonstrate the algorithm on two aerial views of cities downloaded from the web (see Figure 11). On these images, we searched for all three camera angles ®, ¯, and ° simultaneously, which was necessary since the camera was tilted from the horizontal. 6.3 Estimating Viewpoint for Outdoor Rural Scenes. We also applied the Manhattan model to less structured scenes in the English countryside (see Coughlan & Yuille, 2000, for a rst report). Figure 12 shows two images of roads in rural scenes and two elds. These images come from the Sowerby database. Some scenes see (Figure 13) contain so little Manhattan structure that the Manhattan model may base its inference on chance alignments of various parts of the scene. The next four images see (Figure 14) were either downloaded from the web or digitized (the painting). These are the Midwest broccoli eld, the Parthenon ruins, the painting of the French countryside, and the ski scene. On all of the images in this section, the Manhattan algorithm searched for all three camera angles ®, ¯, and ° simultaneously. In almost all of these cases, the Manhattan model makes reasonable inferences despite the absence of strong Manhattan structure, and in some cases despite the absence of strong straight-line edges. 6.4 Manhattan World and the Null Hypothesis. We now propose a test to determine whether an image obeys the Manhattan world assumption.
Manhattan World
1077
Figure 10: Results on four outdoor images. Same conventions as in Figure 7. Observe the accuracy of the x; y projections in these varied scenes despite the poor quality of some of the images.
Our previous results (see sections 6.1, 6.2, and 6.3) show that on several image classes, we can detect accurate vanishing points (with respect to our manual estimates), but there are many images, such as underwater images, for which the Manhattan world assumption is highly implausible. We proceed by constructing a null hypothesis model and then use model selection to estimate whether an image obeys the Manhattan world assumption. The null hypothesis model is constructed by modifying our Manhattan E uE / D U.ÁuE / where U.:/ is the model of equation 5.1 by setting P.ÁuE j muE ; 9; uniform distribution. This modication yields our null hypothesis model: Y [P.edge/P on .EuE / C P.not ¡ edge/Poff .E uE /]U.ÁuE /; (6.1) Pnull .fEEuE g/ D uE
P where P.edge/ D 4iD1 P.mi / D 0:1 and P.not ¡ edge/ D P.m5 / D 0:9. In other words, we no longer distinguish between different types of edges and no longer assume that the image statistics reect any three-dimensional scene structure. To do model comparison for an image with statistics fEEuE g, we compute P E E of the Manthe evidence log Pmanhat .fEuE g/ D log E.9/ P.fEEuE g j 9/P. 9/
1078
J. Coughlan and A. Yuille
Figure 11: Aerial views of Manhattan (left) and Vancouver (right). Note that the camera is tilted from the horizontal in both cases.
Figure 12: Results on rural images in England without strong Manhattan structure. Same conventions as in Figure 7. Two images of roads in the countryside (left) and two images of elds (right).
Manhattan World
1079
Figure 13: Example of a scene with low Manhattan structure. The hill silhouettes are mistakenly interpreted as x lines.
Figure 14: Results on an American Midwest broccoli eld (top left), the ruins of the Parthenon (top right), a digitized painting of the French countryside (bottom left), and a ski scene (bottom right).
1080
J. Coughlan and A. Yuille
E uE g/= Figure 15: We plot the (normalized) log-likehood ratio .1=N/ log.Pmanhat .fE E uE g/ of the (approximated) Manhattan model with respect to the null hyPnull .fE pothesis (where N is the image size). The vertical axis is the log-likelihood ratio, and the horizontal axis is the index label of the images. We indicate indoor, outdoor, Sowerby, and miscellaneous images by circles, crosses (£), triangles, and plusses (C), respectively.
hattan model and subtract the evidence log Pnull .fEEuE g/ of the null model. This gives the log-likelihood ratio between the two models. We approxE ¤ /P.9 E ¤ /, where imate the evidence by log Pmanhat .fEuE g/ ¼ log P.fEEuE g j 9 ¤ E D arg max E P.fEEuE g j 9/P. E E This approximation is a lower bound to 9 9/. 9 the true evidence, and we argue that it is a good approximation because E E (see Figof the sharpness of the peaks in the posterior P.fEEuE g j 9/P. 9/ ure 9; this gure plots the log posterior, so the posterior is considerably sharper). The experimental results—the plots of log Pmanhat .fE uE g/=Pnull .fEuE g/ in Figure 15—show that all the images reported in sections 6.1, 6.2, and 6.3 satisfy the Manhattan world assumption (according to our model selection test). It is therefore not surprising that the algorithm’s estimates of the vanishing point were accurate for these images. The gure also shows a plausible trend: indoor images best satisfy the Manhattan world assumption, fol-
Manhattan World
1081
6.4
6.6
6.8
7
7.2
7.4
7.6
7.8 0
10
20
30
40
50
E uE g/, evaluated Figure 16: Evidence per pixel (approximated), 1=N log.Pmanhat .fE in three domains: indoor images (labeled by o’s), outdoor urban images (labeled by x’s), and miscellaneous images (labeled by +’s). A representative image from each domain is shown below. The trend is that the evidence per pixel decreases, on average, as we go from images with strong Manhattan structure to images without Manhattan structure.
lowed by outdoor images and then rural images. The gure also shows 10 miscellaneous images that are not expected to satisfy the Manhattan world assumption. These images include an underwater image (see Figure 16), and almost all have log likelihoods than are less than zero and considerably lower than those for indoor, outdoor, and rural scenes. The main exception is the fourth image, which is a photograph of a helicopter (see Figure 13) and where the hill silhouettes are mistakenly interpreted as horizontal x lines.
1082
J. Coughlan and A. Yuille
It is also interesting to plot the evidence, log.Pmanhat .fEEuE g/, for the Manhattan model alone (see Figure 16). (As above, we approximate this sum E ¤ .) The evidence is useful for indiby the dominant contribution given by 9 cating trends to determine what classes of images t the Manhattan world assumption. To avoid biases caused by different image sizes, we normalize the evidence by the image size and plot L=N. (Our conditional independence assumption, if correct, implies that the evidence scales linearly with the number of pixels.) Observe that the data are very high dimensional, and so the probabilities of any image will be very small.
7 Outliers in Manhattan World We now describe how the Manhattan world model may help the task of detecting target objects in background clutter. To perform such a task effectively requires modeling the properties of the background clutter in addition to those of the target object. It has recently been appreciated (Ratches, Walters, Buser, & Guenther, 1997) that simple models of background clutter based on gaussian probability distributions are often inadequate and that better performance can be obtained using alternative probability models (Zhu, Lanterman, & Miller, 1998). The Manhattan world assumption gives an alternative, and complementary, way of probabilistically modeling background clutter. The background clutter will correspond to the regular structure of buildings and roads, and its edges will be aligned to the Manhattan grid. The target object, however, is assumed to be unaligned (at least, in part) to this grid. Therefore, many of the edges of the target object will be assigned to model 4 by the algorithm. This enables us to simplify the detection task signicantly by removing all edges in the images except those assigned to model 4. (Of course, further processing is required to group these outlier edges into coherent targets.) This idea is demonstrated in Figures 17 and 18, where the targets are a bike, a robot, and people. At each pixel, our algorithm decides whether the most likely interpretation is model 4 or any of the other models. More E ¤ /=P.muE 6D 4 j EEuE ; 9 E ¤/ > T precisely, it decides outlier if P.muE D 4 j EEuE ; 9 E ¤ is the MAP estimate of and nonoutlier otherwise, where T ¼ 0:4 and 9 the camera orientation. Observe how most of the edges in the image are eliminated as target object candidates because of their alignment to the Manhattan grid. The bike, robot, and people stand out as outliers to the grid. The Ames room (see the top row of Figure 18) is a geometrically distorted room that is constructed so as to give the false impression that it is built on a Cartesian coordinate frame when viewed from a special viewpoint. Human observers assume that the room is indeed Cartesian despite all other visual cues to the contrary. This distorts the apparent size of objects so that, for example, humans in different parts of the room appear to have very different
Manhattan World
1083
Figure 17: Bikes (top row) and robots (bottom row) as outliers in Manhattan world. The original image (left) and the edge maps (center) computed as log Pon .EuE /=Poff .EuE / (see Konishi et al., 1999) displayed as a gray-scale image where black is high and white is low. In the right column, we show the edges assigned to model 4 (i.e., the outliers) in black. Observe that the edges of the bike and robot are now highly salient (and would make detection simpler) because most of them are unaligned to the Manhattan grid.
Figure 18: People in Manhattan world: the Ames room twins (top) and a coauthor (bottom). Same conventions as in Figure 17. The Ames room actually violates the Manhattan assumption, but human observers, and our algorithm, interpret it as if it satised the assumptions. In fact, despite appearances, the two people in the Ames room are really the same size.
1084
J. Coughlan and A. Yuille
sizes. In fact, a human walking across the room will appear to change size dramatically. Our algorithm, like human observers, interprets the room as being Cartesian and helps identify the humans in the room as outlier edges unaligned to the Cartesian reference system. This simple example illustrates a method of modeling background clutter that we refer to as scene clutter because it is effectively the same as dening a probability model for the entire scene. Observe that scene clutter E camera orientation— models require external variables—in this case, the 9 to determine the orientation of the viewer relative to the scene axes. These variables must be estimated to help distinguish between target and clutter. This differs from standard models used for background clutter (Ratches et al., 1997; Zhu et al., 1998) where no external variable is used. 8 Consistency Check of the Prior Ideally, the parameters dening the model assignment prior P.muE / should be learned based on the ground-truth classication of pixels in Manhattantype images into the ve model categories. However, no such ground-truth information is available, so we determined the parameters on the basis of both measurements and guesswork. In the segmentations supplied with the Sowerby data set images, about 10% of all pixels are edges. It is plausible to assume that 40% of all edges are outliers (model 4) and that x; y, and z edges occur in roughly equal proportions. Hence we arrived at the prior frequencies 0:02; 0:02; 0:02; 0:04; 0:9 for muE D 1; 2; : : : ; 5. In this section, we describe a consistency check for the prior frequencies by estimating the frequencies of each model assignment using the above prior and conditioning on image data. Our results show (see Table 1) that the estimated frequencies of the different edge types are similar to our prior frequencies. This is only a crude check because, clearly, our estimated frequencies will be biased toward our choice of prior frequencies. Our check is based on the following observation: assume a Bayesian model P.X j Y/ D P.Y j X/P.X/=P.Y/, where Y is the observables and X is the P variable to be Pestimated. Let fYi g be a set P of samples from P.Y/. Then i P.X j Y i / D i P.Yi j X/P.X/=P.Y i / » Y P.Y/P.Y j X/P.X/=P.Y/ D P.X/ as the number of samples tends to innity. Hence, the posterior averaged over a set of representative samples should converge to the prior. Now because our images are large and we are assuming conditional independence, it is plausible that we get a sufcient number of samples from each image to ensure that the estimated edge frequencies in each image are roughly equal to the prior frequencies (this is an ergodic, or self-averaging, assumption). We can compute the empirical frequency hk of each model assignment k (h1 is the empirical frequency of x lines, h2 is the empirical frequency of y lines, and so on) in an image. We dene hk as the mean frequency of model k in the image, conditioned on all the image data and the MAP estimate of
Manhattan World
1085
Table 1: Empirical Estimates of Prior Model Frequencies for 23 Indoor Images. Image
1
2
3
4
5
6
7
8
9
10
11
12
h1 h2 h3 h4 h5
0.015 0.023 0.024 0.025 0.912
0.018 0.019 0.072 0.038 0.854
0.027 0.014 0.051 0.033 0.875
0.018 0.032 0.019 0.031 0.900
0.032 0.036 0.017 0.035 0.879
0.021 0.030 0.056 0.036 0.856
0.016 0.028 0.035 0.033 0.888
0.025 0.045 0.065 0.054 0.812
0.025 0.044 0.070 0.056 0.806
0.026 0.030 0.033 0.055 0.857
0.025 0.025 0.063 0.035 0.852
0.020 0.028 0.068 0.038 0.800
Image
13
14
15
16
17
18
19
20
21
22
23
h1 h2 h3 h4 h5
0.017 0.028 0.057 0.035 0.862
0.032 0.055 0.035 0.065 0.813
0.029 0.026 0.028 0.041 0.875
0.026 0.027 0.052 0.035 0.859
0.033 0.044 0.066 0.065 0.793
0.022 0.020 0.036 0.043 0.879
0.026 0.022 0.052 0.041 0.859
0.010 0.028 0.044 0.026 0.892
0.018 0.056 0.083 0.060 0.783
0.018 0.024 0.033 0.033 0.892
0.023 0.020 0.108 0.049 0.800
the camera orientation: * + X hk D .1=N/ ±k;muE uE
E ¤/ P.fmuE gjfEEuE g;9
D .1=N/
X uE
E ¤ /; P.muE D k j EEuE ; uE ; 9
E E E¤ E E ¤ /=Z and E 9 where uE D k/P.EuE j muE D k; u; P5P.muE D k j 0EuE ; u; 9 / D P.m 0 ¤ E E Z D k0 D1 P.muE D k /P.EuE j muE D k ; uE ; 9 /. We show results on 23 indoor images in Table 1. The empirical frequencies are fairly consistent with the prior model frequencies, although in many of the images, the frequency of z lines appears to be higher than the x and y frequencies. Averaging the empirical frequencies across an entire domain might be a useful way to adapt the Manhattan prior to new domains. Another useful consistency check on the prior is to calculate the MAP estimate of the model assignment at each pixel conditioned on the entire image. To evaluate this, we calculate the most likely model assignment at each pixel given the image data there and the global MAP estiE ¤ of the camera orientation. More precisely, at each pixel u, we nd mate 9 E ¤ /. We show results on two the value of muE that maximizes P.muE j EEuE ; 9 images in Figure 19, which demonstrates that pixels are being classied appropriately. 9 Summary and Conclusions We developed a Bayesian model for exploiting the Manhattan world assumption and showed that it gave good estimates of vanishing points (e.g., viewpoint) as compared to manual estimation. In addition, the model is able to detect outlier edges, which are not aligned to the Manhattan world grid and may be useful for detecting objects. We formulated a null hypothesis model and used model comparison to test whether an image obeyed the Manhattan world assumption. This demonstrated that the Manhattan
1086
J. Coughlan and A. Yuille
Figure 19: Detection of x; y; z lines. In each row, the original image on the far left is followed by images showing the locations of pixels inferred to be on x; y, and z lines, respectively, from center left to far right.
world assumption applies to a range of images, rural and otherwise, in addition to urban scenes. Moreover, the simplicity of the algorithm makes it suitable for implementation by an articial retina (Burgi, 2002). Our work adds to the growing literature on the statistics of natural images and is perhaps the rst to determine statistical regularities that depend explictly on the three-dimensional scene structure. We expect that there are many further image statistical regularities of this type that might be exploited by biological and articial vision systems. More recently, Deutscher et al. (2002) have made use of the Manhattan world assumption as a component of a surveillance system (Isard & MacCormick, 2001). Deutscher et al. have implemented a stochastic search algorithm that is faster than the algorithms we use in this article. In addition, they extended the Manhattan formulation to estimate focal length as well as camera orientation and demonstrated reasonable accuracy, although the approach was less accurate than standard methods requiring calibration patterns or motion estimation. Acknowledgments We acknowledge funding from NSF with award number IRI-9700446, support from the Smith-Kettlewell core grant, and support from the Center for Imaging Sciences with Army grant ARO DAAH049510494. It is a pleasure to acknowledge e-mail conversations with Song Chun Zhu about scene clutter. We gratefully acknowledge the use of the Sowerby image data set at
Manhattan World
1087
British Aerospace and thank Andy Wright for bringing it to our attention. We thank two referees for helpful feedback, in particular, the suggestion that we should introduce a null hypothesis model to compare with the Manhattan model. References Balboa, R., & Grzywacz, N. M. (2000). The minimal local-asperity hypothesis of early retinal lateral inhibition. Neural Computation, 12, 1485–1517. Blake, A., & Yuille, A. L. (Eds). (1992). Active vision. Cambridge, MA: MIT Press. Brillault-O’Mahony, B. (1991). New method for vanishing point detection. Computer Vision, Graphics, and Image Processing, 54(2), 289–300. Burgi, P-Y. (2002). Bayesian algorithms for autonomous vision systems (Tech. Rep. 1040). Neuchatel, Switzerland: Swiss Centre for Electronics and Microtechnology. Chen, H., Belhumeur, P., & Jacobs, D. (2000). In search of illumination invariants. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society. Coughlan, J., & Yuille, A. L. (1999). Manhattan world: Compass direction from a single image by Bayesian inference. In Proceedings International Conference on Computer Vision. Los Alamitos, CA: IEEE Computer Society. Coughlan, J., & Yuille, A. L. (2000). The Manhattan world assumption: Regularities in scene statistics which enable Bayesian inference. In Proceedings Neural Information Processing Systems. Cambridge, MA: MIT Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley Interscience Press. Deutscher, J., Isard, M., & MacCormick, J. (2002). Automatic camera calibration from a single Manhattan image. In Proceedings of the European Conference on Computer Vision (pp. 175–188). New York: Springer-Verlag. Faugeras, O. D. (1993). Three-dimensional computer vision. Cambridge, MA: MIT Press. Hartley, R., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. Huang, J., & Mumford, D. (1999). Statistics of natural images and models. In Proceedings Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Computer Society. Isard, M., & MacCormick, J. (2001). BraMBLe: A Bayesian multiple-blob tracker. In Proc. 8th Int. Conf. Computer Vision (Vol. 2, pp. 34–41). Los Alamitos, CA: IEEE Computer Society. Konishi, S., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (1999). Fundamental bounds on edge detection: An information theoretic evaluation of different edge cues. In Proc. Int’l Conf. on Computer Vision and Pattern Recognition. Los Alamitos, CA: IEEE Cmputer Society. Konishi, S. M., Yuille, A. L., Coughlan, J. M., & Zhu, S. C. (2003). Statistical edge detection: Learning and evaluating edge cues. Pattern Analysis and Machine Intelligence, 25(1), 57–74.
1088
J. Coughlan and A. Yuille
Lutton, E., Maˆitre, H., & Lopez-Krahe, J. (1994). Contribution to the determination of vanishing points using Hough transform. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(4), 430–438. Mathews, J., & Walker, R. L. (1970). Mathematical methods of physics. Menlo Park, CA: Benjamin=Cummings. Mundy, J. L., & Zisserman, A. (Eds). (1992). Geometric invariants in computer vision. Cambridge, MA: MIT Press. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3),145–175. Ratches, J. A., Walters, C. P., Buser, R. G., & Guenther, B. D. (1997). Aided and automatic target recognition based upon sensory inputs from image forming systems. IEEE Trans. on PAMI, 19(9), 1004–1019. Shufelt, J. A. (1999). Performance evaluation and analysis of vanishing point detection techniques. IEEE Trans. on PAMI, 21(3), 282–288. Torr, P., & Zisserman, A. (1998). Robust computation and parameterization of multiple view relations. In Proceedings of the International Conference on Computer Vision (pp. 727–732). Los Alamitos, CA: IEEE Computer Society. Yuille, A. L., & Coughlan, J. M. (2000). Fundamental limits of Bayesian inference: Order parameters and phase transitions for road tracking. Pattern Analysis and Machine Intelligence, 22(2), 160–173. Yuille, A. L., Coughlan, J. M., Wu, Y-N, & Zhu, S. C. (2001). Order parameters for minimax entropy distributions: When does high level knowledge help? International Journal of Computer Vision, 41(1/2), 9–33. Zhu, S. C., Lanterman, A., & Miller, M. I. (1998). Clutter modeling and performance analysis in automatic target recognition. In Proceedings Workshop on Detection and Classication of Difcult Targets (pp. 477–496). Redstone Arsenal, AL. Received February 8, 2002; accepted December 2, 2002.
LETTER
Communicated by Michael Jordan
Kernel-Based Nonlinear Blind Source Separation Stefan Harmeling harmeli@rst.fhg.de
Andreas Ziehe ziehe@rst.fhg.de
Motoaki Kawanabe nabe@rst.fhg.de Fraunhofer FIRST.IDA, 12489 Berlin, Germany
Klaus-Robert Muller ¨ klaus@rst.fhg.de Fraunhofer FIRST.IDA, 12489 Berlin, Germany, and University of Potsdam, Department of Computer Science, 14482 Potsdam, Germany
We propose kTDSEP, a kernel-based algorithm for nonlinear blind source separation (BSS). It combines complementary research elds: kernel feature spaces and BSS using temporal information. This yields an efcient algorithm for nonlinear BSS with invertible nonlinearity. Key assumptions are that the kernel feature space is chosen rich enough to approximate the nonlinearity and that signals of interest contain temporal information. Both assumptions are fullled for a wide set of real-world applications. The algorithm works as follows: First, the data are (implicitly) mapped to a high (possibly innite)–dimensional kernel feature space. In practice, however, the data form a smaller submanifold in feature space— even smaller than the number of training data points—a fact that has already been used by, for example, reduced set techniques for support vector machines. We propose to adapt to this effective dimension as a preprocessing step and to construct an orthonormal basis of this submanifold. The latter dimension-reduction step is essential for making the subsequent application of BSS methods computationally and numerically tractable. In the reduced space, we use a BSS algorithm that is based on secondorder temporal decorrelation. Finally, we propose a selection procedure to obtain the original sources from the extracted nonlinear components automatically. Experiments demonstrate the excellent performance and efciency of our kTDSEP algorithm for several problems of nonlinear BSS and for more than two sources.
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1089–1124 (2003) °
1090
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
1 Introduction The problem of nonlinear blind source separation (BSS) is a challenging research task, and several methods have been proposed in the literature (Burel, 1992; Valpola, Giannakopoulos, Honkela, & Karhunen, 2000; Lee, Koehler, & Orglmeister, 1997; Lin, Grier, & Cowan, 1997; Yang, Amari, & Cichocki, 1998; Marques & Almeida, 1999; Pajunen, Hyv¨arinen, & Karhunen, 1996; Pajunen & Karhunen, 1997; Fyfe & Lai, 2000; Taleb & Jutten, 1999; Hyvarinen, Karhunen, & Oja, 2001). Fruitful applications are conceivable in, among others, the elds of telecommunications, array processing, vision, biomedical data analysis, and acoustic source separation where nonlinearities can occur in the mixing process. Nonlinear BSS has recently been applied to data from industrial pulp processing (Hyvarinen et al., 2001). In nonlinear BSS we observe a mixed signal, x[t] D f.s[t]/;
(1.1)
where x[t] :D [x1 [t]; : : : ; xn [t]]> and s[t] :D [s1 [t]; : : : ; sn [t]]> are n£1 column vectors with t D 1; : : : ; T and f is a nonlinear invertible function from > ®8 8 x 8 x ¯8 D ® 4 4 4¯4 D ®4 ¯4 ;
(2.3)
which states the remarkable property that the dot product of two linear combinations of the columns of 8x in F coincides with the dot product in 8.x[t]/ D .8> 8v 8.x[t]/ v 8v / 2 3¡ 1 2 3 k.v1 ; v1 / ¢ ¢ ¢ k.v1 ; vd / 2 k.v1 ; x[t]/ 6 7 6 7 :: :: :: 7 6 7; D 6 : : : 4 5 4 5 k.vd ; v1 / ¢ ¢ ¢ k.vd ; vd / k.vd ; x[t]/
(2.4)
which is a real-valued d £ 1 vector representing a projected data point. Note ¡1=2 can be omitted if the subsequent BSS procedure contains a that .8 > v 8v / whitening step. Regarding the computational costs of this projection, we have to evaluate the kernel function O.d2 n/ C O.dTn/ times, and equation 2.4 requires O.d3 / multiplications where n denotes the dimension of the input space. Again, note that d is much smaller than T. Furthermore, after projection, the storage requirements are reduced since we do not have to hold the full T £ T kernel matrix but only a d £ T matrix. 2.1.2 Choosing Vectors for the Basis in F . So far, we have assumed as given some points v1 ; : : : ; vd that fulll equation 2.1, and we presented the benecial properties of our construction. In fact, the vectors v1 ; : : : ; vd are roughly analogous to a reduced set in the support vector world (Sch¨olkopf et al., 1999). Note, however, that often we can only approximately fulll equation 2.1, that is, span.8 v / ¼ span.8 x /;
(2.5)
for example, for an RBF kernel span.8 x / is T-dimensional, but span.8v / is by denition d-dimensional (for further discussion, see appendix A; Williams & Seeger, 2000; Bach & Jordan, 2001). Several options exist to achieve this approximation. The points v1 ; : : : ; vd have to be chosen such that in equation 2.4, the inversion of the kernel matrix Kv , whose entries are .Kv /ij :D .8> v 8v /ij D k.vi ; vj /;
1096
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
is numerically stable. We try to nd a set of points such that the condition number of its corresponding kernel matrix is below a certain threshold, and if we add one more point, the condition number is above. 6 Since we cannot check all possible combinations, we randomly sample d points (for xed d) repeatedly, say r, for example, 100 times. Roughly speaking, we perform this sampling for different d until we nd d points, v1 ; : : : ; vd , that are linearly independent in feature space (more precisely: the corresponding condition number is below the threshold), but we cannot nd d C 1 points with the same property (i.e., the condition number for those points is above the threshold). This is done by computing the kernel matrix in O.d2 n/ time, yielding an overall cost of O.dr/[O.d/ C O.d2 n/] D O.d3 rn/ where n denotes the dimension of the input space. This procedure determines d and points v1 ; : : : ; vd . Running k-means clustering (with k D d), costing O.Tdn/, in input space, is another way to pick such points. Our experience shows that both approaches work well as long as d is chosen large enough. 2.2 Finding a Basis via Kernel PCA. Another more direct method to obtain the low-dimensional subspace is kernel PCA (Sch¨olkopf et al., 1998). Such a subspace is optimal with respect to the reconstruction error in feature space; however, computational costs are slightly increased (see Figure 13). For simplicity, we assume that the data are centered in feature space.7 To perform kernel PCA, we need to nd eigenvectors and eigenvalues of the covariance matrix T1 8X 8> X . Denoting the diagonal matrix with eigenvalues ¸1 ¸; : : : ; ¸ ¸T along the diagonal as ¤, the eigenvectors E D [e1 ; : : : ; eT ] of the kernel matrix T1 8> X 8X fulll ³ ´ 1 > 8 8X E D E¤; T X which immediately implies ³ ´ 1 > 8X 8X .8X E/ D .8X E/¤: T So, ¸1 ; : : : ; ¸T are the eigenvalues of T1 8X 8> X with corresponding eigenvectors 8X E. Normalizing the rst d eigenvectors yields a d-dimensional orthonormal basis, 4 :D 8X Ed .T ¤d /¡1=2 ; 6 The condition number of a matrix is the ratio between the largest and the smallest singular value. 7 The kernel matrix K with entries k.x[i]; x[j]/ can be easily centered (Scholkopf ¨ et al., 1998) by K 7! K ¡ 1T K ¡ K1T C 1T K1T , with 1T being the T £ T matrix with all entries equal to 1=T.
Kernel-Based Nonlinear Blind Source Separation
1097
with Ed :D [e1 ; : : : ; ed ], ¤d being the diagonal matrix with ¸1 ; : : : ; ¸d along the diagonal and .T ¤d /¡1=2 ensuring orthonormality. This basis enables us to parameterize the signals 8.x[t]/ in feature space as real-valued ddimensional signals, 9x [t] :D 4> 8.x[t]/ D .T ¤d /¡1=2 E> 8> 8.x[t]/; 2 1 3 2 d 3X2 p 0 e> k.x[1]; x[t]/ 6 ¸1 76 1 76 6 7 : :: 1 : :: : 76 D p 6 76 : T 4 54 : 54 > 1 p e k.x[T]; x[t]/ 0 d ¸
3 7 7; 5
(2.6)
d
which are calculated conveniently using the kernel trick. Since kernel PCA involves solving the eigenvalue problem for a large matrix, whose size depends on the amount of data, O.T 3 /, we typically apply kernel PCA to a subset of the original data set if T becomes large (Mika, 1998; Sch¨olkopf et al., 1999). 3 Nonlinear Blind Source Separation Clearly, BSS algorithms cannot be applied directly in full feature space without the proposed reduction step. They would need to solve a T-dimensional BSS problem, which is intractable. A further problem is that manipulating such a T £ T matrix can easily become numerically unstable, and even overtting might occur (Hyv¨arinen, S¨arela, ¨ & Vig´ario, 1999). In the previous section, we mapped the signals x[t] from input space onto signals 9x [t] in a d.¿ T/–dimensional parameter space (see Figure 1). This was done either by random sampling or k-means clustering using equation 2.4 or by applying kernel PCA together with equation 2.6. Now we are in a situation in which the nonlinear problem in input space has been transformed to a linear problem in parameter space where we can apply linear BSS methods. In particular, we propose to use TDSEP, a second-order BSS technique that relies on time-shifted covariance matrices of the mapped signals 9x [t], thereby exploiting the assumed time structure of the unknown sources (see appendix B for a discussion of why not to use kurtosis-based techniques). We briey describe the TDSEP algorithm (details can be found in Ziehe & Muller, ¨ 1998; see also Belouchrani et al., 1997). For the signals in parameter space, 9x [t] :D 4 > 8.x[t]/ 2 2.T ¡ ¿ / tD1
C .9x [t C ¿ ] ¡ ¹ 9 /.9 x [t] ¡ ¹9 /> /;
1098
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
P with ¹9 :D T1 TtD1 9x [t]. Then we nd a matrix W that simultaneously diagonalizes8 several of these matrices R¿i , that is, the matrices WR¿i W>
i D 1; : : : ; m
should become approximately diagonal. W is the sought-after demixing matrix. The extracted d nonlinear components are y[t] :D W9x [t] 2 : For brevity, we write s1 instead of s1 [t] (i.e., s1 is a signal). In general, the 1 m2 quasi sources up to degree m are all monomials of the form sm 1 s2 for 0 · Accordingly, q is the vector containing all those monomials. m1 ; m2 < m. m Most quasi sources are pairwise correlated. For two independent signals s1 and s2 , the correlation between arbitrary monomials in s1 and s2 is k2 m2 1 corr.sk11 sm 2 ; s1 s2 / D
D
1 cov.sk11 sm ; sk2 sm2 / q2 1 2 Q ki m i iD1;2 var.s1 s2 /
k Ck m Cm k k m m Efs11 2 g E fs2 1 2 g ¡ Efs11 g Efs12 g E fs2 1 g Efs2 2 g q : Q 2ki 2mi ki mi 2 iD1;2 Efs1 g Efs2 g ¡ .Efs1 g Efs2 g/
Since for symmetrically distributed signals s1 and s2 (with mean zero and variance one), the odd moments are zero, k
E fs1 g D 0
if k is odd,
8 We use the algorithm described in Cardoso and Souloumiac (1996). See also Cardoso (1999).
Kernel-Based Nonlinear Blind Source Separation
1099
k2 m2 1 two quasi sources sk11 sm 2 and s1 s2 are uncorrelated in most cases: k2 m2 1 corr.sk11 sm 2 ; s1 s2 / D 0
if k1 C k2 is odd or m1 C m2 is odd.
(4.1)
This is easily implied from the above equation using the fact that if the sum of two integers is odd, then one of the summands must be odd as well. Therefore, the quasi sources for two signals can be collocated into four groups with no correlations among the groups; for example, for the quasi sources up to degree 2, the four groups are (see Figure 3) fs21 s22 ; s21 ; s22 g; fs21 s2 ; s2 g; fs1 s22 ; s1 g; fs1 s2 g: We will use these ndings to reconstruct the extracted components for an easy example. Consider two sinusoidal source signals s[t] D [s1 [t]; s2 [t]]> that are nonlinearly mixed by x[t] D A.s1 [t]; s2 [t]/> C cs1 [t]s2 [t]; with
"
¡1:2173 AD ¡0:0412
¡1:1283 ¡1:3493
#
"
¡0:2611 and c D 0:9535
#
(mixture taken from Molgedey & Schuster, 1994). Running our algorithm with a polynomial kernel of degree 4, k.a; b/ D .a> b C 1/4 ;
Figure 3: Most quasi sources are pairwise correlated. The right panel shows the covariance matrix of the quasi sources up to degree 2, the lower left panel up to degree 4, and the upper left panel up to degree 8. Note that the quasi sources can always be collocated into four groups.
1100
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨ y
7
2
s
1
[corr= 0.992]
0
y
2 4
2
s
2
[corr= 0.988]
2 1
5
1
2
5 9
5
2
2 corr= 0.990
2
2
0
0
5
2 corr= 0.909
0
4 4 s ×s [corr= 0.538] 1
2
2 0
2 4 s ×s [corr= 0.603]
0
y
corr= 0.995
0
0
y
2
0
5 10
5
5 corr= 0.960
5
0
5
0
5
0
5
Figure 4: The extracted signals in the left panels (only four are shown) are matched with single quasi sources in the middle panels and combinations of subgroups of quasi sources (right panels).
we have to consider the quasi sources up to degree 4: all possible products of s1 , s21 , s31 , s41 and their counterparts in s2 . Using equation 4.1, these quasi sources can also be arranged into four groups with no correlations among the groups. As examples, we explain four of the extracted signals using these quasi sources: y7 ; y4 ; y1 ; y9 , shown in the left panels of Figure 4. The middle panels show the best matching quasi sources. Note that the true sources, s1 and s2 , have a very high correlation to their left neighbors, y7 and y4 , respectively. The other extracted signals, y1 and y9 , do not have a very high correlation to any of the quasi source signals. The best ts, s41 s2 and s41 s42 , are plotted in the two lower middle panels. The extracted signals can better be explained with linear combinations of subsets of mutually correlated quasi sources. Therefore, we combined all quasi sources that are correlated with s41 s42 to reconstruct y9 . The result is shown in the lower right panel, which reaches a good t (corr D 0:960). The same holds for y1 and the other extracted signals not shown in the gure. Note that for y7 and y4 , which match s1 and s2 reasonably well, more quasi sources do not improve the result notably. Empirically, we have seen that the extracted components can be explained by linear combinations of higher-order monomials of the sources. This knowledge can be used to suggest several options to select the signals of interest: the signals built by higher-order monomials are very peaky and therefore have, after proper normalization, a lower variance and also a lower description length (Rissanen, 1978) than the signals of interest. How-
Kernel-Based Nonlinear Blind Source Separation
1101
ever, whether methods based on these intuitions work in practice depends very much on the considered signals. The goal is to identify the soughtafter signals s1 and s2 among the other extracted signals. If, for example, s1 has a much higher description length than s2 , there might be a problem: probably s21 will have a large description length as well, and therefore it is more likely that a method based on these principles will prefer s21 , instead of s2 . This explains why these selection procedures can fail easily, which is in accordance with our experience of running a number of experiments with different signals. 4.2 Selection by Rerunning the Algorithm. An algorithmic trick that worked well in all our experiments is to apply the algorithm twice. This trick is motivated by work on assessing reliability (Meinecke, Ziehe, Kawanabe, & Muller, ¨ 2002, in press; Muller, ¨ Vigario, Meinecke, & Ziehe, in press). Intuitively, the idea is to look for the most reliable components among the extracted signals—the components that appear again after reiteration of the algorithm. For this, we repeat the algorithm with the same parameters (kernel choice, d, ¿ , : : : ), but instead of sending x[t] into the feature space, we start with the d-dimensional demixed results y[t], map them to the feature space, reduce the dimensionality,9 and demix with TDSEP, which yields y0 [t]. The sought-after components of y[t] are the ones that are matched best by the components y0 [t] of the second run of TDSEP in feature space. Why does this selection process nd the right signals? As we saw in the previous section, most of the undesired signals are linear combinations of peaky higher-order monomials of the source signals, which can have very large values. Before the signals get mapped to feature space, they are scaled such that their absolute maximum is one (see the rst paragraph of section 2). This is done by dividing each signal by its absolute maximum. The effect of this rescaling is that very large peaky signals are penalized; their variance is decreased more than the variance of signals that are less peaky. By doing so, we bias the desired signals to appear again with high correlations after another nonlinear demixing. This method works very well. All experiments documented in this article successfully used this selection method. Our experience shows that this selection fails only in cases where the sources are not recovered at all, that is, where the demixing failed, which means that there is nothing to select from. 5 Experiments Nonlinearities appear in different contexts; for example, amplier saturation results in difcult nonlinearities. Also, sensors can have nonlinearities, which have a disadvantageous inuence on the recorded signals. However, 9
The kernel function can be used with signals of arbitrary dimension.
1102
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
real-world signals have the drawback that the ground truth—the true source signals—is not known. Since we want to demonstrate the performance of our algorithm, we consider in this article well-dened, controlled situations where we can compare the results of the algorithm to the true sources. 5.1 Deterministic Articial Data. In the rst experiment, we generate 2000 data points from two sinusoidal signals s[t] D [s1 [t]; s2 [t]]> that have different frequencies (s1 [t] D sin.0:05¼ t/, s2 [t] D sin.0:021¼ t/ with t D 1; : : : ; 2000). These source signals are nonlinearly mixed (see the left panel in the rst row of Figure 5) by x1 [t] D e s1 [t] ¡ e s2 [t]
x2 [t] D e¡s1 [t] C e¡s2 [t] : We use a polynomial kernel of degree 9, k.a; b/ D .a> b C 1/9 ;
which induces a feature space of all monomials up to degree 9. Applying k-means clustering to 500 randomly chosen input vectors, we determine vectors v1 ; : : : ; v20 in input space, shown as C in the left panel in the rst row of Figure 5. Projecting onto the feature space images of these vectors reduces the dimension to 20. As the second step, we apply TDSEP (with time shifts ¿ D 0; : : : ; 7) to those 20-dimensional mapped signals 9x [t]. We obtain 20 components, among which we select two components, as described in the previous section. Their scatter plots and their waveforms are shown in the right panels in the rst and third rows of Figure 5. For comparison, we plot in the left panels of the second and fourth rows the results of applying linear TDSEP directly to the nonlinearly mixed signals x[t]. In this simple example, linear TDSEP already reaches a high correlation lin (corr.ylin 1 ; s1 / D 0:9716; corr.y2 ; s2 / D 0:9716) to the true sources, but as we can see in the scatter plots shown in Figure 5, linear BSS fails to recover the right shape, in contrast to our nonlinear method, which recovers the shape of the scatter plot almost perfectly (corr.y20 ; s1 / D 0:9998; corr.y5 ; s2 / D 0:9999). We study the same mixture with two sinusoidal signals that have almost the same frequencies (s1 [t] D sin.0:0045¼t/, s2 [t] D sin.0:005¼ t/ with t D 1; : : : ; 2000). Figure 6 shows the results after running our algorithm with the same parameters as in the previous case. We see that even two signals that have almost the same frequencies are separated. 5.2 Speech Data—Bent. In another experiment, we nonlinearly mix two speech signals s[t] D [s1 .t/; s2 .t/]> (each with 20,000 samples, sampling rate 8 kHz, each ranging between ¡1 and C1) by x1 [t] D ¡.s2 [t] C 1/ cos.¼ s1 [t]/
x2 [t] D 1:5.s2 [t] C 1/ sin.¼ s1 [t]/:
Kernel-Based Nonlinear Blind Source Separation
nonlinear mixture
nonlinear demixing
y
2
scatterplots
x
x
1103
5
y
1
linear demixing
20
source signals
ylin
s
2
2
lin
s1
nonlinear mixture
nonlinear demixing
y1
1
20
x
y
waveforms
2
5 t
t
linear demixing
source signals
1
1
lin
y
s 2
2 t
t
Figure 5: Deterministic articial data. Scatter plots and waveforms of the nonlinear mixture and the nonlinear demixing (rst and third rows) and of linear demixing and the true source signals (second and fourth rows).
1104
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
nonlinear mixture
scatterplots
x2
nonlinear demixing
y14
x
y
1
11
linear demixing
source signals
lin
y2
s2
lin
s1
nonlinear mixture
nonlinear demixing
y1
1
11
x
y
waveforms
2
14 t
t
linear demixing
source signals
1 y
1
lin
s 2
2 t
t
Figure 6: Deterministic articial data with very close frequencies: Scatter plots and waveforms of the nonlinear mixture and the nonlinear demixing (rst and third rows) and of linear demixing and the true source signals (second and fourth rows).
Kernel-Based Nonlinear Blind Source Separation
1105
We employ a gaussian radial basis function (RBF) kernel, k.a; b/ D e
¡ k a¡b2 k
2
;
2¾
which induces a feature space where each direction measures the similarity to one of the training points. We can set ¾ 2 D 12 and use k.a; b/ D e¡ka¡bk ; 2
without loss of generality, if signals are scaled in an appropriate way. Projecting onto feature space images of the vectors v1 ; : : : ; v20 2 (piano music, scientic utterance, cembalo music, street noise, cello music, funk music, and political speech, each with 20,000 samples, sampling rate 8 kHz) by two steps: 1. Scale the signals between ¡1 and 1, that is, they are contained inside the centered hypercube with side length p 2. Rotate that cube such that its main diagonal (which has length 2 7) is aligned with the rst axis. This operation can be done by some orthogonal 7 £ 7 matrix A.
2. Rotate around different planes by an angle that depends on the rst component of the vector sN [t] D As[t], which we denote by sN1 [t]. More
Kernel-Based Nonlinear Blind Source Separation
1113
Figure 14: Scatter plots for different values of d: (left) articial data, (middle) speech data—bent, and (right) speech data—twisted.
1114
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
nonlinear mixture
x
y
2
scatterplots
nonlinear demixing
10
x
y
1
linear demixing
ylin
s
1
2
lin
s1
nonlinear mixture
nonlinear demixing
y2
1
1
x
y
waveforms
2
10 t
t
linear demixing
source signals
2 y
1
source signals
1
lin
s 1
2 t
t
Figure 15: Stochastic articial data. Scatter plots and waveforms of the nonlinear mixture and the nonlinear demixing (rst and third rows) and of linear demixing and the true source signals (second and fourth rows).
Kernel-Based Nonlinear Blind Source Separation
precisely, for a vector sN [t], we dene a 7 £ 7 matrix, 2 0 0 0 0 0 0 60 0 0 0 0 sN1 [t] 6 6 0 ¡N [t] 0 0 0 0 6 s1 6 6 B.Ns[t]/ D 60 0 0 0 0 sN1 [t] 6 0 0 0 ¡N [t] 0 0 s 6 1 6 40 0 0 0 0 0 0 0 0 0 0 ¡Ns1 [t]
1115
3 0 0 7 7 7 0 7 7 0 7 7; 7 0 7 7 sN1 [t]5 0
and use it to dene a rotation matrix (employing the matrix exponential), C.Ns[t]/ D e ¼ B.Ns[t]/ ; that rotates along the second and third planes, along the fourth and fth planes, and along the sixth and seventh planes by the angle ¼ sN1 [t]. Note that C.Ns[t]/ is an orthogonal matrix that is continuous in sN [t].
The complete mixture11 then reads, x[t] D C.As[t]/As[t]:
Linear TDSEP is not able to demix x[t]. Listening to the linearly demixed signals reveals that each component contains at least contributions of two sources (and they are distorted as well). Their correlations with the true lin sources are below 0.82 (corr.ylin 2 ; s1 / D 0:7805, corr.y1 ; s2 / D 0:7415, lin lin lin corr.y3 ; s3 / D 0:6663, corr.y4 ; s4 / D 0:7481, corr.y5 ; s5 / D 0:7550, lin corr.ylin 6 ; s6 / D 0:5929, corr.y7 ; s7 / D 0:8130). For kTDSEP, we apply k-means clustering to 500 randomly chosen input vectors and obtain d D 40 vectors v1 ; : : : ; v40 . The mapped signals 8.x[t]/ in the feature space (induced by a gaussian RBF kernel k.a; b/ D exp.¡ka ¡ bk2 /) are projected onto the images of those 40 vectors. Applying TDSEP (with time shifts ¿ D 0; : : : ; 30) to those resulting 40 signals 9x [t] and using the selection procedure described above (see Figure 16), we nd the seven sought-after sources. Those seven nonlinearly demixed signals not only have high correlations with the true sources (corr.y1 ; s1 / D 0:9432, corr.y6 ; s2 / D 0:9496, corr.y24; s1 / D 0:9394, corr.y16; s2 / D 0:9218, corr.y5 ; s1 / D 0:9508, corr.y22; s2 / D 0:9142, corr.y4 ; s1 / D 0:9402), but also their waveforms match the true sources very well (see the right panels of Figure 17). Note that the complexityof the algorithm depends mostly on the choice of d. Even to demix seven nonlinearly mixed sources, the most time-consuming 11 This mixture is invertible because of the orthogonality and continuity of the matrices involved.
1116
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
0.95 0.9 0.85 0.8 0.75 0.7 6 5 4 24 1 16 22 8 15 21 23 17 34 30 32 26 31 Figure 16: More than two sources: The 17 largest correlations between the demixed signals y[t] and the signals that we obtain after applying kTDSEP again to y[t]. The numbers at the foot of the bars are the indices of the corresponding signals in y[t]. The seven leftmost bars indicate the sought-after sources.
part of the algorithm is simultaneously to diagonalize 31 time-shifted covariance matrices of size 40 £ 40, which can be done very fast (Cardoso & Souloumiac, 1996). On a 600 MHz Pentium laptop, the Matlab calculation for this experiment with seven sources took less than 6 minutes. 6 Conclusion Our work combines three interesting ideas: kernel feature spaces, techniques for dimensionality reduction, and blind source separation. The rst two enable us to construct an orthonormal basis of the low-dimensional subspace in kernel feature space F where the data lie. This technique establishes a highly useful (scalar-product-preserving) isomorphism between the image of the data points in F and a d-dimensional space b C c/p ; the feature space is nite-dimensional. For example, for homogeneous kernels, that is, c D 0, the dimensionality can be calculated by the formula .n C p ¡ 1/! ; p!.n ¡ 1/! with n being the dimensionality of the mixed signals (taken from Mika, 1998). For instance, at n D 3 and for a polynomial kernel of degree p D 5, the feature space is 21-dimensional, which can also be seen by plotting the largest eigenvalues of the corresponding kernel matrix (see Figure 18, left panel). We clearly see the gap between the twenty-rst and the remaining eigenvalues, which should actually be zero.12 Obviously, in this case we can fulll equation 2.1 with d D 21. For a gaussian RBF kernel, k.a; b/ D e¡ka¡bk ; 2
the feature space is innite-dimensional. However, T data points are always contained in a T-dimensional subspace—the one spanned by the mapped points themselves—but since the corresponding eigenvalues are decaying exponentially fast (see Figure 18, right panel), the data in feature space can be approximated very well with a much lower-dimensional subspace (for more detailed discussions on this issue, see Williams & Seeger, 2000; Bach & Jordan, 2001). Therefore, we can approximate equation 2.1 for suitable d. Appendix B: Kurtosis-Based ICA in Feature Space? Exploiting the temporal structure of the source signals is essential for good performance of our method. However, for nonlinear feature extraction, it would be desirable to construct nonlinear kernel ICA methods where the source signals are assumed to be independent in time. We have tried applying standard linear ICA algorithms such as JADE and FastICA (with deation) in feature space but could not get successful results. One reason might be that general nonlinear ICA problems (see equation 1.1) have nonunique solutions, as explained in Hyv¨arinen and Pajunen (1999). This indeterminacy in kernel Hilbert space has not been clearly understood yet 12
Those eigenvalues are nonzero for numerical reasons.
Kernel-Based Nonlinear Blind Source Separation
10
Eigenvalues for polynomial kernel of degree 5 10
10
10
5
0
10
0
5
10
5
10
10
10
10
15
10
1119
Eigenvalues for Gaussian RBF kernel
21 22
5
10
10
10
10
15
10
10
20
30
40
100
200
300
400
Figure 18: The largest eigenvalues of the kernel matrix on a logarithmic scale: (left) for polynomial kernel of degree 5 and (right) for gaussian RBF kernel.
and is beyond the scope of this article. The other reason is that some of the linear ICA contrast functions are not appropriate in feature space, which we will examine here. As an example, we show that by simply applying kurtosis-based ICA methods (deation) in feature space, we cannot extract the original sources. For this, we consider an example discussed in Harmeling et al. (2001): x1 D a11 s1 C a12 s2 C b1 s1 s2 x2 D a21s1 C a22s2 C b2 s1 s2 : Let us take a polynomial kernel of order 2, k.a; b/ D .a> b C 1/2 ; which introduces a six-dimensional feature space F represented by x1 ; x2 ; x21 ; x22 ; x1 x2 ; 1: Kurtosis-based ICA methods pick linear mixtures y’s of these six basis directions, which are the local maxima of the scaled kurtosis, Q D ·.y/
E[.y ¡ ¹/4 ] ¡ 3E[.y ¡ ¹/2 ]2 E[.y ¡ ¹/4 ] D ¡ 3; 2 2 E[.y ¡ ¹/ ] E[.y ¡ ¹/2 ]2
(B.1)
where ¹ D E[y]. From the bilinear form of the ICA model, we see that any vector in the feature space F can be represented as a linear combination of
1120
S. Harmeling, A. Ziehe, M. Kawanabe, and K. Muller ¨
nine quasi sources,13 which are polynomials of s1 and s2 : s1 ; s2 ; s21 ; s22 ; s1 s2 ; s21 s2 ; s1 s22 ; s21 s22 ; 1: For simplicity, we will use the following transformed quasi sources instead, where ¾i2 D E[s2i ] are the variances of the signals: s1 ; s2 ; s21 ¡ ¾12 ; s22 ¡ ¾22 ; s1 s2 ; .s21 ¡ ¾12 /s2 ; s1 .s22 ¡ ¾22 /; .s21 ¡ ¾12 /.s22 ¡ ¾22 /; 1:
(B.2)
In general, only pairs ®1 s1 C ®2 .s21 ¡ ¾12 /
and
¯1 s2 C ¯2 .s22 ¡ ¾22 /;
®i ; ¯i 2
: 0
: : :
xDi jx ¡ ij D 1 else;
(3.5)
with discrete spatial coordinate x, stimulus location i, and extend of stimulus overlap ° (see Figure 4B; x; i 2 f1; 2; : : : ; Ns g, ° D 0:5) and of temporal structure according to equation 3.3. Two different strengths of wave-based neural interactions are used: strong (· D 5 À ¾ f —interaction mode) and negligible interaction (· D 0:01 ¿ ¾ f — SOM mode). All other model parameters are set as in case 1 of section 3.1. The results are compared with those of the SOM algorithm applied with the same parameter values (neighborhood width according to equation 2.7; model parameters v, ·, and ¾k expressing TOM dynamics and interaction
The Time-Organized Map Algorithm
1157
not used). The state of learning is described quantitatively by the map error dened as the minimal average deviation of weight vectors wk .n/ from optimal weight vectors wk;opt and Swk;opt : Á Nc X 1 min kwk .n/ ¡ wk;opt k; E.n/ D Nc kD1
Nc X kD1
! kwk .n/ ¡ Swk;opt k : (3.6)
Optimal weight vectors follow from symmetry considerations: wk;opt D sk follows from homogeneous spatial and temporal stimulus distances and Swk;opt D sNc ¡kC1 from wk;opt by the initial invariance of the chain of neurons under direction inversion (k D 1::Nc , S reverses neuron numbering; boundary conditions neglected). 3.2.2 Results. In all three cases—SOM, TOM in SOM mode, and TOM in interaction mode—the average map error decreases about monotonically with time (average over 50 runs). Moreover, Figure 4 shows: 1. Temporal stimulus relations facilitate topography learning. The average map error decreases faster and reaches a smaller nal value with lower standard deviation for TOM learning in interaction mode relative to TOM learning in SOM mode and also relative to SOM learning. 2. At the beginning of learning, SOM learning progresses faster than TOM learning (n · 104 ). This is due to the larger number of weight vectors adapted in each learning step of SOM learning compared with TOM learning. 3. In spite of SOM’s seemingly advantageous initial learning, TOM learning in SOM mode yields less fractured maps in our example than SOM learning (7 and 17 fractured maps out of 50 runs for TOM and SOM, respectively). This may be due to fewer homogeneous weight vectors along the neural chain of the TOM algorithm, resulting from the restriction of learning to the adaptation of only one weight vector per learning step. Fewer homogeneous weight vectors may be favorable because they allow a more extensive exploration in the space of neural mappings, yielding a higher probability of nding the globally optimal solution (i.e., a map without fractures). The facilitation of learning by temporal stimulus distances was veried statistically by a chi-square test for homogeneity (Sachs, 1999). Fifty (43, 33) of 50 learning processes were successful for the TOM algorithm in interaction mode (TOM algorithm in SOM mode, SOM algorithm), resulting in topographic maps without fractures. Accordingly, the hypothesis of equal success probabilities for TOM learning in interaction mode and (1) TOM learning in SOM mode (Â 2 D 7:5) and (2) SOM learning (Â 2 D 20:5) is re2 jected on a signicance level 1 ¡ ® D 0:99 (Â1I®D0:01 D 6:6). The hypothesis of equal success probabilities for TOM learning in SOM mode and SOM
1158
J. Wiemer
learning (Â 2 D 5:5) is rejected only on the signicance level 1 ¡ ® D 0:95 2 (Â1I®D0:05 D 3:8). 3.3 Competition of Stimulus Topologies. In this section, TOM learning is analyzed with deviating spatially and temporally dened stimulus topologies. Competition of stimulus topologies results: one topology dominates or both topologies yield an intermediate representation in which the two stimulus topologies are intermingled. 3.3.1 Experimental Set-Up. In order to capture the characteristics of TOM learning, spatiotemporal stimuli are generated that constitute a generic toy example: spatial stimuli follow from permuting coordinates and numbers of the stimuli dened in equation 3.5 such that the temporally and spatially ordered stimuli form patterns that can easily be distinguished by visual inspection, as in Figure 5B (ISIs set according to equation 3.3; for details see Wiemer, 2001). In accordance with the more difcult learning task of deviating topologies, the number of neurons is reduced, the number of learning steps is increased, and other model parameters are adjusted appropriately (Ns D Nc D ¾k D ¾0 D 1=¾f D 10, v D Vs D 1, ® D 0:01, n f D 5 ¢ 106 , n0f D 0:9 ¢ n f ).3 3.3.2 Results.
Our experiment yields:
1. Topography according to temporal topology. Temporally dened topology can be transferred into topographic structure even if it deviates from the topology given by the spatial structure of the stimuli (see Figure 5C, bottom row, left). Here, weight vectors correspond to presented stimuli and are ordered along the chain of processing neurons according to temporal stimulus distances (see Figure 5B, top). 2. Topography according to spatial topology. Spatially dened topology is learned if interactions between wave and feedforward activation are weak, as in SOM mode (see Figure 5C, bottom, right). Here, weight vectors correspond to presented stimuli and are ordered along the chain of processing neurons according to spatial stimulus distances (see Figure 5B, bottom). 3. Topography of mixed topologies. Intermediate interaction strengths yield maps whose structures lie between the above two cases (· D 2 in Figure 5C, middle). Here, weight vectors correspond to averaged presented stimuli and are ordered according to temporal and spatial stimulus distances. 3 The results of this section are also obtained with the same parameters as in the previous two sections if only the parameter of stimulus overlap, ° , is reduced.
The Time-Organized Map Algorithm
1159
Figure 5: Competition of topologies. Self-organization leads to different topographies depending on the strength · of stimulus interaction. (A) Temporal course of average spatial (left) and temporal map error (right). (B) Spatial structure of the applied stimuli ordered according to temporal and spatial topology. (C) Representative emerged maps for different interactions (· D 5, 2, 0:01) and at different instants during learning (n D 2 ¢ 106 , 5 ¢ 106 ).
4. Weight vectors average according to different topologies. The top and bottom rows of Figure 5C present different instants during self-organization: n D 2 ¢ 106 at noise level ¾noise ¼ 1:3 and n D 5 ¢ 106 at ¾noise D 0:1, respectively. The weight vectors are more (top row) and less strongly averaged stimuli (bottom row). Averaging is performed over those stimuli that are represented by neighboring neurons. If the interaction is strong, neighboring neurons average over temporally neighboring stimuli (top row, left). If the interaction is weak, neighboring neurons average over spatially neighboring stimuli (top row, right). Intermediate interaction yields intermediate results (top row, mid-
1160
J. Wiemer
dle). In all cases, the learned type of topography affects the set of resulting weight vectors as long as the level of noise is sufciently high.4 The learning results shown for · D 5, 2, and 0:01 in Figure 5C are typical: they were obtained in 14, 20, and 20 out of 20 learning processes, respectively. Figure 5A suggests the division of TOM learning in two phases (average over 20 runs): an initial ordering phase (up to about n D 106 ) in which the randomly structured initial mapping is adapted toward the representation of the presented stimuli and toward both topologies, and a following convergence phase in which, depending on interaction strength, learning is dominated by spatial or temporal topology. With regard to spatial (temporal) topology, the average map error decreases to its lowest value for · D 0:01 (· D 5) and increases slightly for · D 5 (· D 0:01) during the convergence phase (see Figure 5A, left (right)). Let us return to the important issue of how the starting point of wave propagation is selected. The TOM algorithm is designed so that topographies corresponding to temporally dened topology are stable x points for sufciently small noise. This is fullled only if wave propagation starts from the locus of maximal feedforward activation. For example, TOM learning with · D 5 and ¾noise D 1 destroys the perfect temporal topology shown in Figure 5C if wave propagation starts from the learning neuron. First, due to noise, weight vectors of nonoptimally determined learning neurons are adapted toward presented stimuli. Then nonoptimal learning neurons serve as starting points for wave propagation and yield inadequate time-to-space transformations with further inadequate weight vector changes. Thereby, noise of successive learning steps adds up and leads to a destruction of temporal topography. 3.4 Application to Seminatural Spatiotemporal Stimuli. This section shows the exemplary application of the TOM algorithm to seminatural spatiotemporal stimuli, that is, natural spatial stimuli with systematic temporal relations set by hand. In addition to the results obtained in the previous sections, we show how spatial topology and temporal topology supplement each other to form topographic structure integrating features of both topologies. 3.4.1 Experimental Set-Up. A CCD camera system (PULNIX TM-765, 16 mm, f 1.4) was applied to take pictures of 10 people from ve different 4 Noise in the TOM algorithm refers to random, normally distributed shifts within a given neural topology. Accordingly, the level of noise in TOM learning expresses the degree of cooperation or redundancy of adjacent neurons: how similar and interchangeable their responses are.
The Time-Organized Map Algorithm
1161
stimuli A Astimuli
p/2
p/4
p/2
3p/4
viewing angle
p
viewing angle p/4 p/2 3p/4 p
person
3p/4 p/2
viewing angle
p/4 0 0
p
C of single viewsviews C distances distances of single
p
distancesof ofaveraged averaged views B B distances views
3p/4 viewing angle
0
p/4
1 2 3 4 5 6 7 8 9 10
0
1 2 3 4 5 6 7 8 9 10 0
person p/4 p/2 3p/4 p viewing angle
Figure 6: Seminatural spatiotemporal stimuli. (A) Spatial structure of the ve stimuli of one person. Pictures were centered, cut to about square shape, and edge ltered. Temporal distances were set proportional to viewing angle differences; see the text for details. (B) Euclidean distances of averaged views in the form of the corresponding distance matrix (white (black): minimum (maximum) distance). The Euclidean distance of averaged views varies about monotonically with the difference in viewing angle. (C) Distance matrix of single views. The monotonic relationship between distance and difference in viewing angle is not generally preserved on the level of single views.
perspectives (left prole, left half prole, front, right half prole, and right prole). The images were centered, cut to about square shape (54 rows £ 55 columns), and ltered by a Canny edge detector (see Figure 6A for an exemplary series of stimuli obtained for one person). For averaged views, spatial Euclidean distance varies monotonically with the difference in viewing angle (see the corresponding distance matrix in Figure 6B). However, this relationship is not preserved on the level of single views (see Figure 6C): the diagonal 5 £ 5 matrix structure of Figure 6B in which distance increases monotonically with difference in viewing angle is not reproduced by the 5 £ 5 submatrices comparing the views of different people.
1162
J. Wiemer
The stimuli are presented to the TOM algorithm as follows: 1. Every tenth learning step, a person is randomly drawn. 2. Ten stimuli of that person are randomly drawn and successively presented to the network. 3. The ISIs between these stimuli are set proportional to their differences in viewing angle, isin D
4 jÁn ¡ Án¡1 j; ¼
(3.7)
Án , Án¡1 : viewing angles in radians of stimuli sn , sn¡1 . The ISIs between different 10 stimuli groups are set to innity, isi10¢mC1 D 1;
(3.8)
for m 2 N, m < n f =10. Thereby, ISIs vary systematically between views of the same person, but views of different people are not temporally linked. All model parameters are set in accordance with the numerical experiments presented in the previous sections (Nc D ¾k D ¾0 D · D 5, v D 1, ® D 0:01, n f D 105 , and n0f D 0:9 ¢ n f ). The results of self-organization are evaluated by visualizing weight vectors and feedforward activations and determining which neurons are maximally activated by averaged views (e.g., averaging all left proles). 3.4.2 Results. sults:
The TOM algorithm yields the following two major re-
1. View-specic neurons. Neural weight vectors emerge that correspond to averaged views (see Figure 7A). Averaged left proles are coded in neuron 1, averaged left half proles in neuron 2, and so on (see Figure 7B). The view specicity of neurons is also clearly visible on the single stimulus level of neural feedforward activation (see Figure 7C). 2. Viewing angle topography. The view-specic neurons are ordered according to viewing angle, that is, according to temporally dened topology. The results are remarkable for three reasons: First, due to the stimuli’s spatial Euclidean distances presented in Figure 6C, the results cannot be obtained by the SOM algorithm. Second, they conrm the results of section 3.3 that temporal topology dominates the self-organization of topography for sufciently strong interaction. Third, they show that spatial topology supplements temporal topology in the process of topography formation: Tem-
The Time-Organized Map Algorithm
1163
Figure 7: Spatiotemporal topography from seminatural spatiotemporal stimuli. The application of seminatural spatiotemporal stimuli to the TOM algorithm yields a topographic structure that incorporates aspects of temporal and spatial topology. (A) Final weight vectors correspond to views averaged over people. (B, C) Final topographic coding of averaged and single views, respectively.
poral topology does not entirely determine the resulting topographic structure but allows the self-organization of different structures. The remaining degrees of freedom within topography from temporal topology are reduced by spatial topology. Although views of different people are not temporally linked, stimuli of different people with the same viewing angle are aligned. For example, left prole and right prole stimuli are only rarely mixed (see Figure 7C). This results from spatial stimulus structure: on the average, right prole stimuli are closer to right prole stimuli than to left prole stimuli. 4 Discussion The biological relevance of our numerical simulations is stressed in section 4.1. Promising future numerical experiments and model developments are discussed in section 4.2.
1164
J. Wiemer
4.1 Biological Signicance. The spatiotemporal correlations implemented in section 3.1 by equation 3.4 as a linear increase of speed along the sensory layer are motivated by the visual ux resulting mainly from self-motion and the projection geometry of objects onto the retinas: average stimulus velocities are small in the fovea and increase monotonically to the periphery (Gibson, 1950; Lappe & Rauschecker, 1995). The positiondependent speed of stimuli yields a nonlinear topographic mapping between stimuli and neurons. The logarithmic structure shown in Figure 3B (right) is analogous to the retinotopy in the primary visual cortex of monkeys (Schwartz, 1980; Gattass, Sousa, & Rosa, 1987; Fritsches & Rosa, 1996) and also of humans (DeYoe et al., 1996; Tootell et al., 1998). The analogy comprises two features. First, neural magnication is high in the fovea, where stimulus speed is small, and low in the periphery, where stimulus speed is high. Second, RF size varies the other way around: foveal neurons possess small RFs, whereas peripheral neurons possess large RFs. The latter feature is not only experimentally observed in biological systems; it is reasonable from a functional perspective: neurons representing regions with high stimulus speed must have large RFs in order to capture information about the spatial structure of stimuli. Facilitation of topography learning was demonstrated in section 3.2. For example, the phenomenon could occur in the self-organization of topography in early sensory cortices. Stimuli along sensory surfaces typically extend over many neighboring sensors, activating local regimes of sensors (Szepesvari, Balazs, & L orincz, ¨ 1994). The degree of overlap of these regimes determines the spatial nearness of stimuli. Distinctly and systematically varying spatial Euclidean distances result. Cortical maps that express the order of sensors can be learned solely based on such spatial stimuli (Willshaw & von der Malsburg, 1976; Szepesvari et al., 1994). However, as natural stimuli develop continuously, the map formation in early sensory cortices can additionally prot from temporal stimulus relations. This was exemplarily shown in Figure 4: temporal stimulus distances may facilitate the learning processes similar to the facilitation observed in the numerical experiment. In addition, they may deform the resulting topographic maps, similar to the nonlinear map of Figure 3B (right). The toy example of section 3.3 is biologically signicant because it stresses the possible role of temporal stimulus relations for topographic stimulus representation. Based on generic stimuli, the numerical experiments show that elementary biologically plausible mechanisms, like Hebbian learning and neural dynamics from local interactions, are sufcient to transfer temporal stimulus relations into neural topography. Accordingly, the TOM algorithm extends self-organization from its dependence on purely spatial stimuli to its more natural dependence on spatiotemporal stimuli. The numerical experiment presented in section 3.4 shows how the TOM algorithm is applicable to natural stimuli (e.g., high-dimensional visual stimuli). The obtained topographic representation of faces is analogous
The Time-Organized Map Algorithm
1165
to the representation found by Wang, Tanaka, & Tanifuji (1996) in the inferotemporal cortex of monkeys. The results t well to the more general hypothesis that complex three-dimensional objects are learned and represented as collections of two-dimensional views (Bulthoff, ¨ Edelmann, & Tarr, 1995; Logothetis & Pauls, 1995; Wallis & Bulthoff, ¨ 1999). The TOM algorithm allows these approaches to link to the general framework of spatiotemporal self-organization. Note that the used image preprocessing is biologically plausible: edge detection is a well-known biological process performed in the retina (Hubel, 1995), and centering of images is performed within the “what path” of visual stimulus processing yielding neural selectivities in higher visual cortex that are invariant under spatial stimulus shifts (Tanaka, 1997; Ullman, 2000). It has been speculated that many more cortical topographies exist in addition to those that are already known, which are mainly limited to early cortical areas (Ritter, 1990; Kohonen & Hari, 1999). The TOM algorithm offers a new perspective for the search of topographies stressing the importance of temporal stimulus relations. It is straightforward to predict topographies in those cortical areas that process stimuli with systematic temporal relations. For example, in the primary motor cortex, topographic structure has been found only on a larger spatial scale (of about 1 cm) than the topographic structure in the primary somatosensory cortex (spatial scale of about 1 mm). The primary motor cortex may be topographically structured on the same spatial scale as the primary somatosensory cortex, preserving the temporal topology of muscle activities in movements. Finally, the TOM approach may prove valuable in the context of learning of place cells in rats. These are nerve cells found in the hippocampus representing places of the animal’s explored environment (Gothard, Skaggs, & McNaughton, 1996). The application of the TOM algorithm is reasonable because temporal relations between represented places are given by typical traveling times of the rat. However, there is currently no evidence for the topographic arrangement of place cells (McNaughton, personal communication, 2000). Accordingly, the TOM algorithm can be applied only in its nontopographic interpretation, stressing that neural waves propagate along neural connections. If the neural connections are not locally and randomly distributed, the resulting space of wave propagation does not correspond to anatomical positions in the brain. This may be the case in rat hippocampus. The TOM algorithm comes from an earlier model developed in order to explain ndings of neurobiological learning experiments (see Wiemer, Spengler, Joublin, Stagge, & Wacquant, 1998, and Wiemer, Spengler, et al., 2000). In these studies, the idea of two opposing time-dependent processes was introduced: integration as a reduction of representational distance and segregation as an increase of representational distance. Both processes were based on wave propagation and spatiotemporal interaction comparable to steps 2 and 3 of the TOM algorithm. In spite of the essential differences between the earlier model and the TOM algorithm that were pointed out in
1166
J. Wiemer
sections 2.2 and 3.3.2, the experimental data simulated by the earlier model can also be explained by the TOM algorithm. The application of the TOM algorithm to the experimental data and the interpretation of its numerical results are straightforward and analogous. The experimental verication of the new predicted features could be a signicant step toward understanding how functional meaningful cortical topographies, which may establish the basis for powerful biological information processing, emerge by self-organization. 4.2 Outlook. The TOM algorithm was applied to three articial generic sets and one seminatural set of spatiotemporal stimuli. This allowed us to show typical learning behaviors of the algorithm that depend systematically on the strength of interaction between stimuli. It is now straightforward to apply the TOM algorithm to entirely natural sets of stimuli: ² Natural spatiotemporal stimuli. TOM learning could be applied to more complex natural stimuli that are not restricted to spatial structure, thereby generalizing the term natural stimulus as used in Andres, Mallot, and Gieng (1992), Olshausen and Field (1996), Mayer, Herrmann, Bauer, and Geisel (1998), and Wiemer, Burwick, and von Seelen (2000), and that are not seminatural in the sense of section 3.4. Instead, natural temporal sequences of spatial activity patterns could be used. Such sequences could, for example, be obtained from recordings of sounds, visual scenes, or movement patterns in accordance with typical animal behavior (Einhauser, Kayser, K¨onig, & Kording, 2002). ² Spatiotemporal stimuli with noisy temporal stimulus relations. The effect of noise in temporal stimulus relations on the self-organization process could be analyzed. On the one hand, only exact temporal stimulus distances were calculated and applied in the above numerical experiments, that is, no noise was applied to time-to-space relations. On the other hand, noise was explicitly added as a necessary separate step of the algorithm in the form of neural shifts. To some extent, the two different sources of noise are interchangeable. It is essential to continue to relate the resulting topographic maps to biological ndings. Further numerical and biological experiments are required to answer the question whether the concept of time-to-space transformations is applied by the brain in order to establish dynamic topographic stimulus representations. Concerning the further development of the TOM algorithm, two issues constitute the main challenges for future work. First is the exible linking of timescales. The timescales of stimulus dynamics and neural dynamics should be less rigidly linked to each other. In the simulations, neural wave velocity was a critical system parameter linking temporal stimulus distances with spatial representational distances. The resulting neural representations
The Time-Organized Map Algorithm
1167
correspond to the processing of stimuli at one xed timescale. This is functionally reasonable in cases where canonic timescales exist, as in the average visual ux (Wiemer, 2001). In other cases, such as the representation of objects in higher visual cortex, the timescales of cortical dynamics and stimulus dynamics should be less rigidly linked because of the variability of object speed: neural wave velocity has to be context dependent. Such velocity modulations could be achieved by inputs from cortical areas dealing with stimulus context or motion perception, as from the prefrontal cortex (Asaad & Miller, 2000) or from medial temporal and medial superior temporal cortex (Orban et al., 1992), respectively. The second challenge is the functional integration of processes. It is reasonable to incorporate time-dependent feedforward integration in the TOM algorithm. In biological systems, this may be realized by Hebbian learning and additional dynamics, such as delay dynamics of gating neurons (Becker, 1999) or sensory decay dynamics (Wiemer et al., 1998; Wiemer, Spengler, et al., 2000), or more directly by a learning rule that superimposes past stimuli (Edelman & Weinshall, 1991; Foldiak, ¨ 1991; Wallis & Bulthoff, ¨ 1999). Thereby, stimuli may be additionally associated according to their temporal distances. The two processes, active extrapolating time-to-space transformation and passive interpolating superimposition of stimuli, can be combined to facilitate elaborated stimulus processing in biological as well as articial systems. 5 Conclusion We have presented a new algorithm denoted time-organized map (TOM) for the self-organization of topography in neural networks based on spatiotemporal signals. The algorithm transfers temporal signal distances into spatial signal distances in topographic neural representations and establishes signal representatives in the form of averaged neural weight vectors. The time-to-space transfer is functionally reasonable if the relatedness of signals can be deduced from their closeness in time, for example, due to underlying physical processes transferring one signal into the next signal. In these cases, the TOM algorithm can be applied in order to generate useful topographic signal representations. The learning capacity of the algorithm was illustrated by four examples of self-organization based on different sets of spatiotemporal stimuli. The TOM algorithm can serve as (1) a model for the interpretation and prediction of experimental data concerning topographies in biological neural networks and (2) a tool for the analysis and visualization of spatiotemporal signals. We have developed the TOM algorithm from biological grounds. After simulating reorganizations experimentally observed in primary somatosensory cortex (Wiemer, Spengler, et al., 2000), we have further developed the approach to generate the approximately logarithmic retinotopic mapping of
1168
J. Wiemer
primate primary visual cortex (see Figure 3) and to generate the topographic structure of face representation in inferotemporal cortex (see Figure 7). The TOM algorithm generalizes our perspective on cortical self-organization pointing toward: ² Universal self-organization. In accordance with cortex universality, a unied self-organization process can generate cortical topographies that express spatial, temporal, and mixed spatiotemporal stimulus topologies. The strength with which current and former stimuli interact in the network determines which stimulus topology is transfered into topography. ² Active stimulus representation. Cortical topographies do not just passively receive and represent stimuli. They apply dynamics to embed stimuli in their temporal context in order to predict future stimuli. This can improve the speed and success rate of stimulus recognition. ² Relevance of temporal topology. We predict time-organized topographies in cortical areas representing signals with systematic temporal relations, such as views of three-dimensional objects in higher visual cortex and elementary movements in primary motor cortex. Acknowledgments I thank Stefan Schneider, Christian Igel, and Mark Toussaint for valuable comments on the manuscript, Thomas Burwick for fruitful initial discussions, and Thomas Bucher ¨ for technical support on the CCD camera system. I also thank the reviewers for their constructive criticism. The work was supported by grant DFG SE 251=41-1. References Andres, M., Mallot, H. A., & Gieng, G. J. (1992). Self-organization of binocular receptive elds. In I. Alexander (Ed.), Articial Neural Networks II, Proceedings on ICANN 92. Amsterdam: Elsevier Science. Asaad, W., & Miller, G. R. E. (2000). Task-specic neural activity in the primate prefrontal cortex. J. Neurophysiol., 84(1), 451–459. Becker, S. (1999). Implicit learning in 3D object recognition: The importance of temporal context. Neural Comput., 11(2), 347–374. Bringuier, V., Chavane, F., Glaeser, L., & Fregnac, Y. (1999). Horizontal propagation of visual activity in the synaptic integration eld of area 17 neurons. Science, 283, 695–699. Bulthoff, ¨ H., Edelman, S., & Tarr, M. (1995). How are three-dimensional objects represented in the brain? Cereb. Cortex, 5, 247–260. Buonomano, D., & Merzenich, M. (1998). Cortical plasticity: From synapses to maps. Annu. Rev. Neurosci., 21, 149–186.
The Time-Organized Map Algorithm
1169
Chappell, G., & Taylor, J. (1993). The temporal Kohonen map. Neural Networks, 6, 441–445. DeYoe, E., Carman, G., Bandettini, P., Glickman, S., Wieser, J., Cox, R., Miller, D., & Neitz, J. (1996). Mapping striate and extrastriate visual areas in human cerebral cortex. Proc. Natl. Acad. Sci. USA, 93, 2382–2386. Edelman, S., & Weinshall, D. (1991). A self-organizing multiple-view representation of 3D objects. Biol. Cybern., 64, 209–219. Einhauser, W., Kayser, C., Konig, ¨ P., & Kording, K. (2002). Learning the invariance properties of complex cells from their responses to natural stimuli. Eur. J. Neurosci., 15(3), 475–486. Foldiak, ¨ P. (1991). Learning invariance from transformation sequences. Neural Comput., 3, 194–200. Fritsches, K., & Rosa, M. (1996). Visuotopic organisation of striate cortex in the marmoset monkey (Callithrix jacchus). J. Comp. Neurol., 372, 264–282. Garaschuk, O., Hanse, E., & Konnerth, A. (1998). Developmental prole and synaptic origin of early network oscillations in the CA1 region of rat neonatal hippocampus. J. Physiol., 507, 219–236. Gattass, R., Sousa, A., & Rosa, M. (1987). Visual topography of V1 in the cebus monkey. J. Comp. Neurol., 259, 529–548. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Gibson, J. (1950). The perception of the visual world. Boston: Houghton Mifin. Gothard, K., Skaggs, W., & McNaughton, B. (1996). Dynamics of mismatch correction in the hippocampal ensemble code for space: interaction between path integration and environmental cues. J. Neurosci., 16, 8027– 8040. Horio, K., & Yamakawa, T. (2001).Feedback self-organizing map and its application to spatio-temporal pattern classication. Int. J. Computational Intelligence and Applications, 1, 1–18. Hubel, D. (1995). Eye, brain, and vision. New York: Freeman. Kangas, J. (1990). Time-delayed self-organizing maps. In Proceedings of the International Joint Conference on Neural Networks (pp. 331–336). San Diego, CA. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biol. Cybern., 43, 59–69. Kohonen, T. (1997). Self-organizing maps (2nd ed.). New York: Springer-Verlag. Kohonen, T., & Hari, R. (1999). Where the abstract feature maps of the brain might come from. Trends Neurosci., 22, 135–139. Koskela, T., Varsta, M., Heikkonen, J., & Kaski, K. (1998). Time series prediction using recurrent SOM with local linear models. Int. J. Knowledge-Based Intelligent Engineering Systems, 2, 60–68. Krekelberg, B., & Taylor, J. (1996). Nitric oxide in cortical map formation. J. Chem. Neuroanat., 10, 191–196. Lappe, M., & Rauschecker, J. (1995). Motion anisotropies and heading detection. Biol. Cybern., 72, 261–277. Lo, Z., & Bavarian, B. (1991). On the rate of convergence in topology preserving neural networks. Biol. Cybern., 65, 55–63.
1170
J. Wiemer
Logothetis, N., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cereb. Cortex, 5, 270–288. Mallot, H. (1985). An overall description of retinotopic mapping in the cat’s visual cortex areas 17, 18, and 19. Biol. Cybern., 52, 45–51. Martinetz, T., & Schulten, K. (1993). Topology representing networks. Neural Netw, 7, 507–522. Mayer, N., Herrmann, M., Bauer, H.-U., & Geisel, T. (1998). A cortical interpretation of assoms. In ICANN’98, Proceedings (pp. 961–966). New York: SpringerVerlag. Obermayer, K., Blasdel, G., & Schulten, K. (1992). Statistical-mechanical analysis of self-organization and pattern formation during the development of visual maps. Phys. Rev. A, 45, 7568–7589. Olshausen, B., & Field, D. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Orban, G., Lagae, L., Verri, A., Raiguel, S., Xiao, D., Maes, H., & Torre, V. (1992). First-order analysis of optical ow in monkey brain. Proc. Natl. Acad. Sci. USA, 89, 2595–2599. Prechtl, J., Cohen, L., Pesaran, B., Mitra, P., & Kleinfeld, D. (1997). Visual stimuli induce waves of electrical activity in turtle cortex. Proc. Natl. Acad. Sci. USA, 94, 7621–7626. Riesenhuber, M., Bauer, H.-U., & Geisel, T. (1996). Analyzing phase transitions in high-dimensional self-organizing maps. Biol. Cybern., 75, 397–407. Ritter, H. (1990). Self-organizing maps for internal representations. Psychol. Res., 52, 128–136. Roland, P., & Zilles, K. (1998). Structural divisions and functional elds in the human cerebral cortex. Brain Res. Brain Res. Rev., 26, 87–105. Sachs, L. (1999). Angewandte Statistik. Anwendung statistischer Methoden. Berlin: Springer-Verlag. Schreiner, E. (1995). Order and disorder in auditory cortical maps. Curr. Opin. Neurobiol., 5(4), 489–496. Schwartz, E. (1980). Computational anatomy and functional architecture of striate cortex. Vision Res., 20, 645–669. Seidemann, E., Arieli, A., Grinvald, A., & Slovin, H. (2002). Dynamics of depolarization and hyperpolarization in the frontal cortex and saccade goal. Science, 295, 862–865. Spengler, F., Hilger, T., & Wiemer, J. (2001). Cortical plasticity underlying ¨ Bochum. tactile stimulus learning (Tech. Rep. 2001–01). Ruhr-Universit at Available on-line: http:==www.neuroinformatik.ruhr-uni-bochum.de=ALL= PUBLICATIONS=IRINI=irinis2001 d.html. Szepesvari, C., Balazs, L., & Lorincz, ¨ A. (1994). Topology learning solved by extended objects: A neural network model. Neural Comput., 6, 441–458. Tanaka, K. (1997). Mechanisms of visual object recognition: Monkey and human studies. Curr. Opin. Neurobiol., 7, 523–529. Tanifuji, M., Sugiyama, T., & Murase, K. (1994). Horizontal propagation of excitation in rat visual cortical slices revealed by optical imaging. Science, 266, 1057–1059.
The Time-Organized Map Algorithm
1171
Taylor, C. (1993).NaC currents that fail to inactivate. Trends Neurosci., 16, 455–460. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Tootell, R., Hadjikhani, N., Vanduffel, W., Liu, A., Mendola, J., Sereno, M., & Dale, A. (1998). Functional analysis of primary visual cortex (V1) in humans. Proc. Natl. Acad. Sci., 95, 818–824. Ullman, S. (2000). High-level vision: Object recognition and visual cognition. Cambridge, MA: MIT Press. Voegtlin, T. (2000). Context quantization and contextual self-organizing maps. In Proceedings of the IJCNN’2000 (Vol. 6, pp. 20–25). von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in the striata cortex. Biol. Cybern., 14, 85–100. Wallis, G., & Bulthoff, ¨ H. (1999). Learning to recognize objects. Trends Cogn. Sci., 3, 22–31. Wang, G., Tanaka, K., & Tanifuji, M. (1996). Optical imaging of functional organization in the monkey inferotemporal cortex. Science, 272, 1665–1668. Wang, X., Merzenich, M., Sameshima, K., & Jenkins, W. (1995). Remodelling of hand representation in adult cortex determined by timing of tactile stimulation. Nature, 378, 71–75. Wiemer, J. (2001). Learning topography in neural networks: Towards a better understanding of cortical topography. Aachen: Shaker-Verlag. Available on-line: http:==www-brs.ub.ruhr-uni-bochum.de=netahtml=HSS=Diss= WiemerJanC=. Wiemer, J., Burwick, T., & von Seelen, W. (2000). Self-organizing maps for visual feature representation based on natural binocular stimuli. Biol. Cybern., 82, 97–110. Wiemer, J., Spengler, F., Joublin, F., Stagge, P., & Wacquant, S. (1998). A model of cortical plasticity: Integration and segregation based on temporal input patterns. In ICANN’98, Proceedings(pp. 367–372). New York: Springer-Verlag. Wiemer, J., Spengler, F., Joublin, F., Stagge, P., & Wacquant, S. (2000). Learning cortical topography from spatiotemporal stimuli. Biol. Cybern., 82, 173–187. Wiemer, J., & von Seelen, W. (2002). Topography from time-to-space transformations. Neurocomputing, 44–46, 1017–1022. Willshaw, D., & von der Malsburg, C. (1976). How patterned neural connections can be set up by self-organization. Proc. R. Soc. Lond. B Biol. Sci., 194, 431–445. Received January 25, 2002; accepted November 25, 2002.
LETTER
Communicated by Michael Cohen
New Conditions on Global Stability of Cohen-Grossberg Neural Networks Wenlian Lu
[email protected] Tianping Chen¤
[email protected] Laboratory of Nonlinear Science, Institute of Mathematics, Fudan University, Shanghai 200433, P.R. China
In this letter, we discuss the dynamics of the Cohen-Grossberg neural networks. We provide a new and relaxed set of sufcient conditions for the Cohen-Grossberg networks to be absolutely stable and exponentially stable globally. We also provide an estimate of the rate of convergence. 1 Introduction Research on recurrently connected neural networks (RCNN) is an important topic in neural network theory. Among them, the Cohen-Grossberg neural networks were rst proposed in the pioneering work of Cohen and Grossberg (1983). They can be modeled as follows: 2 3 n X dx D ¡ai .xi / 4bi .xi / ¡ tij gj .xj .t// C Ji 5 ; dt jD1
i D 1; : : : ; n:
(1.1)
This type of network has been widely studied in recent years and has found applications in many areas. The success of many of these applications relies heavily on understanding the underlying dynamic behavior of these networks. A thorough analysis of this dynamics is a necessary step toward a practical design of the RCNN networks. We point out here that the Cohen-Grossberg neural networks include the Hopeld neural networks (Hopeld, 1984; Hopeld & Tank, 1986) as a special case. The latter can be described as n X dxi D ¡di xi C tij gj .xj .t// C Ii ; dt jD1
¤
i D 1; : : : ; n:
(1.2)
Correspondence Author.
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1173–1189 (2003) °
1174
W. Lu and T. Chen
Research on the dynamic behaviors of the RCNN networks dates back to the early days of neural network science. For example, multistable and oscillatory behaviors were studied by Amari (1971, 1972), and Wilson and Cowan (1972). Chaotic behaviors were investigated by Sompolinsky and Crisanti (1988). Hopeld (1984) and Hopeld and Tank (1986) looked into the dynamic stability of symmetrically connected networks and showed their practical applicability to optimization problems. It should also be noted that Cohen and Grossberg (1983) presented more rigorous analytical results on the global stability of the RCNN networks. The global stability of symmetrically connected networks is relatively straightforward to analyze, and by now most of the results have been well established. The local and global stability of asymmetrically connected networks has also been studied. For example, Michel, Farrel, and Porod (1989) and Michel and Gray (1990), as well as Yang and Dillon (1994), obtained sufcient conditions for the local exponential stability and for the existence and the uniqueness of an equilibrium point. However, they did not address the issue of global stability of the networks. In practice, the topic of global stability is of more importance than that of the local stability, as pointed out by Hirsch (1989), who gave some sufcient conditions for global stability in this paper. Kelly (1990) applied contraction mapping theory to obtain some sufcient conditions for global stability. Matsuoka (1992) generalized some of Hirsch’s and Kelly’s results by use of a new Lyapunov function. Kaszkurewicz and Bhaya (1994) proved that the diagonal stability of the interconnection matrix implies the existence and uniqueness of an equilibrium point and the global stability at the equilibrium. Forti, Manetti, and Marini (1992), Forti (1994), and Forti and Tesi (1995) proposed three main features a neural network should possess. Among them, the important one is that neural networks should be absolutely stable. They also pointed out that the negative semideniteness of the interconnection matrix guarantees the global stability of the Hopeld networks and proved the absolute global stability for a class G of activation functions. Fang and Kincaid (1996) applied the matrix measure theory to get some sufcient conditions for global and local stability. Joy (1999, 2000) discussed the stability of delayed neural networks by using Lyapunov functions and Lyapunov diagonal stable (LDS) conditions on matrix. Chen and Amari (2001a, 2001b) introduced a new approach to address the global stability of Hopeld neural networks. Chen, Lu, and Amari (2002) proved the exponential stability under less restrictive conditions and also obtained an estimate of the accurate rate of convergence. The analysis of dynamical behavior of the Cohen-Grossberg RCNN networks was discussed rst by Cohen and Grossberg (1983) and later studied further by Grossberg (1988), in which some global limit property with symmetric connection matrix T D .tij /ni;jD1 was given. Lin and Zou (2002) presented some sufcient conditions and the convergence rate for exponential stability of the Cohen-Grossberg networks with asymmetric connection
Cohen-Grossberg Neural Networks
1175
matrices. However, the signs of entries in the connection matrices T were neglected in the conditions of theorems given in their article, and as a result the difference between the excitatory and inhibitory effects might have been ignored. In addition, the activation functions were restricted to bounded functions. The purpose of this letter is to generalize all the aforementioned results to the Cohen-Grossberg networks, requiring neither the boundedness condition on the activation functions nor the positive lower boundedness condition on the amplication functions. This letter is organized as follows. In section 2, we give a model description and some prerequisite results. In section 3, we discuss the existence and uniqueness of the equilibrium point of the Cohen-Grossberg networks. In section 4, we obtain the conditions for the global absolute stability of the Cohen-Grossberg networks. In section 5, we provide the convergence rate. We conclude in section 6. 2 Model Description and Some Preliminaries In this article, we investigate the Cohen-Grossberg neural networks model, which can be described by the following system of ordinary differential equations (ODEs): 2 3 n X dx D ¡ai .xi / 4bi .xi / ¡ tij gj .xj .t// C Ji 5 ; dt jD1
i D 1; : : : ; n;
(2.1)
where the interconnection matrix .tij /ni;jD1 is not required to be symmetric. Throughout, we assume the gain functions ai .s/, activation functions bi .s/, and signal functions gi .s/ i D 1; : : : ; n are locally Lipshizean. Thus, the trajectory of system 2.1 is existent and unique with any given initial values. Denition 1. GN of functions: Let G D diag.G1 ; G2 ; : : : ; Gn /, where Gi > 0, i D 1; 2; : : : ; n. Then g.x/ D .g1 .x1 /; g2 .x2 /; : : : ; gn .xn //T is said to belong to GN g .»/¡g .´/ if the functions gi .s/, satisfy 0 · i »¡´i · Gi , for i D 1; 2; : : : ; n. Denition 2. Class DN of functions: Let D D diag.D1 ; D2 ; : : : ; Dn /, where Di > 0, i D 1; 2; : : : ; n. Then b.x/ D .b1 .x1 /; b2 .x2 /; : : : ; bn .xn //T is said to belong to i .´/ ¸ Di for i D 1; 2; : : : ; n. DN if the functions bi .s/, satisfy bi .»/¡b »¡´ Denition 3. The Cohen-Grossberg neural networks, equation 2.1, is said to be absolutely globally asymptotically stable (exponentially), if any solution of the equation converges to an equilibrium point (exponentially) as t goes to C1 for any N activation functions in GN and any appropriate functions in D.
1176
W. Lu and T. Chen
Denition 4 (Hershkowitz, 1992). C 2 Rn;n is said to be LDS if there exists a diagonal matrix P D diagfP1 ; P 2 ; : : : ; Pn g, Pi > 0, i D 1; 2; : : : ; n, such that .PC/s D [PC C .PC/T ]=2 is positive denite. Denition 5. An n £ n real symmetric matrix C is said to be C > 0 if C is positive denite. P We denote kxk2 D . niD1 x2i /1=2 , g.x/ D .g1 .x1 /; g2 .x2 /; : : : ; gn .xn //T ; b.x/ D .b1 .x1 /; b2 .x2 /; : : : ; bn .xn //T ; and A.x/ D diagfa1 .x1 /; a2 .x2 /; : : : ; an .xn /g, T D .tij /n£n . Moreover, by changing Ji ; i D 1; : : : ; n, we always assume bi .0/ D 0, gj .0/ D 0; i D 1; : : : ; n. 3 Existence and Uniqueness of the Equilibrium Point In this section, we prove the existence and the uniqueness of the equilibrium point of the Cohen-Grossberg networks, equation 2.1, under very relaxed conditions. Before the proof, let us review these lemmas given in Forti and Tesi (1995): Lemma 1.
Continuous map H.x/: Rn ! Rn is homeomorphic if:
1. H.x/ is injective. 2.
lim kH.x/k D 1.
kxk!1
Lemma 2. If D and G are positive denite diagonal matrices, T is an n £ n matrix, and DG¡1 ¡ T 2 LDS, then for each n £ n diagonal matrix K, such that 0 · K · G, we have det.¡TK C D/ 6D 0:
(3.1)
N b.x/ 2 D, ai .s/ > 0 and DG¡1 ¡ T 2 LSD. Theorem 1. Suppose that g 2 G, Then the Cohen-Grossberg neural network, equation 2.1, has a unique equilibrium point. Proof.
Let H.x/ D .H1 .x/; H2 .x/; : : : ; Hn .x//T : Rn ! Rn , where
Hi .x/ D bi .xi / ¡
X jD1
tij gj .xj / C Ji :
(3.2)
To prove the existence and the uniqueness of the equilibrium point, we need only to prove H.x/ is a homeomorphism. It is straightforward to observe
Cohen-Grossberg Neural Networks
1177
that the homeomorphic property of H.x/ is equivalent to that of its parallel N translation H.x/, HN i .x/ D bi .xi / ¡ bi .0/ ¡
n X jD1
tij [gj .xj / ¡ gj .0/];
(3.3)
being a homeomorphism. Therefore, without loss of generality, we assume that Ji D 0
8
i D 1; : : : ; n:
(3.4)
First, let us prove that H.x/ is injective. Otherwise, there exist x; y 2 Rn such that H.x/ ¡ H.y/ D 0. Thus, [bi .xi / ¡ bi .yi /] ¡
n X jD1
tij [gj .xj / ¡ gj .yj /] D 0;
(3.5)
which can be rewritten as B.x ¡ y/ ¡ TK.x ¡ y/ D .B ¡ TK/.x ¡ y/ D 0;
(3.6)
where B D diagfb1 ; b2 ; : : : ; bn g, (
bi D
Di bi .xi /¡bi .yi / xi ¡yi
xi D yi
xi 6D yi ;
(3.7)
and K D diagfK1 ; K2 ; : : : ; Kn g, where, ( 0 xi D yi Ki D gi .xi /¡gi .yi / xi 6D yi : xi ¡yi Since g 2 G, b.x/ 2 D, and DG¡1 ¡ T 2 LDS, we have B ¸ D, 0 · K · G. By lemma 2, we have BG¡1 ¡ T 2 LDS and det.TK ¡ B/ 6D 0. Hence, x D y, which implies that H.x/ is injective. Second, let us prove lim kH.x/k D 1;
kxk!1
(3.8)
1 such that where kxk D kxk2 . Otherwise, there exists a sequence fxp gpD1 p lim kx k D 1 and for all p,
p!1
kH.xp /k · M;
(3.9)
1178
W. Lu and T. Chen
where 0 < M < C1 is a constant. Therefore, there exists a xed i0 , such p that jxi0 j ! 1, as p ! 1 and kHi0 .xp /k · M for all p. Since Hi0 .xp / D Pn Pn p p p p bi0 .xi0 / ¡ jD1 ti0 j gj .xj /, and bi0 .xi0 / ! 1, we have that j jD1 ti0 j gj .xj /j is 1 p unbounded. Hence, there exists a subsequence fx k gkD1 , such that kg.xpk /k ! 1;
k ! 1:
(3.10)
Since DG¡1 ¡ T 2 LDS, we can nd P D diagfP1 ; P2 ; : : : ; Pn g > 0 and " > 0 such that [P.DG¡1 ¡ T/]s ¸ "I > 0. Therefore, we have 0 1 n X X @Pi gi .xi /bi .xi / ¡ g.x/T PH.x/ D gi .xi /tij gj .xj /A iD1
jD1
¸ g.x/T [P.DG¡1 ¡ T/]s g.x/ ¸ "kg.x/k22 :
(3.11)
This implies "kg.x/k22 · g.x/T PH.x/ · kg.x/k2 kPk2 kH.x/k2
(3.12)
and kg.xpk /k2 ·
1 kPk2 kH.xpk /k2 ; "
(3.13)
which contradicts equations 3.9 and 3.10. In summary, lim kH.x/k D 1, H.x/ must be a homeomorphism, and kxk!1
the equilibrium point exists and is unique. 4 Global Asymptotic Absolute Stability In this section, we address the global asymptotic absolute stability. Before doing so, let us review the following basic lemma whose details can be found in Miller and Michel (1982). Consider the autonomous system of ODEs: dx D f .x/: dt
(4.1)
Denition 6. w.x/: Rn ! R is dened to be radially unbounded if the following conditions are satised: 1. w.o/ D 0
2. w.x/ > 0 for x 2 Rn ¡ fog 3. w.x/ ! 1 as jxj ! 1,
where o D .0; 0; : : : ; 0/T .
Cohen-Grossberg Neural Networks
1179
Lemma 3 (Theorem 11.14 of Miller & Michel, 1982). Assume that there exists a continuous, differentiable, positive denite and radially unbounded function L: Rn ! R, such that: 1.
dL.x/ dt
· 0 for all x 2 Rn
2. The origin is the only invariant subset of the set » ¼ n dL.x/ D0 : ED x2R I dt Then the equilibrium x D o of equation 4.1 is asymptotically stable for Rn . Theorem 2. Z
1 0
N ai .s/ > 0 and Suppose b.s/ 2 D, g.s/ 2 G,
½ d½ D 1 ai .½/
i D 1; : : : ; n:
(4.2)
If DG¡1 ¡ T 2 LDS, then equation 2.1 has a unique equilibrium point x? , and lim kx.t/ ¡ x? k D 0;
t!1
(4.3)
that is, the Cohen-Grossberg neural network, equation 2.1, is absolutely globally asymptotically stable. Proof. Let x? be the unique equilibrium point of the networks 2.1 given in the previous section and P D diagfP1 ; P2 ; : : : ; Pn g be the diagonal matrix such that [P.DG¡1 ¡ T/]s > 0 and u D .u1 ; u2 ; : : : ; un /T
g? .u/ D .g?1 .u1 /; g?2 .u2 /; : : : ; g?n .un //T b? .u/ D .b?1 .u1 /; b?2 .u2 /; : : : ; b?n .un //T
(4.4)
where ui .t/ D xi .t/ ¡ x?i
g?i .s/ D gi .s C x?i / ¡ gi .x?i /
b?i .s/ D bi .s C x?i / ¡ bi .x?i /:
(4.5)
Then system 2.1 can be rewritten as 2 3 n X dui .t/ ¤ 4 ? ? D ¡ai .ui C xi / bi .ui / ¡ tij gj .uj /5 : dt jD1
(4.6)
1180
W. Lu and T. Chen
Let us dene a Lyapunov function: Z V1 .k; u/ D
X ½ C d½ k Pi ai .½ C x?i / iD1
ui 0
Z
ui 0
g?i .½/ d½: ai .½ C x?i /
(4.7)
First, Z V1 .k; x/ ¸
ui
0
½ d½: ai .½ C x?i /
(4.8)
Hence, by equation 4.2, V.k; u/ is radially unbounded. Second, direct differentiating gives T dV 1 .k; u/ D uT [¡.b? .u/¡Tg? .u//]Ckg? .u/P[¡b? .u/CTg? .u/] dt T
(4.9)
· ¡uT DuCuT Tg? .u/Ckg? .u/[P.¡DG¡1 CT/]s g? .u/
(4.10)
D ¡uT Du C uT Tg? .u/ ¡
(4.11)
C
1 ?T g .u/T T D¡1 Tg? .u/ 4
1 ?T g .u/T T D¡1 Tg? .u/ 4 T
C kg? .u/[P.¡DG¡1 C T/]s g? .u/:
(4.12)
Under the assumptions made in theorem 2, we observe that 2 ? g?i .s/b?i .s/ ¸ g?i .s/Di s ¸ Di G¡1 i .gi .s//
sb?i .s/ ¸ Di s2 :
(4.13) (4.14)
As a result, we have » ¼ T» ¼ 1 ¡1=2 ? 1 ¡1=2 ? dV 1 .k; u/ 1=2 1=2 · ¡ D u ¡ .D Tg .u// D u ¡ .D Tg .u// 2 2 dt » ¼ 1 T ¡1 C g? T .u/ T D T C k[P.¡DG¡1 C T/]s g? .u/ 4 < 0;
(4.15)
whenever k is sufciently large and g? .u/ 6D 0. On the other hand, if g? .u/ D 0 and u 6D 0, then we have dV 1 .k; u/ · ¡uT Du < 0: dt
(4.16)
Cohen-Grossberg Neural Networks
1181
In summary, if u 6D 0, then we have dV 1 .k; u/ 0 and equation 4.2. The following corollary is a direct extension of theorem 2 and the theory of LDS matrix and M matrix. N ai .s/ > 0 and satises equation 4.2 Corollary 1. Suppose b.s/ 2 D, g.s/ 2 G, if there exist constants »i > 0, i D 1; : : : ; n such that any one of the following inequalities holds: 0 1 n X (4.20) Dj »j ¡ Gj @»j tjj C »i jtij jA > 0; j D 1; : : : ; n iD1;i6D j
»i .Di ¡ Gi tii / ¡ »i .Di ¡ Gi tii / ¡
n X jD1;j6Di
»j Gj jtij j > 0;
i D 1; : : : ; n
n X 1 .»i Gi jtij j C »j Gj jtji j/ > 0; 2 jD1;j6Di
(4.21)
i D 1; : : : ; n:
(4.22)
Then there is a unique equilibrium point x? 2 Rn , such that lim kx.t/ ¡ x? k D 0:
t!1
(4.23)
In fact, under each one of previous three conditions, the matrix D ¡ MG is an M-matrix, where M D .mij /N i;jD1 with entries (
mii D tii mij D jtij j
i 6D j:
(4.24)
1182
W. Lu and T. Chen
By the property of the M matrix, DG¡1 ¡ M is also an M matrix (see the appendix). Therefore, there exists P D diagfP1 ; P2 ; : : : ; PN g, with Pi > 0, for i D 1; 2; : : : ; N, such that P i .Di G¡1 i ¡ Tii / ¡
N 1 X .Pi jTij j C Pj jTji j/ > 0 2 jD1;j6Di
i D 1; 2; : : : ; n: (4.25)
By Gershgorin’s theorem (see Golub & Von Loan, 1990), all the eigenvalues are positive. So fP[DG¡1 ¡ T]gs is positive denite. Corollary 1 is a direct consequence of theorem 2. 5 Further Results of Exponential Stability Analysis In this section, we discuss exponential stability with the hypotheses that 0 < ® i · ai .u/ i D 1; : : : ; n. Theorem 3. Z
1 0
N 0 < ® · ai .u/ and Suppose b.s/ 2 D, g.s/ 2 G,
½ d½ D 1 ai .½/
i D 1; : : : ; n:
(5.1)
If there exists a constant ® such that min Di > ® > 0 and .D ¡ ®I/G¡1 ¡ i
T 2 LSD, then the neural network 2.1 is absolutely globally asymptotically stable ®® exponentially with the convergence rate 2 . This means that for every trajectory x.t/, there exists a unique equilibrium point x? and a constant M.x.0// > 0 such that kx.t/ ¡ x? k2 · M.x.0//e¡
®®t 2
(5.2)
:
Moreover, if gi .s/, i D 1; : : : ; N, are differentiable, then for every trajectory x.t/, there exist constants t0 .g/ > 0 and W.x.t0 /; g/ > 0, which depend on g, such that kx.t/ ¡ x? k2 · W.x.t0 /; g/e¡®®t
f or t > t0 .g/:
(5.3)
Proof. Let x? be the unique equilibrium point of the networks 2.1, T, u, g? .u/, b? .u/ be the same as dened before, and A? .u/ D diagfa1 .u1 C x?1 /; a2 .u2 C x?2 /; : : : ; an .un C x?n /g: Then the system can be rewritten as du.t/ D ¡A? .u/[b? .u/ ¡ Tg? .u/]: dt
(5.4)
Cohen-Grossberg Neural Networks
1183
The condition of this theorem implies that we can nd P D diagfP1 ; P2 ; : : : ; Pn g > 0, and ¯ D ® C ², ² > 0, such that fP[.D ¡ ¯I/G¡1 ¡ T]gs > 0. To move further, let us dene a Lyapunov function,
V.k; u/ D
n X Z ui g? .½/ 1X i jui j2 C k Pi d½: 2 iD1 C x?i / a .½ 0 i iD1
(5.5)
Direct differentiating gives dV.k; u/ D uT [¡A? .u/.b? .u/ ¡ Tg? .u//] C kg? T .u/P[¡b? .u/ C Tg? .u/] dt · ¡uT A? .u/.D ¡ ¯I/u C uT A? .u/Tg? .u/ C kg?T .u/fP[¡.D ¡ ¯I/G¡1 C T]gs g? .u/ ¡ ¯uT A? .u/u ¡ ¯kg? T .u/Pu
D ¡uT A? .u/.D ¡ ¯I/u C uT A? .u/Tg? .u/
1 ?T g .u/T T A? .u/.D ¡ ¯I/¡1 .A? .u//¡1 A? .u/Tg? .u/ 4 1 C g? T .u/T T A? .u/.D ¡ ¯I/¡1 .A? .u//¡1 A? .u/Tg? .u/ 4 ¡
C kg? T .u/fP[¡.D ¡ ¯I/G¡1 C T]gs g? .u/ ¡ ¯uT A? .u/u ¡ ¯kg? T .u/Pg? .u/ » · ¡ .A? .u//1=2 .D ¡ ¯I/1=2 u 1 ¡ .A? .u//1=2 .D ¡ ¯I/¡1=2 Tg? .u/ 2 » £ .A? .u//1=2 .D ¡ ¯I/1=2 u
¼
1 ¡ .A? .u//1=2 .D ¡ ¯I/¡1=2 Tg? .u/ 2 » 1 T ? C g? T .u/ T A .u/.D ¡ ¯I/¡1 T 4
T
¼
¼ C kfP[¡.D ¡ ¯I/G¡1 C T]gs g? .u/
¡ ¯uT A? .u/u ¡ ¯kg? T .u/Pu:
(5.6)
1184
W. Lu and T. Chen
By theorem 2, the system is globally stable. Hence, ai .s/ · ®, N i D 1; : : : ; n; for some constant ®. N This, combined with the assumption .D ¡ ®I/G¡1 ¡ T 2 LSD, leads to the fact that there exists a sufciently large k > 0 such that for all u 2 Rn , 1 T ? T A .u/.D ¡ ¯I/¡1 T C kfP[¡.D ¡ ¯I/G¡1 C T] < 0: 4
(5.7)
Subsequently, T dV.k; u/ < ¡¯uT A? .u/u ¡ ¯kg? .u/Pu: dt
(5.8)
Because gi is monotone, we have Z g?i .s/s D
s 0
g?i .½/ d½
Z
·®
s 0
gi .½/ d½; ai .½ C x?i /
which leads to " # X Z ui g? .½/ 1 T dV.k; u/ i · ¡¯® u uCk Pi d½ ? 2 dt 0 ai .½ C xi / iD1 · ¡®®V.k; u/:
(5.9)
Therefore, n 1X u2 .t/ · V.k; u/ · V.k; u.0//e¡®®t ; 2 iD1 i
(5.10)
and there exists M.x.0// > 0 such that ku.t/k2 · M.x.0//.e¡
®®t 2
/:
Moreover, because gi .s/ is differentiable, we have 0
g?i .s/ D g?i .0/s C o.jsj/ 0
Z
s
2 0
g?i .s/s D g?i .0/s2 C o.jsj2 / 0
g?i .½/ d½ D g?i .0/s2 C o.jsj2 /;
(5.11)
Cohen-Grossberg Neural Networks
1185
which leads to Z g?i .s/s D 2
s 0
¸ 2®
Z
g?i .½/ d½ C o.jsj2 / s
0
g?i .½/ d½ C o.jsj2 / ai .½ C x?i /
(5.12)
for i D 1; : : : ; N. It follows that there exists t0 > 0, such that " # X Z ui g? .½/ 1 T dV.k; u/ i · ¡2¯® u uCk Pi d½ ? 2 dt 0 ai .½ C xi / iD1 ¡ ²uT ®u N C o.ku.t/k22 / · ¡2®®V.k; u/
(5.13)
holds whenever t > t0 , for ju.t/j converges to zero exponentially. Therefore, n 1X u2 .t/ · V.k; u/ · V.k; u.t0 //e¡2®®.t¡t0 / ; 2 iD1 i
(5.14)
and there exists a constant W.x.t0 // > 0 such that ku.t/k2 · W.x.t0 //e¡®®t
(5.15)
for t > t0 . Theorem 3 is proved. Corollary 2. Replacing the assumption that gi .x/ are differentiable given in theorem 3, we assume that all gi .x/ are convex functions. Then by the proof of theorem 3, we can conclude that kx.t/ ¡ x¤ k2 ·
p 2V.k; u.t0 //e¡®®.t¡t0 /
(5.16)
holds for all t > t0 . Remark 2. The exponential convergence rate obtained from this analysis is independent of the upper bound of ai .u/. This is in contrast with the result in Lin and Zhou (2002), which requires the upper boundness in its analysis. Remark 3. Although the exponential convergence rate of network 2.1 is related to the lower bound of ®, the asymptotic stability of network 2.1 is independent of ®.
1186
W. Lu and T. Chen
Remark 4. The assumption that gi .x/ i D 1; : : : ; N are differentiable can be relaxed to the assumption that right and left derivatives of gi .x/ i D 1; : : : ; N exist. This relaxation is signicant. For example, monotonically increasing piecewise linear functions satisfy this condition. Remark 5. Let g0f§g .x/ denote right and left derivative, respectively. Under the following assumption that for any ² > 0, there is ± > 0, such that gi .x § h/ ¡ gi .x/ 0 ¡ gif§g .x/ (5.17) 0, i D 1; : : : ; n, then under each of the following inequalities: 0 1 n X (5.19) Dj »j ¡ Gj @»j tjj C »i jtij jA > ®»j ; j D 1; : : : ; n iD1;i6D j
»i .Di ¡ Gi tii / ¡ »i .Di ¡ Gi tii / ¡
n X jD1;j6Di
»j Gj jtij j > ®»i ;
i D 1; : : : ; n
n X 1 .»i Gi jtij j C »j Gj jtji j/ > ®»i ; 2 jD1;j6Di
(5.20)
i D 1; : : : ; n (5.21)
there is a unique equilibrium point x? 2 Rn , such that x.t/ ¡ x? D O.e¡®®t /:
(5.22)
6 Conclusion We have investigated the absolute stability and the exponential stability of the Cohen-Grossberg neural networks. We have also given the convergence rate.
Cohen-Grossberg Neural Networks
1187
Appendix Denition 7. A nonsingular n £ n real matrix A is said to be a monotone matrix (M-matrix), if every component of A¡1 is nonnegative (see Horn & Johnson, 1991). Let C D .cij / 2 Rn£n be a nonsingular matrix with cij · 0, i; j D 1; 2; : : : ; n, i 6D j. Then all the following statements are equivalent: 1. C is an M-matrix, that is, all the successive principal minors of C are positive. 2. CT is an M-matrix, where CT is the transpose of C. 3. The real part of all eigenvalues is positive. 4. There exists a vector » D [»1 ; »2 ; : : : ; »n ]T with all »i > 0, i D 1; 2; : : : ; n such that every component of » T C is positive or every component of C» is positive. 5. There is a positive diagonal matrix P such that CP C PT C is positive denite. 6. For any two diagonal matrices P D diag[P1 ; P2 ; : : : ; Pn ], Q D diag [Q1 ; Q2 ; : : : ; Qn ], where Pi > 0, Qi > 0, i D 1; 2; : : : ; n, PCQ is an M-matrix. Acknowledgments We are very grateful to the reviewers for their helpful comments and suggestions. This research has been supported by National Science Foundation of China 69982003 and 60074005. References Amari, S. (1971). Characteristics of randomly connected threshold element networks and neural systems. Proc. IEEE, 59, 35–47. Amari, S. (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans., Syst., Man, Cybern., 2, 643–657. Chen, T., & Amari, S. (2001a). Stability of asymmetric Hopeld networks. IEEE Trans. Neural Networks, 12, 159–163. Chen, T., & Amari, S. (2001b). New theorems on global convergence of some dynamical systems. Neural Networks, 14, 251–255. Chen, T., Lu, W. L., & Amari, S. (2002). Global convergence rate of recurrently connected neural networks. Neural Computation, 14, 2947–2958. Cohen, M. A., & Grossberg, S. (1983). Absolute stability and global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. Syst., Man, Cybern., 13, 815–826.
1188
W. Lu and T. Chen
Fang, Y., & Kincaid, T. G. (1996). Stability analysis of dynamical neural networks. IEEE Trans. Neural Networks, 7, 996–1006. Forti, M. (1994). On global asymptotic stability of a class of nonlinear system arising in neural networks theory. J. Diffe. Equati., 113, 246–264. Forti, M., Manetti, S., & Marini, M. (1992). A condition for global convergence of a class of symmetric neural networks. IEEE Trans. Circuit and Systems, 39, 480–483. Forti, M., & Tesi, A. (1995). New condition for global stability of neural networks with application to linear and quatratic programming problems. IEEE Trans. Circu. Syste-I: Funda. Theor. Appli., 42, 354–366. Golub, G. H., & Von Loan, C. F. (1990). Matrix Computation (2nd ed.). Baltimore, MD: Johns Hopkins University Press. Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17–61. Hershkowitz, D. (1992). Recent directions in matrix stability. Linear Algebra and Its Applications, 171, 161–186. Hirsch, M. (1989).Convergent activation dynamics in continuous time networks. Neural Networks, 2, 331–349. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-stage neurons. Proc. Nat. Acad. Sci.-Biol., 81, 3088–3092. Hopeld J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Horn, R. A., & Johnson, C. R. (1991). Topics in matrix analysis. Cambridge: Cambridge University Press. Joy, M. P. (1999). On the global convergence of a class of functional differential equations with applications in neural network theory. Journal of Mathematical Analysis and Applications, 232, 61–81. Joy, M. P. (2000). Results concerning the absolute stability of delayed neural networks. Neural Networks, 13, 613–616. Kaszkurewicz, E., & Bhaya, A. (1994). On a class of globally stable neural circuits. IEEE Trans. Circuits Syst.-I: Fundamental Theory Appl., 41, 171–174. Kelly, D. G. (1990).Stability in contractive nonlinear neural networks. IEEE Trans. Biomed. Eng., 3, 231–242. Lin, W., & Zou, X. (2002). Exponential stability of Cohen-Grossberg neural networks. Neural Networks, 15, 415–422. Matsuoka, K. (1992). Stability condition for nonlinear continuous neural networks with asymmetric connection weight. Neural Networks, 5, 495–500. Michel, A. N., Farrel J. A., & Porod, W. (1989). Qualitative analysis of neural networks. IEEE Trans. Circuits Syst., 36, 229–243. Michel A. N., & Gray, D. L. (1990). Analysis and synthesis of neural networks with lower block triangular interconnecting structure. IEEE Trans. Circuits Syst., 37, 1267–1283. Miller, R. K., & Michel, A. N. (1982). Ordinary differential equations. Orlando, FL: Academic Press. Sompolinsky, H., & Crisanti, A. (1988). Chaos in random neural networks. Phys. Rev. Lett., 61, 259–262.
Cohen-Grossberg Neural Networks
1189
Wilson, H., & Cowan, A. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Yang, H., & Dillon, T. (1994). Exponential stability and oscillation of Hopeld graded response neural networks. IEEE Trans. Neural Networks, 5, 719–729. Received July 18, 2002; accepted November 13, 2002.
ARTICLE
Communicated by Jonathan Victor
Estimation of Entropy and Mutual Information Liam Paninski
[email protected] Center for Neural Science, New York University, New York, NY 10003, U.S.A.
We present some new results on the nonparametric estimation of entropy and mutual information. First, we use an exact local expansion of the entropy function to prove almost sure consistency and central limit theorems for three of the most commonly used discretized information estimators. The setup is related to Grenander ’s method of sieves and places no assumptions on the underlying probability measure generating the data. Second, we prove a converse to these consistency theorems, demonstrating that a misapplication of the most common estimation techniques leads to an arbitrarily poor estimate of the true information, even given unlimited data. This “inconsistency” theorem leads to an analytical approximation of the bias, valid in surprisingly small sample regimes and more accurate than the usual N1 formula of Miller and Madow over a large region of parameter space. The two most practical implications of these results are negative: (1) information estimates in a certain data regime are likely contaminated by bias, even if “bias-corrected” estimators are used, and (2) condence intervals calculated by standard techniques drastically underestimate the error of the most common estimation methods. Finally, we note a very useful connection between the bias of entropy estimators and a certain polynomial approximation problem. By casting bias calculation problems in this approximation theory framework, we obtain the best possible generalization of known asymptotic bias results. More interesting, this framework leads to an estimator with some nice properties: the estimator comes equipped with rigorous bounds on the maximum error over all possible underlying probability distributions, and this maximum error turns out to be surprisingly small. We demonstrate the application of this new estimator on both real and simulated data. 1 Introduction The mathematical theory of information transmission represents a pinnacle of statistical research: the ideas are at once beautiful and applicable to a remarkably wide variety of questions. While psychologists and neurophysiologists began to apply these concepts almost immediately after their introduction, the past decade has seen a dramatic increase in the c 2003 Massachusetts Institute of Technology Neural Computation 15, 1191–1253 (2003) °
1192
L. Paninski
popularity of information-theoretic analysis of neural data. It is unsurprising that these methods have found applications in neuroscience; after all, the theory shows that certain concepts, such as mutual information, are unavoidable when one asks the kind of questions neurophysiologists are interested in. For example, the capacity of an information channel is a fundamental quantity when one is interested in how much information can be carried by a probabilistic transmission system, such as a synapse. Likewise, we should calculate the mutual information between a spike train and an observable signal in the world when we are interested in asking how much we (or any homunculus) could learn about the signal from the spike train. In this article, we will be interested not in why we should estimate information-theoretic quantities (see Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Cover & Thomas, 1991 for extended and eloquent discussions of this question) but rather how well we can estimate these quantities at all, given nite independently and identically distributed (i.i.d.) data. One would think this question would be well understood; after all, applied statisticians have been studying this problem since the rst appearance of Shannon’s papers, over 50 years ago (Weaver & Shannon, 1949). Somewhat surprisingly, though, many basic questions have remained unanswered. To understand why, consider the problem of estimating the mutual information, I.XI Y/, between two signals X and Y. This estimation problem lies at the heart of the majority of applications of information theory to data analysis; to make the relevance to neuroscience clear, let X be a spike train, or an intracellular voltage trace, and Y some behaviorally relevant, physically observable signal, or the activity of a different neuron. In these examples and in many other interesting cases, the information estimation problem is effectively innite-dimensional. By the denition of mutual information, we require knowledge of the joint probability distribution P.X; Y/ on the range spaces of X and Y, and these spaces can be quite large— and it would seem to be very difcult to make progress here in general, given the limited amount of data one can expect to obtain from any physiological preparation. In this article, we analyze a discretization procedure for reducing this very hard innite-dimensional learning problem to a series of more tractable nite-dimensional problems. While we are interested particularly in applications to neuroscience, our results are valid in general for any information estimation problem. It turns out to be possible to obtain a fairly clear picture of exactly how well the most commonly used discretized information estimators perform, why they fail in certain regimes, and how this performance can be improved. Our main practical conclusions are that the most common estimators can fail badly in common data-analytic situations and that this failure is more dramatic than has perhaps been appreciated in the literature. The most common procedures for estimating condence intervals, or error bars, fail even more dramatically in these “bad”
Estimation of Entropy and Mutual Information
1193
regimes. We suggest a new approach here and prove some of its advantages. The article is organized as follows. In section 2, we dene the basic regularization procedure (or, rather, formalize an intuitive scheme that has been widely used for decades). We review the known results in section 3 and then go on in section 4 to clarify and improve existing bias and variance results, proving consistency and asymptotic normality results for a few of the most commonly used information estimators. These results serve mainly to show when these common estimators can be expected to be accurate and when they should be expected to break down. Sections 5 and 6 contain the central results of this article. In section 5, we show, in a fairly intuitive way, why these common estimators perform so poorly in certain regimes and exactly how bad this failure is. These results lead us, in section 6, to study a polynomial approximation problem associated with the bias of a certain class of entropy estimators; this class includes the most common estimators in the literature, and the solution to this approximation problem provides a new estimator with much better properties. Section 7 describes some numerical results that demonstrate the relevance of our analysis for physiological data regimes. We conclude with a brief discussion of three extensions of this work: section 8.1 examines a surprising (and possibly useful) degeneracy of a Bayesian estimator, section 8.2 gives a consistency result for a potentially more powerful regularization method than the one examined in depth here, and section 8.3 attempts to place our results in the context of estimation of more general functionals of the probability distribution (that is, not just entropy and mutual information). We attach two appendixes. In appendix A, we list a few assorted results that are interesting in their own right but did not t easily into the ow of the article. In appendix B, we give proofs of several of the more difcult results, deferred for clarity’s sake from the main body of the text. Throughout, we assume little previous knowledge of information theory beyond an understanding of the denition and basic properties of entropy (Cover & Thomas, 1991); however, some knowledge of basic statistics is assumed (see, e.g., Schervish, 1995, for an introduction). This article is intended for two audiences: applied scientists (especially neurophysiologists) interested in using information-theoretic techniques for data analysis and theorists interested in the more mathematical aspects of the information estimation problem. This split audience could make for a somewhat split presentation: the correct statement of the results requires some mathematical precision, while the demonstration of their utility requires some more verbose explanation. Nevertheless, we feel that the intersection between the applied and theoretical communities is large enough to justify a unied presentation of our results and our motivations. We hope readers will agree and forgive the length of the resulting article.
1194
L. Paninski
2 The Setup: Grenander’s Method of Sieves Much of the inherent difculty of our estimation problem stems from the fact that the mutual information, Z dP.x; y/ I.X; Y/ ´ dP.x; y/ log ; d.P.x/ £ P.y// X £Y is a nonlinear functional of an unknown joint probability measure, P.X; Y/, on two arbitrary measurable spaces X and Y . In many interesting cases, the “parameter space”—the space of probability measures under consideration—can be very large, even innite-dimensional. For example, in the neuroscientic data analysis applications that inspired this work (Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998), X could be a space of time-varying visual stimuli and Y the space of spike trains that might be evoked by a given stimulus. This Y could be taken to be a (quite large) space of discrete (counting) measures on the line, while X could be modeled as the (even larger) space of generalized functions on 0, ¡2² 2 =
P.j f .x1 ; : : : ; xN / ¡ E. f .x1 ; : : : ; xN //j > ²/ · 2e
PN
c2 jD1 j
:
(3.2)
The condition says that by changing the value of the coordinate xj , we cannot change the value of the function f by more than some constant cj .
1200
L. Paninski
The usefulness of the theorem is a result of both the ubiquity of functions f satisfying condition 3.1 (and the ease with which we can usually check the condition) and the exponential nature of the inequality, which can be quite PN 2 powerful if jD1 cj satises reasonable growth conditions. Antos and Kontoyiannis (2001) pointed out that this leads easily to a useful bound on the variance of the MLE for entropy: Theorem (Antos & Kontoyiannis, 2001). (a) For all N, the variance of the MLE for entropy is bounded above: ³ Var.HO MLE / ·
.log N/2 N
´ (3.3)
:
(b) Moreover, by McDiarmid’s inequality, 3.2, P.jHO MLE ¡ E.HO MLE /j > ²/ · 2e
¡N 2 ¡2 2 ² .log N/
:
(3.4)
Note that although this inequality is not particularly tight—while it says that the variance of HO MLE necessarily dives to zero with increasing N, the true variance turns out to be even smaller than the bound indicates—the inequality is completely universal, that is, independent of m or P. For example, Antos and Kontoyiannis (2001) use it in the context of m (countably) innite. In addition, it is easy to apply this result to other functionals of pN (see section 6 for one such important generalization). 3.3 HO MLE Is Negatively Biased Everywhere. Finally, for completeness, we mention the following well-known fact, Ep .HO MLE / · H.p/;
(3.5)
where Ep .:/ denotes the conditional expectation given p. We have equality in the above expression only when H.p/ D 0; in words, the bias of the MLE for entropy is negative everywhere unless the underlying distribution p is supported on a single point. This is all a simple consequence of Jensen’s inequality; a proof was recently given in Antos and Kontoyiannis (2001), and we will supply another easy proof below. Note that equation 3.5 does not imply that the MLE for mutual information is biased upward everywhere, as has been claimed elsewhere; it is easy to nd distributions p such that Ep .IOMLE/ < I.p/. We will discuss the reason for this misunderstanding below. It will help to keep Figure 1 in mind. This gure gives a compelling illustration of perhaps the most basic fact about HO MLE : the variance is small and the bias is large until N À m. This qualitative statement is not new; however, the corresponding quantitative statement—especially the fact that
Estimation of Entropy and Mutual Information
1201
Sample distributio ns of M LE; p uniform ; m =500 0.8
N=10
0.6 0.4 0.2 0
N=100
0.1
P(Hest)
0.05 0
N=500
0.15 0.1 0.05 0
N=1000
0.15 0.1 0.05 0
0
1
2
3
4 5 H est (bits)
6
7
8
Figure 1: Evolution of sampling distributions of MLE: xed m, increasing N. The true value of H is indicated by the dots at the bottom right corner of each panel. Note the small variance for all N and the slow decrease of the bias as N ! 1.
HO MLE in the statement can be replaced with any of the three most commonly used estimators—appears to be novel. We will develop this argument over the next four sections and postpone discussion of the implications for data analysis until the conclusion. 4 The N À m Range: The Local Expansion The unifying theme of this section is a simple local expansion of the entropy functional around the true value of the discrete measure p, a variant of what
1202
L. Paninski
is termed the “delta method” in the statistics literature. This expansion is similar to one used by previous authors; we will be careful to note the extensions provided by the current work. The main idea, outlined, for example, in Sering (1980), is that any smooth function of the empirical measures pN (e.g., any of the three estimators for entropy introduced above) will behave like an afne function with probability approaching one as N goes to innity. To be more precise, given some functional f of the empirical measures, we can expand f around the underlying distribution p as follows: f .pN / D f .p/ C df .pI pN ¡ p/ C rN . f; p; pN /; where df .pI pN ¡ p/ denotes the functional derivative (Frechet derivative) of f with respect to p in the direction pN ¡ p, and rN . f; p; pN / the remainder. If f is sufciently smooth (in a suitable sense), the differential df .pI pN ¡ p/ will be a linear functional of pN ¡ p for all p, implying 0 1 N X 1 1 X df .pI pN ¡ p/ ´ df @pI ±j ¡ pA D df .pI ±j ¡ p/; N jD1 N j that is, df .pI pN ¡ p/ is the average of N i.i.d. variables, which implies, under classical conditions on the tail of the distribution of df .pI ±j ¡ p/, that N1=2 df .pI pN ¡ p/ is asymptotically normal. If we can prove that N1=2 rN . f; p; pN / goes to zero in probability (that is, the behavior of f is asymptotically the same as the behavior of a linear expansion of f about p), then a CLT for f follows. This provides us with a more exible approach than the method outlined in section 3.1 (recall that that method relied on a CLT for the underlying empirical measures pN , and such a CLT does not necessarily hold if m and p are not xed). Let us apply all this to H: HO MLE .pN / D H.pN / D H.p/ C dH.pI pN ¡ p/ C rN .H; p; pN / D H.p/ C
m X iD1
.pi ¡ pN;i / log pi C rN .H; p; pN /:
(4.1)
A little algebra shows that rN .H; p; pN / D ¡DKL .pN I p/; where DKL .pN I p/ denotes the Kullback-Leibler divergence between pN , the empirical measure, and p, the true distribution. The sum in equation 4.1 has mean 0; by linearity of expectation, then, Ep .HO MLE / ¡ H D ¡Ep .DKL .pN I p//;
(4.2)
Estimation of Entropy and Mutual Information
1203
and since DKL .pN I p/ ¸ 0, where the inequality is strict with positive probability whenever p is nondegenerate, we have a simple proof of the nonpositive bias of the MLE. Another (slightly more informative) approach will be given in section 6. The second useful consequence of the local expansion follows by the next two well-known results (Gibbs & Su, 2002): 0 · DKL .pN I p/ · log.1 C Â 2 .pN I p//;
(4.3)
where Â2 ´
m X .pN;i ¡ pi /2 p2i iD1
denotes Pearson’s chi-square functional, and Ep . 2 .pN I p// D
jsupp.p/j ¡ 1 8p; N
(4.4)
where jsupp.p/j denotes the size of the support of p, the number of points with nonzero p-probability. Expressions 4.2 through 4.4, with Jensen’s inequality, give us rigorous upper and lower bounds on B.HO MLE /, the bias of the MLE: Proposition 1. ³ ´ m¡1 ¡ log 1 C · B.HO MLE / · 0; N with equality iff p is degenerate. The lower bound is tight as N=m ! 0, and the upper bound is tight as N=m ! 1. Here we note that Miller (1955) used a similar expansion to obtain the N1 bias rate for m xed, N ! 1. The remaining step is to expand DKL .pN I p/: DKL .pN I p/ D
1 2 . .pN I p// C O.N ¡2 /; 2
(4.5)
if p is xed. As noted in section 3.1, this expansion of DKL does not converge for all possible values of pN ; however, when m and p are xed, it is easy to show, using a simple cutoff argument, that this “bad” set of pN has an asymptotically negligible effect on Ep .DKL /. The formula for the mean of the chi-square statistic, equation 4.4, completes Miller’s and Madow’s original proof (Miller, 1955); we have B.HO MLE / D ¡
m¡1 C o.N ¡1 /; 2N
(4.6)
1204
L. Paninski
if m is xed and N ! 1. From here, it easily follows that HO MM and HO JK both have o.N ¡1 / bias under these conditions (for HO MM , we need only show O ! m sufciently rapidly, and this follows by any of a number of that m exponential inequalities [Dembo & Zeitouni, 1993; Devroye et al., 1996]; the statement for HO JK can be proven by direct computation). To extend these kinds of results to the case when m and p are not xed, we have to generalize equation 4.6. This desired generalization of Miller’s result does turn out to be true, as we prove (using a completely different technique) in section 6. It is worth emphasizing that Ep . 2 .pN I p// is not constant in p; it is constant on the interior of the m-simplex but varies discontinuously on the boundary. This was the source of the confusion about the bias of the MLE for information, IOMLE .x; y/ ´ HO MLE .x/ C HO MLE .y/ ¡ HO MLE .x; y/: When p.x; y/ has support on the full mx my points, the N1 bias rate is indeed given by mx my ¡ mx ¡ my ¡ 1, which is positive for mx ; my large enough. However, p.x; y/ can be supported on as few as max.mx ; my / points, which means that the N1 bias rate of IOMLE can be negative. It could be argued that this reduced-support case is nonphysiological; however, a simple continuity argument shows that even when p.x; y/ has full support but places most of its mass on a subset of its support, the bias can be negative even for large N, even though the asymptotic bias rate in this case is positive. The simple bounds of proposition 1 form about half of the proof of the following two theorems, the main results of this section. They say that if mS;N and mT;N grow with N, but not too quickly, the “sieve” regularization works, in the sense that the sieve estimator p is almost surely consistent and asymptotically normal and efcient on a N scale. The power of these results lies in their complete generality: we place no constraints whatsoever on either the underlying probability measure, p.x; y/, or the sample spaces X and Y . Note that the theorems are true for all three of the estimators dened above (i.e., HO above—and in the rest of the article, unless otherwise noted— can be replaced by HO MLE , HO JK , or HO MM ); thus, all three common estimators have the same N1 variance rate: ¾ 2 , as dened below. In the following, ¾X;Y is the joint ¾ -algebra of X£Y on which the underlying probability distribution p.X; Y/ is dened, ¾SN ;TN is the (nite) ¾ -algebra generated by SN and TN , and HN denotes the N-discretized entropy, H.SN .X//. The ¾ -algebra condition in theorem 1 is merely a technical way of saying that SN and TN asymptotically retain all of the data in the sample .x; y/ in the appropriate measure-theoretic sense; see appendix A for details. Theorem 1 (Consistency). then IO ! I a.s. as N ! 1.
If mS;N mT;N D o.N/ and ¾SN ;TN generates ¾X;Y ,
Estimation of Entropy and Mutual Information
Theorem 2 (Central limit).
1205
Let
¾N2 ´ Var.¡ log pTN / ´
m X iD1
pTN ;i .¡ log pTN ;i ¡ HN /2 :
If mN ´ m D o.N 1=2 /, and lim inf N 1¡® ¾N2 > 0 N!1
± for some ® > 0, then
N ¾N2
²1=2
.HO ¡ HN / is asymptotically standard normal.
The following lemma is the key to the proof of theorem 1, and is interesting in its own right: Lemma 1.
If m D o.N/, then HO ! HN a.s.
Note that ¾N2 in the statement of the CLT (theorem 2) is exactly the variance of the sum in expression 4.1, and corresponds to the asymptotic variance derived originally in Basharin (1959), by a similar local expansion. We also point out that ¾N2 has a specic meaning in the theory of data compression (where ¾N2 goes by the name of “minimal coding variance”; see Kontoyiannis, 1997, for more details). O We We close this section with some useful results on the variance of H. 2 have, under the stated conditions, that the variance of HO is of order ¾N Clog.N/2
asymptotically (by the CLT), and strictly less than for all N, for some N xed C (by the result of Antos & Kontoyiannis 2001). It turns out that we can “interpolate,” in a sense, between the (asymptotically loose but good for all N) p-independent bound and the (asymptotically exact but bad for small N) p-dependent gaussian approximation. The trick is to bound the average uctuations in HO when randomly replacing one sample, instead of the worst-case uctuations, as in McDiarmid’s bound. The key inequality is due to Steele (1986): Theorem (Steele’s inequality). If S.x1 ; x2 ; : : : ; xN / is any function of N i.i.d. random variables, then var.S/ ·
N 1 X E .S ¡ Si /2 ; 2 jD1
where Si D S.x1 ; x2 ; : : : ; x0i ; : : : ; xN / is given by replacing the xi with an i.i.d. copy.
1206
L. Paninski
O it turns out to be possible to compute the right-hand side For S D H, explicitly; the details are given in appendix B. It should be clear even without 2
any computation that the bound so obtained is at least as good as the Clog.N/ N guaranteed by McDiarmid; it is also easy to show, by the linear expansion technique employed above, that the bound is asymptotically tight under conditions similar to those of theorem 2. O We know Thus, ¾ .p/2 plays the key role in determining the variance of H. 2 can be zero for some ¾ p, since Var.¡ log pi / is zero for any p uniform on any k points, k · m. On the other hand, how large can ¾ 2 be? The following proposition provides the answer; the proof is in appendix B. Proposition 2. max ¾ 2 » .log m/2 : p
This leads us to dene the following bias-variance balance function, valid in the N À m range: V=B2 ¼
N.log m/2 : m2
If V=B2 is large, variance dominates the mean-square error (in the “worstcase” sense), and bias dominates if V=B2 is small. It is not hard to see that if m is at all large, bias dominates until N is relatively huge (recall Figure 1). (This is just a rule of thumb, of course, not least because the level of accuracy desired, and the relative importance of bias and variance, depend on the application. We give more precise—in particular, valid for all values of N and m—formulas for the bias and variance in the following.) To summarize, the sieve method is effective and the asymptotic behavior of HO is well understood for N À m. In this regime, if V=B2 > 1, classical (Cramer-Rao) effects dominate, and the three most common estimators (HO MLE , HO MM , and HO JK ) are approximately equivalent, since they share the same asymptotic variance rate. But if V=B2 < 1, bias plays a more important role, and estimators that are specically designed to reduce the bias become competitive; previous work has demonstrated that HO MM and HO JK are effective in this regime (Panzeri & Treves, 1996; Strong et al., 1998). In the next section, we turn to a regime that is much more poorly understood, the (not uncommon) case when N » m. We will see that the local expansion becomes much less useful in this regime, and a different kind of analysis is required. 5 The N » m Range: Consequences of Symmetry The main result of this section is as follows: if N=m is bounded, the bias of HO remains large while the variance is always small, even if N ! 1. The
Estimation of Entropy and Mutual Information
1207
basic idea is that entropy is a symmetric function of pi ; 1 · i · m, in that H is invariant under permutations of the points f1; : : : ; mg. Most common estimators of H, including HO MLE , HO MM , and HO JK , share this permutation symmetry (in fact, one can show that there is some statistical justication for restricting our attention to this class of symmetric estimators; see appendix A). Thus, the distribution of HO MLE .pN /, say, is the same as that of HO MLE .p0N /, where p0N is the rank-sorted empirical measure (for concreteness, dene “rank-sorted” as “rank-sorted in decreasing order”). This leads us to study the limiting distribution of these sorted empirical measures (see Figure 2). It turns out that these sorted histograms converge to the “wrong” distribution under certain circumstances. We have the following result: Theorem 3 (Convergence of sorted empirical measures; inconsistency). Let P be absolutely continuous with respect to Lebesgue measure on the interval [0; 1], and let p D dP=dm be the corresponding density. Let SN be the mequipartition of [0; 1], p0 denote the sorted empirical measure, and N=m ! c; 0 < c < 1. Then: L1 ;a:s:
a. p0 ! p0c;1 , with kp0c;1 ¡ pk1 > 0. Here p0c;1 is the monotonically decreasing step density with gaps between steps j and j C 1 given by Z 1 .cp.t// j dte ¡cp.t/ : j! 0 b. Assume p is bounded. Then HO ¡ HN ! Bc;HO .p/ a:s:, where Bc;HO .p/ is a deterministic function, nonconstant in p. For HO D HO MLE, Bc;HO .p/ D h.p0 / ¡ h.p/ < 0;
where h.:/ denotes differential entropy. In other words, when the sieve is too ne (N » m), the limit sorted empirical histogram exists (and is surprisingly easy to compute) but is not equal to the true density, even when the original density is monotonically decreasing and of step form. As a consequence, HO remains biased even as N ! 1. This in turn leads to a strictly positive lower bound on the asymptotic error of HO over a large portion of the parameter space. The basic phenomenon is illustrated in Figure 2. We can apply this theorem to obtain simple formulas for the asymptotic bias B.p; c/ for special cases of p: for example, for the uniform distribution U ´ U.[0; 1]/, Bc;HO MLE .U/ D log.c/ ¡ e¡c
1 X jD1
c j¡1 log.j/I .j ¡ 1/!
1208
L. Paninski
4
5
3
4 3
2
2
1 0
m=N=100
1 20
40
60
80
100
5
0 0
0.5
6
1 m=N=1000
4 4
n
3 2
2
1 0
200
400
600
800 1000
0 0
6
6
4
4
2
2
0
2000 4000 6000 8000 10000 unsorted, unnormalized i
0 0
0.5
1
m=N=10000
0.5 sorted, normalized p
1
Figure 2: “Incorrect” convergence of sorted empirical measures. Each left panel shows an example unsorted m-bin histogram of N samples from the uniform density, with N=m D 1 and N increasing from top to bottom. Ten sorted sample histograms are overlaid in each right panel, demonstrating the convergence to a nonuniform limit. The analytically derived p0c;1 is drawn in the nal panel but is obscured by the sample histograms.
1 ¡ e¡c I 2c 1 X c j¡1 Bc;HO JK .U/ D 1 C log.c/ ¡ e¡c .j ¡ c/ log.j/: .j ¡ 1/! jD1
Bc;HO MM .U/ D Bc;HO MLE .U/ C
To give some insight into these formulas, note that Bc;HO MLE .U/ behaves like log.N/ ¡ log.m/ as c ! 0, as expected given that HO MLE is supported on
Estimation of Entropy and Mutual Information
1209
[0; log.N/] (recall the lower bound of proposition 1); meanwhile, Bc;HO MM .U/ » Bc;HO MLE .U/ C
1 2
and Bc;HO JK .U/ » Bc;HO MLE .U/ C 1 in this c ! 0 limit. In other words, in the extremely undersampled limit, the Miller correction reduces the bias by only half a nat, while the jackknife gives us only twice that. It turns out that the proof of the theorem leads to good upper bounds on the approximation error of these formulas, indicating that these asymptotic results will be useful even for small N. We examine the quality of these approximations for nite N and m in section 7. This asymptotically deterministic behavior of the sorted histograms is perhaps surprising, given that there is no such corresponding deterministic behavior for the unsorted histograms (although, by the Glivenko-Cantelli theorem, van der Vaart & Wellner, 1996, there is well-known deterministic behavior for the integrals of the histograms). What is going on here? In crude terms, the sorting procedure “averages over” the variability in the unsorted histograms. In the case of the theorem, the “variability” at each bin turns out to be of a Poisson nature, in the limit as m; N ! 1, and this leads to a well-dened and easy-to-compute limit for the sorted histograms. To be more precise, note that the value of the sorted histogram at bin k is greater than t if and only if the number of (unsorted) pN;i with pN;i > t is at least k (remember that we are sorting in decreasing order). In other words, p0N D F¡1 N ; where FN is the empirical “histogram distribution function,” FN .t/ ´
m 1 X 1.pN;i < t/; m iD1
and its inverse is dened in the usual way. We can expect these sums of indicators to converge to the sums of their expectations, which in this case are given by E.FN .t// D
1 X P.pN;i < t/: m i
Finally, it is not hard to show that this last sum can be approximated by an integral of Poisson probabilities (see appendix B for details). Something similar happens even if m D o.N/; in this case, under similar conditions on p, we would expect each pN;i to be approximately gaussian instead of Poisson.
1210
L. Paninski
O now, we need only note the following important fact: To compute E.H/ each HO is a linear functional of the “histogram order statistics,” hj ´
m X iD1
1.ni D j/;
where ni ´ NpN;i is the unnormalized empirical measure. For example, HO MLE D
N X jD0
aHO MLE ;j;N hj ;
where aHO MLE ;j;N D ¡
j j log ; N N
while aHO JK ;j;N D NaHO MLE ;j;N ¡
N¡1 ..N ¡ j/aHO MLE ;j;N¡1 C jaHO MLE ;j¡1;N¡1 /: N
Linearity of expectation now makes things very easy for us: O D E.H/ D D
N X jD0
aH;j;N E.hj / O
m XX j
X j
iD1
aj;N P.ni D j/
Á ! X N j aj;N pi .1 ¡ pi /N¡j : j i
(5.1)
We emphasize that the above formula is exact for all N, m, and p; again, the usual Poisson or gaussian approximations to the last sum lead to useful asymptotic bias formulas. See appendix B for the rigorous computations. For our nal result of this section, let p and SN be as in the statement of theorem 3, with p bounded, and N D O.m1¡® /, ® > 0. Then some easy computations show that P.9i: ni > j/ ! 0 for all j > ® ¡1 . In other words, with high probability, we have to estimate H given only 1 C ® ¡1 numbers, namely fhj g0·j·® ¡1 , and it is not hard to see, given equation 5.1 and the usual Bayesian lower bounds on minimax error rates (see, e.g., Ritov &
Estimation of Entropy and Mutual Information
1211
Bickel, 1990), that this is not enough to estimate H.p/. We have, therefore: Theorem 4.
If N » O.m1¡® /, ® > 0, then no consistent estimator for H exists.
By Shannon’s discrete formula, a similar result holds for mutual information. 6 Approximation Theory and Bias The last equality, expression 5.1, is key to the rest of our development. O denote the bias of H, O we have: Letting B.H/ 0 O D@ B.H/ Á D
N X jD0
X i
1 Á Á ! ! m m X X N j N¡j A ¡ ¡pi log.pi / aj;N pi .1 ¡ pi / j iD1 iD1 ! pi log.pi / C
XX i
0 D
X i
X
@pi log.pi / C
j
j
Á ! N j aj;N pi .1 ¡ pi /N¡j j
1 Á ! N j aj;N pi .1 ¡ pi /N¡j A : j
If we dene the usual entropy function, H.x/ D ¡x log x; and the binomial polynomials, Á ! N j Bj;N .x/ ´ x .1 ¡ x/N¡j ; j we have 0 O D ¡B.H/
X i
@H.pi / ¡
1 X
aj;N Bj;N .pi /A :
j
In other words, the bias is the m-fold sum of the difference between the function H and a polynomial of degree N; these differences are taken at the points pi , which all fall on the interval [0; 1]. The bias will be small, therefore, if the polynomial is close, in some suitable sense, to H. This type of polynomial approximation problem has been extensively studied (Devore & Lorentz, 1993), and certain results from this general theory of approximation will prove quite useful.
1212
L. Paninski
Given any continuous function f on the interval, the Bernstein approximating polynomials of f , BN . f /, are dened as a linear combination of the binomial polynomials dened above: B N . f /.x/ ´
N X
f .j=N/Bj;N .x/:
jD0
Note that for the MLE, aj;N D H.j=N/I that is, the polynomial appearing in equation 5.1 is, for the MLE, exactly the Bernstein polynomial for the entropy function H.x/. Everything we know about the bias of the MLE (and more) can be derived from a few simple general facts about Bernstein polynomials. For example, we nd the following result in Devore and Lorentz (1993): Theorem (Devore & Lorentz, 1993, theorem 10.4.2). If f is strictly concave on the interval, then B N . f /.x/ < BNC1 . f /.x/ < f .x/; 0 < x < 1: Clearly, H is strictly concave, and BN .H/.x/ and H are continuous, hence the bias is everywhere nonpositive; moreover, since B N .H/.0/ D H.0/ D 0 D H.1/ D BN .H/.1/; the bias is strictly negative unless p is degenerate. Of course, we already knew this, but the above result makes the following, less well-known proposition easy: Proposition 3. For xed m and nondegenerate p, the bias of the MLE is strictly decreasing in magnitude as a function of N. (Couple the local expansion, equation 4.1, with Cover & Thomas, 1991, Chapter 2, Problem 34, for a purely information-theoretic proof.) The second useful result is given in the same chapter: Theorem (Devore & Lorentz, 1993, theorem 10.3.1). If f is bounded on the interval, differentiable in some neighborhood of x, and has second derivative f 00 .x/ at x, then lim N.BN . f /.x/ ¡ f .x// D f 00 .x/
N!1
x.1 ¡ x/ : 2
Estimation of Entropy and Mutual Information
1213
This theorem hints at the desired generalization of Miller’s original result on the asymptotic behavior of the bias of the MLE: Theorem 5. lim
If m > 1, N mini pi ! 1, then
1 N B.HO MLE / D ¡ : 2 m¡1
The proof is an elaboration of the proof of the above theorem 10.3.1 of Devore and Lorentz (1993); we leave it for appendix B. Note that the convergence stated in the theorem given by Devore and Lorentz (1993) is not uniform for f D H, because H.x/ is not differentiable at x D 0; thus, when the condition of the theorem is not met (i.e., mini pi D O. N1 /), more intricate asymptotic bias formulas are necessary. As before, we can use the Poisson approximation for the bins with Npi ! c, 0 < c < 1 and an o.1=N/ approximation for those bins with Npi ! 0. 6.1 “Best Upper Bounds” (BUB) Estimator. Theorem 5 suggests one simple way to reduce the bias of the MLE: make the substitution
aj;N
j j D ¡ log ! aj;N ¡ H00 N N
³
j N
´
j N .1
¡
j N/
2N
j
D¡
.1 ¡ N / j j log C : 2N N N
(6.1)
This leads exactly to a version of the Miller-Madow correction and gives another angle on why this correction fails in the N » m regime: as discussed above, the singularity of H.x/ at 0 is the impediment. A more systematic approach toward reducing the bias would be to choose aj;N such that the resulting polynomial is the best approximant of H.x/ within the space of N-degree polynomials. This space corresponds exactly to the O linear in the histogram order statistics. class of estimators that are, like H, We write this correspondence explicitly: faj;N g0·j·N Ã! HO a;N ; where we dene HO a;N ´ HO a to be the estimator determined by aj;N , according to HO a;n D
N X
aj;N hj : jD0
Clearly, only a small subset of estimators has this linearity property; the hj linear class comprises an NC1-dimensional subspace of the mN -dimensional
1214
L. Paninski
space of all possible estimators. (Of course, mN overstates the case quite a bit, as this number ignores various kinds of symmetries we would want to build into our estimator—see propositions 10 and 11—but it is still clear that the linear estimators do not exhaust the class of all reasonable estimators.) Nevertheless, this class will turn out to be quite useful. What sense of “best approximation” is right for us? If we are interested in worst-case results, uniform approximation would seem to be a good choice: that is, we want to nd the polynomial that minimizes X ¡ M.HO a / ´ max H.x/ a B .x/ : j;N j;N x j (Note the the above: the best approximant in this case turns out to be unique—Devore & Lorentz, 1993—although we will not need this fact below.) A bound on M.HO a / obviously leads to a bound on the maximum bias over all p: max jB.HO a /j · mM.HO a /: p
However, the above inequality is not particularly tight. We know, by Markov’s inequality, that p cannot have too many components greater than 1=m, and therefore the behavior of the approximant for x near x D 1 might be less important than the behavior near x D 0. Therefore, it makes sense to solve a weighted uniform approximation problem: minimize 0 1 X ¤ A M . f; HO a/ ´ sup @ f .x/ H.x/ ¡ aj;N Bj;N .x/ ; x j where f is some positive function on the interval. The choice f .x/ D m thus corresponds to a bound of the form max jB.HO a /j · c¤ . f /M ¤ . f; HO a /; p
with the constant c¤ . f / equal to one here. Can we generalize this? According to the discussion above, we would like f to be larger near zero than near one, since p can have many small components but at most 1=x components greater than x. One obvious candidate for f , then, is f .x/ D 1=x. It is easy to prove that max jB.HO a /j · M¤ .1=x; HO a /; p
that is, c¤ .1=x/ D 1 (see appendix B). However, this f gives too much weight to small pi ; a better choice is ( m x < 1=m; f .x/ D 1=x x ¸ 1=m:
Estimation of Entropy and Mutual Information
1215
For this f , we have: Proposition 4. max jB.HO a /j · c¤ . f /M¤ . f; HO a //; c¤ . f / D 2: p
See appendix B for the proof. It can be shown, using the above bounds combined with a much deeper result from approximation theory (Devore & Lorentz, 1993; Ditzian & Totik, 1987), that there exists an aj;N such that the maximum (over all p) bias is rate offered by the three most O. Nm2 /. This is clearly better than the O. m N/ O popular H. We even have a fairly efcient algorithm to compute this estimator (a specialized descent algorithm developed by Remes; Watson, 1980). Unfortunately, the good approximation properties of this estimator are a result of a delicate balancing of large, oscillating coefcients aj;N , and the variance of the corresponding estimator turns out to be very large. (This is predictable, in retrospect: we already know that no consistent estimator exists if m » N 1C® ; ® > 0.) Thus, to nd a good estimator, we need to minimize bounds on bias and variance simultaneously; we would like to nd HO a to minimize max.Bp .HO a /2 C Vp .HO a //; p
where the notation for bias and variance should be obvious enough. We have max.Bp .HO a /2 C Vp .HO a // · max Bp .HO a /2 C max Vp .HO a / p
p
p
· .c¤ . f /M¤ . f; HO a //2 C max Vp .HO a /; p
(6.2)
and at least two candidates for easily computable uniform bounds on the variance. The rst comes from McDiarmid: Proposition 5. Var.HO a / < N max .ajC1 ¡ aj /2 : 0·j 2 spikes each; m D 46 .
Estimation of Entropy and Mutual Information
1229
while a monkey moved its hand according to a stationary, two-dimensional, ltered gaussian noise process. We show results for 11 cells, simultaneously recorded during a single experiment, in Figure 10; note, however, that we are estimating the entropy of single-cell spike trains, not the full multicell spike train. (For more details on the experimental procedures, see Paninski, Fellows, Hatsopoulos, & Donoghue, 1999, 2003.) With real data, it is, of course, impossible to determine the true value of H, and so the detailed error calculations performed above are not possible here. Nevertheless, the behavior of these estimators seems to follow the trends seen in the simulated data. We see the consistent slow increase in our estimate as we move from HO MLE to HO MM to HO JK , and then a larger jump as we move to HO BUB . This is true even though the relevant timescales (roughly dened as the correlation time of the stimulus) in the two experiments differed by about three orders of magnitude. Similar results were obtained for both the real and simulated data using a variety of other discretization parameters (data not shown). Thus, as far as can be determined, our conclusions about the behavior of these four estimators, obtained using the analytical and numerical techniques described above, seem to be consistent with results obtained using physiological data. In all, we have that the new estimator performs quite well in a uniform sense. This good performance is especially striking in, but not limited to, the case when N=m is O.1/. We emphasize that even at the points where H D HO D 0 (and therefore the three most common estimators perform well, in a trivial sense), the new estimator performs reasonably; by construction, HO BUB never exhibits blowups in the expected error like those seen with HO MLE , HO MM , and HO JK . Given the fact that we can easily tune the bias of the new estimator at points where H ¼ 0, by adjusting ¸0 , HO BUB appears to be a robust and useful new estimator. We offer Matlab code, available on-line at http:==www.cns.nyu.edu/»liam, to compute the exact bias and Steele variance bound for any HO a , at any distribution p, if the reader is interested in more detailed investigation of the properties of this class of estimator. 8 Directions for Future Work We have left a few important open problems. Below, we give three somewhat freely dened directions for future work, along with a few preliminary results. 8.1 Bayes. All of our results here have been from a minimax, or “worstcase,” point of view. As discussed above, this approach is natural if we know very little about the underlying probability measure. However, in many cases, we do know something about this underlying p. We might know that the spike count is distributed according to something like a Poisson distribution or that the responses of a neuron to a given set of stimuli can be
1230
L. Paninski
fairly well approximated by a simple dynamical model, such as an IF cell. How do we incorporate this kind of information in our estimates? The elds of parametric and Bayesian statistics address this issue explicitly. We have not systematically explored the parametric point of view—this would entail building a serious parametric model for spike trains and then efciently estimating the entropy at each point in the parameter space—although this approach has been shown to be powerful in a few select cases. The Bayesian approach would involve choosing a suitable a priori distribution on spike trains and then computing the corresponding MAP or conditional mean estimator; this approach is obviously difcult as well, and we can give only a preliminary result here. Wolpert and Wolf (1995) give an explicit formula for the Bayes’ estimate of H and related statistics in the case of a uniform prior on the simplex. We note an interesting phenomenon relevant to this estimator: as m increases, the distribution on H induced by the at measure on the simplex becomes concentrated around a single point, and therefore the corresponding Bayes’ problem becomes trivial as m ! 1, quite the opposite of the situation considered in the current work. (Nemenman, Shafee, & Bialek, 2002, independently obtained a few interesting results along these lines.) The result is interesting in its own right; its proof shares many of the features (concentration of measure and symmetry techniques) of our main results in the preceding sections. More precisely, we consider a class of priors determined by the following “sort-difference” procedure: x some probability measure P on the unit interval. Choose m ¡ 1 independent samples distributed according to P; sort the samples in ascending order, and call the sorted samples fxi g0 a/ D O.e¡Cm
1=3 2 a
/;
for any constant a > 0 and some constant C.
Estimation of Entropy and Mutual Information
1231
In fact, it is possible to prove much more: the uniform measure on the simplex (and more generally, any prior induced by the sort-difference procedure, under some conditions on the interval measure P) turns out to induce an asymptotically normal prior on H, with variance decreasing in m. We can calculate the asymptotic mean of this distribution by using linearity of expectation and symmetry techniques like those used in section 5. In the following, assume for simplicity that P is equivalent to Lebesgue measure (that is, P is absolutely continuous with respect to Lebesgue measure, and vice versa); this is a technical condition that can be relaxed at the price of slightly more complicated formulas. We have the following: Theorem 7.
P(H) is asymptotically normal, with
Var.H/ »
1 m
and asymptotic mean calculated as follows. Let q be the sorted, normalized density corresponding to a measure drawn according to the prior described above; dene Z Fp .v/ ´
Z
v
1
du 0
dtp.t/2 e¡up.t/ ;
0
and q01 ´ Fp¡1 ; where the inverse is taken in a distributional sense. Then kq ¡ q01 k1 ! 0 in probability and E.H/ ! h.q01 / C log.m/; where h.:/ denotes differential entropy. Fp above is the cumulative distribution function of the p-mixture of exponentials with rate p.t/¡1 (just as p0c;1 in theorem 3 was dened as the inverse cumulative distribution function (c.d.f) of a mixture of Poisson distributions). If P is uniform, for example, we have that kq ¡ q01 k1 ! 0 in probability, where q01 .t/ D ¡ log.t/;
1232
L. Paninski
and
Z H.q/ ! h.q01 / C log.m/ D log m C
1
dt log.t/ log.¡ log.t// 0
in probability. 8.2 Adaptive Partitioning. As emphasized in section 1, we have restricted our attention here to partitions, “sieves,” S and T, which do not depend on the data. This is obviously a strong condition. Can we obtain any results without this assumption? As a start, we have the following consistency result, stated in terms of the measure of the richness of a partition introduced by Vapnik and Chervonenkis (1971), 1N .AF / (the shatter coefcient of the set of allowed partitions, dened in the appendix; m is, as in the preceding, the maximal number of elements per partition, F ; Devroye et al., 1996): ± ² If log 1N .A F / D o .logNm/2 and F generates ¾x;y a.s., IO is consistent in probability; IO is consistent a.s. under the slightly stronger condition Theorem 8.
X
¡N
1N .AF /e .log m/2 < 1:
Note the slower allowed rate of growth of m. In addition, the conditions of this theorem are typically harder to check than those of theorem 1. For example, it is easy to think of reasonable partitioning schemes that do not generate ¾ x;y a.s.: if the support of P is some measurable proper subset of X £ Y, this is an unreasonable condition. We can avoid this problem by rephrasing the condition in terms of ¾ x;y restricted to the support of P (this, in turn, requires placing some kind of topology on X £ Y, which should be natural enough in most problems). What are the benets? Intuitively, we should gain in efciency: we are putting the partitions where they do the most good (Darbellay & Vajda, 1999). We also gain in applicability, since in practice, all partition schemes are data driven to some degree. The most important application of this result, however, is to the following question in learning theory: How do we choose the most informative partition? For example, given a spike train and some behaviorally relevant signal, what is the most efcient way to encode the information in the spike train about the stimulus? More concretely, all things being equal, does encoding temporal information, say, preserve more information about a given visual stimulus than encoding spike rate information? Conversely, does encoding the contrast of a scene, for example, preserve more information about a given neuron’s activity than encoding color? Given m code words, how much information can we capture about what this neuron is telling us about the scene? (See, e.g., Victor, 2000b, for recent work along these lines.)
Estimation of Entropy and Mutual Information
1233
The formal analog to these kinds of questions is as follows (see Tishby, Pereira, & Bialek, 1999, and Gedeon, Parker, & Dimitrov, 2003, for slightly more general formulations). Let F and G be classes of “allowed” functions on the spaces X and Y . For example, F and G could be classes of partitioning operators (corresponding to the discrete setup used here) or spaces of linear projections (corresponding to the information-maximization approach to independent component analysis.) Then, given N i.i.d. data pairs in X £ Y, we are trying to choose fN 2 F and gN 2 G in such a way that I. fN .x/I gN .y// is maximized. This is where results like theorem 8 are useful; they allow us to place distribution-free bounds on Á P
! O fN .x/I gN .y// > ² ; sup I. f .x/I g.y// ¡ I.
(8.1)
f 2F ;g2G
that is, the probability that the set of code words that looks optimal given N samples is actually ²-close to optimal. Other (distribution-dependent) approaches to the asymptotics of quantities like equation 8.1 come from the the theory of empirical processes (see, e.g., van der Vaart & Wellner, 1996). More work in this direction will be necessary to rigorously answer the bias and variance problems associated with these “optimal coding” questions. 8.3 Smoothness and Other Functionals. We end with a slightly more abstract question. In the context of the sieve method analyzed here, are entropy and mutual information any harder to estimate than any other given functional of the probability distribution? Clearly, there is nothing special about H (and, by extension, I) in the case when m and p are xed; here, classical methods lead to the usual N¡1=2 rates of convergence, with a prefactor that depends on only m and the differential properties of the functional H at p; the entire basic theory goes through if H is replaced by some other arbitrary (smooth) functional. There are several reasons to suspect, however, that not all functionals are the same when m is allowed to vary with N. First, and most obvious, HO is consistent when m D o.N/ but not when m » N; simple examples show that this is not true for all functionals of p (e.g., many linear functionals on m can be estimated given fewer than N samples, and this can be extended to weakly nonlinear functionals as well). Second, classical results from approximation theory indicate that smoothness plays an essential role in approximability; it is well known, for example, that the best rate in the bias polynomial approximation problem described in section 6 is essentially determined by a modulus of continuity of the function under question (Ditzian & Totik, 1987), and moduli of continuity pop up in apparently very different functional estimation contexts as well (Donoho & Liu, 1991; Jongbloed,
1234
L. Paninski
2000). Thus, it is reasonable to expect that the smoothness of H, especially as measured at the singular point near 0, should have a lot to do with the difculty of the information estimation problem. Finally, basic results in learning theory (Devroye et al., 1996; van der Vaart & Wellner, 1996; Cucker & Smale, 2002) emphasize the strong connections between smoothness and various notions of learnability. For example, an application of theorem II.2.3 of Cucker and Smale (2002) gives the exact rate of decay of our L2 objective function, equation 6.3, in terms of the spectral properties of the discrete differential operator D, expressed in the Bernstein polynomial basis; however, it is unclear at present whether this result can be extended to our nal goal of a useful asymptotic theory for the upper L1 bound, equation 6.2. A few of the questions we would like to answer more precisely are as follows. First, we would like to have the precise minimax rate of the information estimation problem; thus far, we have only been able to bound this rate between m » o.N/ (see theorem 1) and m » N 1C® , ® > 0 (see theorem 4). Second, how close does HO BUB come to this minimax rate? Indeed, does this estimator require fewer than m samples to learn the entropy on m bins, as Figure 3 seems to indicate? Finally, how can all of this be generalized for other statistical functionals? Is there something like a single modulus of continuity that controls the difculty of some large class of these functional estimation problems? 9 Conclusions Several practical conclusions follow from the results presented here; we have good news and bad news. First, the bad news. ² Past work in which N=m was of order 1 or smaller was most likely contaminated by bias, even if the jackknife or Miller correction was used. This is particularly relevant for studies in which multiple binning schemes were compared to investigate, for example, the role of temporal information in the neural code. We emphasize for future studies that m and N must be provided for readers to have condence in the results of entropy estimation. ² Error bars based on sample variance (or resampling techniques) give very bad condence intervals if m and N are large. That is, condence intervals based on the usual techniques do not contain the true value of H or I with high probability. Previous work in the literature often displays error bars that are probably misleadingly small. Condence intervals should be of size O N=m/ C N ¡1=2 log.min.m; N//; » B.H; where the bias term B can be calculated using techniques described in section 5.
Estimation of Entropy and Mutual Information
1235
Now the good news: ² This work has given us a much better understanding of exactly how difcult the information estimation problem is and what we can hope to accomplish using nonparametric techniques, given physiologically plausible sample sizes. ² We have obtained rigorous (and surprisingly general) results on bias, variance, and convergence of the most commonly employed estimators, including the best possible generalization of Miller’s well-known 1 bias rate result. Our analysis claries the relative importance of N minimizing bias or variability depending on N and m, according to the bias-variance balance function introduced at the end of section 4. ² We have introduced a promising new estimator, one that comes equipped with built-in, rigorous condence intervals. The techniques used to derive this estimator also lead to rigorous condence intervals for a large class of other estimators (including the three most comO mon H). Appendix A: Additional Results A.1 Support. One would like to build an estimator that takes values strictly in some nice set around the true H, say, an interval containing H whose length shrinks as the number of samples, N, increases. This would give us strong “error bars” on our estimate of H; we would be absolutely certain that our estimate is close to the true H. The MLE for entropy has support on [0; log.min.N; m//]. A simple variational argument shows that any estimator, T, for H on m bins is inadmissible if T takes values outside [0; log m]. Similarly, any estimator, T, for I on mS £ mT bins is inadmissible if T takes values outside [0; log.min.mS ; mT //]. It turns out that this is the best possible, in a sense: there do not exist any nontrivial estimators for entropy that are strictly greater or less than the unknown H. In fact, the following is true: Proposition 7. There is no estimator T and corresponding a; b, 0 < a 0, such that T.!0 / > 0. By choosing H such that 0 < H < T.!0 /, we force a contradiction. The proof for b is similar.
1236
L. Paninski
A similar result obviously holds for mutual information. A.2 Bias. It turns out that no unbiased estimators for H or I exist in the discrete setting. This fact seems to be known among information theorists, but we have not seen it stated in the literature. The proof is quite short, so we provide it here. Proposition 8.
No unbiased estimator for entropy or mutual information exists.
Proof. For any estimator T of the entropy of a multinomial distribution, we can write down the mean of T: X E.T/ D P.!/T.!/; !2f1;:::;mgN
where f1; : : : ; mgN is the sample space (i.e, each !, as above, corresponds to an m-ary sequence of length N). Since !j is drawn i.i.d. from the discrete distribution p, P.!/ is given by P.!/ D
N Y
p!j ;
jD1
and so the mean of T is a polynomial function of the multinomial probabilities pi . The entropy, on the other hand, is obviously a nonpolynomial function of the pi . Hence, no unbiased estimator exists. The proof for I is identical. The next (easy) proposition provides some more detail. The proof is similar to that of Proposition 7 and is therefore omitted. Proposition 9. (a) If T is a nonnegatively biased estimator for the entropy of a multinomial distribution on m bins, with T.!/ 2 [0; log.m/] 8! 2 f1; : : : ; mgN , then T.!/ D log.m/ 8! 2 f1; : : : ; mgN : (b) If T is a nonpositively biased estimator for the mutual information of a multinomial distribution on mS ; mT bins, with T.!/ 2 [0; log.min.mS ; mT //], then T.!/ D 0 8! 2 Ä: (c) If T is a nonnegatively biased estimator for the mutual information of a multinomial distribution on mS ; mT bins, with T.!/ 2 [0; log.min.mS ; mT //],
Estimation of Entropy and Mutual Information
1237
then T.!/ D log.min.mS ; mT // 8! 2 Ä: A.3 Minimax Properties of ¾ -Symmetric Estimators. Let the error metric D.T; µ/ be nice—convex in T, jointly continuous in T and µ, positive away from T D µ, and bounded below. (The metrics given by D.T; µ/ ´ .T ¡ µ /p ; 1 · p < 1 are good examples.) The following result partially justies our focus throughout this article on estimators that are permutation symmetric (denoted ¾ -symmetric in the following). Proposition 10. mator exists.
If the error metric D is nice, then a ¾ -symmetric minimax esti-
Proof. Existence of a minimax estimator (see also Schervish, 1995): whenever maxµ Eµ .D/ is a continuous function of the estimator T, a minimax estimator exists, since T can be taken to vary over a compact space (namely, N [0; log m]m ). But max µ.Eµ .D// is continuous in T whenever E.D/ is jointly continuous in T and µ. This is because E.D/ is uniformly continuous in µ and T, since, again, µ and T vary over compact spaces. E.D/ is jointly continuous in µ and T by the continuity of D and the fact that E.D/ is dened by a nite sum. Existence of a symmetric minimax estimator: this is actually a special case of the Hunt-Stein theorem (Schervish, 1995). Any asymmetric minimax estimator, T, in the current setup achieves its maximum, maxµ .Eµ .D//, by the arguments above. However, the corresponding symmetrized estimator, P T¾ .!/ D .1=j¾ j/ ¾ T.¾ .!//, has expected error, which is less than or equal to maxµ .Eµ .D//, as can be seen after a rearrangement and an application of Jensen’s inequality. Therefore, T¾ is minimax (and obviously symmetric). A.4 Insufciency of Symmetric Estimators. The next result is perhaps surprising. Proposition 11. The MLE is not sufcient. In fact, the empirical histograms are minimal sufcient; thus, no ¾ -symmetric estimator is sufcient. Proof. A simple example sufces to prove the rst statement. Choose as a prior on p: P.p.1/ D ²I p.2/ D 1 ¡ ²/ D :5
P.p.1/ D 0I p.2/ D 1/ D :5;
1238
L. Paninski
for some ² > 0. For this P, H.p/ ! HO ! fni g does not form a Markov chain; the symmetry of HO discards information about the true underlying H (namely, observation of a 1 tells us something very different than does observation of a 2). This property is clearly shared by any symmetric estimator. The fact that the empirical histograms are minimal sufcient follows, for example, from Bahadur ’s theorem (Schervish, 1995), and the fact that the empirical histograms are complete sufcient statistics. In other words, any ¾ -symmetric estimator necessarily discards information about H, even though H itself is ¾ -symmetric. This indicates the importance of priors; the nonparametric minimax approach taken here (focusing strictly on symmetric estimators for a large part of the work, as justied by proposition 10) should be considered only a rst step. To be more concrete, in many applications, it is natural to guess that the underlying measure p has some continuity properties; therefore, estimators that take advantage of some underlying notion of continuity (e.g., by locally smoothing the observed distributions before estimating their entropy) should be expected to perform better (on average, according to this mostly continuous prior) than the best ¾ -symmetric estimator, which necessarily discards all topological structure in the underlying space X . (See, e.g., Victor, 2002, for recent work along these lines.) Appendix B: Proofs We collect some deferred proofs here. To conserve space, we omit some of the easily veried details. The theorems are restated for convenience. B.1 Consistency. Statement (Theorem 1). If mS;N mT;N D o.N/ and ¾ SN ;TN generates ¾X;Y , then IO ! I a.s. as N ! 1. Theorem 1 is a consequence of the following lemma: Statement (Lemma 1). If m D o.N/, then HO ! HN a.s. Proof. First, by the exponential bound of Antos and Kontoyiannis, expression 3.4, and the Borel-Cantelli lemma, HO N ! HN a:s: if the (nonrandom) function E.HO N / " HN . This convergence in expectation is a consequence of the local expansion for the bias of the MLE, expression 4.2, and proposition 1 of section 4. Proof of Theorem 1. First, some terminology: by IO ! I a:s:, we mean that if I D 1, p..ION < c/ i:o:/ D 0 8c < 1, and if I < 1, p..jION ¡ Ij > ²/ i:o:/ D 0
Estimation of Entropy and Mutual Information
1239
8² > 0. (“I.o.” stands for “innitely often.”) In addition, we call the given ¾ -algebra on X £ Y (the family of sets on which the probability measure P.X; Y/ is dened) ¾X;Y , and the sub-¾ -algebra generated by S and T ¾S;T . Now, the proof: it follows from Shannon’s formula for mutual information in the discrete case that O N ; TN / ¡ I.SN ; TN /j · jH.S/ O jI.S ¡ H.S/j
O O C jH.T/ ¡ H.T/j C jH.S; T/ ¡ H.S; T/j:
Thus, the lemma gives ION ! I.SN ; TN / a:s: whenever mS mT =N ! 0. It remains only to show that the (nonrandom) function I.SN ; TN / ! I; this follows from results in standard references, such as Billingsley (1965) and Kolmogorov (1993), if either ¾S1 ;T1 µ ¾S2 ;T2 µ ¢ ¢ ¢ µ ¾SN ;TN µ ¢ ¢ ¢ and ¾X;Y D [N ¾SN ;TN ; or sup A2¾S N ;TN ;B2¾X;Y
½.A; B/ ! 0;
where ½.A; B/ ´ P.ABc [ Ac B/: If either of these conditions holds, we say that ¾SN ;TN generates ¾X;Y . B.2 Central Limit Theorem. Statement (Theorem 2). Let ¾N2 ´ Var.¡ log pTN / ´
m X iD1
pTN ;i .¡ log pTN ;i ¡ HN /2 :
If mN ´ m D o.N 1=2 /, and lim inf N 1¡® ¾N2 > 0 N!1
± for some ® > 0, then
N ¾N2
²1=2
.HO ¡ HN / is asymptotically standard normal.
1240
L. Paninski
Proof. The basic tool, again, is the local expansion of HMLE , expression 4.1. We must rst p show that the remainder term becomes negligible in probability on a N scale, that is, p NDKL .pN I p/ D op .1/: This follows from the formula for Ep .DKL .pN I p//, then Markov’s inequality and the nonnegativity of DKL . So it remains only to show that dH.pI pN ¡ p/ is asymptotically normal. Here we apply a classical theorem on the asymptotic normality of double arrays of innitesimal random variables: Lemma. Let fxj;N g; 1 · N · 1; 1 · j · N be a double array of rowwise i.i.d. random variables with zero mean and variance ¾N2 , with distribution p.x; N/ and PN satisfying ¾N2 D 1=N for all N. Then jD1 xj;N is asymptotically normal, with zero mean and unit variance, iff fxj;N g satisfy the Lindeberg (vanishing tail) condition: for all ² > 0, N Z X jD1
jxj>²
x2 dp.x; N/ D o.1/:
(B.1)
The conditions of the theorem imply the Lindeberg condition, with fxj;n g replaced by p 1 2 .dH.pI ±j ¡ p/ ¡ H/. To see this, note that the left-hand side N¾ of equation B.1 becomes, after the proper substitutions, 1 ¾2 or
X ¡1 pj : .N¾ 2 / 2
pj log2 pj ;
j.log pj /¡Hj>²
0
1 X
1 B B ¾2 @
1
pj : pj >e².N¾ 2 / 2 CH
pj log2 pj C
X
C pj log2 pj C A: 1
pj : pj <eH¡² .N¾ 2 / 2
The number of terms in the sum on the left is less than or equal to e¡².N¾
1 2 2 /
¡H
:
Since the summands are bounded uniformly, this sum is o.1/. On the other hand, the sum on the right has at most m terms, so under the conditions of the theorem, this term must go to zero as well, and the proof is complete. p We have proven the above a.s. and N consistency theorems for HO MLE only; the extensions to HO MM and HO JK are easy and are therefore omitted.
Estimation of Entropy and Mutual Information
D
1241
B.3 Variance Bounds a` la Steele. For the ¾ -symmetric statistic Ha .fxj g// P j aj;N hj;N , Steele’s inequality reads: Var.Ha / ·
N E..Ha .fxj g/ ¡ Ha .x1 ; : : : ; xN¡1 ; x0N //2 /; 2
where x0N is a sample drawn independently from the same distribution as xj . The linear form of Ha allows us to exactly compute the right-hand side of the above inequality. We condition on a given histogram, fni giD1;:::;m : E.Ha .fxj g/ ¡ Ha .x1 ; : : : ; xN¡1 ; x0N //2 / X D p.fni g/E..Ha .fxj g/ ¡ Ha .x1 ; : : : ; xN¡1 ; x0N //2 jfni g/: fni g
Now we rewrite the inner expectation on the right-hand side: E..Ha .fxj g/ ¡ Ha .x1 ; : : : ; xN¡1 ; x0N //2 j fni g/ D E..D¡ C DC /2 j fni g/; where D¡ ´ anxN ¡1;N ¡ anxN ;N
P is the change in j aj;N hj;N that occurs when a random sample is removed from the histogram fn Pi g, according to the probability distribution fni g=N, and DC is the change in j aj;N hj;N that occurs when a sample is randomly (and conditionally independently, given fni g) added back to the xN -less histogram fni g, according to the true underlying measure pi . The necessary expectations are as follows. For 1 · j · N, dene Dj ´ aj¡1 ¡ aj : Then X ni D2ni ; i N ± X ± ni ni ² 2 ² E.D2C j fni g/ D pi D2ni C 1 ¡ Dni C1 ; N N i E.D2¡ j fni g/ D
and E.DC D¡ j fni g/ D E.DC j fni g/E.D¡ j fni g/ Á !Á ! ± ² X ni X ± ni ni ² D¡ Dni pi Dn C 1 ¡ Dni C1 : N i N i N i
1242
L. Paninski
Taking expectations with respect to the multinomial measure p.fni g/, we have E.D2¡ / D
X j Dj2 Bj .pi /; N i;j
E.D2C / D
³ ´ ´ X³ j j 2 Dj2 C 1 ¡ DjC1 pi Bj .pi /; N N i;j
(B.2)
and ³ ³ ´ ´ X j k k E.DC D¡ / D ¡ Dj Dk C 1 ¡ DkC1 pi0 Bj;k .pi ; pi0 /; N N N i;i0 ;j;k where Bj and Bj;k denote the binomial and trinomial polynomials, respectively: Á ! N j Bj .t/ ´ t .1 ¡ t/N¡j I j Á ! N Bj;k .s; t/ ´ s j tk .1 ¡ s ¡ t/N¡j¡k : j; k The obtained bound, Var.Ha / ·
N .E.D2¡ / C 2E.D¡ DC / C E.D2C //; 2
may be computed in O.N 2 / time. For a more easily computable (O.N/) bound, note that E.D2¡ / D E.D2C /, and apply Cauchy-Schwartz to obtain Var.Ha / · 2NE.D2¡ /: Under the conditions of theorem 2, this simpler bound is asymptotically tight to within a factor of two. Proposition 6 is proven with devices identical to those used in obtaining the bias bounds of proposition 4 (note the similarity of equations 5.1 and B.2). B.4 Convergence of Sorted Empirical Measures. Statement (Theorem 3). Let P be absolutely continuous with respect to Lebesgue measure on the interval [0; 1], and let p D dP=dm be the corresponding density. Let SN be the m-equipartition of [0; 1], p0 denote the sorted empirical measure, and
Estimation of Entropy and Mutual Information
1243
N=m ! c; 0 < c < 1. Then: L1 ;a:s:
a. p0 ! p0c;1 , with kp0c;1 ¡ pk1 > 0. Here p0c;1 is the monotonically decreasing step density with gaps between steps j and j C 1 given by Z 1 .cp.t// j dte ¡cp.t/ : j! 0 b. Assume p is bounded. Then HO ¡ HN ! Bc;HO .p/ a:s:, where Bc;HO .p/ is a deterministic function, nonconstant in p. For HO D HO MLE, Bc;HO .p/ D h.p0 / ¡ h.p/ < 0;
where h.:/ denotes differential entropy. L1 ;a:s:
Proof. We will give only an outline. By ! , we mean that kp¡pN k1 ! 0 a:s: By McDiarmid’s inequality, for any distribution q, kq ¡ pN k1 ! E.kq ¡ pN k1 / a:s: Therefore, convergence to some p0 in probability in L1 implies almost sure convergence in L1 . In addition, McDiarmid’s inequality hints at the limiting form of the ordered histograms. We have ±n ² ± ± n ²² i i ! E sort sort ; a:s: 81 · i · m: N N Of course, this is not quite satisfactory, since both sort. nNi / and E.sort. nNi // go to zero for most i. Thus, we need only to prove convergence of sort. nNi / to p0 in L1 in probability. This follows by an examination of the histogram order statistics hn;j . Recall that these hn;j completely determine pn . In addition, the hn;j satisfy a law of large numbers: 1 1 hn;j ! E.hn;j / 8j · k; m m for any nite k. (This can be proven, for example, using McDiarmid’s inequality.) Let us rewrite the above term more explicitly: Á ! m m 1 1 X 1 X N j E.hn;j / D E.1.ni D j// D pi .1 ¡ pi /N¡j : m m iD1 m iD1 j Now we can rewrite this sum as an integral: Á ! Z 1 Á ! m N j N 1 X N¡j D pi .1 ¡ pi / dt pn .t/ j .1 ¡ pn .t//N¡j ; m iD1 j j 0
(B.3)
1244
L. Paninski
where pn .t/ ´ mpmax.iji R1 0 0 dt 1.p D 0/, kp ¡ p k1 is obviously bounded away from zero. O Regarding the nal claim of the theorem, the convergence of the HO to E.H/ follows by previous considerations. We need prove only that h.p0 / < h.p/. After some rearrangement, this is a consequence of Jensen’s inequality.
B.5 Sum Inequalities. For f .p/ D 1=p, we have the following chain of implications: g.p/ sup Dc p p ) jg.p/j · cp )
m X iD1
g.i/ · c
X
p D c:
For the f described in section 6.1, ( m p < 1=m; f .p/ D 1=p p ¸ 1=m; we have sup j f .p/g.p/j D c ) p
X i
g.i/ · 2c;
Estimation of Entropy and Mutual Information
since
X i
g.i/ D
X i: im¸1
g.i/ C
1245
X g.i/I i: im 1, N mini pi ! 1, then lim
1 N B.HO MLE / D ¡ : 2 m¡1
Proof. As stated above, the proof is an elaboration of the proof of theorem 10.3.1 of Devore and Lorentz (1993). We use a second-order expansion of the entropy function: ³ ´ 1 00 H.t/ D H.x/ C .t ¡ x/H 0 .x/ C .t ¡ x/2 H .x/ C hx .t ¡ x/ ; 2 where h is a remainder term. Plugging in, we have ³ ¡t log t D ¡x log xC.t¡x/.¡1¡log x/C.t¡x/2
´ 1 ¡1 Chx .t¡x/ : 2 x
After some algebra, x 11 C .t ¡ x/2 : 2x t After some more algebra (mostly recognizing the mean and variance formulae for the binomial distribution), we see that .t ¡ x/2 hx .t ¡ x/ D t ¡ x C t log
lim N.B N .H/.x/ ¡ H.x// D H00 .x/ N
x.1 ¡ x/ C RN .x/; 2
where RN .x/ ´
N X 1¡x j j ¡N Bj;N .x/ log : 2 N Nx jD0
1246
L. Paninski
The proof of theorem 10.3.1 in Devore and Lorentz (1993) proceeds by showing that RN .x/ D o.1/ for any xed x 2 .0; 1/. We need to show that RN .x/ D o.1/ uniformly for x 2 [xN ; 1 ¡ xN ], where xN is any sequence such that NxN ! 1: This will prove the theorem, because the bias is the sum of B N .H/.x/ ¡ H.x/ at m points on this interval. The uniform estimate essentially follows from the delta method (somewhat like Miller and Madow’s original proof, except in one dimension instead of m): use the fact that the sum in the denition of RN .x/ converges to the expectation (with appropriate cutoffs) of the function t log xt with respect to the gaussian distribution with mean x and variance 1 ¡ x/. We spend the rest of the proof justifying the above statement. N x.1 The sum in the denition of RN .x/ is exactly the expectation of the function t log xt with respect to the binomial.N; x/ distribution (in a slight abuse of the usual notation, we mean a binomial random variable divided by N, that is, rescaled to have support on [0; 1]). The result follows if a second-order expansion for t log xt at x converges at an o.1=N/ rate in BinN;x -expectation, that is, if µ ¶ 1 t E BinN;x N t log ¡ .t ¡ x/ ¡ .t ¡ x/2 ´ EBinN;x gN;x .t/ D o.1/; 2x x for x 2 [xN ; 1¡xN ]. Assume, wlog, that xN ! 0; in addition, we will focus on the hardest case and assume xN D o.N ¡1=2 /. We break the above expectation into four parts: Z E BinN;x gN;x .t/ D
Z
axN
gdBin N;x C
0
Z C
bN xN
xN
gdBin N;x
axN
Z gdBin N;x C
1 bN
gdBinN;x ;
where 0 < a < 1 is a constant and bN is a sequence we will specify below. We use Taylor’s theorem to bound the integrands near xN (this controls the middle two integrals) and use exponential inequalities to bound the binomial measures far from xN (this controls the rst and the last integrals). The inequalities are due to Chernoff (Devroye et al., 1996): let B be BinN;x , and let a, b, and xN be as above. Then P.B < axN / < eaNxN ¡NxN ¡NxN a log a NbN ¡Nx N ¡NbN log
P.B > bN / < e
bN xN
(B.5) :
(B.6)
Estimation of Entropy and Mutual Information
1247
Simple calculus shows that max jgN;x .t/j D gN;x .0/ D
t2[0;axN ]
Nx : 2
We have that the rst integral is o.1/ iff aNxN ¡ NxN ¡ NxN a log a C log.NxN / ! ¡1: We rearrange: aNxN ¡ NxN ¡ NxN a log a C log.NxN /
D NxN .a.1 ¡ log a/ ¡ 1/ C log.NxN /:
Since a.1 ¡ log a/ < 1; 8a 2 .0; 1/; the bound follows. Note that this is the point where the condition of the theorem enters; if NxN remains bounded, the application of the Chernoff inequality becomes useless and the theorem fails. This takes care of the rst integral. Taylor’s bound sufces for the second integral: .t ¡ x/3 max jgN;x .t/j < max ; t2[axN ;xN ] u2[axN ;xN ] ¡6u2 from which we deduce Z xN N..1 ¡ a/xN /4 gdBin D o.1/; < N;x ¡6.axN /2 axN by the assumption on xN . The last two integrals follow by similar methods once the sequence bN is xed. The third integral dies if bN satises the following condition (derived, again, from Taylor’s theorem): N.bN ¡ xN /4 D o.1/; ¡6x2N or, equivalently, Á bN ¡ xN D o
1=2
xN N 1=4
! I
choose bN as large as possible under this constraint, and use the second Chernoff inequality, to place an o.1/ bound on the last integral.
1248
L. Paninski
B.7 Bayes Concentration. Statement (Theorem 6). If p is bounded away from zero, then H is normally concentrated with rate m1=3 , that is, for xed a, p.jH ¡ E.H/j > a/ D O.e¡Cm
1=3 2
a
/;
for any constant a > 0 and some constant C. Proof. We provide only a sketch. The idea is that H almost satisies the bounded difference condition, in the following sense: there do exist points x 2 [0; 1] m such that m X
2 .1H.xi //2 > m²m ;
iD1
say, where 1H.xi / ´ max0 jH.x1 ; : : : ; xi ; : : : xm / ¡ H.x1 ; : : : ; x0i ; : : : xm /j; xi ;xi
but the set of such x—call the set A—is of decreasing probability. If we modify H so that H 0 D H on the complement of A and let H 0 D E.H j pi 2 Ac / on A, that is, ( H.x/ x 2 Ac ; 0 R H .x/ D 1 2 A; P.Ac / Ac P.x/H.x/ q then we have that P.jH0 ¡ E.H 0 /j > a/ < e¡a
2 .m² 2 /¡1 m
and P.H 0 6D H/ D P.A/: We estimate P.A/ as follows: Z P.A/ ·
[0;1]m
1.max 1H.xi / > ²m / d i
Z · »
m Y
p.xi /
iD1
1.max.xiC2 ¡ xi / > ²m / dpm .xi / i
Z
dtelog p¡²m pm :
(B.7)
The rst inequality follows by replacing the L2 norm in the bounded difference condition with an L1 norm; the second follows from some computation
Estimation of Entropy and Mutual Information
1249
and the smoothness of H.x/ with respect to changes in single xi . The last approximation is based on an approximation in measure by nice functions argument similar to the one in the proof of theorem 3, along with the wellknown asymptotic equivalence (up to constant factors), as N ! 1, between the empirical process associated with a density p and the inhomogeneous Poisson process of rate Np. We estimate jE.H/ ¡ E.H 0 /j with the following hacksaw: Z Z 0 0 jE.H/ ¡ E.H 0 /j D ¡ C ¡ p.x/.H.x/ H .x// p.x/.H.x/ H .x// A Ac Z 0 D p.x/.H.x/ ¡ H .x// C 0 A
· P.A/ log m: If p > c > 0, the integral in equation B.7 is asymptotically less than ce¡m²m c ; the rate of the theorem is obtained by a crude optimization over ²m . The proof of the CLT in Theorem 7 follows upon combining previous results in this article with a few powerful older results. Again, to conserve space, we give only an outline. The asymptotic normality follows from McLeish’s martingale CLT (Chow & Teicher, 1997) applied to the martingale E.H j x1 ; : : : ; xi /; the computation of the asymptotic mean follows by methods almost identical to those used in the proof of theorem 3 (sorting and linearity of expectation, effectively), and the asymptotic variance follows upon combining the formulas of Darling (1953) and Shao and Hahn (1995) with an approximation-in-measure argument similar, again, to that used to prove theorem 3. See also Wolpert and Wolf (1995) and Nemenman et al. (2002) for applications of Darling’s formula to a similar problem. B.8 Adaptive Partitioning. ± ² Statement (Theorem 8). If log 1N .AF / D o .logNm/2 and F generates ¾x;y a.s., IO is consistent in probability; IO is consistent a.s. under the slightly stronger condition X ¡N 1N .AF /e .log m/2 < 1: The key inequality, unfortunately, requires some notation. We follow the terminology in Devroye et al. (1996), with a few obvious modications. We take, as usual, fxj g as i.i.d random variables in some probability space Ä; G ; P. Let F be a collection of partitions of Ä, with P denoting a given partition. 2P denotes, as usual, the “power set” of a partition, the set of all sets that can be built up by unions of sets in P . We introduce the class of
1250
L. Paninski
sets AF , dened as the class of all sets obtained by taking unions of sets in a given partition, P . In other words,
A F ´ fA: A 2 2P ; P 2 F g: Finally, the Vapnik-Chervonenkis “shatter coefcient” of the class of sets AF , 1N .AF /, is dened as the number of sets that can be picked out of AF using N arbitrary points !j in Ä: 1N .AF / ´ max jf!j g \ A: A 2 Aj: f!j g2ÄN
The rate of growth in N of 1N .AF / provides a powerful index of the richness of the family of partitions AF , as the following theorem (a kind of uniform LLN) shows; p here denotes any probability measure and pN , as usual, the empirical measure. Theorem (Lugosi & Nobel, 1996). Following the notation above, for any ² > 0, Á ! X 2 jpN .A/ ¡ p.A/j > ² · 81N .A F /e¡N² =512 : P sup P2F A2P
Thus, this theorem is useful if 1N .AF / does not grow too quickly with N. As it turns out, 1N .AF / grows at most polynomially in N under various easy-to-check conditions. Additionally, 1N .AF / can often be computed using straightforward combinatorial arguments, even when the number of distinct partitions in F may be uncountable. (See Devroye et al., 1996, for a collection of instructive examples.) Proof. Theorem 8 is proven by a Borel-Cantelli argument, coupling the above VC inequality of Lugosi and Nobel with the following easy inequality, which states that the entropy functional H is “almost L1 Lipshitz”: jH.p/ ¡ H.q/j · H2 .2kp ¡ qk1 / C 2kp ¡ qk1 log.m ¡ 1/; where H2 .x/ ´ ¡x log.x/ ¡ .1 ¡ x/ log.1 ¡ x/ denotes the usual binary entropy function on [0; 1]. We leave the details to the reader. Acknowledgments Thanks to S. Schultz, R. Sussman, and E. Simoncelli for many interesting conversations; B. Lau and A. Reyes for their collaboration on collecting the in vitro data; M. Fellows, N. Hatsopoulos, and J. Donoghue for their collaboration on collecting the primate MI data; and I. Kontoyiannis, T. Gedeon,
Estimation of Entropy and Mutual Information
1251
and A. Dimitrov for detailed comments on previous drafts. This work was supported by a predoctoral fellowship from the Howard Hughes Medical Institute. References Antos, A., & Kontoyiannis, I. (2001). Convergence properties of functional estimates for discrete distributions. Random Structures and Algorithms, 19, 163–193. Azuma, K. (1967). Weighted sums of certain dependent variables. Tohoku Mathematical Journal, 3, 357–367. Basharin, G. (1959). On a statistical estimate for the entropy of a sequence of independent random variables. Theory of Probability and Its Applications, 4, 333–336. Beirlant, J., Dudewicz, E., Gyor, L., & van der Meulen, E. (1997). Nonparametric entropy estimation: An overview. International Journal of the Mathematical Statistics Sciences, 6, 17–39. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991).Reading a neural code. Science, 252, 1854–1857. Billingsley, P. (1965). Ergodic theory and information. New York: Wiley. Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efcient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 5, 959–969. Carlton, A. (1969). On the bias of information estimates. Psychological Bulletin, 71, 108–109. Chernoff, H. (1952). A measure of asymptotic efciency for tests of hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23, 493–509. Chow, Y., & Teicher, H. (1997). Probability theory. New York: Springer-Verlag. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Cucker, F., & Smale, S. (2002). On the mathematical foundations of learning. Bulletins of the American Mathematical Society, 39, 1–49. Darbellay, G., & Vajda, I. (1999). Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory, 45, 1315–1321. Darling, D. (1953). On a class of problems related to the random division of an interval. Annals of Mathematical Statistics, 24, 239–253. Dembo, A., & Zeitouni, O. (1993).Large deviationstechniques and applications. New York: Springer-Verlag. Devore, R., & Lorentz, G. (1993). Constructive approximation. New York: SpringerVerlag. Devroye, L., Gyor, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag. Ditzian, Z., & Totik, V. (1987). Moduli of smoothness. Berlin: Springer-Verlag. Donoho, D., & Liu, R. (1991). Geometrizing rates of convergence. Annals of Statistics, 19, 633–701.
1252
L. Paninski
Efron, B., & Stein, C. (1981).The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Gedeon, T., Parker, A., & Dimitrov, A. (2003). Information distortion and neural coding. Manuscript submitted for publication. Gibbs, A., & Su, F. (2002). On choosing and bounding probability metrics. International Statistical Review, 70, 419–436. Grenander, U. (1981). Abstract inference. New York: Wiley. Jongbloed, G. (2000). Minimax lower bounds and moduli of continuity. Statistics and Probability Letters, 50, 279–284. Kolmogorov, A. (1993). Information theory and the theory of algorithms. Boston: Kluwer. Kontoyiannis, I. (1997). Second-order noiseless source coding theorems. IEEE Transactions Information Theory, 43, 1339–1341. Lugosi, A., & Nobel, A. B. (1996).Consistency of data-driven histogram methods for density estimation and classication. Annals of Statistics, 24, 687–706. McDiarmid, C. (1989).On the method of bounded differences. In J. Siemons (Ed.), Surveys in combinatorics (pp. 148–188). Cambridge: Cambridge University Press. Miller, G. (1955). Note on the bias of information estimates. In H. Quastler (Ed.), Information theory in psychology II-B (pp. 95–100). Glencoe, IL: Free Press. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing, 14. Cambridge, MA: MIT Press. Paninski, L., Fellows, M., Hatsopoulos, N., & Donoghue, J. (1999). Coding dynamic variables in populations of motor cortex neurons. Society for Neuroscience Abstracts, 25, 665.9. Paninski, L., Fellows, M., Hatsopoulos, N., & Donoghue, J. (2003). Temporal tuning properties for hand position and velocity in motor cortical neurons. Manuscript submitted for publication. Paninski, L., Lau, B., & Reyes, A. (in press). Noise-driven adaptation: In vitro and mathematical analysis. Neurocomputing. Available on-line: http:==www. cns.nyu.edu= »liam=adapt.html. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network: Computation in Neural Systems, 7, 87–107. Prakasa Rao, B. (2001). Cramer-Rao type integral inequalities for general loss functions. TEST, 10, 105–120. Press, W., Teukolsky, S., Vetterling, W., & Flannery, B. (1992). Numerical recipes in C. Cambridge: Cambridge University Press. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Ritov, Y., & Bickel, P. (1990). Achieving information bounds in non- and semiparametric models. Annals of Statistics, 18, 925–938. Schervish, M. (1995). Theory of statistics. New York: Springer-Verlag. Sering, R. (1980). Approximation theorems of mathematical statistics. New York: Wiley.
Estimation of Entropy and Mutual Information
1253
Shao, Y., & Hahn, M. (1995).Limit theorems for the logarithm of sample spacings. Statistics and Probability Letters, 24, 121–132. Steele, J. (1986). An Efron-Stein inequality for nonsymmetric statistics. Annals of Statistics, 14, 753–758. Strong, S. Koberle, R., de Ruyter van Steveninck R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–202. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proceedings of the 37th Allerton Conference on Communication, Control, and Computing. Urbana, IL: University of Illinois Press. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. van der Vaart, A., & Wellner, J. (1996). Weak convergence and empirical processes. New York: Springer-Verlag. Vapnik, V. N., & Chervonenkis, A. J. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Victor, J. (2000a). Asymptotic bias in information estimates and the exponential (Bell) polynomials. Neural Computation, 12, 2797–2804. Victor, J. (2000b). How the brain uses time to represent and process visual information. Brain Research, 886, 33–46. Victor, J. (2002).Binless strategies for estimation of information from neural data. Physical Review E, 66, 51903–51918. Watson, G. (1980). Approximation theory and numerical methods. New York: Wiley. Weaver, W., & Shannon, C. E. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Wolpert, D., & Wolf, D. (1995). Estimating functions of probability distributions from a nite set of samples. Physical Review E, 52, 6841–6854. Received June 3, 2002; accepted Nov. 27, 2002.
REVIEW
Communicated by Klaus Obermayer
A Taxonomy for Spatiotemporal Connectionist Networks Revisited: The Unsupervised Case Guilherme de A. Barreto
[email protected] Aluizio F. R. Araujo ´
[email protected] Department of Electrical Engineering, University of S˜ao Paulo, S˜ao Carlos, SP, Brazil
Stefan C. Kremer
[email protected] Guelph Natural Computation Group, Department of Computing and Information Science, University of Guelph, Guelph, Ontario N1G 2W1, Canada
Spatiotemporal connectionist networks (STCNs) comprise an important class of neural models that can deal with patterns distributed in both time and space. In this article, we widen the application domain of the taxonomy for supervised STCNs recently proposed by Kremer (2001) to the unsupervised case. This is possible through a reinterpretation of the state vector as a vector of latent (hidden) variables, as proposed by Meinicke (2000). The goal of this generalized taxonomy is then to provide a nonlinear generative framework for describing unsupervised spatiotemporal networks, making it easier to compare and contrast their representational and operational characteristics. Computational properties, representational issues, and learning are also discussed, and a number of references to the relevant source publications are provided. It is argued that the proposed approach is simple and more powerful than the previous attempts from a descriptive and predictive viewpoint. We also discuss the relation of this taxonomy with automata theory and state-space modeling and suggest directions for further work. 1 Introduction The vast majority of neural models have been concerned with the learning of static mappings. Consequently, a network so trained has a static structure that maps, at a given instant in time, an input vector xE onto an output vector yE. This form of static input-output mapping is well suited for pattern recognition applications (e.g., optical character recognition), where both the input vector xE and the output vector yE represent spatial patterns that are independent of time. We know, however, that time is important in many cognitive tasks encountered in practice that involve decision making c 2003 Massachusetts Institute of Technology Neural Computation 15, 1255–1320 (2003) °
1256
G. Barreto, A. Araujo, ´ and S. Kremer
and behavioral responses to spatiotemporal stimuli, such as vision, speech, signal processing, and motor control (Maass, 1997b; Ballard, Salgian, Rao, & McCallum, 1998). Several examples of such time-varying patterns are also found in engineering and system modeling. Hence, the ability to encode, store, recognize, and recall spatiotemporal patterns is one of the most crucial characteristics of an intelligent system, biological or articial. Static neural networks do not handle both temporal and spatial properties of the input patterns because they process the current pattern without regard to past data. To overcome such a limitation, a number of connectionist networks for temporal sequence processing have been proposed over the past decade and a half. These can be broadly partitioned into two classes based on their learning mechanisms: supervised and unsupervised. Supervised learning is based on the principle that the learner is provided with input stimuli and desired responses during training. Most of the neural network models for temporal processing are supervised ones, such as the multilayer perceptron (MLP), trained with temporal versions of gradient-based learning algorithms (Pearlmutter, 1995). Nonadaptive temporal extensions of the Hopeld associative memory model are also common (Hertz, Krogh, & Palmer, 1991; Herz, 1995). 1 Such neural models are usually referred to as dynamic or recurrent networks. An extensive survey of supervised recurrent models can be found in Kremer (2001) and Kolen and Kremer (2001), who dene a spatiotemporal connectionist network (STCN) as a parallel distributed information processing structure that is capable of dealing with input data presented across time and space. Unsupervised or self-organizing models have also attracted great attention because of a number of interesting computational properties, such as feature extraction and redundancy reduction of unlabeled data (Becker & Plumbey, 1996). Roughly speaking, unsupervised learning is characterized by the absence of an external teacher or guiding signal that indicates the right response or behavior to be pursued. In this case, learning becomes possible because there is redundancy in the input data stream (Barlow, 1961, 1989). Redundancy is the property of the input stimuli that distinguishes them from random noise. Since most naturally occurring phenomena are not random, any redundancy that occurs in the distribution of the data set implies that this set does not ll the data space uniformly. Thus, redundancy in the data suggests the existence of structure in the data. With this in mind, one possible goal of unsupervised learning is to classify the forms the redundancy takes and the methods of handling it. For instance, methods for redundancy reduction try to represent the input data using components (features) that are as independent of each other as pos1 By nonadaptive networks, we mean models for which the connection weights are prewired, that is, determined beforehand by the network designer and not through a “learning” process. For convenience, we also consider this type of network as belonging to the class of supervised models.
A Taxonomy Revisited
1257
sible. Such a representation seems to be very useful for later processing stages and is of great interest in cognitive modeling and machine learning applications. Unsupervised learning also nds applications in industrial settings, where a typical task involves the control of the movements of robot manipulators (Araujo ´ & Barreto, 2002a). In these scenarios, unsupervised algorithms usually train much faster than supervised ones, making them an appealing choice for real-time modeling and control. Despite potential advantages of using unsupervised learning to process spatiotemporal data, recent surveys on this topic (Kaski, Kangas, & Kohonen, 1998; Hinton & Sejnowski, 1999; Barreto & Araujo, ´ 2001a) cover only a narrow range of models and applications dealing with time-related tasks. The main goal of this article is to review how unsupervised STCNs deal with dynamic patterns from the viewpoint of the taxonomy recently proposed by Kremer (2001) to describe supervised STCNs. The approach of this article combines Kremer’s taxonomy with a general denition of latent variables proposed by Meinicke (2000) in order to widen the range of applicability of Kremer’s taxonomy to the unsupervised learning domain. To present our ideas, we adopt the common framework and notation of Kremer’s previous work. Thus, by focusing on underlying design decisions that any temporal sequence processing system must address rather than on the mechanics of a particular architecture, we want to show that Kremer’s taxonomy can also be applied to analyze and design unsupervised spatiotemporal networks. In order to do so, instantiations are given for these design decisions, considering the main approaches. Such a strategy allows us to compare and contrast the representational and operational characteristics of each model. We also provide a discussion on the relationship between Kremer’s taxonomy and automata and dynamic system theories. The remainder of the article is organized as follows. A general description of unsupervised learning and applications is carried out in section 2. We briey present some concepts related to temporal sequence processing tasks in section 3. In section 4, we present four previous attempts to describe unsupervised STCNs, discuss their limitations, and introduce the main concepts of the taxonomy proposed by Kremer (2001) in section 5. In section 6 we introduce a generalization of Kremer’s taxonomy to the unsupervised learning domain. This is possible due to a reinterpretation of the concept of state as a latent (hidden) variable. From section 7 to section 10, the generalized taxonomy is used with the objective of providing a unied conceptual tool to analyze and design unsupervised STCNs. In section 11, we show how the generalized taxonomy can be used to analyze an existing unsupervised STCN and design new ones. We also discuss the relationships of Kremer’s taxonomy with automata and nonlinear dynamic systems theory. We conclude in section 12 by suggesting possible directions for further research.
1258
G. Barreto, A. Araujo, ´ and S. Kremer
2 On Unsupervised Neural Networks Despite the growing importance of unsupervised methods, there is no established denition of unsupervised learning. Instead, there exist several categories of problems and methods that are usually agreed to belong to the eld of unsupervised learning. This section provides an overview of some of these techniques without attempting to cover the whole range of this enormous eld. We concentrate primarily on the concepts and techniques that provide the basis for discussion in subsequent sections. Broader and more detailed surveys of unsupervised techniques for neural networks are given by Hertz et al. (1991), Grossberg and Carpenter (1991), Haykin (1994), Becker and Plumbey (1996), Kohonen (1997), and Hinton and Sejnowski (1999). The classical work by Duda and Hart (1973) is a good starting point for related statistical methods. In supervised learning, each input vector comes with an associated desired output that is supplied by an external teacher. Roughly speaking, unsupervised learning is often characterized as supervised learning with an unknown desired output. Thus, if the available data do not differentiate inputs from outputs, the problem of learning becomes one of unsupervised learning. In comparison with its supervised counterpart, unsupervised learning can be viewed as learning from data with either missing inputs or missing outputs. The question of what, in general, can be learned from such data without inputs leads to the constitution of unsupervised learning as a methodology for nding alternative representations of the data that may provide some task-dependent advantages over the original data set, as, for example, a reduced dimensionality. It is also common to produce a representation that is particularly good at achieving the following goals: ² Discovery of clusters in a data set, grouping together similar inputs ² Reduction of information redundancy contained in the data
² Maximization of mutual information between the inputs and the outputs of a network in the presence of noise ² Discovery of a mapping from the observed data to a set of pairwise decorrelated or statistically independent features ² Discovery of highly nongaussian projections of the data
² Discovery of spatiotemporal invariances in a temporal sequence (e.g., of images) Learning from such transformed data is very important in many practical domains such as data mining, data compression, or pattern recognition, where often an alternative representation is needed in order to facilitate the exploration or storage or transmission of the data or to improve the inferences on such data. In summary, unsupervised neural networks can mainly
A Taxonomy Revisited
1259
be used in exploratory data analysis. They act on unlabeled data in order to extract an efcient internal representation (description) of the structure implicit in the input data distribution. In the next subsections, we summarize previous attempts to structure and classify unsupervised neural network learning architectures according to their main computational characteristics, emphasizing their similarities with standard statistical techniques. 2.1 Projection and Vector Quantization Methods. A number of unsupervised techniques for data modeling using neural networks can broadly be divided into projection methods and vector quantization methods (Harpur, 1997; Zemel, 1993). 2 Projection methods perform a linear mapping of the input space into a new feature space, and the codes that they produce are created by projecting the data onto what might loosely be termed interesting directions in the input space. Projection methods generally produce distributed representations, where each input vector is coded as a combination of some or all component directions and the output values represent the coordinates in the new space. The simplest projection method is principal components analysis (PCA), which is popular for dimension reduction and feature extraction. Several unsupervised neural models have been designed explicitly for their ability to nd principal components (Oja, 1989; Hertz et al., 1991; Haykin, 1994; Becker & Plumbey, 1996). These networks commonly use the so-called Hebbian learning rules, named this way after Hebb’s postulate of learning (Hebb, 1949). This postulate states that changes in the synaptic efcacy are a local function of the correlation between pre- and postsynaptic activities. In neural data analysis, the pre- and postsynaptic activities are usually identied with the input and output patterns, respectively. Hebbian learning rules produce synapses (weights) that are sensitive to second-order statistics of the data. Another class of projection methods is known as projection pursuit (Huber, 1985). This method also searches for interesting projections of the input multidimensional data, with the interestingness being dened by a projection index. Such an index measures the distribution of coefcients obtained by projecting the data onto a particular direction. Most indices are based on the distribution deviation from normality, for example by using entropy or the fourth moment (kurtosis). Huber (1985) has argued that the gaussian distribution is the least interesting one and that the most interesting directions are those that show the least gaussian distribution. In vector quantization methods, local representations are formed by tting each input vector to a single template, class, or cluster. Thus, vector
2 These authors originally classied unsupervised neural models into projection and kernel methods. We replaced the term kernel by vector quantization to avoid confusion with support vector machine techniques.
1260
G. Barreto, A. Araujo, ´ and S. Kremer
quantization is commonly used for clustering (Hertz et al., 1991). Each input vector is represented by a prototype vector, known as a code word, chosen from a nite code book. The code book can be developed using a variety of learning algorithms. A common strategy codes each training vector using the nearest (in Euclidean distance) available code word and then updates each code word to be the centroid of the training vectors that mapped to it. This process is iterated until a minimum average distortion between training vectors and the corresponding code words is reached. Several unsupervised neural network models work along similar lines, being generically known as competitive learning models (Rumelhart & Zipser, 1986; Grossberg, 1987). Unsupervised networks based on adaptive resonance theory (ART) by Grossberg and Carpenter (1988) are common examples of competitive models. The competition occurs among output neurons for the right to represent a given input pattern and is usually formulated in terms of a (dis)similarity measure that computes the closeness between the input pattern and the weight vectors of each output neuron. The competition may yield only a single active unit (the “winner”) at the end of the process, resulting in what is known as a winner-take-all (WTA) network. Learning is usually achieved by moving the weight (prototype) vector of the winning unit closer to the input vector. When the distance measure used is Euclidean, this is simply an on-line version of vector quantization methods. Unlike Hebbian learning rules, competitive learning rules usually capture rst-order statistics of the input data. A common extension to this idea is to arrange the output units in a grid formation and for the learning rule to be applied not only to the winning unit but also to its neighbors in the grid. This leads to a model known as the self-organizing map (SOM; Kohonen, 1997). SOM- and ART-based models are the prototypical competitive neural algorithms and have been applied to a number of real-world problems. Despite some similarities, the SOM and ART networks differ substantially in the way they categorize input data. Indeed, they were designed to tackle different aspects of learning. The SOM is concerned with the preservation of neighborhood relations: nearby output neurons should store nearby input patterns. The ART network, in turn, is concerned with the so-called stability-plasticity dilemma.3 A large portion of the neural networks cited in section 7 are based on the SOM and ART architectures. 2.2 Single- and Multiple-Cause Models. Vector quantization methods attempt to ascribe each input pattern to a single cause; although they contain templates for many possible features, they represent each pattern using only one of them. In so doing, the models make a rm decision to place each
3 How can the network preserve what it has previously learned, while continuing to incorporate new knowledge?
A Taxonomy Revisited
1261
pattern in a single class or cluster. This unsupervised form of classication can have signicant advantages over projection methods, whose representations do not directly allow subsequent processing to see what type of input has occurred. However, there are several important limitations to these single-cause models (Harpur, 1997). For instance, the generalization capabilities of a single-cause model are much weaker than those where there are many active outputs. Single-cause models also usually suffer from the curse of dimensionality (Duda & Hart, 1973). Thus, in many situations, it is desirable for more than one cause or output unit to account for a single input pattern (Dayan & Zemel, 1995). For instance, an input scene composed of several objects might be more efciently described using a different output for each object rather than just one for the whole scene whenever different objects occur somewhat independently of each other. An additional advantage of such multiple-cause models is that a few causes may be applied combinatorially to generate a large set of possible examples. The goal in a multiple-cause learning model is therefore to discover a vocabulary of independent causes or generators such that each input can be completely explained through the cooperative action of a few of these possible generators (typically represented by the activation of network output units). This is closely related to the sparse distributed form of representation (Field, 1994). In sparse coding, the idea is to represent the data using a set of neurons so that only a small number of neurons is activated at the same time. Equivalently, this means that a given neuron is activated only rarely. If the data have certain statistical properties (they are sparse), this kind of coding leads to approximate redundancy reduction. Another method for redundancy reduction is predictability minimization (PM; Schmidhuber, Eldracher, & Foltin, 1996). Input patterns are fed into a system consisting of adaptive, initially unstructured feature detectors. There are also adaptive predictors constantly trying to predict current feature detector outputs from other feature detector outputs. Simultaneously, however, the feature detectors try to become as unpredictable as possible, resulting in a co-evolution of predictors and feature detectors. This method is based on the observation that if two random variables are independent, they provide no information that could be used to predict one variable using the other one. PM has various potential advantages over other neural methods for redundancy reduction. When applied to image data, PM automatically comes up with feature detectors reminiscent of those in biological systems (such as orientation-sensitive edge detectors, on-center–off-surround detectors, and bar detectors). The concept of redundancy reduction can be also used to preprocess timeseries data. For instance, Schmidhuber (1992) introduced the history compression principle for reducing the descriptions of event sequences without loss of information. Using this principle, the author built an unsupervised hierarchical neural chunking system. This system was able to detect causal
1262
G. Barreto, A. Araujo, ´ and S. Kremer
dependencies in the temporal input stream and attend to unexpected inputs instead of focusing on every input. It also learned to reect both the relatively local and the relatively global temporal regularities contained in the input stream. 2.3 Generative and Recoding Models. Smola, Mika, Sch¨olkopf, and Williamson (2001) argued that in order to handle unsupervised methods, one has to consider assumptions on the data with respect to two issues. The rst one involves the properties that can be extracted from the data with an appropriate degree of condence. This is a type of feature-extracting approach to unsupervised learning. The second question is concerned with the properties that describe the data best, leading to a descriptive model of data. These two issues allow the grouping of unsupervised learning techniques into two major classes: generative and recoding models (Hochreiter & Mozer, 2001; Oja, 2002). The idea underlying generative models is to view unsupervised learning as supervised learning in which the observed data are the output and for which there is no input. Hence, the aim of this approach is to nd a probabilistic model that explains how the observed data were generated, assuming implicitly that the model must be stochastic or have an unknown and varying input in order to avoid producing the same output every time. This assumption is often accompanied by the hope that the hidden or latent causes of the data will come to be represented by the activities of the units of the network (the space of neuronal activities being called the hidden state-space). This new set of latent variables provides a more suitable representation of the data. Examples of data modeling techniques under the generative framework are given by mixtures of gaussians (Duda & Hart, 1973), factor analysis (Lawley & Maxwell, 1971), Boltzmann machines (Hinton & Sejnowski, 1986), principal curves (Hastie & Stuetzle, 1989), the generative topographic map (Bishop, Svensen, & Williams, 1998), and dynamical linear gaussian models (Roweis & Ghahramani, 1999). In particular, the Boltzmann machine is a typical generative model in the sense that after learning is complete, the free-running network generates patterns with the same probability distribution that occurred during the training phase. In the recoding unsupervised framework, the model transforms points from an observation space to an output space, and the output distribution is compared to either a reference distribution or a distribution derived from the output distribution. Examples of these methods are neural projection pursuit (Zhao & Atkeson, 1996), kernel principal component analysis (Sch¨olkopf, Smola, & Miller, 1998), and independent component analysis (ICA; Hyv¨arinen, 1999; Hyv¨arinen, Karhunen, & Oja, 2001). The latter is a method that discovers a representation of vector-valued observations in which the statistical dependence among the vector elements in the output space is minimized. With ICA, the model demixes observation vectors, and the output distribution is compared against a factorial distribution derived
A Taxonomy Revisited
1263
from either assumptions about the distribution (e.g., supergaussian) 4 or a factorization of the output distribution. In the basic form of ICA, we observe m scalar random variables at the current time step, xE.t/ D [x1 .t/x2 .t/ ¢ ¢ ¢ xm .t/]T , which are assumed to be linear combinations of n unknown independent sources, Es.t/ D [s1 .t/ s2 .t/ ¢ ¢ ¢ sn .t/]T . Mathematically, we have xE.t/ D AEs.t/:
(2.1)
Thus, ICA postulates that the data are generated by a noiseless and invertible linear transformation of unobservable sources. The constraint that the transformation is noiseless and invertible is eliminated in several kinds of generalized ICA models (Lewicki & Sejnowski, 2000). The classical application of the ICA model is blind source separation (BSS; Jutten & H´erault, 1991; Bell & Sejnowski, 1995; Amari & Cichocki, 1998; Hyv¨arinen, 1999). The task is to recover the original sources by nding a square matrix, W, which is a permuted and rescaled version of the inverse of the unknown matrix, A. The classical example of BSS is the cocktail party problem. Assume that several people are speaking simultaneously in the same room, as at a cocktail party. The problem is to separate the voices of different speakers, using recordings of several microphones in the room. A less direct application of ICA methods is blind deconvolution=equalization. In blind deconvolution, a scalar signal s.t/ is convolved with an unknown tapped delay-line lter a1 ; : : : ; aK , resulting in a corrupted signal x.t/ D a.t/ ¤s.t/, where a.t/ is the impulse response of the lter. The problem is then to recover s.t/ by nding a kind of inverse lter, w, so that s.t/ D w.t/ ¤ x.t/. Assuming that the values of the original signal s.t/ are independent for different t, this problem can be solved using essentially the same formalism as used in ICA (Bell & Sejnowski, 1995; Amari & Cichocki, 1998). Hinton and Sejnowski (1999) argued that the generative framework is general enough to incorporate recoding models (e.g., VQ and ICA) as a kind of improper or degenerate generative models. For example, clustering is just tting a two-stage generative model in which we rst make a discrete choice of a mixture component and then generate from a density determined by this component. The distance measure used for clustering corresponds to the negative log likelihood of the data under a component of the model. The usual hard assignment of a data vector to the closest cluster corresponds to approximating the posterior distribution over the hidden choices by the single most likely choice. ICA can be achieved by tting a generative model in which the activities of the hidden units are chosen independently. Some particular tractable special cases arise when the observed data are modeled as a linear combination of the activities of hidden units plus additive gaussian noise. If one assumes that the hidden units have gaussian 4
A supergaussian is a density of positive kurtosis.
1264
G. Barreto, A. Araujo, ´ and S. Kremer
probability (prior) densities, then this is factor analysis (Neal & Dayan, 1997). If the hidden units have improper uniform densities, then it is PCA. If the densities for the hidden units are independent but nongaussian, then it is ICA. The inference process for ICA becomes straightforward as the additive gaussian observation noise goes to zero. For nonzero noise, the posterior distribution over hidden states must be approximated, typically by nding its peak. As we discuss in section 6, the incorporation of recoding (“nongenerative”) methods, such as PCA and ICA, under the generative viewpoint is not suitable since it can lead to some misunderstandings about its applicability. In this section, we briey described different forms of understanding and classifying unsupervised neural learning methods. Of particular importance for this article is the generative modeling framework, since we will extend the conventional concept of latent or hidden variables in order to apply Kremer’s taxonomy to the unsupervised domain. It is worth noting that no reference to temporal aspects of the input data was made, except for a very short description of ICA and BSS tasks. Since our goal is to devise a general framework for analyzing and designing unsupervised STCNs, we need to discuss some issues related to temporal sequence processing before beginning with the description of frameworks for STCNs. 3 On Temporal Sequence Processing Temporal data processing is a very challenging task, in part because of a number of properties of spatiotemporal signals, whose knowledge is useful in designing neural models that can deal with time-related tasks. Such data can appear in several forms, such as spatial data processed temporally (e.g., scanned images) or data items arriving at different time steps. In this article, the spatiotemporal data we are interested in are commonly referred to as a temporal sequence: a nite set of discrete items linked or correlated in time. Sequence items, which can be scalars or vectors, may assume different meanings depending on the application domain. Temporal sequences can be classied into deterministic, stochastic, or chaotic time series. By deterministic sequences, we mean roughly those whose items and their temporal order are specied with 100% of sureness. Stochastic sequences often result from measuring the behavior of a given dynamic system for which the observed variability has random noise as a fundamental component. Chaotic time series resemble stochastic ones. However, it is claimed that their observed variability is due not to a noise process but is the result of nonlinear interactions among the variables of the underlying deterministic system. The basic question concerning the analysis and design of neural models for temporal sequence processing (i.e., STCNs) is the following: How is the time variable represented by the model? When talking about temporal sequences, a single instant of time has no meaning, since a sequence must
A Taxonomy Revisited
1265
have several components measured at successive instants of time. Hence, the temporal order of the sequence items is a common way of representing the ow of time. Another possibility for representing time involves the representation of the duration between successive instants of time. Temporal relationships can also be captured through feedback loops, as is usually done for the so-called recurrent (supervised) networks (Elman, 1990; Pearlmutter, 1995; Kremer, 1995). Once the mechanisms for representing time are dened, one usually has two approaches for designing unsupervised STCN’s. One approach is to add short-term memory (STM) mechanisms, such as delay lines, leaky integrators, and feedback loops, to an unsupervised static neural model to obtain an unsupervised STCNs (Barreto & Araujo, ´ 2001a). This approach includes the modication of activation and learning rules to take into account temporal aspects of the input sequence. The other approach is to replace the objective function of a temporal supervised method (e.g., mean squared error between the network outputs and the target values) by an objective function for an unsupervised method, such as entropy or the Kullback-Liebler distance between the probability distribution of the data and a target distribution. The choice of one of these alternatives can be guided by the amount of information each one allows to extract from the data (in addition to spatial information) in order to enrich the inference task during signal processing. This seems quite obvious, but its realization is not, since this necessarily means providing the neural network with dynamic properties that make it responsive to time-varying signals. The eld of temporal sequence processing comprises basically four tasks: learning, recognition, recall, or prediction of temporal sequences. Learning means the whole process of extracting and storing a temporal information after suitable representation of the sequence components according to the requirements of the problem at hand. Fundamental issues regarding learning are memory capacity and learning rules. Thus, weight adaptation is only one aspect of the learning process. In a recognition task, the network produces a signal in response to a particular input sequence when an input sequence or a transformed version of it matches any stored template. Recognition tasks are usually performed by recoding methods. In a recall task, the network dynamics must reproduce the previously learned sequence in the correct temporal order. Prediction is dened as the problem of extrapolating a temporal sequence into the future given its observed past values. Recall and prediction tasks are usually performed by generative models, since one needs to nd a model that explains how the observed data were generated. It is worth mentioning that for some applications, it is possible to deal with spatiotemporal data using standard (nontemporal) methods. This is possible in general only after data preprocessing. A common approach is to transform time-domain data to the frequency domain. This is exactly the idea underlying the generalization of ICA formulation to frequencydomain blind deconvolution through the use of Fourier transforms (Amari
1266
G. Barreto, A. Araujo, ´ and S. Kremer
& Cichocki, 1998). A similar approach was proposed by Mozayyani, Alanou, Dreyfus, and Vaucher (1995), who encoded time in the phase of a complex number, with past events represented in the open quadrant .¡¼=2; 0/ and the present and future ones represented in the quadrant [0; C¼=2/. Simple changes in standard neural algorithms can also be sufcient to allow the processing of spatiotemporal data. For example, by generalizing the distance measure in the standard SOM algorithm, Somervuo and Kohonen (1999) enabled this neural model to classify variable-length and warped sequences. They trained each neuron in the network to store prototypical sequences and then used the dynamic time warping algorithm (Sakoe & Chiba, 1978) to obtain time-normalized distance measures between sequences of different lengths. 4 Taxonomies of Unsupervised STCNs Despite the existence of some early models (Grossberg, 1969; Amari, 1972), the design and application of unsupervised STCNs have received special attention only more recently. Hence, there are only a few attempts in organizing such networks under a common framework or taxonomy. These can be roughly divided into architectural and generative approaches. The rst approach is the more conventional view, which tries to group the different kinds of unsupervised STCNs according to the architectural properties of the neural models (e.g., number of layers, number of neurons per layer, and transfer function of the neurons). The second and more modern approach tries to t unsupervised STCNs to the generative modeling viewpoint (see section 2.3). The underlying idea of the probabilistic generative approach is to discover the hidden causes that originated the available data. In the following sections, we describe the main characteristics of two architectural and two generative taxonomies. Our goal is to point out the limitations of these frameworks to justify the proposal of a generalized version of Kremer’s taxonomy, which can be applied to both supervised and unsupervised STCNs. As in Kremer (2001), we focus the description of them on three desired features for a good taxonomy: it should be simple and possess descriptive adequacy and predictive power. 4.1 The Operator Map Framework. A rst attempt at devising a structural framework was made by Kohonen (1997) through the introduction of operator maps. The underlying idea is to regard the neurons in the SOM as mathematical operators, denoted by G, representing some kind of lE ¡ 1/ D ters for dynamic patterns. In other words, if a nite sequence X.t fE x.t ¡ 1/; xE.t ¡ 2/; : : : ; xE.t ¡ n/g of n vector input items is presented at time t to the original SOM, each neuron in the map might be made to correspond E to an operator that analyzes X.t/. For example, in time-series analysis, one can consider each neuron i as a local estimator Gi that denes the prediction
A Taxonomy Revisited
1267
E ¡1// of the current input sample xE.t/ by neuron i, based on the xEO i .t/ D Gi .X.t E ¡ 1/. The winner is the neuron that is the best estimator knowledge of X.t of xE.t/, kE x.t/ ¡ xEO bmu .t/k D minfkE x.t/ ¡ xEO i .t/kg; i
(4.1)
where bmu stands for the best-matching unit (i.e., the winning neuron). A widely used operator is the expression that denes multivariate autorePn gressive processes (Lampinen & Oja, 1989): xEO i .t/ D jD1 wij xE.t ¡ j/, where the weights wij , called linear predictive coding (LPC) coefcients, and initially unknown, are learned upon application of a training sequence. A possible learning rule for the weights is dened by wij .t C 1/ D wij .t/ C hbmu;i .t/[E x.t/ ¡ xEO i .t/]T xE.t ¡ j C 1/;
(4.2)
where hbmu;i .t/ D ´.t/ exp.¡kErbmu ¡Eri k2 =¾ 2 .t// is the neighborhood function, and Erbmu and Eri are the coordinates of the units bmu and i in the map, respectively. The variables ´.t/ and ¾ .t/ dene the learning rate and width of the neighborhood function, respectively. Both ´.t/ and ¾ .t/ decrease monotonically in time during network training. This predictor has been used in speech synthesis and recognition because time-domain signals can be directly applied without prior frequency analysis. The operator map framework is quite general in the sense that the denition of a suitable operator can account for a wide range of the available unsupervised STCNs. Furthermore, it can be equally used by other competitive learning algorithms. In our view, this generality lacks predictive power in that it does not describe the operator component beyond stating that it is an operator. This means that networks with different operator components (e.g., multilayer versus single layer versus no layer, or rst-order connections versus secondorder connections) are grouped in the same category. Also, the operator map approach is restricted to the class of SOM-like networks and, hence, does not apply to Hebbian-learning-based, unsupervised STCNs. 4.2 The Barreto-Araujo ´ Taxonomy. The taxonomy proposed by Barreto and Araujo ´ (2001a) also tries to group unsupervised STCNs based on network structural properties. A number of temporal variants of the SOM were classied along the following dimensions: ² Topology: Nonhierarchical or hierarchical
² Type of STM: Delay lines or leaky integrators ² Position of the STM: Internal or external
² Type of neural signal: Mean ring rate or spikes
1268
G. Barreto, A. Araujo, ´ and S. Kremer
Despite its simplicity, this taxonomy excludes SOM-based models with feedback loops (see section 7.6) and, as in the operator map framework, does not apply to Hebbian unsupervised STCNs. Structural approaches for analyzing and designing unsupervised STCNs are very simple in general and possess extremely high descriptive adequacy since they are based on currently available neural architectures. But they are limited when considering new architectural design possibilities; they lack good predictive power. The need for approaches that better balance these requirements is immediate in the unsupervised domain. From the next section onward, we study several attempts to devise general frameworks for unsupervised learning models under the point of view of probabilistic generative modeling. 4.3 The Generative Modeling Framework. As pointed out by Hinton and Sejnowski (1999), the generative framework can also be used to characterize some unsupervised STCNs, such as those involved in extracting perceptual invariances from spatiotemporal sequences of images (Foldi´ ¨ ak, 1991; Stone & Bray, 1995; Stone, 1996; Wallis, 1996; Kohonen, 1996; Kohonen, Kaski, & Lappalainen, 1997; Bartlett & Sejnowski, 1998). In their own words: Discovering temporal invariances is just tting a dynamical system (without driving inputs) in which the state-transition matrix for the hidden state space is the identity matrix. This makes it obvious temporal invariance is naturally subsumed by linear dependence over time. If the observed data is modeled as a linear function of the hidden state space then temporal invariants can be learned by tting a linear dynamical system. (Hinton & Sejnowski, 1999, p. xii) Under the same viewpoint, Roweis and Ghahramani (1999) presented a common unsupervised learning framework based on linear gaussian models that unies a multitude of different well-known models (e.g., PCA, vector quantization, and mixture of gaussian clusters) under the same discrete time–generative dynamical model. The models’ parameters are all estimated by the same basic expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). Besides the factor analysis model, (nonlinear) gaussian mixture models and (nongaussian) independent component analysis were also considered. In addition to these static data models, probabilistic generative models for time-series data comprising hidden Markov and Kalman lter models were included. Based on the taxonomy by Kremer (2001), we extend this generative modeling viewpoint through the specication of probabilistic as well as deterministic nonlinear dynamical models that can learn spatiotemporal aspects of data, including temporal order and temporal invariances. The
A Taxonomy Revisited
1269
Internal Representation Top Down Path
Generative Model
Recognition Model Bottom Up Path
Comparator
Environment
Figure 1: A general framework for unsupervised learning, where changes to the models are driven between differences between original and generated data distributions.
resulting generalized taxonomy can be equally applied to both supervised and unsupervised STCNs, deterministic or not. We also present a single generic algorithm to apply the taxonomy to both linear or nonlinear STCNs (see section 5), instead of providing a pseudocode for each linear model as did Roweis and Ghahramani. We discuss in more detail the relationship between the generalized taxonomy and the Roweis-Ghahramani linear generative framework in section 6. 4.4 The Bayesian Ying-Yang Framework. In section 2.3, a particular neural architecture is classied in one of two classes: recoding or generative models. A useful concept now becoming common in unsupervised data modeling is to think of the whole neural system in terms of both recoding (bottom-up) and generative (top-down) models (Dayan, Hinton, Neal, & Zemel, 1995; Neal & Dayan, 1997). A general modeling framework that makes use of this idea is shown in Figure 1. An internal representation of the environment is produced by applying a recoding (or recognition) model to the input. A subsequent application of a generative model to this representation attempts to reconstruct the input (or its distribution); differences between the original and reconstructed inputs may be used to improve both recognition and generative models. The exact methods for measuring differences and adapting the models depend on the relative importance of the various properties of the input to the system.
1270
G. Barreto, A. Araujo, ´ and S. Kremer
A general formulation for the combined use of recognition and probabilistic generative models was proposed by Xu (2000, 2001), who named it the Bayesian Ying Yang (BYY) framework.5 The author has shown that the BYY approach provides a unied viewpoint on many standard stochastic techniques for dealing with spatiotemporal data modeling, such as Kalman ltering, hidden Markov modeling (HMM), ICA, and BSS. The author also emphasized the predictive power of the BYY framework since it allowed the proposal of new techniques, such as a higher-order HMM, which includes longer temporal dependencies, temporal ICA (TICA), and temporal factor analysis (TFA). In particular, a number of theorems were given on the conditions for source separation by linear and nonlinear TICA. For example, it has been shown that not only gaussian but also nongaussian sources can be separated by TICA by exploring temporal dependencies. TFA was used to build a model to predict macroeconomic indexes by using hidden factors obtained through TFA with improved performance because it takes into account temporal features of stock price series. 5 On Kremer’s Taxonomy The previous generative frameworks are quite powerful, but they were designed to deal with stochastic or probabilistic models only. Many other dynamical systems, such as chaotic ones, generate data that are better explained by deterministic analysis tools (e.g., bifurcation diagrams) than by statistical methods. One can argue that deterministic models are special cases of stochastic ones for which the noise component has been eliminated. Despite being somewhat intuitive, this view can lead to some misunderstandings about the applicability of several methods we are interested in, such as PCA. We discuss this issue in more detail later in this section. But rst, we introduce the main concepts of Kremer’s taxonomy. Kremer (2001) proposed a taxonomy based on the premise that it could be applied to any kind of STCNs, not just one class of architectures. This goal was pursued by focusing on the underlying design decisions any temporal sequence processing system must address rather than on the mechanics of a particular architecture. This article provides not only an extension to the taxonomy by examining a new class of network architectures but also a test of its central goal by assessing its applicability to an unanticipated computational paradigm. The taxonomy identied four fundamental design decisions on which every temporal sequence processing system is based: (1) a state computation function, (2) an output computation function, (3) a function dening the initial values for the trainable parameters of the network, and a function for adapting the trainable parameters. These
5 The BYY framework can be viewed as a generalization of the Helmholtz machine by Dayan et al. (1995).
A Taxonomy Revisited
1271
four functions can be dened mathematically by the following equations, respectively: fEs .Es.t ¡ 1/; xE.t/; W /
(5.1)
fyE .Es.t/; W /
(5.2)
fW 0 .a priori info/
(5.3)
E W ; xE.¢// f1W .E.t/;
(5.4)
In these equations, Es.t/ represents the state vector, xE.t/ is the input vector, yE.t/ is the output vector, xE.¢/ is the input history vector, W denotes the adaptive parameters of the STCN’s, and W 0 symbolizes the initial values of the trainable parameters. W includes one or more weight matrices and may include connectivity scheme parameters, transmission delay parameters, and initial activation values depending on the type of network. Note that in the denition of this framework, no assumptions about the representation or algorithms employed by each of the four functions are made. The focus is on only the goal of the computation, thus working at Marr’s computational (rst) level rather than the algorithmic (second) level (Marr, 1982). With this goal in mind, this article takes a different approach with respect to other review papers on unsupervised learning. We do not try to describe particular neural architectures and all their components. Instead, we show how the four functions in equations 5.1 through 5.4 are instantiated (or specialized) in terms of unsupervised STCNs. These instantiations can then be used to dene the operation of a specic STCN architecture, relative to the generic STCN algorithm (see Table 1). Originally, the total error, ", was dened as a measure of the overall performance of supervised STCNs. This is the reason behind the existence of E D yE¤ .t/ ¡ yE.t/ in equation 5.4, denoting the difference an error vector E.t/ between the desired output vector, yE¤ , and the actual output vector, yE. Obviously, this error measure does not have the same meaning for unsupervised STCNs as it has for supervised ones, but the original formulation of the algorithm in Table 1 can be preserved anyway.6 For example, one can note that a number of supervised spatiotemporal techniques can be recast as unsupervised by replacing the supervised error 6
In this article, we consider reinforcement learning as a kind of supervised learning paradigm, so that the original Kremer’s taxonomy can be easily applied. In this regard, agents are implemented as neural networks and trained by means of variants of backpropagation. The ability to learn from the reinforcement signal is achieved by nding E a quantity (depending on it) to backpropagate, that is, playing the role of the error, E, in supervised learning rules. Most of the proposed techniques are related to the TD(¸) method (Sutton & Barto, 1998), in which differences of prediction in subsequent timesteps are exploited.
1272
G. Barreto, A. Araujo, ´ and S. Kremer
Table 1: A Generic Algorithm for STCNs from Kremer (2001). 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14
algorithm STCN W Ã fW 0 (a priori info) do tÃ0 do t ÃtC1 xE.t/ Ã environment Es.t/ Ã fEs .Es.t ¡ 1/; xE.t/; W / yE.t/ Ã fyE .Es.t/; W / if target yE¤ .t/ available EE Ã yE¤ .t/ ¡ yE.t/ " Ã f 12 kEE2 kg E W; xE.¢// W Ã W C f1W .E; while not end of sequence while t < t max and " > threshold
/* /* /* /* /* /* /* /*
initialize */ begin training */ reset time */ begin sequence */ increment time */ determine input */ update state */ determine output */
/* compute error */ /* compute scalar */ /* update parameters */ /* no solution found and not out of time */
Note: Particular network designs differ only in their instantiations of the four fundamental functions.
EE by an unsupervised performance measure, such as entropy maximization (Bell & Sejnowski, 1995) or the Kullback-Liebler distance between the probability distribution of the input data and that of a target distribution (Hinton & Sejnowski, 1986). A general approach for evaluating the evolution of training in unsupervised networks consists of monitoring variation of the weight vectors. In this case, lines 10 and 11 should be replaced, for example, by:
10
E.t/ D
11
"D
1 ¿
N 1 X E i .t/k2 k1w N iD1 t X
E.k/;
kDt¡¿ C1
where E.t/ quanties the variation in the updates of all the N weight vectors during the current time step t, and " accounts for the total variation for a certain time span of length ¿ . Large variations of E.t/ and, hence, in " indicate that the weights have not yet converged, while very small ones suggest that training should stop. Thus, we decided to use the same symbols " and E.t/ for evaluating both supervised and unsupervised training performance, to be in accordance with the original proposal and thus provide a framework that transcends the distinction between supervised and unsupervised.
A Taxonomy Revisited
1273
6 The Generalized Taxonomy For a neural network model to be able to process spatiotemporal data, it must be given STM mechanisms. Indeed, this is the distinguishing characteristic of STCNs. Static connectionist networks compute the activation values of all units at time t based on only the input at time t. By contrast, in STCNs, the activations of some units at time t are computed based on activations at time t ¡ 1 or earlier. In Kremer’s taxonomy, the state vector is identied with different STM mechanisms used by neural networks to handle spatiotemporal patterns. A state vector, Es.t/, summarizes information about past signals directly. We also use the state vector, Es.t ¡ 1/, to represent the activations at time t ¡ 1 of the network units, dening uniquely the situation of the network at time t ¡ 1. Such a vector is employed to compute the activations of the network units at time t as stated in equation 5.1. For the generalization of Kremer’s taxonomy to the unsupervised learning domain, we also identify the state vector with generalized latent variables, a recent concept proposed by Meinicke (2000). These types of latent variables, unlike the ones used in stochastic generative models, can also have a deterministic nature. Meinicke (2000) used this idea to propose the generalized regression (GR) framework for formulating unsupervised learning as a regression problem. In the literature, regression has been tackled by means of different techniques, such as statistics, neural networks, prediction trees, regression trees, and fuzzy logic. Moreover, it has been seen as either a supervised or a reinforcement learning task (Barto, Sutton, & Anderson, 1983; Gullapalli, 1990; Hertz et al., 1991; Haykin, 1994). In the context of (the supervised) Kremer’s taxonomy, the regression function is equivalent to the output function in equation 5.2. However, instead of a regression on a set of given inputs, the unsupervised counterpart has to perform a regression on a set of supplemental latent variables, which have to meet the specic task requirements. Thus, from the viewpoint of GR estimation, the main difference between unsupervised and supervised learning lies in the origin of the regressor variables. Meinicke (2000) pointed out that methods of regression on latent variables can also be referred to as regression without inputs or, simply, unsupervised regression. The distinction between random and deterministic latent variables is not new and may be motivated by the existence of several well-established methods, which do not require any assumptions about the distributions of the latent variables (Anderson, 1984). On the other hand, if some knowledge about the underlying data distribution is available, then a suitable randomization of the latent variables may lead to an improvement of the model. Randomness of the latent variables can be handled by Kremer’s taxonomy in two basic ways. First, the input vector xE.t/ can be treated as a random variable itself, for which some assumptions about its distribution are made. This random-
1274
G. Barreto, A. Araujo, ´ and S. Kremer
ness will be reected, through deterministic functions fEs .¢/ and fyE .¢/, in the temporal behavior of the state vector Es.t/. This happens when one uses a deterministic unsupervised algorithm, such as the SOM, to model and predict stochastic time series (Lendasse, Lee, Wertz, & Verleysen, 2002). Second, in addition to a random input vector, the functions fEs .¢/ and fyE .¢/ themselves can have a stochastic dynamics. This is the case when one uses a probabilistic unsupervised algorithm, such as the GTM algorithm, to model and predict stochastic time series (Bishop, Hinton, & Strachan, 1997). Learning is affected by randomness in similar ways. For a deterministic learning function f1W .¢/, randomness is taken into consideration through the input history xE.¢/. For example, in Lendasse et al. (2002), the function f1W .¢/ is the usual SOM learning rule. Alternatively, randomness can be also handled through the denition of a probabilistic learning function f1W .¢/. In Bishop et al. (1997), it is the well-known EM algorithm (Dempster et al., 1977). As an example of how insightful the concept of generalized latent variables can be, we refer to the work of Roweis and Ghahramani (1999), who considered only the case of random latent variables. These authors considered the following discrete time linear dynamical generative model with gaussian noise: Es.t C 1/ D AEs.t/ C w.t/ E
yE.t/ D CEs.t/ C vE.t/;
(6.1) (6.2)
where Es 2 0, ½ h is always positive. That means that even a pair of neurons with common input that is due to only the random connectivity tends to correlate. In a network where a pattern of connectivity as in Figure 3 is abundant, ½ out of a pair of neurons is likely to be ½ in for another pair of neurons. An example of such a network is one with chains hardwired into its connectivity matrix. In such cases, it is relevant to seek a xed point of pairwise correlations, ½ ¤ , that obeys ½ out D ½ in D ½ ¤ . As argued in the appendix, under the constraint of equation A.5, ½ out D ½ h ; thus, it is straightforward to look for such a xed point. In Figure 4a, the ½ out .½ in / is plotted for several values of r. We note that out in crosses the diagonal with slope less than 1 exactly once, for all ½ .½ / values of r, resulting in a globally stable xed point: ½ ¤ . We calculate ½ ¤ from equation 4.2 by letting all ½ variables to be equal to ½ ¤ and taking the positive solution: p r2 ¡ q C .q ¡ r2 /2 C 4.q ¡ 1/"r2 ¤ (4.3) ½ D : 2r2 Two important observations follow from this equation. One is that the result is independent of the ring rate, which we assume is an artifact of the simplied model. The other is that the behavior of this equation is a function of p the ratio r, and not of K and w separately. This implies that scaling w by K is the relevant scaling when considering correlation coefcients in this model. Figure 4b displays ½ ¤ as a function of the scaling variable r. The curve displays a transition from low to high correlations near a critical value of
Synre Chains in a Balanced Network
1331
p r, rc D q. Strictly speaking, the transition becomes sharp only in the limit " ! 0. As can be observed in Figure 4b, in this limit, the stable xed point is zero for r smaller than rc , whereas above rc it diverges rapidly. 4.2 An Integrate-and-Fire Model. When considering transmission of a highly correlated signal in an asynchronously active population, as we do in this study, two main variables should be examined: neuronal ring rates and correlations between neurons. Clearly, both variables depend on each other. Several studies have dealt with these variables. Tetzlaff, Buschermohle, Geisel, and Diesmann (2001) focused on the evolution of output rates as a function of input rates along a chain with noisy input but disregarded the issue of correlation. An integrative study, however, is currently being conducted by Tetzlaff et al. (2002). Salinas and Sejnowski (2000) studied the ring rates and their variability as a function of input correlation. The correlation in the input was a result of a common input or common oscillation in the input rates. Analytical results based on a random walk model with drift gave results that were qualitatively similar to a conductance-based IAF model. In addition, Feng and Brown (2000) discussed the impact of temporally correlated input on the output statistics of an IAF neuron. Stroeve and Gielen (2001) considered the correlation between a pair of neurons as a function of a correlated common input. They follow a logic similar to ours and compute ½ out .½ in / for a conductance-based IF model; the resulting ½ ¤ was not derived. Here, we study the evolution of the output correlation as a function of the spatially correlated input in an IAF model. The IF model is less amenable to analytic studies than the simple model. Hence, we turn to simulations in order to compute ½ out .r; ½ in ; º in /. Note that we added the input ring rate º in as an additional parameter. This is due to the fact that input rates affect correlations of IF neurons (Salinas & Sejnowski, 2000; Stroeve & Gielen, 2001; Tetzlaff et al., 2001). Figure 5a shows ½ out .½ in / for different w and for xed values of K and ºin . In contrast to the simple model, ½ out can cross the diagonal more than once. We look for the xed-point correlations in the dynamic steady state of the system. Following the procedure used for the simple model, we extracted the intersection points of the curve with the diagonal ½ out .½ ¤ / D ½ ¤ . We also calculated the stability of these xed points by measuring the slope at the d½ ¤ point of intersection. The results are depicted in Figure 5b. If j d½ in j < 1, the xed point is stable (thick line); otherwise, it is unstable (thin dotted line). Here, we dene rc to be the point where the saddle node bifurcation appears. Note that the details of the results of Figure 5 depend on the way correlations were introduced. Following the Poisson mother process, as explained in Kuhn et al. (in press), we introduced correlations well beyond the second order. Had we limited ourselves to currents with second-order correlations only, such as h1 and h2 of the appendix, the value of rc would change considerably.
1332 (a)
Y. Aviel, C. Mehring, M. Abeles, and D. Horn (b)
1
1 Stable Unstable
0.9 0.8
0.8 3
0.7
0.6
r*
rout
0.6 0.5
0.2
0.3
1.6
0.2
0
1.1
0.1 0
0.4
2.2
0.4
0.5 0
0.2
0.4
0.6
rin
0.8
1
-0.2
0
0.5
1
1.5
r
2
2.5
3
3.5
Figure 5: Transition in IAF neurons. (a) ½ out as a function of ½ in for different values of r (indicated above each curve). The crossings of the sigmoid curves and the diagonal (dashed line) yield the xed-point correlations. (b) Fixed-point correlations as a function of r. Stable xed points are indicated by a thick line and the unstable ones by a thin dotted line. Note that ½ ¤ D 0 is stable for any r in the range. Parameters: K D 9000, º in D 1:5¤ ºthre .
Unlike the simple model, we note that for small r½ in , ½ out » 0. This insensitivity of the output correlation reects the fact that the neurons’ input is not completely balanced. Had the neurons’ input been perfectly balanced, ½ out would have been more sensitive to small input correlation. However, we chose the same neurons’ parameters as in our simulations of the full network, so as to match the full network simulation better. For r > rc , the system is bistable. This is different from the simple model in Figure 4b, which had one global stable xed point for any value of r. Note, however, that although this two-neuron system is bistable, this does not imply that a network of IF neurons will be bistable. Even if the network settles rst in the lower (zero correlation) branch, transient spontaneous correlations may evolve, pushing the network into the basin of attraction of the upper branch. In fact, our network simulations described in the next section display either a low-correlation AI mode or a high-correlation synchronous mode of activity, depending on the parameters of the network. Moreover, we show in the next section that the critical behavior of the whole network displays scaling in the same variable r as predicted by the simple model. 5 The Full Network The main lesson from our study of a pair of neurons is the existence of a rapid transition between low correlations to high correlations as function p of the scaling variable r D w= K. This type of behavior is also reected in the simulations of IF networks. In our simulations, we use K D N=10. Running networks with different values of K (and N), we search for the transitions, as a function of w, of the networks from their AI mode to the global oscillatory mode. One possible measure is the coefcient of variance
Synre Chains in a Balanced Network
6
1333
8
9
CV of the population rate
5
7
4
4
5 6
3
3
2
1
0
1
2
3
4 r (= w/sqrt(K))
5
6
7
Figure 6: Scale invariance in the full network. The synchrony measure as a function of r for different network sizes. Digits inside the gure denote the size of K in thousands. w range from 150 to 500 in steps of 50. Population rate is dened as in Figure 2. Mean excitatory ring rate for r » 2 is 13 Hz (SD across the different population sizes 5). Mean excitatory ring rate for r » 5 is 41 Hz (SD 7).
(CV) of population rate dened as the standard deviation of population rate divided by the mean of population rate, shown in Figure 6. Plotting the results as function of the scaling variable r, we see that the different curves coincide; thus, this is a valid scaling relation for our problem. A sharp transition is observed around rc D 2:5 in these networks. p Once we have gained some knowledge about the critical w, wc ´ rc K, above which strong correlations are inevitable, we can inquire whether wc is large enough to enable a stable transmission of waves. Diesmann et al. (1999) pointed out that as w decreases, the basin of attraction of the stable pulse packets shrinks. Stated otherwise, there is a minimal w, wmin , below which waves dissolve. Our challenge is meeting the contradicting requirements of small w, to avoid oscillations in the BN, and large w, to enable stable waves.
1334
Y. Aviel, C. Mehring, M. Abeles, and D. Horn
In short, we look for the range wmin < w < wc ;
(5.1)
reset where wmin depends on µ ¡º . We conclude that K has to be large in order JEE to nd a range of w satisfying equation 5.1. For our IF model parameters, we nd wmin > 230 and rc ¼ 2:7. We choose 2 r D 2:5; therefore, K > . 230 2:5 / , or simply K D 9000. Since in our architecture K D "NE , where " D 0:1, we have to simulate NE D 90,000 excitatory neurons and NI D 22,500 inhibitory neurons before we are able to observe a synre wave in a balanced network. Figure 7 shows the results of the simulations that substantiate the theoretical analysis. A wave is ignited at t D 1600 ms. For Figure 7a, w D 200 < wmin ; it dissolves after activating four pools. In Figure 7b, wmin < w D 250 < wc . The wave propagates successfully, activating 50 pools (only the rst 30 pools are shown). It should be noted, however, that due to the length of the chain, all wave ignitions lead to an oscillatory activity that eventually prevents the wave from reaching the end of the chain. For Figure 7c, w D 300 > wc ; global oscillations are obtained, drowning the synre wave.
6 Summary Many articles (Abeles, 1991; Diesmann et al., 1999; Hertz, 1999; Tetzlaff et al., 2002) discuss properties of a single chain with stationary noisy inputs. In this work, we take the discussion one step further by embedding the chain in a balanced network of IF neurons. By simulating the entire network, we take into account the mutual inuence of the chain on the network, and vice versa. By insisting on working with a full network, we discovered that the formation of synre waves in a balanced network poses quite a challenge. In order to obey all constraints of equation 5.1, we need a network of N D 90,000 excitatory neurons or more. In other words, we conclude that large networks are needed in order to create the right conditions. Changing the details of the neuronal model could possibly reduce this high number of neurons. Reducing wmin to 100 (as is the case in Diesmann et al., 1999) cuts the lower bound of K to near 1300 rather than 9000, as in our case. As for the upper bound, it has been shown (Brunel & Hakim, 1999; Brunel, 2000) that the location of the transition from the asynchronous state to the synchronous state depends on the characteristics of synaptic processing (e.g., time constant, heterogeneity of time constant). Moreover, in recent simulations, we have seen that adding inhibitory neurons to synre pools may change the conditions of our network. Mehring, Hehl, Kubo, Diesmann, and Aertsen (in press) also studied a synre chain embedded in a balanced network of locally connected IF neurons. They investigated the propagation of synchronous activity for dif-
Synre Chains in a Balanced Network (a)
Raster Plot. vext=0.002381
1335 (b)
w=200
250
Raster Plot. v =0.002381 ext
w=250
150
Neuron number
Neuron number
200
150
100
100
50 50
0 1600
1610
(c)
1620 1630 Time [ms]
Raster Plot. vext=0.002381
1640
1650
0 1600
1610
1620 1630 Time [ms]
1640
1650
w=300
350 300
Neuron number
250 200 150 100 50 0 1600
1610
1620 1630 Time [ms]
1640
1650
Figure 7: Three panels of raster plots. The neurons in the y-axis are ordered according to their participation in the pools. Only the rst several pools are shown, where in each pool, every seventh neuron is presented. (a) Asynchronous activity. The wave ignited at t D 1600 dissolves (w D 200 < wmin ) (b) Asynchronous activity. The wave ignited at t D 1600 ms is stable. The rst 30 pools are shown. (wmin < w D 250 < wc ). (c) Synchronous activity, no meaningful wave obtained (wc < w D 300). Parameters: NE D 90,000. Mean excitatory/inhibitory ring rates of a, b, and c are 8/8, 4/13, and 26/24 Hz, respectively.
ferent spatial arrangements of a chain but did not analyze the instability of the AI state, as only short chains with 10 pools were considered. Tetzlaff et al. (2001, 2002) showed that for a chain with external Poissonian input, the transition from the asynchronous to synchronous regimes is well determined by a pure rate model. This seems to contradict our results. This apparent contradiction may be a result of the small r used by Tetzlaff et al. (2002). A closer look at their setup reveals that each neuron in their chain receives K D 17,600 excitatory inputs. Their chain width w is 200, hence their r is 1.41, which is small, probably smaller than their model’s rc . This explains the apparent contradiction. For small r values, correlation does not play a signicant role, but rates apparently do. As mentioned before, ring rates and correlations affect each other in an intricate manner. Therefore, we have put emphasis on verifying that the qualitative scaling behavior of the
1336
Y. Aviel, C. Mehring, M. Abeles, and D. Horn
network, as depicted in Figure 6, holds for a range of external ring rates. There is always a critical r, but its value varies slightly as a function of the external rates, as can be expected from the IF model. Previous studies within other models (Bienenstock, 1995; Hertz, 1999) have shown a critical dependence on the number of pools, a capacity limit analogous to that of stored memories in a feedback neural network. Here, by simulating a network of NE D 90,000 and w D 260, we have found the capacity to be on the order of 2000 pools. Embedding a higher number of pools leads to spontaneous global oscillations. We conclude that all our previous results hold only below the capacity limit. If one exceeds the capacity limit, global oscillations will appear, regardless of the value of r. There is current controversy between those who claim that neurons code information via their ring rate (Shadlen & Newsome, 1994) and those who believe that the information is conveyed in the exact timing of spikes (Softky & Koch, 1993). This debate engendered the concepts of balanced inputs (Shadlen & Newsome, 1994) and balanced networks (van Vreeswijk & Sompolinsky, 1998). Van Vreeswijk and Sompolinsky (1998) showed that BN has properties useful for rate coding performance. Here, we demonstrate that BN are capable of temporal coding as well. Furthermore, both codes can be applied simultaneously. Thus, the network may multiplex rate and temporal codes. Appendix: The Simple Model We dene a semilinear neuronal model, ri .t/ D [hi .t/]C ;
(A.1)
where the function [x]C is x if x > 0 and zero otherwise. hi .t/ is the local eld, dened as a sum of K excitatory and K inhibitory synaptic inputs: hi .t/ D
2K X lD1
J ¢ sl .t/ ¡
K X mD1
gJ ¢ sm .t/:
(A.2)
Here, J is the xed synaptic weight, sl .t/ is the presynaptic input at time t, and g controls the excess (or shortage) of excitation in the local eld. The variables sl .t/ > 0, l D 1 ¢ ¢ ¢ 3K represent the instantaneous ring rate of the presynaptic neuron. For clarity, we will discard the time notation of the sl ’s. A semilinear model is a crude caricature of an IF neuron. Nevertheless, it enables the use of analytical tools, allowing for a better understanding of the phenomenon under inspection. Indeed, we nd that a complex network of IF neurons exhibits behavior that is predicted by the current semilinear model. A.1 Correlation Between Two Fields. Consider a pair of neurons in a pool, each having K external as well as K local excitatory synapses and K
Synre Chains in a Balanced Network
1337
inhibitory synapses, as depicted in Figure 3. As a result of random connectivity, some of these synapses share a source that is common to both neurons. The mean number of these inputs is taken to be u ¡ w for the excitatory synapses and u for the inhibitory synapses. Finally, there are w inputs that are common to both neurons, which come from the previous pool. We divide the input eld of a neuron into ve subelds (see Figure 3): ² 2K ¡ u synapses of external and local independent-excitatory input (E) ² K ¡ u synapses of independent-inhibitory input (I) ² u ¡ w synapses of common-excitatory input (Ec ) ² u synapses of common-inhibitory input (Ic )
² w synapses of correlated-common-excitatory (Ecc ) The Ic and E c subelds represent common input due to the random connectivity. The Ecc subeld represents input that comes from the previous pool in the chain; thus, we also consider it as correlated input. To simplify the architecture without impairing its properties, we unite both the external and the local independent excitatory input into one subeld: E. The E and I subelds are different for the two output neurons; we denote the E subeld to the rst and second neurons by E1 and E2 , respectively. Similarly, I1 and I2 are the independent inhibitory subelds for the rst and second neuron, respectively. We assume that the si ’s are correlated stochastic variables that are characterized by their rst two moments: ² hsi i D º
² var.si / D ¾s2
² hsi ¢ sj i D ½ in ¾s2 C º 2 , where ½ in is the correlation coefcient. Here, the angular brackets h i stand for an average over time or over different realizations of the si process. We assume no correlations among inputs that do not come from the previous pool; that is, ½ in is 0 if i; j are not members of the same pool. This assumption may not be valid in a full network. We calculate the membrane potential that is the result of incoming postsynaptic potentials in Peach of the ve subelds. For example, the Ecc subeld is dened as Ecc D w l2Ecc J ¢ sl , where J is the (single) synaptic weight. The synaptic of Ecc is as follows: ¹Ecc ´ hEcc i D Jwº *Á !2 + w X 2 2 hEcc i D J D J2 whs2 i C J2 w.w ¡ 1/hsi ¢ sj i sl l2Ecc
2
D J [w.¾s2 C º 2 / C w.w ¡ 1/.½ in ¾ s2 C º 2 /]:
1338
Y. Aviel, C. Mehring, M. Abeles, and D. Horn
We assume w and K to be large; hence, we can apply the central limit theorem, leading to: Ecc » N.¹Ecc ; ¾Ecc /; with ¹Ecc D Jwº p and ¾Ecc D J¾s w.1 C ½in .w ¡ 1//: A derivation similar to that of subeld Ecc leads to a normal approximation of the other subelds: p E 1 » N.¹E1 ; ¾E1 /; with ¹E1 D J.2K ¡ u/º and ¾E1 D J¾ s 2K ¡ u p Ec » N.¹Ec ; ¾Ec /; with ¹Ec D J.u ¡ w/º and ¾Ec D J¾s u ¡ w p I1 » N.¹I1 ; ¾I1 /; with ¹I1 D ¡gJ.K ¡ u/º and ¾I1 D gJ¾s K ¡ u p Ic » N.¹Ic ; ¾Ic /; with ¹Ic D ¡gJuº and ¾Ic D gJ¾s u Note that we assumed here that the inputs that do not come from the previous pool are uncorrelated, that is, hsi sj i D º 2 . Also note that ¹E1 D ¹E2 , ¹I1 D ¹I2 and that ¾E1 D ¾E2 , ¾I1 D ¾I2 . Let h1 , the input eld of the rst neuron, be the sum of the ve uncorrelated, normally distributed subelds of the rst neuron: h1 D E1 C E c C Ecc C I1 C Ic . The statistics of h are as follows: ¹h ´ hhi D hE1 i C hEc i C hEcc i C hI1 i C hIc i D ¹E1 C ¹Ec C ¹Ecc C ¹I1 C ¹Ic
hh2 i D h.E C Ec C Ecc C I C Ic /2 i D ¾E21 C ¾E2c C ¾E2cc C ¾I21 C ¾I2c C ¾h2 : Thus, ¾h2 D ¾E21 C ¾E2c C ¾E2cc C ¾I21 C ¾I2c D J2 C ¾s2 [.2 C g2 /K C w.w ¡ 1/½ in ]. Similarly, we dene h2 D E2 C Ec C Ecc C I2 C Ic , where E2 and I2 are the respective subelds of the second neuron. Whereas hh2 i D hh1 i and hh22 i D hh21 i, the covariance of the two elds depends on their common input: hh1 ¢ h2 i D h.E1 C Ec C Ecc C I1 C Ic /.E2 C Ec C E cc C I2 C Ic /i D ¾E2c C ¾E2cc C ¾I2c C .¹E1 C ¹Ec C ¹Ecc C ¹I1 C ¹Ic /2 :
The correlation coefcient is: ½h ´
hh1 ¢ h2 i ¡ ¹2h ¾h2
D
¾E2c C ¾ E2cc C ¾I2c
¾E21 C ¾E2c C ¾E2cc C ¾I21 C ¾I2c
:
Finally, we get: ½h D
.1 C g2 /u C w.w ¡ 1/½ in : .2 C g2 /K C w.w ¡ 1/½ in
(A.3)
Synre Chains in a Balanced Network
1339
The expected number of common inputs to pair of neurons in a pool, which is not from the pre-pool, is .K ¡ w/
2 .1 ¡ w K¡w K/ D "K w : 1¡ N N¡w
Taking the leading order in K, we set u D "K:
(A.4)
Substituting equation A.4 in A.2 and using the approximation w.w¡1/ » D w2 , we get equation 4.1. A.2 Correlation Between Two Neurons. Clearly, the correlation between two neurons (½ out ) is a function of the correlation between their elds (½ h ). In general, however, this function is highly dependent on the neuronal model in use. It is hard to derive an analytical expression of this dependency for nonlinear models. Therefore, we assume that the mean membrane potential, or the mean of the local eld, is rarely below zero: ¹h > 2¾h :
(A.5)
Thus P[hi .t/ < 0] < 0:023, which allows us to approximate equation A.1 with ri .t/ » D hi .t/:
(A.6)
This approximation allows us to equate the correlation between the neurons with the correlation between the elds, leading to ½ out .½ h / D ½ h :
(A.7)
Acknowledgments This work was supported in part by grants from GIF and BSF and Boehringer Ingelheim Fonds. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Bienenstock, E. (1995). A model of the neocortex. Network: Computation in Neural Systems, 6, 179–224. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208.
1340
Y. Aviel, C. Mehring, M. Abeles, and D. Horn
Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Computation, 11, 1621–1671. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1995). SYNOD: An environment for neural systems simulations. Rehovot, Israel: Weizmann Institute of Science. Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402(6761), 529–533. Feng, J., & Brown, D. (2000). Impact of correlated inputs on the output of the integrate- and-re model. Neural Computation, 12(3), 671–692. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41–68. Hayun, G. (2002). Modeling compositionality in biological neural networks by dynamic binding of synre chains. Unpublished doctoral dissertation, Hebrew University, Jerusalem. Hertz, J. A. (1999). Modeling synre networks. In O. Parodi (Ed.), Neuronal information processing—From biological data to modelling and application. London: Imperial College Press. Kuhn, A., Rotter, S., & Aertsen, A. (in press). Correlated input spike trains and their effects on the response of the leaky integrate-and-re neuron. Neurocomputing. Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (in press). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cyber. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. J. Neurosci., 20(16), 6193–6209. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4(4), 569–579. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductancebased integrate-and-re neurons due to common and synchronous presynaptic ring. Neural Computation, 13(9), 2005–2029. Tetzlaff, T., Buschermohle, M., Geisel, T., & Diesmann, M. (2001).The spread of rate and correlation in stationary cortical networks. Paper presented at the Eleventh Computational Neuroscience Meeting, Monterey, CA. Tetzlaff, T., Geisel, T., & Diesmann, M. (2002). The ground state of cortical feedforward networks. Neurocomputing, 44–46, 673–678. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10(6), 1321–1371. Received June 3, 2002; accepted December 5, 2002.
LETTER
Communicated by Ad Aertsen
Ergodicity of Spike Trains: When Does Trial Averaging Make Sense? Naoki Masuda
[email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan
Kazuyuki Aihara
[email protected] Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo, Japan, and CREST, Japan Science and Technology Corporation, Saitama, Japan
Neuronal information processing is often studied on the basis of spiking patterns. The relevant statistics such as ring rates calculated with the peri-stimulus time histogram are obtained by averaging spiking patterns over many experimental runs. However, animals should respond to one experimental stimulation in real situations, and what is available to the brain is not the trial statistics but the population statistics. Consequently, physiological ergodicity, namely, the consistency between trial averaging and population averaging, is implicitly assumed in the data analyses, although it does not trivially hold true. In this letter, we investigate how characteristics of noisy neural network models, such as single neuron properties, external stimuli, and synaptic inputs, affect the statistics of ring patterns. In particular, we show that how high membrane potential sensitivity to input uctuations, inability of neurons to remember past inputs, external stimuli with large variability and temporally separated peaks, and relatively few contributions of synaptic inputs result in spike trains that are reproducible over many trials. The reproducibility of spike trains and synchronous ring are contrasted and related to the ergodicity issue. Several numerical calculations with neural network examples are carried out to support the theoretical results. 1 Introduction Neurons are unreliable computational units subject to substantial background noise (von Neumann, 1956). Firing patterns are highly variable from trial to trial even when we record a well-specied neuron exposed to the identical stimulus under the same experimental condition (Softky & Koch, 1993; Tsodyks & Sejnowski, 1995; Arieli, Sterkin, Grinvald, & c 2003 Massachusetts Institute of Technology Neural Computation 15, 1341–1372 (2003) °
1342
N. Masuda and K. Aihara
Aertsen, 1996; Shadlen & Newsome, 1998). To deal with this variability, Bialek, Rieke, de Ruyter van Steveninck, & Warland (1991) showed in the blowy visual system that the external stimulus can be estimated with a segment of a single spike train. In most data analyses, however, the same stimuli or tasks are repeated typically about 30 times or more so that spike trains collected from a particular set of neurons serve to estimate various statistics based on trial averaging. Characteristic ring patterns in the statistical sense are supposed to be related to the animals’ behavior or recognition. For example, the peristimulus time histogram (PSTH), the straightforward averaging of spiking rates from many trials, gives temporal proles of ring rates. If multiple-neuron recording is available, higher-order statistics such as the cross-correlation coefcient of ring patterns can be calculated. Then we usually identify cooperative activities of neurons such as synchronous ring (Gray, Konig, ¨ Engel, & Singer, 1989; Steinmetz et al., 2000) and correlated ring (Aertsen, Erb, & Palm, 1994; Vaadia et al., 1995; Prut et al., 1998) that are linked to the animals’ responses to stimuli. The joint PSTH (Gerstein, Bedenbaugh, & Aertsen, 1989; Aertsen, Gerstein, Habib, & Palm, 1989; Aertsen et al., 1994; Vaadia et al., 1995) also reveals task-dependent temporal modulations of neuronal cooperativity. Although trial averaging is a powerful tool for relating neuronal activities to possible information processing, the brain cannot make use of trial averaging for experimental tasks or in natural environments because just one exposure to the stimulus should actually elicit apposite responses of an animal (Bialek et al., 1991). If the modulation of ring rates or the correlated ring uncovered by the PSTH or the joint PSTH is relevant to information processing, the brain must calculate these statistics using spike trains from many neurons instead of those from many trials. In this sense, any speculations based on experimental results with repeated trials implicitly assume physiological ergodicity of ring patterns (Shaw, Harth, & Scheibel, 1982; Shaw & Silverman, 1988; Gerstein et al., 1989). A neural network serving as a noisy dynamical system is dened to be ergodic when the spike trains recorded over many trials from a reference neuron agree with those from a group of neurons, namely, the neurons that belong to the functional assembly that includes the reference neuron. The degree of consistency depends on the required level of proximity that species how precise information in spike trains the brain really uses. The most rigorous but unnatural requirement could be perfect accordance of the two sets of spike trains. One of the mildest notions of ergodicity requires just equivalence of the trial ring rate and the population ring rate. The concept of ergodicity in this context is different from the classical denition in statistical physics or mathematics, which guarantees the agreement of the statistical ensemble average and the time average of functions of state variables. Nevertheless, we continue to use the term ergodicity, respecting the pioneering work that addressed this issue. Only partial arguments exist to support the ergodicity of spike trains. Shaw and colleagues suggested that approximately 30 nearby neurons may
Ergodicity of Spike Trains
1343
work cooperatively as a functional assembly and emit similar spike trains (Shaw et al., 1982; Shaw & Silverman, 1988). These conclusions were based on column structures in the visual cortex and bistable neural models. This conjecture is also supported by the recent nding that neighboring neurons receiving shared inputs (Shadlen & Newsome, 1998) tend to belong to the same functional assembly (Shaw et al., 1982; Vaadia et al., 1995). However, these abstract arguments do not prove ergodicity directly. Biological neurons are heterogeneous and unreliable units receiving heterogeneous inputs. Consequently, even neighboring neurons may have different dynamical properties. Moreover, it is not clear how coupling inuences the degree of ergodicity. Actually, we can easily imagine nonergodic cases. For example, let us suppose a population of neurons that re synchronously while the time of synchronous ring relative to reference timing drifts from trial to trial (also see section 7.6). In this situation, the ring rate based on the single cell recording with many trials (we call it trial ring rate) is different from the population ring rate. We would miss synchronous ring, which is suggestive of specic information, if we rely on only trial averaging. An example at the other extreme is a coupled network of sufciently heterogeneous neurons with weak noise. The trial ring rate may have sharp peaks caused by spike trains reproducible over trials (we call it trial synchrony), whereas the population ring rate can be temporally at because population heterogeneity generally prohibits synchronous ring. The objective of this article is to clarify the conditions for ergodicity by model studies. As the two examples above imply, we can detect the breakdown of ergodicity at least when population synchrony and trial asynchrony are simultaneously realized or when population asynchrony and trial synchrony are simultaneously realized. However, simultaneous emergence of population synchrony and trial synchrony, or population asynchrony and trial asynchrony, is a necessary condition for ergodicity. As is shown later, however, this necessary condition ensures ergodicity in most practical cases. We therefore emphasize the relation between synchronous or asynchronous ring of neural networks and trial synchrony or asynchrony. Another reason for selecting this strategy is that synchronous ring of many neurons linked to brain functions is widely observed (Gray et al., 1989; Steinmetz et al., 2000). A lot of mathematical and numerical results are also reported on the collective behavior of neurons (Mirollo & Strogatz, 1990; van Vreeswijk & Sompolinsky, 1998; Bressloff & Coombes, 2000; Brunel, 2000; Masuda & Aihara, 2002b, 2003a, 2003b; Aihara & Tokuda, 2002; van Rossum, Turrigiano, & Nelson, 2002). However, regarding trial averaging, the possibility of trial synchrony in experiments is discussed in only a few pioneering articles (Bryant & Segundo, 1976; Mainen & Sejnowski, 1995; Hunter, Milton, Thomas, & Cowan, 1998; Cecchi, Sigman, Alonso, Martinez, & Chialvo, 2000). In addition, theoretical results on trial synchrony (Hunter et al., 1998) are not enough in contrast to the abundant results for population dynamics, and this is what we chiey examine in later sections.
1344
N. Masuda and K. Aihara
In section 2, we introduce the neuron model and explain the known results for the rst-passage-time problem, which is equivalent to distribution of interspike intervals with respect to noisy trials. Since the exact solutions of the rst-passage-time problem cannot be obtained for most cases of interest, we rely on qualitative analysis. In sections 3, 4, and 5, the effects of three factors that contribute to distributions of spike trains and the possibility of trial synchrony—single neuron properties, external stimuli, and feedback coupling—are respectively investigated. We combine these results and discuss the conditions for ergodicity in section 6, with illustrative models in section 7. Section 8 is devoted to discussions of the results and the relevance to physiological experiments. The effects of misring—the failure in triggering action potentials even when the membrane potential has reached the threshold for ring or nonergodicity induced by learning or adaptation over trials—are not considered for the sake of simplicity. 2 First-Passage-Time Problem and Variability of Spike Trains Even if the properties of individual neurons, the external stimuli, and the network structure are given and if the network dynamics is not chaotic, spike trains from a single neuron in the network can still be irregular because of various noise sources. To characterize factors that cause irregular spike trains, we analyze networks of leaky integrate-and-re (LIF) neurons (Knight, 1972; Tuckwell, 1988; Bressloff & Coombes, 2000) with dynamical noise. The model that we introduce elucidates the ingredients of neural networks related to ergodicity. The results can be extended beyond LIF neurons since these essential ingredients are also identied for other models. The dynamics of a noisy LIF neuron is represented by dx.t/ D .¡° x.t/ C Iext .t/ C Isyn.t// dt C DdBt ;
x.0/ D 0;
(2.1)
where x.t/ is the membrane potential at time t, ° is the leak rate, Iext .t/ is the external stimulus, Isyn.t/ is the synaptic current input from other neurons, D is the strength of dynamical noise, and Bt is the Brownian motion. The noise term DdBt includes the contribution of ongoing background activities (Softky & Koch, 1993; Shadlen & Newsome, 1998), which are independent for all neurons, and error caused by the diffusion approximation in which random incident spike inputs are replaced by the summation of a drift and the Wiener process (Ricciardi, 1977; Tuckwell, 1988). A portion of ongoing activity with spatiotemporal structure forms common inputs to the neurons (Arieli et al., 1996; Shadlen & Newsome, 1998). When we consider networks of neurons later, the common noise should be incorporated into Iext rather than into the term DdBt , which is assumed to be independent for different neurons. The neuron res when x.t/ reaches the threshold 2 > 0, and x.t/ is instantaneously reset to the resting potential 0 to restart leaky integration.
Ergodicity of Spike Trains
1345
Although we are interested in how jitters in ring time accumulate after the duration of many interspike intervals (Mainen & Sejnowski, 1995), we rst discuss how interspike intervals are distributed with respect to noisy trials. This problem is known as the rst-passage-time problem. The rstpassage times of the Wiener process and the Ornstein-Uhlenbeck process are interspike intervals of nonleaky integrate-and-re (IF) neurons and LIF neurons, respectively. For these cases complicated but rigorous solutions are known when Iext C Isyn is constant (Ricciardi, 1977; Tuckwell, 1988). To approximate the probability density function (PDF) of the rst-passage time for nontrivial Iext and Isyn, we use the PDF of x.t/, which is relatively easy to derive (Gerstner & Kistler, 2002). If we ignore the threshold for simplicity, the PDF of x.t/ can be represented as follows (Ricciardi, 1977; Tuckwell, 1988; Bulsara, Lowen, & Rees, 1994): µ ¶ 1 .x ¡ M.t//2 exp ¡ (2.2) f .x; t j x0 ; t0 / D p ; 2V.t/ 2¼V.t/ where f .x; t j x0 ; t0 / is the PDF of x.t/ with the initial condition x D x0 at time t0 , Z t 0 (2.3) M.t/ D .Isyn.t0 / C Iext .t0 //e° .t ¡t/ dt0 C x0 e¡° .t¡t0 / ; t0
and V.t/ D
D2 .1 ¡ e¡2° t /: 2°
(2.4)
The Focker-Planck and Kolmogorov equations accompanying equation 2.1 are respectively written as @ @ f .x; t j x0 ; t0 / D ¡ ..¡° x C Iext C Isyn/ f .x; t j x0 ; t0 // @t @x C
D2 @ 2 f .x; t j x0 ; t0 /; 2 @x2
(2.5)
and @ @ f .x; t j x0 ; t0 / D ¡.¡° x C Iext C Isyn/ f .x; t j x0 ; t0 / @t0 @x0 ¡
D2 @ 2 f .x; t j x0 ; t0 /: 2 @x20
(2.6)
To deal with temporally changing Iext .t/ C Isyn.t/, we introduce a shifted membrane potential x by x D x ¡ M.t/ C c;
1346
N. Masuda and K. Aihara
where c is an arbitrary constant. Then x is an Ornstein-Uhlenbeck process without bias. Instead, the threshold for the modied dynamics, 2 D 2 ¡ M.t/ C c, changes temporally. Unfortunately, even in simple cases with constant bias, the rst-passage-time problem is difcult to treat under most conditions. It is analytically tractable only when 2 is constant or when 2 relaxes to a specic constant value with a decay rate that happens to be equal to ° (Ricciardi, 1977). If we instead suppose that Iext .t/ C Isyn.t/ is a stochastic process in an attempt to make the problem solvable, Iext .t/ C Isyn.t/ must be stationary so that the moment-generating function of the rst-passage time calculated from equation 2.5 is mathematically tractable. This restrictive assumption would trivially average out and eliminate Iext .t/ C Isyn.t/ from equations 2.5 and 2.6. For these reasons, we resort to qualitative analysis supported by numerical simulations to address the rst-passage-time problem and the more challenging ergodicity problem. Our nal goal is not to obtain the distribution of interspike intervals (Ricciardi, 1977; Tuckwell, 1988; Bulsara et al., 1994; Masuda & Aihara, 2002a), intersynchronization intervals (Aihara & Tokuda, 2002; Masuda & Aihara, 2002b), or the precision of synchrony (Gray et al., 1989; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996; Diesmann, Gewaltig, & Aertsen, 1999). These distributions concern what happens in the duration of an interspike interval. On the contrary, the ergodicity problem involves the statistics of absolute ring times, or the ring times relative to the stimulus timing, over the duration of many spikes. It is necessary to examine how jitter of interspike intervals and desynchronizing effects accumulate in the long run. In sections 3, 4, and 5, three important factors that affect distributions of spike trains are identied. 3 Effects of Single Neuron Properties Equations 2.2, 2.5, and 2.6 indicate that Iext and Isyn directly inuence ring pattern distributions and, hence, trial synchrony. However, the possibility of trial synchrony also depends on neuronal properties with the inputs given (Hunter et al., 1998). Examination of the inputs is covered in sections 4 and 5. Here, we focus on the effect of single neuron properties on trial averaging. Since the variance of the PDF dened by equation 2.4 converges to D2 =2° rapidly enough compared with the length of the typical interspike interval, equations 2.5 and 2.6 indicate that the instantaneous ring probability is determined by the combined effect of the effective noise strength D2 =2° , the mean membrane potential M.t/ given in equation 2.3, and the instantaneous ow of the distribution dened by A1 .t/ D ¡° x C Isyn C Iext :
(3.1)
A neuron is more likely to re when M.t/ is closer to the threshold and A1 .t/ is larger. Larger values of D2 =2° blur the ring time more strongly.
Ergodicity of Spike Trains
1347
This description applies to other models once the effective noise strength, M.t/, and A1 .t/ are appropriately dened from the corresponding FockerPlanck equations. When M.t/ is far below the threshold, the instantaneous ring probability is low regardless of A1 .t/ because equation 2.3 indicates that A1 .t/ can raise M.t/ by at most A1 .t/1t for a short period 1t. With M.t/ near the threshold, the ring time distribution is sharply peaked and condensed if A1 .t/ is a bumpy function with large variability. In this situation, larger A1 .t/ is more likely to induce instantaneous ring, and the peaks of the ring time distribution are locked to the peaks of the voltage slope A1 .t/ (Hunter et al., 1998; Cecchi et al., 2000; Gerstner & Kistler, 2002). Also experimentally, ring time distributions become sharper when the membrane potential changes rapidly in the spike initiation zone (Warzecha & Egelhaaf, 1999). We note that A1 .t/ is generally determined by the combined effect of the total inputs, I.t/ ´ Iext .t/ C Isyn .t/;
(3.2)
and the susceptibility of the neuron to input change. To evaluate the jitter of subsequent ring times, we examine the distribution of the (n C 1)th ring time in terms of that of the nth ring time. Provided that I.t/ is given, let us imagine that the neuron forgets past inputs when x.t/ has approached the threshold sufciently. Here, the past inputs consist of the noise, I.t/ applied during the integration process, and the jitter in the last ring time. The jitter of forgetful neurons with short integration time compared to mean interspike interval (Konig, ¨ Engel, & Singer, 1996) does not accumulate even after a long period. A LIF neuron with a large ° is such an example; the effective noise strength D2 =2° decreases as ° increases. Furthermore, because M.t/ is near the threshold for only a short period, A1 .t/ tends to have just one local maximum in the spike initiation zone. This effect also suppresses jitter. As a consequence, trial synchrony is likely to appear. On the other hand, if a neuron remembers the past inputs, integration prevails (Konig ¨ et al., 1996), and jitter accumulates to broaden ring time distributions after duration of many spikes. Moreover, if I.t/ has temporally close local maxima, the ring time of such a neuron can be locked to different peaks of I.t/ in different trials since x.t/ stays near the threshold for the passage of many maxima of I.t/. This situation causes considerable accumulated jitter to induce trial asynchrony. Of course, this classication is not clear-cut but graded. The capability of remembering past inputs is characterized by the time window preceding a ring event in which the input really affects the membrane potential (Kempter, Gerstner, van Hemmen, & Wagner, 1998). We are now in position to identify two properties of a single neuron that control the spread of ring time distributions in the long run: (1) variability of A1 .t/ owing to the neuronal property with I.t/ xed and (2) length of
1348
N. Masuda and K. Aihara
the history that the neuron remembers when the membrane potential has arrived near the threshold. Property 1 is generally evaluated by how A1 .t/ changes when I.t/ is varied. For the LIF neuron, equation 3.1 shows that the change in I.t/ directly modulates A1 .t/. Regarding this property, biological and model neurons can be classied into two classes based on the excitability that characterizes the performance of neurons as amplitude-to-frequency converters (Hodgkin, 1948; Izhikevich, 2000; Masuda & Aihara, 2002a). When Iext is gradually increased, a neuron with class I excitability starts ring at an innitesimally low frequency. The dynamic range in the ring rate with respect to Iext is generally large (5–150 Hz). Firing of class I neurons is responsive to change in inputs. In contrast, class II neurons begin repetitive ring with a nite frequency as the bias increases. The ring frequency of class II neurons changes relatively little (75–150 Hz) when compared with that of class I neurons, especially in the highly oscillatory regime where the bias is far beyond the threshold. As is explained later in this section, the responsiveness of these neurons depends on whether they are in the excitable or oscillatory regime. However, this characterization of neuronal dynamics disregards how neurons process the signals received during the interspike interval. The subthreshold process, or the dynamics of the membrane potential, is also closely linked to information processing in the brain, especially when interspike intervals are relatively large in comparison with the typical input period. A neuron lters and synthesizes the detailed information on inputs in its own way if the neuron remembers the past. Property 2 is associated with this subthreshold process. Furthermore, the properties of the subthreshold process affect conditions of synchronous ring (Mirollo & Strogatz, 1990; Bressloff & Coombes, 2000). To our knowledge, only one classication is proposed based on the subthreshold dynamics, and it is used to evaluate the possibility of synchronization of weakly coupled neurons (Hansel, Mato, & Meunier, 1995). We introduce another classication directly related to property 2, that is, two classes of integrability: ² A neuron with class A integrability remembers past inputs, meaning that accumulated past inputs contribute more than instantaneous inputs or the instantaneous ow A1 .t/ to the membrane potential. In other words, the weight for integration is relatively at over time. Consequently, the membrane potential and the ring probability are sensitive to jitter and change in past inputs. Noise also accumulates to disperse the membrane potentials when the threshold is approached. Class A neurons prefer trial asynchrony. ² A neuron with class B integrability forgets past inputs as well as past noise. The membrane potential and the ring probability are chiey determined by A1 .t/. Uneven weights for integration, which are typically realized by large potential leaks and refractory periods, result in
Ergodicity of Spike Trains
1349
input ltering, in which more recent inputs are endowed with more importance. Class B neurons prefer trial synchrony. Integrability also species the performance of signal encoding with interspike intervals (Masuda & Aihara, 2002a). Interspike intervals encode the intensity of external stimuli only when the integration weight is fairly independent of the instantaneous membrane potential or time. This spatiotemporal spike encoding favors class A integrability. Our classication is different from but related to the neuron types proposed by Hansel et al. (1995). A type I neuron has a phase variable of oscillation (Mirollo & Strogatz, 1990) that always advances by receiving an excitatory spike, and coupled type I neurons are proven to desynchronize in general. Type I neurons likely have class A integrability since the contribution of inputs is fairly even with respect to the phase variable or the membrane potential, with positive inputs always raising the membrane potential. On the other hand, the phase variable of a type II neuron increases or decreases depending on the times of excitatory input, and coupled type II neurons ultimately synchronize. Type II neurons have uneven integration weights and tend to have class B integrability. However, the correspondence between classes of integrability and neuron types is not exact. Now let us consider four categories specied by combination of sensitivity to input change and integrability, examples of which are shown below: ² Class I neurons are generally sensitive to change in inputs. Among them, IF neurons and LIF neurons with small leaks are class A integrable. For example, the ring probability of the IF neuron faithfully reects the input strength, and it perfectly remembers past inputs, including dynamical noise since the last ring time. The ring time is widely distributed with multiple peaks locked to the local maxima of Iext (Bulsara et al., 1994). ² Among class I neurons, LIF neurons with large leaks are sensitive to change in inputs and are class B integrable. Equation 2.3 indicates that if we denote by T the next ring time, the input received at t < T contributes to x.T/ with the weight proportional to e¡° .T¡t/ . Moreover, the amount of noise accumulated during an interspike interval is approximately D2 =2° , which monotonically decreases in ° . With the limit ° ! 1, such neurons work as coincidence detectors that re only when a certain number of incident spikes arrive through excitatory synapses within a short time window (Abeles, 1982; Softky & Koch, 1993; Fujii et al., 1996). Another example is a class II neuron, such as the Hodgkin-Huxley or FitzHugh-Nagumo neuron, in the excitable regime with inputs applied to a fast variable like the membrane potential. Such a neuron is past-forgetting because the dynamics is sensitive to inputs only when the internal state is slightly below the threshold (Knight, 1972; Masuda & Aihara, 2002a). Moreover, in the spike initiation zone, larger inputs are more likely to trigger ring, which indicates the responsiveness of ring time to input change.
1350
N. Masuda and K. Aihara
Excitable class II neurons in the regime of stochastic resonance (Inchiosa & Bulsara, 1995; Lindner, Meadows, & Ditto, 1995; Aihara & Tokuda, 2002) also belong to this category. In this situation, the membrane potential hovers slightly below the threshold regardless of past inputs, and A1 .t/ mainly determines ring time with the help of dynamical noise. Class I neurons whose membrane potential hovers near threshold until ring is elicited by large inputs (Tsodyks & Sejnowski, 1995; Mueller & Herz, 1999) are also included in this family for the same reason. ² The ring frequency of oscillatory class II neurons is little affected by I.t/ because it is determined by the inherent frequency of the neuron (Izhikevich, 2000). These neurons become class A integrable if inputs are applied to a slow variable that determines the dominant timescale of the neuronal dynamics (Hu & Zhou, 2000; Zhou, Kurths, & Hu, 2001; also see sections 7.5 and 7.6). Similarly, bursting cells with a variable calcium current such as the models for pancreatic ¯-cells and for Aplysia R-15 neurons (Keener & Sneyd, 1998) belong to this category. Other trivial examples are noise-driven class I neurons such as the random walk model (Ricciardi, 1977; Abeles, 1982; Tuckwell, 1988), in which the effect of I.t/ is negligible owing to the neuronal property. ² Oscillatory class II neurons are insensitive to change in inputs and are class B integrable when the inputs are applied to the fast variable (Hansel et al., 1995; Masuda & Aihara, 2002a). These neurons are past-forgetting as for oscillatory class II neurons explained above. Furthermore, the ring frequency does not depend much on I.t/ as for excitable class II neurons whose fast variables receive inputs. The orbits in the state space and therefore the ring time are insensitive to I.t/. Excitable class II neurons in the regime of coherence resonance (Pikovsky & Kurths, 1997; Wang, Chik, & Wang, 2000), where the membrane potential hovers slightly below the threshold regardless of past inputs, also belong to this category. In this case, the effect of A1 .t/ is very small, and dynamical noise triggers ring. Whether the neuron operates in the excitable or oscillatory regime is important for class II neurons but not for class I neurons. The amplitudefrequency curves of class I neurons have relatively constant slopes. Thus, the sensitivity to change in I.t/ does not depend on I.t/ or the bias strength. However, for class II neurons, the sensitivity is high with I.t/ around the threshold, whereas it is low with I.t/ far beyond or below the threshold. To demonstrate the difference in the ring time distributions of class A and B neurons, we numerically calculate the distributions of the rst three ring times of single LIF neurons with absolutely refractory periods. We bridge the class A and B integrabilities by modulating the refractory period ¿ref . With the length of interspike intervals kept constant, small (or large) refractory periods indicate class A (or B) integrability. We set ° D 0:04 ms¡1 and Isyn D 0 in equation 2.1. The continuous noise is emulated by applying
Ergodicity of Spike Trains
1351
the gaussian white noise with amplitude D0 D 0:020 at every integration step equal to dt D 0:025 ms. The sinusoidal input, Iext.t/ D
80 C ¿ref 80
.4:9 ¢ 10¡2 C 3:0 ¢ 10¡3 sin.2¼t=Text//;
(3.3)
where Text D 12:0 ms, and the ltered gaussian noise input (Mainen & Sejnowski, 1995), Iext.t/ D
80 C ¿ref 80
Z
t 0
0
e.t ¡t/=¿ dB0t ;
(3.4)
where ¿ D 2:0 ms, are used to generate Figures 1A and 1B, respectively. The scaling factor .80 C ¿ref /=80 in equations 3.3 and 3.4 is introduced to keep the ring rates approximately constant. In this way, we avoid the inuence of ring rates on the possibility of trial synchrony, which is not dealt with in this article. As shown in Figure 1, the standard deviation of the third ring time is reduced to about one-third, as the refractory period increases from 0 ms to 20.0 ms. This result agrees with the theoretical prediction that better trial synchrony is realized when the neurons are more class B integrable with larger refractory periods. The classication of neuron models into four categories is summarized in Figure 2. Class B neurons with high sensitivity to change in inputs mostly prefer trial synchrony, and class A neurons that are ignorant of inputs mostly side with trial asynchrony. No denite conclusion is drawn on the possibility of trial synchrony for neurons in the other two categories. These cases are covered by examples in section 7 that take into account the inuence of Isyn and Iext. 4 Effects of External Stimuli In this section, we investigate the effects of time-dependent external stimuli Iext .t/ on dynamics of a single neuron with given properties. The synaptic input Isyn is assumed to be absent, as in section 3. As we have shown in section 3, A1 .t/ with larger variability makes the ring time more localized. Trial synchrony as a result of a highly uctuating A1 .t/ is promoted by Iext with large variability, especially when the neuron is class B integrable and sensitive to changes in Iext . We numerically examine how the ring time distribution depends on the variability of Iext using the sinusoidal input, Iext D 0:049 C amp ¤ cos.2¼ t=Text /;
(4.1)
where Text D 14:0 ms, ° D 0:04 ms¡1 , D0 D 0:001 per dt D 0:025 ms, ¿ref D 1:0 ms, and Isyn D 0. The variability of Iext is parameterized by amp. The rst
1352
N. Masuda and K. Aihara
firing tim e (m s)
A 120 80 40 0
0
5
10
15
20
re fra cto ry p e rio d (m s )
firing tim e (m s)
B 120 80 40 0
0
5
10
15
20
re fra cto ry p e rio d (m s ) Figure 1: Firing time variability of the rst three spikes of single LIF neurons with various refractory periods. Each data point indicates the ring time within the one-sigma range, with the top, middle, and bottom lines corresponding to the third, second, and rst ring times, respectively. The statistics are based on 100 trials with the same stimuli and the same initial conditions. Applied are the (A) sinusoidal external stimulus (see equation 3.3) and (B) external stimulus generated by the ltered gaussian noise (see equation 3.4).
three ring times with respective one-sigma ranges are shown in Figure 3. A larger amp makes the jitter of ring times smaller by encouraging the neuron to re near the local maxima of Iext . Our numerical results correspond to the previous nding that a large amp enhances synchronous ring, even in the absence of feedback coupling in both class B model neurons (Knight, 1972; Hunter et al., 1998; Feng, Brown, & Li, 2000) and experiments (Bryant & Segundo, 1976; Mainen & Sejnowski, 1995; Hunter et al., 1998; Cecchi et al., 2000).
Ergodicity of Spike Trains
1353
Figure 2: Classication of neurons according to the sensitivity to input change and integrability.
firing tim e (m s)
120 80 40 0
0
0 .0 0 5
0 .0 1
0 .0 1 5
a m p litu d e o f e x te rn a l s tim u li Figure 3: Variability of ring times of the rst three spikes (from bottom to top) of single LIF neurons with various dynamic ranges of Iext. Each data point indicates the ring time within the one-sigma range. The statistics are based on 100 trials with the same stimuli and the same initial conditions. The sinusoidal external stimulus generated by equation 4.1 is used.
1354
N. Masuda and K. Aihara
Large variability of Iext does not necessarily ensure trial synchrony. When the local maxima of Iext are temporally close to each other, the ring time distribution can be multipeaked, with its peaks locked to the peaks of Iext . For a class A neuron receiving such Iext , the jitter is characteristically equal to the peak interval of Iext (Bulsara et al., 1994), and it accumulates over time to yield trial asynchrony. For trial synchrony, separation of effective local maxima of Iext and highly uctuating Iext are needed. This condition is met when the peaks of Iext are separated or the neuron is strongly class B integrable. As a realistic example, series of spike packet inputs separated by long quiescent intervals (Abeles, 1991; Fujii et al., 1996) satisfy these requirements. Trial synchrony with the ring times locked to peaks of Iext is also observed in experiments (Mainen & Sejnowski, 1995; Hunter et al., 1998; Cecchi et al., 2000).
5 Effects of Feedback Coupling The results obtained in sections 3 and 4 do not differentiate between population synchrony and trial synchrony since the distributions of spike trains have been discussed only in terms of single neuron dynamics. Here, we consider the effects of feedback coupling on trial synchrony using coupled neurons. Multiple neurons interact by coupling, whereas multiple spike trains of a particular neuron from repeated trials never interact even when coupling is present. Accordingly, we expect to draw a line between the mechanisms of trial synchrony and those of population synchrony. The effects of coupling on network behavior have been investigated extensively. The ultimate behavior depends on model details such as feedback strength and structure, transmission delay, degree of heterogeneity, and external inputs. Coupled neurons usually show characteristic behavior such as fully synchronous periodic ring, synchronous chaotic ring, clustered ring, or generalized synchronous ring in which relative ring times of different neurons are xed (Mirollo & Strogatz, 1990; Gerstner, Ritz, & van Hemmen, 1993; Hansel et al., 1995; Fujii et al., 1996; van Vreeswijk & Sompolinsky, 1998; Bressloff & Coombes, 2000; Brunel, 2000; Wang et al., 2000; Aihara & Tokuda, 2002; Gerstner & Kistler, 2002; Masuda & Aihara, 2002b, 2003a, 2003b; van Rossum et al., 2002). Clustered ring usually consists of clusters of neurons with xed relative ring times and therefore accompanies generalized synchrony. These characteristic states can play essential roles in information processing in the brain. For example, synchronous ring is related to feature binding (Gray et al., 1989) and stimulus recognition (Steinmetz et al., 2000), while generalized synchronous ring is reminiscent of oscillatory associative memories (Gerstner et al., 1993; Mueller & Herz, 1999), object separation (Terman & Wang, 1995; Masuda & Aihara, 2003b), and information transmission through synre chains (Abeles, 1982, 1991; Prut et al., 1998; Diesmann et al., 1999). The spray
Ergodicity of Spike Trains
1355
state with asynchronous ring also yields efcient population rate coding (Shadlen & Newsome, 1998; Masuda & Aihara, 2002b, 2003; van Rossum et al., 2002). To our knowledge, theory is missing on how coupling affects the ring pattern distribution of a noisy single neuron in a network. To address this problem, we consider n coupled neurons and choose a reference neuron whose membrane potential xi0 D 0 at time t D 0. We assume that a nal dynamical state is realized except for small deviations caused by dynamical noise. The crucial assumption that we make here is that a ring event of each neuron is induced by receiving feedback spikes from some of the other n ¡ 1 neurons that are in generalized synchrony or in spray states in which Isyn remains almost constant over time. In both cases, the membrane potential is driven more strongly by Isyn than by Iext . In other words, the dynamics is driven internally (Aertsen et al., 1994; Araki & Aihara, 2001), and the dynamical states may be engaged in internal information processing such as memory retrieval rather than information processing of external stimuli (Abeles, 1991; Gerstner et al., 1993). Various factors such as the number of incident synapses, the feedback strength, the transmission delay, and the PSP inuence Isyn (van Vreeswijk & Sompolinsky, 1998; Brunel, 2000). The synchronous state leads to a peaked Isyn because the feedback spikes to the i0 th neuron arrive nearly simultaneously except for small jitter caused by the heterogeneous transmission delay (Abeles, 1991; Diesmann et al., 1999). In contrast, the spray state generates a at Isyn . The synaptic inputs generated by generalized synchronous states are between almost constant Isyn and peaked Isyn. The results of section 4 imply that a localized ring time distribution is given by a highly variable Isyn as well as a highly variable Iext. On the other hand, a at Isyn derived from a spray state is tied to trial asynchrony. Population synchrony and trial synchrony agree if we consider only the rst ring time. However, Iext and Isyn have different effects on the ring time distribution in the long run, even if the temporal waveforms of Iext and Isyn are similar. To show this, let us suppose that the rst ring time t1 of the i0 th neuron since t D 0 has the PDF f .2; t1 j 0; 0/, where t D 0 is the previous ring time. After ring, the membrane potential is reset to xi0 D 0, stays there for the refractory period, and again follows the dynamics represented by equation 2.1. By assuming internally driven dynamics with generalized synchrony, it follows that Isyn.t C t1 / D Isyn.t/; .0 · t/:
(5.1)
With equations 2.1 and 5.1, we have for t ¸ t1 , dx.t/ D .¡° x.t/ C Isyn .t ¡ t1 / C Iext .t// dt C DdBt ;
x.t1 / D 0:
(5.2)
1356
N. Masuda and K. Aihara
Assuming a constant Iext for simplicity, the distribution of the second ring time t2 is represented by Z
t2 0
f .2; t1 j 0; 0/ f .2; t2 j 0; t1 / dt 1 D . f ¤ f /.2; t2 j 0; 0/;
(5.3)
where ¤ denotes the convolution. Repetition of the same procedure yields the following distribution of the kth ring time tk : k
z }| { . f ¤ f : : : ¤ f /.2; tk j 0; 0/: Unless f .2; t1 j 0; 0/ is the delta function, ring time jitter accumulates over time. For example, if f .2; t1 j 0; 0/ has the variance ¾ 2 and is equipped with the reproducing property (e.g., the normal distribution), the distribution of tk has the same functional form but with variance k¾ 2 . Also for f .2; t1 j 0; 0/ with other shapes, the ring time PDF broadens as k increases. Consequently, the feedback interactions drive the network out of trial synchrony regardless of the structure of Isyn. The fact that the past jitter gives rise to the temporal shift of Isyn.t/ reects the inability of Isyn.t/ to be the absolute timer. However, nonconstant Iext always preserves information on the absolute time. Also in experiments, ring is not locked to external stimuli when the feedback effect is dominant. The ring patterns drift from trial to trial even when the initial timing of spike trains from many trials is appropriately adjusted (Prut et al., 1998). 6 Competition of Three Factors and Ergodicity We have identied three major factors that inuence the ring pattern distributions in sections 3, 4, and 5. The single neuron property stipulates the susceptibility of the ring time distribution to I.t/ D Iext .t/CIsyn.t/ and to the inuence of past inputs. Then Iext and Isyn come into play to determine actual ring pattern distributions. Unfortunately, combining these factors to gain further insight is mathematically difcult, and we therefore can state only that the balance of them matters to ring time distributions. In the externally driven dynamics with Iext dominant, trial synchrony appears if Iext and the neurons satisfy the conditions in sections 3 and 4. In the internally driven dynamics with Isyn dominant, jitter accumulates over the course of repetitive ring. The duality of the two modes may also underlie the selection of the information coding mode, where the externally driven dynamics reects external stimuli and the internally driven dynamics reects the structure of synaptic weights (Araki & Aihara, 2001). Figure 4 shows how competition between Iext and Isyn in a common network inuences the ergodicity. We use 100 pulse-coupled LIF neurons with
Ergodicity of Spike Trains
0 .8
1357
A
0 .6 0 .4 0 .2 0 0
100
200
300
400
300
400
tim e (m s ) 0 .8
B
0 .6 0 .4 0 .2 0 0
100
200 tim e (m s )
Figure 4: Normalized Iext (thin solid line), population ring rate (thick solid line), and trial ring rate (dotted line) of 100 pulse-coupled LIF neurons under (A) externally driven dynamics with A D 0:97 and ² D 0:003 and (B) internally driven dynamics with A D 0:84 and ² D 0:018. The presented ring rates are the proportion of neurons that re within each bin of width 2.5 ms. The statistics are based on 100 trials with the same stimuli and the same initial conditions.
° D 0:04 ms¡1 and D0 D 0:0008 per dt D 0:025 ms. The external stimulus is dened by Iext D A.4:9 ¢ 10¡2 C 5:6 ¢ 10¡3 sin.2¼ t=T ext //;
(6.1)
where Tsyn D 14:0 ms, and A species the strength of Iext relative to Isyn. When an LIF neuron res, it sends instantaneous spikes to 30 randomly selected neurons with amplitude ² and delay of 2:0 ms. The normalized Iext , the population ring rate, and the trial ring rate with A D 0:97 and
1358
N. Masuda and K. Aihara
² D 0:003 are shown in Figure 4A. In this situation, Iext governs the dynamics, and trial synchrony is realized in accordance with our prediction. However, Figure 4B, with A D 0:84 and ² D 0:018, shows that Isyn dominates the network behavior, and trial asynchrony emerges, whereas the neurons re synchronously. To compare the statistics of spike trains from many neurons and those from many trials, we begin by elucidating the types of population behavior on which we shed light. As explained in section 5, synchrony, generalized synchrony including clustered states, and full asynchrony are of particular interest from mathematical and information processing points of view. Generalized synchrony with correlated ring bridges two extreme cases: synchrony and full asynchrony. Let us explore the corresponding concepts for trial statistics. Two simple cases, trial synchrony and trial asynchrony, have already been noted. Trial synchrony, or reproducibility, is necessary and sufcient for the ergodicity of synchronously ring neurons. For a population of asynchronously ring neurons, the notion of ergodicity depends on what kinds of quantities are scrutinized. The most rigorous ergodicity requires complete agreement between the set of spike trains from many trials and that from many neurons, excluding noise-induced error (also see section 1). At the other extreme, only ring rates may carry information. However, information is often contained in more detailed structures of spike trains. For example, the correlation of ring times in a single spike train can store stimulus information (Bialek et al., 1991). If generalized synchrony is actually present among different neurons and is engaged in information processing, simultaneous recording of the neurons involved in this correlated ring, instead of single-neuron recordings, is necessary. The asynchronous state is characterized only by a at ring rate once we adopt the ring rate as the only carrier of information (Gerstner & Kistler, 2002). Accordingly, we disregard relations between precise spike times of neurons that may be present although not evident from the experimental or numerical data at hand. It then becomes possible to examine ergodicity based solely on the agreement of the trial ring rate of one neuron and the population ring rate of one trial. Next, we relate generalized synchrony and its functional meanings to the trial statistics. When information is processed in such a dynamical state, knowledge of neuron ring order and of which neurons are ring synchronously to form a dynamical cell assembly (Fujii et al., 1996) is important. As an example, let us imagine n neurons arranged on a ring. If the kth neuron (1 · k · n) on the ring res at t D k=n; 1 C k=n; 2 C k=n; : : : ; the population ring rate is constant in time as in the spray state (Gerstner et al., 1993). A traveling wave appears, however, and nobody regards this state as functionally meaningless. Generalized synchronous states are believed to be engaged in temporal spike coding that realizes a rich variety of functions such as feature binding (Gray et al., 1989; Terman & Wang, 1995; Fujii et
Ergodicity of Spike Trains
1359
al., 1996), oscillatory associative memories (Gerstner et al., 1993; Mueller & Herz, 1999), and information transmission via synre chains (Abeles, 1982, 1991; Prut et al., 1998; Diesmann, Gewaltig, & Aertsen, 1999). The degree of generalized synchrony and its temporal modulation also reect the responsiveness to stimuli (Aertsen et al., 1989; Vaadia et al., 1995; Steinmetz et al., 2000). These dynamical states strongly reect nontrivial coupling structure as well as external stimuli (Aertsen et al., 1994; Araki & Aihara, 2001). How is generalized synchrony related to the ergodicity? First, owing to dynamical noise and the absence of interactions between spike patterns from different trials, there is nothing like generalized trial synchrony. When many trials of multiple-neuron recording are available, generalized synchrony in all trials is not permitted with systems of unreliable neurons in noisy environments. Instead, general synchrony is manifested by trial averaging (Aertsen et al., 1989, 1994; Gray et al., 1989; Vaadia et al., 1995; Prut et al., 1998). Then the ergodicity of generalized synchronous states should mean the following. Let us imagine two recorded neurons X and Y whose spike trains have particular temporal relationship. Then each of X and Y should accompany a local assembly of neurons ring more or less in the same manner as X or Y. The trial averaging of some statistics of spike train pairs from X and Y must agree with the population averaging of the statistics of neuron pairs, where one neuron is a neighbor of X and the other neuron is a neighbor of Y. The ergodicity regarding higher-order statistics involving more than two neurons is dened similarly. In some models with correlated ring, a small number of single spikes plays an essential role in triggering ring (Gerstner et al., 1993; Fujii et al., 1996). However, a spike with such a large effect can be considered as the summation of spike trains from the neurons in a local assembly since the contribution of a single spike to the membrane potential change is usually small (Abeles, 1991; Tsodyks & Sejnowski, 1995; Shadlen & Newsome, 1998; Diesmann et al., 1999). These models implicitly suppose synchronous ring of nearby neurons. Therefore, the population synchrony within each local assembly around X or Y (or further recorded neurons) is also required for the ergodicity of the generalized synchronous states. Conversely, once a generalized synchronous state is given, trial synchrony, at least after temporally aligning the spike trains from all trials with respect to the time of a reference spike of a reference neuron (Aertsen et al., 1994, 1989; Vaadia et al., 1995; Fujii et al., 1996), is required for ergodicity. If this is not the case, correlated ring patterns manifest themselves only in each trial (Gerstner et al., 1993), and they degrade in time if superimposed over trials. The generalized synchrony cannot be detected by trial averaging. On the basis of the results in sections 3, 4, and 5, we list the factors that affect population dynamics and trial statistics, and hence ergodicity. The
1360
N. Masuda and K. Aihara
competition of these factors determines actual states: Responsiveness to inputs: Neurons sensitive (or insensitive) to input change tend to have the ring time distribution locked (or unlocked) to Iext and facilitate trial synchrony (or asynchrony). Integrability of single neurons: Class B neurons prefer trial synchrony. Moreover, coupled class B neurons tend to synchronize because Iext entrains the ring times of neurons so that they are locked to the peaks of Iext, especially when Iext is uneven (Knight, 1972; Feng et al., 2000). The absorption of phase variables, which promotes population synchrony, is also enhanced with class B neurons (Mirollo & Strogatz, 1990). In contrast, coupled class A neurons tend to show trial asynchrony and population asynchrony. External stimuli: An Iext with large variability and sufciently separated local maxima encourages trial synchrony. This type of Iext , together with neurons that are highly sensitive to change in Iext , also elicits synchronous ring with ring times locked to the maxima of Iext (Knight, 1972; Feng et al., 2000). When Iext is rather at or when its peaks congregate, trial asynchrony and population asynchrony (or possibly generalized synchrony) are likely to occur. Synaptic inputs: Trial asynchrony appears when the contribution of Isyn is dominant. In contrast, feedback coupling generally leads to population synchrony or generalized synchrony (Mirollo & Strogatz, 1990; Hansel et al., 1995; van Vreeswijk & Sompolinsky, 1998; Bressloff & Coombes, 2000; Brunel, 2000; Gerstner & Kistler, 2002). Dynamic noise: A large D obviously results in trial asynchrony and population asynchrony. When D gradually increases, trial asynchrony appears prior to population asynchrony (see section 7.4). This occurs because the feedback coupling would keep the neurons synchronized against noise, whereas there are no interactions between spike trains from different trials. Here, we refer only to dynamical noise independent for different neurons. Noisy inputs common to different neurons should be regarded as a part of Iext with high variability, which tends to lead to trial and population synchrony. Figure 5 summarizes the effects of these factors on the population dynamics and the possibility of trial synchrony. A neural system is obviously nonergodic when population synchrony (or asynchrony) and trial asynchrony (or synchrony) occur simultaneously (see the shaded regions in Figure 5). We must be careful when interpreting the results based on trial averaging in these cases. However, when both population synchrony and trial synchrony are realized, ergodicity holds. With population asynchrony and trial asynchrony, ergodicity holds as far as the ring rates are concerned. If we are interested in higher-order statistics, we must uncover, using multiple-neuron
Ergodicity of Spike Trains
p o pu la tio n
trial
syn chro n y
g en e ralized syn chro n y
1361
synch ron y I ex t w ith bu m p y a nd se p ara te d pe a ks se nsitive to I e x t class B
a synch ron y sm a ll
la rge
larg e
a synch ro ny sm all
I e x t w ith flat a nd clo se p e aks in sen sitive to Ie x t cla ss A I e xt an d n eu ron a l pro p ertie s I s yn n o ise
Figure 5: Relation between the population dynamics and the statistics of spike trains resulting from many trials.
recording, generalized synchronous states out of seemingly asynchronous spike trains using multiple-neuron recording. 7 Examples In this section, ergodicity of several models is examined, and the results are interpreted in the light of our theoretical results. The results provide us with clear views on what numerical simulations of models are measuring. 7.1 Oscillatory Associative Memories. Associative memories based on Hebbian learning can be constructed with networks of spiking neurons such as spike response models (Gerstner et al., 1993) and LIF neurons (Mueller & Herz, 1999). More biologically plausible behavior is exhibited by these models than by conventional associative memories with coupled binary elements at discrete times. As a result of learning, associative memories can memorize periodic spatiotemporal ring patterns with structured synaptic weights and delay. We discuss the model by Gerstner et al. (1993). They have shown that in the autonomous mode after learning, the neurons are in generalized synchrony and can retrieve memory patterns regardless of noise. The ring patterns are reproducible over trials without dynamical noise. However, dynamical noise realizes trial asynchrony by making the
1362
N. Masuda and K. Aihara
Figure 6: Detectors of periodic spike train inputs constructed by coincidence detector neurons.
spatiotemporal ring patterns drift from trial to trial while maintaining generalized synchrony in each trial. As discussed in section 6, the ergodicity problem of generalized synchronous states involves local assemblies of neurons around each neuron in associative memory. In the model, a local population of neurons is simply represented by one neuron. However, this model still motivates us to discuss the possibility of trial synchrony. In the retrieval mode, the neurons are excitable and re only when enough synaptic inputs are received. The external stimulus Iext just triggers the retrieval process, and Iext is turned off once the neurons begin to re. Iext contributes little to the dynamics in the long run, and Isyn, which stores the memory patterns, governs the dynamics. As the results in section 5 predict, noisy networks with a strong Isyn realize trial asynchrony. 7.2 Periodic Firing Pattern Detector. We analyze detectors of periodic spike train inputs constructed by coupled coincidence detector neurons with the function depicted in Figure 6 (Fujii et al., 1996). To illustrate the mechanism briey, let us consider a periodic spike train input with interspike intervals T1 ; T2 ; T3 ; T1 ; T2 ; T3 ; T1 ; : : : (T1 ; T2 ; T 3 > 0) common to all the neurons. Each coincidence detector neuron res when two spikes are received within a short time window. Suppose that the transmission delay from neuron X to neuron Y, from neuron Y to neuron Z, and from neuron Z to neuron X in Figure 6 are T1 , T2 , and T3 , respectively. If one makes X re by a trigger spike and the rst spike of the external input, Y res after time T1 because of a spike from X and another spike from the external input. The
Ergodicity of Spike Trains
1363
neuron Z res later at T1 C T2 , and X, Y, and Z repeat ring alternatively in a correlated manner. This architecture enables simultaneous activation of multiple dynamical cell assemblies, each of which corresponds to one stimulus or memory pattern. This may be a key to solving the binding and superposition-catastrophe problem (Fujii et al., 1996). Since the spike trains from X, Y, and Z are in generalized synchrony, the ergodicity problem again reduces to that of the local populations around X, Y, and Z. Realistically, a single spike in the model should be replaced by a temporally localized spike packet comprising the spikes from nearby neurons. Regarding the trial statistics, absolute ring time does not drift because otherwise the network would fail to recognize the periodic inputs. Then trial synchrony emerges because Iext has large variability and separated peaks and because coincidence detectors are class B integrable and sensitive to input change. The effect of Iext overrides that of Isyn in the sense that precisely timed Iext is necessary for ring. 7.3 Synre Chain. In the synre chain model, information is propagated with a precision of about 1 ms in the form of volleys of synchronous ring from assemblies to assemblies in feedforward networks with possible partial feedback connections (Abeles, 1982, 1991; Prut et al., 1998; Diesmann et al., 1999). The synre chains can be regarded as generalized synchronous states, and we therefore discuss only the possibility of trial synchrony, as in sections 7.1 and 7.2. With fairly homogeneous delay structure, Isyn mainly triggers successful transmission of synchronous ring (Abeles, 1991; Diesmann et al., 1999). The only role of Iext is to trigger the initial synchronous volley. Therefore, long-term ring patterns can drift from trial to trial even though the synre chain models are constructed by neurons that prefer trial synchrony, such as coincidence detector neurons (Abeles, 1991) or LIF neurons (Diesmann et al., 1999; van Rossum et al., 2002). Reproducible correlated ring accompanied by interspike intervals longer than 100 ms and by small jitters of around 1 ms is found in not a few neuron pairs (Prut et al., 1998). However, trial synchrony is not detected once spatiotemporal ring patterns are arranged in time with respect to the stimulus. The absolute ring times of correlated ring are likely to shift from trial to trial, which means that this model is nonergodic. 7.4 Coupled Leaky Integrate-and-Fire Neurons. Pulse-coupled LIF neurons re either synchronously or asynchronously, depending on the noise intensity (Masuda & Aihara, 2002b; van Rossum et al., 2002) or the values of model parameters (Masuda & Aihara, 2003a). Regarding LIF neurons, we have already discussed the effects of integrability, the variability of Iext, and the balance between Iext and Isyn with Figures 1, 3, and 4, respectively. Those results have guaranteed that class B neurons with highly variable Iext, which overrides Isyn , realize trial synchrony. Here we examine the effects of noise on population synchrony and trial synchrony to inspect ergodicity.
1364
N. Masuda and K. Aihara
We numerically simulate a network of 100 LIF neurons with ° D 0:04 ms¡1 and different noise intensities D0 per dt D 0:025 ms. As in Figure 4, each neuron sends feedback spikes with amplitude ² D 0:008 and delay of 0.8 ms to 30 neurons. We use equation 6.1 with A D 0:92 to generate the external stimulus. As shown in Figures 7A and 7B, the neurons re synchronously with small dynamical noise. With sufciently small noise, as in Figure 7A, trial synchrony is also achieved and makes the dynamics ergodic. However, Figure 7B shows that with increasing the noise intensity, trial asynchrony emerges earlier than population asynchrony to result in nonergodicity (see also section 6 and Figure 5). Population synchrony is more easily realized owing to the feedback coupling. Figure 7C shows that ring of the neurons eventually desynchronizes with still larger noise. In this situation, the population ring rate restores a relatively precise temporal waveform of Iext (Masuda & Aihara, 2002b, 2003a; van Rossum et al., 2002), and the population ring rate and the trial ring rate roughly agree, recovering ergodicity at least at a coarse level. 7.5 Stochastic Resonance. The stochastic resonance (SR) observed in biological systems such as craysh neurons (Douglass, Wilkens, Pantazelou, & Moss, 1993) and the cat visual cortex (Anderson, Lampl, Gillespie, & Ferster, 2000) suggests that noise plays a positive role in neural networks. Excitable model neurons such as IF neurons (Bulsara et al., 1994), bistable elements (Inchiosa & Bulsara, 1995), and nonlinear oscillators (Collins, Chow, & Imhoff, 1995; Lindner et al., 1995; Aihara & Tokuda, 2002) also show the SR. In addition, the SR is enhanced when neurons are coupled globally (Inchiosa & Bulsara, 1995; Aihara & Tokuda, 2002) or locally on one-dimensional arrays (Lindner et al., 1995). Here, we numerically examine ergodicity of coupled neurons in the SR regime. We prepare 120 diffusively coupled FitzHugh-Nagumo neurons arranged in an array (Hu & Zhou, 2000; Zhou et al., 2001) and apply a sinusoidal Iext . The dynamics of the ith neuron (1 · i · 120) is represented by x3i ¡ yi C Isyn ; 3 yP i D xi C ai C Iext C D»i .t/;
® xP i D xi ¡
(7.1) (7.2)
Iext D 0:04 sin.0:25¼t/;
(7.3)
Isyn D g.xiC1 C xi¡1 ¡ 2xi /;
(7.4)
where xi is the membrane potential, yi is the slow variable, and D»i is the dynamical noise that is emulated by white gaussian noise with amplitude D0 D 0:3 applied per dt D 0:002. We set ® D 0:01 and g D 0:045. Periodical boundary conditions are assumed by identifying x0 and x121 with x120 and x1 , respectively. For jai j > 1, the neuron is excitable and converges to a stable equilibrium point if Iext , Isyn, and the dynamical noise are absent. For jai j < 1,
Ergodicity of Spike Trains
0 .8
1365
A
0 .6 0 .4 0 .2 0 0
200
400
tim e (m s ) 0 .8
B
0 .6 0 .4 0 .2 0 0 0 .8
200 tim e (m s )
400
200
400
C
0 .6 0 .4 0 .2 0 0
tim e (m s ) Figure 7: Normalized external stimulus Iext (thin solid line), population ring rate (thick solid line), and trial ring rate (dotted line) of 100 pulse-coupled LIF neurons when the noise intensity D0 varies. (A) D0 D 0:001, (B) D0 D 0:0035, and (C) D0 D 0:007. The statistics are based on 100 trials with the same stimuli and the same initial conditions. The bin width for calculating ring rates is 2.5 ms.
1366
N. Masuda and K. Aihara
the neuron converges to a stable limit cycle (Hu & Zhou, 2000; Zhou et al., 2001). Because of equation 7.3 and our choice of ai D 1:05 for all i, the noise is necessary for ring. The FitzHugh-Nagumo neurons in the SR regime are sensitive to change in inputs and are class B integrable, as explained in Figure 2. Consequently, we expect trial synchrony. Without feedback coupling, the population ring rate is locked to Iext (Collins et al., 1995), showing that A1 .t/ or Iext , rather than the history of inputs or the states .xi ; yi /, determines the ring time. Moreover, the diffusive coupling introduced in equation 7.4 improves both population synchrony and the SR (Inchiosa & Bulsara, 1995; Lindner et al., 1995; Aihara & Tokuda, 2002). Figure 8A shows that both population synchrony and trial synchrony are actually realized. As a result, this system is ergodic. 7.6 Coherence Resonance. The coherence resonance (CR) is a phenomenon in which interspike intervals are regulated with a moderate amount of noise even when Iext is absent (Pikovsky & Kurths, 1997). The CR as well as SR are enhanced when neurons, such as the Hodgkin-Huxley and FitzHughNagumo neurons, in the excitable regime are coupled globally (Wang et al., 2000) or locally in an array (Hu & Zhou, 2000; Zhou et al., 2001). In contrast to the SR, Isyn is dominant in the CR regime. Figure 2 suggests that the distribution of spike trains with respect to trials spreads even though the neurons re synchronously owing to coupling. Our numerical simulations are performed with the neural network described by equations 7.1, 7.2, and 7.4 and Iext D 0. The same parameter values as in section 7.5 are used except D0 D 0:5 per dt D 0:002. Figure 8B shows the population ring rate and the trial ring rate in the CR regime. Although the neurons re synchronously, the absolute times of synchronous ring drift from trial to trial. The system is nonergodic, and dynamical noise prohibits us from obtaining useful information on the population activity by single-neuron recording. 8 Discussion We discussed in section 7 the ergodicity of several model neural networks. Figure 9 summarizes the classication of these models in terms of the population dynamics and the possibility of trial synchrony. If a mathematical model satisfactorily reproduces experimental data and describes the underlying mechanisms, we can speculate on ergodicity or nonergodicity of biological spike trains based on the model. Unfortunately, proper models are unavailable for many experiments, and our ultimate goal is to evaluate the validity of trial averaging of experimentally observed spike trains without resorting to models, which is beyond the scope of this article. Here, we briey relate our results to experimental schemes. The factors presented in Figure 5 underlie the classication of biological neural systems as well as example models classied in Figure 9. We discuss
Ergodicity of Spike Trains
0 .5
A
0 .4
firing rate
1367
0 .3 0 .2 0 .1 0 0
0 .04
0 .0 8
0.1 2
tim e 0 .5
B
firing rate
0 .4 0 .3 0 .2 0 .1 0 0
0 .04
0 .0 8
0.1 2
tim e Figure 8: Population ring rates (solid line) and trial ring rates (dashed line) of 120 diffusively coupled FitzHugh-Nagumo neurons in an array in the regime of (A) SR and (B) CR. The bin width for calculating ring rates is 0.25. The statistics are based on 120 trials with the same stimuli and the same initial conditions.
only the inuence of Iext and the balance between Iext and Isyn, which is easily accessible in experiments. Visual or auditory stimuli, for example, generate continuous Iext with spatiotemporal structures, and its variability is modulated by the stimulus contrast (Mainen & Sejnowski, 1995; Hunter et al., 1998; Cecchi et al., 2000). Correlated noise to the neurons due to ongoing background activity also contributes to variability (Arieli et al., 1996; Shadlen & Newsome, 1998). In behavioral tasks, sequences of signals that trigger specic responses of an animal can also be regarded as Iext with large variability and separated peaks. If external queues are exhibited frequently enough, the spike trains tend to be locked to the peaks of Iext,
1368
N. Masuda and K. Aihara
population
trial
synchrony
generalized synchrony
asynchrony
synchrony
asynchrony
SR LIF (sm all noise)
CR
periodic pattern detector
LIF (m oderate noise) synfire chain
associative m em ory associative m em ory w ithout noise w ith noise LIF (large noise)
Figure 9: Classication of neural network models in terms of population dynamics and possibility of trial synchrony.
and characteristic ring patterns such as trial synchrony are likely to be found (Aertsen et al., 1994; Vaadia et al., 1995). As indicated in Figure 5, the queues serve as absolute timers to regulate spike timing and encourage trial synchrony, population synchrony, and hence ergodicity. When Iext is regarded as at or absent, however, the ring pattern distributions tend to broaden. If population synchrony is actually realized because of Isyn in such experiments, the results of population dynamics and trial averaging disagree. Accordingly, we should not regard the at ring patterns obtained by trial averaging as meaningless. In behavioral tasks with random retention periods, animals cannot predict when the queue appears. In this situation, the animals must rely on the internal dynamics of the brain to count time or generate the responses. No absolute timer is available during this period, and at ring rates as a result of trial averaging could miss important information on population synchrony or generalized synchrony. This caveat also applies to the case of long retention periods and internal information processing associated with Isyn , such as memory retrieval in progress (Prut et al., 1998; Araki & Aihara, 2001). Properly designed multiple-neuron recording is necessary to treat these cases (Prut et al., 1998). Conversely, in nonergodic cases with trial synchrony and population asynchrony, we would be misled to presume synchronous ring and rel-
Ergodicity of Spike Trains
1369
evant information processing such as feature binding if we supposed ergodicity. From another perspective, although temporal spike coding in the sense of synchrony does not work in this regime, this nonergodic situation suggests the existence of temporal spike coding with reproducible spike trains from a single neuron (Bialek et al., 1991; Mainen & Sejnowski, 1995). Examining the validity of trial averaging with experimental data is our future problem. Possible difculties are identication of coupling structure, evaluation of background noise and initial conditions (Gerstner et al., 1993; Arieli et al., 1996), and consideration of misring (Brunel, 2000). In analyzing behavioral task data, it is advisable to start with characterizing Iext and its importance relative to Isyn . Acknowledgments We thank H.Fujii and M.Watanabe for helpful discussions. This work is supported by the Japan Society for the Promotion of Science. References Abeles, M. (1982).Role of the cortical neuron: Integrator or coincidence detector? Israel Journal of Medical Science, 18, 83–92. Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica D, 75, 103–128. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J. Neurophsiol., 61(5), 900–917. Aihara, K., & Tokuda, I. (2002). Possible neural coding with inter-event intervals of synchronous ring. Phys. Rev. E, 66, 026212. Anderson, J. S., Lampl, I., Gillespie, D. C., & Ferster, D. (2000).The contribution of noise to contrast invariance of orientation tuning in cat visual cortex. Science, 290, 1968–1972. Araki, O., & Aihara, K. (2001). Dual information representation with stable ring rates and chaotic spatiotemporal spike patterns in a neural network model. Neural Computation, 13, 2799–2822. Arieli, A., Sterkin, A., Grinvald, A., & Aertsen, A. (1996). Dynamics of ongoing activity: Explanation of the large variability in evoked cortical responses. Science, 273, 1868–1871. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Computation, 12, 91–129. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208.
1370
N. Masuda and K. Aihara
Bryant, H., L., & Segundo, J. P. (1976).Spike initiation by transmembrane current: A white-noise analysis. J. Physiol., 260, 279–314. Bulsara, A. R., Lowen, S. B., & Rees, C. D. (1994). Cooperative behavior in the periodically modulated Wiener process: Noise-induced complexity in a model neuron. Phys. Rev. E, 49(6), 4989–5000. Cecchi, G. A., Sigman, M., Alonso, J.-M., Martinez, L., & Chialvo, D., R. (2000). Noise in neurons is message dependent. Proc. Nat. Acad. Sci. USA, 97, 5557– 5561. Collins, J. J., Chow, C. C., & Imhoff, T. T. (1995). Stochastic resonance without tuning. Nature, 376, 236–238. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. Douglass, J. K., Wilkens, L., Pantazelou, E., & Moss, F. (1993).Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. Feng, J., Brown, D., & Li, G. (2000). Synchronization due to common pulsed input in Stein’s model. Phys. Rev. E, 61(3), 2987–2995. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9(8), 1303–1350. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Trans. Biomedical Engineering, 36(1), 4–14. Gerstner, W., & Kistler, W. M. (2002). Spiking neurons models. Cambridge: Cambridge University Press. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993).Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybern., 69, 503–515. Gray, C. M., Konig, ¨ K., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hodgkin, A. L. (1948).The local electric changes associated with repetitive action in a non-medullated axon. J. Physiol., 107, 165–181. Hu, B., & Zhou, C. (2000). Phase synchronization in coupled nonidentical excitable systems and array-enhanced coherence resonance. Phys. Rev. E, 61(2), R1001–R1004. Hunter, J. D., Milton, J. G., Thomas, P. J., & Cowan, J. D. (1998). Resonance effect for neural spike time reliability. J. Neurophysiol. 80, 1427–1438. Inchiosa, M. E., & Bulsara, A. R. (1995). Nonlinear dynamic elements with noisy sinusoidal forcing: Enhancing resonance via nonlinear coupling. Phys. Rev. E, 52(1), 327–339. Izhikevich, E. M. (2000). Neural excitability, spiking and bursting. Int. J. Bifurcation and Chaos, 10(6), 1171–1266. Keener, J., & Sneyd, J. (1998). Mathematical biology. New York: Springer-Verlag. Kempter, R., Gerstner, W., van Hemmen, J., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Computation, 10, 1987–2017.
Ergodicity of Spike Trains
1371
Knight, B. W. (1972). Dynamics of encoding in a population of neurons. Journal of General Physiology, 59, 734–766. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends in Neurosciences 19, 130–137. Lindner, J. F., Meadows, B. K., & Ditto, W. L. (1995). Array enhanced stochastic resonance and spatiotemporal synchronization. Phys. Rev. Lett., 75(1), 3–6. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Masuda, N., & Aihara, K. (2002a). Spatiotemporal spike encoding of a continuous external signal. Neural Computation, 14, 1599–1628. Masuda, N., & Aihara, K. (2002b). Bridging rate coding and temporal spike coding by effect of noise. Phys. Rev. Lett., 88(24), 248101. Masuda, N., & Aihara, K. (2003a). Duality of rate coding and temporal spike coding in multilayered feedforward networks. Neural Computation, 15, 103– 125. Masuda, N., & Aihara, K. (2003b). Global and local synchrony of coupled neurons in small-world networks. Manuscript submitted for publication. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. Mueller, R., & Herz, A. V. M. (1999). Content-addressable memory with spiking neurons. Phys. Rev. E, 59(3), 3330–3338. Pikovsky, A. S., & Kurths, J. (1997). Coherence resonance in a noise-driven excitable system. Phys. Rev. Lett., 78(5), 775–778. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. J. Neurophysiol., 79, 2857–2874. Ricciardi, L. M. (1977). Diffusion processes and related topics in biology. Berlin: Springer-Verlag. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: implications for connectivity, computation, and information coding. J. Neurosci., 18(10), 3870–3896. Shaw, G. L., Harth, E., & Scheibel, A. B. (1982). Cooperativity in brain function: Assemblies of approximately 30 neurons. ExperimentalNeurology,77, 324–358. Shaw, G. L., & Silverman, D. J. (1988). Simulations of the trion model and the search for the code of higher cortical processing. In R. M. J. Cotterill (Ed.), Computer simulation in brain science (pp. 189–209), Cambridge: Cambridge University Press. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K., O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124.
1372
N. Masuda and K. Aihara
Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Vaadia, E., Haalman, L., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of ring rates through layered networks of noisy neurons. J. Neurosci., 22(5), 1956–1966. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. von Neumann, J. (1956). Probabilistic logics and the synthesis of reliable organisms from unreliable components. In C. E. Shannon & J. McCarthy (Eds.), Automata studies (pp. 43–98). Princeton, NJ: Princeton University Press. Wang, Y., Chik, D. T. W., & Wang, Z. D. (2000). Coherence resonance and noise-induced synchronization in globally coupled Hodgkin-Huxley neurons. Phys. Rev. E, 61(1), 740–746. Warzecha, A-K., & Egelhaaf, M. (1999).Variability in spike trains during constant and dynamic stimulation. Science, 283, 1927–1930. Zhou, C., Kurths, J., & Hu, B. (2001). Array-enhanced coherence resonance: Nontrivial effects of heterogeneity and spatial independence of noise. Phys. Rev. Lett., 87(9), 098101. Received May 31, 2002; accepted November 27, 2002.
LETTER
Communicated by Joshua B. Tenenbaum
Laplacian Eigenmaps for Dimensionality Reduction and Data Representation Mikhail Belkin
[email protected] Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A.
Partha Niyogi
[email protected] Department of Computer Science and Statistics, University of Chicago, Chicago, IL 60637 U.S.A.
One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a lowdimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the highdimensional data. The algorithm provides a computationally efcient approach to nonlinear dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.
1 Introduction In many areas of articial intelligence, information retrieval, and data mining, one is often confronted with intrinsically low-dimensional data lying in a very high-dimensional space. Consider, for example, gray-scale images of an object taken under xed lighting conditions with a moving camera. Each such image would typically be represented by a brightness value at each pixel. If there were n2 pixels in all (corresponding to an n £ n image), then 2 each image yields a data point in Rn . However, the intrinsic dimensionality of the space of all images of the same object is the number of degrees of freedom of the camera. In this case, the space under consideration has the 2 natural structure of a low-dimensional manifold embedded in Rn . Recently, there has been some renewed interest (Tenenbaum, de Silva, & Langford, 2000; Roweis & Saul, 2000) in the problem of developing lowdimensional representations when data arise from sampling a probability distribution on a manifold. In this letter, we present a geometrically c 2003 Massachusetts Institute of Technology Neural Computation 15, 1373–1396 (2003) °
1374
M. Belkin and P. Niyogi
motivated algorithm and an accompanying framework of analysis for this problem. The general problem of dimensionality reduction has a long history. Classical approaches include principal components analysis (PCA) and multidimensional scaling. Various methods that generate nonlinear maps have also been considered. Most of them, such as self-organizing maps and other neural network–based approaches (e.g., Haykin, 1999), set up a nonlinear optimization problem whose solution is typically obtained by gradient descent that is guaranteed only to produce a local optimum; global optima are difcult to attain by efcient means. Note, however, that the recent approach of generalizing the PCA through kernel-based techniques (Sch¨olkopf, Smola, & Muller, ¨ 1998) does not have this shortcoming. Most of these methods do not explicitly consider the structure of the manifold on which the data may possibly reside. In this letter, we explore an approach that builds a graph incorporating neighborhood information of the data set. Using the notion of the Laplacian of the graph, we then compute a low-dimensional representation of the data set that optimally preserves local neighborhood information in a certain sense. The representation map generated by the algorithm may be viewed as a discrete approximation to a continuous map that naturally arises from the geometry of the manifold. It is worthwhile to highlight several aspects of the algorithm and the framework of analysis presented here: ² The core algorithm is very simple. It has a few local computations and one sparse eigenvalue problem. The solution reects the intrinsic geometric structure of the manifold. It does, however, require a search for neighboring points in a high-dimensional space. We note that there are several efcient approximate techniques for nding nearest neighbors (e.g., Indyk, 2000). ² The justication for the algorithm comes from the role of the Laplace Beltrami operator in providing an optimal embedding for the manifold. The manifold is approximated by the adjacency graph computed from the data points. The Laplace Beltrami operator is approximated by the weighted Laplacian of the adjacency graph with weights chosen appropriately. The key role of the Laplace Beltrami operator in the heat equation enables us to use the heat kernel to choose the weight decay function in a principled manner. Thus, the embedding maps for the data approximate the eigenmaps of the Laplace Beltrami operator, which are maps intrinsically dened on the entire manifold. ² The framework of analysis presented here makes explicit use of these connections to interpret dimensionality-reduction algorithms in a geometric fashion. In addition to the algorithms presented in this letter, we are also able to reinterpret the recently proposed locally linear em-
Laplacian Eigenmaps
1375
bedding (LLE) algorithm of Roweis and Saul (2000) within this framework. The graph Laplacian has been widely used for different clustering and partition problems (Shi & Malik, 1997; Simon, 1991; Ng, Jordan, & Weiss, 2002). Although the connections between the Laplace Beltrami operator and the graph Laplacian are well known to geometers and specialists in spectral graph theory (Chung, 1997; Chung, Grigor’yan, & Yau, 2000), so far we are not aware of any application to dimensionality reduction or data representation. We note, however, recent work on using diffusion kernels on graphs and other discrete structures (Kondor & Lafferty, 2002). ² The locality-preserving character of the Laplacian eigenmap algorithm makes it relatively insensitive to outliers and noise. It is also not prone to short circuiting, as only the local distances are used. We show that by trying to preserve local information in the embedding, the algorithm implicitly emphasizes the natural clusters in the data. Close connections to spectral clustering algorithms developed in learning and computer vision (in particular, the approach of Shi & Malik, 1997) then become very clear. In this sense, dimensionality reduction and clustering are two sides of the same coin, and we explore this connection in some detail. In contrast, global methods like that in Tenenbaum et al. (2000), do not show any tendency to cluster, as an attempt is made to preserve all pairwise geodesic distances between points. However, not all data sets necessarily have meaningful clusters. Other methods such as PCA or Isomap might be more appropriate in that case. We will demonstate, however, that at least in one example of such a data set ( the “swiss roll”), our method produces reasonable results. ² Since much of the discussion of Seung and Lee (2000), Roweis and Saul (2000), and Tenenbaum et al. (2000) is motivated by the role that nonlinear dimensionality reduction may play in human perception and learning, it is worthwhile to consider the implication of the previous remark in this context. The biological perceptual apparatus is confronted with high-dimensional stimuli from which it must recover low-dimensional structure. If the approach to recovering such lowdimensional structure is inherently local (as in the algorithm proposed here), then a natural clustering will emerge and may serve as the basis for the emergence of categories in biological perception. ² Since our approach is based on the intrinsic geometric structure of the manifold, it exhibits stability with respect to the embedding. As long as the embedding is isometric, the representation will not change. In the example with the moving camera, different resolutions of the camera (i.e., different choices of n in the n £ n image grid) should lead to embeddings of the same underlying manifold into spaces of very dif-
1376
M. Belkin and P. Niyogi
ferent dimension. Our algorithm will produce similar representations independent of the resolution. The generic problem of dimensionality reduction is the following. Given a set x1 ; : : : ; xk of k points in Rl , nd a set of points y1 ; : : : ; yk in Rm (m ¿ l) such that yi “represents” xi . In this letter, we consider the special case where x1 ; : : : ; xk 2 M and M is a manifold embedded in Rl . We now consider an algorithm to construct representative yi ’s for this special case. The sense in which such a representation is optimal will become clear later in this letter. 2 The Algorithm Given k points x1 ; : : : ; xk in Rl , we construct a weighted graph with k nodes, one for each point, and a set of edges connecting neighboring points. The embedding map is now provided by computing the eigenvectors of the graph Laplacian. The algorithmic procedure is formally stated below. 1. Step 1 (constructing the adjacency graph). We put an edge between nodes i and j if xi and xj are “close.” There are two variations: (a) ²-neighborhoods (parameter ² 2 R). Nodes i and j are connected by an edge if kxi ¡ xj k2 < ² where the norm is the usual Euclidean norm in Rl . Advantages: Geometrically motivated, the relationship is naturally symmetric. Disadvantages: Often leads to graphs with several connected components, difcult to choose ². (b) n nearest neighbors (parameter n 2 N). Nodes i and j are connected by an edge if i is among n nearest neighbors of j or j is among n nearest neighbors of i. Note that this relation is symmetric. Advantages: Easier to choose; does not tend to lead to disconnected graphs. Disadvantages: Less geometrically intuitive. 2. Step 2 (choosing the weights). 1 Here as well, we have two variations for weighting the edges: (a) Heat kernel (parameter t 2 R). If nodes i and j are connected, put Wij D e¡
kxi ¡ xj k2 t
I
otherwise, put Wij D 0. The justication for this choice of weights will be provided later. 1 In a computer implementation of the algorithm, steps 1 and 2 are executed simultaneously.
Laplacian Eigenmaps
1377
(b) Simple-minded (no parameters (t D 1)). Wij D 1 if vertices i and j are connected by an edge and Wij D 0 if vertices i and j are not connected by an edge. This simplication avoids the need to choose t. 3. Step 3 (eigenmaps). Assume the graph G, constructed above, is connected. Otherwise, proceed with step 3 for each connected component. Compute eigenvalues and eigenvectors for the generalized eigenvector problem, Lf D ¸Df;
(2.1)
where D is diagonal weight matrix, and its entries are column (or P row, since W is symmetric) sums of W, Dii D j W ji . L D D ¡ W is the Laplacian matrix. Laplacian is a symmetric, positive semidenite matrix that can be thought of as an operator on functions dened on vertices of G. Let f0 ; : : : ; fk¡1 be the solutions of equation 2.1, ordered according to their eigenvalues: Lf0 D ¸0 Df0 Lf1 D ¸1 Df1 ¢¢¢
Lfk¡1 D ¸k¡1 Dfk¡1 0 D ¸0 · ¸1 · ¢ ¢ ¢ · ¸k¡1 : We leave out the eigenvector f0 corresponding to eigenvalue 0 and use the next m eigenvectors for embedding in m-dimensional Euclidean space: xi ! .f1 .i/; : : : ; fm .i//: 3 Justication 3.1 Optimal Embeddings. Let us rst show that the embedding provided by the Laplacian eigenmap algorithm preserves local information optimally in a certain sense. The following section is based on standard spectral graph theory. (See Chung, 1997, for a comprehensive reference.) Recall that given a data set, we construct a weighted graph G D .V; E/ with edges connecting nearby points to each other. For the purposes of this discussion, assume the graph is connected. Consider the problem of mapping the weighted graph G to a line so that connected points stay as close together as possible. Let y D .y1 ; y2 ; : : : ; yn /T be such a map. A reasonable
1378
M. Belkin and P. Niyogi
criterion for choosing a “good” map is to minimize the following objective function, X .yi ¡ yj /2 Wij ; ij
under appropriate constraints. The objective function with our choice of weights Wij incurs a heavy penalty if neighboring points xi and xj are mapped far apart. Therefore, minimizing it is an attempt to ensure that if xi and xj are “close,” then yi and yj are close as well. It turns out that for any y, we have 1X .yi ¡ yj /2 Wij D yT Ly; 2 i;j
(3.1)
whereP as before, L = D ¡ W. To see this, notice that W ij is symmetric and Dii D j Wij . Thus, X i;j
.yi ¡ yj /2 Wij D D
X i;j
X i
.y2i C yj2 ¡ 2yi yj /Wij y2i Dii C
X j
yj2 Djj ¡ 2
X i;j
yi yj Wij D 2yT Ly:
Note that this calculation also shows that L is positive semidenite. Therefore, the minimization problem reduces to nding argmin yT Ly: y yT DyD1
The constraint yT Dy D 1 removes an arbitrary scaling factor in the embedding. Matrix D provides a natural measure on the vertices of the graph. The bigger the value Dii (corresponding to the ith vertex) is, the more “important” is that vertex. It follows from equation 3.1 that L is a positive semidenite matrix, and the vector y that minimizes the objective function is given by the minimum eigenvalue solution to the generalized eigenvalue problem: Ly D ¸Dy: Let 1 be the constant function taking 1 at each vertex. It is easy to see that 1 is an eigenvector with eigenvalue 0. If the graph is connected, 1 is the only eigenvector for ¸ D 0. To eliminate this trivial solution, which collapses all vertices of G onto the real number 1, we put an additional constraint of orthogonality and look for argmin yT Ly: yT DyD1 yT D 1D0
Laplacian Eigenmaps
1379
Thus, the solution is now given by the eigenvector with the smallest nonzero eigenvalue. The condition yT D1 D 0 can be interpreted as removing a translation invariance in y. Now consider the more general problem of embedding the graph into m-dimensional Euclidean space. The embedding is given by the k£m matrix Y D [y1 y2 ; : : : ; ym ], where the ith row provides the embedding coordinates of the ith vertex. Similarly, we need to minimize X ky.i/ ¡ y.j/ k2 Wij D tr.YT LY/; i;j
where y.i/ D [y1 .i/; : : : ; ym .i/]T is the m-dimensional representation of the ith vertex. This reduces to nding argmin tr.YT LY/: YT DYDI
For the one-dimensional embedding problem, the constraint prevents collapse onto a point. For the m-dimensional embedding problem, the constraint presented above prevents collapse onto a subspace of dimension less than m ¡ 1 (m if, as in one-dimensional case, we require orthogonality to the constant vector). Standard methods show that the solution is provided by the matrix of eigenvectors corresponding to the lowest eigenvalues of the generalized eigenvalue problem Ly D ¸Dy. 3.2 The Laplace Beltrami Operator. The Laplacian of a graph is analogous to the Laplace Beltrami operator on manifolds. In this section, we provide a justication for why the eigenfunctions of the Laplace Beltrami operator have properties desirable for embedding. Let M be a smooth, compact, m-dimensional Riemannian manifold. If the manifold is embedded in Rl , the Riemannian structure (metric tensor) on the manifold is induced by the standard Riemannian structure on Rl . As we did with the graph, we are looking here for a map from the manifold to the real line such that points close together on the manifold are mapped close together on the line. Let f be such a map. Assume that f : M ! R is twice differentiable. Consider two neighboring points x; z 2 M. They are mapped to f .x/ and f .z/, respectively. We rst show that j f .z/ ¡ f .x/j · distM
.x; z/kr f .x/k C o.distM
.x; z//:
(3.2)
The gradient r f .x/ is a vector in the tangent space T Mx , such that given another vector v 2 T Mx , df .v/ D hr f .x/; viM . Let l D distM .x; z/. Let c.t/ be the geodesic curve parameterized by length connecting x D c.0/ and z D c.l/. Then Z l Z l hr f .c.t//; c0 .t/i dt: f .z/ D f .x/ C df .c0 .t// dt D f .x/ C 0
0
1380
M. Belkin and P. Niyogi
Now by Schwartz inequality, hr f .c.t//; c0 .t/i · kr f .c.t//k kc0 .t/k D kr f .c.t//k: Since c.t/ is parameterized by length, we have kc0 .t/k D 1. We also have kr f .c.t//k D kr f .x/k C O.t/ (by Taylor’s approximation). Finally, by integrating, we have j f .z/ ¡ f .x/j · lkr f .x/k C o.l/; where both O and o are used in the innitesimal sense. If M is isometrically embedded in Rl , then distM .x; z/ D kx ¡ zkRl C o.kx ¡ zkRl / and j f .z/ ¡ f .x/j · kr f .x/k kz ¡ xk C o.kz ¡ xk/: Thus, we see that kr f k provides us with an estimate of how far apart f maps nearby points. We therefore look for a map that best preserves locality on average by trying to nd Z argmin k f kL 2 .M
D1 M /
kr f .x/k2 ;
(3.3)
where the integral is taken with respect toR the standard measure on a Riemannian manifold. Note that minimizing M kr f .x/k2 corresponds to minP imizing Lf D 12 i;j . fi ¡ fj /2 Wij on a graph. Here, f is a function on vertices, and fi is the value of f on the ith node of the graph. It turns out that minimizing the objective function of equation 3.3 reduces to nding eigenfunctions of the Laplace Beltrami operator L . Recall that def
L f D ¡ div r. f /; where div is the divergence of the vector eld. It follows from the Stokes’ theorem that ¡ div and r are formally that is, if f is a R adjoint operators, R function and X is a vector eld, then M hX; r f i D ¡ M div.X/ f . Thus, Z Z kr f k2 D L . f / f: M
M
R We see that L is positive semidenite. f that minimizes M kr f k2 has to be an eigenfunction of L . The spectrum of L on a compact manifold M is known to be discrete (Rosenberg, 1997). Let the eigenvalues (in increasing order) be 0 D ¸0 · ¸1 · ¸2 · : : : ; and let fi be the eigenfunction corresponding to eigenvalue ¸i . It is easily seen that f0 is the constant function that maps the entire manifold to a single point. To avoid this eventuality, we
Laplacian Eigenmaps
1381
require (just as in the graph setting) that the embedding map f be orthogonal to f0 . It immediately follows that f1 is the optimal embedding map. Following the arguments of the previous section, we see that x ! . f1 .x/; : : : ; fm .x// provides the optimal m-dimensional embedding. 3.3 Heat Kernels and the Choice of Weight Matrix. The Laplace Beltrami operator on differentiable functions on a manifold M is intimately related to the heat ow. Let f : M ! R be the initial heat distribution and u.x; t/ be the heat distribution at time t (u.x; 0/ D f .x/). The heat equation @ is the partial R differential equation . @t C L /u D 0. The solution is given by u.x; t/ D M Ht .x; y/ f .y/, where Ht is the heat kernel, the Green’s function for this partial differential equation. Therefore, ³
L f .x/ D ¡L u.x; 0/ D ¡
@ @t
µZ
¶´ Ht .x; y/ f .y/ M
:
(3.4)
tD0
It turns out that in an appropriate coordinate system (exponential, which to the rst order coincides with the local coordinate system given by a tangent plane in Rl ), Ht is approximately the gaussian: m
Ht .x; y/ D .4¼ t/¡ 2 e¡
kx¡yk2 4t
.Á .x; y/ C O.t//;
where Á .x; y/ is a smooth function with Á.x; x/ D 1. Therefore, when x and y are close and t is small, m
Ht .x; y/ ¼ .4¼ t/¡ 2 e¡
kx¡yk2 4t
:
See Rosenberg (1997) for details. Notice that as t tends to 0, the heat kernel Ht .x; y/ becomes increasingly R localized and tends to Dirac’s ±-function, that is, limt!0 M Ht .x; y/ f .y/ D f .x/. Therefore, for small t from the denition of the derivative, we have µ ¶ Z kx¡yk2 1 m L f .x/ ¼ f .x/ ¡ .4¼ t/¡ 2 e¡ 4t f .y/ dy : t M If x1 ; : : : ; xk are data points on M, the last expression can be approximated by 2 3
L f .xi / ¼
16 6 f .xi / ¡ 1 .4¼ t/¡ m2 t4 k
X xj 0 0), then the minimax risk of the optimal predictor converges to zero at a rate of the order n−2/(2+ρ) . It is also known that if ρ > 2, then ERM over the whole function class C as in equation 2.3 may converge more slowly than the optimal minimax rate (and thus will be a suboptimal learning method; see Birg´e & Massart, 1993). However, when 0 < ρ < 2, by using the nowstandard proof techniques (such as chaining and ratio uniform convergence inequalities) from the theory of empirical processes, it can be shown that the ERM method, equation 2.3, achieves the optimal minimax rate when some additional regularity conditions are satisfied.2 The details are not important for this article, and we refer interested readers to Birg´e and Massart (1993), van de Geer (2000), and van der Vaart and Wellner (1996). Since it can be shown that the uniform metric entropy for kernel function classes does not grow faster than −2 , at least for the least-squares method, it is possible to use the standard empirical process theory to obtain convergence bounds for the ERM kernel formulation, equation 2.3, that approximately matches the optimal minimax rate.3 For example, a family of well-studied kernel function classes are onedimensional smoothing splines on [0, 1]. Here, the function class C is a 1 subset of rth differentiable functions (r > 0.5): C = {p : 0 [p(r) (x)]2 dx ≤ c0 }, where p(r) denotes the rth derivative of function p. It is known that the uniform entropy number is of the order H(C, ) = O( −1/r ), leading to the minimax rate of the order n−2r/(2r+1) , which can be achieved using the ERM method. In the learning theory literature, the ERM method, equation 2.3, for kernel methods has recently been studied by a number of authors (Chucker & Samle, 2002; Evgeniou, Pontil, & Poggio, 2000). Many researchers are not fully aware of the recent developments in the statistical community mentioned above. Although these studies can lead to useful insights, the obtained bounds are not necessarily tight and often do not match the correct minimax rates when applied to specific kernel formulations.
2 For example, this is true when the uniform (or bracketing) entropy number (see van der Vaart & Wellner, 1996, sec. 2.5 for definitions) has the same order of complexity as the L2 metric entropy. The differences among these definitions are technical and irrelevant to this article. It is worth mentioning, though, that for kernel methods, the uniform entropy number is bounded and (by definition) has the same order of complexity as the worst-case metric entropy. 3 Although convergence rates for kernel methods obtained from empirical process techniques are often relatively tight in terms of their dependency on the sample size n, they almost always contain very bad constants. These bounds can also be loose relative to some other factors, such as function class size and noise size.
1402
T. Zhang
Although the ERM method for equation 2.3 has been widely studied and the minimax issue is well understood, the behavior of the regularized learning formulation, equation 2.4, is less studied. From a technical point of view, the main difference is that in equation 2.4, we would like to specify our learning bounds in terms of λn rather than through the restricted function class C. However, the transition from equation 2.3 to 2.4 is not straightforward. For the penalized formulation, equation 2.4, it is natural to compare the expected generalization error QL (A, n) to the best possible risk infp∈H QL (p) using predictors from H. Since the predictor is taken from a very large function class H (often dense in the set of continuous functions) with possibly ∞ metric entropy number, the entropy-dependent lower bound in the minimax analysis indicates that a direct empirical risk minimization over H will not converge. The term λn r(p) is needed to stabilize the estimation process, but it introduces an explicit bias. Therefore, bounds for equation 2.4 will contain both bias and learning complexity terms. From the theoretical point of view, it is important to study the trade-off between the bias and the learning complexity. For example, to obtain an estimator that is consistent, it is necessary to allow the bias to converge to zero by choosing λn → 0 when n → ∞. However, it is also necessary to make sure that λn → 0 at a relatively slow pace so that the learning complexity vanishes in the limit (otherwise, the method overfits the training data). Therefore, an important aspect of the analysis is to understand the behavior of the learning complexity when λn → 0. Recently various stability-based methods have been proposed to study the behavior of the penalized learning formulation, equation 2.4 (Bousquet & Elisseeff, 2002; Kutin & Niyogi, 2002; Zhang, 2002a). The underlying idea is closely related to the analysis of leave-one-out error presented in this article. Exponential-type probability bounds can be obtained in their analysis. Such exponential bounds give more detailed information than the expected generalization error bound defined in equation 2.2, which we are interested in here. However, one pays a price with this generality since the expected generalization error implied from such an analysis will be worse than results obtained here using the leave-one-out analysis. As an example, we shall still consider the least-squares regression problem. Zhang (2002a) pointed out that the bound given in Bousquet and Elisseeff (2002) was not tight and showed that a learning complexity term of the order O(1/ λ2n n) in Bousquet and Elisseeff (2002) can be improved to O(1/λ2n n). Zhang (2002a) further pointed out that even the improved bound derived there (when averaged to obtain an expected generalization error bound) does not match the corresponding expected generalization bound using the leave-one-out analysis, where the learning complexity is of the order O(1/λn n) (see section 5). If the target function belongs to H, then we can choose λn ∼ n−1/2 , and results in section 5 imply that the expected
Leave-One-Out Bounds for Kernel Methods
1403
generalization error QL (A, n) converges to the risk of the target function infq∈H QL (p) at a rate of the order O(n−1/2 ). This is clearly the best kernel independent bound (in terms of n) one can obtain since it matches the minimax rate of smoothing splines when the smoothing parameter r → 0.5. As a comparison, even with the refined analysis in Zhang (2002a), one can obtain a rate of the order O(n−1/3 ) only by setting λn ∼ n−1/3 . A similar conclusion also holds for other loss functions. See sections 4 and 5 for more detailed comparisons. One question is whether this discrepancy is due to the suboptimal proof techniques used in the literature or to something more fundamental. Although we do not have an affirmative answer to this question, we think that the O(1/λ2n n) behavior for the probability bounds might not be improvable. One may notice that the leave-one-out variance bounds given in sections 3 and 4 also have this O(1/λ2n n) behavior. In addition to this worse behavior in terms of the regularization parameter λn , probability bounds such as those in Bousquet and Elisseeff (2002), Kutin and Niyogi (2002), and Zhang (2002a) often require the loss function to be bounded. Again, similar conditions are needed in the leave-one-out variance analysis presented in sections 3 and 4. As a comparison, this restriction is not necessary in the leave-one-out expected generalization analysis. Based on the above discussion, it is clear that the leave-one-out expected generalization error analysis is very useful for understanding the behavior of the penalized kernel learning formulation, equation 2.4, which is widely used in practice. In fact, it is possible to obtain tight expected generalization bounds for both the kernel-dependent case (where we need to use the eigendecomposition structure of the underlying kernel) and the kernel-independent case. However, in order to focus on the main underlying methodology, we shall consider only the kernel-independent case in this article. 3 Kernel Learning Machines and Leave-One-Out Approximation Bound 3.1 Kernel Representation. The goal of many machine learning problems is to find a function that can predict the output variable y based on the input variable x. Typically, one needs to restrict the size of the hypothesis function family so that a stable estimate within the function family can be obtained from a finite number of samples. Let the training samples be (x1 , y1 ), . . . , (xn , yn ). We assume that the hypothesis function family that predicts y based on x can be specified with the following kernel method,
p(α, x) =
n i=1
αi K(xi , x),
(3.1)
1404
T. Zhang
where α = [αi ]i=1,...,n is a parameter vector that needs to be estimated from the data. K is a symmetric positive kernel. That is, K(a, b) = K(b, a), and the n × n Gram matrix G = [K(xi , xj )]i,j=1,...,n is always positive semidefinite. Definition 1. Let H0 = { i=1 αi K(xi , x) : ∈ N, αi ∈ R}. H0 is an inner product space with norm defined as 1/2 α K(xi , ·) = αi αj K(xi , xj ) . i i i,j
Let H be the closure of H0 under the norm · , which forms a Hilbert space, called the reproducing kernel Hilbert space of K. It is well known and not difficult to check that the norm · in definition 1 is well defined and defines an inner product. Consider two functions in H0 : m p1 (x) = ni=1 αi K(xi , x) and p2 (x) = j=1 βj K(xj , x). The inner product can be expressed as p1 · p2 =
n m
αi K(xi , xj )βj .
i=1 j=1
Now assume that p1 (·) ⊥ p2 (·) where p2 (x) = K(xj , x); then 0 = p1 · p2 =
n
αi K(xi , xj ) = p1 (xj ).
i=1
Clearly by taking limit, we know that this property also holds for all p1 ∈ H. We thus obtain the following well-known property for reproducing kernel Hilbert spaces: Proposition 1. Let p(x) ∈ H. Consider the projection p0 (x) of p(x) onto the subspace spanned by functions pi (x) = K(xi , x) (i = 1, . . . , n). Then p0 (xi ) = p(xi ) for all i = 1, . . . , n. In the recent machine learning literature, kernel representation (see equation 3.1) has frequently been discussed under the assumptions of the Mercer’s theorem, which gives a feature space representation of H (see Cristianini & Shawe-Taylor, 2000, chap. 3). Here we represent each input vector x as a possibly infinite-dimensional feature vector [φj (x)]j=1,... . The kernel ∞ becomes K(x1 , x2 ) = j=1 φj (x1 )φj (x2 ), and a function p ∈ H can be rep∞ ∞ 2 1/2 resented as p(·) = j=1 wj φj (·), where p = ( j=1 wj ) . Although this representation provides useful insights, it is not technically essential for the purpose of this article. We take a more general approach that relies on
Leave-One-Out Bounds for Kernel Methods
1405
only simple properties of a positive symmetric kernel function K without considering the eigendecomposition structure of any specific kernel. In some practical kernel learning formulations, a bias term may be included in equation3.1, where the corresponding function space H has the form p(α, x) + b = ni=1 αi K(xi , x) + b. It is well known that H can be considered as the RKHS with kernel K (x1 , x2 ) = K(x1 , x2 ) + 1. It is easier to see this using the feature space representation, K (x1 , x2 ) = 1+ ∞ φi (x1 )φi (x2 ), and i=1 hence any function in the RKHS of K has the form p(·) = ∞ i=1 wi φi (·) + b, 2 + 1)1/2 . Therefore from the learning point with its norm defined as ( ∞ w i=1 i of view, we do not lose any representation power by considering only a kernel representation without an explicit bias term. We shall mention that by treating H as the RKHS of a different kernel, the resulting learning formulations may be slightly different from those in the literature, where the bias b is typically not included in the penalization term. The explicit bias formulation will significantly complicate the derivation in the leave-one-out analysis framework, though it is still possible to handle it. Since an explicit bias formulation does not have any advantage in approximation power, for simplicity and clarity, we shall focus on the representation 3.1 without the bias term. Functions in H can be used to approximate an arbitrary function p(x). However only certain functions can be well approximated; others cannot. Therefore, it is useful to define a metric that characterizes how well a certain function can be approximated. In particular, given a sequence of observations Xn = {x1 , . . . , xn }, we can approximate p(x) by using functions in H that match the values of p(x) at Xn . There can be many possible such function interpolations (assume one exists). Therefore, a natural choice is to select the minimum · -norm interpolation of p(·). The corresponding norm can be used to measure the cost of interpolating p(x) at Xn using functions in H. This leads to the following definition: Definition 2. Denote by Xn a sequence of samples x1 , . . . , xn . We use G(Xn ) to denote the Gram matrix [K(xi , xj )]i,j=1,...,n . For any function p(x) and symmetric positive kernel K, we define p(Xn ) = inf {+∞} ∪ {s ≥ 0 : ∀α, p(Xn )T α ≤ s(α T G(Xn )α)1/2 }, where α denotes an n-dimensional vector. We use the convention that +∞ × 0 = +∞ in the definition. We also define p(x)[n] = supXn p(Xn ), where Xn consists of a sequence of n samples. The following property is useful. It shows that p(Xn ) can be considered as an interpolation of p(α, ¯ ·) ∈ H, and p(Xn ) = p(α, ¯ ·). Proposition 3 implies that for any such interpolation function, q ∈ H, p(Xn ) ≤ q.
1406
T. Zhang
Therefore, the quantity p(Xn ) can be regarded as the norm corresponding to the minimum · -norm interpolation of p(·) using functions in H. The proofs are left to appendix A. Proposition 2. Let K be a symmetric positive kernel, and consider samples Xn = {x1 , . . . , xn }. For any function p(x), the following two situations may happen: • p(Xn ) is not in the range of the Gram matrix G(Xn ): p(Xn ) = +∞. ¯ p(Xn ) = (α¯ T G(Xn )α) ¯ 1/2 . • p(Xn ) can be represented as p(Xn ) = G(Xn )α: In particular, if G(Xn ) is nonsingular, then p(Xn ) = (p(Xn )T G(Xn )−1 p(Xn ))1/2 .
Proposition 3.
Let K be a symmetric positive kernel; then ∀p(x) ∈ H,
p(·)[n] ≤ p(·). This implies that ∀x: |p(x)| ≤ p(·)K(x, x)1/2 .
We will later show that the quantity p(·)[n] characterizes the learning property of using the kernel representation 3.1 to approximate a target function p(x). In fact, our learning bounds containing this quantity directly yield general approximation bounds, as shown at the end of section 5. This is also a new quantity that has not been considered in the traditional approximation literature. For example, the approximation property of kernel representation is typically studied using standard analytical techniques such as Fourier analysis (see Mhaskar, Narcowich, & Ward, 1999; Wendland, 1998). Such results usually depend on many rather complicated quantities as well as the data dimensionality. Compared with these previous results, our bounds using p(x)[n] are simpler to express and are more generally applicable. However, we do not discuss specific approximation consequences of our analysis, but rather concentrate on the general learning aspect. When the target function p is not in H, we do not give conditions to bound p[n] . This quantity can be bounded using estimates of p(Xn ), and such estimates may be obtained using techniques in approximation theory for analyzing the stability of G(Xn ). The stability of the system can be measured by the condition number of G(Xn ) (in 2-norm) or the 2-norm of G(Xn )−1 (see Narcowich, Sivakumar, & Ward, 1998, for related analysis). The quantity p(Xn ) is clearly related to the 2-norm of G(Xn )−1 (the latter gives an upper bound), which measures the stability. However, the stability concept as used in approximation theory has only numerical consequences
Leave-One-Out Bounds for Kernel Methods
1407
but no consequence in approximation rate or learning accuracy. On the other hand, the quantity p(x)[n] determines the rate of approximation (or learning) when the kernel representation 3.1 is used. Although it is well known that the norm of G(Xn )−1 degrades as the number of samples increases, one may introduce smoothness conditions on p so that p(x)[n] behaves nicely for a wide range of function families. 3.2 Duality and Leave-One-Out Approximation Bound for Kernel Machines. We consider the following general formulation of dual kernel learning machines that can be used to estimate p(α, ·) in equation 3.1 from the training data: αˆ = arg min α
n i=1
n n 1 g(−αi , xi , yi ) + αi αj K(xi , xj ) . 2 i=1 j=1
(3.2)
We assume that g(a, b, c) is a convex function of a. For simplicity, we require that a solution of equation 3.2 exists, but not necessarily a unique one. In order to treat classification and regression under the same general framework, we shall consider convex functions from the general convex analysis point of view, as in Rockafellar (1970). Especially we allow a convex function to take the +∞ value, which is equivalent to a constraint. Consider a convex function p(u): Rd → R∗ , where R is the real line and R∗ denotes the extended real line R ∪ {+∞}. However, we assume that convex functions do not achieve −∞. We also assume that any convex function p(u) in this article contains at least one point u0 such that p(u0 ) < +∞. Convex functions that satisfy these conditions are called proper convex functions. This definition is very general: virtually all practically interesting convex functions are proper. We consider only closed convex functions, that is, ∀u, p(u) = lim→0+ inf{p(v) : v − u ≤ }. This condition essentially means that the convex set above the graph of u: {(u, y) : y ≥ p(u)} is closed. In this article, we use ∇p(u) to denote a subgradient of a convex function p at u, which is a vector that satisfies the following condition: ∀u ,
p(u ) ≥ p(u) + ∇p(u)T (u − u).
The set of all subgradients of p at u is called the subdifferential of p at u and is denoted by ∂p(u). Denote by Ln (α) the objective function in equation 3.2: Ln (α) =
n i=1
g(−αi , xi , yi ) +
n n 1 αi αj K(xi , xj ). 2 i=1 j=1
In this article, we assume that equation 3.2 has a solution with a finite objective value. However, we do not assume that the solution is unique. The following proposition gives a sufficient condition for the existence of
1408
T. Zhang
such a solution. All examples given in this article will satisfy the condition: Proposition 4. Assume that for all (xi , yi ), g(αi , xi , yi ) is a (proper) closed convex function in αi . If limαi →±∞ g(αi , xi , yi ) = +∞ for all i, then equation 3.2 has a solution with finite value. Proof. Given α¯ such that g(α¯ i , xi , yi ) < +∞, the region α¯ = {α ∈ Rn : Ln (α) ≤ Ln (α)} ¯ is bounded in Rn . Since solving equation 3.2 in Rn is equivalent to solving it in α¯ , we can find a sequence of α j ∈ α¯ such that Ln (α j ) → infα Ln (α). Since Ln is the sum of closed convex functions, it is also closed. By definition, let αˆ be a limiting point of α j . We have Ln (α) ˆ = infα Ln (α). In the following, we introduce a convex duality of equation 3.2 that becomes useful in our later discussions. For function g(u, b, c) in equation 3.2, we define its dual with respect to the first parameter as f (v, b, c) = n · sup[uv − g(u, b, c)]. u
Given f , g can be obtained as: 1 g(u, b, c) = sup uv − f (v, b, c) . n v
(3.3)
We consider the following primal learning formulation in the corresponding RKHS of K: n 1 1 2 ˆ = arg min p(·) f (p(xi ), xi , yi ) + p(·) . (3.4) p(·)∈H n 2 i=1 Note that this formulation has a penalized empirical risk minimization form as in equation 2.4. We want to prove that equations 3.4 and 3.2 are equivalent. This is given by the following strong duality theorem (its special situation appeared in Jaakkola & Haussler, 1999). The proof is left to appendix B. n ˆ Theorem 1. Any solution of equation 3.4 can be written as p(x) = ˆi i=1 α K(x , x) for some α. ˆ For any solution α ˆ of equation 3.2, the function p( α, ˆ x) = i n ˆ i K(xi , x) is a solution of the primal optimization problem, equation 3.4. The i=1 α converse is also true if the Gram matrix G(Xn ) is nonsingular. Note that the proof of theorem 1 also implies that even when G(Xn ) is singular, a solution p(x) of equation 3.4 can still be written as p(α, ˆ x) where αˆ is a solution of equation 3.2. In addition, the sum of the optimal values of equations 3.4 and 3.2 is zero (for example, see Zhang, 2002b). However, this fact is not important in this article.
Leave-One-Out Bounds for Kernel Methods
1409
We may now introduce the following leave-one-out approximation bound for kernel methods, which forms the foundation of our analysis. The proof is given in appendix C. Lemma 1. Consider training set Dn = {(x1 , y1 ), . . . , (xn , yn )}. Let αˆ be the solution of equation 3.2, and let αˆ [k] be the solution of equation 3.2 with the kth datum (xk , yk ) removed from the training set Dn . Then p(α, ˆ ·) − p(αˆ [k] , ·) ≤ |αˆ k |K(xk , xk )1/2 .
3.3 Leave-One-Out Error and Variance for Kernel Learning Machines. As shown in section 2, a useful aspect of leave-one-out analysis is that the expected leave-one-out error of a learning algorithm equals its expected generalization error. However, the training error of a learner may not be closely related to its generalization error. It is thus interesting to compare the leave-one-out error to the training error. If the two are close, then we know that the expected generalization error is close to the expected training error. If the learner is obtained by approximately minimizing its training error, then we know that the expected generalization error is also approximately minimized. The goal of this section is to bound the leave-one-out error of a kernel learning machine in terms of its training error. We use notations introduced in section 2. Here we use AK to denote the dual kernel learning method, equation 3.2 (or, equivalently, its primal learning formulation, equation 3.4). ZL (AK , Dn ) is used to denote the leaveone-out error with respect to the training samples Dn of size n. Definition 3.
Let L(p, x, y) be an arbitrary loss function. We define
Lδ (p, x, y) = sup L(p + t, x, y) − L(p, x, y), |t|≤δ
Lδ (p, x, y) = sup |L(p + t, x, y) − L(p, x, y)|. |t|≤δ
Using lemma 1, we can easily obtain the following general leave-oneout bound. We will study the consequence of this bound in regression and classification in subsequent sections. Theorem 2. Under the assumptions of lemma 1, let L(p, x, y) be an arbitrary loss function and ZL (AK , Dn ) be the corresponding leave-one-out cross-validation error with respect to L(p, x, y): ZL (AK , Dn ) =
n 1 L(p(αˆ [k] , xk ), xk , yk ). n k=1
1410
T. Zhang
Then n ZL (AK , Dn ) ≤
n
L(p(α, ˆ xk ), xk , yk ) +
k=1
n
Lδk (p(α, ˆ xk ), xk , yk ), (3.5)
k=1
where δk = |αˆ k |K(xk , xk ). Proof. Using lemma 1 and proposition 3, we obtain |p(αˆ [k] , xk )−p(α, ˆ xk )| ≤ δk . This implies that L(p(αˆ [k] , xk ), xk , yk ) ≤ L(p(α, ˆ xk ), xk , yk ) + Lδk (p(α, ˆ xk ), xk , yk ). Summing over k, we obtain the theorem. Although the expected leave-one-out error is the expected generalization error, the former may not be a stable estimator of the latter. To see this, we consider the following example: assume that the input is a 0-1 valued binary random variable with probability 0.7 for 0 and 0.3 for 1 and the output is always 1. Given n such input-output pairs, the learner predicts 1 if an even number of inputs are ones and predicts 0 otherwise. Clearly, depending on whether the number of observed ones are even or odd, the leave-one-out classification error rate for the learner will be approximately 0.7 or 0.3. The expected classification error of this learner will be approximately 0.5. This implies that the leave-one-out classification error is not a stable estimator even when n → +∞. It is thus useful to bound the variance of leave-one-out estimate. In order to do so for kernel learning machines, we use the following modified version of Efron-Stein style concentration inequality (Efron & Stein, 1981; Steele, 1986). A self-contained proof is given in appendix D. Lemma 2. Let Z be an arbitrary function of the training data, Z = u(z1 , . . . , zn ), and let Z(i) be a fixed function of the training data with zi excluded (i = 1, . . . , n): Z(i) = u(i) (z1 , . . . , zi−1 , zi+1 , . . . , zn ), where zi = (xi , yi ). Then the variance of Z can be bounded as: Var(Z) ≤ E
n
(Z − Z(i) )2 ,
i=1
where the expectation (and variance) is with respect to the training data. Note that the choice of u(i) that minimizes the right-hand side of lemma 2 is the expectation of u with respect to the ith component: Ezi u(z1 , . . . , zn ). Clearly, if we let ZL (AK , Dn ) be the leave-one-out error and Z(i) L (AK , Dn ) be the leave-one-out error with the ith datum removed from the training
Leave-One-Out Bounds for Kernel Methods
1411
data, then the variance of ZL (AK , Dn ) can be bounded by the right-hand side of lemma 2. For kernel machines, the latter can be bounded using the following result: Theorem 3. Let ZL (AK , Dn ) be the leave-one-out error on training set Dn as in (i) theorem 2, and let Z(i) L (AK , Dn ) = ZL (AK , Dn ) be the corresponding leave-oneout error with the ith datum removed from the training data. We have the following inequality, n 2 [n ZL (AK , Dn ) − (n − 1) Z(i) L (AK , Dn )] i=1
≤ (2n − 1)
n 1 L(p(αˆ [i] , xi ), xi , yi )2 n i=1
+
|| Lδi,j (p(αˆ [i] , xi ), xi , yi )2 ,
i,j:i= j [i]
where δi,j = |αˆ j |K(xi , xi )1/2 K(xj , xj )1/2 for all i = j. Proof. Let Z = nZL (AK , Dn ) and Z(i) = (n − 1) Z(i) L (AK , Dn ). For all i = j, denote by αˆ [i,j] the solution of equation 3.2 with both the ith and the jth data points removed from the training set. Given any j, using lemma 1 and proposition 3, we have for all i = j: |p(αˆ [i] , xi ) − p(αˆ [i,j] , xi )| ≤ δi,j . Therefore, |L(p(αˆ [i] , xi ), xi , yi ) − L(p(αˆ [i,j] , xi ), xi , yi )| ≤ || Lδi,j (p(αˆ [i] , xi ), xi , yi ). Summing over i(i = j), we obtain: |(Z − L(p(αˆ [j] , xj ), xj , yj )) − Z(j) | ≤
Lδi,j (p(αˆ [i] , xi ), xi , yi ).
i:i= j
We thus obtain: (Z − Z(j) )2 2 1 ≤ n · |L(p(αˆ [j] , xj ), xj , yj )| + || Lδi,j (p(αˆ [i] , xi ), xi , yi ) n i:i= j
≤ (2n − 1) n ·
2 1 L(p(αˆ [j] , xj ), xj , yj ) + || Lδi,j (p(αˆ [i] , xi ), xi , yi )2 . n i:i= j
Now the theorem can be obtained by summing over j.
1412
T. Zhang
4 Leave-One-Out Analysis for Bounded Subgradient Formulations It is clear that theorem 2 can be used to bound leave-one-out errors in terms of training errors. If each αˆ k is small (k = 1, . . . , n) and the loss is continuous, then the corresponding leave-one-out error is not much larger than the training error. Since the expected leave-one-out error is the expected generalization error, we know that the expected generalization error is not much larger than the expected error on the training set. Similarly, one can obtain a bound on the variance of leave-one-out error using theorem 3. In order to apply equation 3.5, we need to estimate α. ˆ Although this quantity is available if we solve the kernel formulation, equation 3.2, based on the training data, it is also very useful to estimate the quantity for all possible training data based on properties of the underlying learning formulation. Such an estimate can be used to derive a bound for the expected generalization error. This section considers a relatively simple situation. From equation B.3, we know that if sup |∇1 f (p, x, y)| is bounded, then each component αˆ is of the order O(1/n). Similarly, if L(p, x, y) is Lipschitz, then L(p(αˆ [k] , x), x, y) − L(p(α, ˆ x), x, y) can be bounded accordingly. For some problems, it is useful to limit the range of p(α, ˆ ·) and p(αˆ [k] , ·). Let p be a measurable function, and let κp = (2 supx,y f (p(x), x, y) + p(·)2[n] )1/2 . Note that we allow κp to be +∞. Now if f is nonnegative, then by comparing the primal empirical risks, we obtain p(α, ˆ ·) ≤ κp , p(αˆ [i] , ·) ≤ κp for all [i,j] i, and p(αˆ , ·) ≤ κp for all i = j. Thus, |p(α, ˆ x)|, |p(αˆ [i] , x)|, |p(αˆ [i,j] , x)| ≤ 1/2 κp K(x, x) . Based on this observation, we obtain the following result: Under the assumptions of theorem 3, let
Theorem 4.
1/2
κ ≥ inf 2 sup f (p(x), x, y) + p(·)
x,y
p(·)2[n]
,
M ≥ sup K(x, x)1/2 , x
where we allow both quantities to be +∞. We further assume that f (p, x, y) is nonnegative when κM < +∞. Let Mf = ML = UL =
sup
|p|≤κM,x,y
sup
∇1 f (p, x, y),
|p1 |,|p2 |≤κM,x,y
sup
|p|≤κM,x,y
|L(p1 , x, y) − L(p2 , x, y)|/|p1 − p2 |,
|L(p(x), x, y)|.
Leave-One-Out Bounds for Kernel Methods
Given training data Dn , let AX = inequalities: ZL (AK , Dn ) ≤
ML M f n
1413
n
k=1 K(xk , xk ).
We have the following
n 1 AX L(p(α, ˆ xk ), xk , yk ) + . n k=1 n
(4.1)
2 n n − 1 (i) ZL (AK , Dn ) ZL (AK , Dn ) − n i=1 n M2L M2f 2 2 2 2 ≤ K(xk , xk ) . UL + AX − n n2 k=1
(4.2)
Proof. The definition of κ implies that p(α, ·) ≤ κ for all p(α, ·) that solves equation 3.4 with m-points removed from the training set (m = 0, 1, 2). Hence, |p(α, x)| ≤ κM. This implies that we can replace L(p, x, y) by L(max(min(p, κM), −κM), x, y) in the proof. This new L has a global Lipschitz constant of ML . Therefore || Lδ (p, x, y) ≤ ML δ. Using equation B.3, we M also obtain |αˆ k | ≤ nf for all training data. Let δk = |αˆ k |K(xk , xk ). Then Lδk (p(α, ˆ xk ), xk , yk ) ≤
M f ML K(xk , xk ). n
Summing over k, and applying theorem 2, we obtain equation 4.1. [i] Similarly, let δi,j = |αˆ j |K(xi , xi )1/2 K(xj , xj )1/2 . We have || Lδi,j (p(αˆ [i] , xi ), xj , yj ) ≤
M f ML K(xi , xi )1/2 K(xj , xj )1/2 . n
Summing over i, j(i = j), we obtain i,j:i= j
|| Lδi,j (p(αˆ [i] , xi ), xj , yj )2 ≤ A2X −
n M2L M2f
n2
K(xk , xk )2 .
k=1
Using theorem 3, we obtain equation 4.2. Note that if κM = +∞, then the condition |p| ≤ +∞ in the definitions of M f , ML , and UL can be interpreted as |p| < +∞.
1414
T. Zhang
Corollary 1.
Under the assumptions of theorem 4, if E K(x, x) < +∞, then
E ZL (AK , Dn ) ≤ E Var(ZL (AK , Dn )) ≤
n M f ML 1 L(p(α, ˆ xk ), xk , yk ) + E K(x, x), n k=1 n
2 2 [U + (M f ML )2 (E K(x, x))2 ], n L
where the expectation (variance) is with respect to the training data Dn . Proof. Taking expectation with respect to the training samples in equation 4.1 leads to the first inequality. Now, note that E A2X
1 2 = (ML M f ) (E K(x, x)) + Var K(x, x) n 1 ≤ (ML M f )2 (E K(x, x))2 + E K(x, x)2 . n 2
Taking expectation over the training samples in equation 4.2 and recalling lemma 2, we obtain the second inequality. We have shown that if both M f and ML are bounded, then the expected generalization error is at most O(1/n) more than that of the expected training error. Furthermore, if L is bounded, then the variance of the leave-one-out estimate is O(1/n). Let L (p, x, y) ≥ L(p, x, y) be an upper bound of the loss, which is convex in p. In the following, we choose f such that f (p, x, y) = cn L (p, x, y), where cn is a regularization parameter. Note that this is equivalent to the penalized formulation, equation 2.4, with λn = 1/cn . Since p(α, ˆ ·) solves equation 3.4, we can obtain from corollary 1, cn M2L 1 p(·)2[n] + E K(x, x). E ZL (AK , Dn ) ≤ inf E L (p, x, y) + p(·) 2cn n
(4.3)
Clearly this implies that if we choose cn = o(n) such that cn → ∞, then as n → ∞, the expected generalization error with respect to the L -loss will be no more than the best approachable L -loss in H: infp∈H E L (p, x, y). Consider a nonnegative loss L . If we let L = min(L , U) for some U ≥ 0, then it is clear that in addition to equation 4.3, we also have Var(ZL (AK , Dn )) ≤
2 2 [U + c2n M2L (E K(x, x))2 ]. n
(4.4)
Leave-One-Out Bounds for Kernel Methods
1415
This shows that if we want the variance of the leave-one-out√error to approach zero as n → ∞, cn has to be chosen such that cn = o( n). Clearly, this condition is more restrictive than the requirement of cn = o(n) in equation 4.3. This also implies that if we choose a large cn , then the variance of the leave-one-out error can be large even when the expected generalization of the estimated predictor using kernel learning is not much worse than that of the optimal predictor. Analysis in this section can be useful for certain robust regression and classification formulations. For regression, we consider the following two scenarios: • Absolute deviation: L(p, x, y) = min(L (p, y), U) where L (p, y) = |p − y|. Let My = supy |y|. Then we may set κ = (2cn My )1/2 and have M f ≤ cn ,
ML ≤ 1,
UL ≤ min((2cn My )1/2 M + My , U).
• Huber’s robust loss: L(p, x, y) = min(L (p, y), U) where |p − y| − 1 |p − y| > 2, L (p, y) = 1 2 |p − y| ≤ 2. 4 (p − y) Let My = supy |y|. Then we may set κ = (2cn My )1/2 and have M f ≤ cn ,
ML ≤ 1,
UL ≤ min((2cn My )1/2 M + My , U).
We can also consider binary classification problems with labels y = ±1. The decision rule is to predict y as 1 if p(x) ≥ 0 and −1 if p(x) < 0. We define the classification error function of this prediction as: 0 py > 0, I(p, y) = 1 py ≤ 0. Our goal is thus to produce a predictor p(x) such that p(x) and y has the same sign. We consider the following two widely used formulations. Instead of using classification error as the loss, we consider losses that are upper bounds of classification error (up to a scale factor). • Logistic regression: L(p, x, y) = min(L (p, y), U) where L (p, y) = ln(1 + exp(−py)). We may set κ = (2cn ln 2)1/2 and have M f ≤ cn ,
ML ≤ 1,
UL ≤ min((2cn ln 2)1/2 M + ln 2, U).
• Support vector machine (SVM) loss: L(p, x, y) = min(L (p, y), U) where L (p, y) = max(0, 1 − py).
1416
T. Zhang
We may set κ = (2cn )1/2 and have M f ≤ cn ,
ML ≤ 1,
UL ≤ min((2cn )1/2 M + 1, U).
• Modified Huber’s loss: L(p, x, y) = min(L (p, y), U) where py < −1, −py 1 2 L (p, y) = 4 (1 − py) py ∈ [−1, 1], 0 py > 1. We may set κ = (cn /2)1/2 and have M f ≤ cn ,
ML ≤ 1,
UL ≤ min((cn /2)1/2 M + 1/2, U).
The modified Huber’s loss has certain advantages over the standard SVM loss, and hence is interesting by itself (Zhang, in press). However the specific theoretical motivation is not important for the purpose of this article. Clearly, for all of the above cases, ML = 1. The analysis in this section can thus be applied. From equations 4.3 and 4.4, we obtain the following result: Corollary 2. Consider absolute deviation or Huber’s robust loss for regression, and logistic regression, SVM, or modified Huber’s loss for classification. If we choose f as f (p, x, y) = cn L (p, y) and solve the corresponding learning problem 3.2 with g(p, x, y) given by equation 3.3, then the following leave-one-out bounds are valid: cn 1 E ZL (AK , Dn ) ≤ inf E L (p(x), y) + p(·)2[n] + E K(x, x), p(·) 2cn n Var(ZL (AK , Dn )) ≤
2 2 [U + c2n (E K(x, x))2 ], n L
where p(·) is an arbitrary measurable function. If we restrict p(·) to be in H, then p(·)2[n] can be replaced by p2 . Since we √ can choose κ = 2cn L (0, 0, 1) for the aboveclassification formulations, we can let L(p, x, y) = L (p, y) and set UL = L (− 2cn L (0, 0, 1) supx K(x, x), 0, 1) in the above variance bound. Note that cn in the above lemma is equivalent to 1/λn under the penalized formulation 2.4. Therefore using notations from section 2, the expected generalization error can be bounded as: 1 λn QL (AK , n − 1) ≤ inf QL (p) + p2 + O . (4.5) p∈H 2 λn n The second term on the right-hand side is the bias term mentioned in section 2, and the third term is the learning complexity term. As a comparison, U2
the leave-one-out variance term is of the order O( nL + λ12 n ), which is usually n much larger since typically we require that λn → 0 as n → ∞.
Leave-One-Out Bounds for Kernel Methods
1417
As mentioned in section 2, this order O( λ12 n ) learning complexity behavn ior is also presented in the algorithmic stability-based approach such as Bousquet and Elisseeff (2002). For example, typical bounds obtained there (such as those for SVM and least squares) in Bousquet and Elisseeff (2002) are of the following form (when adapted to our notations): with probability of 1 − η over training data Dn , we have n λn 1 1 2 L (p(xi ), xi , yi ) + p + O QL (AK , Dn ) ≤ inf p∈H n 2 λn n i=1 − ln(η) +O , λ2n n where the O(·) notation also depends on UL . Taking expectation with respect to the training data, we obtain the following expected generalization bound from their analysis: 1 λn 1 2 QL (AK , n) ≤ inf QL (p) + p + O +O . (4.6) p∈H 2 λn n λ2n n Clearly similar to our leave-one-out variance bound, the O(1/λ2n n) learning complexity behavior is also present in their analysis. Consider now, for example, that the value infp∈H QL (p) can be achieved at p∗ ∈ H: QL (p∗ ) = infp∈H QL (p). In this case, our analysis implies QL (AK , n) √ ≤ QL (p∗ ) + O(λn ) + O( λ1n n ). Now by choosing λn = O(1/ n), we obtain √ QL (AK , n) ≤ QL (p∗ )+O(1/ n). Note that the leave-one-out variance bound U2
is of the order O( nL + 1), which does not even converge to zero as n → ∞. The analysis in Bousquet and Elisseeff (2002) has the same issue. The best convergence rate from equation 4.6 is obtained by setting λn = n−1/4 , and the resulting bound becomes QL (AK , n) ≤ QL (p∗ )+O(n−1/4 ). If we consider the effect of UL dependency in the O(·) notation, the best possible rate of convergence from their analysis can be even slower. We are not trying to criticize the algorithmic stability-based analysis such as Bousquet and Elisseeff (2002). As we mentioned earlier, such analysis leads to probability bounds that provide more detailed information than expected generalization bounds presented in this article. However, it is important to see that their analysis does not lead to the correct expected generalization error bounds, and we do not know an easy fix. In this regard, the leave-one-out analysis presented here is useful since it does not suffer from this problem. However a disadvantage of the analysis is that it leads to only expected generalization bounds. Another interesting classification formulation is exponential loss, where we let f (p, x, y) = cn exp(−py). This function is used in boosting but can also be combined with kernel methods.
1418
T. Zhang
Corollary 3. Consider loss function L(p, x, y) = exp(−py) and choose the corresponding f in equation 3.4 as f (p, x, y) = cn exp(−py). Let M = supx K(x, x)1/2 . Then the leave-one-out error satisfies: 1 2 p(·)[n] E ZL (AK , Dn ) ≤ inf E L(p(x), x, y) + p(·) 2cn √ cn exp( 8cn M) + E K(x, x), n 2 Var(ZL (AK , Dn )) ≤ [exp( 8cn M) + c2n exp( 32cn M)(E K(x, x))2 ], n where p(·) is an arbitrary measurable function. √ Proof. Consider p = 0, which leads can now √to a choice of κ = 2cn . We √ apply corollary 1 with M f = cn exp( 2cn M), and ML = UL = exp( 2cn M). The analysis given in this section appears to be relatively general in the sense that it can be used to study any kernel learning problems with bounded ML and M f . However, for many problems (such as least-squares regression), ML and M f may depend on the regularization parameter cn . In such cases, the analysis given in this section can be suboptimal. It is thus necessary to estimate the behavior of αˆ using more elaborated methods. This in general requires us to obtain a metric that measures the size of αˆ based on some observable quantities. Although any quantity can be used, we will specifically consider bounding αˆ through the optimal training risk of the primal objective function, equation 3.4. The advantage of using this quantity is that it is directly optimized by the kernel learning algorithm. Therefore, we can bound its value using an arbitrary function p ∈ H (the idea has already been used in this section). We will apply this method to specific learning problems in regression and classification in the next two sections. Although for those problems variance estimates can also be obtained, they will be rather complicated. Therefore, for simplicity, we do not include the corresponding variance calculations. 5 Leave-One-Out Analysis for Some Regression Formulations In this section, we study some regression formulations. We define D (p, y) = max(|p − y| − , 0) for all ≥ 0. In theorem 2, let L(p, x, y) = D (p, y)s ,
(1 ≤ s < +∞),
Leave-One-Out Bounds for Kernel Methods
1419
which is a generalized version of Vapnik’s -insensitive regression loss (Vapnik, 1998). For regression problems, our goal is to find a function p(x) to minimize the expected loss, Q(p(·)) = E(x,y) D (p(x), y)s , where E denotes the expectation with respect to an unknown distribution. The training samples (x1 , y1 ), . . . , (xn , yn ) are independently drawn from the same underlying distribution. Since we assume that s ∈ [1, +∞), L is convex. We may choose f in equation 3.4 such that f (p, x, y) =
cn D (p, y)s , s
(5.1)
where cn > 0 is a regularization parameter. The purpose of the parameter in the above formulation is to obtain a sparse representation of α. ˆ In fact, this can be easily seen from equation B.3 since the formula shows that a component αˆ k = 0 as long as |p(α, ˆ x)− y| < . In this case, g in equation 3.2 becomes g(−α, x, y) =
(cn /n)−t/s t |α| − αy + |α|, t
(5.2)
where 1/s + 1/t = 1. For the above loss function, the following leave-one-out cross-validation bound can be directly obtained from theorem 2: Lemma 3. Under the conditions of lemma 1, the following bound on leave-oneout error is valid for all s ≥ 1:
n
1/s [k]
D (p(αˆ , xk ), yk )
k=1
s
≤
n
1/s D (p(α, ˆ xk ), yk )
k=1
+
n k=1
Proof.
Note that ∀δ > 0: sup D (p + p, y) ≤ D (p, y) + δ.
|p|≤δ
s
1/s |αˆ k K(xk , xk )|
s
.
1420
T. Zhang
Let L(p, x, y) = D (p, y)s . The right-hand-side of equation 3.5 can be bounded as n (D (p(α, ˆ xk ), yk ) + |αˆ k |K(xk , xk ))s . k=1
The lemma then follows from the Minkowski inequality. The generic leave-one-out regression bound in the above lemma can be computed from the training data. In order to derive results for the expected generalization error, we shall further investigate the behavior of this bound. Intuitively, analysis given in section 4 suggests that αˆ k is often of the order O(1/n). Therefore, when n → ∞, the normalized second term on the righthand side (by multiplying the normalization factor n−1/s ) converges to zero at the rate O(n−1/s ). Roughly speaking, this leads to the correct convergence rate of n−1/2 (as mentioned in section 2) for the least-squares formulation where s = 2. Similar to the analysis in section 4, in order to obtain exact generalization bounds, we shall start with an estimate of α. ˆ With g in equation 3.2 given by equation 5.2, we obtain from equation B.3 cn |αˆ i | = (i = 1, . . . , n). max(|p(α, ˆ xi ) − yi | − , 0)s−1 n Therefore, ∀k: ctn D (p(α, ˆ xk ), yk )s . nt Summing over k, we obtain: |αˆ k |t = n
|αˆ k |t ≤
k=1
n ctn D (p(α, ˆ xk ), yk )s . nt k=1
Clearly, the left-hand side of the above inequality measures the size of αˆ using its t-norm, and the right-hand side gives a bound in terms of the training error. Therefore, we have been able to bound the size of αˆ using training error, as suggested at the end of section 4. To proceed with the analysis, we consider another quantity s such that 1/s = 1/t + 1/t , where s , t ≥ 1. Using Jensen’s inequality, some simple algebra, and Holder’s ¨ inequality, we can obtain the following:
1
n
nmax(1/s−1/s ,0) ≤
n k=1
1/s |αˆ k K(xk , xk )|
k=1
1/s s
|αˆ k K(xk , xk )|
s
Leave-One-Out Bounds for Kernel Methods
≤
n
1/t |αˆ k |
t
k=1
n
1421
1/t K(xk , xk )
t
k=1
1/t 1/t n n cn s t ≤ D (p(α, ˆ xk ), yk ) K(xk , xk ) . n k=1 k=1 Substituting into lemma 3, we obtain the following leave-one-out bound: Lemma 4. Under the conditions of lemma 1 with g given by equation 5.2, consider s, t, s , t ≥ 1: 1/s + 1/t = 1 and 1/s = 1/t + 1/t . Then n
1/s [k]
D (p(αˆ , xk ), yk )
s
k=1
≤
n
1/s D (p(α, ˆ xk ), yk )
s
k=1
+ cn n
max(1/s−1/s ,0)−1
n
1/t D (p(α, ˆ xk ), yk )
s
k=1
n
1/t K(xk , xk )
t
.
k=1
Note that in the above lemma, we have successfully removed the dependency of the leave-one-out bound on α. ˆ Next, we would like to investigate the behavior of the expected leave-one-out error, which gives the expected generalization error. In the following, for any random variable ξ , we use the more compact notation E1/s ξ to denote (Eξ )1/s . Theorem 5. Under the conditions of lemma 4, assume further that 1/s = 1/u + 1/v where u, v ≥ 1. Denote by Xn the training set {(x1 , y1 ), . . . , (xn , yn )}. Let n ¯ n) = 1 Q(X D (p(α, ˆ xk ), yk )s n k=1
be the observed average training error. Then the expected leave-one-out error can be bounded as ¯ n) E1/s Q(p(αˆ [k] , ·)) ≤ E1/s Q(X +
cn
nmin(1/s,1/s )+1/s
1/u
E
¯ n) Q(X
u/t 1/v
E
n
v/t K(xk , xk )
k=1
where E denotes the expectation over n random training samples Xn .
t
,
1422
T. Zhang
Proof. Let dn = cn nmax(1/s−1/s ,0)−1 . Using lemma 4 and the Minkowski inequality, we have E
n
1/s [k]
D (p(αˆ , xk ), yk )
s
k=1
≤ E
n
1/s D (p(α, ˆ xk ), yk )
s
k=1
s s t n n t + dn E D (p(α, ˆ xk ), yk )s K(xk , xk )t 1 s
k=1
≤ E
n
k=1
1/s D (p(α, ˆ xk ), yk )s
k=1
+ dn E
1 u
n
ut s
D (p(α, ˆ xk ), yk )
E
1 v
k=1
n
t
v
K(xk , xk )
t
,
k=1
where the second inequality follows from the Holder’s ¨ inequality. Also note that E
n
D (p(αˆ [k] , xk ), yk )s = nE D (p(αˆ [k] , xk ), yk )s = nE Q(p(αˆ [k] , ·)).
k=1
We thus obtain the theorem. The next step is to bound the training error using the error of an arbitrary function p ∈ H. This gives the following result. Note that the bound can be improved using moment inequalities for the sum of independent variables. However, this introduces another level of complexity. For our purpose, we will use plain Jensen’s inequality for simplicity. Corollary 4.
Under the conditions of theorem 5, let 1/q
Qq,n = inf E p(·)
s D (p(x), y) + p(·)2[n] 2cn
q ,
where p denotes an arbitrary measurable function. Let
n = 1 + min(1/s, 1/s ) − max(1/t, 1/u) − max(1/t , 1/v).
Leave-One-Out Bounds for Kernel Methods
1423
The expected generalization error is bounded by 1/s
E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n +
cn 1/t Q E1/v K(x, x)v , n n u/t,n
where E denotes the expectation over n random training samples. Proof. We would like to bound the three terms on the right-hand side of theorem 5 separately. For the first term, ¯ n ) ≤ inf E E Q(X p
n 1 s s 2 D (p(xk ), yk ) + p(Xn ) = Q1,n . n k=1 2cn
Now for any measurable function p(·), using Jensen’s inequality to bound the second term, we obtain 1/u
E
¯ n) Q(X
u/t
1/u
≤E
n s 1 D (p(xk ), yk )s + p(Xn )2 n k=1 2cn
≤ E1/u nmax(−1,−u/t)
u/t n s D (p(xk ), yk )s + p(Xn )2 2cn k=1
max(0,1/u−1/t) 1/u
=n
E
u/t
u/t s s 2 D (p(x), y) + p[n] . 2cn
Similarly, using Jensen’s inequality, we have 1/v
E
n
v/t K(xk , xk )
t
≤ E1/v nmax(v/t −1,0)
k=1
n
K(xk , xk )v
k=1 max(1/t ,1/v) 1/v
≤n
E
K(x, x)v .
Applying the above bounds to theorem 5, we obtain the desired result. Corollary 4 can be applied with any choice of (s , t ) and (u, v). We may make the following specific choice that maximizes the rate n : • s ∈ [1, 2]: s = s, t = st/(t − s), u = t and v = t . We have cn 1/t 1/s E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n + Q1,n E(t−s)/st K(x, x)st/(t−s) . n • s ≥ 2: s = t, t = +∞, u = s and v = +∞. We have cn 1/t 1/s E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n + 2/s Qs/t,n sup K(x, x). n x
1424
T. Zhang
Clearly, if we let s = 1 and t = +∞, then we are able to reproduce the expected generalization error bound for absolute deviation in corollary 2. With fixed cn , it is easy to see that when s > 2, the asymptotic convergence rate is O(n−2/s ), which decreases as s increases. When s < 2, we are more tolerant to large input K(x, x) since the bound depends on E(t−s)/st K(x, x)st/(t−s) , which becomes less sensitive to large K(x, x) as s decreases. In addition, for smaller 1/t 1/t s, t is larger, and hence the factor Q1,n (for s ∈ [1, 2], or Qs/t,n for s ≥ 2) has a smaller impact on the bound. This means that even when Q1,n is relatively large, we can still expect a fast convergence. All of these observations imply that the formulation is more robust when s is smaller. Note that regression using the Huber’s loss has the same behavior as that of s = 1. This gives a theoretical justification on the robustness of Huber’s loss function. For s ≥ 2, we require supx K(x, x) to be bounded. It is clear that by choosing v < +∞, we can relax this condition to the boundedness of EK(x, x)v . However, the asymptotic convergence rate reduces to O(n−2/s+1/v ). It is also interesting to observe that the convergence decreases to O(1) as s → ∞. This is not very surprising. We may consider the scenario that s = ∞. Clearly, the risk function Q becomes the essential upper bound of |p(x) − y|: Q(p) = inf{a : P(|p(x) − y| > a) = 0}, which cannot be reliably estimated with a finite number of samples. This is another way to see that large s is very sensitive to outliers. We shall now consider the least-squares regression formulation where s = 2. Assume that K(x, x) is bounded. Let λn = 2/cn in equation 2.4. Corollary 4 (with the above-mentioned parameter choices of s = s, t = st/(t − s), u = t and v = t ) implies that the expected generalization error with n − 1 training data can be bounded as 2 supx K(x, x) 2 λn E Q(p(αˆ [k] , ·)) ≤ 1 + inf Q(p) + p2 . p∈H λn n 2 Similarly to equation 4.5, the above bound implies a O( λ1n n ) learning complexity. As mentioned in section 2, if infp∈H Q(p) can be achieved by p∗ ∈ H, √ then we can let λn = O(1/ n) to obtain the √correct convergence rate (to the minimum value Q(p∗ )) of the order O(1/ n). Similar to the discussion in section 4, our result compares favorably to the least-squares bound of the form 4.6 in Bousquet and Elisseeff (2002), which leads to the best achievable rate of the order O(n−1/4 ). One also notes that our results can be further improved when the problem is noise free: Q(p∗ ) = 0. Here, we may let λn = O(1/n) to obtain an expected generalization error of the order O(1/n). The best achievable rate in Bousquet and Elisseeff (2002) based on equation 4.6 is still no better than O(n−1/4 ) under this scenario. Next, we would like to consider a choice of cn so that the expected generalization error converges to the best approachable error when n → ∞. For simplicity, assume 1 < s ≤ 2, s = s, and E(t−s)/st K(x, x)st/(t−s) is finite. We also restrict p such that p ∈ H in the estimate of Q1,n . This gives the
Leave-One-Out Bounds for Kernel Methods
1425
following results: • infp∈H Ex,y D (p(x), y)s = 0: if we pick cn → ∞ such that cn = O(n), then lim E Ex,y D (p(α, ˆ x), y)s = 0.
n→∞
• infp∈H Ex,y D (p(x), y)s > 0: we pick cn → ∞ such that cn = o(n), then lim E Ex,y D (p(α, ˆ x), y)s = inf Ex,y D (p(α, ˆ x), y)s .
n→∞
p∈H
In the above, the first is the noiseless case, and the second is the noisy case. However, in both cases, we do not need to assume that there is a target function p ∈ H that achieves the minimum of infp∈H Ex,y |p(x) − y|s . For example, s = inf s ˜ such that Ex,y |p(x)−y| ˜ we may have a function p(x) p∈H Ex,y |p(x)−y| , ˜ does not belong to H. The function may lie in the closure of H under but p(x) the Ls -topology. For some kernel functions, this closure of H may contain all continuous functions (and Ls ). In this case, corollary 4 implies that when n → ∞, the kernel learning formulation, equation 3.2, is able to learn all continuous target functions even under observation noise. We call this property universal learning. The leave-one-out analysis can also be used to derive approximation bounds for kernel methods. For function approximation, we are interested in the best possible Ls approximation error of an arbitrary function using n terms of kernel expression: 1/s
An (p∗ ) = inf Ex |p(α, x) − p∗ (x)|s , α
where p∗ is the target function, and p(α, x) is given by equation 3.1. We may produce an approximation by solving equation 3.4 with f (p, x, y) = cn |p(α, x) − p∗ (x)|s . It is clear that An (p∗ ) ≤ E1/s Ex |p(αˆ [k] , x) − p∗ (x)|s . Applying corollary 4, we obtain An−1 (p∗ ) ≤ inf
cn >0
=
s p∗ 2[n] 2cn
sp∗ 2[n] 2n n
1/s
cn + nn
s p∗ 2[n] 2cn
1/t E
1/v
K(x, x)
v
E1/v K(x, x)v .
Again by maximizing the rate n , we obtain the following approximation bounds: • s ∈ [1, 2]: s = s, t = st/(t − s), u = t and v = t . We have sp∗ 2[n] An−1 (p∗ ) ≤ E(t−s)/st K(x, x)st/(t−s) . 2n
1426
T. Zhang
• s ≥ 2: s = t, t = +∞, u = s and v = +∞. We have sp∗ 2[n] An−1 (p∗ ) ≤ sup K(x, x). 2n2/s x This shows that the minimum-norm interpolation quantity p∗ [n] characterizes the rate of approximation using kernel expression 3.1. 6 Leave-One-Out Analysis for Some Binary Classification Formulations In this section, we study some binary classification formulations using ideas similar to those given in section 5. The following result is a direct consequence of theorem 2. A similar bound can be found in Jaakkola and Haussler (1999). Lemma 5. Under the conditions of lemma 1, the following bound on leave-oneout classification error is valid: n
I(p(αˆ [k] , xk ), yk ) ≤
k=1
n
I(p(α, ˆ xk ) − |αˆ k |yk K(xk , xk ), yk ).
k=1
The above bound can be regarded as a margin-style error bound. Although useful computationally, it does not provide useful theoretical insights. In this section, we show how to estimate the right-hand side of the above bound using techniques similar to our analysis of regression problems. One difficulty is the nonconvexity of the classification error function. In practice, people use convex upper bounds of I(p, y) to remedy the problem. There are many possible choices. In this section, we will consider powers of D+ (p, y) = max(0, 1 − py), which are related to support vector machines. Our goal is to find a function p(x) to minimize the expected loss Q(p(·)) = Ex,y f (p(x), x, y) where E denotes the expectation with respect to an unknown distribution. The training samples (x1 , y1 ), . . . , (xn , yn ) are independently drawn from the same underlying distribution. The following lemma plays the same role of lemma 3. Lemma 6. Under the conditions of lemma 1, the following bound on the leaveone-out error is valid for all s ≥ 1: n k=1
1/s [k]
D+ (p(αˆ , xk ), yk )
s
≤
n
1/s s
D+ (p(α, ˆ xk ), yk )
k=1
+
n k=1
1/s |αˆ k K(xk , xk )|
s
.
Leave-One-Out Bounds for Kernel Methods
1427
Proof. Note that D+ (p + p, y) ≤ D+ (p, y) + |p|. The rest of the proof is the same as that of lemma 3. To investigate theoretical properties of SVM-type losses further, we need to bound the right-hand side of lemma 6 in a way similar to the regression analysis. In particular, we consider the following formulation for solving the classification problem, with f in equation 3.4 chosen as f (p, x, y) =
cn D+ (p(x), y)s , s
(6.1)
where cn > 0 is a regularization parameter and 1 ≤ s < ∞. In this case, g in equation 3.2 becomes (cn /n)−t/s t |α| − αy, t
g(−α, x, y) =
(αy ≥ 0),
(6.2)
where 1/s + 1/t = 1. Note that if s = 1, then t = +∞. Equation 6.2 can be equivalently written as: g(α, x, y) =
−αy +∞
if αy ∈ [0, cn /n], otherwise.
(6.3)
With g in equation 3.2 given by equation 6.2, we obtain from equation B.3, αˆ k yk =
cn max(1 − p(α, ˆ xk )yk , 0)s−1 n
(k = 1, . . . , n).
(6.4)
Note that if s = 1, then the right-hand-side of equation 6.4 is not uniquely defined at p(α, ˆ xi )yi = 1. The equation becomes αˆ i yi ∈ [0, cn /n] in this case. Similar to the regression analysis in section 5, we obtain ∀k: |αˆ k |t =
ctn D+ (p(α, ˆ xk ), yk )s . nt
Summing over k, we obtain n k=1
|αˆ k |t ≤
n ctn D+ (p(α, ˆ xk ), yk )s . nt k=1
Clearly we can use the same derivation as that in section 5, and simply replace the symbol D by D+ . This leads to the following counterpart of lemma 4:
1428
T. Zhang
Lemma 7. Under the conditions of lemma 1 with g given by equation 6.2, consider s, t, s , t ≥ 1: 1/s + 1/t = 1 and 1/s = 1/t + 1/t . Then n k=1
≤
1/s [k]
D+ (p(αˆ , xk ), yk ) n
s
1/s D+ (p(α, ˆ xk ), yk )
s
k=1 max(1/s−1/s ,0)−1
+ cn n
n
1/t D+ (p(α, ˆ xk ), yk )
s
k=1
n
1/t t
K(xk , xk )
.
k=1
Using the same proof of theorem 5, we obtain the following bound on expected generalization error. Theorem 6. Under the conditions of lemma 7, assume further that 1/s = 1/u + 1/v where u, v ≥ 1. Denote by Xn the training set {(x1 , y1 ), . . . , (xn , yn )}. Let Q(p(·)) = Ex,y D+ (p(αˆ [k] , x), y)s be the expected true error of a predictor p(·), and let n ¯ n) = 1 Q(X D+ (p(α, ˆ xk ), yk )s n k=1
be the observed training error. Then the expected leave-one-out error can be bounded as ¯ n) E1/s Q(p(αˆ [k] , ·)) ≤ E1/s Q(X +
cn
nmin(1/s,1/s )+1/s
1/u
E
¯ n) Q(X
u/t 1/v
E
n
v/t K(xk , xk )
t
k=1
where E denotes the expectation over n random training samples Xn . The proof of the following result is the same as that of corollary 4: Corollary 5.
Under the conditions of theorem 6, let 1/q
Qq,n = inf E p(·)
s D+ (p(x), y) + p(·)2[n] 2cn
q ,
,
Leave-One-Out Bounds for Kernel Methods
1429
where p denotes an arbitrary measurable function. Let
n = 1 + min(1/s, 1/s ) − max(1/t, 1/u) − max(1/t , 1/v). The expected generalization error is bounded by 1/s
E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n +
cn 1/t Q E1/v K(x, x)v , n n u/t,n
where E denotes the expectation over n random training samples. Similar to the regression analysis, we obtain the following bounds from corollary 5: • s ∈ [1, 2]: s = s, t = st/(t − s), u = t and v = t . We have 1/s
E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n +
cn 1/t (t−s)/st K(x, x)st/(t−s) . Q E n 1,n
• s ≥ 2: s = t, t = +∞, u = s and v = +∞. We have 1/s
E1/s Q(p(αˆ [k] , ·)) ≤ Q1,n +
cn 1/t Q sup K(x, x). n2/s s/t,n x
Formulations with s ≥ 2 can be very sensitive to outliers. However, they can still be useful when the data are nearly separable. In the general case, if we want to choose cn so that the expected generalization error s/2 converges to the optimal error, then cn = o(n). This can be naturally compared to √ the exponential loss formulation in corollary 3 where we require cn exp( 8cn M) = o(n). This shows that we require cn to grow slowly when we use a loss function that heavily penalizes outliers. The case of s = 1 corresponds to the SVM formulation. In this case, if there exists some p ∈ H such that E D+ (p(x), y)s ≈ 0 and K(x, x) is bounded, then we can use another method to estimate the size of α, which leads to a better bound. We multiply equation 6.4 by 1 − p(α, ˆ xk )yk and sum over k to obtain n
|αˆ k | =
k=1
n
αˆ k yk (1 − p(α, ˆ xk )yk ) + p(α, ˆ ·)2
k=1
=
n cn k=1
≤
n
n cn k=1
n
ˆ xk )yk ) + p(α, ˆ ·)2 max(1 − p(α, ˆ xk )yk , 0)s−1 (1 − p(α, ˆ xk ), yk )s + p(α, ˆ ·)2 . D+ (p(α,
1430
T. Zhang
This implies n
|αˆ k K(xk , xk )|
k=1
≤
n cn k=1
n
s
ˆ xk ), yk ) + p(α, ˆ ·) D+ (p(α,
2
sup K(xk , xk ).
(6.5)
k
This estimate of αˆ can be compared to the corresponding estimate used in corollary 5 (and corollary 2) with s = 1:there we used an estimate of supk |αˆ k | ≤ cn /n, which leads to a bound nk=1 |αˆ k K(xk , xk )| ≤ cn /n nk=1 K(xk , xk ). If cn is large, then the estimate given in equation 6.5 is better when n 1 ˆ xk ), yk )s is close to zero and supx K(x, x) is well bounded. k=1 n D+ (p(α, Using equation 6.5, we obtain the following result from lemma 6: Theorem 7. Under the conditions of lemma 1 with g given by equation 6.4, let M = supk K(xk , xk )1/2 . Then
n
cn M2 D+ (p(αˆ , xk ), yk ) ≤ 1 + n k=1 [k]
n
D+ (p(α, ˆ xk ), yk )
k=1
+ p(α, ˆ ·)2 M2 .
Taking expectation over the training data, we obtain Corollary 6. Under the conditions of lemma 1 with g given by equation 6.3, let M = supx K(x, x)1/2 . Then E D+ (p(αˆ [k] , xk ), yk ) ≤
max(n, cn M2 ) + cn M2 1 p2[n] , inf Ex,y D+ (p(x), y) + p(·) n 2cn
where E denotes the expectation over n random training samples. p denotes an arbitrary measurable function. If the problem is separable, theorem 7 implies a leave-one-out bound of n k=1
D+ (p(αˆ [k] , xk ), yk ) ≤
n + 2cn supk K(xk , xk ) p(Xn )2 2cn
for all function p such that p(xk )yk ≥ 1 (k = 1, . . . , n). We can let cn → +∞. Then the formulation becomes the optimal margin hyperplane (separable
Leave-One-Out Bounds for Kernel Methods
1431
SVM) method. In this case, the above analysis implies the following leaveone-out classification error bound: n
I(p(αˆ [k] , xk ), yk ) ≤ p(α, ˆ ·)2 sup K(xk , xk ).
k=1
k
This bound is identical to Vapnik’s bound for optimal margin hyperplane method (Vapnik, 1998). It shows that the optimal margin method can find a decision function that has classification error approaching zero as the sample size n → +∞ when the problem can be separated by a function in H with a large margin. In addition to this result of Vapnik, a bound similar to theorem 7 has also appeared in Joachims (2000). We can further consider the case that the problem is separable but not by any function in H (or not by a large margin). As we pointed out in section 5, there may be a function p ∈ / H (but, say, p belongs to H’s closure under the L1 topology) so that p(x)y > 0 almost everywhere. We can assume in this case that inf Ex,y max(1 − p(x)y, 0) = 0.
p∈H
It is clear from corollary 5 that the optimal margin hyperplane method may not estimate a classifier p(α, ˆ ·) that has expected classification error approaching zero when n → +∞. That is, the method is not universal even for separable problems. This is not surprising since when n → ∞, p(α, ˆ ·) may also go to +∞ at the same rate of n. On the other hand, by corollary 6, with regularization parameter cn = O(n), limn→+∞ Ex,y I(p(α, ˆ x), y) = 0. This means that the soft-margin SVM with fixed C = cn /n is a universal learning method for separable problems. A similar conclusion can be drawn from corollary 5 for 1 < s ≤ 2. Finally, we point out that although our end goal is the classification loss Ex,y I(p(x), y), it is desirable to minimize a convex risk in the kernel formulation, which not only is computationally more efficient but also reduces the variance associated with the estimation. Results in this section show that Ex,y D+ (p(α, ˆ x), y)s achieves the best possible approachable value in H as n → +∞. If H is dense in the set of continuous functions in a compact set under the uniform norm topology, then it can be shown (for example, see Zhang, forthcoming) that a function that approximately minimizes Ex,y D+ (p(α, ˆ x), y)s will also approximately minimize Ex,y I(p(x), y) among all Borel measurable functions. This implies that results obtained in this section can also be used to establish universal learning results for nonseparable problems, though we minimize only a convex upper bound of the classification error function instead of the classification loss itself. This connection indicates that it is important to study the expected generalization error with respect to the D+ (p(α, ˆ x), y)s loss even if our purpose is to minimize the classification error.
1432
T. Zhang
7 Summary In this article, we derived a general leave-one-out approximation bound for kernel methods. The approximation bound leads to a very general leaveone-out cross-validation bound for an arbitrary loss function. In addition, we have studied variance bounds for leave-one-out estimates. Our bounds depend very weakly on properties of the underlying kernel. For example, they do not depend on the eigendecomposition structure of the kernel. On the one hand, this means that the bounds we have obtained are very general. On the other hand, this also indicates that it might be possible to improve our analysis for specific kernels by taking their structures into consideration. We have applied the derived bound to some regression and classification problems, which demonstrated the power of leave-one-out analysis. In fact, to the best of our knowledge, the expected generalization results obtained from our analysis are the best available bounds for general kernel learning machines. In addition, our bounds reflect both learning and approximation aspects of the underlying problems. Based on these results, we are able to demonstrate universal learning properties of certain kernel methods. Our analysis also suggests that even for noiseless problems in regression or separable problems for classification, it is still helpful to use penalty-type regularization. This is because in this case, we can choose the regularization parameter as cn = O(n) to obtain universal learning methods. In this article, we show that the minimal interpolation norm p(x)[n] of a function p(x) determines the rate of learning p(x) with the corresponding · kernel function. However, we do not investigate the behavior of p(x)[n] for any specific function class and any specific kernel formulation. It will be interesting to study such issues, which may lead to useful insights into various kernel formulations. Appendix A: Properties of Kernel Representation We prove proposition 2. Proof.
First we consider the situation that p(Xn ) can be represented as
p(Xn ) = G(Xn )α. ¯ Define s = (α¯ T G(Xn )α) ¯ 1/2 . Then ∀α, p(Xn )T α = α¯ T G(Xn )α ≤ s(α T G(Xn )α)1/2 . The equality can be achieved at α = α. ¯ This shows that s = p(Xn ).
Leave-One-Out Bounds for Kernel Methods
1433
Now assume that p(Xn ) is not in the range of G(Xn ). Using elementary linear algebra, we can write p(Xn ) = G(Xn )α¯ + β, where β = 0 is in the null space of G(Xn ): G(Xn )β = 0. This implies that β T p(Xn ) = β T G(Xn )α¯ + β T β > 0, and (β T G(Xn )β)1/2 = 0. From definition 2, we must have p(Xn ) = +∞. We prove proposition 3: Proof. To prove the first part, we consider the orthogonal projection p0 of p onto the subspace spanned by function pi (x) = K(xi , x). Proposition 1 implies that p0 (xi ) = p(xi ) for all i = 1, . . . , n. Let Xn = {x1 , . . . , xn }. Then proposition 2 implies that p(Xn ) = p0 ≤ p. To prove the second part, denote X1 = {x1 }. If K(x1 , x1 ) = 0, then let p0 (·) be the orthogonal projection of p(·) onto the subspace spanned by function K(x1 , ·). Proposition 1 implies that p(x1 ) = p0 (x1 ) = 0. If K(x1 , x1 ) > 0, then X1 = |p(x1 )|/K(x1 , x1 )1/2 . Therefore, using the first part, we have |p(x1 )|/K(x1 , x1 )1/2 ≤ p(·). Appendix B: Proof of Theorem 1 To prove the theorem, we first note that a solution of equation 3.4 must be a solution of ˆ = arg min p(·)
p(·)∈HXn
n 1 1 2 f (p(xi ), xi , yi ) + p(·) , n i=1 2
(B.1)
where HXn is the subspace of H spanned by K(xi , ·) (i = 1, . . . , n). This is because for any p ∈ H, by proposition 1, let pXn be the orthogonal projection of p onto the subspace Xn . Then p(xi ) = pXn (xi ) for all i and pXn ≤ p, with the equality holding only when p ∈ HXn . Therefore, pXn has a smaller primal objective value than p if p ∈ / H Xn . This shows that any solution of ˆ equation 3.4 can be written as p(x) = ni=1 αˆ i K(xi , x) for some α. ˆ Now, let αˆ be a solution of equation 3.2. Since αˆ achieves the minimum of Ln (α), we have (see Rockafellar, 1970) 0 ∈ ∂αi Ln (α) ˆ for all i (the subdifferential is with respect to each αi ). By theorem 23.8 in Rockafellar (1970), for each i, we can find a subgradient of g(αi , xi , yi ) at αˆ i with respect to αi such that the following first-order condition holds: −∇1 g(−αˆ i , xi , yi ) +
n j=1
αˆ j K(xi , xj ) = 0
(i = 1, . . . , n),
(B.2)
1434
T. Zhang
where ∇1 g denotes a subgradient of g with respect to the first component. By the relationship of duality and subgradient in Rockafellar (1970, sec. 23), we can rewrite equation B.2 as: 1 ˆ xi ), xi , yi ) = 0 ∇1 f (p(α, n
αˆ i +
(i = 1, . . . , n),
(B.3)
where ∇1 f (v, b, c) denotes a subgradient of f with respect to v. Now multiply the two sides by K(xi , x ), and sum over i. We have n
αˆ i K(xi , x ) +
i=1
n 1 ∇1 f (p(α, ˆ xi ), xi , yi )K(xi , x ) = 0 n i=1
( = 1, . . . , n).
(B.4)
For any α, we multiply equation B.4 by α − αˆ , and sum over to obtain n
αˆ i (p(α, xi ) − p(α, ˆ xi ))
i=1
+
n 1 ∇1 f (p(α, ˆ xi ), xi , yi )(p(α, xi ) − p(α, ˆ xi )) = 0. n i=1
(B.5)
Using the definition of subgradient, equation B.5 implies 1 ˆ ·)2 ) (p(α, ·)2 − p(α, 2 n 1 + ( f (p(α, xi ), xi , yi ) − f (p(α, ˆ xi ), xi , yi )) ≥ 0. n i=1 That is, p(α, ˆ ·) achieves the minimum of equation B.1. Since the above steps can be reversed when the Gram matrix G(Xn ) is nonsingular, the converse is also true. Appendix C: Proof of Lemma 1 For notational simplicity, we assume k = n. Using equation B.2, we have for all i ≤ n − 1: [k]
−∇1 g(−αˆ i , xi , yi )(αˆ i − αˆ i ) +
n
[k]
αˆ j K(xi , xj )(αˆ i − αˆ i ) = 0.
j=1
By the definition of subgradient, we have [k]
[k]
−∇1 g(−αˆ i , xi , yi )(αˆ i − αˆ i ) ≤ g(−αˆ i , xi , yi ) − g(−αˆ i , xi , yi ),
Leave-One-Out Bounds for Kernel Methods
1435
which can now be equivalently written as g(−αˆ i , xi , yi ) −
n
[k]
[k]
αˆ j K(xi , xj )(αˆ i − αˆ i ) ≤ g(−αˆ i , xi , yi ).
j=1
Summing over i, we have n−1
g(−αˆ i , xi , yi ) −
i=1
− αˆ i )
n−1 n−1 1 [k] [k] αˆ i αˆ j K(xi , xj ) 2 i=1 j=1
n−1
[k]
g(−αˆ i , xi , yi ) +
i=1
≤
[k] αˆ j K(xi , xj )(αˆ i
j=1
+
≤
n
n−1
g(−αˆ i , xi , yi ) +
i=1
n−1 n−1 1 [k] [k] αˆ i αˆ j K(xi , xj ) 2 i=1 j=1
n−1 n−1 1 αˆ i αˆ j K(xi , xj ). 2 i=1 j=1
The second inequality above follows from the definition of αˆ [k] . Rearrange [k] the above inequality, and denote αˆ n = 0. We obtain: n n 1 1 [k] [k] K(xi , xj )(αˆ i − αˆ i )(αˆ j − αˆ j ) ≤ αˆ n2 K(xn , xn ). 2 i=1 j=1 2
Appendix D: Variance Style Concentration Inequality We prove lemma 2. Proof. Given {z1 , . . . , zn }, we denote by Ei expectation with respect to the variables zi+1 , . . . , zn , conditioned on z1 , . . . , zi . Now, we have Z = E0 Z +
n
(Ei Z − Ei−1 Z).
(D.1)
i=1
Let Zi = Ei Z − Ei−1 Z. Then Zi is a function of {z1 , . . . , zi } and Ei−1 Zi = 0. Obviously for j < i, we have E Zi Zj = E Ei−1 Zi Zj = E [Zj Ei−1 Zi ] = 0.
1436
T. Zhang
This implies that E [ ni=1 Zi ]2 = E ni=1 Z2i . Note that by definition E0 Z = EZ. We thus have from equation D.1 that Var(Z) = E
n i=1
2 Zi
=E
n
(Ei Z − Ei−1 Z)2 .
i=1
Now use Ezk to denote the expectation over zk , conditioned on all other variables. We obtain from Jensen’s inequality: E(Ei Z − Ei−1 Z)2 ≤ E Ei (Z − Ezi Z)2 ≤ E[Ezi (Z − Ezi Z)2 + ((Z(i) )2 − Ezi Z)2 ] = E(Z − Z(i) )2 . This shows that Var(Z) = E
n i=1
(Ei Z − Ei−1 Z)2 ≤ E
n
(Z − Z(i) )2 .
i=1
References Birg´e, L., & Massart, P. (1993). Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields, 97(1–2), 113–150. Bousquet, O., & Elisseeff, A. (2002). Stability and generalization. Journal of Machine Learning Research, 2, 499–526. Chucker, F., & Samle, S. (2002). On the mathematical foundations of learning. Bulletin (New Series) of the American Mathematical Society, 39(1), 1–49. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Ann. Statist., 9(3), 586–596. Evgeniou, T., Pontil, M., & Poggio, T. (2000). Regularization networks and support vector machines. Advances in Computational Mathematics, 13(1), 1–50. Forster, J., & Warmuth, M. (2000). Relative expected instantaneous loss bounds. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (pp. 90–99). San Mateo, CA: Morgan Kaufmann. Jaakkola, T., & Haussler, D. (1999). Probabilistic kernel regression models. In Proceedings of the 1999 Conference on AI and Statistics. San Mateo, CA: Morgan Kaufmann. Joachims, T. (2000). Estimating the generalization performance of an SVM efficiently. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML 00) (pp. 431–438). San Mateo, CA: Morgan Kaufmann. Kearns, M., & Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6), 1427–1453.
Leave-One-Out Bounds for Kernel Methods
1437
Kutin, S., & Niyogi, P. (2002). Almost-everywhere algorithmic stability and generalization error (Tech. Rep. TR-2002-03). Chicago: University of Chicago. Mhaskar, H. N., Narcowich, F. J., & Ward, J. D. (1999). Approximation properties of zonal function networks using scattered data on the sphere. Adv. Comput. Math., 11(2–3), 121–137. Narcowich, F. J., Sivakumar, N., & Ward, J. D. (1998). Stability results for scattered-data interpolation on Euclidean spheres. Adv. Comput. Math., 8(3), 137–163. Rockafellar, R. T. (1970). Convex analysis. Princeton, NJ: Princeton University Press. Steele, J. M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann. Statist., 14(2), 753–758. van de Geer, S. (2000). Empirical processes in M-estimation. Cambridge: Cambridge University Press. van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Wendland, H. (1998). Error estimates for interpolation by compactly supported radial basis functions of minimal degree. J. Approx. Theory, 93(2), 258–272. Yang, Y., & Barron, A. (1999). Information-theoretic determination of minimax rates of convergence. Annals of Statistics, 27, 1564–1599. Zhang, T. (2001). A leave-one-out cross validation bound for kernel methods with applications in learning. In 14th Annual Conference on Computational Learning Theory (pp. 427–443). Berlin: Springer-Verlag. Zhang, T. (2002a). Generalization performance of some learning problems in Hilbert functional spaces. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Zhang, T. (2002b). On the dual formulation of regularized linear systems. Machine Learning, 46, 91–129. Zhang, T. (in press). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statitics. Received April 4, 2002; accepted December 2, 2002.
ARTICLE
Communicated by Peter Latham
Background Synaptic Activity as a Switch Between Dynamical States in a Network Emilio Salinas
[email protected] Department of Neurobiology and Anatomy, Wake Forest University School of Medicine, Winston-Salem, NC 27157-1010, U.S.A.
A bright red light may trigger a sudden motor action in a driver crossing an intersection: stepping at once on the brakes. The same red light, however, may be entirely inconsequential if it appears, say, inside a movie theater. Clearly, context determines whether a particular stimulus will trigger a motor response, but what is the neural correlate of this? How does the nervous system enable or disable whole networks so that they are responsive or not to a given sensory signal? Using theoretical models and computer simulations, I show that networks of neurons have a built-in capacity to switch between two types of dynamic state: one in which activity is low and approximately equal for all units, and another in which different activity distributions are possible and may even change dynamically. This property allows whole circuits to be turned on or off by weak, unstructured inputs. These results are illustrated using networks of integrate-and-re neurons with diverse architectures. In agreement with the analytic calculations, a uniform background input may determine whether a random network has one or two stable ring levels; it may give rise to randomly alternating ring episodes in a circuit with reciprocal inhibition; and it may regulate the capacity of a center-surround circuit to produce either self-sustained activity or traveling waves. Thus, the functional properties of a network may be drastically modied by a simple, weak signal. This mechanism works as long as the network is able to exhibit stable ring states, or attractors. 1 Introduction The mammalian brain performs remarkable computational tasks, such as recognizing complex stimuli of diverse modalities, remembering sensory data in a variety of formats, and generating complex motor sequences (Churchland & Sejnowski, 1992). However, the underlying circuits are not always activated. When we need to remember a phone number announced on the radio, some neural process in the brain enables our short-term memory; without the context information indicating that it is important to store it, the same number vanishes promptly from the mind. In the motor realm, c 2003 Massachusetts Institute of Technology Neural Computation 15, 1439–1475 (2003) °
1440
E. Salinas
this is a classic problem that neurophysiologists have studied using go versus no-go paradigms (see, e.g., Schultz & Romo, 1992; Lauwereyns et al., 2001; Sommer & Wurtz, 2001). In these tasks, a subject performs one of several possible motor actions in response to a presented stimulus, but the movement is contingent on a separate cue giving either a go or a no-go instruction. Neural activity evoked during these tasks has been extensively documented, yet little is known in mechanistic terms about how the go signal gates the interaction between the sensory information that remains xed and the motor apparatus. This appears to be a switching problem, or a trafc control problem. Single neurons recorded from behaving monkeys may reect this switching by displaying quick, dramatic changes in their properties. For instance, during a task in which a stimulus may trigger either a saccade or an antisaccade (a saccade away from the stimulus), some neurons in area LIP re as if their visual receptive elds were being remapped—that is, as if their sensory receptive elds moved as functions of task contingencies (Zhang & Barash, 2000; see also Duhamel, Colby, & Goldberg, 1992). Another study demonstrated that LIP neurons show color selectivity only when color is a relevant cue that guides visuomotor behavior; otherwise, this feature is ignored (Toth & Assad, 2002). Clearly, the dening characteristics of a neuron (e.g., receptive eld location, selectivity) may change radically and fast. The problem of determining how these switching processes occur is complicated by the fact that they probably proceed according to a subtle internal schedule. This is suggested by experiments with intracortical microstimulation applied during performance of a motion discrimination task (Seidemann, Zohary, & Newsome, 1998). The neural activity evoked by articial stimulation, which would normally have a strong impact on the perceptual decision, has no effect when it is produced slightly early or slightly late relative to the normal sequence of events in the task, as if the communication between sensory and motor networks were established only during a certain time window. Hence, communication channels between neurons also seem to open and close fast. The types of switching processes just discussed must be extremely common (Duhamel et al., 1992; Weimann & Marder, 1994; Linden, Grunewald, & Andersen, 1999; Zhang & Barash, 2000; Terman, Rubin, Yew, & Wilson, 2002; Toth & Assad, 2002), yet they are mostly unknown. Here I study a mechanism that may underlie some of the dramatic functional changes exhibited by neurons. I show that under very general conditions, neural networks are highly sensitive to the baseline level of excitatory drive, which, in contrast to sensory-triggered activity, may be weak and uniform across a population. Small changes in this background input level may shift a network from a relatively quiet state to some other state with highly complex dynamics. Thus, the background may act as a switch that allows networks to be turned on or off regardless of their function. This phenomenon is illustrated with a variety of network architectures using both a relatively abstract theoretical
Background Activity as a Network Switch
1441
model and more realistic computer simulations based on integrate-and-re neurons. 2 Theoretical Methods: The Network Equations For the analysis, I consider the response of a neuron as the mean rate at which it produces action potentials, or spikes. This response is described as a function of the activity of other neurons that provide synaptic input. Specically, the ring rate ri of neuron i is determined by P . j wij rj C hi /2 dr i D ¡ri C ¿ ; P dt s C v j rj2
(2.1)
where ¿ is a time constant, s is a saturation constant, and hi represents an excitatory external input whose value is independent of the network’s activity. This is the background component. Here, a neighboring neuron j can either directly drive neuron i to re through synaptic connection wij , or decrease its gain through a synaptic connection of strength v. For simplicity, all gain interactions have equal strength, but this need not be the case (see appendix D). For simplicity also, there is a single population of neurons rather than separate populations of excitatory and inhibitory units. Notice that all quantities, including the synaptic weights, are always positive or zero. The above equations are similar to so-called divisive normalization models, in which the response of a neuron is divided by a factor related to the activity of others (Wilson & Humanski, 1993; Carandini & Heeger, 1994; Carandini, Heeger, & Movshon, 1997; Simoncelli & Heeger, 1998; Wilson, 1998; Schwartz & Simoncelli, 2001). This formulation may be interpreted as a simplied model in which inhibition acts in a purely divisive way rather than in the more traditional subtractive way (Hertz, Krogh, & Palmer, 1991; Dayan & Abbott, 2001). This form of gain control provides a good empirical description of a variety of response nonlinearities observed in visual cortical neurons (Wilson & Humanski, 1993; Carandini & Heeger, 1994; Carandini et al., 1997; Simoncelli & Heeger, 1998; Wilson, 1998; Schwartz & Simoncelli, 2001). It is also consistent with numerous experiments showing that the responses of many neurons can be described as a multiplication of two terms that depend on different types of input signals (Salinas & Thier, 2000; Salinas & Abbott, 2001; Salinas & Sejnowski, 2001). This mechanism is also appealing from a theoretical point of view because it gives rise to more efcient, less redundant neural codes (Deneve, Latham, & Pouget, 1999; Brady & Field, 2000; Schwartz & Simoncelli, 2001). For some of the theoretical derivations that follow, it is useful to rewrite equation 2.1 in an essentially equivalent form that involves continuous functions rather than vectors and matrices. The key is to consider a relatively
1442
E. Salinas
large number of neurons, so that sums over cells can be replaced by integrals (Dayan & Abbott, 2001). In this case, the vector of ring rates turns into a function r.a/, where a acts as the neuron’s index or label; similarly, the connection matrix turns into a function w.a; b/ of two variables. The continuous version of the network equations is thus R .½ w.a; b/r.b/ db C h.a//2 dr.a/ R D ¡r.a/ C (2.2) ¿ ; dt s C v½ r2 .b/ db where ½ is the density of Pneurons. This factor ensuresRthat for any quantity x, the sum over neurons j xj is equal to the integral ½ x.b/ db. Equations 2.1 and 2.2 will be used interchangeably, as they are basically identical. As mentioned above, in these expressions, w and v correspond to excitatory and inhibitory connections, respectively, and h is the background input. In the rest of the article, I explore the dynamics of this model for various connection patterns w. I show that these equations typically have several stable solutions that coexist and that small changes in h can switch the network from one to another. The results are compared to the behavior of networks of spiking neurons with similar connectivities. 3 Simulation Methods 3.1 Spiking Network Implementation. All spiking networks consist of interconnected integrate-and-re neurons (Dayan & Abbott, 2001; Troyer & Miller, 1997; Salinas & Sejnowski, 2000). There are NE excitatory and NI inhibitory units, and in all cases NE =NI =4. In these models, the voltage of neuron i evolves according to ¿m
NE NI X X dV i D ¡Vi ¡ gEij .Vi ¡ VE / ¡ gIij .Vi ¡ VI / dt j j BI ¡ gBE i .Vi ¡ VE / ¡ gi .Vi ¡ VI /;
(3.1)
where ¿m is the membrane time constant of the cell. The resting potential is set at 0 mV, so all voltages are relative to rest. All conductances are measured in units of the leak conductance (leak conductance equals 1). The rst term on the right is the leak current, which is simply Vi in these units. The second and third terms are the total currents contributed by excitatory and inhibitory cells in the network, respectively. The last two terms represent the background input, which has excitatory and inhibitory components too. Neuron i produces a spike when Vi exceeds a threshold Vµ . After this, Vi is reset to an initial value Vreset and, after an absolute refractory period tref , the integration process continues according to equation 3.1. For excitatory neurons, ¿m D 20, tref D 1:8 ms. For inhibitory neurons, ¿m D 12, tref D 1:2 ms. For all neurons, VE D 74, VI D ¡10, Vµ D 20, Vreset D 10 mV.
Background Activity as a Network Switch
1443
In the above expression gEij is the conductance at the synapse from excitatory neuron j to neuron i. This conductance increases instantaneously when neuron j produces a spike and decreases exponentially thereafter. The update rule following a spike is gEij ! gEij C W ijE . Here, the size of the increase is the connection strength, which depends on the connectivity prole—equations 7.1 or 8.1 below, for example. A similar update rule applies to inhibitory synapses. Exponential decay of all recurrent excitatory and inhibitory conductances occurs with synaptic time constants ¿E and ¿I , respectively. Self-connections are eliminated in all simulations. The conductance gBE i is determined by the total rate of background input spikes rBE , the synaptic strength of background synapses WBE , and their corresponding time constant ¿BE . Poisson spike trains can be used to drive these conductances using the same procedure just described for recurrent synapses. However, since the parameters are constant, a series of random numbers may be combined instead, such that gBE i as a function of time has the same statistics as the conductance generated by a barrage of input spikes. These statistics are the mean, ¹BE D rBE ¿BE WBE ;
(3.2)
the variance, 2 D ¾BE
2 rBE ¿BE WBE ; 2
(3.3)
and the correlation time, which is equal to ¿BE . Thus, the recipe to update the background conductances at each time step is (Salinas & Sejnowski, 2002), p BE 2 gBE i ! ²gi C ¹BE .1 ¡ ²/ C ¾BE 1 ¡ ² ° ;
(3.4)
where ° is a gaussian random number with zero mean and unit variance, ² ´ exp.¡1t=¿BE /, and 1t is the integration time step. This procedure is more efcient than generating spikes, and the approximation is excellent whenever rBE ¿BE > 2. A similar rule is used for gBI i . In general, to avoid bursts of synchronous activity, the synaptic time constants of excitation need to be different from those of inhibition. This can typically be achieved by adjusting either the recurrent parameters ¿E and ¿I or the background parameters ¿BE and ¿BI . All results can be replicated qualitatively with diverse combinations of background input parameters; for instance, with different balance between excitation and inhibition or different levels of noise. The integration time step is 0.1 ms. Matlab (The Mathworks Inc., Natick, MA) code of the simulations is available from the author upon request.
1444
E. Salinas
3.2 Parameters for Figure 2a. NE D 300. For all neurons, W BE D 0:02, WBI D 0:06, and WijI 2 [0; 0:002], which means that WijI is a random number distributed uniformly between 0 and 0.002; ¿BE D 3, ¿BI D 6, ¿E D 10, ¿I D 3 ms. For excitatory neurons, W ijE 2 [0; 0:0024]; rBE D 6350 and rBI D 420 spikes per second. During transient activity, rBE increases by 660 spikes per second. The quantity GEE is the total conductance in excitatory neurons generated by recurrent excitatory synapses (see Figure 2). Here it equals NE times the average value of WijE , which gives 0.36. Similarly, GIE is the total conductance in excitatory neurons generated by recurrent inhibitory synapses and is equal to NI times the average value of WijI , which gives 0.075. In Figures 2c and 2d, GIE D 0:075; in Figures 2e and 2f, GEE D 0:36. The parameters in Figures 2d and 2f are the same as in Figures 2c and 2e, respectively, except that rBE D 6670 spikes per second. For inhibitory neurons, WijE 2 [0; 0:0023]; rBE D 6200 and rBI D 440 spikes per second. 3.3 Parameters for Figure 4a. NE D 150 in each of two subnetworks. For all neurons, WBE D 0:02, WBI D 0:06; ¿BE D 3, ¿BI D 3, ¿E D 5, ¿I D 3 ms. For excitatory neurons within a subnetwork, WijE 2 [0; 0:010], W ijI 2 [0; 0:023]; rBE D 7870 and rBI D 1830 spikes per second; during increased background activity, rBE D 8140 spikes per second. For inhibitory neurons within a subnetwork, WijE 2 [0; 0:011], WijI 2 [0; 0:008]; rBE D 7420 and rBI D 1890 spikes per second. In addition, excitatory neurons in one subnetwork are connected to inhibitory neurons in the other, also with strengths in the interval [0; 0:011]. 3.4 Parameters for Figures 5 and 6. NE D 800; excitatory and inhibitory neurons are labeled uniformly between ¡10 and 10, and periodic boundary conditions are used. The label of neuron i is xi . For all neurons, WBE D 0:02, WBI D 0:06; ¿BE D 3, ¿BI D 3, ¿E D 10, ¿I D 3 ms. For excitatory neurons, WijE D 0:0053 exp.¡0:5.xi ¡ xj ¡ 1/2 =.0:8/2 /, where 1 D 0 for Figure 5 and 1 D 0:15 for Figure 6, WijI D 0:019 exp.¡0:5.xi ¡ xj /2 /; rBE D 7770 and rBI D 1830 spikes per second; during increased background activity, rBE D 7880 spikes per second. For inhibitory neurons, WijE D 0:0023[0:3 C 0:7 exp.¡0:5.xi ¡ xj /2 /], W ijI D 0:011 exp.¡0:5.xi ¡ xj /2 /; rBE D 7420 and rBI D 1890 spikes per second. In this conguration, the inhibitory connections are not constant, as implied by equation 2.2. This is done to show that uniform inhibition, which was implemented by Compte, Brunel, Goldman-Rakic, and Wang (2000), is not an absolute requirement; localized inhibition leads to the same qualitative results. The inhibitory structure modies the width and amplitude of the activity prole, but it is still approximately gaussian (see appendix D).
Background Activity as a Network Switch
1445
4 Neurons Firing at Equal Rates A key property of equation 2.2 is that it allows a solution in which all ring rates are equal; refer to this uniform rate as R. If the ring rate is the same for all neurons, the network equations are reduced to a single, nonlinear equation ¿
dR .wtot R C h/2 D ¡R C ; dt s C vNR2
(4.1)
R where N is the total number of neurons; this appears because ½ db D N. Four conditions are necessary for this uniform solution to be possible. First, the background input should be the same for all units. This is why the term h.a/ now appears as a constant h. Second, the quantity Z wtot ´ ½
w.a; b/ db
(4.2)
must also be independent of a. Because w.a; b/ corresponds to the synaptic weight from neuron b to neuron a, this means that the total synaptic input to all neurons must be the same. Notice that this is a normalization condition; it does not restrict the distribution of synaptic weights, just the total amount per cell. Third, equation 4.1 must have a stable steady-state solution, so that R tends to a xed, nite value. To nd the steady-state points, set the derivative dR=dt to zero in equation 4.1 and solve for R. For a resulting steady-state point Rss to be stable, the uniform rate must tend to return to it after a small perturbation; the condition for this to happen is 2wtot .wtot Rss C h/ < s C 3vNR2ss :
(4.3)
The fourth condition is about the stability of the full system, not only regarding equation 4.1. What happens when all rates are near the steady state but are perturbed by independent amounts? For the system to be stable, each of the deviations must decrease with time. There is no general rule to determine the stability of the uniform rate without specifying the connections w. Individual conditions can be calculated analytically for some particular cases. For instance, when all synaptic weights wij are identical, the same expression above also guarantees that small differences between the rates ri and the steady-state Rss will vanish over time (see appendix A). In general, however, the stability of the uniform rate may depend differently on the distribution of synaptic weights, so condition 4.3 is necessary but may not be sufcient. The dynamics of equation 4.1 can be visualized by plotting the derivative dR=dt against R, as in Figure 1a. The derivative becomes zero when the curve crosses the horizontal line; those crossing points are steady-state solutions
1446
E. Salinas
dR/dt
a
R
c
R
b
w
w
tot
tot
e
R
d
u
u
Figure 1: Uniform rate solution for a recurrent neural network. All units may re at the rate R if they have the same background h and total recurrent synaptic input wtot . (a–c) Time derivative of R versus R. Points that intersect the horizontal line are steady-state values. Negative and positive slopes indicate stable and unstable points, respectively. With wtot D 0:98 (left), there is a single low-rate solution that is stable. With wtot D 1:12 (middle), there are two stable solutions and an unstable one. With wtot D 1:29 (right), there is a single high-rate solution that is stable. (b) Steady-state values of R as functions of wtot . Solid and dashed lines indicate stable and unstable solutions, respectively. Bistability arises at intermediate values of wtot . Background input is h D 7:9 and vN D 0:0136. The same holds for the above panels. (c) As in b but with h D 14. (d) Steadystate values of R as functions of inhibitory synaptic strength u, where vN D 0:0105.1 C .u C 0:15/2 /. Here h D 7:9 and wtot D 1:1. (e) As in d but with h D 14. All curves were obtained from equation 4.1 with ¿ D 1 and s D 40. Firing rate is in spikes per second.
Background Activity as a Network Switch
1447
to the equation. Zero points with positive and negative slopes correspond to unstable and stable Rss values, respectively. Depending on the parameters, there may be one or two stable points. Hence, there may be either one or two possible ring levels such that all neurons are activated with the same intensity. The case in which the background input h is zero is special: then the solution in which all rates are equal to zero (Rss D 0) is always possible, and it is always stable (see appendix A). Conversely, for the lower Rss value to be above zero, the background must be above zero as well. This is in agreement with a theoretical study (Latham, Richmond, Nelson, & Nirenberg, 2000) that focused specically on the conditions under which a neural network exhibits a nonzero ring rate that is stable, low, and approximately uniform. That study found that a minimum amount of intrinsic activity is needed; that is, some neurons must be active regardless of the recurrent synaptic input. In the present model, this corresponds to h > 0. Figure 1b shows the steady-state values of R (Rss ) as functions of the total strength of excitatory connections. Stable and unstable points are indicated by continuous and broken lines, respectively. When the neurons are disconnected, they re at a low rate close to h2 =s, which in Figure 1b is 1.6 spikes per second. The uniform rate Rss increases slowly as the recurrent connections become stronger, but at a certain point, two additional solutions appear, one stable and one unstable. In this bistable regime, R can settle at either a low or a high value; which one is reached depends on how the rates in the network are initialized. As the strength of the connections is increased further, the low-rate solution disappears, and only a uniform, high rate is possible; its value is approximately w2tot =.Nv/. Figure 1c is similar to Figure 1b; the only difference is that the background input h is higher. Comparison between the two gures shows that an increase in background can eliminate the lowactivity state, switching the network between regimes with one or two uniform activity levels that are stable. Notice that this happens for the full range of connection strengths tested. Figures 1d and 1e show that this background effect is also observed as the strength of the inhibitory connections varies. The generality of these results should be underscored: the existence of a uniform rate applies to any network described by equation 2.1 or 2.2, regardless of its architecture, as long as it satises certain constraints. The crucial conditions involve homogeneity; the background input and the total sum of incoming synaptic weights should be the same for all neurons. Shifting the background level affects the existence and stability of these uniform-rate solutions. Now the question is whether these analytic results apply to more realistic networks. This is investigated next. 5 Bistability in Unstructured Spiking Networks The rst specic network architecture to consider is a simple one in which the conditions for a uniform rate are always met. This occurs when the back-
1448
E. Salinas
ground input and synaptic weights are identical for all units. In this case, equation 2.2 is always reduced to the uniform rate equation, equation 4.1. In other words, in this arrangement, the neurons can do nothing else but re all at the same rate. An alternative way to generate the same behavior is to use random connections drawn from a given distribution. Strictly speaking, this is not equivalent to identical weights, as explained in appendix A. But based on simulation experiments, the results are typically very similar with either uniform or gaussian weight distributions. Figure 2 illustrates the behavior of a model network composed of integrate-and-re neurons (Troyer & Miller, 1997; Salinas & Sejnowski, 2000; Dayan & Abbott, 2001) that produce spikes (see section 2). The connectivity was all-to-all and random, but the total synaptic weights were approximately the same for all neurons. The background inputs uctuated, but averaged over time they were equal too. Figures 2a and 2b demonstrate the bistable regime in which the mean rate can switch between low and high values (see Amit & Brunel, 1997). Initially, all neurons re at a low rate around 3 spikes per second. All rates increase after a transient excitatory input lasting 100 ms, and activity stays high thereafter. Note that the ring rates are indeed approximately the same for all units. Figures 2c through 2f show the steady-state rate Rss at which the spiking neurons are seen to re, as a function of the strength of excitatory (see Figures 2c and 2d) or inhibitory (see Figures 2e and 2f) coupling. These gures are analogous to Figures 1b through 1e. To compare them, the total synaptic weight wtot of the rate model was equated to the total conductance due to recurrent excitatory synapses in the spiking model. Similarly, the gain weight Nv was equated to a monotonic function of the total conductance due to recurrent inhibitory synapses (see Figure 1). Qualitatively, the steady-state ring rates vary similarly in the rate- and spike-based models. The main difference is that in the latter, the discreteness of spikes acts as a source of noise. Fluctuations due to spiking may vary depending on model parameters. In particular, bistability is more robust when bursts of synchronous activity are minimized; otherwise, the network may oscillate or jump randomly between the low and high rates (Brunel, 2000). With this caveat, however, equation 4.1 does capture the underlying dynamics of the more realistic network. As found in the theoretical model, an increase in background activity can eliminate the low-rate steady state. The examples that follow show that a similar switch effect is also observed when a network has additional solutions or behaviors apart from a uniform rate. 6 Rivaling Steady States In this section, I discuss a network with three possible stable steady states. Here, the background serves as a switch that turns on a different type of dynamics in which two populations re in random alternation.
Background Activity as a Network Switch
1449
a
b
c
GEE
d
GEE
e
GIE
f
GIE
Figure 2: Bistability in a spiking network with 300 excitatory and 75 inhibitory model neurons. Connections were unstructured (see section 2). (a) Raster plot showing spike trains from 30 excitatory neurons chosen randomly. Each row corresponds to one neuron and each tick mark to one spike. Time runs along the x-axis, as in the panel below. The horizontal bar indicates a transient increase in background excitatory input. This causes the neurons to switch from a low to a high ring rate, which is sustained. (b) Mean ring rate, averaged over all neurons, as a function of time. Thick and thin curves are for excitatory and inhibitory populations, respectively. (c) Steady-state values of the average ring rate as functions of excitatory strength GEE , equal to the total excitatory conductance in excitatory cells due to recurrent connections. Filled circles show the low and high steady-state rates measured as in a, before and after a transient input. Each point is based on 10 s of simulated sustained activity. Open symbols represent maximum ring rates reached in separate runs in which weaker transient inputs failed to switch the network to the high state. (d) As in c, but with a slightly larger background excitatory input to all excitatory cells. (e) Steady-state values of the average ring rate as functions of inhibitory strength GIE , equal to the total inhibitory conductance in excitatory cells due to recurrent synapses. (f) As in e, but with a slightly larger background excitatory input to all excitatory cells.
1450
E. Salinas
Consider two identical, unstructured subnetworks that inhibit each other. In the framework of equation 2.1, this means that they decrease each other’s gain. As above, unstructured means with constant or random synaptic weights, so each subnetwork is described by a uniform rate governed by equation 4.1. Thus, there are two populations of N neurons that re at uniform rates R1 and R2 given by ¿
dR1 .wtot R1 C h1 /2 D ¡R1 C dt s C vN.R21 C R22 /
¿
dR2 .wtot R2 C h2 /2 D ¡R2 C : dt s C vN.R21 C R22 /
(6.1)
When h1 D h2 D h, this system also has a uniform rate solution for which R1 D R2 D Rss . However, the condition for stability is now wtot Rss < h
(6.2)
(see appendix A). If one of the rates, say R2 , is held constant, the dynamics of the other becomes identical to the case discussed in the previous section. This is shown in Figures 3a and 3b, which plot the derivative of R1 versus R1 for four values of R2 . Again, steady-state values correspond to points that intersect the horizontal line. In Figure 3a, the background input h1 is low; regardless of R2 , the steady-state rate reached by R1 is close to 5 spikes per second. In contrast, Figure 3b shows that with a higher background input, the curves move up and R2 is able to shift the steady state of R1 between 3 and 30 spikes per second approximately. The dynamics of the coupled equations can be visualized by plotting the steady-state values in a graph with R1 and R2 as axes (Rinzel & Ermentrout, 1998; Wilson, 1998; Latham et al., 2000; Dayan & Abbott, 2001). Such a plot is shown in Figure 3c, where black and gray lines indicate points at which the derivatives of R1 and R2 , respectively, become zero. The steady state of the coupled system is the intersection between those two lines. The vectors in the plot go from the dots to the tips of the lines. Each component is proportional to the derivative of the corresponding rate. When h1 and h2 are low and equal, all initial conditions lead to the only steady-state value, which is around 5 spikes per second for both rates. Condition 6.2 is satised. Figure 3d shows the steady states and vector eld when the background inputs h1 and h2 are still equal but stronger. Now there are three intersections—two that correspond to stable steady-state solutions and one that is unstable. The stable solutions are symmetric, so that either R1 is low and R2 is high, or vice versa. The solution with equal rates is unstable and does not satisfy condition 6.2. Notice that the system cannot cross the diagonal R2 D R1 : if it starts with R2 > R1 , it eventually settles at the high R2 solution, and correspondingly for the other point.
Background Activity as a Network Switch
1451
b
dR1 /dt
a
R1
R1
d
R2
c
R1
R1
Figure 3: Rate description of two unstructured networks that inhibit each other. Their mean rates are R1 and R2 . (a) Time derivative of R1 versus R1 for four values of R2 : 2.5, 11.4, 20, and 30 spikes per second with thicker lines corresponding to higher rates. Here h1 D 8:5; the steady state stays near 5 spikes per second. (b) As in a, but with stronger background input: h1 D 10:15. (c) Steady states of R1 and R2 and the associated vector eld. Short lines are vectors with components proportional to dR1 =dt and dR2 =dt; they indicate how R1 and R2 change at each point. Dots correspond to the tails of the vectors. Black and gray lines correspond to points at which the derivatives of R1 and R2 , respectively, equal zero. On the black curve, the vectors point vertically, and on the gray curve, they point horizontally. Intersections are steady-state solutions of the full system. Here h2 D h1 D 8:5; there is a single intersection for which both rates are about 4 spikes per second. (d) Steady states of R1 and R2 for h2 D h1 D 10:15. There are two stable steady states, each with a low and a high rate around 3 and 30 spikes per second, respectively, and an unstable point with equal rates of about 11.4 spikes per second. All points were obtained from equations 6.1 with ¿ D 1, s D 40, wtot D 1:154, and vN D 0:0295.
1452
E. Salinas
This is as far as the rate description goes. However, in real networks, ring rates uctuate because spikes are discrete events; they act as a source of noise superimposed on the underlying vector elds. If independent random uctuations were added to R1 and R2 in Figure 3c, no qualitative change would be observed; the rates would just uctuate around the steady-state point. In contrast, adding noise to Figure 3d could have a more dramatic consequence: the system would still uctuate around a steady-state point, but it could also cross the diagonal and switch to the other steady state if the uctuations were large enough. This is exactly what is observed in simulations in which noise is added to equations 6.1 (not shown). More important, this is also what is observed in more realistic networks of spiking neurons (see Brunel, 2000). Figure 4 illustrates this. The full spiking network consists of two subnetworks. Synaptic connections within each subnetwork are random, as in Figure 2, but across the two groups, there is an interaction: the excitatory neurons of one group contact the inhibitory neurons of the other. Thus, all neurons in a subnetwork should re at approximately the same rate, with the interaction between groups being reciprocally inhibitory. Figure 4a shows the spike trains from 40 of the 300 excitatory neurons; 20 belong to one group (black) and 20 to the other (gray). Figure 4b shows the average rate for each subnetwork as a function of time. Initially, background inputs are low, and all neurons re at about 4 spikes per second. At t D 0:5 s all background inputs are increased equally. This switches the system to a different dynamic mode: now the network alternates between two states, with one or the other group ring more intensely. Figure 4c shows that for each subnetwork, the distribution of rate values is indeed bimodal. The plot of R2 versus R1 shows two clusters, as expected from Figure 3d. In fact, simulations of equations 6.1 with noise give rise to very similar distributions. Conversely, if parameters are adjusted to reduce variability in the spiking network, the reversals in activity become less frequent or may disappear entirely. This example illustrates how a uniform background can switch a network from one dynamic regime to another. It is also evident that the activity of this network resembles the behavior observed during binocular rivalry; this is worth commenting on. In rivalry, as in other multistable situations, two mutually exclusive sensory percepts dominate at alternating intervals (Leopold & Logothetis, 1996, 1999; Blake & Logothetis, 2002). A fundamental characteristic of these processes, which is observed perceptually as well as in single-neuron recordings, is their randomness: the distribution of dominance intervals is always unimodal, asymmetric and has a long tail. Often, it is well t by a gamma distribution (Lehky, 1995; Leopold & Logothetis, 1996, 1999). The distributions shown in Figure 4d have these properties. Furthermore, in the simulations, successive periods of dominance and suppression are independent; no correlation is found between the length of one period and the next, as reported experimentally also (Lehky, 1995; Leopold
Background Activity as a Network Switch
1453
a
b
d
R
2
c
R
1
Figure 4: Spiking network with random alternations in ring pattern. The full network consisted of two unstructured subnetworks, each with 150 excitatory and 37 inhibitory model neurons connected randomly. In addition, the excitatory neurons from each subnetwork were connected to the inhibitory neurons of the other, producing reciprocal inhibition. (a) Raster plot showing spike trains from 40 of the 300 excitatory neurons—20 from each group. Initially, all neurons re at about 4 spikes per second. At t D 0:5 s (vertical line), all background excitatory inputs are increased, and remain so. This elicits alternating periods of dominance. (b) Mean ring rates, averaged over neurons within a subpopulation, as functions of time. (c) Black and gray histograms are ring-rate distributions for individual subnetworks. The middle plot shows the joint distribution. Compare to Figure 3d, noting different x-axes. (d) Distributions of activation periods. A subnetwork was considered activated if its average rate was more than 13 spikes per second. The periods of time during which the subnetworks remained activated were measured, and histograms were constructed from the resulting lists of interval lengths. Black and gray histograms had mean interval durations of 532 and 704 ms, respectively. Continuous curves are exponential distributions with the same means as the histograms. Using somewhat different threshold values or using R2 > R1 as a criterion produced almost identical results. Distributions in c and d are based on 960 s of continuous simulation time.
1454
E. Salinas
& Logothetis, 1999). Models of rivalry (Lehky, 1988; Kalarickal & Marshall, 2000; Laing & Chow, 2002) reproduce these and other experimental results, but do so relying on slow processes such as spike rate adaptation or synaptic depression. The underlying mechanism in this network is different because it produces alternations with timescales on the order of seconds without any time constants longer than a few milliseconds. Alternations in perception probably reect competitive interactions between and within multiple visual areas (Dayan, 1998; Lumer, 1998; Leopold & Logothetis, 1999; Blake & Logothetis, 2002). However, these results show that random alternations in neuronal activity, like those observed experimentally, are not that exceptional and may arise from reciprocal inhibition in very simple networks, even in the absence of intrinsic oscillators or slow processes. And these alternations may be turned on or off by the background. 7 Self-Sustained Activity Network models have been used extensively to study several types of persistent activity (Wang, 2001), such as neural correlates of working memory in prefrontal cortex (Compte et al., 2000), responses of neurons encoding head direction (Zhang, 1996), and responses of oculomotor cells that stabilize gaze (Seung, Lee, Reis, & Tank, 2000), among others (Wang, 2001). In these recurrent networks, many states or ring rate distributions are possible and are stable over a relatively long timescale. Which state is observed depends on initial conditions or transient inputs that may steer the network from one distribution to another. In this section, I show that the background is also crucial in allowing a network to exhibit such persistent patterns of activity. Models of persistent activity are characterized by so-called line attractors or bump attractors, which refer to continuous collections of steady-state points. To some extent, such attractors can be studied analytically (Seung, 1996), but this is typically difcult for those that generate unimodal proles of activity (Ben-Yishai, Bar-Or, & Sompolinsky, 1995; Zhang, 1996; Laing & Chow, 2001). However, using the present theoretical framework, an attractor solution can be calculated explicitly; this is done rst. In this case, a center-surround connectivity is used. The excitatory connections w have a gaussian shape with standard deviation ¾ , such that ³ ´ .a ¡ b/2 w.a; b/ D w.a ¡ b/ D wmax exp ¡ : 2¾ 2
(7.1)
This organization can be interpreted as excitation between nearby neurons or between similarly tuned neurons. When this connectivity is used and all background inputs h.a/ are set to zero, another solution to equation 2.2, apart from the uniform rate, is a gaussian prole of activity that can be
Background Activity as a Network Switch
1455
centered at an arbitrary point x within the array, ³ ´ .a ¡ x/2 r.aI t/ D r.a ¡ xI t/ D rmax .t/ exp ¡ : 2¾ 2
(7.2)
Note that each value of x corresponds to a different steady-state point and that this solution applies in the absence of tuned or structured input. The virtue of the network is that the peak of activity can be localized at any value of x within a range and that this value is set by initial conditions. In a model of working memory, x corresponds to the quantity being stored. This solution can be veried by direct substitution into the right and left sides of equation 2.2. For the left side, rst compute the time derivative of the above expression. The key for this result is that the integral in the numerator of equation 2.2 is a convolution of two gaussians, which produces another gaussian. Although the resulting gaussian is wider than any of the original two, the squaring operation decreases the width, so the gaussian prole ends up with the same standard deviation ¾ as the connectivity pattern. Substitution into equation 2.2 reduces the network equation to a scalar equation for the amplitude of the gaussian prole, ¿
1 2 2 dr max 2 wtot rmax D ¡rmax C p ; dt s C v½ ¼ ¾ r2max
(7.3)
p where, in this case, wtot D ½wmax 2¼ ¾ . Compare this expression to equation 4.1: with zero background input, they give rise to the same dynamics. Therefore, there are three possible scenarios for this solution, depending on the steady-state values that rmax can attain. First, if the connections are too weak, only the zero-rate steady state exists; the amplitude always shrinks to zero; the memory always fades away. Second, when the connections are stronger, rmax is bistable, and the memory can be switched on or off. Third, if the connections are even stronger, a high-amplitude gaussian prole of activity always develops; the memory is never off. Similar calculations can be performed with a constant, nonzero background input (see appendix B). The solution in that case can still be approximated as a gaussian prole of activity centered at an arbitrary point, but the background input has two major consequences: it shifts the activity prole upward and makes it wider; that is, it adds a constant to the right side of equation 7.2 and multiplies ¾ by a factor larger than 1. These effects are indeed observed in spiking networks (not shown). With this architecture, there are two distinct situations that are interesting in terms of the background component, and both regimes can be seen in simulations in which equation 2.1 is solved numerically. First, the network may be multistable, in the sense that both a uniform, low-rate solution and a gaussian prole of activity are possible for the same level of background
1456
E. Salinas
input. Second, the network may have a single stable solution that changes as a function of the background. In this mode, a low background gives rise to only a uniform, low-rate solution (the gaussian prole is unstable), whereas a higher background gives rise to only a gaussian prole (the uniform rate is unstable). In this latter mode the persistent activity may be switched on or off by the background, which can act as a gating or context signal for the memory circuit. These results are in agreement with more realistic simulations. The rst mode was demonstrated in a detailed simulation study (Compte et al., 2000) showing that a spiking network can reproduce many of the features of prefrontal responses recorded during working memory tasks. I have also observed such bistable regime in similar networks. The second mode is illustrated in Figure 5a, which shows spikes from a network with gaussian connectivity. Initially, the network res at a uniform low rate. At t D 0:4 s, two things happen: a cue is presented, which lasts 100 ms, and all background excitatory inputs are increased by a small amount. The cue evokes a burst of localized activity that persists. At t D 2:5 s, all background inputs are decreased back to their original values, and the persistent activity disappears. Note that if the cue is presented without increasing the background excitation, the evoked activity does not persist; likewise, if the background is simply increased without presenting a cue, a gaussian prole eventually develops at a random position (not shown). The main point here is that the level of background input may act as a gate that controls whether the transient sensory signal will be stored. In effect, this input transforms the circuit’s response from sensory to mnemonic. Evidently, this scenario requires an additional mechanism for switching the level of background input and maintaining it throughout a delay. However, this may be a simpler problem because the background is not specic; on average, it is the same for all excitatory neurons. This would effectively split the problem of storing a quantity into two parts: a circuit indicating whether something needs to be stored, and another, conditional on the rst, specifying the actual quantity. 8 Traveling Waves The networks considered so far have had symmetric connectivity patterns. Asymmetries typically give rise to solutions with richer, nonstationary dynamics, but these also depend on the level of background input. This is an example. Consider a network with a center-surround connectivity prole that is slightly shifted, ³ ´ .a ¡ b ¡ 1/2 w.a; b/ D w.a ¡ b ¡ 1/ D wmax exp ¡ : 2¾ 2
(8.1)
Background Activity as a Network Switch
1457
a
d
b
c
d
Figure 5: A spiking network with self-sustained activity. The full network consisted of 800 excitatory and 200 inhibitory model neurons connected in a centersurround organization with periodic boundaries (see section 2). (a) Raster plot showing spike trains from 70 excitatory neurons. Initially, all units re at about 3.5 spikes per second. A cue is presented between t D 0:4 and t D 0:5 s. This is implemented through a localized and transient increase in background excitatory input (bar). In addition, at t D 0:4 s, all background excitatory inputs are uniformly increased, remaining so until t D 2:5 s, when they go back to their original levels. Self-sustained activity disappears if the background input is below a minimum value. The continuous line traces the center of mass of the population activity, calculated using the vector method (Georgopoulos, Schwartz, & Kettner, 1986; Dayan & Abbott, 2001). (b) Average rates of all excitatory neurons. Two sets of data points are shown: rates measured during sustained activity, between t D 1:5 and t D 2:5 s (peaked responses), and rates measured before cue presentation. Gray lines were obtained by smoothing the respective data sets with a gaussian lter. (c) As in b, but for inhibitory neurons.
1458
E. Salinas
Parameter 1 is the shift, which is small relative to the width ¾ . This shift introduces an asymmetry; now neuron b excites neuron b C 1 most strongly. This does not alter the shape of the resulting activity prole substantially, but forces it to move continuously from any given position toward a position 1 units away. Now the ring rates are given by ³
.a ¡ x.t//2 r.aI t/ D r.a ¡ x.t// D rmax exp ¡ 2¾ 2
´ ;
(8.2)
where dx 1 D : dt ¿
(8.3)
This solution is valid when rmax is equal to the steady-state value of equation 7.3. To verify it, again substitute into equation 2.2 assuming h D 0 and using the derivative of equation 8.2 on the left-hand side. In addition, notice that now the result of the convolution integral followed by squaring is a gaussian function of a ¡ x ¡ 1, which can be expanded in a Taylor series around a ¡ x. The above expressions describe a gaussian prole of activity centered at an arbitrary point x and moving with a constant speed, which is proportional to the connection shift 1. The amplitude of the prole can still have multiple steady states, as with a stationary prole. Figure 6a illustrates this and the effect of the background. The situation in Figure 6a is identical to that in Figure 5a, except that the connectivity pattern is shifted (see equation 8.1). After the cue, a wave of activity traverses the network at a constant speed. When the background excitation decreases, the wave stops and the network returns to a uniform low rate. Figure 6b shows that propagation speed is indeed proportional to 1, as predicted by the theory (see also Idiart & Abbott, 1993; Ermentrout, 1998a). Suppose such propagating activity underlies a sequence of motor actions. In this scenario, a strong, localized input such as a sensory-evoked response would be the trigger for the sequence, whereas a weak, uniform signal could act in a context-dependent way to allow or stop the corresponding sequence of neural activations. In terms of a go=no-go paradigm, the background would carry information about the type of trial (go or no-go), while the localized input would reect the turning on of the sensory cue. Thus, a highly complex chain of events could potentially be controlled by a simple mechanism. These results have another important consequence. The connection shift in this network can be seen as the difference between feedforward and feedback components of the connectivity. This network, for which the feedback component is large, operates in a nonsynchronous mode (van Rossum, Turrigiano, & Nelson, 2002), whereas in purely feedforward networks, activity can also propagate via synchronous volleys of spikes (Diesmann, Gewaltig,
Background Activity as a Network Switch
1459
a
b
Figure 6: Traveling wave of activity in a spiking network. The simulation was identical to that in Figure 5, except that synaptic connections between excitatory neurons were shifted by 0.15 label units. For reference, the standard deviation of the gaussian connections was about 1 unit. (a) Spike trains from 70 excitatory neurons. Stable propagation of activity is maintained if the background input is above a minimum value; otherwise, it stops. (b) Propagation speed as a function of connection shift. Five points per shift value are shown. Each point is the result of a different simulation identical to that in a. The gray line is the best linear t. The connection shift is in label units, and speed is in label units per second.
& Aertsen, 1999; van Rossum et al., 2002). According to the results in this and the previous section, with strong feedback and low synchrony, the stability requirements for the propagation of activity are practically identical to those for stationary activity. In other words, stabilizing a stationary rate distribution and a nonsynchronous traveling wave are very similar problems. Thus, except for the shift itself, the mechanisms needed to establish and maintain the functionality of both types of circuits (Renart, Song, Compte, & Wang, 2001) could be the same.
1460
E. Salinas
9 Discussion I have presented a general mathematical argument indicating that the global properties of a network—in particular, which attractors it may exhibit— may be controlled by one of the simplest possible mechanisms: a uniform, low-amplitude excitatory signal. This idea is consistent with data showing task-dependent changes in background or undriven activity in high-order cortical areas (Sakagami & Niki, 1994; Lauwereyns et al., 2001; Port, Kruse, Lee, & Georgopoulas, 2001). Although it is not surprising that widespread excitation can produce drastic changes in a network, the generality of this effect is not intuitively obvious. The background input can turn a simple, unstructured network into a bistable element; it can induce competition between two networks that are mutually inhibitory; it can turn a selective, transient response into a sustained one when the local network has a centersurround organization; it can gate the propagation of activity in a network with a feedforward component. The main analytic result—how the background controls the stability of the state in which all neurons re at the same rate—applies to any network regardless of its architecture, as long as it is described by the present mathematical model. A second important nding is that this mechanism could operate through fairly weak changes: in the simulations of Figures 5 and 6, the background excitation varied by about 2% (see section 2). Hence, with neurons receiving a background synaptic drive of 10,000 spikes per second (from, say, 2000 excitatory neurons ring at 5 spikes per second), the switching could be produced by an increase in the total input rate for each neuron of as little as 200 spikes per second. Although all input changes considered here were implemented through changes in background ring, similar effects could also occur through the action of neuromodulators, which may alter the overall excitability of cells as well as their interactions (Harris-Warrick & Marder, 1991; Gu, 2002). For instance, neurons in the hippocampus are capable of producing various types of rhythmic activity characterized by their dominant frequency bands, and there is evidence that acetylcholine may switch the hippocampal circuitry between different rhythms (Fellous & Sejnowski, 2000; Tiesinga, Fellous, Jos´e, & Sejnowski, 2001). Except for the fact that most of these states are oscillatory, the neuromodulator causes uniform changes in excitability that regulate the transition between various dynamical states, consistent with the results presented here. Although fast oscillations have not been studied in detail within the present framework, preliminary simulations of the rate model indicate that the background level can indeed gate the transition between uniform ring and oscillatory activity, as expected from the main result. The existence of multiple attractors in neural networks has been pointed out before using somewhat different mathematical formulations (Wilson & Cowan, 1973; Yuille & Grzywacz, 1989; Ermentrout, 1998b; Latham et al.,
Background Activity as a Network Switch
1461
2000). In these studies, inhibition is usually considered subtractive. For all specic examples discussed here, there was good qualitative agreement between theory and simulations. Given the range of behaviors explored, such consistency is somewhat remarkable because the divisive mechanism was originally derived from empirical observations (Wilson & Humanski, 1993; Carandini & Heeger, 1994; Carandini et al., 1997; Simoncelli & Heeger, 1998; Wilson, 1998; Schwartz & Simoncelli, 2001) and its biophysical justication is still uncertain (Holt & Koch, 1997; but see Chance, Abbott, & Reyes, 2002, and Jos´e, Tiesinga, Fellous, Salinas, & Sejnowski, 2002). However, several variants of the equations used here give rise to similar results (Wilson, 1998; Salinas, in press; see appendix C), so the general form of equation 2.1 does seem to capture key properties of recurrent neural networks. An important challenge for the future is to derive an expression similar to equation 2.1 starting from a realistic biophysical description, for instance, from the equations determining a single neuron’s voltage. The most serious limitation of the model presented here is that the recurrent connections in the simulations cannot be too strong. This is of concern in view of the suggestion (Abeles, 1991) that the cortex might operate in a high-gain regime, in which synapses are strong and a single spike from one neuron can trigger spikes from multiple targets. With the chosen parameters, an excitatory spike in Figure 5 or 6 produces a maximum depolarization of about 0.13 mV in a postsynaptic target at rest (the background input acts as if individual spikes produced 0.24 mV depolarizations at rest, but this can be increased). By making the synapses stronger but sparse, this number can be safely raised by a factor of 2 or 3, and possibly more if the size of the network also increases. This and other ways to generate large postsynaptic potentials were not explored systematically. Increasing all synaptic conductances (excitatory and inhibitory) eventually leads to oscillations and unstable behavior. This problem is most serious for the bistable circuit in Figure 2 and least so for the network with alternating dominant states, because it relies precisely on such instabilities (see also Brunel & Wang, 2001). This, however, seems to be a general limitation of spiking network models: ne tuning, homogeneity, and stabilizing mechanisms are necessary to obtain stable attractors (Compte et al., 2000; Seung et al., 2000; Renart et al., 2001). For example, NMDA synapses, which saturate and have long timescales, are crucial to stabilize realistic models of persistent activity (Wang, 1999; Compte et al., 2000; Seung et al., 2000). In these models, as in those presented here, variability in interspike intervals is lower than measured from cortical neurons (Holt, Softky, Koch, & Douglas, 1996). It should be pointed out that this is a problem for continuous attractor models where receptive elds overlap and responses are graded; networks with discrete numbers of steady states and nongraded responses may overcome this limitation (Brunel & Wang, 2001). It is possible that the cortex uses other stabilization mechanisms that are not yet understood in order to work at high gain and with high spike train variability, but this merits a separate in-depth investigation.
1462
E. Salinas
With this caveat, these results could have profound implications for any functional interpretation of neural network activity. Cortical areas associated with a given function may not always display the expected properties. For instance, an area regarded as involved in working memory may display transient responses typical of sensory cortices, as is often the case in prefrontal cortex and other regions with persistent activity (di Pellegrino & Wise, 1993; Ungerleider, 1995; O Scalaidhe, Wilson, & Goldman-Rakic, 1997; Thompson & Schall, 1999). And vice versa, an area with mostly sensory characteristics may sometimes display persistent activity, as has been observed in visual (Miyashita & Chang, 1988; Ungerleider, 1995; Kobatake & Tanaka, 1994; Gibson & Maunsell, 1997) and somatosensory cortices (Zhou & Fuster, 1997; Pruett, Sinclair, & Burton, 2000; Salinas, Hern´andez, Zainos, & Romo, 2000). Similarly, areas associated with motor control and which may trigger sequences of neural activity (Graziano, Taylor, & Moore, 2002) can also exhibit short-lived sensory responses (Fadiga, Fogassi, & Rizzolatti, 1996; Rizzolatti, Fadiga, Gallese, & Fogassi, 1996; Graziano, Hu, & Gross, 1997a; Zhang, Riehle, Requin, & Kornblum, 1997; Buccino et al., 2001; Port et al., 2001; Hern´andez, Zainos, & Romo, 2002) and persistent activity (Ungerleider, 1995; Kettner, Marcario, & Port, 1996; Graziano, Hu, & Gross, 1997b). Many of these responses are observed even when no motor action is actually produced. The results just presented suggest that such functional changes may be caused by small differences in uniform, modulatory inputs that switch the dynamics of local networks rather than by large changes in strong feedforward input, which would still beg for an explanation. Appendix A: Stability of the Uniform-Rate Solution This is to clarify equation 4.3 and the difference between stability conditions for the uniform rate equation versus the full network. From equation 4.1, the uniform rate at steady-state Rss must satisfy Rss D
.wtot Rss C h/2 s C vNR2ss
:
(A.1)
To determine if Rss is a stable point of equation 4.1, set the uniform rate equal to the steady state plus a small perturbation, R ! Rss C ±R. Then use the expression above and eliminate quadratic terms in ±R to obtain Á ! 2wtot .wtot Rss C h/ ¡ 2vNR2ss d±R D ±R ¡1 C ¿ : dt s C vNR2ss
(A.2)
The requirement for stability is that the coefcient multiplying ±R be negative, which leads to condition 4.3.
Background Activity as a Network Switch
1463
In terms of the full network, however, the above result is equivalent to varying all rates ri identically; the procedure is more complicated if each neuron is perturbed by an independent amount. This is as follows. Substitute rj for Rss C ±rj in equation 2.1. Again use equation A.1 above and linearize to obtain ¿
X X d±ri D ¡±ri C ®1 wij ±rj ¡ ®2 ±rj ; dt j j
(A.3)
with ®1 ´ ®2 ´
2.wtot Rss C h/ s C vNR2ss 2vR2ss
s C vNR2ss
:
(A.4)
Rewriting the equation in matrix form, ¿
d±Er D L±Er D .¡1 C ®1 w ¡ ®2 U/±Er; dt
(A.5)
where 1 is the unit matrix and U is a matrix with all entries equal to 1. The stability of the system now depends on the matrix L; for all perturbations to decrease with time, the real parts of all eigenvalues must be negative (Hertz et al., 1991; Wilson, 1998; Dayan & Abbott, 2001). Determining the eigenvalues of L is generally difcult, but for some specic cases, the calculation is straightforward. The key is to point out three results. First, U has one eigenp value equal to N, which corresponds to the eigenvector Ee D .1; 1; : : : ; 1/= N, and all other eigenvalues equal to 0. Second, the largest eigenvalue of w is E equal to wtot , which also P corresponds to e. This results because of the normalization condition, j wij D wtot , and because all entries wij are positive or zero. Third, as a consequence of these two results, the largest eigenvalue of L is either ¡1 C ®1 wtot ¡ ®2 N
(A.6)
¡1 C ®1 ¸2 ;
(A.7)
or
where ¸2 is the second largest eigenvalue of w. The rst option results because Ee is an eigenvector of L, and the second occurs because multiplying any other eigenvector of w by U gives zero.
1464
E. Salinas
Given these observations, consider four simple cases: 1. Suppose w D w0 U. This corresponds to all neurons exciting each other by the same amount w0 , which is the example mentioned in the main text. In this case, wtot D Nw0 , and the largest eigenvalue of L is equal to ¡1 C ®1 wtot ¡ ®2 N. Imposing that this be smaller than 0 leads to condition 4.3, as claimed in the main text. 2. The situation is practically the same if each neuron excites all others by the same amount but does not excite itself; that is, w D w0 .U ¡ 1/. Now the total synaptic weight is wtot D .N ¡ 1/w0 , but the maximum eigenvalue is still equal to expression A.6. Therefore, the same stability condition applies. The case of random connections can be viewed as an extension of cases 1 or 2, where w0 is the mean value of the connection weights (here it is important to recall that negative connections are not allowed). If the variance of the weight distribution is low, the same stability condition should apply. However, high variances may give rise to situations in which the second largest eigenvalue of w is large enough so that expression A.7 should be used. In practice, using a uniform distribution, for instance, gives practically the same results as using constant weights, except that the ring rates are distributed around Rss . 3. Now consider the opposite extreme, in which each unit excites itself only; that is, w D w0 1. Now wtot D w0 , but the major difference is that all eigenvalues of w, not just one of them, are equal to wtot . The maximum eigenvalue of L is thus ¡1 C ®1 wtot . Imposing that this be smaller than 0 leads to a new condition,
2wtot .wtot Rss C h/ < s C vNR2ss :
(A.8)
This is the same as expression 6.2, because of equation A.1. This connectivity matrix describes a network in which all interactions are mainly inhibitory (via the divisive mechanism). The two-subnetwork model with mutual inhibition, equations 6.1, corresponds to this case with N D 2, so the above expression is the corresponding stability condition. Note that the above inequality implies inequality 4.3, but not vice versa. 4. Finally, consider a simplied center-surround organization where each neuron excites M of its neighbors equally and does not excite itself. For example, for N D 8 neurons and M D 4 neighbors, the
Background Activity as a Network Switch
1465
connectivity matrix is 0
0 B1 B B B1 B B0 B w D w0 B B0 B B0 B B @1 1
1 0 1 1 0
1 1 0 1 1
0 1 1 0 1
0 0 1 1 0
0 0 0 1 1
1 0 0 0 1
0 0 1
0 0 0
1 0 0
1 1 0
0 1 1
1 0 1
1 1 1C C C 0C C 0C C C: 0C C 1C C C 1A 0
(A.9)
This corresponds to a ring of neurons with a symmetric, square connectivity prole. Using this square connectivity in equation 2.1 actually results in behaviors that are almost indistinguishable from those obtained with gaussian connections. In this case, wtot D Mw0 . For any N even, the eigenvalues ¸ of such matrices are given by ¸ D 2w0
M=2 X jD1
³ cos
´ 2¼ kj ; N
(A.10)
with k D 0; 1; : : : ; N=2; the rst and largest eigenvalue corresponds to k D 0 and is equal to wtot , as found in general. This expression is interesting because it shows that the eigenvalue spectrum—and thus the stability condition—depends on the size of the excitatory neighborhood, M. Take two opposite examples; for both, suppose that N D 100. If the neighborhood is large, the second largest eigenvalue of w is much smaller than wtot ; with M D 90, the above sum gives about 0:09wtot . In this case, w is similar to U. In contrast, when the neighborhood is small, the second largest eigenvalue of w is very close to wtot ; with M D 10, the above sum gives about 0:98wtot . In this case, w is more similar to the identity matrix 1. Thus, the width of the connections has a large impact on the stability of the system. Appendix B: Solution for Gaussian Connectivity and Nonzero Background When the recurrent connectivity is gaussian, as in equation 7.1, and the background activity h is nonzero, the steady-state, nonuniform solution for the recurrent network is approximately given by Á
.a ¡ x/2 r.a/ D r.a ¡ x/ D R0 C R1 exp ¡ 2¾12
! :
(B.1)
1466
E. Salinas
This is still a gaussian prole of activity centered at an arbitrary point x. However, the baseline rate R0 , the amplitude of the gaussian R1 , and the width ¾1 now depend on h, in addition to the other network parameters. To see that this expression is a solution, substitute it into equation 2.2, and do the two integrals. Notice that the one in the numerator is a convolution of two gaussians. Assuming that the integration limits are much larger than ¾ , the integral results in another gaussian. This assumption simply indicates that the width of the connectivity prole must be considerably smaller than the full length of the neural array. Then dene p ¾ ¾1 ¾1 D R1 wtot q A ´ R1 ½wmax 2¼ q 2 2 2 ¾ C ¾1 ¾ C ¾12 B ´ R0 wtot ¾ C h p p C ´ v.½ ¼ ¾1 R21 C ½2 2¼ ¾1 R0 R1 C NR20 /:
(B.2) (B.3) (B.4)
In terms of these quantities, the network equation becomes Á ! dr.a/ .a ¡ x/2 D ¡R0 ¡ R1 exp ¡ ¿ dt 2¾12 h C
± ²i2 .a¡x/2 B C A exp ¡ 2.¾ 2 C¾ 2 / 1
C
:
(B.5)
The next step is the key of the derivation. It relies on the observation that the square of a gaussian plus a constant is not exactly a gaussian, but can nevertheless be approximated quite accurately as such. This is µ
³ ´¶2 ³ ´ y2 y2 2 2 ¼ B C .A C 2AB/ exp ¡ 2 ; B C A exp ¡ 2 2® 2®
(B.6)
where p ® 1 C 2 2.B=A/ ®´ p : 2 1 C 2.B=A/
(B.7)
For A C B D 1 the maximum error is less than 0.012, so at worst it is on the order of 1%. Using this approximation in equation B.5, it follows that three conditions are required for dr=dt to be zero: R0 D
B2 C
(B.8)
R1 D
A2 C 2AB C
(B.9)
Background Activity as a Network Switch
¾12 D ¾ 2
p .1 C 2 2.B=A//2 p : 1 C .8 ¡ 4 2/.B=A/
1467
(B.10)
Clearly, all three variables, R0 , R1 , and ¾ 1 , are coupled nonlinearly. However, the main characteristics of this solution can be appreciated from the above expressions. First, the presence of a background input means that the the baseline rate R0 must be larger than zero. This is because for h > 0, equations B.3 and B.8 imply that R0 > 0. On the other hand, when the background is zero, the original solution, equation 7.2, is recovered: setting B D R0 D 0 makes ¾1 D ¾ and R1 equal to the steady-state value of rmax in equation 7.3. The second and more interesting consequence of the background input in this case is that it widens the resulting prole of activity. This can be seen from equation B.10, which shows that the width of the activity prole is always larger than the width of the connections. The proportionality factor between them is always larger than 1; it is a monotonic function of B=A, which increases monotonically with R0 =R1 . To see this, combine equations B.8 and B.9 to obtain s ³ ´2 B R0 R0 R0 D C C (B.11) : A R1 R1 R1 Thus, the larger the baseline rate is relative to the amplitude of the activity prole, the wider the prole will be relative to the spread of the synaptic connections. This effect can be quite signicant. For example, for R0 D 3 and R1 D 40, the widening factor is 1.49, almost a 50% increase. Notice that the theoretical model predicts this widening effect without any free parameters involved, just based on the underlying assumption of a center-surround organization. When the background is not zero, solving for R0 , R1 , and ¾ 1 is quite difcult; it is easier simply to simulate the rate-based network. However, there is an alternative analytical approach that is simple. First, set R0 and R1 , leaving h and wmax undetermined, and solve for h and wmax according to constraints B.8 through B.10. This is as follows. (1) Set R0 and R1 and nd B=A from equation B.11. (2) Find ¾1 , C, and B from equations B.10, B.4, and B.8, respectively. (3) Find A using B and the ratio B=A. (4) Solve for wmax using equation B.2. (5) Solve for h using equation B.3. This procedure matches the results of computer simulations of equation 2.1, conrming that expression B.6 is a reasonable approximation and that equation B.1 is close to the true unimodal activity prole that results. Appendix C: Variants on the Recurrent Equations The exact form of the equations used throughout the article is not critical for the results presented because, qualitatively, the same types of dynamics
1468
E. Salinas
are obtained with a variety of similar equations (Salinas, in press). This is demonstrated for arbitrary exponents n and m such that P . j wij rj C hi /n dri P D ¡ri C (C.1) ¿ : dt s C v j rjm The uniform rate solution in this case is given by ¿
dR .wtot R C h/n D ¡R C : dt s C vNRm
(C.2)
In order for all steady-state solutions to remain nite, the derivative dR=dt must be negative when R tends to innity. By analyzing the limit behavior, it is straightforward to see that this happens for all parameter combinations when m > n ¡ 1. An additional constraint, n > 1, is also applied (see below). With these conditions, the qualitative behavior of this equation is very similar to the case n D m D 2 discussed in the main text: there are at least one and up to three steady-state points. This can be seen by plotting dR=dt as a function of R. The stability of the uniform rate can be analyzed exactly as in appendix A. The resulting expression is the same as equation A.5, except with coefcients that depend on the exponents, ®1 ´ ®2 ´
n.wtot Rss C h/n¡1 s C vNRm ss mvRm ss : s C vNRm ss
(C.3)
The same matrix equation is reached because in the calculation, only linear terms in the perturbations are retained, regardless of the exponents. Equation C.1 also gives rise to a sustained pattern of activity when the connections are gaussian. To see this, rst rewrite the equation using integrals, as in equation 2.2, and consider the case h D 0. The nonuniform solution is ³ ´ .a ¡ x/2 ¡ D exp ¡ (C.4) r.a xI t/ rmax .t/ ; 2¾r2 which can be veried by substitution into the integral equation. The key is that the convolution integral in the numerator has not changed; again, it produces a wider gaussian, and exponentiation decreases its width. In order for the convolution-exponentiation combination to produce a gaussian with the same width ¾r as the original gaussian prole of activity, it must happen that ¾r2 D
¾2 : n¡1
(C.5)
Background Activity as a Network Switch
1469
This shows that n must be larger than 1, as mentioned above, and that the width of the activity prole is the same as the width of the connectivity when n D 2. The amplitude of the activity prole now evolves according to ± ¿
p1 wtot rmax n
²n
drmax D ¡rmax C q ; 2¼ dt s C v½ m.n¡1/ ¾ rm max
(C.6)
which for n D m D 2 is the same as equation 7.3. Finally, traveling waves of activity can also be produced by shifting the synaptic weights, as in equation 8.1. The derivation is identical to the case discussed in the main text, except that the amplitude is given by the steady state of the expression above. In general, simulations of equation C.1 and of other variants based on divisive inhibition produced similar qualitative results—uniform rate, bistability, localized proles of activity, traveling waves—as the equations used in the main text (see Salinas, in press). Appendix D: Nonuniform Inhibition The simulations in Figures 5 and 6 differed in one important aspect from the theoretical model: instead of all units effectively inhibiting each other uniformly, the inhibitory connections were spatially localized. This was to show that perfectly uniform inhibition, which was used in a detailed modeling study of persistent activity in prefrontal cortex (Compte et al., 2000), is not an essential requirement of these types of attractor models. An argument for why localized inhibition produces qualitatively the same dynamics can by outlined analytically. Again, the excitatory connections are gaussian (see equation 7.1). The basic idea is that, as discussed in appendix B, a combination of unimodal functions centered at the same point can be approximated quite accurately by a single gaussian; in this case, the combination is a ratio. As a shorthand, use G.aI ¾ / to indicate a gaussian function of a with standard deviation ¾ and unit amplitude. Suppose that h D 0 and that the inhibitory connections are given by v.a; b/ D v.a ¡ b/ D vmax G.a ¡ bI ¾v /:
(D.1)
Substitute solution 7.2 into the integral equation, equation 2.2, and note that there are now two convolutions, as the denominator now contains a term proportional to Z G.a ¡ bI ¾v /G2 .b ¡ xI ¾r / db:
(D.2)
1470
E. Salinas
After the substitutions, ¿
drmax G.a ¡ xI ¾r / D ¡rmax G.a ¡ xI ¾r / dt p p AG.a ¡ xI ¾r2 C ¾ 2 = 2/ C p ; s C BG.a ¡ xI ¾v2 C ¾r2 =2/
(D.3)
where A and B absorb all the amplitude terms. Here, all gaussian functions are centered at the same point. The unknowns are the maximum rate rmax and the width ¾r of the activity prole. The crucial observation is that the ratio of gaussian functions that appears above can be accurately approximated as a single gaussian of amplitude A=.s C B/, and its width can be set by imposing that the areas under the curves match up (as was done in equations B.6 and B.7). The approximation fails if ¾v < ¾ , but as long as the effective spread of inhibition is larger than that of excitation, the error is very small. Thus, there are two constraints for the steady-state solution: that the amplitude of the approximating gaussian be A=.s C B/ and that its width be equal to ¾r . Consistent with the case of uniform inhibition, the resulting expressions show that as ¾v increases, the width ¾r of the activity prole approaches ¾ . Although this procedure ultimately results in a system of coupled, nonlinear equations that is hard to solve, it shows that in principle, the same type of solution is found with localized inhibition. This is in agreement with the results of computer simulations of both the rate and spiking models. Acknowledgments This research was supported by start-up funds from Wake Forest University School of Medicine and by NIH grant 1 R01 NS044894-01. I thank Peter Latham for his thorough critique and Larry Abbott for his continuing support and mentoring. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cereb. Cortex, 7, 237–252. Ben-Yishai, R., Bar-Or, R. L., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 92, 3844–3848. Blake, R., & Logothetis, N. K. (2002). Visual competition. Nat. Rev. Neurosci., 3, 13–23. Brady, N., & Field, D. J. (2000). Local contrast in natural images: Normalization and coding efciency. Perception, 29, 1041–1055.
Background Activity as a Network Switch
1471
Brunel, N. (2000). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Buccino, G., Binkofski, F., Fink, G. R., Fadiga, L., Fogassi, L., Gallese, V., Seitz, R. J., Zilles, K., Rizzolatti, G., & Freund, H. J. (2001). Action observation activates premotor and parietal areas in a somatotopic manner: An fMRI study. Eur. J. Neurosci., 13, 400–404. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Carandini, M., Heeger, D. J., & Movshon, J. A. (1997). Linearity and normalization in simple cells of the macaque primary visual cortex. J. Neurosci., 17, 8621–8644. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: MIT Press. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cereb. Cortex, 10, 910–923. Dayan, P. (1998). A hierarchical model of binocular rivalry. Neural Comput., 10, 1119–1135. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Deneve, S., Latham, P. E., & Pouget, A. (1999). Reading population codes: A neural implementation of ideal observers. Nat. Neurosci., 2, 740–745. Diesmann, M., Gewaltig, M-O., & Aertsen, A. (1999). Stable propagation of synchronous spiking in cortical neural networks. Nature, 402, 529–533. di Pellegrino, G., & Wise, S. P. (1993). Visuospatial versus visuomotor activity in the premotor and prefrontal cortex of a primate. J. Neurosci., 13, 1227–1243. Duhamel, J.-R., Colby, C. L., & Goldberg, M. E. (1992). The updating of the representation of visual space in parietal cortex by intended eye movements. Science, 255, 90–92. Ermentrout, G. B. (1998a). The analysis of synaptically generated traveling waves. J. Comput. Neurosci., 5, 191–208. Ermentrout, G. B. (1998b). Neural networks as spatio-temporal pattern-forming systems. Rep. Prog. Phys., 61, 353–430. Fadiga, L., Fogassi, L., & Rizzolatti, G. (1996). Action recognition in the premotor cortex. Brain, 119, 593–609. Fellous, J.-M., & Sejnowski, T. J. (2000). Cholinergic induction of spontaneous oscillations in the hippocampal slice in the slow (.5–2 Hz), theta (5–12 Hz), and gamma (35–70 Hz) bands. Hippocampus, 10, 187–197. Georgopoulos, A. P., Schwartz, A., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Gibson, J. R., & Maunsell, J. H. (1997). Sensory modality specicity of neural activity related to memory in visual cortex. J. Neurophysiol., 78, 1263–1275.
1472
E. Salinas
Graziano, M. S., Hu, T. X., & Gross, C. G. (1997a). Visuospatial properties of ventral premotor cortex. J. Neurophysiol., 77, 2268–2292. Graziano, M. S., Hu, X. T., & Gross, C. G. (1997b). Coding the locations of objects in the dark. Science, 277, 239–241. Graziano, M. S., Taylor, C. S., & Moore, T. (2002). Complex movements evoked by microstimulation of precentral cortex. Neuron, 34, 841–851. Gu, Q. (2002). Neuromodulatory transmitter systems in the cortex and their role in cortical plasticity. Neuroscience, 111, 815–835. Harris-Warrick, R. M., & Marder, E. (1991). Modulation of neural networks for behavior. Annu. Rev. Neurosci., 14, 39–57. Hern´andez, A., Zainos, A., & Romo, R. (2002). Temporal evolution of a decisionmaking process in medial premotor cortex. Neuron, 33, 959–972. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on ring rates. Neural Comput., 9, 1001–1013. Holt, G. R., Softky, W. R., Koch, C., Douglas, R. J. (1996).Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75, 1806–1814. Idiart, M. A. P., & Abbott, L. F. (1993).Propagation of excitation in neural network models. Network, 4, 285–294. Jos´e, J. V., Tiesinga, P., Fellous, J.-M., Salinas, E., & Sejnowski, T. J. (2002). Is attentional gain modulation optimal at gamma frequencies? Soc. Neurosci. Abstr., 28, 55.6. Kalarickal, G. J., & Marshall, J. A. (2000). Neural model of temporal and stochastic properties of binocular rivalry. Neurocomput., 32, 843–853. Kettner, R. E., Marcario, J. K., & Port, N. L. (1996). Control of remembered reaching sequences in monkey. II. Storage and preparation before movement in motor and premotor cortex. Exp. Brain Res., 112, 347–358. Kobatake, E., & Tanaka, K. (1994). Neuronal selectivities to complex object features in the ventral pathway of the macaque cerebral cortex. J. Neurophysiol., 71, 856–867. Laing, C. R., & Chow, C. C. (2001). Stationary bumps in networks of spiking neurons. Neural Comput., 13, 1473–1494. Laing, C. R., & Chow, C. C. (2002). A spiking neuron model for binocular rivalry. J. Comput. Neurosci., 12, 39–53. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. N. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808–827. Lauwereyns, J., Sakagami, M., Tsutsui, K., Kobayashi, S., Koizumi, M., & Hikosaka, O. (2001). Responses to task-irrelevant visual features by primate prefrontal neurons. J. Neurophysiol., 86, 2001–2010. Lehky, S. R. (1988). An astable multivibrator model of binocular rivalry. Perception, 17, 215–228. Lehky, S. R. (1995). Binocular rivalry is not chaotic. Proc. R. Soc. Lond. B Biol. Sci., 259, 71–76. Leopold, D. A., & Logothetis, N. K. (1996). Activity changes in early visual cortex reect monkeys’ percepts during binocular rivalry. Nature, 379, 549–553.
Background Activity as a Network Switch
1473
Leopold, D. A., & Logothetis, N. K. (1999). Multistable phenomena: Changing views in perception. Trends Cogn. Sci., 3, 254–264. Linden, J. F., Grunewald, A., & Andersen, R. A. (1999). Responses to auditory stimuli in macaque lateral intraparietal area. II. Behavioral modulation. J. Neurophysiol., 82, 343–358. Lumer, E. D. (1998). A neural model of binocular integration and rivalry based on the coordination of action-potential timing in primary visual cortex. Cereb. Cortex, 8, 553–561. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. O Scalaidhe, S. P., Wilson, F. A., & Goldman-Rakic, P. S. (1997). Areal segregation of face-processing neurons in prefrontal cortex. Science, 278, 1135–1138. Port, N. L., Kruse, W., Lee, D., & Georgopoulos, A. P. (2001). Motor cortical activity during interception of moving targets. J. Cogn. Neurosci., 13, 306– 318. Pruett, J. R., Sinclair, R. J., & Burton, H. (2000). Response patterns in second somatosensory cortex (SII) of awake monkeys to passively applied tactile gratings. J. Neurophysiol., 84, 780–797. Renart, A., Song, S., Compte, A., & Wang, X.-J. (2001). Homogenization by homeostatic synaptic plasticity in a recurrent network model of continuous bump attractors. Soc. Neurosci. Abstr., 27, 535.14. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). Cambridge, MA: MIT Press. Rizzolatti, G., Fadiga, L., Gallese, V., & Fogassi, L. (1996). Premotor cortex and the recognition of motor actions. Cog. Brain Res., 3, 131–141. Sakagami, M., & Niki, H. (1994). Encoding of behavioral signicance of visual stimuli by primate prefrontal neurons: Relation to relevant task conditions. Exp. Brain Res., 97, 423–436. Salinas, E. (in press). Self-sustained activity in networks of gain-modulated neurons. Neurocomputing. Salinas, E., & Abbott, L. F. (2001). Coordinate transformations in the visual system: How to generate gain elds and what to compute with them. Prog. Brain Res., 130, 175–190. Salinas, E., Hern a´ ndez, H., Zainos, A., & Romo, R. (2000). Periodicity and ring rate as candidate neural codes for the frequency of vibrotactile stimuli. J. Neurosci., 20, 5503–5515. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output ring rate and variability in simple neuronal models. J. Neurosci., 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Gain modulation in the central nervous system: Where behavior, neurophysiology and computation meet. Neuroscientist, 7, 430–440. Salinas, E., & Sejnowski, T. J. (2002). Integrate-and-re neurons driven by correlated stochastic input. Neural Comput., 14, 2111–2155. Salinas, E., & Thier, P. (2000).Gain modulation—a major computational principle of the central nervous system. Neuron, 27, 15–21.
1474
E. Salinas
Schultz, W., & Romo, R. (1992).Role of primate basal ganglia and frontal cortex in the internal generation of movements. I. Preparatory activity in the anterior striatum. Exp. Brain Res., 91, 363–384. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Seidemann, E., Zohary, U., & Newsome, W. T. (1998). Temporal gating of neural signals during performance of a visual discrimination task. Nature, 394, 72–75. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. U.S.A., 93, 13339–13344. Seung, H. S., Lee, D. D., Reis, B. Y., & Tank, D. W. (2000). Stability of the memory of eye position in a recurrent network of conductance-based model neurons. Neuron, 26, 259–271. Simoncelli, E. P., & Heeger, D. J. (1998). A model of neuronal responses in visual area MT. Vision Res., 38, 743–761. Sommer, M. A., & Wurtz, R. H. (2001). Frontal eye eld sends delay activity related to movement, memory, and vision to the superior colliculus. J. Neurophysiol., 85, 1673–1685. Terman, D., Rubin, J. E., Yew, A. C., & Wilson, C. J. (2002). Activity patterns in a model for the subthalamopallidal network of the basal ganglia. J. Neurosci., 22, 2963–2976. Thompson, K. G., & Schall, J. D. (1999). The detection of visual signals by macaque frontal eye eld during masking. Nat. Neurosci., 2, 283–288. Tiesinga, P. H. E., Fellous, J.-M., Jos´e, J. V., & Sejnowski, T. J. (2001). Computational model of carbachol-induced delta, theta and gamma oscillations in the hippocampus. Hippocampus, 11, 251–274. Toth, L. J., & Assad, J. A. (2002). Dynamic coding of behaviourally relevant stimuli in parietal cortex Nature, 415, 1651–1668. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9, 971–983. Ungerleider, L. G. (1995). Functional brain imaging studies of cortical mechanisms for memory. Science, 270, 769–775. van Rossum, M. C. W., Turrigiano, G. G., & Nelson, S. B. (2002). Fast propagation of ring rates through layered networks of noisy neurons. J. Neurosci., 22, 1956–1966. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603. Wang, X.-J. (2001). Synaptic reverberation underlying mnemonic persistent activity. Trends Neurosci., 24, 455–463. Weimann, J. M., & Marder, E. (1994). Switching neurons are integral members of multiple oscillatory networks. Curr. Biol., 4, 896–902. Wilson, H. R. (1998). Spikes, decisions, and actions. Oxford: Oxford University Press. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Wilson, H. R., & Humanski, R. (1993). Spatial frequency adaptation and contrast gain control. Vision Res., 33, 1133–1149.
Background Activity as a Network Switch
1475
Yuille, A. L., & Grzywacz, N. M. (1989). A winner–take–all mechanism based on presynaptic inhibition. Neural Comput., 1, 334–347. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16, 2112–2126. Zhang, M., & Barash, S. (2000). Neuronal switching of sensorimotor transformations for anti-saccades. Nature, 408, 971–975. Zhang, J., Riehle, A., Requin, J., & Kornblum, S. (1997). Dynamics of single neuron activity in monkey primary motor cortex related to sensorimotor transformation. J. Neurosci., 17, 2227–2246. Zhou, Y. D., & Fuster, J. M. (1997). Neuronal activity of somatosensory cortex in a cross-modal (visuo-haptic) memory task. Exp. Brain Res., 116, 551–555. Received October 29, 2002; accepted January 2, 2003.
NOTE
Communicated by Terrence J. Sejnowski
Note on “Comparison of Model Selection for Regression” by Vladimir Cherkassky and Yunqian Ma Trevor Hastie
[email protected] Rob Tibshirani
[email protected] Jerome Friedman
[email protected] Department of Statistics, Stanford University, Stanford, CA, 94305, U.S.A.
While Cherkassky and Ma (2003) raise some interesting issues in comparing techniques for model selection, their article appears to be written largely in protest of comparisons made in our book, Elements of Statistical Learning (2001). Cherkassky and Ma feel that we falsely represented the structural risk minimization (SRM) method, which they defend strongly here. In a two-page section of our book (pp. 212–213), we made an honest attempt to compare the SRM method with two related techniques, Aikaike information criterion (AIC) and Bayesian information criterion (BIC). Apparently, we did not apply SRM in the optimal way. We are also accused of using contrived examples, designed to make SRM look bad. Alas, we did introduce some careless errors in our original simulation— errors that were corrected in the second and subsequent printings. Some of these errors were pointed out to us by Cherkassky and Ma (we supplied them with our source code), and as a result we replaced the assessment “SRM performs poorly overall” with a more moderate “the performance of SRM is mixed” (p. 212). These and other corrections can be seen in the errata section on-line at http:==www-stat.stanford.edu=ElemStatLearn. 1 Which Bound to Use? Not being experts in the use of the SRM, when writing this section we consulted Cherkassky and Mulier (1998). We found on their page 110 a general error bound to be used for regression. Happily, there are also recommendations for the parameters c and ², which we used in our example. Apparently this was the wrong approach. We should have read on to page 122, where, buried in a case study in section 4.5, we nd the “practical bound,” which we should have used instead. To the non cognoscenti, the various SRM bounds are confusing, and it is very hard to nd the one in vogue at any given moment. c 2003 Massachusetts Institute of Technology Neural Computation 15, 1477–1480 (2003) °
1478
T. Hastie, R. Tibshirani, and J. Friedman
2 Choice of Examples The examples we chose in our book for these comparisons were nonstandard, and designed to be used for comparisons of both regression and classication procedures. Nature does not generally deliver to us textbook examples to suit our models and procedures. We get what we get and try to use our procedures the best we can. As an aside, we note that throughout the book, we in fact favored K-fold cross-validation over other model-based selection procedures (such as the three considered here), because it is automatic and does not require special tuning and tweaking. So in selecting these simulation examples, we were not carrying a torch for any of the three methods being compared. We are accused of selecting our particular (noise-free) “linear regression” example with malicious intent (to favor AIC, which tends to overt). Our P 10 generative model has X » U[0; 1]20 , and Y D I. jD1 Xj > 5/. Part of the authors’ justication for this accusation is their Figure 9, where they use a ve-dimensional version of this problem, for which the effects of overtting are not evident (and hence AIC would not be hurt). The reason they reduced the dimension from 20 down to 5 was to avoid a “combinatorial explosion.” The authors apparently did not realize that all-subsets regression can be efciently computed for up to about p D 35 inputs using the branch and bound technique of Furnival and Wilson (1974). Figure 1 shows the test error for best-subset regression for ve realizations from the original 20dimensional model. Here we do see the effects of overtting. In spite of the lack of an error term in the model, we see the characteristic bias-variance trade-off as we vary the size of the subset, and in each case the correct number, 10, is about the right size of the model. It is worth emphasizing here that although the output might be a deterministic function of the inputs, one still pays a price in variance when predicting at a new input point; the variance is a result of the random positions of the inputs in the training set. 3 Effective Number of Parameters for K-Nearest Neighbor Regression We have another quibble, which is the use of N=.1:5k/ rather than N=k as the effective degrees of freedom for a nearest-neighbor regression estimator. Our use of N=k arises in the context of the AIC=Cp statistic—both being estimates of insample prediction error. The context here is that of an additiveerror model, Y D f .X/ C ". It is easy to show that N=k is exactly the correct quantity to use for DoF. Using the notation in our book (p. 202), one can show that 1 Errin D Ey err C 2 ¢ ¾"2 : k The term
1 k
compares to the usual
DoF , N
which proves the result.
Note on “Comparison of Model Selection for Regression”
1479
0.25
1 4 1 2 5 3
4 1
0.20
Test Error
3 2 5
2 4 5 3 2
2 4
1 4 5 3
1 3
2
5
4 1 3
0.15
5
4 2 1 4 2 1 3 5
4 2 1 3
4 1
5 5 3 2
5
1 5 4 3
5
3 2
5
4 3 1
2
10
4
4
4
1
1 3
1 3
1 4 3
5 2
2 5
2 5
3 5 2
15
4 1 3
1 4 3
1 4 3
1 4 3
2 5
5 2
5 2
5 2
20
Subset Size
Figure 1: Test error curves for ve realizations of the 20-dimensional regression example. Shown is the test error for the best subset models of each size, t by linear regression to training sets of size 50. In each case, the test error starts to increase beyond a size of about 10, and hence shows evidence of increased variance due to overtting.
4 Conclusion Although all three of the model selection techniques in question here are somewhat sensitive to the choices of the parameters involved, it seems that the SRM is the most difcult to apply. For a statistical method to become widely accepted, it needs to be easy to use and robust. It should work well in a wide variety of situations without requiring expert tuning in each new setting. Rather than blaming the user (us), Cherkassky and Ma should focus on making their method simple and reliable. That way it can be applied successfully by people other than world experts in SRM theory. References Cherkassky, V., & Ma, Y. (2003). Comparison of model selection for regression. Neural Computation, 15, 1691–1714. Cherkassky, V., & Mulier, F. (1998), Learning from data. New York: Wiley.
1480
T. Hastie, R. Tibshirani, and J. Friedman
Furnival, G., & Wilson, R. (1974). Regression by leaps and bounds. Technometrics, 16, 499–511. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning; Data mining, inference and prediction. New York: Springer-Verlag. Received January 21, 2003; accepted January 23, 2003.
LETTER
Communicated by David Horn
Spike-Timing-Dependent Plasticity and Relevant Mutual Information Maximization Gal Chechik
[email protected] InterdisciplinaryCenter for Neural Computation, Hebrew University, Jerusalem, Israel
Synaptic plasticity was recently shown to depend on the relative timing of the pre- and postsynaptic spikes. This article analytically derives a spike-dependent learning rule based on the principle of information maximization for a single neuron with spiking inputs. This rule is then transformed into a biologically feasible rule, which is compared to the experimentally observed plasticity. This comparison reveals that the biological rule increases information to a near-optimal level and provides insights into the structure of biological plasticity. It shows that the time dependency of synaptic potentiation should be determined by the synaptic transfer function and membrane leak. Potentiation consists of weightdependent and weight-independent components whose weights are of the same order of magnitude. It further suggests that synaptic depression should be triggered by rare and relevant inputs but at the same time serves to unlearn the baseline statistics of the network’s inputs. The optimal depression curve is uniformly extended in time, but biological constraints that cause the cell to forget past events may lead to a different shape, which is not specied by our current model. The structure of the optimal rule thus suggests a computational account for several temporal characteristics of the biological spike-timing-dependent rules. 1 Introduction Temporal aspects of synaptic connections have recently become a key research topic. First, it was demonstrated that synapses exhibit short-term facilitation and depression (Abbott, Varela, Sen, & Nelson, 1997; Markram, Lubke, Frotscher, & Sakmann, 1997), and synaptic connections with various dynamics were shown to be expressed following learning (Monyer, Burnashev, Laurie, Sakmann, & Seeburg, 1994; Tang et al., 1999). Moreover, there is now ample evidence that changes in synaptic efcacies depend on the relative timing of pre- and postsynaptic spikes. This conicts with the concept of learning correlated activity (“those who re together wire together ”), which was the common interpretation of Hebbian learning several years ago. Recent ndings show that synapses between two excitatory neurons are potentiated when a presynaptic spike closely precedes a postsynaptic c 2003 Massachusetts Institute of Technology Neural Computation 15, 1481–1510 (2003) °
1482
G. Chechik
one, whereas spiking activity in reverse order results in synaptic depression. These effects have been demonstrated in a variety of systems and preparations such as hippocampal studies in vivo (Levy & Steward, 1983), cultures of hippocampal neurons (Debanne, Gahwiler, & Thompson, 1994; Bi & Poo, 1999), retinotectal synapses of xenopus (Zhang, Tao, Holt, Harris, & Poo, 1998), mammalian cortex (Markram et al., 1997; Feldman, 2000; Froemke & Dan, 2002), and others (Magee & Johnston, 1997; Bell, Han, Sugawara, & Grant, 1997; Debanne, Gahwiler, & Thompson, 1998). A recent comparative review of the different spike-dependent learning rules can be found in Roberts and Bell (2002). This new type of plasticity, sometimes termed spike-timing-dependent plasticity (STDP), has been studied in various theoretical frameworks, and some of its computational properties have been characterized. Importantly, under certain assumptions about the relative magnitude of synaptic potentiation and depression, STDP embodies an inherent competition between incoming inputs, and thus results in normalization of the total incoming synaptic strength (Kempter, Gerstner, & van Hemmen, 1999, 2001), and maintains the irregularity of neuronal spike trains (Abbott & Song, 1999; Song, Miller, & Abbott, 2000). STDP may also play an important role in sequence learning (Mehta, Quirk, & Wilson, 1999; Rao & Sejnowski, 2000) and lead to the emergence of synchronous subpopulation ring in recurrent networks (Horn, Levy, Meilijson, & Ruppin, 2000). The dynamics of synaptic efcacies under the operation of STDP strongly depends on whether STDP is implemented additively (independent of the baseline synaptic value) or multiplicatively (where the change is proportional to the synaptic efcacy) (Rubin, Lee, & Sompolinsky, 2001; Aharonov, Guttig, & Sompolinsky, 2001). Additive learning leads to strong competition between synapses and is intrinsically unstable, while supra-additive learning may involve inherent stabilization and allow a wide range of synaptic values. 1 This article takes a different approach to the investigation of temporal aspects of learning. Whereas the common approach is to model STDP and study its properties, we start by analytically deriving spike-dependent learning from rst principles in a simple rate model, show how it could be approximated by a biologically feasible rule, and compare the latter with the experimentally observed STDP. This comparison provides computational insights into the temporal structure of the experimentally observed STDP and shows that under certain biological constraints, STDP approximates the performance of the optimal rule. To this end, we apply the generic principle of information maximization (Linsker, 1988), which states that the goal of a neural network’s learning procedure is to maximize the mutual information between its output and
1 Additive and multiplicative learning may be viewed as two instances of a continuum of the form 1W D ´W ® , where ® D 0 corresponds to additive learning, ® D 1 corresponds to multiplicative one, and any ® > 0 is termed supra additive.
Spike-Timing-Dependent Plasticity and Information Maximization
1483
input. This principle, known as Infomax, was applied in Linsker (1992) to a noisy linear network of real-valued (“rate”) input neurons with a multivariate gaussian distribution. It yielded a two-phased learning rule: a Hebbian learning phase when a signal is presented to the network and anti-Hebbian learning when only noise is presented. This article extends the Infomax principle and derives a learning rule that maximizes information about some relevant components of the inputs, not about its complete distribution. A naive interpretation of the Infomax principle asserts that information about the inputs of the network should be maximized. However, this criterion may be inappropriate in the context of information processing in the brain, which is not targeted to reproduce the sensory inputs, but rather to extract their behaviorally relevant components. We therefore consider a variant of the Infomax principle in which information should be maximized about the identity of the input, allowing for the extraction of relevant information. The article is organized as follows. Section 2 describes the optimization task and the model. Section 3 derives a gradient-ascent learning rule that maximizes input-output relevant mutual information. Section 4 shows how this rule can be approximated by a biologically feasible learning rule and discusses its properties. Section 5 studies possible extensions for the case of a limited supervised signal, and section 6 investigates learning procedures for the synaptic transfer function. The importance of these results and insights into experimentally observed plasticity are discussed in section 7. 2 The Model We study a generic learning task in a network with N input neurons S1 ; : : : ; SN ring spike trains and a single output (target) neuron Y. At any point in time, the target neuron integrates its inputs with some continuous temporal lter F due to voltage attenuation and a synaptic transfer function, Y.t/ D
N X
Z Xi .t/ ´
(2.1)
Wi Xi .t/ iD1 t
¡1
Z F¿ .t ¡ t0 /Si .t0 / dt0
I
1 0
F¿ .t/ dt D 1;
where W i is the amplitude of synaptic efcacy between the ith input neuron P and the target neuron, Si .t/ D tspike ±.t ¡ tspike/ and is the ith spike train. The lter F may be used to consider general synaptic transfer functions and voltage decay effects. For example, voltage attenuation due to leak currents in a passive membrane is realized as an exponential lter F¿ .t/ D 1 exp.¡t=¿ /, 8t > 0 with ¿ being the membrane time constant. ¿ The learning goal in our model is to set the synaptic weights vector W such that M C 1 uncorrelated patterns of input activity » ´ .t/ (´ D 0::M) may be discriminated. More formally, the goal is to maximize the mutual
1484
G. Chechik
information between the output Y and the identity of the input pattern ´. Each pattern » ´ .t/ determines the ring rates of all N input neurons as a function of time. The actual input Si .t/ is a stochastic realization of »i .t/. The input patterns are presented for short periods of length T. At each period, a pattern » ´ is randomly chosen for presentation with probability p´ . Most of PM the patterns are rare ( ´D1 p´ ¿ 1), but » 0 is abundant and may be thought of as a background noisy pattern. This assumption is based on the common scenario of a neuron (e.g., a face cell in IT cortex) that most of the time is exposed to irrelevant stimuli and is rarely presented with stimuli it is tuned to. Two aspects of this model should be emphasized. First, unlike Linsker (1992), information is maximized not about the input values but about the identity of the presented pattern. This follows the idea that the goal of a neural system is not to reproduce the representation of the inputs but to extract relevant information. Second, the input patterns determine the modulating rates that underlie the input spike trains. These rates are not observable, and therefore any learning procedure must depend on the observable input spikes that realize the underlying rates. Therefore, the fact that a learning rule depends on spikes does not necessarily mean that information is coded in spikes instead of in the underlying rates. 3 Mutual Information Maximization The goal of this section is to derive a learning algorithm that maximizes the input-output mutual information of the above model by changing the synaptic weights. We rst focus on the modication of the synaptic magnitudes W, while keeping the temporal lter F xed. Learning the optimal temporal lter is discussed in section 6.3. 3.1 Deriving a Gradient-Ascent Learning Rule. Let us focus on a single presentation period and look at the value of Y D Y.T/ at the end of this period. Assuming that the temporal lter is mostly concentrated in a time period of length T and omitting the notation of t, we obtain from equation 2.1 YD
N X iD1
Z Wi X i
I
Xi D
0 ¡T
F¿ .0 ¡ t0 /Si .t0 / dt0 :
(3.1)
The mutual information (Shannon, 1948; Cover & Thomas, 1991; see also Linsker, 1992) between the output and pattern identity in this network is dened by Z I.YI ´/ D h.Y/ ¡ h.Y j ´/
I
h.Y/ D ¡
1 ¡1
f .y/ log. f .y// dy;
(3.2)
where the h.Y/ is the differential entropy (Cover & Thomas, 1991) of the distribution of Y, and h.Y j ´/ is the differential entropy of the Y distribution
Spike-Timing-Dependent Plasticity and Information Maximization
1485
given that the network is presented with a known input pattern ´. f .Y/ is the probability density function (p.d.f.) of Y. This mutual information measures how easy it is to decide which input pattern ´ is presented to the network by observing the network’s output Y. Consider the case where the number of input neurons is large and neurons re independently. Under these conditions, according to the central limit theorem, when the target neuron is presented with the pattern » ´ , its membrane voltage Y is a weighted sum of many uncorrelated inputs and is thus normally distributed f .Y j ´/ D N.¹´ ; ¾´ 2 / with mean ¹´ D hWX´ i and variance ¾´ 2 D h.WX´ /.WX´ /T i ¡ hWX´ i2 . The brackets denote averaging over the possible realizations of the inputs X´ when the network is presented with the pattern » ´ . Consider now the more complex case where the input neurons re in a correlated manner according to the covariance matrix Cij D Cov.Xi ; Xj /. Y is now a sum of correlated inputs, and its distribution converges to the normal distribution at a rate that is determined not by the number of input neurons but by the number of effectively independent variables. In particular, according to the theory of large samples (Lehman, 1998; Ferguson, 1996), if each neuron is correlated with at most m other neurons, their sum converges to the normal distribution, with a rate that is at the worst case slower by a factor of 2m (Ferguson, 1996). Consequently, even with the high levels of synchrony reported in some experiments performed in the mammalian cortex (e.g., Singer & Gray, 1995), the number of effectively independent inputs is still very large. This suggests that the input-dependent weighted sum of synaptic inputs can be well approximated by a normal distribution, with parameters that depend on the activity pattern. Formally, f .Y j ´/ ¼ N.¹´ ; ¾´ 2 / with mean ¹´ D hWX´ i and variance ¾´ 2 D h.WX´ /.WX´ /T i ¡ hWX´ i2 . Using the expression for the entropy of a gaussian (h.X/ D 12 log.2¼e¾ 2 / when X » N.¹; ¾ 2 /), we obtain h.Y j ´/ D
M X ´D0
p´ h.Y j » ´ / D
M X
1 p´ log.2¼e¾´ 2 /: 2 ´D0
(3.3)
To further calculate the mutual information I.YI ´/, we turn to calculate the second term of equation 3.2, the entropy h.Y/. For this purpose, we note that f .Y/ is a mixture of gaussians P (each resulting from the presentation of a single input pattern) f .Y/ D ´ f .Y j ´/p´ , and use the assumption that PM ´D1 p´ is small with respect to p0 . The calculation of h.Y/ is still difcult, but fortunately its derivative with respect to Wi is tractable. Using a rst-order Taylor approximation, we obtain (see appendix A) M X @I.YI ´/ D C p´ .Cov.Y; X´i /K1´ C E.Xi´ /K´2 / @Wi ´D1
¡
M X ´D1
p´ .Cov.Y; Xi0 /K´0 C E.Xi0 /K2´ /;
1486
with
G. Chechik
K´0 ´
.¹´ ¡ ¹0 /2 ¾´ 2 ¡ ¾0 2 C 4 ¾0 ¾0 4
K´1 ´
1 1 ¡ 2 ¾0 2 ¾´
K´2 ´
¹´ ¡ ¹0 ; ¾0 2
(3.4)
where E.Xi´ / is the expected value of Xi´ as averaged over presentations of the » ´ pattern. The derived gradient may be used for a gradient-ascent learning rule, by repeatedly calculating the distribution moments ¹ ´ ; ¾´ that depend on W and updating the weights according to 1Wi D ¸
@ I.YI ´/ @Wi
M X D C¸ p´ .Cov.Y; X´i /K´1 C E.Xi´ /K´2 / ´D1
¡¸
M X ´D1
p´ .Cov.Y; Xi0 /K´0 C E.X0i /K2´ /:
(3.5)
Since this learning rule climbs along the gradient, it is guaranteed to converge to a local maximum of the mutual information. To demonstrate the operation of this learning rule, Figure 1 (left) plots the mutual information during the operation of the learning procedure as a function of time, showing that the network indeed reaches a (possibly local) maximum of the mutual information. When this simulation is repeated for different random initializations of the synaptic weights, the rise time of the mutual information varies across runs, but all runs reach the same mutual information end level. The above analytical derivation requires the background pattern to be PM highly abundant ´D1 p´ ¿ p0 . When this assumption does not hold, performance of the learning rule can be studied numerically, to test whether the obtained learning rule is still benecial. To this end, we repeated our simulations for the worst-case scenario, where two input patterns are presented with p0 D p1 D 0:5. In this case too, the simulation yielded a monotonic increase of mutual information until it reached a stable plateau, as shown in Figure 1 (right). 3.2 Dynamics Under the Operation of the Learning Rule. In order to understand the dynamics of the weights W under the operation of the gradient-ascent learning rule, consider now the two sums in equation 3.4. The rst sum contains terms that depend on the pattern » ´ through Xi´ , and
Spike-Timing-Dependent Plasticity and Information Maximization
1487
0.25 0.4
I(Y;h) (bits)
I(Y;h) (bits)
0.2 0.15 0.1
0.2
0.1
0.05 0
0.3
1000
2000
time steps
3000
4000
0
1000
2000
time steps
3000
4000
Figure 1: Mutual information during learning with the gradient-ascent learning rule (see equation 3.4). All patterns were constructed by randomly setting 10% of the input neurons to re Poisson spike trains at 40 Hz, while the remaining input neurons re at 5 Hz. Poisson spike trains were simulated by discretizing time into 1 millisecond bins. Simulation parameters: ¸ D 1, N D 1000. (Left) 100 memory patterns with p0 D 0:9, p´ D 0:001 for ´ > 0. (Right) Two memory patterns with p0 D p´ D 0:5.
the second sum contains terms that depend on » 0 through X0i . When the K terms are positive, these two sums are positive. Thus, the weights that correspond to strong inputs of the background pattern » 0 are weakened, while the weights that correspond to strong foreground inputs are strengthened. This result is similar to the case of real-valued inputs investigated in Linsker (1992) where an information gradient-ascent rule alternates between Hebbian and anti-Hebbian learning. The main effect of this process is an increase of the differences j¹´ ¡ ¹0 j, while constraining the variances ¾´ , thus providing better discrimination of the rare patterns » ´ from the abundant » 0 pattern. This combination of Hebbian and anti-Hebbian learning characterizes the mutual information target function we use and is not necessarily obtained with other target functions. For example, a maximum output variance criterion, whose optimal learning rule is the traditional correlation Hebbian rule, involves a global decay term that is not input specic. Appendix D compares these two target functions. The intuition behind the increase in j¹´ ¡ ¹0 j stems from the information maximization criterion: aiming to split the outputs as sparsely as possible, the pattern with the highest prior probability is assigned an opposite sign compared to the other patterns. Figure 2 depicts the distribution of output values during learning as traced in simulation, showing how the output values split into two distinct gaussian-like bumps: one corresponding to the presentation of the background pattern » 0 and the other to the presentation of the rest of the patterns. Although in this gure, all rare patterns share a
1488
G. Chechik 0.1 150
probability P(Y)
0.08
100
200 300
0.06 0.04 0.02 0 60
40
20
0
output value Y
20
40
Figure 2: Distribution of the output values Y after 100, 150, 200, and 300 learning steps. Outputs segregate into two distinct bumps: one that corresponds to the presentation of the » 0 pattern and a second that corresponds to the rest of the patterns. Simulation parameters are as in Figure 1 (left).
single bump, in the general case, different foreground patterns may have different bumps. The mutual information manifold is a complex and nonconvex function of the weight vector that does not have a single maximum. In fact, any extremum point has a mirror extermum point with all weights having the opposite signs (see appendix B). A further analytical characterization of the information manifold is difcult, and therefore in order to characterize the steady-state structure of the system, we repeated simulations for different random initializations of synaptic weights, 2 and compared the synaptic efcacies at the steady state. Interestingly, these simulations found only two steady-state solutions for a given set of input patterns, and these were always a pair of mirror solutions (see appendix B). This suggests that the landscape of mutual information as a function of weights values contains only two large basins of attractions, into which the optimization process is drawn.
2 Simulation was repeated with the same set of memory patterns for 100 times under two conditions. First, weights were initialized as uniform random numbers between 0 and 1. Second, weights were initialized in .¡1; 1/. This was repeated for 10 different sets of memory patterns.
Spike-Timing-Dependent Plasticity and Information Maximization
1489
4 Toward Biological Feasibility 4.1 Deriving an Approximated Rule. With the goal of obtaining a biologically feasible spike-dependent learning rule that maximizes mutual information, we now turn to approximate the analytical learning rule derived above by a rule that can be implemented in biology. We follow four steps to modify the analytically derived rule into a more biologically plausible one. First, biological synapses are limited to either excitatory or inhibitory regimes. The optimum has two mirror solutions, and for simplicity, we limit the weights W to positive values (but see our discussion on inhibitory synapses in section 7). Second, we replace the terms fK1´ ; K´0 ; K´2 g, which are global functions of W and » ´ , with constants f¸1 ; ¸0 ; ¸2 g. These constants are optimized but remain xed during the simulation and therefore cannot provide equal performance as the continuously changing K terms. However, if these constants are close to the K values, the changes in the weight vector have a positive projection on the gradient, and this approximation is sufcient for approaching the optimum. This suggests that high levels of information can be obtained for a large range of values of these constants, allowing cells to operate in a nearoptimal regime. To test this idea, we measured the steady-state information for different constants’ values. Figure 3 shows that the steady-state level of information is not critically sensitive to the ¸ values, since high information levels are obtained for a large range of their values. Third, we implement an on-line learning mode instead of a batch one by replacing explicit summation over patterns with stochastic averaging over the presented patterns. Because summation is performed over the rare patterns alone (see equation 3.4), stochastic averaging is naturally implemented by restricting learning to the presentation of foreground (rare) patterns. Appendix C discusses alternative learning rules in which learning is triggered by the presentation of both the background and foreground patterns, showing that these rules are not robust to uctuations in p´ and result in lower mutual information. We thus restrict learning to the presentation of foreground patterns only,3 which yields the following learning rule: ´ ´ 1W i D C.¸1 Cov.Y; Xi /C¸2 E[Xi ]/¡.¸0 Cov.Y; Xi0 /C¸2 E[X0i ]/
when » ´ is presented;
for ´ D 1::M only.
(4.1)
Fourth, we replace the explicit dependence on average quantities E.X/ and Cov.Y; X/ by stochastic averaging over spikes. This yields a spikedependent learning rule. In the case of inhomogeneous Poisson spike trains 3
This requires a weak form of supervision and is discussed in section 5.
1490
G. Chechik
Figure 3: Mutual information as a function of learning parameters ¸1 and ¸0 . High information levels are obtained for a wide range of parameter values. Simulation parameters ¸2 D 0:1. ¸ D 1:0, M D 20, N D 1000, p0 D 0:9, p´ D 0:001, T D 20 ms.
where input neurons re independently,4 the expectation terms obey E.Xi / D R0 0 0 0 ¡T F.t /E.Si .t // dt , and the covariance terms obey (Kingman, 1993) Cov.Y; Xi / D
X j
ÁZ Wj Cov.Xj ; Xi / D Wi Var
Z D Wi Z D Wi
0 ¡T 0 ¡T
0 ¡T
! 0
0
F.t /Si .t / dt
0
Var.F.t0 /E.Si .t0 /// dt0 F2 .t0 /E.Si .t0 // dt0 :
(4.2)
The expectations E.X´i / for ´ > 0 may simply be estimated by weighted averages of the observed spikes S´i that precede the learning moment. Estimating E.Xi0 / is more difcult because, as discussed in appendix C, learning should be triggered solely by the rare patterns. Namely, » 0 spikes should have an effect only when a rare pattern » ´ is presented. A reasonable approximation can be obtained using the fact that » 0 is highly frequent, by 4 Relaxation of these assumptions and their effect on the learning rule are discussed in section 6.
Spike-Timing-Dependent Plasticity and Information Maximization
1491
averaging over all spikes in long time periods. In this case, the highest reliability for estimating » 0 spikes is obtained if spikes are weighted uniformly everywhere except during the presentation of » ´ . To see this, consider a stationary Poisson neuron that res at a rate E[S.t/] D p, which has to R R be estimated by a weighted average pO D ab f .t/S.t/ dt where ab f .t/ D 1. Rb The variance of the estimator is a f 2 .t/ dt Var[S.t/], which is minimized for f .t/ D const. This optimal function extends innitely in time and therefore cannot be realized in biological cells, since these have a limited memory span. When the memory span of a neuron is limited by L, that is, its weighting function is constrained to be zero for jtj > L, the optimal weighting function is 1 for all t 2 .¡L; ¡T/ [ .0; L/ and zero otherwise. It should be f .t/ D .2L¡T/ stressed that different constraints on the memory span of the cell, reecting limitations of the biological hardware, may lead to weighting functions that are different from a uniform f . The exact form of the weighting function for the background spikes depends on such constraints and therefore cannot be predicted by our analysis here. Formally, the following rule is activated only when one of the foreground patterns (» ´ ; ´ D 1::M) is presented: Z 1W i D C
0 ¡T
Z ¡
[¸1 Wi F2 .t/ C ¸2 F.t/]Si .t/ dt
1
¡1
[¸0 Wi f 2 .t/ C ¸2 f .t/]Si .t/ dt:
(4.3)
This learning rule uses a weak form of supervision signal that activates learning when one of the foreground patterns is presented but does not require an explicit error signal. Its properties are discussed in section 5. 4.2 Comparing Performance. In the previous section, we derived a biologically implementable spike-dependent learning rule that approximates an information maximization learning rule. But how good are these approximations? Does learning with the biologically feasible learning rule increase mutual information, and to what level? The curves in Figure 4 compare the mutual information of the learning rule of equation 3.4 with that of equation 4.3, as traced in a simulation of the learning process. Apparently, the approximated learning rule achieves fairly good performance compared to the optimal rule, and most of the reduction in performance is due to limiting the weights to positive values. 4.3 Interpreting the Learning Rule Structure. In order to obtain intuition into the approximated learning rule, we now examine the components of equation 4.3 and demonstrate them pictorially in Figure 5. First, synaptic potentiation in equation 4.3 is temporally weighted in a manner that is determined by the same lter F that the neuron applies over
1492
G. Chechik 0.06
I(Y; h ) (bits)
0.05
Gradient ascent rule (Eq. 3.4)
0.04
Positive weights limitation
0.03
Spike dependent rule (Eq. 4.3)
0.02 0.01
0
0.5
1
1.5
2
2.5
3
time steps
3.5
4
4.5
4
5
x 10
Figure 4: Comparing optimal (see equation 3.4) and approximated (see equation 4.3) learning rules. Poisson spike trains were simulated by discretizing time into 1 millisecond bins. For each pattern, 10% of the input neurons were set to re at 40 Hz, while the rest re at 5 Hz. A lower bound of zero was imposed on all weights. Simulation parameters: ¸ D 1:0, M D 20, N D 2000, ¸1 D 0:15,¸0 D 0:05, ¸2 D 0:1, p0 D 0:9, p´ D 0:001, T D 20 ms.
R its inputs. The learning curve involves an average of F (i.e., t F.t¡t0 /S.t0 / dt0 ) Rt and F2 (i.e., .F.t ¡ t0 //2 S.t0 / dt0 ). To demonstrate the shape of the resulting learning rule, we choose the lter F to be an exponential lter F.t/ D 1 exp.¡t=¿ /, which corresponds to the voltage decay in a passive mem¿ brane with time constant ¿ . In this case, the squared lter F2 is also an exponential curve but with time constant ¿=2. Figure 5A presents the weightdependent (ltered by F2 ) and weight-independent (ltered by F) potentiation components for the exponential lter. The relative weighting of these two components (determined by the K terms) was numerically estimated by simulating the optimal rule of equation 3.4 and was found to be on the same order of magnitude (demonstrated in Figure 6). This suggests that learning should contain weight-dependent and -independent components, in agreement with investigations of additive versus multiplicative STDP implementations (Rubin et al., 2001). Second, equations 3.4 and 4.3 suggest that synaptic depression is activated when a foreground pattern is presented but serves to learn the under-
Spike-Timing-Dependent Plasticity and Information Maximization
A.
1493
B. ¬ weight dependent
DW
¬ weight independent
DW
sum of weight dependent and weight independent ®
0
50
50
t(pre) t(learning)
0
50
50
t(pre) t(learning)
Figure 5: The temporal characteristics of the approximated learning rule of equation 4.3. The changes in synaptic weights 1W as a function of the temporal difference between the learning time t and the input spike time tspike. (A) The potentiation curve (solid line) is the sum of two exponents with constants ¿ and 1 (dashed lines). (B) A combination of the potentiation curve of the subplot A 2¿ and a uniform depression curve with L D 45, T D 30.
K0
0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
0.5
4.5
5 4
K1
x 10
0
0.5
1
1.5
2
2.5
3
3.5
4
0.5
4.5
5 4
K2
x 10
0
0.5
1
1.5
2
2.5
time steps
3
3.5
4
4.5 x 10
5 4
Figure 6: Traces of the values of K01 (up), K11 (middle), and K12 (bottom), as are numerically calculated during the simulation of the gradient-ascent rule, showing that all three values are of the same order of magnitude. Simulation parameters as in Figure 5.
1494
G. Chechik
lying structure of the background activity. As discussed in section 4.2, the optimal depression learning curve extends innitely in time, but this requires long memory that is not biologically feasible. Using a limited memory span of size L allows combining it with the potentiation curve into a single feasible rule. A learning rule that combines both potentiation and depression is presented in Figure 5B. A major difference between the spike-triggered rule of equation 4.3 and the experimentally observed STDP is the learning trigger. In equation 4.3, learning is triggered by an external learning signal that corresponds to the presentation of rare patterns, while in the experimentally observed rule, it is triggered by the postsynaptic spike. The possible role of the postsynaptic spike is discussed in the following section. 5 Limited Supervised Learning Signal We have thus far considered a learning scenario in which the system was allowed to use an explicit and continuous external learning signal whenever one of the rare patterns was presented. To take a further step toward biological plausibility, we turn to investigate a more plausible implementation of the supervised learning signal. Such a signal may be noisy, unreliable, and probably available for only a limited period. In biological neural networks, the output spike of an excitatory neuron usually signals the presence of an interesting (or a rare) pattern. The postsynaptic spike is therefore a natural candidate to serve as a learning signal in our model, in which learning should be triggered by the presentation of the rare patterns. 5 The derivation in the previous sections aimed to maximize the information I.YI ´/ in the membrane voltage Y. If, however, a neuron is forced to transmit a binary signal, the strategy that minimizes information loss when translating the continuous valued Y into a binary value is to spike when one of the rare foreground patterns ´ ¸ 1 is presented. 6 We thus turn to investigate whether the postsynaptic spike (signaling the presence of interesting input patterns) may be used as a learning signal. This yields a learning procedure identical to equation 4.3, except this time, learning is triggered by postsynaptic spikes. The resulting learning rule is somewhat similar to 5 The postsynaptic spike is not the only candidate for providing a learning signal; high but subthreshold depolarization may also testify for the presentation of a foreground pattern. Indeed, there is evidence that learning is triggered by high subthreshold depolarization (Debanne et al., 1994). 6 Note, however, that as proved in appendix B, two mirror solutions can be achieved by our analytically derived rule. The rst one potentiates synapses that correspond to strong inputs and strongly depolarizes the target cell when rare patterns are presented. In this case, output spikes signal the presence of rare patterns. Under the mirror solution, the spikes of target neuron signal the background pattern, leading to higher mean ring rates. This high activity solution is considered in the literature as less desirable in terms of energy consumption.
Spike-Timing-Dependent Plasticity and Information Maximization
1495
previous models of the experimentally observed STDP (Kempter et al., 1999; Song et al., 2000; Rubin et al., 2001; Kempter et al., 2001), although we keep the form of the learning rule derived earlier. To this end, we have simulated an integrate-and-re neuron that receives additional strong depolarizatory input when foreground patterns are presented. This input usually causes the neuron to re several milliseconds after the beginning of pattern presentation. As before, the mutual information was traced along time. The main difference between this simulation and our previous ones was that the target neuron was sometimes spontaneously ring even when presented with the background pattern and that the time of the postsynaptic spike was not locked to the end of stimulus presentation. In addition, the integrate-and-re model embodies a different behavior of the membrane leak. While the model of equation 2.1 accounts for local leaks that inuence the voltage from a single input, the membrane leak in the integrate-and-re model depends on the membrane voltage, which in turn depends on all converging input to the target neuron. The external signal was supplied for a limited learning period. After this period, learning may still occur, but the postsynaptic spikes that trigger it are no longer supervised but controlled by inputs only. If the network has learned to represent the presented input pattern faithfully, the target neuron would faithfully spike on foreground pattern presentation and serve to keep the network at a stable and informative state. Figure 7 traces the mutual information along and after learning, showing that the system stays in the informative state even after the external supervised learning signal is turned off. 6 Extending the Model Our analysis derived a biologically feasible learning rule that is near optimal when the statistics of the input spike trains obeys several limitations. This section considers a wider family of input spike train statistics and discusses the changes in the learning rule needed in order to take this statistics into account in an optimal manner. We present three types of extensions: correlated inputs, non-Poisson spike trains, and learnable synaptic transfer functions. Although all three aspects may occur in conjunction, we consider them separately here for the simplicity of exposition. 6.1 Correlated Inputs. Section 4 derived a learning rule for uncorrelated input spike trains. We now turn to consider input cross-correlations and show how these correlations change the covariance term of equation 4.1, but preserve the general structure of the spike-dependent learning rule. Consider the case where input neurons do not re independently but in a correlated manner such that their ltered spike trains X´1 ; : : : ; X´N have ´ ´ ´ the covariance matrix Cij D Cov.Xi ; Xj /. If these correlations obey the
1496
G. Chechik
optimal rule spike triggered rule
0.1
unsupervised learning
limited supervision
I(Y;h)
(bits)
0.15
0.05
0
2
4
6
8
time steps
4 10
x 10
Figure 7: Mutual information along learning with the optimal learning rule (see equation 3.4) and learning with a limited supervised rule in a simulation of an integrate-and-re neuron (threshold = 55 mV, resting potential = 70 mV, membrane time constant = 10 milliseconds). Learning occurred whenever the target neuron red a spike. On the rst 20,000 simulation steps, a depolarizing input of 5 millivolts was injected into the neuron if one of the foreground patterns was presented (which occurred with probability 5%). This input sometimes resulted in a spike that in turn initiated learning. Following this learning phase, the neuron continues to preserve the discrimination between background and foreground patterns even in the absence of the supervision signal.
P conditions discussed in section 3.1, then the weighted sum Y D i Wi X´i is normally distributed f .Y j ´/ D N.¹´ ; ¾´ 2 /, with mean ¹´ D hWX´ i and variance ¾ ´ 2 D WC´ij W T ¡ ¹´ 2 . In this case, the covariance term in equation 4.2 changes into Cov.Y; Xi / D D D
X
Wj Cov.Xj ; Xi /
j
X
ÁZ Wj Cov
j
X j
Z Wj
0
¡T
0 ¡T
Z 0
0
0
F.t /Sj .t / dt ;
!
0 ¡T
F.t/Si .t/ dt
F2 .t/ Cov[Sj .t/; Si .t/] dt:
(6.1)
Spike-Timing-Dependent Plasticity and Information Maximization
1497
In order to take into account these correlations, the learning rule of equation 4.1 becomes 0 1W i D C @¸1
Wj X
0 ¡T
j
0 ¡ @¸0
1
Z
X
Z Wj
j
´ ´ F2 .t/ Cov[Sj .t/; Si .t/] dt
0 ¡T
C ¸2 E[Xi´ ]A 1
F2 .t/ Cov[Sj0 .t/; S0i .t/] dt C ¸2 E[X0i ]A
when » ´ is presented, for ´ D 1::M only.
(6.2)
When presented with the pattern ´, this learning rule increases more strongly the synapses from neurons that are positively correlated with other input neurons. More interesting, while the rst term of equation 6.2 always has R0 a positive component W i ¡T F2 .t/ Var.Si .t// dt, it may be counterbalanced with negative components Cov.Sj .t/; Si .t// for j 6D i and may operate to weaken the synaptic strength of neurons that are active in foreground patterns. These effects are not unique to Infomax learning, but also characterize maximum variance learning (principal components analysis) and its implementations in traditional covariance-based learning (see appendix D). As with any other learning rule that takes into account input correlations, implementing the learning rule of equation 6.2 in biology requires the target neuron to measure coactivities of its synaptic inputs. The physiological mechanisms of such correlation-dependent synaptic changes are still largely unknown, although they may in principle be realized to some extent through nonlinear processing in the dendrites (Segev & London, 2000). 6.2 Non-Poisson Spike Trains. The previous section discussed correlations between different inputs. Here, we consider temporal correlations in the spike trains of an individual input neuron. Let Si .t/ be a point process that is no longer Poisson but has covariance Cov[Si .t/; Si .t0 /] for all t 2 [¡T; 0]. Under these conditions, the covariance term in equation 4.2 changes into Cov.Y; Xi / D Wi Var.Xi / Z 0 Z 0 D F.t/ Cov[Si .t/; Si .t0 /]F.t0 / dt 0 dt: ¡T
(6.3)
¡T
As in the case of cross-correlations described in section 6.1, the positive term Cov[Si .t/; Si .t/] D Var.Si .t// may now be counterbalanced by negative covariance terms Cov[Si .t/; Si .t0 /] for t 6D t0 if spikes are anticorrelated.
1498
G. Chechik
As an example, consider a stationary input spike train with E.S.t// D p and autocorrelations Cov[S.t/; S.t0 /]. Focus on two points in time, t and t0 D t ¡ 1t, and denote p.S.t0 / D 1 j S.t/ D 1/ D p01t . When p01t < p, the autocorrelation is negative, as in the case where the two spikes are within the neuronal refractory period. With this notation, the covariance can be written as Cov[S.t0 /; S.t/] D pp01t ¡ p2 D pp01t g.1t/, with g.1t/ D 1 ¡ p=p01t . The spike-triggered learning rule thus includes two components: synaptic changes induced by single-input spikes as described in section 3 and synaptic changes induced by pairs of spikes. Spike pairs occur with probability pp01t and effect the synapse with a magnitude g.1t/. The pairwise component of the learning rule can therefore be implemented by changing the synapse by 1Wi D F.t/g.t ¡ t0 /F.t0 / whenever a pair of spikes occurs at times t and t0 . The function g depends on the inputs’ statistics (its ring rate and autocorrelations) and is used by the learning rule. Optimized learning rules can therefore be achieved when g matches the baseline statistics of the input. In the case of negative autocorrelations, the covariance term and the function g.1t/ are negative; thus, the rst input spike may be regarded as inhibiting the synaptic change induced by the second input spike. What is the relation between these autocorrelation-sensitive rules and STDP observed in neural tissues? Do neural systems estimate and use autocorrelations in the inputs to optimize their synaptic learning rule? Interestingly, while early work focused on measuring changes in excitatory postsynaptic potential (EPSP) as a function of time differences between single pre- and postsynaptic spikes, the effects of complex and natural spike trains were investigated recently by Froemke and Dan (2002). The structure of patterns of presynaptic spikes was found to have pronounced effects on STDP, suggesting that synaptic plasticity is indeed sensitive to autocorrelations and input structures. Interestingly, an input spike that preceded a pair of input-output spikes that caused an EPSP change was found to suppress the magnitude of STDP. This is in agreement with the above analysis for the case of negative autocorrelation at short time differences. Sensitivity to complex patterns was observed not only for presynaptic spikes (as in the above analysis) but also for output spikes patterns. An information maximization study of these effects therefore requires more detailed assumptions on the nature of coding with output spikes and exceeds the scope of this article. 6.3 Learning the Synaptic Transfer Function. The above analysis focused on changes in the magnitude of synaptic efcacy. As our model includes the shape of the synaptic transfer function, we can extend the above analysis to derive learning rules that change not only the amplitude of the synaptic value W i but also the shape of the synaptic transfer function Fi .t/. This is achieved by differentiating the mutual information I.YI ´/ with
Spike-Timing-Dependent Plasticity and Information Maximization
respect to Fi .t/ in a similar manner to the derivation of 0 Cov.Y; Si .t// D Cov @
N X jD1
and yields
1
Z Wj
@I.YI´/ @Wi
1499
Fj .0 ¡ t0 /Sj .t0 /; Sj .t/A
D Cov.Wi Fi .0 ¡ t/Si .t/; Si .t// D Wi Fi .0 ¡ t/ Var.Si .t// ¼ Wi Fi .0 ¡ t/E.Si .t//:
(6.4)
Thus, a learning rule that changes the synaptic transfer function contains terms that are proportional to Fi .t/. This suggests that in addition to an increase in amplitude, the synaptic transfer function should sharpen because large Fi .t/ values would be strengthened more than small Fi .t/ values. In biological neurons, various components affect the shape of the EPSP at the soma. Among these are the dynamics of the receptor, the morphology of the dendrite and spine, and the passive electrical properties of the dendritic tree. Physiological evidence is gathered regarding activity-dependent changes in these components (see Segal & Andersen, 2000), but the experimental results are far from being conclusive. 7 Discussion The principle of information maximization was originally applied for discriminating gaussian signals from noise (Linsker, 1992). Our analysis extends the Infomax principle to discriminate between spatiotemporal activity patterns. The learning task is thus to maximize the relevant information, which in our model lies in the identity of the presented input, instead of reproducing the inputs. Introducing temporal properties of the target neuron into the analysis (such as its synaptic transfer function) allows us to infer the temporal properties of the learning rule required to maximize this relevant information. Moreover, we show how the analytically derived learning rule is well approximated by a biologically feasible learning rule that operates through spike-dependent plasticity. This rule requires a limited supervisor signal, which activates learning whenever an interesting event occurs but does not require an explicit error signal. The supervision signal is required only for a transient learning period, yielding a network that remains at a stable performance level for long periods. Presumably, such a learning signal may be available in the brain by “novelty detector ” or reward prediction circuits (Schultz, Dayan, & Montague, 1997). What can we learn from the structure of this analytically derived learning rule about spike-timing-dependent plasticity observed experimentally? First, in our analysis, synaptic depression operates to unlearn the statistics of the background activity. It is activated following the presentation of rare patterns, in order to maintain the balance between potentiation and depression, as in traditional Hebbian learning models (Sejnowski, 1977; Dayan &
1500
G. Chechik
Willshaw, 1991; Chechik, Meilijson, & Ruppin, 2001). The optimal performance is obtained when the weight of depression is stronger than that of potentiation (see Figures 3 and 6). This constraint is nicely explained by the model of Song et al. (2000), where spike time differences that lead to potentiation occur more often than chance because input spikes contribute to the creation of output spikes. Preventing the divergence of synaptic values requires synaptic potentiation to be balanced by stronger depression or by neuronal-level regulation of synaptic efcacies (Chechik, Horn, & Ruppin, 2002). Second, the shape of the synaptic potentiation curve is determined by the input-output transfer function of the synapse. This yields direct experimentally testable predictions, suggesting that neurons with fast membrane time constants will exhibit fast synaptic plasticity curves correspondingly. Interestingly, while STDP was often found to be NDMA dependent (Markram et al., 1997; Zhang et al., 1998; Bi & Poo, 1999; Feldman, 2000), the STDP potentiation window lasts only a few tens of milliseconds, while the typical timescale of NMDA receptors extends to hundreds of milliseconds. Our ndings suggest a computational reasoning for this apparent mismatch because the major contributors of input current to excitatory cells are AMPA-type channels, whose typical timescales are on the order of a few tens of milliseconds only. The derived learning rules combine both a weight-dependent and a weight-independent component. This is in agreement with experiments that show that the change in excitatory postsynaptic current (EPSC) depends on the initial EPSC amplitude (Bi & Poo, 1999). The correlation between EPSC change and initial EPSC is negative, in that larger initial EPSCs show smaller increases following STDP. This suggests that in this preparation, the weight-dependent depression component is stronger than the weight-dependent potentiation component. Third, because synaptic depression serves to unlearn the baseline statistics of the inputs, it achieves this goal most effectively for synaptic depression curves that extend in time. The experimental evidence in this regard is mixed: some preparations reveal depression curves that extend to 100 to 200 milliseconds (Debanne et al., 1994; Feldman, 2000), while in others, the time constants of both depression and potentiation are similar and around 40 milliseconds (Zhang et al., 1998; Bi & Poo, 1999; Yao & Dan, 2001). Moreover, the theoretical analysis suggests that depression would be even more effective if it operates when the presynaptic spike occurs long before the postsynaptic one (as in Figure 5B). This effect was indeed observed in some preparations (Nishiyama, Hong, Katsuhiko, Poo, & Kato, 2000; Feldman, 2000), but the evidence is very limited, possibly because longer time differences should be tested to observe this effect in additional preparations. We have chosen to focus on excitatory synapses in this discussion, but our analytical derivation of Infomax learning is also relevant for inhibitory synapses (see appendix B). Applying the derivation for inhibitory synapses yields an asymmetric learning rule in which synapses become less inhibitory
Spike-Timing-Dependent Plasticity and Information Maximization
1501
if their activity is followed by a postsynaptic spike and is strengthened when the opposite order holds. Such a learning rule was observed in synapses of rat cortical interneurons (Holmgren & Zilberter, 2001). Finally, the analysis suggests that the shape of the synaptic transfer function should also be plastic. EPSP shape is known to change on a short time scale (an effect termed synaptic facilitation and depression; Markram et al., 1997; Abbott et al., 1997) due to depletion of synaptic vesicles. Our analysis suggests that long-term changes in EPSP shape should take place following learning, where sharpening of EPSP should accompany synaptic strengthening. In summary, the main predictions suggested by the model that can be experimentally tested with standard electrophysiological techniques are as follows. First, when comparing neurons with various electrophysiological propoerties, the STDP potentiation curve of each neuron should have a similar time constant as its EPSP’s time constants. Second, in addition to synaptic depression that is observed when presynaptic spikes follow the postsynaptic spikes, weakening of synapses is expected to be observed for presynaptic spikes that largely precede the postsynaptic spikes (e.g., by three to four membrane time constants). Third, in parallel with potentiation of synaptic magnitude induced by STDP, the shape of the EPSP at the soma should sharpen. Our analysis fails to explain several other forms of STDP such as the one observed in Dan and Poo (1998) and Egger, Feldmeyer, and Sakmann (1999), in which symmetric learning rules are observed. It is possible that a model that incorporates high-order temporal correlations in the input and output spike trains may be required to account for these ndings. It was conjectured that STDP supports the idea that information is stored in temporal spike patterns. It is shown here that even in a learning task in which information is coded in the underlying ring rates, spike-timingdependent learning is required for effective learning. However, natural spike trains that trigger learning in vivo contain complex statistical structures (Paulsen & Sejnowski, 2000; Froemke & Dan, 2002). It thus remains important to explore learning rules that maximize mutual information about naturally structured spike trains. Appendix A: Gradient-Ascent Learning Rule This section derives a gradient-ascent learning rule that performs local search on the mutual information manifold. The input-output mutual information we aim to maximize is the mutual information between the output value and the identity of the presented pattern as dened by Z I.YI ´/ D h.Y/ ¡ h.Y j ´/
I
h.Y/ D
f .y/ log[ f .y/] dy;
(A.1)
1502
G. Chechik
where the rst term is the differential entropy of the output Y and the second is the differential entropy of the Y conditioned on input pattern presentation. The conditional entropy is h.Y j ´/ D
M X ´D0
p´ h.Y j » ´ / D
M X
1 p´ log.2¼e¾´ 2 /; 2 ´D0
(A.2)
and its derivative with respect to W i is M 1 1 @¾´ 2 @ h.Y j ´/ X D p´ : 2 ¾´ 2 @W i @Wi ´D0
(A.3)
To calculate the entropy H.Y/, we use the fact that the distribution of Y is a mixture of gaussians, each resulting from the presentation of a different PM pattern f .y/ D ´D0 p´ Á´ .y/, and write Z h.Y/ D ¡
dy
M X ´D0
³ ¼ ¡ log p Z ¡
dy
Á p´ Á´ .y/ log p0 Á0 .y/ C ´Z
p0 2¼ ¾0
M X
M X
! p´ Á´ .y/
´D1
Z M M X X .y ¡ ¹0 /2 dy p´ Á´ .y/ ¡ dy p´ Á´ .y/ 2¾0 2 ´D0 ´D0 PM
p´ Á´ .y/ ´D0
´D1 p´ Á´ .y/
p0 Á0 .y/
;
(A.4)
where the last equality relies on the fact that signal patterns are only rarely P presented ( M ´D1 p´ ¿ 1) and using the approximation log.x C ²/ ¼ ² log.x/ C x . We now turn to differentiate each of the above three terms with regard R PM to Wi . For the rst term, we use ´D0 p´ Á´ .y/ dy D 1 and obtain ³ ´Z M X @ p0 ¡ log p dy p´ Á´ .y/ @W i 2¼ ¾0 ´D0 1 1 @ ¾0 2 @ 1 log.¾0 2 / D : 2 ¾0 2 @Wi @ Wi 2 R For the second term, we use Á´ .y/.y ¡ ¹´ / dy D 0 and obtain D
@ @ Wi
Z dy
M X ´D0
p´ Á´ .y/
¾´ 2 .y ¡ ¹´ C ¹´ ¡ ¹0 /2 2¾´ 2 ¾0 2
(A.5)
Spike-Timing-Dependent Plasticity and Information Maximization
D D
D
M X
p´ ´D0
1503
M @ 1 ¾´ 2 X @ .¹´ ¡ ¹0 /2 C p ´ 2¾ 0 2 @ Wi 2 ¾0 2 ´D0 @Wi
³ M 2´ X p´ @¾´ 2 2 2 @¾ 0 ¡ ¾ ¾ 0 ´ 2¾0 4 @Wi @Wi ´D0 0 1 2 @.¹´ ¡¹0 /2 2 0 M X .¹´ ¡ ¹0 /2 @¾ p´ @ @Wi ¾0 @Wi A C ¡ 4 4 2 ¾ ¾ 0 0 ´D0 M X
1 1 p´ 2 ¾0 2 ´D1
³
@ ¾´ 2 ¾´ 2 @¾0 2 @ .¹´ ¡ ¹0 /2 ¡ 2 C @ Wi ¾0 @W i @W i .¹´ ¡ ¹0 /2 @¾0 2 ¾0 2 @ Wi
¡
´ (A.6)
;
where the last equality results from a vanishing term for ´ D 0. For the third P term, we use the fact that M ´D1 p´ ¿ 1 and neglect second-order terms of PM : p ´D1 ´ @ @Wi
Z X M
p´ Á´ .y/
´D1 p´ Á´ .y/
´D0
@ D @Wi ¼
PM
p0 Á0 .y/ PM
Z p0 Á0 .y/
dy
´D1 p´ Á´ .y/
p0 Á0 .y/
@ dy C @Wi
Z
.
PM
´D1 p´ Á´ .y//
p0 Á0 .y/
@ .1 ¡ p0 / D 0: @Wi
2
dy (A.7)
These three terms together yield M 1 1 @¾0 2 X 1 1 @h.Y/ D C p´ 2 ¾0 2 @Wi 2 ¾0 2 @Wi ´D1 0 1 2 @¾0 2 2 2 2 2 ¡¹ .¹ / 0 ¡¹ ´ @¾ ¾ @¾ @ .¹ / 0 0 ´ ´ ´ @Wi A £@ ¡ C ¡ : (A.8) @Wi ¾0 2 @W i @Wi ¾0 2
Combining equations A.2 and A.5 through A.7 while omitting the constant factor 12 , we obtain the gradient on the mutual information manifold, @I.YI ´/ @ @ D h.Y/ ¡ h.Y j ´/ @Wi @Wi @W i M X 1 @¾ ´ 2 1 @¾0 2 1 @¾0 2 / ¡ p´ 2 ¡ p0 2 C 2 ¾´ @ Wi ¾0 @ Wi ¾0 @ Wi ´D1
1504
G. Chechik M X
1 C p´ 2 ¾ 0 ´D1
³
@¾´ 2 ¾ ´ 2 @¾0 2 ¡ 2 @Wi ¾0 @ Wi @.¹´ ¡ ¹0 /2 .¹´ ¡ ¹0 /2 @¾0 2 C ¡ @ Wi ¾0 2 @W i
DC
´
M 1 @¾ 0 2 1 @¾0 2 1 @¾0 2 X ¾´ 2 ¡ ¡ p p 0 ´ ¾0 2 @Wi ¾0 2 @W i ¾0 2 @W i ´D1 ¾0 2
¡
M 1 @¾0 2 X .¹´ ¡ ¹0 /2 p ´ ¾0 2 @W i ´D1 ¾0 2
M M 1 @¾´ 2 X 1 @¾´ 2 X 1 @.¹´ ¡ ¹0 /2 C C p´ 2 p´ 2 2 ¾´ @Wi ´D1 ¾0 @W i ´D1 ¾0 @ Wi ´D1 " # M M X 1 @¾0 2 ¾´ 2 X .¹´ ¡ ¹0 /2 D 2 .1 ¡ p0 / ¡ p´ 2 ¡ p´ ¾0 @Wi ¾0 ¾0 2 ´D1 ´D1
¡
M X
p´
M X
@¾´ 2 C p´ @W i ´D1 D
µ
¶ X M 1 1 1 @ .¹´ ¡ ¹0 /2 ¡ C p ´ ¾0 2 ¾´ 2 ¾0 2 @Wi ´D1
µ ¶ M 1 @¾0 2 X ¾´ 2 .¹´ ¡ ¹0 /2 1 ¡ ¡ p ´ ¾0 2 @Wi ´D1 ¾0 2 ¾0 2 C
M X ´D1
p´
@¾´ 2 @W i
µ
¶ X M 1 1 1 @.¹´ ¡¹0 /2 ¡ C p´ 2 ; 2 2 ¾0 ¾´ ¾0 @W i ´D1
where we used the fact that 1 ¡ p0 D 2 1 @¾´ 2 @Wi
Cov.Y´ ; X´i /
D tion A.9, we obtain
and
2 1 @.¹´ ¡¹0 / 2 @Wi
PM
´D1 p´ .
(A.9)
Substituting the derivatives
D .¹´ ¡ ¹0 /.hX´i i ¡ hXi0 i/ into equa-
M M X X @ ´ I.YI ´/ D p´ Cov.Y´ ; Xi /K´1 ¡ p´ Cov.Y0 ; Xi0 /K0´ @ Wi ´D1 ´D1
C
M X ´D1
p´
1 ´ .¹ ´ ¡ ¹0 /.hXi i ¡ hX0i i/; ¾0 2
(A.10)
where K0´ D
¾´ 2 ¡ ¾0 2 .¹´ ¡ ¹0 /2 C ¾0 4 ¾0 4
K1´ D
1 1 ¡ 2: ¾0 2 ¾´
(A.11)
Spike-Timing-Dependent Plasticity and Information Maximization
1505
In the case of inhomogeneous Poisson process, in which spikes in the train are uncorrelated but their underlying rate may vary along time, Á Cov.Y; Xi / D Cov
N X iD1
"Z D Wi V
Z ¼ Wi
Wi Xi ; Xj D Cov.W i Xi ; Xi / D Wi Var.Xi / #
0 t0 D¡T
Z D Wi
!
0 t0 D¡T 0 t0 D¡T
0
F¿ .0¡t‘/S.t / D Wi
Z
0
t0 D¡T
V[F¿ .0¡t‘/S.t0 /]
F2¿ .0 ¡ t0 /V[S.t0 /] F2¿ .0 ¡ t0 /E[S.t0 /];
(A.12)
where in the case of exponential lter (F¿ .x/ D ¿1 exp.¡x=¿ /), we obtain R 0 0 0 Cov.Y; Xi / D W i E¿=2 .Xi / where E¿ .X/ D exp. t¡t . ¿ /S.t / dt Appendix B: Mirror Fixed Points of the Learning Rule Let W ¤ be a xed-point solution of the optimal learning of equation 3.4, that is, M X @I.YI ´/ D C p´ .Cov.Y; X´i /K1 .W ¤ / C E.X´i /K2 .W ¤ // ´ ´ @ Wi W ¤ ´D1 ¡
M X ´D1
p´ .Cov.Y; X0i /K0´ .W ¤ /CE.Xi0 /K´2 .W ¤ // D 0: (B.1)
Now note that both K1´ and K´0 are even functions of W, but K´2 and Cov.Y; Xi / D Cov.WX; X/ are odd functions of W. This yields M X @I.YI ´/ D C p´ ..¡ Cov.Y; Xi´ //K1 .W ¤ / C E.X´i /.¡K2 .W ¤ /// ´ ´ @ Wi ¡W ¤ ´D1 ¡
M X ´D1
p´ ..¡ Cov.Y; X0i //K0´ .W ¤ / C E.Xi0 /.¡K´2 .W ¤ ///
@I.YI ´/ D 0: D¡ @Wi W¤
(B.2)
Therefore, any solution W ¤ has a mirror solution ¡W ¤ , which is also a xed point of the dynamics.
1506
G. Chechik
Appendix C: Pattern-Triggered Learning The batch learning rule of equation 3.4 can be turned into a stochastic online version by replacing summation over patterns with a learning rule that modies the synaptic weights according to the input pattern presented at that moment. There are two fundamentally different alternative implementations: changing synaptic weights triggered by the presentation of all patterns, 8 PM > ¡.¸0 Cov[Y; X0i ] C ¸2 E[Xi0 ]/. p1 ´D1 p´ / > 0 > > < when » 0 is presented 1Wi D > ¡¸1 Cov[Y; Xi´ ] C ¸2 E[X´i ] > > > : when » ´ is presented, ´ > 0;
(C.1)
or limiting learning to follow the presentation of the foreground patterns only, ´
´
1Wi D C.¸1 Cov[Y; Xi ]C¸2 E[Xi ]/¡.¸0 Cov[Y; X0i ]C¸2 E[Xi0 ]/ when » ´ is presented,
for ´ D 1::M only.
(C.2)
These two alternatives have different consequences for on-line learning: The latter rule (“learn on rare patterns”, equation 4.1) does not explicitly depend on the prior probabilities p´ while the former rule (learn on all patterns, equation C.1) does. The former rule thus suffers from a major drawback when the prior probabilities are unknown or vary in time, because the system has to continuously estimate these probabilities. If the priors p´ are incorrectly estimated, strengthening and weakening of synaptic weights goes out of balance, and synaptic values drift to their limits, causing the neuron to lose all of its discriminative power. This is similar to saturation effects observed in traditional models of Hebbian learning as discussed in Sejnowski (1977), Dayan and Willshaw (1991), and Chechik et al. (2001). To demonstrate this effect, we have conducted the following series of experiments. We set p´ to uctuate along time, while the learning parameters were set to the average p´ values. Figure 8 compares the above two learning rules and shows that the “learn on all patterns” learning rule is sensitive to uctuations in p´ , while the “learn on rare patterns” rule is robust to these uctuations. We conclude that an on-line implementation of the batch rule should take the form of equation 4.1, where learning is triggered by the rare patterns only. Appendix D: Maximum Output Variance A clearer intuition into the structure of the Infomax learning rule can be obtained by comparing it with another learning rule, derived according
Spike-Timing-Dependent Plasticity and Information Maximization
1507
0.06 Learning triggered on foreground patterns
I(Y;h) (bits)
0.05
0.04
0.03 Learning triggered on all patterns 0.02 0
2
4 Burstiness
6
8
Figure 8: Learning patterns with uctuating frequencies p´ . The two learning rules of equations 4.1 and C.1 are compared, showing that learning that is triggered on rare patterns is much more robust to such uctuations. Fluctuating patterns were modeled in the following way. Pattern presentation was divided to ve cycles, each containing 2000 presentation steps. Each cycle consisted of a period of length L0 , where only the background pattern was presented, and the remaining period (with length L´ D 2000 ¡ L0 ), where the rest of the patterns are randomly chosen for presentation such that the average probabilities p´ are L preserved. The ratio B D L´ thus provides a burstiness measure, where high B 0 values correspond to input statistics with bursts of foreground input patterns. Information value is the average over the last cycle containing both quiet periods and bursts. Simulations were repeated for ve different runs, and errors depict variability across these runs.
to the criterion of maximum output variance. Traditional Hebbian learning (learning by correlation) is known to perform gradient ascent on the manifold of this target function. This optimization criterion must be complemented with a constraint over the weights; otherwise, synaptic weights diverge (Sejnowski, 1977; Dayan & Willshaw, 1991). Consider therefore the following target function, Var.Y/ ¡ ¸
X
Wi® ;
(D.1)
i
where ® is commonly taken to be 1 (additive normalization) or 2 (multiplicative normalization). Differentiating this target function with regard to
1508
G. Chechik
Wi yields Á @ @ Wi
Var.Y/ ¡ ¸
X i
! Wi®
D Cov.Y; Xi / ¡ ¸®Wi®¡1 :
(D.2)
In this learning rule, all patterns strengthen synapses in a way that corresponds to their input’s strength. Synaptic weakening is enforced through a global decay term. The balance between activity-dependent inputoutput correlation and normalization should yield a nondivergent solution. In contradistinction, the Infomax learning rule derived in this article tends to strengthen synaptic weights that correspond to strong inputs of some patterns while decreasing the synapses of other pattern. Depression is thus achieved not through a global competition mechanism but is input specic. Acknowledgments It is a pleasure to thank N. Tishby, E. Ruppin, D. Horn, and R. Aharonov for helpful discussions and E. Schneidman, A. Globerson, and M. Ben-Shachar for a careful reading of the manuscript. This research was supported by a grant from the Ministry of Science, Israel. References Abbott, L., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing and neural respons variability. In S. A. Solla & D. A. Cohen (Eds.), Advances in neural information processing systems, 11 (pp. 69–75). Cambridge, MA: MIT Press. Abbott, L., Varela, J., Sen, K., & Nelson, S. (1997).Synaptic depression and cortical gain control. Science, 275, 220–223. Aharonov, R., Guttig, R., & Sompolinsky, H. (2001). Generalized synaptic updating in temporally asymmetric Hebbian learning. In J. Bower (Ed.), Computational neuroscience: Trends in research 2001. Amsterdam: Elsevier. Bell, C., Han, V., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bi, Q., & Poo, M. (1999). Precise spike timing determines the direction and extent of synaptic modications in cultured hippocampal neurons. J. Neurosci., 18, 10464–10472. Chechik, G., Horn, D., & Ruppin, E. (2002). Hebbian learning and neuronal regulation. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (2nd ed.). Cambridge, MA: MIT Press. Chechik, G., Meilijson, I., & Ruppin, E. (2001). Effective neuronal learning with ineffective Hebbian learning rules. Neural Computation, 13(4), 817–840.
Spike-Timing-Dependent Plasticity and Information Maximization
1509
Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dan, Y., & Poo, M. (1998). Hebbian depression of isolated neuromascular synapses in vitro. Science, 256, 1570–1573. Dayan, P., & Willshaw, D. (1991). Optimizing synaptic learning rules in linear associative memories. Biol. Cyber., 65, 253. Debanne, D., Gahwiler, B., & Thompson, S. (1994). Asynchronous pre- and postsynaptic activity induces associative long-term depression in area CA1 of the rat hippocampus in vitro. Proc. Natl. Acad. Sci., 91, 1148–1152. Debanne, D., Gahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507(1), 237–247. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efcacy in spiny stellate neurons in rat barrel cortex. Nature Neuroscience, 2, 1098–1105. Feldman, D. (2000). Timing based LTP and LTD at vertical inputs to layer II=III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Ferguson, T. (1996). A course in large sample theory. London: Chapman and Hall. Froemke, R., & Dan, Y. (2002). Spike-timing dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438. Holmgren, C., & Zilberter, Y. (2001). Coincident spiking activity induces long term changes in inhibition of neocortical pyramidal cells. J. Neuroscience, 21, 8270–8277. Horn, D., Levy, N., Meilijson, I., & Ruppin, E. (2000). Distributed synchrony of spiking neurons in a Hebbian cell assembly. In S. Solla, T. Leen, & K. Muller (Eds.), Advances in neural informationprocessing systems,12 (pp. 129–135). Cambridge, MA: MIT Press. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output rates by spike-time dependent Hebbian learning. Neural Computation, 13, 2709–2742. Kingman, J. (1993). Poisson processes. New York: Oxford University Press. Lehman, E. (1998). Elements of large sample theory. New York: Springer-Verlag. Levy, W., & Steward, D. (1983). Temporal contiguity requirements for longterm associative potentiation=depression in the hippocampus. Neurosci., 8, 791–797. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. Linsker, R. (1992). Local synaptic learning rules sufce to maximize mutual information in a linear network. Neural Computation, 4(5), 691–702. Magee, J., & Johnston, D. (1997). A synaptic controlled associative signal for Hebbian plasticity in hippocampal neurons. Science, 275(5297), 209–213. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275(5297), 213–215. Mehta, M., Quirk, M., & Wilson, M. (1999). From hippocampus to V1: Effect of LTP on spatio-temporal dynamics of receptive elds. In J. Bower
1510
G. Chechik
(Ed.), Computational neuroscience: Trends in research 1999. Amsterdam: Elsevier. Monyer, H., Burnashev, N., Laurie, D., Sakmann, B., & Seeburg, P. (1994). Developmental and regional expression in the rat brain and functional properties of four NMDA receptors. Neuron, 12, 529–540. Nishiyama, M., Hong, K., Katsuhiko, M., Poo, M., & Kato, K. (2000). Calcium stores regulate the polarity and input specicity of synaptic modication. Nature, 408, 584–588. Paulsen, O., & Sejnowski, T. (2000). Natural patterns of activity and long term synaptic plasticity. Current Opinion in Neurobiology, 10, 172–179. Rao, R., & Sejnowski, T. (2000). Predictive sequence learning in recurrent neocortical circuits. In S. Solla, T. Leen, & K. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Roberts, P. D., & Bell, C. C. (2002). Spike-timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. Rubin, J., Lee, D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev., 86(2), 364–367. Schultz, W., Dayan, P., & Montague, P. (1997). A neural subtrate of prediction and reward. Science, 275, 1593–1599. Segal, M., & Andersen, P. (2000). Dendritic spines shaped by synaptic activity. Curr. Opin. Neurobiol., 10, 582–587. Segev, I., & London, M. (2000). Untangling dendrites with quantitative models. Science, 290, 744–750. Sejnowski, T. (1977). Statistical constraints on synaptic plasticity. J. Theo. Biol., 69, 385–389. Shannon, C. (1948). A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423. Singer, W., & Gray, C. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-timing dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Tang, Y. P., Shimizu, E., Dube, G. R., Rampon, C., Kerchner, G. A., Zhuo, M., Liu, G., & Tsien, J. Z. (1999). Genetic enhancement of learning and memory in mice. Nature, 401, 63–69. Yao, H., & Dan, Y. (2001). Stimulus timing-dependent plasticity in cortical processing of orientation. Neuron, 32, 315–323. Zhang, L., Tao, H. W., Holt, C., Harris, W., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395(3), 37–44. Received April 29, 2002; accepted January 5, 2003.
LETTER
Communicated by Terrence J. Sejnowski
Relating STDP to BCM Eugene M. Izhikevich
[email protected], www.nsi.edu/users/izhikevich
Niraj S. Desai
[email protected] The Neurosciences Institute, San Diego, CA, 92121, U.S.A.
We demonstrate that the BCM learning rule follows directly from STDP when pre- and postsynaptic neurons re uncorrelated or weakly correlated Poisson spike trains, and only nearest-neighbor spike interactions are taken into account. 1 Introduction Over the past several years, there has been increasing interest in a novel form of Hebbian synaptic plasticity called spike-timing-dependent plasticity (STDP; Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Debanne, Gahwiler, & Thomson, 1998; Feldman, 2000; Sjostrom, Turrigiano, & Nelson, 2001; Froemke & Dan, 2002), in which the temporal order of presynaptic and postsynaptic spikes determines whether a synapse is potentiated or depressed (see Figure 1A). While experiments to date have given us a fairly clear idea of how STDP affects synaptic weights when only isolated pairs of presynaptic and postsynaptic spikes are present, it is not clear how STDP should be applied to natural spike trains, which involve many spikes and many possible pairings of spikes (Froemke & Dan, 2002). Specically, what is not clear is how plasticity at a given synapse builds up over time. This problem is interesting for its own sake, and also because it promises to shed light on the relationship between STDP and the best-studied forms of Hebbian plasticity, long-term potentiation, and depression (LTP and LTD), which explicitly make use of long spike trains (Bear & Malenka, 1994). Presumably all of these forms of plasticity arise from the same underlying biophysical mechanisms, and it should be possible to consider them within a single framework. Here we examine different implementations of STDP, compare them with a standard LTP=LTD implementation called the BCM (Bienenstock-Cooper-Munro) synapse (Bienenstock, Cooper, & Munro, 1982; Bear, Cooper, & Ebner, 1987), and in so doing arrive at certain constraints on how STDP should be implemented when considering natural spike trains. In the BCM formulation, one considers instantaneous ring rates rather than individual spikes. Synaptic input that drives postsynaptic ring to c 2003 Massachusetts Institute of Technology Neural Computation 15, 1511–1523 (2003) °
1512
E. Izhikevich and N. Desai
B
0
d i ve
20
no
t
de pr
post
synaptic change, %
synaptic change, %
pre 50
BCM
40
al
STDP
rm
A 100
0
-50 -50
0 50 time, t (ms)
100
pre
0
5 10 15 20 postsynaptic firing rate, x (Hz)
E
0 50 time, t (ms)
nearest-neighbor
-4 0
F
2
100
ressio ad di
n
tiv
e
5
10
co rre la te d
15
20
nearest-neighbor
pre
probability density 0 50 time, t (ms)
-2
100
post
-50
sup
tive ica ltipl
-50
all-to-all
mu
probability density
synaptic change, %
all-to-all
post
0 -100
0
D1
C
0 -100
-20
synaptic change, %
-100
1
0
-1
0
5 10 15 20 postsynaptic firing rate, x (Hz)
Figure 1: The STDP curve. The parameters AC D 103%, ¿C D 0:014 sec, A¡ D ¡51%, ¿¡ D 0:034 sec are taken from Froemke and Dan (2002). (B) Function controlling synaptic plasticity at the Cooper synapse receiving 20 Hz presynaptic stimulation. Data points (circles) are from visual cortex experiments by Kirkwood et al. (1996). Parameters from Figure 1A in equation 3.1 result in the normal curve. Increasing ¿C by 10% results in the “deprived” curve. (C) All-toall implementation of STDP: the net synaptic change is a combination of small changes induced by all possible pre- and postsynaptic pairs. (D) The result of application of STDP rule to Poisson spike trains: presynaptic: 10 Hz and 100,000 spikes; postsynaptic: x Hz and matching number of spikes. The analytical curves are derived in supplemental materials. (E) The nearest-neighbor implementation of STDP. For each presynaptic spike, only one preceding and one succeeding postsynaptic spike are considered. (F) The resulting BCM function. Parameters are as in Figure 1A.
Relating STDP to BCM
1513
high levels results in an increase in synaptic strength, whereas input that produces only low levels of postsynaptic ring results in a decrease (see Figure 1B). The threshold ring rate, the crossover point between potentiation and depression, is itself a slow function of postsynaptic activity, moving so as to make potentiation more likely when average activity is low and less likely when it is high. This sliding of the threshold serves to stabilize network activity and promote competition between synapses (Bienenstock et al., 1982). Considerable experimental evidence for this kind of plasticity has been obtained in neocortex and hippocampus at some of the same synapses at which evidence for STDP has also been obtained (Bi & Poo, 1998; Froemke & Dan, 2002; Kirkwood, Dudek, Gold, Aizenman, & Bear, 1993). Even so, it is not obvious how BCM plasticity and STDP are related or even if they are compatible. Consider, as an illustration, the extreme case in which all spikes of the postsynaptic neuron occur after those of the presynaptic one. This will always result in potentiation by an STDP rule, but it could result in either depression or potentiation by the BCM rule, depending on the exact value of the postsynaptic ring rate. To clarify this issue, we compare the two kinds of plasticity more closely in a more biologically realistic regime—that of uncorrelated or weakly correlated presynaptic and postsynaptic neurons that re in a nearly Poisson manner, as do cortical neurons in vivo. 2 Classical Implementations of STDP The form of the STDP curve, with its well-matched potentiation and depression portions, might cause one to guess that STDP applied to this kind of natural spike train should lead to a BCM-like curve similar to the one in Figure 1C. On the one hand, the peak of the potentiation half of the curve is larger than the minimum of the depression half, which suggests that high ring rates (small interspike intervals) should result in a net potentiation. On the other hand, the tail of the depression half is longer than the tail of the potentiation one, which suggests that low ring rates (large interspike intervals) should result in a net depression. And at some intermediate ring rate, a crossover should occur. But in fact, as we demonstrate in Figure 1D, this is not necessarily true. A straightforward application of the STDP rule does not lead to potentiation in any case, when one uses independent Poisson spike trains with a mean postsynaptic ring rate x. In the standard additive implementation of STDP (Song, Miller, & Abbott, 2000), for each presynaptic spike, one takes into account all preceding and all succeeding postsynaptic spikes and then sums the contributions of the various pairings. If postsynaptic ring is a random process that is relatively independent of presynaptic ring, then the postsynaptic ring density is relatively at, as in Figure 1C (no peak). Consequently, all postsynaptic spikes that precede the presynaptic one essentially sample the depression curve, and all postsynaptic spikes that follow the presynaptic one sample the potentiation curve. (Mathemat-
1514
E. Izhikevich and N. Desai
ical details of this procedure are given in the appendix.) The net effect is, then, the difference between the area under the positive portion of the STDP curve and the area under the negative portion, multiplied by the postsynaptic ring rate x. As a function of x, this is merely a straight line, in which potentiation never results, no matter how high the ring rate. One can relax the assumption of completely uncorrelated pre- and postsynaptic spike trains and consider the case when postsynaptic ring density has a small peak right after the presynaptic ring (function c.t/ in Figure 1C). The magnitude of the peak a may be constant, or it may scale with the postsynaptic ring rate x. Both cases, considered in the appendix, result in a linear synaptic change similar to that produced by additive STDP without correlations. The case of constant a shifts the straight line up (see Figure 1D), while the case of a proportional to x changes the slope of the line. In any case, one does not see the transition from LTD to LTP, as is required by the BCM rule. Also inadequate from this standpoint is a modication of the STDP rule recently proposed by Froemke and Dan (2002), in which the efcacy of each spike is suppressed by the preceding spikes of the same neuron, so that activity-induced synaptic modication depends not only on the relative spike timing between neurons, but also on the spiking pattern within each neuron. Specically, these authors propose that associated with each presynaptic and postsynaptic neuron is an efcacy variable that drops to zero following a spike and that relaxes exponentially back to one (its maximal value) with a characteristic time ¿ pre and ¿ post on the order of tens of milliseconds. However, this proposal for treating multiple interactions cannot account for the potentiation of synapses at high ring rates. As we show in the appendix, the modication is basically equivalent to the standard STDP implementation, the only difference being that the presynaptic and postsynaptic spikes are now in chronic state of suppression. The precise numbers are changed, but the qualitative picture is not. In the appendix, we explicitly consider eight implementations of STDP. In particular, we nd that treating interactions between multiple pairs of spikes multiplicatively (van Rossum, Bi, & Turrigiano, 2000) rather than additively—that is, imagining that each pairing changes the conductance by a percentage of its existing value rather than by a xed amount—makes no real difference. Additive and multiplicative models are in some cases equivalent because the latter often reduces to the former when one considers the logarithms of weights rather than the weights themselves. 3 Nearest-Neighbor Implementation of STDP What does make a difference—what does make STDP compatible with BCM (and with classical LTP/LTD more generally)—is restricting which pairings contribute to plasticity (Sjostrom et al., 2001; van Rossum et al., 2000). Rather than considering all presynaptic and postsynaptic pairings equally,
Relating STDP to BCM
1515
there are good reasons to consider only nearest-neighbor pairs. One reason is that postsynaptic spikes backpropagate into the dendritic tree and reset the membrane voltage in dendritic spines. Consequently, the most recent postsynaptic spike overrides the effect of all the earlier spikes, so that the membrane voltage is really only a function of time since the latest postsynaptic spike. Similarly, the rst succeeding postsynaptic spike may override the effect of subsequent spikes due to calcium saturation or glutamate receptor desensitization. Making this assumption, one nds that when the postsynaptic spike train is a Poisson process with ring rate x, the postsynaptic probability density (the probability of observing a spike with a certain delay t) becomes exponential in time, xe¡xt, as in Figure 1E (here we consider uncorrelated trains again). High (low) ring rates x result in predominantly small (large) intervals and hence in potentiation (depression). The expected magnitude of synaptic modication per one presynaptic spike has the form average potentiation average depression }| { zZ }| { zZ 1 0 C.x/ D AC e¡t=¿C xe¡xt dt C A¡ et=¿¡ xext dt 0
Á
Dx
! AC
¿C¡1
Cx
C
¡1
A¡
¿¡¡1
Cx
(3.1)
depicted in Figure 1F. It coincides with the BCM synapse in the sense that low activity results in depression and large activity results in potentiation. Incidentally, the assumption of strictly nearest-neighbor pairs can be relaxed and the result retained if one instead considers one preceding but all succeeding postsynaptic spikes (semi-nearest-neighbor implementation). This also leads to the BCM synapse, as we show in the appendix. Having an analytic expression for STDP’s effects at a given ring rate x (see equation 3.1) allows us to attempt to relate data obtained using STDP and LTP=LTD experimental protocols more quantitatively. In particular, equation 3.1 indicates that the threshold between potentiation and depression (the zero crossing of C.x/), #D¡
AC =¿¡ C A¡ =¿C ; AC C A¡
(3.2)
has positive values when AC > jA¡ j
(potentiation dominates depression for short intervals) ;
jA¡ j¿¡ > AC ¿C
(depression area is greater than potentiation area/:
1516
E. Izhikevich and N. Desai
Using the STDP parameters obtained in layer 2=3 rat visual cortex (Froemke & Dan, 2002), we calculate a threshold value of around 12 Hz, which is near the value of 9 Hz found using LTP=LTD protocols in the same brain area (Kirkwood, Rioult, & Bear, 1996). One key feature of the BCM learning rule is that the threshold does not have a xed value but slides between higher and lower values as a slow function of postsynaptic activity (Bienenstock et al., 1982). Equation 3.1 can be used to relate the sliding of the threshold to the biophysical processes underlying plasticity. In particular, the potentiation time constant ¿C most likely depends on the kinetics of NMDA receptors, which in turn depend on their subunit composition. It has been shown (Philpot, Sekhar, Shouval, & Bear, 2001) that low levels of postsynaptic activity due to light deprivation increase the ratio of NR2B subunits to NR2A subunits in NMDA receptors. This in turn increases the time constant of NMDA receptors by up to 20%. As one can see from equation 3.2 and in Figure 1B, increasing ¿C by as little as 10% results in sliding the calculated threshold by a factor of two, which is in agreement with the experimental data (Kirkwood et al., 1996). Discussion The question of how interactions between multiple spike pairs should be treated given an STDP rule has been considered previously (Sjostrom et al., 2001; Froemke & Dan, 2002; Song et al., 2000; van Rossum et al., 2000). In theoretical articles, the choice between applying the rule to all spike pairings or only to nearest-neighbor pairings has been made on a somewhat ad hoc basis, without a real empirical or biological justication. In experimental articles in which the question has been considered, the answers given have been limited in scope. Specically, they have been limited to explaining data generated by applying STDP experimental protocols. What they have neglected is that a proper approach should take into account both STDP and classical LTP=LTD. Here we have attempted to do so, and our study suggests that the most widely used STDP implementations may not be adequate: the pairings should be restricted to only proximal spike pairs. And we see that a handful of reasonable assumptions generates a simple equation that can link the parameters of STDP to the BCM formulation of LTP=LTD, resulting in a more intuitive picture of how these forms of plasticity are related. We have been considering two phenomenological models of plasticity, STDP and BCM, and important questions remain about how these models relate to the actual biophysical processes underlying synaptic plasticity. Although considerable data have been gathered about STDP applied to a single pair of spikes, many other aspects of this rule have yet to be worked out experimentally—for example, the issue of multiple spike pairs that we address in this theoretical article, as well as whether changes in synaptic efcacy depend on the size of the synapse. While the BCM formulation of LTP=LTD has been inuential, how well it describes the empirical
Relating STDP to BCM
1517
data is open to question. Tests of the BCM idea have been largely indirect and have not convincingly validated certain aspects of the theory, in particular the idea that only postsynaptic activity determines the sign of plasticity. 4 Conclusion This article addresses an important theoretical issue: how STDP, a novel form of Hebbian plasticity that has attracted much attention recently, relates to classical long-term potentiation and depression, in the form of the Cooper (or BCM) synapse. It is commonly (though not universally) believed that these two forms of plasticity arise from the same underlying biophysical process. The relationship between the two has been explored by, among others, van Rossum et al. (2000), Senn, Markram, & Tsodyks (2001), Kempter, Gerstner, & van Hemmen (1999, 2001), Castellani, Quinlan, Cooper, & Shouval (2001), and Abarbanel, Huerta, & Rabinovich (2002), though no attempts have been to contrast different implementations of STDP with respect to BCM. Here we have determined conditions under which one is equivalent to the other, and by making a handful of biologically plausible assumptions, such as Poissonian spiking, we have derived a simple equation that relates one form of plasticity to the other (see equation 3.1 or Figure 1F). We believe that this is an important step in reconciling STDP and classical LTP=LTD and will be of use to a broad range of researchers, ranging from electrophysiologists studying synaptic plasticity to computational neuroscientists interested in understanding dynamics of networks. Appendix: Different Implementations of STDP post
pre
If ¿ D tj ¡ ti is the interval between the ith and the jth spikes of the presynaptic and postsynaptic neurons, then the STDP synaptic modication is given by ( wij D
AC e¡¿=¿C A¡ eC¿=¿¡
if ¿ > 0 if ¿ < 0;
where AC D 103, A¡ D ¡51, ¿C D :014 sec, and ¿¡ D :034 sec are experimentally determined STDP parameters for layer 2=3 visual cortical neurons (Froemke & Dan, 2002). See Figure 1A. Poisson spike trains with ring rates of 10 Hz and x Hz were generated for presynaptic and postsynaptic neurons, respectively. The postsynaptic ring rate x was varied systematically between 0 and 20 Hz, and the average synaptic modication per presynaptic spike was calculated using eight
1518
E. Izhikevich and N. Desai
STDP implementations, which are described below. ² Classical additive STDP adds the effect of all pairs, X wD wij i;j
and results in the straight line depicted in Figure 1d, which has a slope equal to the difference between the depression and potentiation areas. Indeed, for every presynaptic spike, all preceding postsynaptic spikes sample the depression curve, and all succeeding postsynaptic spikes sample the potentiation curve, so that the averaged net effect w is the difference between the areas beneath the curves multiplied by the postsynaptic ring rate. More precisely, average potentiation average depression z }| { zZ }| { Z 0 1 ¡t=¿C t=¿¡ C.x/ D AC e x dt C A¡ e x dt 0
D x.AC ¿C C A¡ ¿¡ /:
¡1
² Classical multiplicative STDP, Y .1 C 0:01w/ D .1 C 0:01wij /; i;j
is equivalent to the additive model if one considers the logarithm X log.1 C 0:01w/ D log.1 C 0:01wij /: ij
This also results in a straight line with a slope equal to the difference between the areas bounded by the curves, Z 1 log.1 C 0:01A§ e¡t=¿§ / dt: 0
See Figure 1D. ² Suppression additive STDP assigns a suppression weight to each spike that depends on the time elapsed since the previous spike, ei D 1 ¡ e.ti ¡ti¡1 /=¿ ;
where ¿ pre D 28 ms and ¿ post D 88 ms, so that the net effect is X pre post wD ei ej wij : i;j
This is the model of Froemke and Dan (2002), and the parameters are taken from that model. Because of the interactions between the spikes,
Relating STDP to BCM
1519
all the spikes are in a chronic state of “suppression,” that is, Z 1 1 hei D 1 ¡ e¡t=¿ xe ¡xt dt D ; 1 C x¿ 0 and the averaged synaptic modication is the curve, C.x/ D x.AC ¿C C A¡ ¿¡ /
1 1 ; 1 C 10¿ pre 1 C x¿ post
depicted in Figure 1D. It saturates at high ring rates but never crosses zero. ² Classical additive STDP (correlated spike trains). When the pre- and postsynaptic neurons have correlated rings, the postsynaptic ring density can be represented in the form p.t/ D x C ac.t/; where c.t/ describes the shape of the peak of the cross-correlogram, as in Figure 1C, and a is a scaling factor, which may depend on the postsynaptic ring rate x. The averaged synaptic modication has the form average potentiation }| {
zZ C.x/ D
1 0
AC e¡t=¿C .x C ac.t// dt C
D x.AC ¿C C A¡ ¿¡ / C ac0 ;
z Z
average depression }| { 0
¡1
A¡ et=¿¡ .x C ac.t// dt (A.3)
where Z c0 D
1 0
Z AC e¡t=¿C c.t/ dt C
0 ¡1
A¡ et=¿¡ c.t/ dt
is a parameter that depends on the exact shape of the peak of the crosscorrelogram. If the magnitude of the peak is constant, then C.x/ has the same straight-line form as in the uncorrelated case except that it is translated by ac0 , as in Figure 1D. If the magnitude of the peak is proportional to the ring rate, for example, a D x, then C.x/ is a straight line with the slope AC ¿C C A¡ ¿¡ C c0 .
² Semi-nearest-neighbor additive STDP. For each presynaptic spike, we consider only one preceding postsynaptic spike and ignore all earlier spikes. The motivation for this is the idea that the preceding postsynaptic spike overrides the effect of the earlier spikes due to sublinear summation of membrane voltage in dendritic spines. If there is no saturation in CaCC or glutamate=NMDA receptor dynamics, then all subsequent postsynaptic spikes must be considered. As a result, the postsynaptic probability density is exponential before the
1520
E. Izhikevich and N. Desai
presynaptic spike (as in Figure 1E) but at after the presynaptic spike (as in Figure 1C). The resulting net modication per one presynaptic spike is average potentiation average depression z }| { zZ }| { Z 0 1 C.x/ D AC e¡t=¿C x dt C A¡ et=¿¡ xe xt dt 0
Á
¡1
!
D x A C ¿C C
A¡
¿¡¡1
Cx
;
which is similar to the curve depicted in Figure 1F. ² Nearest-neighbor additive STDP. For each presynaptic spike, only two postsynaptic spikes are considered: the one that occurs before and the one that occurs after the presynaptic ring. The resulting averaged curve, average potentiation average depression }| { zZ }| { zZ 1 0 ¡t=¿C ¡xt t=¿¡ xt C.x/ D AC e xe dt C A¡ e xe dt 0
Á
Dx
!
AC
¿C¡1
Cx
C
¡1
A¡
¿¡¡1
Cx
is depicted in Figure 1F. ² Nearest-spike additive STDP. For each pre-synaptic spike, one nds the nearest postsynaptic spike (model 2 in by Sjostrom et al., 2001), regardless of whether it is before or after the presynaptic ring. Since the conditional postsynaptic probability density is xe§2xt , the resulting averaged curve has the form Á ! AC A¡ C ¡1 C.x/ D x : ¿C¡1 C 2x ¿¡ C 2x ² Nearest-spike additive STDP with LTP winning is similar to the one above except that the LTD due to “post then pre” pairs is not counted if the postsynaptic spike participated in LTP interaction. This corresponds to the model 3 implementation of STDP rule by Sjostrom et al. (2001). The conditional postsynaptic probability for LTP window is as above, that is, xe ¡2xt ; t ¸ 0. The probability for the LTD window depends on the pre- and postsynaptic ring rate, and it can easily be determined numerically. If both rates are the same, then it can be approximated by
Relating STDP to BCM
1521 nearest-neighbor
0 un co
0
rre lat ed
c or
rel a
te d
5 10 15 20 postsynaptic firing rate, x (Hz)
rre co un
synaptic change
synaptic change
co
rre
l at
ed
C(x)
la t ed
standard additive
C(x)
0
0
5 10 15 20 postsynaptic firing rate, x (Hz)
Figure 2: Stability diagram of two implementations of STDP rule.
2=3xe2xt ; t < 0. The resulting averaged curve has the form Á ! 2=3A¡ AC C ¡1 C.x/ D x : ¿C¡1 C 2x ¿¡ C 2x A.1 Stability of Various Implementations of STDP. Using the relationship between STDP and BCM rules, we can study some aspects of stability of the former using the latter. Consider the BCM equation sP D C.xpost .t//xpre .t/; where s is the synaptic weight and xpre and xpost are the pre- and postsynaptic ring rates. The form of the function C.x/ provides invaluable information about the stability of synaptic modication even when there is no modication of the threshold. The additive, multiplicative, and suppression implementations of STDP without correlations result in negative functions C.x/ depicted in Figure 1D provided that AC ¿C < jA¡ ¿¡ j. In this case, the synaptic weight s ! 0 for any nonzero pre- or postsynaptic ring rate. This result conrms ndings of Song et al. (2000) that uncorrelated ring of a postsynaptic neuron always results in synaptic depression provided that the depression area of STDP curves is greater than the facilitation area. When the postsynaptic ring rate is small, the postsynaptic neuron starts to exhibit correlations with presynaptic neurons (Song et al., 2000). Therefore, to study the dynamics of s for small xpost , we need to consider correlated probability density, such as the one in Figure 1C. The classical additive implementation of STDP results in C.x/ having form A.3 depicted in Figure 1D and Figure 2, left. It has a nonzero globally stable equilibrium.
1522
E. Izhikevich and N. Desai
Similarly, the nearest-neighbor implementation of STDP with correlations results in Á ! A¡ C ac0 : C.x/ D x AC ¿C C ¡1 ¿¡ C x Here, the term ac0 is due to correlations, as in the case of classical additive STDP considered above. The graph of this function is depicted in Figure 2, right, and it has a stable and unstable equilibrium. Small postsynaptic ring rates result in s conversing to the stable equilibrium (the black circle in the gure), whereas large ring rates result in a runaway dynamics of s— s ! C1. The stable equilibrium corresponds to a balance of potentiation, which is due to the correlations (coincidences) of pre- and postsynaptic ring, and depression, which occurs because the depression area of STDP is greater than the facilitation area. If the term ac0 increases, the stable equilibrium in Figure 2, left, persists. In contrast, the stable and unstable equilibria in Figure 2, right, approach and annihilate each other via saddle-node bifurcation. The dynamics of this implementation of STDP becomes unstable: any ring rate of a postsynaptic neuron results in potentiation. Thus, the classical implementation of STDP is unconditionally stable, whereas the nearest-neighbor implementation can be stable or unstable depending on how strong the correlations are. Finally, notice that we consider the issue of stability assuming that the postsynaptic ring rate does not depend on the strength of the synapse s. The qualitative results persist when xpost is a monotone function of s (i.e., the stronger the synapse, the higher the postsynaptic rate). More research is needed to determine the stability of various implementations of STDP when xpost is a nonmonotone function of s (i.e., a very strong synaptic drive results in decreased input resistance and=or inactivation of Na conductances and, hence, a smaller ring rate). Acknowledgments We thank Jeff Krichmar, Anil Seth, and Jonathan Rubin for reading the rst draft of the manuscript and making a number of useful suggestions. This research was supported by the Neurosciences Research Foundation and in part by a grant from the Ala Family Foundation. References Abarbanel, H. D. I., Huerta, R., & Rabinovich, M. I. (2002). Dynamical model of long-term synaptic plasticity. PNAS, 99, 10132–10137. Bear, M. F., Cooper, L. N., & Ebner, F. F. (1987). A physiological basis for a theory of synapse modication. Science, 237, 42–48. Bear, M. F., & Malenka, R. C. (1994). Synaptic plasticity: LTP and LTD. Curr. Opn. Neurobiol., 4, 389–399.
Relating STDP to BCM
1523
Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors, PNAS, 98, 12772–12777. Debanne, D., Gahwiler, B. H., & Thomson, S. M. (1998).Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (London), 507, 237–247. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II=III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modication induced by natural spike trains. Nature, 416, 433–438. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput, 13, 2709–2741. Kirkwood, A., Dudek, S. M., Gold, J. T., Aizenman, C. D., & Bear, M. F. (1993). Common forms of synaptic plasticity in the hippocampus and neocortex in vitro. Science, 260, 1518–1521. Kirkwood, A., Rioult, M. G., & and Bear, M. F. (1996). Experience-dependent modication of synaptic plasticity in visual cortex. Nature, 381, 526–528. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Philpot, B. D., Sekhar, A. K., Shouval, H. Z., & Bear, M. F. (2001). Visual experience and deprivation bidirectionally modify the composition and function of NMDA receptors in visual cortex. Neuron, 29, 157–169. Senn, W., Markram, H., & Tsodyks, M. (2001).An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Sjostrom, P. J., Turrigiano, G. G., & Nelson, S. B. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neurosci., 3, 919– 926. van Rossum, M. C., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing–dependent plasticity. J. Neurosci., 20, 8812–8821. Received June 25, 2002; accepted December 28, 2002.
LETTER
Communicated by Marion Stewart-Bartlett
Learning Innate Face Preferences James A. Bednar
[email protected] Risto Miikkulainen
[email protected] Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, U.S.A.
Newborn humans preferentially orient to facelike patterns at birth, but months of experience with faces are required for full face processing abilities to develop. Several models have been proposed for how the interaction of genetic and evironmental inuences can explain these data. These models generally assume that the brain areas responsible for newborn orienting responses are not capable of learning and are physically separate from those that later learn from real faces. However, it has been difcult to reconcile these models with recent discoveries of face learning in newborns and young infants. We propose a general mechanism by which genetically specied and environment-driven preferences can coexist in the same visual areas. In particular, newborn face orienting may be the result of prenatal exposure of a learning system to internally generated input patterns, such as those found in PGO waves during REM sleep. Simulating this process with the HLISSOM biological model of the visual system, we demonstrate that the combination of learning and internal patterns is an efcient way to specify and develop circuitry for face perception. This prenatal learning can account for the newborn preferences for schematic and photographic images of faces, providing a computational explanation for how genetic inuences interact with experience to construct a complex adaptive system. 1 Introduction A number of studies have found that newborn humans will preferentially turn their eyes or head toward facelike stimuli within minutes or hours after birth (Goren, Sarty, & Wu, 1975; Johnson, Dziurawiec, Ellis, & Morton, 1991; Johnson & Morton, 1991; Mondloch et al., 1999; Simion, Valenza, Umilt`a, & Dalla Barba, 1998; Valenza, Simion, Cassia, & Umilt`a, 1996). These results suggest that some rudimentary face-selective visual circuitry develops prenatally. However, full face processing abilities appear to require months or years of experience with faces (as reviewed in de Haan, 2001; Gauthier & Nelson, 2001; Pascalis & Slater, 2001; Tov´ee, 1998). c 2003 Massachusetts Institute of Technology Neural Computation 15, 1525–1557 (2003) °
1526
J. Bednar and R. Miikkulainen
As will be described in the next section, several theoretical and computational models have been proposed to explain how face processing develops through a combination of genetic and environmental inuences. Most of these models assume that the brain areas responsible for newborn orienting behavior do not learn and are physically separate from those that later learn from real faces (Johnson & Morton, 1991). However, recent studies show that newborns are capable of learning faces in the rst few hours and days after birth (Bushnell, Sai, & Mullin, 1989; Field, Cohen, Garcia, & Greenberg, 1984; Walton, Armstrong, & Bower, 1997; Walton & Bower, 1993). It has been difcult to reconcile the existing interactional models with early learning of faces. In this letter, we propose a novel explanation for how genetic and environmental inuences can interact in a way that allows the visual system to learn faces at all ages. The main hypothesis is that the same brain regions and learning mechanisms are active throughout development, both before and after birth. In both cases, the learning mechanisms are extracting and encoding regularities in their input patterns, but in prenatal development, the input patterns are internally generated and genetically controlled, whereas in postnatal development, the input patterns come from the visual environment (Constantine-Paton, Cline, & Debski, 1990; Jouvet, 1999; Marks, Shaffery, Oksenberg, Speciale, & Roffwarg, 1995; Roffwarg, Muzio, & Dement, 1966; Shatz, 1990, 1996). This way, instead of precisely specifying the organization of the brain, the genome can encode a developmental process that is based on presenting genetically determined patterns to a learning mechanism. After birth, the learning mechanism can then seamlessly integrate environmental and genetic information into the same cortical hardware. Separating the genetic specication and visual processing circuitry in this manner also lets each evolve and develop independently. A simple, genetically xed pattern-generating region can then serve to direct the development of an arbitrarily complex visual system. This explanation can help resolve a number of the confusing results from studies of infant face perception. We test this idea computationally using a self-organizing model of cortical development (see Erwin, Obermayer, & Schulten, 1995, and Swindale, 1996, for reviews of this class of models). Computational models allow the assumptions and implications of a conceptual model to be evaluated rigorously and are particularly useful for generating testable predictions. In previous experiments using the RF-LISSOM model (receptive-eld laterally interconnected synergetically self-organizing map; Miikkulainen, Bednar, Choe, & Sirosh, 1997; Sirosh & Miikkulainen, 1994; Sirosh, Miikkulainen, & Bednar, 1996), we have shown that patterns of orientation preferences and lateral connections in the primary visual cortex can form from spontaneously generated activity (Bednar & Miikkulainen, 1998). We have also shown in preliminary work how a simplied model of face processing can account for much of the behavioral data on newborn face preferences (Bednar & Miikkulainen, 2000).
Learning Innate Face Preferences
1527
In this article, we present HLISSOM, a signicant extension to RF-LISSOM that allows processing of real photographic images by modeling the retina, lateral geniculate nucleus (LGN) and higher-level visual areas in addition to the primary visual cortex. The same model allows preferences for low-level features such as orientation or spatial frequency and high-level features such as faces to be compared, in order to account for the different categories of preferences found in experiments with newborns. Very few self-organizing models has been tested with real images, and none to our knowledge has previously modeled the self-organization of both the primary visual cortex and higher regions or simulated the large retinal and cortical area needed to process the large stimuli tested with newborns. Using HLISSOM, we demonstrate how patterns found during REM sleep could explain a newborn’s orienting responses toward faces, within a general-purpose adaptive system for learning both low-level and high-level features of images. This explanation will be called the pattern-generation model of newborn face preferences. In the next section, we review other models of the early development of face perception as well as experimental evidence for the internally generated brain activity on which the pattern-generation model is based. Next, we introduce the HLISSOM architecture and describe the experimental setup for testing face detection computationally. We then show how this model can account for newborn face preferences using both schematic and photographic images, and suggest future computational and psychophysical experiments to help understand the development of face processing. 2 Related Work There are three main previous explanations for how a newborn can show an initial face preference and later develop full face-processing abilities: (1) the linear systems model (LSM), (2) sensory models (of which we consider the Acerra, Burnod, & de Schonen, 2002, computational model and the top-heavy conceptual model separately), and (3) multiple systems models. These hypotheses are reviewed below, and we argue that the patterngeneration model provides a simpler, more effective explanation. 2.1 Linear Systems Model. The LSM (Banks & Salapatek, 1981; Kleiner, 1993) is a straightforward and effective way of explaining a wide variety of newborn pattern preferences and could easily be implemented as a computational model. Due to its generality and simplicity, it constitutes a baseline model against which others can be compared. The LSM is based solely on the newborn’s measured contrast sensitivity function (CSF). For a given spatial frequency, the value of the CSF will be high if the early visual pathways respond strongly to that size of pattern, and lower otherwise. The newborn CSF is limited by the immature state of the eye and the early visual pathways, which makes lower frequencies more visible than ne detail.
1528
J. Bednar and R. Miikkulainen
The assumption of the LSM is that newborns simply pay attention to those patterns that, when convolved with the CSF, give the largest response. Low-contrast patterns and patterns with only very ne detail are only faintly visible, if at all, to newborns (Banks & Salapatek, 1981). Conversely, faces might be preferred simply because they have strong spatial-frequency components in the ranges that are most visible to newborns. However, studies have found that the LSM fails to account for the responses to facelike stimuli. For instance, some of the facelike patterns preferred by newborns have a lower amplitude spectrum in the visible range (and thus lower expected LSM response) than patterns that are less preferred (Johnson & Morton, 1991). The LSM also predicts that the newborn will respond equally well to a schematic face regardless of its orientation, because the orientation does not affect the spatial frequency or the contrast. Instead, newborns prefer schematic facelike stimuli oriented right-side-up. Such a preference is found even when the inverted stimulus is a better match to the CSF (Valenza et al., 1996). Thus, the CSF alone does not explain face preferences, and a more complicated model is required. 2.2 Acerra et al. Sensory Model. The LSM is a high-level abstraction of the properties of the early visual system. Sensory models extend the LSM to include additional constraints and circuitry, but without adding faceselective visual regions or systems. Acerra et al. (2002) recently developed such a computational model that can account for some of the face preferences found in the Valenza et al. (1996) study. Their model consists of a xed Gabor-lter–based model of V1, plus a higher-level sheet of neurons with modiable connections. They model two conditions separately: newborn face preferences and postnatal development by four months. The newborn model includes only V1, because they assume that the higher-level sheet is not yet functional at birth. Acerra et al. (2002) showed that the newborn model responds slightly more strongly to the upright schematic face pattern used by Valenza et al. (1996) than to the inverted one, because of differences in the responses of a subset of the V1 neurons. This surprising result replicates the newborn face preferences from Valenza et al. The difference in model responses is because Valenza et al. actually inverted only the internal facial features, not the entire pattern. In the upright case, the spacing is more regular between the internal features and the face outline (compare Figure 5d with 5g, Retina row). Making the spacing more regular between alternating white and black blobs will increase the response of a Gabor lter whose receptive eld (RF) lobes match the blob size. As a result, the total response of all lters will be slightly higher for the facelike (upright) pattern than to the nonfacelike (inverted) pattern. However, the Acerra et al. model was not tested with patterns from other studies of newborn face preferences, such as Johnson et al. (1991). The facelike stimuli published in Johnson et al. (1991) do not have a regular spacing
Learning Innate Face Preferences
1529
between the internal features and the outline, and so we do not expect that the Acerra et al. model will replicate preferences for these patterns. Moreover, Johnson et al. used a white paddle against a light-colored ceiling, and so their face outlines would have a much lower contrast than the blackbackground patterns used in Valenza et al. (1996). Thus, although border effects may contribute to the face preferences found by Valenza et al., they are unlikely to explain those measured by Johnson et al. The Acerra et al. newborn model also does not explain newborn learning of faces, because their V1 model is xed and the higher-level area is assumed not to be functional at birth. Also important, the Acerra et al. newborn model was not tested with real images of faces, where the spacing of the internal features from the face outline varies widely depending on the way the hair falls. Because of these differences, we do not expect that the Acerra et al. model will show a signicantly higher response overall to photographs of real faces than to other similar images. The pattern-generation model will make the opposite prediction and will explain how newborns can learn faces. 2.3 Top-Heavy Sensory Model. Simion, Macchi Cassia, Turati, and Valenza (2001) have also presented a sensory model of newborn preferences, but unlike Acerra et al. (2002), it is a conceptual model only. Simion et al. observed that nearly all of the facelike schematic patterns that have been tested with newborns are top-heavy; i.e., they have a boundary that contains denser patterns in the upper than the lower half. They also ran behavioral experiments showing that newborns prefer several top-heavy (but not facelike) schematic patterns to similar but inverted patterns. Based on these results, they proposed that newborns prefer top-heavy patterns in general, and thus prefer facelike schematic patterns as a special case. This hypothesis is compatible with most of the experimental data so far collected in newborns. However, facelike patterns have not yet been compared directly with other top-heavy patterns in newborn studies. Thus, it is not yet known whether newborns would prefer a facelike pattern to a similarly top-heavy but not facelike pattern. Future experimental tests with newborns can resolve this issue. To be tested computationally, the top-heavy hypothesis would need to be made more explicit, with a specic mechanism for locating object boundaries and the relative locations of patterns with them. We expect that such a test would nd only a small preference (if any) for photographs of real faces, compared to many other common stimuli. For faces with beards, wide smiles, or wide-open mouths, many control patterns would be preferred over the face image, because such face images will no longer be top-heavy. Because the bulk of the current evidence suggests that instead, the newborn preferences are more selective for faces, most of the HLISSOM simulations will use training patterns that result in strongly face-selective visual responses. Section 6.1 will summarize results from training patterns that are less facelike and thus more comparable to the top-heavy model.
1530
J. Bednar and R. Miikkulainen
2.4 Multiple-Systems Models. The most widely known conceptual model for newborn face preferences and later learning was proposed by Johnson and Morton (1991). Apart from our tests using the HLISSOM architecture, it has not yet been evaluated computationally. Johnson and Morton suggested that infant face preferences are mediated by two hypothetical visual processing systems, which they dubbed CONSPEC and CONLERN. CONSPEC is a nonplastic system controlling orienting to facelike patterns, assumed to be located in the subcortical superior-colliculus–pulvinar pathway. Johnson and Morton proposed that a CONSPEC responding to three blobs in a triangular conguration, one each for the eyes and one for the nose-mouth region, would account for the newborn face preferences (for examples see Figure 3a and the Retina row of Figure 4c). CONLERN is a separate plastic cortical system, presumably the faceprocessing areas that have been found in adults and in infant monkeys (Kanwisher, McDermott, & Chun, 1997; Rodman, 1994). The CONLERN system would assume control after about six weeks of age and would account for the face preferences seen in older infants. Eventually, as it learns from real faces, CONLERN would gradually stop responding to schematic faces, which would explain why face preferences can no longer be measured with static schematic patterns by ve months (Johnson & Morton, 1991). The CONSPEC/CONLERN model is plausible, given that the superior colliculus is fairly mature in newborn monkeys and does seem to be controlling attention and other functions (Wallace, McHafe, & Stein, 1997). Moreover, some neurons in the adult superior colliculus/pulvinar pathway appear to be selective for faces (Morris, Ohman, & Dolan, 1999), although such neurons have not yet been found in young animals. The model also helps explain why infants after one month of age show a reduced interest in faces in the periphery, which could occur as attentional control shifts to the notquite-mature cortical system (Johnson et al., 1991; Johnson & Morton, 1991). However, subsequent studies showed that even newborns are capable of learning individual faces (Slater, 1993; Slater & Kirby, 1998). Thus, if there are two visual processing systems, either both are plastic or both are functioning at birth, and thus there is no a priori reason that a single face-selective visual system would be insufcient. On the other hand, de Schonen, Mancini, and Leigeois (1998) argue for three visual processing systems: a subcortical one responsible for facial feature preferences at birth, another one responsible for newborn learning (of objects and head/hair outlines; Slater, 1993), and a cortical system responsible for older infant and adult learning of facial features. And Simion, Valenza, and Umilt`a (1998) proposed that face selectivity instead relies on multiple visual processing systems within the cortex, maturing rst in the dorsal stream but later supplanted by the ventral stream (which is where most of the face-selective visual regions have been found in adult humans). In contrast to the increasing complexity of these explanations, we propose that a single general-purpose, plastic visual processing system is sufcient,
Learning Innate Face Preferences
1531
if that system is rst exposed to internally generated facelike patterns of neural activity. 2.5 Internally Generated Activity. Spontaneous neural activity has recently been documented in many cortical and subcortical areas as they develop, including the visual cortex, the retina, the auditory system, and the spinal cord (Feller, Wellis, Stellwagen, Werblin, & Shatz, 1996; Leinekugel et al., 2002; Lippe, 1994; Wong, Meister, & Shatz, 1993; Yuste, Nelson, Rubin, & Katz, 1995; reviewed in O’Donovan, 1999; Wong, 1999). Such activity has been shown to be responsible for the segregation of the LGN into eyespecic layers before birth, indicating that internally generated activity is crucial for normal development (Shatz, 1990, 1996). We will focus on one common type of spontaneous activity: pontogeniculo-occipital (PGO) waves generated during rapid-eye-movement (REM) sleep. Developing embryos spend a large percentage of their time in what appears to be REM sleep, which suggests that it plays a major role in development (Marks et al., 1995; Roffwarg et al., 1966). During and just before REM sleep, PGO waves originate in the pons of the brain stem and travel to the LGN, visual cortex, and a variety of subcortical areas (see Callaway, Lydic, Baghdoyan, & Hobson, 1987, for a review). PGO waves are strongly correlated with eye movements and with vivid visual imagery in dreams, suggesting that they activate the visual system as if they were visual inputs (Marks et al., 1995). PGO waves have been found to elicit different distributions of activity in different species (Datta, 1997), and interrupting them has been shown to increase the inuence of the environment on development (Marks et al., 1995; Shaffery, Roffwarg, Speciale, & Marks, 1999). All of these characteristics suggest that PGO waves may be providing species-specic training patterns for development (Jouvet, 1998). However, due to limitations in experimental imaging equipment and techniques, it has not yet been possible to measure the two-dimensional shape of the activity resulting from the PGO waves; only the temporal behavior is known (Rector, Poe, Redgrave, & Harper, 1997). How closely neonatal REM sleep and PGO waves are related to those of adults is also controversial (Dugovic & Turek, 2001; Jouvet, 1999). In any case, it is clear that there is substantial brain-stemgenerated activity in the visual areas of newborns during sleep. This article predicts that if these activity patterns have a simple spatial conguration of three active areas (or other similar patterns), they can account for the measured face-detection performance of human newborns. The three-dot training patterns are similar to the three-dot preferences proposed by Johnson and Morton (1991). However, in that model, the patterns are implemented as a hard-wired subcortical visual area responsible for orienting to faces. In the pattern-generation model, the areas receiving visual input learn from their input patterns, and only the pattern generator is assumed to be hard-wired.
1532
J. Bednar and R. Miikkulainen
Both of these possible mechanisms for newborn orienting responses would require about the same amount of genetic specication, but the crucial difference is that in the pattern-generation approach, the visual processing system can be arbitrarily complex. In contrast, a hard-coded visual processing system like CONSPEC is limited in complexity by what can specically be encoded in the genome. For this reason, a system implemented like CONSPEC is a plausible model for subcortically mediated orienting at birth, but would become less plausible if imaging techniques later show that newborn face preferences are mediated by a hierarchy of cortical areas, as in the adult. Specifying only the training patterns also makes sense from an evolutionary perspective, because that way, the patterns and the visual processing hardware can evolve independently. Separating these capabilities thus allows the visual system to become arbitrarily complex, while maintaining the same genetically specied function. Finally, assuming that only the pattern generator is hardwired explains how infants of all ages can learn faces. 3 HLISSOM Model We will investigate the pattern-generation hypothesis using the HLISSOM model of the primate visual system, focusing on how subcortically generated patterns can drive the development of cortical areas. The HLISSOM architecture is shown in Figure 1. The model consists of a hierarchy of two-dimensional sheets of neural units modeling different areas of the nervous system: two sheets of input units (the retinal photoreceptors and the PGO generator), two sheets of LGN units (ON-center and OFF-center), and two sheets of cortical units (“neurons”): the primary visual cortex (V1) and a higher-level area from the ventral processing stream with large RFs (here called the face-selective area, FSA, as described below). Because the focus is on the two-dimensional organization of each cortical sheet, a neuron in V1 or the FSA corresponds to a vertical column of cells through the six anatomical layers of that area of the cortex. The FSA represents the rst region in the ventral processing pathway above V1 that has RFs spanning approximately 45 degrees of visual arc, large enough to span a human face at close range. Areas V4v and LO are likely FSA candidates based on adult patterns of connectivity, but the infant connectivity patterns are not known (Rolls, 1990). We use the generic term face-selective area rather than V4v or LO to emphasize that the model results depend only on having RFs large enough to allow face-selective responses, not on the region’s precise location or architecture. Within each cortical sheet (V1 or the FSA), neurons receive afferent connections from RFs in the form of broad overlapping circular patches of units in the previous sheet(s) in the hierarchy. Neurons in the cortical sheets also have reciprocal excitatory and inhibitory lateral connections to neurons
Learning Innate Face Preferences
1533
Figure 1: Schematic diagram of the HLISSOM model. Each sheet of units in the model visual pathway is shown with a sample activation pattern and the connections to one example unit. Gray-scale visual inputs are presented on the photoreceptors, and the resulting activity propagates through afferent connections to each of the higher levels. Internally generated PGO input propagates similarly to visual input, but travels through the ponto-geniculate (PG) pathway to occipital cortex (i.e., V1), rather than the retino-geniculate pathway. As in humans and animals, activity may be generated in either the PGO area or the photoreceptors, but not both at once. In the cortical levels (V1 and FSA), activity is focused by lateral connections, which are initially excitatory between nearby neurons (dotted circles) and inhibitory between more distant neurons (dashed circles). The nal patterns of lateral and afferent connections in the cortical areas develop through an unsupervised self-organizing process. After self-organization is complete, each stage in the hierarchy represents a different level of abstraction, transforming the raw image input into a more biologically relevant representation. The LGN responds best to edges and lines, suppressing areas with no information. The V1 response is further selective for the orientation of each contour; the patchiness is due to geometrical constraints on representing all orientations on a two-dimensional surface. The FSA represents the highest level of abstraction; the response of an FSA neuron signals the presence of a face at the corresponding location on the retina, which is information that an organism can use directly to control behaviors like orienting.
1534
J. Bednar and R. Miikkulainen
within the same sheet. Lateral excitatory connections are short range, connecting each neuron with itself and its close neighbors in a circular radius. Lateral inhibitory connections extend in a larger radius, but also include connections to the neuron itself and to its neighbors. This pattern of lateral connections models the lateral interactions measured experimentally for high-contrast input patterns. Because the Hebbian learning used in this model is proportional to the activity levels of the neurons, the higher-contrast inputs have more inuence on adaptation. Optical imaging and electrophysiological studies have shown that at these contrasts, long-range interactions in the cortex are inhibitory, even though individual lateral connections are primarily excitatory (Hirsch & Gilbert, 1991; Weliky, Kandler, Fitzpatrick, & Katz, 1995). Computational studies have also shown that long-range interactions between high-contrast inputs must be inhibitory for proper self-organization to occur (Sirosh & Miikkulainen, 1994). For simplicity, the model uses only inhibitory interactions, but additional long-range excitation can be included when studying phenomena such as perceptual grouping (Choe, 2001; Choe & Miikkulainen, 2000). Previous models have explained how the connections in the LGN could develop from internally generated activity in the retina (Eglen, 1997; Haith, 1998). Because the HLISSOM model focuses on learning at the cortical level, all connection strengths to such subcortical neurons are set to xed values. The strengths were chosen to approximate the RFs that have been measured in adult LGN cells, using a standard difference-of-gaussians model (Rodieck, 1965). In this model, the center of each RF is rst mapped to the location in the input sheet corresponding to the location of the LGN unit, and the weight to a receptor is then a xed function of the distance from that center. For an ON-center cell, the receptor weights are calculated from the difference of two normalized gaussians with radii ¾c (center) and ¾s (surround). The weights for an OFF-center cell are the negative of the ON-center weights; they are the surround minus the center instead of the center minus the surround. The parameter ¾c determines the peak spatial frequency to which the LGN responds, and ¾s is generally set to a xed multiple of ¾c . The connections to neurons in the cortical sheets are initially unselective (either random, for afferent weights, or gaussian shaped, for lateral weights), and their strengths are subsequently modied through an unsupervised learning process. The learning process is driven by input patterns. Patterns may be presented on the photoreceptor sheet, representing external visual input, or they may be generated internally, reaching V1 through the PGO pathway. The PGO-generating region of the pons and its pathway to the LGN have not yet been mapped out in detail, but since the activity that results from the PGO waves is similar to that from visual input (Marks et al., 1995), for simplicity we assume that the retino-geniculate and PGO pathways are similar. Therefore, the PGO pattern input is modeled with an area like the photoreceptor sheet, connecting to the LGN in the same way.
Learning Innate Face Preferences
1535
At each training step, the activities of all units are initialized to zero, and a grayscale pattern is drawn on the input sheet (either the PGO generator or the photoreceptors; see Figures 2a, 2i, and 3a for examples). The cells in the ON and OFF channels of the LGN compute their responses as a scalar product of their xed weight vector and the activity of units in their receptive elds (as in Figures 2b, 2j, and 3b). The response ´ab of ON or OFF-center cell .a; b/ is calculated from the weighted sum of the retinal activations in its receptive eld as Á ´ab D ¾
X
°A
! »xy ¹ab;xy ;
(3.1)
xy
where ¾ is a piecewise linear approximation of the sigmoid activation function, »xy is the activation of retina cell .x; y/, and ¹ab;xy is the corresponding afferent weight. An LGN neuron will respond when the input pattern is a better match to the central portion of the RF than to the surrounding portion; the response will be larger with higher contrast, subject to the minimum and maximum values enforced by the sigmoid. The constant afferent scaling factor °A is set so that the LGN outputs approach 1.0 in the highest-contrast regions of typical input patterns. The cortical activation is similar to the LGN, but extended to support self-organization and to include lateral interactions. The total activation is computed from both afferent and lateral responses, which will be discussed separately here. First, the afferent response ³ij of V1 neuron .i; j/ is calculated like the retinal sum in equation 3.1, with an additional divisive (shunting) normalization: P °A ½ab »½ab ¹ij;½ab P (3.2) ³ij D ; 1 C °N ½ab »½ab where »½ab is the activation of unit .a; b/ in sheet ½ (either the ON channel or the OFF channel and ¹ij;½ab is the corresponding afferent weight. The normalization strength °N starts at zero and is gradually increased over training as neurons become more selective, and the scaling factor °A is set to compensate so that the afferent response ³ will continue to have values in the range 0 to 1.0 for typical input patterns. The FSA computes its afferent response just as V1 does, except that sheet ½ in equation 3.2 is the settled response (described below) of V1, instead of the response of an ON or OFF channel of the LGN. Note that although only a single ON and OFF channel were used here, afferent inputs from additional channels with different peak spatial frequencies (i.e., different ¾c ) can be included in equation 3.2 in the same way. Simulating a single channel allows V1 to be smaller and makes its output easier to interpret. This is a valid approximation because the response of
1536
J. Bednar and R. Miikkulainen
each channel is broad, so even a single channel responds to a wide range of input patterns. The initial response of a cortical neuron, whether in V1 or the FSA, is computed from the afferent response only: ´ij .0/ D ¾ .³ij /;
(3.3)
where ¾ is a piecewise linear approximation of the sigmoid activation function. After the initial response, the cortical activity evolves over a very short timescale through lateral interaction. At each subsequent time step, the neuron combines the afferent response ³ with lateral excitation and inhibition: Á ´ij .t/ D ¾
³ij C °E
X kl
Eij;kl ´kl .t ¡ 1/ ¡ °I
X kl
! Iij;kl ´kl .t ¡ 1/ ;
(3.4)
where Eij;kl is the excitatory lateral connection weight on the connection from neuron .k; l/ to neuron .i; j/, Iij;kl is the inhibitory connection weight, and ´kl .t ¡ 1/ is the activity of neuron .k; l/ during the previous time step. The scaling factors °E and °I determine the relative strengths of excitatory and inhibitory lateral interactions. While the cortical response in V1 and the FSA is settling, the afferent response remains constant. The cortical activity pattern in both areas starts out diffuse, but within a few iterations of equation 3.4, converges into a small number of stable, focused patches of activity, or activity bubbles (as in Figures 2c, 2d, 2k, 3e, and 3f). After the activity has settled, the connection weights of each neuron are modied. Afferent and lateral weights in both areas adapt according to the same mechanism: the Hebb rule, normalized so that the sum of the weights is constant: wij;mn .t C ±t/ D P
wij;mn .t/ C ®´ij Xmn ; mn [wij;mn .t/ C ®´ij Xmn ]
(3.5)
where ´ij stands for the activity of neuron .i; j/ in the nal activity bubble, wij;mn is the afferent or lateral connection weight (¹, E or I), ® is the learning rate for each type of connection (®A for afferent weights, ®E for excitatory, and ®I for inhibitory), and Xmn is the presynaptic activity (» for afferent, ´ for lateral). The larger the product of the pre- and postsynaptic activity ´ij Xmn , the larger the weight change. At long distances, very few neurons in the cortical regions have correlated activity, and therefore most long-range connections eventually become weak. The weak connections are eliminated periodically, resulting in patchy lateral connectivity similar to that observed in the visual cortex.
Learning Innate Face Preferences
1537
4 Experiments To model the presentation of large stimuli at close range (as in typical experiments with newborns), V1 parameters were generated using the scaling equations from Kelkar, Bednar, and Miikkulainen (2001) to model a large V1 area (approximately 1600 mm2 total) at a relatively low sampling density per mm2 (approximately 50 neurons/mm2 ). The total area was chosen to be just large enough for the corresponding photoreceptor area to cover the required range, and the density was chosen to be the minimum that would still provide a V1 organization that matched animal maps. The input pattern scale was then set so that the spatial frequencies for which V1 response was maximal would match the frequency range to which newborns are most sensitive, as cited by Valenza et al. (1996). As mentioned above, the response is broad, and the model also responds to other spatial frequencies. The FSA parameters were generated using the scaling equations from Kelkar et al. (2001) to model a small region of units that have RFs large enough to span the central portion of a face. The nal FSA activation threshold (lower bound on the sigmoid function in equations 3.3 and 3.4) was set to a high value so that the presence of a blob of activity in the FSA would be a criterion for the presence of a face at the corresponding location in the retina. This high threshold allows the FSA output to be interpreted unambiguously as a behavioral response, without the need for feedback from an additional higher-level set of units modeling newborn attention and xation processes. The model consisted of 438£438 photoreceptors, 220£220 PGO generator units, 204 £ 204 ON-center LGN units, 204 £ 204 OFF-center LGN units, 288 £ 288 V1 units, and 36 £36 FSA units, for a total of 408,000 distinct units. Area FSA was mapped to the central 160£160 region of V1 such that even the units near the borders of the FSA had a complete set of afferent connections on V1, with no RF extending over the outside edges of V1. V1 was similarly mapped to the central 192 £ 192 region of the LGN channels, and the LGN channels to the central 384£384 region of the photoreceptors and the central 192 £ 192 region of the PGO generators. In the gures that follow, only the area mapped directly to V1 is shown, to ensure that plots for different sheets have the same size scale. The model requires 300 megabytes of physical memory to represent the 80 million connections in the two cortical sheets. The remaining parameters are listed in the appendix. 4.1 Training V1. For simplicity, V1 and the FSA were self-organized in separate training phases. V1 was self-organized for 10; 000 iterations, each with approximately 11 randomly located circular discs 25 PGO units wide (see Figure 2a). The background activity level was 0.5, and the brightness of each disc relative to this surround (either C0:3 or ¡0:3) was chosen randomly. The borders of each disc were smoothed into the background level following a gaussian of half-width ¾ D 1:5. After 10,000 such iterations (10
1538
J. Bednar and R. Miikkulainen
Learning Innate Face Preferences
1539
hours on a 600 MHz Pentium III workstation), the orientation map shown in Figures 2g and 2h emerged, consisting of neurons that respond strongly to oriented edges and thick lines. The map organization is similar to maps measured in experimental animals (Blasdel, 1992). We selected these V1 training patterns as well-dened approximations of internally generated spatial patterns, such as spontaneous retinal waves. Imaging data show that these patterns have spatially coherent blobs of activity (Feller et al., 1996), unlike the uniform random noise assumed in earlier models of spontaneous activity (Linsker, 1986). Each pattern represents the state of one wave at a given instant, and together the patterns result in a map of static input properties. It is possible to train the model with moving patterns as well, and the resulting map would represent directional selectivity in addition to orientation selectivity (Bednar & Miikkulainen, in press). However, including only orientation processing is appropriate in this case because all patterns tested with newborns have either been static or moving at approximately the same rate as control patterns. Miller (1994) had previously argued that because retinal waves are much larger than the V1 receptive elds and not strongly oriented, they cannot account for how orientation selectivity develops. The results here demonstrate Figure 2: Facing page. Training the V1 orientation map. The input consisted of randomly located circular patterns (a) approximating spontaneous neural activity waves. The ON and OFF LGN channels compute their responses based on this input. (b) The ON response minus the OFF response is shown, with medium gray indicating regions with no response. The V1 neurons compute their responses based on the output of the ON and OFF channels. Initially, V1 neurons have random weights in their receptive elds, and thus all neurons with RFs overlapping the active areas of the LGN respond (c). After training, the afferent weights of each neuron show a clear preference for a particular orientation (e). The plot shows the weights to the ON-center LGN sheet minus those to the OFF-center sheet, drawn on the full LGN sheet. The orientation maps (f, g) summarize the development of such preferences. Each neuron is color-coded according to its preferred orientation (using the color key beside d); brighter colors indicate a stronger selectivity. Initially the neurons are not selective for orientation (f), but after training, V1 develops an orientation map similar to those measured in experimental animals (g; Blasdel, 1992). The map covers a large area of V1; previous work corresponded to only the central 36£ 36 portion (h) of the 288 £ 288 map in g. For the same input (a), the self-organized network response is patchy (d), because only those neurons whose preferred orientation matches the orientations of local features in the image respond. In d, each neuron is colored with the orientation it prefers, and the brightness of the color indicates how active that neuron is. The trained orientation map works well with natural images; each V1 neuron responds to the orientation of lines and edges in the corresponding region of the image (i–k). Color versions of these gures are available on-line at http://www.cs.utexas.edu/users/nn/.
1540
J. Bednar and R. Miikkulainen
the opposite conclusion and show that patterns do not need to be oriented as long as they are smooth and large relative to the RF size of the orientation map. To our knowledge, these simulations are the rst to show that realistic orientation maps can develop from large, unoriented (but spatially coherent) stimuli like the retinal waves. 4.2 Training the FSA. After V1 training, the V1 weights were xed to their self-organized values, and the FSA was allowed to activate and learn from the V1 responses. The FSA was self-organized for 10; 000 iterations using two triples of circular dots, each arranged in a triangular facelike conguration (see Figure 3a). Each dot was 20 PGO units wide, and was 0.3 units darker than the surround, which itself was 0.5 on a scale of 0 to 1.0. Each triple was placed at a random location each iteration, at least 118 PGO units away from the center of the other one (to avoid overlap). The angle of each triple was drawn randomly from a narrow (¾ D ¼ /36 radians) normal distribution around vertical. Because the face preferences found in newborns have all been for faces of approximately the same size (life-sized at a distance of around 20 cm), only a single training pattern size was used. As a result, the model will be able to detect facelike patterns only at one particular size scale. If response to multiple face sizes (i.e., distances) is desired, the spatial scale of the training patterns can be varied during self-organization (Sirosh & Miikkulainen, 1996). However, the FSA in such a simulation would need to be much larger to represent the different sizes, and the resulting patchy FSA responses would require more complex methods of analysis. Using these three-dot patterns, the FSA was self-organized for 10,000 iterations (13 hours on a 600 MHz Pentium III workstation). The result was the face-selective map shown in Figure 3h, consisting of an array of neurons that respond most strongly to patterns similar to the training patterns. Despite the overall similarities between neurons, the individual weight patterns are unique because each neuron targets specic orientation blobs in V1. Such complicated patterns would be difcult to specify and hard-wire directly, but they arise naturally from internally generated activity. As FSA training completed, the lower threshold of the sigmoid was increased to ensure that only patterns that are a strong match to an FSA neuron’s weights will activate it. This way, the presence of FSA activity can be interpreted unambiguously as an indication that there is a face at the corresponding location in the retina, which will be important for measuring the face detection ability of the network. The cortical maps were then tested on natural images and with the same schematic stimuli on which human newborns have been tested (Goren et al., 1975; Johnson & Morton, 1991; Simion, Valenza, Umilt`a, & Dalla Barba, 1998; Valenza et al., 1996). For all of the tests, the same set of model parameter settings described in appendix A was used. Schematic images were scaled to a brightness range (difference between the darkest and lightest pixels in
Learning Innate Face Preferences
1541
Figure 3: Training the FSA face map. (a) For training, the input consisted of simple three-dot congurations with random, nearly vertical orientations presented at random locations in the PGO layer. These patterns were chosen based on the experiments of Johnson and Morton (1991). (b–c) The LGN and V1 sheets compute their responses based on this input. Initially, FSA neurons will respond to any activity in their receptive elds (e), but after training, only neurons with closely matching RFs respond (f). Through self-organization, the FSA neurons develop RFs (d) that are selective for a range of V1 activity patterns like those resulting from the three-dot stimuli. The RFs are patchy because the weights target specic orientation blobs in V1. This complex match between the FSA and the local self-organized pattern in V1 would be difcult to ensure without training on internally generated patterns. (g, h) Plots show the afferent weights for every third neuron in the FSA. All neurons develop roughly similar weight proles, differing primarily by the position of their preferred stimuli on the retina and by the specic orientation blobs targeted in V1. The largest differences between RFs are along the outside border, where the neurons are less selective for threedot patterns. Overall, the FSA develops into a face detection map, signaling the location of facelike stimuli.
the image) of 1.0. Natural images were scaled to a brightness range of 2.5, so that facial features in images with faces would have a contrast comparable to that of the schematic images. If both types of images were instead scaled to the same brightness range, the model would prefer the schematic faces over real ones. Using different scales is appropriate for these experiments because we are only comparing
1542
J. Bednar and R. Miikkulainen
responses within groups of similar images, and we assume that the infant has become adapted to images of this type. For simplicity, the model does not specically include the mechanisms in the eye, LGN, and V1 that are responsible for such contrast adaptation. 4.3 Predicting Behavioral Responses. The HLISSOM model provides detailed neural responses for each neural region. These responses constitute predictions for future electrophysiological and imaging measurements in animals and humans. However, the data currently available for infant face perception are behavioral. They consist of newborn attention preferences measured from visual tracking distance and looking time. Thus, validating the model on these data will require predicting a behavioral response based on the simulated neural responses. As a general principle, we hypothesize that newborns pay attention to the stimulus whose overall neural response most clearly differs from those to typical stimuli. This idea can be quantied as
a.t/ D
F.t/ V.t/ L.t/ C ;C ; N N F V LN
(4.1)
where a.t/ is the attention level at time t, X.t/ is the total activity in region X at that time, XN is the average (or median) activity of region X over recent history, and F, V, and L represent the FSA, V1, and LGN regions, respectively. Because most stimuli activate the LGN and V1 but not the FSA, when a pattern evokes activity in the FSA, newborns would attend to it more strongly. Yet stimuli evoking only V1 activity could still be preferred over facelike patterns if their V1 activity is much higher than typical. Unfortunately, it is difcult to use such a formula to compare to newborn experiments, because the presentation order of stimuli in those experiments is usually not known. As a result, the average or median value of patterns in recent history is not available. Furthermore, the numerical preference values computed in this way will differ depending on the specic set of patterns chosen, and thus will be different for each study. Instead, for the simulations below, we will use a categorical approach inspired by Cohen (1998), which should yield similar results. Specically, we will assume that when two stimuli both activate the model FSA, the one with the higher total FSA activation will be preferred. Similarly, with two stimuli activating only V1, the higher total V1 activation will be preferred. When one stimulus activates only V1 and another activates both V1 and the FSA, we will assume that the pattern that produces FSA activity will be preferred, unless the V1 activity is much larger than for typical patterns. Using these guidelines, the computed model preferences can be validated against the newborn’s looking preferences, to determine if the model shows the same behavior as the newborn.
Learning Innate Face Preferences
1543
5 Results This section presents the model’s response to schematic patterns and real images after it has completed training on the internally generated patterns. The response to each schematic pattern is compared to behavioral results from infants, and the responses to real images constitute predictions for future experiments. 5.1 Schematic Patterns. Figures 4 and 5 show that the model responses match the measured stimulus preference of newborns remarkably well, with the same relative ranking in each case where infants have shown a signicant preference between schematic patterns. These gures and the rankings are the main result from this simulation. We will analyze each of the categories of preference rankings to show what parts of the model underlie the pattern preferences. Most nonfacelike patterns activate only V1, and thus the preferences between those patterns are based on only the V1 activity values (see Figures 4a, 4e–4i, and 5f–5i). Patterns with numerous high-contrast edges have greater V1 response, which explains why newborns would prefer them. These preferences are in accord with the simple LSM model, because they are based on only the early visual processing (see section 2.1). Facelike schematic patterns activate the FSA, whether they are realistic or simply patterns of three dots (see Figure 4b–4d and 5a–5d). The different FSA activation levels reect the level of V1 response to each pattern, not the precise shape of the face pattern. The responses explain why newborns would prefer patterns like the three-square-blob face over the oval-blob face, which has a lower edge length. The preferences between these patterns are also compatible with the LSM, because in each case, the V1 responses in the model match newborn preferences. The comparisons between facelike and nonfacelike patterns show how the predictions of the HLISSOM pattern generation model differ from those of the LSM. In the HLISSOM simulations, the patterns that activate the FSA are preferred over those activating only V1, except when the V1 response is highly anomalous. Most V1 responses are very similar, and so the patterns with FSA activity (see Figures 4b–4d and 5a–5d) should be preferred over most of the other patterns, as found in infants. However, the checkerboard pattern (see Figure 4a) has nearly three times as much V1 activity as any other pattern with a similar background. Thus, the HLISSOM results explain why the checkerboard would be preferred over other patterns, even ones that activate the FSA. Because the RFs in the model are only a rough match to the schematic patterns, the FSA can have spurious responses to patterns that to adults do not look like faces. For instance, the inverted three-dot pattern in Figure 5e activated the FSA. In this case, part of the square outline lled in for the missing third dot of an upright pattern. With high enough pattern contrast,
1544
J. Bednar and R. Miikkulainen
Figure 4: Human newborn and model response to Goren et al.’s (1975) and Johnson et al.’s (1991) schematic images. The Retina row along the top shows a set of patterns as they are drawn on the photoreceptor sheet. These patterns have been presented to newborn human infants on head-shaped paddles moving at a short distance (about 20 cm) from the newborn’s eyes, against a light-colored ceiling. In the experimental studies, the newborn’s preference was determined by measuring the average distance his or her eyes or head tracked each pattern, compared to other patterns. Below, x>y indicates that image x was preferred over image y under those conditions. Goren et al. (1975) measured infants between 3 and 27 minutes after birth. They found that b > f > i and b > e > i. Similarly, Johnson et al. (1991), in one experiment measuring within one hour of birth, found b > e > i. In another, measuring at an average of 43 minutes, they found b > e, and b > h. Finally, Johnson and Morton (1991), measuring newborns an average of 21 hours old, found that a > .b; c; d/, c > d, and b > d. The HLISSOM model has the same preference for each of these patterns, as shown in the images above. The second row shows the model LGN activations (ON minus OFF) resulting from the patterns in the top row. The third row shows the V1 activations, with the numerical sum of the activities shown underneath. If one unit were fully active, the sum would be 1.0; higher values indicate that more units are active. The bottom row shows the settled response of the FSA. Activity at a given location in the FSA corresponds to a facelike stimulus at that location on the retina, and the sum of this activity is shown underneath. The images are sorted left to right according to the preferences of the model. The strongest V1 response by nearly a factor of three is to the checkerboard pattern (a), which explains why the newborn would prefer that pattern over the others. The facelike patterns (b–d) are preferred over patterns (e–i) because of activation in the FSA. The details of the facelike patterns do not signicantly affect the results—all of the facelike patterns (b–d) lead to FSA activation, generally in proportion to their V1 activation levels. The remaining patterns are ranked by
Learning Innate Face Preferences
1545
Figure 5: Response to schematic images from Valenza et al. (1996) and Simion, Valenza, and Umilt a` (1998). Valenza et al. measured preference between static, projected versions of pairs of the schematic images in the top row, using newborns ranging from 24 to 155 hours after birth. They found the following preferences: d > f , d > g, f > g, and h > i. Simion et al. similarly found a preference for d > g and b > e. The LGN, V1, and FSA responses of the model to these images are displayed here as in Figure 4, and are again sorted by the model’s preference. In all cases where the newborn showed a preference, the model preference matched it. For instance, the model FSA responds to the facelike pattern (d) but not to the inverted version (g). Patterns that closely match the newborn’s preferred spatial frequencies (f,h) caused a greater V1 response than their corresponding lower-frequency patterns (g,i). Some nonfacelike patterns with high-contrast borders can cause spurious FSA activation (e), because part of the border completes a three-dot pattern. Such spurious responses did not affect the predicted preferences, because they are smaller than the genuine responses.
two dots can also be sufcient to give a response. Because inverted patterns have two dots in the correct location, these results suggest that inverted patterns are not an ideal control to use in experimental studies (Bednar, 2002). Interestingly, the model also showed a clear preference in one case where no signicant preference was found in newborns (Simion,Valenza, & Umilt`a, Figure 4: continued. their V1 activity alone, because they do not activate the FSA. In all conditions tested, the HLISSOM model shows behavior remarkably similar to that of the newborns and provides a detailed computational explanation for why these behaviors occur.
1546
J. Bednar and R. Miikkulainen
1998) for the upright three-blob pattern with no face outline (see Figure 5a) over the similar but inverted pattern (see Figure 5i). The V1 responses to both patterns are similar, but only the upright pattern has an FSA response, and thus the model predicts that the upright pattern would be preferred. This potentially conicting result may be due to postnatal learning rather than capabilities at birth. As will be summarized in section 6.2, postnatal learning of face outlines can have a strong effect on FSA responses. The newborns in the Simion et al. study were already one to six days old, and Pascalis, de Schonen, Morton, Dervelle, and Fabre-Grenet (1995) have shown that newborns within this age range have already learned some of the features of their mothers’ face outlines. Thus, the pattern-generation model predicts that if younger newborns are tested, they will show a preference for upright patterns even without a border. Alternatively, the border may satisfy some minimum requirement on size or complexity for patterns to attract a newborn’s interest. For instance, the total V1 activity may need to be above a certain threshold for the newborn to pay any attention to a pattern. If so, the HLISSOM procedure for deriving a behavioral response from the model response (see section 4.3) would need to be modied to include this constraint. Overall, these results provide strong computational support for the speculation of Johnson and Morton (1991) that the newborn could simply be responding to a three-dot facelike conguration rather than performing sophisticated face detection. Internally generated patterns provide an account for how such “innate” machinery can be constructed during prenatal development, within a system that can also learn postnatally from visual experience. 5.2 Real Face Images. Researchers testing newborns with schematic patterns often assume that the responses to schematics are representative of responses to real faces. However, no experiment has yet tested that assumption by comparing real faces to similar but nonfacelike controls. Similarly, no previous computational model of newborn face preferences has been tested with real images. HLISSOM makes such tests practical, which is an important way to determine the signicance of the behavioral data. The examples in Figure 6 show that the prenatally trained HLISSOM model works remarkably well as a face detector for natural images. The face detection performance of the map was tested quantitatively using two image databases: a set of 150 images of 15 adult males without glasses, photographed at the same distance against blank backgrounds (Achermann, 1995), and a set of 58 nonface images of various natural scenes (National Park Service, 1995). The face image set contained two views of each person facing forward, upward, downward, left, and right; Figure 6a shows an example of a frontal view. Each natural scene was presented at six different size scales, for a total of 348 nonface presentations. Overall, the results indicated very high face detection performance: the FSA responded
Learning Innate Face Preferences
1547
Figure 6: Model response to natural images. The top row shows a sample set of photographic images. The LGN, V1, and FSA responses of the model to these images are displayed as in Figures 4 and 5. The FSA is indeed activated at the correct location for most top-lit faces of the correct size and orientation (e.g., a– d). (a) An example of one frontal view from the Achermann (1995) database. For this database, 88% of the faces resulted in FSA activity in the correct location. Just as important, the network is not activated for most natural scenes (f–g) and man-made objects (h). For example, the FSA responded to only 4.3% of 348 presentations of landscapes and other natural scenes from the National Park Service (1995). Besides human faces, the FSA responds to patterns causing a V1 response similar to that of a three-dot arrangement of contours (d,i), including related patterns such as dog and monkey faces (not shown). Response is low to images where hair or glasses obscure the borders of the eyes, nose, or mouth, and to front-lit downward-looking faces, which have low V1 responses from nose and mouth contours (e). The model predicts that newborns would show similar responses if tested. Credits: (a) copyright 1995 Bernard Achermann; (b–e) public domain; (f–i) copyright 1999–2001, James A. Bednar.
to 91% (137/150) of the face images, but to only 4.3% (15/348) of the natural scenes. Because the two sets of real images were not closely matched in terms of lighting, backgrounds, and distances, it is important to consider the actual response patterns to be sure that the differences in the overall totals are genuine. The FSA responded with activation in the location corresponding to the center of the face in 88% (132/150) of the face images. At the same time, the FSA had spurious responses in 27% (40/150) of the face images, that is, responses in locations other than the center of the face. Nearly half
1548
J. Bednar and R. Miikkulainen
of the spurious responses were from less selective neurons along the outer border of the model FSA (see Figure 3h); these can be ignored. Most of the remaining spurious responses resulted from a genuine V1 eye or mouth response plus V1 responses to the hair or jaw outlines. Such responses would actually serve to direct attention to the general region of the face, although they would not pinpoint the precise center. For the natural scenes, most of the responses were from the less selective neurons along the border of the FSA, and those responses can again be ignored. The remainder were in image regions that coincidentally had a triangular arrangement of three contour-rich areas, surrounded by smooth shading. In summary, the FSA responds to most top-lit human faces of about the right size, signaling their location in the visual eld. It does not respond to most other stimuli, except when they contain accidental three-dot patterns. The model predicts that human newborns will have a similar pattern of responses in the face-selective regions of the visual cortex. 6 Discussion and Future Work The HLISSOM simulations show that internally generated patterns and selforganization can together account for newborn face detection. The novel contributions of the model are: (1) providing a rigorous, computational test of the Johnson and Morton (1991) three-dot conceptual model for newborn schematic face preferences, (2) demonstrating that preferences for threedot patterns result in signicant preference for faces in natural images, (3) proposing a specic, plausible way that a system for face detection could be constructed in the developing brain, (4) demonstrating how a single system can both exhibit genetically encoded preferences in the newborn and later learn from real faces, (5) demonstrating how a hierarchy of brain areas could self-organize and function, at the level of columnar maps and their connections, (6) showing how the reported newborn pattern preferences can result from specic feature preferences at the different levels of the hierarchy, and (7) showing that the effects of internally generated patterns should be considered when explaining genetically inuenced behavior. The key idea is that pattern generation can combine the strengths of both genetic prespecication and input-driven learning into a system more powerful than the sum of its parts. The following sections discuss results with other training patterns and postnatal learning, relate the HLISSOM results to those from other models, and discuss future research directions. 6.1 Effect of the Training Pattern Shape. How crucial is the specic pattern of three dots for the model, given that other patterns might also provide face selectivity? For instance, Viola and Jones (2001) found that a horizontal bar or a pair of eye spots can detect faces well in engineering applications. In separate work, we tested the effectiveness of a variety of
Learning Innate Face Preferences
1549
training pattern shapes at detecting faces in real images and for replicating the psychological data (Bednar, 2002). These simulations used nine different training patterns, including dots, bars, top-heavy triangles, and circles. For all training patterns, test faces that matched the overall training pattern size gave higher responses than did other natural images. Thus, simply matching the size of typical faces gives some degree of face selectivity in real images. However, different training patterns led to large differences in responses to the schematic patterns. Of all training patterns tested, the three-dot pattern provided the most face selectivity in overhead lighting and the best match to the experimental data, and thus it is a good choice for pattern-generation simulations. It leads to strong predictions that can be matched with experimental data and contrasted with other models. All of these tests used upright training patterns, because the Simion, Valenza, and Umilt`a and Simion, Valenza, Umilt`a, and Dalla Barba (1998) studies suggest that the orientation of face patterns is important even in the rst few days of life. In the areas that generate training patterns, such a bias might be due to anisotropic lateral connectivity, which would cause spontaneous patterns in one part of the visual eld to suppress those below them. On the other hand, tests with the youngest infants (less than one day) have not yet found orientation biases (Johnson et al., 1991). Thus, the experimental data are also consistent with a model that assumes unoriented patterns prenatally, followed by rapid learning from upright faces. Such a model would be more complicated to simulate and describe than the one presented here, but could use a similar architecture and learning rules. Importantly, not all possible training patterns result in face preferences. For example, if only a single low-level feature like eye size is matched between the training and testing images, the model shows no face preference. Together, the results with different patterns show that the pattern shape is important, but that the shape does not need to be strictly controlled to provide face preferences. 6.2 Postnatal Learning. The work reported in this article focuses on understanding the origin of newborn face preferences, but a crucial motivation for the model is that the same components can learn from real faces and not just prenatal patterns. To understand postnatal learning computationally, we have trained a map with internally generated patterns, then trained it on images of real faces in front of natural image backgrounds (Bednar, 2002; Bednar & Miikkulainen, 2002). Preliminary results from these simulations suggest that (1) prenatal training leads to faster and more robust face learning after birth, compared to a system exposed only to environmental stimuli, (2) the decline in infants’ responses to schematic patterns in the periphery after one month (see section 2.4) may simply result from learning real faces, not from a shift to a separate system, (3) learning of real faces may explain why young infants prefer their mothers’ faces, as well as why the preference disappears when
1550
J. Bednar and R. Miikkulainen
the face outline is masked, and (4) delayed maturation of the fovea may explain why facial preferences appear in central vision later than in peripheral vision. These results are consistent with the postnatal disappearance of newborn face preferences that in part inspired the Johnson and Morton (1991) model, while also providing a simple account for why infants of all ages appear to be able to learn faces.
6.3 Comparison to Other Models. Predictions from the HLISSOM pattern-generation model differ strongly from those of other recent models. These predictions can be tested by future experiments. One easily tested prediction is that newborn face preferences should not depend on the precise shape of the face outline. The Acerra et al. (2002) model (see section 2.2) makes the opposite prediction, because in that model, preferences arise from precise spacing differences between the external border and the internal facial features. Results from the HLISSOM simulations also suggest that newborns will have a strong preference for real faces (e.g., in photographs), while we argue that the Acerra et al. model predicts only a weak preference for real faces, if any. Experimental evidence to date cannot yet decide between the predictions of these two models. For instance, Simion, Valenza, and Umilt`a (1998) did not nd a signicant schematic face preference in newborns one to six days old without a contour surrounding the internal features, which is consistent with the Acerra et al. (2002) model. However, the same study concluded that the shape of the contour “did not seem to affect the preference” for the patterns, which would not be consistent with the Acerra et al. model. As discussed earlier, newborns may not require any external contour, as the pattern-generation model predicts, until they have had postnatal experience with faces. Future experiments with younger newborns should compare model and newborn preferences between schematic patterns with a variety of border shapes and spacings. These experiments will either show that the border shape is crucial, as predicted by Acerra et al. or that it is unimportant, as predicted by the pattern-generation model. The predictions of the pattern-generation model also differ from those of the Simion et al. (2001) top-heavy model (reviewed in section 2.3). This model predicts that any face-sized border that encloses objects denser at the top than the bottom will be preferred over similar schematic patterns. The pattern-generation model predicts instead that a pattern with three dots in the typical symmetrical arrangement would be preferred over the same pattern with both eye dots pushed to one side, despite both patterns being equally top heavy. These two models represent very different explanations of the existing data, and thus testing such patterns should offer clear support for one model over the other. On the other hand, many of the predictions of the fully trained patterngeneration model implemented in HLISSOM are similar to those of the
Learning Innate Face Preferences
1551
CONSPEC model proposed by Johnson and Morton (1991). In fact, the reduced HLISSOM face preference network in section 6.1 (which does not include V1) can be seen as the rst CONSPEC system to be implemented computationally, along with a concrete proposal for how such a system could be constructed during prenatal development (Bednar & Miikkulainen, 2000). The primary functional difference between the trained HLISSOM network and CONSPEC is that in HLISSOM, only neurons located in cortical visual areas respond selectively to faces in the visual eld. Whether newborn face detection is mediated cortically or subcortically has been debated extensively, yet no clear consensus has emerged from behavioral studies (Simion, Valenza, & Umilt`a, 1998). If future brain imaging studies do discover face-selective visual neurons in subcortical areas in newborns, HLISSOM will need to be modied to include such areas. Yet the key principles would remain the same, because subcortical regions also appear to be organized from internally generated activity (Wong, 1999). Thus, experimental tests of the pattern-generation model versus CONSPEC should focus on how the initial system is constructed, not where it is located. 6.4 Future Directions. The pattern-generation hypothesis can be veried directly by imaging the actual activity patterns produced during REM sleep. Very recent advances in imaging hardware have made limited measurements of this type possible (Rector et al., 1997). It may soon be feasible to test the assumptions and predictions of the pattern-generation model in developing animals. Why would evolution have favored a pattern-generation approach over xed hardwiring or strictly general learning? One important reason is that the genome can encode the desired outcome, three-blob responses, independent of the actual architecture of the face processing hardware. As shown by Dailey and Cottrell (1999), even when the underlying cortical architecture is specied only in domain-general terms, specic regions can still become primarily devoted to face processing. The divide-and-conquer strategy of pattern generation would allow such an architecture to evolve and develop independent of the function, which could potentially enable the rapid evolution and development of more complex adaptive systems. In terms of information processing, pattern generation combined with self-organization may represent a general way to solve difcult problems like face detection and recognition. Rather than meticulously specifying the nal, desired individual, the specication need only encode a process for constructing an individual through interaction with its environment. The result can seamlessly combine the full complexity of the environment with a priori information about the desired function of the system. In the future, this approach could be used for engineering complex articial systems for real-world tasks, such as handwriting recognition, speech recognition, and language processing.
1552
J. Bednar and R. Miikkulainen
7 Conclusion The pattern-generator model of newborn face preferences can explain how newborns can show a face preference at birth yet remain able to learn faces at all ages. The predictions of the model can be tested experimentally in order to better understand the development of face perception. The simulations and future human experiments will also help explain how environmental and genetic inuences can together construct complex, adaptive systems. Appendix: Parameter Values Most of the model parameters were calculated using the scaling equations from Kelkar et al. (2001). Of those below, only the ° parameters and the sigmoid thresholds were set empirically for this study. For the LGN and V1, the values for these free parameters were set so that the output from each neural sheet would be in a useful range for the following sheet over the course of training. For the highest level (FSA), the free parameters were set so that the response would be nearly binary, and therefore an unambiguous criterion for the presence of a facelike pattern on the input. Apart from such threshold parameters, small parameter variations produce roughly equivalent results. The specic parameter values for each subcortical and cortical region follow. For the LGN, ¾c was 0.4 and ¾s was 1.6. The afferent scale °A was 10.6, and the upper and lower thresholds of the sigmoid were 0:14 and 1:0, respectively. For V1, the initial lateral excitation radius was 3:6 and was gradually decreased to 1:5. The lateral inhibitory radius of each neuron was 8, and inhibitory connections whose strength was below 0:01 were pruned away at 10; 000 iterations. The lateral inhibitory connections were initialized to a gaussian prole with ¾ D 17, and the lateral excitatory connections to a gaussian with ¾ D 2:8, with no connections outside the nominal circular radius. Initially, the divisive normalization strength °N was zero, the afferent scale °A was 1.0, and the lateral excitation °E and inhibition strength °I were both 0:9. The ° values were constant during V1 training and were gradually increased over the course of FSA training to °N D 4, °A D 3:25, °E D 1:2, and °I D 1:4. The learning rate ®A was gradually decreased from 0:0035 to 0:00075, ®E from 0:059 to 0:0029, and ®I was a constant 0:00088. The lower and upper thresholds of the sigmoid were increased from 0:08 to 0:5 and from 0:63 to 0:86, respectively, over the course of V1 training; the lower threshold was then decreased to 0:22 over the course of FSA training. The number of iterations for which the lateral connections were allowed to settle at each training iteration was initially 9 and was increased to 13. For the FSA, the initial lateral excitation radius was 6:3 and was gradually decreased to 1:5. The lateral inhibitory radius of each neuron was 15:8, and inhibitory connections whose strength was below 0:0027 were pruned away
Learning Innate Face Preferences
1553
at 20; 000 iterations. The lateral inhibitory connections were initialized to a gaussian prole with ¾ D 33, and the lateral excitatory connections to a gaussian with ¾ D 4:9. The lateral excitation °E was 0:9 and the inhibition strength °I was 0:9. Initially, the divisive normalization strength °N was zero, the afferent scale °A was 3.0, and the lateral excitation °E and inhibition strength °I were both 0:9. These values were gradually changed over the course of FSA training to °N D 9, °A D 10:6, °E D 0:4, and °I D 0:6. The learning rate ®A was gradually decreased from 0:0001 to 0:000022, ®E from 0:025 to 0:013 and ®I was a constant 0:003. The lower and upper thresholds of the sigmoid were increased from 0:1 to 0:81 and from 0:65 to 0:88, respectively, over the course of training. The number of iterations for which the lateral connections were allowed to settle at each training iteration was initially 9, and was increased to 13. Acknowledgments This research was supported in part by the National Science Foundation under grant IIS-9811478 and by the National Institutes of Mental Health under Human Brain Project grant 1R01-MH66991. We thank Cornelius Weber, Lisa Kaczmarczyk, an anonymous reviewer, and especially Mark H. Johnson for critical comments on earlier drafts. Software, demonstrations, color gures, and related publications are available on-line at http://www.cs.utexas.edu/ users/nn/. References Acerra, F., Burnod, Y., & de Schonen, S. (2002). Modelling aspects of face processing in early infancy. Developmental Science, 5(1), 98–117. Achermann, B. (1995). Full-faces database. Available on-line: http://iamwww. unibe.ch/ fkiwww/Personen/achermann.html. Copyright 1995, University of Bern; all rights reserved. Banks, M. S., & Salapatek, P. (1981). Infant pattern vision: A new approach based on the contrast sensitivity function. Journal of Experimental Child Psychology, 31(1), 1–45. Bednar, J. A. (2002). Learning to see: Genetic and environmental inuences on visual development. Unpublished doctoral dissertation, University of Texas at Austin. Available as technical report AI-TR-02-294, Dept. of Computer Science. Bednar, J. A., & Miikkulainen, R. (1998). Pattern-generator-driven development in self-organizing models. In J. M. Bower (Ed.), Computational neuroscience: Trends in research, 1998 (pp. 317–323). New York: Plenum. Bednar, J. A., & Miikkulainen, R. (2000). Self-organization of innate face preferences: Could genetics be expressed through learning? In Proceedings of the 17th National Conference on Articial Intelligence (pp. 117–122). Cambridge, MA: MIT Press.
1554
J. Bednar and R. Miikkulainen
Bednar, J. A., & Miikkulainen, R. (2002). Neonatal learning of faces: Interactions between genetic and environmental inputs. In W. Gray & C. Schunn (Eds.), Proceedings of the 24th Annual Conference of the Cognitive Science Society (pp. 107–112). Mahwah, NJ: Erlbaum. Bednar, J. A., & Miikkulainen, R. (in press). Self-organization of spatiotemporal receptive elds and laterally connected direction and orientation maps. Neurocomputing. Blasdel, G. G. (1992). Orientation selectivity, preference, and continuity in monkey striate cortex. Journal of Neuroscience, 12, 3139–3161. Bushnell, I. W. R., Sai, F., & Mullin, J. T. (1989). Neonatal recognition of the mother’s face. British Journal of Developmental Psychology, 7, 3–15. Callaway, C. W., Lydic, R., Baghdoyan, H. A., & Hobson, J. A. (1987). Pontogeniculooccipital waves: Spontaneous visual system activity during rapid eye movement sleep. Cellular and Molecular Neurobiology, 7(2), 105–149. Choe, Y. (2001). Perceptual grouping in a self-organizing map of spiking neurons. Unpublished doctoral dissertation, University of Texas at Austin. Choe, Y., & Miikkulainen, R. (2000). Contour integration and segmentation with self-organized lateral connections (Tech. Rep. AI2000-286). Austin: Department of Computer Sciences, University of Texas at Austin. Cohen, L. B. (1998). An information-processing approach to infant perception and cognition. In F. Simion & G. Butterworth (Eds.), The developmentof sensory, motor, and cognitive capacities in early infancy (pp. 277–300). East Sussex, UK: Psychology Press. Constantine-Paton, M., Cline, H. T., & Debski, E. (1990). Patterned activity, synaptic convergence, and the NMDA receptor in developing visual pathways. Annual Review of Neuroscience, 13, 129–154. Dailey, M. N., & Cottrell, G. W. (1999). Organization of face and object recognition in modular neural network models. Neural Networks, 12(7), 1053–1074. Datta, S. (1997). Cellular basis of pontine ponto-geniculo-occipital wave generation and modulation. Cellular and Molecular Neurobiology, 17(3), 341–365. de Haan, M. (2001). The neuropsychology of face processing during infancy and childhood. In C. A. Nelson & M. Luciana (Eds.), Handbook of developmental cognitive neuroscience. Cambridge, MA: MIT Press. de Schonen, S., Mancini, J., & Leigeois, F. (1998). About functional cortical specialization: The development of face recognition. In F. Simion & G. Butterworth (Eds.), The development of sensory, motor, and cognitive capacities in early infancy (pp. 103–120). East Sussex, UK: Psychology Press. Dugovic, C., & Turek, F. W. (2001). Similar genetic mechanisms may underlie sleep-wake states in neonatal and adult rats. Neuroreport, 12(14), 3085–3089. Eglen, S. (1997). Modeling the development of the retinogeniculate pathway. Unpublished doctoral dissertation, University of Sussex at Brighton, UK. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Computation, 7(3), 425–468. Feller, M. B., Wellis, D. P., Stellwagen, D., Werblin, F. S., & Shatz, C. J. (1996). Requirement for cholinergic synaptic transmission in the propagation of spontaneous retinal waves. Science, 272, 1182–1187.
Learning Innate Face Preferences
1555
Field, T. M., Cohen, D., Garcia, R., & Greenberg, R. (1984). Mother-stranger face discrimination by the newborn. Infant Behavior and Development, 7, 19–25. Gauthier, I., & Nelson, C. A. (2001). The development of face expertise. Current Opinion in Neurobiology, 11(2), 219–224. Goren, C. C., Sarty, M., & Wu, P. Y. (1975). Visual following and pattern discrimination of face-like stimuli by newborn infants. Pediatrics, 56(4), 544–549. Haith, G. L. (1998). Modeling activity-dependent development in the retinogeniculate projection. Unpublished doctoral dissertation, Stanford University, Stanford, CA. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11, 1800–1809. Johnson, M. H., Dziurawiec, S., Ellis, H., & Morton, J. (1991). Newborns’ preferential tracking of face-like stimuli and its subsequent decline. Cognition, 40, 1–19. Johnson, M. H., & Morton, J. (1991). Biology and cognitive development: The case of face recognition. New York: Blackwell. Jouvet, M. (1998). Paradoxical sleep as a programming system. Journal of Sleep Research, 7(Suppl. 1), 1–5. Jouvet, M. (1999). The paradox of sleep: The story of dreaming. Cambridge, MA: MIT Press. Kanwisher, N., McDermott, J., & Chun, M. M. (1997). The fusiform face area: A module in human extrastriate cortex specialized for face perception. Journal of Neuroscience, 17(11), 4302–4311. Kelkar, A., Bednar, J. A., & Miikkulainen, R. (2001). Modeling large cortical networks with growing self-organizing maps (Tech. Rep. AI-00-285). Austin: Department of Computer Sciences, University of Texas at Austin. Kleiner, K. (1993). Specic vs. non-specic face-recognition device. In B. de Boysson-Bardies (Ed.), Developmental neurocognition: Speech and face processing in the rst year of life (pp. 103–108). Boston: Kluwer. Leinekugel, X., Khazipov, R., Cannon, R., Hirase, H., Ben-Ari, Y., & Buzs´aki, G. (2002). Correlated bursts of activity in the neonatal hippocampus in vivo. Science, 296(5575), 2049–2052. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of orientation columns. Proceedings of the National Academy of Sciences, USA, 83, 8779–8783. Lippe, W. R. (1994). Rhythmic spontaneous activity in the developing avian auditory system. Journal of Neuroscience, 14(3), 1486–1495. Marks, G. A., Shaffery, J. P., Oksenberg, A., Speciale, S. G., & Roffwarg, H. P. (1995). A functional role for REM sleep in brain maturation. Behavioural Brain Research, 69, 1–11. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (1997). Self-organization, plasticity, and low-level visual phenomena in a laterally connected map model of the primary visual cortex. In R. L. Goldstone, P. G. Schyns, & D. L. Medin (Eds.), Psychology of Learning and Motivation, 36, (pp. 257–308). San Diego, CA: Academic Press. Miller, K. D. (1994). A model for the development of simple cell receptive elds and the ordered arrangement of orientation columns through activity-
1556
J. Bednar and R. Miikkulainen
dependent competition between ON- and OFF-center inputs. Journal of Neuroscience, 14, 409–441. Mondloch, C. J., Lewis, T. L., Budreau, D. R., Maurer, D., Dannemiller, J. L., Stephens, B. R., & Kleiner-Gathercoal, K. A. (1999). Face perception during early infancy. Psychological Science, 10(5), 419–422. Morris, J. S., Ohman, A., & Dolan, R. J. (1999). A subcortical pathway to the right amygdala mediating “unseen” fear. Proceedings of the National Academy of Sciences, USA, 96(4), 1680–1685. National Park Service. (1995). Image database. Available on-line: http://www. freestockphotos.com/NPS. O’Donovan, M. J. (1999). The origin of spontaneous activity in developing networks of the vertebrate nervous system. Current Opinion in Neurobiology, 9, 94–104. Pascalis, O., de Schonen, S., Morton, J., Deruelle, C., & Fabre-Grenet, M. (1995). Mother ’s face recognition by neonates: A replication and an extension. Infant Behavior and Development, 18, 79–85. Pascalis, O., & Slater, A. (2001). The development of face processing in infancy and early childhood: Current perspectives. Infant and Child Development, 10(1/2), 1–2. Rector, D. M., Poe, G. R., Redgrave, P., & Harper, R. M. (1997). A miniature CCD video camera for high-sensitivity light measurements in freely behaving animals. Journal of Neuroscience Methods, 78(1–2), 85–91. Rodieck, R. W. (1965). Quantitative analysis of cat retinal ganglion cell response to visual stimuli. Vision Research, 5(11), 583–601. Rodman, H. R. (1994). Development of inferior temporal cortex in the monkey. Cerebral Cortex, 4(5), 484–498. Roffwarg, H. P., Muzio, J. N., & Dement, W. C. (1966). Ontogenetic development of the human sleep-dream cycle. Science, 152, 604–619. Rolls, E. T. (1990). The representation of information in the temporal lobe visual cortical areas of macaques. In R. Eckmiller (Ed.), Advanced neural computers (pp. 69–78). New York: Elsevier. Shaffery, J. P., Roffwarg, H. P., Speciale, S. G., & Marks, G. A. (1999). Pontogeniculo-occipital-wave suppression amplies lateral geniculate nucleus cell-size changes in monocularly deprived kittens. Developmental Brain Research, 114(1), 109–119. Shatz, C. J. (1990). Impulse activity and the patterning of connections during CNS development. Neuron, 5, 745–756. Shatz, C. J. (1996). Emergence of order in visual system development. Proceedings of the National Academy of Sciences, USA, 93, 602–608. Simion, F., Macchi Cassia, V., Turati, C., & Valenza, E. (2001). The origins of face perception: Specic versus non-specic mechanisms. Infant and Child Development, 10(1/2), 59–66. Simion, F., Valenza, E., & Umilta` , C. (1998). Mechanisms underlying face preference at birth. In F. Simion & G. Butterworth (Eds.), The development of sensory, motor, and cognitive capacities in early infancy (pp. 87–102). East Sussex, UK: Psychology Press.
Learning Innate Face Preferences
1557
Simion, F., Valenza, E., Umilta` , C., & Dalla Barba, B. (1998). Preferential orienting to faces in newborns: A temporal-nasal asymmetry. Journal of Experimental Psychology: Human Perception and Performance, 24(5), 1399–1405. Sirosh, J., & Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Sirosh, J., & Miikkulainen, R. (1996). Self-organization and functional role of lateral connections and multisize receptive elds in the primary visual cortex. Neural Processing Letters, 3, 39–48. Sirosh, J., Miikkulainen, R., & Bednar, J. A. (1996). Self-organization of orientation maps, lateral connections, and dynamic receptive elds in the primary visual cortex. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral interactions in the cortex: Structure and function. Austin, TX: UTCS Neural Networks Research Group. Available on-line: http://www.cs.utexas.edu/ users/nn/web-pubs/htmlbook96. Slater, A. (1993). Visual perceptual abilities at birth: Implications for face perception. In B. de Boysson-Bardies (Ed.), Developmental neurocognition: Speech and face processing in the rst year of life (pp. 125–134). Boston: Kluwer. Slater, A., & Kirby, R. (1998). Innate and learned perceptual abilities in the newborn infant. Experimental Brain Research, 123(1–2), 90–94. Swindale, N. V. (1996). The development of topography in the visual cortex: A review of models. Network—Computation in Neural Systems, 7, 161–247. Tov´ee, M. J. (1998). Face processing: Getting by with a little help from its friends. Current Biology, 8, R317–R320. Valenza, E., Simion, F., Cassia, V. M., & Umilt a` , C. (1996). Face preference at birth. Journal of Experimental Psychology: Human Perception and Performance, 22(4), 892–903. Viola, P., & Jones, M. (2001). Robust real-time object detection. Paper presented at the Second International Workshop on Statistical and Computational Theories of Vision—Modeling, Learning, Computing, and Sampling, Vancouver, BC. Wallace, M. T., McHafe, J. G., & Stein, B. E. (1997). Visual response properties and visuotopic representation in the newborn monkey superior colliculus. Journal of Neurophysiology, 78(5), 2732–2741. Walton, G. E., Armstrong, E. S., & Bower, T. G. R. (1997). Faces as forms in the world of the newborn. Infant Behavior and Development, 20(4), 537–543. Walton, G. E., & Bower, T. G. R. (1993). Newborns form “prototypes” in less than 1 minute. Psychological Science, 4, 203–205. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. C. (1995). Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neuron, 15, 541–552. Wong, R. O. L. (1999). Retinal waves and visual system development. Annual Review of Neuroscience, 22, 29–47. Wong, R. O. L., Meister, M., & Shatz, C. J. (1993). Transient period of correlated bursting activity during development of the mammalian retina. Neuron, 11(5), 923–938. Yuste, R., Nelson, D. A., Rubin, W. W., & Katz, L. C. (1995). Neuronal domains in developing neocortex: Mechanisms of coactivation. Neuron, 14(1), 7–17. Received December 14, 2001; accepted December 29, 2002.
LETTER
Communicated by Bartlett Mel
Learning Optimized Features for Hierarchical Models of Invariant Object Recognition Heiko Wersing
[email protected] Edgar Korner ¨
[email protected] HONDA Research Institute Europe GmbH, 63073 Offenbach=Main, Germany
There is an ongoing debate over the capabilities of hierarchical neural feedforward architectures for performing real-world invariant object recognition. Although a variety of hierarchical models exists, appropriate supervised and unsupervised learning methods are still an issue of intense research. We propose a feedforward model for recognition that shares components like weight sharing, pooling stages, and competitive nonlinearities with earlier approaches but focuses on new methods for learning optimal feature-detecting cells in intermediate stages of the hierarchical network. We show that principles of sparse coding, which were previously mostly applied to the initial feature detection stages, can also be employed to obtain optimized intermediate complex features. We suggest a new approach to optimize the learning of sparse features under the constraints of a weight-sharing or convolutional architecture that uses pooling operations to achieve gradual invariance in the feature hierarchy. The approach explicitly enforces symmetry constraints like translation invariance on the feature set. This leads to a dimension reduction in the search space of optimal features and allows determining more efciently the basis representatives, which achieve a sparse decomposition of the input. We analyze the quality of the learned feature representation by investigating the recognition performance of the resulting hierarchical network on object and face databases. We show that a hierarchy with features learned on a single object data set can also be applied to face recognition without parameter changes and is competitive with other recent machine learning recognition approaches. To investigate the effect of the interplay between sparse coding and processing nonlinearities, we also consider alternative feedforward pooling nonlinearities such as presynaptic maximum selection and sum-of-squares integration. The comparison shows that a combination of strong competitive nonlinearities with sparse coding offers the best recognition performance in the difcult scenario of segmentation-free recognition in cluttered surround. We demonstrate that for both learning and recognition, a precise segmentation of the objects is not necessary. c 2003 Massachusetts Institute of Technology Neural Computation 15, 1559–1588 (2003) °
1560
H. Wersing and E. Korner ¨
1 Introduction The concept of convergent hierarchical coding assumes that sensory processing in the brain is organized in hierarchical stages, where each stage performs specialized, parallel operations that depend on input from earlier stages. The convergent hierarchical processing scheme can be employed to form neural representations that capture increasingly complex feature combinations, up to the so-called grandmother cell, which may re only if a specic object is being recognized, perhaps even under specic viewing conditions. This concept is also known as the neuron doctrine of perception, postulated by Barlow (1972, 1985). The main criticism of this type of hierarchical coding is that it may lead to a combinatorial explosion of the possibilities that must be represented due to the large number of combinations of features that constitute a particular object under different viewing conditions (von der Malsburg, 1999; Gray, 1999). In recent years, several authors have suggested approaches to avoid such a combinatorial explosion for achieving invariant recognition (Fukushima, 1980; Mel & Fiser, 2000; Ullman & Soloviev, 1999; Riesenhuber & Poggio, 1999b). The main idea is to use intermediate stages in a hierarchical network to achieve higher degrees of invariance over responses that correspond to the same object, thus reducing the combinatorial complexity effectively. Since the work of Fukushima, who proposed the neocognitron as an early model of translation-invariant recognition, two major processing modes in the hierarchy have been emphasized. Feature-selective neurons are sensitive to particular features, which are usually local in nature. Pooling neurons perform a spatial integration over feature-selective neurons, which are successively activated, if an invariance transformation is applied to the stimulus. As was recently emphasized by Mel and Fiser (2000), the combined stages of local feature detection and spatial pooling face what could be called a stability-selectivity dilemma. On the one hand, excessive spatial pooling leads to complex feature detectors with a very stable response under image transformations. On the other hand, the selectivity of the detector is largely reduced, since wide-ranging spatial pooling may accumulate too much weak evidence, increasing the chance of the accidental appearance of the feature. This can also be phrased as an instance of a binding problem, which occurs if the association of the feature identity with the spatial position is lost (von der Malsburg, 1981). One consequence of this loss of spatial binding is the superposition problem due to the crosstalk of intermediate representations that are activated by multiple objects in a scene. As has been argued by Riesenhuber and Poggio (1999b), hierarchical networks with appropriate pooling can circumvent this binding problem by gradually reducing the assignment between features and their position in the visual eld. This can be interpreted as keeping a balance between spatial binding and invariance. There is substantial experimental evidence in favor of the notion of hierarchical processing in the brain. Proceeding from the neurons that receive
Learning Optimized Features for Recognition
1561
the retinal input to neurons in higher visual areas, an increase in receptive eld size and stimulus complexity can be observed. A number of experiments have also identied neurons responding to object-like stimuli (Tanaka, 1993, 1996; Logothetis & Pauls, 1995) in the IT area. Thorpe, Fize, and Marlot (1996) showed that for a simple recognition task, a signicant change in event-related potentials can be observed in human subjects after 150 ms. Since this short time is in the range of the latency of the spike signal transmission from the retina over visual areas V1, V2, V4 to IT, this can be interpreted as a predominance of feedforward processing for certain rapid recognition tasks. To allow this rapid processing, the use of a neural latency code has been proposed. Thorpe and Gautrais (1997) suggested a representation based on the ranking between features due to the temporal order of transmitted spikes. Korner, ¨ Gewaltig, Korner, ¨ Richter, and Rodemann (1999) proposed a bidirectional model for cortical processing, where an initial hypothesis on the stimulus is facilitated through a latency encoding in relation to an oscillatory reference frame. In subsequent processing cycles, this coarse hypothesis is rened by top-down feedback. Rodemann and Korner ¨ (2001) showed how the model can be applied to the invariant recognition of a set of articial stimuli. Despite its conceptual attractivity and neurobiological evidence, the plausibility of the concept of hierarchical feedforward recognition stands or falls by the successful application to sufciently difcult real-world invariant recognition problems. The central problem is the formulation of a feasible learning approach for optimizing the combined feature-detecting and pooling stages. Apart from promising results on articial data and very successful applications in the realm of handwritten character recognition, applications to 3D recognition problems (Lawrence, Giles, Tsoi, & Back, 1997) are exceptional. One reason is that the processing of real-world images requires network sizes that usually make the application of standard supervised learning methods like error backpropagation infeasible. The processing stages in the hierarchy may also contain network nonlinearities like winner-take-all, which do not allow similar gradient-descent optimization. Of great importance for the processing inside a hierarchical network is the coding strategy employed. An important principle, as emphasized by Barlow (1985), is redundancy reduction, that is, a transformation of the input that reduces the statistical dependencies among elements of the input stream. Wavelet-like features have been derived that resemble the receptive elds of V1 cells by either imposing sparse overcomplete representations (Olshausen & Field, 1997) or imposing statistical independence as in independent component analysis (Bell & Sejnowski, 1997). These cells perform the initial visual processing and are thus attributed to the initial stages in hierarchical processing. Extensions for complex cells (Hyv¨arinen & Hoyer, 2000; Hoyer & Hyv¨arinen, 2002) and color and stereo coding cells (Hoyer & Hyv¨arinen, 2000) were shown. Recently, principles of temporal stability or slowness have been proposed and applied to the learning of simple and
1562
H. Wersing and E. Korner ¨
complex cell properties (Wiskott & Sejnowski, 2002; Einh¨auser, Kayser, Konig, ¨ & Kording, ¨ 2002). Lee and Seung (1999) suggested using the principle of nonnegative matrix factorizations to obtain a sparse distributed representation. Nevertheless, especially the interplay between sparse coding and efcient representation in a hierarchical system is not well understood. Although a number of experimental studies have investigated the receptive eld structure and preferred stimuli of cells in intermediate visual processing stages like V2 and V4 (Gallant, Braun, & Van Essen, 1993; Hegde & Van Essen, 2000), our knowledge of the coding strategy facilitating invariant recognition is at most fragmentary. One reason is that the separation into afferent and recurrent inuences gets more difcult at higher stages of visual processing. Therefore, functional models are an important means for understanding hierarchical processing in the visual cortex. Apart from understanding biological vision, these functional principles are also of great relevance for the eld of technical computer vision. Although independent component analysis (ICA) and sparse coding have been discussed for feature detection in vision by several authors, there are only few references to its usefulness in invariant object recognition applications. Bartlett and Sejnowski (1997) showed that for face recognition, ICA representations have advantages over representations based on principal component analysis (PCA) with regard to pose invariance and classication performance. In this contribution, we analyze optimal receptive eld structures for shape processing of complex cells in intermediate stages of a hierarchical processing architecture by evaluating the performance of the resulting invariant recognition approach. In section 2, we review related work on hierarchical and nonhierarchical feature-based neural recognition models and learning methods. Our hierarchical model architecture is dened in section 3. In section 4, we suggest a novel approach to the learning of sparse combination features, which is particularly adapted to the characteristics of weight sharing or convolutional architectures. The architecture optimization and detailed learning methods are described in section 5. In section 6, we demonstrate the recognition capabilities with the application to different object-recognition benchmark problems. We discuss our results in relation to other recent models and give our conclusions in section 7. 2 Related Work Fukushima (1980) introduced with the neocognitron a principle of hierarchical processing for invariant recognition that is based on successive stages of local template matching and spatial pooling. The neocognitron can be trained by unsupervised, competitive learning; however, applications like handwritten digit recognition have required a supervised manual training procedure. A certain disadvantage is the critical dependence of the performance on the appropriate manual training pattern selection (Lovell, Downs, & Tsoi, 1997) for the template matching stages.
Learning Optimized Features for Recognition
1563
Perrett and Oram (1993) extended the idea of pooling from translation invariance to the invariance over any image plane transformation by appropriate pooling over corresponding afferent feature-selective cells. Poggio and Edelman (1990) showed earlier that a network with a gaussian radial basis function architecture that pools over cells tuned to a particular view is capable of view-invariant recognition of articial paper clip objects. Riesenhuber and Poggio (1999a) emphasized that hierarchical networks with appropriate pooling operations may avoid the combinatorial explosion of combination cells. They proposed a hierarchical model with similar matching and pooling stages as in the neocognitron. A main difference are the nonlinearities that inuence the transmission of feedforward information through the network. To reduce the superposition problem, in their model, a complex cell focuses on the input of the presynaptic cell providing the largest input (MAX nonlinearity). The model has been applied to the recognition of articial paper clip images and computer-rendered animal and car objects (Riesenhuber & Poggio, 1999c) and is able to reproduce the response characteristics of IT cells found in monkeys trained with similar stimuli (Logothetis, Pauls, Bulthoff, ¨ & Poggio, 1994). Multilayered convolutional networks have been widely applied to pattern recognition tasks, with a focus on optical character recognition (see LeCun, Bottou, Bengio, & Haffner, 1998, for a comprehensive review). Learning of optimal features is carried out using the backpropagation algorithm, where constraints of translation invariance are explicitly imposed by weight sharing. Due to the deep hierarchies, however, the gradient learning takes considerable training time for large training ensembles and network sizes. Lawrence et al. (1997) have applied the method augmented with a prior vector quantization based on self-organizing maps and reported improved performance for a face classication setup. Wallis and Rolls (1997) and Rolls & Milward (2000) showed how the connection structure in a four-layered hierarchical network for invariant recognition can be learned entirely by layer-wise unsupervised learning that is based on the trace rule (Foldi´ ¨ ak, 1991). With appropriate information measures, they showed that in their network, increasing levels in the hierarchy acquire both increasing levels of invariance and object selectivity in their response to the training stimuli. The result is a distributed representation in the highest layers of the hierarchy with on average higher mutual information with the particular object shown. Behnke (1999) applied a Hebbian learning approach with competitive activity normalizations to the unsupervised learning of hierarchical visual features for digit recognition. In a case study for word recognition in text strings, Mel and Fiser (2000) have demonstrated how a greedy heuristics can be used to optimize the set of local feature detectors for a given pooling range. An important result is that a comparably small number of low- and medium-complexity letter templates is sufcient to obtain robust recognition of words in unsegmented text strings. Ullman and Soloviev (1999) elaborated a similar idea for a scenario
1564
H. Wersing and E. Korner ¨
of simple articial graph–based line drawings. They also suggested extending this approach to image micropatches for real image data and argued for a more efcient hierarchical feature construction based on empirical evidence in their model. Roth, Yang, and Ahuja (2002) proposed the application of their SNoW classication model to three-dimensional object recognition, which is based on an incremental assembly of a very large low-level feature collection from the input image ensemble, which can be viewed as similar to the greedy heuristics employed by Mel and Fiser (2000). Similarly, the model can be considered as a “at” recognition approach, where the complexity of the input ensemble is represented in a very large low-level feature library. Amit (2000) proposed a more hierarchical approach to object detection based on feature pyramids, which are based on starlike arrangements of binary orientation features. Heisele, Poggio, and Pontil (2000) showed that the performance of face detection systems based on support vector machines can be improved by using a hierarchy of classiers. 3 A Hierarchical Model of Invariant Recognition In the following, we dene our hierarchical model architecture. The model is based on a feedforward architecture with weight sharing and a succession of feature-sensitive matching and pooling stages. The feedforward model is embedded in the larger cortical processing model architecture as proposed in Korner ¨ et al. (1999). Thus, our particular choice of transfer functions aims at capturing the particular form of latency-based coding as suggested for this model. Apart from this motivation, however, no further reference to temporal processing will be made here. The model comprises three stages arranged in a processing hierarchy (see Figure 1). 3.1 Initial Processing Layer. The rst feature-matching stage consists of an initial linear sign–insensitive receptive eld summation, a winnerstake-most (WTM) mechanism between features at the same position and a nal threshold function. In the following, we adopt the notation that vector indices run over the set of neurons within a particular plane of a particular layer. To compute the response ql1 .x; y/ of a simple cell in the rst layer named S1, responsive to feature type l at position .x; y/, rst the image vector I is multiplied with a weight vector wl1 .x; y/ characterizing the receptive eld prole: ql1 .x; y/ D jwl1 .x; y/ ¤ Ij;
(3.1)
The inner product is denoted by ¤—for a 10 £ 10 pixel image, I and wl1 .x; y/ are 100-dimensional vectors. The weights wl1 are normalized and characterize a localized receptive eld in the visual eld input layer. All cells in a feature plane l have the same receptive eld structure, given by wl1 .x; y/, but shifted receptive eld centers, as in a classical weight-sharing or
Learning Optimized Features for Recognition
1565
Figure 1: Hierarchical model architecture. The S1 layer performs a coarse local orientation estimation, which is pooled to a lower resolution in the C1 layer. Neurons in the S2 layer are sensitive to local combinations of orientation-selective cells in the C1 layer. The nal S3 view-tuned cells are tuned to the activation pattern of all C2 neurons, which pool the S2 combination cells.
convolutional architecture (Fukushima, 1980; LeCun et al., 1998). The receptive elds are given as rst-order even Gabor lters at four orientations (see the appendix). In a second step, a competitive mechanism is performed with 8 ql .x;y/ 0 are the solution of a linear equation x D .V ¡ I/¡1 h, where we have eliminated all components with xed-point zero entries (Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). Therefore, if we neglect the relative magnitudes of the overlap matrix V, we obtain a suppression of activations with weak inputs hi to zero, together with a linear weakening of the input for the nonzero xed-point entries with largest input hi0 . The WTM model can be viewed as a coarse feedforward approximation to this function.
Learning Optimized Features for Recognition
1585
A.3 Architecture Details. The initial odd Gabor lters were computed as 1
w.x; y/ D e¡ 2 f
2
..
x cos.µ/Cy sin.µ/ 2 ¡x sin.µ/Cy cos.µ/ 2 / C. / / ¾ ¾
¢ sin. f .x cos.µ / C y sin.µ ///; with f D 12, ¾ D 1:5, and µ D 0; ¼=4; ¼=2; 3¼=4. Then the weights were normalized to unit norm, that is, w2i D 1. The visual eld layer input intensities were scaled to the interval [0; 1]. The pooling kernels were computed as w.x; y/ D exp..¡x2 ¡ y2 /=¾ 2 /. The optimal nonlinearity parameter settings for the plain COIL data are for WTM+sparse: °1 D 0:7; ¾1 D 4:0; µ1 D 0:3; °2 D 0:9; ¾2 D 2:0; µ2 D 1:0, WTM+PCA: °1 D 0:7; ¾1 D 2:0; µ1 D 0:1; °2 D 0:9; ¾2 D 1:0; µ2 D 1:0, MAX: 2 pixel S1 and S2 pooling radius, and Square: ¾1 D 6:0; ¾2 D 1:0. For the COIL data with clutter, the parameters are for WTM+sparse: °1 D 0:7; ¾1 D 2:0; µ1 D 0:1; °2 D 0:9; ¾2 D 1:0; µ2 D 1:0, WTM+PCA: °1 D 0:9; ¾1 D 3:0; µ1 D 0:3; °2 D 0:7; ¾2 D 1:0; µ2 D 1:0, MAX: 1 pixel S1 and S2 pooling radius, and Square: ¾1 D 2:0; ¾2 D 1:0. Acknowledgments We thank J. Eggert, T. Rodemann, U. K¨orner, C. Goerick, and T. Poggio for fruitful discussions on earlier versions of this article. We are also grateful to B. Heisele and B. Mel for providing the image data and thank the anonymous referees for their valuable comments. References Amit, Y. (2000). A neural network architecture for visual selection. Neural Computation, 12(5), 1141–1164. Barlow, H. B. (1972). Single units and cognition: A neuron doctrine for perceptual psychology. Perception, 1, 371–394. Barlow, H. B. (1985). The Twelfth Bartlett Memorial Lecture: The role of single neurons in the psychology of perception. Quart. J. Exp. Psychol., 37, 121–145. Bartlett, M. S., & Sejnowski, T. J. (1997). Viewpoint invariant face recognition using independent component analysis and attractor networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 817). Cambridge, MA: MIT Press. Behnke, S. (1999). Hebbian learning and competition in the neural abstraction pyramid. In Proc. Int. Joint Conf. on Neural Networks. Washington, DC. Piscataway, NJ: IEEE Neural Networks Council. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338.
1586
H. Wersing and E. Korner ¨
Einh a¨ user, W., Kayser, C., Konig, ¨ K., & Kording, ¨ K. (2002). Learning the invariance properties of complex cells from their responses to natural stimuli. European Journal of Neuroscience, 15(3), 475–486. Feng, J. (1997). Lyapunov functions for neural nets with nondifferentiable inputoutput characteristics. Neural Computation, 9(1), 43–49. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3(2), 194–200. Freeman, W. T., & Adelson, E. H. (1991). The design and use of steerable lters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(9), 891–906. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cyb., 39, 139–202. Gallant, J. L., Braun, J., & Van Essen, D. C. (1993). Selectivity for polar, hyperbolic, and cartesian gratings in macaque visual cortex. Science, 259, 100–103. Gray, C. M. (1999). The temporal correlation hypothesis of visual feature integration: Still alive and well. Neuron, 24, 31–47. Hahnloser, R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplication coexist in a cortexinspired silicon circuit. Nature, 405, 947–951. Hegde, J., & Van Essen, D. C. (2000). Selectivity for complex shapes in primate visual area V2. Journal of Neuroscience, 20(RC61), 1–6. Heisele, B., Poggio, T., & Pontil, M. (2000). Face detection in still gray images (Tech. Rep. AI Memo 1687). Cambridge, MA: MIT. Hoyer, P. O., & Hyva¨ rinen, A. (2000). Independent component analysis applied to feature extraction from colour and stereo images. Network, 11(3), 191–210. Hoyer, P. O., & Hyva¨ rinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hyva¨ rinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neur. Comp., 12(7), 1705–1720. Korner, ¨ E., Gewaltig, M.-O., Korner, ¨ U., Richter, A., & Rodemann, T. (1999). A model of computation in neocortical architecture. Neural Networks, 12(7–8), 989–1005. Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997). Face recognition: A convolutional neural-network approach. IEEE Trans. Neur. Netw., 8(1), 98–113. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86, 2278–2324. Lee, D. L., & Seung, S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Logothetis, N. K., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cerebral Cortex, 5, 270–288. Logothetis, N. K., Pauls, J., Bulthoff, ¨ H., & Poggio, T. (1994). Shape representation in the inferior temporal cortex of monkeys. Curr. Biol., 4, 401–414.
Learning Optimized Features for Recognition
1587
Lovell, D., Downs, T., & Tsoi, A. (1997). An evaluation of the neocognitron. IEEE Trans. Neur. Netw., 8, 1090–1105. Malik, J., & Perona, P. (1990, May). Preattentive texture discrimination with early vision mechanisms. J. Opt. Soc. Amer., 5(5), 923–932. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally inspired approach to visual object recognition. Neural Computation, 9(4), 777–804. Mel, B. W., & Fiser, J. (2000). Minimizing binding errors using learned conjunctive features. Neural Computation, 12(4), 731–762. Nayar, S. K., Nene, S. A., & Murase, H. (1996). Real-time 100 object recognition system. In Proc. of ARPA Image Understanding Workshop. San Mateo, CA: Morgan Kaufmann. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325. Perrett, D., & Oram, M. (1993). Neurophysiology of shape processing. Imaging Vis. Comput., 11, 317–333. Poggio, T., & Edelman, S. (1990). A network that learns to recognize 3D objects. Nature, 343, 263–266. Pontil, M., & Verri, A. (1998). Support vector machines for 3D object recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 20(6), 637–646. Riesenhuber, M., & Poggio, T. (1999a). Are cortical models really bound by the “binding problem”? Neuron, 24, 87–93. Riesenhuber, M., & Poggio, T. (1999b). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025. Riesenhuber, M., & Poggio, T. (1999c). A note on object class representation and categorical perception (Tech. Rep. 1679). Cambridge, MA: MIT AI Lab. Rodemann, T., & Korner, ¨ E. (2001). Two separate processing streams in a corticaltype architecture. Neurocomputing, 38–40, 1541–1547. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition and information-based performance measures. Neural Computation, 12(11), 2547–2572. Rolls, E., Webb, B., & Booth, C. (2001). Responses of inferior temporal cortex neurons to objects in natural scenes. Society for Neuroscience Abstracts, 27, 1331. Roobaert, D., & Hulle, M. V. (1999). View-based 3D object recognition with support vector machines. In Proc. IEEE Int. Workshop on Neural Networks for Signal Processing (pp. 77–84). Madison, WI. New York: IEEE. Roth, D., Yang, M.-H., & Ahuja, N. (2002). Learning to recognize 3D objects. Neural Computation, 14(5), 1071–1104. Tanaka, K. (1993). Neuronal mechanisms of object recognition. Science, 262, 685–688. Tanaka, K. (1996). Inferotemperal cortex and object vision: Stimulus selectivity and columnar organization. Annual Review of Neuroscience, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the visual system. Nature, 381, 520–522.
1588
H. Wersing and E. Korner ¨
Thorpe, S. J., & Gautrais, J. (1997). Rapid visual processing using spike asynchrony. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 901). Cambridge, MA: MIT Press. Ullman, S., & Soloviev, S. (1999). Computation of pattern invariance in brain-like structures. Neural Networks, 12(7–8), 1021–1036. von der Malsburg, C. (1981). The correlation theory of brain function (Tech. Rep. 81-2). MPI Gottingen. ¨ von der Malsburg, C. (1999). The what and why of binding: The modeler ’s perspective. Neuron, 24, 95–104. Wallis, G., & Rolls, E. T. (1997). A model of invariant object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wersing, H., Steil, J. J., & Ritter, H. (2001). A competitive layer model for feature binding and sensory segmentation. Neural Computation, 13(2), 357–387. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Received June 6, 2001; accepted December 27, 2002.
LETTER
Communicated by Erkki Oja
Soft Learning Vector Quantization Sambu Seo
[email protected] Klaus Obermayer
[email protected] Department of Electrical Engineering and Computer Science, Technical University of Berlin, 10587 Berlin, Germany
Learning vector quantization (LVQ) is a popular class of adaptive nearest prototype classiers for multiclass classication, but learning algorithms from this family have so far been proposed on heuristic grounds. Here, we take a more principled approach and derive two variants of LVQ using a gaussian mixture ansatz. We propose an objective function based on a likelihood ratio and derive a learning rule using gradient descent. The new approach provides a way to extend the algorithms of the LVQ family to different distance measure and allows for the design of “soft” LVQ algorithms. Benchmark results show that the new methods lead to better classication performance than LVQ 2.1. An additional benet of the new method is that model assumptions are made explicit, so that the method can be adapted more easily to different kinds of problems.
1 Introduction Learning vector quantization (LVQ) (Kohonen, 2001; Kohonen, Hynninen, Kangas, Laaksonen, & Torkkola, 1995) is a class of learning algorithms for nearest prototype classication (NPC). LVQ was introduced by Kohonen (1986) almost 20 years ago and has since then been widely used (see Centre, 2002, for an extensive bibliography). There have also been proposed new versions designed to improve classication performance (Kohonen, 1990; Kohonen et al., 1995). Like the K-nearest neighbor method (Duda & Hart, 2000), NPC is a local classication method in the sense that classication boundaries are approximated locally. Instead of making use of all the data points of a training set, however, NPC relies on a set of appropriately chosen prototype vectors. This makes the method computationally more efcient, because the number of items that must be stored and to which a new data point must be comc 2003 Massachusetts Institute of Technology Neural Computation 15, 1589–1604 (2003) °
1590
S. Seo and K. Obermayer
pared for classication are considerably fewer. Therefore, LVQ has been a popular method for real-time applications, like speech recognition (Komori & Katagiri, 1992; McDermott & Katagiri, 1994), where it has been used with M good success. A nearest prototype classier consists of a set T D f.µj ; cj /gjD1 of labeled prototype vectors. The parameters µj 2 X ´ RD are vectors in data space, and the cj 2 I , j D 1; : : : ; Ny are their corresponding class labels. Typically, the number of prototypes is larger than the number of classes, such that every classication boundary is determined by more than one prototype. The class y of a new data point x is determined (1) by selecting the prototype µq that is closest to x, q D arg min d.x; µj /;
(1.1)
j
where d.¢; ¢/ is the distance between data point and prototype, and (2) assigning the label cq of this prototype to the data point x. A popular choice for d is the Euclidean distance d.x; µj / D kx ¡ µj k2 , but other distance measures may be chosen depending on the problem at hand. From equation 1.1, it is clear that the set of prototype vectors induces a tesselation of data space and that the classier assigns the same class label to all data points that fall into the same tesselation. The tesselation is optimal if all data points within one cell indeed belong to the same class. LVQ, a class of algorithms for the determination of prototype vectors for NPC, make use of the distribution of the training data as well as their class labels when selecting a useful set of prototypes. Let us consider LVQ 2.1 as an example that has been shown to provide good NPC classiers (Kohonen, 1990; Centre, 2002). For every data point .x; y/ from the training set S D f.xi ; yi /gN iD1 , LVQ 2.1 rst selects the two nearest prototypes µl , µm according to the Euclidean distance. If the labels cl and cm are different and if one of them is equal to the label y of the data point, then the two nearest prototypes are adjusted according to µl .t C 1/ D µl .t/ C ®.t/.x ¡ µl /;
cl D y;
µm .t C 1/ D µm .t/ ¡ ®.t/.x ¡ µm /;
cm 6D y:
(1.2)
If the labels cl and cm are equal or both labels differ from the label y of the data point, no parameter update is being performed. The prototypes, however, are changed only if the data point x is close to the classication boundary, that is, if it falls into a window, ³ min
d.x; µm / d.x; µl / ; d.x; µl / d.x; µm /
´ > s;
where
sD
1¡! ; 1C!
(1.3)
Soft Learning Vector Quantization
1591
of relative width 0 < ! · 1 (Kohonen, 2001). This “window rule” had to be introduced, because otherwise prototype vectors may diverge. LVQ provides good classication results (Centre, 2002), but it has the disadvantage of being a heuristic (though a well-chosen one) and may not be optimal. In this contribution, we therefore derive variants of the LVQ method using a principled approach. We rst make the ansatz that the probability density of the data is well described by one gaussian mixture model for each class. Then we consider the probability density that a data point x is generated by the gaussian mixture model of the correct class and compare it to the probability density that this data point is generated by the gaussian mixture models of the incorrect classes. The logarithm of the ratio of the “correct” versus the “incorrect” probability densities serves as a cost function and is maximized. The choice of cost function is motivated by the fact that the learning rule of LVQ leads to both attraction and repulsion of prototypes depending on the class labels (cf. equations 1.2). As we will see in section 2, the “attractive” term is a result of maximizing log probability that a data point is generated by the “correct” class, while the “repulsive” term is a consequence of minimizing the log probability that the data point is generated by the classes with the “incorrect” label. In order to avoid the divergence of prototypes, we nally propose a slightly modied cost function in section 3. Section 4 contains benchmark results for the UCI letter data set showing that our principled approach indeed leads to improved classication accuracy compared to LVQ 2.1. 2 Learning Vector Quantization Based on Likelihood Ratios 2.1 Cost Function and Learning Rule: General Case. In the following, we assume that the probability density p.x/ of the data points x can be described by a mixture model and that every component j of the mixture is homogeneous in the sense that it generates data points that belong to only one class Cj . The probability density of the data is then given by
p.x j T / D
Ny X X yD1 fj: cj Dyg
p.x j j/p.j/;
(2.1)
where Ny is the number of classes and cj is the class label of the data points generated by component j. p.j/ is the probability that data points are generated by a particular component j of the mixture, and p.x j j/ is the conditional probability that this component j generates a particular data point x. For NPC, the conditional probability p.x j j/ should be a function of a prototype µj , which is usually interpreted as the representative feature vector for all data points generated by (and later assigned to) component j.
1592
S. Seo and K. Obermayer
Let us now consider a data point x and its true class label y, and let us dene the following restricted probability densities: X X p.x; y j T / D p.x j j/p.j/; p.x; yN j T / D p.x j j/p.j/; fj: cj Dyg
Lc .S j T / D
N Y kD1
fj:cj 6Dyg
p.xk ; yk j T /;
Lu . S j T / D
N Y kD1
p.xk ; yN k j T /: (2.2)
p.x; y j T / is the probability density that a data point x is generated by the mixture model for the “correct” class, that is, the class denoted by the label y. p.x; yN j T / is the probability density that a data point x is generated by the mixture models for the “incorrect” classes, that is, the classes denoted by labels 6D y. The mixture model, equation 2.1, is a good model of the data if the likelihood functions Lc .S j T / and Lu .S j T / are maximal and minimal, respectively. A natural approach is to maximize the likelihood ratio, Lr .S j T / D
N Y p.xk ; yk j T / ! D max; p.xk ; yN k j T / kD1
(2.3)
with regard to the parameters of the model densities. This leads to a classier that maximizes the rate of correct classication and at the same time minimizes the rate of misclassication. For computational simplicity, we maximize log Lr , log Lr .S j T / D
N X
log
kD1
p.xk ; yk j T / ! D max : p.xk ; yN k j T /
(2.4)
Using stochastic gradient ascent (Robbins & Monro, 1951), we obtain the learning rule @ p.x; y j T / log ; @µl p.x; yN j T /
µl .t C 1/ D µl .t/ C ®.t/
(2.5)
where t is the iteration number and ®.t/ is learning P rate. In order to en1 sure convergence, must fullll the conditions ®.t/ tD0 ®.t/ D 1 and P1 2 1 (Robbins & Monro, 1951). ® .t/ < tD0 If the conditional density functions p.x j j/ are from the normalized exponential form, p.x j j/ D K.j/ exp f .x; µj /;
(2.6)
we obtain (see appendix A) @ p.x; y j T / @ f .x; µl / log D ±.cl D y/P y .l j x/ @ µl p.x; yN j T / @µl ¡ ±.cl 6D y/PyN .l j x/
@ f .x; µl / ; @µl
(2.7)
Soft Learning Vector Quantization
1593
where ±.x/ D 1 if x is one and zero otherwise. K.j/ is the normalization constant, and P y .l j x/ and P yN .l j x/ are given by Py .l j x/ D P
p.l/ exp f .x; µl / ; fj: cj Dyg p.j/ exp f .x; µj /
PyN .l j x/ D P
p.l/ exp f .x; µl / : fj: cj 6Dyg p.j/ exp f .x; µj /
(2.8)
Py .l j x/ and PyN .l j x/ are assignment probabilities. Py .l j x/ describes the (posterior) probability that the data point x is assigned to the component l of the mixture, given that the data point was generated by the correct class. PyN .l j x/ describes the (posterior) probability that the data point x is assigned to the component l of the mixture, given that the data point was generated by the incorrect class. Using the gradient given in equation 2.7, we obtain the learning rule h i 8
s; where s D : kx ¡ µl k2 kx ¡ µm k2 1C! µl and µm are the two prototypes closest to the data point x and that have different labels, cl 6D cm . 0 < w · 1 is the width parameter that determines the window size. According to this window rule, prototypes are modied only if a data point lies near the actual class boundary. The learning rule, equation 2.11, together with the window rule, equation 2.13, implement the method of LVQ based on likelihood ratios (LVQ-LR). ¾ 2 is a hyperparameter of the learning rule, equation 2.12. In the spirit of deterministic annealing, which is a useful optimization procedure for clustering problems (Graepel, Burger, & Obermayer, 1997; Miller, Rao, Rose, & Gersho, 1996), ¾ 2 is set to a large value initially and is decreased during optimization (“annealing”) until an optimal value ¾ f is reached. 2.4 Hard Classication: The Winner-Takes-All Case. If the width ¾ of the gaussian components goes to zero, both assignment probabilities, equation 2.12, become hard assignments, that is, Py .l j x/ D ±.l; q/;
q D arg min kx ¡ µk k;
PyN .l j x/ D ±.l; q/;
q D arg min kx ¡ µk k:
fk: ck Dyg fk: ck 6Dyg
If a given data point .x; y/ falls into the window, equation 2.13, then the nearest prototype in the set of prototypes with the “correct” label y and the nearest prototype in the set of prototypes with the “incorrect” label 6D y will be modied according to ( if cl D y; ±.l; q/.x ¡ µl /; ¤ (2.14) µl .t C 1/ D µl .t/ C ® .t/ ¡±.l; q/.x ¡ µl /; if cl 6D y: 2.5 Relationship with LVQ 2.1. The hard version of LVQ-LR is similar to LVQ 2.1 with the only exception that LVQ 2.1 requires one of the two nearest prototypes to have the same class label as the data point x. In contrast to LVQ 2.1, however, LVQ-LR was derived by optimization of a cost function. The principled approach has several benets: ² It provides insight into why prototypes tend to diverge when LVQ is used to generate an NPC. ² It provides a way to extend the LVQ family to different distance measures (cf. equation 2.9), and it relates these distance measures to assumptions about the components of the mixture model.
1596
S. Seo and K. Obermayer
² It provides a way to extend the LVQ family to soft assignments (see equation 2.11). ² It provides better classication results, as we will see in section 4. Note, however, that the window rule is still a heuristic and that it introduces a hyperparameter w for which a proper value must be selected. In the next section, the cost function is modied, such that the window rule is no longer required. 3 Robust Soft Learning Vector Quantization 3.1 Cost Function and Learning Rule: General Case. Let us consider the following objective function to be maximized: Lr D D
N Y
p.xk ; yk j T / Nk j T / p.x ; y k k j T / C p.xk ; y kD1 N Y p.xk ; yk j T / ! D max : p.xk j T / kD1
(3.1)
/ The ratio p.x;yjT is bounded by 0 from below and, in contrast to the likelip.xjT / hood ratio used in equation 2.3, bounded by 1 from above. As in the previous section, we optimize the logarithm of the ratio Lr ,
log Lr D
N X kD1
log
p.xk ; yk j T / ! D max; p.xk j T /
(3.2)
instead of equation 3.1. Stochastic gradient ascent then leads to the learning rule µ ¶ @ p.x; y j T / log (3.3) µl .t C 1/ D µl .t/ C ®.t/ ; @µl p.x j T / P P1 2 where ®.t/ is learning rate with 1 tD1 ®.t/ D 1 and tD1 ® .t/ < 1. If the conditional density functions p.x j j/ are of the normalized exponential form, p.x j j/ D K.j/ exp f .x; µj /;
(3.4)
the learning rule becomes (see appendix B) µ ¶ @ p.x; y j T / @ f .x; µl / log D ±.cl D y/.Py .l j x/ ¡ P.l j x// @ µl p.x j T / @µl ¡ ±.cl 6D y/P.l j x/
@ f .x; µl / ; @µl
Soft Learning Vector Quantization
1597
where P y .l j x/ and P.l j x/ are assignment probabilities, Py .l j x/ D P
p.l/ exp f .x; µl / ; fj: cj Dyg p.j/ exp f .x; µj /
p.l/ exp f .x; µl / P.l j x/ D PM : jD1 p.j/ exp f .x; µj /
(3.5)
Py .l j x/ describes the (posterior) probability that the data point x is assigned to the component l of the mixture, given that the data point was generated by the correct class. P.l j x/ describes the (posterior) probability that the data point x is assigned to the component l of the complete mixture using all classes. Using the gradient given in equation 3.3, we obtain the learning rule h i 8 0) of correctly classed data points as a function of their P2 location .x1 ; x2 /. (Bottom) Strength of the factors iD1 P . i j x/ ¡ P. i j x/ (for P2 x1 < 0) and iD1 P. i j x/(for x1 > 0) of incorrectly classed data points as a function of their location .x1 ; x2 /.
recognition=). The data set was generated from a large number of blackand-white rectangular pixel displays of the 26 capital letters in the English alphabet. The letters were taken from 20 different fonts, and each symbol was randomly distorted. Each pixel image was afterward converted into 16 primitive numerical attributes (statistical moments and edge counts), which were then scaled to t into a range of integer values between 0 and 15. The data set contains 20,000 items. In order to assess the performance of the three methods, we split the data set into two subsets, each with the same number of data points for each label. The rst subset, the training set, was used to nd the optimal values
1600
S. Seo and K. Obermayer
of the hyperparameters (window width ! for LVQ 2.1 and LVQ-LR and parameter ¾ for LVQ-LR, soft version, and RSLVQ). Tenfold cross-validation was performed on the training set for several values of the hyperparameters, and the values that gave rise to the minimal error were selected. The generalization error for the optimal hyperparameters was then estimated by 10-fold cross-validation on the second (test) set. The training and test sets contained 10,000 data points. The prototypes of each class were initialized by adding random vectors to the centroids ¹l of all the data points that belong to this class: µl D ¹l C 0:01 £ ³ £ ¾l £ » . ¹l and ¾l 2 RD are the centroid and the standard deviations along the coordinate axes of the data points in the training set with label l, ³ is the number of prototypes per class, and » 2 [¡1; 1] is a number drawn randomly from a uniform distribution. The learning £12000 rate was annealed using ®.t C 1/ D 0:1 £ ³ ³£12000Ct in all cases without annealing. Learning was terminated after all training data points were used 30 times. For the annealing in the variance ¾ 2 , we used the schedule ¾ 2 .t/ D ¾ 2 .t ¡ 1/ £ ¯.t/; ¯.t/ D ¯.t ¡ 1/.1:1/ where ¯.0/ D 0:99 and 2 2 C 0:2 and annealing was terminated if ¾ 2 .t/ < ¾opt ¡ 0:2. For ¾ 2 .0/ D ¾opt LVQ-LR, the window parameter ! was set to the optimal value of !, which had been obtained by the numerical experiments without ¾ 2 annealing. Figure 2 shows the average rate of misclassication and the standard deviation obtained on the test set as a function of the number of prototypes per class, which determines the complexity of the model. The corresponding optimal hyperparameters are shown in appendix C. The gure shows that the rate of misclassication decreases by increasing the compexity of the model (i.e., number of prototypes per class) for all algorithms considered in this article. Figure 2 shows that the soft versions with annealing (LVQLR-AN) and without annealing (LVQ-LR soft), the hard version of LVQLR (LVQ-LR hard), and RSLVQ with annealing (RSLVQ-AN) and without annealing (RSLVQ soft) perform consistently better than LVQ 2.1. LVQ-LRsoft is more complex to optimize than LVQ 2.1 and RSLVQ, because two hyperparameters must be optimized rather than one; hence, LVQ-LR hard and RSLVQ should be preferred. Figure 2 shows that it is advantageous to optimize LVQ-LR, and RSLVQ using annealing in the component width ¾ . 5 Discussion In this contribution, we derived two variants of learning vector quantization using a cost function approach. Our approach is based on a mixture model ansatz for the class probability densities and on a likelihood ratio, which is to be maximized. The principled approach has the advantage that it makes it simpler to analyze the learning methods mathematically, that it allows extending LVQ methods to different distance measures, and that— at least for RSLVQ—it circumvents the problem of the divergence of the prototype vectors. Benchmarks on the UCI letter data set have shown that
Soft Learning Vector Quantization
LVQ 2.1 LVQ LR soft LVQ LR AN LVQ LR hard
0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08
0.2 0.18 0.16 0.14 0.12 0.1 0.08
0.06
0.06
0.04
0.04
0.02
0.02 2
4
6
8
10
LVQ 2.1 RSLVQ soft RSLVQ AN
0.22
Rate of Misclassification
Rate of Misclassification
1601
12
2
Number of Prototypes per Class
4
6
8
10
12
Number of Prototypes per Class
Figure 2: Rate of misclassication (normalized to [ 0; 1 ]) as a function of the number of prototypes per class for the different learning algorithms considered in this article. Error bars indicate standard deviation. The misclassication rate of RSLV-soft with 1 prototype per class results in 0:3980. For details, see the text.
the classication results have improved compared to LVQ 2.1. Note that the performance of our new method is comparable to and slightly better than the performance obtained with the support vector machine (SVM) method for multiclass classication (Suykens & Vandewalle, 1999). For the RBF kernel and parameters set to C D 60; ¾ D 7:5 and 200 support vectors, we obtain a misclassication rate for SVMs of 0:038. Nearest prototype classers, however, are faster, and fewer hyperparameters must be optimized for our new approach. Hence, NPC should be preferred over the SVM method, especially for an application like speech processing, where real-time classication is a serious constraint. Appendix A: The Gradient of log p.x;yjT N p.x;yjT
/ /
with Regard to µl
@ p.x; y j T / log @µl p.x; yN j T / µ ¶ 1 p.x; yN j T / @p.x; y j T / p.x; y j T / @p.x; yN j T / D ¡ p.x; y j T / p.x; yN j T / @µl p.x; yN j T /2 @µl
1602
S. Seo and K. Obermayer .a/
D ±.cl D y/
p.l/ exp f .x; µl / @ f .x; µl / p.x; y j T / @µl
¡ ±.cl 6D y/
p.l/ exp f .x; µl / @ f .x; µl / p.x; yN j T / @µl
D ±.cl D y/Py .l j x/
@ f .x; µl / @ f .x; µl / ¡ ±.cl 6D y/P yN .l j x/ @µl @µl
where .a/ @p.x; y j T / @ f .x; µl / D ±.cl D y/p.l/ exp f .x; µl / ; @ µl @µl @p.x; yN j T / @ f .x; µl / D ±.cl 6D y/p.l/ exp f .x; µl / : @ µl @µl Appendix B: The Gradient of log p.x;yjT p.xjT
/ /
with Regard to µl
@ p.x; y j T / log @ µl p.x j T / µ 1 p.x j T / @p.x; y j T / D p.x; y j T / p.x j T / @µl ³ ´¶ p.x; y j T / @p.x; y j T / @p.x; yN j T / ¡ C p.x j T /2 @µl @µl .a/
D ±.cl D y/
p.l/ exp f .x; µl / @ f .x; µl / p.x; y j T / @µl
¡ ±.cl D y/
p.l/ exp f .x; µl / @ f .x; µl / p.x j T / @µl
¡ ±.cl 6D y/
p.l/ exp f .x; µl / @ f .x; µl / p.x j T / @µl
D ±.cl D y/.Py .l j x/ ¡ P.l j x// ¡ ±.cl 6D y/P.l j x/
@ f .x; µl / @µl
@ f .x; µl / @µl
Appendix C: Optimal Hyperparameters The values of the hyperparameters are selected by cross-validation on the training set. ³ is the number of prototypes per class.
Soft Learning Vector Quantization
1603
³
lvq 2.1: !
lvq-lr h: !
lvq-lr: !
lvq-lr: ¾ 2
rslvq: ¾ 2
1 2 3 4 5 6 7 8 9 10 11 12 13
0.02 0.035 0.03 0.035 0.04 0.055 0.065 0.06 0.065 0.065 0.065 0.075 0.075
0.03 0.03 0.02 0.045 0.035 0.045 0.05 0.05 0.055 0.055 0.06 0.065 0.05
0.015 0.025 0.035 0.045 0.03 0.045 0.045 0.055 0.045 0.055 0.06 0.045 0.05
0.3 1.5 2.25 2.25 1.75 1.75 2.25 2.5 2.0 1.75 2.0 1.75 2.25
1.5 2.0 1.5 2.75 2.5 2.0 2.25 1.75 2.5 2.25 2.0 2.25 2.25
References Centre, N. N. R. (2002). Bibliography on the self-organizing map (SOM) and learning vector quantization (LVQ). Otaniemi: Helsinki University of Technology. Available on-line: http:==liinwww.ira.uka.de=bibliography=Neural= SOM.LVQ.html. Duda, R. O., & Hart, P. E. (2000). Pattern classication and scene analysis. New York: Wiley. Graepel, T., Burger, M., & Obermayer, K. (1997). Phase transduction in stochastic self-organizing maps. Physical Review E, 56, 3876–3890. Kohonen, T. (1986). Learning vector quantization (Tech. Rep.). Otaniemi: Helsinki University of Technology. Kohonen, T. (1990).Improved versions of learning vector quantization. In Neural Networks, IJCNN International Joint Conference on (Vol. 1, pp. 545–550). San Diego: IEE Computer Society Press. Kohonen, T. (2001). Self organization maps. New York: Springer-Verlag. Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., & Torkkola, K. (1995). Lvq Pak: The learning vector quantization program package. Otaniemi: Helsinki University of Technology. Komori, T., & Katagiri, S. (1992). Application of a generalized probabilistic descent method to dynamic time warping-based speech recognition. In Acoustics, Speech, and Signal Processing (pp. 497–500). San Francisco: IEEE Signal Processing Society Press. McDermott, E., & Katagiri, S. (1994). Prototype-based minimum classication error=generalized probabilistic descent training for various speech units. Computer Speech and Language, 8(4), 351–368. Miller, D., Rao, A. V., Rose, K., & Gersho, A. (1996). A global optimization technique for statistical classier design. IEEE Transactions on Signal Processing, 44(12), 3108–3122.
1604
S. Seo and K. Obermayer
Robbins, H., & Monro, S. (1951).A stochastic approximation method. Ann. Math. Stat., 22, 400–407. Suykens, J. A. K., & Vandewalle, J. (1999). Multiclass least squares support vector machines. In IJCNN’99 International Joint Conference on Neural Networks. Washington, DC: IEEE Computer Society Press. Received July 19, 2002; accepted December 23, 2002.
LETTER
Communicated by Eric Mjolsness
An Efcient Approximation Algorithm for Finding a Maximum Clique Using Hopeld Network Learning Rong Long Wang [email protected]
Zheng Tang [email protected] Faculty of Engineering, Toyama University, Toyama-shi, Japan 930-8555
Qi Ping Cao [email protected] Tateyama Systems Institute, Toyama-shi, Japan 930-001
In this article, we present a solution to the maximum clique problem using a gradient-ascent learning algorithm of the Hopeld neural network. This method provides a near-optimum parallel algorithm for nding a maximum clique. To do this, we use the Hopeld neural network to generate a near-maximum clique and then modify weights in a gradientascent direction to allow the network to escape from the state of nearmaximum clique to maximum clique or better. The proposed parallel algorithm is tested on two types of random graphs and some benchmark graphs from the Center for Discrete Mathematics and Theoretical Computer Science (DIMACS). The simulation results show that the proposed learning algorithm can nd good solutions in reasonable computation time. 1 Introduction A clique of an undirected graph G.V; E/ with a vertex set V and an edge set E is a subset of V such that all pairs of vertices are connected by an edge in E. The maximum clique problem is to determine a clique of maximum size of a graph G. This problem is highly intractable. It is one of the rst problems that has been proved to be NP-complete (Karp, 1972). Moreover, even its approximations with a constant factor are NP-hard (Feiges, Goldwasser, Safra, Lovasz, & Szegedy, 1991). In particular, Hastad (1996) stated that if NP 6D P, then no polynomial time algorithm can approximate the maximum clique to within a factor of N1=2¡" for any " > 0, where N is the number of nodes of the graph. The maximum clique problem has many practical applications in science and engineering (Pardalos & Xue, 1994). These include cluster analysis, information retrieval, mobile networks, computer vision, and alignment of c 2003 Massachusetts Institute of Technology Neural Computation 15, 1605–1619 (2003) °
1606
R. Wang, Z. Tang, and Q. Cao
DNA with protein sequences. Determining maximum clique is also very useful in circuit design. The problem is to create an optimal geometric layout for different chip hardware, such as programmable logic array and complementary metal oxide semiconductor transistors. Many researchers have widely studied this problem using different methods because of potential real-life applications. Johnson (1974) introduced two greedy algorithms to the maximum clique problem. One selects a vertex connected to most vertices among candidates and adds it one by one to a clique until no addition is possible. The other selects a vertex connected to the least vertices and removes it one by one from a graph until a clique appears. Lecky, Murphy, and Absher (1989) proposed the modied greedy algorithm by selecting each vertex of a graph in turn as the initial vertex in a clique. Tomita and Fujii (1985) proposed three algorithms using the depth-rst search and preprocessing procedure. Balas and Yu (1986) proposed the algorithm using the properties of triangulated graphs and the relationship between cliques and vertex colorings. Carraghan and Pardalos (1990) proposed the algorithm based on a partial enumeration. Pardalos and Philips (1990) formulated the maximum clique problem as a linearly constrained indenite quadratic global optimization problem. Pardalos and Rodgers (1992) proposed the algorithm using the unconstrained quadratic zero-one programming formulation. Soule and Foster (1995) created a genetic algorithm to search for max-cliques in general graphs. Marchiori (1998) proposed a heuristic-based genetic algorithm for the maximum clique problem, which consists of a combination of a simple genetic algorithm and a native heuristic algorithm. For solving such combinatorial optimal problems, energy-descent optimization algorithms also constitute an important avenue. The Hopeld network algorithm (Hopeld, 1982; Hopeld & Tank, 1985) is a typical energy-descent optimization algorithm. Jagota (1995) proposed ve energydescent optimization algorithms using Hopeld neural networks for approximating the maximum clique problem: steepest descent, ½-annealing, stochastic steep descent (SSD), continuous Hopeld dynamics, and meaneld annealing (MFA), which also appeared in the second Center for Discrete Mathematics and Theoretical Computer Science (DIMACS) challenge (Johnson & Trick, 1996). The challenge, which took place in conjunction with the DIMACS Special Year on Combinatorial Optimization, addressed three difcult combinatorial optimization problems: nding cliques in a graph, coloring the vertices of a graph, and solving instances of the satisability problem. Their simulation results showed that SSD and MFA perform better than other algorithms in the solution quality. Yamada, Tomita, and Takahashi (1993) proposed an energy-descent optimization algorithm called RaCLIQUE, based on the Boltzmann machine method for approximating the maximum clique problem. Funabiki and Nishikawa (1996) proposed a binary neural network (BNN) for approximating the maximum clique problem and compared the performance of four promising energy-
An Efcient Algorithm for Maximum Clique Problem
1607
descent optimization algorithms: BNN, RaCLIQUE, SSD, and MFA. Their simulation results showed that BNN and RaCLIQUE outperformed SSD and MFA in the solution quality. In this article, we introduce a learning algorithm of the Hopeld neural network for efciently solving the maximum clique problem. The learning algorithm has two phases: the Hopeld network updating phase and the gradient-ascent learning phase. The rst phase uses the Hopeld network updating to nd a near-maximum clique. The second phase uses the gradient-ascent learning of the Hopeld network to make the network escape from the near-maximum clique (i.e., a local minimum) that the network falls into once in the rst phase. A large number of randomly generated examples are simulated to verify the proposed algorithm. The performance of the proposed algorithm is compared with that of RaCLIQUE (Yamada et al., 1993) and Funabiki’s binary neural network (BNN) (Funabiki & Nishikawa, 1996). The simulation results show that the proposed learning algorithm works well on nding a maximum or a better clique than RaCLIQUE and BNN. We also test the proposed learning algorithm on several DIMACS benchmark graphs (Johnson & Trick, 1996). The simulation results show that the proposed learning algorithm can nd optimal solutions in these test graphs.
2 Hopeld Neural Network Updating Phase for Maximum Clique Problem Let G D .V; E/ be an undirected graph, where V D f1; : : : ; ng is the set of vertices, and E µ V £ V is the set of edges. A clique of G is a subset of V in which every pair of vertices is connected by an edge. A clique is called maximal if no strict superset of it is also a clique. A maximum clique is a clique having the largest cardinality (note that a maximal clique is not necessarily a maximum one). Hence, the maximum clique problem consists of nding a clique of maximum size of a graph G. Figure 1a illustrates a 10-vertex 21-edge undirected graph. There exist two maximum cliques consisting of the black vertices shown in Figure 1b and 1c. In general, the N-vertex maximum clique problem can be mapped onto the Hopeld neural network with N neurons. An objective function can be formulated for this optimization problem whose minimum value corresponds to the optimal solution. In a reasonable formulation, there are two components to the objective function: a goal term, which is to realize the number of vertices in clique is maximum, and a constrain term, which is to satisfy that in a clique, every pair of vertices is connected by an edge. Suppose the output yi of neuron #i represents the vertex #i and constant dij represent the edge between vertex #i and vertex #j. This optimization problem can be mathematically stated as nding the minimum of the following
1608
R. Wang, Z. Tang, and Q. Cao
(a)
(b)
(c)
Figure 1: (a) A 10-vertex 21-edge undirected graph. (b) First maximum clique consisting of the black vertices. (c) Second maximum clique consisting of the black vertices.
Hopeld network energy function, EDA
N X N X iD1 j6Di
.1 ¡ dij /yi yj ¡ B
N X
yi ;
(2.1)
iD1
where the output yi of neuron #i is 1 if vertex i# is in the clique and 0 otherwise and constant dij is 1 if vertex #i and vertex #j are connected by an edge and 0 otherwise. A and B are constant coefcients. We rewrite this energy function into a standard energy function of the Hopeld network, eD¡
N X N N X 1X wij yi yj ¡ hi yi ; 2 iD1 jD1 iD1
(2.2)
where the weights and the thresholds of the Hopeld network become wij D ¡2A.1 ¡ dij /.1 ¡ ±ij / hi D B:
(2.3) (2.4)
In equation 2.3, the notation ±ij is 1 if i D j and 0 otherwise. The equation of motion is dxi =dt D
N X j
wij yj C hi :
(2.5)
The follow sigmoid function is used as an input-output function of neurons, yi D 1=.1 C e.¡xi =T/ /;
(2.6)
An Efcient Algorithm for Maximum Clique Problem
1609
Figure 2: The conceptual graph of the relation between energy and state transition in the learning process of the Hopeld network with two stable states.
where T is the temperature parameter. Hopeld (1982) proved that such a Hopeld network can be used as an approximate method for solving 0-1 optimization problems because the network converges to a minimum of energy function provided that the weights are symmetric .wij D wji / and there are no self-connections .wii D 0/. The Hopeld network updating procedure can be viewed as seeking a minimum in an energy mountainous terrain. Thus, in the rst phase we can nd the solution to the maximum clique problem simply by observing the stable state that the Hopeld network reaches. Unfortunately, the quality of the solution is poor. It is usually difcult for the Hopeld networks to nd the global minimum. We now propose a learning method for the maximum clique to help the network escape from a local minimum to the global minimum or a better one. 3 Gradient-Ascent Learning Phase for the Maximum Clique Problem In this section we propose a learning method that can help the network escape from a local minimum to the global minimum. In order to explain the learning method, we use a two-dimensional graph (see Figure 2) of energy function with a local minimum and a global minimum. The energy function value is reected in the height of the graph. Each position on the energy terrain corresponds to a possible state of the network. For example, if the network is initialized onto the mountainous terrain A, the updating procedure of the Hopeld network makes the state of network move toward a minimum position and reach a steady state B (see Figure 2a).
1610
R. Wang, Z. Tang, and Q. Cao
Because the weights and the thresholds of the Hopeld network determine the energy terrain, we can change the weights and the thresholds to increase the energy at point B so as to ll up the local minimum valley and nally drive point B out of the valley. For the weights and the thresholds, the learning requires these parameter changes to be in the positive gradient direction. We can obtain the changes of the weights and the thresholds of the maximum clique problem from the partial differential of the energy function (see equation 2.1) to the weights and the thresholds at state B, @e D ¡pyi yj @ wij @e D ¡qyi 1hi D q @hi
1wij D p
.for i; j D 1; : : : ; N/
(3.1)
.for i D 1; : : : ; N/
(3.2)
where p and q are small, positive constants and yi , yj correspond to the state of point B. Now we show that after we change the weights and the thresholds according to equations 3.1 and 3.2, point B will be on the slope of a valley. We dene a vector Y.y1 ; y2 ; : : : ; yN / whose elements are the outputs of neurons to represent the state of a point on the energy terrain. Thus, the vector of the state at point B can be represented as YB .yB1 ; yB2 ; : : : ; yBN /, and the vector of the state of any point P can be represented as YP .yP1 ; yP2 ; : : : ; yPN /. Then the change of energy in point P by the learning (see equations 3.1 and 3.2) becomes 1eP D
N X N N X 1X wij yPi yjP C hi yPi 2 iD1 jD1 iD1
¡ D
N X N N X 1X .wij ¡ pyBi yjB /yPi yjP ¡ .hi ¡ qyBi /yPi 2 iD1 jD1 iD1
N X N N X pX .yBi yPi /.yjB yjP / C q .yBi yPi / 2 iD1 jD1 iD1
" #2 N N X p X B P D Cq yi yi .yBi yPi /: 2 iD1 iD1
(3.3)
Because point B is a minimum of the energy function, the output of neuron in point B can be considered as either 0 or 1 (Hopeld, 1982). Thus, the increase in energy at point B becomes " #2 N N X X p Cq (3.4) 1eB D yBi .yBi /; 2 iD1 iD1 where yBi D 1 or 0 for i D 1; : : : ; N.
An Efcient Algorithm for Maximum Clique Problem
1611
Because the sigmoid function is used as the input-output function, all elements of each vector Y are in [0; 1]. Thus, we have yBi ¸ yBi yPi
for i D 1; 2; : : : ; N; and any point P of energy terrain.
(3.5)
for any point P of energy terrain,
(3.6)
Therefore, we have 1eB ¸ 1eP
where the equality holds only when the vector difference between point .YP / and point .YB / is only nonzero for vector components for which YB is zero. Point .1; 1; : : : ; 1/ is one such point YP . Thus, because in the energy terrain, there is no point at which the increase in energy due to the learning is larger than that at point P, point B may be on the slope of the energy terrain of the network. In general, the situation can be described using a two-dimensional graph as shown in Figure 2b, where the previous stable state B becomes a point on the slope of the valley .B0 /. Thus, after updating of the Hopeld network with the new weights and the new thresholds in the Hopeld network updating phase, point B0 goes down the slope of the valley and reaches a new stable state C (see Figure 2b). Thus, the Hopeld network updating (phase 1) and the gradient-ascent learning (phase 2) may result in movement out of a local minimum and lead the network to converge to a global minimum or a new local minimum (see Figures 2c and 2d). 4 Learning Algorithm for Maximum Clique The following procedure describes the proposed algorithm. Note that there are two kinds of conditions for the end of the learning. One has a very clear condition—for example, the N-queen problem in which the energy is zero if the solution is the optimal. Another one does not have a clear condition, for example, the traveling salesman problem and the maximum clique problem in which the energy is not zero even when the solution is optimal. For the latter case, we have to set a maximum number of the learning (learn limit) in advance. Learning stops if the maximum number of learning is performed. If the learn limit is supposed to be the maximum number of learning times for the system termination condition, we have: 1. Set learn time D 0, and set learn limit and other parameters.
2. The initial value of yi for i D 1; : : : ; N is randomized in the range from 0.0 to 1.0. 3. The updating procedure is performed on the Hopeld network with original weights and thresholds until the network converges to a stable state (phase 1).
1612
R. Wang, Z. Tang, and Q. Cao
4. Record the stable state, and set the best solution as this stable state. 5. Use equations 3.1 and 3.2 to compute the new weights and new thresholds (phase 2). For i D 1; : : : ; N
a. Compute the 1wij , using equation 3.1 for j D 1; : : : ; N and j 6D i. b. Change the wij for j D 1; : : : ; N and j 6D i wij D wij C 1wij . c. Compute the 1hi , using equation 3.2. d. Change the hi : hi D hi C 1hi .
6. The updating procedure is taken on the Hopeld network with the new weights and thresholds until the network reaches a stable state. 7. Because the weights and the thresholds of the Hopeld network determine the energy terrain, once the weights and the thresholds are changed, the energy terrain may be changed and the position of global minimum in energy terrain may also be changed. In order to check if the local minimum is lled up and avoid the possible false solution to a specic problem due to the shift of the state of the global minimum, the updating procedure on the Hopeld network may be reperformed with original weights and thresholds until the network reaches a new stable state. If the stable state is different from the recorded state, then set the weight and the thresholds to the original ones and state of network to the stable state, and replace the recorded stable state to the stable state. Otherwise, the weight, the thresholds, and the state of network are set to those at the end of step 6. 8. If the solution obtained from step 7 is better than the recorded solution, the recorded solution is replaced by the solution obtained from step 7. 9. Increment the learn time by 1. If learn time D learn limit, then terminate this procedure. Otherwise go to step 5. 5 Simulation Results In order to assess the effectiveness of the proposed learning algorithm, extensive simulations were carried out over randomly generated graphs of various sizes on a digital computer (PentiumIII 733 MHz). The Hopeld network-updating phase used the weights and the thresholds matrix dened in equations 2.3 and 2.4, and the gradient-ascent learning phase used the learning rules dened in equations 3.1 and 3.2. Simulations referred to the parameter set at A D 1:0 B D 1:0. In the experiments, the learning parameters p and q were set to 0.2. The learn limit was set to 30. The initial values of neurons were randomized in the range of 0.0 to 1.0. The temperature parameter T in equation 3.4 was set to 1.4. The RaCLIQUE (Yamada et al., 1993) and Funabiki’s binary neural network (BNN) (Fun-
Energy
An Efcient Algorithm for Maximum Clique Problem
5 0
1613
Learning time 0
1
2
3
-5 -10 Figure 3: Variation process of the energy during learning.
abiki & Nishikawa, 1996) were also executed for comparison. The reason to compare the proposed learning algorithm to RaCLIQUE and BNN is that Funabiki and Nishikawa (1996) indicated that RaCLIQUE and BNN were the best energy-descent algorithm in solving maximum clique problem. The rst type of graph we tested was p-random graphs (Jagota, 1995), where each of the vertex pairs is connected by an edge with the probability of p independent of other edges. The sizes of this type of graph in our simulation were N D 100, 200, 300, 500 and the probabilities were p D 0:5, 0.8, 0.9. The results of simulations, based on 10 runs on each of the considered graph, are summarized in Table 1. The rst two columns indicate the graph size (i.e., the number of notes) and probability, respectively. The other columns indicate the average computation time of a run and the average and best cliques generated by the proposed learning algorithm, BNN, and RaCLIQUE. From Table 1, we can see that the proposed learning algorithm provides the best solution quality among the three algorithms in every tested p-random graph and the BNN is the fastest algorithm among the three algorithms, but the solution quality found by the BNN is inferior to that found by the RaCLIQUE. In order to verify the analysis described in section 3, we recorded a typical behavior of learning on simulating the 100-vertex 0.5-random graph. Figure 3 shows the energy variation during learning, which clearly demonstrates how a local minimum was removed by learning and is in perfect agreement with our learning algorithm. Initially, the network converges to a time-independent state. We found the size of the clique is 6. It is obviously not an optimum clique. After the rst, third, and sixth learning, the size of clique increased to 7, 8, and 9, respectively. The second type of graph we tested was the so-called k-random clique graph (Jagota, 1995). This kind of graph is generated by generating k cliques
2475 3960 4455 9950 15,920 17,910 22,425 35,880 40,365 62,375 99,800 112,275
100 100 100 200 200 200 300 300 300 500 500 500
0.5 0.8 0.9 0.5 0.8 0.9 0.5 0.8 0.9 0.5 0.8 0.9
Probability
0.75 5.63 4.27 58.75 46.93 30.34 45.5 122.68 81.35 75.45 236.21 214.11
Average Time
Note: The unit of computation time is seconds.
Edges
Nodes
9 19 30 11 21 39 11 28 46 12 32 56
Average Clique
Proposed Method
Table 1: Simulation Results on p-Random Graphs.
9 19 30 11 21 39 11 28 46 12 32 56
Best Clique 0.23 0.4 0.51 9.65 10.01 12.18 45.63 39.15 56.45 95.31 115.01 101.76
Average Time 9 19 29.5 9.3 19 33.5 9.7 24.6 44.3 9.2 28.5 49.1
Average Clique
BNN
9 19 30 10 21 35 10 25 45 10 29 50
Best Clique 0.87 0.68 0.59 6.21 10.09 9.41 26.6 28.34 26.71 123.03 127.64 126.19
Average Time
8.4 18.8 30 10.4 21 38.4 10.6 26.7 45.3 10.8 30.6 55.3
Average Clique
RaCLIQUE
9 19 30 11 21 39 11 27 46 11 31 56
Best Clique
1614 R. Wang, Z. Tang, and Q. Cao
An Efcient Algorithm for Maximum Clique Problem
1615
of varying sizes at random and taking their union. Such graphs have a wide range of clique sizes—much wider than in p-random graphs. This suggests that such graphs do not accurately approximate the maximum clique and might separate the poor algorithm from the good ones. In our simulation, we generated 20 such graphs to test the proposed learning algorithm, BNN and RaCLIQUE. The results of simulations are based on 10 runs on each of the considered graphs and are summarized in Table 2. From Table 2, we note that the proposed learning algorithm provided the best-quality solution and the BNN worked the fastest. We also evaluated experimentally the performance of the proposed learning algorithm on the DIMACS benchmark graphs (Johnson & Trick, 1996). These graphs provide a valuable source for testing the performance of algorithms for the maximum clique problem, because they arise from different areas of application. For example the Hamming graphs are from coding theory problem (Sloane, 1989), and the Keller graphs are based on Keller’s conjecture on tilings using hypercubes (Lagarias & Shor, 1992). We tested the proposed learning algorithm on Keller4, Hamming8-4, and Hamming10-4 (Johnson & Trick, 1996). The maximum clique sizes of Keller4, Hamming8-4, and Hamming10-4 generated by the proposed learning algorithm were 11, 16, and 40, respectively, which were the same as the best clique size found by all 15 heuristic algorithms presented at the second DIMACS Challenge (Johnson & Trick, 1996) and have been proven to be global optimality. The simulation results on these graphs are given in Table 3. In order to verify the efciency of the proposed learning algorithm, we also ran the DIMACS benchmark program for comparison. The reasons to select this program for comparison are that, rst, all 15 DIMACS heuristics were very efcient in the second DIMACS challenge, although it is difcult to judge which was best as claimed in second DIMACS Challenge (Johnson & Trick, 1996). Second, the DIMACS benchmark program implements a branch-and-bound algorithm, which is an efcient enumeration scheme. The solution found by the branch-and-bound algorithm may be optimal if there is no time limit. We ran the DIMACS benchmark program on the same instances used to test the proposed learning algorithm. For the p-random graph, the computation time of the DIMACS benchmark program became very long when the edge density of graphs increased. We ran the DIMACS benchmark program on only some small edge density graphs, which can be solved in a reasonable time (24 hours ). For k-random graphs, the DIMACS worked very well. In Table 3, we list the results of the proposed learning algorithm and the DIMACS benchmark program on both solution quality and computation time. From Table 3, we can see that the solutions found by the proposed method are as good as those found by the DIMACS benchmark program in all test graphs. For small, sparser graphs, the proposed method worked more slowly than did the DIMACS benchmark program, but for large graphs or large dense graphs, the proposed method found the maximum clique in much more quickly than the DIMACS benchmark program
Edges
16,291 17,453 15,578 14,114 16,927 17,973 13,822 16,100 14,704 15,728 65,030 72,140 68,327 70,425 67,527 73,417 65,518 71,376 62,997 62,481
Nodes
200 200 200 200 200 200 200 200 200 200 400 400 400 400 400 400 400 400 400 400
0.82 0.87 0.78 0.71 0.85 0.9 0.69 0.81 0.73 0.79 0.81 0.9 0.86 0.88 0.85 0.92 0.82 0.89 0.79 0.78
Probability
4.34 14.89 9.03 11.39 8.84 18.31 25.65 10.88 23.04 18.11 31.22 41.31 33.41 43.58 44.95 81.91 53.72 45.05 70.63 37.65
Average Time
94 102 96 86 95 111 90 86 92 95 179 192 179 194 191 198 199 189 188 180
Average Clique 94 102 96 86 95 111 90 86 92 95 179 192 179 194 191 198 199 189 188 180
Best Clique
Proposed Method
3.34 4.36 14.55 5.07 11.81 6.33 10.23 13.86 9.56 8.07 16.11 7.45 14.01 19.37 9.87 19.03 15.11 13.02 14.31 16.02
Average Time
Table 2: Simulation Results on k-Random Clique Graphs .k D 20/.
93 99 87.5 77.5 94.9 111 85.3 83.7 83 94.5 173 189.9 176.1 191.5 173.3 195.7 197.3 187.7 173.6 180
Average Clique
BNN
94 101 94 83 95 111 88 84 83 95 175 192 177 192 175 197 199 188 187 180
Best Clique 6.26 7.02 7 6.64 6.96 6.82 6.87 6.75 6.81 7.54 70.87 76.95 78.12 73.5 70.08 78.03 77.15 72.45 71.33 71.05
Average Time 85.7 91 93 80 90.5 108.3 80 80.9 80.1 90.2 160.7 187.5 175.8 188.9 171.2 194.1 192.4 183.1 177.1 164.4
Average Clique
RaCLIQUE
94 102 94 80 93 111 88 84 83 94 177 188 176 190 181 196 197 185 187 180
Best Clique
1616 R. Wang, Z. Tang, and Q. Cao
An Efcient Algorithm for Maximum Clique Problem
1617
Table 3: Comparison Between the Proposed Algorithm and the DIMACS Benchmark Program. Proposed Method Graph Type p-random graph
k-random graph
Keller4 Hamming8-4 Hamming10-4
DIMACS
Nodes
Probability
Time
Clique
Time
Clique
100 100 100 200 200 300 500 200 200 200 200 200 200 200 200 200 200 400 400 400 400 400 400 400 400 400 400 171 256 1024
0.5 0.8 0.9 0.5 0.8 0.5 0.5 0.82 0.87 0.78 0.71 0.85 0.9 0.69 0.81 0.73 0.79 0.81 0.9 0.86 0.88 0.85 0.92 0.82 0.89 0.79 0.78 0.649 0.639 0.828
0.75 5.63 4.27 58.75 46.93 45.5 75.45 4.34 14.89 9.03 11.39 8.84 18.31 25.65 10.88 23.04 18.11 31.22 41.31 33.41 43.58 44.95 81.91 53.72 45.05 70.63 37.65 0.31 5.72 1253.18
9 19 30 11 21 11 12 94 102 96 86 95 111 90 86 92 95 179 192 179 194 191 198 199 189 188 180 11 16 40
0.45 6.4 34.86 1.97 1897.11 6.29 96.18 31.89 55.1 35.22 20.44 54.06 68.01 15.33 24.67 19.75 28.01 45.35 93.4 40.16 50.35 54.23 115.44 61.37 98.05 81.75 82.33 5.44 53.13 –
9 19 30 11 21 11 12 94 102 96 86 95 111 90 86 92 95 179 192 179 194 191 198 199 189 188 180 11 16 –
Note: The unit of computation time is seconds.
did. Thus, the solutions found by the proposed method are satisfactory, and the computation time is reasonable. 6 Conclusions We have proposed a learning algorithm of the Hopeld neural network for efciently solving the maximum clique problem and showed its effectiveness by simulation experiments. This learning algorithm has two phases: the Hopeld network updating phase and the gradient-ascent learning phase. In the rst phase, we implemented the Hopeld network for optimizing the energy function in the state domain. In the second phase, we intentionally
1618
R. Wang, Z. Tang, and Q. Cao
increased the energy of the Hopeld network by modifying parameters in the weight domain in a gradient-ascent direction, thus making the network escape from the local minimum. The proposed learning algorithm is tested on two types of random graphs and DIMACS benchmark graphs. The simulation results showed that the proposed learning algorithm could nd good solutions in these test graphs. Acknowledgments This work was supported in part by a grant-in-aid for Science Research of the Ministry of Education, Science and Culture of Japan under Grant (C)(2)12680392. References Balas, E., & Yu, C. S. (1986). Finding a maximum clique in an arbitrary graph. SIAM J. Comput., 15, 1054–1068. Carraghan, R., & Pardalos, P. M. (1990). An exact algorithm for the maximum clique problem. Oper. Res. Lett., 9, 375–382. Feiges, U., Goldwasser, S., Safra, S., Lovasz, L., & Szegedy, M. (1991). Approximating clique is almost NP-complete. In Proc. 32nd Annual IEEE Symposium on the Foundations of Computer Science (pp. 2–12). Los Angeles: IEEE Computer Society. Funabiki, N., & Nishikawa, S. (1996). Comparisons of energy-descent optimization algorithms for maximum clique problems. Trans. IEICE, E79-A, pp. 452– 460. Hastad, J. (1996). Clique is hard to approximate within n1¡" . In Proc. 37th Annual IEEE Symposium on the Foundations of Computer Science (pp. 627–636). Los Angeles: IEEE Computer Society. Hopeld, J. J. (1982).Neurons with graded response have collective computation properties like those of two-state neurons. Proc. Nat. Acad. Sci. U.S.A., 81, 3088–3092. Hopeld, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Bio. Cybern., 52, 141–152. Jagota, A. (1995). Approximating maximum clique with a Hopeld network. IEEE Trans. Neural Networks, 6, 724–735. Johnson, D. S. (1974). Approximation algorithms for combinatorial problem. J. Comput. Syst. Sci., 9, 256–278. Johnson, D., & Trick, M. (Eds.). (1996). Cliques, coloring, and satisability. New York: American Mathematical Society. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller & I. M. Thatcher (Eds.), Complexity of computer computations (pp. 85–103). New York: Plenum Press. Lagarias, J. C., & Shor, P. W. (1992). Keller ’s cube-tiling conjecture is false in high dimensions. Bulletin AMS, 27, 279–283.
An Efcient Algorithm for Maximum Clique Problem
1619
Lecky, J. E., Murphy, O. J., & Absher, R. G. (1989). Graph theoretic algorithms for the PLA folding problem. IEEE Trans. Computer-Aided Design, 8, 1014–1021. Marchiori, E. (1998). A simple heuristic based genetic algorithm for the maximum clique problem. In Proc. ACM Symp. Appl. Comput. (pp. 366–373). New York: American Mathematical Society. Pardalos, P. M., & Phillips, A. T. (1990). A global optimization approach for solving the maximum clique problem. Int. J. Comp. Math., 33, 209–216. Pardalos, P. M., & Rodgers, G. P. (1992). A branch and bound algorithm for the maximum clique problem. Comp. Oper. Res., 19, 363–375. Pardalos, P. M., & Xue, J. (1994). The maximum clique problem. Journal of Global Optimization, 4, 301–328. Sloane, N. J. A. (1989). Unsolved problem in graph theory arising from the study of codes. J. Graph Theory Notes of New York, 18, 11–20. Soule, T., & Foster, J. A. (1995). Using genetic algorthms to nd maximum cliques (Tech. Rep. No. LAL 95-12). Moscow: Department of Computer Science, University of Idaho. Tomita, E., & Fujii, T. (1985). Efcient algorithms for nding a maximum clique and their experimental evaluation. Trans. IEICE, J68-D, 221–228. Yamada, Y., Tomita, E., & Takahashi, H. (1993).A randomized algorithm for nding a near-maximum clique and its experimental evaluations. Trans. IEICE, J76-D-I, 46–53. Received January 29, 2002; accepted January 6, 2003.
Communicated by Peter Dayan
LETTER
The Effect of Noise on a Class of Energy-Based Learning Rules A. Bazzani [email protected]
D. Remondini [email protected] Department of Physics and INFN, University of Bologna, 40126, Bologna, Italy
N. Intrator Nathan [email protected] Institute for Brain and Neural Systems, Brown University, Providence, RI 02912, U.S.A.
G.C. Castellani [email protected] Department of Physics and DIMORFIPA, University of Bologna, 40126, Bologna, Italy
We study the selectivity properties of neurons based on BCM and kurtosis energy functions in a general case of noisy high-dimensional input space. The proposed approach, which is used for characterization of the stable states, can be generalized to a whole class of energy functions. We characterize the critical noise levels beyond which the selectivity is destroyed. We also perform a quantitative analysis of such transitions, which shows interesting dependency on data set size. We observe that the robustness to noise of the BCM neuron (Bienenstock, Cooper, & Munro, 1982; Intrator & Cooper, 1992) increases as a function of dimensionality. We explicitly compute the separability limit of BCM and kurtosis learning rules in the case of a bimodal input distribution. Numerical simulations show a stronger robustness of the BCM rule for practical data set size when compared with kurtosis. 1 Introduction We describe a method for the study of a whole class of energy functions dened by ED
hcp i hc2 iq ¡k ; p q
2 < p · 2q
(1.1)
where k is a constant and the hi operator denotes the average with respect to the input distribution. Learning rules belonging to this class seek directions c 2003 Massachusetts Institute of Technology Neural Computation 15, 1621–1640 (2003) °
1622
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
in input space with “interesting” polynomial moments (p > 2), useful for the search of nongaussian projections (Friedman, 1987). The evolution of synaptic weights is given by E EP D c.cp¡2 ¡ khc2 iq¡1 /d; m
(1.2)
E dE are E ¢ d, where c is the neuron output in the linear approximation: c D m E is the weight vector. If the input vectors presented to the neuron, and m we average over the distribution of input vectors, the learning equations modify synaptic weights in order to minimize (or maximize) the energy function: mP i D
@E : @mi
(1.3)
We characterize the critical points of equation 1.3, averaged over the input distribution. We prove that the stable states of such equations are characterized only by the properties of the rst term hcp i in the energy—that is, the pth order statistical moment of the output c. The second term, related to the second-order moment matrix hc2 i, is a regularization term that prevents the solutions from diverging. A different choice of the q exponent for this term results in a different rescaling of the norm of the stable states, without affecting their direction. Our approach reduces the problem of energy function minimization to the computation of the critical points of a homogeneous polynomial function associated with the pth-order statistical moment, constrained on the unit sphere. We consider explicitly two relevant cases: ² p D 3, q D 2, k D 3=4, corresponding to the BCM learning rule (Bienenstock et al., 1982; Intrator & Cooper, 1992) ² p D 4, q D 2, k D 3, corresponding to a kurtosis-like learning rule, which reduces to the classical kurtosis when hci D 0 (Blais, Intrator, Shouval, & Cooper, 1998) The rst case is a neuron model with interesting biological features. It supports a bidirectional mechanism of synaptic plasticity and has been extensively compared with experimental observations (Dudek & Bear, 1992; Kirkwood & Bear, 1994; Bear & Malenka, 1994; Kirkwood, Lee, & Bear, 1995) and with other models (Shouval, Intrator, Law, & Cooper, 1996; Shouval, Intrator, & Cooper, 1997; Blais et al., 1998). Moreover, energy function formalism for this model (Intrator & Cooper, 1992) provides a complex statistical analysis of the data, performing exploratory projection pursuit that seeks multimodality in data distributions (Friedman, 1987). We extend an earlier analysis (Intrator & Cooper, 1992; Castellani, Intrator, Shouval, & Cooper, 1999) to the more general case of input distributions. In particular, we describe the conditions on the noise so that a noisy environment can
The Effect of Noise on Learning Rules
1623
be considered close to the noiseless environment, in terms of stable state properties, and how the dimensionality of the problem affects the neuron’s behavior. Kurtosis is commonly used in statistical analysis to point out nongaussian features of data distributions in the form of large tails. We compare it with the BCM model in order to analyze the effect of higher-order moments. The case of an input distribution dened by n linear independent orthogonal vectors, perturbed by gaussian noise, is explicitly studied. The dimensionality of the problem n is varied, keeping the signal-to-noise (SNR) constant. In the absence of noise, we nd the same directions of the stable states for both energy functions. Gaussian noise perturbation produces different behaviors, leading to a phase transition for the BCM model, depending on a critical value of the SNR and on dimension n. In the kurtosis case, the effect of gaussian noise cancels out, and selectivity is destroyed only asymptotically. But simulations show that for general noise patterns (e.g., uniform), selectivity is lost even by kurtosis. We explicitly compare the separability property of BCM and kurtosis on an example of bimodal distribution of two-dimensional vectors. For gaussian noise, the BCM learning rule is more sensitive in separating two close clusters, since the kurtosis-like learning rule is strongly affected by tail uctuations of input distribution. This is not surprising because kurtosis is a fourth-order polynomial while BCM is a third order. Theoretical predictions are recovered only considering very high statistics (N ¸ 106 input signals). 2 Outline of the Method We write explicitly the energy functions (see equation 1.1) for the average E ’ m E as E ¢ di E ¢ hdi, learning equations, using the approximation hci D hm discussed in Intrator and Cooper (1992), EDT
Ep E 2 /q m .C m ¡k ; 2q p
(2.1)
where T is a p-linear form (namely, it is a p-rank tensor Ti1 ¢¢¢ip D hdi1 ; : : : ; dip i), and C is the matrix of the second-order moments of the input distribution, E ¤ of Cij D hdi ¢ dj i. In appendix A, we prove that to each critical point m E ¤ of reduced energy, equation 2.1, it corresponds to a critical point w E0 D T
Ep E2 w Cw ¡k ; 2 p
(2.2)
according to the scaling law ED w
E m ; kmka
(2.3)
1624
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
where the scaling exponent is a D .2q ¡ 2/=.p ¡ 2/, and k k is the norm associated with the C matrix. Then we show that the critical points of the reduced energy E0 are related to the extremal points of a homogeneous polynomial function, f .yE/ D
T yEp ; p
(2.4)
constrained on the unit sphere (kyk2 D 1). Moreover, we prove that the stable states of the initial energy (see equation 2.1), correspond to the maxima of f .y/. In a practical nondegenerate case, it is always possible to reducepthe correlation matrix C to the identity matrix by means of the matrix S D C¡1 . The coefcients of T are modied according to the usual tensor transformation law. The local maxima yE¤ of equation 2.4 dene the directions of the stable states for the learning rule associated with the energy E, since the scaling transformation affects only the vectors’ length. In the case of kurtosis-like energy (p D 2q), the scaling is singular, thus conrming the necessity of a normalizing procedure to obtain the stable states of the original equations. 3 BCM Energy Function E the BCM neuron is dened by the evolution E ¢ d, In the linear regime c D m equations, EP D c.c ¡ µ/dE m
E dE 2 Rn ; m;
(3.1)
where µ is a threshold that depends on the past activity of the neuron and E is a stochastic process that simulates the interaction with the endE D d.t/ E is an uncorrelated process, µ can be dened as hc2 i vironment. When d.t/ (Intrator & Cooper, 1992). Equation 3.1 is discretized according to E E C 1t/ D m.t/ E C c.t/.c.t/ ¡ µ/d1t; m.t
(3.2)
and in the limit 1t ! 0, one can prove that the synaptic weights satisfy the average equations (Khas’minskii, 1966; Intrator & Cooper, 1992), mP i D hc2 di i ¡ µhcdi i:
(3.3)
EP D @ E=@ m E if one introduces Equation 3.3 can be written in a covariant form m the “energy” (Intrator & Cooper, 1992), E BCM D
E3 E 2 /2 Tm .Cm ¡ ; 3 4
which has the form of the function energies (see equation 2.1).
(3.4)
The Effect of Noise on Learning Rules
1625
We apply the method described in the previous section to some explicit cases in order to point out the main properties and limits of the statistical analysis of the input space performed by a BCM neuron. Let us consider a BCM neuron that is stimulated by n linearly independent vectors vEk 2 Rn k D 1; : : : ; n. When forming a matrix V with the vectors as columns, the input signal is dened as dE D V »E ;
(3.5)
where »E is a random vector that takes the values on the canonical base eOk of Rn with equal probability. The second-order matrix reads, C D h»k »h vEk vEh i D
1 VV T : n
(3.6)
It is convenient to introduce the variable ok D yE ¢ vEk , so that the cubic dened by the third-order moments reads, f .o/ D
n 1 X o3 : 3n kD1 k
(3.7)
According to the results of the previous section, the existence of stable xed points for the BCM average equation is related to the existence of local maxima of the cubic (see equation 3.7) constrained on the sphere n X kD1
o2k D n:
(3.8)
In appendix A, we show that the cubic, equation 3.7, always has n local maxima on the unit sphere that are dened by the dual base of the vectors vEk . In such a case, the BCM neuron has a complete selectivity for the space of initial inputs (Castellani et al., 1999). Let us generalize the previous model by adding a noise ´E to the input signal according to E dE D V »E C ´;
(3.9)
where ´E is a vector of independent gaussian variables with hE ´i D 0 and hE ´2 i D .¾ 2 =n/ I (I is the identity n £ n matrix). A straightforward calculation shows that the second-order matrix reads CD
1 .VVT C ¾ 2 I/: n
(3.10)
As a consequence, the SNR is independent from the dimension n.In addition, we assume that the unperturbed vectors vEk are orthogonal. The matrix 3.10
1626
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
has the same eigenvectors as the matrix 3.6 with eigenvalues ¸k C¾ 2 =n. With the addition of noise, the cubic 3.7 reads 0 1 2 1 X 3 3¾ 2 X oj @o C A; (3.11) f .o/ D ok k 3n k n ¸ j j constrained on the ellipsoid, X k
³ ´ ¾2 D n: o2k 1 C ¸k
(3.12)
For n À 1, the stable solution selective to the ith input vector satises v 0 1 s u ³ 2 ´2 u 46 u n ¾ @1 C 1 ¡ C 1 A; (3.13) oi ’ t ¾ 2 n2 ¸i 2. ¸i C 1/ P with 6 D .¾ 2 =¸i C 1/ k6Di 1=.¾ 2 =¸k C 1/. If ¾ 2 =¸i is of order O.1/, then 6 ’ p O.n/ and p oi ’ O. n/, whereas the response to the other inputs decreases as oj D O.1= n/. With these assumptions, we get a different critical value of the noise ¾c for each eigenvalue ¸i , that grows like O.n1=4 /: s ³p ´ n ¡1 : (3.14) ¾c .i/ ’ ¸i 2 This means that for a xed value of ¾ , we lose selectivity only for those inputs whose related eigenvalues do not satisfy ¾ < ¾ c .i/. If noise grows continuously, we have a gradual loss of selectivity: the larger the eigenvalue ¸i , the bigger the robustness to the noise of the related solution. If the eigenvalues ¸i are all equal (e.g., ¸i D 1; 8i), we have an exact result for the growth of ¾c : r n p (3.15) ¾c D : 2 n¡1 This is due to the strong law of large numbers, which implies that the effect of the noise is averaged over all dimensions.Therefore, with a constant SNR, a BCM neuron in a high-dimensionality space is more robust to noise perturbation. We remark that the selectivity result, equation 3.13, uses explicitly the orthogonality condition on the input vectors vEk . When this condition is relaxed, we can prove only that the BCM neuron is selective to the orthogonal subspaces dened by the input vectors.
The Effect of Noise on Learning Rules
1627
4 Kurtosis-Like Energy Function The kurtosis energy function is dened by: EK D
hc4 i hc2 i2 ¡3 : 4 4
(4.1)
E D 0, then equation 4.1 reduces to the usual When the input dE satises hdi E D 0. The learning rule, corresponding E di denition of kurtosis since hci D m¢h to averaged equations for the weights evolution derived from equation 4.1, is mP i D hc3 di i ¡ 3µhcdi i:
(4.2)
These equations can lead the weights to increase indenitely. Therefore, we consider a “renormalized” version of the kurtosis-like energy, E´ D hc4 i ¡ 3 hc2 i2C´ ;
´ > 0;
(4.3)
and we show that the critical points of equation 4.3 dene the same directions of equation 4.1, since they are independent from ®. By applying the following transformation, ³ E¤ D m
2° 3.2 C ´/
´1
2´
yE¤ ;
° D 4 f .y¤ /;
(4.4)
E ¤ of the energy 4.3 to the critical points yE¤ of we relate the critical points m the polynomial function, f .y/ D
T yE4 ; 4
(4.5)
constrained on the unit sphere, where the coefcients T are the fourth-order moments of the input distribution. We remark that in the limit ´ ! 0, E ¤ and yE¤ is the scaling (see equation 4.4) is singular, but the direction of m the same, and it does not depend on ®. The renormalization procedure is equivalent to a weight normalization, like Oja’s (Oja, 1982), in the learning equations. We analyze explicitly the same case considered in section 3, where the inputs are dened as dE D V »E C ´E. Let us introduce ok D yE ¢ vEk ; the function 4.5 then reads: 0 0 12 1 2 2 2 4 X X X X o ok 1 B 6¾ 3¾ j @ A C C (4.6) f .o/ D o4k C o2i ¢ @ A: 4n n ¸ n ¸ j k i j k k
1628
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
Taking into account the constraint 3.12, the function 4.6 reduces to 0 1 Á !2 1 X 4 3 X 2 @ f .o/ D ok ¡ ok C 3nA ; 4n n k k
(4.7)
and the noise variance completely disappears from the equation. The selective solution for the ith input is r oj D 0;
8j 6D iI
oi D
n : 1 C ¾ 2 =¸i
(4.8)
We do not have a critical value for the p variance except for ¾ ! 1. In this case, the effect of noise decreases / n as a function of the dimension. This result is strictly connected to the gaussian properties of noise. We remark that in the limit ¾ ! 1 for a xed n, the constraint 3.12 implies ok ! 0, 8k and selectivity is lost. But if one normalizes the lengths of the stable solutions of equation 4.2, the singular behavior of the outputs ok in the limit ¾ ! 1 disappears, and selectivity is preserved. 5 Separability Limits for Two Gaussian Distributions We consider the problem of a neuron receiving two gaussian distributions with the same variance ¾ but centered at different points in the plane in the case of both BCM and kurtosis energy functions. We can represent the input distribution as two inputs with a zero-mean gaussian noise added, E dE D pE v1 C .1 ¡ p/E v2 C ´;
(5.1)
where p is a random variable that takes the values .0; 1/ with equal probability, and ´E is a gaussian noise of variance ¾ 2 . In order to simplify our calculations, we set vE1 D .l; 0/ and vE2 D .l cos ®; l sin ®/, thus having the same norm. We characterize the stable states, showing their dependence on the angle ®, the vector length l, and the noise variance ¾ . According to our approach, we look for the maxima of the functions 2.4: fBCM .y/ D
¯ yE 3 3
¯ijk D hd0i dj0 d0k i
(5.2)
fK .y/ D
® yE 4 4
®ijkl D hd0i dj0 d0k d0l i;
(5.3)
where dE 0 is the input distribution after the application of the matrix T that reduces the correlation matrix to identity (see appendix C). The directions of
The Effect of Noise on Learning Rules
1629
the stable solutions are dened by the angles ®=2§Â , where (see appendix B) s ÂBCM D arctan
l2 cos2 ®=2 ¡ ¾ 2
l2 sin2 ®=2 C ¾ 2 s ! Á ¸C 2 ® ÂK D arctan cot ; 2 ¸¡
;
(5.4)
¸§ D ¾ 2 Cl2 .1§cos ®/=2 being the eigenvalues of the second moment matrix C. It is easily checked that both solutions, in the limit ¾ ! 0, recover the dual directions with respect to the input vectors. The distance 1 between the two peaks of the input distribution projected along the direction of the equilibrium weight vectors measures the separability property in the BCM and kurtosis cases. According to the results of appendix C, we have 1 D 2l sin  sin
® : 2
(5.5)
From this equation, we get that separability is maximal when  ’ ¼=2. The main difference is that in the BCM case, we have critical values of the SNR that destroy selectivity: that is, when ¾ ¸ jl cos ®=2j, the stable solutions collapse and move to the complex space. In the kurtosis p case, selectivity is lost only in the limit ¾ ! 1, due to the fact that the 3 scaling becomes singular and the original weights vanish (see appendix C). However, the angle ÂK remains dened, and the directions of the stable solutions still separate the input distribution. Computing explicitly the behavior of  in the limit l®=¾ ! 0, small angle with a nonvanishing noise, we get "
³ ´2 # ³ ´3 l2 ¡ ¾ 2 ¾ l® l® ¡ p CO ÂBCM D arctan ¾ ¾ 2 l2 ¡ ¾ 2 2¾ 2 3 s , ¼ ± ® ²2 ¾2 5 4 ¡ C O.® 3 /: ÂK D 2 2 ¾ 2 C l2 p
(5.6)
(5.7)
In the kurtosis case, the angle  is changed by a quantity of order O.® 2 / from the optimal value ¼=2, whereas in the BCM case, the presence of a critical value for the ratio ¾= l reduces the separability, since the difference from the optimal value is of the order O.1/. As a consequence, if we average over the noise distribution, the kurtosis learning rule is able to completely lter the effect of a gaussian noise on the input data.
1630
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
6 Numerical Results In the previous sections, we studied the behavior of equations 3.3 and 4.2 under the assumption that we can average over the whole gaussian distribution of the input signals. However, for applications, it is important to understand the robustness of the theoretical results in the presence of nite statistics. This is also relevant to studying the effect of distribution tails on neural dynamics. We have considered the separability property in the cases p D 3; q D 2 (BCM neuron) and p D 4; q D 3 (“renormalized” kurtosis-like energy). The distribution of the input vectors is dened according to equation 5.1 with l D 1, but we have used two different nite statistics for the input distribution (respectively, N D 100; 1000 input vectors) and two separation angles ® D 10± ; 90± . Average equations are integrated with a time step 1t D :01 up to a maximal nal time tf D 100, which is sufciently long to reach an equilibrium state; the third and fourth moments of the input distribution are numerically computed. To study the effect of nite statistics, we have considered 1000 samples for all cases, and we have measured the average distance ² between the proE ¢ vEj j D 1; 2 corresponding to the peaks of the input distribution, jections m E normalized by its standard deviation along the equilibrium direction m ¾² , ²D
E ¢ vE1 i ¡ hm E ¢ vE2 ij jhm ; ¾²
(6.1)
where ¾² D
q E ¢ vE1 ¡ hm E ¢ vE1 ii2 C hm E ¢ vE2 ¡ hm E ¢ vE2 ii2 ; hm
(6.2)
where h i denotes the average over the sample distribution. We recall that if the value of ² is bigger than 1, the neuron has a good probability of recognizing the presence of the signal vE1 ; vE2 in the input distribution. In Figure 1, we report the numerical results for the calculation of ² as a function of noise level (¾ 2 [0 : 1:5]). We observe that for ® D 10± (left plots), the BCM neuron has better performance than the kurtosis-like learning rule for the cases N D 100 and N D 1000. In the BCM case, we detect the existence of a critical value for noise level ¾c ’ 1 according to equation 5.4, but numerical results suggest that the derivative of ² with respect to ® when ® ¿ 1 is bigger in the BCM case than in the kurtosis-like case. This is probably due to the sensitivity of fourth-order moments to uctuations in the input distribution tails. In the case ® D 90± (right plots), we observe that the two curves have a similar behavior for a low level of noise (¾ < :5), but kurtosis-like energy has a better performance at bigger values of ¾ p in the case N D 1000. This is due to the existence of a critical value ¾c D 1= 2 for the BCM neuron at which stable solutions bifurcate in the complex plane.
The Effect of Noise on Learning Rules
1631
10
es
100 10
1
es
0.1
0.1 0
0.5
A
es
1
s
1
1.5
1000
10
100
1
es
0.1
0.01
C
0.5
s
1
1.5
0
0.5
s
1
1.5
B
100
0.001
0
10 1 0.1
0
0.5
s
1
1.5 D
Figure 1: Numerical results for the average distance ²s between the projections E ¢ vEj j D 1; 2 on the stable directions m E normalized by its standard deviation m (see equation 6.1) in the case of BCM energy (lled circles) and kurtosis-like case (empty circles) as a function of noise level ¾ . One thousand samples were considered. (A, C) ® D 10± . (B, D) ® D 90± . In the A; B (resp. C; D) panels, each sample used in numerical simulation contains N D 100 (resp. N D 1000) input vectors.
However, a completely “nonselective” solution, symmetric with respect to the input clusters, is very unlikely due to the nite statistics available in real situations. In practice, a partial separation of the clusters is always obtained. We remark that the result 5.4 for separability property in the kurtosis-like case is also related to the gaussian nature of the noise. If we relax such a hypothesis, a critical value also exists for the kurtosis case. This is shown in Figure 2, where we compare the behavior of BCM and kurtosis-like learning rules when we consider a uniform and a gaussian input distribution. We E ¢ vEj j D 1; 2 on the stable have computed the distance 1 of the projections m E as a function of noise level for a uniform noise distribution solutions m (left plot) and for a gaussian noise distribution (right plot). The separation angle is ® D 90± , and we have used a sample of N D 106 input vectors. In
1632
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani 1.5
1.5
1
1
D
D 0.5
0.5
0
0
0.2
0.4
A
s
0.6
0.8
0
1
0
0.2
0.4
0
0.2
0.4
B
1.5
1.5
1
1
D
s
0.6
0.8
1
0.6
0.8
1
D 0.5
0
0.5
0 C
0.2
0.4
s
0.6
0.8
1
0
D
s
E vj , j D 1; 2 (see Figure 2: Numerical results for the distance 1 of the projections m¢E equation 5.5) in the case of BCM (continuous curve) and kurtosis-like energy (dashed curve) as a function of noise level ¾ . Separation angle is ® D 90± . A sample of N D 106 input vectors was used for both a uniform noise distribution (A) and a gaussian noise distribution (B). (C, D) The same situation with N D 103 sampling.
the gaussian case, the curves reproduce the theoretical results 5.4 and 5.5: a constant value for 1 in the kurtosis-like case and p a decreasing behavior in the BCM case up to the critical value ¾c D 1= 2 where stable solutions bifurcate. In the case of a uniform noise, the BCM neuron follows the same curve as for the gaussian noise, whereas the kurtosis-like neuron shows a sudden lost of separability property at ¾ ’ :8, which is the result of a bifurcation of stable solutions in the complex plane.
The Effect of Noise on Learning Rules
1633
7 Conclusions In this article, we introduce a method for the study of the stable states of a class of energy functions, dened by high-order polynomial moments. Our method allows us to analyze the response of neurons, whose weights modication is governed by minimization of these energies, for noisy input distributions. In particular, we study the combined effect of noise and high dimensionality of the input environment. We have considered in detail the cases of BCM and kurtosis-like energy functions. We prove that a system with n linearly independent gaussian clusters presents analogies with the simpler case of n linearly independent input vectors when the noise level is below the critical value. Increasing the dimensionality n, the critical noise level scales as n1=4 in the BCM case, and this implies that the robustness of this learning rule improves for high dimensionality and a xed SNR ratio. The solutions for the kurtosis-like energy are identical to the BCM case in a noiseless input environment. The presence of gaussian noise does not introduce phase transitions as for BCM energy, and selectivity is lost only asymptotically. This does not hold true for generic noise distribution. We have also considered the separability property of BCM and kurtosislike energies on a bimodal distribution, proving that the average kurtosislike energy gives an optimal result in the case of gaussian noise. To check the relevance of theoretical results for applications, we have performed numerical simulations that point out the effect of nite data input statistics. Numerical results show that the BCM learning rule has a better separability property than kurtosis-like energy for close clusters. This implies a better sensitivity of the BCM neuron to changes in clusters separation. We also checked the effect of nongaussian noise, proving the existence of a critical value for the noise level in the case of the kurtosis-like learning rule. Increasing robustness to noisy perturbations in high-dimensional spaces and independence from nite data sampling distributions suggest a capability of the BCM neuron to perform statistical analysis better than the kurtosis-like learning rule. Appendix A We give an explicit proof of the statements discussed in section 2. Without loss of generality, we assume that the second-order matrix C is in a diagonal form.1 To simplify calculations, we explicitly consider only the BCM case. However, all the results can be easily generalized to the whole class of energy
1
If the rank of C is less than n, one can always reduce the dimensionality of the system.
1634
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
functions, equation 1.1. The equilibrium solutions of averaged equations 3.3 satisfy E 2 D 0; T ijk mj mk ¡ Cij mj kmk
Tijk D hdi dj dk i:
(A.1)
Since we are interested in the nontrivial solutions of system A.1, it is convenient to use the following lemma: E ¤ is a nontrivial solution of the system A.1 if and only if Lemma 1. A vector m E ¤ =km E ¤ k2 where yE¤ is a nontrivial solution of the system yE¤ D m T ijk yj yk D Cij yj :
(A.2)
E ¤ k and that equation A.2 has a covariant We observe that by denition kE y¤ k D 1=km form @E0 =@ yE D 0 if we introduce the reduced energy E0 D
Tijk yi yj yk kE yk2 ¡ : 3 2
(A.3)
Then we prove proposition 1: Proposition 1. Let us consider the homogeneous cubic function f .yE/ D Tijk yi yj yk =3 dened on an n-dimensional sphere kE yk2 D r2 . Then each stationary point yE¤ of f .E y/ corresponds to a nontrivial equilibrium solution of the system A.1: E ¤ D 3. f .E m y¤ /=r4 / yE¤ . Proof. By using the method of the Lagrange multipliers, we introduce the new function F.E y/ D f .E y/ ¡
° .kE yk2 ¡ r2 /; 2
(A.4)
where ° is a real parameter. Then the stationary points of f .E y/ are given by the equations @F D Tijk yj yk ¡ ° Cij yj D 0 @yi
1 @F D ¡ .kyEk2 ¡ r2 / D 0: 2 @°
(A.5)
E ¤ D ° =r2 yE¤ If yE¤ is a solution of equations A.5, then ° r2 D 3 f .yE¤ / and m turn out to be a solution of equation A.1. In particular, if we choose r D 1, the existence of xed directions for the BCM equation A.1 is related to the stationary points of a homogeneous cubic function f .E y/ on the unit sphere. In a nondegenerate situation, the system A.2 has 2n ¡ 1 nontrivial solutions, which corresponds to the maximum number of nontrivial equilibria of a BCM neuron with n weights. In this case, we observe that the cubic function
The Effect of Noise on Learning Rules
1635
f .yE/ has 2.2n ¡1/ critical points on a sphere due to the ambiguity in the choice of the sign of the stationary points. The stable solutions are the maxima of the energy function 3.4. To study the stability of the xed points, we linearize the equations A.1 E ¤ by introducing the variables ± m E Dm E¡ around an equilibrium solution m E ¤ . The leading terms read m E ¤ k2 ¡ 2Cij m¤;j .m E ¤ ¢ ± m/ E ± mP i D 2T ijk m¤;j ±mk ¡ Cij ±mj kw E 3 /: C Tijk ±mj ±mk C O.k± mk
(A.6)
E ¤ by considering the time derivaWe study P the stability of the xed point m tive of k ±m2k , which is a Lyapunov function for equation A.1. E ¤ D 0 and m E ¤ 6D 0. In the rst case, equation A.1 We distinguish the cases m reduces to ± mP i D Tijk ±mj ±mk ;
(A.7)
E which so that the trivial solution is always unstable along the directions ± m, E > 0. In the second case, we introduce the vector yE¤ D m E ¤ =km E ¤ k2 , satisfy f .± m/ and the stability depends on the eigenvalues of the symmetric matrix, E ¤ k2 Cik y¤;k Cjl y¤;l : Sij D 2¯ijk y¤;k ¡ Cij ¡ 2km
(A.8)
The following lemma holds: E ¤ is stable if and only if the matrix S is Lemma 2. The equilibrium solution m negative denite in the linear space tangent to the sphere E ¤k kE yk D 1=km
(A.9)
at the point yE¤ . Proof. Using equation A.2, we get by a direct calculation SE y¤ D ¡CE y¤ . E according to ± m E D ±m E ¤ C ± vE where ± m E ¤ is We decompose any vector ± m parallel to yE¤ and ±E v belongs to the tangent space of the sphere A.9 at the E ¤ is a consequence of the point yE¤ . The stability of the equilibrium solution m inequality E m E D ¡k± m E ¤ k2 C 2± m E ¤ S± vE C ±E ± mS± vS± vE < 0
E 8 ± m:
(A.10)
According to the denition A.8, and using equation A.1, it is not difcult E ¤ S± vE D 0. Therefore, the inequality ±E to show that ± m vS± vE < 0 8 ± vE is a necessary and sufcient condition for the stability. As a consequence, we
1636
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
have the following proposition: Proposition 2. If the eigenvalues of the reduced symmetric matrix SO ij D 2Tijk y¤;k ¡ Cij , dened on the tangent space of the sphere A.9 at the point yE¤ , E ¤ of equation A.1 is a stable are all negative, then the corresponding solution m equilibrium. By straightforward calculation, one can see that yE¤ SO yE¤ D kE y¤ k2 , so that the O matrix S at the stable equilibrium solutions yE¤ , extended to the whole Rn space, has n ¡ 1 negative eigenvalues and 1 positive eigenvalue. Then, due to topological reasons (Milnor, 1958), the number of stable solutions among the 2n ¡ 1 equilibria are at most n, which corresponds to the case of maximal selectivity of the BCM neuron. By using the previous arguments, we have a corollary to proposition 1: Corollary 1. The equilibrium solution yE¤ of the system A.7 denes a stable direction if and only if it is a local maximum of the cubic f .yE/ D T yE3 =3 on the unit sphere kE yk D 1. We compute the local maxima of the cubic function f .o/ D
n 1 X o3 3n kD1 k
(A.11)
P constrained on the sphere k o2k D n. Let us suppose that on 6D 0. Then, by differentiating equation A.11, we get the system o2k C o2n
@on D0 @ok
k D 1; : : : ; n ¡ 1
@on ok D¡ : @ok on
(A.12)
Therefore, the critical points with on 6D 0 are computed by the system ok .ok ¡ on / D 0
k D 1; : : : ; n ¡ 1:
(A.13)
It is easy to check that equation pA.13 denes a unique local maximum ok D 0 for k D 1; : : : ; n ¡ 1 and on D n that corresponds to the choice of synaptic weights that select the nth input vector vEn . Neuron output oj for all the other input vectors is zero. The other local maxima can be computed in the same way. p We can generalize our analysis to a case of orthogonal vectors vEi , of length ¸i , perturbed by additive gaussian noise. The C matrix is diagonal, Cii D
¸i C ¾ 2 ; n
(A.14)
The Effect of Noise on Learning Rules
1637
and the cubic takes the form 0 1 2 1 X 3 3¾ 2 X oj @o C A f .o/ D ok k 3n k n ¸j j
(A.15)
constrained on the ellipsoid X j
³ ´ ¾2 D n: oj2 1 C ¸j
(A.16)
We look for the critical point such that on À oj 8 j, selective for the input vEn . By differentiating, we arrive at the system 0
1 0 1 ² X± X 1 1 @o2 C 1 ¡ oj2 C 2oi oj A ¡ @o2n C 1 ¡ .o2 C 2on oj /A i n j n j j ¢
.¾ 2 =¸i C 1/ oi D0 .¾ 2 =¸n C 1/ on
i 6D n:
(A.17)
We consider the leading terms according to the ansatz oj =on ¿ 1 and the limit n ! 1, so that equation A.17 reduces to ³ ´ o2 oi .¾ 2 =¸i C 1/ 1 ¡ n ¡ o2n D 0; n on .¾ 2 =¸n C 1/
(A.18)
³ ´ 1 ¾ 2 =¸i C 1 o2n 1 ¡ ; on ¾ 2 =¸n C 1 n
(A.19)
P supposing 1=n j6Dn oj2 ¿ 1 and on =n ¿ 1. The solution for the small outputs reads oi ’
i 6D n;
where on can be determined from the constraint o2n C
1 o2n
³ 1¡
o2n n
´2 X i6Dn
¾ 2 =¸i C 1 n D 2 : 2 ¾ =¸n C 1 ¾ =¸n C 1
(A.20)
In the limit n ! 1, the leading terms of equation A.20 give o4n ¡
n o2n C 6 D 0; nC1
¾ 2 =¸
with 6 D .¾ 2 =¸i C 1/
P
1 k6Di ¾ 2 =¸k C1 ,
(A.21)
1638
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
so that v u u u on ’ t
0 n 2
2. ¾¸i C 1/
s
@1 C
46 1¡ 2 n
³
¾2 C1 ¸i
´2
1 A:
(A.22)
Let ¾ 2 =¸p i be of the order O.1/ (and this is our constant SNR hypothesis). Then p on D O. n/. The response to the other inputs decreases as oj D O.1= n/, and 6 is O.n/, so that our ansatz has been veried. In the most general case, we get a different critical value of the noise ¾c for each eigenvalue ¸i , which grows like O.n1=4 / even in this case: s ³p ´ n ¡1 : (A.23) ¾c .i/ ’ ¸i 2 Under the simplifying hypothesis that the correlation matrix is proportional to the identity (¸i D 1 8i), the problem reduces to the study of the critical points of f .o/ D
n 1 X ok .o2k C 3¾O 2 / 3n kD1
(A.24)
where ¾O 2 D
¾2 ; 1 C ¾2
(A.25)
P with the usual constraint on the sphere k o2k D n=.1 C n¾ 2 /. We again assume on 6D 0 and consider the effect of noise on the selectivity property of the nth input vector. By differentiating equation A.24, we get the system .on ok ¡ ¾O 2 /.ok ¡ on / D 0
k D 1; : : : ; n ¡ 1:
If ok ¿ on we have a local maximum at ok D ¾O 2 =on where 0s 1 r 1 .n ¡ 1/ 4 A n @ 1 C ¡ on D ¾ : 1 C ¾2 2 4 n2
(A.26)
(A.27)
In the limit ¾ ! 0, this solution converges toward the unperturbed stable solution ok D 0 for k D 1; : : : ; n ¡ 1 and on D n. It follows that when the noise ¾ reaches the critical value r n p (A.28) ¾c D ; 2 n¡1 the solution moves into the complex space. Beyond the critical value A.28, the BCM neuron is no longer able to develop selectivity for the nth input vector.
The Effect of Noise on Learning Rules
1639
Appendix B Given an input distribution as in section 5, we want to characterize the directions of the solutions for BCM and kurtosis energy functions, putting in evidence how noise can alter stable states. The correlation matrix C has the form: Á l2 ! 2 2 l2 cos sin ® ® 2 .1 C cos ®/ C ¾ 2 (B.1) CD : 2 l2 l2 2 2 cos ® sin ® 2 sin ® C ¾ It can be reduced to identity by applying a rotation R (with a rotation angle p equal to ®=2) that diagonalizes it, 3 D RCRT , and a rescaling 3, so that we can dene the matrix T as follows: p (B.2) T D 3¡1 ¢ R; TCT T D I: We apply this transformation on the inputs before calculating the third- and E Thus, we obtain: fourth-order statistical moments: dE 0 D T d. " # p y21 2 2 2 (B.3) fBCM .y/ D .cos ®=2y1 /= ¸C .cos ®=2 C 3¾ /3y2 ¸C "Á ! 1 2 4 3 ¡ 2 cos ®=2 y21 fK .y/ D 4 ¸C Á ! # 2 4 4 2 2 C 3 ¡ 2 sin ®=2 y2 C 6y1 y2 ; (B.4) ¸¡ with ¸C D 311 D ¾ 2 C l2 =2 cos2 ®=2, ¸¡ D 322 D ¾ 2 C l2 =2 sin2 ®=2, yE D .cos µ; sin µ/. The stable states for the energy functions EBCM and EK correspond to the maxima among the solutions of @ f .y/=@µ D 0. They are: s l2 cos2 ®=2 ¡ ¾ 2 tan.µBCM / D § 2 (B.5) l cos2 ®=2 C ¾ 2 tan.µK / D §
¸¡ ® cot2 : 2 ¸C
(B.6)
By applying the inverse transformation T ¡1 , we obtain the directions ®=2§Â of the weight vectors where s l2 cos2 ®=2 ¡ ¾ 2 (B.7) ÂBCM D arctan l2 sin2 ®=2 C ¾ 2 s ! Á ¸C 2 ® (B.8) ÂK D arctan cot : 2 ¸¡
1640
A. Bazzani, D. Remondini, N. Intrator, and G. Castellani
We remark that this transformation becomes singular in the limit ¾ ! 1, p due to the scaling matrix 3. Projecting the input distribution on the direction of the stable solution, it is easy to compute the distance 1 between the two peaks: 1 D l.cos.®=2 ¡  / ¡ cos.®=2 C  // D 2l sin  sin
® : 2
(B.9)
We use 1 to measure the separability property of BCM and kurtosis energies. References Bear, M. F., & Malenka, R. C. (1994). Synaptic plasticity: LTP and LTD. Curr. Opin. Neurobiol., 4, 389–399. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Blais, B. S., Intrator, N., Shouval, H., & Cooper, L. N. (1998). Receptive eld formation in natural scene environments: Comparison of single cell learning rules. Neural Computation, 10(7), 1797–1813. Castellani, G. C., Intrator, N., Shouval, H., & Cooper, L. N. (1999). Solutions of a BCM learning rule in a network of lateral interacting non-linear neurons. Network: Comput. Neural Syst., 10, 111. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and the effects on NMDA receptor blockade. Proc. Natl. Acad. Sci., 89, 4363–4367. Friedman, J. H. (1987). Exploratory projection pursuit. J. Am. Stat. Ass., 82, 249. Intrator, N., & Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5, 3–17. Khas’minskii, R. (1966). On stochastic processes dened by differential equations with a small parameter. Theory Prob. Appl., 11, 211–228. Kirkwood, A., & Bear, M. F. (1994). Hebb synapses in visual cortex. J. Neurosci., 14, 1634–1645. Kirkwood, A., Lee, H.-K., & Bear, M. F. (1995). Long-term potentiation and experience-dependent synaptic plasticity in visual cortex are correlated by age and experience. Nature, 375, 328–331. Milnor, J. (1958). Differential topology. Princeton, NJ: Princeton University Press. Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. Shouval, H., Intrator, N., Law, C. C., & Cooper, L. N. (1996). Effect of binocular cortical misalignment on ocular dominance and orientation selectivity. Neural Computation, 8(5), 1021–1040. Shouval, H., Intrator, N., & Cooper, L. N. (1997). BCM network develops orientation selectivity and ocular dominance in natural scene environment. Vision Research, 37(23), 3339–3342. Received August 15, 2000; accepted January 6, 2003.
LETTER
Communicated by Simon Haykin
Approximation by Fully Complex Multilayer Perceptrons Taehwan Kim [email protected] MITRE Corporation, McLean, Virginia 22102 U.S.A.
Tulay ¨ Adalõ [email protected] Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Maryland 21250 U.S.A.
We investigate the approximation ability of a multilayer perceptron (MLP) network when it is extended to the complex domain. The main challenge for processing complex data with neural networks has been the lack of bounded and analytic complex nonlinear activation functions in the complex domain, as stated by Liouville’s theorem. To avoid the conict between the boundedness and the analyticity of a nonlinear complex function in the complex domain, a number of ad hoc MLPs that include using two real-valued MLPs, one processing the real part and the other processing the imaginary part, have been traditionally employed. However, since nonanalytic functions do not meet the Cauchy-Riemann conditions, they render themselves into degenerative backpropagation algorithms that compromise the efciency of nonlinear approximation and learning in the complex vector eld. A number of elementary transcendental functions (ETFs) derivable from the entire exponential function ez that are analytic are dened as fully complex activation functions and are shown to provide a parsimonious structure for processing data in the complex domain and address most of the shortcomings of the traditional approach. The introduction of ETFs, however, raises a new question in the approximation capability of this fully complex MLP. In this letter, three proofs of the approximation capability of the fully complex MLP are provided based on the characteristics of singularity among ETFs. First, the fully complex MLPs with continuous ETFs over a compact set in the complex vector eld are shown to be the universal approximator of any continuous complex mappings. The complex universal approximation theorem extends to bounded measurable ETFs possessing a removable singularity. Finally, it is shown that the output of complex MLPs using ETFs with isolated and essential singularities uniformly converges to any nonlinear mapping in the deleted annulus of singularity nearest to the origin.
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1641–1666 (2003) °
1642
T. Kim and T. Adalõ
1 Introduction The main challenge of extending the multilayer perceptron (MLP) paradigm to process complex-valued data has been the lack of bounded and analytic complex nonlinear activation functions in the complex plane C that need to be incorporated in the hidden (and sometimes output) layers of a neural network. As stated by Liouville’s theorem, a bounded entire function must be a constant in C, where an entire function is dened as analytic (i.e., differentiable at every point z 2 C; Silverman, 1975). Therefore, we cannot nd an analytic complex nonlinear activation function that is bounded everywhere in C. The fact that many real-valued mathematical objects are best understood when they are considered as embedded in the complex plane (e.g., the Fourier transform) provides a basic motivation to develop a neural network structure that uses well-dened analytic nonlinear activation functions (Clarke, 1990). However, due to Liouville’s theorem, the complex MLP has been subjected to the common view that it has to live with a compromised employment of nonanalytic but bounded nonlinear activation functions for the stability of the system (Georgiou & Koutsougeras, 1992; You & Hong, 1998; Mandic & Chambers, 2001). Hence, to process the complex data, a split complex approach is typically adopted (Leung & Haykin, 1991; Benvenuto, Marchesi, Piazza, & Uncini, 1991). Another approach has been to use two independent real-valued MLPs, dealing with the real and the imaginary part separately or with the magnitude and the phase. However, because of the cumbersome representation of separated real and imaginary (or magnitude and phase) components, all of these approaches are inefcient in representing complex-valued functions. Also, the nonanalytic but bounded ad hoc complex functions cannot provide truthful complex gradients required for the error backpropagation process. Therefore, the nonanalytic activation functions render themselves into degenerative forms for the backpropagation of error that compromise the efciency of complex number representation and learning paradigm suitable for further optimization. Consequently, the traditional complex MLPs are often unable to achieve robust and resilient tracking and pattern matching performances in noisy and time-varying signal conditions. In Kim and Adalõ (2000), a subset of elementary transcendental functions (ETFs) derivable from the exponential function f .z/ D ez , which are entire since f .z/ D f 0 .z/ D ez in C, is proposed as activation functions for nonlinear processing of complex data. The fully complex MLP is shown to provide better performance compared to the traditional MLPs in system identication (Kim, 2002), equalization for nonlinear satellite channel (Kim & Adalõ , 2001, 2002), and nonlinear prediction and cancellation of scintillation examples (Kim, 2002; Kim & Hegarty, 2002). In this letter, fully complex activation function is used to refer to one of the 10 ETFs shown in section 2.2 that is entire and meets Cauchy-Riemann conditions. The fully complex MLP employs one of these ETFs and takes
Approximation by Fully Complex Multilayer Perceptrons
1643
advantage of compact joint real and imaginary input feedforward and error backpropagation through well-dened complex error gradients. In this letter, we study the approximation properties of these activation functions when they are used in a feedforward structure and present three key results that explain their approximation ability. The ETFs we consider are entire and either bounded in a bounded domain of interest or bounded almost everywhere (a.e.), that is, unbounded only on a set of points having zero measures in C. Since these functions are entire, they are also conformal and provide well-dened derivatives while still allowing the complex MLP to converge with probability 1, which is sufcient in most practical applications of MLPs. Also, among these ETFs, circular and hyperbolic functions such as tan z and tanh z are a special form of bilinear (i.e., Mobius) ¨ transforms and lend themselves to the xed-point analysis discussed in Mandic and Chambers (2001). Note, however, that inverse functions such as arcsin z and arcsinh z are not bilinear transforms, but perform very well in certain learning tasks (Kim, 2002) because of their elegant symmetric and squashing magnitude responses as discussed in section 2.2. The organization of the rest of the article is as follows. The traditional complex-valued MLPs and error backpropagation algorithms that rely on the nonanalytic but bounded nonlinear activation functions are summarized in section 2. In section 3, three categories of singularity that ETFs inherently possess are presented, and their relationships to the approximation capabilities of fully complex MLP are established. Similar to its well-known real-valued MLP counterpart, it is shown that fully complex MLPs using continuous and bounded measurable activation functions are universal approximators over a compact set in the n-dimensional complex vector eld Cn by using the Stone-Weierstrass and the Riesz representation theorems. For measurable but unbounded elementary transcendental functions with isolated singularities and nonmeasurable functions with essential singularity, we show the uniform convergence to any complex number with an arbitrary accuracy within the annulus of a deleted singularity via Laurent series approximation. We include a short discussion on some of the applications for which complex nonlinear processing is particularly important and present a simple numerical example to highlight the efciency of fully complex representation. Section 5 contains a summary and discussions on the further research. 2 Complex-Valued MLPs 2.1 Traditional Complex-Valued MLPs. This section gives a brief summary of previous work on complex-valued MLPs. (A more detailed discussion can be found in Kim, 2002.) It was Clarke (1990) who generalized the real-valued activation function into the complex-valued one. He suggested employing a complex tangent sigmoidal function tanh z D .ez ¡e¡z /= .ez C e¡z /; z 2 C and noted that this function is no longer bounded but in-
1644
T. Kim and T. Adalõ
cludes singularities. He added that to keep the activation function analytic, one cannot avoid having unbounded functions because of Liouville’s theorem. However, after this important observation was made, the expansion of neural networks in the complex domain followed a series of ad hoc approaches instead of further investigation of the unbounded nature of analytic activation functions. The rst and the most commonly used ad hoc approach has been the employment of the pair of tanh x; x 2 R—hence, the name split complex activation function (Leung & Haykin, 1991; Benvenuto et al., 1991). This approach has inherent limitations in representing a truthful gradient and, consequently, for developing the fully complex backpropagation algorithm because the derivatives of split-complex activation function cannot fully represent the true gradient unless the real and imaginary weight updates in the error backpropagation are completely independent. However, this is clearly impossible in a complex error backpropagation process. Equation 2.1 represents the split-complex activation function, f .z/ D fR .Re.z// C if I .Im.z//;
(2.1)
where a pair of real-valued sigmoidal functions fR .x/ D fI .x/ D tanh x; x 2 R process the real and imaginary components of z D W ¢ X C b, that is, inner product of the complex input vector X and the complex weight vector W of the same dimension plus a complex bias. The split-complex backpropagation algorithm almost simultaneously developed in Leung and Haykin (1991) and Benvenuto et al. (1991) did employ complex arithmetic but was shown to be a degenerative and special form of fully complex backpropagation algorithm (Kim & Adalõ , 2001, 2002). A compromised approach to process real and imaginary components of complex signal jointly followed soon (Georgiou & Koutsougeras, 1992; Hirose, 1992). These authors proposed joint-nonlinear complex activation functions that process the real and imaginary components as shown in equations 2.2 and 2.3, respectively: f .z/ D z=.c C jzj=r/ ¢ f s ¢ exp[i¯] D tanh.s=m/ exp[i¯]: ¡
(2.2) (2.3)
Here, c and r are real positive constants, and m is a constant that is inversely related to the gradient of the absolute function j f j along the radius direction around the origin of the complex coordinate for z D s ¢ exp[i¯]. However, these functions are still not analytic and also preserve the phase. The inability to provide accurate nonlinear phase response poses a signicant disadvantage for these functions in signal processing applications, as shown in section 4. Recently, it has been shown that the analytic nonlinear function tanh z can successfully be used in a fully complex MLP structure where its superior performance over the traditional nonanalytic approach is demonstrated in
Approximation by Fully Complex Multilayer Perceptrons
1645
Figure 1: Split tanh z. (Left) Magnitude. (Right) Phase.
a nonlinear channel equalization example (Kim & Adalõ , 2000). The article also develops the fully complex error backpropagation algorithm that takes advantage of the well-dened true complex gradient satisfying the Cauchy-Riemann equations. The split and joint-complex backpropagation algorithms are also shown to be special and degenerative cases of the fully complex backpropagation algorithm and that the fully complex backpropagation algorithm is a complex conjugate form of its real-valued counterpart (Kim & Adalõ , 2001, 2002). Nine other fully complex nonlinear activation functions from the elementary transcendental function family are identied to be adequate complex squashing functions. The advantage of having truthful gradients in these functions over the pseudo-gradients generated from split- and joint-complex activation functions is demonstrated in various numerical examples (Kim & Adalõ , 2001, 2002; Kim & Hegarty, 2002; Kim, 2002). Figure 1 shows the magnitude and phase characteristics of the split tanh z function given in equation 2.1. Figure 2 shows the magnitude and phase of the joint-complex activation function proposed by Georgiou and Koutsougeras given in equation 2.2, followed by Figure 3 showing Hirose’s jointcomplex activation function given in equation 2.3. As observed in these gures, the nonlinear discrimination ability of the phase of these functions is limited, and the magnitude of split tanh z divides the domain into four quadrants.
2.2 Fully Complex Activation Functions. The following three classes of elementary transcendental functions have been identied as those that can provide adequate squashing-type nonlinear discrimination with welldened rst-order derivatives, and hence are candidates for developing a fully complex MLP (Kim & Adalõ , 2001).
1646
T. Kim and T. Adalõ
Figure 2: f .z/ D z=.c C jzj=r/. (Left) Magnitude. (Right) Phase.
Figure 3: tanh.s=m/ exp[i¯]. (Left) Magnitude. (Right) Phase.
² Circular functions: tan z D sin z D
eiz ¡ e¡iz ; i.eiz C e¡iz / eiz ¡ e¡iz ; 2i
d tan z D sec2 z dz d sin z D cos z dz
² Inverse circular functions: Z
arctan z D arcsin z D arccos z D
Z Z
z 0 z 0 z 0
1 ; 1 C t2
1 d arctan z D 1 C z2 dz
dt ; .1 ¡ t/1=2
dt ; .1 ¡ t2 /1=2
d arcsin z D .1 ¡ z2 /¡1=2 dz d arccos z D ¡.1 ¡ z2 /¡1=2 dz
Approximation by Fully Complex Multilayer Perceptrons
1647
Figure 4: sin 2. (Left) Magnitude. (Right) Phase.
Figure 5: sinh z. (Left) Magnitude. (Right) Phase.
² Hyperbolic functions: tanh z D sinh z D
sinh z ez ¡ e¡z D z ; cosh z e C e¡z ez ¡ e¡z ; 2
d tanh z D sech2 z dz
d sinh z D cosh z dz
² Inverse hyperbolic functions: Z
arctanh z D arcsinh z D
Z
z 0 z 0
dt ; 1 ¡ t2
d arctanh z D .1 ¡ z2 /¡1 dz
dt ; .1 C t2 /1=2
d arcsinh z D .1 C z2 /¡1 dz
Figures 4 through 12 show the magnitude and phase responses of these elementary transcendental functions. Figures 4 and 5 (both left) show the magnitude of sin z and sinh z where the sine curve characteristics in the range [¡¼=2; ¼=2] are observed along the
1648
Figure 6: arcsin z. (Left) Magnitude. (Right) Phase.
Figure 7: arcsinh z. (Left) Magnitude. (Right) Phase.
Figure 8: arccos z. (Left) Magnitude. (Right) Phase.
T. Kim and T. Adalõ
Approximation by Fully Complex Multilayer Perceptrons
Figure 9: tan z. (Left) Magnitude. (Right) Phase.
Figure 10: tanh z. (Left) Magnitude. (Right) Phase.
Figure 11: arctan z. (Left) Magnitude. (Right) Phase.
1649
1650
T. Kim and T. Adalõ
Figure 12: arctanh z. (Left) Magnitude. (Right) Phase.
real and imaginary axes, respectively. Unlike the other ETFs, those shown in Figures 6 through 12, sin z and sinh z functions exhibit continuous magnitude but are unbounded functions with convex behavior along the imaginary and real axis, respectively, as their domain expands from the origin. Therefore, these functions, when bounded in a radius of ¼=2 can be effective nonlinear approximators. Figures 6 and 7 (both left) show that arcsin z and arcsinh z functions are not continuous and therefore not analytic along a portion of the real or the imaginary axis. These discontinuities known as branch cuts are explained in section 3.2. However, these two are the the most radially symmetric functions in magnitude. As observed in gure 8 (left), arccos z function is asymmetric along the real axis and has a discontinuous zero point at z D 1. The split-tanh x function shown in Figure 1, on the other hand, does not possess similar smooth radial symmetry property because of the valleys it has along the real and imaginary axes. Note that these functions have decreasing rates of magnitude growth as they move away from the origin. Also note that unlike arcsinh z and arccos p z functions that are unbounded, the split-tanh x function is bounded by 2 in magnitude as they are dened in terms of two real-valued bounded functions. Instead, when the domain is bounded, arcsin z, arcsinh z, and arccos z are naturally bounded while providing more discriminating nonlinearity than the split-tanh x function. It has been observed that the radial symmetricity as well as the nonlinear phase response of arcsinh z and arcsin z functions tend to provide efcient nonlinear approximation capability (Kim, 2002). Unlike arcsin z and arcsinh z functions that have branch cut–type singularities, note the point-wise periodicity and singularity of tanh z at every .1=2C n/¼ i; n 2 N, in Figure 10 (left). Similarly, tan z is singular and periodic at every .1=2 C n/¼; n 2 N, as shown in Figure 9 (left). In contrast, Figures 11 and 12 (both left) show isolated singularities of arctan z and arctanh z at §i and §1, respectively. Although our focus in this article is on the approximation using these ETFs as the activation functions, we note that these types of singular points and discontinuities at nonzero points do not pose a prob-
Approximation by Fully Complex Multilayer Perceptrons
1651
lem in training when the domain of interest is bounded within a circle of radius ¼=2. If the domain is larger and includes these irregular points and the initial random hidden layer weight radius is not small enough, then the training process tends to become more sensitive to the size of the learning rate and the radius of initial random weights (Kim, 2002). 3 Approximation by Fully Complex MLP In this section, the theory of the approximation capability of fully complex MLP is developed. In this development, three types of approximation theorems emerge according to the three types of singularity that each of the elementary transcendental functions possesses (except the continuous sin z and sinh z functions that do not contain any). The rst two approximation results and the supporting domains of convergence for the rst two classes of functions are very general and resemble the universal approximation theorem for the real-valued feedforward MLP that was shown almost concurrently by multiple authors in 1989 (Cybenko, 1989; Hornik, Stinchecombe, & White, 1989; Funahashi, 1989). The third approximation theorem for the fully complex MLP is unique and related to the power series approximation that can represent any complex number arbitrarily closely in the deleted neighborhood of a singularity. This approximation is uniform only in the analytic domain of convergence whose radius is dened by the closest singularity. 3.1 Background. The universal approximation theorem for real-valued MLPs shows the existence of arbitrarily accurate approximation of any continuous or Borel-measurable multivariate real mapping of one nitedimensional space to another by using a nite linear combination of a xed, univariate nonlinear activation function, provided sufcient hidden neuron units are available. It is well known that Arnold and Kolmogorov refuted Hilbert’s thirteenth conjecture in 1957 when Kolmogorov’s superposition theorem solved Hilbert’s arbitrary representation question of multivariate functions using bivariate functions (Cybenko, 1989; Hornik et al., 1989). Kolmogorov’s superposition theorem showed that all continuous functions of n variables have an exact representation in terms of nite superposition and composition of a small number of functions of one or two variables. However, the Kolmogorov theorem requires representation of different unknown nonlinear transformations for each continuous desired multivariate function, while specifying an exact upper limit to the number of intermediate units needed for the representation (Hornik et al., 1989). For MLP, the interest is in nite linear combinations involving the same univariate activation function for approximations as opposed to exact representations. Mathematically, the representation of the general function of an n-
1652
T. Kim and T. Adalõ
dimensional real variable, x 2 Rn , by nite linear combinations of the form is expressed
G.x/ D
N X kD1
®k ¾ .WkT x C µk /;
(3.1)
where Wk 2 Rn and ®k ; µk 2 R are xed. In the neural network approach, the nonlinear function ¾ is typically a sigmoidal function. Cybenko noted that equation 3.1 could be seen as a special case of a generalized approximation by nite Fourier series. He demonstrated two mathematical tools to prove the completeness properties of such series, (or, equivalently, the uniform convergence of Cauchy sequences): the algebra of functions leading to the Stone-Weierstrass theorem (Rudin, 1991) and the translation-invariant subspaces leading to the Tauberian theorem. Hornik et al. (1989) also employed Stone-Weierstrass theorem with cosine transform approximation and then extended their universal approximation theorem to multi-output and multilayer feedforward MLP. Since Cybenko’s proof for the single-output, single hidden-layer MLP has shown the fundamental universal approximation in a more compact manner, the approximation theorems in fully complex MLP that we present here for an arbitrary complex continuous and measurable mapping follow similar steps to Cybenko’s derivation. 3.2 Classication of Activation Functions. In this section, we present the classication of complex activation functions based on their types of singularity. The rst two classes divide the ETFs included in section 2.2 into two classes: a set of bounded squashing activation functions over the bounded domain and a set of activation functions with unbounded isolated singularities. We then proceed and introduce a third class, activation functions having essential singularity, and discuss the properties of each class to establish the universal approximation property. A single-valued function is said to have a singularity at a point if the function is not analytic, and therefore not continuous, at the point. In complex analysis, three types of singularities are known: removable, isolated, and essential (Silverman, 1975). Note that ETFs possessing singularities are classied as entire functions just like the complex exponential function f .z/ D f 0 .z/ D ez from which they all can be derived (Churchill & Brown, 1992; Needham 2000). 3.2.1 Removable Singularity Class. If f .z/ has single point singularity at z0 , the singularity is said to be removable if limz!z0 f .z/ exists. The inverse sine and cosines shown in Figures 6, 7, and 8 have removable singularities represented as ridges along the real or imaginary axis outside the unit circle, which is dened as the branch cuts. Without loss of generality, even if the complex sine and cosine functions shown in section 2.2 have periodic-
Approximation by Fully Complex Multilayer Perceptrons
1653
ity dened by their branch cuts (Silverman, 1975), the working domain is treated as connected and bounded by the branch cuts to ensure that they are single-valued functions. These branch cut discontinuities of inverse functions are related to the denition of the corresponding integral where the paths of integration must not cross the branch cuts. The inverse sine and cosine functions exhibit the unbounded but decreasing rate of magnitude growth as they move away from the origin. Therefore, once the domain is bounded and connected, the range of these functions is naturally bounded with squashing function characteristics. On the other hand, sin z and sinh z functions are continuous and grow convex upward with an increasing rate of magnitude growth in parallel to real or imaginary axis, respectively, as shown in Figures 4 and 5. Also, note that sin z is equivalent to sin x along the real axis, while sinh z is equivalent to sin x along the imaginary axis. Therefore, the bounded squashing property for sin z and sinh z functions is available within a radius of ¼=2 from the origin. Consequently, sin z, sinh z, arcsin z, arcsinh z, and arccos z functions (the sine family) can be classied as complex squashing functions within a bounded domain. Note that by using a scaling coefcient, the size of squashing domain can be controlled. The formal denition of a complex squashing function is given in the appendix. 3.2.2 Isolated Singularity Class. If limz!z0 j f .z/j ! 1, while f .z/ is analytic in a deleted neighborhood of z D z0 , then the singularity is not removable but isolated. The isolated singularities are found in the tangent function family. The inverse tangent functions have nonperiodic isolated singularities: arctan z at §i and arctanh z at §1. On the contrary, tanh z has isolated periodic singularities at every .1=2Cn/¼ i; n 2 N, where limz!.1=2Cn/¼iC tanh z D ¡1 and limz!.1=2Cn/¼i¡ tanh z D C1. Similarly, tan z has periodic isolated singularities at every .1=2 C n/¼ . 3.2.3 Essential Singularity Class. If a singularity is neither removable nor isolated, then it is known as (isolated) essential singularity. A rare example can be found in f .z/ D e1=z . As shown in Figure 13 (left) in three dimensions and in Figure 13 (right) along the real axis, the function has singularity at 0 with multiple limiting values depending on the direction of approach: limz!0§i e1=z D 1, limz!0¡ e1=z D 0, and limz!0C e1=z D 1. For this reason, the essential singularity does not satisfy Cauchy-Riemman equations at the origin. However, the essential singularity has an intriguing property summarized in a powerful theorem known as the Big Picard theorem (or Picard’s Great theorem) (Silverman, 1975). It states that in the neighborhood of an essential singularity, a function assumes each complex value, with one possible exception ( f .z/ D ez ), innitely often. Therefore, if a priori knowledge on the working domain is available, a variation of f .z/ D e1=z , for example, f .z/ D 0:5.e1=.z¡z0 / Ce1=.zCz0 / /, can be used as an activation function
1654
T. Kim and T. Adalõ
Figure 13: Essential singularity function f .z/ D e1=z . (Left) 3D magnitude. (Right) Real axis magnitude view.
by placing multiple shifted and scaled singularities to build a powerful approximation. In the uniform approximation theorem using the essential singularity activation function, it is shown that for an activation function with an essential singularity centered at origin, the converging domain should exclude the essential singularity itself to form a deleted annulus. 3.3 Approximation by Fully Complex MLP. Since most of the ETFs are not continuous, except sin z and sinh z, but have singularities, it is important to know whether they are complex meausurable functions, that is, the measure over the set of singularities is zero in the complex vector eld. The denition of complex measurable space and (Borel) complex measurable functions are given in denition 4 in the appendix. Note that the value innity is admissible for a positive measure, but when a complex measure ¹ is being considered, dened on a ¾ -algebra l2 C 1 > 2 where l1 and l2 are the numbers of training examples in class 1 and class 2, respectively.1 2. For i 6D j, xi 6D xj . That is, no two examples have identical x vectors.2 The following lemma is useful. Lemma 1. For any given .C; ¾ 2 /, the solution (®) of equation 1.2 is unique. Also, for every ¾ 2 , fzi j yi D 1g and fzi j yi D ¡1g are linearly separable. Proof. From Micchelli (1986), if the gaussian kernel is used and xi 6D xj 8i 6D j from assumption 2, Q is positive denite. By corollary 1 of Chang and Lin (2001b), we get linear separability in z-space. Uniqueness of ® follows from the fact that equation 1.2 is a strictly convex quadratic programming problem. We now discuss the various asymptotic behaviors. As the results of each case are stated, it is useful to refer to the example shown in Figure 1. Wherever we come across results whose proofs do not shed any insight on the asymptotic behaviors, we only state the results and relegate the proofs to the appendix. 2.1 Case 1. ¾ 2 Fixed and C ! 0. It can be shown (see the proof of theorem 5 in Chang & Lin, 2001b, for details) that if C is smaller than a certain positive value, the following holds: ®i D C 8 i 1
with yi D ¡1:
(2.1)
If l2 > l1 C 1 > 2, then we can always interchange the two classes and apply all the results derived in this article. Cases where jl1 ¡ l2 j · 1 or minfl1 ; l2 g < 2 correspond to abnormal situations that are not worth discussing in detail since in practice, the number of examples in the two classes rarely satises any of these two conditions. 2 This is a generic assumption that is easily satised if small, random perturbations are added to all training examples.
Asymptotic Behaviors of Support Vector Machines
1671
P Let us take one such C. Using equation 2.1 together with liD1 yi ®i D 0 and l1 > l2 , it is easy to see that there exists at least one i for which ®i < C and yi D 1. For such an i, we have wT zi C b ¸ 1:
(2.2)
P For C ! 0, we have ®i ! 0, and so wT z D liD1 ®i yi K.xi ; x/ ! 0, where z D Á .x/. These imply that if X is any compact subset of Rn , then for any N we have given 0 < a < 1, there exists CN > 0 such that for all C · C, l a X · 8x 2 X: b ¸ a and ®i yi K.xi ; x/ 2 iD1
(2.3)
N Hence, for all C · C, f .x/ > 0 8x 2 X: In particular, if we take X to be the compact subset of data space that is of interest to the given problem, then for sufciently small C, every point in this subset is classied as class 1. The rst part of assumption 1 allows us to use similar arguments for the case of equation 1.2 with one example left out. Then we can also show that as C ! 0, the number of LOO errors is l2 . Thus, C ! 0 corresponds to severe undertting as expected. Furthermore, we have the following properties as C ! 0: 1. kwk2 D ® T Q® ! 0. P 2. limC!0 C1 liD1 ®i D limC!0
2 C
P i: yi D¡1
®i D 2l2 .
3. Using the equality of primal and dual objective function values at optimality and the inequality ® T Q® · l2 C2 , we get Á ! l l X 1 X T lim »i D lim ®i ¡ ® Q® D 2l2 : C!0 C!0 C iD1 iD1 It is useful to interpret the above asymptotic results geometrically; in particular, study the movement of the top, middle, and bottom planes dened by wT z C b D 1, wT z C b D 0, and wT z C b D ¡1 as C ! 0. By equation 2.2, at least one example of class 1 lies on or above the top plane. By property 1 given above, the distance between the top and bottom planes (which equals 2=kwk) goes to innity. Hence, the middle and bottom planes are forced to move down farther and farther away from the location where the training points are located, causing the half-space dened by wT z C b ¸ 0 to cover X entirely, the compact subset of interest to the problem, after C becomes sufciently small.
1672
S. Keerthi and C. Lin
Remark 1. The results given above for C ! 0 are general and apply to nongaussian kernels also, assuming, of course, that all hyperparameters associated with the kernel function are kept xed. The results also apply if Q is a bounded function of C since theorem 5 of Chang and Lin (2001b) holds for this case. Remark 2. For kernels whose values are bounded (e.g., the gaussian kerN nel), there is CN such that equation 2.3 holds for all x 2 Rn . Thus, for all C · C, f .x/ > 0 8x 2 Rn : N every point is classied as class 1. That is, for all C · C, 2.2 Case 2. ¾ 2 Fixed and C ! 1. By lemma 1, fzi j yi D 1g and fzi j yi D ¡1g are linearly separable. This implies that it is possible to set »i D 0 8i while still remaining feasible for equation 1.1. Thus, as C ! 1, the solution of equation 1.1 approaches the solution of the hard margin problem: min w;b
subject to
1 T w w 2 yi .wT zi C b/ ¸ 1; i D 1; : : : ; l:
(2.4)
A formal treatment of this is in Lin (2001), which shows that if equation 2.4 is feasible, there exists a C¤ such that for C ¸ C¤ , the solution set of equation 1.1 is the same as that of equation 2.4. An easy way to see this result is to solve equation 1.2 with C D 1, obtain the f®i g, and set C¤ D maxi ®i . The limiting SVM classier classies all training examples correctly, and so it is an overtting classier. In particular, severe overtting occurs when ¾ 2 is small since the exibility of the classier is high when ¾ 2 is small. For the case of C ! 1, it is not possible to make any conclusions about the actual value of the LOO error. That value depends on the data set as well as on the value of ¾ 2 . However, after equation 1.2 is solved using all the examples, it is possible to give bounds on the LOO error (Joachims, 2000; Vapnik & Chapelle, 2000) without solving the quadratic programs obtained by leaving out one example at a time. 2.3 Case 3. C Is Fixed and ¾ 2 ! 0. Let us dene ±ij D 1 for i D j and 2 2 ±ij D 0 if i 6D j. Since e¡kxi ¡xj k =.2¾ / ! ±ij as ¾ 2 ! 0, we consider the following problem: min ®
subject to
1 T ® ® ¡ eT ® 2 0 · ®i · C; i D 1; : : : ; l; T
y ® D 0:
(2.5)
Asymptotic Behaviors of Support Vector Machines
1673
Using lemma 2 (the proof is in section A.1), as ¾ 2 ! 0, the solution of equation 1.2 converges to that of equation 2.5. Since l1 > l2 , the solution of equation 1.2 has 0 < ®i < C for at least one i.3 Thus, b is uniquely determined, and as ¾ 2 ! 0, it approaches the value of b corresponding to the primal form of equation 2.5. Therefore, let us study the solution of equation 2.5. In section A.2, we show that its solution is given by ®i D ® C if yi D 1 and ®i D ® ¡ if yi D ¡1, where ( ¡
® D
Clim C
if C ¸ Clim
if C < Clim ;
( C
® D
2l2 = l l2 C= l1
if C ¸ Clim
if C < Clim ;
(2.6)
and Clim D 2l1 = l. The threshold parameter b in the primal form corresponding to equation 2.5 can be determined using the fact that 0 < ® C < C (and hence all class 1 examples lie on the top plane dened by wT z C b D 1): ( bD
.l1 ¡ l2 /= l 1 ¡ l2 C= l1
if C ¸ Clim; if C < Clim:
(2.7)
Consider the classier function f .x/ D wT z C b corresponding to equation 2.5. In section A.2, we also show the following: 1. If C ¸ Clim =2, f classies all training examples correctly and classies the rest of the space as class 1. Thus, it overts the training data. 2. If C < Clim=2, then f classies the entire space as class 1, and so it underts the training data. 3. The number of LOO errors is l2 . Consider the SVM classier corresponding to the gaussian kernel for small values of ¾ 2 . Even though the number of LOO errors tends to l2 for all C, it is important to note that the SVM classier is qualitatively very different for large C and small C. For large C, there are small regions around each example of class 2 that are classied as class 2 (overtting), while for small C, there are no such regions (undertting). It is interesting to note that if ¾ 2 is small and C is greater than a threshold that is around Clim , from equation 2.6, the SVM classier does not depend on C. Thus, contour lines of constant generalization error are parallel to the C axis in the region where ¾ 2 is small and C is large. 3 As we show below (see equation 2.6), the solution of equation 2.5 is well in the interior of .0; C/ for at least one i. Since for small values of ¾ 2 , the solution of equation 1.2 approaches that of equation 2.5, it follows that the solution of equation 1.2 also has 0 < ®i < C for at least one i.
1674
S. Keerthi and C. Lin
2.4 Case 4. C Is Fixed and ¾ 2 ! 1. When ¾ 2 ! 1, we can write Q x/ N D exp.¡kxQ ¡ xk N 2 =2¾ 2 / K.x; D 1¡
N 2 kxQ ¡ xk N 2 =¾ 2 / C o.kxQ ¡ xk 2 2¾
D 1¡
Q 2 N 2 kxk kxk xQ T xN N 2 =¾ 2 /: ¡ C 2 C o.kxQ ¡ xk 2 2 2¾ 2¾ ¾
(2.8)
Now consider equation 1.2. Using the simplication given above, we can write the rst (quadratic) term of the objective function in equation 1.2 as XX i
®i ®j yi yj K.xi ; xj / D T1 C
j
1ij 1 XX T2 C T3 C T4 C ®i ®j yi yj 2 ; 2¾ 2 2 i j ¾
where XX
T1 D
i
T3 D ¡
®i ®j yi yj ;
j
XX i
j
T2 D ¡
®i ®j yi yj kxj k2 ;
XX i
j
T4 D 2
®i ®j yi yj kxi k2 ;
XX i
®i ®j yi yj xTi xj ;
and
j
lim 1ij D 0:
¾ 2 !1
(2.9)
P 2 By the equality constraint of equation P P 1.2, T1 D . i ®i yi / D 0. We can also 2 rewrite T2 as T2 D . i ®i yi kxi k /. j ®j yj / D 0. In a similar way, T3 D 0. By dening ®Q i D
®i 8 i; ¾2
(2.10)
equation 1.2 can be written as 4 min ®Q
subject to
X 1 XX F D ®Q i ®Q j yi yj KQ ij ¡ ®Q i 2 2 i j ¾ i Q i D 1; : : : ; l; 0 · ®Q i · C;
(2.11)
T
y ®Q D 0; where KQ ij D xTi xj C 1ij and C CQ D 2 : ¾ 4
To do this, note that we need to divide the objective function by the term ¾ 2 .
(2.12)
Asymptotic Behaviors of Support Vector Machines
1675
Remark 3. Note that KQ ij may not correspond to a valid kernel satisfying the Mercer’s condition. But that is immaterial since we always operate with the constraint yT ®Q D 0. In the presence of this constraint, equations 1.2 and 2.11 are equivalent. Remark 4. If C is xed at some value and ¾ 2 is made large, CQ of equation 2.11 goes to zero, and so the situation is similar to case 1, discussed at the beginning of this section. By equation 2.9, KQ ij is a bounded function for large Q By the last sentence of remark 1, results ¾ 2 (or, equivalently, for small C). of case 1 can be applied here. Thus, for C xed and ¾ 2 ! 1, equation 2.11 corresponds to a severely undertting classier. Since equations 2.11 and 1.2 correspond to the same problem in different forms, they have the same primal decision function (for full details, see equation A.8). Therefore, in this situation, we get a severely undertting classier. Q as ¾ 2 ! 1 and C varies with ¾ 2 as given by equation 2.12, For a given C, we can see that equation 2.11 is close to the following linear SVM problem: min ®Q
subject to
X 1 XX ®Q i ®Q j yi yj xTi xj ¡ ®Q i 2 i j i Q i D 1; : : : ; l; 0 · ®Q i · C;
(2.13)
yT ®Q D 0:
We are interested in their corresponding decision functions, which can lead us to analyze the performance of equation 1.1. Now the primal form of equation 2.13 is min Q »Q Q b; w;
subject to
l X 1 T wQ wQ C CQ »Qi 2 iD1
Q ¸ 1 ¡ »Qi ; yi .wQ T xi C b/
(2.14)
»Qi ¸ 0; i D 1; : : : ; l:
Q denote primal optimal solutions of equations 1.1 Q b/ Let .w.¾ 2 /; b.¾ 2 //, and .w; and 2.14, respectively. We then have the following theorem: Theorem 1. For any x, lim¾ 2 !1 w.¾ 2 /T z D wQ T x. If the optimal bQ of equation 2.14 is unique, then lim¾ 2 !1 b.¾ 2 / D bQ and hence the following also hold: 1. For any x, Q lim w.¾ 2 /T z C b.¾ 2 / D wQ T x C b:
¾ 2 !1
(2.15)
1676
S. Keerthi and C. Lin
2. If wQ T x C bQ 6D 0, then for ¾ 2 sufciently large,
Q sgn.w.¾ 2 /T z C b.¾ 2 // D sgn.wQ T x C b/:
Q the limiting SVM gaussian The proof is in section A.3. Thus, for a given C, kernel classier as ¾ 2 ! 1 is the same as the SVM linear kernel classier for Q Hereafter, we will simply refer to the SVM linear kernel classier as linear C. SVM. The above analysis can also be extended to show that as ¾ 2 ! 1, the LOO error corresponding to equations 1.1 and 2.13 is the same. The above results also show that in the part of the hyperparameter space Q where ¾ 2 is large, if .C1 ; ¾12 / and .C2 ; ¾22 / are related by C1 =¾12 D C2 =¾22 D C, the classiers corresponding to the two combinations are nearly the same. Hence, both will give nearly the same value for generalization error (or an estimate of it, such as k-fold cross-validation error or LOO error). Thus, in this part of the hyperparameter space, contour lines of such functions will Q Then all classiers be straight lines with slope 1: log ¾ 2 D log C ¡ log C. dened by points on that straight line for large ¾ 2 are nearly the same as Q the linear SVM classier corresponding to C. Given that for any x, lim¾ 2 !1 w.¾ 2 /T z D wQ T x holds without any assumption, the assumption on the uniqueness of bQ in theorem 1 should be viewed as only a minor technical irritant.5 For normal situations, the uniqueness assumption is a reasonable one to make. Unless CQ is very small, typically there Q when such an ®Q i exists, will be at least one ®Q i strictly in between 0 and C; lemma 3 in section A.1 (as applied to section 2.14) implies the uniqueness Q The case of a very small CQ corresponds to the upper left part of the of b. plane in which log C and log ¾ 2 are the horizontal and vertical axes. We can easily see this by considering C xed and increasing ¾ 2 to large values (the upper part) or considering ¾ 2 xed and decreasing C to small values (the left part). As remark 4 and case 1 of this section show, each of these asymptotic behaviors corresponds to a severely undertting SVM decision function. Finally, theorem 1 also indicates that if complete model selection on .C; ¾ 2 / using the gaussian kernel has been conducted, there is no need to consider linear SVM. This helps in the selection of kernels. 3 A Method of Model Selection It is usual to take log C and log ¾ 2 as the parameters of the hyperparameter space. Putting together the results derived in the previous section, it is easy to see that in the asymptotic (outer) regions of the .log C; log ¾ 2 / 5 The assumption is needed to state results cleanly. If bQ is nonunique, SVM classiers also become nonunique, and then it becomes clumsy to talk about convergence of SVM decision functions.
Asymptotic Behaviors of Support Vector Machines
1677
Figure 2: A rough boundary curve separating the undertting=overtting reQ the equation log ¾ 2 D log C¡log CQ gion from the “good” region. For each xed C, 2 denes a straight line of unit slope. As ¾ ! 1 along this line, the SVM classiQ The dotted er converges to the linear SVM classier with penalty parameter C. Q line corresponds to the choice of C that gives the optimal generalization error for the linear SVM.
space, there exists a contour of generalization error (or an estimate such as LOO error or k-fold cross validation error) that looks like that shown in Figure 2 and helps separate the hyperparameter space into two regions: an overtting=undertting region and a good region (which most likely has the hyperparameter set with the best generalization error). (For LOO, recall that in the undertting=overtting region, the number of LOO errors is l2 .) The Q straight line with unit slope in the large ¾ 2 region (log ¾ 2 D log C ¡ log C) Q which is small enough to make the linear corresponds to the choice of C, SVM an undertting one. The presence of a separating contour as outlined in Figure 2 has been observed on a number of real-world data sets (Lee, 2001). When searching for a good set of values for log C and log ¾ 2 , it is usual to form a two-dimensional uniform grid (say r £ r) of points in this space and nd a combination that gives the least value for some estimate of generalization error. This is expensive since it requires trying r2 .C; ¾ 2 / pairs. The earlier discussion relating to Figure 2 suggests a simple and efcient heuristic method for nding a hyperparameter set with small generalization error: form a line of unit slope that cuts through the middle part of the good region (see the dashed line in Figure 2) and search on it for a good set of hyperparameters. The CQ that denes this line can be set to the optimal value
1678
S. Keerthi and C. Lin
of penalty parameter for the linear SVM. Thus, we propose the following procedure: Q 1. Search for the best C of linear SVM and call it C. 2. Fix CQ from step 1, and search for the best .C; ¾ 2 / satisfying log ¾ 2 D log C ¡ log CQ using the gaussian kernel.
The idea is that as ¾ 2 ! 1, SVM with gaussian kernel behaves like linear SVM, and so the best CQ should happen in the upper part of the “good” region in Figure 2. Then a search on the line dened by log ¾ 2 D log C ¡ log CQ gives an even better point in the “good” region. In many practical pattern recognition problems, a linear classier already gives a reasonably good performance, and some added nonlinearities help obtain ner improvements in accuracy. Step 2 of our procedure can be thought of as a simple way of injecting the required nonlinearities via the gaussian kernel. Since the procedure involves only two one-dimensional searches, it requires only 2r pairs of .C; ¾ 2 / to be tried. To test the goodness of the proposed method, we compare it with the usual method of using a two-dimensional grid search. For both, vefold cross-validation was used to obtain estimates of generalization error. For the usual method, we uniformly discretize the [¡10; 10] £ [¡10; 10] region to 212 D 441 points. At each point, a vefold cross-validation is conducted. The point with the best CV accuracy is chosen and used to predict the test data. For the proposed method, we search for CQ by vefold cross-validation on linear SVM using uniformly spaced log C values in [¡8; 2]. Then we discretize [¡8; 8] as values of log ¾ 2 and check all points satisfying log ¾ 2 D Q Because now fewer points have to be tried, we use the smaller log C ¡ log C. grid spacing of 0.5 for both discretizations. The total number of points tried is 54. To evaluate empirically the usefulness of the proposed method, we consider several binary problems from Ra¨ tsch (1999). For each problem, R¨atsch (1999) gives 100 realizations of the given data set into (training set, test set) partitions. We consider only the rst of those realizations. In addition, the problem adult, from the UCI “adult” data set (Blake & Merz, 1998), and the problem web, both as compiled by Platt (1998), are also included. For each of these two data sets, there are several realizations. For our study here, we consider only the realization with the smallest training set; the full data set with training data (including duplicated ones) removed is taken as the test set. For all data sets used, Table 1 gives the number of input variables, the number of training examples, and the number of test examples. All data sets are directly used as given in the mentioned references, without any further normalization or scaling. The SVM software LIBSVM (Chang & Lin, 2001a), which implements a decomposition method, is employed for solving equation 1.2. Table 1 presents
Asymptotic Behaviors of Support Vector Machines
1679
Table 1: Comparison of the Model Selection Methods. Problem
banana diabetes image splice ringnorm twonorm waveform tree adult web
Number of Inputs
Number of Training Examples
Number of Test Examples
Test Set Error of Usual Grid Method
Test Set Error of Proposed Method
2 8 18 60 20 20 21 18 123 300
400 468 1300 1000 400 400 400 700 1605 2477
4900 300 1010 2175 7000 7000 4600 11,692 29,589 38,994
0.1235 (6,¡0) 0.2433 (4,7) 0.02475 (9,4) 0.09701 (1,4) 0.01429(¡2; 2) 0.031 (1,3) 0.1078 (0,3) 0.1132 (8,4) 0.1614 (5,6) 0.02223 (5,5)
0.1178 (¡2; ¡2) 0.2433 (4,6) 0.02475 (1,0.5) 0.1011 (0,4) 0.018 (¡3; 2) 0.02914 (1,4) 0.1078 (0,3) 0.1246 (2,2) 0.1614 (5,6) 0.02223 (5,5)
Note: For each approach, apart from the test error, the optimal .log C; log ¾ 2 / pair is also given.
the test error of the two methods, as well as the corresponding chosen values of log C and log ¾ 2 . It can be clearly seen that the new method is very competitive with the usual method in terms of test set accuracy. For large data sets, the proposed method has the great advantage that it checks many fewer points on the .log C; log ¾ 2 / plane, and so the savings in computing time can be large. Note that in the chosen problems, the following quantities have a reasonably wide range: test error (1.5% to 25%), the number of input variables (2 to 300), and the number of training examples (400 to 2477), and so the empirical evaluation demonstrates the applicability of the proposed approach to different types of data sets. A remaining issue is how to decide the range of log C for determining Q C in step 1. From Table 1, we can see that log CQ D log C ¡ log ¾ 2 is usually not a large number. Furthermore, we observe that for all problems, after C is greater than a certain threshold, the cross-validation accuracy of the linear SVM is about the same. Therefore, if we start searching from small C values and go on to large C values, the search can be stopped after the CV accuracy stops varying much. An example of the variation of the vefold CV accuracy of linear SVM is given in Figure 3. For linear SVMs, we can formally establish that there exists a nite limiting value C¤ such that for C ¸ C¤ , the solution of the linear SVM remains unchanged. If fxi : yi D 1g and fxi : yi D ¡1g are linearly separable, then the above result is easy to appreciate; the same ideas used in case 2 can be applied to show this. However, if fxi : yi D 1g and fxi : yi D ¡1g are not linearly separable (which is typically the case), the result is nontrivial to
1680
S. Keerthi and C. Lin
84 82
CV rate
80 78 76 74 72 70 -8
-6
-4
log(C)
-2
0
2
Figure 3: Variation of CV accuracy of linear SVM with C for the image problem.
establish. Here, we prove the following theorem: Theorem 2. There exists a nite value C¤ and .w¤ ; b¤ / such that .w; b/ D .w¤ ; b¤ / solves equation 1.2, 8C ¸ C¤ . If this decision function is used, the LOO error is same for all C ¸ C¤ . Moreover, this w¤ is unique. Details of the proof are in section A.4. Appendix A.1 Two Useful Lemmas. Lemma 2. Consider an optimization problem with the form 1.2 and Q is a function of ¾ 2 (denoted as Q.¾ 2 /). Let ®.¾ 2 / be its solutions. For a given number a, if Q¤ ´ lim Q.¾ 2 / ¾ 2 !a
exists, then there exists a convergent sequence f®.¾k2 /g with ¾k2 ! a, and the limit of any such sequence is an optimal solution of equation 1.2 with the Hessian matrix Q¤ . Moreover, if Q¤ is positive denite, lim¾ 2 !a ®.¾ 2 / exists.
Asymptotic Behaviors of Support Vector Machines
1681
Proof. The feasible region of equation 1.2 is independent of ¾ 2 so is compact. Then there exists a convergent sequence f®.¾k2 /g with limk!1 ¾k2 D a. For any one such sequence, we have 1 1 ®.¾k2 /T Q.¾k2 /®.¾k2 / ¡ eT ®.¾k2 / · .® ¤ /T Q.¾k2 /® ¤ ¡ eT ® ¤ ; 2 2 1 ¤ T ¤ ¤ 1 .® / Q ® ¡ eT ® ¤ · ®.¾k2 /T Q¤ .¾k2 /®.¾k2 / ¡ eT ®.¾k2 /; 2 2
and (A.1)
where ® ¤ is any optimal solution of equation 1.2 with the Hessian matrix N taking the limit of equation A.1, Q¤ . If ®.¾k2 / goes to ®, 1 ¤ T ¤ ¤ 1 N .® / Q ® ¡ eT ® ¤ D ®N T Q¤ ®N ¡ eT ®: 2 2 Thus, ®N is an optimal solution too. If Q¤ is positive denite, equation 1.2 is a strictly convex problem with a unique optimal solution. This implies that lim¾ 2 !a ®.¾ 2 / exists. Lemma 3. If equation 1.2 has an optimal solution with at least one free variable (i.e., 0 < ®i < C for at least one i), then the optimal b of equation 1.1 is unique. Proof. The Karush-Kuhn-Tucker (KKT) condition (i.e. the optimality condition) of equation 1.2 is that if ® is an optimal solution, there are a number b and two nonnegative vectors ¸ and ¹ such that rF.®/ C by D ¸ ¡ ¹; ¸i ®i D 0;
¹i .C ¡ ®/i D 0;
¸i ¸ 0; ¹i ¸ 0;
i D 1; : : : ; l;
where rF.®/ D Q® ¡ e is the gradient of F.®/ D 1=2® T Q® ¡ eT ®. This can be rewritten as rF.®/i C byi ¸ 0 rF.®/i C byi · 0 rF.®/i C byi D 0
if ®i D 0;
if ®i D C;
(A.2)
if 0 < ®i < C:
Note that rF.®/i D yi wT zi ¡ 1 is independent of different optimal solutions ® as the primal optimal solution w is unique. Let .w; b; » / denote a primal solution. As already said, w is unique. By convexity of the solution set, the set of all possible b solutions, B is an interval.
1682
S. Keerthi and C. Lin
Once b is chosen, » is uniquely dened. By assumption, there exists b 2 B and a corresponding Lagrange multiplier vector ®.b/ with a free alpha, say, 0 < ®.b/k < C. Thus, ®.b/ is an optimal solution of equation 1.2 and so, by equation A.2, rF.®/k C byk D 0. Denote A0 D fi j rF.®/i C byi > 0g;
AC D fi j rF.®/i C byi < 0g;
and
AF D fi j i 2 = A0 [ AC g: Let us dene 2e and 2 f to be the minimum and maximum of the following set: fbnew j rF.®/i C bnew yi > 0 if i 2 A0 I
rF.®/i C bnew yi < 0 if i 2 AC g:
(A.3)
Clearly e < b < f . Suppose B is not a singleton. Now choose bnew 2 B \ [e; f ] such that bnew 6D b. Let ®.bnew / be any Lagrange multiplier corresponding to bnew . Thus, ®.bnew / and bnew satisfy equation A.2. Suppose bnew > b and yk D 1. Then rF.®/k C bnew yk > 0 so ®.bnew /k D 0 < ®.b/k :
(A.4)
If we use equation A.2 as applied to .b; ®.b// and .bnew ; ®.bnew //, equation A.3 implies the following: ®.b/i D ®.bnew /i 8 i 2 A0 [ AC with yi D yk ; also, ®.b/i ¸ ®.bnew /i 8 i 2 AF with yi D yk . Note that k is an element of this second group. Thus, with equation A.4, X
®.b/i >
i: yi Dyk
X
®.bnew /i :
(A.5)
i: yi Dyk
This is a violation of the fact that both ®.b/ and ®.bnew / are solutions of equation 1.2 since, for a given dual solution ®, the dual cost is .kwk2 =2/ ¡ P 2 i: yi D1 ®i and the rst term is the same for ®.b/ as well as ®.bnew /. If P yk D ¡1, the proof is the same, but equation A.5 becomes i: yi Dyk ®.b/i < P new /i : A similar contradiction can be reached if bnew < b. Thus, B i: yi Dyk ®.b is a singleton and b is unique. A.2 Optimal Solution of Equation 2.5. KKT conditions applied to equation 2.5 correspond to the existence of a scalar b and two nonnegative vectors ¸ and ¹ such that ®i ¡ 1 C byi D ¸i ¡ ¹i ; ®i ¸i D 0;
.C ¡ ®i /¹i D 0;
i D 1; : : : ; l:
(A.6)
Asymptotic Behaviors of Support Vector Machines
1683
To show that the solution is given by equations 2.6 and 2.7, all that we need to do is to show the existence of ¸ and ¹ so that equation A.6 holds. For the solution 2.6, when C ¸ Clim , using b dened in equation 2.7, ®i ¡ 1 C byi D 0 8 i; so we can simply choose ¸ D ¹ D 0 so that equation A.6 is satised. If C < Clim , ( ®i ¡ 1 C byi D
0 Cl l1
¡2 · 0
if yi D 1;
if yi D ¡1;
so equation A.6 also holds. Therefore, equation 2.6 gives an optimal solution for equation 2.5. Let us now analyze properties of the classier function f associated with equation 2.5. Note using equation 2.7 that b > 0. For x 6D xi , exp.¡kx ¡ xi k2 =¾ 2 / ! 0 as ¾ 2 ! 0. Therefore, for such x, the classier function corresponding to equation 2.5 is given by f .x/ D b. Since b > 0, all points x not in the training set are classied as class 1 regardless of the value of C. This, together with item 2 of assumption 2, implies that the number of LOO errors is equal to l2 . For a training point xi , we have f .xi / ! yi ®i C b as ¾ 2 ! 0. Thus, after 2 is sufciently small, all class 1 training points are classied correctly by ¾ f . For training points xi in class 2, we can use equations 2.6 and 2.7 to show that (i) for C > Clim =2, all of those points are classied correctly by f , and (ii) for C · Clim =2, all of those points are classied incorrectly by f . A.3 Proof of Theorem 1. To prove theorem 1, rst we write down the primal form of equation 2.11: min Q »Q Q b; w;
subject to
l X 1 T wQ wQ C CQ »Qi 2 iD1
Q ¸ 1 ¡ »Qi ; Q i / C b/ yi .wQ T Á.x
(A.7)
»Qi ¸ 0; i D 1; : : : ; l;
Q Q multiplying the objective funcwhere Á.x/ ´ ¾ Á .x/.6 By dening w ´ ¾ w, tion of equation A.7 by ¾ 2 and using equation 2.12, equation A.7 has exactly the same form as equation 1.1, so we can say, Q » D »Q ; and wQ T Á.x/ Q C bQ D wT Á.x/ C b: b D b;
(A.8)
6 It should be pointed out that equation 2.11 is not directly the dual of equation A.7. The dual of equation A.7 reduces to equation 2.11 when the yT ®Q D 0 constraint is used.
1684
S. Keerthi and C. Lin
A difculty in proving this theorem is that the solution of equation A.7 is an element of a vector space that is different from that of a solution of equation 2.14. Hence, to build the relation as ¾ 2 ! 1, we will consider their duals using lemma 2. Q It is in a Assume ®.¾ Q 2 / is the solution of equation 2.11 under a given C. 2 bounded region for all ¾ > 0 so there is a convergent sequence ®.¾ Q k2 / ! ®Q 2 as ¾k ! 1. We can apply lemma 2, as now the ij component of the Hessian of equation 2.11 is a function of ¾ 2 : Á ! 2 2 kx k kx k 2 j i 2 ¡kx ¡x k=.2¾ / ¡1C C yi yj KQ ij D ¾ yi yj e i j ; 2¾ 2 2¾ 2 with the limit yi yj xTi xj as ¾ 2 ! 1. Therefore, ®.¾ Q k2 / converges to an optimal solution ®Q of equation 2.13. Q 2 / and wQ are unique optimal solutions of equations A.7 We denote that w.¾ and 2.14, respectively. Then, for any such convergent sequence f®.¾ Q k2 /g1 , kD1 2 using yT ®.¾ Q k / D 0, we have that for any x, Q Q k2 /T Á.x/ lim w.¾
¾k2 !1
D lim
l X
¾k2 !1 iD1
D lim
l X
¾k2 !1 iD1
D lim
l X
¾k2 !1 iD1
D
l X iD1
Q i /T Á.x/ Q Q k2 /i Á.x yi ®.¾ Q k2 /i .¾k2 ¡ kxi k2 =2 C xTi x ¡ kxk2 =2/ yi ®.¾ Q k2 /i .¡kxi k2 =2 C xTi x/ yi ®.¾
Q D wQ T x C d.®/; Q yi ®Q i xTi x C d.®/
(A.9)
(A.10)
where equation A.9 follows from equation 2.8 and yT ®.¾ Q k2 / D 0 and d.®/ Q ´ Pl 2 ¡ iD1 yi ®kx Q i k =2. By a similar way, we can prove Q k2 /T w.¾ Q k2 / lim w.¾
¾k2 !1
D lim
l X l X
¾k2 !1 iD1 jD1
D
l X l X iD1 jD1
Q i /T Á.x Q j/ Q k2 /i ®.¾ Q k2 /j yi yj Á.x ®.¾
Q Q k2 /i ®.¾ Q k2 /j yi yj xTi xj D wQ T w: ®.¾
(A.11)
Asymptotic Behaviors of Support Vector Machines
1685
Note that equation A.11 follows from the discussion between equations 2.8 P and 2.11. The last equality is via wQ D liD1 ®Q o xi , as wQ is the optimal solution of equation 2.11. Q is the unique optimal solution of equaQ b/ Next we consider that .w; tion 2.14. The constraints of equation A.7 imply that Q i / ¡ »Q .¾ 2 /i g Q 2 /T Á.x maxf1 ¡ w.¾ yi D1
Q 2 / · max f¡1 ¡ w.¾ Q i / C »Q .¾ 2 /i g: Q 2 /T Á.x · b.¾ yi D¡1
Note that the primal-dual optimality condition implies 0 · »Q .¾ 2 /i ·
l X iD1
Q 2/ eT ®.¾ · l: »Q .¾ 2 /i · CQ
With equation A.9 and the assumption l1 ¸ 1 and l2 ¸ 1, after ¾ 2 is large Q 2 / is in a bounded region. When .w.¾ Q 2 // is optimal for Q 2 /; b.¾ enough, b.¾ equation A.7, the optimal »Q .¾ 2 / is Q 2 ///: Q i / C b.¾ Q 2 /T Á.x »Q .¾ 2 /i ´ max.0; 1 ¡ yi .w.¾ Q 2 / ! b¤ with ¾ 2 ! 1, we can further For any convergent sequence b.¾ k k have a subsequence such that f®.¾ Q k2 /g converges. Thus, we can consider any such sequence with both properties. Then equation A.10 implies Q C b¤ //: »Q .¾k2 /i ! »i¤ D max.0; 1 ¡ yi .wQ T xi C d.®/
(A.12)
Q b¤ C d.®/; Hence, .w; Q » ¤ / is feasible for equation 2.14. By dening Q Q i / ¡ d.®/ Q k2 /T Á.x Q C b//; »N .¾k2 /i ´ max.0; 1 ¡ yi .w.¾ Q k2 /; bQ ¡ d.®/; Q »N .¾k2 // is feasible for equation A.7. In addition, using equa.w.¾ tion A.10, Q »Q ´ lim »N .¾k2 / D max.0; 1 ¡ yi .wQ T xi C b//;
(A.13)
¾k2 !1
Q »Q / is optimal for equation 2.14. Thus, Q b; so .w; l l X X 1 T 1 wQ wQ C CQ »Qi · wQ T wQ C CQ »i¤ ; 2 2 iD1 iD1
and
l l X X 1 1 Q k2 /T w.¾ Q k2 / C CQ Q k2 /T w.¾ Q k2 / C CQ w.¾ »Q .¾k2 /i · w.¾ »N .¾k2 /i : (A.14) 2 2 iD1 iD1
1686
S. Keerthi and C. Lin
With equations A.11, A.12, and A.13, taking the limit A.14 becomes l l X X 1 T 1 wQ wQ C CQ »i¤ · wQ T wQ C CQ »Qi : 2 2 iD1 iD1
Q b¤ C d.®/; Therefore, we have that .w; Q » ¤ / is optimal for equation 2.14. Since bQ is unique by assumption, Q Q D b: b¤ C d.®/
(A.15)
Now we are ready to prove the main result, equation 2.15. If it is wrong, Q k2 /g with ¾k2 ! 1 such that there is ² > 0 and a sequence fw.¾ Q ¸ ²; 8k: Q Q k2 /T Á.x/ jw.¾ C b.¾k2 / ¡ wQ T x ¡ bj
(A.16)
Q 2 / D b¤ Since we can nd an innite subset K such that limk2K;¾ 2 !1 b.¾ k k Q 2 / from equation A.8, the above and equation A.10 holds, with b.¾ 2 / D b.¾ analysis (equations A.10 and A.15) shows that lim
k2K;¾ k2 !1
Q C bQ ¡ d.®/ Q w.¾k2 /T Á.x/ C b.¾k2 / D wQ T x C d.®/ Q D wQ T x C b:
This contradicts equation A.16 so equation 2.15 is valid. Therefore, if wQ T x C bQ 6D 0, after ¾ 2 is sufciently large, Q sgn.w.¾ 2 /T Á.x/ C b.¾ 2 // D sgn.wQ T x C b/: A.4 Proof of Theorem 2. Let ® 1 be a feasible vector of equation 1.2 for C D C1 and ® 2 be a feasible vector of equation 1.2 for C D C2 . We say that ® 1 and ® 2 are on the same face if the following hold: (i) fi j 0 < ®i1 < C1 g D fi j 0 < ®i2 < C2 g; (ii) fi j ®i1 D C1 g D fi j ®i2 D C2 g and (iii) fi j ®i1 D 0g D fi j ®i2 D 0g. To prove theorem 2, we need the following result: Lemma 4. If C1 < C2 and their corresponding duals have optimal solutions at the same face, then for any C1 · C · C2 , there is at least one optimal solution at the same face. Furthermore, there are optimal solutions ® and b of equation 1.2, which form linear functions of C in [C1 ; C2 ]. Proof of Lemma 4. If ® 1 and ® 2 are optimal solutions at the same face corresponding to C1 and C2 , then they satisfy the following KKT conditions,
Asymptotic Behaviors of Support Vector Machines
1687
respectively: Q® 1 ¡ e C b1 y D ¸1 ¡ ¹1 ; Q® 2 ¡ e C b2 y D ¸2 ¡ ¹2 ;
¸1i ®i1 D 0; ¸2i ®i2 D 0;
.C1 ¡ ®i1 /¹1i D 0; .C2 ¡ ®i2 /¹2i D 0:
Since they are at the same face, ¸2i ®i1 D 0;
¸1i ®i2 D 0;
.C1 ¡ ®i1 /¹2i D 0;
.C2 ¡ ®i2 /¹1i D 0:
(A.17)
As C1 · C · C2 , we can have 0 · ¿ · 1 such that C D ¿ C1 C .1 ¡ ¿ /C2 :
(A.18)
Let ® ´ ¿ ® 1 C .1 ¡ ¿ /® 2 ;
¹ ´ ¿ ¹1 C .1 ¡ ¿ /¹2 ;
¸ ´ ¿ ¸1 C .1 ¡ ¿ /¸2 ; b ´ ¿ b1 C .1 ¡ ¿ /b2 :
(A.19)
Then ®; ¸; ¹; b satisfy the KKT condition at C: Q® ¡ e C by D ¸ ¡ ¹; 0 · ®i · C;
¸i ¸ 0;
¸i ®i D 0;
¹i ¸ 0;
.C ¡ ®i /¹i D 0;
yT ® D 0:
Using equation A.18, ¿D
C ¡ C2 : C1 ¡ C2
Putting it into equation A.19, ® and b are linear functions of C where C 2 [C1 ; C2 ]. This proves the lemma. Let us now prove theorem 2. Proof of Theorem 3. As we already mentioned, if the points of the two classes are linearly separable in x space, then the proof of the result is straightforward. So let us give a proof only for the case of linearly nonseparable points. Since the number of faces is nite, by lemma 4 there exists a C¤ such that for C ¸ C¤ , there are optimal solutions at the same face. For the rest of the proof, let us consider only optimal solutions on a single face. For any C1 > C¤ , lemma 4 implies that there are optimal solutions ® and b, which form linear functions of C in the interval [C¤ ; C1 ]. Since l X iD1
»i D
X i:®i DC
¡[.Q®/i ¡ 1 C byi ];
(A.20)
1688
Pl
S. Keerthi and C. Lin
iD1 »i
is a linear function of C in this interval and can be represented as
l X iD1
»i D AC C B;
(A.21)
P where A and B are constants. If we consider another C2 > C1 , liD1 »i is also a linear function of C in [C¤ ; C2 ]. For each C, the optimal 12 wT w as well as Pl iD1 »i are unique. Thus, the two linear functions have the same values at more than two points, so they are indeed identical. Therefore, equation A.21 holds for any C ¸ C¤ . P Since liD1 »i is a decreasing function of C (e.g., using techniques similar to Chang and Lin (2001b, lemma 4)), A · 0. However, A cannot be negative, P P as otherwise liD1 »i goes to ¡1 as C increases. Hence, A D 0, and so liD1 »i ¤ is a constant for C ¸ C . If .w1 ; b1 ; » 1 / and .w2 ; b2 ; » 2 / are optimal solutions at C D C1 and C2 , respectively, then l l X X 1 1 T 1 1 .w / w C C1 »i1 · .w2 /T w2 C C1 »i2 2 2 iD1 iD1
and l l X X 1 2 T 2 1 .w / w C C2 »i2 · .w1 /T w1 C C2 »i1 2 2 iD1 iD1
imply that .w1 /T w1 D .w2 /T w2 . That is, ® T Q® is a constant for C ¸ C¤ . Therefore .w2 ; b2 ; » 2 / is also feasible and optimal when C D C1 . Since the solution of w is unique (e.g., Lin, 2001, lemma 1), w1 D w2 . If F D fi j 0 < ®i < Cg 6D ;, there is xi such that wT xi C b D yi . Hence b1 D b2 , and so the decision functions as well as the LOO rates are the same for C ¸ C¤ . On the other hand, if F D ; and we denote ®.C/ the solution of equation 1.2 at a given C, then ®.C/ D .C=C¤ /®.C¤ / for all C ¸ C¤ . As wT w D ® T Q® becomes a constant, we have w D 0 after C ¸ C¤ . However, since F D ;, the optimal b might not be unique under the same C. For any one of such b, .w; b/ is optimal for equation 1.2 for all C ¸ C¤ . Finally, since wT w becomes a constant, for C ¸ C¤ , the solution of equation 1.2 is also a solution of min w;b;»
subject to
l X
»i iD1
yi .wT xi C b/ ¸ 1 ¡ »i ; »i ¸ 0; i D 1; : : : ; l:
This completes the proof of theorem 2.
Asymptotic Behaviors of Support Vector Machines
1689
Acknowledgments C.-J. L. was partially supported by the National Science Council of Taiwan, grant NSC 90-2213-E-002-111. References Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases (Technical Rep.) Irvine, CA: University of California, Department of Information and Computer Science. Available on-line at: http:==www.ics.uci. edu=»mlearn=MLRepository.html. Chang, C.-C., & Lin, C.-J. (2001a). LIBSVM: A library for support vector machines. Software Available on-line at: http:==www.csie.ntu.edu.tw=»cjlin=libsvm. Chang, C.-C., & Lin, C.-J. (2001b). Training º-support vector classiers: Theory and algorithms. Neural Computation, 13(9), 2119–2147. Joachims, T. (2000). Estimating the generalization performance of a SVM efciently. In Proceedings of the International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Lee, J.-H. (2001). Model selection of the bounded SVM formulation using the RBF kernel. Master’s thesis, National Taiwan University. Lin, C.-J. (2001). Formulations of support vector machines: A note from an optimization point of view. Neural Computation, 13(2), 307–317. Micchelli, C. A. (1986). Interpolation of scattered data: Distance matrices and conditionally positive denite functions. Constructive Approximation, 2, 11–22. Platt, J. C. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. R¨atsch, G. (1999). Benchmark data sets. Available on-line at: http:==ida.rst.gmd. de=»raetsch=data=benchmarks.htm. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Vapnik, V., & Chapelle, O. (2000). Bounds on error expectation for support vector machines. Neural Computation, 12(9), 2013–2036.
Received June 25, 2002; accepted December 5, 2002.
LETTER
Communicated by John Shawe-Taylor
Comparison of Model Selection for Regression Vladimir Cherkassky [email protected]
Yunqian Ma [email protected] Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, U.S.A.
We discuss empirical comparison of analytical methods for model selection. Currently, there is no consensus on the best method for nitesample estimation problems, even for the simple case of linear estimators. This article presents empirical comparisons between classical statistical methods—Akaike information criterion (AIC) and Bayesian information criterion (BIC)—and the structural risk minimization (SRM) method, based on Vapnik-Chervonenkis (VC) theory, for regression problems. Our study is motivated by empirical comparisons in Hastie, Tibshirani, and Friedman (2001), which claims that the SRM method performs poorly for model selection and suggests that AIC yields superior predictive performance. Hence, we present empirical comparisons for various data sets and different types of estimators (linear, subset selection, and k-nearest neighbor regression). Our results demonstrate the practical advantages of VC-based model selection; it consistently outperforms AIC for all data sets. In our study, SRM and BIC methods show similar predictive performance. This discrepancy (between empirical results obtained using the same data) is caused by methodological drawbacks in Hastie et al. (2001), especially in their loose interpretation and application of SRM method. Hence, we discuss methodological issues important for meaningful comparisons and practical application of SRM method. We also point out the importance of accurate estimation of model complexity (VC-dimension) for empirical comparisons and propose a new practical estimate of model complexity for k-nearest neighbors regression. 1 Introduction and Background We consider standard regression formulation under a general setting for predictive learning (Vapnik, 1995; Cherkassky & Mulier, 1998; Hastie, Tibshirani, & Friedman, 2001). The goal is to estimate unknown real-valued function in the relationship, y D g.x/ C ";
(1.1)
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1691–1714 (2003) °
1692
V. Cherkassky and Y. Ma
where " is independent and identically distributed (i.i.d.) zero-mean random error (noise), x is a multidimensional input, and y is a scalar output. The estimation is made based on a nite number of samples (training data): .xi ; yi /; .i D 1; : : : ; n/. The training data are i.i.d. generated according to some (unknown) joint probability density function (pdf), p.x; y/ D p.x/p.y j x/:
(1.2)
The unknown function in equation 1.1 is Rthe mean of the output conditional probability (regression function) g.x/ D yp.y j x/ dy. A learning method (or estimation procedure) selects the “best” model f .x; !0 / from a set of approximating functions (or possible models) f .x; !/ parameterized by a set of parameters ! 2 Ä. The quality of an approximation is measured by the loss or discrepancy measure L.y; f .x; !//. An appropriate loss function for regression is the squared error L.y; f .x; !// D .y ¡ f .x; !// 2 :
(1.3)
The squared-error loss, equation 1.3, is commonly used for model selection comparisons. The set of functions f .x; !/; ! 2 Ä supported by a learning method may or may not contain the regression function g.x/. Thus, learning is the problem of nding the function f .x; !0 / (regressor) that minimizes the prediction risk functional, Z (1.4) R.!/ D .y ¡ f .x; !// 2 p.x; y/ dx dy; using only the training data. This risk functional measures the accuracy of the learning method’s predictions of unknown target function g.x/. For a given parametric model (with xed number of parameters), the model parameters are estimated by minimizing the empirical risk: Remp .!/ D
n 1X .yi ¡ f .xi ; !//2 : n iD1
(1.5)
The problem of model selection (complexity control) arises when a set of possible models f .x; !/ consists of (parametric) models of varying complexity. Then the problem of regression estimation requires optimal selection of model complexity (i.e., the number of parameters) in addition to parameter estimation via minimization of empirical risk (see equation 1.5). Analytic model selection criteria attempt to estimate (unknown) prediction risk, equation 1.4, as a function of (known) empirical risk, equation 1.5, penalized by some measure of model complexity. Then one selects the model complexity model corresponding to the smallest (estimated) risk. It is important to point out that for nite-sample problems, the goal of model selection is distinctly different from the goal of accurate estimation of prediction risk.
Comparison of Model Selection for Regression
1693
Since all analytic model selection criteria are based on certain assumptions (notably, asymptotic analysis and linearity), it is important to perform empirical comparisons in order to understand their practical usefulness in settings when these assumptions may not hold. Recently, Cherkassky and Mulier (1998), Cherkassky, Shao, Mulier, and Vapnik (1999), and Shao, Cherkassky, and Li (2000) showed such empirical comparisons for several analytic model selection methods for linear and penalized linear estimators and concluded that model selection based on structural risk minimization (SRM) is superior (to other methods). Later, Cherkassky and Shao (2001) and Cherkassky and Kilts (2001) successfully applied SRM-based model selection to wavelet signal denoising. Similarly, Chapelle, Vapnik, and Bengio (2002) suggest an analytic model selection approach for small-sample problems based on Vapnik-Chervonenkis (VC) theory and claim superiority of the VC approach using empirical comparisons. These ndings contradict a widely held opinion that VC generalization bounds are too conservative for practical model selection (Bishop, 1995; Ripley, 1996; Duda, Hart, & Stork, 2001). Recently, Hastie et al. (2001) presented empirical comparisons for model selection and concluded that “SRM performs poorly overall.” This article is intended to clarify the current state of affairs regarding the practical usefulness of SRM model selection. In addition, we address important methodological issues related to empirical comparisons in general and meaningful application of SRM model selection in particular. The article is organized as follows. Section 2 describes classical model selection criteria and the VC-based approach used for empirical comparisons. Section 3 describes empirical comparison for low-dimensional data sets using linear estimators and k-nearest neighbors method. Section 4 describes empirical comparisons for high-dimensional data sets taken from Hastie et al. (2001) using k-nearest neighbor and linear subset selection regression. Conclusions are presented in section 5. 2 Analytical Model Selection Criteria In general, analytical estimates of (unknown) prediction risk Rest as a function of (known) empirical risk Remp take one of the following forms, Rest.d/ D Remp .d/ ¢ r.d; n/
or
Rest .d/ D Remp .d/ C r.d=n; ¾ 2 /;
where r.d; n/ is often called the penalization factor, which is a monotonically increasing function of the ratio of model complexity (degrees of freedom) d to the number of samples n. In this article, we discuss three model selection methods. The rst two are representative statistical methods: Akaike information criterion (AIC)
1694
V. Cherkassky and Y. Ma
and Bayesian information criterion (BIC), AIC.d/ D Remp .d/ C
2d 2 ¾O n
d BIC.d/ D Remp .d/ C .ln n/ ¾O 2 ; n
(2.1) (2.2)
where d is the number of free parameters (of a linear estimator) and ¾O 2 denotes an estimate of noise variance in equation 1.1. Both AIC and BIC are derived using asymptotic analysis (i.e., a large sample size). In addition, AIC assumes that the correct model belongs to the set of possible models. In practice, however, AIC and BIC are often used when these assumptions do not hold. When using AIC or BIC for practical model selection, one is faced with two issues. The rst issue is the estimation and meaning of (unknown) noise variance. When using a linear estimator with d parameters, the noise variance can be estimated from the training data as ¾O 2 D
n 1X n ¢ .yi ¡ yO i /2 : n ¡ d n iD1
(2.3)
Then one can use equation 2.3 in conjunction with AIC or BIC in one of two possible ways. Under the rst approach, one estimates noise via equation 2.3 for each (xed) model complexity (Cherkassky et al., 1999; Chapelle et al., 2002). Thus, different noise estimates are used in AIC or BIC expression for each (chosen) model complexity. This leads to the following form of AIC known as nal prediction error (FPE) (Akaike, 1970): FPE.d/ D
1 C d=n Remp .d/: 1 ¡ d=n
(2.4)
Under the second approach, one rst estimates noise via equation 2.3 using a high-variance=low-bias estimator, and then this noise estimate is plugged into AIC or BIC expressions 2.1 or 2.2 to select optimal model complexity. In this article, we use the latter approach since it has been used in empirical comparisons by Hastie et al. (2001); however, there seems to be no consensus on which approach is best for practical model selection. Further, even though one can obtain an estimate of noise variance (as outlined above), the very interpretation of noise becomes difcult for practical problems when the set of possible models does not contain the true target function. In this case, it is not clear whether the notion of noise refers to a discrepancy between admissible models and training data or reects the difference between the true target function and the training data as in formulation 1.1. In particular, noise estimation becomes impossible when there is signicant mismatch
Comparison of Model Selection for Regression
1695
between an unknown target function and an estimator. Unfortunately, all data sets used in Hastie et al. (2001) are of this sort. For example, they use k-nearest neighbor regression to estimate discontinuous target functions, even though it is well known that kernel estimators are intended for smooth target functions (Hardle, 1995). Hence, all empirical comparisons presented in this article assume that for AIC and BIC methods, the variance of additive noise in equation 1.1 is known. This helps to avoid ambiguity with respect to using different strategies for noise estimation and gives an additional competitive advantage to AIC=BIC versus SRM. The second issue is estimation of model complexity. AIC and BIC use the number of free parameters (for linear estimators), but it is not clear what is a good measure of complexity for other types of estimators (e.g., penalized linear, subset selection, k-nearest neighbors). There have been several suitable generalizations of model complexity (known as effective degrees-of-freedom; Bishop, 1995; Cherkassky & Mulier, 1998; Hastie et al., 2001). The third model selection method used in this article is based on SRM, which provides a very general and powerful framework for model complexity control. Under SRM, a set of possible models forms a nested structure, so that each element (of this structure) represents a set of models of xed complexity. Hence, a structure provides a natural ordering of possible models according to their complexity. Model selection amounts to choosing an optimal element of a structure using VC generalization bounds. For regression problems, we use the following VC bound (Vapnik, 1998; Cherkassky & Mulier, 1998; Cherkassky et al., 1999): Á !¡1 r ln n (2.5) R.h/ · Remp .h/ 1 ¡ p ¡ p ln p C ; 2n C
where p D h=n and h is a measure of model complexity (called the VCdimension). The practical form of the VC-bound, equation 2.5, is a special case of the general analytical bound (Vapnik, 1995) with appropriately chosen practical values of theoretical constants. (See Cherkassky et al., 1999, for detailed derivation of equation 2.5 from the general bounds developed in VC-theory.) According to the SRM method, model selection amounts to choosing the model minimizing the upper bound on prediction risk (see equation 2.5). Note that under the SRM approach, we do not need to estimate noise variance. The bound, equation 2.5, has been derived under very general assumptions (e.g., nite-sample setting, nonlinear estimation). However, it requires an accurate estimation of the VC-dimension. Here we are faced with the same problem as estimating effective degrees of freedom (DoF) in the AIC or BIC method: the VC-dimension coincides with the number of free parameters for linear estimators but is (usually) hard to obtain for other types of estimators.
1696
V. Cherkassky and Y. Ma
This discussion suggests the following commonsense methodology for empirical comparisons of AIC, BIC, and SRM. First, perform comparisons for linear estimators, for which model complexity can be accurately estimated for all methods. Then perform comparisons for other types of estimators using either advanced methods for estimating model complexity (Vapnik, Levin, & Cun, 1994; Shao et al., 2000) or crude heuristic estimates of model complexity. The latter approach is pursued in this article for k-nearest neighbor regression (where accurate estimates of model complexity are not known). 3 Empirical Comparisons for Linear Estimators and k-Nearest Neighbors We describe rst experimental comparisons of AIC, BIC, and SRM for linear estimators using comparison methodology and data sets from Cherkassky et al. (1999). It is important to obtain meaningful comparisons for model selection with linear estimators rst, since in this case the model complexity (DoF in AIC=BIC and VC-dimension in SRM) can be analytically estimated. Later (in section 4) we show comparisons using data sets from Hastie et al. (2001) for (nonlinear) estimators with crude estimates of the model complexity used for model selection criteria. First, we describe the experimental setup and comparison metrics and then show empirical comparison results. 3.1 Experimental Setup The following target functions were used:
3.1.1 Target Functions Used.
Sine-squared function: g1 .x/ D sin2 .2¼x/
x 2 [0; 1]
Discontinuous piecewise polynomial function: 8 > < g2 .x/ D
4x2 .3 ¡ 4x/ 2
> :
.4=3/x.4x ¡ 10x C 7/ ¡ 3=2 .16=3/x.x ¡ 1/2
x 2 [0; 0:5] x 2 .0:5; 0:75] x 2 .0:75; 1]
Two-dimensional sinc function: q g3 .x/ D sin
q x21 C x22 = x21 C x22
x 2 [¡5; 5]2
In addition, the training and test samples were uniformly distributed in the input domain.
Comparison of Model Selection for Regression
1697
3.1.2 Estimators Used. The linear estimators include polynomial and trigonometric estimators, that is, Algebraic polynomials: fm .x; !/ D Trigonometric expansion: fm .x; !/ D
m¡1 X iD0 m¡1 X iD0
!i xi C !0
(3.1)
!i cos.ix/ C !0
(3.2)
For linear estimators, the number of parameters (DoF) is used as a measure of model complexity (VC-dimension) for all model selection methods. In the k-nearest neighbors regression, the unknown function is estimated by taking a local average of k training samples nearest to the estimation point. In this case, an estimate of effective DoF or VC-dimension is not known, even though sometimes the ratio n=k is used to estimate model complexity (Hastie et al., 2001). However, this estimate appears too crude and can be criticized using both commonsense and theoretical arguments, as discussed next. With the k-nearest neighbors method, the training data can be divided into n= k neighborhoods. If the neighborhoods were nonoverlapping, then one could t one parameter in each neighborhood (leading to an estimate d D n=k). However, the neighborhoods are, in fact, overlapping, so that a sample point from one neighborhood affects regression estimates in an adjacent neighborhood. This suggests that a better estimate of DoF has the form d D n=.c ¤ k/ where c > 1. The value of (unknown) parameter c is unknown but (hopefully) can be determined empirically or using additional theoretical arguments. Using the ratio n=k to estimate model complexity is inconsistent with the main result in VC-theory (that the VC-dimension of any estimator should be nite). Indeed, asymptotically (for large n), the ratio n=k grows without bound. On the other hand, asymptotic theory for knearest neighbor estimators (Hardle, 1995) provides asymptotically optimal k-values (when n is large), namely, k » n4=5 . This suggests the following 1 (asymptotic) dependency for DoF: d » nk ¤ n1=5 . This (asymptotic) formula is clearly consistent with commonsense expression d D n=.c¤ k/ with c > 1. We found that a good practical estimate of DoF can be found empirically by assuming the dependency d D const ¤ n4=5 =k
(3.3)
and then empirically “tuning” the value of const D 1 using a number of data sets. This leads to the following empirical estimate for DoF: dD
1 n ¤ : k n1=5
(3.4)
Prescription 3.4 is used as an estimate of DoF and VC-dimension for all model selection comparisons presented later in this article. We point out that
1698
V. Cherkassky and Y. Ma
equation 3.4 is a crude practical measure of model complexity. However, it is certainly better than an estimate d D n=k used in Hastie et al. (2001) because using expression 3.4 instead of d D n=k actually improves the prediction accuracy of all model selection methods (AIC, BIC, and SRM) for all data sets used in our comparisons (including those used in Hastie et al. (2001). Also, proposed estimate 3.4 is consistent with DoF estimates provided by asymptotic theory and commonsense arguments. 3.1.3 Noise Estimation. All comparisons use the true noise variance for AIC and BIC methods. Hence, AIC and BIC have an additional competitive advantage versus the SRM method. 3.2 Experimental Procedure. A training set of xed size (n) is generated using standard regression formulation 1.1 using a target function with yvalues corrupted by additive gaussian noise. The x-values of training data follow random uniform distribution in the input domain. Then the model (of xed complexity) is estimated using training data (via least-squares tting for linear estimators or k-nearest neighbors regression). Model selection is performed by (least-squares) tting of models of different complexity and selecting an optimal model corresponding to the smallest prediction risk as estimated by a given model selection criterion. Prediction risk (accuracy) of the model chosen by a model selection method is then measured (experimentally). The prediction risk is measured as the mean squared error (MSE) between a model (estimated from training data) and the true values of target function g.x/ for independently generated test inputs: Rpred D
ntest 1 X
ntest
iD1
.g.xi / ¡ yO i //2 :
The above experimental procedure is repeated 100 times using 100 different random realizations of n training samples (from the same statistical distribution), and the empirical distribution of the prediction risk (observed for each model selection method) is displayed using standard box plot notation, with marks at 95%, 75%, 50%, 25%, and 5% of the empirical distribution. 3.2.1 Experiment 1. The training data are generated using the sinesquared target function corrupted by gaussian noise. A linear estimator (using algebraic polynomials) was used. Experiments were performed using a small training sample (n D 30) and a large sample size (n D 100). Figure 1 shows comparison results for AIC, BIC, and SRM for noise level ¾ D 0:2. 3.2.2 Experiment 2. The training data are generated using discontinuous piecewise polynomial target function corrupted by gaussian noise. A
Comparison of Model Selection for Regression
1699
Risk (MSE)
(a) 0.2 0.1 0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
15 10 5 0
(b)
Risk (MSE)
0.04
0.02
0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
15 10 5 0
Figure 1: Comparison results for sine-squared target function estimated using polynomial regression, noise level ¾ D 0:2. (a) Small size, n D 30. (b) Large size, n D 100.
1700
V. Cherkassky and Y. Ma
linear estimator (using trigonometric expansion) was used. Experiments were performed using a small training sample (n D 30) and different noise levels. Figure 2 shows comparison results for AIC, BIC, and SRM. Comparison results presented in Figures 1 and 2 indicate that SRM and BIC methods work better than AIC for small sample sizes (n D 30). For large samples (see Figure 1b), all methods show comparable (similar) prediction accuracy; however, SRM is still preferable to other methods since it selects lower model complexity (i.e., lower-order polynomials). These ndings are consistent with model selection comparisons for linear estimators presented in Cherkassky et al. (1999). 3.2.3 Experiment 3. Here we compare model selection methods using k-nearest neighbors regression. The training and test data are generated using the same target functions (sine-squared and piecewise polynomial) as in previous experiments. The training sample size is 30 and the noise level ¾ D 0:2. Comparison results are shown in Figure 3. We can see that all methods provide similar prediction accuracy (with SRM and BIC having a slight edge over AIC). However, SRM is more robust since it selects models of lower complexity (larger k) than BIC and AIC. Note that all comparisons use the proposed (estimated) model complexity (see equation 3.4) for the k-nearest neighbors method. Using (arguably incorrect) estimates of model complexity n=k as in Hastie et al. (2001) can obscure model selection comparisons. In this case, model selection comparisons for estimating sinesquared target function using k-nearest neighbors are shown in Figure 4. According to Figure 4, AIC provides the best model selection; however, its prediction accuracy is lower than with “more accurate” DoF estimates (see equation 3.4) shown in Figure 3a. By comparing results in Figure 3a and Figure 4, we conclude that the proposed model complexity estimate, equation 3.4, actually improves prediction performance of all three methods (AIC, BIC, and SRM) relative to using DoF D n=k. In addition, prediction performance of SRM relative to AIC and BIC degrades signicantly when using estimates of DOF D n=k. This can be explained by the multiplicative form of VC-bound (see equation 2.5) that is affected more signicantly by incorrect estimates of the model complexity (VC-dimension) than AIC and BIC, which have an additive form. 3.2.4 Experiment 4. Here we compare model selection using k-nearest neighbors regression to estimate two-dimensional sinc target function corrupted by gaussian noise (with ¾ D 0:2 and 0.4). The training size is 50, and the test size is 300. Comparison results in Figure 5 indicate that the VCbased approach yields better performance than AIC, and its performance is similar to BIC. All experiments with low-dimensional data sets indicate that SRM model selection yields better prediction accuracy than AIC, and its performance is similar to BIC. This conclusion is obtained in spite of the fact that compar-
Comparison of Model Selection for Regression
1701
Risk (MSE)
(a) 0.1 0.05 0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
15 10 5 0
(b)
Risk (MSE)
0.1
0.05
0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
15 10 5 0
Figure 2: Comparison results for piecewise polynomial target function estimated using trigonometric regression. Sample size, n D 30. (a) ¾ D 0:2; SNR D 1:5. (b) ¾ D 0:1, SNR D 3.
1702
V. Cherkassky and Y. Ma
(a)
Risk (MSE)
0.2
0.1
0
AIC
BIC
SRM
AIC
BIC
SRM
15
k
10 5 0
(b)
Risk (MSE)
0.1
0.05
0
AIC
BIC
SRM
AIC
BIC
SRM
15
k
10 5 0
Figure 3: Comparison results for univariate regression using k-nearest neighbors. Training data: n D 30; noise level ¾ D 0:2. (a) Sine squared target function. (b) Piecewise polynomial target function.
Comparison of Model Selection for Regression
1703
Risk (MSE)
0.2
0.1
0
AIC
BIC
SRM
AIC
BIC
SRM
15
k
10 5 0
Figure 4: Comparison results for sine-squared target function using k-nearest neighbors when DoF D n=k for all methods. Training data: n D 30; noise level ¾ D 0:2.
isons were biased in favor of AIC/BIC since we used the true noise variance for AIC and BIC. One can also see from our experimental results that SRM is more robust than BIC, since it consistently selects lower model complexity, when both methods provide similar prediction performance. Hence, we conclude that SRM is a better choice for practical model selection than AIC or BIC. Our conclusions strongly disagree with conclusions presented in Hastie et al. (2001), who used high-dimensional data sets for model selection comparisons. The reasons for such disagreement are discussed in the next section.
4 Comparisons for High-Dimensional Data Sets This section presents comparisons for higher-dimensional data sets taken from (or similar to) Hastie et al. (2001). These data sets are used in Hastie et al. (2001) to compare model selection methods for two types of estimators: k-nearest neighbors (described in section 3) and linear subset selection (discussed later in this section). Next, we present comparisons using the same or similar data sets, but our results show an overall superiority of SRM method.
1704
V. Cherkassky and Y. Ma
(a)
Risk (MSE)
0.1
0.05
0
AIC
BIC
SRM
AIC
BIC
SRM
15
k
10 5 0
(b)
Risk (MSE)
0.3 0.2 0.1 0
AIC
BIC
SRM
AIC
BIC
SRM
k
20
10
0
Figure 5: Comparison results for two-dimensional sinc target function using k-nearest neighbors; sample size n D 50. (a) ¾ D 0:2. (b) ¾ D 0:4.
Comparison of Model Selection for Regression
1705
4.1 Comparisons for k-Nearest Neighbors. Let us consider a multivariate target function of 20 input variables x 2 R20 uniformly distributed in [0; 1]20 : ( g4 .x/ D
1 0
if if
x1 > 0:5 x1 · 0:5
(4.1)
4.1.1 Experiment 5. We estimate target function 4.1 from n D 50 training samples, using k-nearest neighbor regression. The quality of estimated models is evaluated using the test set of 500 samples. We use the true noise variance for AIC and BIC methods. All methods use the proposed DoF estimate, equation 3.4. Results in Figure 6a correspond to the example in Hastie et al. (2001) when training samples are not corrupted by noise. Results in Figure 6b show comparisons for noisy training samples. In both cases, SRM achieves better prediction performance than other methods. Results presented in Figure 6 are in complete disagreement with empirical results and conclusions presented in Hastie et al. (2001). There are several reasons for this disagreement. First, Hastie et al. (2001) use their own (incorrect) interpretation of theoretical VC-bound that is completely different from the VC-bound equation 2.5. Second, discontinuous target function (see equation 4.1) used by Hastie et al. (2001) prevents any meaningful application of AIC=BIC model selection with k-nearest neighbors. That is, kernel estimation methods assume sufciently smooth (or at least, continuous) target functions. When this assumption is violated (as in the case of discontinuous step function), it becomes impossible to estimate the noise variance (for AIC=BIC method) with any reasonable accuracy. For example, using 5-nearest neighbor regression to estimate noise for this data set, as recommended by Hastie et al. (2001), gives the noise variance 0.15, which is completely wrong (since the data has no additive noise). Third, application of AIC=BIC to the data set with no additive noise (as suggested by Hastie et al., 2001) makes little sense as well, because the additive form 2.1 and 2.2 for AIC and BIC implies that with zero noise, these methods should select the highest model complexity. That is why we used a small noise variance 0.001 (rather than zero) for AIC/BIC in our comparisons for this data set. Graphical illustration of overtting by AIC=BIC for this data set can be clearly seen in Figure 7, showing the values of AIC, BIC and VC-bound as a function of k, along with the true prediction risk. In contrast, the SRM method performs very accurate model selection for this data set. 4.2 Comparisons for Linear Subset Selection. Here, the training and test data are generated using a ve-dimensional target function x 2 R5 and
1706
V. Cherkassky and Y. Ma
(a)
Risk (MSE)
0.4
0.2
0
AIC
BIC
SRM
AIC
BIC
SRM
k
20
10
0
Risk (MSE)
(b) 0.4 0.2 0
AIC
BIC
SRM
AIC
BIC
SRM
30
k
20 10 0
Figure 6: Comparisons for high-dimensional target function 4.1 using k-nearest neighbors method; sample size n D 50. (a) ¾ D 0. (b) ¾ D 0:2.
Comparison of Model Selection for Regression
0.45
1707
Risk AIC/BIC SRM
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
10
20
k
30
40
50
Figure 7: True prediction risk and its estimates provided by AIC=BIC and SRM (VC-bound) as a function of k (number of nearest neighbors) for a single realization of training data for high-dimensional target function 4.1 with sample size n D 50 and noise ¾ D 0. Risk estimates given by AIC and BIC are identical due to zero noise (assumed to be known). Minimal values of risk and estimated risk are marked with a star.
y 2 R, dened as 8 > > > 1 > >
> > > 0 > :
if
3 X
xj > 1:5
jD1
if
3 X jD1
(4.2) xj · 1:5
with random x-values uniformly distributed in [0; 1]5 . The linear subset selection method amounts to selecting the best subset of m input variables for a given training sample. Here the “best” subset of m variables yields the linear model with the lowest empirical risk (MSE tting error) among all linear models with m variables, for a given training sample. Hence, for linear subset selection, model selection corresponds to selecting an optimal value of m (providing minimum prediction risk). Also note that linear subset selection is a nonlinear estimator, even though it produces models linear in parameters. Hence, there may be a problem of
1708
V. Cherkassky and Y. Ma
estimating its model complexity when applying AIC, BIC, or SRM for model selection. In this article, we estimate the model complexity (DoF) as m C 1 (where m is the number of chosen input variables) for all methods, similar to Hastie et al. (2001). We do not know, however, whether this estimate is correct. Implementation of subset selection used in this article performs an exhaustive search over all possible subsets of m variables (out of total d input variables) for choosing the best subset (minimizing the empirical risk). Hence, we used a moderate number of input variables (d D 5) to avoid the combinatorial explosion of subset selection. 4.2.1 Experiment 6. Training data and test data are generated using target function 4.2. We used true noise variance for AIC and BIC methods. Our comparisons use 30 training samples and 200 test samples. Regression estimates are obtained from (noisy) training data using linear subset selection. Comparison results in Figure 8 show that all methods achieve similar prediction performance when the training data have no noise. For noisy data, BIC provides superior prediction performance, and both AIC and BIC perform better than SRM. A closer look at results in Figure 8 helps to explain the methods’ performance for this data set. Results in Figure 8a clearly show that overtting by AIC and BIC does not result in the degradation of prediction risk, whereas undertting (by SRM) can adversely affect prediction performance (see Figure 8b). Also, from DoF box plots in Figure 8b, SRM selects optimal model complexity (DoF D 4) most of the time, unlike AIC=BIC, which tend to overt. However, this does not yield better prediction performance for SRM in terms of prediction risk. This is further explained in Figure 9, which shows the dependence of prediction risk on degrees of freedom (the number of variables +1) for one (representative) realization of training data. From Figure 9, we can clearly see that there is no overtting. This happens due to the choice of approximating functions (i.e., linear subset selection), which results in a large bias (mismatch) for the chosen target function, 4.2. Even using the most complex linear subset selection model, one cannot accurately t the training data generated by this target function. In other words, there is no possibility of overtting for this data set; hence, a more conservative model selection approach (such as SRM) tends to underperform relative to AIC and BIC. In order to perform more meaningful comparisons for the linear subset selection method, consider the data set where the target function belongs to a set of possible models (approximating functions): the training and test data are generated using ve-dimensional target function x 2 R5 and y 2 R, dened as g6 .x/ D x1 C 2x2 C x3 with x-values with uniformly distributed in [0; 1]5 .
(4.3)
Comparison of Model Selection for Regression
(a)
0.2 Risk (MSE)
1709
0.1
0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
6 4 2
(b)
Risk (MSE)
0.2
0.1
0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
6 4 2
Figure 8: Comparison results for high-dimensional target function 4.2 using linear subset selection for n D 30 samples. (a) ¾ D 0. (b) ¾ D 0:2.
1710
V. Cherkassky and Y. Ma
0.2
Prediction Risk
0.15
0.1
0.05
0 2
3
4 DoF
5
6
Figure 9: Prediction risk as a function of DoF with linear subset selection for a single realization of training data generated using high-dimensional target function 4.2; n D 30 samples; ¾ D 0.
Experimental comparisons of model selection for this data set are shown in Figure 10. Note that experimental setup and the properties of training data (sample size, noise level ¾ D 0:2) are identical to the setting used to produce comparisons in Figure 8b, except that we use the target function, 4.3. Results shown in Figure 10 indicate that SRM and BIC have similar prediction performance (both better than AIC) for this data set. Note that all comparisons assume known noise level for AIC=BIC; hence, they are biased in favor of AIC=BIC. Even with this bias, SRM performs better (or at least not worse) than AIC=BIC for linear subset selection. Our ndings contradict comparisons presented by Hastie et al. (2001). This can be explained by the contrived data set used in their comparisons (similar to equation 4.2), such that no overtting is possible with linear subset selection. 5 Discussion Given the proliferation of ad hoc empirical comparisons and heuristic learning methods, the issue of what constitutes meaningful comparisons becomes of great practical importance. Hence, in this section, we discuss several methodological implications of our comparison study and suggest future research directions. The rst issue to address is what the goal of empirical
Comparison of Model Selection for Regression
1711
Risk (MSE)
0.04
0.02
0
AIC
BIC
SRM
AIC
BIC
SRM
DoF
6 4 2
Figure 10: Comparison results for high-dimensional target function 4.3 using linear subset selection for n D 30 samples; noise level ¾ D 0:2.
comparisons is. Often, the goal seems to be demonstrating that one method (learning technique, model selection criterion) is better than the others, and this can easily be accomplished by using specially chosen (contrived) data sets that favor particular methods or tuning well one method (learning algorithm), while tuning poorly other (competing) methods, often done unintentionally since a person performing comparisons is usually an expert in one method (being proposed). In fact, according to Vapnik (1998), generalization from nite data is possible only when an estimator has limited capacity (i.e., complexity). Therefore, a learning method cannot solve most practical problems with nite samples unless it uses a set of approximating functions (admissible models) appropriate for a problem at hand. Ignoring this commonsense observation leads to formal proofs that no learning method generalizes better than another for all data sets (i.e., results collectively known as no-free-lunch theorems; see Duda et al., 2001). In this sense, a mismatch between estimation methods and data sets used for model selection comparisons in Hastie et al. (2001) leads to particularly confusing and misleading conclusions. In our opinion, the goal of empirical comparisons of methods for model selection should be improved understanding of their relative performance for nite-sample problems. Since the model complexity (VC-dimension) can be accurately estimated for linear methods, it is methodologically appropriate to perform such comparisons for linear regression rst, before consid-
1712
V. Cherkassky and Y. Ma
ering other types of estimators. Recent comparison studies (Cherkassky & Mulier, 1998; Cherkassky et al., 1999; Shao et al., 2000) indicate that VC-based model selection is superior to other analytic model selection approaches for linear and penalized linear regression with nite samples. Empirical comparisons between SRM, AIC, and BIC for linear estimators presented in this article conrm practical advantages of SRM model selection. In addition, we propose a new practical estimate of model complexity for k-nearest neighbor regression and use it for model selection comparisons. This article presents the rst known practical application of VC-bound equation 2.5 with k-nearest neighbors regression. Our empirical results show the advantages of SRM and BIC (relative to AIC) model selection for k-nearest neighbors. Our empirical comparisons clearly show the practical advantages of SRM model selection for linear subset selection regression. Future research may be directed toward meaningful comparisons of model selection methods for other (nonlinear) estimators. Here, the main challenge is estimating model complexity and avoiding potential methodological pitfalls of ad hoc comparisons. Empirical comparisons performed by Hastie et al. (2001) illustrate possible dangers of ad hoc comparisons, such as these: ² Inaccurate estimation of model complexity. It is impossible to draw any conclusions based on empirical comparisons unless one is condent that model selection criteria use accurate estimates of model complexity. There exist experimental methods for measuring the VC-dimension of an estimator (Vapnik et al., 1994; Shao et al., 2000); however, they may be difcult to apply for general practitioners. An alternative practical approach is to come up with empirical commonsense estimates of model complexity to be used in model selection criteria—for example, in this article, where we successfully used a new complexity measure for k-nearest neighbor regression. Essentially, under this approach, we combine the known analytical form of a model selection criterion with an appropriately tuned measure of model complexity taken as a function of (some) complexity parameter (i.e., the value of k in the k-nearest neighbor method). ² Using contrived data sets. For both k-nearest neighbors and linear subset selection comparisons, Hastie et al. (2001) used specially chosen data sets so that meaningful model selection comparisons become difcult or impossible. For example, for subset selection comparisons, they use a data set such that regression estimates do not exhibit any overtting. This (contrived) setting obviously favors model selection methods (such as AIC) that tend to overt. ² Using small number of data sets. Comparisons presented in Hastie et al. (2001) use just two data sets of xed size with no noise (added to response variable). Such comparisons can easily be misleading, since
Comparison of Model Selection for Regression
1713
model selection methods typically exhibit different relative performance for different noise levels, sample size, and other factors. Reasonable comparisons (leading to meaningful conclusions) should apply model selection criteria to a broad cross-section of data sets with different statistical characteristics (Cherkassky et al., 1999). ² Using an inappropriate form of VC-bounds. The VC-theory provides an analytical form of VC-bounds; however, it does not give specic (practical) values of theoretical parameters and constants. Such (practical) parameter values lead to the practical form of VC-bounds for regression problems used in this article. It appears that Hastie et al. (2001) used the original theoretical VC-bound with poorly chosen parameter values instead of the practical form of VC-bound 2.5 described in Cherkassky and Mulier (1998), Vapnik (1998), and Cherkassky et al. (1999). Finally, we briey comment on empirical comparisons of model selection methods for classication, presented in Hastie et al. (2001) using an experimental setting for high-dimensional data sets described in section 4. Whereas Hastie et al. (2001) acknowledge that they do not know how to set practical values of constants in VC-bounds for classication, they proceed with empirical comparisons anyway. We remain highly skeptical about the scientic value of such comparisons. Selecting appropriate “practical” values of theoretical constants in VC bounds for classication is an important and interesting research area. It may be worthwhile to point out that Hastie et al. (2001) use identical data sets for regression and classication comparisons; the only difference is in the choice of the loss function. Likewise, classical model selection criteria (AIC and BIC) have an identical form (for regression and classication). This observation underscores an important distinction between classical statistics and a VC-based approach to learning and estimating dependencies from data—namely, the classical approach to learning (estimation) problems is to apply maximum likelihood (ML) method for estimating parameters of a density function. Hence, the same approach (density estimation via ML) can be used for both classication and regression, and AIC=BIC prescriptions have an identical form for both types of learning problems. In contrast, the VC approach clearly differentiates between regression and classication settings, since the goal of learning is not density estimation, but rather estimation of certain properties of unknown densities via empirical risk minimization or structural risk minimization. This results in completely different forms of VC-bounds for classication (additive form) and regression (multiplicative form). Based on the above discussion, it becomes clear that Hastie et al. (2001) use identical data sets for both classication and regression comparisons following the classical statistical approach. In contrast, under VC-based methodology, it makes little sense to use the same data sets for classication and regression comparisons, since we are faced with two completely different learning problems.
1714
V. Cherkassky and Y. Ma
Acknowledgments This work was supported by NSF grant ECS-0099906. We also acknowledge two anonymous referees for their helpful comments. References Akaike, H. (1970). Statistical prediction information. Ann. Inst. Statist. Math, 22, 203–217. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Chapelle, O., Vapnik, V., & Bengio, Y. (2002). Model selection for small sample regression. Machine Learning, 48(1), 9–23. Cherkassky, V., & Mulier, F. (1998). Learning from data: Concepts, theory and methods. New York: Wiley. Cherkassky, V., Shao, X., Mulier, F., & Vapnik, V. (1999).Model complexity control for regression using VC generalization bounds. IEEE Transactions on Neural Networks, 10(5), 1075–1089. Cherkassky, V., & Shao, X. (2001). Signal estimation and denoising using VCtheory. Neural Networks, 14, 37–52. Cherkassky, V., & Kilts, S. (2001). Myopotential denoising of ECG signals using wavelet thresholding methods. Neural Networks, 14, 1129–1137. Duda, R., Hart, P., & Stork, D. (2001). Pattern classication (2nd ed.). New York: Wiley. Hardle, W. (1995). Applied nonparametric regression. Cambridge: Cambridge University Press. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference and prediction. New York: Springer-Verlag. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Shao, X., Cherkassky, V., & Li, W. (2000). Measuring the VC-dimension using optimized experimental design. Neural Computation, 12, 1969–1986. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Vapnik, V., Levin, E., & Cun, Y. (1994).Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876. Received May 13, 2002; accepted December 4, 2002.
ARTICLE
Communicated by Paul Bressloff
Computation in a Single Neuron: Hodgkin and Huxley Revisited Blaise Aguera ¨ y Arcas [email protected] Rare Books Library, Princeton University, Princeton, NJ 08544, U.S.A.
Adrienne L. Fairhall [email protected] NEC Research Institute, Princeton, NJ 08540, and Department of Molecular Biology, Princeton, NJ 08544, U.S.A.
William Bialek [email protected] NEC Research Institute, Princeton, NJ 08540, and Department of Physics, Princeton, NJ 08544, U.S.A.
A spiking neuron “computes” by transforming a complex dynamical input into a train of action potentials, or spikes. The computation performed by the neuron can be formulated as dimensional reduction, or feature detection, followed by a nonlinear decision function over the lowdimensional space. Generalizations of the reverse correlation technique with white noise input provide a numerical strategy for extracting the relevant low-dimensional features from experimental data, and information theory can be used to evaluate the quality of the low–dimensional approximation. We apply these methods to analyze the simplest biophysically realistic model neuron, the Hodgkin–Huxley (HH) model, using this system to illustrate the general methodological issues. We focus on the features in the stimulus that trigger a spike, explicitly eliminating the effects of interactions between spikes. One can approximate this triggering “feature space” as a two-dimensional linear subspace in the highdimensional space of input histories, capturing in this way a substantial fraction of the mutual information between inputs and spike time. We nd that an even better approximation, however, is to describe the relevant subspace as two dimensional but curved; in this way, we can capture 90% of the mutual information even at high time resolution. Our analysis provides a new understanding of the computational properties of the HH model. While it is common to approximate neural behavior as “integrate and re,” the HH model is not an integrator nor is it well described by a single threshold. c 2003 Massachusetts Institute of Technology Neural Computation 15, 1715–1749 (2003) °
1716
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
1 Introduction On short timescales, one can conceive of a single neuron as a computational device that maps inputs at its synapses into a sequence of action potentials or spikes. To a good approximation, the dynamics of this mapping are determined by the kinetic properties of ion channels in the neuron’s membrane. In the 50 years since the pioneering work of Hodgkin and Huxley, we have seen the evolution of an ever more detailed description of channel kinetics, making it plausible that the short time dynamics of almost any neuron we encounter will be understandable in terms of interactions among a mixture of diverse but known channel types (Hille, 1992; Koch, 1999). The existence of so nearly complete a microscopic picture of singleneuron dynamics brings into focus a very different question: What does the neuron compute? Although models in the Hodgkin–Huxley (HH) tradition dene a dynamical system that will reproduce the behavior of the neuron, this description in terms of differential equations is far from our intuition about—or the formal description of—computation. The problem of what neurons compute is one instance of a more general problem in modern quantitative biology and biophysics: Given a progressively more complete microscopic description of proteins and their interactions, how do we understand the emergence of function? In the case of neurons, the proteins are the ion channels, and the interactions are very simple: current ows through open channels, charging the cell’s capacitance, and all channels experience the resulting voltage. Arguably, there is no other network of interacting proteins for which the relevant equations are known in such detail; indeed, some efforts to understand function and computation in other networks of proteins make use of analogies to neural systems (Bray, 1995). Despite the relative completeness of our microscopic picture for neurons, there remains a huge gap between the description of molecular kinetics and the understanding of function. Given some complex dynamic input to a neuron, we might be able to simulate the spike train that will result, but we are hard pressed to look at the equations for channel kinetics and say that this transformation from inputs to spikes is equivalent to some simple (or perhaps not so simple) computation such as ltering, thresholding, coincidence detection, or feature extraction. Perhaps the problem of understanding computational function in a model of ion channel dynamics is a symptom of a much deeper mathematical difculty. Despite the fact that all computers are dynamical systems, the natural mathematical objects in dynamical systems theory are very different from those in the theory of computation, and it is not clear how to connect these different formal schemes. Finding a general mapping from dynamical systems to their equivalent computational functions is a grand challenge, but we will take a more modest approach. We believe that a key intuition for understanding neural computation is the concept of feature selectivity: while the space of inputs to a neuron—
Computation in a Single Neuron
1717
whether we think of inputs as arriving at the synapses or being driven by sensory signals outside the brain—is vast, individual neurons are sensitive only to some restricted set of features in this vast space. The most general way to formalize this intuition is to say that we can compress (in the information-theoretic sense) our description of the inputs without losing any information about the neural output (Tishby, Pereira, & Bialek, 1999). We might hope that this selective compression of the input data has a simple geometric description, so that the relevant bits about the input correspond to coordinates along some restricted set of relevant dimensions in the space of inputs. If this is the case, feature selectivity should be formalized as a reduction of dimensionality (de Ruyter van Steveninck & Bialek, 1988), and this is the approach we follow here. Closely related work on the use of dimensionality reduction to analyze neural feature selectivity has been described in recent work (Bialek & de Ruyter van Steveninck, 2003; Sharpee, Rust, & Bialek, in press). Here, we develop the idea of dimensionality reduction as a tool for analysis of neural computation and apply these tools to the HH model. While our initial goal was to test new analysis methods in the context of a presumably simple and well-understood model, we have found that the HH neuron performs a computation of surprising richness. Preliminary accounts of these results have already appeared (Aguera ¨ y Arcas, 1998; Aguera ¨ y Arcas, Bialek, & Fairhall, 2001).
2 Dimensionality Reduction Neurons take input signals at their synapses and give as output sequences of spikes. To characterize a neuron completely is to identify the mapping between neuronal input and the spike train the neuron produces in response. In the absence of any simplifying assumptions, this requires probing the system with every possible input. Most often, these inputs are spikes from other neurons; each neuron typically has of order N » 103 presynaptic connections. If the system operates at 1 msec resolution and the time window of relevant inputs is 40 msec, then we can think of a single neuron as having an input described by a » 4 £ 104 bit word—the presence or absence of a spike in each 1 msec bin for each presynaptic cell—which is then mapped to a one (spike) or zero (no spike). More realistically, if average spike rates are » 10 s¡1 , the input words can be compressed by a factor of 10. In this picture, a neuron computes a Boolean function over roughly 4000 variables. Clearly one cannot sample every one of the » 24000 inputs to identify the neural computation. Progress requires making some simplifying assumption about the function computed by the neuron so that we can vastly reduce the space of possibilities over which to search. We use the idea of dimensionality reduction in this spirit, as a simplifying assumption that allows us to make progress but that also must be tested directly.
1718
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
The ideas of feature selectivity and dimensionality reduction have a long history in neurobiology. The idea of receptive elds as formulated by Hartline, Kufer, and Barlow for the visual system gave a picture of neurons as having a template against which images would be correlated (Hartline, 1940; Kufer, 1953; Barlow, 1953). If we think of images as vectors in a high-dimensional space, with coordinates determined by the intensities of each pixel, then the simplest receptive eld models describe the neuron as sensitive to only one direction or projection in this high-dimensional space. This picture of projection followed by thresholding or some other nonlinearity to determine the probability of spike generation was formalized in the linear perceptron (Rosenblatt, 1958, 1962). In subsequent work, Barlow, Hill, and Levick (1964) characterized neurons in which the receptive eld has subregions in space and time such that summation is at least approximately linear in each subregion but these summed signals interact nonlinearly, for example, to generate direction selectivity and motion sensitivity. We can think of Hubel and Wiesel’s description of complex and hypercomplex cells (Hubel & Wiesel, 1962) again as a picture of approximately linear summation within subregions followed by nonlinear operations on these multiple summed signals. More formally, the proper combination of linear summation and nonlinear or logical operations may provide a useful bridge from receptive eld properties to proper geometric primitives in visual computation (Iverson & Zucker, 1995). In the same way that a single receptive eld or perceptron model has one relevant dimension in the space of visual stimuli, these more complex cells have as many relevant dimensions as there are independent subregions of the receptive eld. Although this number is larger than one, it still is much smaller than the full dimensionality of the possible spatiotemporal variations in visual inputs. The idea that neurons in the auditory system might be described by a lter followed by a nonlinear transformation to determine the probability of spike generation was the inspiration for de Boer’s development (de Boer & Kuyper, 1968) of triggered or reverse correlation. Modern uses of reverse correlation to characterize the ltering or receptive eld properties of a neuron often emphasize that this approach provides a “linear approximation” to the input-output properties of the cell, but the original idea was almost the opposite: neurons clearly are nonlinear devices, but this is separate from the question of whether the probability of generating a spike is determined by a simple projection of the sensory input onto a single lter or template. In fact, as explained by Rieke, Warland, Bialek, and de Ruyter van Steveninck (1997), linearity is seldom a good approximation for the neural input-output relation, but if there is one relevant dimension, then (provided that input signals are chosen with suitable statistics) the reverse correlation method is guaranteed to nd this one special direction in the space of inputs to which the neuron is sensitive. While the reverse correlation method is guaranteed to nd the one relevant dimension if it exists, the method does not include
Computation in a Single Neuron
1719
any way of testing for other relevant dimensions, or more generally for measuring the dimensionality of the relevant subspace. The idea of characterizing neural responses directly as the reduction of dimensionality emerged from studies (de Ruyter van Steveninck & Bialek, 1988) of a motion-sensitive neuron in the y visual system. In particular, this work suggested that it is possible to estimate the dimensionality of the relevant subspace rather than just assuming that it is small (or equal to one). More recent work on the y visual system has exploited the idea of dimensionality reduction to probe both the structure and adaptation of the neural code (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001) and the nature of the computation that extracts the motion signal from the spatiotemporal array of photoreceptor inputs (Bialek & de Ruyter van Steveninck, 2003). Here we review the ideas of dimensionality reduction from previous work; extensions of these ideas begin in section 3. In the spirit of neural network models, we will simplify away the spatial structure of neurons and consider time-dependent currents I.t/ injected into a point–like neuron. While this misses much of the complexity of real cells, we will nd that even this system is highly nontrivial. If the input is an injected current, then the neuron maps the history of this current, I.t < t0 /, into the presence or absence of a spike at time t0 . More generally, we might imagine that the cell (or our description) is noisy, so that there is a probability of spiking P[spike at t0 j I.t < t0 /] that depends on the current history. The dependence on the history of the current means that the input signal still is high dimensional, even without spatial dependence. Working at time resolution 1t and assuming that currents in a window of size T are relevant to the decision to spike, the input space is of dimension D D T=1t, where D is often of order 100. The idea of dimensionality reduction is that the probability of spike generation is sensitive only to some limited number of dimensions K within the D-dimensional space of inputs. We begin our analysis by searching for linear subspaces, that is, a set of signals s1 ; s2 ; : : : ; sK that can be constructed by ltering the current, Z s¹ D
1 0
dt f¹ .t/I.t0 ¡ t/;
(2.1)
so that the probability of spiking depends on only this small set of signals, P[spike at t0 j I.t < t0 /] D P[spike at t0 ]g.s1 ; s2 ; : : : ; sK /;
(2.2)
where the inclusion of the average probability of spiking, P[spike at t0 ], leaves g dimensionless. If we think of the current I.t0 ¡ T < t < t0 / as a D-dimensional vector, with one dimension for each discrete sample at spacing 1t, then the ltered signals si are linear projections of this vector. In
1720
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
this formulation, characterizing the computation done by a neuron involves three steps: 1. Estimate the number of relevant stimulus dimensions K, with the hope that there will be many fewer than the original dimensionality D. 2. Identify a set of lters that project into this relevant subspace. 3. Characterize the nonlinear function g.Es/. The classical perceptron–like cell of neural network theory would have only one relevant dimension, given by the vector of weights, and a simple form for g, typically a sigmoid. Rather than trying to look directly at the distribution of spikes given stimuli, we follow de Ruyter van Steveninck and Bialek (1988) and consider the distribution of signals conditional on the response, P[I.t < t0 / j spike at t0 ], also called the response conditional ensemble (RCE); these are related by Bayes’ rule, P[spike at t0 j I.t < t0 /] P[I.t < t0 / j spike at t0 ] D : P[spike at t0 ] P[I.t < t0 /]
(2.3)
We can now compute various moments of the RCE. The rst moment is the spike-triggered average stimulus (STA), Z STA.¿ / D [dI]P[I.t < t0 / j spike at t0 ]I.t0 ¡ ¿ /; (2.4) which is the object that one computes in reverse correlation (de Boer & Kuyper, 1968; Rieke et al., 1997). If we choose the distribution of input stimuli P[I.t < t0 /] to be gaussian white noise, then for a perceptron–like neuron sensitive to only one direction in stimulus space, it can be shown that the STA or rst moment of the RCE is proportional to the vector or lter f .¿ / that denes this direction (Rieke et al., 1997). Although it is a theorem that the STA is proportional to the relevant lter f .¿ /, in principle it is possible that the proportionality constant is zero, most plausibly if the neuron’s response has some symmetry, such as phase invariance in the response of high-frequency auditory neurons. It also is worth noting that what is really important in this analysis is the gaussian distribution of the stimuli, not the “whiteness” of the spectrum. For nonwhite but gaussian inputs, the STA measures the relevant lter blurred by the correlation function of the inputs, and hence the true lter can be recovered (at least in principle) by deconvolution. For nongaussian signals and nonlinear neurons, there is no corresponding guarantee that the selectivity of the neuron can be separated from correlations in the stimulus (Sharpee et al., in press). To obtain more than one relevant direction (or to reveal relevant directions when symmetries cause the STA to vanish), we proceed to second
Computation in a Single Neuron
1721
order and compute the covariance matrix of uctuations around the spiketriggered average, Z Cspike.¿; ¿ 0 / D
[dI]P[I.t < t0 / j spike at t0 ]I.t0 ¡ ¿ /I.t0 ¡ ¿ 0 / ¡ STA.¿ /STA.¿ 0 /:
(2.5)
In the same way that we compare the spike-triggered average to some constant average level of the signal in the whole experiment, we compare the covariance matrix Cspike with the covariance of the signal averaged over the whole experiment, Z Cprior .¿; ¿ 0 / D
[dI]P[I.t < t0 /]I.t0 ¡ ¿ /I.t0 ¡ ¿ 0 /;
(2.6)
to construct the change in the covariance matrix, 1C D Cspike ¡ Cprior:
(2.7)
With time resolution 1t in a window of duration T as above, all of these covariances are D£ D matrices. In the same way that the spike-triggered average has the clearest interpretation when we choose inputs from a gaussian distribution, 1C also has the clearest interpretation in this case. Specically, if inputs are drawn from a gaussian distribution, then it can be shown that (Bialek & de Ruyter van Steveninck, 2003): 1. If the neuron is sensitive to a limited set of K-input dimensions as in equation 2.2, then 1C will have only K nonzero eigenvalues. 1 In this way, we can measure directly the dimensionality K of the relevant subspace. 2. If the distribution of inputs is both gaussian and white, then the eigenvectors associated with the nonzero eigenvalues span the same space as that spanned by the lters f f¹ .¿ /g. 3. For nonwhite (correlated) but still gaussian inputs, the eigenvectors span the space of the lters f f¹ .¿ /g blurred by convolution with the correlation function of the inputs.
Thus, the analysis of 1C for neurons responding to gaussian inputs should allow us to identify the subspace of inputs of relevance and test specically the hypothesis that this subspace is of low dimension. 1 As with the STA, it is in principle possible that symmetries or accidental features of the function g.Es/ would cause some of the K eigenvalues to vanish, but this is very unlikely.
1722
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Several points are worth noting. First, except in special cases, the eigenvectors of 1C and the lters f f¹ .¿ /g are not the principal components of the RCE, and hence this analysis of 1C is not a principal component analysis. Second, the nonzero eigenvalues of 1C can be either positive or negative, depending on whether the variance of inputs along that particular direction is larger or smaller in the neighborhood of a spike. Third, although the eigenvectors span the relevant subspace, these eigenvectors do not form a preferred coordinate system within this subspace. Finally, we emphasize that dimensionality reduction—identication of the relevant subspace—is only the rst step in our analysis of the computation done by a neuron. 3 Measuring the Success of Dimensionality Reduction The claim that certain stimulus features are most relevant is in effect a model for the neuron, so the next question is how to measure the effectiveness or accuracy of this model. Several different ideas have been suggested in the literature as ways of testing models based on linear receptive elds in the visual system (Stanley, Lei, & Dan, 1999; Keat, Reinagel, Reid, & Meister, 2001) or linear spectrotemporal receptive elds in the auditory system (Theunissen, Sen, & Doupe, 2000). These methods have in common that they introduce a metric to measure performance—for example, mean square error in predicting the ring rate as averaged over some window of time. Ideally, we would like to have a performance measure that avoids any arbitrariness in the choice of metric, and such metric-free measures are provided uniquely by information theory (Shannon, 1948; Cover & Thomas, 1991). Observing the arrival time t0 of a single spike provides a certain amount of information about the input signals. Since information is mutual, we can also say that knowing the input signal trajectory I.t < t0 / provides information about the arrival time of the spike. If “details are irrelevant,” then we should be able to discard these details from our description of the stimulus and yet preserve the mutual information between the stimulus and spike arrival times (for an abstract discussion of such selective compression, see Tishby et al., 1999). In constructing our low-dimensional model, we represent the complete (D-dimensional) stimulus I.t < t0 / by a smaller number (K < D) of dimensions Es D .s1 ; s2 ; : : : ; sK /. The mutual information I [I.t < t0 /I t0 ] is a property of the neuron itself, while the mutual information I [EsI t0 ] characterizes how much our reduced description of the stimulus can tell us about when spikes will occur. Necessarily, our reduction of dimensionality causes a loss of information, so that
I [EsI t0 ] · I [I.t < t0 /I t0 ];
(3.1)
but if our reduced description really captures the computation done by the neuron, then the two information measures will be very close. In particular,
Computation in a Single Neuron
1723
if the neuron were described exactly by a lower-dimensional model—as for a linear perceptron or for an integrate-and-re neuron (Aguera ¨ y Arcas & Fairhall, 2003)—then the two information measures would be equal. More generally, the ratio I [EsI t0 ]=I [I.t < t0 /I t0 ] quanties the efciency of the low-dimensional model, measuring the fraction of information about spike arrival times that our K dimensions capture from the full signal I.t < t0 /. As shown by Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck (2000), the arrival time of a single spike provides an information,
I [I.t < t0 /I t0 ] ´ Ione spike
1 D T
Z
T 0
µ ¶ r.t/ r.t/ log2 dt ; rN rN
(3.2)
where r.t/ is the time-dependent spike rate, rN is the average spike rate, and h¢ ¢ ¢i denotes an average over time. In principle, information should be calculated as an average over the distribution of stimuli, but the ergodicity of the stimulus justies replacing this ensemble average with a time average. For a deterministic system like the HH equations, the spike rate is a singular function of time: given the inputs I.t/, spikes occur at denite times with no randomness or irreproducibility. If we observe these responses with a time resolution 1t, then for 1t sufciently small, the rate r.t/ at any time t either is zero or corresponds to a single spike occurring in one bin of size 1t, that is, r D 1=1t. Thus, the information carried by a single spike is Ione spike D ¡ log2 rN1t:
(3.3)
On the other hand, if the probability of spiking really depends on only the stimulus dimensions s1 ; s2 ; : : : ; sK , we can substitute r.t/ P.Es j spike at t/ ! : rN P.Es/
(3.4)
Replacing the time averages in equation 3.2 with ensemble averages, we nd µ ¶ Z P.Es j spike at t/ Es K I [EsI t0 ] ´ Ione spike D d sP.Es j spike at t/ log2 (3.5) P.Es/ (for details of these arguments, see Brenner, Strong, et al., 2000). This allows us to compare the information captured by the K-dimensional reduced model with the true information carried by single spikes in the spike train. For reasons that we will discuss in the following section, and as was pointed out in Aguera ¨ y Arcas et al. (2001) and Aguera ¨ y Arcas and Fairhall (2003), we will be considering isolated spikes—those separated from previous spikes by a period of silence. This has important consequences for our analysis. Most signicantly, as we will be considering spikes that occur
1724
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
on a background of silence, the relevant stimulus ensemble, conditioned on the silence, is no longer gaussian. Further, we will need to rene our information estimate. The derivation of equation 3.2 makes clear that a similar formula must determine the information carried by the occurrence time of any event, not just single spikes; we can dene an event rate in place of the spike rate and then calculate the information carried by these events (Brenner, Strong, et al., 2000). In the case here, we wish to compute the information obtained by observing an isolated spike, or equivalently by the event silence+spike. This is straightforward: we replace the spike rate by the rate of isolated spikes, and equation 3.2 will give us the information carried by the arrival time of a single isolated spike. The problem is that this information includes both the information carried by the occurrence of the spike and the information conveyed in the condition that there were no spikes in the preceding tsilence msec (for an early discussion of the information carried by silence, see de Ruyter van Steveninck & Bialek, 1988). We would like to separate these contributions, since our idea of dimensionality reduction applies only to the triggering of a spike, not to the temporally extended condition of nonspiking. To separate the information carried by the isolated spike itself, we have to ask how much information we gain by seeing an isolated spike given that the condition for isolation has already been met. As discussed by Brenner, Strong, et al. (2000), we can compute this information by thinking about the distribution of times at which the isolated spike can occur. Given that we know the input stimulus, the distribution of times at which a single isolated spike will be observed is proportional to riso .t/, the time-dependent rate or peristimulus time histogram for isolated spikes. With proper normalization, we have P iso .t j inputs/ D
1 1 ¢ riso .t/; T rNiso
(3.6)
where T is duration of the (long) window in which we can look for the spike and rNiso is the average rate of isolated spikes. This distribution has an entropy, Z Siso .t j inputs/ D ¡
T
dt Piso .t j inputs/ log2 Piso .t j inputs/
0
1 D¡ T
Z
T 0
µ ¶ 1 riso .t/ riso .t/ log2 ¢ dt rNiso T rNiso
D log2 .TNriso 1t/ bits,
(3.7) (3.8) (3.9)
where again we use the fact that for a deterministic system, the timedependent rate must be either zero or the maximum allowed by our time
Computation in a Single Neuron
1725
resolution 1t. To compute the information carried by a single spike, we need to compare this entropy with the total entropy possible when we do not know the inputs. It is tempting to think that without knowledge of the inputs, an isolated spike is equally likely to occur anywhere in the window of size T, which leads us back to equation 3.3 with rN replaced by rNiso . In this case, however, we are assuming that the condition for isolation has already been met. Thus, even without observing the inputs, we know that isolated spikes can occur only in windows of time whose total length is Tsilence D T ¢ Psilence , where Psilence is the probability that any moment in time is at least tsilence after the most recent spike. Thus, the total entropy of isolated spike arrival times (given that the condition for silence has been met) is reduced from log2 T to Siso .t j silence/ D log2 .T ¢ P silence /;
(3.10)
and the information that the spike carries beyond what we know from the silence itself is 1Iiso spike D Siso .t j silence/ ¡ Siso .t j inputs/ µ ¶ Z 1 T riso .t/ riso .t/ D log2 ¢ Psilence dt T 0 rNiso rNiso
(3.11) (3.12)
D ¡ log2 .Nriso 1t/ C log2 Psilence bits.
(3.13)
This information, which is dened independent of any model for the feature selectivity of the neuron, provides the benchmark against which our reduction of dimensionality will be measured. To make the comparison, however, we need the analog of equation 3.5. Equation 3.12 provides us with an expression for the information conveyed by isolated spikes in terms of the probability that these spikes occur at particular times; this is analogous to equation 3.2 for single (nonisolated) spikes. If we follow a path analogous to that which leads from equation 3.2 to equation 3.5, we nd an expression for the information that an isolated spike provides about the K stimulus dimensions Es: µ
Z
1IEsiso spike
D
dEs P.Es j iso spike at t/ log2 C hlog2 P.silence j Es/i;
P.sE j iso spike at t/ P.Es j silence/
¶
(3.14)
where the prior is now also conditioned on silence: P.Es j silence/ is the distribution of Es given that Es is preceded by a silence of at least tsilence . Notice that this silence-conditioned distribution is not knowable a priori, and in particular it is not gaussian; P.Es j silence/ must be sampled from data. The last term in equation 3.14 is the entropy of a binary variable that indicates whether particular moments in time are silent given knowledge
1726
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
of the stimulus. Again, since the HH model is deterministic, this conditional entropy should be zero if we keep a complete description of the stimulus. In fact, we are not interested in describing those features of the stimulus that lead to silence, and it is not fair (as we will see) to judge the success of dimensionality reduction by looking at the prediction of silence, which necessarily involves multiple dimensions. To make a meaningful comparison, then, we will assume that there is a perfect description of the stimulus conditions leading to silence and focus on the stimulus features that trigger the isolated spike. When we approximate these features by the K-dimensional space Es, we capture an amount of information, µ
Z Es 1Iiso spike D
dEs P.Es j iso spike at t/ log2
¶ P.Es j iso spike at t/ : (3.15) P.Es j silence/
This is the information that we can compare with 1Iiso spike in equation 3.13 to determine the efciency of our dimensionality reduction. 4 Characterizing the Hodgkin-Huxley Neuron For completeness, we begin with a brief review of the dynamics of the space-clamped HH neuron (Hodgkin & Huxley, 1952). Hodgkin and Huxley modeled the dynamics of the current through a patch of membrane with ion-specic conductances: C
dV D I.t/ ¡ gN K n4 .V ¡ VK / ¡ gN Na m3 h.V ¡ VNa / ¡ gN l .V ¡ Vl /; dt
(4.1)
where I.t/ is injected current, K and Na subscripts denote potassium– and sodium–related variables, respectively, and l (for “leakage”) terms include all other ion conductances with slower dynamics. C is the membrane capacitance. VK and VNa are ion-specic reversal potentials, and Vl is dened such that the total voltage V is exactly zero when the membrane is at rest. gN K , gN Na , and gN l are empirically determined maximal conductances for the different ion species, and the gating variables n, m, and h (on the interval [0; 1]) have their own voltage-dependent dynamics: dn=dt D .0:01V C 0:1/.1 ¡ n/ exp.¡0:1V/ ¡ 0:125n exp.V=80/; dm=dt D .0:1V C 2:5/.1 ¡ m/ exp.¡0:1V ¡ 1:5/ ¡ 4m exp.V=18/; dh=dt D 0:07.1 ¡ h/ exp.0:05V/ ¡ h exp.¡0:1V ¡ 4/:
(4.2)
We have used the original values for these parameters, except for changing the signs of the voltages to correspond to the modern sign convention: C D 1 ¹F/cm2 , gN K D 36 mS/cm2 , gN Na D 120 mS/cm2 , gN l D 0:3 mS/cm2 , VK D ¡12 mV, VNa D C115 mV, Vl D C10:613 mV. We have taken our system to be
Computation in a Single Neuron
1727
a ¼ £ 302 ¹m2 patch of membrane. We solve these equations numerically using fourth-order Runge–Kutta integration. The system is driven with a gaussian random noise current I.t/, generated by smoothing a gaussian random number stream with an exponential lter to generate a correlation time ¿ . It is convenient to choose ¿ to be longer than the time steps of numerical integration, since this guarantees that all functions are smooth on the scale of single time steps. Here, we will always use ¿ D 0:2 msec, a value that is both less than the timescale over which we discretize the stimulus for analysis and far less than the neuron’s capacitative smoothing timescale RC » 3 msec. I.t/ has a standard deviation ¾ , but since the correlation time is short, the relevant parameter usually is the spectral density S D ¾ 2 ¿ ; we also add a DC offset I0 . In the following, we will consider two parameter regimes, I0 D 0 and I0 a nite value, which leads to more periodic ring. The integration step size is xed at 0:05 msec. The key numerical experiments were repeated at a step size of 0:01 msec with identical results. The time of a spike is dened as the moment of maximum voltage, for voltages exceeding a threshold (see Figure 1), estimated to subsample precision by quadratic interpolation. As spikes are both very stereotyped and very large compared to subspiking uctuations, the precise value of this threshold is unimportant; we have used C20 mV. 4.1 Qualitative Description of Spiking. The rst step in our analysis is to use reverse correlation, equation 2.4, to determine the average stimulus feature preceding a spike, the STA. In Figure 1(top), we display the STA in a regime where the spectral density of the input current is 6:5 £ 10¡4 nA2 msec. The spike-triggered averages of the gating terms n4 (proportion of open potassium channels) and m3 h (proportion of open sodium channels) and the membrane voltage V are plotted in Figure 1 (middle and bottom). The error bars mark the standard deviation of the trajectories of these variables. As expected, the voltage and gating variables follow highly stereotyped trajectories during the »5 msec surrounding a spike. First, the rapid opening of the sodium channels causes a sharp membrane depolarization (or rise in V); the slower potassium channels then open and repolarize the membrane, leaving it at a slightly lower potential than rest. The potassium channels close gradually, but meanwhile the membrane remains hyperpolarized and, due to its increased permeability to potassium ions, at lower resistance. These effects make it difcult to induce a second spike during this »15 msec “refractory period.” Away from spikes, the resting levels and uctuations of the voltage and gating variables are quite small. The larger values evident in Figure 1(middle and bottom) by §15 msec are due to the summed contributions of nearby spikes. The spike-triggered average current has a largely transient form, so that spikes are on average preceded by an upward swing in current. On the
1728
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 1: Spike-triggered averages with standard deviations for (top) the input current I, (middle) the fraction of open KC and NaC channels, and (bottom) the membrane voltage V, for the parameter regime I0 D 0 and S D 6:50 £ 10¡4 nA2 sec.
other hand, there is no obvious bottleneck in the current trajectories, so that the current variance is almost constant throughout the spike. This is qualitatively consistent with the idea of dimensionality reduction: if the neuron ignores most of the dimensions along which the current can vary, then the variance, which is shared almost equally among all dimensions for this near white noise, can change by only a small amount.
Computation in a Single Neuron
1729
4.2 Interspike Interaction. Although the STA has the form of a differentiating kernel, suggesting that the neuron detects edge-like events in the current versus time, there must be a DC component to the cell’s response. We recall that for constant inputs, the HH model undergoes a bifurcation to constant frequency spiking, where the frequency is a function of the value of the input above onset. Correspondingly, the STA does not sum precisely to zero; one might think of it as having a small integrating component that allows the system to spike under DC stimulation, albeit only above a threshold. The system’s tendency to periodic spiking under DC current input also is felt under dynamic stimulus conditions and can be thought of as a strong interaction between successive spikes. We illustrate this by considering a different parameter regime with a small DC current and some added noise (I0 D 0:11 nA and S D 0:8 £ 10¡4 nA2 sec). Note that the DC component puts the neuron in the metastable region of its f ¡ I curve (see Figure 2). In this regime, the neuron tends to re quasi-regular trains of spikes intermittently, as shown in Figure 3. We will refer to these quasi-regular spike sequences as “bursts” (note that this term is often used to refer to compound spikes in neurons with additional channels; such events do not occur in the HH model). Spikes can be classied into three types: those initiating a spike burst, those within a burst, and those ending a burst. The minimum length of
Figure 2: Firing rate of the HH neuron as a function of injected DC current. The empty circles at moderate currents denote the metastable region, where the neuron may be either spiking or silent.
1730
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 3: Segment of a typical spike train in a “bursting” regime.
Figure 4: Spike-triggered averages, derived from spikes leading (“on”), inside (“burst”), and ending (“off”) a burst. The parameters of this bursting regime are I0 D 0:11 nA and S D 0:8 £ 10¡4 nA2 sec. Note that the burst-ending spike average is, by construction, identical to that of any other within-burst spike for t < 0.
the silence between bursts is taken in this case to be 70 msec. Taking these three categories of spike as different “symbols” (de Ruyter van Steveninck & Bialek, 1988), we can determine the average stimulus for each. These are shown in Figure 4 with the spike at t D 0. In this regime, the initial spike of a burst is preceded by a rapid oscillation in the current. Spikes within a burst are affected much less by the current; the feature immediately preceding such spikes is similar in shape to a single “wavelength” of the leading spike feature, but is of much smaller amplitude and is temporally compressed into the interspike interval. Hence, although it is clear that the timing of a spike within a burst is determined largely by the timing of the previous spike, the current plays some role in affecting the precise placement. This also demonstrates that the shape of the STA is not the same for all spikes; it depends strongly and nontrivially on the time to the previous spike, and this is related to the observation that subtly different patterns of two or three spikes correspond to very different average stimuli (de Ruyter van Steveninck & Bialek, 1988). For a reader of the spike code, a spike within a burst conveys a different message about the input than the spike at the onset of the burst. Finally, the feature ending a burst has a very similar form to the onset feature, but reversed in time. Thus, to a good approximation, the absence of a spike at the end of a burst can be read as the opposite of the onset of the burst. In summary, this regime of the HH neuron is similar to a “ip-op,” or 1-bit memory. Like its electronic analog, the neuron’s memory is preserved
Computation in a Single Neuron
1731
Figure 5: Overall spike-triggered average in the bursty regime, showing the ringing due to the tendency to periodic ring. Plotted in gray is the spike autocorrelation, showing the same oscillations.
by a feedback loop, here implemented by the interspike interaction. Large uctuations in the input current at a certain frequency “ip” or “op” the neuron between its silent and spiking states. However, while the neuron is spiking, further details of the input signal are transmitted by precise spike timing within a burst. If we calculate the spike-triggered average of all spikes for this regime, without regard to their position within a burst, then as shown in Figure 5, the relatively well-localized leading spike oscillation of Figure 4 is replaced by a long-lived oscillating function resulting from the spike periodicity. This is shown explicitly by comparing the overall STA with the spike autocorrelation, also shown in Figure 5. This same effect is seen in the STA of the burst spikes, which in fact dominates the overall average. Prediction of spike timing using such an STA would be computationally difcult due to its extension in time, but, more seriously, unsuccessful, as most of the function is an artifact of the spike history rather than the effect of the stimulus. While the effects of spike interaction are interesting and should be included in a complete model for spike generation, we wish here to consider only the current’s role in initiating spikes. Therefore, as we have argued elsewhere, we limit ourselves initially to the cases in which interspike interaction plays no role (Aguera ¨ y Arcas et al., 2001; Aguera ¨ & Fairhall, 2003). These “isolated” spikes can be dened as spikes preceded by a silent period tsilence long enough to ensure decoupling from the timing of the previous spike. A reasonable choice for tsilence can be inferred directly from the inter-
1732
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 6: Hodgkin-Huxley interspike interval histogram for the parameters I0 D 0 and S D 6:5 £ 10¡4 nA 2 sec, showing a peak at a preferred ring frequency and the long Poisson tail. The total number of spikes is N D 5:18 £ 106 . The plot to the right is a closeup in linear scale.
spike interval distribution P.1t/, illustrated in Figure 6. For the HH model, as in simpler models and many real neurons (Brenner, Agam, Bialek, & de Ruyter van Steveninck, 1998), the form of P.1t/ has three noteworthy features: a refractory “hole” during which another spike is unlikely to occur, a strong mode at the preferred ring frequency, and an exponentially decaying, or Poisson, tail. The details of all three of these features are functions of the parameters of the stimulus (Tiesinga, Jos´e, & Sejnowski, 2000), and certain regimes may be dominated by only one or two features. The emergence of Poisson statistics in the tail of the distribution implies that these events are independent, so we can infer that the system has lost memory of the previous spike. We will therefore take isolated spikes to be those preceded by a silent interval 1t ¸ tsilence , where tsilence is well into the Poisson regime. The burst-onset spikes of Figure 4 are isolated spikes by this denition. Note that the bursty behavior evident in Figure 3 is characteristic of “type II” neurons, which begin ring at a well-dened frequency under DC stimulus; “type I” neurons, by contrast, can re at arbitrarily low frequency under constant input. Nonetheless the analysis that follows is equally applicable to either type of neuron: spikes still interact at close range and become independent at sufcient separation. In what follows, we will be probing the HH neuron with current noise that remains below the DC threshold for periodic ring. 5 Isolated Spike Analysis Focusing now on isolated spikes, we proceed to a second-order analysis of the current uctuations around the isolated spike triggered average (see
Computation in a Single Neuron
1733
Figure 7: Spike-triggered average stimulus for isolated spikes.
Figure 7). We consider the response of the HH neuron to currents I.t/ with mean I0 D 0 and spectral density of S D 6:5 £ 10¡4 nA2 sec. Isolated spikes in this regime are dened by tsilence D 60 msec. 5.1 How Many Dimensions? As explained in section 2, our path to dimensionality reduction begins with the computation of covariance matrices for stimulus uctuations surrounding a spike. The matrices are accumulated from stimulus segments 200 samples in length, roughly corresponding to sampling at the timescale sufciently long to capture the relevant features. Thus, we begin in a 200-dimensional space. We emphasize that the theorem that connects eigenvalues of the matrix 1C to the number of relevant dimensions is valid only for truly gaussian distributions of inputs and that by focusing on isolated spikes, we are essentially creating a nongaussian stimulus ensemble—namely, those stimuli that generate the silence out of which the isolated spike can appear. Thus, we expect that the covariance matrix approach will give us a heuristic guide to our search for lower-dimensional descriptions, but we should proceed with caution. The “raw” isolated spike-triggered covariance Ciso spike and the corresponding covariance difference 1C, equation 2.7, are shown in Figure 8. The matrix shows the effect of the silence as an approximately translationally invariant band preceding the spike, the second-order analog of the constant negative bias in the isolated spike STA (see Figure 7). The spike itself is associated with features localized to §15 msec. In Figure 9, we show the spectrum of eigenvalues of 1C computed using a sample of 80,000 spikes. Before calculating the spectrum, we multiply 1C by C¡1 prior . This has the effect of giving us eigenvalues scaled in units of the input standard deviation
1734
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 8: The isolated spike-triggered covariance Ciso (left) and covariance difference 1C (right) for times ¡30 < t < 5 msec. The plots are in units of nA2 .
Figure 9: The leading 64 eigenvalues of the isolated spike-triggered covariance after accumulating 80,000 spikes.
along each dimension. Because the correlation time is short, Cprior is nearly diagonal. While the eigenvalues decay rapidly, there is no obvious set of outstanding eigenvalues. To verify that this is not an effect of nite sampling, Figure 10 shows the spectrum of eigenvalue magnitudes as a function of sample size N. Eigenvalues that are truly zero up to the noise oor determined by
Computation in a Single Neuron
1735
Figure 10: Convergence of the leading 64 eigenvalues of the isolated spiketriggered covariance with increasing sample size. The log slope of the diagonal p is 1= nspikes . Positive eigenvalues are indicated by crosses and negative by dots. The spike-associated modes are labeled with an asterisk.
p sampling decrease like N. We nd that a sequence of eigenvalues emerges stably from the noise. These results do not, however, imply that a low-dimensional approximation cannot be identied. The extended structure in the covariance matrix induced by the silence requirement is responsible for the apparent high dimensionality. In fact, as has been shown in Aguera ¨ y Arcas and Fairhall (2003), the covariance eigensystem includes modes that are local and spike associated, and others that are extended and silence associated, and thus irrelevant to a causal model of spike timing prediction. Fortunately, because extended silences and spikes are (by denition) statistically independent, there is no mixing between the two types of modes. To identify the spikeassociated modes, we follow the diagnostic of Aguera ¨ y Arcas and Fairhall (2003), computing the fraction of the energy of each mode concentrated in the period of silence, which we take to be ¡60 · t · ¡40 msec. The energy of a spike-associated mode in the silent period is due entirely to noise and will therefore decrease like 1=nspikes with increasing sample size, while this energy remains of order unity for silence modes. Carrying out the test on the covariance modes, we obtain Figure 11, which shows that the rst and fourth modes rapidly emerge as spike associated. Two further spike-associated modes appear over the sample shown, with the suggestion
1736
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 11: For the leading 64 modes, fraction of the mode energy over the interval ¡40 < t < ¡30 msec as a function of increasing sample size. Modes emerging with low energy are spike associated. The symbols indicate the sign of the eigenvalue.
of other, weaker modes yet to emerge. The two leading silence modes are shown in Figure 12. Those shown are typical; most modes resemble Fourier modes, as the silence condition is close to time translationally invariant. Examining the eigenvectors corresponding to the two leading spikeassociated eigenvalues, which for convenience we will denote s1 and s2 (although they are not the leading modes of the matrix), we nd (see Figure 13) that the rst mode closely resembles the isolated spike STA and the second is close to the derivative of the rst. Both modes approximate differentiating operators; there is no linear combination of these modes that would produce an integrator. If the neuron ltered its input and generated a spike when the output of the lter crosses threshold, we would nd two signicant dimensions associated with a spike. The rst dimension would correspond simply to the lter, as the variance in this dimension is reduced to zero (for a noiseless system) at the occurrence of a spike. As the threshold is always crossed from below, the stimulus projection onto the lter’s derivative must be positive, again resulting in a reduced variance. It is tempting to suggest, then, that ltered threshold crossing is a good approximation to the HH model, but we will see that this is not correct.
Computation in a Single Neuron
1737
Figure 12: Modes 2 and 3 of the spike-triggered covariance (silence associated).
Figure 13: Modes 1 and 4 of the spike-triggered covariance, which are the leading spike-associated modes.
5.2 Evaluating the Nonlinearity. At each instant of time, we can nd the projections of the stimulus along the leading spike-associated dimensions s1 and s2 . By construction, the distribution of these signals over the whole experiment, P.s1 ; s2 /, is gaussian. The appropriate prior for the isolation condition, P.s1 ; s2 j silence/, differs only subtly from the gaussian prior. On the other hand, for each spike, we obtain a sample from the dis-
1738
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 14: 104 spike-conditional stimuli (or “spike histories”) projected along the rst two covariance modes. The axes are in units of standard deviation on the prior gaussian distribution. The circles, from the inside out, enclose all but 10¡1 ; 10¡2 ; : : : ; 10¡8 of the prior.
tribution P.s1 ; s2 j iso spike at t0 /, leading to the picture in Figure 14. The prior and spike-conditional distributions are clearly better separated in two dimensions than in one, which means that the two-dimensional description captures more information than projection onto the spike-triggered average alone. Surprisingly, the spike-conditional distribution is curved, unlike what we would expect for a simple thresholding device. Furthermore, the eigenvalue of 1C, which we associate with the direction of threshold crossing (plotted on the y-axis in Figure 14), is positive, indicating increased rather than decreased variance in this direction. As we see, projections onto this mode are almost equally likely to be positive or negative, ruling out the threshold crossing interpretation.
Computation in a Single Neuron
1739
Combining equations 2.2 and 2.3 for isolated spikes, we have g.s1 ; s2 / D
P.s1 ; s2 j iso spike at t0 / ; P.s1 ; s2 j silence/
(5.1)
so that these two distributions determine the input-output relation of the neuron in this 2D space (Brenner, Bialek, et al., 2000). Recall that although the subspace is linear, g can have arbitrary nonlinearity. Figure 14 shows that this input-output relation has clear structure, but also some fuzziness. As the HH model is deterministic, the input-output relation should be a singular function in the continuous space of inputs—spikes occur only when certain exact conditions are met. Of course, nite time resolution introduces some blurring, and so we need to understand whether the blurring of the inputoutput relation in Figure 14 is an effect of nite time resolution or a real limitation of the 2D description. 5.3 Information Captured in Two Dimensions. We will measure the effectiveness of our description by computing the information in the 2D approximation, according to the methods described in section 3. If the s1 ;s2 two-dimensional approximation were exact, we would nd that Iiso spike D
s1 ;s2 Iiso spike; more generally, one nds Iiso spike · Iiso spike, and the fraction of the information captured measures the quality of the approximation. This fraction is plotted in Figure 15 as a function of time resolution. For comparison, we also show the information captured in the one-dimensional case, considering only the stimulus projection along the STA. We nd that our low-dimensional model captures a substantial fraction of the total information available in spike timing in an HH neuron over a range of time resolutions. The approximation is best near 1t D 3 msec,
Figure 15: Bits per spike (left) and fraction of the theoretical limit (right) of timing information in a single spike at a given temporal resolution captured by projection onto the STA alone (triangles) and projection onto 1C covariance modes 1 and 2 (circles).
1740
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
reaching 75%. Thus, the complex nonlinear dynamics of the HH model can be approximated by saying that the neuron is sensitive to a 2D linear subspace in the high-dimensional space of input signals, and this approximate description captures up to 75% of the mutual information between input currents and (isolated) spike arrival times. The dependence of information on time resolution (see Figure 15) shows that the absolute information captured saturates for both the 1D and 2D cases, at ¼ 3:2 and 4:1 bits respectively. Hence, for smaller 1t, the information fraction captured drops. The model provides, at its best, a time resolution of 3 msec, so that information carried by more precise spike timing is lost in our low-dimensional projection. Might this missing information be important for a real neuron? Stochastic HH simulations with realistic channel densities suggest that the timing of spikes in response to white noise stimuli is reproducible to within 1 to 2 msec (Schneidman, Freedman, & Segev, 1998), a gure that is comparable to what is observed for pyramidal cells in vitro (Mainen & Sejnowski, 1995), as well in vivo in the y’s visual system (de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Lewen, Bialek, & de Ruyter van Steveninck, 2001), the vertebrate retina (Berry, Warland, & Meister, 1997), the cat lateral geniculate nucleus (LGN) (Reinagel & Reid, 2000), and the bat auditory cortex (Dear, Simmons, & Fritz, 1993). This suggests that such timing details may indeed be important. We must therefore ask why our approximation seems to carry an inherent time resolution limitation and why, even at its optimal resolution, the full information in the spike is not recovered. For many purposes, recovering 75% of the information at » 3 msec resolution might be considered a resounding success. On the other hand, with such a simple underlying model, we would hope for a more compelling conclusion. From a methodological point of view, it behooves us to ask what we are missing in our 2D model, and perhaps the methods we use in nding the missing information in the present case will prove applicable more generally. 6 What Is Missing? The obvious rst approach to improving the 2D approximation is to add more dimensions. Let us consider the neglected modes. We recall from Figure 11 that in simulations with very large numbers of spikes, we can isolate at least two more modes that have signicant eigenvalues and are associated with the isolated spike rather than the preceding silence; these are shown in Figure 16. We see that these modes look like higher-order derivatives, which makes some sense since we are missing information at high time resolution. On the other hand, if all we are doing is expanding in a basis of higher-order derivatives it is not clear that we will do qualitatively better by including one or two more terms, particularly given that sampling higher-dimensional distributions becomes very difcult.
Computation in a Single Neuron
1741
Figure 16: The two next spike-associated modes. These resemble higher-order derivatives.
Our original model attempted to approximate the set of relevant features as lying within a K-dimensional linear subspace of the original Ddimensional input space. The covariance spectrum indicates that additional dimensions play a small but signicant role. We suggest that a reasonable next step is to consider the relevant feature space to be low dimensional but not at. The rst two covariance modes dene a plane; we will consider next a 2D geometric construction that curves into additional dimensions. Several methods have been proposed for the general problem of identifying low-dimensional nonlinear manifolds (Cottrell, Munro, & Zipser, 1988; Boser, Guyon, & Vapnik, 1992; Guyon, Boser, & Vapnik, 1993; Oja & Karhunen, 1995; Roweis & Saul, 2000), but these various approaches share the disadvantage that the manifold or, equivalently, the relevant set of features remains implicit. Our hope is to understand the behavior of the neuron explicitly; we therefore wish to obtain an explicit representation of this curved feature space in terms of the basis vectors (features) that span it. A rst approach to determining the curved subspace is to approximate it as a set of locally linear “tiles.” At any place on the surface, we wish to nd two orthogonal directions that form the surface’s local linear approximation. Curvature of the subspace means that these two dimensions will vary across the surface. We apply a simple algorithm to construct a locally linear tiling with the advantage that it requires only rst-order statistics. First, we take one of the directions to be globally dened; it is natural to take this direction to be the overall spike-triggered average. Allowing for curvature in the feature subspace means that along the direction of the STA, we allow the second, orthogonal direction to vary. In principle, this vari-
1742
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Figure 17: The orthonormal components of spike-triggered averages from 80,000 spikes conditioned on their projection onto the overall spike-triggered average (eight conditional averages shown).
ation is continuous, but we will not be able to sample sufciently to nd the complete description of the second direction, so we sort the stimulus histories according to their projection along this direction and bin the sorted histories into a small number of bins. This denes the number of tiles used to cover the surface. We determine the conditional average of the stimulus histories in each bin and compute the (normalized) component orthogonal to the overall STA. This provides a second locally meaningful basis vector for the subspace in that bin. The resulting family of curves orthogonal to the STA is shown in Figure 17. Applying singular value decomposition to the family of curves shows that there are at least four signicant independent directions in stimulus space apart from the STA. This gives us an estimate of the embedding dimension of the feature subspace. Geometrically, this construction is equivalent to approximating the surface as a twisting ribbon, with the leading direction that of the STA, but where the surface is allowed to rotate about the STA axis. Further, we have discretized the twisting direction into a small number of tiles. The discretization is xed by the size of the data. There is a trade-off between the precision of estimating the conditional average and the delity with which one follows the twist. Here we have restricted ourselves to an experimentally realistic number of isolated spikes (80,000), using an equal number of spikes per bin. Varying over the number of bins, we found that eight bins gave the best result. Note that our model is discontinuous; it might be possible to improve it by interpolating smoothly between successive tiles. Computing the information as a function of 1t using this locally linear model, we obtain the curve shown in Figure 18, where the results can be compared against the information found from the STA alone and from the covariance modes. The information from the new model captures a maxi-
Computation in a Single Neuron
1743
Figure 18: Bits per spike (left) and fraction of the theoretical limit (right) of timing information in a single spike at a given temporal resolution captured by the locally linear tiling “twist” model (diamonds), compared to models using the STA alone (triangles), and projection onto 1C covariance modes 1 and 2 (circles).
mum of 4.8 bits, recovering » 90% of the information at a time resolution of approximately 1 msec. One of the main strengths of this simple approach is that we have succeeded in extracting additional geometrical information about the feature subspace using very limited data, as we compute only averages. Note that a similar number of spikes cannot resolve more than two spike-associated covariance modes in the covariance matrix analysis. 7 Discussion The HH equations describe the dynamics of four degrees of freedom, and almost since these equations were rst written down, there have been attempts to nd simplications or reductions. FitzHugh and Nagumo proposed a 2D system of equations that approximate the HH model (Fitzhugh, 1961; Nagumo, Arimoto, & Yoshikawa, 1962), and this has the advantage that one can visualize the trajectories directly in the plane and thus achieve an intuitive graphical understanding of the dynamics and its dependence on parameters. The need for reduction in the sense pioneered by FitzHugh and by Nagumo et al. has become only more urgent with the growing use of increasingly complex HH-style model neurons with many different channel types. With this problem in mind, Kepler, Abbott, and Marder have introduced reduction methods that are more systematic, making use of the difference in timescales among the gating variables (Kepler, Abbott, & Marder, 1992; Abbott & Kepler, 1990). In the presence of constant current inputs, it makes sense to describe the HH equations as a 4D autonomous dynamical system; by well-known methods in dynamical systems theory, one could consider periodic input
1744
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
currents by adding an extra dimension. The question asked by FitzHugh and Nagumo was whether this 4D or 5D description could be reduced to two or three dimensions. Closer in spirit to our approach is the work by Kistler, Gerstner, and van Hemmen (1997), who focused in particular on the interaction among successive action potentials. They argued that one could approximate the HH model by a nearly linear dynamical system with a threshold, identifying threshold crossing with spiking, provided that each spike generated either a change in threshold or an effective input current that inuences the generation of subsequent spikes. The notion of model dimensionality considered here is distinct from the dynamical systems perspective in which one simply counts the system’s degrees of freedom. Here we are attempting to nd a description of the dynamics, which is essentially functional or computational. We have identied the output of the system as spike times, and our aim is to construct as complete a description as possible of the mapping between input and output. The dimensionality of our model is that of the space of inputs relevant for this mapping. There is no necessary relationship between these two notions of dimensionality. For example, in a neural network with two attractors, a system described by a potentially large number of variables, there might be a simple rule (perhaps even a linear lter) that allows us to look at the inputs to the network and determine the times at which the switching events will occur. Conversely, once we leave the simplied world of constant or periodic inputs, even the small number of differential equations describing a neuron’s channel dynamics could in principle be equivalent to a very complicated set of rules for mapping inputs into spike times. In our context, simplicity is (roughly) feature selectivity: the mapping is simple if spiking is determined by a small number of features in the complex history of inputs. Following the ideas that emerged in the analysis of motion-sensitive neurons in the y (de Ruyter van Steveninck & Bialek, 1988; Bialek & de Ruyter van Steveninck, 2003), we have identied “features” with “dimensions” and searched for low-dimensional descriptions of the input history that preserve the mutual information between inputs and outputs (spike times). We have considered only the generation of isolated spikes, leaving aside the question of how spikes interact with one another, as considered by Kistler et al. (1997). For these isolated spikes, we began by searching for projections onto a low-dimensional linear subspace of the originally » 200-dimensional stimulus space, and we found that a substantial fraction of the mutual information could be preserved in a model with just two dimensions. Searching for the information that is missing from this model, we found that rather than adding more (Euclidean) dimensions, we could capture approximately 90% of the information at high time resolution by keeping a 2D description but allowing these dimensions to vary over the surface, so that the neuron is sensitive to stimulus features that lie in a curved 2D subspace.
Computation in a Single Neuron
1745
The geometrical picture of neurons as being sensitive to features that are dened by a low-dimensional stimulus subspace is attractive and, as noted in section 1, corresponds to a widely shared intuition about the nature of neuronal feature selectivity. While curved subspaces often appear as the targets for learning in complex neural computations such as invariant object recognition, the idea that such subspaces appear already in the description of single-neuron computation we believe to be novel. While we have exploited the fact that long simulations of the HH model are quite tractable to generate large amounts of “data” for our analysis, it is important that, in the end, our construction of a curved, relevant, stimulus subspace involves a series of computations that are just simple generalizations of the conventional reverse correlation or spike-triggered average. This suggests that our approach can be applied to real neurons without requiring qualitatively larger data sets than might have been needed for a careful reverse correlation analysis. In the same spirit, recent work has shown how covariance matrix analysis of the y’s motion-sensitive neurons can reveal nonlinear computations in a 4D subspace using data sets of fewer than 104 spikes (Bialek & de Ruyter van Steveninck, 2003). Lowdimensional linear subspaces can be found even in the response of model neurons to naturalistic inputs if one searches directly for dimensions that capture the largest fraction of the mutual information between inputs and spikes (Sharpee et al., in press), and again the errors involved in identifying the relevant dimensions are comparable to the errors in reverse correlation (Sharpee, Rust, & Bialek, 2003). All of these results point to the practical feasibility of describing real neurons in terms of nonlinear computation on low-dimensional relevant subspaces in a high-dimensional stimulus space. Our reduced model of the HH neuron both illustrates a novel approach to dimensional reduction and gives new insight into the computation performed by the neuron. The reduced model is essentially that of an edge detector for current trajectories but is sensitive to a further stimulus parameter, producing a curved manifold. An interpretation of this curvature will be presented in a forthcoming manuscript. This curved representation is able to capture almost all information that isolated spikes convey about the stimulus, or conversely, allow us to predict isolated spike times with high temporal precision from the stimulus. The emergence of a low-dimensional curved manifold in a model as simple as the HH neuron suggests that such a description may also be appropriate for biological neurons. Our approach is limited in that we address only isolated spikes. This restricted class of spikes nonetheless has biological relevance; for example, in vertebrate retinal ganglion cells (Berry & Meister, 1999), in rat somatosensory cortex (Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001), and in LGN (Reinagel, Godwin, Sherman, & Koch, 1999), the rst spike of a burst has been shown to convey distinct (and the majority of the) information.
1746
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
However, a clear next step in this program is to extend our formalism to take into account interspike interaction. For neurons or models with explicit long timescales, adaptation induces very long-range history dependence, which complicates the issue of spike interactions considerably. A full understanding of the interaction between stimulus and spike history will therefore in general involve understanding the meanings of spike patterns (de Ruyter van Steveninck & Bialek, 1988; Brenner, Strong, et al., 2000) and the inuence of the larger statistical context (Fairhall et al., 2001). Our results point to the need for a more parsimonious description of self-excitation, even for the simple case of dependence on only the last spike time. We close by reminding readers of the more ambitious goal of building bridges between the burgeoning molecular-level description of neurons and the functional or computational level. Armed with a description of spike generation as a nonlinear operation on a low-dimensional, curved manifold in the space of inputs, it is natural to ask how the details of this computational picture are related to molecular mechanisms. Are neurons with more different types of ion channels sensitive to more stimulus dimensions, or do they implement more complex nonlinearities in a low-dimensional space? Are adaptation and modulation mechanisms that change the nonlinearity separable from those that change the dimensions to which the cell is sensitive? Finally, while we have shown how a low-dimensional description can be constructed numerically from observations of the input-output properties of the neuron, one would like to understand analytically why such a description emerges and whether it emerges universally from the combinations of channel dynamics selected by real neurons. Acknowledgments We thank N. Brenner for discussions at the start of this work, and M. Berry for comments on the manuscript. References Abbott, L. F., & Kepler, T. (1990). Model neurons: From Hodgkin-Huxley to Hopeld. In L. Garrido (Ed.), Statistical mechanisms of neural networks (pp. 5–18). Berlin: Springer-Verlag. Aguera ¨ y Arcas, B. (1998). Reducing the neuron: A computational approach. Unpublished master’s thesis, Princeton University. Aguera ¨ y Arcas, B., Bialek, W., & Fairhall, A. L. (2001). What can a single neuron compute? In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 75–81). Cambridge, MA: MIT Press. Aguera ¨ y Arcas, B., & Fairhall, A. (2003). What causes a neuron to spike? Neural Computation, 15, 1789–1807. Barlow, H. B. (1953). Summation and inhibition in the frog’s retina. J. Physiol., 119, 69–88.
Computation in a Single Neuron
1747
Barlow, H. B., Hill, R. M., & Levick, W. R. (1964). Retinal ganglion cells responding selectively to direction and speed of image motion in the rabbit. J. Physiol., 173, 377–407. Berry, M. J. II, & Meister, M. (1999). The neural code of the retina. Neuron, 22, 435–450. Berry, M. J. II, Warland, D., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. U.S.A., 94, 5411–5416. Bialek, W., & de Ruyter van Steveninck, R. R. (2003). Features and dimensions: Motion estimation in y vision. Unpublished manuscript. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992).A training algorithm for optimal margin classiers. In D. Haussler (Ed.), 5th Annual ACM Workshop on COLT (pp. 144–152). Pittsburgh, PA: ACM Press. Bray, D. (1995). Protein molecules as computational elements in living cells. Nature, 376, 307–312. Brenner, N., Agam, O., Bialek, W., & de Ruyter van Steveninck, R. R. (1998). Universal statistical behavior of neural spike trains. Phys. Rev. Lett., 81, 4000–4003. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comp., 12, 1531–1552. Available on-line: http:==xxx.lanl.gov=abs=physics=9902067. Cottrell, G. W., Munro, P., & Zipser, D. (1988). Image compression by back propagation: A demonstration of extensional programming. In N. Sharkey (Ed.), Models of cognition: A review of cognitive science (Vol. 2, pp. 208–240). Norwood, NJ: Ablex. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. de Boer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Trans. Biomed. Eng., 15, 169–179. de Ruyter van Steveninck, R. R., & Bialek, W. (1988). Real-time performance of a movement sensitive in the blowy visual system: Information transfer in short spike sequences. Proc. Roy. Soc. Lond. B, 234, 379–414. de Ruyter van Steveninck, R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Dear, S. P., Simmons, J. A., & Fritz, J. (1993). A possible neuronal basis for representation of acoustic scenes in auditory cortex of the big brown bat. Nature, 364, 620–623. Fairhall, A., Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efciency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Fitzhugh, R. (1961). Impulse and physiological states in models of nerve membrane. Biophysics J., 1, 445–466. Guyon, I. M., Boser, B. E., & Vapnik, V. N. (1993). Automatic capacity tuning of very large VC-dimension classiers. In S. J. Hanson, J. D. Cowan, & C. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 147–155). San Mateo, CA: Morgan Kaufmann.
1748
B. Aguera ¨ y Arcas, A. Fairhall, and W. Bialek
Hartline, H. K. (1940). The receptive elds of optic nerve bres. Amer. J. Physiol., 130, 690–699. Hille, B. (1992). Ionic channels of excitable membranes. Sunderland, MA: Sinauer. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 463, 391–407. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (Lond.), 160, 106–154. Iverson, L., & Zucker, S. W. (1995). Logical=linear operators for image curves. IEEE Trans. Pattern Analysis and Machine Intelligence, 17, 982–996. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30(3), 803–817. Kepler, T., Abbott, L. F., & Marder, E. (1992). Reduction of conductance-based neuron models. Biological Cybernetics, 66, 381–387. Kistler, W., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the HodgkinHuxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Kufer, S. W. (1953). Discharge patterns and functional organization of mammalian retina. J. Neurophysiol., 16, 37–68. Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001).Neural coding of naturalistic motion stimuli. Network, 12, 317–329. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Nagumo, J., Arimoto, S., & Yoshikawa, Z. (1962). An active pulse transmission line simulating nerve axon. Proc. IRE, 50, 2061–2071. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami, Y. Attikiouzel, R. J. Marks II, D. Fogel, & T. Fukuda (Eds.), Computational intelligence—a dynamic system perspective(pp. 83–97). New York: IEEE Press. Panzeri, S., Petersen, R., Schultz, S., Lebedev, M., & Diamond, M. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Reinagel, P., Godwin, D., Sherman, S. M., & Koch, C. (1999). Encoding of visual information by LGN bursts. J. Neurophys., 81, 2558–2569. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neuroscience, 20(14), 5392–5400. Rieke, F., Warland, D., Bialek, W., & de Ruyter van Steveninck, R. R. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–408. Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan Books. Roweis, S., & Saul, L. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290, 2323–2326.
Computation in a Single Neuron
1749
Schneidman, E., Freedman, B., & Segev, I. (1998). Ion channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Comp., 10, 1679–1703. Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. Journal, 27, 379–423, 623–656. Sharpee, T., Rust, N. C., & Bialek, W. (in press). Maximally informative dimensions: analysing neural responses to natural signals. Neural Information Processing Systems 2002. Available on-line: http:==xxx.lanl.gov=abs=physics= 0208057. Sharpee, T., Rust, N. C., & Bialek, W. (2003). Maximally informative dimensions: Analysing neural responses to natural signals. Unpublished manuscript. Stanley, G. B., Lei, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. J. Neurosci., 19(18), 8036–8042. Theunissen, F., Sen, K., & Doupe, A. (2000). Spectral-temporal receptive elds of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20, 2315–2331. Tiesinga, P. H. E., Jos´e, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltage-gated channels. Physical Review E, 62, 8413–8419. Tishby, N., Pereira, F., & Bialek, W. (1999). The information bottleneck method. In B. Hajek & R. S. Sreenivas (Eds.), Proceedingsof the 37th Annual Allerton Conference on Communication, Control and Computing (pp. 368–377). Champaign, IL. Available on-line: http:==xxx.lanl.gov=abs=physics=0004057. Received January 9, 2003; accepted January 28, 2003.
NOTE
Communicated by Bruno Olshausen
Learning the Nonlinearity of Neurons from Natural Visual Stimuli Christoph Kayser [email protected]
Konrad P. Kording ¨ [email protected]
Peter Konig ¨ [email protected] Institute of Neuroinformatics, University = ETH Zurich, ¨ Winterthurerstrasse 190, 8057 Zurich, ¨ Switzerland
Learning in neural networks is usually applied to parameters related to linear kernels and keeps the nonlinearity of the model xed. Thus, for successful models, properties and parameters of the nonlinearity have to be specied using a priori knowledge, which often is missing. Here, we investigate adapting the nonlinearity simultaneously with the linear kernel. We use natural visual stimuli for training a simple model of the visual system. Many of the neurons converge to an energy detector matching existing models of complex cells. The overall distribution of the parameter describing the nonlinearity well matches recent physiological results. Controls with randomly shufed natural stimuli and pink noise demonstrate that the match of simulation and experimental results depends on the higher-order statistical properties of natural stimuli. 1 Introduction One important application of articial neural networks is to model real cortical systems. For example, the receptive elds of sensory cells have been studied using such models, most prominently in the visual system. An important reason for the choice of this system is our good knowledge of natural visual stimuli (Ruderman, 1994; Dong & Atick, 1995; Kayser, Einhauser, ¨ & Konig, ¨ in press). A number of recent studies train networks with natural images, modeling cells in the primary visual cortex (see Simoncelli & Olshausen, 2001, and references therein). In these studies, a nonlinear model for a cell was xed, and the linear kernel was optimized to maximize a given objective function. The resulting neurons share many properties of either simple or complex cells as found in the visual system. However, these studies incorporate prior knowledge about the neuron model to quantify the nonlinearity of the neuron in advance. But to better understand and predict properties of less-well-studied systems, where such prior knowledge is c 2003 Massachusetts Institute of Technology Neural Computation 15, 1751–1759 (2003) °
1752
C. Kayser, K. Kording, ¨ and P. Konig ¨
not available, models are needed that can adapt the nonlinearity of the cell model. Here we make a step in this direction and present a network of visual neurons in which both a parameter controlling the nonlinearity and the linear receptive eld kernels are learned simultaneously. The network optimizes a stability objective (Foldiak, ¨ 1991; Kayser, Einhauser, ¨ Dummer, ¨ Konig, ¨ & Kording, ¨ 2001; Wiskott & Sejnowski, 2002; Einh¨auser, Kayser, Konig, ¨ & Kording, ¨ 2002) on natural movies, and the transfer functions learned are comparable to those found in physiology as well as those used in other established models for visual neurons. 2 Methods As a neuron model, we consider a generalization of the two subunit energy detector (Adelson & Bergen, 1985). The activity A for a given stimulus I is given by 1
A.t/ D .jI ² W1 jN C jI ² W 2 jN / N :
(2.1)
Each neuron is characterized by two linear lters W1;2 , as well as the exponent of the transfer function N. For N D 1, the transfer function has unit gain, and except for the absolute value, the neuron performs a linear operation. For N D 2, this corresponds to a classical energy detector. All of these parameters are subject to an optimization of an objective function. The objective used here is temporal coherence, a measure of the stability of the neuron’s output plus a decorrelation term: 9 D 9time C 9Decorr D ¡ C
X
Neurons i;j
h.A.t/ ¡ A.t ¡ 1t//2 iStimuli var.A.t//Stimuli
¡hAi Aj i2Stimuli:
(2.2)
A neuron is optimal with respect to the rst term of the objective function if its activity changes slowly on a timescale given by 1t. The second term avoids a trivial solution with identical receptive elds of all neurons. Implementing optimization by gradient ascent, the formula for the change of parameters can be obtained by differentiation of equation 2.2 with respect to the parameters describing the neurons. The subunits’ receptive elds W are initialized with values varying uniformly between 0 and 1 and the exponents with values between 0.1 and 6. During optimization, in a few cases, the exponent of a neuron would diverge to innity. In this case, the transfer function implements a maximum operation on the subunits, and the precise value of the exponent is no longer relevant. To avoid the associated numerical problems, we constrain the exponent to be smaller than 15.
Learning the Nonlinearity of Neurons
1753
The input stimuli are taken from natural movies recorded by a lightweight camera mounted to the head of a cat exploring the local park (for details, see Einh¨auser et al., 2002). As controls, two further sets of stimuli are used. First, temporally “white” stimuli are constructed by randomly shufing the frames of the natural video. Second, the natural movie is transformed into spatiotemporal “pink” noise by assigning random phases to the space-time Fourier transform. This yields a movie with the same secondorder statistics as the original but without higher-order correlations. From two consecutive video frames, one corresponding to t ¡1t, the other to t, we cut pairs of patches of size 30 £ 30 pixels (¼ 6 degrees). The square patches are multiplied by a circular gaussian window in order to ensure isotropic input. The training set consists of 40,000 such pairs. The dimension of stimulus space is reduced by a principal component analysis in order to ease the computational load. Excluding the rst component corresponding to the mean intensity, we keep the rst 120 components out of a total of 1800, carrying more than 95% of the variance. We quantify the properties of the receptive elds in an efcient way by convolving the subunits with a large circular grating covering frequencies ranging from 0.5 cycles per patch to 15 cycles per patch and all orientations (see Figure 1C). The convolutions of the subunits with this patch are summed pointwise according to the cell model (see equation 2.1) resulting in a two-dimensional activity diagram. 3 Results We optimize the receptive elds and exponents of a population of 80 cells. In the following, we rst discuss one example in detail before reporting results of the whole population. The development of the subunits of the example neuron is shown in Figure 1A. It is selective to orientation and spatial frequency. This selectivity develops after about 50 iterations simultaneously in both subunits, and the spatial prole is very similar to a Gabor lter. Indeed the best-tted Gabor lter explains 81% and 74% of the variance of the two subunits’ receptive elds. Simultaneously with their emergence, the oriented structures in the two subunits are shifted relative to each other by 87 degrees. The evolution of the exponent of this neuron during optimization is shown in Figure 1B. It converges to a value of 2.35, which is close to the exponent of 2.0 of the classical energy detector. The response prole characterizing the spatial receptive eld (see methods) is displayed in Figure 1C. It shows that the neuron is selective to orientation (oblique) and spatial frequency (11 pixel wavelength). The response to a grating shows a constant increase toward higher spatial frequencies, with a small-frequency doubled modulation on top. The response to a drifting grating (data not shown) shows a similar effect, which is caused by the absolute value in the formula for the activity. However, it is weak compared to the constant increase of the response.
1754
C. Kayser, K. Kording, ¨ and P. Konig ¨
Learning the Nonlinearity of Neurons
1755
The neuron’s small modulation ratio for drifting gratings (F1=F0: 0.26) is equivalent to a translation-invariant response. Therefore, this neuron shares the properties of complex cells found in primary visual cortex (Skottun et al., 1991). Looking at the spatial tuning properties of the cells in the entire population, we nd that all neurons are selective to orientation (mean width at half height 35 degrees and the ratio of the response strength at preferred orientation to the response at orthogonal orientation is 11.2). Furthermore, the neurons are selective to spatial frequency with a median spatial frequency selectivity index of 59 (for a denition, see Schiller, Finlay, & Volman, 1976). In the population, independent of the initial value, most exponents converge quickly toward a value close to two. Then, for most cells during the next iterations, the exponent changes only little. Sometimes, however, the exponent suddenly explodes and grows until it reaches our constraint of N D 15. Over the whole population, most cells acquire an exponent close to 2 (see Figure 1D right, red), although the distribution is fairly broad. This shows that the classical energy detector is an optimal solution for the given objective. Furthermore, in comparison to recent physiological results obtained in cat primary visual cortex, the distribution of exponents is similar (see Figure 1D right, gray, reproduced from Lau, Stanley, & Dan, 2002). An overview of the relation between the exponent of a neuron and its response properties for the whole population is given in Figure 1F. There, we plot the exponent versus the modulation ratio for drifting gratings. Note that cells with a low modulation ratio would be classied as complex cells. Responses of cells with a high modulation ratio are specic also to the position of a stimulus, as are simple cells in visual cortex. Cells that can be classied as complex cells are found only in the region of the exponent close to two. Cells Figure 1: Facing page. (A) Subunits’ receptive elds of an example neuron as a function of the number of iterations. (B) The exponent N and the similarity index of the subunits for the example neuron during the optimization. The similarity index is the scalar product of the nal subunit (after 100 iterations) with the subunit at a given iteration divided by the lengths of these vectors. The tick marks on the x-axis indicate 35, 55, and 95 iterations in correspondence to panel A. (C) Left: The circular grating patch used for quantifying the neurons response properties. Right: Response diagram of the example neuron on a color scale from bright red (high activity) to dark blue (no activity). (D) Left: The evolution of the exponents for the population. The tick marks on the x-axis indicate 35, 55, and 95 iterations in correspondence to panel A. Right: The histogram of the exponents after 100 iterations over the population is shown in red. Physiological data taken from Lau et al. (2002) are shown in gray. (E) Histograms of the nal values for the exponents in the control simulations. Light gray: temporally white stimulus; black: spatiotemporal pink noise. (F) Exponent vs. modulation ratio for drifting gratings. The response diagrams of seven neurons are shown at the upper border. On the right border, all neurons are located with an exponent larger than 5.
1756
C. Kayser, K. Kording, ¨ and P. Konig ¨
with large modulation ratios, on the other hand, are found for all values of the exponent. Furthermore, the preferred orientation and spatial frequency are distributed homogenously. To probe which properties of natural scenes are important for the stable development of the exponent and the spatial receptive elds, we perform two controls. Since the objective function is based on temporal properties of the neurons’ activity, we rst use a stimulus without temporal correlations. This temporal white stimulus is constructed by shufing the frames of the original video (see section 2), but preserves its spatial structure. After a few iterations, the exponents of all neurons diverge to either the upper or lower limit imposed (see Figure 1E, gray). The spatial receptive elds appear noisy and display weak tuning to orientation (median of the ratio of response at optimal orientation to response at orthogonal orientation is 3.7). This shows that the distribution of exponents obtained in the rst simulation is not a property inherent in our network. Rather, a smooth temporal structure of the input is necessary for the development of both the exponent and the spatial receptive elds. As a second control, we train the network on spatiotemporal pink noise (see section 2) having the same second-order statistics as the natural movie. While this stimulus lacks the higher-order structure of natural scenes, it has a smooth, temporal structure compared to the rst control. During optimization, the exponents converge as in the original simulation. However, the histogram of nal values (see Figure 1E, black) differs markedly from that in Figure 1D. Most neurons acquire an exponent slightly above one, with many neurons having a linear slope of their transfer function. Furthermore, most neurons resemble low-pass lters, and their orientation tuning is weak (median of the response at preferred orientation to response at the orthogonal orientation is 2.9). Thus, the second-order correlations of natural movies are sufcient to allow stable learning of the exponent, as well as organized spatial receptive elds. However, to achieve a good match to physiological results in terms of both spatial receptive elds and the exponent of the transfer function, the higher-order structure of the natural scenes is decisive. 4 Discussion We report here a network simultaneously optimizing the linear kernel of the given neuron model and the exponent of the neurons’ transfer functions with respect to the same objective function. The results of the study are interesting for the following reasons. First, by changing parameters controlling the transfer function, neurons can represent a much larger class of nonlinear functions and, given a xed network size, are able to solve a larger class of problems. Second, nonlinearities are ubiquitous in sensory systems (Ghazanfar, Krupa, & Nicolelis, 2001; Shimegi, Ichikawa, Akasaki, & Sato, 1999; Anzai, Ohzawa, & Freeman, 1999;
Learning the Nonlinearity of Neurons
1757
Lau et al., 2002; Escabi & Schreiner, 2002; Field & Rieke, 2002) and a wide variety of effects contributes to nonlinear effects within a neuron (Softky & Koch, 1993; Mel, 1994; Reyes, 2002). Yet in most cases, we simply lack the knowledge for a quantitative specication. Therefore, to model these systems, networks are required that can adjust all parameters and do not require detailed a priori knowledge. While most learning schemes still concentrate on linear kernels, a small number of studies address optimizing the neurons nonlinearity. Bell and Sejnowski (1995) start from an Infomax principle and derive a two-step learning scheme, which allows optimizing the shape of a (sigmoidal) transfer function to match the input probability density. This directly relates to the concept of independent component analysis, where the adaptation of the transfer function adjusts the nonlinearity of the system (Nadal & Parga, 1994; Everson & Roberts, 1998; Obradovic & Deco, 1998; Pearlmutter & Parra, 1997). Recently, this approach was applied to natural auditory stimuli comparing the resulting receptive eld structure to physiological results (Lewicki, 2002). However, some of these theoretical derivations require an invertible transfer function, which is not the case in our network and might not apply in general for neurons that discard information, such as neurons having invariance properties. Extending this previous research, we optimize the weight matrix and the nonlinearity simultaneously at each iteration and compare the resulting distribution of the parameters to recent experimental data. In our network, we nd many cells that are selective to orientation and spatial frequency, have an exponent around 2, and have translationinvariant responses. Their properties match the classical energy detector model for complex cells (Movshon, Thompson, & Tolhurst, 1978; Adelson & Bergen, 1985). Thus, our results show how this classical model of complex neurons can develop in a generalized nonlinear model. Furthermore, in a recent study of the cat visual system, the nonlinearity of the computational properties of cortical neurons has been investigated (Lau et al., 2002). The measured exponent of the nonlinear transfer function shows a large variation. An exponent around 2 best explains the response of many neurons, while in other cells it ranges between 0 and high values. The overall distribution of exponents found in the study examined here matches these experimental results surprisingly well. Acknowledgments We thank Laurenz Wiskott for introducing the evaluation of receptive elds using the circular grating patch to us, and Mathew Diamond and Rasmus Petersen for valuable discussion of yet unpublished results. This work was supported by the Centre of Neuroscience Zurich ¨ (C.K.), the SNF (Grant No. 31-65415.01, P.K.) and the EU IST-2000-28127=BBW 01.0208-1 (K.P.K., P.K.).
1758
C. Kayser, K. Kording, ¨ and P. Konig ¨
References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Optic. Soc. Am. A, 2, 284–299. Anzai, A. I., Ohzawa, I., & Freeman, R. (1999).Neural mechanisms for processing binocular information I. Simple cells. J. Neurophysiol., 82, 891–908. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1004–1034. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time varying images. Network: Computation in Neural Systems, 6, 345–358. Einh a¨ user, W., Kayser, C., Konig, ¨ P., & Kording, ¨ K. P. (2002). Learning the invariance properties of complex cells from their responses to natural stimuli. Eur. J. Neurosci., 15, 475–486. Escabi, M. A., & Schreiner, C. E. (2002). Nonlinear spectrotemporal sound analysis by neurons in the auditory midbrain. J. Neurosci., 22, 4114–4131. Everson, R. M., & Roberts, S. J. (1998). Independent component analysis: A exible nonlinearity and decorrelating manifold approach. Neural Computation, 11, 1957–1983. Field, G. D., & Rieke, F. (2002). Nonlinear signal transfer from mouse rods to bipolar cells and implications for visual sensitivity. Neuron, 34, 773–785. Foldiak, ¨ P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Ghanzafar, A. A., Krupa, D. J., & Nicolelis, M. A. L. (2001). Role of cortical feedback in the receptive eld structure and nonlinear response properties of somatosensory thalamic neurons. Exp. Brain Res., 141, 88–100. Kayser, C., Einh a¨ user, W., Dummer, ¨ O., Konig, ¨ P., & Kording, ¨ K. P. (2001).Extracting slow subspaces from natural videos leads to complex cells. In G. Dorffner, H. Bischoff, & K. Kornik (Eds.), Proc. Int. Conf. Artif. Neural Netw. (ICANN) (pp. 1075–1080). Berlin: Springer-Verlag. Kayser, C., Einhauser, ¨ W., & K onig, ¨ P. (in press). Temporal correlations of orientations in natural scenes. Neurocomputing. Lau, B., Stanley, G. B., & Dan, Y. (2002). Computational subunits of visual cortical neurons revealed by articial neural networks. Proc. Natl. Acad. Sci. USA, 99, 8974–8979. Lewicki, M. S. (2002). Efcient coding of natural sounds. Nature Neurosci., 5(4), 356–363. Mel, B. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Receptive eld organization of complex cells in the cat’s striate cortex. J. Physiol. (London), 283, 79–99. Nadal, J.-P., & Parga, N. (1994). Nonlinear neurons in the low-noise limit: A factorial code maximized information transfer. Network: Computation in Neural Systems, 5, 565–581. Obradovic, D. & Deco, G. (1998). Information maximization and independent component analysis: Is there a difference? Neural Computation, 10(8), 2085–2101.
Learning the Nonlinearity of Neurons
1759
Pearlmutter, B., & Parra, L. (1997).Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. Mozer, J. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Reyes, A. (2002). Inuence of dendritic conductances on the input-output properties of neurons. Annu. Rev. Neurosci., 24, 653–675. Ruderman, D. L. (1994). The statistics of natural images. Network: Computation in Neural Systems, 5, 517–548. Schiller, P. H., Finlay, B. L., & Volman, S. F. (1976). Quantitative studies of single cell properties in monkey striate cortex. III. Spatial frequency. J. Neurophysiol., 39, 1334–1351. Shimegi, S., Ichikawa, T., Akasaki, T., & Sato, H. (1999). Temporal characteristics of response integration evoked by multiple whisker stimulations in the barrel cortex of rats. J. Neurosci., 19, 10164–10175. Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Ann. Rev. Neurosci., 24, 1193–1216. Skottun, B. C., De Valois, R. L., Grosof, D. H., Movshon, J. A., Albrecht, D. G., & Bonds, A. B. (1991). Classifying simple and complex cells on the basis of response modulation. Vision Res., 31, 1079–1086. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14, 715–770. Received September 4, 2002; accepted January 30, 2003.
LETTER
Communicated by Bard Ermentrout
Analytic Expressions for Rate and CV of a Type I Neuron Driven by White Gaussian Noise Benjamin Lindner [email protected]
Andr´e Longtin [email protected] Department of Physics, University of Ottawa, Ottawa, Canada KIN 6N5
Adi Bulsara [email protected] SPAWAR Systems Center, San Diego, CA 92152-6147, U.S.A.
We study the one-dimensional normal form of a saddle-node system under the inuence of additive gaussian white noise and a static “bias current” input parameter, a model that can be looked upon as the simplest version of a type I neuron with stochastic input. This is in contrast with the numerous studies devoted to the noise-driven leaky integrate-and-re neuron. We focus on the ring rate and coefcient of variation (CV) of the interspike interval density, for which scaling relations with respect to the input parameter and noise intensity are derived. Quadrature formulas for rate and CV are numerically evaluated and compared to numerical simulations of the system and to various approximation formulas obtained in different limiting cases of the model. We also show that caution must be used to extend these results to the 2 neuron model with multiplicative gaussian white noise. The correspondence between the rst passage time statistics for the saddle-node model and the 2 neuron model is obtained only in the Stratonovich interpretation of the stochastic 2 neuron model, while previous results have focused only on the Ito interpretation. The correct Stratonovich interpretation yields CVs that are still relatively high, although smaller than in the Ito interpretation; it also produces certain qualitative differences, especially at larger noise intensities. Our analysis provides useful relations for assessing the distance to threshold and the level of synaptic noise in real type I neurons from their ring statistics. We also briey discuss the effect of nite boundaries (nite values of threshold and reset) on the ring statistics. 1 Introduction The transition from quiescent to periodic ring behavior as a bias current increases leads to signicant changes in the dynamics of an excitable c 2003 Massachusetts Institute of Technology Neural Computation 15, 1761–1788 (2003) °
1762
B. Lindner, A. Longtin, and A. Bulsara
system. This transition is characterized by the behavior of the ring frequency across this transition. It is possible to divide neurons according to this behavior into type I and type II dynamics (Hodgkin, 1948). Type I dynamics exhibit a continuous variation in ring frequency as a bias parameter (such as an input current) is increased. Dynamically, such a transition to repetitive ring is associated with a saddle-node bifurcation (Rinzel & Ermentrout, 1989). At this bifurcation, a stable and an unstable xed point coalesce and disappear, with a stable limit cycle taking their place. In the neural modeling context, the stable xed point is associated with the resting potential, while the unstable xed point (and associated unstable direction(s) or “unstable manifold” in phase space) is associated with the threshold of the cell. Type II membranes exhibit a nite nonzero frequency as repetitive ring begins. Such transitions are associated with Hopf bifurcations. Some model and experimental systems can exhibit transitions to repetitive ring by one or the other of these mechanisms, depending on system parameters. The relevance of noise-induced ring in type I membranes is related to the problem of large coefcients of variation (CVs). There has been much effort to explain the observed variability in ring rates in various experimental preparations, and much attention has been devoted to the occurrence of interspike interval histograms (ISIH) with high CVs dened as the ratio of ISIH standard deviation to mean (Wilbur & Rinzel, 1983; Softky & Koch, 1993; Shadlen & Newsome, 1994; Bell, Mainen, Tsodyks, & Sejnowski, 1995; Troyer & Miller, 1997). Gutkin and Ermentrout (1998) have shown using numerical simulations that large CVs can be obtained from type I dynamics with noise. They used the so-called one-dimensional 2 neuron model (Ermentrout, 1996) and compared results from stochastic simulations of this model with those from simulations of the two-dimensional Morris-Lecar model (Morris & Lecar, 1981; Rinzel & Ermentrout, 1989) that has inspired the elaboration of the 2 neuron model. Other researchers (Gang, Ditzinger, Ning, & Haken, 1993; Rappel & Strogatz, 1994) have also explored the effect of noise on saddle-node bifurcations in a generic dynamical system, while Longtin (1997) has studied the effect of noise on this bifurcation in the Hindmarsh-Rose bursting neuron model and shown how the ISIHs and other ring statistics vary with noise strength. The origin of the high variability seen in certain experiments and in noise-driven models of type I membranes can be understood from the seminal theoretical work of Sigeti and Horsthemke (1989). They studied how noise can move the state variable across the unstable xed point associated with a saddle-node bifurcation. Their analysis was performed on a onedimensional dynamical system known as the normal form of the saddlenode bifurcation. This system describes the long-lived dynamics of a (possibly higher-dimensional) system in the vicinity of this bifurcation, all other aspects of the dynamics having decayed away to zero. More precisely, their
Rate and CV of a Type I Neuron
1763
analysis was conned to the bifurcation point itself, where the saddle and the node have coalesced into a semistable xed point. In other words, they focused on the normal form xP D ¯ C x2 with ¯ D 0; they also considered the Adler equation µP D 1 ¡ cos µ, a variant of this normal form, which agrees with it to second order. In neural terms, that means that the bias current is set right at rheobase. The analysis of Sigeti and Horsthemke (1989) described how noise pushes solutions over the saddle-node point in terms of the two rst moments of the rst passage time density. While those theoretical and computational studies were carried out at the bifurcation, our study aims to explore the vicinity of this bifurcation, thus making it relevant to a range of experimentally plausible parameters in real neurons. The goal of our article is to provide analytical insight into how noise, for example, of synaptic origin, affects the transition to repetitive ring in type I membranes. We give exact and simplied approximate expressions for the mean interspike interval (ISI), the ISI standard deviation, and CV dened as the ratio of standard deviation to mean; these expressions are sought as a function of the input parameters. The article is organized as follows. In section 2.2, we introduce our basic spike generator model, discuss its relation to the 2 neuron, and derive some scaling relations for rate and CV with respect to the input parameters. In section 3, we give exact integral expressions for the rst two central moments of the interspike interval and derive simple approximations for various limit cases. The results, rate and CV as functions of constant input and noise intensity, are presented in section 4. Here we also compare results from two different versions of the stochastic 2 model and those from our basic model; furthermore, we discuss briey how the ring statistics change for nite boundaries in the associated passage time problem. In section 5, our results are summarized and briey discussed in a more general context. Two appendixes contain supplementary material.
2 The Model and Its Basic Properties 2.1 Stochastic Spike Generator Model for a Type I Neuron. Every system that is close to a saddle-node bifurcation will be dominated by the quasi-one-dimensional passage through the region around its xed point. This holds true also for many neurons known as type I neurons and in particular for neuron models like, for instance, the Morris-Lecar model (Morris & Lecar, 1981). The inuence of noise on such a system is of eminent importance for issues like spike train variability and reliability of signal transmission through neurons. Moreover, the typical spike train input received by many higher-order neurons can be also approximated by a simple noise process (diffusion approximation; see, for instance, Tuckwell, 1988). The simplest choice for a random input that still permits an extensive analytical
1764
B. Lindner, A. Longtin, and A. Bulsara
treatment is an uncorrelated noise—a white noise.1 We allow for a nite mean value of the noise that can be looked upon as a separate constant input. After the usual reduction procedure from the multidimensional dynamics (Ermentrout, 1996; Gutkin & Ermentrout, 1998; Hoppensteadt & Izhikevich, 1997), the one-dimensional normal form driven by a white noise input reads xP D ¯ C x2 C
p
2D».t/:
(2.1)
Here, time is measured in the typical timescale of the model. The parameter D denotes the noise intensity, and the gaussian white noise ».t/ obeys the correlation function h».t/».t C ¿ /i D ±.¿ /. The parameter ¯ is another input parameter; it is constant and can also be thought of as a static or very slowly varying signal. We note that this reduction has assumed a prior approximation regarding the nature of the noise. If the noise is meant as synaptic input to a cell, this input modies the conductances that multiply the usual “battery” terms .Vrev ¡ V/ in the Hodgkin-Huxley formalism. This gives rise to multiplicative noise, since the state variable (the voltage) multiplies the noise (the uctuating conductance). The reduction is assumed to take place very near the bifurcation, so that the noise can be made additive (the voltage is set to a constant in the synaptic battery terms). It is not known generally what effect this approximation has on the dynamics, except right near the bifurcation. We consider in this work a simple spike generator that produces spikes whenever the variable x reaches a threshold value xC . After occurrence of a spike, the variable is reset to a negative value x¡ . If not stated otherwise, these values are chosen to be at plus and minus innity, respectively. A version of this spike generator with nite threshold and reset values (although possibly with a different kind of input current) is known as the quadratic integrate-and-re neuron and has been recently used in studies of neuronal networks (see, e.g., Latham, Richmond, Nelson, & Nirenberg, 2000; Hansel & Mato, 2001); here, we discuss the effect of niteness of x§ briey (see section 4.4). The ISIs are obtained from independent realizations of the passage from x¡ to xC . The presence of the square term and the white gaussian noise in equation 2.1 ensures that this passage time is nite (in spite of the possibly innite threshold and reset values) for all values of the input parameter ¯. A simulation of the system can be performed for only nite initial and threshold points x¡ and xC , respectively. Starting at x0 D x¡ , the dynamics equation 2.1 can be numerically integrated with a sufciently small time 1 The term white refers to the power spectrum of the noise, which is at, that is, contains all frequency (as white light contains all frequencies of the visible electromagnetic spectrum). A at spectrum in turn implies a ± correlation of the noise (Risken, 1984). The process has no correlations over a nite time at all.
Rate and CV of a Type I Neuron
1765
100 0 -100 100 0 -100 100 0 -100 0
10
20
t
30
40
50
Figure 1: Trajectories of the model obtained by a simulation of the stochastic differential equation, 2.2,with threshold and reset parameters x§ D §100, D D 1, 1t D 10¡3 , and ¯ D ¡1; 0; 1 (from top to bottom).
step 1t, xjC1 D xj C .¯ C xj2 /1t C
p
2D1t »j ;
(2.2)
: where xj D x.j1t/ and the »j is a sequence of independent gaussian random numbers with unit variance (Risken, 1984). The rule for reset and generation of the ith ring time ti is : x.t/ D xC ! ti D t and x.tC / D x¡ :
(2.3)
The ring (spike) times are dened as the instants at which x crosses xC ; the variable x is reset to x¡ right after occurrence of a spike. The points x¡ and xC should be chosen sufciently large and the time step sufciently small such that a further increase or decrease, respectively, does not change the statistics of the measured quantities of interest signicantly (as mentioned, the effect of nite x§ is studied in section 4.4). Three example trajectories for different values of ¯ are shown in Figure 1. The sequence of ring times generated by the model .: : : ; ti¡1 ; ti ; tiC1 ; : : :/ describes a renewal point process (Cox, 1962); subsequent intervals between ring times, T i D ti ¡ ti¡1 and TiC1 D tiC1 ¡ ti , are statistically independent (therefore, we omit in the following the index i). Here, we study two basic quantities: the stationary ring rate and the CV. The stationary ring rate is given by the inverse of the mean interspike interval hTi (with the brackets standing for an ensemble average): r.¯; D/ D
1 : hTi
(2.4)
1766
B. Lindner, A. Longtin, and A. Bulsara
b 1 b=0 b=1
V(x)
4
0
-4 -2
0
x
2
Figure 2: Potential associated with equation 2.1 for different values of ¯.
The CV is the relative standard deviation of the interspike interval; it is a second-order quantity that measures the variability of spiking and is given by p h1T 2 i CV.¯; D/ D ; hTi
(2.5)
where h1T 2 i D hT 2 ¡ hTi2 i stands for the variance. Before we proceed, we give an instructive mechanical analogy for the dynamics equation 2.1. We may associate a potential with equation 2.1, the derivative of which yields the deterministic part of the right-hand side (the constant of integration has been chosen to be zero): V.x/ D ¡x3 =3 ¡ ¯x:
(2.6)
Depending on the value of ¯, this potential exhibits for ¯ < 0 a minimum and a maximum, for ¯ D 0 a saddle point, or for ¯ > 0 no extrema at all, resulting in quite different statistics of the passage times and thus of the spike train. In Figure 2, we show these different shapes that correspond to the following ring regimes of the type I neuron: ¯ < 0 noise-activated ring, that is, the ISI is dominated by the noise-assisted escapepfrom the potential p minimum at x D ¡ ¯ over the potential barrier at x D ¯, corresponding to the famous Kramers problem (Kramers, 1940); ¯ D 0 right at the saddlenode bifurcation, where no ring occurs without noise but the statistics of interspike intervals (rst passage time of particle) is very particular (Sigeti & Horsthemke, 1989); and ¯ > 0 the oscillatory (“running”) regime, where the associated potential has no minimum (“downhill motion”).
Rate and CV of a Type I Neuron
1767
2.2 Relation to the 2 Neuron Model. The dynamics equation 2.1 can be transformed to the so-called 2 neuron model by the new variable (Ermentrout, 1996), 2 D 2 arctan.x/;
(2.7)
resulting in a stochastic differential equation with multiplicative white noise (i.e., the prefactor of the noise term depends on the state variable), p P D .1 ¡ cos.2// C .1 C cos.2//.¯ C 2D» .t//: (2.8) 2 For this phase oscillator model, threshold and reset values are at nite values—at ¡¼ and ¼, respectively. The 2 model with a white noise input current, however, must be treated with caution. A stochastic differential equation with multiplicative noise is not uniquely determined; it has to be supplemented by an interpretation (so-called Ito-Stratonovich dilemma; see Gardiner, 1985). This seems to be paradoxical since the original dynamics equation 2.1, which is driven by additive noise, is nonambiguous. The resolution of this paradox is given by the fact that the Stratonovich interpretation is the only interpretation that permits the usual transformation of variables (Gardiner, 1985). Since equation 2.8 results from such a transformation (namely, equation 2.7), we have to interpret equation 2.8 in Stratonovich’s sense. This is also plausible for another reason: driving currents in real neurons are never white noise but will have a nite correlation time; white noise that is thought of as the limit of a “colored” noise with negligible correlation time leads to the Stratonovich interpretation of a dynamics with multiplicative noise (see, e.g., Risken, 1984). In the following, we wish to use a simple Euler integration algorithm (see, e.g., Risken, 1984), which assumes the Ito interpretation. Thus, we must rst express our Stratonovich stochastic differential equation into its corresponding Ito form. This correspondence preserves the physics of the problem. This Ito equivalent stochastic differential equation, 2.8, has an additional drift term ¡D sin.2/.1 C cos.2//. The integration scheme analog to equation 2.2 then reads 2jC1 D 2j C [.1 ¡ cos 2j / C .¯ ¡ D sin 2j /.1 C cos 2j /]1t p C 2D1t.1 C cos 2j /»j ;
(2.9)
where »j is again a sequence of independent gaussian distributed random numbers with unit variance. Note that in Gutkin and Ermentrout (1998), the drift term is apparently missing, and hence the dynamics has been interpreted in the sense of Ito. This leads to the following integration scheme, 2jC1 D 2j C [.1 ¡ cos 2j / C ¯.1 C cos 2j /]1t p C 2D1t.1 C cos 2j /»j ;
(2.10)
1768
B. Lindner, A. Longtin, and A. Bulsara
which is also the one that could be expected from a straightforward (although inexact) inclusion of gaussian white noise in the 2 neuron dynamics. Clearly, the difference between equations 2.10 and 2.9 vanishes in the weak noise limit since the additional Stratonovich drift in equation 2.9 is proportional to noise intensity D. We will show that simulation of the Stratonovich version indeed yields the same statistics of interspike intervals as the original dynamics equation, 2.1, whereas the Ito interpretation, equation 2.10 of the 2 neuron model, which was used by Gutkin and Ermentrout (1998), leads to different results in particular at moderate to large noise intensity. 2.3 Scaling Relations. We return now to the original model, equation 2.1. Therein, we have two free parameters, ¯ and D. If we properly rescale x and t, it is possible to obtain the same nonlinearities as in equation 2.1, except for a new combination of the parameters ¯ and D (threshold and reset values do not change in any case since they remain at plus and minus innity, respectively). This gives us scaling relations for rate and CV with respect to the input parameters D and ¯. Moreover, we can eliminate one of the parameters, although we still have to distinguish among the three different ring regimes. For ¯ D 0, which has been treated by Sigeti (1988), that is, for xP D x2 C
p
2D».t/;
(2.11)
we choose the new variable and time as y D x=a; tQ D at
(2.12)
with a arbitrary but positive. This leads to2 q yP D y2 C
Q 2D=a3 ».t/:
(2.13)
With a D D1=3 we can eliminate the noise intensity (Sigeti, 1988). Of course, the moments of ISI will be still rescaled according to equation 2.12; however, the CV, which is the relative standard deviation of the ISI, remains the same for all values of D: CV.¯ D 0; D/ D CV.¯ D 0; D=a3 / ) CV.¯ D 0; D1 / D CV.¯ D 0; D2 /:
(2.14)
p Note that ».t=a/ D a».t/, which can bepseen by p considering the correlation function: 0 h».t=a/».t =a/i D ±.[t ¡ t0 ]=a/ D a±.t ¡ t0 / D h a».t/ a».t0 /i. 2
Rate and CV of a Type I Neuron
1769
The elimination of noise intensity is also possible for ¯ 6D 0. (This fact was also suggested independently by E. Izhikevich, personal communication.) With respect to the approximations we shall derive, it is, however, more p convenient to set a D j¯j. This yields the following dynamics: q ¯ Q C y2 C 2D=j¯j3=2 ».t/ j¯j ( p Q C1 C y2 C 2D=j¯j3=2 ».t/; p D 2 3=2 Q ¡1 C y C 2D=j¯j ».t/;
yP D
¯ >0 ¯ 0) and vanishing noise, the rate scales like r » ¯. In the presence of noise, an increase in ¯ diminishes the effective noise intensity. The third equation is even more important. The range of possible CVs does not depend on ¯ as long as its sign is xed. Plotting the CV as a function of noise intensity results always in the same curve apart from a stretching by j¯j¡3=2 in the argument. Furthermore, we see from equation 2.18 that the CV in the large noise limit corresponds to the CV at nite D but vanishing input, lim CV.¯; D/ D lim CV.§1; j¯j¡3=2 D/
D!1
D!1
D lim CV.§1; j¯j¡3=2 D/ D CV.0; D/: ¯!0
(2.19)
By similar arguments, asymptotic scaling relations for strong noise (or equivalently weak input ¯) can be derived, given by r.§1; D/ ¼ A1 D1=3 C B1 ¯D¡1=3 ;
CV.§1; D/ ¼ A2 C B2 ¯D
¡2=3
;
D!1
D ! 1:
(2.20) (2.21)
1770
B. Lindner, A. Longtin, and A. Bulsara
In these formulas A1 ; A2 ; B1 ; and B2 denote numerical constants, which will be given below. Small noise intensity will require higher-order terms in ¯ (rate and CV in equations 2.20 and 2.21 diverge for D ! 0, which is not expected from the biological point of view) and thus the linearity with respect to input depends strongly on the noise level. 3 Exact and Approximate Expressions for Spike Rate and CV 3.1 Exact Formulas for the First Two Central Moments of the Interspike Interval Distribution. The hierarchy of quadrature expressions for the moments of the passage time problem in an arbitrary potential have been known for a long time (Pontryagin, Andronov, & Witt, 1989). The standard expressions for mean and variance of the rst passage time can be somewhat simplied (see appendix A), yielding ³ hTi D ³ h1T 2 i D ³ ®D
9 D 9 D
´1=3 Z1
Zx dx e
¡®x¡x3
¡1
dy e®yCy ¡1
´2=3 Z1
3 D2
dx e ¡1
´1=3
3
¡®x¡x3
2
Z1 dy e x
(3.1)
¡®y¡y3
32
Zx
4
®zCz3
dz e
5
(3.2)
¡1
¯:
These formulas correspond to innite threshold and reset values x§ D §1. The quadratures for the case of nite reset and nite threshold values are given in the appendix by equations A.1 and A.4. The integrals in equations 3.1 and 3.2 may be evaluated numerically by standard procedures.3 There are, however, a number of cases where the integrals can be carried out analytically, which gives us some additional control over the accuracy of our numerical integration. First, the mean rst passage time (i.e., the mean ISI) can be expressed by an innite sum (Colet, San Miguel, Casademunt, & Sancho, 1989) as follows:4 ³ hTi D
1 3D
´1=3 r
³ ´ Á r !n 1 .2nC1/=3 2n C 1 3 ¼X 3 n2 .¡1/ 0 ¯ : 3 nD0 6 n! D2
(3.3)
3 The innite integration boundaries have to replaced by nite ones that are chosen sufciently large such that a further increase does not change the results to within the desired accuracy. 4 Note that there are write errors in equations 3.6 and 3.7 of Colet et al. (1989).
Rate and CV of a Type I Neuron
1771
Here 0.¢/ denotes the gamma function (Abramowitz & Stegun, 1970). This formula is especially useful for large noise intensities, whereas in the weak noise limit, many terms are required to achieve convergence. Furthermore, for ¯ D 0 both mean and variance of the ISI can be calculated analytically (see Sigeti, 1988; Sigeti & Horsthemke, 1989). The result is simple and reads ³ hT.¯ D 0/i D [0.1=3/]2 h1T 2 .¯ D 0/i D
1 [0.1=3/]4 3
1 3D ³
´1=3
1 3D
(3.4)
: ´2=3 D
1 hTi2 : 3
(3.5)
The latter equality in equation 3.5 implies that for ¯ D 0, the ratio of variance and mean square of the ISI (and hence also the CV) is a constant and independent of noise intensity in accordance with equation 2.14. Rate and CV of the type I neuron read in this case 1 .3D/1=3 ¼ 0:201D1=3 ; [0.1=3/]2 p CV.¯ D 0; D/ D 1= 3: r.¯ D 0; D/ D
(3.6)
Since the integral and sum formulas above are not very transparent, we shall derive some simplications that apply in different limit cases: (1) weak noise and positive input (¯ > 0, oscillatory regime), that is, limit cycle dynamics weakly perturbed by noise; (2) weak noise and negative input (¯ < 0, excitable regime), that is, excitations are rare and an escape rate description applies; and (3) weak input or strong noise limit. 3.2 Rate and CV in the Oscillatory Regime at Weak Noise. For a strictly monotonously decreasing potential (as in our problem for ¯ > 0), general approximation formulas for the mean and the variance of the passage time to linear order in D were given by Arecchi and Politi (1980) and yield in our case p hTi ¼ ¼= ¯; (3.7) h1T 2 i ¼
3D¼ : 4¯ 5=2
(3.8)
p There is no linear contribution in D to the mean ISI; the value ¼= ¯ is clearly the deterministic passage time along the entire x-axis. Both this time and the variance of the passage time decrease with increasing ¯. For rate and CV, we nd r p 3D ¡3=4 for ¯ À D2=3 : (3.9) r ¼ ¯=¼; CV D ¯ 4¼
1772
B. Lindner, A. Longtin, and A. Bulsara
This pscaling behavior (independence of mean ISI of D, variance of ISI going as D) resembles that of a perfect integrate-and-re (PIF) neuron driven by white noise (see, e.g. Bulsara, Lowen, & Rees, 1994). Indeed, for weak noise and positive input, one may even approximate the ISI probability density function (PDF) by an inverse gaussian (i.e., the ISI PDF for a perfect integrate-and-re neuron)—an approach that works rather well and will be presented elsewhere. 3.3 Rate in the Excitable Regime and Weak Noise. For ¯ < 0 and weak noise, the passage from minus to plus innity is dominated by the escape p p from the potential minimum at x¡ D ¡ ¯ over the barrier at xC D ¯. A standard saddle-point approximation of the integral in equation 3.1 yields an exponential ring rate (Colet et al., 1989), rD
p
µ ¶ j¯j 4j¯j3=2 exp ¡ 3D ¼
for
D ¿ j¯j3=2 ;
(3.10)
which is the Kramers escape rate for the potential equation 2.6 in the overdamped case (Kramers, 1940). A similar approximation of the variance in equation 3.2 gives the square of the mean ISI; such an approach yields a CV of unity equivalent to the rare-event statistics of a Poisson process. 3.4 Rate and CV in the Case of Weak Input or Strong Noise. If the input ¯ is weak in amplitude or, equivalently, the noise intensity is sufciently strong (j¯j ¿ D2=3 ), we expect that rate and CV deviate only linearly from the simple expressions in equation 3.6. To nd this linear expansion for the ring rate, we keep only the rst two terms in the sum formula of the mean ISI, equation 3.3, and expand the rate with respect to ¯. This yields, after some manipulations, r.¯ ¿ 1; D/ ¼
9 31=6 .3D/1=3 C [0.2=3/]4 D¡1=3 ¯ 2 [0.1=3/] 8 ¼3
¼ 0:201 D1=3 C 0:147D¡1=3 ¯:
(3.11)
For the CV, the numerical constant, namely, the derivative of the CV with respect to ¯, can be performed only numerically. We nd that this derivative is well approximated by 1=4D¡2=3 , 1 1 CV.¯ ¿ 1; D/ ¼ p C D¡2=3 ¯ 3 4 ¼ 0:578 C 0:250 ¢ D¡2=3 ¯:
(3.12)
Clearly, equations 3.11 and 3.12 can be not only used in case of a weak input but also regarded as large noise asymptotic.
Rate and CV of a Type I Neuron
1773
4 Results In the next two sections, we discuss rate and CV as functions of the noise intensity and the constant input. In all gures, lines for ¯ 6D 0 were obtained by numerical evaluation of the quadrature formulas, 3.1 and 3.2, 5 and insertion into equations 2.4 and 2.5; for ¯ D 0 formula equation 3.6 was used. Symbols indicate results of numerical simulations of the dynamics using reset and absorption points x§ D §500, a Euler scheme with 1t D 10¡3 or 1t D 10¡4 (the latter at large noise intensity). Depending on parameters, up to 105 spikes were used for estimation of the statistics of the interspike interval. In the upper panels of all gures, we compare the numerically evaluated quadrature results and the numerical simulations. They are in all cases in excellent agreement with each other, as it should be. In the lower panels the quadrature results are compared to the various approximations obtained in the previous section. In section 4.3 we compare rate and CV with simulation results of the 2 neuron using either the Stratonovich or Ito interpretation and compare the results to those of Gutkin and Ermentrout (1998). The effect of nite threshold and reset values is discussed in section 4.4. 4.1 Rate and CV as Functions of Noise Intensity. The rate depicted in Figure 3 behaves at large noise intensity rather independently of the value of ¯; it is seen to increase in proportion to D1=3 according to equation 3.6. The effect of ¯ becomes apparent at small noise intensity, where the rate (1) saturates if ¯ > 0 at the value given by equation 3.9; (2) increases like D1=3 for ¯ D 0 according to equation 3.6; or (3) increases following an exponential dependence on inverse noise intensity (Kramers law, equation 3.10), for ¯ < 0. The approximation equation 3.10, for ¯ < 0 (see Figure 3b, dashed line) is valid only for rather small noise intensity. In contrast, the result from linearization around ¯ D 0 (see equation 3.11) ts pretty well for moderate to large noise intensity and for ¯ D 1 or ¡1. In fact, one may interpolate between the weak and strong noise approximations without making an appreciable error. p The CV tends in the strong noise limit to 1= 3, as predicted by equation 2.19. In the weak noise limit, it decreases either to zero (¯ > 0) corresponding to a perfectly regular ring for the oscillatory system in the absence of noise or tends to one (¯ < 0), indicating rare spiking with Poisp son statistics. The exceptional case ¯ D 0 leads to CV D 1= 3 for all noise intensities. Note that any nite value of ¯ will lead to one of the other limits if the noise is sufciently weak. Indeed for vanishing noise, ¯ D 0 can be looked upon as a threshold value, as we will see.
5 The sum formula equation 3.3 showed excellent agreement with the quadrature result, but is not discussed here.
1774
B. Lindner, A. Longtin, and A. Bulsara
(a) spike rate
10
10
0
b=1
-1
b=0 10
-2
10
-2
b= 1 -1
10
0
10
(b) spike rate
10
10
0
D 10
1
2
10
3
10
b=1
-1
b=0 10
b= 1 b=1
-2
10
-2
eq. 3.10 eq. 3.11
b= 1 -1
10
0
10
D 10
1
2
10
3
10
Figure 3: Spike rate of neuron versus noise intensity for the three distinct cases ¯ D §1 or 0 as indicated. (a) Quadrature results (thin lines) compared to simulations. (b) Quadrature results (thin lines) compared to Kramers rate (see equation 3.10, dashed line, only for ¯ D ¡1), and linearization approximation (see equation 3.11, thick lines).
The approximation equation p (3.9) is shown in Figure 4b as a dashed line. The “square-root” law (CV» D) describes the true CV up to D ¼ 0:2 rather well. For a general (positive) value of ¯, the approximation is valid for noise intensities that yield a CV below 0.25. This can be precisely formulated as a condition between D and ¯, ³ ¯>
12D ¼
´2=3 :
(4.1)
Rate and CV of a Type I Neuron
1775
(a) 1.0 b= 1
CV
b=0 0.5
b=1 b= 1
b=1 0.0
10
-3
-1
10
(b) 1.0
D
10
3
10
3
b= 1
b=0 CV
1
10
0.5 eq. 3.9 eq. 3.12
b=1 0.0
10
-3
-1
10
D
1
10
Figure 4: Coefcient of variation versus noise intensity for the three distinct cases ¯ D §1 or 0 as indicated. (a) Quadrature results compared to simulations. (b) Quadrature results compared to weak noise expansion (see equation 3.9, dashed line, only for ¯ D 1), and linearization approximation (see equation 3.12, thick lines).
The linearization result shown by thick lines in Figure 4b seems to have an even larger range of validity. For ¯ D ¡1, one may use the weak-input approximation equation (3.12) as a good estimate as long as the CV is below 0.9. For ¯ D 1, the approximation p coincides with the exact result in line thickness for CVs between 0.5 and 1= 3. Via the scaling relation equation 2.18, these estimates can be generalized to arbitrary negative (positive) values of ¯, respectively since a change in input rescales only the noise intensity but not the range of CV.
1776
B. Lindner, A. Longtin, and A. Bulsara
10
-1
D=10
-2
D=1
-3
10 -2
spike rate
(b) 10 10
D=0.1 D=1 D=10
-1
D=0.1 0
b
1
2
0
D=10
-1
-2
D=1
10 -2
eq. 3.9 eq. 3.10 eq. 3.11
0 .1
10
0
D=
spike rate
(a) 10
-1
0
b
1
2
Figure 5: Spike rate of neuron versus input ¯ for three different noise intensities as indicated. (a) Quadrature results compared to simulations. (b) Quadrature results compared to Kramers rate (see equation 3.10, dashed lines), linearization approximation (see equation 3.11, dashed-dotted lines), and the deterministic rate (see equation 3.9, thick line, only for D D 0:1).
4.2 Rate and CV as Functions of the Constant Input. Since ¯ plays the role of an input, the dependencies of rate and CV on this parameter are of most interest. We show these dependencies for different noise intensities— D D 0:1; 1; or 10 in Figure 5 (rate) and Figure 6 (CV). For strong noise, the dependence of the rate on input (see Figure 5) is rather weak. This can be readily understood: the linear part of the potential governed by ¯ is of minor importance at large noise, and a potential barrier at negative ¯ is “not seen.” The passage through this region is biased diffusion (governed mainly by the cubic part of the potential that is independent of ¯).
Rate and CV of a Type I Neuron
1777
(a) 1.0
D=0.1
CV
D=1 D=10
0.5
D=0.1 D=1 D=10 0.0 -2
-1
(b) 1.0 CV
b
1
2
1
2
D=0.1
D=1 0.5
0
D=10 eq. 3.9 eq. 3.12
0.0 -2
-1
0
b
Figure 6: Coefcient of variation versus input ¯ for three different noise intensities as indicated. (a) Quadrature results (thin lines) compared to simulations. (b) Quadrature results (thin lines) compared to weak noise expansion (equation 3.9, thick lines) and linearization approximation (equation 3.12, dashed lines). The latter coincides with the quadrature result almost within line thickness for D D 10.
Decreasing noise results in an increasing dependence of the rate on input, and for negative ¯, the rate can become arbitrary low. In the limit of vanishing noise, the rate can likewise attain arbitrary low values for positive input. This is one of the characteristic features of type I neurons (Ermentrout, 1996). The approximation equation 3.10 is shown in Figure 5 by dashed lines. Although it well describes the data for D D 0:1, it becomes worse with increasing noise intensity over the range of ¯ shown in Figure 6b. On the contrary, the linearization result agrees best at large noise intensity; indeed,
1778
B. Lindner, A. Longtin, and A. Bulsara
there is no visible difference between the quadrature result and the approximation for D D 10. At moderate noise intensity (D D 1), the linear behavior is restricted to a much smaller range of ¯. For small noise intensity and increasing ¯, the rate switches very fast from almost zero to a saturation value—a rather nonlinear behavior that is not well described (or only in a very small range around ¯ D 0) by the linearization approximation equation 3.11. The shape of the CV versus ¯ curve is almost linear for strong noise (see Figure 6, lines for D D 10) as already mentioned by Gutkin and Ermentrout (1988). Since the quadrature formulas always contain ¯=D, a linearization of rate and CV with respect to ¯ is valid for a larger range of ¯ if D is large. Indeed, noise linearizes not only the transfer function (i.e., rate versus input parameter) but also all other statistical quantities with respect to variations of ¯. The linear dependence is, however, changed into a threshold-like dependence for small noise. With a look at the different small noise limits of the CV for ¯ < 0; ¯ > 0, no other behavior than this threshold behavior is indeed possible. The approximation according to equation 3.9 is shown by thin dashed lines in Figure 6. It can be used only for the small noise level D D 0:1 where the CV is below 0.25 and is far off for larger noise intensities. 4.3 Comparison to Simulations of the 2 Neuron Model. In section 2.2 we derived that the Stratonovich interpretation of the white-noise-driven 2 neuron model equation 2.8 is completely equivalent to our basic model, equation 2.1. That means that if we simulate using the scheme equation 2.9, we should exactly obtain the same rate and CV as for the original model. On the contrary, the Ito interpretation of equation 2.8 and the corresponding integration scheme equation 2.10 should result in different rate and CV, in particular, at larger noise intensity. Here we ask whether these differences are serious. We will compare also with results by Gutkin and Ermentrout (1998), who have apparently used the Ito scheme, equation 2.10, which does not exactly correspond to the original dynamics. In Figure 7, the ring rate is shown as a function of the noise intensity for ¯ D §1 and ¯ D 0. The quadrature result for the original dynamics equation 2.1 is compared to simulation results using the two integration schemes, equations 2.9 and 2.10. While the Stratonovich scheme exactly matches the analytical curves and thereby conrms our expectation, the rate obtained by simulating the Ito scheme is below the true rate for moderate to large noise intensities and saturates in the strong noise limit. Most remarkably, for ¯ D 1, the rate does not depend on noise intensity at all, in marked contrast to the D1=3 increase of the true rate for strong noise (in appendix B, we derive that the mean ISI, and hence the rate, do not depend on D for the specic input ¯ D 1). As expected, both schemes yield the same function in the weak noise limit where the Stratonovich drift (proportional to noise intensity) becomes negligible.
spike rate
Rate and CV of a Type I Neuron
10
1779
b= 1,eq. 2.9 b= 1,eq. 2.10 b=1,eq. 2.9 b=1,eq. 2.10
0
b=1 10
-1
b= 1
b=0 -2
0
10
10
2
10
D
Figure 7: Firing rate as a function of noise intensity. Theory for the saddle-node system shown by solid lines for different values of ¯ as indicated; simulation results for the 2 model in Stratonovich interpretation according to the integration scheme, equation 2.9, (gray symbols), and in Ito interpretation according to equation 2.10 (black symbols).
1.0
CV
b=0 0.5
0.0 -3 10
b= 1 b=1
10
-1
b= 1,eq. 2.9 b= 1,eq. 2.10 b=1, eq. 2.9 b=1, eq. 2.10
D
10
1
10
3
Figure 8: Coefcient of variation as a function of noise intensity. Theory for the saddle-node system shown by solid lines for different values of ¯ as indicated; simulation results for the 2 model in Stratonovich interpretation according to the integration scheme, equation 2.9 (gray symbols) and in Ito interpretation according to equation 2.10 (black symbols).
Figure 8 shows the CV as a function of noise intensity for the three standard values of ¯. Here, the differences between the Ito and Stratonovich interpretations of the 2 neuron model are more important. The latter again yields perfect agreement with the theoretical result for the original dynamics. The Ito scheme, in contrast, shows strong deviations for moderate to
1780
B. Lindner, A. Longtin, and A. Bulsara
large noise intensity. p The CV for the oscillatory regime (¯ D 1) is no longer restricted to .0; 1= 3/. For both positive and negative ¯, the CV is always above the true CV. Increasing noise even leads to a growth of the CV, such that for ¯ D ¡1, a minimum in the CV versus noise intensity curve is seen. Such a minimum is a signature of coherence resonance (Pikovsky & Kurths, 1997). Many excitable (as opposed to periodically ring) systems exhibit their most regular spiking (as indicated by a minimal CV) if driven by a noise with nite “optimal” intensity. This effect is clearly absent for the original dynamics equation 2.1. This qualitative difference between the Ito and Stratonovich interpretations, in the form of the existence of such a minimum, might be irrelevant in real type I neurons since the reduction from the multidimensional system to the one-dimensional normal form becomes somewhat doubtful in the strong noise limit. Gutkin and Ermentrout (1998) showed by means of numerical simulations that the 2 neuron exhibits (at least for ¯ < 0) always a high CV, close to one. They started, as we did, with the normal form, derived the equation for the 2 neuron, but interpreted this equation apparently in Ito’s sense (i.e., integrated equation 2.10). Although their conclusions are qualitatively correct, we would like to point out a quantitative discrepancy resulting from using the Ito scheme. In Figure 6c of Gutkin and Ermentrout (1998) the CV is shown as a function of the mean interval, a representation of the data that is independent of the denition of noise intensity (they use a parameter ¾ that is related to the p noise intensity D by ¾ D 2D). For a mean ISI ranging from 1 up to 10,000, they nd a CV between 0.75 and 1.1 with a minimal CV at a small mean ISI (between 1 and 10). Data are pretty noisy, and Gutkin and Ermentrout (1998) do not mention the apparent minimum of the CV. We may use our data from Figures 7 and 8 and also plot a CV versus mean ISI curve; this is shown in Figure 9. Simulating the Ito version of the 2 model, we recover indeed a high CV and a minimum at a low mean interval.6 In contrast, for the original model of a saddle-node system (theoretical curve, solid line in Figure 8) as well as for simulation results from the 2 model interpreted in Stratonovich’s sense, we obtain no minimum in the CV and observe
6 We note two discrepancies with the data of Gutkin and Ermentrout (1998): (1) We cannot plot a CV for an interval below hTi D 3:14 since the rate is limited by the inverse of this value (cf. Figure 7), whereas in Gutkin and Ermentrout (1998), there is one data point for hTi D 1:5. (2) The CV we have found in our simulation for ¯ D ¡1 is always between 0.8 and 1, whereas in Gutkin and Ermentrout (1998), this range is slightly larger (CV 2 .0:75; 1:1/). Either difference can be explained by insufcient statistics in Gutkin and Ermentrout (1998) and possibly (since there seems to be also a systematic deviation) by too large a time step in the integration procedure. Especially at low mean ISIs (high noise intensities), we had to use time steps down to 1t D 10¡4 to achieve independence of the data with regard to time step.
Rate and CV of a Type I Neuron
1781
CV
1.0
0.5 b= 1,eq. 2.9 b= 1,eq. 2.10 0.0
1
10
100
Figure 9: Coefcient of variation as a function of the mean ISI. Variation of the mean ISI is as in Gutkin and Ermentrout (1998), achieved by changing the noise intensity while keeping xed ¯ D ¡1. Theory for the saddle-node system shown by solid line; simulation results for the 2 model in Stratonovich interpretation according to the integration scheme equation 2.9 (gray) and in Ito interpretation according to equation 2.10 (black).
generally a considerably lower CV at small mean intervals. The lower limit is given by p the large noise limit of the CV; in the excitable regime, we have CV > 1= 3. In the Stratonovich interpretation of the model as well as in the original dynamics, we can have arbitrary low mean interspike intervals, that is, arbitrary high ring rate. 4.4 Effect of Finite Threshold and Reset Values. As already mentioned, the spike generator model, 2.1, together with the reset rule, equation 2.3 has been used with nite threshold and reset values xC and x¡ , respectively, in studies of neuronal networks (Latham et al., 2000; Hansel & Mato, 2001) (the spike generator was referred to as quadratic integrate-and-re neuron model). This is also the case for our model simulations, as well as in our numerical evaluation of the quadrature formulas. So it is desirable to know how a niteness of the boundaries 7 in the rst passage problem inuences rate and CV. This is an interesting problem on its own and will be studied in detail elsewhere. In short, we can expect the following changes: (1) the scaling relations, equations 2.16 through 2.18, have to be modied (as they will now include rescaled values of xC and x¡ ); (2) signicant changes to rate and CV are to 7 Actually, only the threshold point is a boundary (absorbing barrier). The second boundary (reecting barrier) is at ¡1, so the trajectory started at x¡ is not prevented going toward ¡1.
1782
B. Lindner, A. Longtin, and A. Bulsara
be expected at large noise intensity; (3) as long as the (now nite) region of passage includes the xed points of the system, rate and CV will be higher than for innite boundaries. To understand the last prediction, it is useful to think of splitting the passage for the innite boundaries into three passage problems: to go (1) from ¡1 to a nite value x¡ , (2) from x¡ to xC , and (3) from xC to 1. Since we are dealing with a Markov process, these passages and the moments of their durations are statistically independent. Now, considering a system with nite boundaries amounts to neglecting the outside passages 1 and 3. From this picture, it becomes clear that the mean passage time will be shorter than for innite boundaries. Furthermore, the passages on the outside are governed by strong forces (recall the square term in equation 2.1) and are therefore more regular than the passage through the “inside” region (i.e., the CV of the passage times for the outside region is expected to be lower than for the passage from x¡ to xC ). It can be readily shown that this implies a higher variability of the passage time (or ISI) for the case where the outside passages are not taken into account. In Figure 10, we show how rate and CV versus noise intensity change for different values of ¯ if we choose x§ D §2 (this is a rather drastic reduction of the reset-threshold distance). These curves were obtained using the general formula, equations A.1 and A.4 from appendix A, together with the potential equation, 2.6. We compare the modied rate and CV (thick lines) to the old results for innite x§ (thin lines). The rate (see Figure 10a) is generally larger for nite x§ as expected. At small noise, the differences with the innite boundary case are minor. For D ! 0, the rate converges to that with innite boundaries if ¯ · 0. However, for ¯ D 1,pwe see in the a difference between p small noise limitp p the deterministic rates ¯=¼ and ¯[¼ ¡ .arctan.xC = ¯/ ¡ arctan.x¡ = ¯/]¡1 for the cases of innite and nite threshold and reset points, respectively. Further, for all ¯, the deviations are more serious at large noise intensity. The rate grows for the nite case in proportion to D2=3 (not like D1=3 as for the case of innite boundaries). This can be understood by the large-D asymptotics of the integrals in equation A.1. The CV shows also a drastic qualitative change in its shape. Generally, it converges to the CV for innite boundaries in the low noise limit, but grows unbounded in the large D limit, in marked contrast to the model with innite x§ . This growth goes as D1=6 as can be again concluded from the asymptotics of the integrals in equations A.1 and A.4. We would like to point out that this holds for all values of ¯; even for ¯ D 0, where we had remarkably a constant CV in the case of innite boundaries, the CV does increase with increasing noise intensity like D1=6 . Most notably, a minimum in the CV is observed for ¯ D ¡1 (coherence resonance; cf. the discussion in the previous section). Thus, the absence of coherence resonance (i.e., an nite “optimal” noise intensity) was a consequence of using innite boundaries.
Rate and CV of a Type I Neuron
(a)
1783
1
spike rate
10 10 10 10
b=1
0
-1
-2
b=0 -2
10
b= 1 0
10
D
(b) 1.5
2
10
b= 1
1.0
CV
b=0
0.5
0.0
b=1 -2
10
0
10
D
2
10
Figure 10: Spike rate of (a) neuron and (b) coefcient of variation versus noise intensity for three values of ¯ as indicated. Thin lines are quadrature results as shown before in Figure 3 and Figure 4; thick lines are quadrature results for nite threshold and reset values, x¡ D ¡2 and xC D 2, respectively.
In general, rate and CV for the nite and innite cases agree best for negative ¯ and small noise intensity. Increasing the modulus of threshold and reset points will improve the agreement, in particular, for small to moderate noise. This means plotting rate or CV for x§ D §10 and a particular value of ¯, curves will be obtained that are in between those for x§ D §2 and x§ D §1. From a computational point of view, large values of x§ increase the simulation time. An estimate of the computation time saved by using x§ D §2 instead of x§ D §500 (i.e., innite values for practical purposes) can be inferred from the distance between the spike rates for either case in Figure 10;
1784
B. Lindner, A. Longtin, and A. Bulsara
at very strong noise, this is about one order of magnitude, whereas at small to moderate noise intensity, the difference in computation time appears to be negligible.8 Another question is which kind of threshold and reset conditions is suitable with respect to more realistic type I neuron models and, of course, real type I neurons. This important issue is currently under investigation. 5 Discussion and Conclusions We have considered a simple spike generator model that involves the onedimensional normal form of a saddle-node bifurcation and may be used to represent a type I neuron. The model seems to be computationally simpler than the equivalent 2 neuron often used in neurocomputational research. Using the spike generator with saddle-node bifurcation helps, moreover, to avoid the pitfalls associated with using multiplicative white noise as in the 2 model. A second advantage of the saddle-node spike generator is that we were able to adopt many useful analytic results for the passage time problem from the statistical physics literature on a related problem. We discussed the remarkable scaling behavior of the model, which allows reducing the number of parameters in the problem. We gave the exact quadrature expressions for mean and variance of the ISI and a simple sum formula for the mean ISI. Furthermore, we derived expressions for simple limiting cases for rate and CV of the model. All results may be useful also for problems involving networks of type I neurons. For the problem of high variability of spike trains of cortical neurons, one of our conclusions is especially relevant: given a negative input (¯ < 0), the p CV is restricted to the interval .1= 3; 1/. It can never be lower than the high noise limit CV of the model. This stands in marked contrast to the leaky integrate-and-re model, which can in this regime attain arbitrary low values of the CV if the input parameter is tuned to a certain small negative value and can be as well arbitrary high for both sub- and suprathreshold input (Lindner, 2002; Lindner, Schimansky-Geier, & Longtin, 2002). Although Gutkin and Ermentrout (1998) overestimated the CV with their simulation results, their main conclusion remains valid: type I neurons stimulated by white gaussian noise show a high variability for arbitrary but negative input. Future work will consider a comparison of our results with a two-(or higher-) dimensional dynamics of an ionic model of a type I neuron, such as Morris-Lecar. We will also consider the theoretical analysis of the phase resetting curves and the response of the stochastic saddle-node neuron model studied here to periodic and broadband input.
8 Larger modulus of x§ requires also a smaller time step (or an adaptive time step), which will increase the computation time.
Rate and CV of a Type I Neuron
1785
Appendix A: Derivation of the Quadrature Formulas The classical formulas for the rst two moments of the rst passage time in an arbitrary potential V.x/ from x¡ to xC with reecting boundary at ¡1 were given by Pontryagin et al. (1933) ( for a more recent reference, see Anishchenko, Astakhov, Neiman, Vadivasova, & Schimansky-Geier, 2002, chapter 1.2, equation 1.248) 1 hT.x¡ ! xC /i D D
ZxC
Zx dx e
ZxC
2
Zx dy e¡V.y/=D
V.x/=D
dx e
¡1
x¡
ZxC £
(A.1)
¡1
x¡
2 hT .x¡ ! xC /i D 2 D
dy e¡V.y/=D ;
V.x/=D
Zu dv e¡V.v/=D :
V.u/=D
du e
(A.2)
¡1
y
The resulting expression for the variance h1T 2 i D hT 2 i ¡ hTi2 can be simplied by interchanging several times the integration boundaries and the names of the variables (see, e.g., in Sigeti, 1988, chap. 4, for the potential equation 2.6 with ¯ D 0 or in Reimann et al., 2002, for the case of a periodic potential), yielding
h1T 2 i D
ZxC
2 D2
µZ
Zx dz e¡V.z/=D
dx eV.x/=D ¡1
x¡
z ¡1
dy e¡V.y/=D
¶2 :
(A.3)
Another interchange of variables leads to (Lindner, 2002) 2 h1T i D 2 D
ZxC
2
µZ dz eV.z/=D
¡1
z ¡1
dy e¡V.y/=D
¶2
ZxC £
z
dx 2.x ¡ x¡ /eV.x/=D ;
(A.4)
(here, 2.x/ is Heaviside’s step function), a formula that is much easier to evaluate numerically than the original expressions. Inserting the potential equation 2.6 into equations A.1 and A.4 with p 3 x§ D §1 and using new integration variables like xQ D x= 3D, we obtain equations 3.1 and 3.2.
1786
B. Lindner, A. Longtin, and A. Bulsara
Appendix B: Mean First Passage Time of the 2 Neuron in the Ito Interpretation In section 4.3 we have seen from the simulation results that the rate of the 2 neuron model interpreted in the sense of Ito does not depend on D if ¯ D 1. Although irrelevant for the saddle-node system (correctly, we have to use the Stratonovich interpretation of the 2 model), this is a remarkable nding, which is analytically proven in the following. The Ito scheme, equation 2.10, corresponds to the following stochastic differential equation interpreted in the Stratonovich sense: P D 1 ¡ cos.2/ C .¯ C D sin.2// 2 p £ .1 C cos.2// C 2D.1 C cos.2//» .t/:
(B.1)
Here, the additional term CD sin.2/ compensates the Stratonovich drift ¡D sin.2/ in the Stratonovich integration scheme, equation 2.9, and hence this scheme leads now to what was before the Ito scheme, equation 2.10. Since we use the Stratonovich interpretation, we can transform equation B.1 back to the old variable x.t/ by means of equation 2.7, yielding p 2Dx C 2D».t/: 2 1Cx
(B.2)
x Q D ¡ ¡ ¯x ¡ D ln.1 C x2 /: V.x/ 3
(B.3)
xP D ¯ C x2 C
Hence, interpreting the 2 neuron model with white noise in the Ito sense amounts to the introduction of a noise-dependent modication in the potential V(x) given by equation 2.6. The modied potential reads
Inserting this into the general formula for the mean rst passage time yields 1 hTi D D
Z1 dx ¡1
exp[¡.x3 =3 C ¯x/=D] 1 C x2
Zx £
¡1
dy .1 C y2 / exp[.y3 =3 C ¯y/=D];
(B.4)
which for ¯ D 1 can be simplied: 1 hTi D D
D
Z1 ¡1
Z1
dx ¡1
exp[¡.x3 =3 C x/=D] dx 1 C x2 1 D ¼: 1 C x2
Zx dy D ¡1
d .exp[.y3 =3 C y/=D]/; dy
(B.5)
Rate and CV of a Type I Neuron
1787
Hence the rate is exactly given by r D 1=¼;
(B.6)
as it was also found in the simulations within the numerical accuracy. Note that for any value ¯ 6D 1, the rate is expected to depend on D. Acknowledgments This research was supported by NSERC Canada and a Premiers Research Excellence Award (PREA) from the Government of Ontario. We also thank Christophe Langevin for preliminary numerical calculations of rst passage times in the 2 neuron model and Jan Benda for a careful reading of the manuscript. References Abramowitz, M., & Stegun, I. A. (1970). Handbook of mathematical functions. New York: Dover. Anishchenko, V. S., Astakhov, V. V., Neiman, A. B., Vadivasova, T. E., & Schimansky-Geier, L. (2002). Nonlinear dynamics of chaotic and stochastic systems. Berlin: Springer-Verlag. Arecchi, F. T., & Politi, A. (1980). Transient uctuations in the decay of an unstable state. Phys. Rev. Lett., 45, 1219. Bell, A., Mainen, Z., Tsodyks, M., & Sejnowski, T. (1995). “Balancing” of conductances may explain irregular cortical spiking (Tech. Rep. INC-9502). San Diego: Institute for Neural Computation, University of California, San Diego. Bulsara, A., Lowen S. B., & Rees, C. D. (1994). Cooperative behavior in the periodically modulated Wiener process: Noise-induced complexity in a model neuron. Phys. Rev. E, 49, 4989. Colet, P., San Miguel, M., Casademunt, J., & Sancho, J. M. (1989). Relaxation from a marginal state in optical bistability. Phys. Rev. A, 39, 149. Cox, D. R. (1962). Renewal theory. London: Methuen. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979. Gang, H., Ditzinger, T., Ning, C. Z., & Haken, H. (1993). Stochastic resonance without external periodic force. Phys. Rev. Lett., 71, 807. Gardiner, C. W. (1985). Handbook of stochastic methods. Berlin: Springer-Verlag. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175. Hodgkin, A. (1948). The local electric changes associated with repetitive action in a medullated axon. J. Physiol., 107, 165. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag.
1788
B. Lindner, A. Longtin, and A. Bulsara
Kramers, H. A. (1940). Brownian motion in a eld of force and the diffusion model of chemical reactions. Physica, 7, 284. Latham, P. E., Richmond, B. J., Nelson, P. G., & Nirenberg, S. (2000). Intrinsic dynamics in neuronal networks. I. Theory. J. Neurophysiol., 83, 808. Lindner, B. (2002). Coherence and stochastic resonance in nonlinear dynamical systems. Berlin: Logos-Verlag. Lindner, B., Schimansky-Geier, L., & Longtin, A. (2002). Maximizing spike train coherence or incoherence in the leaky integrate-and-re model. Phys. Rev. E, 66, 031916. Longtin, A. (1997). Autonomous stochastic resonance in bursting neurons. Phys. Rev. E, 55, 868. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophys. J., 35, 193. Pikovsky, A., & Kurths, J. (1997). Coherence resonance in a noise-driven excitable system. Phys. Rev. Lett., 78, 775. Pontryagin, L., Andronov, A., & Witt, A. (1989). On the statistical treatment of dynamical systems. In F. Moss & P. V. E. McClintock (Eds.), Noise in nonlinear dynamical systems (Vol. 1, p. 329). Cambridge: Cambridge University Press. Rappel, W. J., & Strogatz, S. H. (1994). Stochastic resonance in an autonomous system with a nonuniform limit cycle. Phys. Rev. E, 50, 3249. Reimann, P., Van den Broeck, C., H¨anggi, P., Linke, H., Rubi, J. M., & P´erezMadrid, A. (2002). Diffusion in tilted periodic potentials: Enhancement, universality, and scaling. Phys. Rev. E, 65, 031104. Rinzel, J., & B. Ermentrout, B. (1989). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (p. 251). Cambridge, MA: MIT Press. Risken, H. (1984). The Fokker-Planck equation. Berlin: Springer-Verlag. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes, and cortical organization. Curr. Op. Neurobiol., 4, 569. Sigeti, D. (1988). Universal results for the effects of noise on dynamical systems. Unpublished doctoral dissertation, University of Texas at Austin. Sigeti, D., & Horsthemke, W. (1989). Pseudo-regular oscillations induced by external noise. J. Stat. Phys., 54, 1217. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuronsci., 13, 334. Troyer, T., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neur. Comp., 9, 971. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Wilbur, W., & Rinzel, J. (1983). A theoretical basis for large coefcient of variation and bimodality in neuronal interspike interval distributions. J. Theor. Biol., 105, 345. Received October 31, 2002; accepted February 4, 2003.
LETTER
Communicated by Paul Bressloff
What Causes a Neuron to Spike? Blaise Aguera ¨ y Arcas [email protected] Rare Books Library, Princeton University, Princeton, NJ 08544, U.S.A.
Adrienne L. Fairhall [email protected] Department of Molecular Biology, Princeton University, Princeton, NJ 08544, U.S.A.
The computation performed by a neuron can be formulated as a combination of dimensional reduction in stimulus space and the nonlinearity inherent in a spiking output. White noise stimulus and reverse correlation (the spike-triggered average and spike-triggered covariance) are often used in experimental neuroscience to “ask” neurons which dimensions in stimulus space they are sensitive to and to characterize the nonlinearity of the response. In this article, we apply reverse correlation to the simplest model neuron with temporal dynamics—the leaky integrate-andre model—and nd that for even this simple case, standard techniques do not recover the known neural computation. To overcome this, we develop novel reverse-correlation techniques by selectively analyzing only “isolated” spikes and taking explicit account of the extended silences that precede these isolated spikes. We discuss the implications of our methods to the characterization of neural adaptation. Although these methods are developed in the context of the leaky integrate-and- re model, our ndings are relevant for the analysis of spike trains from real neurons. 1 Introduction There are two distinct approaches to characterizing a neuron based on spike trains recorded during stimulus presentation. One can take the view of a downstream neuron and ask how the spike train should be decoded to reconstruct some aspect of the stimulus. Alternatively, one can take the neuron’s own view and try to determine how the stimulus is encoded, that is, which aspects of the stimulus trigger spikes. This article is concerned with the latter problem: the identication of relevant stimulus features from neural data. The earliest attempts at neural characterization, including classic electrophysiological experiments like those of Adrian (1928), presented the neural system with simple, highly stereotyped stimuli with one or at most a few free parameters; this makes it relatively straightforward to map the c 2003 Massachusetts Institute of Technology Neural Computation 15, 1789–1807 (2003) °
1790
B. Aguera ¨ y Arcas and A. Fairhall
input-output relation, or tuning curve, in terms of these parameters. While this approach has been invaluable and is still often used to advantage, it has the shortcoming of requiring a strong assumption about what the neuron is “looking for.” As this assumption is relaxed by probing the system using stimuli with a larger number of parameters, the length of the experiment needed to probe neural response grows combinatorially. In some cases, white noise can be used as an alternative stimulus, which naturally contains “every” signal (Marmarelis & Naka, 1972). The stimulus feature best correlated with spiking is then the spike-triggered average (STA), that is, the average stimulus history preceding a spike (de Boer & Kuyper, 1968; Bryant & Segundo, 1978). Such reverse-correlation methods have been further developed and applied to characterize the motionsensitive identied neuron H1 in the y visual system (de Ruyter van Steveninck & Bialek, 1988). By extending reverse correlation to second order, one can identify multiple directions in stimulus space relevant to the neuron’s spiking “decision” (Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Bialek & de Ruyter van Steveninck, 2002). These techniques allow the experimenter to probe neural response stochastically, using minimal assumptions about the nature of the relevant stimulus. In this article, we will examine in detail the relation between the stimulus features extracted through white noise analysis and the computation actually performed by the neuron. In the usual interpretation of white noise analysis, the features extracted using reverse correlation are considered to be linear lters, which, when convolved with the stimulus, provide the relevant inputs to a nonlinear spiking decision function. While the features recovered by reverse correlation are optimal for stimulus reconstruction (Rieke, Warland, Bialek, & de Ruyter van Steveninck, 1997), we will show that they are distinct from the lter or lters relevant to the neuron’s spiking decision. In particular, reconstruction lters necessarily depend on the stimulus ensemble, while the lters relevant for spiking—at least, for simple neurons like leaky integrate and re—are properties of the model alone. The main difculty in interpreting the spike-triggered average is that spikes interact, in the sense that the occurrence of a spike affects the probability of the future occurrence of a spike. Both stimulus reconstruction and the identication of triggering features are affected by this issue. Some of the effects of correlations between spikes have been treated in Kistler, Gerstner, & van Hemmen (1997), Eguia, Rabinovich, & Abarbanel (2000), Tiesinga (2001), and Chacron, Longtin, & Maler (2001). To separate the inuence of previous spikes from the inuence of the stimulus itself, we propose to analyze only “isolated” spikes—those sufciently distant in time from previous spikes that they can be considered to be statistically independent. This procedure allows us to recover exactly the relevant stimulus features. However, it introduces additional complications. By requiring an extended silence before a spike, we bias the prior statistical ensemble. This bias will emerge in the white noise analysis, and we will have to identify it explicitly.
What Causes a Neuron to Spike?
1791
We carry out this program for the simplest model neuron with temporal dynamics: the leaky integrate-and-re neuron. We choose this model because, unlike more complex models, we know exactly which stimulus feature causes spiking and can make a precise comparison with the output of the analysis. The methods presented and validated here using the integrate-and-re model have been applied to the more biologically interesting HodgkinHuxley neuron (Aguera ¨ y Arcas, Fairhall, & Bialek, 2003), resulting in new insights into its computational properties. Application of the same analytic techniques to real neural data is currently underway. 2 Characterizing Neural Response In this section we briey review material that has appeared elsewhere (Aguera ¨ y Arcas, Bialek, & Fairhall, 2001; Aguera ¨ y Arcas et al., 2003; Bialek & de Ruyter van Steveninck, 2002). 2.1 The Linear-Nonlinear Model. Consider a neuron responding to a dynamic stimulus I.t/. While here we will consider I to be a scalar function, in general, it may depend on any number of additional variables, such as frequency, spatial position, or orientation. We can suppress these dependencies without loss of generality. Even without additional dependent variables, I.t/ is a high-dimensional stimulus, as the occurrence of a spike at time t0 generally depends on the stimulus history, I.t < t0 /. The effective dimensionality of this input is determined by the temporal extent of the relevant stimulus and the effective sampling rate, which is imposed by biophysical limits. For the computation of the neuron to be characterizable at all, the dimensionality of the stimulus relevant to spiking must generally be much lower than this total dimensionality. The simplest approach to dimensionality reduction is to approximate the relevant dimensions as a low-dimensional linear subspace of I.t/. If the neuron is sensitive only to a low-dimensional linear subspace, we can dene a small set of signals s1 ; s2 ; : : : ; sK by ltering the stimulus, Z s¹ D
1 0
d¿ f¹ .¿ /I.t0 ¡ ¿ /;
(2.1)
so that the probability of spiking depends on only these few signals, P[spike at t0 j I.t < t0 /] D P[spike at t0 ]g.s1 ; s2 ; : : : ; sK /:
(2.2)
Classic characterizations of neurons in the retina (e.g., Berry & Meister, 1999), typically assume that these neurons are sensitive to a single linear dimension of the stimulus; the lter f1 then corresponds to the (spatiotemporal) receptive eld, and g is proportional to the one-dimensional tuning
1792
B. Aguera ¨ y Arcas and A. Fairhall
curve. A number of other neurons, particularly early in sensory pathways, have been similarly characterized (Theunissen et al., 2001; Theunissen, Sen, & Doupe, 2000; Kim & Rieke, 2001). Here we will concentrate primarily on the problem of nding the lters f f¹ g; once they are found, g can be obtained easily by direct sampling of P[spike at t0 j s1 ; s2 ; : : : ; sK ], assuming that the relevant dimensionality K is small. 2.2 Reverse Correlation Methods. While one would like to determine the neural response in terms of the probability of spiking given the stimulus, P[spike at t0 j I.t < t0 /], we follow de Ruyter van Steveninck and Bialek (1988) in considering instead the distribution of signals conditional on the response, P[I.t < t0 / j spike at t0 ]. This quantity can be extracted directly from the known stimulus and the resulting spike train. The two probabilities are related by Bayes’ rule, P[spike at t0 j I.t < t0 /] P[I.t < t0 / j spike at t0 ] D : P[spike at t0 ] P[I.t < t0 /]
(2.3)
P[spike at t0 ] is the mean spike rate, and P[I.t < t0 /] is the prior distribution of stimuli. If I.t/ is gaussian white noise, then this distribution is a multidimensional gaussian; furthermore, the distribution of any ltered version of I.t/ will also be gaussian. The rst moment of P[I.t < t0 / j spike at t0 ] is the spike-triggered average, Z STA.¿ / D
[dI]P[I.t < t0 / j spike at t0 ]I.t0 ¡ ¿ /:
(2.4)
We can also compute the covariance matrix of uctuations around this average, Z Cspike.¿; ¿ 0 / D
[dI]P[I.t < t0 / j spike at t0 ]I.t0 ¡ ¿ /I.t0 ¡ ¿ 0 / ¡ STA .¿ /STA .¿ 0 /:
(2.5)
In the same way that we compare the spike-triggered average to some constant average level of the signal in the whole experiment, an important advance of Brenner et al. (2000) was to compare the covariance matrix Cspike with the covariance of the signal averaged over the whole experiment, Z Cprior .¿; ¿ 0 / D
[dI]P[I.t < t0 /]I.t0 ¡ ¿ /I.t0 ¡ ¿ 0 /:
(2.6)
If the probability of spiking depends on only a few relevant features, the change in the covariance matrix, 1C D Cspike ¡Cprior , has a correspondingly
What Causes a Neuron to Spike?
1793
small number of outstanding eigenvalues. Precisely, if P[spike j stimulus] depends on K linear projections of the stimulus as in equation 2.2 and if the inputs I.t/ are chosen from a gaussian distribution, then the rank of the matrix 1C is exactly K. Further, the eigenmodes associated with nonzero eigenvalues span the relevant subspace. A positive eigenvalue indicates a direction in stimulus space along which the variance is increased relative to the prior, while a negative eigenvalue corresponds to reduced variance. 3 Leaky Integrate-and-Fire The leaky integrate-and-re neuron is perhaps the most commonly used approximation to real neurons. Its dynamics can be written as C
X dV V D I.t/ ¡ ¡ CV c ±.t ¡ ti /; dt R i
(3.1)
where C is a capacitance, R is a resistance, and Vc is the voltage threshold for spiking. The rst two terms on the right-hand side model an RC circuit; the capacitor integrates the input current as a potential V, while the resistor dissipates the stored charge (hence “leaky”). The third term implements spiking: spikes are dened as the set of times fti g such that V.ti / D Vc . When the potential reaches Vc , a restoring current is instantaneously injected to bring the stored charge back to zero, resetting the system. This formulation emphasizes the similarity between leaky integrate-and-re and the more biophysically realistic Hodgkin-Huxley equation for current ow through a patch of neural membrane; in the Hodgkin-Huxley system, however, the nonlinear spike-generating term is replaced with ionic currents of a more complex (and less singular) form. Hence, the simple properties of the leaky integrate-and-re model—a single degree of freedom V, instantaneous spiking, total resetting of the system following a spike, and linearity of the equation of motion away from spikes—are only rough approximations to the properties of realistic neurons. Integrating equation 3.1 away from spikes, we obtain Z CV.t/ D
³
t
d¿ exp ¡1
´ ¿ ¡t I.¿ /; RC
(3.2)
assuming initialization of the system in the distant past at zero potential. The right-hand side can be rewritten as the convolution of I.t/ with the causal exponential kernel, ( f1 .¿ / D
0 exp.¡¿ =RC/
if ¿ · 0 if 0 < ¿ ,
(3.3)
1794
B. Aguera ¨ y Arcas and A. Fairhall
so the condition for a spike at time t1 is that this convolution reach threshold: Z CVc D
1 ¡1
d¿ f1 .¿ /I.t1 ¡ ¿ /:
(3.4)
The interaction between spikes complicates this picture somewhat. If the system spiked last at time t0 < t1 , we must replace the lower limit of integration in equation 3.2 with t0 , as no current injected before t0 can contribute to the accumulated potential. However, if we wish always to evaluate stimulus projections by integrating over the entire current history I.t/ as in equation 3.4, then we must replace f1 with an entire family of kernels dependent on the time to the previous spike, 1t D t1 ¡ t0 : 8 > : 0
if ¿ · 0 if 0 < ¿ < 1t if 1t · ¿ .
(3.5)
This equation completes an exact model of the leaky integrate-and-re neuron. 4 Response to White Noise We now consider the statistical behavior of the integrate-and-re neuron when driven with gaussian white noise current of standard deviation ¾ , hI.t/i D 0;
hI.t/I.t0 /i D ¾ 2 ±.t ¡ t0 /:
(4.1)
For the following analysis, we take R D 10 kÄ, C D 1 ¹F, Vc D 10 mV, p and ¾ D 200 ¹A and use Euler integration with a time step of 0.05 msec. These values were chosen to mimic biophysically plausible parameters for a 1 cm2 patch of membrane, but none of our results depend qualitatively on this choice. 4.1 Unqualied Reverse Correlation. At rst glance, it would appear that the neuron is perfectly described by a reduced model in one dimension: the current ltered by the exponential kernel of equation 3.3. If so, the spiketriggered average under white noise stimulation should be proportional to this lter.1 However, as is evident in Figure 1, the STA is not exponential, and it does not asymptote to the appropriate decay rate 1=RC. One reason 1 In possible linear combination with a derivative-like lter, arising from the additional constraint that the threshold must be crossed from below; this point will be addressed in more detail later.
What Causes a Neuron to Spike?
1795
Figure 1: Spike-triggered average of a leaky integrate-and-re neuron, shown inset with a logarithmic I-axis and the pure exponential ¾ f1 (gray line).
is that in the presence of previous spikes, the integrate-and-re neuron is not low dimensional. Recall that the relative timing 1t of the previous spike selects a lter f1t (see equation 3.5); these lters are truncated versions of the original exponential f1 . The STA will therefore include contributions from each of these lters, weighted by their probability of occurrence, as determined by the interspike interval distribution.2 It is readily shown that the orthogonal component of each lter f1t is a ±-function at ¡1t, so the set of relevant lters spans the entire stimulus space. Covariance analysis reveals this high dimensionality very clearly. In Figure 2, we show the spectrum of eigenvalue magnitudes of 1C as a function of the number of spikes summed to accumulate the matrix; with decreasing noise, an arbitrary number of signicant eigenmodes emerge. As shown in Figure 3, none of these eigenmodes corresponds to an exponential lter. The rst mode most closely resembles an exponential, but as shown in the right half of Figure 3, it is more peaked near t D 0, and its time constant is very different from RC.
2 This is not a prescription for recovering f1 from the STA. The STA is further inuenced by other effects, including the average stimulus histories leading up to previous spikes.
1796
B. Aguera ¨ y Arcas and A. Fairhall
Figure 2: Eigenvalue magnitudes of the spike-triggered covariance difference matrix 1C as a function of the number of spikes used to accumulate the matrix. An arbitrary number of stable modes emerges, given enough data. The stable modes all have negative eigenvalues; spikes are associated with reduced stimulus variance along these q dimensions. The gray diagonal line, shown for reference, is proportional to 1= Nspikes .
Figure 3: (Left) The leading six modes of the unqualied spike-triggered change in covariance 1C. (Right) Only the rst mode is monophasic; here it is plotted on a logarithmic y-axis (circles) with f1 for reference (solid line).
What Causes a Neuron to Spike?
1797
Figure 4: Interspike interval histogram for a leaky integrate-and-re neuron. 5 £ 106 spikes were accumulated in 0:27 msec bins.
4.2 Isolated Spikes. It is clear that making progress requires rst considering the causes of a single spike independent of the spiking history. Therefore, we will include in our analysis only spikes that follow previous spikes after a sufciently long interval. The appropriate interval can be estimated from the interspike interval distribution, Figure 4. As in many real neurons, the interval distribution has three main features: a refractory period during which spikes are strongly suppressed, a peak at a preferred interval, and an exponential tail. Intervals in the exponential tail follow Poisson statistics; for intervals in this range, we can take spikes to be statistically independent. With reference to Figure 4, we set our isolated spike criterion—very conservatively—as 75 msec of previous “silence.” As shown in Figure 5, even with this constraint, we do not recover a spike-triggered average matching f1 . This is no longer, strictly speaking, a spike-triggered average but an event-triggered average, where the event includes the preceding silence. Thus, we can expect the silence itself to contribute to any average we construct. To rst order, during silence, there is a constant negative bias, since certain positive currents that would have caused a spike are excluded from the average. Similarly, we expect that the covariance of the current will be altered from that of the gaussian prior during extended silences. Because extended silence is not locked to any particular moment in time, the covariance matrix is stationary during silence.
1798
B. Aguera ¨ y Arcas and A. Fairhall
Figure 5: Triggered average stimulus for isolated spikes. Isolated spikes are dened as spikes preceded by at least 75 msec of silence. The inset shows the triggered average with the constant silence bias subtracted on a logarithmic I-axis, again with the pure exponential ¾ f1 drawn as a gray line for reference.
Figure 6: Eigenvalue magnitudes of the isolated spike-triggered change in covariance 1Ciso as a function of the number of isolated spikes used to accumulate the matrix. The stable modes again have negative eigenvalues.
What Causes a Neuron to Spike?
1799
Figure 7: Fraction of the energy in each mode of 1Ciso (modes are dened over t 2 [¡65; 0] msec) in the silent interval t 2 [¡65; ¡45] msec, as a function of the number of isolated spikes used to accumulate 1Ciso . The eigenvalue for a mode with zero energy in this interval would decrease like 1=Nspikes , as the only contribution would come from noise. Here, two modes emerge with very small energy in this interval, though they stabilize at small, nonzero values. These modes are localized to the time immediately before the spike, but as they have exponential support, their energy in the silence does not fall all the way to zero.
We construct the covariance matrix as before: 1Ciso D Cisolated spike ¡ Cprior. An entire series of modes once again appears (see Figure 6). This time, they arise because silence is, by denition, translationally invariant, resulting in a Fourier-like spectrum. Modes associated with the spike, on the other hand, are locked to a particular time and should therefore not have Fourier-like modes. In particular, these spiking modes should have temporal support only in the time immediately before the spike. This implies that two types of modes—silence associated and spike associated—should emerge independently in the covariance analysis and should exhibit a clear difference in their domain of support. In Figure 7, we consider the fraction of the energy of each mode over the interval t 2 [¡65; ¡45] msec. At ¡45 msec, the exponential lter f1 has decayed to 1% of its peak value, so we are well away from the spike at t D 0. The 75 msec silence criterion also puts the lower boundary of this interval well inside the stationary silence (at least 10 msec after the previous spike).
1800
B. Aguera ¨ y Arcas and A. Fairhall
Figure 8: The two spike-associated modes of 1Ciso . The rst (left) is the recovered exponential lter of the leaky integrate-and-re spiking condition; the measured mode is marked with dots, and the line is f1 (normalized). The second mode (right) is the “rst crossing” constraint.
As we see, exactly two modes emerge as local, with their energy fraction decaying with the noise in the matrix until reaching a stable value well below the rest of the modes. The rst localized mode, shown in Figure 8, is the long-awaited exponential lter f1 . Using our modied second-order reverse correlation analysis on “experimental data,” we have thus nally been able to recover the stimulus feature to which the leaky integrate-and-re neuron is sensitive. 4.3 Constraints. The second mode is a consequence of an implicit constraint in the spiking model. Naively, we know that at the moment when V D Vc , VP > 0; that is, the threshold must be crossed from below. Hence Z 1 d (4.2) d¿ I.t ¡ ¿ / f1 .¿ / < 0; d¿ ¡1 making the neuron also sensitive to the time derivative of f1 . The integrateand-re neuron is peculiar in that f1 and df 1 =d¿ are almost linearly dependent: ( ±.¿ / d f1 .¿ / D d¿ ¡.RC/¡1 exp.¡¿ =RC/
if ¿ · 0.
if 0 < ¿ .
(4.3)
The orthogonal part of this lter is therefore only ±.¿ /, and we can rewrite this extra condition more simply as I.0/ > Vc =R. However, we see that although the second spiking mode is indeed very sharply peaked at ¿ D 0, it is not a ± function. While the above argument is correct in constraining the possible values of I.0/, we have also required
What Causes a Neuron to Spike?
1801
Figure 9: The time-dependent distributions of V.t/ (left) and I.t/ (right) leading up to an isolated spike at t D 0. In both cases, the distributions with the highest means correspond to the nal time slice (they are thus also the most constrained). For V, the times shown are f¡15; ¡14:5; ¡14; : : : ; ¡1:5; ¡1; ¡0:5g msec. For I, the times are f¡5:05; ¡4:05; : : : ; ¡1:05g, f¡0:8; ¡0:6; ¡0:4g, and f¡0:35; ¡0:3; : : : ; ¡0:05g msec. The distribution of V is virtually indistinguishable from its silence steady state by ¡15 msec; convergence to the silence steady state is much faster for I, occurring by ¡5 msec.
no threshold crossings for an extended time prior to the spike. Hence, certain trajectories I.t/ that satisfy the two basic conditions (V.0/ D Vc and P V.0/ > 0) are still disallowed, because they would have led to a spike earlier. The second mode is therefore an extended version of the derivative condition. Although it appears to have very extended support, this is due to orthogonalization with respect to the rst mode, f1 . The true constraint is positive denite and highly peaked, with virtually all of its energy concentrated in the » 5 msec interval prior to the spike. We can more easily understand this constraint by considering the timedependent distribution of V.t/ leading up to an isolated spike, shown in the left panel of Figure 9. The process V.t/ leading up to a spike is a centrally biased random walk with an absorbing barrier at Vc and a capacitatively determined correlation time. If we trace the distribution of V backward in time from the spike at t D 0, we see a constrained diffusive process from a ± function at the spike time to the steady-state silence distribution. The constraint on V enforces a corresponding constraint on I, whose timedependent distribution is shown in the right panel of Figure 9. Notice that the I distribution is noticeably distorted from its silence steady state only at times very near the spike. 4.4 Silence Modes. Finally, let us return to the spectrum of silenceassociated modes in Figure 7. Several of these are shown overlaid in Figure 10. It is immediately clear that they have the expected Fourier-like
1802
B. Aguera ¨ y Arcas and A. Fairhall
Figure 10: The leading six modes of 1Ciso associated with silence (i.e., with nonvanishing support over t 2 [¡65; ¡45] msec). Although they are Fourierlike in the silence, they chirp leading up to the spike.
structure over the period of silence. Their behavior near the spike is less obvious, however: each mode appears to execute an FM chirp, culminating in a sharp feature at the spike time itself. Clearly, while it is true by construction that spike-associated covariance modes have no support in periods of extended silence, the converse is not true. This structure is due to the essential nonstationarity of silence near spikes: at some point, the silence must come to an abrupt end. It is clear that in order to implement our procedure for separating silence from spikes, we could not have taken the approach of looking for modes with signicant energy near the spike. All of the silence modes have interesting structure near the spike and could easily be mistaken for additional spike-generating features. Like the spike-associated modes, these silence modes have negative eigenvalues, indicating reduced stimulus variance in certain dimensions. Yet they do not cause spiking or inhibit it; they are in no meaningful sense “suppressive” modes. They simply show us the second-order statistics of silences, which emerge from the enforced absence of spikes. This underscores the necessity of considering a sufciently extended silence before isolated spikes for the translationally invariant part of the silence modes to emerge, distinguishing them from spiking modes.
What Causes a Neuron to Spike?
1803
This structure also highlights the reason that we could not, as one might expect, simply subtract from Cisolated spike a prior obtained from extended silences instead of Cprior to obtain a simple, low-dimensional 1Ciso . While such a subtraction results in a local matrix around the spike, it cannot take into account the nonstationary part of the silence and leaves the spikerelated and nonstationary silence-related aspects of the covariance near the spike mixed. This again results in a high-dimensional matrix with incorrect (mixed) modes. 5 Adaptation It has previously been observed that the STAs of some neurons depend on the stimuli used to probe their response (Theunissen et al., 2000; Lewen, Bialek, & de Ruyter van Steveninck, 2001). White noise stimulus was originally introduced to circumvent this difculty by sampling stimulus space in an unbiased way; however, one must still choose a noise variance. While one would hope that this does not affect the STA, it has been shown that at least in some sense, even the integrate-and-re neuron “adapts” to the stimulus variance (Rudd & Brown, 1997; Paninski, Lau, & Reyes, in press), showing both a change in the overall STA and in its measured nonlinear decision function. The reason for this form of adaptation is simple: the statistics of spike intervals depend on the stimulus variance, as the time to reach threshold depends on the variance. By construction, however, neither the spike-triggering feature f1 nor the decision function g (here simply the threshold) of the integrate-and-re neuron actually depends on the stimulus statistics. As we have discussed, the unqualied STA is a linear combination of the lters f1t conditioned on the time to the last spike; the contribution of each lter to the overall STA thus depends on the probability P.1t/ of the interspike interval 1t. As different stimulus ensembles produce different interspike interval distributions, these coefcients are functions of the stimulus ensemble. Therefore, even in the absence of adaptation in the linear lters f , the overall STA will be stimulus dependent. Because sampling the input-output relation normally involves convolving the stimulus with the STA, the measured nonlinearity will also appear to be stimulus dependent, despite the fact that the neuron’s nonlinear decision function g (based on convolution of the stimulus with the xed lters f ) is also xed. Hence this form of “adaptation” does not reect any changing property or internal state of the neuron in response to the stimulus statistics. Rather, it arises from the nonlinearity inherent in spiking itself. This emphasizes the signicance of our methods: we have introduced a way to extract from data the intrinsic, unique spike-triggering feature, removing the effects of the stimulus-dependent interspike interval statistics. These statistics are produced by the model when driven by a particular stimulation regime; they are not part of the model itself. If a neuron does in fact exhibit “true” adaptation in the sense that the functions f and g are
1804
B. Aguera ¨ y Arcas and A. Fairhall
variance dependent, then the isolated spike analysis presented here will reveal this. Such variance dependence can come about as a result of explicit gain control mechanisms (Shapley, Enroth-Cugell, Bonds, & Kirby, 1972), or because nonlinearities in the subthreshold neural response select nonconstant stimulus features, an effect that can play an important role in more complex models such as the Hodgkin-Huxley neuron (Aguera ¨ y Arcas et al., 2003). 6 Discussion While the interspike interval statistics are not an intrinsic model property, they do depend on an aspect of the complete characterization of the neuron that we have explicitly neglected here: the interspike interaction. There are two senses in which addressing only isolated spikes appears to be a limitation of our method. One is practical: in most situations, this will substantially reduce the number of spikes considered in the analysis. Further, much of the work on white noise analysis to date has focused on the problem of stimulus reconstruction, and our method does not provide a recipe for decoding nonisolated spikes. With respect to the second point, there are good reasons for considering isolated spikes rst. Evidence from several systems suggests that isolated spikes contain more information about the stimulus than the spikes that follow them (Berry & Meister, 1999; de Ruyter van Steveninck & Bialek, 1995; Panzeri, Petersen, Schultz, Lebedev, & Diamond, 2001). This is clear from rst principles, as nonisolated spikes are triggered only partly by the stimulus and partly by the timing of the previous spike. Understanding the complex interactions between spikes and the stimulus to create spike patterns is obviously an important next step in understanding the neural code, but the logical starting point is to establish the causes of isolated spikes. If we attempt stimulus reconstruction using reverse correlations averaged over all spikes, then the reconstruction will be incorrect; similarly, predicting spike timing without knowledge of the recent spiking history is generally impossible. A step along the path of treating spike interactions explicitly has recently been made in the modeling of retinal ganglion cell spike trains (Keat, Reinagel, Reid, & Meister, 2001). The problem in interpreting the spike-triggered average without taking into account the interspike interaction is also noted in Pillow and Simoncelli (in press), where a technique is presented for reconstructing the lter of an integrate-and-re neuron using all spikes. That technique, however, assumes that projection onto only a single stimulus dimension is relevant and that the corresponding lter is “chopped” by a previous spike as in equation 3.5, hence, it is only applicable to reconstructing the lter of an integrateand-re-like neuron, in which the interspike interaction takes this simple form and there is one relevant stimulus dimension (for isolated spikes). The technique presented here, however, does not assume any specic form for the interspike interaction and is capable of recovering the entire set of
What Causes a Neuron to Spike?
1805
relevant lters for isolated spikes. This allows one to study explicitly the nature of the interspike interaction in a separate step. The separation of spikes into isolated and nonisolated also allows a more efcient use of the data—ensemble averages will not be complicated by the inclusion of effects from interspike interaction. Thus, although we are using fewer data, it is probably the case that we are using the data more effectively, by summing stronger covariance signals from a simpler, lower-dimensional space. While in a simulation the amount of data can be unbounded and we have used up to millions of spikes to demonstrate our points here, we note that the signicant effects all emerge very clearly after » 104 spikes. This can be seen from the evolution of the eigenvalues and silence energy fractions with increasing spike count (see Figures 2, 6, and 7). This number is in a range often accessible to the experimenter. It should be noted that the need to consider the effect of silence arises even in the standard analysis of nonisolated spikes. Because all spiking neurons have a refractory period, there is always an implicit silence or isolation constraint, albeit sometimes of only a few milliseconds. The second spike-associated mode identied here (see Figure 8) is localized precisely in the disallowed interval and in fact emerges almost identically in the unqualied covariance analysis. Its contribution to the highly peaked STA is highly signicant, since both the spike-conditional variance and mean along this stimulus dimension are substantially altered from the prior. Recall that this mode is related to a derivative condition expressing upward threshold crossing, but it is extended, as there is always a nite period of silence prior to a spike. An analogous mode should arise for all spiking neurons, though its shape will depend strongly on the shape of the spike-generating lter or lters. Reduced stimulus variance in the direction of this derivative-like feature should not be thought of as part of the cause of a spike; it is a necessary concomitant to spike generation conditioned on the rst lter alone. This inevitable derivative-like condition is relevant to the inverse problem of stimulus reconstruction from a spike train, but not to the forward problem of causally predicting spike times from the stimulus. Notice that this implies that, in general, convolving the stimulus with the STA—either unqualied or for isolated spikes only—is not an ideal way to predict spike times. In summary, we have shown that standard white noise analysis is unable to recover the known lter properties of the leaky integrate-and-re neuron. We show that this is due to the inuence of interactions between spikes, and we demonstrate that by removing this inuence, requiring spikes to be “isolated,” we are able to recover the computation performed by the neuron. However, in so doing, we introduce a complication: we must address not only the isolated spikes, but the silences that precede them. Silences produce complex structure in the covariance analysis, which is nonetheless statistically independent and clearly distinguishable from structure tied to the spikes themselves. In a sense, the distinction between spike-related and silence-related features is articial, for silence is simply the complement to
1806
B. Aguera ¨ y Arcas and A. Fairhall
spiking: silence is structured by the absence of features that would produce a spike. The distinction is useful, however, in allowing us to use the isolated spikes to identify the spike-generating feature or features unambiguously. We have shown that failing to consider spikes in isolation leads to distortion in the lters obtained through standard white noise reverse correlation. Our conclusions are relevant not just for the simple case of the leaky integrate-and-re neuron, but for any system in which the presence of a spike has an inuence on the generation of subsequent spikes. The new isolated spike method presented should improve the accuracy of white noise analysis of neural spike trains with interspike interactions. In some cases, it may signicantly change our picture of what stimulus features a neuron is sensitive to. Acknowledgments We thank William Bialek for his advice, discussion, and encouragement to study this problem. References Adrian, E. (1928). The basis of sensation: The action of the sense organs. London: Christophers. Aguera ¨ y Arcas, B., Bialek, W., & Fairhall, A. L. (2001). What can a single neuron compute? In T. Leen, T. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 75–81). Cambridge, MA: MIT Press. Aguera ¨ y Arcas, B., Fairhall, A. L., & Bialek, W. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Computation, 15, 1715–1749. Berry II, M. J., & Meister, M. (1999). The neural code of the retina. Neuron, 22, 435–450. Bialek, W., & de Ruyter van Steveninck, R. R. (2002). Features and dimensions: Motion estimation in y vision. Unpublished manuscript. Brenner, N., Strong, S., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. R. (2000). Synergy in a neural code. Neural Comp., 12, 1531–1552. Bryant, H., & Segundo, J. P. (1978). Spike initiation by transmembrane current: A white-noise analysis. J. Physiol., 260, 279–314. Chacron, M., Longtin, A., & Maler, L. (2001). Negative interspike interval correlations increase the neuronal capacity for encoding time-dependent stimuli. J. Neurosci., 21, 5328–5343. de Boer, E., & Kuyper, P. (1968). Triggered correlation. IEEE Trans. Biomed. Eng., 15, 169–179. de Ruyter van Steveninck, R. R., & Bialek, W. (1988). Real-time performance of a movement sensitive in the blowy visual system: Information transfer in short spike sequences. Proc. Roy. Soc. Lond. B, 234, 379–414. de Ruyter van Steveninck, R. R., & Bialek, W. (1995). Reliability and statistical efciency of a blowy movement-sensitive neuron. Phil. Trans. R. Soc. Lond. B, 348, 321–340.
What Causes a Neuron to Spike?
1807
Eguia, M., Rabinovich, M., & Abarbanel, H. (2000). Information transmission and recovery in neural communications channels. Phys. Rev. E, 62, 7111–7122. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30(3), 803–817. Kim, K., & Rieke, F. (2001). Temporal contrast adaptation in the input and output signals of salamander retinal ganglion cells. J. Neurosci., 21, 287–299. Kistler, W., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the HodgkinHuxley equations to a single-variable threshold model. Neural Computation, 9, 1015–1045. Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Neural coding of naturalistic motion stimuli. Network, 12, 317–329. Marmarelis, P., & Naka, K. (1972). White-noise analysis of a neuron chain: An application of the Wiener theory. Science, 175, 1276–1278. Paninski, L., Lau, B., & Reyes, A. (In press). Noise-driven adaptation: In vitro and mathematical analysis. Neurocomputing. Panzeri, S., Petersen, R., Schultz, S., Lebedev, M., & Diamond, M. (2001). The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron, 29, 769–777. Pillow, J., & Simoncelli, E. P. (In press). Biases in white noise analysis due to non-Poisson spike generation. Neurocomputing. Rieke, F., Warland, D., Bialek, W., & de Ruyter van Steveninck, R. R. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rudd, M., & Brown, L. (1997). Noise adaptation in integrate-and-re neurons. Neural Computation, 9, 1047–1069. Shapley, R., Enroth-Cugell, C., Bonds, A., & Kirby, A. (1972). Gain control in the retina and retinal dynamics. Nature, 236, 352–353. Theunissen, F., David, S., Singh, N., Hsu, A., Vinje, W., & Gallant, J. (2001). Estimating spation-temporal receptive elds of auditory and visual neurons from their responses to natural stimuli. Network, 12, 289–316. Theunissen, F., Sen, K., & Doupe, A. (2000). Spectral-temporal receptive elds of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20, 2315–2331. Tiesinga, P. H. (2001). Information transmission and recovery in neural communications channels revisited. Phys. Rev. E, 64, 012901. Received December 22, 2001; accepted January 27, 2003.
LETTER
Communicated by Sebastian Seung
Rate Models for Conductance-Based Cortical Neuronal Networks Oren Shriki orens@z.huji.ac.il Racah Institute of Physics, Hebrew University, Jerusalem 91904, Israel, and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
David Hansel [email protected] Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris cedex 06, Paris, France, and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
Haim Sompolinsky haim@z.huji.ac.il Racah Institute of Physics, Hebrew University, Jerusalem 91904, Israel, and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
Population rate models provide powerful tools for investigating the principles that underlie the cooperative function of large neuronal systems. However, biophysical interpretations of these models have been ambiguous. Hence, their applicability to real neuronal systems and their experimental validation have been severely limited. In this work, we show that conductance-based models of large cortical neuronal networks can be described by simplied rate models, provided that the network state does not possess a high degree of synchrony. We rst derive a precise mapping between the parameters of the rate equations and those of the conductance-based network models for time-independent inputs. This mapping is based on the assumption that the effect of increasing the cell’s input conductance on its f-I curve is mainly subtractive. This assumption is conrmed by a single compartment Hodgkin-Huxley type model with a transient potassium A-current. This approach is applied to the study of a network model of a hypercolumn in primary visual cortex. We also explore extensions of the rate model to the dynamic domain by studying the ring-rate response of our conductance-based neuron to time-dependent noisy inputs. We show that the dynamics of this response can be approximated by a time-dependent second-order differential equation. This phenomenological single-cell rate model is used to calculate the response of a conductance-based network to time-dependent inputs.
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1809–1841 (2003) °
1810
O. Shriki, D. Hansel, and H. Sompolinsky
1 Introduction Theoretical models of the collective behavior of large neuronal systems can be divided into two categories. One category attempts to incorporate the known microscopic anatomy and physiology of the system. To study these models, numerical simulations are required. They involve a large number of parameters whose precise values are unknown, and the systematic exploration of the model parameter space is impractical. Furthermore, due to their complexity, it is hard to construct a qualitative interpretation of their behavior. The second category consists of simplied models that retain only some gross features of the modeled system, thereby allowing for systematic analytical and numerical investigations. These models have been extremely useful in extracting qualitative principles underlying such functions as memory, visual processing, and motor control (Amit, 1989; Churchland & Sejnowski, 1992; Georgopoulos & Lukashin, 1993; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Seung, 1996; Zhang, 1996; Salinas & Abbott, 1996; Hansel & Sompolinsky, 1998; Rolls & Treves, 1998). Simplied models of large neuronal systems are often cast in the form of rate models, in which the state of the network units is characterized by smooth rate variables (Wilson & Cowan, 1972; Hopeld, 1984). These variables are related to the units’ synaptic inputs via a nonlinear input-output transfer function. The input is a linear sum of the presynaptic activities, whose coefcients are termed the synaptic weights of the network. Unfortunately, the use of rate models for concrete neuronal systems has been limited by the lack of a clear biophysical interpretation of the parameters appearing in these models. In particular, the relation between activity variables, input variables, and synaptic weights, on one hand, and physiologically measured quantities, on the other, is obscure. Furthermore, quite often rate models predict that the network should settle in a xed point where the network activities, as well as synaptic inputs, are time independent. However, the biological meaning of this xed-point state is unclear, since neither the postsynaptic currents nor the postsynaptic potentials are constant in time if the cells are active. It is thus important to inquire whether there is a systematic relation between real neuronal systems and simple rate models. Several studies have derived reduced-rate models for networks of spiking neurons (Amit & Tsodyks, 1991; Abbott & Kepler, 1990; Ermentrout, 1994). In particular, it has been shown that if the synaptic time constants are long, the network dynamics can be reduced to rate equations that describe the slow dynamics of the synaptic activities (Ermentrout, 1994). However, the assumption of slow synaptic dynamics is inadequate for modeling cortical networks, where fast synapses play a dominant role. Here, we show that asynchronous states of large cortical networks described by conductancebased dynamics can be described in terms of simple rate equations, even when the synaptic time constants are small. A simple mapping between the synaptic conductances and the synaptic weights is derived. We apply our
Rate Models for Conductance-Based Cortical Neuronal Networks
1811
method to study the properties of conductance-based networks that model a hypercolumn in visual cortex. The simple reduction of conductance-based networks to rate models is restricted to asynchronous states that exist only if the networks are driven by stationary inputs. We derive a more complex rate model, which is appropriate to describe the synchronous response of large conductance-based networks to weakly nonstationary noisy synaptic inputs. Our results provide a framework for using rate models to quantitatively predict the extracellular and intracellular response properties of large cortical networks. 2 Models and Methods 2.1 Dynamics of a Single-Compartment Cell. Our starting point is the dynamic equation of a single-compartment neuron, C
dV.t/ D gL .EL ¡ V.t// ¡ Iactive .t/ C Iapp .t/; dt
(2.1)
where V.t/ is the membrane potential of the cell at time t, C is its capacitance, gL is the leak conductance, and EL is the reversal potential of the leak current. Besides the leak current, the cell has active ionic currents with Hodgkin-Huxley type kinetics (Hodgkin & Huxley, 1952), the total sum of which is denoted as Iactive .t/ in equation 2.1. An externally injected current is denoted as Iapp . If Iapp is constant in time and is sufciently large, the cell will re in a repetitive manner with a steady-state ring rate f . In general, the relation between the applied current, I, and the ring rate, f , denes a function f D F.I; gL /, called the f-I curve. The second argument, gL , expresses the dependence of the input-output function of the neuron on the magnitude of the leak conductance. This dependence is an important factor in our work, as will become clear. The form of the function F depends on the active currents comprising Iactive . In many cortical neurons, the f-I curve is approximately linear for I above threshold (Azouz, Gray, Nowak, & McCormick, 1997; Ahmed, Anderson, Douglas, Martin, & Whitteridge, 1998; Stafstrom, Schwindt, & Crill, 1984) and can be captured by the following equation, f D ¯[I ¡ Ic ]C ;
(2.2)
where [x]C ´ x for x > 0 and is zero otherwise; ¯ is the gain parameter. This behavior can be modeled by a Hodgkin-Huxley type single compartment neuron with a slow A-current (Hille, 1984). (See appendix A for the details of the model.) The parameters of the sodium and the potassium currents were chosen to yield a saddle-node bifurcation at the onset of ring. Figure 1 shows the response of this model neuron without and with the A-current. As can be seen, the A-current linearizes the f-I relationship. Our model neuron has gain value ¯ D 35:4 cm2 /¹Asec.
1812
O. Shriki, D. Hansel, and H. Sompolinsky
120
V [mV]
Firing rate [spikes/sec]
160
50 0 80 0
100 200 Time [msec]
80 40 0 0
1
2 3 2 I [mA/cm ]
4
5
Figure 1: f-I curves of the single-neuron model with gA D 0 (dashed line), gA D 20 mS=cm2 , ¿A D 0 msec (dash-dotted line), gA D 20 mS=cm2 , ¿A D 20 msec (solid line). Comparison of the three curves shows that linearization of the fI curve is due to the long time constant of the A-current. (Inset) A voltage trace of the single-neuron model with constant current injection of amplitude I D 1:6 ¹A=cm2 for gA D 20 mS=cm2 , ¿A D 20 msec. The neuron’s parameter values are as dened in appendix A.
Relatively few experimental data have been published on the dependence of the ring rate of cortical cells on their leak conductance. However, experimental evidence (Connors, Malenka, & Silva, 1988; Brizzi, Hansel, Meunier, van Vreeswijk, & Zytnicki, 2001) and biophysical models (Kernell, 1968; Holt & Koch, 1997) show that increasing gL affects the f-I curve primarily by increasing its threshold current, whereas its effect on the gain of the curve is weak. We incorporate these properties by assuming that ¯ is independent of gL and that the threshold current increases linearly with the leak conductance, Ic D Ico C Vc gL :
(2.3)
The threshold gain potential Vc measures the rate of increase of the threshold current as the leak conductance, gL , increases, and Ioc is the threshold current when gL D 0. This behavior is also reproduced in our model neuron, as shown in Figure 2. Approximating the f-I curve with equations 2.2 and 2.3 yields Ioc D 0:63 mA=cm2 and Vc D 5:5 mV. This provides a good
Rate Models for Conductance-Based Cortical Neuronal Networks
2 Ic [mA/cm2]
Firing rate [spikes/sec]
160
1813
1.5
120 80
1 0.05
0.1 0.15 2 gL [mS/cm ]
0.2
40 0 0
1
2 3 2 I [mA/cm ]
4
5
Figure 2: f-I curves for gA D 20 mS=cm2 , ¿A D 20 msec, and different values of gL . The curves from left to right correspond to gL D 0:05; 0:1; 0:15; 0:2 mS=cm2 , respectively. (Inset) The threshold current, Ic , as a function of the leak conductance, gL .
approximation of the f-I curve for the range f D 5–150 spikes=sec. For higher ring rates, the effect of the saturation of the rates becomes signicant and needs to be incorporated into the model. We have found that this effect can be described approximately by an f-I curve of the form f D ¯[I ¡ Ic ]C ¡ ° [I ¡ Ic ]2C ;
(2.4)
with Ic given by equation 2.3. Fitting the ring rate of our single-neuron model to equation 2.4 yields good t over the range f D 5–300 spikes=sec, with the parameter values ¯ D 39:6 cm2 /¹Asec, ° D 0:86 (cm2 /¹A)2 /sec, ® D 6:77 mV, and Ico D 0:59 ¹A=cm2 . The above conductance-based model neuron is the one used in all subsequent numerical analyses. 2.2 Network Dynamics. The network dynamics of N coupled cells are given by
C
dV i D gL .EL ¡ Vi .t// ¡ Iiactive C Iiext C Iinet dt
.i D 1; : : : ; N/;
(2.5)
1814
O. Shriki, D. Hansel, and H. Sompolinsky
where Iinet denotes the synaptic current of the postsynaptic cell i generated by the presynaptic sources within the network. It is modeled as Inet i .t/ D
N X
gij .t/.Ej ¡ Vi .t//;
jD1
(2.6)
where gij .t/ is the synaptic conductance triggered by the action potentials of the presynaptic jth cell and Ej is the reversal potential of the synapse. The synaptic conductance is assumed to consist of a linear sum of contributions from each of the presynaptic action potentials. In our simulations, gij .t/ has the form of an instantaneous jump from 0 to Gij followed by an exponential decay with a synaptic decay time ¿ij , dgij dt
D¡
gij ¿ij
C Gij Rj .t/;
t > 0;
(2.7)
P where Rj .t/ D tj ±.t ¡ tj / is the instantaneous ring rate of the presynaptic jth neuron and tj are the times of occurrence of its spikes. In general, we dene Gij as the peak of gij .t/ and the synaptic time constant as R1 ¿ij D 0 gij .t/ dt=Gij . Synaptic currents from presynaptic sources outside the network are denoted by Iext i . For simplicity, we assume that these sources are all excitatory with the same reversal potential, Einp , peak conductance Ginp , and synaptic time constant, ¿ inp . We assume that these sources re Poisson trains of action potentials asynchronously, which generate synaptic conductances with dynamics similar to equation 2.7. Under these assumptions, their summed effects on the postsynaptic neuron i can be represented by a single effective excitatory synapse with peak conductance, Ginp , time constant ¿ inp , and acinp tivated by a single Poisson train of spikes with average rate fi which is the summed rate of all the external sources to the ith neuron. Thus, the external current on neuron i can be written as inp
inp ¡ Vi .t//: Iext i .t/ D gi .t/.E
(2.8)
inp
The quantity gi .t/ satises an equation similar to equation 2.7, inp
dginp gi inp i D ¡ inp C Ginp Ri .t/; dt ¿ inp
t > 0;
(2.9) inp
inp
where Ri is a Poisson spike train with mean rate fi . The value of fi is specied below for each of the concrete models that we study. Due to the inp inp Poisson statistics of Ri , the external conductance, gi .t/, is a random variinp inp able with a time average and variance Ginp fi ¿ inp and .Ginp /2 fi ¿ inp =2,
Rate Models for Conductance-Based Cortical Neuronal Networks
1815
inp
respectively. Note that fi ¿ inp is the mean number of input spikes arriving inp within a single synaptic integration time. The uctuations in Ri constitute the noise in the external input to the network. We dene the coefcient of variation of thisqnoise as the ratio of its standard deviation and its mean, inp
that is, 1i D 1= 2 fi ¿ inp . As expected, the coefcient of variation is proportional to the inverse square root of the total number of spikes arriving within a single synaptic integration time. In particular, one can increase the inp noise level of the input by decreasing fi and increasing Ginp while keeping their product constant. Time-dependent inputs are modeled by a Poisson process with a rate that is modulated in time. In most of the examples studied in this article, we assume a sinusoidal modulation—that the instantaneous ring rate in the external input to the neuron is inp
f inp .t/ D f0
inp
C f1 cos.!t/;
(2.10)
where !=2¼ is the frequency of the modulation. 2.3 Model of a Hypercolumn in Primary Visual Cortex. We model a hypercolumn in visual cortex by a network consisting of Ne excitatory neurons and Nin inhibitory neurons that are selective to the orientation of the visual stimulus in their common receptive eld. We impose a ring architecture on the network. The cortical neurons are parameterized by an angle µ, which denotes their preferred orientation (PO). The ith excitatory neuron is parameterized by µi D ¡ ¼2 C i N¼e , and similarly for the inhibitory ones. The peak conductances of the cortical recurrent excitatory and inhibitory synapses decay exponentially with the distance between the interacting neurons, measured by the dissimilarity in their preferred orientations, that is, G® .µ ¡ µ 0 / D
GN ® exp.¡jµ ¡ µ 0 j=¸® /; ¸®
(2.11)
where µ ¡ µ 0 is the difference between the POs of the pre- and postsynaptic neurons. The index ® takes the values e and in. The quantity Ge (resp. Gin ) denotes an excitatory (inhibitory) interaction (targeting either excitatory or inhibitory neurons) with a space constant ¸e (resp. ¸in ). Note that the excitatory as well as the inhibitory interactions are the same for excitatory and inhibitory targets. Additional excitatory neurons provide external input to the network, representing the lateral geniculate nucleus (LGN) input to cortex, each with peak conductance Ginp D GN LGN . The total mean ring rate of the afferent inputs to a neuron with PO µ is f inp D fLGN .µ ¡ µ0 /, where fLGN .µ ¡ µ0 / D fNLGN C[.1 ¡ ²/ C ² cos.2.µ ¡ µ0 //]:
(2.12)
1816
O. Shriki, D. Hansel, and H. Sompolinsky
The parameter C is the stimulus contrast, and the angle µ0 denotes the orientation of the stimulus. The parameter ² measures the degree of tuning of the LGN input. If ² D 0, the LGN input is untuned: all the neurons receive the same input from the LGN, regardless of their PO and the orientation of the stimulus. If ² D 0:5, the LGN input vanishes for neurons with a PO that is orthogonal to the stimulus orientation. The maximal LGN rate fNLGN is the total ring rate of the afferents of a stimulus with C D 1 and µ D µ0 . The single-neuron dynamics is given by equation 2.1 and is assumed to be the same for both the excitatory and inhibitory populations. 2.4 Numerical Integration and Analysis of Spike Responses. In the numerical simulations of the conductance-based networks, the nonlinear differential equations of the neuronal dynamics were integrated using a fourth-order Runge-Kutta method (Press, Flannery, Teukolsky, & Vetterling, 1988) with a xed time step 1t. Most of the simulations were performed with 1t D 0:05 msec. In order to check the stability and precision of the results, some simulations were also performed with 1t D 0:025 msec. A spike event is counted each time the voltage of a neuron crosses a xed threshold value Vth D 0. We measure the instantaneous ring rate of a single neuron dened as the number of spikes in time bins of size 1t D 0:05 msec, averaged over different realizations of the external input noise. We then compute the time average of this response and, in the case of periodically modulated input, the amplitude and phase of its principal temporal harmonic. We also measure the population ring rate, dened as the number of spikes of single neurons in each time bin divided by 1t, averaged over a select population of neurons in the network as well as over the input noise. As in the case of a single neuron, the network response is characterized by the time average and the principal harmonic of the population rate. 3 Rate Equations for General Asynchronous Neuronal Networks The dynamic states of a large network characterized by the above equations can be classied as being synchronous or asynchronous (Ginzburg & Sompolinsky, 1994; Hansel & Sompolinsky, 1996), which differ in terms of the strength of the correlation between the temporal ring of different neurons. When the external currents, Iiext, are constant in time (except for a possible noisy component which is spatially uncorrelated), the network may exhibit an asynchronous state in which the activities of the neurons are only weakly correlated. Formally, in an asynchronous state, the correlation coefcients between the voltages of most of the neuronal pairs approach zero in the limit where the network size, N, grows to innity. Analyzing the asynchronous state in a highly connected network is relatively simple. Because each postsynaptic cell is affected by many uncorrelated synaptic conductances (within a window of its integration time), these
Rate Models for Conductance-Based Cortical Neuronal Networks
1817
conductances can be taken to be time independent. In other words, in the asynchronous state, the spatial summation of the synaptic conductances is equivalent to a time average. Hence, the total synaptic current of each cell can be written as Iinet .t/CIiext .t/ D
N X jD1
inp
Gij ¿ij fj .Ej ¡Vi .t//CGinp ¿ inp fi
.Einp ¡Vi .t//
(3.1)
(see equations 2.6 and 2.8). This current can be decomposed into two components: app
Iinet C Iiext D Ii
C 1IiL :
(3.2)
app
Ii is the component of the synaptic current that has the form of a constant applied voltage-independent current app
Ii
D
N X jD1
inp
Gij ¿ij fj .Ej ¡ EL / C Ginp ¿ inp fi
.Einp ¡ EL /:
(3.3)
The second component of the synaptic current, 1IiL , embodies the voltage dependence of the synaptic current and has the form of a leak current, syn
1ILi D gi .EL ¡ Vi .t//;
(3.4)
where syn
gi
inp
D gnet i C gi
(3.5)
is the mean total synaptic conductance of the ith cell. The quantities gnet i and inp gi are given by gnet i D inp
gi
N X
(3.6)
Gij ¿ij fj ; jD1 inp
D Ginp ¿ inp fi
:
(3.7)
Thus, the discharge of the postsynaptic cell in the asynchronous network can be described by the f-I curve of a single cell with an applied current, equasyn tion 3.3, and a “leak” conductance, which is equal to gL C gi , equation 3.5. Incorporating these contributions in equation 2.2, taking into account the dependence of the threshold current on the total passive conductance as
1818
O. Shriki, D. Hansel, and H. Sompolinsky
given by equation 2.3, yields the following equations for the ring rates of the cells: app
syn
fi D ¯[Ii ¡ Vc .gL C gi / ¡ Ico ]C " # N X inp D¯ Jij fj C Jinp fi ¡ T ; iD1
C
(3.8) .i D 1; : : : ; N/;
where Jij D Gij ¿ij .Ej ¡ EL ¡ Vc /
(3.9)
and Jinp D Ginp ¿ inp .Einp ¡ EL ¡ Vc /:
(3.10)
The parameter T is the threshold current of the isolated cells, T D Ic .gL /. Note that the subtractive term Vc in equations 3.9 and 3.10 is the result of the increase of the current threshold of the cell due to the synaptic conductance (see equation 2.3). Equation 3.8 is of the form of the self-consistent rate equations that describe the input-output relations for the neurons in a recurrent network at a xed-point state. This theory provides a precise mapping between the biophysical parameters of the neurons and synapses and the parameters appearing in the xed-point rate equations. The output state variables, given by the right-hand side of equation 3.8, are simply the stationary ring rates of P inp inp , are the mean synaptic the neurons. The input variables, N fi iD1 Jij fj C J currents at a xed potential given by EL C Vc , where Vc is the threshold-gain potential, equation 2.3. Equation 3.9 provides a precise interpretation of the synaptic efcacies Jij in terms of the biophysical parameters of the cells and the synaptic conductances. We note in particular that our theory yields a precise criterion for the sign of the efcacy of the synaptic connection. According to equation 3.9, synapses with positive efcacy obey the inequality Ej > EL C Vc :
(3.11)
Conversely, synapses with negative efcacies obey Ej < EL C Vc . The potential EL C Vc is close but not identical to the threshold potential of the cell. Hence, this criterion, which takes into account the dynamics of ring rates in the network, does not match exactly the biophysical denition of excitatory and inhibitory synapses. The above results allow the prediction not only of the stationary rates of the neurons but also their mean synaptic conductances due to the input from
Rate Models for Conductance-Based Cortical Neuronal Networks
1819
within the network and to the external input. In fact, using equations 3.6 and 3.7 and equations 3.9 and 3.10 yields gnet i D
N X jD1
Jij
fj Ej ¡ EL ¡ Vc
(3.12)
and inp
gi
inp
D Jinp
fi : Einp ¡ EL ¡ Vc
(3.13)
In the following sections, we apply this theory to concrete network architectures. 4 Response of an Excitatory Population to a Time-Independent Input We rst test the mapping equations, equations 3.8 through 3.10, in the case of a large, homogeneous network that contains N identical excitatory neurons. The network dynamics are given by equations 2.5 through 2.7 with the single-neuron model of appendix A. Each neuron is connected to all other neurons with a peak synaptic conductance, G, which is the same for all the connections in the network. In addition, each neuron receives a single external synaptic input that has a peak conductance, Ginp , which is activated by a Poisson process with a xed uniform rate, f inp . The external synaptic inputs to different cells are uncorrelated. The dynamic response of all synaptic conductances is given by equation 2.7 with a single synaptic time constant ¿e D 5 msec. Applying equations 3.8 through 3.10 to this simple architecture results in the following equation for the mean ring rate of the neurons in the network, f , f D ¯[Jinp f inp C Jf ¡ Ic ]C
(4.1)
where Jinp D Ginp ¿e .Ee ¡ EL ¡ Vc /
(4.2)
J D NG¿e .Ee ¡ E L ¡ Vc /:
The solution for the ring rate is f D
¯ [Jinp f inp ¡ Ic ]C : 1 ¡ ¯J
(4.3)
The mean ring rate, f , of the neurons in the network, as predicted from this equation, is displayed in Figure 3 (dashed line) against the value of the
1820
O. Shriki, D. Hansel, and H. Sompolinsky
250 200 150 100
spikes/sec
Firing rate [spikes/sec]
300 300 200 100 0 0
0.5 1 1.5 2 Input [mA/cm ]
50 0 0
0.02 0.04 0.06 0.08 0.1 2 Synaptic conductance [mS/cm ]
Figure 3: Firing rate versus excitatory synaptic strength in a large network of fully connected excitatory neurons. The rate of the external input is f inp D 1570 spikes=sec, and the synaptic time constant is ¿e D 5 msec. Dashed line: Analytical results from equation 4.3. Solid line: Analytical results when a quadratic t is used for the f-I curve, equation 2.4. Circles: Results from simulations of the conductance-based model with N D 1000. (Inset) Firing rate versus external input for strong excitatory feedback (analytical results) showing bistability for NG D 0:49 ¹S=cm2 . The network can be either quiescent or in a stable sustained active state in a range of external inputs, Jinp f inp , less than the threshold current, Ic .
peak conductance of the total excitatory feedback, NG. When the synaptic conductance increases, such that J reaches the critical value J D 1=¯ (corresponding to NG D 0:095 mS=cm2 ), the ring rate, f , diverges. However, it is expected that when the ring rate reaches high values, the weak nonlinearity of the f-I curve, given by the quadratic correction, equation 2.4, will need to be taken into account. Indeed, solving the self-consistent equation for f with the quadratic term (solid line in Figure 3) yields nite values for f , even when J is larger than 1=¯. In addition, the quadratic nonlinearity predicts that in the high J regime, the network should develop bistability. For a range of subthreshold external inputs, the network can be in either a stable quiescent state or a stable active state with high ring rates, as shown in the inset of Figure 3. These predictions are in full quantitative agreement with the numerical simulations of the conductance-based excitatory network as shown in Figure 3.
Rate Models for Conductance-Based Cortical Neuronal Networks
1821
5 A Model of a Hypercolumn in Primary Visual Cortex In this section we show how the correspondence between conductancebased models and rate models can be applied to investigate a model of a hypercolumn in V1. When applying the general rate equations, equations 3.8 through 3.10, to the hypercolumn model, we rst note that in the asynchronous state, the ring-rate prole of the excitatory and the inhibitory populations is the same. This is because we assume that the interaction proles depend solely on the identity (excitatory or inhibitory) of the presynaptic neurons and that the single-neuron properties of the two types of cells are the same. We denote the rate of the (e or in) neurons with PO µ and a stimulus orientation µ0 as f .µ ¡ µ0 /. These rates obey "Z
f .µ ¡ µ0 / D ¯
C¼=2 ¡¼=2
dµ 0 J.µ ¡ µ 0 / f .µ 0 ¡ µ/ ¼ #
C JLGN fLGN .µ ¡ µ0 / ¡ T
;
(5.1)
C
where we replaced the sum over the synaptic recurrent inputs by an integration over the variable µ 0 , which is a valid approximation for a large network. The recurrent interaction prole, J.µ /, combines the effect of the excitatory and the inhibitory cortical inputs and has the form J.µ ¡ µ0 / D
X J® exp.¡jµ ¡ µ 0 j=¸® /; ®De;in ¸®
(5.2)
where J® D N® GN ® ¿® .E® ¡E L ¡Vc /, ® D e; in, and JLGN D GN LGN ¿e .Ee ¡EL ¡Vc /; ¿® denotes the excitatory and inhibitory synaptic time constants, and Ne , Nin are the number of neurons in the excitatory and inhibitory populations, respectively. Equations 5.1 and 5.2 correspond to the rate equations 3.8 through 3.10 with the synaptic conductances and input ring rate, which are given by equations 2.11 and 2.12. In appendix B, we outline the analytical solution of equations 5.1 and 5.2, which allows us to compute the neuronal activity and the synaptic conductances as functions of the model parameters. We used the analytical solution of these rate equations to explore how the spatial pattern of activity of the hypercolumn depends on the parameters of the recurrent interactions (Je ,Jin , ¸e , and ¸in ) and the stimulus properties. 5.1 Emergence of a Ring Attractor. We rst consider the case of an untuned LGN input, ² D 0. In this case, equation 5.1 has a trivial solution in which all the neurons respond at the same ring rate. As shown in
1822
O. Shriki, D. Hansel, and H. Sompolinsky
appendix B, this solution is unstable when the spatial modulation of the effective interaction, equation 5.2, is sufciently large. The condition for the onset of this instability is given by equation B.4. When the homogeneous solution is unstable, the system settles into a heterogeneous solution, which has the form f .µ / D M.µ ¡ Ã/. The angle à is arbitrary and reects the fact that the system is spatially invariant. The manifold of stable states that emerges in this system and breaks its spatial symmetry is known as a ring attractor. The function M, which represents the shape of the activity prole in each of these states, can be computed analytically, as described in appendix B. Depending on which mode is unstable, the heterogeneous prole of activity that emerges consists of a single “hill” of activity or several such “hills.” In appendix B, we describe how the function M can be computed in the case of a state with a single hill. As an example, we consider the case ¸e D 11:5± , ¸in D 43± , and ¯ Jin D 0:73. For this choice of parameters, equation B.4 predicts that the state with a homogeneous response is stable for ¯ Je < 0:87 and unstable for ¯ Je > 0:87. At ¯ Je D 0:87, the unstable mode corresponds to the rst Fourier mode. Therefore, the instability at this point should give rise to a heterogeneous response with a unimodal prole of activity (a single hill). This is conrmed by the numerical solution of the rate equations, 5.1 and 5.2. Using the mapping prescriptions, equation 3.9, these results can be translated into the prediction that if ¸e D 11:5± , ¸in D 43± , Nin GN in D 0:333 mS=cm2 , the homogeneous state is stable for conductance-based model if Ne GN e < 0:138 mS=cm2 , but that it is unstable for Ne Ge > 0:138 mS=cm2 . We tested whether these predictions coincide with the actual behavior of the conductance-based model in numerical simulations. Figures 4A and 4B show raster plots of the network for Ne GN e D 0:133 mS=cm2 and Ne GN e D 0:143 mS=cm2 , respectively. For Ne GN e D 0:133 mS=cm2 , neurons in all the columns responded in a similar way. This corresponds to the homogeneous state of the rate model. Moreover, in this simulation, the average population ring rate was f D 18 spikes=sec, in excellent agreement with the value predicted from the rate model for the corresponding parameters ( f D 18:05 spikes=sec). In contrast, for Ne GN e D 0:143 mS=cm2 , the network does not respond homogeneously to the stimulus. Instead, a unimodal hill of activity appears. This is congruent with the prediction of the rate model. Since the external input is homogeneous, the location of the peak is arbitrary. In the numerical simulations, the activity prole slowly moves due to the noise in the LGN input. The stability analysis of the homogeneous state of the rate model shows that if ¯ Jin > 0:965, which corresponds to Nin GN in D 0:443 mS=cm2 , the mode n D 2 is the one that rst becomes unstable when Je increases. This suggests that in this case, the prole of activity that emerges through the instability is bimodal. Numerical simulations of the full conductance-based model were found to be in excellent agreement with this expectation (see Figure 5).
Rate Models for Conductance-Based Cortical Neuronal Networks
1823
Preferred orientation [deg]
A
90 60 30 0 30 60 90 0
100
90 60 30 0 30 60 90 0
100
B
200
300
400
500
200
300
400
500
Time [msec]
Figure 4: Symmetry breaking leading to a unimodal activity prole. A network with Ne D Nin D 1600 was simulated for two values of the maximal conductance of the excitatory synapses. The external input to the network is homogeneous (² D 0). The input rate is NfLGN D 2700 Hz. The maximal conductance of the input synapses is Ginp D 0:0025 mS=cm2 . Parameters of the interactions are ¸e D 11:5± , N in D 0:333 mS=cm2 . The time constants of the synapses are ¿e D ¸in D 43± , Nin G ¿in D 3 msec. The analytical solution of the rate model equations predicts that for N e < 0:138 mS=cm2 , the response of the network to the input is homogeneous Ne G N e > 0:138 mS=cm2 , it is unimodal. (A) Raster plot of the network and that for Ne G N e D 0:133 mS=cm2 showing that the response is homogeneous. (B) Ne G Ne D for N e G 0:143 mS=cm2 , showing that the response is a unimodal hill of activity. The noise that is present in the system induces a slow random wandering of the hill of activity.
5.2 Tuning of Firing Rates and Synaptic Conductances. We consider now the case of a tuned LGN input, which corresponds to ² > 0. Equation 5.1 shows that in general, the response of a neuron with PO µ depends on the stimulus orientation µ0 through the difference µ ¡µ0 —namely, that f .µ ; µ0 / D M.jµ ¡ µ0 j/. For xed µ0 , when µ varies, f .µ ; µ0 / is the prole of activity of the network in response to a stimulus of orientation µ0 . Conversely, when µ is xed and µ0 varies, f .µ; µ0 / is the tuning curve of the neuron with PO µ. Therefore, the function M determines the tuning curve of the neurons in the model. We now compare the tuning curves of the neurons computed in the framework of the rate model with those in the corresponding simulations
1824
O. Shriki, D. Hansel, and H. Sompolinsky
Preferred orientation [deg]
A
90 60 30 0 30 60 90 0
100
90 60 30 0 30 60 90 0
100
B
200
300
400
500
200
300
400
500
Time [msec]
Figure 5: Symmetry breaking leading to a bimodal activity prole. The size of the simulated network is Ne D Nin D 1600. The input rate is NfLGN D 2700 Hz. The maximal conductance of the input synapses is Ginp D 0:0025 mS=cm2 . The parameters of the interactions are ¸e D 11:5± , ¸in D 43± , Nin Gin D 1:33 mS=cm2 . The time constants of the synapses are ¿e D ¿in D 3 msec. The analytical solution N e < 0:196 mS=cm2 , the response of the rate model equations predicts that for Ne G N e > 0:196 mS=cm2 , of the network to the input is homogeneous and that for Ne G N it is bimodal. (A) Raster plot of the network for Ne Ge D 0:19 mS=cm2 , showing that the response is homogeneous. The average ring rate in the network in the simulation is f D 3:2 spikes=sec, which is in good agreement with the prediction N e D 0:138 mS=cm2 , showing that of the rate model ( f D 2:9 spikes=sec). (B) N e G the response is bimodal. The noise that is present in the system induces a slow random wandering of the pattern of activity.
of the conductance-based network. Specically, we take ² D 0:175 and fNLGN D 3400 Hz. For these values, in the absence of recurrent interactions, the response of the neurons to the input exhibits broad tuning. This is shown in Figure 6 (dashed line). The recurrent excitation can sharpen the tuning curves and also amplify the neuron response, as shown in Figure 6. In this gure, we plotted the neuronal tuning curve when the parameters of the interactions are ¸e D 6:8± and ¸in D 43± , Ne GN e D 0:125 mS=cm2 , Nin GN in D 0:333 mS=cm2 . The solid line was computed from the solution of the mean-eld equations of the corresponding rate model. This solution indicates that the tuning width is µC D 30± and that the maximal ring rate is fmax D 75:5 spikes=sec. The circles are from the numerical simulations of the
Rate Models for Conductance-Based Cortical Neuronal Networks
1825
Firing rate [spikes/sec]
80 60 40 20 0 90
60
30 0 30 Orientation [deg]
60
90
Figure 6: Tuning curves of the LGN input (dashed line) oriented at µ0 D 0± and the spike response of a neuron with preferred orientation µ D 0± (solid line and circles). The LGN input parameters are ² D 0:175, ginp D 0:0025 mS=cm2 , NfLGN D 3400 Hz. The interaction parameters are ¸e D 6:3± , ¸in D 43± , Ne G Ne D N in D 0:467 mS=cm2 . The time constants of the synapses are 0:125 mS=cm2 , Nin G ¿e D ¿in D 3 msec. The circles are from numerical simulations with Ne D Nin D 1600 neurons. The response of the neuron was averaged over 1 sec of simulation. The solid line was obtained by solving the rate model with the corresponding parameters.
conductance-based model. The agreement with the analytical predictions from the rate model is very good. The input conductances of the neurons in V1 change upon presentation of a visual stimulus. Experimental results (Borg-Graham, Monier, & Fregnac, 1998; Carandini, Anderson, & Ferster, 2000) indicate that with large stimulus contrasts, these changes have typical values of 60% when the stimulus is presented at null orientation, whereas they can be as large as 250 to 300% at optimal orientation. We applied our approach to study the dependence of the total change in input conductance on the space constants of the interactions for a given LGN input. We assume that the cortical circuitry sharpens and amplies the response such that the tuning curves have a given width µC and amplitude fmax . We compute the interaction parameters to achieve tuning curves with a given width, µC and a given maximal discharge rate, fmax . To be specic, we take µC D 30:5± ; fmax D 70 spikes=sec. We also xed the space constant of the
O. Shriki, D. Hansel, and H. Sompolinsky
Conductance change [%]
1826
2000 1800 1600 1400 1200 1000 800 600 400 200 0 0
10 20 30 40 Space constant of excitation[deg]
Figure 7: Change in the input conductance in iso (solid line) and crossorientation (dashed line). The lines were computed as explained in the text.
inhibitory interaction, ¸in D 43± , and varied the value of ¸e . For each value of ¸e , we evaluated the values of Je and Jin that yield the desired values of µC and fmax . Subsequently, for each set of parameters, we computed the changes in the input conductance of the neurons, relative to the leak conductance, for a stimulus presented in iso and cross orientation. These changes, denoted by 1giso and 1gcross , respectively, are increasing functions of ¸e , as shown in Figure 7. Actually, it can be shown analytically from the mean eld equations of the rate model that the conductance changes diverge when ¸e ! ¸in . This is because in that limit, the net interaction is purely excitatory with an amplitude that is above the symmetry-breaking instability. The rate model also allows us to estimate the separate contributions of the LGN input, the recurrent excitation, and the cortical inhibition to the change in total input conductance induced by the LGN input. An example of the tuning curves of these contributions is shown in Figure 8, where they are compared with the results from the simulations of the full model. These results indicate that for the chosen parameters, most of the input conductance contributed by the recurrent interactions comes from the cortical inhibition. This is despite the fact that the spike discharge rate is greatly amplied by the cortical excitation.
Rate Models for Conductance-Based Cortical Neuronal Networks
1827
Conductance change [%]
350 300 250 200 150 100 50 0 90
60
30 0 30 Orientation [deg]
60
90
Figure 8: Tuning curves of conductances. Solid line: Total change. Dotted line: Contribution to the total change from the LGN input. Dashed line: Contribution to the total change from the recurrent excitatory interactions. Dash-dotted line: Contribution to the total change from the recurrent inhibitory interactions. These lines were obtained from the solution of the mean-eld equations of the rate model. The parameters are as in Figure 6. The squares, circles, triangles, and diamonds are from numerical simulations. Same parameters as in Figure 6.
6 Rate Response of a Single Neuron to a Time-Dependent Input The analysis of the previous sections focused on situations in which the ring rates of the neurons were approximately constant in time. We now turn to the question of ring-rate dynamics, namely, how to describe the neuronal ring response in the general situation in which the ring rates are time dependent. We rst study the response of a single neuron to a noisy sinusoidal input, equation 2.10. The ring rate of the neuron (averaged over the noise) can be expanded in a Fourier series: f .t/ D f0 C f1 cos.!t C Á/ C ¢ ¢ ¢ ;
(6.1)
where f0 is the mean ring rate and f1 and Á are the amplitude and phase of the rst harmonic (the Fourier component at the stimulation frequency). Here, we consider only cases in which the modulation of the external input is not overly large, so that there is no rectication of the ring rate by the threshold.
1828
O. Shriki, D. Hansel, and H. Sompolinsky
Under these conditions, our simulations show that harmonics with orders higher than 1 are negligible (results not shown). Therefore, the response of the neuron can be characterized by the average ring rate f0 and the modulation f1 , f1 < f0 . Our simulations also show that the mean response, f0 , depends only weakly on the modulation frequency of the stimulus or on the stimulus amplitude and can be well described by the steady-state frequency-current relation. This is shown in Figure 9 (left panels), where the predictions from the rate model (solid horizontal lines) are compared with the mean output rates in the numerical simulations of the conductancebased model (open circles). Figure 9 also shows the amplitude of the rst harmonic of the response and its phase shift Á as a function of the modulation frequency º ´ !=2¼ (lled circles). It should be noted that the raw response of the neuron reects two ltering processes of the external input rate: the low-pass ltering induced by the synaptic dynamics and the ltering induced by the intrinsic dynamics of the neuron and its spiking mechanism. To better elucidate the transfer properties of the neuron, we remove the effect of the synaptic lowpass ltering on the amplitude and phase of the response. (This corresponds p to multiplying f1 by 1 C !2 .¿ inp /2 and subtracting tan¡1 .¡!¿ inp / from the phase.) The results for two values of the mean output rate, f0 ’ 30 spikes=sec and f0 ’ 60 spikes=sec, and for two values of the input noise coefcient of inp inp variation, 1 D 0:18 ( f0 D 1125 Hz) and 1 D 0:3 ( f0 D 3125 Hz), are presented. Clearly, both the amplitude and the phase of the response depend on the modulation frequency. Of particular interest is the fact that the amplitude exhibits a resonant behavior for modulation frequencies close to the mean ring rate of the neuron, f0 . The main effect of increasing the coefcient of variation of the noise is to broaden the resonance peak. As in our analysis of the steady-state properties, the external synaptic input can be decomposed here into a current term, and a conductance term, which are now time dependent. The simplest dynamic model would be to assume that the same f-I relation that was found under steady-state conditions, Equations 2.2 and 2.3, holds when the applied current and the passive conductance are time dependent. In our case, this would take the form f .t/ D ¯[I.t/ ¡ .Ico C Vc g.t//]C . This model predicts that the response amplitude does not depend on the modulation frequency and that the phase shift is always zero, in contrast to the behavior of the conductance-based neuron (see Figure 9). To account for the dependence of the modulation frequency, we extend the model by assuming that it has the form f .t/ D ¯[Ilt.t/ ¡ .Ico C Vc glt.t//];
(6.2)
where Ilt .t/ and glt .t/ are ltered versions of the current and conductance terms, respectively. For simplicity, we use the same lter for both current and conductance. A rst-order linear lter can be either a high-pass or a low-pass
Rate Models for Conductance-Based Cortical Neuronal Networks
A
60 40 20 0 60 40 20 0
0 30 60 90
B
Phase [deg]
Amplitude [Hz]
60 40 20 0
C
D
60 40 20 0 0
1829
50
100
0 30 60 90 0 30 60 90 0 30 60 90 0
50
100
Modulation frequency [Hz] Figure 9: Mean, amplitude, and phase of single-neuron rate response as a function of the modulation frequency. The left panels show the mean output rate of the neuron (hollow circles) and the amplitude of the response (lled circles). The right panels show the phase of the response. The solid curves are the predictions of the dynamic rate model (see the text). The external input was designed to produce a mean output rate of about » 30 spikes=sec in A and B and » 60 spikes=sec in C and D. A and C show q the responses to inputs with a small noise coefcient inp
of variation, 1 D 1= 2 f0 ¿ inp D 0:18, while B and D show the responses to inputs with a high noise level, 1 D 0:3.
lter. Since the dependence of the response amplitude on the modulation frequency has a bandpass nature, a rst-order linear lter would not be suitable. Thus, we assume a second-order linear lter. The lter for the current is described by 1 d2 Ilt 1 dIlt a dI C C Ilt D I C ; !0 Q dt !0 dt !20 dt 2
(6.3)
1830
O. Shriki, D. Hansel, and H. Sompolinsky
w0/2p [Hz]
80 60 40 20 0 0
20 40 60 80 Mean output rate [spikes/sec]
Figure 10: Resonance frequency (!0 =2¼ ) as a function of the mean output rate. For each value of the mean output rate, optimal values of !0 were averaged over four noise levels: 1 D 0:15; 0:18; 0:22; 0:3. The range of modulation frequencies for the t was 1–100 Hz. The optimal linear t (solid line) has a slope of 1 and an offset of 13:1.
and a similar equation (with the same parameters) denes the conductance lter. This is the equation of a damped harmonic oscillator with a resonance frequency !0 =2¼ and Q-factor Q (Q is dened as !0 over the width at half height of the resonance curve). Note that the driving term is a linear combination of the driving current input and its derivative. We investigated the behavior of the optimal lter parameters over a range of mean output rates, 10–70 spikes=sec, and a range of input noise levels, 1 D 0.15–0.3. For lower noise levels, the response prole contains additional resonant peaks (both subharmonics and harmonics) that cannot be accounted for by the linear lter of equation 6.3. For each mean output rate and input noise level, we ran a set of simulations with modulation frequencies ranging from 1 Hz to 100 Hz, and then numerically found the set of lter parameters that gave the best t for both amplitude and phase of the response. The variation of the optimal values for Q and a was small under the range of input parameters we considered, with mean values of Q D 0:85 and a D 1:9. The resonance frequency !0 =2¼ depends only weakly on the noise level but increases linearly with the mean output rate of the neuron, with a slope of 1, !0 =2¼ D f0 C 13:1, as shown in Figure 10. The solid curves in Figure 9 show the predictions of equation 6.3 for the amplitude and phase
Rate modulation [spikes/sec]
Rate Models for Conductance-Based Cortical Neuronal Networks
1831
20 10 0 10 20 500
1000 1500 Time [msec]
2000
Figure 11: Single-neuron response to a broadband signal. The thick curve shows the rate of the conductance-based neuron, and the thin curve shows the prediction of the rate model. The input consisted of a superposition of 10 cosine functions with random frequencies, amplitudes, and phases (see text for specics). For clarity, only the uctuations around the mean ring rate (» 30 spikes=sec) are shown.
of the rate response using the mean values of Q and a and the linear t for !0 , mentioned above. The results show that this model is a reasonable approximation of the response modulation of the conductance-based neuron. So far, we have dealt with the ring-rate response of a single neuron to input composed of a single sinusoid. We also tested the response of the neuron to a general broadband stimulus. For this, we used a Poissonian input characterized by a mean rate and a time-varying envelope that consisted of a superposition of 10 cosine functions with random frequencies, amplitudes, and phases. The mean input rate was chosen to obtain a mean output ring rate » 30 spikes=sec. The modulation frequencies were drawn from a uniform distribution between 0 and 60 Hz and the phases from a uniform distribution between 0 and 360 ± . The relative amplitudes were drawn from a uniform distribution between 0 and 1, and to avoid rectication, they were normalized so that their sum is 0:8. The results are presented in Figure 11. They show that the response to a broadband stimulus can be predicted from the response to each of the Fourier components. This conrms the validity of our linearity assumption, equation 6.2.
1832
O. Shriki, D. Hansel, and H. Sompolinsky
7 Response of a Neuronal Population to a Time Periodic Input We now apply the approach of the previous section to study the dynamics of a network of interacting neurons in response to a time-dependent input. Here we consider the case of a large, homogeneous network of N fully connected neurons, which receive an external noisy oscillating input. The input to each neuron is a Poisson process with a sinusoidal envelope. In addition, the ring times of the inputs that converge on different cells in the network are uncorrelated, so that the Poissonian uctuations in the inputs to the different cells are uncorrelated. Nevertheless, since the underlying oscillatory rate of the different inputs is coherent, the network response will have a synchronized component. To construct a simple model of the dynamics of the network rate, we rst dene two conductance-rate variables, r.t/ D g.t/=.NG¿e / and rinp .t/ D ginp .t/=.Ginp ¿e /, where G and Ginp are the peak conductances of the recurrent and afferent synapses, respectively. Averaging equation 2.7 over all the neurons in the network yields the following equation for the population conductance rate, ¿e
dr D ¡r C f .t/; dt
(7.1)
where f .t/ is the instantaneous ring rate of the network per neuron. A similar equation holds for rinp .t/, which is a low-pass lter of f inp .t/. The equation for f .t/ is f .t/ D ¯[Jinp ½ inp C J½ ¡ Ic ];
(7.2)
where J inp and J are given in equation 4.3. The quantities ½.t/ and ½ inp .t/ are obtained from r.t/ and rinp .t/, respectively, using the lter in equation 6.3, 1 d2 ½ !02 dt2
C
1 d½ a dr C½ DrC ; !0 Q dt !0 dt
(7.3)
and a similar equation holds for ½ inp . This yields a set of self-consistent equations, which determine the ring rate of the neurons in the network, f .t/. For response modulation amplitudes smaller than the mean ring rate, the dynamics are linear and can be solved analytically (see appendix C). Figure 12 shows the mean, amplitude, and phase of the network response f .t/ obtained by simulating the dynamics of the conductance-based network together with the analytical predictions of the rate model. (As in the previous section, the ltering done by the input synapses was removed for purposes of presentation.) The mean rate of the Poisson input and the strength of the synaptic interaction were chosen such that the mean ring rate of the network will be around 50 spikes=sec and the noise level will be 1 D 0:22. The results of Figure 12 reveal the effect of the recurrent interactions on the modulation of the network rate response. Qualitatively, the peak of the
Phase [deg]
Amplitude [Hz]
Rate Models for Conductance-Based Cortical Neuronal Networks
1833
60 40 20 0 0 30
20
40
60
80
100
20 40 60 80 Modulation frequency [Hz]
100
0 30 60 90 0
Figure 12: Response of an excitatory network to a sinusoidal stimulus. (Top) Mean (hollow circles) and amplitude (lled circles). (Bottom) Phase. The network consists of N D 4000 neurons. The neurons are coupled all-to-all. The synaptic conductance is NG D 0:039 mS=cm2 , and the synaptic time constant is ¿e D 5 msec. The reversal potential of the synapses was Ve D 0 mV.
response amplitude prole is suppressed and shifts to smaller frequencies due to the excitatory connectivity. In contrast, for an inhibitory network, the model predicts that the peak response will be shifted to the right and that the resonant behavior will be more pronounced. This can be proved from equation C.2 by taking negative J. To test this prediction, we ran simulations of a uniform fully connected network with inhibitory synapses (the reversal potential was ¡80 mV). The input parameters and the strength of the synaptic interaction were chosen to produce a mean ring rate around 50 spikes=sec and a noise level 1 D 0:3. Shown in Figure 13 are the results of the numerical simulations, together with the prediction of the rate model. These results provide additional strong support for our phenomenological time-dependent rate response model. 8 Discussion The rate models derived here are based on specic assumptions about single cell properties. The most important assumptions are the independence of the gain of the f-I curve of the leak conductance, gL (see equation 2.2) and the approximated linear dependence of the threshold current on gL (see
O. Shriki, D. Hansel, and H. Sompolinsky
Phase [deg]
Amplitude [Hz]
1834
60 40 20 0 0 30
20
40
60
80
100
120
20 40 60 80 100 Modulation frequency [Hz]
120
0 30 60 90 0
Figure 13: Response of an inhibitory network to a sinusoidal stimulus. (Top) Mean (hollow circles) and amplitude (lled circles). (Bottom) Phase. The network consists of N D 4000 neurons. The neurons are coupled all-to-all. The synaptic conductance is NG D 0:18 mS=cm2 , and the synaptic time constant is ¿e D 5 msec. The reversal potential of the synapses was Vin D ¡80 mV.
equation 2.3). This means that shunting conductances have a subtractive effect rather than a divisive one. This is in agreement with previous modeling studies (Holt & Koch, 1997) and with the properties of the conductancebased model neuron used in our study (see Figure 2). Recent experiments using the dynamic clamp technique provide additional support for this assumption (Brizzi et al., 2001; Chance, Abbott, & Reyes, 2002). A further simplifying assumption, supported by experiments, is that in a broad range of input currents and output ring rates, the f-I curves of cortical neurons can be well approximated by a threshold linear function. We used such a form of the f-I curve to show that the response of a single neuron to a stationary synaptic input can be well described by simple rate models with threshold nonlinearity. We applied our single-neuron model to the study of a network of neurons, receiving an input generated by a set of synapses activated by uncorrelated trains of spikes with stationary Poisson statistics. In this case, we derived a rate model under the additional assumptions that the network is highly connected and that it is in an asynchronous state. The mapping of the conductance-based model onto a rate model described in this work provides a correspondence between the “synaptic efcacy” used in rate models and biophysical parameters of the neurons. In
Rate Models for Conductance-Based Cortical Neuronal Networks
1835
particular, the sign of the synaptic efcacy is determined by the value of the reversal potential relative to E L C Vc , where Vc is the threshold gainpotential of the postsynaptic neuron (see equation 2.3). Furthermore, our theory enables the use of rate models to calculate synaptic conductances that are generated by the recurrent activity, allowing a quantitative comparison of predictions from rate models and conductance measurements in in vitro and in vivo intracellular experiments. In the case of a fully connected network of excitatory neurons, we showed that the ring rate of the neurons predicted by the rate model was highly congruent with simulation results in a broad range of synaptic conductances. This indicates that our rate model provides reliable results over a broad range of ring rates and conductance changes. Even for very high rates, incorporating a weak quadratic nonlinearity is enough to account for the saturation of the neurons’ ring rates. We also showed that our approach can be applied to networks with more complicated architectures, such as the conductance-based model of a hypercolumn in V1 analyzed in this work. The conditions for the stabilization of the homogeneous state can be correctly predicted from the mean-eld analytical solution of the corresponding rate model. Furthermore, the prole of activity of the network and the tuning curve of the synaptic conductances can be calculated. Our results show that in order to obtain changes in input conductances that are similar to those found experimentally, one has to assume that the space constant of the excitatory feedback is much smaller than the space constant of the inhibitory interactions. However, this conclusion may depend on our assumption of interaction proles, which are identical for excitatory and inhibitory targets. To extend our approach to the case of time-dependent inputs, we studied the response of the single neuron to noisy input that is modulated sinusoidally in time. We showed that this response can be described by a rate model in which the neuron responds instantaneously to an effective input, which is a ltered version of the actual one. This is similar to the approach of Chance, du Lac, and Abbott (2001), who studied the responses of spiking neurons to oscillating input currents. Our description is valid provided that the input is sufciently noisy to broaden resonances and that its modulation is small compared to its average to avoid rate rectication. Interestingly, if these assumptions are satised, we found that the neuron essentially behaves like a linear device even if the modulation of the input is substantial. This allowed us to derive a rate model that provides a good description for the dynamics of a homogeneous network of neurons receiving a time-dependent input. Our derivation of rate equations for conductance-based models can be compared to the one suggested by Ermentrout (1994; see also Rinzel & Frankel, 1992). Ermentrout derived a rate model in the limit where the synaptic time constants are much longer than the time constants of the currents involved in the generation of spikes, as well as in the interspike
1836
O. Shriki, D. Hansel, and H. Sompolinsky
interval. In this case, the neuronal dynamics on the scale of the synaptic time constants can be reduced to rate equations of the type of equation 7.1 in which the rate variables, r.t/, represent the instantaneous activity of the slow synapses. However, in Ermentrout’s approach, the ring rate f .t/ is given by an equation similar to equation 7.2 with ½ inp ´ rinp and ½ ´ r. This is in contrast to our approach, where ½ inp and ½ are ltered versions of rinp and r. The two approaches become equivalent in the limit of slow varying input and slow synapses, that is, !0 À 1=tinp ; 1=¿s where tinp is the typical timescale over which the external input varies and ¿s is the synaptic time constant of the faster synapses in the network. The parameter !0 is the resonance frequency. Knight (1972a, 1972b) studied the response of integrate-and-re neurons to periodic input. He concluded that a resonant behavior occurred for input modulation frequencies around the average ring rate of the neurons. He also showed that white noise in the input broadens the resonance peak. Similar results were obtained by Gerstner (2000) in the framework of spike response models. This resonance has also been reported by Brunel, Chance, Fourcaud, and Abbott (2001) and Fourcaud and Brunel (2002), who studied analytically the ring response of a single integrate-and-re neuron to a periodic input with temporally correlated noise. Our results extend these conclusions to conductance-based neurons. An interesting nding of this work is that the resonance frequency of the rate response increases linearly as a function of the mean output rate with a slope of one. It would be interesting to derive this relationship as well as the other parameters of our rate dynamics model from the underlying conductance-based dynamics. In addition, applications of our approach to other single-cell conductancebased models and to more complicated network architectures need to be explored. For instance, the effect of slow potassium adaptation currents should be taken into account. These currents are likely to contribute to the high-pass ltering of the neuron as shown by Carandini, Fernec, Leonard, and Movshon (1996) in the case of an oscillating injected current. In this work, we used an A-current as a mechanism for the linearization of the f-I curve. Other slow hyperpolarizing conductances might also achieve the same effect (Ermentrout, 1998). In particular, slow potassium currents, which are responsible for spike adaptation in many cortical cells, might be alternative candidates. However, linear f-I curves are also observed in cortical inhibitory cells, most of which do not show spike adaptation, but they probably do possess an A-current. Finally, we focused in this work on conductance-based network models of point neurons. An interesting open question is whether an appropriate rate model can also be derived for neurons with extended morphology. In conclusion, our results open up possibilities for applications of rate models to a range of problems in sensory and motor neuronal circuits in cortex as well as in other structures with similar single-neuron properties. Applying the rate model to a concrete neuronal system requires knowledge of
Rate Models for Conductance-Based Cortical Neuronal Networks
1837
a relatively small number of experimentally measurable parameters that characterize the f-I curves of the neurons in the system. In addition, the rate model requires knowledge of the gross features of the underlying connectivity pattern, as well as an estimate of the order of magnitude of the associated synaptic conductances. Appendix A: Model Neuron This appendix provides the specics for the equations of the model neuron used here. The dynamics of our single-neuron model consists of the following equations Cm
dV D ¡IL ¡ INa ¡ IK ¡ IA C Iapp: dt
(A.1)
The leak current is given by IL D gL .V ¡ EL /. The sodium and the delayed rectier currents are described in a standard way: INa D gN Na m31 h.V ¡ ENa / for the sodium current and IK D gN K n4 .V ¡ EK / for the delayed rectier current. The gating variables x D h; n satisfy the relaxation equations: dx=dt D .x1 ¡ x/=¿x . The functions x1 , ( x D h; n; m), and ¿x are: x1 D ®x =.®x C ¯x /, and ¿x D Á=.®x C ¯x / where ®m D ¡0:1.V C 30/=.exp.¡0:1.V C 30// ¡ 1/, ¯m D 4 exp.¡.V C 55/=18/, ®h D 0:07 exp.¡.V C 44/=20/, ¯h D 1=.exp.¡0:1.V C 14// C 1/, ®n D ¡0:01.V C 34/=.exp.¡0:1.V C 34// ¡ 1/ and ¯n D 0:125 exp.¡.V C 44/=80/. We have taken: Á D 0:1. The A-current is IA D gN A a31 b.V ¡ EK / with a1 D 1=.exp.¡.V C 50/=20/ C 1/. The function b.t/ satises db=dt D .b1 ¡ b/=¿A with: b1 D 1=.exp..V C 80/=6/ C 1/. For simplicity, the time constant, ¿A , is voltage independent. The other parameters of the model are: Cm D 1¹F=cm2 , gN Na D 100 mS/cm2 , gN K D 40 mS/cm2 . Unless specied otherwise, gL D 0:05 mS/cm2 , gN A D 20 mS/cm2 , and ¿A D 20 msec. The reversal potentials of the ionic and synaptic currents are ENa D 55 mV, EK D ¡80 mV, E L D ¡65 mV, Ee D 0 mV, and Ein D ¡80 mV. The external current is Iapp (in ¹A=cm2 ). Appendix B: Model of a Hypercolumn in V1 : The Mean-Field Equations B.1 The Instability of the Homogeneous State. For a homogeneous input, a trivial solution to the xed-point equation, equation 5.1, corresponds to a state in which the responses of all the neurons are the same. However, this homogeneous state can be unstable if the spatial modulation of the interactions is sufciently large. The condition for this instability can be derived by solving equation 5.1 in the limit of a weakly heterogeneous input fLGN .µ / as follows. The Fourier expansion of fLGN is fLGN .µ / D
1 X nD1
fn exp.2inµ/;
(B.1)
1838
O. Shriki, D. Hansel, and H. Sompolinsky
where the coefcients fn are complex numbers. For weakly heterogeneous inputs, all the coefcients but f0 are small. In the parameter regime where the homogeneous state is stable, the response of the network to this input will be weakly heterogeneous. Therefore, the Fourier expansion of the activity prole is, m.µ / D
1 X
mn exp.2inµ/;
(B.2)
nD1
where all the coefcients but m0 are small. Substituting equations B.1 and B.2 in equation 5.1 and computing the coefcients mn , n > 0 perturbatively, one nds: mn D
JLGN fn ; 1 ¡ ¯ Jn
(B.3)
where Jn are the Fourier coefcients of the interactions. The coefcient mn diverges if ¯ Jn D 1. This divergence indicates an instability of the homogeneous state. This instability induces a heterogeneous prole of activity with n peaks. The coefcients Jn can be computed from equation 5.2. This yields the instability onset condition: Á ! 1¡.¡1/n exp.¡¼=2¸e / 1¡.¡1/n exp.¡¼=2¸in / 2¯ Je CJin D 1: (B.4) 1C4n2 ¸2e 1C4n2 ¸2in B.2 Mean-Field Equations for Prole of Activity. We are interested in the case in which the LGN input is broadly tuned, 0 < ² < 1=2, and the effect of the interactions is sufciently strong to sharpen substantially the response of the neurons. More specically, we require that f .µ / D 0 for all the neurons with POs such that jµ ¡ µ0 j > ¼=2. In this case, the solution to equations 5.1 and 5.2 can be found analytically. Taking µ0 D 0, without loss of generality, this solution has the form f .µ / D A0 CA1 cos.¹1 µ/CA2 sin.¹2 µ/CA3 cos.2µ/ f .t/ D 0
for µc < jµj
for jµ j< µc
(B.5) (B.6)
The angle µc is determined by f .§µc / D 0:
(B.7)
Substituting equation B.5 in equation 5.1, one nds that ¹ 1 and ¹2 are solutions (real or imaginary) of the equation X ®DE;I
2J® D 1; 1 C ¸2® x2
(B.8)
Rate Models for Conductance-Based Cortical Neuronal Networks
1839
and that the coefcients A0 and A3 are given by A0 D A3 D
JNLGN fLGN .1 ¡ ²/ ¡ T 1 ¡ 2Je ¡ 2Jin
JN f ± LGN LGN ²: ¹2 Jin ¹1 Je 1 ¡ 2 1C4¹2 C 1C4¹ 2 1
(B.9) (B.10)
2
Finally, A1 and A2 are obtained from the two equations: F.¸® ; ¹; A/ D 0
® D E; I;
(B.11)
where, ¹ D .¹1 ; ¹2 /, A D .A0 ; A1 ; A2 ; A3 / and F is the function F.x; ¹; A/ D A0 C
X
Ai .cos.¹i µc / ¡ ¹i x sin.¹i µc // 1 C ¹i x 2 iD1;3
(B.12)
with ¹3 D 2 and the angle µc is determined by the condition A0 C A1 cos.¹1 µc / C A2 sin.¹2 µc / C A3 cos.2µc / D 0:
(B.13)
Appendix C: Rate Dynamics in a Neuronal Population with Uniform Connectivity By Fourier transforming equations 7.1, 7.2, and 6.3 (assuming that the ring rates are always positive), we obtain for ! 6D 0, O fO.!/ D ¯.Jinp ½O inp C J½/;
(C.1)
where ½.!/ O D B.!/Or.!/, rO.!/ D L.!/ fO.!/, ½O inp .!/ D B.!/Orinp .!/, rOinp .!/ D inp O L.!/ f .!/. The low-pass lter L.!/ and bandpass lter B.!/ are given by L.!/ D 1=1 C i!¿e and B.!/ D .1 C ia!=!0 /=.1 C iw=!0 ¡ ! 2 =!20 /. Putting the above relations in equation C.1 and solving for fO.!/ gives the transfer function of the network, ¯ Jinp B.!/L.!/ Oinp fO.!/ D f .!/; 1 ¡ ¯ JB.!/L.!/
(C.2)
which can be used to predict the time-dependent rate response. Acknowledgments We thank L. Abbott for fruitful discussions and Y. Loewenstein and C. van Vreeswijk for careful reading of the manuscript for this article. This research was partially supported by a grant from the U.S.-Israel Binational Science Foundation and a grant from the Israeli Ministry of Science and the French Ministry of Science and Technology. This research was also supported by the Israel Science Foundation (Center of Excellence Grant-8006/00).
1840
O. Shriki, D. Hansel, and H. Sompolinsky
References Abbott, L. F., & Kepler, T. (1990). Model neurons: From Hodgkin-Huxley to Hopeld. In L. Garrido (Ed.), Statistical mechanics of neural networks (pp. 5–18). Berlin: Springer-Verlag. Ahmed, B., Anderson, J. C., Douglas, R. J., Martin, K. A., & Whitteridge, D. (1998). Estimates of the net excitatory currents evoked by visual stimulation of identied neurons in cat visual cortex. Cereb. Cortex, 8, 462–476. Amit, D. J. (1989). Modeling brain function. Cambridge: Cambridge University Press. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates I: Substrate—spikes, rates and neuronal gain. Network, 2, 259–273. Azouz, R., Gray, C. M., Nowak, L. G., & McCormick, D. A. (1997). Physiological properties of inhibitory interneurons in cat striate cortex. Cereb. Cortex, 7, 534–545. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Borg-Graham, L. J., Monier, C., & Fregnac, Y. (1998). Visual input evokes transient and strong shunting inhibition in visual cortical neurons. Nature, 393, 369–373. Brizzi, L., Hansel, D., Meunier, C., van Vreeswijk, C., & Zytnicki, D. (2001). Shunting inhibition: A study using in vivo dynamic clamp on cat spinal motoneurons. Society for Neuroscience Abstract, 934.3. Brunel, N., Chance, F. S., Fourcaud, N., & Abbott, L. F. (2001). Effects of synaptic noise and ltering on the frequency response of spiking neurons. Phys. Rev. Lett., 86, 2186–2189. Carandini, M., Fernec, M., Leonard, C. S., & Movshon, J. A. (1996). Spike train encoding by regular-spiking cells of the visual cortex. J. Neurophysiol., 76, 3425–3441. Carandini, M., Anderson, J., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiol., 84, 909–926. Chance, F. S., du Lac, S., & Abbott, L. F. (2001). An integrate-or-re model of spike rate dynamics. Society for Neuroscience Abstract, 821.44. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: MIT Press. Connors, B. W., Malenka, R. C., & Silva, L. R. (1988). Two inhibitory postsynaptic potentials, and GABAA and GABAB receptor-mediated responses in neocortex of rat and cat. J. Physiol. (Lond.), 406, 443–468. Ermentrout, B. (1994). Reduction of conductance based models with slow synapses to neural nets. Neural Comp., 6, 679–695. Ermentrout, B. (1998). Linearization of F-I curves by adaptation. Neural Comp., 10, 1721–1729. Fourcaud, N., & Brunel, B. (2002). Dynamics of the ring probability of noisy integrate-and-re neurons. Neural Comp., 14, 2057–2111.
Rate Models for Conductance-Based Cortical Neuronal Networks
1841
Georgopoulos, A. P., & Lukashin, A. V. (1993). A dynamical neural network model for motor cortical activity during movement: Population coding of movement trajectories. Biol. Cybern., 69, 517–524. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comp., 12, 43–89. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neuronal networks. Phys. Rev. E, 50, 3171–3191. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Hansel, D., & Sompolinsky, H. (1998). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (2nd ed., pp. 499–567). Cambridge, MA: MIT Press. Hille, B. (1984). Ionic channels of excitable membranes. Sunderland, MA: Sinauer. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.), 117, 500–562. Holt, G. R., & Koch, C. (1997). Shunting inhibition does not have a divisive effect on ring rates. Neural Comput., 9, 1001–1013. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 79, 1554–2558. Kernell, D. (1968). The repetitive impulse discharge of a simple model compared to that of spinal motoneurones. Brain Research, 11, 685–687. Knight, B. W. (1972a). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W. (1972b).The relationship between the ring rate of a single neuron and the level of activity in a population of neurons. Experimental evidence for resonant enhancement in the population response. J. Gen. Physiol., 59, 767–778. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1988). Numerical recipes in C: The art of scientic computing. Cambridge: Cambridge University Press. Rinzel, J., & Frankel, P. (1992). Activity patterns of a slow synapse network predicted by explicitly averaging spike dynamics. Neural Comp., 4, 534–545. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Seung, H. S. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93, 3339–13344. Stafstrom, C. E., Schwindt, P. C., & Crill, W. E. (1984). Repetitive ring in layer V neurons from cat neocortex in vitro. J. Neurophysiol., 52, 264–277. Wilson, H. R., & Cowan, J. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16, 2112–2126. Received November 14, 2002; accepted January 21, 2003.
LETTER
Communicated by Richard Zemel
Neural Representation of Probabilistic Information M. J. Barber ¤ ¨ Theoretische Physik, Universit¨at zu K¨oln, D-50937 K¨oln, Germany Institut fur
J. W. Clark Department of Physics, Washington University, Saint Louis, MO 63130, U.S.A.
C. H. Anderson Department of Anatomy and Neurobiology,Washington University School of Medicine, Saint Louis, MO 63110, U.S.A.
It has been proposed that populations of neurons process information in terms of probability density functions (PDFs) of analog variables. Such analog variables range, for example, from target luminance and depth on the sensory interface to eye position and joint angles on the motor output side. The requirement that analog variables must be processed leads inevitably to a probabilistic description, while the limited precision and lifetime of the neuronal processing units lead naturally to a population representation of information. We show how a time-dependent probability density ½.xI t/ over variable x, residing in a specied function space of dimension D, may be decoded from the neuronal activities in a population as a linear combination of certain decoding functions Ái .x/, with coefcients given by the N ring rates ai .t/ (generally with D ¿ N). We show how the neuronal encoding process may be described by projecting a set of complementary encoding functions ÁO i .x/ on the probability density ½.xI t/, and passing the result through a rectifying nonlinear activation function. We show how both encoders ÁOi .x/ and decoders Ái .x/ may be determined by minimizing cost functions that quantify the inaccuracy of the representation. Expressing a given computation in terms of manipulation and transformation of probabilities, we show how this representation leads to a neural circuit that can carry out the required computation within a consistent Bayesian framework, with the synaptic weights being explicitly generated in terms of encoders, decoders, conditional probabilities, and priors. 1 Introduction It has been hypothesized (Anderson, 1994, 1996) that circuits of cortical neurons perform statistical inference and, in particular, that they encode and ¤ Present address: Centro de Ciˆ encias Matema´ ticas, Universidade da Madeira, Campus Universit a´ rio da Penteada, 9000-390 Funchal, Portugal
c 2003 Massachusetts Institute of Technology Neural Computation 15, 1843–1864 (2003) °
1844
M. Barber, J. Clark, and C. Anderson
process information about analog variables in the form of probability density functions (PDFs). This PDF hypothesis provides a unied framework for understanding diverse observations from experimental neurobiology, constructing neural network models, and gaining insights into how neurons can implement a rich collection of information-processing functions. The PDF hypothesis derives from two major themes of computational neuroscience. The rst theme stems from efforts to determine how information is represented by neural systems through understanding how neural activity correlates to external cues or actions (such as sensory stimuli or motor response). Our understanding of neural encoding can be tested by inferring sensory input or motor output from a set of neural activities and comparing the estimate thus obtained to the external cue or action. To decode the response from a population of neurons requires procedures to infer information from individual spike trains, as well as procedures to combine these results into an aggregate estimate. An optimal method for decoding information from individual neural spike trains has been developed (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Bialek & Rieke, 1992; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997) and applied to movement-sensitive neurons in the blowy (Rieke et al., 1997) and other systems (Theunissen, Roddey, Stufebeam, Clague, & Miller, 1996). This method consists of utilizing a linear lter to extract the maximum possible information from each spike (typically a few bits; see Rieke et al., 1997), as measured by the ability to reconstruct the stimulus from the spike train. In these studies, the linear lter determines a ring rate from the spike trains; this ring rate contains most of the information, with additional information possibly encoded in other aspects of the activity patterns. In the current work, we assume that the ring rates capture the essential behavior of neural systems and will not explicitly consider spike trains. Methods for decoding information from the ring rates of populations of neurons were pioneered by Georgopoulos and collaborators. They showed that a “population vector” derived from the ring rates of a population of cortical neurons can be used to predict the intended arm movements of monkeys (Georgopoulos, Schwartz, & Kettner, 1986; Schwartz, 1993). This vector estimate of direction, Vest , is obtained from the neural ring rates ai by Vest D
N X
ai Ci ;
(1.1)
iD1
where the preferred direction vectors, Ci , indicate the direction at which neuron i has its maximal ring response. The population vector approach has been rened and extended by several authors; in particular, Salinas and Abbott (1994) provide an excellent discussion of several such renements, as well as introducing their own. The emphasis in such studies has been the reconstruction of vector quantities from populations of neural responses by
Neural Representation of Probabilistic Information
1845
a process that in several cases appears to be computation of an expectation value from an implicit probability distribution. The second theme leading to the PDF hypothesis stems from an analysis showing that the original Hopeld neural network implements, in effect, Bayesian inference on analog quantities in terms of PDFs (Anderson & Abrahams, 1987). The role of PDFs in neural information processing is being explored along a number of avenues. As in this work, Zemel, Dayan, and Pouget (1998) have investigated population coding of probability distributions, but with different representations from those we will consider here. Several extensions of this representation scheme have been developed (Zemel, 1999; Zemel & Dayan, 1999; Yang & Zemel, 2000) that feature information propagation between interacting neural populations. Further, a number of related models have been introduced. Of particular note is a dynamic routing model of directed attention (Anderson & Van Essen, 1987; Olshausen, Anderson, & Van Essen, 1993, 1995). Additionally, several “stochastic machines” (Haykin, 1999) have been formulated, including Boltzmann machines (Hinton & Sejnowski, 1986), sigmoid belief networks (Neal, 1992), and Helmholtz machines (Dayan & Hinton, 1996). Stochastic machines are built of stochastic neurons that occupy one of two possible states in a probabilistic manner. Learning rules for stochastic machines enable such systems to model the underlying probability distribution of a given data set. Systems based on probabilistic frameworks offer a number of important advantages, both conceptual and practical. In particular, two major advantages of the PDF scheme are that (1) it provides a well-motivated way to address a variety of issues in neural information processing, since having the joint distribution allows one to answer all probabilistic questions about the system, while (2) application of the Bayes rule makes it easy to deal with a changing body of evidence (e.g., allowing one to remove evidence that is later found to be incorrect without “starting over”). The two prominent themes of population coding and probabilistic inference are combined in the PDF hypothesis through the assertion that a physical variable x is described by a neural population at time t in terms of a PDF ½.xI t/ rather than as a single-valued estimate x.t/. Such a PDF description has the signicant advantage that it not only permits a single-valued estimate to be calculated, but also provides for measures of the uncertainty of such estimates. For example, a specic value » at time t can be represented as the mean of a normal distribution over x with variance ¾ 2 , so that ½.xI t/ D N.xI ».t/; ¾ 2 .t//:
(1.2)
Clearly, this PDF allows ».t/ to be known very precisely (small variance) or with a great deal of uncertainty (large variance). More generally, we consider a PDF described at time t in terms of a set of D underlying parameters fA¹ g. Guided by the experimentally observed linear
1846
b
1
1
0.8
0.8
0.6
0.6
ai(¹ )
ai(¹ )
a
M. Barber, J. Clark, and C. Anderson
0.4
0.4 0.2
0.2 0 -1
-0.5
0
0.5
1
0 -1
-0.5
0
0.5
1
¹ ¹
Figure 1: Neuronal responses to a stimulus. (a) Broad, piecewise-linear response functions result in many active neurons for any stimulus and are thus well suited to precisely representing a value. (b) Gaussian response functions result in each neuron’s being active in response to a narrow range of stimuli. Neurons with gaussian response are well suited for representing PDFs with multimodal structure.
decoding rules discussed above, we will take the PDFs to be represented by ½.xI t/ ´ ½.xI fA¹ .t/g/ D
D X
A¹ .t/8 ¹ .x/:
(1.3)
¹D1
The basis functions 8¹ .x/ are orthonormal functions that dene the PDFs that the neural circuit can represent. We describe x with ½.xI fA¹ .t/g/ rather than ½.x j fA¹ .t/g/ to distinguish between the assumed forms of models (see equation 1.3) and relationships that exist among random variables (viz. conditional probabilities). The amplitudes A¹ .t/ of the representations dened by equation 1.3 cannot be interpreted as neuronal ring rates: they can take on negative values and are more precise than neuronal ring rates. However, we can represent a PDF in terms of decoding functions Ái .x/ and ring rates ai .t/ associated with N neurons, so that ½.xI t/ D
N X
ai .t/Ái .x/:
(1.4)
iD1
Unlike the basis functions 8¹ .x/, the decoding functions Ái .x/ form a highly redundant, overcomplete representation (N À D) that is specialized for use with neurons of limited precision. In this work, we will consider neurons whose ring rates show either piecewise linear behavior (and thus constitute essentially one-dimensional versions of the response functions entering Georgopoulos’s population vector; see also Figure 4 in Fuchs, Scudder, & Kaneko, 1988) or gaussian forms in response to a stimulus (see Figure 1).
Neural Representation of Probabilistic Information
1847
From the relations asserted in equations 1.3 and 1.4, we can identify three relevant problem domains. First, we have the physical variable x, described by the PDF ½.xI t/. This domain is that of high-level concepts. Second, we have the neural network with its measurable neural ring rates ai .t/. The neural network constitutes a physical implementation of the desired computations on the physical variable, so the properties of this second domain should be chosen to match the properties of biological systems as closely as possible. In particular, the neural ring rates must be constrained to be positive quantities of low precision. The third domain is that of the underlying parameters A¹ , which subserve an alternative, abstract implementation of the desired computations. The constraint in this case is minimality: we concern ourselves only with mathematical convenience and allow the A¹ to be of arbitrary precision and to take on negative values. Following Zemel et al. (1998), the domain of physical variables is called the implicit space and the domain of measurable quantities the explicit space. Extending their nomenclature, we shall refer to the third domain as the minimal space. The minimal space will serve as a useful bridge between the two other spaces. It may be conceptually helpful to regard the variables or parameters A¹ .t/ as the activities of a set of D “metaneurons,” ctitious entities that reside and act in the minimal space. However, it must be emphasized that such metaneurons differ from real neurons in their abilities to function with high precision and to produce negative “ring rates” A¹ .t/. Accordingly, they possess valuable properties that will facilitate formal representation and analysis. 2 Obtaining the Neuronal Representation 2.1 Multiple Levels of Representation. The fundamental assumption of the framework to be developed in this article is that information about a physical variable x given a set of parameters A D fA1 ; A2 ; A3 ; : : :g at time t is represented by an ensemble of neurons as a PDF ½.xI A1 .t/; A2 .t/; A3 .t/; : : :/. For notational convenience, we will usually abbreviate this quantity as ½.xI t/. This PDF can be determined from a set of neuronal ring rates fai .t/g using a set of decoding functions (or simply decoders) Ái .x/, as prescribed in equation 1.4. In turn, a set of encoding functions (encoders) ÁOi .x/ is used to determine the ring rates from an assumed PDF by means of ³Z ai .t/ D f
´ ÁO i .x/½ .xI t/ dx ;
(2.1)
where a nonlinear activation function f ./ is introduced to preclude negative ring rates. The encoding functions ÁOi .x/ must be chosen so as to yield a close match to desired (i.e., experimentally observed) ring rates ai .t/. The decoding rule (see equation 1.4) should in general be viewed as only
1848
M. Barber, J. Clark, and C. Anderson
returning an approximation to a PDF; in particular, functions that are not strictly positive semidenite can be decoded from such a rule. We can also represent the PDF using a complete orthonormal basis f8¹ .x/g for the space spanned by the decoders, as shown in equation 1.3. Further, we can represent the decoding functions in terms of this basis, writing Ái .x/ D
D X
(2.2)
·ºi 8º .x/;
ºD1
where the ·ºi are coupling coefcients to be determined. Since we now have an orthonormal basis, the coefcients A¹ in equation 1.3 are simply evaluated from Z (2.3) A¹ .t/ D 8¹ .x/½.xI t/ dx: The encoding and decoding rules based on the amplitudes A¹ .t/ in the minimal space are seen to parallel those based on the neuronal ring rates ai .t/, apart from the absence of a nonlinearity in equation 2.3. In this section, we will develop methods to relate operations in the mathematically convenient minimal space and the biologically plausible implementation of PDFs in the explicit space of model neurons. 2.2 Obtaining the Encoding Functions. Although we do not know the encoding functions at this point, we do know that they can be represented O ¹ .x/g through in terms of another set of basis functions f8 X O ¹ .x/; (2.4) ÁOj .x/ D ·Oj¹ 8 ¹
where the coupling coefcients ·Oj¹ are in general distinct from the ·ºi . For many networks, it is appropriate to assume the basis for the encoders to be identical to the basis for the decoders. For example, in the case of the neural integrator (see section 2.5), the PDFs are continually mapped into and out O º .x/. Thus, spanf8¹ g can of the minimal space provided by the 8¹ .x/ and 8 O O ¹ .x/. be equal to spanf8º g. For deniteness, we take 8¹ .x/ D 8 To nd a set of encoding functions relating the neural ring rates fai .t/g to a desired target PDF ½ T .xI t/, we dene the cost function E1 D
1X 2 i
1X D 2 i
Z µ
³Z ÁO i .x/½ T .xI A/ dx
ai .A/ ¡ f Z " ai .A/ ¡ f
Á X º
´¶2 ½.A/ dA !#2
Z ·O iº
T
8º .x/½ .xI A/ dx
½.A/ dA: (2.5)
Neural Representation of Probabilistic Information
1849
We now use gradient descent to determine the ·O iº that minimize E1 , d·Oj¹ @E1 ¼ ¡´ dt @ ·Oj¹ Z D .aj .A/ ¡ f .hj .A/// f 0 .hj .A//U¹ .A/½ .A/ dA;
(2.6)
where ´ is a rate constant. We have dened Z Uº .A/ D hj .A/ D
8º .x/½ T .xI A/ dx X º
·Ojº Uº .A/
(2.7) (2.8)
to simplify the expression. To demonstrate the efcacy of this optimization procedure, we apply it to each of the neuronal responses shown in Figure 1. We will see that the differing neuronal properties result in markedly different encoding functions. First, we nd encoders for the set of broadly tuned, piecewise-linear neuronal responses (see Figure 1a) to a precise input signal. We assume a low-dimensional minimal space spanned by two straight-line functions, p p 80 .x/ D 1=2 and 81 .x/ D 3=2x, and take the activation function to be rectication ( f .x/ D
x 0
x¸0 : x 0) to limit the information content of the ring rates, we determine a set of decoders that utilizes all of the neurons in the representation (see Figure 2c) and is independent of unrealistically precise ring rates. In a similar fashion, we obtain decoders (see Figure 2d) complementary to the encoders for the gaussian activity patterns. Having determined the decoders, we can directly transform between the explicit, implicit, and minimal spaces. The transformation rules are summarized pictorially in Figure 3.
Original and Decoded PDFs
Neural Representation of Probabilistic Information
0.7
Original
0.6
D=2 D=4 D=8
1853
0.5 0.4 0.3 0.2
-1
-0.5
0
0.5
1
x
Figure 4: Effect of the dimensionality D on the quality of the neural representation. As D is increased, the decoded PDF more closely matches the original PDF.
2.4 Dimensionality of the Minimal Space. The structure of the neural representations created depends critically on the dimensionality D of the associated spaces. We can most easily explore the effect of the dimensionality in the minimal space, where D is simply equal to the number of basis functions 8¹ .x/. By way of illustration, let us pattern the basis functions after the Legendre polynomials P ¹ .x/. The Legendre polynomials form qR an orthogonal 1 set but are not normalized, so we dene PQ¹ .x/ D P¹ .x/= P2 .x/ dx over ¡1
¹
the interval [¡1; 1]. For dimension D, we then set the minimal-space basis function 8¹ .x/ equal to the normalized Legendre polynomial PQ¹¡1 .x/ for ¹ D 1; 2; : : : ; D. To demonstrate the effect of the dimension D on the quality of the neural representation, we compare an assumed target PDF with the PDF as represented in neural populations. We vary D and generate, as described in sections 2.2 and 2.3, encoding and decoding functions optimized to work with neurons with ring-rate proles as shown in Figure 1. Using equation 2.1, the target PDF is encoded into neural ring rates, which are then decoded using equation 1.4. With a bimodal target PDF, increasing D improves the quality of the decoded PDF (see Figure 4). For D D 2, only a straight line is decoded (although this may still be useful—see section 2.5), while for D D 8, the decoded PDF matches the target PDF quite well.
1854
M. Barber, J. Clark, and C. Anderson
2.5 A Neural Integrator Model. An important example of a neural integrator is the group of neurons that maintain the eyes in a xed position in the absence of visual input. These recurrently connected neurons are able to hold the eye in position for times much longer than the interspike interval of the neurons. Collectively, they form an attractor network that acts as a memory of eye position, which lasts for several seconds (Seung, 1996). By introducing temporal dynamics into the underlying probabilistic models, we can create a model of a neural integrator. The dynamics are straightforward: for a short time ¿ , the PDF should be unchanged, so ½.xI t C ¿ / D ½.xI t/;
(2.17)
where x is the value (i.e., eye position) stored in the memory. As discussed above, we generate decoding functions using piecewiselinear activities, linear encoders, and a rectifying activation function. Making use of this representation, the encoding and decoding rules (see equations 1.4 and 2.1), and the probabilistic dynamics (see equation 2.17), we can show that ³Z ´ (2.18) ai .t C ¿ / D g ÁO i .x/½ .xI t C ¿ / dx 0 D g@
X
1
Z aj .t/
j
ÁO i .x/Áj .x/ dxA :
(2.19)
Dening weights Z !ij D
ÁOi .x/Áj .x/ dx;
(2.20)
we may rewrite this as 0 ai .t C ¿ / D g @
1 X
!ij aj .t/A :
(2.21)
j
The recurrent neural network that results is fully connected, with each neuron having a synaptic connection to every other neuron. The stored value of the eye position is extracted by calculating the expectation value of the random variable x, weighted by the decoded PDF. Ideally, we would like any value in the supported range to be held constant, so that the network functions as a line attractor (Seung, 1996), a kind of continuous attractor. However, the system actually operates as a collection of point attractors with only a limited number of stable xed points, as can be seen from the network’s transfer function (see Figure 5). The structure of
Neural Representation of Probabilistic Information
1
line attractor D=2 D=6
0.5 hxi(t + ½ )
1855
0 -0.5 -1
-1
-0.5
0
0.5
1
hxi(t)
Figure 5: The neural integrator model maintains only a limited number of values rather than an arbitrary input value. The number of stable xed points of the neural integrator model can be seen in the network’s transfer function. Here there are two stable xed points for a neural integrator consisting of 20 neurons, with encoders and decoders found using a minimal space with dimension D D 2. By increasing D to 4, the number of stable xed points increases to 3 (not shown), while increasing D to 6 yields 4 stable xed points. With only the 20 neurons of limited precision utilized here, further increases in D do not give rise to further increases in the number of stable xed points.
the transfer function, and the number of stable xed points, depends on the dimensionality D of the minimal space. As the dimensionality of the minimal space is increased, the neural integrator can support additional stable xed points, eventually approximating a line attractor. This neural integrator model is essentially a variation of the model constructed by Eliasmith and Anderson (1999). 3 Probabilistic Inference Performed by Neural Networks 3.1 Inference. Inference between two related variables x and y in the implicit space is performed by taking a weighted average of the conditional probability ½.y j x/: Z ½.yI t C ¿ / D
½.y j x/½ .xI t/ dx:
(3.1)
1856
M. Barber, J. Clark, and C. Anderson
We have assumed in equation 3.1 that the relationship between x and y is independent of the values of the minimal space parameters, so ½.y j xI fA¹ .t/g/ D ½.y j x/:
(3.2)
This assumption xes the structure of the probabilistic model, explicitly excluding learning from any neural networks derived from it. (Relaxing this assumption would be a useful rst step toward allowing the neural network weights to change in response to experience.) The conditional probability ½.y j x/ is like a xed look-up table; the Marr-Albus theory of cerebellar function can be directly mapped into equation 3.1 (Hakimian, Anderson, & Thach, 1999). Mapping the implicit-space inference relation 3.1 into the explicit space of neurons yields a neural network (Anderson, 1994, 1996; Zemel & Dayan, 1997). Specically, one imposes representations as given in equations 1.3 and 1.4 for x and ½.yI t/ D
X
(3.3)
bj .t/Ãj .y/
j
³Z bj .t/ D g
´ ÃO j .y/½ .yI t/ dy
(3.4)
for y. Then one combines these representations with equation 3.1, leading to bj .t C ¿ / D g
Á X
! wji ai .t/
(3.5)
i
with the coupling coefcients ZZ wji D
ÃO j .y/½ .y j x/Ái .x/ dx dy:
(3.6)
For well-chosen encoding and decoding functions, equations 3.5 and 3.6 allow us to construct a neural network that embodies the desired relationship between the implicit variables, as expressed by the conditional probability, without applying a training procedure to nd a relation from a data set. As a concrete example of probabilistic inference within the PDF scheme, we construct a neural network using equations 3.5 and 3.6. We use gaussian activity patterns (see Figure 1) and the corresponding encoders and decoders previously derived (see Figures 2b and 2d). Dening ½.y j x/ D ±.y ¡ x/, we obtain a neural network that implements a form of communication channel, transmitting a PDF from x to y (see Figure 6a).
Neural Representation of Probabilistic Information b
2 .5
» (x;t) » (y;t)
2 1 .5 1 0 .5 0 -0 .5 -1
-0 .5
0
0 .5
1
Input and Output PDFs
Input and Output PDFs
a
4 3.5 3 2.5 2 1.5 1 0.5 0 -1
x;y
1857
» (x;t) » (y;t) » (z;t)
-0.5
0
0.5
1
x;y;z
Figure 6: The PDF ½ .yI t/ decoded from the neural network is a close match to ½.xI t/. The neural network accurately represents the assumed inference relationship. (b) Multiple sources of evidence can help to resolve ambiguous information. Here, the inferior mode of a bimodal, bottom-up input ½.xI t/ is damped by a more specic top-down signal ½.zI t/ in the minimal space.
The above approach to inference is naturally extended to greater numbers of implicit variables. For example, suppose we add a second input z to the above network and write ZZ (3.7) ½.yI t C ¿ / D ½.y j x; z/½ .xI t/½.zI t/ dx dz: Representing z using ½.zI t/ D
X
(3.8)
ck .t/µk .z/
k
³Z
´ µOk .z/½.zI t/ dx
ck .t/ D f
(3.9)
leads to bj .t C ¿ / D g
Á X
! wjik ai .t/ck .t/
(3.10)
i
with ZZZ wjik D
ÃO j .y/½ .y j x; z/Ái .x/µ k .z/ dx dy dz:
(3.11)
An interesting feature of this neural network is that it employs multiplicative interactions. This multiplication might be realized by coincidence detection in the dendrites; the implication is that the dendrites are active processing elements (Mel, 1994; Cash & Yuste, 1998).
1858
M. Barber, J. Clark, and C. Anderson
3.2 Working in the Minimal Space. So far, we have used the concept of the minimal space as a tool for developing the encoders and decoders. We also can make direct use of the minimal space to set up abstract networks, then convert those into networks of neurons. To accomplish this, we derive relations between the ring rates in the two spaces (fA¹ .t/g and fai .t/g). The neural network in the explicit space then constitutes a physical implementation of the abstract network in the minimal space. The issues of the role of neuronal ring rate variability in the population code (see, e.g., Abbott & Dayan, 1999) may thus be separated from the issues of the propagation of probabilistic information. First, consider the decoding rules given by equations 1.3 and 1.4. Making use of equation 2.2, we obtain X ¹
A¹ .t/8¹ .x/ D
X
ai .t/
i
X
·ºi 8º .x/:
(3.12)
º
Since the 8¹ .x/ are orthonormal functions, we have A¹ .t/ D
X
(3.13)
·ºi ai .t/
i
for transforming from the explicit space to the minimal space. Next, consider the encoding rule given by equation 2.1. Recalling that R P O Ái .x/ D º ·Oiº 8º .x/ and Aº .t/ D 8º .x/½.xI t/ dx, we have ai .t/ D f
Á X º
!
·O iº Aº .t/
(3.14)
for transforming from the minimal space to the explicit space. Using equations 3.13 and 3.14, we can translate between the minimal and explicit spaces. This allows us to set up neural networks by rst working in the mathematically convenient minimal space. To illustrate this procedure, we return to the X ¡! Y inference network. We take the minimal spaces for both the input x and the output y to be dened by linear functions over the interval [¡1; 1], with basis functions of the form shown in Figure 2a. The associated PDFs are represented using equation 1.3 and ½.yI t/ D
X Bº .t/9º .y/:
(3.15)
º
With these representations, the probabilistic relation given in equation 3.1 becomes B º .t C ¿ / D
X ¹
ĺ¹ A¹ .t/;
(3.16)
Neural Representation of Probabilistic Information
1859
where Z ĺ¹ D
9º .y/½ .y j x/8 ¹ .x/ dx dy:
(3.17)
We next convert this into a neural network in the explicit space using equations 3.13 and 3.14 so that Á ! X .y/ bj .t C ¿ / D g ·Ojº Bº .t/ º
Dg
Á XX i
¹;º
! .y/ .x/ ·Ojº ĺ¹ ·¹i ai .t/
:
(3.18)
By identifying !ji D
X ¹;º
.y/
.x/ ·Ojº ĺ¹ ·¹i ;
(3.19)
we may rewrite equation 3.18 as bj .t C ¿ / D g
Á X
! !ji ai .t/ ;
(3.20)
i
arriving at a neural network with the same feedforward dynamics (see equation 3.5) and the same synaptic weights (see equation 3.6) found previously. This example reproduces results previously found by working in the explicit space, but also highlights several advantages of working in the minimal space. Perhaps most important, the fundamental structure of the neural networks is made more transparent by eliminating the redundancies that arise in the networks due to the limited representational ability of neurons. Signicantly, we see that computational properties of the nonlinear update rule for the output neurons (see equation 3.18) can be understood by studying the linear update rule in the minimal space (see equation 3.16), consistent with the population vector representations investigated by Georgopoulos et al. (1986). 3.3 Top-Down Feedback from a High-Level Model. As an explicit example of the advantages of working in the minimal space, as well as an illustration of how the PDF approach can be applied to more complicated systems, we investigate a simple system with multiple sources of evidence. We have already considered inference of the PDF describing a random variable Y from evidence about another random variable X (see section 3.2). To that system, we add another random variable Z, which is conditionally
1860
M. Barber, J. Clark, and C. Anderson
independent of X given Y; the probabilistic model can thus be represented as a Bayesian belief network (Pearl, 1988) with the form X ! Y ! Z. For the chosen demonstration, we employ neural ring-rate proles and corresponding basis functions of the gaussian forms introduced in section 2.2. PThe PDF describing Z is represented in the minimal space as ½.zI t/ D ° C° .t/2 ° .z/. To keep the focus on the interaction of multiple sources of evidence, we take the conditional probabilities to be the Dirac delta functions ½.y j x/ D ±.y ¡ x/ and ½.z j y/ D ±.z ¡ y/. To obtain the network dynamics, we rst introduce a cost function, Z ³ HD
½.yI t/ ¡ Z ³ C
´2
Z ½.y j x/½.xI t/ dx
´2
Z ½.zI t/ ¡
dy
½.z j y/½ .yI t/ dy
dz;
(3.21)
that takes a minimum when the desired inference relations are satised. Minimizing H with respect to the B º using gradient descent yields an update rule for a network in the minimal space. The update rule denes the dynamics of a neural network when mapped into the explicit space. The resulting network combines properties of both neural networks and Bayesian belief networks, so we term them neural belief networks (Barber, 1999). To utilize the third random variable Z as an additional source of evidence, we directly specify ½.zI t/ and encode it into the minimal network as the set of parameters fC° g. The parameters describing Y are driven in a feedforward manner by X and in a feedback manner by Z, and then decoded to nd ½.yI t/. One possible use of this second source of evidence is to resolve an ambiguous input; the inferior mode of a bimodal feedforward input in X can be deemphasized using more specic high-level evidence in Z (see Figure 6b). Neural networks derived in this manner could be used as the starting point for coherent theoretical accounts of attentive effects in the primate visual system, the electrosensory system of weakly electric sh, and other neural systems where an internal model is built up to impose global constraints on neural representations of information. Neural belief networks are described in more detail in Barber (1999). 4 Conclusion We have examined some of the ramications of the hypothesis that neural networks represent information as probability density functions. These PDFs are assumed to be expressible as a linear combination of some implicit decoding functions, with the decoder for each neuron being weighted by its ring rate. The ring rates in turn may be obtained from a PDF using a complementary set of encoding functions.
Neural Representation of Probabilistic Information
1861
In general, the encoding and decoding functions that we have introduced are numerous enough to dene spaces of very high dimension, far beyond the range of accurate representation by biological neurons having a precision of only a few bits. To mediate this conict between computational requirements and biological reality, we have introduced an auxiliary representation of a lower-dimensional minimal space appropriate to the nature and scale of the computations that neurobiological systems actually perform on the relevant input and output analog variables. The basis functions in this minimal space are used to represent both the encoding and decoding functions, limiting the dimensionality of the spaces they dene. As an added benet, the minimal space—and the associated metaneuron variables—can be chosen to have properties that facilitate theoretical characterization of the neural networks resulting from the PDF hypothesis. These neural networks are based on the available probabilistic information and the encoding and decoding functions. The synaptic weights of the networks are fully specied without a training procedure. A natural extension of the work we have presented is the addition of learning rules for determining the weights. Learning rules would provide several advantages; in particular, they would facilitate the generation of neural networks when data are available but the underlying computations are not entirely clear. As previously noted, relaxing the assumption expressed in equation 3.2 would be a natural starting point for incorporating learning into the neural networks derived in the PDF scheme. The optimization procedure we utilized to nd the encoding functions may be a useful model for a more complete learning rule. Researchers in the elds of molecular biology, immunology, genetics, development, and evolution, all of which involve highly complex systems having many degrees of freedom, are beginning to explore the use of metavariables as a formal means to reduce the dimensionality of the space of parameters that must be dealt with in achieving viable and tractable quantitative descriptions. The formal results we have derived for metavariable (“metaneuron”) representation of function spaces and the experience we have gained through associated model simulations may prove valuable for parallel investigations in these and other elds. Returning to the neurobiological context, we may comment on the the role that is envisioned for the PDF formalism in the modeling of brain function. Recent work based on population-temporal coding (e.g., Eliasmith & Anderson, 1999, 2002) indicates that the modeling of low-level sensory processing and output motor control does not require such a sophisticated representation; manipulation of mean values is generally sufcient, and the representations can be simplied to deal with vector spaces instead of function spaces. However, explicit representation of probabilistic descriptors of the state of knowledge of pertinent analog variables may prove indispensable to an understanding of higher-level processes. For example, estimates of depth at each spatial location from the disparity between the images
1862
M. Barber, J. Clark, and C. Anderson
impinging on both eyes can never be made with precision using a purely bottom-up strategy. The modern approach to all higher-level image-processing tasks is driven by the theory of Bayesian inference, in which models are developed and parameters estimated based on a set of well-dened rules within a probabilistic framework. Elsewhere, we shall carry the PDF program a step further by formulating procedures for embedding joint probabilities into neural networks. These procedures will allow us to design neural circuit models that pool multiple sources of evidence. In our view, this offers the most rational approach to building and understanding cortical circuits that carry out well-posed information processing tasks. Acknowledgments This research was supported by the National Science Foundation under Grant Nos. IBN-9634314, PHY-9602127, and PHY-9900713. While writing this work, M.J.B. was a member of the Graduiertenkolleg Azentrische Kristalle, GK549 of the DFG. Much of the writing was completed while M.J.B. and J.W.C. were participants in the Research Year on “The Sciences of Complexity: From Mathematics to Technology to a Sustainable World” at the Zentrum fur ¨ interdisziplin¨are Forschung, University of Bielefeld. References Abbott, L., & Dayan, P. (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1). Anderson, C. (1994). Basic elements of biological computational systems. International Journal of Modern Physics C, 5(2), 135–137. Anderson, C. (1996). Unifying perspectives in neuronal codes and processing. In E. Ludena, ˜ P. Vashishta, & R. Bishop (Eds.), Condensed matter theories (Vol. 11, pp. 365–374). Commack, NY: Nova Science Publishers. Anderson, C., & Abrahams, E. (1987). The Bayes connection. In Proceedings of the IEEE First International Conference on Neural Networks (Volume 3, pp. 105–112). San Diego, CA: IEEE, SOS Print. Anderson, C., & Van Essen, D. (1987). Shifter circuits: A computational strategy for directed visual attention and other dynamic neural processes. Proc. Natl. Acad. Sci. USA, 84(17), 6297–6301. Barber, M. (1999). Studies in neural networks: Neural belief networks and synapse elimination. Unpublished doctoral dissertation, Washington University, Saint Louis, MO. Bialek, W., & Rieke, F. (1992). Reliability and information transmission in spiking neurons. Trends Neurosci., 15(11), 428–434. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Cash, S., & Yuste, R. (1998). Input summation by cultured pyramidal neurons is linear and position-independent. J. Neurosci., 18(1), 10–15.
Neural Representation of Probabilistic Information
1863
Dayan, P., & Hinton, G. (1996). Varieties of Helmholtz machines. Neural Networks, 26, 1385–1403. Eliasmith, C., & Anderson, C. (1999). Developing and applying a toolkit from a general neurocomputational framework. Neurocomputing, 26, 1013–1018. Eliasmith, C., & Anderson, C. (2002). Neural engineering: The principles of neurobiological simulation. Cambridge, MA: MIT Press. Fuchs, A., Scudder, C., & Kaneko, C. (1988). Discharge patterns and recruitment order of identied motoneurons and internuclear neurons in the monkey abducens nucleus. J. Neurophysiol., 60(6), 1874–1895. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hakimian, S., Anderson, C., & Thach, W. (1999). A PDF model of populations of Purkinje cells. Neurocomputing, 26, 169–175. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hinton, G., & Sejnowski, T. (1986). Learning and relearning in Boltzmann machines. In D. Rumelhart, J. McClelland, and the PDP Research Group (Eds.), Parallel distributed processing: Explorations in microstructure of cognition (pp. 282–317). Cambridge, MA: MIT Press. Mel, B. (1994). Information processing in dendritic trees. Neural Comput., 6, 1031–1085. Neal, R. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71–113. Olshausen, B., Anderson, C., & Van Essen, D. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci., 13(11), 4700–4719. Olshausen, B., Anderson, C., & Van Essen, D. (1995). A multiscale dynamic routing circuit for forming size- and position-invariant object representations. J. Comput. Neurosci., 2(1), 45–62. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Salinas, E., & Abbott, L. (1994). Vector reconstruction from ring rates. J. Comput. Neurosci., 1(1–2), 89–107. Schwartz, A. (1993). Motor cortical activity during drawing movements: Population representation during sinusoid tracing. J. Neurophysiol., 70(1), 28–36. Seung, H. (1996). How the brain keeps the eyes still. Proc. Natl. Acad. Sci. USA, 93(23), 13339–13344. Theunissen, F., Roddey, J., Stufebeam, S., Clague, H., & Miller, J. (1996). Information theoretic analysis of dynamical encoding by four identied primary sensory interneurons in the cricket cercal system. J. Neurophysiol.,75(4), 1345–1364. Yang, Z., & Zemel, R. (2000). Managing uncertainty in cue combination. In S. Solla, T. Leen, & K. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press. Zemel, R. (1999). Cortical belief networks. In R. Hecht-Neilsen (Ed.), Theories of cortical processing. Berlin: Springer-Verlag.
1864
M. Barber, J. Clark, and C. Anderson
Zemel, R., & Dayan, P. (1997). Combining probabilistic population codes. In International Joint Conference on Articial Intelligence. San Mateo, CA: Morgan Kaufmann. Zemel, R., & Dayan, P. (1999). Distributional population codes and multiple motion models. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Zemel, R., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Comput., 10(2), 403–430. Received December 20, 2000; accepted December 6, 2002.
LETTER
Communicated by Aapo Hyvarinen
Learning the Gestalt Rule of Collinearity from Object Motion Carsten Prodohl ¨ [email protected]
Rolf P. Wurtz ¨ [email protected]
Christoph von der Malsburg [email protected] Institut fur ¨ Neuroinformatik, Ruhr-Universit¨at Bochum, D-44780 Bochum, Germany
The Gestalt principle of collinearity (and curvilinearity) is widely regarded as being mediated by the long-range connection structure in primary visual cortex. We review the neurophysiological and psychophysical literature to argue that these connections are developed from visual experience after birth, relying on coherent object motion. We then present a neural network model that learns these connections in an unsupervised Hebbian fashion with input from real camera sequences. The model uses spatiotemporal retinal ltering, which is very sensitive to changes in the visual input. We show that it is crucial for successful learning to use the correlation of the transient responses instead of the sustained ones. As a consequence, learning works best with video sequences of moving objects. The model addresses a special case of the fundamental question of what represents the necessary a priori knowledge the brain is equipped with at birth so that the self-organized process of structuring by experience can be successful. 1 Introduction It has long been realized that perception is not a passive intake of unstructured sensory signals, but a highly selective and structured active process employing complicated and widely unknown rules to guide behavior. If these rules are not hardwired at birth, and we will argue in detail that some of them are not, this creates a vicious circle of knowledge depending on perception to arise and perception in turn depending on acquired knowledge to be possible. An attractive way to break this circle is the postulate that only some basic perceptual mechanisms are present at birth, which can be used to learn useful processing rules from the properties of the environment. These can lead to rened perception and, consequently, the acquisition of rened knowledge about the environment. The evolutionary advantage gained from such a hierarchy would be a lower burden on the coding capacity of the genome and more exibility to cope c 2003 Massachusetts Institute of Technology Neural Computation 15, 1865–1896 (2003) °
1866
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
with such environmental changes that happen too fast for evolutionary changes. The most prominent rules for visual perception are the Gestalt principles formulated by Wertheimer (1923) and Koffka (1935). Their basic postulate is that a percept is more than the sum of the constituent parts in that these parts are linked to form a coherent whole. The Gestalt principles are rules that govern what should be linked together to form a percept and what should not. The most important ones are proximity, similarity, closure, good continuation, common fate, good form, and global precedence. Recent research in computer vision has begun to recognize Gestalt principles as important for object selection and segmentation, and they have been applied in numerous models of visual perception, where they are usually taken for granted. To our knowledge, there are two models for their development during maturation of the visual system (Grossberg & Williamson, 2001; Choe, 2001), but we know of none that relies on short sequences of a single natural scene as the basis for learning. We will present a detailed model of how the principles of collinearity and curvilinearity can be learned from real visual stimuli, assuming that the principle of common fate actively organizes perception already at birth. We will employ three assumptions, which are, to varying degree, covered by experimental data: ² Gestalt principles are implemented by the connectivity of the visual cortex. ² There is a hierarchy of Gestalt principles in the sense that some principles are already active at birth, and others are developed later and inuenced by visual experience. This hierarchy is reected in the temporal order in which the various principles become observable during development. ² The higher principles are learned from experience, while the lower ones assist in structuring the data to be learned. Concretely, we will review the literature to present evidence that the principles of collinearity and curvilinearity correspond to structured horizontal connections between simple cells in V1 and that the principle of common fate is more fundamental than collinearity and curvilinearity. Then we describe a neuronal model that shows that collinearity and curvilinearity can be learned from observing moving objects by structuring horizontal connections with a Hebbian learning rule. The letter is organized as follows. In section 1.1, we review recent psychophysical work on the course of development of various Gestalt principles in early infancy. In section 1.2, we focus on the role of common fate. In section 1.3, we look in depth at recent neurophysiological data about which biological pathways and computations are probably present at birth or shortly afterward and which are structured by experience. The more
Learning the Gestalt Rule of Collinearity from Object Motion
1867
technically minded reader may want to skip this introduction and move to section 2, where we describe a simplied model of visual processing up to V1 and the proposed learning dynamics of the horizontal connections. The technical details can be found in the appendices. In section 3, the outcome of computational experiments using this model is presented. Finally, in the discussion, we extract the basic components that we consider crucial for the success of the model. 1.1 The Developmental Succession of Gestalt Principles. In this section, we collect psychophysical evidence that Gestalt principles are not present at birth and that they develop one after the other. Although for adults the Gestalt principles of good form or good continuation are dominant for perceiving an object as a unity, children younger than one year can hardly make use of them (Spelke, Breinlinger, Jacobson, & Phillips, 1993). The same holds for texture similarity and color similarity. Twelve-week-old infants do not use these principles at all with regard to object unity and slowly start to employ them during the development in the rst postnatal year. Detecting form at a very early stage seems to depend on continuous optical transformations caused by object or observer motion. At 24 weeks, infants are unable to grasp 3D object form from multiple stationary binocular views. However, if 16-week-old observers are presented with a continuous geometrical transformation around a stationary 3D object, they are able to build up a 3D form representation (Kellman & Short, 1987). The disparity information from the two eyes cannot be used immediately after birth. It is only at the age of 4 to 8 weeks that the convergence accuracy of the eyes for distances between 25 cm and 200 cm reaches the accuracy of the adult (Hainline, Riddell, Grose-Fifer, & Abramow, 1992). Accommodation of the eyes becomes accurate at 12 weeks postnatally (Braddick & Atkinson, 1979). Stereo vision itself develops even later, at the age of 16 weeks (Yonas, Arterberry, & Granrud, 1987). In the period between 12 and 20 weeks postnatally, strong binocular interaction arises, which at 24 weeks culminates in a superior binocular acuity and the beginning of stereopsis (Birch & Salom˜ao, 1998; Birch, 1985). The contrast sensitivity for spatial frequency gratings increases between 4 and 9 weeks for all spatial frequencies, in line with the convergence accuracy of the eyes (Norcia, Tyler, & Hamer, 1990). At 9 weeks, the acuity for low spatial frequency gratings reaches adult levels, but for higher frequencies, it is still three octaves worse than in the adult (Courage & Adams, 1996). From this time on, the contrast sensitivity for high spatial frequency gratings increases systematically in line with the emergence of stereo vision. The color system develops even later: luminance contrast above 20% is reliably detectable at the age of 5 weeks, but there is still no response to isoluminant chromatic stimuli of any size or contrast. In the following weeks, chromatic gratings are detectable only at low spatial frequencies with an acuity 20 times lower than that for luminance stimuli. The sensitivity to chromatic
1868
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
gratings increases more rapidly in the following weeks than the one for luminance stimuli (Morrone, Burr, & Fiorentini, 1990). Not only physical factors such as changing photoreceptor density may be responsible for the changes in contrast sensitivity, but the neural noise is nine times higher in neonates than in adults and decreases to adult levels during the rst eight months of development (Skoczenski & Norcia, 1998). 1.2 The Relevance of Rough Motion Information. In this section we discuss the biological relevance of motion information and collect evidence that it is available at an early stage of development. We have argued that neither form nor color nor disparity information can be processed at early postnatal stages, but luminance information at low spatial frequencies can. The latter is thought to be mediated mainly by the M-path, which is also essential for motion processing. The Gestalt principle of common fate realized by common motion of object parts is the dominating principle for perceiving the unity of an object in 12-week-old infants (Kellman, Spelke, & Short, 1986). This is independent of the direction of motion in 3D space. Common motion dominates gural quality, substance, weight, texture and shape in 16-week-old infants (Streri & Spelke, 1989). At this age, the infant is able to make a distinction between object and observer motion and uses only object motion for the generation of an object percept (Kellman, Gleitmann, & Spelke, 1987). The question arises as to what kind of computation is carried out regarding the visual information originating from a moving object detected by the retina. Twenty-four-week-old infants can predict linear object motion in grasping tasks but have difculties doing the same for visual tasks involving tracking of objects that are out of reach (Hofsten, Vishton, Spelke, Feng, & Rosander, 1998). The apparent inability to predict linear object motion together with the observation that at low spatial frequencies, luminance information with sufcient contrast can be processed directly after birth, makes it very unlikely that the visual system is able to estimate accurate motion vectors at early stages of development. Nevertheless, some motion processing is important in early development: Piaget (1936) reported that even newborns react to high-contrast stimuli that are moving in front of their faces. This is consistent with the ndings on luminance gratings, and we conclude that at birth, a system must be present that can detect changes in the visual world signaled by achromatic luminance stimuli of low spatial frequency. We interpret these ndings such that learning Gestalt principles relies on changes in low-frequency luminance information at times when no egomotion occurs. Infants can actively create such a situation by gazing in a constant direction, where a moving object of sufcient size and luminance contrast is present, for a considerable amount of time. This specic behavior is indeed observed regularly, as has been reported by, for example, Piaget (1936) and Barten, Birns, & Ronch (1971). Our model refers to this particular
Learning the Gestalt Rule of Collinearity from Object Motion
1869
situation, which from “within” the visual system is distinguished by strong transient responses in retina and cortex, together with the knowledge that the observer is not moving. In order to complete the model, two things remain to be specied: 1. What is the supposed neuronal substrate for the functional organization according to Gestalt principles? 2. How can this substrate be modied on the basis of the transient responses caused by object movement? 1.3 Development of the Visual Pathway. In order to motivate our answer to these questions, we now review some facts about the early development of connectivity in the visual pathway. Although for the retina, the process of maturing is not complete after birth, Tootle (1993) has shown that ganglion cells of cats show burst-like spontaneous activity and that those cells fatigue very quickly after repeated stimulation with the same stimulus in the rst postnatal week. We interpret this functionally as a kind of transient response property that detects changes in the visual input. The same author further showed that ON- and OFF-ganglion cells are already present at birth and that the proportion of light-driven ganglion cells approaches 100% in the second postnatal week. Furthermore, the retinogeniculate (Snider, Dehay, Berland, Kennedy, & Chalupa, 1999) and the thalamocortical (Isaac, Crair, Nicoll, & Malenka, 1997) pathways develop mainly prenatally and are therefore present at birth. In visually inexperienced kittens, 90% of all cells in area 17 are of simple type, and 70% of all visually active neurons in this area show a rudimentary orientation bias (Albus & Wolf, 1984), although only 11% are specically tuned to one orientation. Most of these cells respond preferentially to contrast changes caused by decreasing light intensity as 76% of all responding neurons are activated by OFF-zones exclusively. This means that area 17 shows a sensitivity bias in favor of dark stimuli immediately after birth. The cells responding to visual stimuli are located in layers 4 and 6 of the striate cortex, and there is almost no activity in layers 2=3 and 5 before 3 weeks postnatally. In the fourth postnatal week, ON- and OFF-zones are equal in number, and almost all cells in layers 4 and 6 show orientation tuning. Psychophysical experiments show that in the human infant, contrast differences overrule orientation-based texture differences in segmentation tasks (Atkinson & Braddick, 1992) up to the twelfth week. We now turn to the question of what the neural substrate for learning the Gestalt principles of collinearity or curvilinearity is. There is extensive evidence in the psychophysical (Field, Hayes, & Hess, 1993; Hess & Field, 1999; Kovacs, 2000), neurophysiological (Malach, Amir, Harel, & Grinvald, 1993; Bosking, Zhang, Schoeld, & Fitzpatrick, 1997; Schmidt, Goebel, Lowel, ¨ & Singer, 1997; Fitzpatrick, 1997), and modeling (Grossberg & Mingolla, 1985; Li, 1998; Ross, Grossberg, & Mingolla, 2000; Yen & Finkel, 1998; Gross-
1870
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
berg & Williamson, 2001) literature that these principles are at least partly implemented by horizontal connections in V1. The models described by Grossberg and Mingolla (1985), Grossberg and Williamson (2001), and Ross et al. (2000) agree with perceptual data even to the point of reproducing illusionary contours. Most, but not all, vertical interlayer local circuits in V1 of macaque monkeys develop prenatally in precise order without visual experience (Callaway, 1998). This means that axon terminals at least nd the right layer and already form a crude retinotopic projection. However, intralayer horizontal connections are present but only rudimentarily developed at birth, as most axon terminals have not yet hit their target cells. Studies of postmortem human brains show that the rst horizontal connections develop 1 to 3 weeks before birth (37 weeks after gestation) in layers 4b and 5. Their number increases rapidly after birth and culminates in a uniform plexus at around 7 weeks after birth. The patchiness of these projections as it is found in the adult emerges after at least 8 weeks postnatally (Burkhalter, Bernardo, & Charles, 1993; Katz & Callaway, 1992). The long-range connections can extend up to a maximum of four hypercolumns in each direction (Katz & Callaway, 1992). Burkhalter et al. (1993) further showed that consistent with the results of neuronal activity in kittens, layers 2=3 and 6 develop horizontal connections later than layers 4b and 5. In layer 2=3, they are not present until the sixteenth postnatal week and reach maturity in the sixtieth week. It is interesting to note that the connections in layer 2=3 are patchy from the start. This suggests that they can already benet from the patchiness of the connections in layer 4b probably mediated by a direct vertical connection from layer 4b to layer 2=3 that develops after birth (Katz & Callaway, 1992). Furthermore, as layer 4b belongs to the M-path and provides direct input to area MT (which is strongly involved in motion processing), we conclude that the processing of visual information related to motion precedes and probably supports the processing of form, color, precise stereoscopic depth, and their integration. This assumption is consistent with the psychophysical results mentioned earlier. The development of horizontal connections has been shown to depend on the visual input presented (Lowel ¨ & Singer, 1992). 1.4 Relation to Natural Image Statistics. An article about learning from natural stimuli is incomplete without discussing what is known in the literature about the statistics of such stimuli. The idea that the visual system is wired in a way that it provides an efcient and nonredundant representation of the incoming signals goes back to Attneave (1954) and Barlow (1961). Based on this principle, there have been successful predictions of properties of retinal, lateral geniculate nucleus (LGN), and simple cells in V1. Examples without attempt on completeness include Olshausen and Field (1996), Bell and Sejnowski (1997), and van Hateren and Ruderman (1998). A complete review of this line of work is beyond the scope of this letter but is done beau-
Learning the Gestalt Rule of Collinearity from Object Motion
1871
tifully in Simoncelli and Olshausen (2001). Additional assumptions have to be employed—typically either the sparseness (Olshausen & Field, 1996) of a cortical representation or the statistical independence of the activities of the cells involved. The latter leads to properties of visual cells resulting from independent component analysis (van Hateren & Ruderman, 1998; Bell & Sejnowski, 1997). Also, translation invariance is usually assumed, because otherwise the required statistical basis would become intractably large. During review of this article, we learned that independent component analysis has recently been applied successfully to networks of V1 cells that support contour enhancement (Hoyer & Hyv¨arinen, 2002). They learn a feedforward layer of contour coding cells that take input from complex cells in V1. The underlying assumption is sparseness of coding. With our model, we take a slightly different approach. We do not employ any assumption about sparseness or independence of cortical signals. Rather, we model the spatiotemporal properties of the visual pathway up to V1 and apply Hebbian learning to the horizontal connections between simple cells. The model is more sophisticated in biological detail than others in this area. For example, positivity of neuronal responses is always maintained. As a consequence, the stimuli are preprocessed in a nonlinear way before providing data for learning. The importance of such nonlinearities has been pointed out by Zetzsche & Krieger (2001). A feature that our model shares with the others is the explicit assumption of translation invariance, which leads to weight sharing during learning. This assumption is rather unbiological but hard to avoid for keeping computation times acceptable. 2 Methods and Models Starting from the data just reviewed, we assume that the specic connectivity pattern of long-range horizontal connections provides the neural basis for the Gestalt principles of collinearity and curvilinearity. This notion is supported by the apparent co-occurence of the use of these principles and the maturation of the respective connections during development. This answers the rst question raised at the end of section 1.2 about the neural substrate of Gestalt principles. In order to answer the second one about how these connections develop depending on object motion, we propose a quantitative model of how the transient retinal stimuli are propagated toward the cells interconnected by the axons in question (see Figure 1) and how their connection strengths are modied. We model the relevant parts of the visual pathway running from the retina via the retinogeniculate and thalamocortical connections to the simple cells of layer 4b in primary visual cortex and apply a Hebbian learning rule to shape the connectivity. All functions we will use to describe our model are functions of twodimensional space, but we omit this dependency for convenience of notation. The full details of the retina model are given in appendix A.1; we
1872
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
Figure 1: Illustration of the complete model.
summarize only the important features here. The retinal photoreceptors show a strong transient response to changing stimuli. Their output is passed to bipolar cells, and nally ganglion cells perform a spatial ltering using the well-known center-surround antagonism that enhances local contrast differences. As the temporal information detected by the photoreceptors is preserved, the ganglion cells show a transient output as well. Our retina model is based on the model for Y-ON-ganglion cells developed by Gaudiano (1994), which is extended in two respects. First, both ON- and OFF-ganglion cells are modeled and, second, the time course of the transient response pattern of each cell is represented in a simplied way by just two values (see Figure 2): a maximal transient response vtr when
Learning the Gestalt Rule of Collinearity from Object Motion
1873
Figure 2: The computation of the discrete approximations vtr and vst to the continuous transient activity of an ON ganglion cell.
new input comes in, and the steady-state activity rate vst while this input is constantly present. Here, it is assumed that the timescale of the ganglion cells is faster than the change in the input, which is recorded by a camera at 3 frames per second. This requirement results in an upper bound for object speed relative to the timescales used for the neuronal processing. Neither the LGN nor the cortical layer 4c® is explicitly modeled, as it is assumed that in early ontogenesis, no processing relevant for the development of Gestalt principles is performed there. Therefore, the activity of ONand OFF-ganglion cells provides direct input to the simple cells in layer 4b. It will become clear later that for the purpose of our model, it is not necessary to model all aspects of inhibitory neurons in striate cortex in detail. A detailed model of the cortical dynamics mediated by short-, middle-, and long-range corticocortical connections is also not required. Let us explain why. There are a lot of inhibitory interneurons in the cortex that receive afferent input themselves and affect the excitatory pyramidal cells later by at least short-range lateral connections. We assume that the main effect of those inhibitory connections with regard to our model is to avoid an excitatory explosion when input arrives at the cortex, as only afferences are coming in. This assumption is realistic because the high neural noise in neonates (see section 1.3) requires strong global inhibition—stronger than the overall excitation—in the cortex to keep the whole system stable. Furthermore, activity mediated along horizontal connections alone should not be able to trigger a response in a target cell. If inhibitory neurons have no other func-
1874
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
tion than the ones mentioned, we can avoid modeling them explicitly by letting only those excitatory cortical cells participate in the structuring of long-range cortical connections that receive strong primary afferent input themselves. Consequently, we do not need to model the cortical dynamics further at this early stage of development, as purely intracortical inuences at a postsynaptic neuron without strong primary afferent input should not have signicant impact on the activity. In the adult, however, there are substantial inuences from intracortical long-range connections, and also the neural noise is reduced by a factor of nine. The modeled simple cells are arranged in hypercolumns, and during the simulations, a long-range connection structure between these cells emerges. One cell of each hypercolumn can be connected to any cell located in a 9 £ 9 square surrounding its own hypercolumn. Each simple cell in the model in fact represents a local pool of cells that all have similar properties. Therefore, it makes sense to model connections of a model cell to itself. One of the crucial features of our model for the development of a specic long-range connection structure is that the cortical simple cells in layer 4b show a transient response pattern. In the results section, we will see that this transient cortical response pattern enables the development of an iso-orientation long-range connection structure as it is found in animals (Schmidt et al., 1997). In the following, we describe how we model the transient cortical activity starting with the transients of the ganglion cells. An interesting question, which is beyond the range of this article, is how these responses are produced biologically. For simplicity (and computational tractability), we omit all biological details that could be part of the theoretically possible mechanisms (e.g., single-cell properties, sustained local inhibition) that enable transient responses. These must be investigated in the animal and by models on a ner scope than ours. We have found that the success of the model depends much more on the fact that the response is transient than on its precise time course. Therefore, for each time step n between the acquisition of successive video frames, the output of an ON- or OFF-ganglion cell is discretized to two values vtr .n/ and vst .n/. Consequently, it is also natural to discretize the primary afferent response a that a simple cell would show if no other (intracortical) inuences were present to a transient and a stationary value atr and ast , respectively. a is computed by a 2D spatial convolution (denoted by ¤) with the kernels gON and gOFF representing the synaptic couplings made by ON-afferences and OFF-afferences to the cortical cells: ON OFF C vOFF atr .n/ D vON tr .n/ ¤ g tr .n/ ¤ g
ON OFF C vOFF ast .n/ D vON : st .n/ ¤ g st .n/ ¤ g
(2.1)
The full details of the cortex model are given in equations B.1 through B.3. Here it is sufces to mention that the functions gON and gOFF are responsible
Learning the Gestalt Rule of Collinearity from Object Motion
1875
for the orientation Á and the polarity (C; ¡) of the simple cell receptive elds. The resulting receptive elds for the ÁC cells are shown schematically in the top row and left-most column of Figures 3 and 4. Regardless of the detailed mechanism that causes the transient nature of the cortical cell response, if a transient response is triggered at frame n, then there must be a difference in the primary afferent input of this frame ast .n/ and the primary afferent input the cell received during the previous frame ast .n ¡ 1/. Then it follows from our retina model that there has to be a difference in the values of atr .n/ and ast .n/ as well, as after the transient over- or undershoot, a new steady state ast .n/ will be reached, which cannot be equal to the old one ast .n ¡ 1/, because otherwise there would have been no transient response at all. The real output o of the ganglion cell—the one that can be measured experimentally by counting action potentials and linearly transforming the base rate to zero—can then be approximated as o.n/ D max.atr.n/ ¡ ast .n/; 0/:
(2.2)
The response in equation 2.2 is quantitatively a bit too strong, as the relaxation of the afferent input ast .n/ may not be complete in the time between two consecutive frames. However, the important feature for our model is that transient responses are successfully detected by this output function. To point out how crucial the transient nature of cortical responses in layer 4b is for the development of long-range connections, we have done additional simulations with simple cells having sustained (o¤ ) instead of transient responses, o¤ .n/ D max.ast .n/ ¡ 2; 0/;
(2.3)
where 2 is a constant near the baseline primary afferent activity. A simple Hebbian learning mechanism is used for the adaption of the long-range synaptic strengths: 1wij D ²oi oj :
(2.4)
For computational efciency, this learning rule is not applied to single synaptic weights but to ensembles of equivalent connections. Two connections are equivalent if their pre- and postsynaptic cells have the same orientation and polarity, and they span the same cortical distance. This effectively leads to a system of connections that is translation invariant by design. Mathematical details and a biological motivation can be found in section B.2. If the transient cortical responses o are inserted into equation 2.4 to evaluate the correlation between pre- and postsynaptic cell, respectively, the connection structure shown in Figure 3 emerges. We will argue in section 3 that this is suitable to form the anatomical basis for the Gestalt principle of collinearity.
1876
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
Figure 3: Learned synaptic strengths.The left-most column shows the classical pre receptive eld shapes ÁC of presynaptic simple cells and the top row those of post the postsynaptic cells (ÁC ). The other squares show the spatial distribution of synaptic strengths of one presynaptic cell to a 9 £ 9 patch of postsynaptic cells of constant orientation selectivity. Both types of squares use the same scale in retinal coordinates. The weights are coded in gray scale, with white corresponding to the highest weight. The central position of each 9 £ 9 array corresponds to a connection between pre- and postsynaptic cells in the same hypercolumn. The weight distribution shown has been learned on the basis of the transient cortical responses dened in equation 2.2 with a Hebbian learning rule (see equation 2.4). The inuence of cortical connections far exceeds the size of the classical receptive elds. Furthermore, the connections that are established prominently connect simple cells with nearly the same orientation in hypercolumns that in retinal coordinates refer to points lying in the direction of their preferred orientation. Thus, it supports collinearity and curvilinearity.
Learning the Gestalt Rule of Collinearity from Object Motion
1877
Figure 4: This gure shows long-range cortical synaptic strengths that develop if sustained cortical responses (see equation 2.3) are used in the Hebbian learning rule (see equation B.4). For an explanation of the display arrangement, refer to Figure 3. The inuence of the strong gray tone diagonal in the background of the used images can easily be detected in most of the synaptic connection elds. This example shows how the arbitrary background dominates the learning process of long-range connections when sustained cortical responses are used.
If the simulations are done with the same visual processing but with the Hebbian learning on the basis of the sustained cortical responses (o¤ ) the resulting connection structure is qualitatively different (see Figure 4). In this case, no general principles are learned, but the nal connection structure reects properties of the particular visual data used for learning.
1878
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
Figure 5: One frame of a typical camera sequence used in the simulations. There is a single moving object in front of the observer and a structured background. The strong diagonal gray tone border in the background is used to illustrate the difference between sustained and transient responses in the learning rule. Such a biased background is a problem for static responses, which tend to learn just that bias. If transient responses are used, the distribution of orientations caused by the moving person is much broader.
3 Results The simulations have been performed with sequences of camera images like the example shown in Figure 5. A typical sequence consisted of 100 to 200 frames and showed a person moving in front of a camera and performing some arm movements. In the whole sequence (one frame is shown in Figure 5), the person was the only moving object and there is a strong diagonal gray tone border in the background.
Learning the Gestalt Rule of Collinearity from Object Motion
1879
The learned long-range connection structures for the transient (see equation 2.2) and the sustained (see equation 2.3) cortical responses are illustrated in Figures 3 and 4, respectively. The shown cells all have the same polarity and vary only in orientation. The results are very similar for the cells of the other polarity Á¡ , while cross connections between different polarities have not been examined in detail. We now interpret the results shown in Figure 3 that have been obtained when the transient cortical responses (see equation 2.2) have been used. In detail, the results are: ² The size of the long-range connection structure far exceeds the size of the classical receptive eld. ² The strongest connections are established from the reference hypercolumn (central array positions) to itself. They connect a pool of identical presynaptic cell types with themselves (iso-orientation). ² There are, in addition, strong connections in the 3 £ 3 neighborhood of the reference hypercolumn to iso-oriented cells. ² The connections from the (central) reference hypercolumn to other hypercolumns are made to cells with a similar orientation of pre- and postsynaptic receptive eld, respectively. Looking at these connections, one can see that the direction of the strongest connections corresponds to the orientation in the receptive elds (collinearity). Therefore, the receptive eld is somewhat extended by these long-range connections. ² The connectivity pattern diminishes in strength with the difference in orientation of pre- and postsynaptic receptive elds (curvilinearity). This shows that the use of transient responses for Hebbian learning can lead in an efcient and robust way to a connection structure suited to form the anatomical basis for the Gestalt principle of collinearity and even curvilinearity (Field et al., 1993; Hess & Field, 1999; Guy & Medioni, 1996). The importance of transient responses is elucidated by comparison with the results from the same learning rule applied to the sustained responses shown in Figure 4 caused by the same stimuli. The resulting horizontal connection structure is qualitatively different from the one in Figure 3. The most signicant long-range connections are established to the postsynaptic cells in column 6 of Figure 4. These show a strong response to the tilted edge in the background of the sequence. Furthermore, almost all long-range connections are in the direction of the that edge. To a lesser extent, the same holds for the connection strengths for the neighboring four columns (4–8) of column 6 and just the presynaptic cells with an orientation difference of 90 degrees (shown in row 2 and the last row of Figure 4) to the gray tone border cells show almost no developed connection structure. One can say that the learned connection structure is
1880
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
dominated by the strong gray tone border that is an accidental property of the background. As the gray tone edge is constantly present in the image, cells with an orientation similar to that of the gray tone border are— according to their tuning curve—active as well. Therefore, although their presynaptic cell activation may be relatively weak, their connection strength to the gray tone border cells increases with each learning step. One could argue that this is a problem of the threshold 2 used in equation 2.3, and that an increased threshold should lter out the weak responses to borders, so that just the presynaptic border cells connect to the postsynaptic border cells, leaving the rest mainly unchanged. This procedure could work for any particular image, as probably one can nd a threshold that lters out the important borders and suppresses the weak responses for that particular image. However, given the variations in natural images, this threshold must be changed from image to image, because a weak response to a border in one image may be of the same magnitude as the response to the strongest border in another image. One could think that a kind of adaptive threshold that decides about the presence of a border, taking its strength in relation to the maximum border strength found in the image could be a solution. Related to this is the idea of normalizing the maximal responses in the image. These approaches have the additional disadvantage that even a “noise” image would participate in the learning process to the same degree as an image with very strong borders. The results show that the inclusion of transient responses in the model overcomes these conceptual problems by using information from two different cues: the gray tone and the occurring change in subsequent images. Therefore, only moving edges in the image take part in the learning process, while static ones have no inuence. Because it was not possible to learn collinearity with sustained cortical responses out of the same amount of biased image data that was sufcient for transient ones, one can say that using transient responses is an efcient and robust way to do so. One could argue now that with different kinds of backgrounds presented or with a large variety of directions present in one background, the disturbing inuences will eventually average out, and in the end, the same connection structure as with transient response could emerge. With a well-balanced input or at least a very large input of different visual scenes, this may be possible. It would, however, be a considerable risk for the infants if early development depended critically on this condition. It may also be argued that the advantage of transient over sustained responses is only a consequence of the orientation bias in the background. In order to clarify the two previous points, we have applied our model to standard image collections, which we concatenated into sequences. In this case, transient responses are pointless, because there is no real movement. Learning from the sustained responses also yields a biased connection structure, which supports collinearity in the horizontal and vertical orientations but hardly in the oblique ones. See Figure 6 for the results on the database
Learning the Gestalt Rule of Collinearity from Object Motion
1881
Figure 6: Long-range cortical synaptic strengths learned from a sequence of static natural images. Transient responses make no sense in this case, so the sustained cortical responses (see equation 2.3) are used in the Hebbian learning rule (see equation B.4). For an explanation of the display arrangement, refer to Figure 3.
available from British Telecom. More stimulus sets together with the learning results are available from our Web site and described in appendix C. 4 Discussion We have started from the assumption that there is a hierarchy of Gestalt principles and that the principle of common fate is more fundamental than the one of good continuation. Technically speaking, this can be reformulated as saying that motion is a primary cue for the task of segmenting objects
1882
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
from the background. Closed boundaries (as supported by collinearity and curvilinearity) provide a secondary cue that can be learned using the primary one. In this situation, an important strategy is the distinction between observer motion and object motion. This information is probably detectable by a newborn, but it is very unlikely that it can already be used accurately on a cellular level, although there is evidence that this is possible even for 4-month-old neonates (Kellman et al., 1987) and in the adult (Leopold & Logothetis, 1998). How can object motion be detected by newborns? A very easy way for the organism to achieve this distinction is to gaze in one direction for a considerable amount of time (Piaget, 1936). While the direction of gaze is xed, changes of illumination on the retina cannot be caused by eye movements or observer motion. Consequently, the structuring process in the cortical model is strongly dependent on the feature constellations resulting from moving objects. Using the latter, we have shown (see Figure 5) that it is possible to learn, for example, the Gestalt principle of collinearity, or more precisely, the anatomical structure of long-range connections in primary visual cortex, which probably implements this Gestalt principle and was found experimentally by Schmidt et al. (1997). They showed that cells with an orientation preference in area 17 of the cat are linked preferentially to iso-oriented cells, a result reproduced by our model. Furthermore, the coupling strength diminishes with the difference in preferred orientation of pre- and postsynaptic cell. From a technical point of view, our system has learned by experience something similar to an association eld (Field et al., 1993; Hess & Field, 1999), projection eld (Grossberg & Williamson, 2001), or extension eld (Guy & Medioni, 1996). These are well known to aid the concept of collinearity and curvilinearity in technical computer vision algorithms and biological models. Actually, the learned arrangement of synaptic strengths may provide means of improving the extension eld (Guy & Medioni, 1996) algorithm, as strong connections between cells with similar orientation are established in all directly neighboring hypercolumns, even if the hypercolumn lies in a spatial direction different from the pre- and postsynaptic orientation. Another lesson learned is that one presynaptic cell type should connect not only to one specic postsynaptic cell type in the target hypercolumn, but with a smaller synaptic weight to a range of types with similar orientation in the same target column. Recently, two models have been presented that address the ontogenesis of Gestalt principles. In Grossberg and Williamson (2001), a very detailed perceptual model emerges from learning. It is remarkable how well the performance of this model matches neurophysiological measurements. In Choe (2001), PGLISSOM, a variant of the self-organizing map, leads to lateral connectivities similar to the ones shown in Figure 3 after learning from patches of elongated gaussians. Our model differs from those in the respect that we consider only horizontal connections and show that these can be learned from real camera images. Using real sensor data complicates things
Learning the Gestalt Rule of Collinearity from Object Motion
1883
considerably, and consequently other details, for example, stability against contrast changes, have not been addressed. However, we could show that the complexity of data required for learning can be taken from actual motion. It is clear that these connections are only one aspect of a complete system, but we believe it is worthwhile to study the isolated subsystem. There are several possible reasons for the result that the desired learning effect could be achieved by using the transient rather than the sustained responses. The rst is the biased background. The moving object is much more likely to provide the rich variety of oriented edges required for learning. The transient responses do not react to the static background, and therefore that bias cannot inuence the connection structure. This also suggests that object motion is preferable to observer motion. Pure observer motion against a static background would learn exactly the background biases, and the same holds for saccadic eye movements. Furthermore, the common motion of object edges gives strong hints that these edges belong together. This could not be achieved if the whole background moved consistently. In order to learn from sustained responses, the background would have to vary a lot in order for the biases of individual backgrounds to average out. Using only one sequence may be regarded as a weakness of our system, and, of course, it would provide far too little data to cover the environmental properties. This problem is greatly alleviated by assuming translation invariance and therefore employing massive weight sharing. Keeping this in mind, our results indicate that concentrating on the moving parts of a scene provides excellent preprocessing for learning of collinearity. Actually, the fact that one sequence is sufcient clearly demonstrates the power of our preprocessing to select the data relevant for learning. Also, the number of images in the collections is clearly rather small, but we tried to keep about as many as we had movie frames for a fair comparison. It may be argued that a biased background is an unrealistic assumption in the visual world of ambulatory system. However, there is considerable orientation bias in collections of natural images, at least in the environment given by the campus of Duke University (Coppola, Purves, McCoy, & Purves, 1998). Our results indicate that moving objects yield a more even distribution of orientations, although we have not studied that systematically. The assumption that persons moving about can provide a major source of data for visual learning of young infants seems safe. It may also be argued that 100 static images are too few to learn. However, even large image sets would show the bias described by Coppola et al. (1998). Learning from static images imprints this bias into the connection structure, as can be seen in Figure 6. At least our experiments show that learning is faster when based on the transient responses to image sequences showing real motion. Regarding the biological relevance of our model, many details have been omitted and many simplications been made in order to achieve a computationally tractable size. However, good results have been reached with a combination of basic mechanisms: spatiotemporal retinal ltering,
1884
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
topology preservation in the cortical map, the transient nature of cortical cell responses, and Hebbian learning. It is a well-established fact that all those single building blocks of our model exist in the brain, so the complete model should be regarded as a valid approximation to one of the processes that organize perception during early development. The problem of learning relevant feature constellations from natural images has been known at least since the models of Marr (1982) were introduced. We think that the main reason for the difculties faced by the approach is the concentration on feature constellations that are present in the whole image. Due to the nature of Hebbian learning (or other secondorder correlation rules), the useful second-order feature constellations are much harder to detect in the whole image than in the part of the image representing a single object. Of course, usefulness is not a physical entity, but arises from the natural desire of living creatures to be able to distinguish objects for various purposes, like grabbing, escape, or ingestion, which in turn yields considerable evolutionary advantage. The problem probably gets harder the more complex the features become, because of their diminishing statistical signicance in whole images of natural scenes. If one considers features like collinearity, vertices, or closed boundaries, which dene a geometric object, or combinations of closed boundaries, which are listed here in order of their assumed complexity, then the statistical signicance to nd those features in whole images probably not only diminishes with rising complexity, but the more complex features are probably not encountered often enough to make a difference statistically. And even if there is a small signicance for those complex features, it would take a long time and a lot of different scenes to learn them using a statistical algorithm. For the case of collinearity, Kruger ¨ (1998) was able to show that collinearity and short-range parallelism are statistically signicant features of natural images if the set of images examined is large and varied enough. Geisler, Perry, Super, and Gallogly (2001) link this to the psychophysical performance of contour grouping and conclude that this must be due to an underlying neuronal structure. Our system shows that the necessary variation can be derived from a single scene with one object moving across it for long enough that the movement covers all image locations. Both studies are opposite extremes; the reality for a newborn consists of neither snapshots of many possible scenes nor a single moving object. Both together show that the neural circuits underlying the collinearity Gestalt principle can be learned from natural input. An important aspect pointed out by Geisler et al. (2001) and Simoncelli and Olshausen (2001) is that simple linear correlations are not sufcient to extract interesting statistics from natural data. Hebbian learning, however, relies on linear correlations. In our case, it has been applied to nonlinearly preprocessed data, so there is no contradiction here. Geisler et al. (2001) also nd that curvilinearity cannot be learned from simple co-occurence but requires Bayesian co-occurrence statistics. Again, this is not a contradiction
Learning the Gestalt Rule of Collinearity from Object Motion
1885
due to the nonlinearities in our model, which are motivated biologically rather than statistically. Comparison of our results to the ones from Hoyer and Hyv¨arinen (2002) is made difcult by the fact that the assumptions about the underlying network are different. Their model relies on a feedforward structure for contour coding, while our emphasis is on horizontal connections. It seems that horizontal connections would hurt the statistical independence of the cells they connect, so our model does not map naturally onto the ICA concept. The biological evidence is certainly too sparse to make a decision in favor of one of these assumptions. This is clearly a point that requires further analysis on the modeling as well as on the biological side. Further technical studies (Potzsch, ¨ 1999) indicate that feature constellations of higher complexity, such as vertices that seem to play an important part in object recognition, cannot be learned by simple correlation learning rules that operate on the features of the whole image. As indicated above, these useful features are probably not statistically signicant features of natural images. If this is the case, complex features can be learned from natural images only if there is purposeful behavior that carefully selects the data worthwhile to be learned. Concentrating on moving objects seems to be a good strategy even in the absence of a tracking mechanism. It can be conjectured that more sophisticated mechanisms like head and eye saccades can boost learning further. Reinagel and Zador (1999) show that effect on learning image statistics; therefore, a positive effect on learning Gestalt rules may be expected. Appendix A: Model of Subcortical Processing A.1 Retina Model. We do not model the development of the retina itself during the postnatal weeks. Instead, we extend and modify an existing retina model (Gaudiano, 1994) to compute ON- and OFF-Y-ganglion cell responses. Once this is done, we discretize the continuous differential equations in time by values for each cell type (ON and OFF). The activity rate vst corresponds to the tonic or steady-state part in the continuous model and the transient rate vtr to an upper bound for the maximum transient or phasic rate of the ganglion cell response (see Figure 2). These discrete approximations of the retina model are essential because their output is used in the model for the cortical layer. To understand how these equations are derived, we now go into the details of the continuous retina model. Note that the spatial dependency of the functions used is omitted to improve readability. A.2 The Photoreceptors. Given the light intensity value pn for each pixel of the camera image recorded at time tn , we dene a function pIn .t/ that is equal to pn for all times t in the interval of [tn ; tnC1 /. The values of this max function pIn .t/ are the interval of [pmin In ; pIn ], which is given by the camera as
1886
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
[0; 255]. We compute the nonlinear photoreceptor response r.t/ according to r.t/ D z.t/p.t/
with
(A.1)
p.t/ D [pmax ¡ pmin ]pIn .t/=.pmax In / C pmin :
(A.2)
Here, the biological fact is taken into account that at extreme light intensities, a photoreceptor can perform temporal high-pass ltering. To achieve this, the photoreceptor adapts its response rate under extreme constant light intensity toward its base rate activity by multiplying the visual input p.t/ (see equation A.2) with an internal state z.t/. The way this internal state is computed makes clear that it is something like a short-term memory for light intensity. By this light adaption mechanism, a sudden change to a given extreme light intensity level produces a short-term activity rate substantially different from the one produced under a constant light intensity of the same extreme magnitude. The photoreceptor response r.t/ is computed by using the transformed light intensity p.t/. The simple linear rescaling in equation A.2 to the interval [pmin ; pmax ] is necessary, because the increased minimal value allows for dynamic photoreceptor behavior. We have chosen pmin D 30; pmax D 255. The chosen value of pmin is not particularly critical but should be well above zero, because otherwise, low light intensities in the visual world cannot be modulated by the internal state of the photoreceptor and therefore cannot trigger a dynamic photoreceptor response distinguishable from the response to constant low light intensity. The rate of change dz=dt of the internal parameter depends on the relative intensity prel .t/ (see equation A.4) of the visual stimulus that the photoreceptor receives: dz=dt D F[M ¡ z.t/] ¡ Hprel .t/z.t/
with
prel .t/ D .p.t/ ¡ pmin /=.pmax ¡ pmin /:
(A.3) (A.4)
Here F, M, and H have the values chosen by Gaudiano (1994). M is the maximal value of z.t/, and F is a gain parameter controlling how fast z.t/ approaches M in the absence of light (prel .t/ ´ 0). On the other hand, Hprel .t/ controls the decay of z.t/. A.3 Bipolar and Ganglion Cells. The next steps in retinal processing are the integration of photoreceptor activity by horizontal cells and the feedforward processing by bipolar cells. We summarize horizontal inuences of horizontal and amacrine cells in a later step of the model (see equations A.7 and A.8), and concentrate on the straightforward modeling of bipolar cell responses bC;¡ . It is assumed that one bipolar cell receives input from just one photoreceptor. The value rmax D pmax M is given by equation A.1 as M is the maximal value of the internal state z.t/ of a photoreceptor and pmax
Learning the Gestalt Rule of Collinearity from Object Motion
1887
the maximal light intensity: bC .t/ D r.t/
and
b¡ .t/ D rmax ¡ r.t/:
(A.5)
The ganglion cell activity v.t/ itself is modeled using the shunting equation, A.6, rst used by Grossberg (1970) and Sperling (1970). In our particular equation, the ganglion activity rate v.t/ is limited to the interval of [0; 1] without passive decay: dv.t/=dt D [1 ¡ v.t/]u.t/ ¡ [v.t/]w.t/:
(A.6)
The excitatory u.t/ and inhibitory w.t/ inputs contribute to the ganglion cell response in a PUSH-PULL way, which means that each bipolar cell contributes to both center and surround mechanisms of the ganglion cell in different quantities (McGuire, Stevens, & Sterling, 1986). For ON-ganglion cells, these inuences have been modeled by the following equations, where we use the gaussians c (central) and s (surround) with the parameters of the Gaudiano model for Y-ganglion cells extended to two dimensions: u.t/ D c ¤ bC .t/ C s ¤ b¡ .t/; C
¡
w.t/ D s ¤ b .t/ C c ¤ b .t/:
(A.7) (A.8)
Biologically, these inuences come from lateral integration mediated by horizontal and amacrine cells. For OFF-ganglion cells, equations A.7 and A.8 have been used with bC .t/ and b¡ .t/ interchanged. In our simulations, the center gaussian has a value of ¾ equal to 3.18 pixel and the surround gaussian of 3.89 pixel. The analytical solutions of equation A.6 for ON- and OFF-cells then are vON .t/ D E=pmax [c ¤ r.t/ ¡ s ¤ r.t/] C FON ;
vOFF .t/ D ¡E=pmax [c ¤ r.t/ ¡ s ¤ r.t/] C FOFF ;
(A.9) (A.10)
where the following abbreviations have been used for convenience of notation: E :D 1=.MVs C MV c /;
(A.11)
FON :D MVs =.MV s C MV c /;
(A.12)
FOFF :D MVc =.MVs C MV c :
(A.13)
The numbers used in these denitions are as follows: Vs and Vc are the integrals over the two gaussians used for modeling center and surround inuences in the PUSH-PULL model.
1888
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
So far, we have presented a continuous model for ON and OFF ganglion cells, and one can discuss various aspects of the model like the validity of the used simplications, for example, where exactly in the retina the spatial integration takes place, how the temporal ltering is probably done, and more. We refer the reader to Gaudiano (1994) for these arguments. As shown in Figure 2, the continuous time course is now approximated by two values, computed as follows: rtr D pn zn¡1 ;
(A.14)
zn D MF=.F C Hpn /;
(A.15)
rst D pn zn :
(A.16)
With these approximations and using the continuous equations for the ganglion cell responses (equations A.9 and A.10) we can approximate the ONand OFF-ganglion cell activity discretely in time and represent each of them by two values that describe the spatiotemporal properties of the ganglion cell responses: vON tr D E=pmax [c ¤ rtr ¡ s ¤ rtr ] C FON
(A.17)
vON st D E=pmax [c ¤ rst ¡ s ¤ rst ] C FON
(A.18)
D ¡E=pmax [c ¤ rtr ¡ s ¤ rtr] C FOFF vOFF tr
(A.19)
D ¡E=pmax [c ¤ rst ¡ s ¤ rst ] C FOFF : vOFF st
(A.20)
Appendix B: Cortex Model How should the cortical layer be modeled? Recall from section 1.3 that after birth, 90% of all active cells are of simple type. For that reason, we will not consider complex cells here, as they probably play only a minor role directly after birth and most likely develop afterward. Additionally, the development of the long-range connection structure of different kinds of simple cells after birth probably occurs at different speeds. The development of the long-range connection structure between edge detector (odd symmetry) cells of all scales is expected to be more robust than the one of bar detector (even symmetry) cells immediately after birth. One reason for this is the increased sensitivity to low spatial frequency gratings of the cortical cells (see section 1.3) in the newborn. This phenomenon assigns a special role to the edge detector cells. An edge between two large areas of high and low luminance, respectively, remains an edge for all edge detector cells independent of their scale or spatial frequency. This is not true, for example, for a bar detector cell, as the strength of the response is determined by the preferred spatial frequency and the size of the bar presented. Therefore, we restrict our model to simple cells with odd symmetry that are functionally edge detectors.
Learning the Gestalt Rule of Collinearity from Object Motion
1889
How can edge detector cells be modeled? Equation 2.1 describes the primary afferent response of a simple cell, and the terms gON and gOFF are now dened in detail. The most important feature of an edge detector cell is its preferred orientation Á, and for each orientation, there are two edge detector cells: one with positive (ÁC ) and one with negative (Á¡ ) polarity. They can be modeled by using the sine part of a Gabor function, equation B.1, with different signs. In our simulations, we use cells with eight different orientations Á and do not model other properties of simple cells like selectivity for motion direction. The formulas are as follows: ³ 2 2 ´ k .x C y2 / ¢ sin.kx cos Á C ky sin Á/; (B.1) g.x; y; Á/ D exp ¡ 2¾ 2 ÁC : gON D max.Cg; 0/,
gOFF D max.¡g; 0/;
(B.2)
Á¡ : g
g
(B.3)
ON
D max.¡g; 0/,
OFF
D max.Cg; 0/:
The parameters are ¾ < 2 and k D 0:5. A value of ¾ above two leads to receptive elds with more than two signicant ON- or OFF-subelds in contrast to the data about simple cells. Note that the resting activity of a cortical neuron in equation 2.1 is nonzero despite the vanishing integral of g in equation B.1. The reason is that the resting activity of the retinal ganglion cells is nonzero, and there are afferences only to the cortex. Biologically, it is not likely that the synaptic strengths of the afferent thalamocortical connections are so nely tuned in the rst postnatal weeks as the Gaborlike receptive elds imply. With the high specic connection structure given in equations B.2 and B.3, the effects of the short-range intracortical inhibitory and excitatory connections are implicitly modeled, which are the probable basis for sharp orientation tuning (Somers, Nelson, & Sur, 1995). However, to use parts of a Gabor function is an easy alternative way to implement some form of orientation tuning without the computational cost of modeling the short-range corticocortical excitatory and inhibitory feedback connections. B.1 Cortical Organization. We assume that the retinogeniculate and thalamocortical pathways already exist at birth and that their mapping is already retinotopic (see section 1.3). Simple cells sensitive to low spatial frequencies also exist, and their theoretical primary afferent responses are modeled using equation 2.1. The receptive eld center positions pE 2 N2 for the simple cells are placed in the image at grid points with a distance of four pixels in the horizontal and vertical directions. For each of those grid positions, a hypercolumn consisting of simple cells of eight different receptive eld orientations is modeled. Within one hypercolumn, each specic orientation is represented by two cells ÁC and Á¡ with receptive eld properties dened by equations B.2 or B.3. All cells in one hypercolumn have the same receptive eld center pE0 in retinal coordinates, and neighboring hypercolumns have different receptive eld centers pEk . Each cell of a
1890
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
hypercolumn establishes connections with all neurons of its own hypercolumn and with all neurons in the neighboring four hypercolumns in each direction of the cortical plane (9 £ 9 neighborhood). Each model cell represents biologically a pool of cells with nearly the same properties, and therefore a model cell was allowed to make connections to itself. To avoid border artifacts, learning of long-range connections was disabled in the rst four hypercolumns at the border of the cortical plane. B.2 Learning Horizontal Connections. We now shift our focus to the organization of the long-range cortical connections illustrated at the bottom of Figure 1 and in more detail in Figure 7, in which the repeated regular structure is representing a hypercolumn with eight orientations (only four are shown) and two polarities. A small segment of the cortical plane is shown, and the dashed arrows in it illustrate the range of the cortical horizontal connections. To adapt the synaptic weights, the difference in primary afferent input in equation 2.2 is used in equation 2.4 to select only those cells that have a transient cortical response. Equation 2.4 is a Hebbian learning rule. 1wij is the change of synaptic strength between neuron j and i, and ² is a general learning factor that controls the impact of each learning step. The choice of ² is not critical. To illustrate the effect of transient responses, we have also examined a Hebbian learning rule that uses only the sustained (see equation 2.3) cortical responses: 1wij D ²o¤i oj¤ :
(B.4)
This corresponds to the classical interpretation of the Hebbian postulate in neural modeling because it is a correlation learning rule that operates directly on the input. All cells that have a primary afferent response above the baseline activity µ of equation 2.3 may participate in the structuring process. After the increment of equation 2.4 or B.4 is used to update the synaptic strengths, equation B.5, two additional procedures are incorporated. To avoid articially high synaptic values, a connection strength above one is reset to one by equation B.6. To introduce competition between synapses, the total synaptic strength for a given ensemble C is held constant at K by normalizing the weights after each learning step and multiplying them with K, equation B.7: wij à wij C 1wij
(B.5)
wij à min.wij ; 1/ X wij à wij K= wij :
(B.6) (B.7)
C
One of the steps to make the model applicable to natural image data is the introduction of ensembles C of equivalent connections. First, we will
Learning the Gestalt Rule of Collinearity from Object Motion
1891
Figure 7: The cortical horizontal connection scheme. A small segment of the cortical plane is shown to illustrate the range of the long-range connections (indicated by the dashed arrows) and some connections of an ensemble of equivalent connections (indicated by the solid arrows and explained in the text). The repeated regular structure represents a hypercolumn with 8 orientations (just four are shown) and two polarities. Refer to section B.1 for a detailed explanation.
give a technical description of such an ensemble and then explain the biological motivation for building those ensembles. All established connections can be characterized by four parameters: the receptive eld center position pE 2 N2 of the presynaptic cell, the spanned distance measured in hypercolumns Er 2 Z2 , orientation and polarity of the presynaptic simple pre cell ÁC;¡ and the orientation and polarity of the postsynaptic simple cell post
ÁC;¡ . Mathematically, one can build the equivalence classes from the relation that two connections—and with them their synaptic weights—are pre post equivalent if they have the same Er, ÁC;¡ and ÁC;¡ but possibly different centers.
1892
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
By doing this, we have introduced translational invariance of the synapses that connect the same pre- and the same postsynaptic cell types over the same distance of hypercolumns, which is illustrated for a few connections by the solid arrows in equation A.3. As we have two polarities and eight different types of cells as pre- and postsynaptic cells and 9 £ 9 different Er, we have 1296 ensembles of connections for each presynaptic cell type of the reference hypercolumn and a total of 20,736 ensembles for this hypercolumn as a whole. As an example, the four solid arrows in Figure 7 are equivalent and are forced to have identical weights. What is the biological foundation for building these ensembles? Consider a newborn with a moving object in front of it. The newborn will gaze in one direction, and the image of the objects moves over the retina. Then on a larger timescale, the newborn will shift its head trying to follow—not very successfully, because of the low tracking accuracy after birth (Hofsten et al., 1998; Piaget, 1936)—the stimulus and gaze again in the new direction. This happens several times until the newborn has lost the object. If we now think of the visual input the newborn receives in terms of object features projected on the retina, we see that the same features are shifted on the retina because of the object motion and because of the more or less randomly distributed eye or head saccades of the newborn. One could argue now that this should mainly affect horizontally neighboring cells and not vertically neighboring cells because most movements are on the horizontal surface. But when the newborn is gazing in one direction and then moves his head or eyes to gaze in another direction, it is very unlikely that this can be done without a vertical shift given the low accuracy of tracking movements in the newborn. The structuring of long-range connection strengths in the cortical region that corresponds to the retinal area covered by the object will, of course, average in time over all presented stimuli, and this average should be the same for equivalent connections because the object features seen have been the same on average. The result should be connection strengths of roughly the same magnitude for equivalent connections. Therefore, we can model just one synapse for each ensemble of connections and reduce the amount of synapses drastically. Appendix C: Input Data We have applied the model to two movies of a person moving and waving his arms in front of a background from a seminar room. The backgrounds showed different biases. The sequence “moving.mpg” consists of 199 frames, and the background contains a diagonal rectangle. In the course of revising this article, we also applied the model to the sequence “fw carsten.mpg,” which is in color and has a larger resolution, because it was collected for a different experiment. This movie consists of 100 frames and shows no strong diagonals in the background. Sampling has been adjusted and color ignored. The results for the transient responses are
Learning the Gestalt Rule of Collinearity from Object Motion
1893
very similar to the ones for the other sequence. The ones for the sustained responses reect the different background bias. To clarify the relationship to static natural images, we have applied the model to movies made out of 60 frames from a texture database from MIT (http:==www-white.media.mit.edu =vismod=imagery=VisionTexture/) and one with 98 frames from a scene database by British Telecom (ftp:==ftp. vislist.com=IMAGERY=BT scenes/). All four sequences as well as the respective learning processes can be retrieved on-line from (ftp:==ftp.neuroinformatik.ruhr-uni-bochum.de=pub= pictures=gestalt/). Acknowledgments This work has been funded by the DFG in SFB509 and the KOGNET Graduiertenkolleg and by the BMBF in the LOKI project (01IB 001D). Partial funding from the MUHCI project by the European Community is also gratefully acknowledged. We thank the reviewers for thoughtful comments and important hints for improvement of the original manuscript. References Albus, K., & Wolf, W. (1984). Early post-natal development of neuronal function in the kitten’s visual cortex: A laminar analysis. Journal of Physiology, 348, 153–185. Atkinson, J., & Braddick, O. (1992). Visual segmentation of oriented textures by infants. Behav. Brain Res., 49(1), 123–131. Attneave, F. (1954). Some informational aspects of visual perception. Psychological Review, 61, 183–193. Barlow, H. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Barten, S., Birns, B., & Ronch, J. (1971). Individual differences in visual pursuit behavior of neonates. Child Development, 42, 313–319. Bell, A. J., & Sejnowski, T. J. (1997). The “independent component” of natural scenes are edge lters. Vision Research, 37(23), 3327–3338. Birch, E. (1985). Infant interocular acuity differences and binocular vision. Vision Research, 25(4), 571–576. Birch, E., & Salom˜ao, S. (1998). Infant random dot stereoacuity cards. Journal of Pediatric Ophthalmology and Strabismus, 35(2), 86–90. Bosking, W., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6), 2112–2127. Braddick, O., & Atkinson, J. (1979). A photorefractive study of infant accommodation. Vision Research, 19, 1319–1330. Burkhalter, A., Bernardo, K. L., & Charles, V. (1993).Development of local circuits in human visual cortex. Journal of Neuroscience, 13(5), 1916–1931.
1894
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
Callaway, E. M. (1998). Prenatal development of layer-specic local circuits in primary visual cortex of the macaque monkey. Journal of Neuroscience, 18(4), 1505–1527. Choe, Y. (2001). Perceptual grouping in a self-organizing map of spiking neurons. Unpublished doctoral dissertation, University of Texas at Austin. Available on-line: http:==www.cs.tamu.edu=faculty=choe=ftp=choe.diss.pdf. Coppola, D. M., Purves, H. R., McCoy, A. N., & Purves, D. (1998). The distribution of oriented contours in the real world. PNAS, 95, 4002–4006. Courage, M., & Adams, R. (1996). Infant peripheral vision: The development of monocular visual acuity in the rst 3 months of postnatal life. Vision Research, 36(8), 1207–1215. Field, D. J., Hayes, A., & Hess, R. F. (1993). Contour integration by the human visual system: Evidence for local “association eld.” Vision Res., 33(2), 173–193. Fitzpatrick, D. (1997). The functional organization of local circuits in visual cortex: Insights from the study of tree shrew striate cortex. Cerebral Cortex, 385(6), 535–538. Gaudiano, P. (1994). Simulations of x and y retinal ganglion cell behavior with a nonlinear push-pull model of spatiotemporal retinal processing. Vision Research, 34, 1767–1784. Geisler, W., Perry, J., Super, B., & Gallogly, D. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vision Research,41(6), 711–724. Grossberg, S. (1970). Neural pattern discrimination. Journal of Theoretical Biology, 27, 291–337. Grossberg, S., & Mingolla, E. (1985). Neural dynamics for perceptual grouping: Textures, boundaries, and emergent segmentation. Perception and Psychophysics, 38, 141–171. Grossberg, S., & Williamson, J. (2001). A neural model of how horizontal and interlaminar connections of visual cortex develop into adult circuits that carry out perceptual grouping and learning. Cerebral Cortex, 11(1), 37–58. Guy, M., & Medioni, G. (1996). Inferring global perceptual contours from local features. International Journal of Computer Vision, 20, 113–133. Hainline, L., Riddell, J., Grose-Fifer, J. J., & Abramow, I. (1992). Development of accommodation and convergence in infancy. Behavioural Brain Research, 49, 33–50. Special Issue. Hess, R., & Field, D. (1999). Integration of contours: New insights. Trends in Cognitive Sciences, 3(12), 480–486. Hofsten, C., Vishton, P., Spelke, E., Feng, Q., & Rosander, K. (1998). Predictive action in infancy: Tracking and reaching for moving objects. Cognition, 67(3), 255–285. Hoyer, P. O., & Hyva¨ rinen, A. (2002). A multilayer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Isaac, J., Crair, M., Nicoll, R., & Malenka, R. (1997). Silent synapses during development of thalamocortical inputs. Neuron, 18(2), 269–280. Katz, L. C., & Callaway, E. M. (1992).Development of local circuits in mammalian visual cortex. Annual Review of Neuroscience, 15, 31–56.
Learning the Gestalt Rule of Collinearity from Object Motion
1895
Kellman, P. J., Gleitmann, H., & Spelke, E. (1987). Object and observer motion in the perception of objects by infants. J. Exp. Psychol. Hum. Percept. Perform., 13(4), 586–593. Kellman, P. J., & Short, K. (1987). Development of three-dimensional form perception. J. Exp. Psychol. Hum. Percept. Perform., 13(4), 545–557. Kellman, P. J., Spelke, E., & Short, K. (1986). Infant perception of object unity from translatory motion in depth and vertical translation. Child Development, 57(1), 72–86. Koffka, K. (1935). Principles of gestalt psychology. London: Lund Humphries. Kovacs, I. (2000). Human development of perceptual organization. Vision Research, 40(12), 1301–1310. Kruger, ¨ N. (1998). Collinearity and parallelism are statistically signicant second-order relations of complex cell responses. Neural Processing Letters, 8, 117–129. Leopold, D., & Logothetis, N. (1998).Microsaccades differentially modulate neural activity in the striate and extrastriate visual cortex. Experimental Brain Research, 123(3), 341–345. Li, Z. (1998). A neural model of contour integration in the primary visual cortex. Neural Computation, 10, 903–940. Lowel, ¨ S., & Singer, W. (1992). Selection of intrinsic horizontal connections in the visual cortex by correlated neuronal activity. Science, 255, 209–212. Malach, R., Amir, Y., Harel, M., & Grinvald, A. (1993). Relationship between intrinsic connections and functional architecture revealed by optical imaging and in-vivo targeted biocytin injections in primate striate cortex. PNAS, 90(22), 10469–10473. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. McGuire, B. A., Stevens, J., & Sterling, P. (1986). Microcircuitry of beta ganglion cells in cat retina. Journal of Neuroscience, 6(4), 907–918. Morrone, M., Burr, D., & Fiorentini, A. (1990). Development of contrast sensitivity and acuity of the infant colour system. Proc. R. Soc. Lond. B. Biol. Sci., 242(1304), 134–139. Norcia, A., Tyler, C., & Hamer, R. (1990). Development of contrast sensitivity in the human infant. Vision Research, 30(10), 1475–1486. Olshausen, B. A., & Field, D. J. (1996). Wavelet-like receptive elds emerge from a network that learns sparse codes for natural images. Nature, 381, 607–609. Piaget, J. (1936). La naissance de l’intelligence chez l’enfant. New York: International Universities Press. Potzsch, ¨ M. (1999). Object-contour statistics extracted from natural image sequences. Unpublished doctoral dissertation, Ruhr-Universit a¨ t-Bochum. Available on-line: ftp:==ftp.neuroinformatik.ruhr-uni-bochum.de=pub=manuscripts= theses=poetzsch.pdf. Reinagel, P., & Zador, A. M. (1999). Natural scene statistics at the center of gaze. Network: Computation in Neural Systems, 10, 341–350. Ross, W., Grossberg, S., & Mingolla, E. (2000). Visual cortical mechanisms of perceptual grouping: interacting layers, networks, columns, and maps. Neural Networks, 13(6), 571–588.
1896
C. Prod¨ohl, R. Wurtz, ¨ and C. von der Malsburg
Schmidt, K., Goebel, R., Lowel, ¨ S., & Singer, W. (1997). The perceptual grouping criterion of colinearity is reected by anisotropies of connections in the primary visual cortex. European Journal of Neuroscience, 5(9), 1083–1084. Simoncelli, E., & Olshausen., B. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216. Skoczenski, A., & Norcia, A. (1998). Neural noise limitations on infant visual sensitivity. Nature, 391(6668), 697–700. Snider, C., Dehay, C., Berland, M., Kennedy, H., & Chalupa, L. (1999). Prenatal development of retinogeniculate axons in the macaque monkey during segregation of binocular inputs. Journal of Neuroscience, 19(1), 220–228. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. Journal of Neuroscience, 15(8), 5448–5465. Spelke, E., Breinlinger, K., Jacobson, K., & Phillips, A. (1993). Gestalt relations and object perception: A developmental study. Perception, 22, 1483–1501. Sperling, G. (1970). Models of visual perception and contrast detection. Perception and Psychophysics, 8, 143–157. Streri, A., & Spelke, E. (1989). Effects of motion and gural goodness on haptic object perception in infancy. Child Development, 60(5), 1111–1125. Tootle, J. (1993). Early postnatal development of visual function in ganglion cells of the cat retina. Journal of Neurophysiology, 69(5), 1645–1660. van Hateren, J., & Ruderman, D. (1998). Independent component analysis of natural image sequences yields spatiotemporal lters similar to simple cells in primary visual cortex. Proceedings of the Royal Society London B, 265, 2315–2320. Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt II. Psychologische Forschung, 4, 301–350. Yen, S., & Finkel, L. (1998). Extraction of perceptually salient contours by striate cortical networks. Vision Research, 38(5), 719–741. Yonas, A., Arterberry, M. E., & Granrud, C. E. (1987). Four-month-old infants’ sensitivity to binocular and kinetic infomation for three-dimensional-object shape. Child Development, 84(4), 910–917. Zetzsche, C., & Krieger, G. (2001). Nonlinear mechanisms and higher-order statistics in biological vision and electronic image processing: Review and perspectives. Journal of Electronic Imaging, 10(1), 56–99. Received September 13, 2001; accepted January 28, 2003.
LETTER
Communicated by Yoshua Bengio
Recurrent Neural Networks with Small Weights Implement Denite Memory Machines Barbara Hammer [email protected] Department of Mathematics=Computer Science, University of Osnabr¨uck, D-49069, Osnabruck, ¨ Germany
Peter Ti Ïno [email protected] School of Computer Science, University of Birmingham, Edgbaston, Birmingham B15 2TT, U.K.
Recent experimental studies indicate that recurrent neural networks initialized with “small” weights are inherently biased toward denite memÏ ory machines (Ti Ïno, Cer Ïnansky, ´ & Be ÏnuÏskov´a, 2002a, 2002b). This article establishes a theoretical counterpart: transition function of recurrent network with small weights and squashing activation function is a contraction. We prove that recurrent networks with contractive transition function can be approximated arbitrarily well on input sequences of unbounded length by a denite memory machine. Conversely, every denite memory machine can be simulated by a recurrent network with contractive transition function. Hence, initialization with small weights induces an architectural bias into learning with recurrent neural networks. This bias might have benets from the point of view of statistical learning theory: it emphasizes one possible region of the weight space where generalization ability can be formally proved. It is well known that standard recurrent neural networks are not distribution independent learnable in the probably approximately correct (PAC) sense if arbitrary precision and inputs are considered. We prove that recurrent networks with contractive transition function with a xed contraction parameter fulll the so-called distribution independent uniform convergence of empirical distances property and hence, unlike general recurrent networks, are distribution independent PAC learnable. 1 Introduction Data of interest have a sequential structure in a wide variety of application areas such as language processing, time-series prediction, nancial forecasting, and DNA sequences (Laird & Saul, 1994; Sun, 2001). Recurrent neural networks and hidden Markov models constitute very c 2003 Massachusetts Institute of Technology Neural Computation 15, 1897–1929 (2003) °
1898
B. Hammer and P. Ti Ïno
powerful methods, which have been successfully applied to these problems (see, for example Baldi, Brunak, Frasconi, Pollastri, & Soda, 2001; Giles, Lawrence, & Tsoi, 1997; Krogh, 1997; Nadas, 1984; Robinson, Hochberg, & Renals, 1996). Successful applications are accompanied by theoretical investigations that demonstrate the capacities of recurrent networks and probabilistic counterparts such as hidden Markov models.1 The universal approximation ability of recurrent networks has been proved in Funahashi and Nakamura (1993), for example; moreover, they can be related to classical computing mechanisms like Turing machines or even more powerful nonuniform Boolean circuits (Siegelmann & Sontag, 1994, 1995). Standard training of recurrent networks by gradient descent methods faces severe problems (Bengio, Simard, & Frasconi, 1994), and the design of efcient training algorithms for recurrent networks is still a challenging problem of ongoing research; Hochreiter and Schmidhuber (1997) provide a particularly successful approach and a further discussion on the problem of long-term dependencies. Besides, the generalization ability of recurrent neural networks constitutes a further, not yet satisfactorily solved question: unlike standard feedforward networks, common recurrent neural architectures possess VC dimension, which depends on the maximum length of input sequences and is hence in theory innite for arbitrary inputs (Koiran & Sontag, 1997; Sontag, 1998). The VC dimension can be thought of as expressing exibility of a function class to perform classication tasks. We will introduce a variant of the VC dimension—the so-called fat-shattering dimension. Finiteness of the VC dimension is equivalent to the so-called distribution-independent probably approximately correct (PAC) learnability, that is, the ability of valid generalization from a nite training set, the size of which depends on only the given function class (Anthony & Bartlett, 1999; Vidyasagar, 1997). Hence, prior distribution independent bounds on the generalization ability of general recurrent networks are not possible. A rst step toward posterior or distribution-dependent bounds for general recurrent networks without further restrictions can be found in Hammer (1999, 2001); however, these bounds are weaker than the bounds obtained via a nite VC dimension. Of course, bounds on the VC dimension of various restricted recurrent architectures can be derived, for example, for architectures implementing a nite automaton with a limited number of states (Frasconi, Gori, Maggini, & Soda, 1995), or for architectures with activation function with nite codomain and nite input alphabet (Koiran & Sontag, 1997). Moreover, the argumentation in Maass and Orponen (1998) and Maass and Sontag (1999) shows that the presence of noise in the computation severely limits the capacity of recurrent networks. Depending on the
1 Although hidden Markov models are usually dened on a nite state space, unlike recurrent neural networks which possess continuous states.
RNNs with Small Weights Implement DMMs
1899
support of the noise, the capacity of recurrent networks reduces to nite automata or even less. This fact provides a further argument for the limitation of the effective VC dimension of recurrent networks in practical implementations. However, these arguments rely on deciencies of neural network training: the bounds on the generalization error that can be obtained in this way become worse the more computation accuracy and reliability can be achieved. The argumentation can only partially account for the fact that recurrent networks often generalize in practical applications after appropriate training and that they may show particularly good generalization behavior if advanced training methods are used (Hochreiter & Schmidhuber, 1997). We will focus in this article on the initial phases of recurrent neural network training by formally characterizing the function class of recurrent neural networks initialized with small weights. This allows us to compare the behavior of recurrent networks at the early stages of training with alternative tools for sequence processing. Furthermore, we will show that small weights constitute a sufcient condition for good generalization ability of recurrent neural networks even if arbitrary precision of the computation and arbitrary real-valued inputs are assumed. This argumentation formalizes one aspect of why recurrent neural network training is often successful: initialization with small weights biases neural network training toward regions of the search space where the generalization ability can be rigorously proved. Naturally, further aspects may account for the generalization ability of recurrent networks if we allow for arbitrary weights, for example, the above-mentioned corruption of the network dynamics by a noise, implicit regularization of network training due to the choice of the error function, or the fact that regions in the weight space that give a large VC dimension cannot be found by standard training because of the problem of long-term dependencies. Alternatives to recurrent networks or hidden Markov models have been investigated for which efcient training algorithms can be found and prior bounds on the generalization ability can be established. One possibility constitutes networks with a time window for sequential data or xed-order Markov models. Both alternatives use only a nite memory length, that is, perform predictions based on a xed number of sequence entries (Ron, Singer, & Tishby, 1996; Sejnowski & Rosenberg, 1987). Particularly efcient modications are variable memory length Markov models, which adapt the necessary memory depth to contexts in the given input sequence (B uhlmann ¨ & Wyner, 1999). Various applications can be found in Guyon and Pereira (1995), Ron et al. (1996), and Ti Ïno and Dorffner (2001), for example. Note that some of these approaches propose alternative notations for variable-length Markov models, which are appropriate for specic training algorithms such as prediction sufx trees or iterative function systems. Markov models are much simpler than general hidden Markov models since they operate on
1900
B. Hammer and P. Ti Ïno
only a nite number of observable contexts. 2 Nevertheless they are appropriate for a wide variety of applications as shown in the experiments (Guyon & Pereira, 1995; Ron et al., 1996; Ti Ïno & Dorffner, 2001), and the dynamics of large denite memory machines can be learned with neural networks as presented in the articles (Clouse, Giles, Horne, & Cottrell, 1997; Giles, Lawrence, & Lin, 1995). However, hidden Markov models or recurrent networks can obviously simulate xed-order Markov models or denite memory machines. We will theoretically show in this article that recurrent networks are biased toward denite memory machines through initialization of the weights with small values. Hence, standard neural network training rst explores regions of the weight space that correspond to the simpler (but potentially useful) dynamics of denite memory machines before testing more involved dynamics such as nite state machines and other mechanisms that can be Ï implemented by recurrent networks (Ti Ïno & Sajda, 1995). This bias has the effect that structural differentiation due to the inherent dynamics can be observed even prior to training. This observation has been veried experimenÏ tally (Christiansen & Chater, 1999; Kolen, 1994a, 1994b; Ti Ïno, Cer Ïnansky, & Be ÏnuÏskov´a, 2002a, 2002b). Moreover, the structural bias corresponds to the way in which humans recognize language as pointed out in Christiansen and Chater (1999), for example. This article establishes a thorough mathematical formalization of the notion of architectural bias in recurrent networks. Furthermore, initial exploration of simple denite memory mechanisms in standard neural network training focuses on a region of the parameter search space where prior bounds on the generalization error can be obtained. We formalize this hypothesis within the mathematical framework provided by the statistical learning theory. We prove in the second part of this article that recurrent networks with small weights are distribution-independent PAC learnable and hence yield a valid generalization if enough training data are provided. This contrasts with unrestricted recurrent networks with innite precision that may yield in theory considerably worse generalization accuracy. We start by dening the notions of denite memory machines, xedorder Markov models, and variations thereof that are particularly suitable for learning. Then we show that standard discrete-time recurrent networks initialized with small weights (or, more generally, nonautonomous discretetime dynamical systems with contractive transition function) driven with arbitrary input sequences can be simulated by denite memory machines operating on a nite input alphabet. Conversely, we show that every denite memory machine can be simulated by a recurrent network with small weights. Finally, we link the results to statistical learning theory and show
2
It is not necessary to do inference about the states for Markov models.
RNNs with Small Weights Implement DMMs
1901
that small weights constitute one sufcient condition for the distributionindependent uniform convergence of empirical distances (UCED) property. 2 Finite Memory Models for Sequence Prediction Assume X is a set. We denote the set of all nite length sequences over X by X¤ . The sequences of length at most L are denoted by X·L . [ ] denotes the empty sequence, and [x1 ; : : : ; xl ] denotes the sequence of length l and elements xi 2 X. For every L 2 N, the L-truncation T L .s/ of a sequence s D [x1 ; : : : ; xl ] is dened as the rst part of length L of the sequence, ( if l · L s TL .s/ D : [x1 ; : : : ; xL ] otherwise We are interested in predictions on sequences, that is, functions of the form f : X¤ ! X; f .s/ D x, or probability distributions p.x j s/ for x 2 X given a sequence s, which allow us, for example, to predict the next symbol x or its probability, respectively, when the sequence s has been observed. We assume that the sequences are ordered right to left, that is, x1 is the most recent entry in the sequence [x1 ; : : : ; xl ]. In the next-symbol prediction setting, f .s/ D x indicates that the sequence s D [x1 ; : : : ; xl ] is completed to [x; x1 ; : : : ; xl ] in the next time step. Obviously, a function f : X¤ ! X induces the probability p.x j s/ 2 f0; 1g with p.x j s/ D 1 () f .s/ D x and can therefore be seen as a special case of the probabilistic formalism. Assume X :D 6 is a nite alphabet. A classical and very simple mechanism for next-symbol prediction on sequences over 6 is given by denite memory machines or their probabilistic counterparts, xed-order Markov models (Ron et al., 1996). Denition 1. Assume V is a set. A denite memory machine (DMM) computes a function f : 6 ¤ ! V, such that some L 2 N exists with f .s/ D f .TL .s//
8s 2 6 ¤ :
A xed-order Markov model (FOMM) denes for each sequence s a probability p.¢ j s/ on V with the following property: Some L 2 N can be found with p.x j s/ D p.x j TL .s//
8x 2 V; s 2 6 ¤ :
Note that V D 6 if the above formalisms are used for predictions on sequences. Only a nite memory of length L is necessary for inferring the next symbol. FOMMs dene rich families of sequence distributions and can naturally be used for sequence generation or probability estimation. However, if L increases, estimation of FOMMs on a nite set of examples becomes very hard. Therefore, variable memory length Markov models (VLMM) have
1902
B. Hammer and P. Ti Ïno
been proposed, where the memory length may depend on the sequence, that is, they implement probability distributions with p.x j s/ D p.x j TL.s/ .s//
8x 2 V; s 2 6 ¤ ;
where the length L.s/ · Lmax may depend on the context (Buhlmann ¨ & Wyner, 1999; Guyon & Pereira, 1995). The length of the memory is adapted to the context. Since L.s/ is universally limited by some value Lmax , VLMMs constitute a specic efcient implementation of FOMMs. Their in-principle capacity is the same. VLMMs are often represented as prediction sufx trees for which efcient learning algorithms can be designed (Ron et al., 1996). Alternative models for sequence processing that are more powerful than DMMs and FOMMs are nite state machines and nite memory machines, respectively. The behavior of a nite state machine does depend on only the input and the actual state. Thereby, the state is an element of a nite number of different states. Finite memory machines implement functions the behavior of which can be determined by the last m input symbols and the last n output symbols, for some xed numbers m and n. Denite memory machines can be alternatively dened as nite memory machines that depend on only the last m input symbols, but no outputs need to be known, that is, n D 0. Formal denitions can be found, for example, in Kohavi (1978). Note that denite and nite memory machines cannot produce several simple languages; for example, they cannot produce the binary number representing the sum of two bitwise presented binary numbers. A nite state machine with only one bit of memory could solve the task. There exists a rich literature that relates recurrent networks (with arbitrary weights) to nite state machines (nite memory machines) and demonstrates the possibility of learning or simulating these models in practice (Carrasco & Forcada, 2001; Frasconi et al., 1995; Giles et al., 1997; Omlin & Giles, 1996a, Ï 1996b; Ti Ïno & Sajda, 1995). Note that denite memory machines constitute particularly simple (though useful) models where only a xed number of input signals uniquely determines the current output. DMMs are alternatively called DeBruijn automata (Kohavi, 1978). Large DMMs have been successfully learned from examples with recurrent networks as reported (e.g., Clouse et al., 1997; Giles et al., 1995). A very natural way of processing sequences is in a recursive manner. For this purpose, we introduce a general notation of recursive functions induced by standard functions via iteration: Denition 2. Assume X and U are sets. Every function f : X £ U ! U and element y 2 U induces a recursive function fQy : X¤ ! U, ( fQy .s/ D
y f .x1 ; fQy .[x2 ; : : : ; xl ]//
if s D [ ]
if s D [x1 ; : : : ; xl ]:
RNNs with Small Weights Implement DMMs
1903
y is called the initial context. The induced function with nite memory length L is dened by fQyL : X¤ ! U, fQyL .s/ D fQy .TL .s//: Starting from the initial context y, the sequence s D [x1 ; x2 ; : : : ; xl ] is processed iteratively, starting from the last entry xl , applying a transition function f in each step. fQy may use innite memory in the sense that all entries of a sequence, not just the most recent ones, may contribute to the output. On the other hand, fQyL takes into an account only the L most recent entries of the sequence. Functions of the form fQyL share the idea of DMMs that only a nite memory is available for processing. General recursive functions of the form fQy have more powerful properties. Recurrent neural networks, which we will introduce later, constitute one popular mechanism for recursive computation, which is more powerful than FOMM. However, we will rst briey mention an alternative to FOMMs that explicitly uses recursive processing. Fractal prediction machines (FPMs) constitute an alternative approach for sequence prediction through FOMM as proposed in Ti Ïno and Dorffner (2001). Here the L most recent entries of a sequence s are rst mapped to a real vector space in a fractal way. Then the fractal codes of L-blocks are quantized into a xed number of prototypes or code book vectors. The probability of the next symbol x is dened by the probability vector attached to the corresponding nearest code book vector. Formally, a FPM is given by the following ingredients. The elements x 2 6 are identied with binary vectors bx in f0; 1gD , D :D dlog j6je. Denote by R: 6 £[0; 1]D ! [0; 1]D , the mapping .x; p/ 7! ® ¢ bx C .1 ¡ ®/ ¢ p, where ® 2 .0; 1=2] is a xed scalar. Some memory depth L 2 N is xed. A sequence s is rst mapped to RQ Ly .s/ 2 [0; 1]D , where y D f 12 gD . Sequences are encoded in a fractal way such that all sequences of length at most L are encoded uniquely. In general, if two sequences s1 , s2 share the most recent entries, then their images RQ Ly .s1 /, RQ Ly .s2 / lie close to each other. A nite set of prototypes pi 2 [0; 1]D is given, together with a vector P j6j Pi 2 [0; 1]j6j for each pi , with jD1 Pji D 1 (Pji denotes the components of Pi ), which represents the probabilities for the next element in the sequence. Hereby, k ¢ k denotes the Euclidean metric. Assume 6 D fx1 ; : : : ; xn g. The probability of xi given s equals the ith entry of the probability vector attached to the code book vector that is nearest to the fractal encoding of s; j
p.xi j s/ D Pi
s.t. kP j ¡ RQ Ly .s/k minimal:
This notation has the advantage that an efcient training procedure can immediately be found. If a training set of sequences is given, rst all L-blocks are encoded in [0; 1]D . Afterward, a standard vector quantization learning
1904
B. Hammer and P. Ti Ïno
algorithm is applied, such as a self-organizing map (Kohonen, 1997). Finally, the probability vectors attached to the prototypes are determined such that they correspond to the relative frequencies of next symbols for all L-blocks in the training set codes, which are located in the receptive eld of the corresponding code book vector. Note that a variable length of the respective memory is automatically introduced through the vector quantization. Regions with a high density of codes attract more prototypes than regions with a low density of codes. Hence, the memory length is closer to the maximum length L in the former regions compared to the latter ones. It is obvious that at most FOMMs can be implemented by FPMs. Conversely, it can be seen easily that each FOMM with corresponding probability p can be approximated up to every desired accuracy with a FPM. We can choose the parameter L in FPM equal to the order of FOMM. Then the encoding in the FPM yields RQ Ly .s1 / D RQ Ly .s2 / only if the next-symbol-prediction probabilities given by s1 and s2 coincide. If enough data points are available, all possible codes in [0; 1]D of nonzero probability prediction contexts of length L in [0; 1]D can be observed in the rst step of FPM construction. Clustering with a sufcient number of prototypes can simply choose all codes as prototypes, where the nearest prototypes for two codes are identical iff the codes itself are identical. Hence, the probabilities attached to a prototype that correspond to the observed frequencies converge to the correct probabilities p.xi j s/ for every s, which is mapped to the corresponding prototype. FPMs constitute one example for efcient sequence prediction tools. As we will see, recurrent networks initialized with small weights are inherently biased toward these simpler and more efciently trainable mechanisms. Naturally, situations where more complicated dynamics is required, and hence recurrent networks with large weights are needed, can be easily found. 3 Contractive Recurrent Networks Implement DMMs We are interested in recursive processing of sequences with recurrent neural networks. The basic dynamics of a recurrent neural network (RNN) used for sequence prediction is given by the above notion of induced recursive functions. An RNN computes a function h ± fQy : .Rn /¤ ! Rm , where fQy is the function induced by some function f : Rn £ Rl ! Rl , which together with h: Rl ! Rm are functions of a specic form, which is dened later. Recurrent networks are more powerful than nite memory models and nite state models for two reasons: they can use an innite memory, and using this memory, they can simulate Turing machines, for example, as shown in Siegelmann and Sontag (1995). Moreover, they usually deal with real vectors instead of a nite input set such that a priori unlimited information in the inputs might be available for further processing (Siegelmann & Sontag, 1994). Here we are interested in RNNs where the recursive transition
RNNs with Small Weights Implement DMMs
1905
function f has a specic property: it forms a contraction. We will see later that this property is automatically fullled if an RNN with sigmoid activation function is initialized with “small” weights, which is a reasonable way to initiate weights, unless one has a strong prior knowledge about the underlying dynamics of the generating source (Elman et al., 1996). We will show that under these circumstances, RNNs can be seen as denite memory machines; they use only a nite memory, and only a nite number of functionally different input symbols exists. This result holds even if arbitrary real-valued inputs are considered and computation is done with perfect accuracy. Hence, RNNs initialized in this standard way are biased toward denite memory machines. First, we formally dene contractions and focus on the general case of recursive functions induced by contractions. Assume X and U are sets and f : X £ U ! U is a function. Assume the set U is equipped with a metric structure. We denote the distance of two elements u1 and u2 in U by ju1 ¡u2 j. Denition 3. A function f : X £ U ! U is a contraction with respect to U if a real value C 2 [0; 1/ exists such that the inequality j f .x; u1 / ¡ f .x; u2 /j · C ¢ ju1 ¡ u2 j holds for all x 2 X and u1 , u2 2 U. If the transition function is a contraction and U is bounded with respect to the metric, then we can approximate the recursive function induced by f by the respective induced function with only a nite memory length: Lemma 1. Assume f : X £ U ! U is a contraction with parameter C 2 [0; 1/ with respect to U. Assume ju1 ¡ u2 j · C0 for all u1 , u2 2 U, and x ² > 0. Then, for memory length L D logC .²=C0 /, we have j fQy .s/ ¡ fQyL .s/j · ² for every initial context y 2 U and every sequence s 2 X¤ . Proof. Choose s D [x1 ; : : : ; xl ] 2 X¤ . If l · L, the inequality follows immediately. Assume l > L. Then j fQy .s/ ¡ fQyL .s/j D j fQy .[x1 ; : : : ; xl ]/ ¡ fQy .[x1 ; : : : ; xL ]/j D j f .x1 ; fQy .[x2 ; : : : ; xl ]// ¡ f .x1 ; fQy .[x2 ; : : : ; xL ]//j · C ¢ j fQy .[x2 ; : : : ; xl ]/ ¡ fQy .[x2 ; : : : ; xL ]/j · ¢¢¢
1906
B. Hammer and P. Ti Ïno
· CL ¢ j fQy .[xLC1 ; : : : ; xl ]/ ¡ fQy .[ ]/j · CL ¢ C0 D ²=C0 ¢ C0 D ²; where L D logC .²=C0 /. Hence we can approximate the dynamics by a dynamics with a nite memory length if the transition function is a contraction. The memory length depends on the parameter C of the contraction. Usually, the space of internal states U is a compact subset of a real vector space, for example, the set .0; 1/d , d denoting the respective dimensionality. We have already seen that we need only a nite length if we approximate recursive functions with contractive transition function. We would like to go a step further and show that we do not need innite accuracy for storing the intermediate real vectors in U. Rather, a nite set U will do. For this purpose, we rst need an intermediate result. Denition 4. Assume F is a function class with domain X and codomain Y, such that Y is equipped with a metric j ¢ j. For f , g 2 F we denote the maximum distance by j f ¡ gj1 D supx2X j f .x/ ¡ g.x/j. Assume ² > 0. An external covering of F with accuracy ² consists of a set of functions C.F / where g: X ! Y may be arbitrary for g 2 C.F /, such that for all f 2 F , a function g 2 C.F / can be found with j f ¡ gj1 · ². Note that for every function class, an external covering, the class itself, can be found. A nite ²-covering of a set U consists of a nite number of points fu1 ; : : : ; un g such that for every u 2 U, some ui with ju ¡ ui j · ² exists. Note that we can nd a nite covering for every compact and bounded set in a metric space, that is, ju1 ¡ u2 j · C0 for all u1 , u2 2 U and some C0 < 1. Denote by FQ the set of all functions of the form fQy for f 2 F and y 2 U. FQ L denotes the set of all functions of the form fQyL for f 2 F and y 2 U. External coverings of F extend to external coverings of FQ and FQ L , respectively: Lemma 2. Assume F is a set of functions mapping X £ U to U, such that every f 2 F forms a contraction with respect to U with parameter C. Assume ² > 0. Assume C.F / is an external covering of F with parameter ². Assume U is bounded and the constants fu1 ; : : : ; un g cover U with parameter ². Then f gQ uj j g 2 C.F /; j D 1; : : : ; ng is an external covering of FQ with parameter ²=.1 ¡ C/ and f gQ Luj j g 2 C.F /; j D 1; : : : ; ng is an external covering of FQ L with parameter ²=.1 ¡ C/. Proof. Assume f 2 F and y 2 U. Choose a function g from the covering C.F / such that j f ¡ gj1 · ². Choose a value uj from fu1 ; : : : ; un g such that jy ¡ uj j · ². It follows by induction over the length l of a sequence s 2 X¤
RNNs with Small Weights Implement DMMs
1907
that j fQy .s/ ¡ gQ uj .s/j · ² ¢ .1 ¡ ClC1 /=.1 ¡ C/ as follows: For s D [ ] we nd 1 ¡ C1 j fQy .s/ ¡ gQ uj .s/j D jy ¡ uj j D ² · ² ¢ : 1¡C For s D [x1 ; : : : ; xl ] we nd j fQy .[x1 ; : : : ; xl ]/ ¡ gQ uj .[x1 ; : : : ; xl ]/j
D j f .x1 ; fQy .[x2 ; : : : ; xl ]// ¡ g.x1 ; gQ uj .[x2 ; : : : ; xl ]//j · j f .x1 ; fQy .[x2 ; : : : ; xl ]// ¡ f .x1 ; gQ uj .[x2 ; : : : ; xl ]//j
C j f .x1 ; gQ uj .[x2 ; : : : ; xl ]// ¡ g.x1 ; gQ uj .[x2 ; : : : ; xl ]//j
· C ¢ j fQy .[x2 ; : : : ; xl ]/ ¡ gQ uj .[x2 ; : : : ; xl ]/j C ² · C ¢ ² ¢ .1 ¡ Cl /=.1 ¡ C/ C ² D ² ¢ .1 ¡ ClC1 /=.1 ¡ C/ by induction. Assume U is bounded and fu1 ; : : : ; un g is an ²-covering of U. Then we can obviously approximate every function f : X £ U ! U by a function the codomain of which is contained in fu1 ; : : : ; un g only. Hence we can cover every set of functions mapping to U by functions with images in the discrete set fu1 ; : : : ; un g. Since the initial contexts in the above lemma can be chosen as elements of the set fu1 ; : : : ; un g and the approximations in the cover yield values only in that set, we obtain as an immediate corollary that a nite set U is sufcient for internal processing: Corollary 1. Assume F and U are as above. Assume ² > 0; assume fu1 ; : : : ; un g is an ²-covering for U. Denote by Q: U ! fu1 ; : : : ; un g the quantization mapping, which maps a value u 2 U to the nearest value ui (some xed nearest ui if this is not unique). Denote by f Q the composition f Q .x; u/ D Q. f .x; u//. Then f. fQQ /uj j f 2 F ; j D 1; : : : ; ng forms an ²=.1 ¡ C/-covering of FQ and f. fQQ /L j f 2 F ; j D uj
1; : : : ; ng forms an ²=.1 ¡ C/-covering of FQ L . Note that these functions use values of fu1 ; : : : ; un g only.
Proof. Note that f f Q j f 2 F g constitutes an external ²-covering of F because the outputs are changed by at most ². Moreover, fu1 ; : : : ; un g constitutes an ²cover of U by assumption. Hence we can apply lemma 2. As a consequence, the recursive classes f. fQQ /uj j f 2 F ; j D 1; : : : ; ng and f. fQQ /Luj j f 2 F ; j D 1; : : : ; ng form ²=.1 ¡ C/-covers of FQ and FQ L , respectively. Hence, we can substitute every recursive function where the transition constitutes a contraction by a function that uses only a nite number of
1908
B. Hammer and P. Ti Ïno
different values for U and a nite memory length. Depending on the form of X, the internal values for U can be substituted by values consisting of sequences in X. More precisely, we get the following result: Corollary 2. For every ² > 0, function f : X £ U ! U with compact domain U such that f is a contraction with parameter C, initial context y 2 U, we can nd a memory length L, a nite set fx1 ; : : : ; xm g in X, and a quantization Q: X ! fx1 ; : : : ; xm g such that the following holds: there exists a function g: fx1 ; : : : ; xm g·L ! U such that Q j fQy .s/ ¡ g.TL .Q.s///j · ²; Q where Q.s/ denotes the element-wise application of Q to the sequence s. If X is nite, Q can be chosen as the identity. Proof. As a consequence of lemma 1 and corollary 1, we can approximate fQy by a function hQ Lui , which uses only a nite number of values fu1 ; : : : ; un g in 0 U and a nite memory length L. Dene equivalence classes on X via the definition x » x0 for x, x0 2 X iff h.x; uj / D h.x0 ; uj / for all uj 2 fu1 ; : : : ; un g. This yields only a nite number of equivalence classes. Choose a xed value xj from each equivalence class. Dene Q: x 7! xj , such that x lies in the equivalence class of xj . Then the choice g :D hQ Lui yields the desired ap0 proximation. The same choice is possible if X itself is nite and Q is the identity. This result tells us that we can substitute recursive maps with compact codomain and contractive transition functions by denite memory machines if the input alphabet is nite. Otherwise, the input alphabet can be quantized accordingly such that an equivalent denite memory machine with a nite number of different input symbols and the same behavior can be found. In case of RNNs, further processing is added to the recursive computation; we are interested in functions of the form h ± fQy , where h is some function that maps the processed sequence to the desired output but itself does not contribute to the recursive computation. If h is continuous, obviously similar approximation results can be obtained, since we can simply combine the above approximation g with h. Note that h is then uniformly continuous on the compact domain U. Therefore, approximation of fQy by the function g ± TL ± QQ up to ² yields approximation of h ± fQy with h ± g ± TL ± QQ up to a value that depends on the modulus of continuity of g. We are here interested in RNNs and their connection to denite memory machines. We assume that X D Rn1 and U D Rn2 are real vector spaces equipped with the maximum norm, which we denote by j ¢ j1 .
RNNs with Small Weights Implement DMMs
1909
Denition 5. A recurrent network (RNN) computes a function of the form h± fQy : .Rn1 /¤ ! Rn3 , where h: Rn2 ! Rn3 and f : Rn1 £ Rn2 ! Rn2 are of the form h.u/ D M1 ¢ ¾ .M2 ¢ u C d/;
f .x; u/ D ¾ .W1 ¢ x C W2 ¢ u C b/
where M1 2 Rn3 £n4 , M2 2 Rn4 £n2 , W1 2 Rn2 £ Rn1 , and W2 2 Rn2 £ Rn2 are matrices, d 2 Rn4 , b 2 Rn2 , and ¾ denotes the component-wise application of a transition function ¾ : R ! R. In the above denition, h constitutes a so-called feedforward network with one hidden layer that maps the recursively processed sequences to the desired outputs, and f denes the recurrent part of the network. Popular choices for ¾ are the hyperbolic tangent tanh or the logistic function sgd.x/ D .1 C exp.¡x//¡1 . We can apply the above results if the transition function f constitutes a contraction and the internal values are contained in a bounded set. Under these circumstances, RNNs simply implement a denite memory machine and can be substituted by a fractal prediction machine, as an example. We rst refer to the case where ¾ is the identity. Denition 6. A function f : X ! Y is Lipschitz continuous with parameter K with respect to metrics j ¢ j on X and Y if j f .x/ ¡ f .x0 /j · K ¢ jx ¡ x0 j for all x, x0 2 X. Lemma 3. The function f : Rn1 £ Rn2 ! Rn2 , .x; u/ 7! W 1 ¢ x C W2 ¢ u C b as above is Lipschitz continuous with respect to the second input parameter u and the maximum norm j ¢ j1 with parameter K where K D n2 ¢ max jwjk j and wjk are the components of matrix W2 . The mapping is a contraction for max jwjk j < 1=n2 . Proof. We nd j.W1 ¢ x C W2 ¢ u C b/ ¡ .W1 ¢ x C W2 ¢ u0 C b/j1 X D jW2 ¢ .u ¡ u0 /j1 D max j wjk .uk ¡ u0k /j j
· n2 ¢ max.jwjk j ¢ juk ¡ jk
u0k j/
k
· n2 ¢ max jwjk j ¢ ju ¡ u0 j1 : jk
Obviously, a contraction is obtained for max jwjk j < 1=n2 . Hence, if we can in addition make sure that the image of the transition function is bounded, for example, due to the fact that b D 0 and the elements of input sequences are contained in a compact set, we can approximate the above recursive computation by a denite memory machine. The necessary length of the sequences L depends on the degree of the
1910
B. Hammer and P. Ti Ïno
contraction, that is, the magnitude of the weights and the desired accuracy of the approximation. The following simple observation allows us to obtain results for nonlinear activation functions ¾ . If f1 and f2 are Lipschitz continuous with constants K1 and K2 , respectively, the composition f1 ± f2 is Lipschitz continuous with constant K1 ¢ K2 . Hence, arbitrary activation functions ¾ that are Lipschitz continuous with parameter K lead to contractive transition functions f if the weights W2 fulll max jwjk j < 1=.n2 ¢ K/. In particular, differentiable activation functions ¾ such that ¾ 0 can be uniformly limited by a constant K are Lipschitz continuous with parameter K. Hence, they yield to contractions. Since many standard activation functions like the hyperbolic tangent or the logistic activation function fulll this property and map, moreover, to a limited domain such as .0; 1/ or .¡1; 1/ only, we have nally obtained the result that recurrent networks with small weights can be approximated arbitrarily well with denite memory machines. Before training, the weights are usually initialized with small, random vectors. If they are initialized in a small enough domain (e.g., their absolute value is not larger than, say, 4=n2 if the logistic function ¾ is used), they have contractive transition functions, and so act like denite memory machines. This argumentation implies that through the initialization, recurrent networks have an architectural bias toward denite memory machines. Feedforward neural networks with time window input constitute a popular alternative method for sequence processing (Sejnowski & Rosenberg, 1987; Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989). Since a nite time window corresponds to a nite memory of denite memory machines, recurrent networks are biased toward these successful alternative training methods where the size of the time window is not xed a priori. We add a remark on recurrent neural networks used for the approximation of probability distributions as proposed, for example, in Bengio & Frasconi, 1996. Denition 7. A probabilistic recurrent network computes a function of the P 3 form h ± fQy : .Rn1 /¤ ! Rn3 where h: Rn2 ! fx 2 Rn3 j niD1 xi D 1; xi ¸ 0g, and f : Rn1 £ Rn2 ! Rn2 is of the form f .x; u/ D ¾ .W1 ¢ x C W 2 ¢ u C b/ where W1 2 Rn2 £ Rn1 and W 2 2 Rn2 £ Rn2 are matrices, b 2 Rn2 , and ¾ denotes the component-wise application of a transition function ¾ : R ! R. h ± fQy denes a conditional probability distribution on a set fx1 ; : : : ; xn3 g of cardinality n3 given a sequence s 2 .Rn1 /¤ via the choice p.xi js/ D hi . fQy .s//, where hi denotes the ith output component of h. P 3 Note that elements in fx 2 Rn3 j niD1 xi D 1; xi ¸ 0g correspond to probability distributions over n3 discrete elements. Hence, a probabilistic recurrent
RNNs with Small Weights Implement DMMs
1911
network induces a distribution for the next symbol given a sequence s if the output components of the network are interpreted as a probability distribution over the alphabet. Usually h consists of a linear function combined possibly by component-wise nonlinear transformation and followed by normalization. In Bengio and Frasconi (1996), the outputs of f are normalized too, such that the intermediate values can be interpreted as a probability distribution on a nite set of hidden states and training can be performed for example with a generalized EM algorithm (Neal & Hinton, 1998). The above approximation results can be transferred immediately to a probabilistic network if the transition function is a contraction and the set of intermediate values is bounded. Here we obtain the result that the function that maps a sequence to the next symbol probabilities can be approximated by a function implemented by a denite memory machine. Such probabilistic recurrent networks can be approximated arbitrarily well by FOMMs. Approximation of probability distributions p and q on the nite set of possible events fx1 ; : : : ; xn g up to degree ² here means that jp.xi / ¡ q.xi /j · ² for all i. Based on this estimation, and assuming P q.xi / ¸ ± > 0, we can obtain a bound on the Kullback-Leibler divergence i p.xi / log.p.xi /=q.xi //, which is smaller than log..± C ²/=±/. This term becomes arbitrarily small if ² approaches 0. One can obtain explicit bounds on the weights W2 such that the contraction condition is fullled as above if f consists of a linear function and a component-wise nonlinearity like the logistic function. Assuming a normalization of the outputs is added in the recursive steps of f , too, as proposed in Bengio and Frasconi (1996), then alternative bounds on the magnitudes of the weights can be derived using the fact that the mapping x 7! x=kxk is Lipschitz continuous with parameter 2=² for kxk > ² where k ¢ k denotes the Euclidean metric. 4 Every DMM Can Be Implemented by a Contractive Recurrent Network We have seen that, loosely speaking, recurrent networks with contractive transition functions implement at most DMMs (or FOMMs). Here, we establish the converse direction: every DMM or FOMM, respectively, can be approximated arbitrarily well by a recurrent network with contractive transition function. Several possibilities of injecting nite automata or nite state machines (and thus also denite memory machines) into recurrent networks have been proposed in the literature (Carrasco & Forcada, 2001; Frasconi et al., 1995; Omlin & Giles, 1996a, 1996b). Since these methods deal with general nite automata, the transition function of the constructed RNNs is not a contraction and does not fulll the condition of small weights. We assume that 6 D fx1 ; : : : ; xn g is a nite alphabet. We are interested in the processing of sequences over 6. We assume that input sequences in 6 ¤ are presented to a recurrent network in a unary way, that is, xi corresponds
1912
B. Hammer and P. Ti Ïno
to the unit vector ei 2 Rn with entry 1 at position i and 0 for all other Q positions. Denote by Q: 6 ! Rn the coding xi 7! ei . Denote by Q.s/ the element-wise application of Q to the entries of a sequence s. We assume that the nonlinearity ¾ used in the network is of sigmoid type; it has a specic form that is fullled for popular activation functions like the hyperbolic tangent. More precisely, we assume the properties that ¾ is a monotonically increasing and continuous function that has nite limits limx!1 ¾ .x/ 6D limx!¡1 ¾ .x/. Lemma 4. Assume ¾ is a strictly monotonously increasing, continuous function with nite limits limx!¡1 ¾ .x/ D: l < r :D limx!1 ¾ .x/. Assume g: 6 ¤ ! 6 is computed by a DMM, that is, there exists some L 2 N such that g.s/ D g.TL .s// for all s 2 6 ¤ . Assume C 2 .0; 1/. Then there is m 2 N and y 2 Rm , so that we can nd functions f : Rn £ Rm ! Rm and h: Rm ! Rn of a recurrent network Q D Q.g.s//, for all s 2 6 ¤ and f is a contraction with h ± fQy , such that h ± fQy .Q.s// parameter C with respect to the second argument. Proof. Assume n D j6j. We choose m D L ¢ n, and let y be the origin. First, we dene the transition function f of the recursive part of the form f .x; u/ D ¾ .W1 ¢ x C W2 ¢ u C b/. We start constructing the recursive part for the case ¾ .0/ D 0: Because of the continuity of ¾ , we can nd some positive ± such that f constitutes a contraction with respect to the second argument and inputs in [l; r]m with parameter C if the absolute value of all coefcients in W2 is at most ±. We can think of the outputs of fQy as L blocks of n coefcients. We will dene f such that, given the input sequence s, coefcient j of block i is larger than 0 iff the element i of the input sequence s is xj and it is 0 otherwise. For this purpose, denote by index: f1; : : : ; Lg£f1; : : : ; ng ! f1; : : : ; m D L¢ng a xed bijective mapping. We enumerate the coefcients of W 2 by tuples .index.i; j/; index.k; l// where i, k are in f1; : : : ; Lg and j, l are in f1; : : : ; ng. We enumerate the entries of W1 by tuples .index.i; j/; k/ where i 2 f1; : : : ; Lg, j, k are in f1; : : : ; ng. We choose b D 0 and all entries of W 1 and W2 as 0 except for .W 1 /.index.1;j/;j/ D 1 for j 2 f1; : : : ; ng, and .W 2 /.index.i;j/;index.iC1;j// D ± for i < L, j 2 f1; : : : ; ng. This choice has the effect that the actual input is stored in the rst block, and the inputs of the last L ¡ 1 steps, which can be found in the rst to L ¡ 1st block in the previous step, are transferred to the second to Lth block. Hence the last L values of an input sequence are stored in the activations of the network. All different prexes of length L of sequences yield unique outputs of fQy . Assume that ¾ .0/ 6D 0. Then we can construct a recursive part of a network that uniquely encodes prexes of length L as follows. The function ¾t with ¾t .x/ D ¾ .x/ ¡ ¾ .0/ is a monotonously increasing and continuous function with nite limits and the property ¾t .0/ D 0. Hence, we can use ¾t to construct a recursive part of a network fQy with the above properties, where the transition function is of the form ¾ t .W1 ¢ x C W2 ¢ u C b/. We
RNNs with Small Weights Implement DMMs
1913
nd for all sequences s the equality fQy .s/ D gQ z .s/ ¡ ¾ .0/ where g.x; u/ D ¾ .W1 ¢ x C W2 ¢ u C b ¡ W2 ¢ ¾ .0//, z D W2 ¢ ¾ .0/, and ¾ .0/ is the vector with components ¾ .0/. Obviously, fQy .s/ encodes the prexes of length L uniquely iff fQy .s/ C ¾ .0/ encodes the prexes uniquely; hence, gQ z constitutes a recursive part of a network with the desired properties and activation function ¾ . Hence, we obtain a unique encoding of the last L entries of the sequence through the recursive transformation in both cases. It follows immediately from well-known approximation or interpolation results, respectively, for feedforward networks that some h can be found that maps the outputs of fQy to the desired values (Hornik, 1993; Hornik, Stinchcombe, & White, 1989; Sontag, 1992). h can be chosen as a feedforward network with one hidden layer. We can obtain the further extension of the above result that every DMM can be approximated by an RNN of the above form with arbitrarily small weights in the recursive and feedforward part. We have already seen that the weights in W2 can be chosen arbitrarily small. Choosing the entry in W 1 as ± instead of 1 does not change the argumentation. Moreover, the universal approximation capability of feedforward networks also holds for analytic ¾ (e.g., the hyperbolic tangent) if the bias and the weights are chosen from an arbitrarily small open interval (Hornik, 1993). 3 Hence, we can limit the weights in the feedforward part too. The above result can be immediately transferred to approximation results for the probabilistic counterparts of DMMs. Even if the output of the recursive part is in addition normalized as in Bengio and Frasconi (1996), the fact that all sequences of length at most L are mapped to unique values through the recursive computation is not altered. Hence, we can nd an appropriate h that outputs the probabilities of the next symbol in a sequence. h can be computed by a feedforward network followed by normalization. Therefore, FOMM can obviously be approximated (even precisely interpolated) by probabilistic recurrent networks up to any desired degree too. 5 Learnability We have shown that RNNs with small weights and DMMs implement the same function classes if restricted to a nite input set. The respective memory length sufcient for approximating the RNN depends on the size of the weights. Since initialization of RNNs often puts a bias toward DMMs or their probabilistic counterpart and FOMMs possess efcient training algorithms 3 The number of hidden neurons in h might increase if the weights are restricted. For unlimited weights, we can bound the number of hidden neurons in h by the nite number of possible different outputs of fQy , which depends (exponentially) on m and L only.
1914
B. Hammer and P. Ti Ïno
like fractal prediction machines, the latter constitute a valuable alternative to standard RNNs for which training is often very slow (Ron et al., 1996; Ti Ïno & Dorffner, 2001). Another point that makes DMMs and recurrent networks with small weights attractive concerns their generalization ability. Here, we introduce several denitions. Statistical learning theory provides one possible way to formalize the learnability or generalization ability of a function class. Assume F is a function class with domain X and codomain Y. We assume in the following that every function or set that occurs is measurable. Assume j¢j denes a metric on Y. A learning algorithm for F outputs a function g 2 F given a nite set of examples .x1 ; f .x1 //; : : : ; .xn ; f .xn // for an unknown function f 2 F . Generalization ability of the algorithm refers to the fact that the functions f and g approximately coincide on all possible inputs if they coincide on the given nite set of examples. Denote by P the set of probability measures on X and by P its elements. Pm is the product measure induced by P on Xm . The distance between functions f and g with respect to P is denoted by Z j f .x/ ¡ g.x/j dP .x/: dP . f; g/ D X
The empirical distance between f and g given xE D .x1 ; : : : ; xm / 2 Xm refers to the quantity X j f .xi / ¡ g.xi /j=m; dOm . f; g; xE/ D i
which is obtained if the distance of f and g is evaluated at m given data points. The aim in the general training scenario is to minimize the distance between the function to be learned, say f , and the function obtained by training, say g. Usually, this quantity is not available because the function to be learned is unknown. Hence, standard training often minimizes the empirical error between f and g on a given set xE of training examples. A justication of this principle can be established if the empirical distance is representative of the real distance. Since the function obtained by training usually depends on the whole training set (and hence the error on one training example does not constitute an independent observation), a uniform convergence in (high) probability of the empirical distance dOm . f; g; xE/ for arbitrary functions f and g and sample xE is established. Generalization O f; g; xE/ and dP . f; g/ nearly coincide for large enough m then means that d. uniformly for f and g. Denition 8. F fullls the distribution independent uniform convergence of empirical distances property (UCED property) if for all ² > 0, lim sup Pm .E x j 9 f; g 2 F jdP . f; g/ ¡ dOm . f; g; xE/j > ²/ D 0:
m!1 P2P
RNNs with Small Weights Implement DMMs
1915
Since one can think of f as the function to be learned and of g as the output of the learning algorithm, this property characterizes the fact that we can nd prior bounds (independent of the underlying probability) on the necessary size of the training set, such that every algorithm with small training error yields good generalization with high probability. For short, the UCED property is one possible way of formalizing the generalization ability. The framework tackled by statistical learning theory usually deals with a more general scenario, the so-called agnostic setting (Haussler, 1992). There, the function class F used for learning need not contain the unknown function, which is to be learned, and the error is measured by a general loss function. Valid generalization then refers to the property of uniform convergence of empirical means (UCEM) of a class associated to F via the loss function. However, under several conditions on F and the loss function, learnability of this associated class can be related to learnability of F (Anthony & Bartlett, 1999; Vidyasagar, 1997). For simplicity, we will investigate only the UCED property of recurrent networks with small weights. The following is a well-known fact: Lemma 5.
Finite function classes fulll the UCED property.
Assume 6 is a nite alphabet and F is the class of functions from 6 ¤ to 6, which can be computed by a DMM with xed nite memory length L. Then F fullls obviously the UCED property because the function class is nite. Hence DMMs with xed length L can generalize when provided with enough training data. Assume F is the function class that is given by the functions computed by all recurrent neural networks as dened in denition 5, where the dimensionalities ni and ¾ are xed, but the entries of the matrices can be chosen arbitrarily and arbitrary computation accuracy is assumed. Then F does not possess the UCED property as shown in Bartlett, Long, and Williamson (1994), Hammer (1997), and Koiran and Sontag (1997), for example. Hence, general recurrent networks with no further restrictions do not yield valid generalization in the above sense, unlike xed-length DMM. One can prove weaker results for recurrent networks, which yield bounds on the size of a training set such that valid generalization holds with high probability, as derived in Hammer (1999, 2001), for example. However, these bounds are no longer independent of the underlying (unknown) distribution of the inputs. Training of general RNNs may need in theory an exhaustive number of patterns for valid generalization and certain underlying input distributions. One particularly “bad” situation is explicitly constructed in Hammer (1999), where the number of examples necessary for valid generalization increases more than polynomially in the required accuracy. Naturally, restriction of the search space, for example, to nite automata with a xed number of states, offers a method to establish prior bounds on the
1916
B. Hammer and P. Ti Ïno
generalization error of RNNs. Moreover, in practical applications, because of the computation noise and nite accuracy, the effective VC dimension of RNNs is nite. Nevertheless, more work has to be done to formally explain why neural network training often shows good generalization ability in common training scenarios. Here, we offer a theory for initial phases of RNN training by linking RNNs with small weights to the denite memory machines. Note that RNNs with small weights and a nite input set approximately coincide with DMMs with xed length, where the length depends on the size of the weights. We can conclude that RNNs with a priori limited small weights and a nite input alphabet possess the UCED property contrarious to general RNNs with arbitrary weights and nite input alphabet. That means that the architectural bias through the initialization emphasizes a region of the parameter search space where the UCED property can be formally established. We will show in the remaining part of this section that an analogous result can be derived for recurrent networks with small weights and arbitrary real-valued inputs. This shows that function classes given by RNNs with a priori limited small weights possess the UCED property, in contrast to general RNNs with arbitrary weights and innite precision. We consider function classes F with domain X and codomain equal to [0; 1] equipped with the maximum norm. Moreover, we assume that the constant function 0 is contained in F too. Then an alternative characterization for the UCED property can be found in the literature that relates the generalization ability to the capacity of the function class. Appropriate formalizations of the term capacity are as follows: Denition 9. Assume F is a function class. Let ² > 0. The external covering number L.²; F ; j ¢ j1 / denotes the size of the smallest external ²-covering of F with respect to the metric j¢j1 . L.²; F ; j¢j1 / is innite if no nite external covering of F exists. The ²-fat shattering dimension V C ² .F / of F is the largest size (possibly innite) of a set of points fx1 ; : : : ; xn g in X, which can be shattered with parameter ². Shattering with parameter ² means that real values r1 ; : : : ; rn exist such that for each function d: fx1 ; : : : ; xn g ! f¡1; 1g, some function f 2 F exists with j f .xi / ¡ ri j ¸ ² and . f .xi / ¡ ri / ¢ d.xi / > 0. Both the covering number and the fat-shattering dimension measure the richness of F : the number of essentially different functions up to ², or the number of points where a rich behavior can be observed within the function class, respectively. Assume xE D .x1 ; : : : ; xm / 2 Xm is a vector. Denote the restriction of F to xE by F jxE D fg: fx1 ; : : : ; xm g ! Y j 9 f 2 F g.xi / D f .xi /g. Proofs for the following alternative characterizations of the UCED property can be found in Anthony and Bartlett (1999), Bartlett et al. (1994),
RNNs with Small Weights Implement DMMs
1917
and Vidyasagar (1997): Lemma 6. The following characterizations are equivalent for a function class F with codomain [0; 1] that contains the constant function 0: ² F fullls the UCED-property.
EPm .log2 L.²; F jxE ; j ¢ j1 // D0 m ² V C ² .F / is nite for every ² > 0. ² limm!1 sup P2P
8² > 0:
EPm denotes expectation with respect to xE D fx1 ; : : : ; xm g. Furthermore, the estimation ³ L.²; F jxE ; j ¢ j1 / · 2
4m ²2
´d log2 .12m=.d²//
holds for every xE D fx1 ; : : : ; xm g where d D V C
²=4 .F /.
Using this alternative characterization, we can prove that recurrent networks with small weights and arbitrary inputs fulll the UCED property too. Denote by G ± F the class of compositions fg ± f j g 2 G ; f 2 F g for function classes F and G with common domain of G and codomain of F , respectively. Lemma 7. Assume C1 2 .0; 1/ and C2 > 0 are xed. Assume U is a compact set. Assume F is a function class with domain X £ U and codomain U such that every function in F is a contraction with parameter C1 with respect to the second argument. Assume G is a function class with domain U and codomain [0; 1] such that every function in G is Lipschitz continuous with parameter C2 . Then the function class G ± FQ fullls the UCED property if the function class G ± FQ L fullls the UCED property for every L 2 N. Proof. Assume ² > 0. Assume sE D .s1 ; : : : ; sm / is a vector of m sequences over X. Because of lemma 1, and because every g 2 G is Lipschitz continuous with parameter C2 , we can nd some L such that every g± fQy in G± FQ deviates from g ± fQyL in G ± FQ L by at most ²=2 for all input sequences s. Hence, L.²; G ± FQ jEs ; j ¢ j1 / · L.²=2; G ± FQ L jEs ; j ¢ j1 /
D L.²=2; G ± FQ L jTL .Es/ ; j ¢ j1 /;
where TL .Es/ denotes the application of the truncation TL to every si in sE. We can bound the term L.²; G ± FQ jEs ; j ¢ j1 / for every Es by 2.16m=² 2 /d log2 .24m=.d²// where d D V C ²=8 .G ± FQL / is nite because G ± FQL fullls the UCED property.
1918
B. Hammer and P. Ti Ïno
Hence, the quotient EPm .log2 L.²; G ± FQ jEs ; j¢j1 /=m/ becomes arbitrarily small for large m, every ² > 0, and every P 2 P . As a consequence, standard recurrent networks with small weights in the recursive part such that the transition function constitutes a contraction and with limited weights in the feedforward part such that Lipschitz continuity is guaranteed fulll the UCED property: the function classes G ± FQ L from the above proof correspond in this case to simple feedforward networks with more than one hidden layer that have a nite fat-shattering dimension and therefore fulll the UCED property for standard activation functions like the hyperbolic tangent (Baum & Haussler, 1989; Karpinski & Macintyre, 1995). An alternative proof for the UCED property given real valued inputs can be obtained relating G ± FQ to the class G ± F , which is nonrecursive, as follows: Lemma 8. Assume C1 2 .0; 1/ and C2 > 0 are xed. Assume U and X are compact sets. Assume F is a function class with domain X £ U and codomain U such that every function in F is a contraction with parameter C1 with respect to the second argument. Assume that in addition, every function in F is Lipschitz continuous with parameter C01 . Assume G is a function class with domain U and codomain [0; 1] such that every function in G is Lipschitz continuous with parameter C2 . Then G ± FQ fullls the UCED property if G ± F does. Proof. Note that L.²; G ± FQ jEs ; j ¢ j1 / · L.²; G ± FQ ; j ¢ j1 / for all Es. Because of lemma 2 and the Lipschitz continuity of all functions in G with parameter C2 , we nd L.²; G ± FQ ; j ¢ j1 / · L.±; G ± F ; j ¢ j1 / for some ± > 0 that depends on ², C1 , and C2 . Because X and U are compact, E :D f.x1 ; u1 /; : : : ; .xc ; uc /g with parameter we can nd a nite covering xu ±=.3C01 C2 / of the set X £ U. Denote by N.±; H; j ¢ j1 / the smallest size of a ±-covering of a function class H with respect to the metric j ¢ j1 such that all functions in the cover are contained in H itself. Because of the triangle inequality, the estimation N.2±; H; j ¢ j1 / · L.±; H; j ¢ j1 / · N.±; H; j ¢ j1 / follows immediately for every function class H . Now we nd L.±; G ± F ; j ¢ j1 / · N.±=3; G ± F jxu E ; j ¢ j1 /
E because of the following: choose for .x; u/ 2 X £ U a closest .xi ; ui / in xu and for g ± f in G ± F a function gj ± fj corresponding to a function in
RNNs with Small Weights Implement DMMs
1919
E Then N.±=3; G ± F jxu E ; j ¢ j1 / such that the distance to g ± f is minimum on xu. jg. f ..x; u/// ¡ gj . fj ..x; u///j · jg. f ..x; u/// ¡ g. f ..xi ; ui ///j C jg. f ..xi ; ui /// ¡ gj . fj ..xi ; ui ///j C jgj . fj ..xi ; ui /// ¡ gj . fj ..x; u///j
· C01 C2 ±=.3C01 C2 / C ±=3 C C01 C2 ±=.3C01 C2 / D ±: Since the UCED property holds for G ± F , we can bound the quantity N.±=3; G ± F jxu E ; j ¢ j1 / · L.±=6; G ± F jxu E ; j ¢ j1 / ³ ´d log2 .12¢6¢c=.d±// 4 ¢ 36 ¢ c ·2 ±2 where c depends on only ±, C01 , and C2 , and d D V C ±=24 .G ± F / is nite because of the UCED property of G ± F . Hence, the quantity L.²; G ± FQ jEs ; j ¢ j1 / can be limited by a nite number for xed ². Therefore, the UCED property of G ± FQ follows. The additional property that the set X is compact allows us to connect the learnability of recurrent architectures with contractive transition function to the learnability of the corresponding nonrecursive transition function. We conclude this section by performing two experiments that give some hints on the effect of small recurrent weights on the generalization ability. We use RNNs for sequence prediction for two sequences: the Mackey-Glass time series with dynamic dx=d¿ D b ¢ x.¿ / C a ¢ x.¿ ¡ d/=.1 C x.¿ ¡ d//10 with a D 0:2, b D ¡0:1, and d D 17 (Mackey & Glass, 1977). The task for the RNN is to predict the related discrete-time series S1 : x.3 ¢ t/ ¡ 0:4 for t D 0; 1; 2; : : : with values in .0; 1/ and quasiperiodic behavior. In addition, we consider the Boolean time series x.t/ D x.t¡1/_:x.t¡2/, t D 2; 3; : : : with x.0/ D x.1/ D 0. We introduce observation noise by ipping each entry with probability 0:1. The second task for the RNN is to predict the related sequence S2 : 0:8 ¢ x.t/ C 0:1 2 .0; 1/. For both tasks, we generated 2000 training instances and 1334 test instances. We are interested in the generalization ability of networks that t these sequences with different sizes of the weights on recurrent connections. A small network with two hidden neurons and the logistic activation function is used for prediction. To separate effects of RNN training from the effect of small weights, we use no training algorithm but consider only randomly generated RNNs. For different sizes of the recurrent weights, we compare the test set error of the fraction of 100,000 randomly generated networks that have the mean absolute training error smaller than 0:15. Hence “training” consists in our case only of
1920
B. Hammer and P. Ti Ïno
accepting or rejecting networks based on their training set performance. To separate the positive effect of weight restriction for the recurrent dynamic from the benet of small weights for feedforward networks (Bartlett, 1997) we initialize the output weights and the weights connected to the input randomly in the interval .¡4; 4/ in all cases. The recurrent connections are randomly initialized in the interval .¡B; B/, and B is varied from 0:1 to 10:1. The recurrent mapping need no longer be a contraction for B ¸ 2. The relationship between the fraction of randomly generated networks with training error smaller than 0:15 and the size B of recurrent connections is presented in Figure 1. Figure 2 shows the mean absolute training and test set error for the two tasks. For comparison, the constant mapping to the expected value for S1 has an error 0:19, and the default classication according to the majority in S2 gives the error 0:24. In our experiments, the mean error on the training set remains almost constant, whereas the mean error on the test set increases for increasing size of the recurrent weights. This increase is smooth; hence, no dramatic decrease of the generalization ability can be observed if noncontractive recursive mapping might occur, that is, the weights come from an interval with B > 2. For S1 , the test error becomes as large as 0:24, which almost corresponds to random guessing. The test error approximates 0:2 for large weights for S2 , which is still better than a majority vote; hence, generalization can here be observed even for large recurrent weights. The generalization error, that is, the absolute distance of the training and test set errors, is depicted in Figure 3. The mean generalization error reaches values of 0:1 and 0:06, respectively, for large weights and is much smaller for small weights. As shown in Figure 4, the percentage of networks with low training error and test error comparable to the training error decreases with increasing radius B of the size of recurrent connections. For small, recurrent weights, nearly 100% or 65%, respectively, of the networks with small training error have a test error of at most 0:17, whereas the percentage decreases to 40% or 30%, respectively, for increasing size of the weights of recurrent connections. These experiments indicate that in this setting, the generalization ability of RNNs without further restrictions is better for smaller recurrent weights. However, particularly “bad” situations that could occur in theory for noncontractive transition function cannot be observed for randomly generated networks: the increase of the test error is smooth with respect to the size of the weights. Note that no training has been taken into account in this setting. It is very likely that training provides additional regularization to the RNNs. Hence, randomly generated networks might not be representative for typical training outputs, and the generalization error of trained networks with possibly large recurrent weights might be much better than the reported results. Further investigation is necessary to answer the question whether initialization with small weights has a positive effect on the generalization ability in realistic training settings, but such experiments are beyond the scope of this article.
RNNs with Small Weights Implement DMMs
1921
0.014
hits
0.012 0.01 0.008 0.006 0.004 0.002
2
4
6
0.058
8
10
8
10
hits
0.056 0.054 0.052 0.05 0.048 0.046 0.044 0.042 0.04
2
4
6
Figure 1: Fraction (max 1) of randomly generated networks with training error smaller than 0:15 for S1 (top) and S2 (bottom), respectively, depending on the size B of recurrent connections. Among 100,000 randomly generated networks, we obtain about 200 up to 1400 hits for S1 and 4200 up to 4500 hits for S2 .
6 Discussion We have rigorously shown that initialization of recurrent networks with small weights biases the networks toward denite memory models. This theoretical investigation supports our previous experimental ndings (Ti Ïno et al., 2002a, 2002b). In particular, by establishing simulation of denite memory machines by contractive recurrent networks and vice versa, we
1922
B. Hammer and P. Ti Ïno
0.28
training error test error default
0.26 0.24 0.22 0.2 0.18 0.16 0.14 0.12
2
4
0.28
6
8
10
8
10
training error test error default
0.26 0.24 0.22 0.2 0.18 0.16 0.14 2
4
6
Figure 2: Mean training and test error of RNNs with randomly initialized weights on the two time series S1 (top) and S2 (bottom). The x-axes show the radius B of the interval in which recurrent weights have been chosen. The default horizontal line shows the error of constant prediction of the expected value for S1 (top) and the error of constant classication to the majority class for S2 (bottom). The default models represent naive memoryless predictors.
proved an equivalence between problems that can be tackled with recurrent neural networks with small weights and denite memory machines. Analogous results for probabilistic counterparts of these models follow from the same line of reasoning and show the equivalence of xed-order Markov models and probabilistic recurrent networks with small weights.
RNNs with Small Weights Implement DMMs
1923
S1 S2
0.12 0.1 0.08 0.06 0.04 0.02 0
2
4
6
8
10
Figure 3: Mean generalization error of RNNs for S1 and S2 , respectively, depending on the size of the recurrent connections.
We conjecture that this architectural bias is benecial for training: it biases the architectures toward a region in the parameter space where simple and intuitive behavior can be found, thus guaranteeing initial exploration of simple models where prior theoretical bounds on the generalization error can be derived. A rst step into this direction has been investigated in this article, too, within the framework of statistical learning theory. It can be shown that unlike general recurrent networks with arbitrary precision, recurrent networks with small weights allow bounds on the generalization ability that depend on only the number of parameters of the network and the training set size, but not on the specic examples of the training set or the input distribution. These bounds hold even if innite accuracy is available and inputs may be real-valued. The argumentation is valid for every xed weight restriction of recurrent architectures, which guarantees that the transition function is a contraction with a given xed contraction parameter. These learning results can be easily extended to arbitrary contractive transition functions with no a priori known constant through the luckiness framework of machine learning (Shawe-Taylor, Bartlett, Williamson, & Anthony, 1998). The size of the weights or the parameter of the contractive transition function, respectively, offers a hierarchy of nested function classes with increasing complexity. The
1924
B. Hammer and P. Ti Ïno
1
0.16 0.17
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2
2
4
6
8
10
8
10
0.16 0.17
0.6 0.5 0.4 0.3 0.2 0.1 2
4
6
Figure 4: Percentage of networks with test error smaller than 0:16 and 0:17, respectively, among all randomly generated networks with training error at most 0:15 and various sizes of the recurrent connections for S1 (top) and S2 (bottom).
contraction parameter controls the structural risk in learning contractive recurrent architectures. Although the VC dimension of RNNs might become arbitrarily large in theory if arbitrary inputs and weights are dealt with, it is not likely to occur in practice: it is well known that lower bounds on the VC dimension need high precision of the computation, and the bounds are effectively limited if the computation is disrupted by noise. Maass and Orponen (1998) and Maass and Sontag (1999) provide bounds on the VC dimension in dependence on
RNNs with Small Weights Implement DMMs
1925
the given noise. Moreover, the problem of long-term dependencies likely restricts the search space for RNN training to comparably simple regions and yields a restriction of the effective VC dimension that can be observed when training RNNs. In addition, the choice of the error function (e.g., quadratic error) puts an additional bias toward training and might constitute a further limitation of the VC dimension achieved in practice. Hence, the restriction to small weights in initial phases of training that has been investigated in this article constitutes one aspect among others that might account for good generalization ability of RNNs in practice. We have derived explicit prior bounds on the generalization ability for this case and have established an equivalence of the dynamics to the well-understood dynamics of DMMs. As a consequence, small weights constitute one sufcient condition for valid generalization of RNNs, among other well-known guarantees. The concrete effect of the small weight restriction and other aspects as mentioned above has to be further investigated in experiments. Two preliminary experiments for time-series prediction have shown that small recurrent weights have a benecial effect on the generalization ability of RNNs. Thereby, we tested randomly generated RNNs in order to rule out numerical effects of the training algorithm. We varied only the size of the recurrent connections to rule out the benecial effect of small weights in standard feedforward networks (Bartlett, 1997). For randomly chosen small networks, the percentage of networks with small weights that generalize well to unseen examples is larger than the percentage among RNNs initialized with larger weights. Thereby, the increase of the generalization error is smooth compared to the size of the weights; that is, networks with particularly bad generalization ability for larger weights can hardly be found by random choice. Since efcient training of RNNs is still an open problem, we did not incorporate the effects of training in our experiments, which might introduce additional regularization into learning such that the effect of small weights might vanish. Nevertheless, restriction to the smallest possible weights for a given task seems one possible strategy to achieve valid generalization, and we have derived explicit mathematical bounds for this setting. In Ti Ïno et al. (2002a, 2002b), we extracted from recurrent networks predictive models that operated on the network dynamics. The networks were rst randomly initialized with small weights and then input driven with training sequences. The resulting clusters of recurrent activations were labeled with (cluster conditional) empirical next-symbol distributions calculated on the training stream. Hence, training takes place in one epoch on the output level only. No optimization of the representation of the sequences in the hidden neurons was done, but the sequence representation provided by the randomly initialized recurrent network dynamic was used. By performing experiments on symbolic sequences of various memory and subsequence structure, we showed that predictive models extracted from these networks where internal representation of the sequences is given by randomly initialized (with small weights) networks achieved performance very similar
1926
B. Hammer and P. Ti Ïno
to that of variable memory length Markov models (VLMM). Obviously, recurrent networks have a potential to outperform nite memory models and they indeed did so after a careful and (often rather lengthy) training process. But since the predictive models extracted from networks with untrained recurrent connections initialized with small weights4 correspond to VLMM, depending on the nature of the data, the performance gain resulting from training the appropriate recursive representation in the hidden neurons of recursive neural networks can be quite small. We (Ti Ïno et al., 2002b) argue that to appreciate how much information has really been induced during the training, the network performance should always be compared with that of VLMM and predictive models extracted before training as the “null” base models. Interestingly enough, the contractive nature of recurrent networks initialized with small weights enables us to perform a rigorous fractal analysis of the state-space representations induced by such networks. The rst results in that direction can be found in Ti Ïno and Hammer (2002). Acknowledgments We thank two anonymous reviewers for profound and valuable comments on an earlier version of this article. References Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Baldi, P., Brunak, S., Frasconi, P., Pollastri, G., & Soda, G. (2001). Bidirectional dynamics for protein secondary structure prediction. In R. Sun & C. L. Giles (Eds.), Sequence learning: Paradigms, algorithms, and applications (pp. 80–104). Berlin: Springer-Verlag. Bartlett, P. L. (1997). For valid generalization, the size of the weights is more important than the size of the network. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 134– 141). Cambridge, MA: MIT Press. Bartlett, P. L., Long P., & Williamson, R. (1994). Fat-shattering and the learnability of real valued functions. In Proceedings of the 7th ACM Conference on Computational Learning Theory (pp. 299–310). Baum, E. B., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1(1), 151–165. Bengio, Y., & Frasconi, P. (1996). Input=output HMMs for sequence processing. IEEE Transactions on Neural Networks, 7(5), 1231–1249.
4
Training is performed in one epoch to adjust hidden-layer-to-output mapping.
RNNs with Small Weights Implement DMMs
1927
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 157–166. Buhlmann, ¨ P., & Wyner, A. J. (1999). Variable length Markov chains. Annals of Statistics, 27, 480–513. Carrasco, R. C., & Forcada, M. L. (2001). Simple strategies to encode tree automata in sigmoid recursive neural networks. IEEE Transactions on Knowledge and Data Engineering, 13(2), 148–156. Christiansen, M. H., & Chater, N. (1999). Towards a connectionist model of recursion in human linguistic performance. Cognitive Science, 23, 157–205. Clouse, D. S., Giles, C. L., Horne, B. G., & Cottrell, G. W. (1997). Time-delay neural networks: Representation and induction of nite state machines. IEEE Transactions on Neural Networks, 8(5), 1065. Elman, J., Bates, E., Johnson, M., Karmiloff-Smith, A., Parisi, D., & Plunkett, K. (1996). Rethinking innateness: A connectionist perspective on development. Cambridge, MA: MIT Press. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1995). Unied integration of explicit rules and learning by example in recurrent networks. IEEE Transactions on Knowledge and Data Engineering, 8(6), 313–332. Funahashi, K., & Nakamura, Y. (1993). Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, 12, 831–864. Giles, C. L., Lawrence, S., & Lin, T. (1995). Learning a class of large nite state machines with a recurrent neural network. Neural Networks, 8, 1359–1365. Giles, C. L., Lawrence, S., & Tsoi, A. C. (1997). Rule inference for nancial prediction using recurrent neural networks. Proceedings of the Conference on Computational Intelligence for Financial Engineering (pp. 253–259). New York. Guyon, I., & Pereira, F. (1995). Design of a linguistic postprocessor using variable memory length Markov models. Proceedings of the International Conference on Document Analysis and Recognition (pp. 454–457). Montreal, Canada: IEEE Computer Society Press. Hammer, B. (1997). On the generalization of Elman networks. In W. Gerstner, A. Germond, M. Hasler, & J.-D. Nicaud (Eds.), Articial Neural Networks— ICANN’97 (pp. 409–414). Berlin: Springer-Verlag. Hammer, B. (1999). On the learnability of recursive data. Mathematics of Control, Signals, and Systems, 12, 62–79. Hammer, B. (2001). Generalization ability of folding networks. IEEE Transactions on Knowledge and Data Engineering, 13(2), 196–206. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hornik, K. (1993). Some new results on neural network approximation. Neural Networks, 6, 1069–1072. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366.
1928
B. Hammer and P. Ti Ïno
Karpinski, M., & Macintyre, A. (1995). Polynomial bounds for the VC dimension of sigmoidal neural networks. In Proceedings of the 27th Annual ACM Symposium on the Theory of Computing (pp. 200–208). Kohavi, Z. (1978). Switching and nite automata. New York: McGraw-Hill. Kohonen, T. (1997). Self-organizing maps. Berlin: Springer-Verlag. Koiran, P., & Sontag, E. D. (1997). Vapnik-Chervonenkis dimension of recurrent neural networks. In Proceedings of the 3rd European Conference on Computational Learning Theory (pp. 223–237). Kolen, J. F. (1994a). Recurrent networks: State machines or iterated function systems? In Proceedings of the 1993 Connectionist Models Summer School (pp. 203–210). Mahwah, NJ: Erlbaum. Kolen, J. F. (1994b). The origin of clusters in recurrent neural state space. In Proceedings of the 1993 Connectionist Models Summer School (pp. 508–513). Mahwah, NJ: Erlbaum. Krogh, A. (1997). Two methods for improving performance of a HMM and their application for gene nding. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology (pp. 179–186). Menlo Park, CA: AAAI Press. Laird, P., & Saul, R. (1994). Discrete sequence prediction and its applications. Machine Learning, 15, 43–68. Maass, W., & Orponen, P. (1998). On the effect of analog noise in discrete-time analog computation. Neural Computation, 10(5), 1071–1095. Maass, W., & Sontag, E. D. (1999). Analog neural nets with gaussian or other common noise distributions cannot recognize arbitrary regular languages. Neural Computation, 11, 771–782. Mackey, M. C., & Glass, L. (1977). Oscillations and chaos in physiological control systems. Science, 197, 287–289. Nadas, J. (1984). Estimation of probabilities in the language model of the IBM speech recognition system. IEEE Transactions on ASSP, 4, 859–861. Neal, R., & Hinton, G. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer. Omlin, C. W., & Giles, C. L. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the ACM, 43(6), 937–972. Omlin, C. W., & Giles, C. L. (1996b).Stable encoding of large nite-state automata in recurrent networks with sigmoid discriminants. Neural Computation, 8, 675–696. Robinson, T., Hochberg, M., & Renals, S. (1996). The use of recurrent networks in continuous speech recognition. In C.-H. Lee & F. K. Song (Eds.), Advanced topics in automatic speech and speaker recognition. Norwell, MA: Kluwer. Ron, D., Singer, Y., & Tishby, N. (1996). The power of amnesia. Machine Learning, 25, 117–150. Sejnowski, T., & Rosenberg, C. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1, 145–168. Shawe-Taylor, J., Bartlett, P. L., Williamson, R., & Anthony, M. (1998). Structural risk minimization over data dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940.
RNNs with Small Weights Implement DMMs
1929
Siegelmann, H. T., & Sontag, E. D. (1994). Analog computation, neural networks, and circuits. Theoretical Computer Science, 131, 331–360. Siegelmann, H. T., & Sontag, E. D. (1995). On the computational power of neural networks. Journal of Computer and System Sciences, 50, 132–150. Sontag, E. D. (1992).Feedforward nets for interpolation and classication. Journal of Computer and System Sciences, 45, 20–48. Sontag, E. D. (1998).VC dimension of neural networks. In C. Bishop (Ed.), Neural networks and machine learning (pp. 69–95). Berlin: Springer-Verlag. Sun, R. (2001), Introduction to sequence learning. In R. Sun & C. L. Giles (Eds.), Sequence learning: Paradigms, algorithms, and applications (pp. 1–10). Berlin: Springer-Verlag. Ï Ti Ïno, P., Cer Ïnansky, ´ M., & Be ÏnuÏskov´a, L. (2002a).Markovian architectural bias of recurrent neural networks. In P. SinÏca´ k, J. VaÏscÏ a´ k, V. KvasniÏcka, & J. Pospichal (Eds.), Intelligent technologies—Theory and applications: Frontiers in AI and applications—2nd Euro-International Symposium on Computational Intelligence (pp. 17–23). Amsterdam: IOS Press. Ï Ti Ïno, P., Cer Ïnansk y, ´ M., & Be ÏnuÏskova, ´ L. (2002b). Markovian architectural bias of recurrent neural networks (Tech. Rep. No. NCRG=2002=008). NCRG, Aston University, UK. Ti Ïno, P., & Dorffner, G. (2001). Predicting the future of discrete sequences from fractal representations of the past. Machine Learning, 45(2), 187–218. Ti Ïno, P., & Hammer, B. (2002). Architectural bias of recurrent neural networks— fractal analysis. In J. R. Dorronsoro (Ed.), Int. Conf. on Articial Neural Networks (ICANN 2002) (pp. 1359–1364). Berlin: Springer-Verlag. Ï Ti Ïno, P., & Sajda, J. (1995). Learning and extracting initial Mealy machines with a modular neural network model. Neural Computation, 4, 822–844. Vidyasagar, M. (1997). A theory of learning and generalization. Berlin: SpringerVerlag. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339. Received June 26, 2002; accepted January 24, 2003.
LETTER
Communicated by Christian Omlin
Architectural Bias in Recurrent Neural Networks: Fractal Analysis Peter Ti Ïno [email protected] Aston University, Birmingham B4 7ET, U.K.
Barbara Hammer [email protected] University of Osnabruck, ¨ D-49069 Osnabruck, ¨ Germany
We have recently shown that when initialized with “small” weights, recurrent neural networks (RNNs) with standard sigmoid-type activation functions are inherently biased toward Markov models; even prior to any training, RNN dynamics can be readily used to extract nite memÏ ory machines (Hammer & Ti Ïno, 2002; Ti Ïno, Cer Ïnansky, ´ & Be ÏnuÏskov´a, 2002a, 2002b). Following Christiansen and Chater (1999), we refer to this phenomenon as the architectural bias of RNNs. In this article, we extend our work on the architectural bias in RNNs by performing a rigorous fractal analysis of recurrent activation patterns. We assume the network is driven by sequences obtained by traversing an underlying nite-state transition diagram—a scenario that has been frequently considered in the past, for example, when studying RNN-based learning and implementation of regular grammars and nite-state transducers. We obtain lower and upper bounds on various types of fractal dimensions, such as box counting and Hausdorff dimensions. It turns out that not only can the recurrent activations inside RNNs with small initial weights be explored to build Markovian predictive models, but also the activations form fractal clusters, the dimension of which can be bounded by the scaled entropy of the underlying driving source. The scaling factors are xed and are given by the RNN parameters. 1 Introduction There is a considerable amount of literature devoted to connectionist processing of sequential symbolic structures (e.g., Kremer, 2001). For example, researchers have been interested in formulating models of human performance in processing linguistic patterns of various complexity (Christiansen & Chater, 1999). Recurrent neural networks (RNNs) constitute a well-established approach for dealing with linguistic data, and the capability of RNNs of processing nite automata and some context-free and c 2003 Massachusetts Institute of Technology Neural Computation 15, 1931–1957 (2003) °
1932
P. Ti Ïno and B. Hammer
context-sensitive languages is well known (Bod´en & Wiles, 2002; Forcada & Carrasco, 1995; Ti Ïno, Horne, Giles, & Collingwood, 1998). The underlying dynamics that emerges through training has been investigated in the work of Blair, Bod´en, Pollack, and others (Blair & Pollack, 1997; Bod´en & Blair, in press; Rodriguez, Wiles, & Elman, 1999). It turns out that although theoretically RNNs are able to explore a wide range of possible dynamical regimes, trained RNNs often achieve complicated behavior by combining appropriate xed-point dynamics, including attractive xed points, as well as saddle points. It should be mentioned that RNNs can be canonically generalized to socalled recursive networks for processing tree structures and directed acyclic graphs, hence providing a very attractive connectionist framework for symbolic or structural data (Frasconi, Gori, & Sperduti, 1998; Hammer, 2002). For both recurrent and recursive networks, the interior connectionist representation of input data given by the hidden neuron’s activation prole is of particular interest. The hidden neuron’s activation might allow the detection of important information about relevant substructures of the inputs for the respective task, as has been demonstrated, for example, for quantitative structure activity relationship prediction tasks with recursive networks (Micheli, Sperduti, Starita, & Bianucci, 2002). The hidden neuron’s activations often show considerable proles such as clusters and structural differentiation. For RNNs, hidden nodes’ activation prole allows inferring parts of the underlying dynamical components that account for the network behavior (Blair & Pollack, 1997). However, it has been known for some time that when training RNNs to process symbolic sequences, activations of recurrent units display a considerable amount of structural differentiation even prior to learning (Christiansen & Chater, 1999; Kolen, 1994a, 1994b; Manolios & Fanelli, 1994). Following Christiansen and Chater (1999), we refer to this phenomenon as the architectural bias of RNNs. Now the question arises: Which parts of structural differentiation are due to the learning process and hence are likely to encode information related to the specic learning task, and which parts emerge automatically, even without training, due to the architectural bias of RNNs? We have recently shown, both empirically and theoretically, the meaning of the architectural bias for RNNs: when initialized with “small” weights, RNNs with standard sigmoid-type activation functions are inherently biased towards Markov models; that is, even prior to any training, RNN dynamics can be readily used to extract nite memory machines (Hammer Ï & Ti Ïno, 2002; Ti Ïno, Cer Ïnansk y, ´ & Be ÏnuÏskov´a, 2002a, 2002b). In this study, we extend our work by rigorously analyzing the “size” of recurrent activation patterns in such RNNs. Since the activation patterns are of fractal nature, their size is expressed through fractal dimensions (Falconer, 1990). The dimensionality of hidden neurons’ activation patterns provides one characteristic for comparing different networks and the effect of learning
Architectural Bias in Recurrent Neural Networks
1933
algorithms on the network’s interior connectionistic representation of data. We concentrate on the case where the RNN is driven by sequences generated from an underlying nite-state automaton, a scenario frequently studied in the past in the context of RNN-based learning and implementation of regular grammars and nite-state transducers (Casey, 1996; Cleeremans, Servan-Schreiber, & McClelland, 1989; Elman, 1990; Forcada & Carrasco, 1995; Frasconi, Gori, Maggini, & Soda, 1996; Giles et al., 1992; Manolios & Ï Fanelli, 1994; Ti Ïno & Sajda, 1995; Watrous & Kuhn, 1992). We will derive bounds on the dimensionality of the recurrent activation patterns that reect complexity of the input source and the bias caused by the recurrent architecture initialized with small weights. The letter is organized as follows. After a brief introduction to fractal dimensions and automata-directed iterated function systems (IFS) in sections 2 and 3, respectively, we prove in section 4 dimension estimates for invariant sets of general automata-directed IFSs composed of nonsimilarities. In section 5, we reformulate RNNs driven by sequences over nite alphabets as IFSs. Section 6 is devoted to dimension estimates of recurrent activations inside contractive RNNs. The discussion in section 7 puts our ndings into the context of previous work. We confront the theoretically calculated dimension bounds with empirical fractal dimension estimates of recurrent activations in section 8. Section 9 summarizes the key messages of this study. 2 Fractal Dimensions In this section we briey introduce various notions of “size” for geometrically complicated objects called fractals. (For more detailed information, see Falconer, 1990.) Consider a metric space X. Let K be a totally bounded subset of X. For each ± > 0, dene N± .K/ to be the smallest number of sets of diameter · ± that can cover K (±-ne cover of K). Since K is bounded, N± .K/ is nite for each ± > 0. The rate of increase of N± .K/ as ± ! 0 tells us something about the “size” of K. Denition 1. as
The upper and lower box-counting dimensions of K are dened
dimC B K D lim sup ±!0
dim¡ B K D lim inf ±!0
respectively.
log N± .K/ ¡ log ±
log N± .K/ ; ¡ log ±
and (2.1)
1934
P. Ti Ïno and B. Hammer
Denition 2.
Let s > 0. For ± > 0, dene X H±s .K/ D inf .diam B/s ; 0± .K/
(2.2)
B20± .K/
where the inmum is taken over the set 0± .K/ of all countable ±-ne covers of K. Dene
H s .K/ D lim Hs± .K/: ±!0
The Hausdorff dimension of the set K is dimH K D supfs j Hs .K/ D 1g D inffs j Hs .K/ D 0g:
(2.3)
The Hausdorff dimension is more subtle than the box-counting dimensions in the sense that the former can capture details not detectable by the latter. In a way, box-counting dimensions can be thought of as indicating the efciency with which a set may be covered by small sets of equal size. In contrast, the Hausdorff dimension involves coverings by sets of small but perhaps widely varying size (Falconer, 1990). It is well known that C dimH K · dim¡ B K · dimB K:
(2.4)
3 Finite Automata and Automata-Directed IFS We now introduce some terminology related to IFSs with an emphasis on IFS-driven by sequences obtained by traversing nite automata. First, we briey recall basic notions related to sequences over nite alphabets. 3.1 Symbolic Sequences over a Finite Alphabet. Consider a nite alphabet A D f1; 2; : : : ; Ag. The sets of all nite (nonempty) and innite sequences over A are denoted by AC and A! , respectively. Sequences from A! are also referred to as !-words. Denoting by ¸ the empty word, we write A¤ D AC [ f¸g. The set of all sequences consisting of a nite or an innite number of symbols from A is then A1 D AC [ A! . The set of all sequences over A with exactly n symbols (n-blocks) is denoted by An . The number of symbols in a sequence w is denoted by jwj. j
Let w D s1 s2 ¢ ¢ ¢ 2 A1 and i · j. By wi , we denote the string si siC1 ¢ ¢ ¢ sj , j wi
with wii D si . For i > j, D ¸. If jwj D n ¸ 1, then w¡ D wn¡1 is obtained 1 1 by omitting the last symbol of w. A partial order · may be dened on A¤ as follows: write w1 · w2 if and only if w1 is a prex of w2 , that is, there is some w3 2 A¤ , such that 1
If jwj D 1, then w¡ D ¸.
Architectural Bias in Recurrent Neural Networks
1935
w2 D w1 w3 . Two strings w1 , w2 are incomparable if neither w1 · w2 nor w2 · w1 . For each nite string c 2 A¤ , the cylinder [c] is the set of all innite strings w 2 A! that begin with c, that is, [c] D fw 2 A! j c · wg. 3.2 Iterated Function Systems. Let X be a complete metric space. A (contractive) IFS (Barnsley, 1988) consists of A contractions fa : X ! X, a 2 f1; 2; : : : ; Ag D A, operating on X. In the basic setting, at each time step all the maps fa are used to transform X in an iterative manner: X0 D X; [ XtC1 D fa .Xt /; t ¸ 0:
(3.1) (3.2)
a2A
S There is a unique compact invariant set K ½ X, K D a2A fa .K/, called the attractor of the IFS. Actually, it can be shown that the sequence fXt g1 tD0 converges2 to K. For any !-word w D s1 s2 ¢ ¢ ¢ 2 A! , the images of X under compositions of IFS maps, . fs1 . fs2 .¢ ¢ ¢ . fsn .X// ¢ ¢ ¢/// D fwn1 .X/;
converge to an element of K as n ! 1. More complicated IFS systems can be devised, for example, by constraining the sequences of maps that can be applied to X in the process of dening the attractive invariant set K. One possibility is to demand that the words w corresponding to allowed compositions of IFS maps belong to a language over A (Prusinkiewicz & Hammel, 1992). A specic example of this are recurrent IFS (Barnsley, Elton, & Hardin, 1989), where the allowed words w are specied by a topological Markov chain. In this article, we concentrate on IFSs with map compositions restricted to sequences that can be read out by traversing labeled state-transition graphs (see, e.g., Culik & Dube, 1993). 3.3 IFS-Associated with State-Transition Graphs. Consider a labeled directed multigraph G D .V; E; ·/, where the elements v 2 V and e 2 E are the vertices (or nodes) and edges of the multigraph G , respectively. Each edge e 2 E is labeled by a symbol ·.e/ 2 A from the input alphabet A D f1; 2; : : : ; Ag, as prescribed by the map ·: E ! A . The map · can be naturally extended to operate on subsets E0 of E and sequences ° over E: [ for E0 µ E; ·.E0 / D (3.3) ·.e/; e2E0
for ° D e1 e2 ; : : : ; ·.° / D ·.e1 /· .e2 / ¢ ¢ ¢ 2 A1 :
(3.4)
S The set Eu!v µ E contains all edges from u 2 V to v 2 V. Eu! D v2V E u!v is the set of all edges starting at the vertex u. 2
Under a Hausdorff distance on the power set of X.
1936
P. Ti Ïno and B. Hammer
A path in the graph is a sequence ° D e1 e2 ; : : : of edges, such that the terminal vertex of each edge ei is the initial vertex of the next edge eiC1 . The initial vertex of e1 is the initial vertex ini.° / of ° . For a nite path ° D e1 e2 ; : : : ; en of length n, the terminal vertex of en is the terminal vertex term.° / of ° . The sequence of labels read out along the path ° is ·.° /. As in j the case of sequences over A, for i · j · n, °i denotes the path ei eiC1 ; : : : ; ej , ¡ and ° denotes the path ° without the last edge en , ° ¡ D °1n¡1 . We write E.n/ u!v for the set of all paths of length n with initial vertex u and terminal vertex v. The set of all paths of length n with initial vertex .n/ .¤/ u is denoted by Eu! . Analogously, Eu!v denotes the set of all nite paths .¤/ .!/ from u to v; E u! and Eu! denote the sets of all nite and innite paths, respectively, starting at vertex u. Denition 3. A strictly contracting state-transition graph (SCSTG) is a labeled directed multigraph G D .V; E; ·/ together with a number 0 < ra < 1 for each symbol a from the label alphabet A. We will assume that the state-transition graph G is both deterministic (for each vertex u, there are no two distinct paths initiated at u with the same label read-out) and strongly connected (there is a path from any vertex to any other). Denition 4. An IFS associated with an SCSTG .V; E; ·; fra ga2A / consists of a complete metric space .X; k ¢ k/3 and a set of A contractions fa : X ! X, one for each input symbol a 2 A, such that for all x; y 2 X, k fa .x/ ¡ fa .y/k · ra kx ¡ yk; a 2 A:
(3.5)
The allowed compositions of maps fa are controlled by the sequences of labels associated with the paths in G D .V; E; ·/. The IFS image of a point x 2 X under a (nite) path ° D e1 e2 ; : : : ; en in G is ° .x/ D . f·.e1 / . f·.e2 / .¢ ¢ ¢ . f·.en / .x// ¢ ¢ ¢/// D . f·.e1 / ± f·.e2 / ± ¢ ¢ ¢ ± f·.en / /.x/:
(3.6)
The contraction factor r.° / of the path ° is calculated as r.° / D
3
n Y
r·.ei / : iD1
The metric is expressed through a norm.
(3.7)
Architectural Bias in Recurrent Neural Networks
1937
For the empty path ¸, ¸.x/ D x and r.¸/ D 1. For an innite path ° D e1 e2 ; : : : ; the corresponding image of a point x 2 X is ° .x/ D lim . f·.e1 / ± f·.e2 / ± ¢ ¢ ¢ ± f·.en / /.x/: n!1
(3.8)
The maps ° .¢/ are extended to subsets X0 of X as follows: ° .X0 / D f° .x/ j x 2 X0 g. As the length of paths ° in G increases, the points ° .x/ tend to closely approximate the attractive set of the IFS. The attractor now has a more intricate structure than in the case of the basic “unrestricted” IFS. There is a list of invariant compact sets Ku µ X, one for each vertex u 2 V (Edgar & Golds, 1999), such that Ku D
[ [ f·.e/ .Kv /:
(3.9)
v2V e2Eu!v
Each set Ku contains (possibly nonlinearly) compressed copies of the sets Kv . We denote the union of the sets Ku by K, KD
[
(3.10)
Ku : u2V
4 Dimension Estimates for Ku In this section, we develop a theory of dimension estimates for the invariant sets Ku of IFSs driven by sequences obtained by traversing deterministic nite automata. In particular, we ask whether upper and lower bounds on Lipschitz coefcients of contractive IFS maps fa translate into bounds on dimensions of Ku . In answering this question, we extend the results of Falconer (1990), who considered the original IFS setting (IFSs driven by A ! ), and Edgar and Golds (1999), who studied graph-directed IFSs (GDIFS). In GDIFS, one associates each edge in the underlying graph with a distinct IFS map, so the calculations are somewhat more straightforward. However, many of the ideas applied in proofs in Edgar and Golds (1999) were transferable to proofs in this section. 4.1 Upper Bounds Denition 5. Let .V; E; ·; fra ga2A / be an SCSTG. For each s > 0, denote by M.s/ a square matrix (rows and columns are indexed by vertices4 V) with entries M uv .s/ D
X
.r·.e/ /s :
(4.1)
e2Eu!v
4 We slightly abuse mathematical notation by identifying the vertices of G with integers 1; 2; : : : ; jVj.
1938
P. Ti Ïno and B. Hammer
The unique5 number s1 such that the spectral radius of M.s1 / is equal to 1 is called the dimension of the SCSTG .V; E; ·; fra ga2A /. Theorem 1. Let G D .V; E; ·/ be a labeled directed multigraph and .X; f fa ga2A / an IFS associated with the SCSTG .G ; fra ga2A /. Let s1 be the dimension of the SCSTG .G ; fra ga2A /. Then for all u 2 V, dimC B Ku · s1 , and also dimC · K s . 1 B Proof. By the Perron-Frobenius theory of nonnegative irreducible6 matrices (Minc, 1988), the spectral radius (equal to 1) of the matrix M.s1 / is actually an eigenvalue of M.s1 / with the associated positive eigenvector f¸u gu2V satisfying ¸u D
X
X ¸v
v2V
.r·.e/ /s1 ;
e2Eu!v
X u2V
¸u D 1;
¸u > 0:
(4.2)
Given a vertex u 2 V, it is possible to dene a measure ¹ on the set of cylinders Cu D f[w] j w 2 ·.E.¤/ u! /g
(4.3)
by postulating s1 for w D ·.° /; ° 2 E.¤/ u!v I ¹.[w]/ D ¸v ¢ r.° / :
(4.4)
Indeed, for ° 2 E.¤/ u!v , w D ·.° /, we have ¹.[w]/ D
X
¹.[w·.e/]/: e2Ev!
The measure ¹ extends to Borel measures on E.!/ u! (Edgar & Golds, 1999). Consider a vertex u 2 V. Fix a positive number ±. We dene a (cross-cut) set, ¡ T ± D f° j ° 2 E.¤/ u! ; r.° / < ± · r.° /g:
(4.5)
Note that T± is a nite set such that for every innite path ° 0 2 E.!/ u! , there is exactly one prex length n such that .° 0 /n1 2 T± . The sets T± and ·.T± / 5
Uniqueness follows from the Perron-Frobenius theory (see Minc, 1988). Matrix M.s1 / is irreducible because the underlying graph .V; E; ·/ is strongly connected. 6
Architectural Bias in Recurrent Neural Networks
1939
.!/ partition the sets E.!/ u! and ·.Eu! /, respectively. The collection
C
u
D f° .Kterm.° / / j ° 2 T± g
(4.6)
covers the set Ku with sets of diameter less than ´ D ± ¢ diam.X/. In order to estimate N´ .Ku /, which is upper bounded by the cardinality card.T± / of T± , we write X (4.7) ¸u D ¹.· .E .!/ ¸term.° / ¢ .r.° //s1 : u! // D ° 2T±
By equation 4.5, we have for each ° D e1 e2 ; : : : ; en 2 T± , r.° / D r.° ¡ / ¢ r.en / ¸ ± ¢ rmin ; and so from equation 4.7, we obtain ¸max ¸ ¸u ¸ card.T± / ¢ ¸min ¢ .±rmin /s1 ;
(4.8)
where card.T± / is the cardinality of T± and ¸max D max ¸v ; ¸min D min ¸v and rmin D min ra : v2V
v2V
(4.9)
a2A
Hence, ¸max ¸max N´ .Ku / · card.T± / · .±rmin /¡s1 D ¸min ¸min
³
rmin diam.X/
´¡s1
´ ¡s1 :
It follows that log N´ .Ku / · ¡s1 log ´ C Const;
(4.10)
where Const is a constant term with respect to ±. Dividing equation 4.10 by ¡ log ´ > 0, and letting the diameter ´ of the covering sets diminish, we obtain dimC B K u D lim sup ´!0
log N´ .Ku / · s1 : ¡ log ´
(4.11)
If we denote the number of vertices in V by card.V/, then the number of sets with diameter less than ´ needed to cover the set K can be upper bounded by N´ .K/ · card.V/ ¢ max N´ .Ku / u2V
¸max · card.V/ ¸min
³
rmin diam.X/
´¡s1
´¡s1 ;
1940
P. Ti Ïno and B. Hammer
and we again obtain log N´ .K/ · ¡s1 log ´ C Const0 ;
(4.12)
where Const0 is a constant term with respect to ±. Repeating the above argument leads to dimC B K · s1 . We can derive a less tight bound that is closely related to the theories of symbolic dynamics and formal languages over the alphabet A . The adjacency matrix G of a labeled directed multigraph G D .V; E; ·/ is dened element-wise as follows:7 Guv D card.Eu!v /:
(4.13)
Theorem 2. Let G D .V; E; ·/ be a labeled directed multigraph and .X; f fa ga2A / an IFS associated with the SCSTG .G ; fra ga2A /. Let rmax D maxa2A ra . Then for all u 2 V, dimC B Ku · s1 ·
log ½.G/ ; ¡ log rmax
where ½.G/ is the maximum eigenvalue of G. Proof. Consider an SCSTG .G ; ra D rmax ; a 2 A / with a xed contraction rate rmax for all symbols in A. The matrix, equation 4.1, for this SCSTG has the form N.s/ D rsmax G. The spectral radius of N.s/, which coincides with the maximum eigenvalue of N.s/, is ½.N.s// D rsmax ¢ ½.G/, where ½.G/ is the spectral radius (maximum eigenvalue) of G. The dimension smax associated with such an SCSTG is the solution of ½.N.s// D 1. In particular, rsmax ¢ ½.G/ D 1, and so smax D
log ½.G/ : ¡ log rmax
(4.14)
Next we show that smax ¸ s1 . Let M.s1 / be the matrix, equation 4.1, of spectral radius 1 for the original SCSTG .G ; fra ga2A /. Dene 1 D smax ¡ s1 . Note that N.smax / D r1 max ¢ N.s1 /. Since for all a 2 A, rmax ¸ ra , we have 0 < M.s1 / · N.s1 /, where M · N for two positive matrices M; N of the same size means that · applies element-wise, that is, 0 · Muv · Nuv . Hence, ½.M.s1 // · ½.N.s1 //. 7 Although we do not allow s D 0 in the denition of M.s/, G can be formally thought of as a matrix M.s/ computed with s set to 0.
Architectural Bias in Recurrent Neural Networks
1941
Observe that 1 ½.N.smax // D 1 D ½.r1 max ¢ N.s1 // D rmax ¢ ½.N.s1 //:
But ½.N.s1 // ¸ ½.M.s1 // D 1, so r1 max · 1 must hold. Since rmax < 1, we have that 1 ¸ 0, which means smax ¸ s1 . 4.2 Lower Bounds. Imagine that besides the Lipschitz bounds (see equation 3.5) for the IFS maps f fa ga2A , we also have lower bounds: for all x; y 2 X, k fa .x/ ¡ fa .y/k ¸ r0a kx ¡ yk; r0a > 0; a 2 A:
(4.15)
We will show that under certain conditions, the bounds 4.15 induce lower bounds on the Hausdorff dimension of the sets Ku . Roughly speaking, to estimate lower bounds on the dimension of Ku ’s, we need to make sure that our lower bounds on covering set size are not invalidated by trivial overlaps between the IFS basis sets. Denition 6. Let G D .V; E; ·/ be a labeled directed multigraph. We say that an IFS .X; f faga2A / associated with the SCSTG .G ; fra ga2A / falls into the disjoint case if, for any u; v; v0 2 V, e 2 Eu!v , e0 2 Eu!v0 , e 6D e0 , we have8 f·.e/ .Kv / \ f·.e0 / .Kv0 / D ;. Theorem 3. Consider a labeled directed multigraph G D .V; E; ·/ and an IFS .X; f fa ga2A / associated with the SCSTG .G ; fra ga2A /. Suppose the IFS falls into the disjoint case. Let s2 be the dimension of the SCSTG .G ; fr0a ga2A /. Then for each vertex u 2 V; dimH Ku ¸ s2 . Proof.
First, dene a metric on the power set of X as
d.B1 ; B 2 / D inf inf kx ¡ yk; B 1 ; B2 µ X: x2B1 y2B2
Because the IFS falls into the disjoint case and the sets Ku , u 2 V, are compact, there is some ³ > 0 such that for all u; v; v0 2 V, e 2 Eu!v , e0 2 Eu!v0 , e 6D e0 , we have d. f·.e/ .Kv /; f·.e0 / .Kv0 // > ³:
(4.16)
Fix a vertex u 2 V, and consider two paths °1 ,°2 from E.¤/ u! such that the corresponding label sequences ·.°1 / and ·.°2 / are incomparable. Let w be the longest common prex of ·.°1 /, ·.°2 /, and let ° be the initial 8
By determinicity of G, ·.e/ 6D ·.e0 /.
1942
P. Ti Ïno and B. Hammer jwj
subpath of °1 that supports the label sequence w, that is, ° D .°1 /1 . We have jwj < j·.°1 /j, and so w · ·.°1 /¡ , which means r0 .° / ¸ r0 .°1¡ /. The paths °1 and °2 can be written as °1 D ° e°10 and °2 D ° e0 °20 , where the edges e and e0 are distinct and ·.e/ 6D ·.e0 / (G is deterministic). The paths °10 , °20 may happen to be empty. In any case, °10 .Kterm.°1 / / µ Kterm.e/ ; °20 .Kterm.°2 / / µ Kterm.e0 / ; and we have d.e°10 .Kterm.°1 / /; e0 °20 .Kterm.°2 / // ¸ d.e.Kterm.e/ /; e0 .Kterm.e0 / // > ³: This leads to d.°1 .Kterm.°1 / /; °2 .Kterm.°2 / // D
(4.17)
¸
(4.18)
d.we°10 .Kterm.°1 / /; we0 °20 .Kterm.°2 / //
r0 .° / ¢ d.e°10 .Kterm.°1 / /; e0 °20 .Kterm.°2 / // > 0
r .° / ¢ ³ ¸ r
(4.19) 0
.°1¡ /
¢ ³:
(4.20)
Now, as in the upper-bound case (see section 4.1), we construct a matrix M0 .s/, M0uv .s/ D
X
.r0·.e/ /s :
(4.21)
e2Eu!v
Let s2 be the unique number such that the spectral radius of M0 .s2 / is equal to 1. Again, 1 is an eigenvalue of M0 .s2 / with the associated eigenvector f¸0u gu2V satisfying ¸0u D
X v2V
¸0v
X e2Eu!v
.r0·.e/ /s2 ;
X u2V
¸0u D 1;
¸0u > 0:
(4.22)
As before, we dene a measure ¹0 on the set of cylinders Cu (see equation 4.3) as follows: 0 0 0 s2 for w D ·.° /; ° 2 E.¤/ u!v I ¹ .[w]/ D ¸v ¢ r .° / :
(4.23)
The measure ¹0 extends to Borel measures on E.!/ u! . Fix a (Borel) set B ½ Ku and dene ±D
diam.B/ : ³
(4.24)
Architectural Bias in Recurrent Neural Networks
1943
The corresponding cross-cut set is 0 0 ¡ T±0 D f° j ° 2 E.¤/ u! ; r .° / < ± · r .° /g:
(4.25)
From equation 4.24 and the denition of T±0 , we have for each ° 2 T±0 I diam.B/ D ³ ¢ ± · ³ ¢ r0 .° ¡ /:
(4.26)
The label strings ·.° / associated with the paths ° in T 0± are incomparable, implying that there is at most one path ° 2 T±0 such that ° .Kterm.° / / \ B 6D ;. To see this, note that by equations 4.17 through 4.20 and 4.26, for any two paths °1 ; °2 2 T±0 , we have d.°1 .Kterm.°1 / /; °2 .Kterm.°2 / // > r0 .°1¡ / ¢ ³ ¸ diam.B/: Denote by BQ the set of all !-sequences associated with innite paths starting in the vertex u that map X into a point in B, that is, BQ D f·.° / j ° 2 E.!/ u! ; ° .X/ 2 Bg:
(4.27)
0 Since the cross-cut set T±0 partitions the set E.!/ u! , there exists some °B 2 T± such that °B .Kterm.°B / / meets B µ Ku . In fact, as argued above, it is the only such path from T±0 , and so BQ µ [·.°B /]. In terms of measure (see equation 4.23),
Q · ¹0 .[·.°B /]/ · ¸0 ¢ r0 .°B /s2 ; ¹0 .B/ max
(4.28)
where ¸0max D maxv2V ¸0v . From the denition of T±0 , we have diam.B/ D ³ ¢ ± > ³ ¢ r0 .°B /: Using this inequality in equation 4.28, we obtain ³ Q · ¸0 ¹ .B/ max 0
diam.B/ ³
´s2
Now, for any countable cover C X B2C
u
.diam.B//s2 ¸ ¸
(4.29)
: u
of Ku by Borel sets, we have
³ s2 X 0 Q ³ s2 X ¹ .B/ D 0 º.B/ 0 ¸max B2C ¸max B2C u
u
s2
³ º.Ku /; ¸0max
(4.30)
1944
P. Ti Ïno and B. Hammer
where º is the pushed-forward measure on Ku induced by the measure ¹0 . Note that this is a correct construction, since the IFS falls into the disjoint case. Consulting denition 2 and noting that it is always the case that
H s2 .Ku / ¸
³ s2 º.Ku / > 0; ¸0max
we conclude that dimH Ku ¸ s2 . As in the case of upper bounds, we transform the results of the previous theorem into a weaker but more accessible lower bound. Theorem 4.
Under the assumptions of theorem 3, for each vertex u 2 V;
dimH Ku ¸ s2 ¸
log ½.G/ ; ¡ log rmin
where rmin D mina2A ra . Proof. As in the proof of theorem 2, consider an SCSTG .G ; ra D rmin ; a 2 A/. The matrix, equation 4.1, for this SCSTG is N.s/ D rsmin G with ½.N.s// D rsmin ¢ ½.G/. The dimension smin associated with such a SCSTG is smin D
log ½.G/ : ¡ log rmin
(4.31)
Dene 1 D smin ¡ s2 . We have N.smin / D r1 min ¢ N.s2 /. Since rmin · ra , for all a 2 A, it holds that 0 < N.s2 / · M.s2 /, and so ½.N.s2 // · ½.M.s2 //. Note that ½.N.smin // D 1 D r1 min ¢ ½.N.s2 //; and ½.N.s2 // · ½.M.s2 // D 1. It follows that r1 min ¸ 1 must hold. Since rmin < 1, we have 1 · 0, which means that smin · s2 . 4.3 Discussion of the Quantities Involved in the Bounds. The numbers s1 and s2 are well-known quantities in the literature on fractal sets generated by constructions employing maps with different contraction rates (Fernau & Staiger, 1994, 2001; Edgar & Golds, 1999). Since there is no closed-form analytical solution for s1 as introduced in denition 5, the dimensions s1 , s2 are dened only implicitly. Loosely speaking, spectral radius of M.s/ is related to the variation of the “cover function” H s .K/ (see denition 2) and
Architectural Bias in Recurrent Neural Networks
1945
the only way of getting a nite change-over point s in equation 2.3 is to set the spectral radius to 1. The situation changes when we conne ourselves to a constant contraction rate across all IFS maps. As seen in the proof of theorem 2, in this case there is a closed-form solution for s. The solution is equal to the scaled spectral radius of the adjacency matrix, which determines the growth rate of allowed label strings as the sequence length increases. The major difculty with having different contraction rates in the IFS maps is that the areas on the fractal support that need to be covered by sets of decreasing diameters cannot be simply estimated by counting the number of different label sequences of certain length that can be obtained by traversing the underlying multigraph. 5 Recurrent Networks as IFS In this article, we concentrate on rst-order (discrete-time) RNNs of N recurrent neurons driven with dynamics 0 1 D N X X (5.1) xn .t/ D g @ vnj ij .t/ C wnj xj .t ¡ 1/ C dn A ; jD1
jD1
where xj .t/ and ij .t/ are elements at time t of the state and input vectors, x.t/ D .x1 .t/; : : : ; xN .t//T and i.t/ D .i1 .t/; : : : ; iD .t//T , respectively; wnj and vnj are recurrent and input connection weights, respectively; dn ’s constitute the bias vector d D .d1 ; : : : ; dN /T ; and g.¢/ is an injective nonconstant differentiable activation function of bounded derivative from R to a bounded interval Ä ½ R of length jÄj. Denoting by V and W the N £ D and N £ N weight matrices .vnj / and .wnj /, respectively, we rewrite equation 5.1 in matrix form, x.t/ D G.Vi.t/ C Wx.t ¡ 1/ C d/;
(5.2)
where G: RN ! ÄN is the element-wise application of g. Assume RNN is processing strings over the nite alphabet A D f1; 2; : : : ; Ag. Symbols a 2 A are presented at the network input as unique Ddimensional codes ca 2 RD , one for each symbol a 2 A. RNN can be viewed N as a nonlinear IFS consisting of a collection of A maps f fa gA aD1 acting on Ä , fa .x/ D .G ± Ta /.x/;
(5.3)
where Ta .x/ D Wx C da
(5.4)
da D Vca C d:
(5.5)
and
1946
P. Ti Ïno and B. Hammer
For a set B µ RN , we denote by [B]i the slice of B dened as [B]i D fxi j x D .x1 ; : : : ; xN /T 2 Bg; i D 1; 2; : : : ; N:
(5.6)
When symbol a 2 A is at the network input, the range of possible net-in activations on recurrent units is the set [ [Ta .ÄN /]i : (5.7) Ra D 1·i·N
Recall that singular values ®1 ; : : : ; ®N of the matrix W are positive square roots of the eigenvalues of WWT . The singular values are lengths of the (mutually perpendicular) principal semiaxes of the image of the unit ball under the linear map dened by the matrix W. We assume that W is nonsingular and adopt the convention that ®1 ¸ ®2 ¸ ¢ ¢ ¢ ¸ ®N > 0. Denote by ®max .W/ and ®min.W/ the largest and the smallest singular values of W, respectively, that is, ®max .W/ D ®1 and ®max .W/ D ®N . The next lemma gives us conditions under which the maps fa are contractive. Lemma 1.
If
D ®max .W/ ¢ sup jg0 .z/j < 1; rmax a
(5.8)
z2Ra
then the map fa , a 2 A (see equation 5.3) is contractive. Proof.
The result follows from two facts:
1. A map f from a metric space .U; k ¢ kU / to a metric space .V; k ¢ kV / is contractive if it is Lipschitz continuous, that is, for all x; y 2 U, k f .x/ ¡ f .y/kV · r f kx ¡ ykU , and the Lipschitz constant r f is smaller than 1. 2. The Lipschitz constant r of a composition . f1 ± f2 / of two Lipschitz continuous maps f1 and f2 with Lipschitz constants r1 and r2 is equal to r D r1 ¢ r2 . Note that ®max .W/ is a Lipschitz constant of the afne maps Ta and that sup z2Ra jg0 .z/j is a Lipschitz constant of the map G with domain Ta .ÄN /. By arguments similar to those in the proof of lemma 1, we get: Lemma 2.
For
D ®min .W/ ¢ inf jg0 .z/j; rmin a z2Ra
N it holds: k fa .x/ ¡ fa .y/k ¸ rmin a kx ¡ yk, for all x; y 2 Ä .
(5.9)
Architectural Bias in Recurrent Neural Networks
1947
6 RNNs Driven by Finite Automata Major research effort has been devoted to the study of learning and implementation of regular grammars and nite-state transducers using RNNs (Casey, 1996; Cleeremans et al., 1989; Elman, 1990; Forcada & Carrasco, 1995; Frasconi et al., 1996; Giles et al., 1992; Manolios & Fanelli, 1994; Ti Ïno Ï & Sajda, 1995; Watrous & Kuhn, 1992). In most cases, the input sequences presented at the input of RNNs were obtained by traversing the underlying nite state automaton/machine. Of particular interest to us are approaches that did not reset the RNN state after presentation of each input string, but learned the appropriate resetting behavior during the training process (ForÏ cada & Carrasco, 1995; Ti Ïno & Sajda, 1995). In such cases RNNs are fed by a concatenation of input strings over some input alphabet A0 , separated by a special end-of-string symbol # 2 = A0 . We can think of such sequences as nite substrings of innite sequences generated by traversing the underlying state-transition graph extended by adding edges labeled by # that terminate in the initial state. The added edges start at all the states—in the case of transducers—or at the nal states—in the case of acceptors. Let G D .V; E; ·/ represent such an extended state-transition graph. Let A D A0 [ f#g. If for all a 2 A, rmax < 1 (see lemma 1), the recurrent network a can then be viewed as an IFS associated with an SCSTG .G ; frmax ga2A / (see a denition 4) operating on the RNN state space ÄN . When driving RNNs with symbolic sequences generated by traversing a strongly connected state-transition graph G , by action of the IFS f fa ga2A , we get a set of recurrent activations x.t/ that tend to group in well-separated clusters (Kolen, 1994a, 1994b; Manolios & Fanelli, 1994). In case of contractive RNNs (which are capable of emulating denite-memory machines; see Hammer & Ti Ïno, 2002; Ti Ïno et al., 2002a, 2002b), the activations x.t/ approximate the invariant attractor sets Ku µ ÄN , u 2 V. In this situation, we can get size estimates for the activation clusters. Dene (see equation 5.7) ` D sup jg0 .z/j D max sup jg0 .z/j
(6.1)
qD
(6.2)
z2[a Ra
min
a;a0 2A; a6Da0
a2A z2R a
kV.ca ¡ ca0 /k:
Theorem 5. Let RNN (see equation 5.2) be driven with sequences generated by traversing a strongly connected state-transition graph G D .V; E; ·/ with adjacency matrix G. Suppose ®max .W/ < `¡1 . Let s1 be the dimension of the SCSTG .G ; frmax a ga2A /. Then the (upper box-counting) fractal dimensions of attractors Ku , u 2 V, approximated by the RNN activation vectors x.t/, are upper bounded by 8u 2 VI dimC B Ku · s1 ·
log ½.G/ ¡ log rmax
(6.3)
1948
P. Ti Ïno and B. Hammer
and also dimC B K · s1 ·
log ½.G/ ; ¡ log rmax
where rmax D maxa2A rmax a . Proof. Since ®max .W/ < `¡1 , by equation 6.1 and lemma 1, each IFS map . fa , a 2 A, is a contraction with contraction coefcient rmax a One is now tempted to invoke theorems 1 and 2. Note, however, that in the theory developed in sections 3 and 4, for allowed paths ° in G , the compositions of IFS maps ° .¢/ are applied in the reversed manner 9 (see equation 3.6), that is, if ·.° / D w D s1 s2 ; : : : ; sn 2 AC , ° .x/ D . fs1 ± fs2 ± ¢ ¢ ¢ ± fsn /.x/. Actually, this does not pose any difculty, since we can work with a new state-transition graph G R D .V; E R ; · R / associated with G D .V; E; ·/. G R is completely the same as G up to orientation of the edges. For every edge e 2 Eu!v with label ·.e/, there is an associated edge eR 2 ERv!u labeled by · R .eR / D ·.e/. The matrices M.s/ (see equation 4.1) corresponding to IFSs based on G are simply transposed versions of the matrices MR .s/ corresponding to IFSs based on G R . However, spectral radius is invariant with respect to the transpose operation, so the results of section 4 can be directly employed. We now turn to lower bounds. Theorem 6. Consider an RNN (see equation 5.2) driven with sequences generated by traversing a strongly connected state-transition graph G D .V; E; ·/ with adjacency matrix G. Suppose »
¼ 1 q ®max .W/ < min ;p : ` N jÄj
(6.4)
Let s2 be the dimension of the SCSTG .G ; frmin a ga2A /. Then the (Hausdorff) dimensions of attractors Ku , u 2 V, approximated by the RNN activation vectors x.t/, are lower bounded by 8u 2 VI dimH Ku ¸ s2 ¸
log ½.G/ ; ¡ log rmin
(6.5)
where rmin D mina2A rmin a . 9 This approach, usual in the IFS literature, is convenient because of the convergence properties of ° .X/, ° 2 A! .
Architectural Bias in Recurrent Neural Networks
Proof. have
Note that diam.ÄN / D
q > ®max .W/ ¢
p
p
1949
N ¢ jÄj. Furthermore, by equation 6.4, we
N ¢ jÄj D ®max .W/ ¢ diam.ÄN /:
It follows that q D min0 kV.ca ¡ ca0 /k D min0 kda ¡ da0 k > ®max .W/ diam.ÄN /; a6Da
a6Da
and so for all a; a0 2 A; a 6D a0 I Ta .ÄN / \ Ta0 .ÄN / D ;: Since G is injective, we also have 8 a; a0 2 A; a 6D a0 I .G ± Ta /.Ä N / \ .G ± Ta0 /.ÄN / D ;; which, by equation 5.3, implies fa .ÄN / \ fa0 .ÄN / D ;, for all a; a0 2 A, a 6D a0 . Hence, the IFS f fa ga2A falls into the disjoint case. We can now invoke theorems 3 and 4. Using equation 2.4, we summarize the bounds in the following corollary: Corollary 1.
Under the assumptions of theorems 5 and 6, for all u 2 V,
log ½.G/ C · s2 · dimH Ku · dim ¡ B Ku · dimB Ku · s1 ¡ log rmin ·
log ½.G/ : ¡ log rmax
(6.6)
7 Discussion Recently, we have extended the work of Kolen and others (Christiansen & Chater, 1999; Kolen, 1994a, 1994b; Manolios & Fanelli, 1994) by pointing out that in dynamical tasks of symbolic nature, when initialized with “small” weights, RNNs, with, for example, sigmoid activation functions, form contractive IFS. In particular, the dynamical systems (see equation 5.3) are driven by a single attractive xed point, one for each a 2 A (Ti Ïno, Horne, & Giles, 2001; Ti Ïno et al., 1998). Actually, this type of simple dynamics is a reason for starting to train recurrent networks from small weights. Unless one has a strong prior knowledge about the network dynamics (Giles & Omlin, 1993), the sequence of bifurcations leading to a desired network behavior may be hard to achieve when starting from an arbitrarily complicated network dynamics (Doya, 1992).
1950
P. Ti Ïno and B. Hammer
The tendency of RNNs to form clusters in the state space before training was rst reported by Servan-Schreiber, Cleeremans, and McClelland (1989) and later by Kolen (1994a, 1994b) and Christiansen & Chater (1999). We have shown that RNNs randomly initiated with small weights are inherently biased toward Markov models; even prior to any training, RNN dynamics can be readily used to extract denite memory machines (Hammer & Ti Ïno, 2002; Ti Ïno et al., 2002a, 2002b). In other words, even prior to training, the recurrent activation clusters are perfectly reasonable and are biased toward nite-memory computations. This article further extends our work by showing that in such cases, a rigorous analysis of fractal encodings in the RNN state space can be performed. We have derived (lower and upper) bounds on several types of fractal dimensions, such as Hausdorff and (upper and lower) box-counting dimensions of activation patterns occurring in the RNN state space when the network is driven by an underlying nite-state automaton, as has been the case in many studies reported in the literature (Casey, 1996; Cleeremans et al., 1989; Elman, 1990; Forcada & Carrasco, 1995; Frasconi et al., 1996; Ï Giles et al., 1992; Manolios & Fanelli, 1994; Ti Ïno & Sajda, 1995; Watrous & Kuhn, 1992). Our results have a nice information-theoretic interpretation. The entropy of a language L µ A¤ is dened as the rate of exponential increase in the number of distinct words belonging to L, as the word length increases. With respect to the scenario we have in mind, RNN driven by traversing a nitestate automaton, a more appropriate context is that of regular !-languages. Roughly speaking, a regular !-language is a set L µ A! of innite strings over A that, when parsing a (traditional) nite-state automaton, end up visiting the set of nal states innitely often (Thomas, 1990). 10 The entropy of a regular !-language L is equal to the entropy of the set of all nite prexes of the words belonging to L, and this is equal to log ½.G/, where G is the adjacency matrix of the underlying nite state automaton (Fernau & Staiger, 2001). Hence, the architectural bias phenomenon has an additional avor: not only can the recurrent activations inside RNNs with small weights be explored to build Markovian predictive models, but also the activations form fractal clusters, the dimension of which can be upper- (and under some conditions lower-) bounded by the scaled entropy of the underlying driving source. The scaling factors are xed and are given by the RNN parameters. Our work is related to valuations of languages (Fernau & Staiger, 1994). Briey, a valuation of a symbol a 2 A can be thought of as a “contraction ratio” ra of the associated IFS map fa . Valuations of nite words over A are
10 This is by no means the only acceptance criterion that has been investigated. For a survey, see Thomas (1990).
Architectural Bias in Recurrent Neural Networks
1951
then obtained by postulating that in general, the valuation ¯ is a monoid morphism mapping .A¤ ; C; ¸/ to ..0; 1/; :; 1/, where C and : denote the operations of word concatenation and real number multiplication, respectively (see equation 3.7). It should be stressed that, strictly speaking, not all the maps fa need to be contractions. There can be words with valuation ¸ 1. It is important, however, to have control over recursive structures. For example, in our setting of sequences drawn by traversing nite-state diagrams, it is desirable that valuations of cycles be < 1. Actually, in this article, we could have softened in this manner our demand on contractiveness of each IFS map. However, this would unnecessarily complicate the presentation. In addition, we are dealing with the architectural bias in RNNs where all the maps fa are contractive anyway. Our Hausdorff dimension estimates are related to ¯-entropy of languages. However, valuations in principle assume that maps fa are similarities. In this article, we deal with a more general case of nonsimilarities, since the RNN maps fa are nonlinear. In addition, the theory of valuations operates with contraction ratios only, but in order to obtain lower bounds on the Hausdorff dimension of the IFS attractor, one has to consider other details of the IFS maps. Results in this article extend the work of Blair, Bod´en, Pollack, and others (Blair & Pollack, 1997; Bod´en & Blair, 2002; Rodriguez et al., 1999) analyzing xed-point representations of context-free languages in trained RNNs. They observed that the networks often developed an intricate combination of attractive and saddle-type xed points. Compared with Cantor-set-like representations emerging from attractive-point-only dynamics (the case studied here), the inclusion of saddle points can lead to better generalization on longer strings. Our theory describes the complexity of recurrent patterns in the early stages of RNN training, before bifurcating into dynamics enriched with saddle points. Finally, our fractal dimension estimates can prove helpful in neural approaches to the (highly nontrivial) inverse problem of fractal image compression through IFSs (Melnik & Pollack, 1998; Stucki & Pollack, 1992): Given an image I , identify an IFS that generates I . The inverse fractal problem is an area of active research with many alternative approaches proposed in the literature. However, a general algorithm for solving the inverse fractal problem is still missing (Melnik & Pollack, 1998). As the neural implementations of IFS maps are contractions and the feeding input label sequences are generated by Bernoulli sources, the fractal dimension estimates derived in this article can be used, for example, to assess in an evolutionary setting the tness of neural candidates for generating the target image with a known precomputed fractal dimension. Such assessment will be most appropriate in the early stages of population evolution and is potentially much cheaper (see section 8) than having to produce at each step and for each candidate iteratively generated images and then compute their Hausdorff distances to the target image.
1952
P. Ti Ïno and B. Hammer
8 Experiments Although our main result is of a theoretical nature—the size of recurrent activations in the early stages of RNN training can be upper- (and sometimes lower-) bounded by the scaled information-theoretic complexity of the underlying input driving source and the scaling factors are determined purely by the network parameters—one can ask a more practical question about tightness of the derived bounds. 11 In general, wider differences between contraction ratios of the IFS maps will result in weaker bounds. We ran an experiment with randomly initialized RNNs having N D 5 recurrent neurons and standard logistic sigmoid activation functions g.u/ D 1=.1 C exp.¡u//. The weights W D .wnj / were sampled from the uniform distribution over a symmetric interval [¡wmax ; wmax ], with wmax 2 [0:05; 0:35]. We assumed that the networks were input driven by sequences from a fair Bernoulli source over A D 4 symbols. In other words, the underlying automaton has a single state with A symbol loops. The topological entropy of such a source is log A. For each weight range wmax 2 f0:05; 0:06; 0:07; : : : ; 0:35g, we generated 1000 networks and approximated the upper and lower bounds on fractal dimensions presented in corollary 1. In particular, two types of approximations were made: (1) we assumed that the principal semiaxes of WB0 (B0 is an N-dimensional ball centered at the origin) are aligned with the neuron-wise coordinate system in RN , and (2) we did not check whether the IFSs fall into the disjoint case. Both simplications, while potentially overestimating the bounds, considerably speed up the computations. The results are shown in Figure 1. The means of the bounds across 1000 networks are plotted as plain solid lines; the dashed lines indicate the spread of one standard deviation. To verify the theoretical bounds, we generated from the Bernoulli source a long sequence ! of 10,000 symbols. Next we constructed for each weight range wmax 2 f0:1; 0:2; 0:3g 10 randomly initialized RNNs and drove each network with !. We collected the 10,000 recurrent activations from every RNN and estimated their box-counting fractal dimension using the FD3 system (version 4) (Sarraille & Myers, 1994). The mean dimension estimates are shown in Figure 1 as squares connected by a solid line; the error bars correspond to the spread of § one standard deviation. For comparison, we show in Figure 2 the theoretical bounds (circles) and empirical dimension estimates (squares) for a single RNN randomly initialized with weight range wmax 2 f0:1; 0:15; 0:2; 0:25; 0:3g. The empirical box-counting estimates based on recurrent activation patterns inside RNNs fall between the theoretical bounds computed based on networks’ parameters. Tightness of the computed bounds depends on the range of contraction ratios of the individual IFS maps within a RNN and on
11
Thanks to an anonymous reviewer for raising this question.
Architectural Bias in Recurrent Neural Networks
1953
Box counting dim estimates 0.8
dB
0.6
0.4
0.2 0.1
0.2
Weight range
0.3
Figure 1: Mean lower and upper bounds on fractal dimensions of RNNs (solid lines) estimated for weight ranges in [0:05; 0:35] across 1000 RNN realizations. The dashed lines indicate the spread of one standard deviation. The solid line with squares shows the mean empirical dimension estimates from 10,000 recurrent activations obtained by driving each network with an external input sequence.
the degree to which the simplifying assumption of axes-aligned principal semi-axes of WB0 holds. It turns out that the danger of overestimating the theoretical lower bounds because of not checking the disjoint case condition never materialized. The lower bounds are based on the minimal contraction rate across the IFS maps in an RNN, and so the potential overestimation is balanced by higher contraction rates in the rest of the IFS maps. Also, even though the disjoint case condition may not be fully satised, deterioration of the bounds follows a “graceful degradation” pattern as the level of overlap between the IFS images fa .ÄN / increases. The framework presented in this article is general and can be readily applied to any RNN with dynamics 5.1 and injective nonconstant differentiable activation functions of bounded derivative, mapping R into a bounded interval. Many extensions and renements are possible. For example, the theory can be easily extended to second-order RNNs and can be made more specic for the case of unary (one-of-A) encodings ca of input symbols a 2 A.
1954
P. Ti Ïno and B. Hammer
Box counting dim estimates 0.8
d
B
0.6
0.4
0.2 0.1
0.2
Weight range
0.3
Figure 2: Theoretical bounds (circles) and empirical dimension estimates (squares) for a single RNN randomly initialized with weight range wmax 2 f0:1; 0:15; 0:2; 0:25; 0:3g.
9 Conclusion It has been reported that RNNs form reasonably clustered activation clusters even prior to any training (Christiansen & Chater, 1999; Kolen, 1994a, 1994b; Servan-Schreiber et al., 1989). We have recently shown that RNNs randomly initialized with small weights are inherently biased toward Markov models; even without any training, RNN dynamics can be readily used to extract denite memory machines (Hammer & Ti Ïno, 2002; Ti Ïno et al., 2002a, 2002b). In this article, we have shown that in such cases, a rigorous analysis of fractal encodings in the RNN state space can be performed. We have derived (lower and upper) bounds on the Hausdorff and box-counting dimensions of activation patterns occurring in the RNN state space when the network is driven by an underlying nite-state automaton, as has been the case in many studies reported in the literature (Casey, 1996; Cleeremans et al., 1989; Elman, 1990; Forcada & Carrasco, 1995; Frasconi et al., 1996; Giles et Ï al., 1992; Manolios & Fanelli, 1994; Ti Ïno & Sajda, 1995; Watrous & Kuhn, 1992). It turns out that the recurrent activations form fractal clusters, the dimension of which can be bounded by the scaled entropy of the underlying driving source. The scaling factors are xed and are given by the RNN parameters.
Architectural Bias in Recurrent Neural Networks
1955
Acknowledgments This work was supported by the VEGA MS SR and SAV 1=9046=02. References Barnsley, M. F. (1988). Fractals everywhere. New York: Academic Press. Barnsley, M. F., Elton, J. H., & Hardin, D. P. (1989). Recurrent iterated function system. Constructive Approximation, 5, 3–31. Blair, A. D., & Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5), 1127–1142. Bod´en, M., & Blair, A. D. (in press). Learning the dynamics of embedded clauses. Applied Intelligence. Bod´en, M., & Wiles, J. (2002). On learning context free and context sensitive languages. IEEE Transactions on Neural Networks, 13(2), 491–493. Casey, M. P. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8(6), 1135–1178. Christiansen, M. H., & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23, 417–437. Cleeremans, A., Servan-Schreiber, D, & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Culik, K. II, & Dube, S. (1993). Afne automata and related techniques for generation of complex images. Theoretical Computer Science, 116(2), 373–398. Doya, K. (1992). Bifurcations in the learning of recurrent neural networks. In Proc. of 1992 IEEE Int. Symposium on Circuits and Systems (pp. 2777–2780). Edgar, G. A., & Golds, J. (1999). A fractal dimension estimate for a graph-directed iterated function system of non-similarities. Indiana University Mathematics Journal, 48, 429–447. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Falconer, K. J. (1990). Fractal geometry: Mathematical foundations and applications. New York: Wiley. Fernau, H., & Staiger, L. (1994). Valuations and unambiguity of languages, with applications to fractal geometry. In S. Abiteboul, & E. Shamir (Eds.), Automata, languages and programming, 21st International Colloquium, ICALP 94 (pp. 11–22). Berlin: Springer-Verlag. Fernau, H., & and Staiger, L. (2001). Iterated function systems and control languages. Information and Computation, 168(2), 125–143. Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a secondorder recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930. Frasconi, P., Gori, M., & Sperduti, A. (1998). A general framework for adaptive processing of data structures. IEEE Transactions on Neural Networks, 9(5), 768–786. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Insertion of nite state automata in recurrent radial basis function networks. Machine Learning, 23, 5–32.
1956
P. Ti Ïno and B. Hammer
Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracting nite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Giles, C. L., & Omlin, C.W. (1993). Insertion and renement of production rules in recurrent neural networks. Connection Science, 5(3), 307–328. Hammer, B. (2002). Recurrent networks for structured data—a unifying approach and its properties. Cognitive Systems Research, 3(2), 145–165. Hammer, B., & Ti Ïno, P. (2002). Neural networks with small weights implement nite ¨ Germany: Department of memory machines (Tech. Rep. P-241). Osnabruck, Mathematics=Computer Science, University of Osnabruck. ¨ Kolen, J. F. (1994a). The origin of clusters in recurrent neural network state space. In Proceedings from the Sixteenth Annual Conference of the Cognitive Science Society (pp. 508–513). Hillsdale, NJ: Erlbaum. Kolen, J. F. (1994b). Recurrent networks: state machines or iterated function systems? In M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, & A. S. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 203–210). Hillsdale, NJ: Erlbaum. Kremer, S. C. (2001). Spatio-temporal connectionist networks: A taxonomy and review. Neural Computation, 13(2), 249–306. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic nite state automata. Neural Computation, 6(6), 1155–1173. Melnik, O., & Pollack, J. B. (1998). A gradient descent method for a neural fractal memory. In Proceedings of WCCI 98. International Joint Conference on Neural Networks. Piscataway, NJ: IEEE Press. Micheli, A., Sperduti, A., Starita, A., & Bianucci, A. M. (2002). QSPR=QSAR based on neural networks for structures. In L. M. Sztandera & H. Cartwright (Eds.), Soft computing approaches in chemistry. Heidelberg: Physica Verlag. Minc, H. (1988). Nonnegative matrices. New York: Wiley. Prusinkiewicz, P., & Hammel, M. (1992). Escape-time visualization method for language-restricted iterated function system. In Proceedings of Graphics Interface ’92 (pp. 213–223). San Mateo, CA: Morgan Kaufmann. Rodriguez, P., Wiles, J., & Elman, J. L. (1999). A recurrent neural network that learns to count. Connection Science, 11, 5–40. Sarraille, J. J., & Myers, L. S. (1994). FD3: A program for measuring fractal dimensions. Educational and Psychological Measurement, 54(1), 94–97. Servan-Schreiber, D., Cleeremans, A., & McClelland, J. (1989). Encoding sequential structure in simple recurrent networks. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1. San Mateo, CA: Morgan Kaufmann. Stucki, D. J., & Pollack, J. B. (1992). Fractal (reconstructive analogue) memory. In Proceedingsof 14th Annual Cognitive Science Conference (pp. 118–123). Mahwah, NJ: Erlbaum. Thomas, W. (1990). Automata on innite objects. In J. van Leeuwen (Ed.), Handbook of theoretical computer science (Vol. B, pp. 133–191). Amsterdam: Elsevier. Ï Ti Ïno, P., Cer Ïnansk y, ´ M., & Be ÏnuÏskov´a, L. (2002a). Markovian architectural bias of recurrent neural networks. In P. SinÏca´ k, J. VaÏscÏ ak, ´ V. KvasniÏcka, & J. Pospichal (Eds.), Intelligent technologies—Theory and applications (pp. 17–23). Amsterdam: IOS Press.
Architectural Bias in Recurrent Neural Networks
1957
Ï Ti Ïno, P., Cer Ïnansk y, ´ M., & Be ÏnuÏskov´a, L. (2002b). Markovian architectural bias of recurrent neural networks (Tech. Rep. NCRG=2002=008). Birmingham, UK: NCRG, Aston University. Ti Ïno, P., Horne, B. G., & Giles, C. L. (2001). Attractive periodic sets in discrete time recurrent networks (with emphasis on xed point stability and bifurcations in two–neuron networks). Neural Computation, 13(6), 1379–1414. Ti Ïno, P., Horne, B. G., Giles, C. L., & Collingwood, P. C. (1998). Finite state machines and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). Orlando, FL: Academic Press. Ï Ti Ïno, P., & Sajda, J. (1995). Learning and extracting initial Mealy machines with a modular neural network model. Neural Computation, 7(4), 822–844. Watrous, R. L., & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Received August 27, 2002; accepted February 5, 2003.
LETTER
Communicated by Ralf Herbrich
An Effective Bayesian Neural Network Classier with a Comparison Study to Support Vector Machine Faming Liang [email protected] Department of Statistics, Texas A & M University, College Station, TX 77843, U.S.A.
We propose a new Bayesian neural network classier, different from that commonly used in several respects, including the likelihood function, prior specication, and network structure. Under regularity conditions, we show that the decision boundary determined by the new classier will converge to the true one. We also propose a systematic implementation for the new classier. In our implementation, the tune of connection weights, the selection of hidden units, and the selection of input variables are unied by sampling from the joint posterior distribution of the network structure and connection weights. The numerical results show that the new classier consistently outperforms the commonly used Bayesian neural network classier and the support vector machine in terms of generalization performance. The reason for the inferiority of the commonly used Bayesian neural network classier and the support vector machine is discussed at length. 1 Introduction Support vector machines (SVMs) have recently been proposed as a technique for solving pattern classication problems (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995) and have rapidly gained in popularity due to their theoretical merits, computational simplicity, and excellent performance in real-world applications. The excellence is certainly based on a comparison with other classication tools, such as nearest neighbor, discriminant analysis, and neural networks. For example, Ding and Dubchak (2001) and Markowetz, Edler, and Vingron (2002) showed numerically that SVMs outperform neural networks and other classication tools in protein fold class prediction. Breiman (2002) commented that the excellent performance of SVMs is beginning to supplant the use of neural networks. However, SVMs have two drawbacks despite their successes. One is that their decision boundary is determined only by support vectors. The boundary will be severely biased from the true one if some of the support vectors were statistical outliers. The support vectors often contain some extreme values of the populations. The low reliability of the extreme values often makes the decision boundary of SVMs not reliable at all. The other drawback is c 2003 Massachusetts Institute of Technology Neural Computation 15, 1959–1989 (2003) °
1960
F. Liang
that it does not take account of the uncertainty of data. As many researchers know, the methods that take account of data uncertainty—for example, the ensemble method (Perrone & Cooper, 1993; Rosen, 1996; Krogh & Vedelsby, 1995) and Bayesian model averaging (Raftery, Madigan, & Hoeting, 1997)— often have better generalization performance than methods that ignore data uncertainty. One the other hand, we note that the inferiority of neural networks in classications is due to implementations instead of their intrinsic capability. In particular, the implementations often suffer from the two difculties. First is the necessity of a priori specication for the network structure, namely, the number of hidden units and input variables, assuming that the output variables have been determined by the problems. An inappropriate network structure often results in an overtting or insufcient learning to the training data, and thus a poor generalization performance. The second difculty is the lack of efcient training algorithms. The deterministic training algorithms such as backpropagation (Rumelhart, Hinton, & Williams, 1986), conjugate gradient, and their variants, usually converge to some local minima of the objective functions. This problem becomes more serious as the network size increases. Bayesian neural networks (BNNs) (Mackay, 1992; Neal, 1996) provide a natural framework for alleviating the above difculties. First, the priors for the network structure and connection weights work naturally as a regulation term for network training. It controls the selection of input variables, the number of hidden units, and the range of the weight values. Second, Markov chain Monte Carlo (MCMC) methods are naturally used to train the networks, sampling from the joint posterior of the network structure and connection weights. It avoids the local minima problem encountered by the deterministic training algorithms. Third, the data uncertainty can be take into account, as Bayesian model averaging is naturally used for tting and prediction with the MCMC outputs. For an overview of the BNN classiers, see Thodberg (1996) and Husmeier, Penny, and Roberts (1999). In this letter, we propose a new BNN classier, different from that proposed in Neal (1996) in several respects, including the likelihood function, network structure, and prior specication. Under regularity conditions, we show that the decision boundary determined by the new classier will converge to the true one. We also propose a systematic implementation for the new classier. The tune of connection weights, the selection of hidden units, and the selection of input variables are unied by sampling from the joint posterior of the network structure and connection weights. The numerical results show that the new BNN classier outperforms SVMs and other BNN classiers in all examples in this letter. The reason for the inferiority of the other BNN classiers and SVMs is discussed at length. In section 2, we give a brief description of SVMs. In section 3, we describe the new BNN classier and show that it will asymptotically converge to the true classication function under regularity conditions. In section 4,
A Bayesian Neural Network Classier
1961
we describe the numerical implementation for the new BNN classier. In section 5, we present our numerical results with extensive comparisons to SVMs and other BNN classiers. In section 6, we conclude with a brief discussion. 2 SVM Classiers In this section, we give a brief description of SVMs. Given the training p data f.xi ; yi /gN iD1 with explanatory variables xi 2 R and the corresponding binary class labels yi 2 f¡1; C1g, the SVM classier is constructed in the form 2 3 m X (2.1) y.x/ D sign 4 wj Ãj .x/ C b5 D sign[w 0 Ã .x/ C b]; jD1
where à .x/ D [Ã1 .x/; : : : ; Ãm .x/]0 denotes a set of nonlinear transformations from the input space to the feature space, and w D [w1 ; : : : ; wm ]0 is the set of linear weights connecting the feature space to the output space. The underlying assumption of equation 2.1 is that the data from two classes can always be separated by a hyperplane with an appropriate nonlinear mapping à .¢/ to a sufciently highly dimensional feature space. The nonlinear mapping à .¢/ is related to the kernel function through K.x1 ; x2 / D à .x1 /0 à .x2 /; where the kernel K.¢; ¢/ is symmetric and positive denite from Mercer’s theorem (Mercer, 1909). Typically, one has the following choices for the kernel function: K.x; xi / D x0 xi (linear kernel); K.x; xi / D .x0 xi C 1/d (polynomial kernel of degree d 2 N); and K.x; xi / D expf¡kx ¡ xi k22 =2¾ 2 g (radial basis function (RBF) kernel). Note that the Mercer condition holds for all ¾ and d values in the RBF and polynomial kernels. The linear weights in equation 2.1 are computed through minimizing the objective function minw;» JSVM .w ; » / D (
N X 1 0 wwCC »i ; 2 iD1
yi w 0 Ã .xi / ¸ 1 ¡ »i ; subject to »i ¸ 0
i D 1; : : : ; N; i D 1; : : : ; N;
(2.2)
where »i ’s are slack variables that are introduced to allow for some misclassication due to overlap distributions. The C is a user-specied positive real number. It controls the trade-off between the complexity of the machine and the number of nonseparable patterns. With the so-called Kuhn-Tucker construction (Fletcher, 1987; Bertsekas, 1995), the optimization problem, equa-
1962
F. Liang
Optimal hyperplane
Figure 1: Illustration of the optimal hyperplane of SVMs for two linearly separable classes. The hyperplane is determined only by the support vectors, which are shown as the lled squares and circles.
tion 2.2, can be solved by solving its dual problem, and the resulting optimal classication hyperplane can then be expressed as f .x/ D
N X iD1
yi ®O i K.xi ; x/ C b¤ ;
where .®O 1 ; : : : ; ®O N / denotes the solution to the dual problem, and b¤ is chosen so that yi f .xi / D 1 for any i with C > ®O i > 0. The decision rule for the classication is given by sign[ f .x/]. The property of the optimal hyperplane can be illustrated by Figure 1. In this case, the two classes are linearly separable (a linear kernel is used), and the support vectors are those xi ’s satisfying the condition O 0 xi C b¤ D C1 for yi D C1; w O 0 xi C b¤ D ¡1 for yi D ¡1; w PN O D jD1 where w yj ®O j xj . The Kuhn-Tucker conditions imply that the following equalities hold: O 0 xi C b¤ / ¡ 1] D 0; ®O i [yi .w
i D 1; : : : ; N:
(2.3)
Therefore, only those ®O i ’s corresponding to the support vectors can be nonzero and thus contribute to the construction of the optimal hyperplane. In this sense, we say that the decision boundary of the SVMs is determined
A Bayesian Neural Network Classier
1963
locally by the support vectors, and the information contained in the other vectors is ignored in the boundary construction. If the support vectors unfortunately contain some statistical outliers, the resulting boundary may be severely biased from the true one. For linearly nonseparable cases, the optimal hyperplane is determined similarly, but with the slack variables being taken into account. 3 Bayesian Neural Network Classiers p Suppose we have a group of data D D f.xi ; y i /gN iD1 , where xi 2 R are explanatory variables and y i indicates the class membership. For problems of two classes, y i is represented by a single digit, 0 or 1. For problems of multiple classes, y i is represented by a binary vector, and class membership is indicated by the position of its only component of 1. A one hidden layer feedforward neural network model (illustrated by Figure 2) with weight eliminations approximates y i by a group of q functions (corresponding to q output units) of the form
" gj .xi / D Áo ®j0 I®j0 C j D 1; : : : ; q;
p X kD1
xik ®jk I®jk C
M X lD1
Á ¯jl I¯jl Áh °l0 I°l0 C
p X
!# xik °lk I°lk
;
kD1
(3.1)
where ®j0 denotes the bias of the jth output unit, ®jk denotes the weight on the shortcut connection from the kth input unit to the jth output unit; ¯jl denotes the weight on the connection from the lth hidden unit to the jth output unit; °l0 denotes the bias of the lth hidden unit, °lk denotes the weight on the connection from the kth input unit to the lth hidden unit; I³ is an indicator function that indicates the effectiveness of the connection ³ ; M denotes the maximum number of hidden units specied by users; and Áo .¢/ and Áh .¢/ denote the activation functions of the hidden units and output units, respectively. Sigmoid and hyperbolic tangent functions are two popular choices for the activation functions. In this letter, we set Áh .¢/ D tanh.¢/ for all examples to ensure that the output of a hidden unit is 0 if all connections to the hidden unit from the input units have been eliminated, and thus the hidden unit can be eliminated from the network without any effect on the outputs. This is not the case for the sigmoid function, which will return a constant of 0.5 if the input is zero. Extra work is needed to make the constant be absorbed by the bias term if we want to delete the hidden unit from the network. The Áo .¢/ is set according to the problem under consideration. For the problems of binary classes, we set Áo .¢/ to the sigmoid function classes, we set Áo .z/ D 1=.1 C exp.¡z//, and for the problems of multiple P Áo .¢/ to the softmax function (Ripley, 1994) exp.zi /= j exp.zj / to ensure the outputs sum to one. These settings also ensure the probability interpretation of the BNN outputs.
1964
F. Liang
Output Units
Hidden Units
Bias Unit
Input Units
Figure 2: A fully connected one-hidden-layer feedforward neural network with four input units, three hidden units, and two output units. The arrows show the direction of data feeding, where each hidden unit independently processes the values fed to it by units in the preceding layer and then presents its output to units in the next layer for further processing.
For simplicity, in the following, we will suppress the bias terms ®j0 ; °10 ; : : : ; °M0 by treating them as the weights on the connections exiting from an additional input unit with a constant input, say, xt0 ´ 1. Let 3 be the vector consisting of all indicators of equation 3.1. Note that 3 species the structure of the network. Let ® D .®jk /q£.pC1/ , ¯ D .¯jk /q£M , ° D .°jk /M£.pC1/ , and µ3 D .®3 ; ¯ 3 ; ° 3 /, where ®3 , ¯ 3 , and ° 3 denote the nonzero subsets of ®, ¯ , and ° , respectively. Thus, the model 3.1 is completely specied by the tuple .3; µ3 /. For simplicity, in the following, we will denote the model by µ 3 only. To show the dependence of gj .xi / on the model, we will also write it as gj .xi ; µ3 /. To conduct a Bayesian analysis for model 3.1, we rst dene the likelihood function as P.y j x; µ 3 / / expf¡¿ H.µ 3 /g; where ¿ is a tunable parameter, and H.µ3 / D
q N X X iD1 jD1
[gj .xi ; µ 3 / ¡ yij ]2 :
(3.2)
A Bayesian Neural Network Classier
1965
Note that this likelihood is not normalized. It can not be used directly for Bayesian inference, as its normalizing constant may contain the unknown parameters. With the same technique as that adopted by Sollich (2002), we avoid the problem by assuming an additional distribution for x such that the normalizing constant gets canceled out. If equation 3.2 is normalized as P.y j x; µ3 / D expf¡¿ H.µ 3 /g=z1 .µ 3 ; x/; where z1 .µ 3 ; x/ is chosen to normalize P.y j x; µ 3 / across all possible values of y , then we choose P.x j µ 3 / D Q.x/z1 .µ3 ; x/=z2 .µ3 /; where Q.¢/ is any positive function, and z2 .µ / is the normalizing constant. Thus, we have the joint distribution P.D j µ 3 / D P.y j x; µ 3 /P.x j µ 3 / D expf¡¿ H.µ3 /gQ.x/=z2 .µ3 /;
(3.3)
so z1 .µ3 ; x/ is canceled out from the expression. More precisely, we can write Q z 1 .µ 3 ; x / D N iD1 z0 .µ 3 ; xi /, where z0 .µ 3 ; xi / is the normalizing constant of the distribution, 8 9 q < = X [gj .xi ; µ 3 / ¡ yij ]2 =z0 .µ 3 ; xi /; P.y i j xi ; µ 3 / D exp ¡¿ : ; jD1
Q corresponding to the ith observation; write Q.x/ D N iD1 Q.xi /; and write R z2 .µ 3 / D [ Q.xi /z0 .µ 3 ; xi / dxi ]N , which depends on the sample size of the given data set. We specify the following log-prior distribution for µ3 , log P.µ 3 / D Constant C meff log ¸ ¡ log.meff !/ ¡ q
p
1 XX ¡ I® 2 jD1 kD0 jk
Á
log ¾®2
C
®jk2
!
meff log.2¼ / 2
¾®2
Á p !Á ! q M X ¯jk2 1 XX ¡ log ¾¯2 C 2 I¯ k ± I°jl 2 jD1 kD1 j ¾¯ lD0 Á q ! Á ! p M X X °jk2 1X 2 ¡ ± I¯lj I°jk log ¾° C 2 2 jD1 kD0 ¾° lD1 C log z2 .µ 3 /;
(3.4)
1966
F. Liang
where ±.s/ is 1 if s > 0 and 0 otherwise, the total number Pq P p meff isP P pof effecq PM tive connections of µ 3 , meff D jD1 kD0 I®jk C jD1 kD1 I¯j k ±. lD0 I°jl / C P M Pp Pq . In the prior P.µ 3 /, z2 .µ 3 / can be regarded as a jD1 kD0 ±. lD1 I¯lj /I°jk weighting factor. Since the weighting factor depends on the sample size of the given data set, we may have some difculty in probability interpretation. That is, if one constructs a distribution for a data set of size N C 1 and then marginalizes over xNC1 and y NC1 , one does not obtain the same distribution as if one had assumed from the start that there were only N points. As pointed out by Sollich (2002), “While this may seem objectionable on theoretical grounds, in most applications one has to deal with a single, given, data set—and thus a single value of N—and then this objection is irrelevant.” If the weighting factor were ignored, it is equivalent to assume the following independent priors for the network weights and structure: ®3 » N.0; ¾®2 I® /, ¯3 » N.0; ¾¯2 I¯ /, and ° 3 » N.0; ¾°2 I° /, where I is an identity matrix with appropriate order and 3 is subject to a prior probability that is proportional to the mass put on meff by a truncated Poisson with rate ¸, ( Q D P.3/
1 ¸m eff z3 meff ! ;
0
meff D 3; 4; : : : ; U; otherwise;
(3.5)
where U D .p C 1/.M C q/ C Mq P is the total number of connections of a model with all I³ D 1, and z3 D 32Ä ¸meff =meff ! Here we let Ä denote the set of all possible BNN structures with 3 · meff · U. We set the minimum number of meff to three based on our views: neural networks are usually used for complex problems, and three has been a small enough number as a limiting network size. The ¾®2 , ¾¯2 , ¾°2 , and ¸ are all hyperparameters to be specied by users (discussed below). Q In equation 3.5, we use P.3/ to indicate it is a probability distribution, but not the real prior probability we put on 3. Similarly, we let ¼ Q .µ3 j 3/ denote the prior density of µ3 given structure 3 by ignoring the weighting factor. Thus, we have the following log posterior, log P.µ3 j D / D log P.D j µ 3 / C log P.µ 3 / ¡ log Z.D /
Q D log Q.x/ ¡ ¿ H.µ3 / C log P.3/ C log ¼ Q .µ 3 j 3/ D Constant ¡ ¿ H.µ 3 / C meff log ¸ ¡ log.meff !/ Á ! q p ®jk2 1 XX meff 2 ¡ log.2¼ / ¡ I® log ¾® C 2 2 2 jD1 kD0 jk ¾®
A Bayesian Neural Network Classier
1967
Á p !Á ! q M X ¯jk2 1 XX 2 ¡ log ¾¯ C 2 I¯ k ± I°jl 2 jD1 kD1 j ¾¯ lD0 Á q ! Á ! p M X X °jk2 1X ¡ ± I¯lj I°jk log ¾°2 C 2 ; 2 jD1 kD0 ¾° lD1
(3.6)
where Z.D / is the normalizing constant of the posterior. The rationale of the likelihood function can be argued as follows. First, we rewrite H.µ 3 / as H.µ 3 / D
q X q N X X [gj .xi ; µ3 / ¡ yij ]2 D Hj .µ 3 /: jD1 iD1
jD1
For each Hj .¢/, we have Hj .µ3 / D
N X iD1
[gj .xi ; µ 3 / ¡ yij ]2
8 ¢ ¢ ¢ > tn ´ 1. Issues related to the choice of the temperature ladder can be found in Geyer and Thompson (1995) and Liang and Wong (2000). Let »i denote a sample from fi .¢/; we follow Liang and Wong (2000) in calling it an individual. The n individuals » 1 ; : : : ; » n form a population denoted by z D f» 1 ; : : : ; » n g. We further assume that all individuals in the same population z are mutually independent, and they have the following joint density: ( f .z / / exp ¡
n X
) i
E.» /=ti :
(4.1)
iD1
TRJ-MCMC works by sampling from f .z / directly. The samples of interest can be collected at the target temperature level. TRJ-MCMC consists of two operators, local updating and exchange, described in the following sections. 4.1 Local Updating. In the local updating operator, an individual, say »k , is selected at random from the current population z . Then »k is mod0 ied to a new individual » k by some type of moves (described below). 0 The new population z 0 D f» 1 ; : : : ; » k¡1 ; » k ; » kC1 ; : : : ; » n g is accepted with probability min(1, rl ) according to the Metropolis-Hastings rule (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970), and keeps z
A Bayesian Neural Network Classier
1973
unchanged with the remaining probability, where rl D
f .z 0 / T.z j z 0 / T.z j z 0 / k0 k D expf¡. ¡ g E . » / E . » //=t ; k f .z / T.z 0 j z / T.z 0 j z /
(4.2)
and T.¢ j ¢/ denotes the transition probability between two populations. The move types we used for this article are “birth,” “death,” and “Metropolis” as in Green (1995). Let S denote the set of effective connections of the current neural network » k , and Sc be the complementary set of S. Let meff D kSk be the number of elements in S (recall the denition of meff in section 2); thus, kSc k D U ¡ meff . Let P.meff ; birth/, P.meff ; death/, and P.meff ; Metropolis/ denote the proposal probabilities of the three types of moves for a network with meff connections. In our examples, we set P.meff ; birth/DP.meff ; death/ D 1=3 for 3 < q < U, P.U; death/ D P.3; birth/ D 2=3, P.U; birth/ D P.3; death/ D 0, and P.meff ; Metropolis/ D 1=3 for 3 · q · U. For convenience, in the remaining part of this section, we will let i be an index of the output units, j be an index of the hidden units, and l be an index of the bias unit and input units. Their respective ranges are 1 · i · q, 1 · j · M and 0 · l · p. In the “birth,” move, a connection, say, ³ , is rst uniformly chosen from Sc . If ³ 2 f®il : I®il D 0g [ f¯ij : I¯ij D Pq Pp 0; uD1 I¯uj ¸ 1g [ f°jl : I°jl D 0; uD0 I°ju ¸ 1g, the move is called the type I “birth” move (illustrated by Figure 4a). In this type of move, only the connection ³ is proposed to addPto » k with weight !³ being randomly drawn eff N 2 from N.0; ¾»2k /, where ¾»2k D m iD1 .wi ¡ w/ =.meff ¡ 1/, w1 ; : : : ; wmeff denote Pmeff k the effective weights of » , and wN D iD1 wi =meff . The resulting transition probability ratio is 1 T.z j z 0 / P.meff C 1; death/ U ¡ meff D ; T.z 0 j z / P.meff ; birth/ m0del Á.!³ =¾»k /
where Á.¢/ denotes the standard normal density, and m0del denotes the num0 ber of deletable connections of the network » k . A connection, Pq say, À, is deletable from a network if À 2 f®il : I®il D 1g [ f¯ij : I¯ij D 1; uD1 I¯uj > 1g Pq Pp Pp [ f¯ij : I¯ij D 1; uD1 I¯uj D 1; uD0 I°ju D 1g [ f°jl : I°jl D 1; uD0 I°ju > Pq Pp 1g [ f°jl : I°jl D 1; uD1 I¯uj D 1; uD0 I°ju D 1g. Pq Pp Pq If ³ 2 f¯ij : I¯ij D 0; uD1 I¯uj D 0; uD0 I°ju D 0g [ f°jl : uD1 I¯uj D Pp 0; uD0 I°ju D 0g, the move is called the type II “birth” move (illustrated by Figure 4b), which P is to add a new unit. In this type of move, Phidden q p if ³ 2 f¯ij : I¯ij D 0; uD1 I¯uj D 0; uD0 I°ju D 0g, a connection, say, ³ 0 , is uniformly chosen from the set of connections from input units to the corresponding hidden unit; otherwise, ³ 0 is uniformly chosen from the set of connections from the corresponding hidden unit to the output units, and then ³ and ³ 0 are proposed to add to » k together. The weights !³ and !³ 0 are
1974
F. Liang (a)
(b)
Figure 4: Illustration of the “birth” and “death” moves. The dotted lines denote the connections to add to or delete from the current network. (a) Type I “birth” (“death”) move: There is only one connection to add to (from left to right) or delete from (from right to left) the current network. (b) Type II “birth” (“death”) move: There are two connections to add to (from left to right) or delete from (from right to left) the current network simultaneously. The two connections must be the only two connections associated with that hidden unit; one must be from input units to that hidden unit, and the other one must be from that hidden unit to output units.
drawn independently from N.0; ¾»2k /. The resulting transition probability ratio is 2.U ¡ meff / 1 T.z j z 0/ P.meff C 2; death/ D : 0 0 T.z j z / P.meff ; birth/ .1=q C 1=.p C 1//mdel Á.!³ =¾»k /Á .!³ 0 =¾»k / Once the “birth” proposal is accepted, the corresponding indicators are switched to 1. In the “death” move, a connection, say, ³ , is rst uniformly chosen from the set of deletable connections of » k . If ³ 2 f®il : I®il D 1g [ f¯ij : I¯ij D Pq Pp 1; uD1 I¯uj > 1g [ f°jl : I°jl D 1; uD0 I°ju > 1g, the move is called the type I “death” move (illustrated by Figure 4a). In this type of move, only ³ is proposed for deletion, and the resulting transition probability ratio is Á.!³ =¾»0k / T.z j z 0 / P.meff ¡ 1; birth/ mdel D ; 1 T.z 0 j z / P.meff ; death/ U ¡ meff C 1 where mdel denotes the number of deletable connections of » k , and ¾ 0k de» notes the sample standard deviation of the connection weights of » k except
A Bayesian Neural Network Classier
1975
Pq Pp Pq ³ . If ³ 2 f¯ij : I¯ij D 1; uD1 I¯uj D 1; uD0 I°ju D 1g[ f°jl : I°jl D 1; uD1 I¯uj D Pp 1; uD0 I°ju D 1g, the move is called the type II “death” move (illustrated by Figure 4b), which is to eliminate one the model. In this P q hidden unitPfrom p type of move, if ³ 2 f¯ij : I¯ij D 1; uD1 I¯uj D 1; uD0 I°ju D 1g, the only connection denoted by ³ 0 from some input unit to the corresponding hidden unit is also proposed for deletion together with ³ ; otherwise, the connection denoted by ³ 0 from the corresponding hidden unit to some output unit is proposed for deletion together with ³ . The resulting transition probability ratio is 0 0 T.z j z 0 / P.meff ¡ 2; birth/ .1=q C 1=.p C 1//mdel Á .!³ =¾»k /Á .!³ 0 =¾»k / D : 2.U ¡ meff C 2/ 1 T.z 0 j z / P.meff ; death/
Once the “death” proposal is accepted, the corresponding indicators are switched to 0. In the “Metropolis” move, the weights are updated while keeping the indicator vector unchanged. In this article, we implemented the move in a Metropolis-within-Gibbs sampler (Muller, ¨ 1993; Robert & Casella, 1999). In any step of the Gibbs sampler with a difculty sampling from the conditional distribution, substitute a one-step Metropolis-Hastings move. Let the network weights be arranged rst into a vector, » k D .» k1 ; : : : ; » kmeff /, by some xed order. Let » k[a;b] D .» ka ; » kaC1 ; : : : ; » kb / and » k¡[a;b] denote all elements of »k except the set »k[a;b] . Let · be the maximum block size allowed by the move. The weights are updated iteratively as follows: ² Update » k[1;· ] by one step of Metropolis-Hastings (MH) simulation from the conditional density f .» k[1;·] j » k¡[1;· ] /.
² Update » k[·C1;2·] by one step of MH simulation from the conditional density f .» k[·C1;2·] j » k¡[·C1;2·] /. ² ::::::
² Update » k[.r¡1/· C1;meff ] by one step of MH simulation from the conditional density f .» [.r¡1/·C1;meff ] j » ¡[.r¡1/· C1;meff ] /. The rst block is updated as follows. First, a random direction d1 is uniformly drawn from the ·-dimensional space. Then set 0
»k[1;·] D »k[1;· ] C ½k d1 ; where ½k is a random variable drawn from a gaussian distribution with mean 0 and variance ´tk . The ´ is a user-specied parameter, the so-called 0 Metropolis step size. It controls the acceptance rate of the move. The » k[1;·] is accepted according to the Metropolis-Hastings rule. Once it is accepted, 0 the current » k will be updated immediately by replacing » k[1;· ] by » k[1;·] . For 0 0 this type of move, we have T.z j z /=T.z j z / D 1, since the transition is symmetric in the sense T.z j z 0 / D T.z 0 j z /.
1976
F. Liang
4.2 Exchange. Given the current population z and the attached temperature ladder t, we try to make an exchange between » i and » j without changing the t’s; initially we have .z ; t/ D .» 1 ; t1 ; : : : ; » i ; ti ; : : : ; » j ; tj ; : : : ; » N ; tN /, and we propose to change it to .z 0 ; t/ D .» 1 ; t1 ; : : : ; » j ; ti ; : : : ; » i ; tj ; : : : ; » N ; tN /. The new population is accepted with probability min(1,re ) according to the Metropolis-Hastings rule, where f .z 0 / T.z j z 0 / f .z / T.z 0 j z / » ³ ´¼ 1 1 T.z j z 0 / D exp .H.» i / ¡ H.» j // ¡ : ti tj T.z 0 j z /
re D
(4.3)
Typically, the exchange is performed on only two states with neighboring temperatures, ji ¡ jj D 1. If p.» i / is the probability that » i is chosen to exchange with the other state and wi;j is the probability that » j is chosen to exchange with » i , we have T.z 0 j z / D p.» i /wi;j C p.» j /wj;i . Thus, T.z 0 j z /= T.z j z 0 / D 1 in equation 4.3. 4.3 Algorithm. With the operators described above, one iteration of TRJMCMC consists of the following two steps. 1. Try to update each of the individuals of the current population by the local updating operator. 2. Try to exchange » i with » j for n ¡ 1 pairs .i; j/ with i being sampled uniformly on f1; : : : ; ng and j D i § 1 with probability wi;j , where wi;iC1 D wi;i¡1 D 0:5 for 1 < i < n and w1;2 D wn;n¡1 D 1. In this algorithm, the user-specied parameters include the population size n, the temperature ladder t, block size ·, and the Metropolis step size ´. For parameter setting, we have the following suggestions. First, set the temperature ladder carefully. Set the highest temperature such that almost all local updating moves can be accepted, and set the intermediate temperatures such that the exchange operations have an almost equal acceptance rate for all neighboring levels. For all examples of this article, we set the temperature ladder to be a series of equal harmonic difference, that is, 1=tiC1 ¡ 1=ti D c for i D 1; 2; : : : ; n ¡ 1, where c is a constant. Then set · to a moderate value. It may range from tens to 100 in our experience. In this article, we set · D 50 for all examples. Next, set ´ such that the Metropolis moves have a moderate acceptance rate, for example, 0.2 to 0.4, as suggested by Gelman, Roberts, and Gilks (1996).
A Bayesian Neural Network Classier
1977
5 Numerical Results 5.1 Two Simulated Examples. BNNs were rst tested with two simulated examples with known optimal classication boundaries. Example 1 contains 50 pairs of training and test data sets, which are all mutually independent. Each training set consists of 50 independent and identically distributed (i.i.d.) samples drawn from N2 .¹; I2 / (class 1) and 50 i.i.d. samples drawn from N2 .¡¹; I2 / (class 2); each test set consists of 500 i.i.d. samples drawn from N2 .¹; I2 / and 500 i.i.d. samples drawn from N2 .¡¹; I2 /, where ¹ D .1:1; 1:1/0 and I2 is a 2 £ 2 identity matrix. It is clear that the optimal classication boundary of each data set is the straight line y D ¡x in the two-dimensional Cartesian coordinates. The new BNN classier was rst applied to the example with M D 2, ¿ D 0:1 and a variety of ¸’s with values ranging from 10 to 25. Here we set M D 2, as the corresponding full model will have 11 connections and that should have been enough for this example to approximate the true classication boundary. The ¿ and ¸ were chosen by a 10-fold cross-validation procedure for 10 training sets; that is, each training set was randomly divided into 10 disjoint sets of equal size, the classier was trained 10 times, each time with a different set held out as a validation set, and the quality of the parameter settings was assessed by averaging over the 100 errors (10 sets £ 10-fold runs). In training, a temperature ladder of 10 levels was used with the highest temperature t1 D 5 and the Metropolis step size ´ D 5. Each run consists of 6000 iterations. The rst 1000 iterations were discarded for the burn-in process, and the followed 5000 iterations were used for inference. The results are summarized in Table 1. Here the values of ¿ are very small, as we expect the learned classication boundary to be an approximately straight line and not to be strongly affected by the samples near the boundary. While given ¿ , the inuence of ¸ on the classier is not signicant. For the other examples, the ¿ and ¸ were determined similarly; we will skip the details of the cross-validation experiments and provide only the nal settings. In the formal training, we have the same MCMC setting as in the crossvalidation training. For each data set, the classier was run 10 times indepen-
Table 1: Cross-Validation Experiment for the First Simulated Example. Parameter
¿ D 0:05
¿ D 0:1
¿ D 0:15
¿ D 0:2
Setting
Training
Test
Training
Test
Training
Test
Training
Test
¸ D 10 ¸ D 15 ¸ D 20 ¸ D 25
5.23 5.29 5.28 5.29
0.63 0.63 0.63 0.59
5.32 5.22 5.23 5.22
0.60 0.60 0.60 0.61
5.20 5.21 5.20 5.32
0.60 0.63 0.61 0.61
5.16 5.13 5.14 5.14
0.62 0.62 0.60 0.63
Note: The columns show the averaged 10-fold training and test errors for 10 data sets.
1978
F. Liang
dently. Each run consists of 6000 iterations, and the last 5000 iterations were used for inference. The overall acceptance rates of the local updating moves and exchange operations are about 0.55 and 0.75, respectively. The training error (reported in the “Training” column in Table 2) was computed as the average of the 500 training errors (50 data sets £ 10 runs), and the test error (reported in the “Test” column in Table 2) was computed as the average of the 500 test errors. The standard deviations of them (the qPgures in parentheses 50 d in Table 2) were computed in the formula SDe D ar.Nei /=50, where e iD1 v indicates the error type, training error, or test error; eNi denotes the averaged P 10 1 2 d error (over 10 runs) of the ith data set; and v ar.Nei / D .10/.9/ jD1 .eij ¡ eNi / . The same computational procedure for the training and test errors and their standard deviations was also applied to the old BNN classier and other examples in this article. For comparison, we also applied the old BNN classier and SVMs to the same data sets. The software for the old BNN classier was developed by R. M. Neal and was downloaded on-line from http://www.cs.toronto.edu/ »radford (as suggested by a referee). Refer to Neal (1996) for more details of the software. For each data set, it was also run 10 times independently. Each run consists of 30 “iterations” and costs about the same CPU time on the same computer as that of the new BNN classier. Note that “Neal’s iteration” is different from ours; it contains many substeps, Gibbs sampling, heat bath sampling, and hybrid Monte Carlo. Fortunately, with 30 iterations, the results have been reliable enough for comparison. The numbers of hidden units we tried for the old classier include 2, 3, 4, and 5. The results were summarized in Table 2. The SVM software we used is LIBSVM, which was developed by Chang and Lin (2001) and was downloaded on-line from http://www.csie.ntu.edu.tw /»cjlin. A linear kernel was rst tried with different values of C ranging from 0.001 to 10. The computational results were also summarized in Table 2. Since SVMs do a deterministic computation and are very fast, they were run only one time for each data set, and the CPU time they cost was not recorded. The standard deviations of the averaged training and test errors can be regarded as 0. The other types of kernels, such as the RBF and polynomial kernels, were also tried for this example. Their generalization performances were all worse than that of the linear kernel. This is reasonable as the true boundary of this example is a linear function. Table 2 shows that the new BNN classier is superior to the old BNN classier in test errors but inferior in training errors. This is consistent with our conjecture: The old BNN classier will tend to overt to the data and perform less well in generalization, since it puts a more severe penalty on the misclassied patterns. Table 2 also shows that with the correct kernel and the suitable value of C, SVM performs competitively with the new BNN classier in generalization for this example. But for more complicated problems, our experiments show that SVM is usually inferior to the new
A Bayesian Neural Network Classier
1979
Table 2: Comparison of the New BNN Classier, the Old BNN Classier, and SVM for the First Simulated Example.
New BNN
Old BNN
SVM (linear)
Parameter Setting
CPU(s)
Training
Test
¸ D 10 ¸ D 15 ¸ D 20 ¸ D 25
47.1 47.3 47.5 47.6
5.426(0.026) 5.416(0.026) 5.388(0.023) 5.364(0.024)
62.032(0.097) 61.708(0.100) 61.726(0.096) 61.706(0.099)
39.9 47.6 55.3 62.6
5.206(0.026) 5.192(0.029) 5.130(0.030) 5.166(0.031)
63.52(0.11) 63.43(0.12) 63.62(0.11) 63.63(0.12)
C D 0:001 C D 0:01 C D 0:10 C D 1:0 C D 10:0
— — — — —
7.24 5.36 5.60 5.40 5.18
72.32 61.94 61.64 63.40 64.16
MD2 MD3 MD4 MD5
Notes: The “CPU(s)” column shows the total CPU time (in seconds) cost by 10 independent runs for each data set. The “Training” and “Test” columns show the averaged training and test errors and their respective standard deviations (in parentheses).
BNN classier as shown below. Figures 5a, 5b, and 5c show the decision regions of the three classiers for one of the 50 data sets. They were generated by the three classiers with ¸ D 15, M D 3, and C D 0:1, respectively. The plots show again that for this example, the old BNN classier has a tendency to overt compared to the other two classiers, and the new BNN classier has a comparable performance with SVM, although the situation is not favorable to it. The true classication boundary of this example is linear, and the new BNN classier approximates the linear boundary. As a by-product, we show that the new BNN classier has the ability to select relevant variables from the pool of explanatory variables. To show this, we retrain a training set randomly selected from the above 50 ones, but with one additional input variable x3 , which consists of 100 i.i.d. samples drawn from N.0; 1/ independent of x1 and x2 . In training, we have the same MCMC parameter setting and the same network parameter setting as described above except for ¿ D 5, ¸ D 0:5, and a total of 51,000 iterations. Here we set ¿ D 5 and ¸ D 0:5 to drive the networks to have a very good approximation to the targets but with a very small network size, that is, a very small number of effective connections. This setting is based on our belief that a neural network with a very good approximation to the targets but with a minimum structure should include the most relevant, linearly or nonlinearly, variables as the input variables. Note that the resulting networks may have a poor generalization performance due to the extremeness (minimum structure and maximum approximation) we pursued in the
1980
F. Liang
0 x
1
1.5 1.0 0.5 0.0 0.5 1.0 1.5 x
0 x
1
1 y 0 1
2
1.5 1.0 0.5 0.0 0.5 1.0 1.5 x
2 1
e O d BNN y 1.5 1.0 0.5 0.0 0.5 1.0 1.5
y 1.5 1.0 0.5 0.0 0.5 1.0 1.5
(d) New BNN
2 2 1
2
1 y 0
2
0 x
1
2
SVM y 15 10 0500 05 10 15
2 1
1
(c) SVM
2
(b) Old BNN
2
2
1
y 0
1
2
(a) New BNN
1 5 1 0 0 50 0 0 5 1 0 1 5 x
F gure 5 The dec s on reg ons earned by he hree c ass ers or wo s mu a ed da a se s The ra n ng da a are shown as squares he ed squares or c ass 1 and he open squares or c ass 2 a–c The dec s on reg ons o he hree c ass ers or one da a se o he rs s mu a ed examp e The do ed nes show he rue c ass ca on boundar es o he da a se d– The dec s on reg ons o he hree c ass ers or one da a se o he second s mu a ed examp e The s ne curves show he rue c ass ca on boundar es o he da a se
tra n ng process Under the above sett ng we have meff D 3 36 on average and a c ass cat on accuracy of 96% for the tra n ng set F gure 6a shows the poster or probab t es of each of the connect ons of the fu mode where the names of the bars show the correspond ng nput or h dden un ts For examp e start ng from the eft the rst second and th rd x1 s correspond to the connect ons from the nput var ab e x1 to the rst h dden un t the second h dden un t and the output un t respect ve y The x0 s and x2 s can be exp a ned n the same way The h1 and h2 nd cate the connect ons from the two h dden un ts to the output un t The h gh poster or probab t es of the connect ons from x1 and x2 to the h dden un ts nd cate the non near re evance of x1 and x2 to the target var ab e F gure 6b shows the averaged frequenc es of the connect ons ex t ng from each of the nput var ab es The frequenc es were computed based on the networks samp ed from the pos-
A Bayesian Neural Network Classier
1981
Posterior probability 0.00.10.20.30.40.5
(a)
x0
x1 x2 x3 x0
x1 x2 x3 x0 h1 h2 x1 x2 x3 Connection variable
Averaged frequency 0.0 0.4 0.8
(b)
Const
X1 X2 Input variable
X3
Figure 6: (a) The posterior probabilities of each of the connections of the full BNN model, which has three input units, two hidden units, and one output unit. (b) The averaged frequencies of the connections exiting from each of the input variables.
terior distribution. For example, the averaged frequency of x1 is equal to the sum of the posterior probabilities of all connections out of it. The frequencies reect the relative importance of each of the input variables for the explanation for the target. As we expected, Figure 6b shows that the role of x3 in the BNN model is only comparable to a constant, and thus it is irrelevant to the target variable. Example 2 also contains 50 pairs of training and test sets, which are all mutually independent. Each training set consists of 100 i.i.d. samples, and each test set consists of 1000 i.i.d. samples. They were generated as follows: (1) Draw a point .x1 ; x2 / » unif[¡1:5; 1:5]2 . (2) Compute d D x2 ¡ sin.¼ x1 /. (3) If d > 0, that is, the point .x1 ; x2 / is above the sine curve, with probability exp.¡5jdj/, the point is assigned to class 1 and with the remaining probability, it is assigned to class 2. Otherwise, with probability exp.¡5jdj/, the point is assigned to class 2, and with the remaining probability it is assigned to class 1. The data generation process implies that the optimal classication
1982
F. Liang
Table 3: Comparison of the New BNN Classier, the Old BNN Classier, and SVM for the Second Simulated Example. Parameter Setting New BNN
Old BNN
SVM (RBF)
¸ ¸ ¸ ¸
D 1:0e ¡ 5 D 1:0e ¡ 4 D 1:0e ¡ 3 D 1:0e ¡ 2 MD2 MD3 MD4 MD5
C D 0:5 C D 1:0 C D 1:5 C D 2:0 C D 3:0
CPU(s)
Training
Test
45.0 46.1 47.3 49.1
10.496(0.019) 10.392(0.021) 10.216(0.022) 9.952(0.022)
124.342(0.075) 124.846(0.096) 124.902(0.088) 125.068(0.152)
40.1 47.4 54.9 62.5
9.822(0.068) 9.190(0.063) 9.066(0.054) 9.022(0.078)
127.462(0.651) 127.198(0.454) 126.998(0.459) 128.018(0.486)
— — — — —
9.72 8.78 8.32 7.62 7.16
153.72 147.02 144.82 144.00 144.96
Notes: The “CPU(s)” column shows the total CPU time (in seconds) cost by 10 independent runs for each data set. The “Training” and “Test” columns show the averaged training and test errors and their respective standard deviations (in parentheses).
boundary is the sine curve y D sin.¼ z/; z 2 [¡1:5; 1:5]. Figure 5d shows a training set generated by the process together with the sine curve. The new BNN classier was applied to this example with M D 3, ¿ D 15 and a variety of ¸’s ranging from 1.0e-5 to 1.0e-2. Here we set a relatively large value for ¿ , since we expect the learned classication boundary to be very curved. In training, a temperature ladder of 10 levels was used with t1 D 5 and the Metropolis step size ´ D 0:25. As for example 1, each run consists of 6000 iterations. The rst 1000 iterations were discarded for the burn-in process, and the following 5000 iterations were used for inference. The overall acceptance rates of the local updating moves and the exchange rates were about 0.25 and 0.6, respectively. The averaged training and test errors are reported in Table 3. For comparison, we also applied the old BNN classier and SVM to this example. In SVM, a RBF kernel was used with ¾ being the median of the Euclidean distances from each sample of class 1 to the nearest sample of class 2 as suggested by Brown et al. (2000). A variety of C values were tried ranging from 0.5 to 3. Table 3 shows that the new BNN classier outperforms signicantly the old BNN classier and SVM in general for this example. Figure 5 compares the decision regions of the three classiers for one training set of this example. They were generated by the three classiers in one run with ¸ D 1:0e ¡ 5, M D 4, and C D 2:0, respectively. It shows that SVM cannot approximate well the true classication boundary, the old classier overts to the data, and the new classier can approximate the
A Bayesian Neural Network Classier
1983
true classication boundary rather well, even though the true boundary is very curved. 5.2 Breast Cancer Classication. The Wisconsin Breast Cancer database (WBCD), collected by Dr. W. H. Wolberg at University of Wisconsin Hospitals, has 683 samples, which consist of visually assessed nuclear features of ne needle aspirates (FNAs) taken from patients’ breasts. Each sample was assigned a 9-dimensional vector of diagnostic characteristics: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. Each component is in the interval 1 to 10, with 1 corresponding to the normal state and 10 to the most abnormal state. The samples were classied into two classes, benign and malignant. The classication was conrmed by either biopsy or further examinations. (For a detailed description for the data set, see Mangasarian & Wolberg, 1990 or ftp://ftp.cs.wisc.edu/mathprog/cpo-dataset/ machine-learn/cancer1.) The data set has been used as a benchmark example for machine learning. Here we apply the new BNN classier to this data set and compare its performance with that of the old BNN classier and SVM. Fifty pairs of training and test sets were generated from the WBCD data set. Each training set consists of 342 samples uniformly drawn from the data set, and the corresponding test set consists of the remaining samples. For the new BNN classiers, we set M D 3, ¿ D 0:5, ¸ D 2:5; 5; 10; 20; 40, ´ D 5, and employed a temperature ladder of 10 levels with t1 D 2:5. The classier was run 10 times independently for each pair of the training and test sets. Each run consists of 6000 iterations, where the rst 1000 iterations were discarded for the burn-in process, and the following iterations were used for inference. For the old BNN classier, we tried four network structures with 2; 3; 4, and 5 hidden units, respectively. It was also run for 10 times independently for each pair of the training and test sets. For SVM, we tried both the linear and RBF kernels, and a wide range of C. Again, the parameter ¾ of the RBF kernel was set as the median of the Euclidean distances from each sample of class 1 to the nearest sample of class 2. The computational results are summarized in Table 4, which shows that the new BNN classier outperforms signicantly the other three classiers in general for this example. Comparing to the new BNN classier, the old BNN classier has a tendency to overt to the data. Figure 7a shows the posterior probability of each of the connections of a new BNN classier with M D 5, ¿ D 5, and ¸ D 0:1 and a training set consists of the 683 samples. The averaged network size is meff D 5:1, and the classication accuracy is 97:7%. Figure 7a shows that the sampled network structures are dominated by the shortcut connections. This is consistent with the work of Bennett and Mangasarian (1992) that the data can be classied by a hyperplane with an accuracy rate of 97:7%. Figure 7b shows the averaged frequencies of each of the nine input variables. The constant term is also
1984
F. Liang
Table 4: Comparison of the Four Classiers: New BNN Classier, Old BNN Classier, SVMs with Linear Kernels, and SVMs with RBF Kernels, Breast Cancer Example.
New BNN
Old BNN
SVM (linear kernel)
SVM (RBF kernel)
Parameter Setting
CPU(s)
Training
Test
¸ D 2:5 ¸ D 5:0 ¸ D 10 ¸ D 20 ¸ D 40
196.1 201.5 203.0 203.2 203.5
7.994(0.023) 7.846(0.026) 7.746(0.025) 7.776(0.027) 7.728(0.026)
9.134(0.032) 8.914(0.032) 8.874(0.033) 8.966(0.032) 9.054(0.030)
MD2 MD3 MD4 MD5
164.5 195.4 229.2 262.8
7.670(0.064) 7.256(0.087) 6.994(0.101) 6.972(0.096)
10.980(0.059) 11.012(0.063) 10.944(0.058) 10.900(0.056)
C D 0:01 C D 1=64 C D 1=16 C D 1=4 C D 1:0
— — — — —
9.96 9.92 9.76 8.96 8.50
10.26 10.10 10.30 10.68 11.04
C D 0:1 C D 0:5 C D 1:0 C D 1:5 C D 2:5
— — — — —
10.46 9.40 8.84 8.52 7.94
10.62 10.30 10.28 10.42 10.84
Notes: The “CPU(s)” column shows the total CPU time (in seconds) cost by 10 independent runs for each data set. The “Training” and “Test” columns show the averaged training and test errors and their respective standard deviations (in parentheses).
included in the plot for a reference. It indicates that the four most relevant diagnostic characteristics are clump thickness, uniformity of cell size, bare nuclei, and normal nuckeoli. With different settings for ¿ and ¸, the results are similar. 5.3 Protein Structure Class Prediction. Finally, we consider one example with multiple classes. The majority of proteins of known structures can be classied into one of the following four structural classes: ®, ¯, ®=¯, and ® C ¯, based on the percentages of their secondary structure components. The prediction of protein structure class has been one of important components of the overall protein structure prediction problem (Levitt & Chothia, 1976). The early studies on the subject reveal that the structure class of a protein might basically depend on its amino acid composition. The underlying physical mechanism of the correlation was explored by Bahar, Atilgan, Jemigan, and Erman (1997) and Chou (1999). A comprehensive review of the studies can be found in Chou (2001). In this article, we restudied the problem with BNN classiers. The data set we used was downloaded from http://www.nersc.gov/»cding/protein.
A Bayesian Neural Network Classier
1985
Posterior Probability 0.0 0.2 0.4 0.6 0.8
(a)
12
6 8 12
6 8 12 6 8 12 6 8 12 Connection variable
68
12
68
Averaged frequency 0.00.20.40.60.81.0
(b)
const
X1
X2
X3
X4 X5 X6 Input variable
X7
X8
X9
Figure 7: (a) The posterior probability of each of the connections of a full BNN model with M D 5, ¿ D 5, and ¸ D 0:1. A training set consists of all the 683 samples. (b) The averaged frequencies of the connections out of each of the input variables. The nine input variables from x1 to x9 are clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses.
The training set consists of 313 proteins where two proteins have no more than 35% of the sequence identity for the aligned subsequences longer than 80 residuals. The test set consists of 385 proteins where all protein sequences have less than 40% identity with each other and less than 35% identity with the proteins of the training set. (For more details of the data set, see Ding & Dubchak, 2001.) In this example, the BNN classiers have four output units and 21 input units, where the rst 20 input units correspond to the compositions of the 20 amino acids and the last one corresponds to the length of the protein sequence. For the new BNN classier, we set M D 10, ¿ D 2, ¸ D 30; 35; 40; 45, ´ D 3:0, and employed a temperature ladder of 20 levels with t1 D 5. For each ¸, the classier was run for 10 times independently. Each run consists of 200,000 iterations, and the rst 2000 iterations were discarded for the burn-in process. The overall acceptance rates of the local updating and exchange operations were around
1986
F. Liang
Table 5: Comparison of New BNN Classier, Old BNN Classier, and SVM, Protein Structure Class Prediction Example. Parameter Setting
CPU(m)
Training
Test
Frequency
Best
D 30 D 35 D 40 D 45
164 165 165 166
19.3(0.63) 19.2(0.59) 18.9(0.53) 16.5(0.81)
80.2(0.65) 80.8(0.65) 79.2(0.71) 80.6(0.67)
9 8 9 8
76 77 75 77
137 162 181 209
18.8(4.34) 3.9(1.92) 3.6(2.10) 9.5(3.60)
86.9(1.11) 83.5(0.73) 83.4(0.54) 83.2(0.87)
1 3 3 5
81 80 81 79
C D 1:0 C D 1:5 C D 2:0 C D 2:5
— — — —
31 20 15 7
83 85 86 88
— — — —
— — — —
New BNN
¸ ¸ ¸ ¸
Old BNN
MD8 M D 10 M D 12 M D 15
SVM (RBF)
Notes: The “CPU” column shows the CPU time cost by each run. The “Training” and “Test” columns show the averaged training and test errors over the 10 runs and their respective standard deviations (the gures in the parentheses). The “Frequency” column shows the number of runs (out of 10) with test error less than 83, the minimum test error of SVM. The “Best” column shows the minimum test error of the 10 runs.
0.3 and 0.25, respectively. For the old BNN classier, we tried four different network structures with M D 8; 10; 12, and 15. It was also run 10 times independently for each structure. Each run consists of 5000 iterations, and the rst 1000 iterations were discarded for the burn-in process. The computational results are summarized in Table 5. LIBSVM implemented the multiclass classication with the “one-against-one” approach (Knerr, Personnaz, & Dreyfus, 1990), in which k.k ¡ 1/=2 classiers are constructed and each one trains data from two different classes. The classication made use of a voting strategy: each binary classication is considered to be a vote, and in the end, the point is designated to be in a class with the maximum number of votes. This approach has been shown to have a better performance than the “one-against-all” approach (Weston & Watkins, 1998; Platt, Cristianini, & Shawe-Taylor, 2000). (For more details on the implementation, see http://www.csie.ntu.edu.tw/ »cjlin.) For this example, we tried the RBF kernel and a variety of C ranging from 0.1 to 100. Table 5 shows the best four test errors and the associated C values and training errors. The other types of kernels were also tried, but their performances are not as good as that of RBF. This example shows that both BNN classiers outperform SVMs in generalization performance, and the new BNN classier outperforms the old one in whatever the minimum test error or the frequency of beating SVMs.
A Bayesian Neural Network Classier
1987
6 Conclusion In this article, we introduced a new BNN classier that is different from the old one in several respects, including the likelihood function, prior specication, and network structure. Under mild conditions, we show that the new BNN classier will converge to the true classication function. Numerical results show that the new BNN classier provides a consistent improvement over the old BNN classier and SVMs in general performance for all examples of this article. The success of the new BNN classier comes from two factors. First is an appropriate likelihood function, or equivalently, an appropriate penalty function for the misclassied patterns. In this article, we chose the likelihood function from the family P Pq fexp.¡¿ N [g ¡ yij ]2 / : ¿ > 0g, with ¿ being determined by iD1 jD1 j .xi ; µ 3 / a cross-validation procedure in practice. Our results suggest that we may choose the likelihood function from a more broad likelihood family for a BNN model with further improved performance. Second is the use of global information of the data in the decision boundary determination, in contrast to the local determination of the decision boundary of SVMs by the support vectors. In this article, each pattern of the training set is equally treated in the training process to make sure that global information of the data is extracted by the BNN model. Further research on how to extract the global information from the data efciently will be of interest. As a by-product, we also showed numerically that the new BNN classier can identify the relevant linear or nonlinear variables from the pool of explanatory variables. The program is available from the author by request.
References Bahar, I., Atilgan, A. R., Jemigan, R. L., & Erman, B. (1997). Understanding the recognition of protein structural classes by amino acid composition. Proteins: Struc., Func., and Genetics, 29, 172–185. Bennett, K. P., & Mangasarian, O. L. (1992). Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software, 1, 23–34. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athenas Scientic. Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classiers. In Fifth Annual Conference on Computational Learning Theory (pp. 142–152). San Mateo, CA: Morgan Kaufmann. Breiman, L. (2002). Statistical modeling—the two cultures. Statistical Science, 16, 199–215. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C. W., Furey, T. S., M.A. Jr., & Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci, USA, 97, 262–267.
1988
F. Liang
Chang, C. C., & Lin, C. J. (2001). Training º-support vector classiers: Theory and algorithms. Neural Computation, 13, 2119–2147. Chou, K. C. (1999). A key driving force in determination of protein structure classes. Biochem. Biophys. Res. Commun., 264, 216–224. Chou, K. C. (2001). Review: Prediction of protein structural classes and subcellular location. Current Protein and Peptide Science, 1, 171–208. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Ding, C. H. Q., & Dubchak, I. (2001). Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics, 17, 349–358. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classication (2nd ed.). New York: Wiley. Fletcher, R. (1987). Practical methods of optimization (2nd ed.). New York: Wiley. Gelman, A., Roberts, R. O., & Gilks, W. R. (1996). Efcient metropolis jumping rules. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian Statistics, 5. New York: Oxford University Press. Geyer, C. J. (1991). Markov chain Monte Carlo maximum likelihood. In E. M. Keramigas (Ed.), Computing science and statistics: Proceedings of the 23rd Symposium on the Interface (pp. 153–163). Seattle: Interface Foundation. Geyer, C. J., & Thompson, E. A. (1995). Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Amer. Statist. Assoc., 90, 909–920. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82, 711–732. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Hukushima, K., & Nemoto, K. (1996) Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn., 65, 1604–1608. Husmeier, D., Penny, W. D., & Roberts, S. J. (1999). An empirical evaluation of Bayesian sampling with hybrid Monte Carlo for training neural network classiers. Neural Networks, 12, 677–705. Kass, R. E., & Vaidyanathan, S. (1992). Approximate Bayesian factors and orthogonal parameters, with applications to testing equality of two binomial proportions. J. Royal Statist. Soc., B, 129–144. Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: A stepwise procedure for building and training a neural network. In J. Fogelman (Ed.), Neurocomputing: Algorithms, architectures and applications. Berlin: Springer-Verlag. Krogh, A., & Vedelsby, J. (1995).Neural network ensembles, cross validation, and active learning. In J. D. Cowan, D. Touretzky, & J. Alspector (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Levitt, M., & Chothia, C. (1976). Structural patterns in globular proteins. Nature, 261, 552–557. Liang, F., & Wong, W. H. (2000). Evolutionary Monte Carlo: Applications to Cp model sampling and change point problem. Statistica Sinica, 10, 317–342. MacKay, D. J. C. (1992). A practical Bayesian framework for backprop networks. Neural Computation, 4, 448–472. Mangasarian, O. L., & Wolberg, W. H. (1990). Cancer diagnosis via linear programming. SIAM News, 23, 1–18.
A Bayesian Neural Network Classier
1989
Markowetz, F., Edler, L., & Vingron, M. (2002). Support vector machines for protein fold class prediction (Tech. Rep.). Berlin: Max-Planck-Institute for Molecular Genetics, Computational Molecular Biology. Mercer, J. (1909). Functions of positive and negative type, and their connection with the theory of integral equations. Transactions of the London Philospohical Society A, 209, 819–835. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087–1091. Muller, ¨ P. (1993). Alternatives to the Gibbsa sampling scheme (Tech. Rep.). Durham, NC: Institute of Statistics and Decision Sciences, Duke University. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble methods for hybrid neural networks. In R. J. Mammone (Ed.), Neural networks for speech and image processing. London: Chapman-Hall. Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass classication. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural informationprocessingsystems,12 (pp. 547–553). Cambridge, MA: MIT Press. Raftery, A.E., Madigan, D., & Hoeting, J.A. (1997). Bayesain model averaging for linear regression models. J. Amer. Statist. Assoc., 92, 179–191. Ripley, B. D. (1994). Flexible nonlinear approaches to classication. In J. H. Friedman & H. Wechsler (Eds.), From statistics to neural networks: Theory and pattern recognition applications. Berlin: Springer-Verlag. Robert, C. P., & Casella, G. (1999). Monte Carlo statistical methods. New York: Springer-Verlag. Rosen, B. E. (1996). Ensemble learning using decorrelated neural networks. Connection Science, 3, 373–384. Rumelhart, D., Hinton, G., & Williams, J. (1986). Learning internal representations by error propagation. In D. Rumelhart & J. McClelland (Eds.), Parallel distributed processing (pp. 318–362). Cambridge, MA: MIT Press. Smith, A. F. M., & Roberts, G. O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Royal Statist. Soc. B, 55, 3–23. Sollich, P. (2002). Bayesian methods for support vector machines: Evidence and predictive class probabilities. Machine Learning, 46, 21–52. Thodberg, H. H. (1996). A review of Bayesian neural networks with an application to near infrared spectroscopy. IEEE Transactions on Neural Networks, 7, 56–72. Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1990). Predicting the future: A connectionist approach. Int. J. Neural Syst., 1, 193–209. Weston, J., & Watkins, C. (1998). Multi-class support vector machines (Tech. Rep. CSD-TR-98-04). Egham, Surrey: Royal Holloway, University of London. White, H. (1992). Articial neural networks: Approximation and learning theory. Cambridge, MA: Blackwell. Received July 22, 2002; accepted January 28, 2003.
LETTER
Communicated by Erkki Oja
Variational Bayesian Learning of ICA with Missing Data Kwokleung Chan [email protected] Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A.
Te-Won Lee [email protected] Institute for Neural Computation, University of California at San Diego, La Jolla, CA 92093, U.S.A.
Terrence J. Sejnowski [email protected] Computational Neurobiology Laboratory, Salk Institute, La Jolla, CA 92037, U.S.A., and Department of Biology, University of California at San Diego, La Jolla, CA 92093, U.S.A.
Missing data are common in real-world data sets and are a problem for many estimation techniques. We have developed a variational Bayesian method to perform independent component analysis (ICA) on highdimensional data containing missing entries. Missing data are handled naturally in the Bayesian framework by integrating the generative density model. Modeling the distributions of the independent sources with mixture of gaussians allows sources to be estimated with different kurtosis and skewness. Unlike the maximum likelihood approach, the variational Bayesian method automatically determines the dimensionality of the data and yields an accurate density model for the observed data without overtting problems. The technique is also extended to the clusters of ICA and supervised classication framework. 1 Introduction Data density estimation is an important step in many machine learning problems. Often we are faced with data containing incomplete entries. The data may be missing due to measurement or recording failure. Another frequent cause is difculty in collecting complete data. For example, it could be expensive and time-consuming to perform some biomedical tests. Data scarcity is not uncommon, and it would be very undesirable to discard those data points with missing entries when we already have a small data set. Traditionally, missing data are lled in by mean imputation or regression imputation during preprocessing. This could introduce biases into the data c 2003 Massachusetts Institute of Technology Neural Computation 15, 1991–2011 (2003) °
1992
K. Chan, T. Lee, and T. Sejnowski
cloud density and adversely affect subsequent analysis. A more principled way would be to use probability density estimates of the missing entries instead of point estimates. A well-known example of this approach is the use of the expectation-maximization (EM) algorithm in tting incomplete data with a single gaussian density (Little & Rubin, 1987). Independent component analysis (ICA; Hyvarinen, Karhunen, & Oja, 2001) assumes the observed data x are generated from a linear combination of independent sources s: x D A s C º;
(1.1)
where A is the mixing matrix, which can be nonsquare. The sources s have nongaussian density such as p.sl / / exp.¡jsl jq /. The noise term º can have nonzero mean. ICA tries to locate independent axes within the data cloud and was developed for blind source separation. It has been applied to speech separation and analyzing fMRI and EEG data (Jung et al., 2001). ICA is also used to model data density, describing data as linear mixtures of independent features and nding projections that may uncover interesting structure in the data. Maximum likelihood learning of ICA with incomplete data has been studied by Welling and Weber (1999) in the limited case of a square mixing matrix and predened source densities. Many real-world data sets have intrinsic dimensionality smaller than that of the observed data. With missing data, principal component analysis cannot be used to perform dimension reduction as preprocessing for ICA. Instead, the variational Bayesian method applied to ICA can handle small data sets with high observed dimension (Chan, Lee, & Sejnowski, 2002; Choudrey & Roberts, 2001; Miskin, 2000). The Bayesian method prevents overtting and performs automatic dimension reduction. In this article, we extend the variational Bayesian ICA method to problems with missing data. More important, the probability density estimate of the missing entries can be used to ll in the missing values. This allows the density model to be rened and made more accurate. 2 Model and Theory 2.1 ICA Generative Model with Missing Data. Consider a data set of T data points in an N-dimensional space: X D fxt 2 RN g, t in f1; : : : ; Tg. Assume a noisy ICA generative model for the data, Z P.xt j µ/ D
N .xt j Ast C º ; ª/P.st j µs / dst ;
(2.1)
where A is the mixing matrix and º and [ª]¡1 are the observation mean and diagonal noise variance, respectively. The hidden source st is assumed
Variational Bayesian Learning of ICA with Missing Data
1993
to have L dimensions. Similar to the independent factor analysis of Attias (1999), each component of st will be modeled by a mixture of K gaussians to allow for source densities of various kurtosis and skewness, P.st j µs / D
Á L K Y X l
¼lkl N
¡
kl
st .l/ j Álkl ; ¯lkl
¢
! :
(2.2)
o> Split each data point into a missing part and an observed part: x> t D .xt ; m> xt /. In this article, we consider only the random missing case (Ghahramani & Jordan, 1994), that is, the probability for the missing entries xm t is o independent of the value of xm t , but could depend on the value of xt . The likelihood of the data set is then dened to be
L .µ I X/ D
Y t
P.xot j µ/;
(2.3)
where Z P.xot j µ / D D
P.xt j µ/ dxm t
Z µZ
N Z
¶ .xt j Ast C º ; ª/ dxm t P.st j µs / dst
N .xot j [Ast C º ]ot ; [ª]ot /P.st j µs / dst :
D
(2.4)
Here we have introduced the notation [¢]ot , which means taking only the observed dimensions (corresponding to the tth data point) of whatever is inside the square brackets. Since equation 2.4 is similar to equation 2.1, the variational Bayesian ICA (Chan et al., 2002; Choudrey & Roberts, 2001; Miskin, 2000) can be extended naturally to handle missing data, but only if care is taken in discounting missing entries in the learning rules. 2.2 Variational Bayesian Method. In a full Bayesian treatment, the posterior distribution of the parameters µ is obtained by P.X j µ/P.µ / D P.µ j X/ D P.X/
Q t
P.xot j µ /P.µ / ; P.X/
(2.5)
where P.X/ is the marginal likelihood and given as P.X/ D
Z Y t
P.xot j µ/P.µ / dµ:
(2.6)
1994
K. Chan, T. Lee, and T. Sejnowski
The ICA model for P.X/ is dened with the following priors on the parameters P.µ /, P.Anl / D N
.Anl j 0; ®l /
P.¼ l / D D .¼ l j do .¼ l //
P.®l / D G .®l j ao .®l /; bo .®l //
P.Álkl / D N
.Álkl j ¹o .Álkl /; 3o .Álkl // (2.7)
P.¯lkl / D G .¯ lkl j ao .¯lkl /; bo .¯ lkl // P.ºn / D N
.ºn j ¹o .ºn /; 3o .ºn //
P.9n / D G .9n j ao .9n /; bo .9n //;
(2.8)
where N .¢/, G .¢/ and D .¢/ are the normal, gamma, and Dirichlet distributions, respectively: s
N .x j ¹; ¤/ D
j¤j ¡ 1 .x¡¹/> ¤.x¡¹/ I e 2 .2¼ /N
(2.9)
ba a¡1 ¡bx x e I 0.a/ P 0. dk / d1 ¡1 D .¼ j d/ D Q £ ¢ ¢ ¢ £ ¼KdK ¡1 : ¼ 0.dk / 1
G .x j a; b/ D
(2.10) (2.11)
Here ao .¢/, bo .¢/, do .¢/, ¹o .¢/, and 3o .¢/ are prechosen hyperparameters for the priors. Notice that ¤ in the normal distribution is an inverse covariance parameter. Under the variational Bayesian treatment, instead of performing the integration in equation 2.6 to solve for P.µ j X/ directly, we approximate it by Q.µ / and opt to minimize the Kullback-Leibler distance between them (Mackay, 1995; Jordan, Ghahramani, Jaakkola, & Saul, 1999): Z ¡KL.Q.µ / j P.µ j X// D
Q.µ / log "
Z D
Q.µ /
P.µ j X/ dµ Q.µ /
X t
#
P.µ / log P.xot j µ / C log Q.µ /
¡ log P.X/:
dµ (2.12)
Since ¡KL.Q.µ / j P.µ j X// · 0, we get a lower bound for the log marginal likelihood, Z log P.X/ ¸
Q.µ /
X t
Z log P.xot j µ / dµ C
Q.µ / log
P.µ / dµ; Q.µ /
(2.13)
which can also be obtained by applying Jensen’s inequality to equation 2.6. Q.µ / is then solved by functional maximization of the lower bound. A sep-
Variational Bayesian Learning of ICA with Missing Data
1995
arable approximate posterior Q.µ / will be assumed: Q.µ / D Q.º /Q.ª/ £ Q.A/Q.®/ " # Y Y £ Q.¼ l / Q.Álkl /Q.¯lkl / : l
(2.14)
kl
The second term in equation 2.13, which is the negative Kullback-Leibler divergence between approximate posterior Q.µ / and prior P.µ /, is then expanded as Z Q.µ / log D
XZ l
C
P.µ / dµ Q.µ /
Q.¼l / log
XZ
P.¼ l / d¼ l Q.¼ l /
Q.Álkl / log
l kl
X P.Álkl / dÁlkl C Q.Álkl / lk l
Z Z
Z Q.¯lkl / log
P.¯lkl / d¯lkl Q.¯ lkl /
Z P.A j ®/ P.®/ dA d® C Q.®/ log d® Q.A/ Q.®/ Z Z P.º / P.ª/ C Q.º / log C (2.15) dº Q.ª/ log dª: Q.º / Q.ª/ C
Q.A/Q.®/ log
2.3 Special Treatment for Missing Data. Thus far, the analysis follows almost exactly that of the variational Bayesian ICA on complete data, except that P.xt j µ / is replaced by P.xot j µ/ in equation 2.6, and consequently the missing entries are discounted in the learning rules. However, it would be o useful to obtain Q.xm t j xt /, that is, the approximate distribution on the missing entries, which is given by Z o Q.xm t j xt / D
Z Q.µ /
m m N .xm t j [Ast C º ]t ; [ª]t /Q.st / dst dµ :
(2.16)
As noted by Welling and Weber (1999), elements of st given xot are dependent. More important, under the ICA model, Q.st / is unlikely to be a single gaussian. This is evident from Figure 1, which shows the probability density functions of the data x and hidden variable s. The inserts show the sample data in the two spaces. Here the hidden sources assume density of P.sl / / exp.¡jsl j0:7 /. They are mixed noiselessly to give P.x/ in the upper graph. The cut in the upper graph represents P.x1 j x2 D ¡0:5/, which transforms into a highly correlated and nongaussian P.s j x2 D ¡0:5/. Unless we are interested in only the rst- and second-order statistics o of Q.xm t j xt /, we should try to capture as much structure as possible of
1996
K. Chan, T. Lee, and T. Sejnowski
1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 0.5 0
1 0.5
0.5 x2
0 1
0.5 1
x1
1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 1 0.5
1 0.5
0
0
0.5 s2
0.5 1
1
s1
Figure 1: Probability density functions for the data x (top) and hidden sources s (bottom). Inserts show the sample data in the two spaces. The “cuts” show P.x1 j x2 D ¡0:5/ and P.s j x2 D ¡0:5/.
Variational Bayesian Learning of ICA with Missing Data
1997
P.st j xot / in Q.st /. In this article, we take a slightly different route from Chan et al. (2002) or Choudrey and Roberts (2001) when performing variational Bayesian learning. First, we break down P.st / into a mixture of KL gaussians in the L-dimensional s space:
P.st / D D
D
Á L Y X l
X k1
X
! .st .l/ j Álkl ¯lkl /
¼lkl N
kl
¢¢¢
X kL
[¼1k1 £ ¢ ¢ ¢ £ ¼LkL £N
¼k N
k
.st .1/ j Á1k1 ¯1k1 / £ ¢ ¢ ¢ £ N
.st .L/ j ÁLkL ¯LkL /]
.st j Ák ; ¯ k /:
(2.17)
Here we have dened k to be a vector index. The “kth” gaussian is centered at Ák , of inverse covariance ¯ k , in the source s space, k D .k1 ; : : : ; kl ; : : : ; kL /> ;
kl D 1; : : : ; K
Ák D .Á1k1 ; : : : ; Álkl ; : : : ; ÁLkL /> 0 1 ¯1k1 B C :: ¯k D @ A : ¯LkL
¼k D ¼1k1 £ ¢ ¢ ¢ £ ¼LkL :
(2.18)
Log likelihood for xot is then expanded using Jensen’s inequality, Z log P.xot j µ/ D log D log ¸
X k
C
P.xot j st ; µ/ X k
¼k N
.st j Ák ; ¯ k / dst
P.xot j st ; µ/ N
.st j Ák ; ¯ k / dst
Z ¼k
X k
Z Q.kt / log P.xot j st ; µ /N
X k
Q.kt / log
¼k : Q.kt /
.st j Ák ; ¯k / dst (2.19)
Here, Q.kt / is a short form for Q.kt D k/. kt is a discrete hidden variable, and Q.kt D k/ is the probability that the tth data point belongs to the kth gaussian. Recognizing that st is just a dummy variable, we introduce Q.skt /,
1998
K. Chan, T. Lee, and T. Sejnowski
a n Y
A
k
xt
st
p
t
f b
Figure 2: A simplied directed graph for the generative model of variational ICA. xt is the observed variable, kt and st are hidden variables, and the rest are model parameters. The kt indicates which of the K L expanded gaussians generated st .
apply Jensen’s inequality again, and get log P.xot
j µ/ ¸
X
µZ Q.skt / log P.xot j skt ; µ/ dskt
Q.kt /
k
Z C C
X k
Q.kt / log
Q.skt / log ¼k : Q.kt /
N
.skt j Ák ; ¯k / dskt Q.skt /
¶
(2.20)
Substituting log P.xot j µ/ back into equation 2.13, the variational Bayesian method can be continued as usual. We have drawn in Figure 2 a simplied graphical representation for the generative model of variational ICA. xt is the observed variable, kt and st are hidden variables, and the rest are model parameters, where kt indicates which of the KL expanded gaussians generated st . 3 Learning Rules Combining equations 2.13, 2.15, and 2.20, we perform functional maximization on the lower bound of the log marginal likelihood, log P.X/, with regard to Q.µ / (see equation 2.14), Q.kt / and Q.skt / (see equation 2.20)—for example, Z X log Q.º / D log P.º / C Q.µ nº / log P.xot j µ / dµ nº C const.; (3.1) t
Variational Bayesian Learning of ICA with Missing Data
1999
where µ nº is the set of parameters excluding º . This gives Q.º / D
Y
N .ºn j ¹.ºn /; 3.ºn //
n
3.ºn / D 3o .ºn / C h9n i ¹.ºn / D
X
ont
t
3o .ºn /¹o .ºn / C h9n i
P
t ont
P
3.ºn /
k Q.kt /h.xnt
¡ An¢ skt /i
:
(3.2)
Similarly, Q.ª/ D
Y
G .9n j a.9n /; b.9n //
n
a.9n / D ao .9n / C
1X ont 2 t
b.9n / D bo .9n / C
X 1X ont Q.kt /h.xnt ¡ An¢ skt ¡ ºn /2 i: 2 t k
Q.A/ D
Y n
0
N .An¢ j ¹.An¢ /; ¤.An¢ //
h®1 i
1
X X C ont Q.kt /hskt s> A C h9n i kt i t k h®L i Á ! X X > ¹.An¢ / D h9n i ont .xnt ¡ hºn i/ Q.kt /hskt i ¤.An¢ /¡1 :
B ¤.An¢ / D @
::
:
t
Q.®/ D
Y l
(3.3)
(3.4)
k
G .®l j a.®l /; b.®l //
N 2 1X 2 hAnl i: b.® l / D bo .®l / C 2 n a.® l / D ao .®l / C
Q.¼ l / D D .¼ j d.¼ l // XX d.¼lk / D do .¼lk / C Q.kt /: t
kl Dk
(3.5)
(3.6)
2000
K. Chan, T. Lee, and T. Sejnowski
Q.Álkl / D N
.Álkl j ¹.Álkl /; 3.Álkl // XX 3.Álkl / D 3o .Álkl / C h¯lkl i Q.kt / t
¹.Álkl / D
kl Dk
3o .Álkl /¹o .Álkl / C h¯lkl i
P P t
3.Álkl /
kl Dk
Q.kt /hskt .l/i
:
(3.7)
Q.¯lkl / D G .¯lkl j a.¯lkl /; b.¯lkl // 1XX a.¯lkl / D ao .¯lkl / C Q.kt / 2 t k Dk l
1XX b.¯lkl / D bo .¯ lkl / C Q.kt /h.skt .l/ ¡ Álkl /2 i: 2 t k Dk
(3.8)
l
Q.skt / D N
.skt j ¹.skt /; ¤.skt // 0 1 h¯1k1 i B C :: ¤.skt / D @ A : h¯LkL i 0 1 * + o1t 91 C >B : :: C A @ AA oNt 9N 0 1 h¯1k1 Á1k1 i B C :: ¤.skt /¹.skt / D @ A : h¯LkL ÁLkL i 0 1 * + o1t 91 C >B : :: C A @ A .xt ¡ º / :
(3.9)
oNt 9N
In the above equations, h¢i denotes the expectation overPthe posterior distributions Q.¢/. An¢ is the nth row of the mixing matrix A, kl Dk means picking out those gaussians such that the lth element of their indices k has the value of k, and ot is an indicator variable for observed entries in xt : » 1; if xnt is observed (3.10) ont D : 0; if xnt is missing For a model of equal noise variance among all the observation dimensions, the summation in the learning rules for Q.ª/ would be over both t and
Variational Bayesian Learning of ICA with Missing Data
2001
n. Note that there exists scale and translational degeneracy in the model, as given by equation 2.1 and 2.2. After each update of Q.¼l /, Q.Álkl /, and Q.¯ lkl /, it is better to rescale P.st .l// to have zero mean and unit variance. Q.skt /, Q.A/, Q.®/, Q.º /, and Q.ª/ have to be adjusted correspondingly. Finally, Q.kt / is given by log Q.kt / D hlog P.xot j skt ; µ/i C hlog N
.skt j Ák ; ¯ k /i
¡ hlog Q.skt /i C hlog ¼ k i ¡ log zt ;
(3.11)
where zt is a normalization constant. The lower bound E .X; Q.µ // for the log marginal likelihood, computed using equations 2.13, 2.15, and 2.20, can be monitored during learning and used for comparison of different solutions or models. After some manipulation, E .X; Q.µ // can be expressed as
E .X; Q.µ // D
X t
Z log zt C
Q.µ / log
P.µ / dµ: Q.µ /
(3.12)
4 Missing Data 4.1 Filling in Missing Entries. Recovering missing values while performing demixing is possible if we have N > L. More specically, if the number of observed dimensions in xt is greater than L, the equation xot D [A]ot ¢ st
(4.1)
would be overdetermined in st unless [A]ot has a rank smaller than L. In this case, Q.st / is likely to be unimodal and peaked, point estimates of st would be sufcient and reliable, and the learning rules of Chan et al. (2002), with small modication to account for missing entries, would give a reasonable approximation. When Q.st / is a single gaussian, the exponential growth in complexity is avoided. However, if the number of observed dimensions in xt is less than L, equation 4.1 is now underdetermined in st , and Q.st / would have a broad, multimodal structure. This corresponds to overcomplete ICA where single gaussian approximation of Q.st / is undesirable and the formalism discussed in this article is needed to capture the o higher-order statistics of Q.st / and produce a more faithful Q.xm t j xt /. The m o approximate distribution Q.xt j xt / can be obtained by Q.xm t
j xot / D
X k
Z Q.kt /
m m o m ±.xm t ¡ xkt /Q.xkt j xt ; k/ dxkt ;
(4.2)
2002
K. Chan, T. Lee, and T. Sejnowski
where ±.¢/ is the delta function, and Z o Q.xm kt j xt ; k/ D
D
Z Q.µ /
ZZ
m m N .xm kt j [Askt C º ]t ; [ª]t /Q.skt / dskt dµ m m .xm kt j ¹.xkt /; ¤ .xkt // dA dª
Q.A/Q.ª/N
m ¹.xm kt / D [A¹.skt / C ¹.º /]t
¡1 ¤.xm kt /
D [A¤.skt /
¡1
>
(4.3) (4.4)
¡1
A C ¤.º /
C
diag.ª/¡1 ]m t :
(4.5)
Unfortunately, the integration over Q.A/ and Q.ª/ cannot be carried out analytically, but we can substitute hAi and hªi as an approximation.Estimation o of Q.xm t j xt / using the above equations is demonstrated in Figure 3. The o shaded area is the exact posterior P.xm t j xt / for the noiseless mixing in Figure 1 with observed x2 D ¡2 and the solid line is the approximation by equations 4.2 through 4.5. We have modied the variational ICA of Chan P et al. (2002) by discounting missing entries. This is done by replacing t 0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 4
3
2
1
0
1
2
o Figure 3: The approximation of Q.xm t j xt / from the full missing ICA (solid line) and the polynomial missing ICA (dashed line). The shaded area is the o exact posterior P.xm t j xt / corresponding to the noiseless mixture in Figure 1 with observed x2 D ¡2. Dotted lines are the contribution from the individual o Q.xm k t j xt ; k/.
Variational Bayesian Learning of ICA with Missing Data
2003
P with t ont and 9n with ont 9n in their learning rules. The dashed line is the o approximation Q.xm t j xt / from this modied method, which we refer to as polynomial missing ICA. The treatment of fully expanding the KL hidden source gaussians discussed in section 2.3 is named full missing ICA. The full o missing ICA gives a more accurate t for P.xm t j xt / and a better estimate o i. From equation 2.16, for hxm j x t t Z Z o m m j x D N .xm (4.6) Q.xm / Q.µ / t t t j [Ast C º ]t ; [ª] t /Q.st / dst dµ; and the above formalism, Q.st /, becomes Z X Q.st / D Q.k t / ±.st ¡ skt /Q.skt / dskt ;
(4.7)
k
which is a mixture of KL gaussians. The missing values can then be lled in by Z X hst j xot i D st Q.st / dst D (4.8) Q.kt /¹.skt / k
Z o hxm t j xt i D
D
m o m xm t Q.xt j xt / dxt
X k
m o m Q.kt /¹.xm kt / D [A] t hst j xt i C [¹.º /]t ;
(4.9)
where ¹.skt / and ¹.xm kt / are given in equations 3.9 and 4.4. Alternatively, o a maximum a posterior (MAP) estimate on Q.st / and Q.xm t j xt / may be obtained, but then numerical methods are needed. 4.2 The “Full” and “Polynomial” Missing ICA. The complexity of the full variational Bayesian ICA method is proportional to T £ KL , where T is the number of data points, L is the number of hidden sources assumed, and K is the number of gaussians used to model the density of each source. If we set K D 2, the ve parameters in the source density model P.st .l// are already enough to model the mean, variance, skewness, and kurtosis of the source distribution. The full missing ICA should always be preferred if memory and computational time permit. The “polynomial missing ICA” converges more slowly per epoch of learning rules and suffers from many more local maxima. It has an inferior marginal likelihood lower bound. The problems are more serious at high missing data rates, and a local maximum solution is usually found instead. In the full missing ICA, Q.st / is a mixture of gaussians. In the extreme case, when all entries of a data point are missing, that is, empty xot , Q.st / is the same as P.st j µ / and would not interfere with the learning of P.st j µ/ from other data point. On the other hand, the single gaussian Q.st / in the polynomial missing ICA would drive P.st j µ/ to become gaussian too. This is very undesirable when learning ICA structure.
2004
K. Chan, T. Lee, and T. Sejnowski
5 Clusters of ICA The variational Bayesian ICA for missing data described above can be easily extended to model data density with C clusters of ICA. First, all parameters µ and hidden variables kt , skt for each cluster are given a superscript index c. Parameter ½ D f½ 1 ; : : : ; ½ C g is introduced to represent the weights on the clusters. ½ has a Dirichlet prior (see equation 2.11). 2 D f½; µ 1 ; : : : ; µ C g is now the collection of all parameters. Our density model in equation 2.1 becomes P.xt j 2/ D D
X c
X c
P.ct D c j ½/P.xt j µ c / Z
N .xt j Ac sct C º c ; ªc /P.sct j µsc / dsct :
P.ct D c j ½/
(5.1)
The objective function in equation 2.13 remains the same, but with µ replaced by 2. The separable posterior Q.2/ is given by Q.2/ D Q.½/
Y
Q.µ c /
(5.2)
c
and similar to equation 2.15, Z
P.2/ Q.2/ log d2 D Q.2/
Z Q.½/ log C
XZ
P.½/ d½ Q.½/
Q.µ c / log
c
P.µ c / c dµ : Q.µ c /
(5.3)
Equation 2.20 now becomes, log P.xot j 2/ ¸
X
Q.ct / log
c
P.ct / X C Q.ct /Q.kct / Q.ct / c;k
µZ Q.sckt / log P.xot j sckt ; µ c / dsckt
£
Z C C
X c;k
Q.sckt / log
N .sckt j Áck ; ¯ ck / c dskt Q.sckt /
Q.ct /Q.kct / log
¼kc : Q.kct /
¶
(5.4)
We have introduced one more hidden variable ct , and Q.ct / is to be interpreted in the same fashion as Q.kct /. All learning rules in section 3 remain
Variational Bayesian Learning of ICA with Missing Data
the same, only with learning rules,
P t
replaced by
d.½ c / D do .½ c / C
X
P
t Q.ct /.
2005
Finally, we need two more
Q.ct /
(5.5)
t
log Q.ct / D hlog ½ c i C log zct ¡ log Zt ;
(5.6)
where zct is the normalization constant for Q.kct / (see equation 3.11) and Zt is for normalizing Q.ct /. 6 Supervised Classication It is generally difcult for discriminative classiers such as multilayer perceptron (Bishop, 1995) or support vector machine (Vapnik, 1998) to handle missing data. In this section, we extend the variational Bayesian technique to supervised classication. Consider a data set .XT ; YT / D f.xt ; yt /, t in .1; : : : ; T/g. Here, xt contains the input attributes and may have missing entries. yt 2 f1; : : : ; y; : : : ; Yg indicates which of the Y classes xt is associated with. When given a new data point xTC1 , we would like to compute P.yTC1 j xTC1 ; XT ; YT ; M/, P.yTC1 j xTC1 ; XT ; YT ; M/ D
P.xTC1 j yTC1 ; XT ; YT ; M/P.yTC1 j XT ; YT ; M/ : P.xTC1 j XT ; YT ; M/
(6.1)
Here M denotes our generative model for observation fxt ; yt g: P.xt ; yt j M/ D P.xt j yt ; M/P.yt j M/:
(6.2)
P.xt j yt ; M/ could be a mixture model as given by equation 5.1. 6.1 Learning of Model Parameters. Let P.xt j yt ; M/ be parameterized by 2y and P.yt j M/ be parameterized by ! D .!1 ; : : : ; !Y /, P.xt j yt D y; M/ D P.xt j 2y / P.yt j M/ D P.yt D y j ! / D !y :
(6.3) (6.4)
If ! is given a Dirichlet prior, P.! j M/ D D .! j d o.! //, its posterior has also a Dirichlet distribution: P.! j YT ; M/ D D .! j d.! // X d.!y / D do .!y / C I.yt D y/: t
(6.5) (6.6)
2006
K. Chan, T. Lee, and T. Sejnowski
I.¢/ is an indicator function that equals 1 if its argument is true and 0 otherwise. Under the generative model of equation 6.2, it can be shown that P.2y j XT ; YT ; M/ D P.2 y j Xy /;
(6.7)
where Xy is a subset of XT but contains only those xt whose training labels yt have value y. Hence, P.2 y j XT ; YT ; M/ can be approximated with Q.2 y / by applying the learning rules in sections 3 and 5 on subset Xy . 6.2 Classication. First, P.yTC1 j XT ; YT ; M/ in equation 6.1 can be computed by Z P.yTC1 D y j XT ; YT ; M/ D
P.yTC1 D y j !y /P.!y j XT ; YT / d!y
d.!y / D P : y d.! y /
(6.8)
The other term P.xTC1 j yTC1 ; XT ; YT ; M/ can be computed as log P.xTC1 j yTC1 D y; XT ; YT ; M/ D log P.xTC1 j Xy ; M/
D log P.xTC1 ; Xy j M/ ¡ log P.Xy j M/
(6.9)
¼ E .fxTC1 ; Xy g; Q .2 y // ¡ E .Xy ; Q.2 y //:
(6.10)
0
The above requires adding xTC1 to Xy and iterating the learning rules to obtain Q0 .2y / and E .fxTC1 ; Xy g; Q0 .2 y //. The error in the approximation is the difference KL.Q0 .2 y /; P.2y j fxTC1 ; Xy g// ¡ KL.Q.2 y /; P.2y j Xy //. If we assume further that Q0 .2y / ¼ Q.2 y /, Z
log P.xTC1 j Xy ; M/ ¼
Q.2 y / log P.xTC1 j 2y / d2y
D log ZTC1 ;
(6.11)
where ZTC1 is the normalization constant in equation 5.6. 7 Experiment 7.1 Synthetic Data. In the rst experiment, 200 data points were generated by mixing four sources randomly in a seven-dimensional space. The generalized gaussian, gamma, and beta distributions were used to represent source densities of various skewness and kurtosis (see Figure 5). Noise
Variational Bayesian Learning of ICA with Missing Data
2007
Figure 4: In the rst experiment, 30% of the entries in the seven-dimensional data set are missing as indicated by the black entries. (The rst 100 data points are shown.)
Figure 5: Source density modeling by variational missing ICA of the synthetic data. Histograms: recovered sources distribution; dashed lines: original probability densities; solid line: mixture of gaussians modeled probability densities; dotted lines: individual gaussian contribution.
at ¡26 dB level was added to the data, and missing entries were created with a probability of 0.3. The data matrix for the rst 100 data points is plotted in Figure 4. Dark pixels represent missing entries. Notice that some data points have fewer than four observed dimensions. In Figure 5, we plotted the histograms of the recovered sources and the probability density functions (pdf) of the four sources. The dashed line is the exact pdf used to generate the data, and the solid line is the modeled pdf by mixture of two one-dimensional gaussians (see equation 2.2). This shows that the two gaussians gave adequate t to the source histograms and densities.
2008
K. Chan, T. Lee, and T. Sejno