LETTER
Communicated by Misha Tsodyks
Mean-Driven and Fluctuation-Driven Persistent Activity in Recurrent Networks Alfonso Renart∗
[email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain, and Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.
Rub´en Moreno-Bote
[email protected] Department de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain
Xiao-Jing Wang
[email protected] Volen Center for Complex Systems, Brandeis University, Waltham, MA 02254, U.S.A.
N´estor Parga
[email protected] Departamento de F´ısca Te´orica, Universidad Aut´onoma de Madrid, Cantoblanco 28049, Madrid, Spain
Spike trains from cortical neurons show a high degree of irregularity, with coefficients of variation (CV) of their interspike interval (ISI) distribution close to or higher than one. It has been suggested that this irregularity might be a reflection of a particular dynamical state of the local cortical circuit in which excitation and inhibition balance each other. In this “balanced” state, the mean current to the neurons is below threshold, and firing is driven by current fluctuations, resulting in irregular Poisson-like spike trains. Recent data show that the degree of irregularity in neuronal spike trains recorded during the delay period of working memory experiments is the same for both low-activity states of a few Hz and for elevated, persistent activity states of a few tens of Hz. Since the difference between these persistent activity states cannot be due to external factors coming from sensory inputs, this suggests that the underlying
∗ Current address: Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102 USA.
Neural Computation 19, 1–46 (2007)
C 2006 Massachusetts Institute of Technology
2
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
network dynamics might support coexisting balanced states at different firing rates. We use mean field techniques to study the possible existence of multiple balanced steady states in recurrent networks of current-based leaky integrate-and-fire (LIF) neurons. To assess the degree of balance of a steady state, we extend existing mean-field theories so that not only the firing rate, but also the coefficient of variation of the interspike interval distribution of the neurons, are determined self-consistently. Depending on the connectivity parameters of the network, we find bistable solutions of different types. If the local recurrent connectivity is mainly excitatory, the two stable steady states differ mainly in the mean current to the neurons. In this case, the mean drive in the elevated persistent activity state is suprathreshold and typically characterized by low spiking irregularity. If the local recurrent excitatory and inhibitory drives are both large and nearly balanced, or even dominated by inhibition, two stable states coexist, both with subthreshold current drive. In this case, the spiking variability in both the resting state and the mnemonic persistent state is large, but the balance condition implies parameter fine-tuning. Since the degree of required fine-tuning increases with network size and, on the other hand, the size of the fluctuations in the afferent current to the cells increases for small networks, overall we find that fluctuation-driven persistent activity in the very simplified type of models we analyze is not a robust phenomenon. Possible implications of considering more realistic models are discussed.
1 Introduction The spike trains of cortical neurons recorded in vivo are irregular and consistent, to a first approximation, to a Poisson process, possessing a roughly exponential interspike interval (ISI) distribution (except at very short intervals) and a coefficient variation (CV) of the ISI close to one (Softky & Koch, 1993). The possible implications of this fact on the basic principles of cortical organization have been the motivation for a large number of studies during the past 10 years (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998; Zador & Stevens, 1998; Harsch & Robinson, 2000). An important idea that was analyzed by some of these studies was that a way out of the apparent inconsistency between the cortical neuron working as an integrator over the timescale of a relatively long time constant of the order of 10 to 20 ms of a very large number of inputs, and its irregular spiking, was to have similar amounts of excitatory and inhibitory drive. In this way, the mean drive to the cell was subthreshold, and spikes were the result of fluctuations, which occur irregularly, thus leading to a high CV (Gerstein & Mandelbrot, 1964). Although the implications of this result were first studied in a feedforward architecture (Shadlen & Newsome, 1994), it was soon discovered that a state
Bistability in Balanced Recurrent Networks
3
in which excitation and inhibition balance each other, resulting in irregular spiking, was a robust dynamical attractor in recurrent networks (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998); that is, under very general conditions, a recurrent network settles down into a state of this sort. Although the original studies characterizing quantitatively the degree of spiking irregularity in the cortex were done using data from sensory cortices, it has since been shown that neurons in higher-order associative areas like the prefrontal cortex (PFC) also spike irregularly (Shinomoto, Sakai, & Funahashi, 1999; Compte et al., 2003) (see Figure 1). This is interesting because it is well known that cells in the PFC (Fuster & Alexander, 1971; Funahashi, Bruce, & Goldman-Rakic, 1989; Miller, Erickson, & Desimone, 1996; Romo, Brody, Hern´andez, & Lemus, 1999), as well as those in other associative cortices like the inferotemporal (Miyashita & Chang, 1988) or posterior parietal cortex (Gnadt & Andersen, 1988; Chafee & GoldmanRakic, 1998), show activity patterns that are selective to stimuli no longer present to the animal and are thus being held in working memory. The activity of these neurons seems to be able to switch, on presentation of an appropriate brief, transient input, from a basal spontaneous activity level to a higher activity state. When the dimensionality of the stimulus to be remembered is low (e.g., the position of an LED on a computer screen or the frequency of a vibrotactile simulus), the mnemonic activity during the delay period when the stimulus is absent seems to be graded (Funahashi et al., 1989; Romo et al., 1999), whereas when the dimensionality of the stimulus is high (e.g., a complex image), the single neurons seem to choose from a small number of discrete activity states (Miyashita & Chang, 1988; Miller et al., 1996). This last coding scheme is referred to as object working memory. Since there is no explicit sensory input present during the delay period in a working memory task, the neuronal activity must be a result of the dynamics of the relevant neural circuit. There is a long tradition of modeling studies that have described delay-period activity as a reflection of dynamical attractors in multistable (usually bistable) networks presumed to represent the local cortical environment of the neurons recorded in the neurophysiological experiments (Hopfield, 1982; Amit & Tsodyks, 1991; Ben-Yishai, Lev Bar-Or, & Sompolinsky, 1995; Amit & Brunel, 1997b; Brunel, 2000a; Compte, Brunel, Goldman-Rakic, & Wang, 2000; Brunel & Wang, 2001; Hansel & Mato, 2001; Cai, Tao, Shelly, & McLaughlin, 2004). Originally inspired by models and techniques from the statistical mechanics of disordered systems, network models of persistent activity have progressively become more faithful to the biological circuits that they seek to describe. The landmark study (Amit & Brunel, 1997b) provided an extended meanfield description of the activity of a recurrent network of spiking current-based leaky integrate-and-fire neurons (LIF). One of its main achievements was to use the theory of diffusion processes to provide an intuitive, compact
4
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga FIXATION 3
NONPREFERRED
# of cells
p =8.5e −15
2
CV 1 0
PREFERRED
F
20
20
20
10
10
10
0
PR NP
0
1
2
0
0
2
1
2
0
0
CV
CV
1
2
CV
30
30
30
20
20
20
10
10
10
p =1.6e−23
# of cells
1.5
CV2
1 0.5 0
F
PR NP
0
0
1
CV
2
2
0
0
1
CV
2
2
0
0
1
CV
2
2
Figure 1: CV of the ISI of neurons in monkey prefrontal cortex during a spatial working memory task. The monkey made saccades to remembered locations on a computer screen after a delay period of a few seconds. On each trial, a dot of light (cue stimulus) was briefly shown in one of eight to-be-remembered locations, equidistant from the fixation point but at different angles. After the delay period, starting with the disappearance of the cue stimulus and terminating with the disappearance of the fixation point, the monkey made a saccade to the remembered location. Top and bottom rows correspond, respectively, to the CV and CV2 (CV calculated using only consecutive ISIs to try to compensate from possible slow nonstationarities in the neurons instantaneous frequency) computed from spike trains of prefrontal cortical neurons recorded from monkeys performing an oculomotor spatial working memory task. Results shown correspond to analysis of the activity during the delay period of the task. The spike trains are irregular (CV ∼ 1), and to a similar extent, both when the data correspond to trials in which the preferred (PR; middle column) positional cue for the cell was held in working memory (higher firing rate during the delay period) and when it corresponds to stimuli with the nonpreferred (NP; right column) positional cue for the particular neuron (lower firing rate during the delay period). See Compte et al. (2003) for details. Adapted with permission from Compte et al. (2003).
description of the spontaneous, low-rate, basal activity state of cortical cells in terms of self-consistency equations that included information about both the mean and the fluctuations of the afferent current to the cell. The theory proposed was both simple and accurate, and matched well the properties of simulated LIF networks. The spontaneous activity state in Amit and Brunel (1997b) is effectively the balanced state described above, in which the recurrent connectivity is
Bistability in Balanced Recurrent Networks
5
dominated by inhibition and firing is due to the occurrence of positive fluctuations in the drive to the neurons. However, in Amit and Brunel (1997b), this same model was used to describe the coexistence of the spontaneous activity state with a persistent activity state with a physiologically plausible firing rate that would correspond to the spiking observed during the delay period in object working memory tasks, such as seen in, for example, Miyashita and Chang (1988). Although the model, with its large number of subsequent improvements, has been successful in providing a fairly accurate description of simulated spiking networks, no effort has yet been made to study systematically the relationship between multistability and the irregularity of the spike trains, especially in the elevated activity state. As we will show below, the qualitative organization of the connectivity in the recurrent network not only determines the existence of a fluctuation-driven balanced spontaneous activity state in the network, but also the existence of bistability in the network, and whether the elevated activity states are fluctuation driven. In order to perform a systematic analysis of the types of persistent activity that can be obtained in a network of current-based LIF neurons, two steps are important. First, we believe that the scaling of the synaptic connections with the number of afferent synapses per neuron should be made explicit. This approach was taken in the studies of the balanced state (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996), but is not present in the Amit and Brunel (1997b) framework. As we shall see, when the scaling is made explicit and the network is studied in the limit of a large number of connections per cell, the difference between the behavior of alternative circuit organizations (or architectures) becomes qualitative. Second, it would be desirable to be able to check for the spike train irregularity within the theory. In Amit and Brunel (1997b), spiking was assumed to be Poisson and, hence, to have a CV equal to 1. Poisson spike trains are completely characterized by a single number, the instantaneous firing probability, so there is nothing more to say about the spike train once its firing rate has been given. A general self-consistent description of the higher-order moments of spiking in a recurrent network of LIF neurons is extremely difficult, as the calculation of the moments of the ISI distribution becomes prohibitively complicated when the input current to a particular cell contains temporal correlations (although see Moreno-Bote & Parga, 2006). However, based on our study of the input-output properties of the LIF neuron under the influence of correlated inputs (Moreno, de la Rocha, Renart, & Parga, 2002), we have constructed a self-consistent description for the first two moments of the current to the neurons in the network, which relaxes the Poisson assumption and which we expect to be valid if the temporal correlations in the spike trains in the network are sufficiently short. Some of the results presented here have already been published in abstract form.
6
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
2 Methods We consider a network of current-based leaky integrate-and-fire neurons. The voltage difference across each neuron’s membrane evolves in time according to the following equation, d V(t) V(t) + I (t), =− dt τm with voltages being measured relative to the leak potential of the neuron. When the depolarization reaches a threshold voltage that we set at Vth = 20 mV, a spike is emited, and the voltage is clamped at a reset potential Vr = 10 mV during a refractory period τref = 2 ms, after which the voltage continues to integrate the input current. The membrane time constant is τm = 10 ms. When the neuron is inserted in a network, I (t) represents the total synaptic current, which is assumed to be a linear sum of the contributions from each individual presynaptic cell. We consider the simplest description of the synaptic interaction between the pre- and postsynaptic neurons, according to which each presynaptic action potential provokes an instantaneous “kick” in the depolarization of the postsynaptic cell. The network is composed of NE excitatory and NI inhibitory cells randomly connected so that each cell receives C E excitatory and C I inhibitory contacts, each with an efficacy (“kick” size) J E j and J Ik , respectively ( j = 1, . . . , C E ; k = 1, . . . , C I ). The total afferent current into the cell can be represented as I (t) =
CE j=1
J E j s j (t) −
CI
J Ik sk (t),
k=1
where s j(k) (t) represents the spike train from the jth excitatory (kth inhibitory) neuron. Since according to this description, the effect of a presynaptic spike on the voltage of the postsynaptic neuron is instantaneous, s(t) is a collection of Dirac delta functions, that is, s(t) ≡ j δ(t − t j ), where t j are the spike times. 2.1 Mean-Field Description. Spike trains in the model are stochastic, with an instantaneous firing rate (i.e., a probability of measuring a spike in (t, t + dt) per unit time) denoted by ν(t) = ν. The secondorder statistics of the process is characterized by its connected two-point correlation function C(t, t ), giving the joint probability density (above chance) that two spikes happen at (t, t + dt) and at (t , t + dt), that is, C(t, t ) ≡ s(t)s(t ) − s(t)s(t ). Stochastic spiking in network models is usually assumed to follow Poisson statistics, which is both a fairly good approximation to what is commonly observed experimentally (see, e.g.,
Bistability in Balanced Recurrent Networks
7
Softky & Koch, 1993; Compte et al., 2003) and convenient technically since Poisson spike trains lack any temporal correlations. For Poisson spike trains, C(t, t ) = νδ(t − t ), where ν is the instantaneous firing probability. We have previously analyzed the effect of temporal correlations in the afferents to a LIF neuron on its firing rate (Moreno et al., 2002). Temporal correlations measured in vivo are often well fitted by an exponential (Bair, Zohary, & Newsome, 2001). We considered exponential correlations of the form
|t −t|
e− τ c C(t, t ) = ν δ(t − t) + (F∞ − 1) 2τc
,
(2.1)
where F∞ is the Fano factor of the spike train for infinitely long time windows. The Fano factor in a window of length T is defined as the ratio between the variance and the mean of the spike count on the window. It is illustrative to calculate it for our process, FT ≡
2 σ N(T)
,
N(T)
where N(T) is the (stochastic) spike count in a window of length T, N(T) =
T
dt s(t), 0
so that N(T) = νT, and the spike count variance is given by 2 σ N(T) ≡
0
T
0
T
T dt dt C(t, t ) = νT + ν(F∞ − 1) T − τc 1 − e− τc .
When the time window is long compared to the correlation time constant, 2 that is, T τc , then σ N(T) ∝ F∞ νT; hence, our use of the factor F∞ in the definition of the correlation function. An interesting point to note is that for time windows that are long compared to the correlation time constant, the variance of the spike count is linear in time, which is a signature of independence across time, that is, independent variances add up (for the 2 Poisson process, (σ N(T) )Poisson = νT, so that (FT )Poisson = 1). If the characteristic time of the postsynaptic cell integrating this stochastic current (its membrane time constant) is very long compared with τc , we expect that the main effect of the deviation from Poisson of the input spike trains will be on the amplitude of the current variance, with the parameter τc playing only 2 a marginal role, as it does not appear in σ N(T) when T τc . As we show
8
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
below, a rigorous analysis of the effect of correlations on the mean firing rate of a LIF neuron confirms this intuitive picture. The postsynaptic cell receives many inputs. Recall that the total current is given (we consider for simplicity for this discussion that the cell receives C inputs from a single, large, homogeneous population composed of N neurons) by I (t) = J Cj s j (t). Thus, the mean and correlation function of the total afferent current to a given cell are I (t) = C J ν C I (t, t ) = I (t )I (t) − I (t)I (t ) =J2 si (t)s j (t ) − si (t)s j (t ) ij
= C J C(t, t ) + C(C − 1)J 2 Ccc (t, t ), 2
where C(t, t ) is the (auto)correlation function in equation 2.1 and Ccc (t, t ) is the cross-correlogram between any two given cells of the pool of presynaptic inputs (which we have assumed to be the same for all pairs). We restrict our analysis to very sparse random networks—networks with C N—so that the fraction of synaptic inputs shared by any two given neurons can be assumed to be negligible. In this case, the cross-correlation between the spike trains of the two cells Ccc (t, t) will be zero. This approximation simplifies the analysis of the network behavior significantly and allows for a self-consistent solution for the network’s steady states. Thus, the temporal structure of the total current to the cell is described by
α − |t−t | C I (t, t ) = σ02 δ(t − t ) + e τc 2τc
(2.2)
with σ02 = C J 2 ν
and α = F∞ − 1.
We have previously calculated the output firing rate of an LIF neuron subject to an exponentially correlated input (Moreno et al., 2002). The calculation is done using the diffusion approximation (Ricciardi, 1977) in which the discontinuous voltage trajectories are approximated by those obtained from an equivalent diffusion process. The diffusion approximation is expected to give accurate results when the overall rate of the input process is high, with the amplitude of each individual event being very small (Ricciardi, 1977). For small but finite τc , the analytic calculation of the firing rate of the cell can be done only when the deviation of the input current from a white noise
Bistability in Balanced Recurrent Networks
9
√ process is small, that is, it has to be done assuming that k ≡ τc /τm 1 and that α 1. More specifically, we found that if k = 0, then the firing rate can be calculated for arbitrary values of α, but if k is small but finite, then an expression can be found for the case when both k and α are small (see Moreno et al., 2002, for details). If k = 0, then the result one obtains is that the firing rate of the neuron is given by the same expression that one finds for the case of a white noise input, but with an effective variance that takes into account the change in amplitude of the fluctuations due to the non-Poisson nature of the inputs. The effective variance is equal to 2 σeff = σ02 (1 + α) = C J 2 ν F∞ ,
which is exactly the slope of the linear increase with the size of the time window T of the variance in the spike count NI (T) of the total input current. This result can be understood in terms of the Fano factor calculation outlined above. Assuming k = 0 is equivalent to assuming an infinitely long time window for the calculation of the Fano factor, and in those conditions we also saw that the only effect of the temporal correlations is to renormalize the variance of the spike count with respect to the poisson case. In order to set up a self-consistent scenario, we have to close the loop, by calculating a measure of the variability of the postsynaptic cell and relating it to the same property in the spike trains of its inputs. To do this, we note that if the spike trains in the model can be described as renewal processes, these processes have a property that relates their spike count variability and their ISI variability, F∞ = CV2 , if a point process is renewal (Cox, 1962). Renewal point processes are characterized by having independent ISIs, which are not necessarily exponentially distributed. Since we are assuming that the temporal correlations in the spike trains are short anyway, and the firing rates of the cells in the persistent activity states that we are interested in are not very high, then we expect the renewal assumption to be appropriate. The final step is to make sure that the result for the firing rate (the inverse of the mean ISI) in terms of the effective variance also holds for higher moments of the postsynaptic ISI, not only for the first, and this is indeed the case (Renart, 2000); that is, the CV of the ISI when k = 0 is given by the same expression as when the input is a white noise process, but with a renormalized variance equal to 2 σeff . Thus, under the assumptions described above, there is a way of computing the output rate and CV of an LIF neuron solely in terms of the rate and CV of its presynaptic inputs. In the steady states, both input and output
10
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
firing rate and CV will be the same, and this provides a couple of equations that determine these quantities self-consistently. In the reminder of the letter, we thus use the common expressions for the mean and CV of the first passage time of the Ornstein-Uhlenbeck (OU) process, ν
−1
√
= τref + τm π
CV2 = 2πν 2
Vth −µV √ σV 2 Vres −µV √ σV 2
Vth −µV √ σV 2
2
d x ex [1 + erf(x)]
Vres −µV √ σV 2
d x ex
2
x
−∞
2
dy e y [1 + erf(y)]2 ,
(2.3)
(2.4)
where µV and σV2 are the mean and variance of the depolarization of the postsynaptic cell (in the absence of threshold; Ricciardi, 1977). In a stationary situation, they are related to the mean µ and variance σ 2 of the afferent current to the cell by µV = τm µ;
σV2 =
1 τm σ 2 . 2
Following the arguments above, the mean and (effective) variance of the current to the cells are given by µ=CJ ν 2 = C J 2 νCV2 σ 2 ≡ σeff
for the mean and variance of the white noise input current. Finally, it is easy to show that if the presynaptic afferents to the cell come from a set of different statistically homogeneous subpopulations, the previous expressions generalize readily to µi =
Ci j J i j ν j
j
σi2 ≡ σi2eff =
Ci j J i2j ν j CV2j
(2.5) (2.6)
j
as long as the timescales of the correlations in the spike trains of the neurons in the different subpopulations are all of the same order. Inhibitory subpopulations are characterized by negative connection strengths. 2.2 Dynamics. A detailed characterization of the dynamics of the activity of the network is beyond the scope of this work. Since our main interest is the steady states of the network, we use a simple, effective dynamical
Bistability in Balanced Recurrent Networks
11
scheme that is consistent with the self-consistent equations that determine the steady states. In particular, we use the subthreshold dynamics of the first two moments of the depolarization in terms of the first two moments of the current (Ricciardi, 1977; Gillespie, 1992): dµV µV + µ; =− dt τm
dσV2 σ2 = − V + σ 2. dt τm /2
(2.7)
In using these equations, our assumption is that the activity of the population follows instantaneously the distribution of the depolarization. Thus, at every point in time, we use expressions 2.5 and 2.6 for µ and σ appearing in the right-hand side of equations 2.7, which depend on the rate ν(µV , σV ) and CV(µV , σV ) as given in equations 2.3 and 2.4. The only dynamical variables are therefore µV and σV2 (Amit & Brunel, 1997b; Mascaro & Amit, 1999). 2.3 Numerical Analysis of the Analytic Results. The phase plane analysis of the reduced network was done using both custom-made C++ code and the program XPPaut. The integrals in equations 2.3 and 2.4 were calculated analytically for very large and very small values of the limits of integration (using asymptotic expressions for the error function; Abramowitz & Stegun, 1970) and numerically for values of the integration limits of order one. The corresponding C++ code was incorporated into XPPaut through the use of dynamically linked libraries for phase plane analysis. Some of the cusp diagrams were calculated without the use of XPPaut by the direct approach of looking for values of the parameters at which the number of fixed points changed abruptly.
2.4 Numerical Simulations. We simulated an identical network to the one used in the mean-field description (see the captions of Figures 12 and 13 for parameters). In the simulation, on every time step dt = 50 µs, it is checked which neurons in the network receive any spikes. The membrane potential of cells that do not receive spikes is integrated analytically. The membrane potential of cells that receive spikes is integrated analytically within that dt, taking into account the synaptic postsynaptic potentials (PSPs) but assuming that there is no threshold. Only at the end of the time step is it checked whether the membrane potential is above threshold. If this is the case, the neuron is said to have produced a spike. This procedure effectively introduces a (very short but nonzero) synaptic integration time constant. Emitted spikes are fed back into the network using a system of queues to account for the synaptic delays (Mattia & Del Giudice, 2000).
12
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
3 Results 3.1 Network Architecture and Scaling Considerations. We first study the issue of how the architecture of the network determines the qualitative nature of the steady-state solutions for the firing rate of the cells in the network. In particular, we are interested in understanding under which conditions there are multiple steady-state solutions (bistability or, in general, multistability) in networks in which cells will fire with a high degree of irregularity. We consider a network for object working memory composed of a number of nonoverlapping subpopulations, or columns, defined by selectivity with respect to a given external stimulus. Each subpopulation contains both excitatory and inhibitory cells. The synaptic properties of neurons within the same column are assumed to be statistically identical. Thus, the column is, in our network, the minimal signaling unit at the average level, that is, all the neurons within a column behave identically on average. As will be shown below, the type of bistability realizable for the column depends critically on its size (more specifically, on the number of connections that a given cell receives from within its own column). Different architectures will thus be considered in which the total number of afferent connections per cell C is constant (and large) but the number of columns in the network varies, effectively varying the number of afferent connections from a given column to a cell. A multicolumnar architecture of this sort is inspired in the anatomical organization of the PFC, in which columnarly organized putative excitatory cells and interneurons show similar response profiles during working memory tasks (Rao, Williams, & Goldman-Rakic, 1999). As noted in section 1, many of the properties of the network can be inferred from a scaling analysis. In the limit in which the connectivity is very sparse, so that correlations between the spike trains of different cells can be neglected (see section 2), the relevant parameter is the number of afferent connections per neuron C. We will investigate the behavior of the network in the limit C → ∞ (the “extensive” limit) since, in this case, the different scenarios become qualitatively different. Of course, physical neurons receive a finite number of connections, but the rationale is that the physical solution can be considered a small perturbation to the solution found in the C = ∞ case, which is much easier to characterize. One should keep in mind that even if C becomes very large, we still need to impose the sparseness condition for our analysis to be valid, which implies that it should always hold that N C. When considering current-based scenarios in the extensive limit, one is forced to normalize the connection strengths (the size of the unitary PSPs, which we denote generally by J ) by (some power of) the number of connections per cell C, in order to keep the total afferent current within the (presynaptic) dynamic range of the neuron (whose order of magnitude is given by the distance between reset and threshold). As we show below,
Bistability in Balanced Recurrent Networks
13
different scaling schemes of J with C lead to different relative magnitudes of the mean and fluctuations of the afferent current into the cells in the extensive limit, and this in turn determines the type of steady-state solutions for the network. We thus proceed to analyze the expressions for the mean and variance of the afferent current (see equations 2.5 and 2.6) under different scaling assumptions. We consider multicolumnar networks in which the C afferent connections to a given cell come from Nc different “columns” (each contributing Cc connections, so that C = Nc Cc ). Each column is composed of an excitatory and an inhibitory subpopulation. The multicolumnar structure of the network is specified by the following scaling relationships, Nc ≡ nc C α 1−α Cc ≡ n−1 , c C
with 0 ≤ α ≤ 1 and nc order one, that is, independent of C. The case α = 0 corresponds to a finite number nc of columns, each contributing a number of connections of order C. The case α = 1 corresponds to an extensive number of columns, each contributing a number of connections of order one—that is, a fixed number as the total number of connections C grows. Although connection strengths between the different subpopulations can all be different, we assume that they can be classified into two types according to their scaling with C: those between cells within the same column, of strength J w , and those between cells belonging to different columns, of strength J b (the scaling is assumed to be the same for excitatory and for inhibitory connections). We define J w ≡ jw C −αw J b ≡ jb C −αb where αw , αb > 0 and the j’s are all order one. In these conditions, the afferent current to the excitatory or inhibitory cells (it does not matter which, for this analysis) from their own subpopulation is characterized by µin = Cc [J Ew ν Ein − J Iw ν Iin ] 1−α−αw = C 1−α−αw [ j Ew ν Ein − j Iw ν Iin ]n−1 f µin c ≡C
σin2 = Cc [J E2w ν Ein CV2Ein + J I2w ν Iin CV2Iin ] 1−α−2αw f σin , = C 1−α−2αw [ j E2 w ν Ein CV2Ein + j I2w ν Iin CV2Iin ]n−1 c ≡C
where the f ’s are linear combinations of rates and CVs weighted by factors
14
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
of order one. We proceed by assuming that all other columns are in the same state ν E,I out , CV E,I out , so that the current to the cells in the column under focus from the rest of the network is characterized by µout = (Nc − 1)Cc J Eb ν Eout − J Ib ν Iout −α −α j Eb ν Eout − j Ib ν Iout ≡ C 1−αb 1 − n−1 f µout = C 1−αb 1 − n−1 c C c C 2 σout = (Nc − 1)Cc J E2b ν Eout CV2Eout + J I2b ν Iout CV2Iout 2 −α j Eb ν Eout CV2Eout + j I2b ν Iout CV2Iout = C 1−2αb 1 − n−1 c C −α ≡ C 1−2αb 1 − n−1 f σout . c C −α become equal to one if α > 0 In the extensive limit, the terms 1 − n−1 c C and are of order one if α = 0, in which case they can be included as an extra multiplicative factor in the f terms. We thus omit them from now on. In addition to their recurrent inputs, cells receive a similar number of external excitatory inputs as well, but since we are interested in the generation of irregularity by the network, we will assume this external drive to be deterministic, that is, characterized by
µext = C J ext νext = C 1−αext jext νext ≡ C 1−αext f ext , with J ext = jext C −αext and αext > 0. The scaling with C of the different components of the total afferent current is thus given by µin = C 1−α−αw f µin µout = C 1−αb f µout
σin2 = C 1−α−2αw f σin 2 σout = C 1−2αb f σout
µext = C 1−αext f ext . If α, αb , αw are such that the variances vanish as C → ∞, the corresponding networks will consist of regularly spiking neurons. Since we are interested in irregular spiking, we therefore look for solutions in which 2 σin2 , σout or both remain order one in the C → ∞ limit. There are several ways to achieve this. 3.1.1 Scenario 1: Homogeneous Balanced Network. This case is associated with the choice α = 0. In this case, the size of the columns is of the same order as the size of the whole network (i.e., the number of columns, nc , is order one), in which case the in and out quantities become equivalent. √ A finite variance is achieved by setting αw = αb = 1/2, that is, J ∝ 1/ C. This scenario is equivalent to the network studied originally in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998). In such
Bistability in Balanced Recurrent Networks
15
a network, the mean input from the recurrent network grows as the square root of the number of inputs, µin + µout =
√ C( f µin + f µout ).
This quantity can be positive or negative depending on the excitationinhibition balance in the network. The overall mean input into the neurons is obtained by adding the external input: µ = µin + µout + µext . In order not to saturate the dynamic range of the cell in the extensive limit, the overall mean current into the neurons should remain of order one as C → ∞. Hence, it is needed that µ=
√ 1 C[ f µin + f µout + f ext C 2 −αext ] ∼ O(1),
(3.1)
√ which is possible only if the term in square brackets vanishes as 1/ C. If the synapses from the external inputs vanish like 1/C (αext = 1), then the external input has a negligible contribution to the overall mean input to the cells. In this case, since both f µin and f µout are linear combinations of the firing rates of the neurons inside and outside the column under focus, equation 3.1 effectively becomes, in the extensive limit, a set of linear homogeneous equations for the activity of the different columns (note that although we have, for brevity, written only one, there are four equations like equation 3.1, for the excitatory and inhibitory subpopulations inside and outside the column under focus). Thus, unless the matrix of coefficients of the firing rates in f µin and f µout for the excitatory and inhibitory subpopulations is not full rank, the only solution of equation 3.1 in the extensive limit is given by a silent, zero rate, network (van Vreeswijk & Sompolinsky, 1998). On the other hand, if αext = 1/2, the linear system defined by equations 3.1 in the extensive limit is not homogeneous anymore. Hence, in the general case (except for degeneracies), if αext = 1/2, there is a single self-consistent solution for the firing rates in the network, in which the activity in each subpopulation is proportional to the external drive νext (van Vreeswijk & Sompolinsky, 1998). This highlights the importance of a powerful external excitatory drive. When αext = 1/2, the external drive by itself would drive the cells to saturation if the recurrent connections were inactivated. In the presence of the recurrent connections, the activities in the excitatory and inhibitory subpopulations adjust themselves to compensate this massive external drive. The firing rates in the self-consistent solution correspond to the only way in which this compensation can occur for all the subpopulations at the same time. It follows that inasmuch as the different inputs to the neuron combine linearly, unless the connectivity matrix is degenerate, which requires some kind of fine-tuning mechanism, bistability in a large, homogeneous balanced network is not a robust property.
16
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
3.1.2 Scenario 2: Homogeneous Multicolumnar Balanced Network. Since the linearity condition that results in a unique solution follows from the mean current to the cells from within the column growing with C, we impose that µin ∼ O(1), that is, 1 − α − αw = 0. If this is the case, the variance coming from within the column goes as C α−1 . We consider first the case α < 1. In these conditions, the variance coming from the activity inside the column vanishes for large C. To keep the total variance finite, we set αb = 1/2. If we also choose α = 1/2, then αw = αb = 1/2, so the network is homogeneous in that all the connection strengths scale similarly with C regardless of whether they connect cells belonging to the same or different columns. Since α = 1/2, there are many columns in the network, and the number of connections coming from√inside a particular column is a very small fraction (which decreases like 1/ C) of the total number of afferent connections to the cell. The fact that αb = 1/2 implies that √ the mean current coming from outside the column will still grow like C. Thus, in order for the cells not to saturate, the excitation and the inhibition from outside the column have to balance precisely; the steady state of the rest of the network becomes, again, a unique, linear function of the external input to the cells (where again we choose αext = 1/2 to avoid the quiescent state). However, now the mean current coming from inside the column is independent of C, so the steady-state activity inside the column is not determined by a set of linear equations. Instead, it should be determined self-consistently using the nonlinear transfer function in equation 2.3, which, in principle, permits bistability. This scenario is, in fact, equivalent to the one studied in Brunel (2000b), where a systematic analysis of the conditions in which bistability in a network like this can exist has been performed. (Although no explicit scaling of the synaptic connection strengths with C was assumed in Brunel, 2000b, the essential fact that the total variance to the cells in the subpopulation that supports bistability is constant is considered in that article.) As will be shown in detail below, the fact that the potential multiple steadystate solutions in this scenario differ only in the mean current to the cells, not in their variance (which is fixed by the balance condition on the rates outside the column), leads necessarily to a lower (in general, significantly lower) CV in the activity in the cells in√the elevated persistent activity state. Therefore, in a network with J ∝ 1/ C scaling, bistability is possible in √ small subsets of neurons comprising a fraction ∝ 1/ C of the total number of connections per cell, but the elevated persistent activity states are characterized by a change in the mean drive to the cells at constant variance, and, as we show below, this leads to a significant decrease in the spiking irregularity in the elevated persistent activity states. 3.1.3 Scenario 3: Heterogeneous Multicolumnar Network. In order for the CV in the elevated persistent activity state to remain close to one, the variance of the afferent current to the cells inside the column should depend
Bistability in Balanced Recurrent Networks
17
on their own activity. Thus, in addition to the condition 1 − α − αw = 0 necessary for bistability, we have to impose that σin2 ∝ C α−1 be independent of C, that is, α = 1, which implies αw = 0. In these conditions, the extensive number of connections per cell come from an extensive number of columns, with the number of connections from each column remaining a finite number. The αw = 0 condition reflects the fact that since cells receive only a finite number of intracolumnar connections, the strength of these connections does not need to scale in any specific way with C. As for the activity outside the√network, one could now, in principle, choose either J b ∝ 1/C or J b ∝ 1/ C (corresponding to αb = 1, 1/2, respectively), since there is already a finite amount of variance coming from within the column. In the first case, the rest of the network contributes only a noiseless deterministic current whose exact amount has to be determined selfconsistently, and in the second it contributes both to the total mean and variance of the afferent current to the neurons in the √ column. In this last case, as in the previous two scenarios, the J b ∝ 1/ C scaling results in the need for balance between the total excitation and inhibition outside the network, which (again, if αext = 1/2) leads to a unique solution for the activity of the rest of the population linear in the external drive to these neurons. In this scenario, the network is heterogeneous since the strength of the connections from neurons within the same column is larger than those from neurons in other columns. Since the rate and CV of the cells inside the column have to be determined self-consistently in this case, we proceed to do a systematic quantitative analysis of this scenario in the next section. From the scaling considerations described in this section, it is already clear, though, that a potential bistable solution with high CV is possible only in a small network.
3.2 Types of Bistable Solutions in a Reduced Network. In this section, we consider the network described in scenario 3 in the previous section, with the choice αb = 1/2, and analyze the types of steady-state solutions for the activity in a particular column of finite size. The rest of the network is in a balanced state, and its activity is completely decoupled from the activity of the column, which is too small in size to make a difference in the overall input to the rest of the cells. For our present purposes, all that matters about the afferents outside the column (from both the rest of the network and the external ones) is that they provide a finite net input to the cells in the column. We denote the mean and variance of that fixed external ext 2 current by µext E,I /τm and 2(σ E,I ) /τm , where the factors with the membrane ext 2 time constant τm have been included so that µext E,I and (σ E,I ) represent the contribution to the mean and variance of the postsynaptic depolarization (in the absence of threshold) in the steady states arising from outside the column.
18
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
The net input to the excitatory and inhibitory populations in the column under focus is thus characterized by µ E = c EE jEE ν E − c EI jEI ν I + µ I = c IE jIE ν E − c II jII ν I +
µext E τm
µext I τm
2 σ E2 = c EE jEE ν E CV2E + c EI jEI2 ν I CV2I +
σ I2 = c IE jIE2 ν E CV2E + c II jII2 ν I CV2I +
(σ Eext )2 τm /2
(σ Iext )2 , τm /2
where the number of connection and connection strength parameters c and j are all order one. We proceed by simplifying this scheme further in order to reduce the dimensionality of the system from four to two, which will allow a systematic exploration of the effect of all the parameters on the type of steady-state solutions of the network. In particular, we make the inputs to the excitatory and inhibitory populations identical, c IE = c EE ≡ c E µext E
= µext I
c II = c EI ≡ c I
≡µ
ext
(σ Eext )2
=
jIE = jEE ≡ j E
(σ Eext )2
≡ (σ
jII = jEI ≡ j I
) ,
ext 2
so that the whole column becomes statistically identical: ν E = ν I ≡ ν and CV E = CV I ≡ CV. For simplicity, we also assume that the number of excitatory and inhibitory inputs is the same: c E = c I = c. Thus, we are left with a system with four parameters, c µ = c( j E − j I );
cσ =
c( j E2 + j I2 );
µext ;
σ ext ,
(3.2)
all with units of mV, and two dynamical variables (from equation 2.7) µV dµV =− + µ; dt τm
σ2 dσV2 = − V + σ 2, dt τm /2
where µ = cµ ν +
µext ; τm
σ 2 = c σ2 νCV2 +
(σ ext )2 τm /2
(3.3)
Bistability in Balanced Recurrent Networks
19
100
1
Firing Rate x CV2 (Hz)
1.5
CV
Firing Rate (Hz)
150 150
50 0.5 0 20
σ (mV)
0 0
10 0 0
20 µ (mV)
10
20 10 20 µ (mV)
30
10 30 0
100
50
0 20
10 σ (mV)
σ (mV)
0 0
10
20 µ (mV)
30
Figure 2: Mean firing rate ν (left), CV (middle), and product νCV2 (right) of the LIF neuron as a function of the mean and standard deviation of the depolarization. Parameters: Vth = 20 mV, Vres = 10 mV, τm = 10 ms, and τref = 2 ms.
and −1
ν(µV , σV )
√
= τref + τm π
CV2 (µV , σV ) = 2πν 2
Vth −µV √ σV 2 Vres −µV √ σV 2
Vth −µV √ σV 2 Vres −µV √ σV 2
d x ex
2
2
d x ex [1 + erf(x)]
x
−∞
2
dy e y [1 + erf(y)]2
(3.4)
(3.5)
The parameters c µ and c σ2 measure the strength of the feedback that the activity in the column produces on the mean and variance of the current to the cells. c µ can be less than equal to, or greater than zero. A value larger (less) than zero implies that the activity in the column has a net excitatory (inhibitory) effect on the neurons. In general, we assume the positive parameter c σ2 to be independent of c µ (implying the recurrent feedback on the mean and on the variance can be manipulated independently). Note, however, that since j I /j E > 0, c σ2 cannot be arbitrarily small, that is, c σ2 > c µ2 /c. Equations 3.4 and 3.5 are plotted as a function of µV and σV in Figure 2. 3.2.1 Mean- and Fluctuation-Driven Bistability in the Reduced System. nullclines for the two equations 3.3 are given by ν(µV , σV ) =
µV − µext ; τm c µ
ν(µV , σV )CV2 (µV , σV ) =
The
σV2 − (σ ext )2 . (τm /2)c σ2
The values of (µV , σV ) that satisfy these equations are shown in Figure 3 for several values of the parameters c µ , c σ . The nullclines for the mean (see Figure 3, left) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, left, with a plane parallel to the σV axes, shifted by µext and tilted (i.e., with slope) at a rate 1/(τm c µ ). Since the mean firing
20
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Nullclines for the Standard Deviation 10
Nullclines for the mean
4
Cµ=0 C >0
3.5
σ
8 σ (mV)
σ (mV)
3 2.5 2
Cµ < 0
6
4 C >0 µ
1.5 1 10
2
µext=15 mV
15
20 µ (mV)
σ
C =0
=2 mV
σ
ext
25
30
10
20
µ (mV)
30
40
Figure 3: Nullclines of the equation for the mean µV (left) and standard deviation σV (right). The nullcline for the mean depends on only c µ , and the one for the standard deviation depends on only c σ .
rate as a function of the average depolarization changes curvature (it has an inflection point near threshold, where firing changes from being driven by fluctuations to being driven by the mean), the nullcline for the mean has a “knee” when the net feedback c µ is large enough and excitatory. Similarly, the nullclines for the standard deviation of the depolarization (see Figure 3, right) are the projection on the (µV , σV ) plane of the intersection of the surface in Figure 2, right, with a parabolic surface parallel to the µV axes, shifted by (σ ext )2 and with a curvature 2/τm c σ2 . Again, this curve can display a “knee” for high enough values of the net strength of the feedback onto the variance c σ2 . The fixed points of the system are given by the points of intersection of the two nullclines. We now show, through two particular examples, the main result of this letter: depending on the degree of balance between excitation and inhibition, two types of bistability can exist: mean driven and fluctuation driven. Mean-Driven Bistability. Figure 4 shows a particular example of the type of bistability obtained for low-moderate values of c σ and moderate-high values of c µ . Figure 4a shows the time evolution of the firing rate (bottom) and CV (top) in the network when the external drive to the cells is transiently elevated. In response to the transient input, the network switches between a low-rate, high-CV basal state, into an elevated activity state. For this particular type of bistability, the CV in this state is low. The nullclines for this example are shown in Figure 4b. The σV nullcline is essentially parallel to the µV axis, and it intercepts the µV nullcline (which has a pronounced “knee”) at three points: one below (stable), one around (unstable), and one above
Bistability in Balanced Recurrent Networks
a
21
CV
1.5 1 0.5 0 Firing Rate (Hz)
80 60 40 20 0
b
σ (mV)
0.75
0
1 2 Time (s)
Nullcline for µ Nullcline for σ
3
Rate = 35.5 Hz CV = 0.24
0.7 0.65 Rate = 1.42 Hz CV = 0.94
0.6 18
19
20 µ (mV)
21
Figure 4: Example of mean-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the mean of the external drive to the neurons was elevated from 18 mV to 19 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is low. (b) Nullclines for this example. The two stable fixed points differ primarily in the mean current that the cells are receiving, with an essentially constant variance. Hence, the CV in the elevated persistent-activity fixed point is low. Parameters: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Dotted line: neuronal threshold.
(stable) threshold. The stable fixed point below threshold corresponds to the state of the system before the external input is transiently elevated in Figure 4a. It is typically characterized by a low rate and a relatively high CV, as subthreshold spiking is fluctuation driven and thus irregular. However, since the CV drops fast for µV > Vth at the values of σV at which the µV
22
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
nullcline bends (see Figure 2, middle), the CV in the suprathreshold fixed point is typically low (but see section 4). Fluctuations in the current to the cells play little role in determining the spike times of the cells in this elevated persistent activity state. Qualitatively, this is the type of excitation-driven bistability that has been thoroughly analyzed over the past few years . It is expected to be present in networks in which small populations of excitatory cells can be bistable in the presence of global inhibition by virtue of selective synaptic potentiation. It is also expected to be present in larger populations if the fluctuations arising from the recurrent connections are weak compared to those coming from external afferents, for instance, due to synaptic filtering (Wang, 1999). Fluctuation-Driven Bistability. If the connectivity is such that the mean drive to the neurons is only weakly dependent on the activity, that is, c µ is small, but at the same time the activity has a strong effect on the variance, that is, c σ is large, the system can also have two stable fixed points, as shown in the example in Figure 5 (same format as in the previous figure). In this situation, however, the two fixed points are subthreshold, and they differ primarily in the variance of the current to the cells. Hence, spiking in both fixed points is fluctuation driven, and the CV is high in both of them; in particular, it is slightly higher in the elevated activity fixed point (see Figure 5a). This type of bistability can be realized only if there is a precise balance between the net excitatory and inhibitory drive to the cells. Since c σ must be large in order for the σV nullcline to display a “knee,” both the net excitation and inhibition should be large, and in these conditions, a small c µ can be achieved only if the balance between the two is precise. This suggests that this regime will be sensitive to changes in the parameters determining the connectivity; that is, it will require fine-tuning, a conclusion that is supported by the analysis below. Mean- and fluctuation-driven bistability are not discrete phenomena. Depending on the values of the parameters, the elevated activity fixed point can rotate in the (µV , σV ) plane, spanning intermediate values from those shown in the examples in Figures 4 and 5. We thus now proceed to a systematic characterization of all possible behaviors of the reduced system as a function of its four parameters. 3.2.2 Effect of the External Current. Since c σ2 > 0, the σV nullcline always bends upward (see Figure 3), that is, the values of σV in the nullcline are always larger than σ ext . Assuming for simplicity that c σ can be arbitrarily low, this implies that no bistability can exist unless the external variance is low enough. In particular, for every value of the external mean µext , ext there is a critical value σc1 defined as the value of σV at which the first two ext derivatives of the µV nullcline vanish (see Figure 6, middle). For σ ext > σc1 , the two nullclines can cross only once, and therefore no bistability of any
Bistability in Balanced Recurrent Networks
a
23
CV
1.5 1 0.5 0 Firing Rate (Hz)
80 60 40 20 0
0
b
1 2 Time (s)
3
15 σ (mV)
Rate = 59.3 Hz CV = 1.17
10 Rate = 2.4 Hz CV = 1.02
5
Nullcline for µ Nullcline for σ
5
15 µ (mV)
25
Figure 5: Example of fluctuation-driven bistability. (a) CV (top) and firing rate (bottom) in the network as a function of time. Between t = 0 s and t = 0.5 s (dashed lines), the standard deviation of the external drive to the neurons was elevated from 5 mV to 7 mV, causing the network to switch to its elevated activity fixed point. In this fixed point, the CV is slightly higher than in the basal state. (b) Nullclines for this example. The two stable fixed points differ primarily in the variance of the current that the cells are receiving, with little change in the mean. Hence, the CV in the elevated persistent-activity fixed point is slightly higher than in the low-activity fixed point; that is, the CV increases with the rate. Parameters: µext = 5 mV, σ ext = 5 mV, c µ = 5 mV, c σ = 20.2 mV. Dotted line: Neuronal threshold.
kind is possible in the reduced system (see Figure 6, left). For values of σ ext only slightly lower than this critical value, the jump between the low- and the high-activity stable fixed points in the (µV , σV ) plane is approximately horizontal, so the type of bistability obtained is mean driven. For lower
24
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext
ext
> σc1
10 8 σ (mV)
σ (mV)
8 6 4 2 0
σext=σext c2
ext
σ =σc1 µ nullcline σ nullcline
12 10 σ (mV)
ext
σ
10
6 4
10
20 µ (mV)
30
40
2 0
8 6 4
10
20 µ (mV)
30
40
2 0
10
20 µ (mV)
30
40
Figure 6: The external variance determines the existence and types of bistability ext (left), no bistability is possible. σ ext = possible in the network. For σ ext > σc1 ext ext marks the onset of bistability (middle). At σ ext = σc2 bistability becomes σc1 possible in a perfectly balanced network (a network with c µ = 0) (right).
values of the external variance, a point is eventually reached at which bistability becomes possible in a perfectly balanced network. Again, for ext each µext , one can define a second critical value σc2 in the following way: ext ext σc2 is the value of σ at which the point where the derivative at the inflection point of the σV nullcline is infinite occurs at a value of µV equal ext to µext (see Figure 6, right). For values of σ ext < σc2 , bistability is possible in networks in which the net recurrent feedback is inhibitory. Since both critical values of σ ext are functions of the external mean, they define curves in the (µext , σ ext ) plane. These curves are plotted in Figure 7. Both are decreasing functions of µext and meet at threshold, implying that bistability in the reduced network is possible only for subthreshold mean external inputs (see section 4). 3.2.3 Phase Diagrams of the Reduced Network. For each point in the (µext , σ ext ) plane, the external current is completely characterized, and the only two parameters left to be specified are c µ , c σ . In particular, in the regions where bistability is possible, it will exist for only appropriate values of c µ and c σ . The two insets in Figure 7 show phase diagrams in the (c µ , c σ ) plane showing the regions of bistability in two representative points: one in ext ext which σc2 < σ ext < σc1 , in which bistability is possible only in excitationext dominated networks (top-right inset), and one in which σ ext < σc2 , in which bistability is possible in both excitation- and in inhibition-dominated networks (bottom-left inset). In this latter case, the region enclosed by the curve in which bistability can exist stretches to the left, including the region with c µ 0. We have further characterized the nature of the fixed-point solutions in these two cases by plotting the rate and CV on each point in the (c µ , c σ ) plane on which bistability is possible, as well as the ratio between the rate and CV in the high- versus low-activity states. Instead of showing this in the
Bistability in Balanced Recurrent Networks
25
9 ext
σc1
8
No Bistability
7
10
ext
σ
5
5 0
4 3 2
15
20
c (mV)
25
µ
20 cσ (mV)
σ
ext
(mV)
6
c (mV)
σc2
10
1
0
0
40
c (mV)
80
µ
0 0
5 ext
µ
10 (mV)
15
20
Figure 7: Phase diagram with the effect of the mean and variance of the external current on the existence and types of bistability in the network. The two insets represent the regions of bistability in the (c µ , c σ ) plane at the corresponding points in the (µext , σ ext ) plane. Fluctuation-driven bistability is possible only ext (µext ). Top-right and bottom-left near and below the lower critical line σ ext = σc2 ext ext insets correspond to µ = 10 mV; σ = 4 mV and µext = 10 mV; σ ext = 3 mV, respectively.
(c µ , c σ ) plane, we have inverted Equations 3.2 to show (assuming a constant c = 100) the results as a function of the unitary EPSP j E and of the ratio of the unitary inhibitory to excitatory PSPs j I /j E , which measures the degree of balance in the network, that is, j I /j E = 1 implies a perfect balance:
jE =
cµ +
2cc σ2 − c µ2 2c
;
j I /j E =
2cc σ2 − c µ2 − c µ 2cc σ2 − c µ2 + c µ
.
(3.6)
In Figure 8 we show the results for the case where the external variance ext , so that bistability is possible only if the net recurrent is higher than σc2 connectivity is excitatory. Overall, the shape of the bistable region in this space is a diagonal to the right. This means that closer to the balanced region, the net excitatory drive (proportional to j E ) has to be higher in order for bistable solutions to exist. The low-activity fixed point (left column) is subthreshold, and thus spiking is fluctuation driven, characterized by a high CV. In this case, the high-activity fixed point is suprathreshold, so the CV in this fixed point is in general small (see the bottom right). Of course,
26
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga Rate
down
(Hz)
Rate /Rate
Hz 30
0.8
up
down
0.8 25
25
0.6
0.6
20
I E
0.4
15
0.2
10
j /j
j /j
I E
20
15
0.4
10 0.2 5
5
0.2
0.4 0.6 j (mV)
0.8
0.2
E
0.4 0.6 j (mV)
0.8
E
CV
CV /CV
down
up
0.95
0.8
down
0.8
0.8 0.7
0.9
0.4
I E
0.6 j /j
jI/jE
0.6
0.6 0.5
0.4
0.4 C
σ
0.3
0.2
0.2 0.85 0.2
0.4 0.6 jE (mV)
0.8
0.2 16
0.2
0.1
C 18 µ
0.4 0.6 jE (mV)
0.8
Figure 8: Bistability in an excitation-dominated network in the ( j E , j I /j E ) plane. (Top and bottom left) Firing rate and CV in the low-activity fixed point. (Top and bottom right) Ratio of the firing rate and CV between the high- and low-activity fixed points. (Inset) same as bottom-right panel in the (c µ , c σ ) plane. Parameters: c = 100, µext = 10 mV, and σ ext = 4 mV.
very close to the cusp, at the onset of bistability, the CV (and rate) in both fixed points is similar. In Figure 9 the same information is shown for the case where the external ext variance is lower than σc2 so that bistability is possible when the recurrent connectivity is dominated by excitation or inhibition. In this case, the region where the CV in the high- and low-activity fixed points is similar is larger, corresponding to situations in which j I /j E ∼ 1, that is, excitation and inhibition in the network are roughly balanced. Only in this region is the < 100 Hz. In this regime, firing rate in the high-activity state not too high, ∼ when excitation dominates, the rate in the high-activity state becomes very high. Note also that the transition between the relatively low-rate, high-CV regime and the very high-rate, low-CV regime at j I /j E ∼ 0.9 is relatively abrupt. Finally, we can use the relations 3.6 to study quantitatively the effect of the number of connections c on the regions of bistability, something we
Bistability in Balanced Recurrent Networks
27
Hz
Ratedown (Hz)
Rateup/Ratedown 600
3
1
2.5
1.5
0.4
1
0.2 0
0.5
1 jE (mV)
I E
500
0.8
400
0.6
300
0.4
200 100
0.2
0.5
0
0
1.5
0.5
1 j (mV)
1.5
E
CVdown
j /j
j /j
0.6
I E
2
CVup/CVdown
1
1
1
0.8
0.8
0.6
0.99
jI/jE
j /j
I E
0.8
1
1
C
σ
−10
0.8
C 10 µ
0.6
0.6
0.4
0.4
0.2
0.2
0.4 0.2 0
0
0.5
1 j (mV) E
1.5
0.98
0
0.5
1 jE (mV)
1.5
Figure 9: Mean and fluctuation-driven bistability in the ( j E , j I /j E ) plane. Panels as in Figure 8. (Inset) Portion of the bistability region with fluctuation-driven fixed points in the (c µ , c σ ) plane. When the network is approximately balanced, that is, j I /j E ∼ 1, the CV in the high-activity state is high. Parameters: c = 100, µext = 10 mV, and σ ext = 3 mV.
did in section 3.1 at a qualitative level based on scaling considerations. Figure 10 shows the effect of increasing the number of afferent connections per cell on the shape of the region of bistability for σ ext = 4 mV (left) and for σ ext = 3 mV (right). The results are clearer when shown in the plane ( j E c, j I /j E ), where j E c represents the net mean excitatory drive to the neurons. The range of values of the net excitatory drive in which bistability is allowed in the excitation-dominated regime, where j I /j E < 1, does not depend very strongly on c. However, for both σ ext = 4 mV and σ ext = 3 mV, when inhibition and excitation become more balanced, a higher net excitatory drive is needed. In particular, when σ ext = 3, the bistable region always includes the balanced network, j I /j E = 1, but the range of values of j I /j E ∼ 1 where bistability is possible (being in this case fluctuation driven) considerably shrinks. Thus, as noted in section 3.1, bistability in a large, balanced network requires a precise balance of excitation and inhibition.
28
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga ext
σ
ext
σ
=4 mV
=3 mV
1
I E
0.5
j /j
j /j
I E
1
0.5
c=100 c=1000 c=10000 0 0
200
400 jE c (mV)
600
c=100 c=1000 c=10000 800
0 0
500
1000 1500 j c (mV)
2000
E
Figure 10: Effect of number of connections c on the phase diagram in the ( j E c, j I /j E ) plane for the case where µext = 10 mV and σ ext = 4 mV (left) and σ ext = 3 mV (right). Note that in the right panel, j I /j E = 1 is always in the region of bistability, but the range of values of j I /j E ∼ 1 in the bistable region decreases significantly with c.
The precise balance between excitation and inhibition required to obtain solutions with fluctuation-driven bistability is also evident when one analyzes the effect of changing the net input to the excitatory and inhibitory subpopulations within a column. In this case, the excitatory and inhibitory mean firing rate and CV become different. We have chosen to study the effect of the different ratios of excitation and inhibition to the excitatory and inhibitory populations. In particular, defining γE ≡
jEI jEE
and
γI ≡
jII , jIE
we have considered the effect of having γ E = γ I while still considering that the excitatory connections to the excitatory and inhibitory populations are equal, that is, jEE = jIE ≡ j E . To proceed with the analysis, we started by specifying a point in the parameter space of the symmetric network in which excitation and inhibition were identical by choosing a value for (µext , σ ext , c µ , c σ ). Then, fixing c = 200, we used the relationships 3.6 to solve for j E and γ and, defining γ E ≡ γ , we found which values of γ I resulted in bistable solutions. Correspondingly, when γ I = γ E , the two subpopulations within the column become identical again. We performed this analysis for two initial sets of parameters of the symmetric network: one corresponding to mean-driven and the other to fluctuation-driven bi-stability. The results of this analysis are shown in Figure 11. The type of bistability does not change qualitatively depending on the value of γ I /γ E in the mean-driven case (left column). For the right
Bistability in Balanced Recurrent Networks
29
Mean-driven 300
Low activity High activity
80
Firing Rate (Hz)
Firing Rate (Hz)
100
Fluctuation-driven
60 40
Bistable Region
20
200
0
0 1
1.5 γI/γE
2
1
1.05
1
0.8
0.8
Bistable Region
0.6
CV
CV
1.025 γ /γ I E
1
0.4
0.2
0.2 1
1.5 γI/γE
2
Bistable Region
0.6
0.4
0
Bistable Region
100
0
1
1.025 γI/γE
1.05
Figure 11: Effect of different levels of input to the excitatory and inhibitory subpopulations. The ratio between the inhibitory and excitatory connection strengths γ was allowed to be different for each subpopulation. Left and right columns correspond to mean- and fluctuation-driven bistability in the corresponding situation for the symmetric network. The network is bistable for values of γ I /γ E within the dashed lines. Parameters on the left column: µext = 18 mV, σ ext = 0.65 mV, c µ = 7.2 mV, c σ = 1 mV. Parameters on the right column: µext = 10 mV, σ ext = 3 mV, c µ = 0.5 mV, c σ = 19 mV.
column, however, the original fluctuation-driven regime is quickly abolished as γ I /γ E increases, leading to very high activity and low CV in the high-activity fixed point. Note that the size of the bistable region is also much smaller in this case. 3.3 Numerical Simulations. We conducted numerical simulations of our network to investigate whether the two types of bistable states that the mean-field analysis predicts, the mean-driven and the fluctuation-driven regimes, can be realized. In addition to the approximations that we are forced to make in order to be able to construct the mean-field theory itself, the more qualitative and robust result that fluctuation-driven bistable points require large fluctuations in relatively small networks with relatively large
30
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
synapses leads to the question of whether potential fixed points in this very noisy network will indeed be stable. We found that under certain conditions, both types of bistable points can be realized in numerical simulations. In Figure 12 we show an example of a network supporting bistability on the mean-driven regime. In this network, the recurrent connections are dominated by excitation, and the mean of the external drive leaves the neurons very close to threshold, with the fluctuations of this external drive being small. As expected, the irregularity in the elevated-activity fixed point is low, with a CV ∼ 0.2. The mean µV in this fixed point is above threshold. The mean-field theory predicts for the same network a rate of 46.7 Hz and a CV of 0.21. An example of another network showing bistability, this time in the fluctuation-driven regime, is shown in Figure 13. In this network, the recurrent connections are dominated by inhibition, and the external drive leaves the membrane potential relatively far from threshold on average but has large fluctuations. Taking into account only consecutive ISIs, the temporal irregularity is still large: CV2 ∼ 0.8. The spike trains in the elevated activity state are quite irregular, partly, but not only, due to large, temporal fluctuations in the instantaneous network activity. The mean-field prediction for these parameters gives a rate of 91.5 Hz and a CV of 1.6. In order for the elevated activity states to be stable, in both the meandriven and, especially, the fluctuation-driven regimes, we needed to use a wide distribution of synaptic delays in the recurrent connections: between 1 and 10 ms for Figure 12 and 1 and 50 ms for Figure 13. Narrow distributions of delays lead to oscillations, which destabilize the stationary activity states. The emergence and properties of these oscillations in a network similar to the one we study here have been described in Brunel (2000a). Although such long synaptic delays are not expected to be found in connections between neighboring local cortical neurons, our network is extremely simple and lacks many elements of biological realism that would work in the same direction as the wide distributions of delays. Among these are slow and saturating synaptic interactions (NMDA-mediated excitation; (Wang, 1999) and heterogeneity in cellular and synaptic properties. The large and slow temporal fluctuations in the instantaneous rate in Figure 13 are due to the large fluctuations in the nearly balanced external and recurrent drive to the cells and the wide distribution of synaptic delays. These fluctuations lead to high trial-to-trial variability in the activity of the network, as shown in Figure 14. In this figure, we show nine trials with identical parameters as in Figure 13, and only different seeds for the random number generator. On each panel, the mean instantaneous activity across all nine trials (the thick line) is shown along with the activity in the individual trial. Sometimes the large fluctuations lead to the activity returning to the basal spontaneous state. Other times they provoke longlasting periods of elevated firing (above average). Nevertheless, on a large fraction of the trials, a memory of the stimulus persists for several seconds.
Bistability in Balanced Recurrent Networks
31
Rate (Hz)
150
100
50
0 0
100
200
300
400 500 Time (ms)
600
=0.2
0
50 Rate (Hz)
100 0
0.25 CV
0.5
700
800
=0.21 2
0
0.25 CV2
0.5
Figure 12: Numerical simulations of a bistable network in the mean-driven regime. The rate of the external afferents was raised between 200 and 300 ms (vertical bars). (Top) Raster display of the activity of 200 neurons in the network. (Middle) Instantaneous network activity (temporal bin of 10 ms). The dashed line represents the average network activity during the delay period, 53.4 Hz. (Lower panels) Distribution across cells of the rate (left), CV (middle), and CV2 (right) during the delay period. The fact that the CV and CV2 are very similar reflects the stationarity of the instantaneous activity. Single-cell parameters as in the caption to Figure 2. The network consists of two populations of excitatory and inhibitory cells (1000 neurons each) connected at random with 0.1 probability. Delays are uniformly distributed between 1 and 10 ms. External spikes are all excitatory, with PSP size 0.09 mV. The external rate is 19.25 KHz. This leads to µext = 17.325 mV and σ ext = 0.883 mV. Recurrent EPSPs and IPSPs are 0.138 mV and −0.05 mV, respectively, leading to c µ = 8.8 mV and c σ = 1.468 mV.
32
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Rate (Hz)
150
100
50
0 0
100
200
300
400 500 Time (ms)
600
=1.27
0
200 Rate (Hz)
400 0
2 CV
700
800
=0.81
4
0
1 CV2
2
Figure 13: Numerical simulations of a bistable network in the fluctuationdriven regime. Panels as in Figure 12. Parameters as in Figure 12 except distribution of delays, which is uniform between 1 and 50 ms. External spikes are excitatory, with PSP size 1.85 mV and rate 0.78 KHz and inhibitory, with PSP size −1.85 mV and rate 0.5 KHz. This leads to µext = 5.18 mV and σ ext = 4.68 mV. Recurrent EPSPs and IPSPs are 1.85 mV and −1.98 mV, respectively, leading to c µ = −13 mV and c σ = 27.1 mV.
Firing Rate (Hz)
Bistability in Balanced Recurrent Networks
33
100 50
Firing Rate (Hz)
0 100 50
Firing Rate (Hz)
0 100 50 0 0
1 Time (s)
20
1 Time (s)
20
1 Time (s)
2
Figure 14: Trial-to-trial variability in the fluctuation-driven regime. Each panel is a different repetition of the same trial, in a network identical to the one described in Figure 13. The thick line represents the average across all nine trials, and the thin line is the instantaneous network activity in the given trial. Vertical bars mark the time during which the rate of the external inputs is elevated.
In the mean-driven regime, the trial-to-trial variability is very low (not shown). We conclude that despite quantitative differences in the rate and CV between the mean-field theory and the simulations, it is possible, albeit difficult, to find both mean-driven and fluctuation-driven bistability in small networks of LIF neurons. 4 Discussion In this letter, we have aimed at an understanding of the different ways in which a simple network of current-based LIF neurons can be organized in order to support bistability, the coexistence of two steady-state solutions for the activity of the network that can be selected by transient external stimulation. We have shown that in addition to the well-known case in which strong excitatory feedback can lead to bistability, bistability can also be obtained when the recurrent connectivity is nearly balanced, or even when its net effect is inhibitory, provided that an increase in the activity in
34
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
the network provides a large enough increase in the size of the fluctuations of the current afferent to the cells. When bistability is obtained in this fashion, the CV in both steady states is close to one, as found experimentally (see Figure 1; Compte et al., 2003). We have done a systematic analysis at the mean-field level (and a partial one through numerical simulations) of a reduced network where the activity in the excitatory and the inhibitory subpopulations was equal by construction (implying balance at the level of the output activity) and studied which types of bistable solutions are obtained depending on the level of balance in the currents (the parameter c µ ), that is, balance at the level of the inputs. This simple model allows for a complete understanding of the role of the different model parameters. The first phenomenon, which we have termed mean-driven bistability, can essentially be traced back to the shape of the curve relating the mean firing rate of the cell to the average current it is receiving (at a constant noise level; Brunel, 2000b), that is, the f − I curve. In order for bistability to exist, this curve should be a sigmoid, for which it is enough that the neurons possess a threshold and a refractory period. If, in addition, the low-activity fixed point is to have a nonzero activity (consistent with the fact that cells in the cortex fire spontaneously), then the neuron should display nonzero activity for subthreshold mean currents. This can be achieved if the current is noisy, where the noise is due to the spiking irregularity of the inputs to the cell. When this type of bistability is considered in a network of LIF neurons, the mean current to the cells in the high-activity fixed point is above threshold. Under general assumptions, this leads invariably to fairly regular spiking in this high-activity fixed point. Of course, tuning the parameters of the current in such a way that the mean current in the high-activity fixed point is only very marginally suprathreshold will result in only a small decrease of the CV with respect to the low-activity fixed point (e.g., Figure 2 in Brunel & Wang, 2001). On the other hand, in this scenario, it is relatively easy (it does not take much tuning) for the firing rate in the elevated persistent activity state not to be very much higher than that in the low-activity state, for example, below 100 Hz (see Figure 8). When the recurrent connectivity is balanced, bistable solutions can exist in which both fixed points are subthreshold, so that spiking in both fixed points is fluctuation driven and thus fairly irregular. This can be the case if the fluctuations in the depolarization due to current from outside the column are low enough (see Figure 6). However, in order for these solutions to exist, first, the overall inputs to the excitatory and the inhibitory subpopulations should be close enough (ensuring balance at the level of the firing activity in the network); second, both of these inputs, the one to the excitatory and the one to the inhibitory subpopulation, should themselves also be balanced (be composed of similar amounts of excitation and inhibition); and third, both the net excitatory drive and the inhibitory drive to the cells should be large. This third condition, if the first two are satisfied, results in a high, effective fluctuation feedback gain: an increase in the activity of
Bistability in Balanced Recurrent Networks
35
the cells results in a large increase in the size of the fluctuations in the afferent current to the neurons (a large value of the parameter c σ ). However, it also implies that the excitation-inhibition balance condition will be quite stringent; it will require tuning, especially when the network is large. In addition, if this balance is slightly broken, since both excitation and inhibition are large, the corresponding firing rate in the elevated persistent activity state becomes very large, for example, significantly higher than 100 Hz (see Figure 9). In fact, based just on scaling considerations (see section 3.1), one can conclude that this type of bistability can be present only (unless one allows for perfect tuning) in small networks. If the network is large, the excitation-inhibition balance condition has, in general, a single (albeit very robust) solution (van Vreeswijk & Sompolinsky, 1996, 1998). It is intriguing, however, that several lines of evidence in fact suggest a fairly precise balance of local excitation and inhibition in cortical circuits, at both the output level (Rao et al., 1999) and the input level (Anderson, Carandini, & Ferster, 2000; Shu, Hasenstaub, & McCormick, 2003; Marino et al., 2005).
4.1 Limitations of the Present Approach. Most of the results we have presented are based on an exhaustive analysis of the stationary states of a mean-field description of a simple network of LIF neurons. Several limitations of our approach should be noted. First, in order to be able to go beyond the Poisson assumption, we have had to make a number of approximations (discussed in section 2) that are expected to be valid only on limited regions of the large parameter space. Second, we have focused only on the stationary fixed points of the system, neglecting an examination of any oscillatory solutions. Oscillations in networks of LIF neurons in the high-noise regime have been extensively studied by Brunel and collaborators (see, e.g., Brunel & Hakim, 1999; Brunel, 2000a; Brunel & Wang, 2003). Third, in order to be able to provide an analytical description, we have considered a very simplified network lacking many aspects of biological realism known to affect network dynamics, most important, a more realistic description of synaptic dynamics (Wang, 1999; Brunel & Wang, 2001). Finally, the use of a mean-field description based on the diffusion approximation to study small networks with big synaptic PSPs might lead to problems, since the diffusion approximation assumes these PSPs are (infinitely) small. Large PSPs might lead to fluctuations that are too strong, which would destabilize the analytically predicted fixed points. In order to check that the main qualitative conclusion of this study was not an artifact due to the mean-field approach, we simulated the network of LIF neurons used in the mean-field description, adding synaptic delays. Provided the distribution of delays was wide, we observed both types of bistable solutions. However, as expected, the fluctuation-driven persistent activity states show large, temporal fluctuations that sometimes are enough to destabilize them.
36
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
The evidence we have provided is suggestive, but given the limitations listed above, it is not conclusive. Addressing the limitations of this work will involve using recently developed analytical techniques (Moreno-Bote & Parga, 2006), along with a systematic exploration of the behavior of more realistic recurrent networks through numerical simulations. 4.2 Self-Consistent Second-Order Statistics. The extended mean-field theory we have used builds on the landmark study by Amit and Brunel (1997b), which provided a general-purpose theory for the study of the different types of steady states in recurrent networks of spiking neurons in the presence of noise, while leaving room for different degrees of biophysical realism. Our contribution has been to try to go beyond the Poisson assumption in order to allow a self-consistent solution to the second-order statistics of the spike trains in the network. If the spike trains are assumed to be Poisson, there is only one parameter to be determined self-consistently: the firing rate. Under the approximations made in this letter, the statistics are characterized by two parameters, the firing rate and the CV, which provides information about the degree of irregularity of the spiking activity. In order to go beyond the Poisson assumption, we have assumed the spike trains in the network can be described as renewal processes with a very short correlation time. In these conditions, for time windows large compared to this correlation time, the Fano factor of the process is constant, but instead of being one, as for a Poisson process, it is equal to CV2 . This motivates our strategy of neglecting the temporal aspect of the deviation from Poisson, which is extremely complicated to deal with analytically, and keep only its effect on the amplitude of the correlations. We have done this by using the expressions for the rate and CV of the first passage time of the OU process with a renormalized variance that takes into account the CV of the inputs. If the time constant of the autocorrelation of the process is exactly zero, this approximation becomes exact (Moreno et al., 2002), so we have assumed it will still be qualitatively valid if the correlation time constant is small. In this way, we have been able to solve for the CV of the neurons in the steady states self-consistently. It has to be stressed that the fact that the individual inputs to a neuron are considered independent does not imply that the overall input process, made of the sum of each individual component, is Poisson. Informally, in order for the superposition to converge to a (homogeneous) Poisson process of rate λ, two conditions have to be met: given any set S on the time axis (say, any time interval), calling Ni1 the probability of observing one spike in S from process i, and Ni>2 the probability of observing two or more spikes in S from process i, then the superposition of the i = 1, . . . , N processes will converge to a Poisson process if lim N→∞ iN Ni1 = λS (with max{Ni1 } = 0 as N >2 N → ∞) and if lim N→∞ i Ni = 0 (see, e.g., Daley & Vere-Jones, 1988). The autocorrelated renewal processes that we consider in this letter do
Bistability in Balanced Recurrent Networks
37
not meet the second condition, which can also be seen in the fact that the superposition process has an autocorrelation given by equation 2.2, not by a Dirac delta function, as would be the case for a Poisson process. Despite this, it might be the case that if instead of any set S, one considers only a given time window T, both conditions could approximately be met in T, and we could say that the superposition process is locally Poisson in T. Whether this locally Poisson train will have the same effect on the postsynaptic cell as a real Poisson train of the same rate depends on a number of factors and has been studied in detail in Moreno et al. (2002) for the case of exponential autocorrelations. Other types of autocorrelation structures, for instance, regular spike trains, could lead to different results. This is an open problem. 4.3 Current-Based versus Conductance-Based Descriptions. We have analyzed a network of current-based LIF neurons. The motivations for this choice are that current-based LIF neurons are simpler to analyze, especially in the presence of noise, than conductance-based LIF neurons and also that there were a number of unresolved issues raised in the current-based framework that we have made an attempt to clarify. In particular, we were interested in understanding whether the framework of Amit and Brunel (1997b) could be used to produce bistable solutions in balanced networks like those studied in Tsodyks and Sejnowski (1995) and van Vreeswijk and Sompolinsky (1996, 1998) outside the context of multistability in recurrent networks. An important issue has been the relationship between different scaling relationships between the connection strengths and the number of afferents and the possible types of bistability attainable in large networks, when the number of afferents per cell tends to infinity. √ This analysis shows that large, homogeneous networks using the J ∼ 1/ C scaling needed to retain a significant amount of fluctuation at large C do not support bistability in a robust manner, a result already implicitly present in van Vreeswijk and Sompolinsky (1996, 1998). Reasonably robust bistability in homogeneous balanced networks requires that they are small. Does one expect these conclusions to hold qualitatively if one considers the more realistic case of conductance-based synaptic inputs to the cells? The answer to this question is uncertain. In particular, scaling relationships between J and C, absolutely unavoidable in current-based scenarios to keep a finite input to the cells in the extensive C limit, are not necessary when synaptic inputs are assumed to induce a transient change in conductance. In the presence of conductances, the steady-state voltage is automatically independent of C for large C, regardless of the value of the unitary synaptic conductances. In fact, assuming a simple model for a cell having only leak, excitatory, and inhibitory synaptic conductances, the steady-state voltage in the absence of threshold is given by Vss =
gL gE gI VL + VE + VI , gTot gTot gTot
38
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
where VL , VE , VI are the reversal potentials of the respective currents; g L , g E , g I are the total leak, excitatory, and inhibitory conductance in the cell and gTot = g L + g E + g I . (This expression ignores the effect of temporal fluctuations in the conductances, but it is a good approximation, since the mean conductance, being by definition positive, is expected to be much larger than its fluctuations.) The steady-state voltage is just the average of the different reversal potentials, each weighted by the relative amount of conductance that the respective current is carrying. Of course, each of the total synaptic conductances is proportional to the number of inputs, but since the steady-state voltage is just a weighted sum, it does not explode even if C tends to infinity. It might seem that in fact, the infinite-C limit leads to an ill-defined model, as the membrane time constant vanishes in this case as 1/C. If Cm is the membrane capacitance, then τm =
Cm ∼ 1/C, gTot
assuming that g E,I ∼ C. We believe, however, that this is an artifact due to an incorrect way of defining the model in the large C limit. It implicitly assumes that the number of synaptic inputs grows at a constant membrane area, thus increasing indefinitely the local channel density. A more appropriate way of taking the large C limit is to fix the relative densities of the different channels per unit area and then assume the area becomes large. In this case, both Cm and the total leak conductance of the cell (proportional to the number of leak channels) will grow with the total area. This way of taking the limit respects the well-known decrease in membrane time constant as the synaptic input to the cell grows, but retaining a well-defined, nonzero membrane time constant in the extensive C limit (in this case, the range of values that τm can take is determined by the local differences in channel density, which is independent of the total channel number). A crucial difference with the current-based cell is the behavior of the variance of the depolarization in the large C limit. A simple estimate of this variance can be obtained by ignoring threshold and considering only the low-pass filtering effect of the membrane (with time constant τm ) on a gaussian noise current of variance σ I2 and time constant τs . It is straightforward to calculate the variance of the depolarization in these conditions, resulting in σV2 =
σ I2 2 gTot
τs τs + τm
.
Bistability in Balanced Recurrent Networks
39
If the inputs are independent, both the variance of the current and the total conductance of the cell are proportional to C, which implies that σV2 ∼ 1/C for large C. Therefore, the statistics of the depolarization in conductance-based and current-based neurons show a very different dependence with the number of inputs to the cell. In particular, it is unclear whether the main organizational principle behind the balanced state in the current-based framework, √ that is, the J ∼ 1 C scaling that is needed to retain a finite variance in the C → ∞ limit and that leads to the set of linear equations that specify the single solution for the activity in the balanced network, is relevant in a conductance-based framework. A rigorous study of this problem is beyond the scope of this work, but is one of the outstanding challenges for understanding the basic principles of cortical organization.
4.4 Correlations and Synaptic Time Constants. Our mean-field description assumes that the network is in the extreme sparse limit, N C, in which the fraction of common input shared by two neurons is negligible, leading to vanishing correlations between the afferent current to different cells in the large C, large N limit. This is a crucial assumption, since it causes the variance of the depolarization in the network to be the sum of the variances of the individual spike trains, that is, proportional to C. If the correlation coefficient is finite as C → ∞, the variance is proportional to C 2 (see, e.g., Moreno et al., 2002). In a current-based network, J ∼ 1/C scaling would lead to a nonvanishing variance in the large C limit without a stringent balance condition, and in a conductance-based network, it would lead to a C-independent variance for large C. This suggests that correlations between the cells in the recurrent network should have a large effect on both their input-output properties (Zohary, Shadlen, & Newsome, 1994; Salinas & Sejnowski, 2000; Moreno et al., 2002) and the network dynamics. The issue is, however, not straightforward, as simulations of irregular spiking networks with realistic connectivity parameters, which do show weak but significant cross-correlations between neurons (Amit & Brunel, 1997a), seem to be well described by the mean-field theory in which correlations are neglected (Amit & Brunel, 1997a; Brunel & Wang, 2001). Noise correlations measured experimentally are small but significant, with normalized correlation coefficients on the range of a few percent to a few tenths for a review (see, e.g., Salinas & Sejnowski, 2001). It would thus be desirable to be able to extend the current mean-field theory to incorporate the effect of cross-correlations and to understand under which conditions their effect is important. The first steps in this direction have already been taken (Moreno et al., 2002; Moreno-Bote & Parga, 2006). The arguments of the previous section suggest that a correct treatment of correlations might be especially important in large networks of conductance-based neurons.
40
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
An issue of similar importance for the understanding of the high irregularity of cortical spike trains is the relationship between the time constant (or time constants) of the inputs to the neuron and its (effective) membrane time constant. Indeed, the need for a balance between excitation and inhibition in order to have a high-output spiking variability when receiving many irregular inputs exists only if the membrane time constant is relatively long—in particular, long enough that if the afferents are all excitatory, the input fluctuations are averaged out. If the membrane time constant is very short, large, short-lived fluctuations are needed to make the postsynaptic cell fire, and these occur only irregularly, even if all the afferents are excitatory (see, e.g., Figure 1 in Shadlen & Newsome, 1994). These considerations seem relevant since cortical cells receive large numbers of inputs that have spontaneous activity, thus putting the cell into a high-conductance state (see, e.g., Destexhe, Rudolph, & Pare, 2003) in which its effective membrane time constant is short—on the order of only a few miliseconds (Bernander, Douglas, Martin, & Koch, 1991; Softky, 1994). It has also been recognized that when the synaptic time constant is large compared to the membrane time constant, spiking in the subthreshold regime becomes very irregular, and in particular, the distribution of firing rates becomes bimodal. Qualitatively, in these conditions, the depolarization follows the current instead of integrating it. Relative to the timescale of the membrane, fluctuations are long-lived, and this separates two different timescales for spiking (which result in bimodality of the firing-rate distribution) depending on whether the size of a fluctuation is such that the total current is subthreshold (i.e., no spiking leading to a large peak of the firing rate histogram at zero) or suprathreshold (leading to a nonzero peak in the firing rate distribution) (Moreno-Bote & Parga, 2005). In these conditions, neurons seem “bursty,” and the CV of the ISI is high. Interestingly, recent evidence confirms this bimodality of the firing-rate distribution in spiking activity recorded in vivo in the visual cortex (Carandini, 2004). Increases in the synaptic-to-membrane time constant ratio leading to more irregular spiking can be due to a number of factors: a very short membrane time constant if the neuron is a high-conductance state, relatively long excitatory synaptic drive if there is a substantial NMDA component in the excitatory EPSPs, or even long-lived dendrosomatic current sources, for instance, due to the existence of “calcium spikes” generated in the dendrites. There is evidence that irregular current applied to the dendrites of pyramidal cells results in higher CVs than the same current applied to the soma (Larkum, Senn, & Luscher, 2004). 4.5 Parameter Fine-Tuning. In order for both stable firing rate states of the networks we have studied to display significant spiking irregularity, the afferent current to the cells in both states needs to be subthreshold. We have shown that this requires a significant amount of parameter fine-tuning, especially when the number of connections per neuron is large. Parameter
Bistability in Balanced Recurrent Networks
41
fine-tuning is a problem, since biological networks are heterogeneous and cellular and synaptic properties change in time. Regarding this issue, though, some considerations are in order. First, the model we have considered is extremely simple, especially at the singleneuron level. We have already pointed out possible consequences of considering properties such as longer synaptic time constants or some degree of correlations between the spiking activity of different neurons. Another biophysical property that we expect to have a large impact is short-term synaptic plasticity. In the presence of depressing synapses, the postsynaptic current is no longer linear in the presynaptic firing rate, thus acting as an activity-dependent gain control mechanism (Tsodyks & Markram, 1997; Abbott, Varela, Sen, & Nelson, 1997). It remains to be explored to what extent balanced bistability in networks of neurons exhibiting these properties becomes a more robust phenomenon. Second, synaptic weights (as well as intrinsic properties; Desai, Rutherford, & Turrigiano, 1999) can adapt in an activity-dependent manner to keep the overall activity in a recurrent network within an appropriate operational range (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998). Delicate computational tasks, which seem to require finetuning, can be rendered robust though the use of these types of activitydependent homeostatic rules (Renart, Song, & Wang, 2003). It will be interesting to study whether homeostatic plasticity (Turrigiano, 1999) can be used to relax some of the fine-tuning constraints described in this letter. 4.6 Multicolumnar Networks and Hierarchical Organization. The fact that bistability is not a robust property of large, homogeneous balanced networks suggests that the functional units of working memory could correspond to small subpopulations (Rao et al., 1999). In addition, we have shown that bistability in a small, reduced network is possible only for subthreshold external inputs (see section 3.2.2). At the same time, it is known that a nonzero activity balanced state requires a very large (suprathreshold) excitatory drive (see section 3.1 and van Vreeswijk & Sompolinsky, 1998). This seems to point to a hierarchical organization: large networks receive massive excitation from long-distance projections, and this external excitation sets up a balanced state in the network. Globally, the activity in the large, balanced network follows the external input linearly. This large, balanced network then provides an already balanced (subthreshold) input to smaller subcomponents, which, in these conditions (in particular, if the variance of this subthreshold input is small enough; see figure 7), can display more complex nonlinear behavior such as bistability. From the point of view of the smaller subnetworks, the balanced subthreshold input can be considered external, since the size of this network is too small to make a difference in the global activity of the larger network (despite being recurrently connected, the activities in the large and small networks effectively
42
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
decouple). In the cortex, the larger balanced network could correspond to a whole cortical column. Indeed, long-range projections between columns are mostly excitatory (see, e.g., Douglas & Martin, 2004). Within a column, the smaller networks that interact through both excitation and inhibition could anatomically correspond to microcolumns (Rao et al., 1999) or, more generally, define functional assemblies (Hebb, 1949). 5 Summary General principles of cortical organization (large numbers of active synaptic inputs per neuron) and function (irregular spiking statistics) put strong constraints on working memory models of spiking neurons. We have provided evidence that a network of current-based LIF neurons can exhibit bistability with the high persistent activity driven by either the mean or the fluctuations in the input to the cells. The fluctuation-driven bistability regime requires a strict excitation-inhibition balance that needs parameter tuning. It remains a challenge in future research to analyze systematically what the conditions are under which nonlinear phenomena such as bistability can exist robustly in large networks of more biophysically plausible conductance-based and correlated spiking neurons. It is also conceivable that additional biological mechanisms, such as homeostatic regulation, are important for solving the fine-tuning problem and ensuring a desired excitation-inhibition balance in cortical circuits. Progress in this direction will provide insight into the microcircuit mechanisms of working memory, such as found in the prefrontal cortex. Acknowledgments We are indebted to Jaime de la Rocha for providing the code for the numerical simulations and to Albert Compte for providing the data for Figure 1. A.R. thanks N. Brunel for pointing out previous related work, and A. Amarasingham for discussions on point processes. Support was provided by the National Institute of Mental Health (MH62349, DA016455), the A. P. Sloan Foundation and the Swartz Foundation, and the Spanish Ministery of Education and Science (BFM 2003-06242). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abramowitz, M., & Stegun, I. A. (1970). Tables of mathematical functions. New York: Dover. Amit, D. J., & Brunel, N. (1997a). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404.
Bistability in Balanced Recurrent Networks
43
Amit, D. J., & Brunel, N. (1997b). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural network retrieving at low spike rates II: Low-rate retrieval in symmetric networks. Network, 2, 275–294. Anderson, J. S., Carandini, M., & Ferster, D. (2000). Orientation tuning of input conductance, excitation, and inhibition in cat primary visual cortex. J. Neurophysiol., 84, 909–926. Bair, W., Zohary, E., & Newsome, W. T. (2001). Correlated firing in macaque visual area MT: Time scales and relationship to behavior. J. Neurosci, 21, 1676– 1697. Ben-Yishai, R., Lev Bar-Or, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. USA, 92, 3844–3848. Bernander, O., Douglas, R. J., Martin, K. A., & Koch, C. (1991). Synaptic background activity influences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. Brunel, N. (2000a). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8, 183–208. Brunel, N. (2000b). Persistent activity and the single cell f-I curve in a cortical network model. Network, 11, 261–280. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Computation, 11, 1621–1671. Brunel, N., & Wang, X.-J. (2001). Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition. J. Comput. Neurosci., 11, 63–85. Brunel, N., & Wang, X.-J. (2003). What determines the frequency of fast network oscillations with irregular neural discharges? I. Synaptic dynamics and excitationinhibition balance. J. Neurophysiol., 90, 415–430. Cai, D., Tao, L., Shelley, M., & McLaughlin, D. W. (2004). An effective kinetic representation of fluctuation-driven networks with application to simple and complex cells in visual cortex. Proc. Natl. Acad. Sci., 101, 7757–7762. Carandini, M. (2004). Amplification of trial-to-trial response variability by neurons in visual cortex. PLOS Biol., 2, E264. Chafee, M. V., & Goldman-Rakic, P. S. (1998). Matching patterns of activity in primate prefrontal area 8a and parietal area 7ip neurons during a spatial working memory task. J. Neurophysiol., 79, 2919–2940. Compte, A., Brunel, N., Goldman-Rakic, P. S., & Wang, X.-J. (2000). Synaptic mechanisms and network dynamics underlying spatial working memory in a cortical network model. Cerebral Cortex, 10, 910–923. Compte, A., Constantinidis, C., Tegn´er, J., Raghavachari, S., Chafee, M., GoldmanRakic, P. S., & Wang, X.-J. (2003). Temporally irregular mnemonic persistent activity in prefrontal neurons of monkeys during a delayed response task. J. Neurophysiol., 90, 3441–3454. Cox, D. R. (1962). Renewal theory. New York: Wiley. Daley, D. J., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer.
44
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nat. Neurosci., 2, 515–520. Destexhe, A., Rudolph, M., & Pare, D. (2003). The high-conductance state of neocortical neurons in vivo. Nat. Rev. Neurosci., 4, 739–751. Douglas, R. J., & Martin, K. A. (2004). Neuronal circuits of the neocortex. Ann. Rev. Neurosci., 27, 419–451. Funahashi, S., Bruce, C. J., & Goldman-Rakic, P. S. (1989). Mnemonic coding of visual space in the monkey’s dorsolateral prefrontal cortex. J. Neurophysiol., 61, 331– 349. Fuster, J. M., & Alexander, G. (1971). Neuron activity related to short-term memory. Science, 173, 652–654. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1992). Markov processes: An introduction for physical scientists. Orlando, FL: Academic Press. Gnadt, J. W., & Andersen, R. A. (1988). Memory related motor planning activity in posterior parietal cortex of macaque. Exp. Brain Res., 70, 216–220. Hansel, D., & Mato, G. (2001). Existence and stability of persistent states in large neuronal networks. Phys. Rev. Lett., 86, 4175–4178. Harsch, A., & Robinson, H. P. (2000). Postsynaptic variability of firing in rat cortical neurons: The roles of input synchronization and synaptic NMDA receptor conductance. J. Neurosci., 20, 6181–6192. Hebb, D. O. (1949). Organization of behavior. New York: Wiley. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558. Larkum, M. E., Senn, W., & Luscher, H. M. (2004). Top-down dendritic input increases the gain of layer 5 pyramidal neurons. Cereb. Cortex, 14, 1059–1070. Marino, J., Schummers, J., Lyon, D. C., Schwabe, L., Beck, O., Wiesing, P., & Obermayer, K. (2005). Invariant computations in local cortical networks with balanced excitation and inhibition. Nat. Neurosci., 8, 194–201. Mascaro, M., & Amit, D. J. (1999). Effective neural response function for collective population states. Network, 10, 351–373. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Computation, 12, 2305– 2329. Miller, E. K., Erickson, C. A., & Desimone, R. (1996). Neural mechanisms of visual working memory in prefrontal cortex of the macaque. J. Neurosci., 16, 5154– 5167. Miyashita, Y., & Chang, H. S. (1988). Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature, 331, 68–70. Moreno, R., de la Rocha, J., Renart, A., & Parga, N. (2002). Response of spiking neurons to correlated inputs. Phys. Rev. Lett., 89, 288101. Moreno-Bote, R., & Parga, N. (2005). Membrane potential and response properties of populations of cortical neurons in the high conductance state. Phys. Rev. Lett., 94, 088103. Moreno-Bote, R., & Parga, N. (2006). Auto- and cross-correlograms for the spike response of lif neurons with slow synapses. Phys. Rev. Lett., 96, 028101.
Bistability in Balanced Recurrent Networks
45
Rao, S. G., Williams, G. V., & Goldman-Rakic, P. S. (1999). Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: Evidence for microcolumnar organization in PFC. J. Neurophysiol., 81, 1903–1916. Renart, A. (2000). Multi-modular memory systems. Unpublished doctoral dissertation, ´ Universidad Autonoma de Madrid. Renart, A., Song, P., & Wang, X.-J. (2003). Robust spatial working memory through homeostatic synaptic scaling in heterogeneous cortical networks. Neuron, 38, 473– 485. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: SpringerVerlag. Romo, R., Brody, C. D., Hern´andez, A., & Lemus, L. (1999). Neuronal correlates of parametric working memory in the prefrontal cortex. Nature, 399, 470–474. Salinas, E., & Sejnowski, T. J. (2000). Impact of correlated synaptic input on output firing rate and variability in simple neuronal models. Journal of Neuroscience, 20, 6193–6209. Salinas, E., & Sejnowski, T. J. (2001). Correlated neuronal activity and the flow of neural information. Nat. Rev. Neurosci., 2, 539–550. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Shinomoto, S., Sakai, Y., & Funahashi, S. (1999). The Ornstein-Uhlenbeck process does not reproduce spiking statistics of neurons in prefrontal cortex. Neural Comput., 11, 935–951. Shu, Y., Hasenstaub, A., & McCormick, D. A. (2003). Turning on and off recurrent balanced cortical activity. Nature, 432, 288–293. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neurosci., 22, 221–227. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Computation, 10, 1321–1371. Wang, X.-J. (1999). Synaptic basis of cortical persistent activity: The importance of NMDA receptors to working memory. J. Neurosci., 19, 9587–9603.
46
A. Renart, R. Moreno-Bote, X.-J. Wang, and N. Parga
Zador, A. M., & Stevens, C. F. (1998). Input synchrony and the irregular firing of cortical neurons. Nat. Neurosci., 1, 210–217. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143.
Received August 22, 2005; accepted May 17, 2006.
LETTER
Communicated by Alain Destexhe
Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neural Network Simulations Abigail Morrison
[email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Sirko Straube
[email protected] Computational Neurophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany
Hans Ekkehard Plesser
[email protected] Department of Mathematical Sciences and Technology, Norwegian University of Life ˚ Norway Sciences, N-1432 As,
Markus Diesmann
[email protected] Computational Neurophysics, Institute of Biology III, and Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, 79104 Freiburg, Germany
Very large networks of spiking neurons can be simulated efficiently in parallel under the constraint that spike times are bound to an equidistant time grid. Within this scheme, the subthreshold dynamics of a wide class of integrate-and-fire-type neuron models can be integrated exactly from one grid point to the next. However, the loss in accuracy caused by restricting spike times to the grid can have undesirable consequences, which has led to interest in interpolating spike times between the grid points to retrieve an adequate representation of network dynamics. We demonstrate that the exact integration scheme can be combined naturally with off-grid spike events found by interpolation. We show that by exploiting the existence of a minimal synaptic propagation delay, the need for a central event queue is removed, so that the precision of event-driven simulation on the level of single neurons is combined with the efficiency of time-driven global scheduling. Further, for neuron models with linear subthreshold dynamics, even local event queuing can be avoided, resulting in much greater efficiency on the single-neuron level. These ideas are exemplified by two implementations of a widely used neuron model. We present a measure for the efficiency of network simulations in terms Neural Computation 19, 47–79 (2007)
C 2006 Massachusetts Institute of Technology
48
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
of their integration error and show that for a wide range of input spike rates, the novel techniques we present are both more accurate and faster than standard techniques. 1 Introduction A major problem in the simulation of cortical neural networks has been their high connectivity. With each neuron receiving input from the order of 104 other neurons, simulations are demanding in terms of memory as well as simulation time requirements. Techniques to study such highly connected systems of 105 and more neurons, using distributed computing, are now available (see, e.g., Hammarlund & Ekeberg, 1998; Harris et al., 2003; Morrison, Mehring, Geisel, Aertsen, & Diesmann, 2005). The question remains as to whether a time-driven or event-driven simulation algorithm (Fujimoto, 2000; Zeigler, Praehofer, & Kim, 2000; Sloot, Kaandorp, Hoekstra, & Overeinder, 1999; Ferscha, 1996) should be used. At first glance, the choice of event-driven algorithms seems natural, because a neuron can be described as emitting and absorbing point events (spikes) in continuous time. In fact, for neuron models with linear subthreshold dynamics and postsynaptic potentials without rise time, highly efficient algorithms exist (see, e.g., Mattia & Del Giudice, 2000). These exploit the fact that threshold crossings can occur only at the impact times of excitatory events. If more general types of neuron models are considered, the global algorithmic framework becomes much more complicated. For example, each neuron may be required to “look ahead” to determine when it will fire in the absence of new events. The global algorithm then either updates the neuron with the shortest latency or delivers the event with the most imminent arrival time (whichever is shorter) and revises the latency calculations for the neurons receiving the event. (See Marian, Reilly, & Mackey, 2002; Makino, 2003; Rochel & Martinez, 2003; Lytton & Hines, 2005; and Brette, 2006, for refined versions of such an algorithm.) This decision process clearly comes at a cost and becomes unwieldy for networks of high connectivity: if each neuron is receiving input spikes at a conservative average rate of 1 Hz from each of 104 synapses, it needs to process a spike every 0.1 ms, and this limits the characteristic integration step size. Therefore, time-driven algorithms have been found useful for the simulation of large, highly connected networks. Here, each neuron is updated on an equidistant time grid, and the emission and absorption times of spikes are restricted to the grid (see section 3). The temporal spacing of the grid is called computation time step h. Consider a network of 105 neurons as described above. At a computation time step of 0.1 ms, a time-driven algorithm carries out 109 neuron updates per second, the same number as required for the eventdriven algorithm. In this situation, the time-driven scheme is necessarily faster than the event-driven scheme because the costs of the actual updates are the same and there is no overhead caused by the scheduling of events.
Continuous Spike Times in Exact Discrete-Time Simulations
49
However, in this letter, we criticize this view and argue that in order to arrive at a relevant measure of efficiency, simulation time should be analyzed as a function of the integration error rather than the update interval. Whether a time-driven or event-driven scheme yields a better perfomance from this perspective depends on the required accuracy of the simulation and the network spike rate, and is not immediately apparent from considerations of complexity. In the time-driven framework, Rotter and Diesmann (1999) showed that for a wide class of neuron models, the linearity of the subthreshold dynamics can be exploited to integrate the neuron state exactly from one grid point to the next by performing a single matrix vector multiplication. Here, the computation time step simultaneously determines the accuracy with which incoming spikes influence the subthreshold dynamics and the timescale at which threshold crossings can be detected. However, Hansel, Mato, Meunier, and Neltner (1998) showed that forcing spikes onto the grid can significantly distort the synchronization dynamics of certain networks. Reducing the computation step ameliorates the problem only slowly, as the integration error declines linearly with h (see section 8.4.1). The problem was solved (Hansel et al., 1998; Shelley & Tao, 2001) by interpolating the membrane potential between grid points to give a better approximation of the time of threshold crossing and evaluating the effect of incoming spikes on the neuronal state in continuous time. In this work, we demonstrate that the techniques developed for the exact integration of the subthreshold dynamics (Rotter & Diesmann, 1999) and for the interpolation of spike times (Hansel et al., 1998; Shelley & Tao, 2001) can be successfully combined. By requiring that the minimal synaptic propagation delay be at least as large as the computation time step, all events can be queued at their target neurons rather than relying on a central event queue to maintain causality in the network. This reduces the complexity of the global scheduling algorithm—that is, deciding which neuron should be updated, and how far—to the simple time-driven case, whereby each neuron in turn is advanced in time by a fixed amount. Therefore, the global overhead costs are no more than in a traditional discrete-time simulation, and yet on the level of the individual neuron, spikes can be processed and emitted in continuous time with the accuracy of an event-driven algorithm. This approach represents a hybridization of traditional time-driven and event-driven algorithms: the scheme is time driven on the global level to advance the system time but event driven on the level of the individual neurons. The exact integration method is predicated on the linearity of the subthreshold dynamics. We show that this property can be further exploited, as the order of incoming events is not relevant for calculating the neuron state. This completely removes the need for storing and sorting individual events, and therefore also for dynamic data structures, while maintaining the high precision of the event-driven approach.
50
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
We illustrate these ideas by comparing three implementations of the same widely used integrate-and-fire neuron model. As in previous work, the scaling behavior of the integration error is considered a function of the computational resolution. However, in contrast to these works, we also analyze the run time and memory consumption of a large neuronal network model as a function of integration error, thus defining a measure of efficiency that can be applied to any competing model implementations. This analysis reveals that the novel scheme of embedding continuous-time implementations in a discrete-time framework can in many cases result in simulations that are both more accurate and faster than a given purely discrete-time simulation. This is possible because the new scheme achieves the same accuracy at larger computation time steps. Depending on the rate of events to be processed by the neuron, the gain in simulation speed due to an increased step size can more than compensate for the increased complexity of processing continuous-time input events. The scheme presented is well suited for distributed computing. In section 2, we describe the neuron model used as an example in the remainder of the article, and then we review the techniques for integrating the dynamics of a neural network in discrete time steps in section 3. Subsequently, in section 4, we present two implementations solving the singleneuron dynamics between grid points but handling the incoming and emitted spikes in continuous time. The performance of these implementations with respect to integration error, run time, and memory requirements is analyzed in section 5. We show that the choice of which implementation should be used for a given problem depends on a trade-off between these factors. The concepts of time-driven and event-driven simulation of large neural networks are discussed in section 6 in the light of our findings. The numerical techniques underlying the reported results are given in the appendix. The conceptual and algorithmic work described here is a module in our long-term collaborative project to provide the technology for neural systems simulations (Diesmann & Gewaltig, 2002). Preliminary results have been presented in abstract form (Morrison, Hake, Straube, Plesser, & Diesmann, 2005). 2 Example Neuron Model Although the methods in this letter can be applied to any neuron model reducible to a system of linear differential equations, for clarity, we compare various implementations of one particular physical model: a currentbased integrate-and-fire neuron with postsynaptic currents represented as α-functions. The dynamics of the membrane potential V is: V˙ = −
V 1 + I, τm C
Continuous Spike Times in Exact Discrete-Time Simulations
51
where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The time course of the synaptic current ι due to one incoming spike is ι(t) = ˆι
e −t/τα te , τα
where ˆι is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article are τm , 10 ms; C, 250 pF; , 20 mV; τr , 2 ms; ˆι, 103.4 pA; and τα , 0.1 ms. 3 Exact Integration of Subthreshold Dynamics in a Discrete Time Simulation The dynamics of the neuron model described in section 2 is linear and can therefore be reformulated to give a particularly efficient implementation for a discrete-time simulation (Rotter & Diesmann, 1999). We refer to this traditional, discrete-time approach as the grid-constrained implementation. Making the substitutions y1 =
d 1 ι+ ι dt τα
y2 = ι y3 = V, where yi is the ith component of the state vector y, we arrive at the following system of linear equations:
− τ1α
0
0
1 C
1 y˙ = Ay = 1 − τα
0
0 y, 1 − τm
ˆι τeα
y(0) = 0 , 0
where y(0) is the initial condition for a postsynaptic potential originating at time t = 0. The exact solution of this system is given by y(t) = P(t)y(0), where P(t) = e At denotes the matrix exponential of At, which is an exact mathematical expression (see, e.g., Golub & van Loan, 1996). For a fixed time step h, the state of the system can be propagated from one grid position to the next by yt+h = P(h)yt . This is an efficient method because P(h) is constant and has to be calculated only once at the beginning of a simulation. Moreover, P(h) can be obtained in closed form (Diesmann, Gewaltig, Rotter,
52
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
& Aertsen, 2001), for example, using symbolic algebra software such as Maple (Heck, 2003) or Mathematica (Wolfram, 2003), and can therefore be evaluated by simple expressions in the implementation language (see also section A.3). The complete update for the neuron state may be written as yt+h = P(h)yt + xt+h ,
(3.1)
assuming incoming spikes are constrained to the grid, as the linearity of the system permits the initial conditions for all spikes arriving at a given grid point to be lumped together into one term, xt+h
e τα
xt+h = 0 ˆιk . 0
(3.2)
k∈St+h
Here St+h is the set of indices k ∈ 1, . . . , K of synapses that deliver a spike to the neuron at time t + h, and ˆιk represents the “weight” of synapse k. Note that the ˆιk may be arbitrarily signed and may also vary over the course of the simulation. The new neuron state yt+h is the exact solution to the subthreshold neural dynamics at time t + h, including all events that arrive at time t + h. This assumes that the neuron does not itself produce a spike in the interval (t, t + h]. 3 If the membrane potential yt+h exceeds the threshold value , the neuron communicates a spike event to the network with a time stamp of t + h; the membrane potential is subsequently clamped to 0 in the interval [t + h, t + h + τr ] (see Figure 1A). The earliest grid point at which a neuron could produce its next spike is therefore t + 2h + τr . Note that for a gridconstrained implementation, τr must be an integer multiple of h, because the membrane potential is evaluated only at grid points, and we define it to be nonzero. 3.1 Computational Requirements. In order to preserve causality, it is necessary that there is a minimal synaptic delay of h. Otherwise, if a neuron spiked at time t + h and its synapses had a propagation delay of 0, then this event would seem to arrive at some of its targets at t + h and at some of them at t + 2h, depending on the order in which the neurons are updated. In practice, simulations are generally performed with synaptic delays that are greater than the time step h, and so some technique must be used to store events that have already been produced by a neuron but are not due to arrive at their targets for several time steps. In a grid-constrained simulation, only delays that are an integer multiple of h can be considered because incoming spikes can be handled only at grid points. Consequently, pending events can be stored in a data structure analogous to a looped tape device (see Morrison, Mehring, et al., 2005). If a neuron emits a spike at time t that has a
Continuous Spike Times in Exact Discrete-Time Simulations
53
AV Θ
0
B
t
t+h
t+2h
time
t+h
t+2h
time
V
Θ
0 t
t δ
= t Θ− t
Θ
τr
Figure 1: Spike generation and refractory periods for a grid-constrained (A) and a continuous-time implementation (B). The spike threshold is indicated by the dashed horizontal line. The solid black curve shows the membrane potential time course for a neuron subject to a suprathreshold constant input current leading to a threshold crossing in the interval (t, t + h]. The gray vertical lines indicate the discrete time grid with spacing h. The refractory period τr in this example is set to its minimal value h. Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. (A) In the grid-constrained implementation, the spike is emitted at t + h. During the refractory period τr , the membrane potential is clamped to zero. (B) In the continuous-time implementation, the threshold crossing is found by interpolation (here linear: black dashed line) at time t . The spike is emitted with the time stamp t + h and with an offset with respect to t of δ = t − t. The neuron is refractory from t until t + τr , during which period the membrane potential is clamped to zero. At grid point t + 2h, a finite membrane potential can be observed.
54
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
delay of d, the simulation algorithm waits until all neurons have completed their updates for the integration step (t − h, t] and then delivers the event to its target(s). The event is placed in the tape device of the target neuron d/ h segments on from the current reading position. It will then become visible to the target neuron at the grid point t + d, when the neuron is performing the update for the integration step (t + d − h, t + d]. Recalling that the initial conditions for all events arriving at a given time may be lumped together (see equation 3.2) and that two of the three components of the initial conditions vector are zero, the segments of the looped taped device can be very simple. Each segment contains just one value, which is incremented by the weight of every spike event delivered there. In other words, when the neuron performs the integration step (t, t + h], the segment visible to the reading head contains the first component of xt+h up to the scaling factor of τeα . In terms of memory, the looped tape device needs as many segments as computation steps are required to cover the maximum synaptic delay, plus an additional segment to represent the events arriving at t + h. Therefore, performing a given simulation with a smaller time step will require more memory. The model described in section 2 has only a single synaptic dynamics and so requires only one tape device; models using multiple types of synaptic dynamics can be implemented in this framework by providing them with the corresponding number of tape devices. 4 Continuous-Time Implementation of Exact Integration 4.1 Canonical Implementation. The most obvious way to reconcile exact integration and precise spike timing within a discrete-time simulation is to store the precise times of incoming events. In order to represent this information, an offset must be assigned to each spike event in addition to the time stamp. This offset is measured from the beginning of the interval in which the spike was produced: a spike generated at time t + δ receives a time stamp of t + h and an offset of δ (see Figure 1B). Given a sorted list of event offsets {δ1 , δ2 , · · · , δn } with δi ≤ h, which become visible to a neuron in the step (t, t + h], exact integration of the subthreshold dynamics can be performed from the beginning of the time step to the beginning of the list: yt+δ1 = P(δ1 )yt + xδ1 ; then along the list: yt+δ2 = P(δ2 − δ1 )yt+δ1 + xδ2 .. . yt+δn = P(δn − δn−1 )yt+δn−1 + xδn ;
Continuous Spike Times in Exact Discrete-Time Simulations
55
and finally from the end of the list to the end of the time step:
yt+h = P(h − δn )yt+δn .
The final term yt+h is the exact solution for the neuron dynamics at time t + h. This sequence is illustrated in Figure 2B. This is assuming that the neuron does not produce a spike or emerge from its refractory period during this interval. These special cases are described in more detail below. 4.1.1 Spike Generation. In the grid-constrained implementation, the neuron state is inspected at the end of each time step to see if it meets its spiking criteria. In the case of the neuron model described in section 3, the criterion is y3 ≥ , where is the threshold. In this implementation, the neuron state can be inspected after every step of the process described in sec3 3 tion 4.1. If yt+δ < and yt+δ ≥ , then the membrane potential of the i i+1 neuron reached threshold between t + δi and t + δi+1 . As the dynamics of this model is not invertible, the time t of this threshold passing can be determined only by interpolating the membrane potential in the interval (t + δi , t + δi+1 ]. For this article, linear, quadratic, and cubic interpolation schemes were investigated. After the threshold crossing, the neuron is refractory for the duration of its refractory period τr . The membrane potential y3 is set to zero and need not be calculated during this period, although the other components of the neuron state continue to be updated as in section 4.1. At the end of the time interval, an event is dispatched with a discrete-time stamp of t + h and an offset of δ = t − t (see Figure 1B). 4.1.2 Emerging from the Refractory Period. The neuron emerges from its refractory period in the time step defined by t < t + τr ≤ t + h (see Figure 1B). In contrast to the grid-constrained implementation, τr does not have to be an integer multiple of h. For a continuous time τr , a grid position t can always be found such that t + τr comes within (t, t + h]. However, the implementation is simpler when τr is a nonzero integer multiple of h. To calculate the neuron state at time t + h exactly, the interval is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. In the first period, the neuron is still refractory, so when performing the exact integration along the incoming events as in section 4.1, y3 is not calculated and remains at zero. At the end of this period, the neuron state yt +τr is the exact solution for the dynamics at the end of the refractory period. In the second period, the neuron is no longer refractory, and so the exact integration can be performed as usual (including the calculation of y3 ). The neuron state yt+h is therefore an exact solution to the neuron dynamics at time t + h.
56
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
A
B
C
t
t+h
t+2h
Figure 2: Handling of incoming spikes in the grid-constrained (A), the canonical (B), and the prescient (C) implementations. In each panel, the solid curve represents the excursion of the membrane potential in response to two incoming spikes (gray vertical lines). Filled circles denote observable values of the membrane potential; unfilled circles denote supporting points that are not observable. The gray horizontal arrows beneath each panel indicate the propagation steps performed during the time step (t, t + h]. Dashed arrows in A indicate that in the grid-constrained implementation, input spikes are effectively shifted to the next point on the grid. The observable membrane potentials following spike impact are identical and exact in B and C but differ from those in A.
Continuous Spike Times in Exact Discrete-Time Simulations
57
4.1.3 Computational Requirements. As in the grid-constrained implementation, a minimum spike propagation delay of h is necessary in order to ensure that all events due at the neuron between t and t + h have arrived by the time the neuron performs its update for that interval. The simple looped event buffer described in section 3.1 must be extended to store the weights and offsets for incoming events separately. As the number of events arriving in an interval cannot be known ahead of time, this structure must be capable of dynamic memory allocation, which reduces its cache effectiveness. However, information about the minimum propagation delay in the network can be utilized to streamline the data structure so that its size does not depend on h, which compensates to some extent for the use of dynamic memory. Finally, as the events may arrive in any order, the buffer must also be capable of sorting the events with respect to increasing offset, which for a general-purpose sorting algorithm has complexity O(n log n). An alternative representation of the timing data is a priority queue; in practice, this was not quite as efficient as the looped tape device. In contrast to the grid-constrained implementation, delays can now be represented in continuous time. A spike arrival time t + δ + d, where t + δ is a continuous spike generation time and d is a continuous time delay, can always be decomposed into a discrete grid point t + d and a continuous offset δ . However, for notational and implementational convenience (see section 6), we assume d to be a nonzero integer multiple of h. 4.2 Prescient Implementation. In the implementation described in section 4.1, the neuron state at the end of a time step is calculated by integrating along a sorted list of events. However, as the subthreshold dynamics is linear, it is not dependent on the order of events. In this section, an implementation is presented that exploits this fact to reduce the computational complexity and dynamic memory requirements of the canonical implementation. 4.2.1 Receiving an Event. Consider a spike event generated during the step (t − h, t] with offset δ and transmitted with delay d. This spike will be processed during the update step (t + d − h, t + d], its effect being observable for the first time at t + d. Since the correct spike arrival time is t + d − h + δ, when the algorithm delivers the spike to the neuron we evolve the effect of the spike input from the arrival time to the end of the interval at t + d using y˜ t+d = P(h − δ)x. Therefore, instead of storing the entire event as for the canonical implementation, the three components of its effect on the neuron state can be stored in an event buffer at the position d instead. Due to the linearity of the system,
58
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
these components can be summed for all events due to arrive at the neuron in a given time step, regardless of the order in which the algorithm delivers them to the neuron or the order in which they are to become visible to the neuron. As we exploit the fact that the effect of an event on the neuron can be calculated before the event becomes visible to the neuron, we call this the prescient implementation. 4.2.2 Calculating the Subthreshold Dynamics. At the beginning of each time interval (t, t + h], the total effect of all events arriving within that step on the three components of the neuron state at time t + h is already stored in the event buffer. Calculating the neuron state at the end of the time step is therefore simple: yt+h = P(h)yt + y˜ t+h . The new neuron state yt+h is the exact solution to the neuron dynamics at time t + h, including all events that arrived within the step (t, t + h]. This is depicted in Figure 2C. As with the canonical implementation, there are two special cases that need to be treated with more care: a time step in which a spike is generated and one in which the neuron emerges from its refractory period. 4.2.3 Spike Generation. The process that generates a spike is very similar to that for the canonical implementation described in section 4.1.1. In this case, as the timing of the incoming events is no longer known, the neuron state can be inspected only at the end of the time step rather than at each incoming event, and so the length of the interpolation interval is h rather than the interspike interval of incoming events. 4.2.4 Emerging from the Refractory Period. As for the canonical implementation, the time step in which the neuron emerges from its refractory period is divided into two subintervals: (t, t + τr ] and (t + τr , t + h]. Setting tem = t + τr − t, the neuron state at the end of the refractory period can be calculated as follows: yt+tem = P(tem )yt 3 yt+t ← 0. em
Having emerged from the refractory period, the membrane potential is no longer clamped to zero and can develop normally during the remainder of the time step (see Figure 1B): yt+h = P(h − tem )yt+tem + y˜ t+h .
Continuous Spike Times in Exact Discrete-Time Simulations
59
However, this overestimates the effect of the events arriving in the time interval (t, t + h] on the membrane potential. The summation of the components of these events was predicated on the assumption of linear dynamics, but as the membrane potential is clamped to zero until t + tem , this assumption does not hold. Any events arriving at the neuron before its emergence from its refractory period should have no effect on its membrane potential before this point, yet adding the full value of the third component of y˜ t+h assumes that they do. As a corrective measure, the effect of the new events on the membrane potential can be considered to be linear within the small interval (t, t + h], and the membrane potential can be adjusted accordingly: 3 3 3 yt+h ← yt+h − γ y˜ t+h ,
with γ = tem / h. 4.2.5 Computational Requirements. As in the grid-constrained and canonical implementations, a minimum spike propagation delay of h is required to preserve causality. The looped-tape device described in section 3.1 needs to be able to store the three components of the neuron state rather than just the weight of the incoming events. Alternatively, three event buffers can be used, capable of storing one component each. Unlike the buffer devices for the canonical implementation, they need to store only one value per time step rather than one for each incoming spike, so there is no time overhead for sorting the values. Moreover, they do not require dynamic memory allocation and so are more cache effective. 5 Performance In order to compare error scaling for the different implementations and interpolation orders, a simple single-neuron simulation was chosen. As the system is deterministic and nonchaotic, reducing the computation time step h causes the simulation results to converge to the exact solution, so error measures can be well defined. To investigate the costs incurred by simulating at finer resolutions or using computationally more expensive off-grid neuron implementations, a network simulation was chosen. This is fairer than a single-neuron simulation, as the run-time penalties of applications requiring more memory will come into play only if the application is large enough not to fit easily into the processor’s cache memory. Furthermore, it is only when performing network simulations that the bite of long simulation times per neuron is really felt. 5.1 Single-Neuron Simulations. Each experiment consisted of 40 trials of 500 ms each, during which a neuron of the type described in section 2 was stimulated with a constant excitatory current of 412 pA and unique
60
A. Morrison, S. Straube, H. Plesser, and M. Diesmann 0
Error [mV]
A 10
B
−5
−5
10
10
−10
−10
10
10
−15
10
0
10
−15
−4
10
−2
10
Time step h [ms]
0
10
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 3: Scaling of error in membrane potential as a function of the computational resolution in double logarithmic representation. (A) Canonical implementation. (B) Prescient implementation. No interpolation, circles; linear interpolation, plus signs; quadratic interpolation, diamonds; cubic interpolation, multiplication signs. In both cases, the triangles show the behavior of the gridconstrained implementation, and the gray lines indicate the slopes expected for scaling of orders first to fourth with an arbitrary intercept of the vertical axis.
realizations of an excitatory Poissonian spike train of 1.3 × 104 Hz and an inhibitory Poissonian spike train of 3 × 103 Hz. The spike times of the Poissonian input trains were represented in continuous time. Parameters are as in section 2, but the peak value of the current resulting from an inhibitory spike was a factor of 6.25 greater than that of an excitatory spike to ensure a balance between excitation and inhibition. The output firing rate was 12.7 Hz. The experiment was repeated for each implementation with each interpolation order over a wide range of computational resolutions. As the membrane potential and spike times cannot be calculated analytically for this protocol, the canonical implementation with cubic interpolation at the finest resolution (2−13 ms ≈ 0.12 µs) was defined to be the reference simulation for each realization of the input spike train. As a measure of the error in calculating the membrane potential, the deviation of the actual membrane potential from the reference membrane potential was sampled every millisecond for all the trials. In Figure 3, the median of these deviations is plotted as a function of the computational resolution in double logarithmic representation. In both the canonical implementation (see Figure 3A) and the prescient implementation (see Figure 3B), the same scaling behavior can be seen: for an interpolation order of n, the error in membrane potential scales with order n + 1 (see section A.4). The error has a lower bound at 10−14 , which can be seen for very fine resolutions using cubic interpolation. This represents the greatest
Continuous Spike Times in Exact Discrete-Time Simulations 0
Error [ms]
A 10
B
−5
0
10
−5
10
10
−10
−10
10
10
−15
10
61
−15
−4
10
−2
10
Time step h [ms]
0
10
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 4: Scaling of error in spike times as a function of the computational resolution. (A) Canonical implementation. (B) Prescient implementation. Symbols and lines as in Figure 3.
numerical precision possible for this physical quantity using the standard representation of floating-point numbers (see section A.1). Interestingly, the error for the canonical implementation also saturates at coarse resolutions. This is because interpolation is performed between incoming events rather than across the whole time step, as in the case of the prescient implementation. Consequently, the effective computational resolution cannot be any coarser than the average interspike interval of the incoming spike train (in this case, 1/16 ms), and this determines the maximum error of the canonical implementation. Note that the error of the grid-constrained implementation scales in the same way as that of the canonical and prescient implementations with no interpolation. However, due to the fact that incoming spikes are forced to the grid, the absolute error is greater for this implementation. The accuracy of a simulation is not determined by the membrane potential alone; the precision of spike times is of at least as much relevance. The median of the differences between the actual and the reference spike times is shown in Figure 4. As with the error in membrane potential, using an interpolation order of n results in a scaling of order n + 1, and the error has a lower bound that is exhibited at very fine resolution when using cubic interpolation. Furthermore, a similar upper bound on the error is observed for the canonical implementation at coarse resolutions. However, in this case, the grid-constrained implementation exhibits not only the same scaling, but a similar absolute error to the continuous-time implementations with no interpolation. Recalling that all the simulations receive identical continuous-time Poisson spike trains, the only difference remaining between the grid-constrained implementation and the continuous-time implementations without interpolation is that the former ignores the offset of the incoming spikes and treats them as if they were produced on the
62
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
grid, whereas the latter process the incoming spikes precisely. This reveals that handling incoming spikes precisely confers no real advantage if outgoing spikes are generated without interpolation and thereby forced to the grid. One might therefore conclude that the precise handling of incoming spikes is unnecessary and that the single-neuron integration error could be significantly improved just by performing an appropriate interpolation to generate spikes, while treating incoming spikes as if they were produced on the grid. In fact, this is not the case. If incoming spikes are treated as if on the grid, the error in the membrane potential decreases only linearly with h, thus limiting the accuracy of higher-order methods in determining threshold crossings. This is corroborated by simulations (data not shown). Substantial improvement in the accuracy of single-neuron simulations requires both techniques: precise handling of incoming spikes and interpolation of outgoing spikes. 5.2 Network Simulations. In order to determine the efficiency of the various implementations, a balanced recurrent network was adapted from Brunel (2000). The network contained 10,240 excitatory and 2560 inhibitory neurons and had a connection probability of 0.1, resulting in a total of 15.6 × 106 synapses. The inhibitory synapses were a factor of 6.25 stronger than the excitatory synapses, and each neuron received a constant excitatory current of 412 pA as its sole external input. Membrane potentials were initialized to values chosen from a uniform random distribution over [−/2, 0.99]. In this configuration, the network fires with approximately 12.7 Hz in the asynchronous irregular regime, which recreates the input statistics used in the single-neuron simulations. The synaptic delay was 1 ms, and the network was simulated for 1 biological second. The simulation time and memory requirements for the network simulation described above are shown in Figure 5. For the simulation time (see Figure 5A), it can be seen that at coarse resolutions, the grid-constrained implementation is significantly faster than the prescient implementation, which in turn is faster than the canonical implementation. This is due to the fact that the cost of processing spikes is essentially independent of the computational resolution and manifests as an implementation-dependent constant contribution to the simulation time, which is particularly dominant at coarse resolutions. The difference in speed between the canonical and prescient implementations results from the use of dynamic as opposed to static data structures, and to a lesser extent from the cost of sorting incoming spikes in the canonical implementation. As the computation time step decreases, the simulation times converge, because the cost of updating the neuron dynamics in the absence of events, which is the same for all implementations, is inversely proportional to the resolution and so manifests as a scaling with exponent −1 at small computation time steps (see Figure 5A). It is clear that in general, at the same computation time step, the continuous-time implementations must be slower than the grid-constrained
Continuous Spike Times in Exact Discrete-Time Simulations
B
3
10
Memory [GB]
Simulation time [s]
A
2
10
63
6 4
2
1
1
10
−4
10
−2
10
Time step h [ms]
0
10
−4
10
−2
10
0
10
Time step h [ms]
Figure 5: Simulation time (A) and memory requirements (B) for a network simulation as functions of computational resolution in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Other interpolation orders for the canonical and the prescient implementations result in practically identical curves and are therefore not shown. For details of the simulation, see the text.
implementation, as the former perform a propagation for every incoming spike and the latter does not. The increased costs concomitant with higher interpolation orders proved to be negligible in a network simulation. An increase in memory requirements can be observed for all implementations (see Figure 5B) as the resolutions become finer. Although all implementations require much the same amount of memory at coarser resolutions, for finer resolutions, the canonical implementation requires the least memory, followed by the grid-constrained implementation, and the prescient implementation requires the most. It is clear that for a wide range of resolutions, the memory required by the rest of the network, specifically the synapses, dominates the total memory requirements (Morrison, Mehring, et al., 2005). As the resolution becomes finer, the memory required for the input buffers for the neurons plays a greater role. The spike buffer for the canonical implementation is independent of the resolution (see section 4.1.3), and so it might seem that it should not require more memory at finer resolution. However, all implementations tested here also have a buffer for a piece-wise constant input current. In addition to this, the grid-constrained implementation has one buffer for the weights of incoming spikes, and the prescient implementation has three buffers—one for each component of the state vector. All of these buffers require memory in inverse proportion to the resolution, thus explaining the ordering of the curves. Generally a smaller application is more cache effective than a larger one, and this may explain why the canonical implementation exhibits slightly lower
64
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
simulation times than the other implementations at very fine resolutions (see Figure 5A). 5.3 Conjunction of Integration Error and Run-Time Costs. The considerations in section 5.2 of how the simulation time and memory requirements increase with finer resolutions are of limited practical relevance to a scientist with a particular problem to investigate. More interesting in this case is how much precision bang you get for your simulation time buck. Unlike the single neuron, however, the network described above is a chaotic system (Brunel, 2000). Any deviation at all between simulations will lead, in a short time, to totally different results on the microscopic level, such as the evoked spike patterns. Such a deviation can even be caused by differences in the tiny round-off errors that occur if floating-point numbers are summed in a different order. Because of this, these simulations do not converge on the microscopic level as the single-neuron simulations do, and for that reason the so-called accuracy of a simulation cannot be taken at face value. We therefore relate the cost of network simulations to the accuracy of single-neuron simulations with comparable input statistics. In Figure 6, the simulation time and memory requirements data from Figure 5 are combined with the accuracy of the corresponding single-neuron simulations shown in Figure 4, thus eliminating the computational resolution as a parameter. Figure 6A shows the simulation time as a function of spike time error for the three implementations. This graph can be read in two directions: horizontally and vertically. By reading the graph horizontally, we can determine which implementation will give the best accuracy for a given affordable simulation time. Reading the graph vertically allows us to determine which implementation will result in the shortest simulation time for a given acceptable accuracy. Concentrating on the latter interpretation, it can be seen from the intersection of the lines corresponding to the prescientand grid-based implementations (vertical dashed line in Figure 5A) that if an error greater than 2.3 × 10−2 ms is acceptable, the grid-constrained implementation is faster. For better accuracy than this, the prescient implementation is more effective. If an appropriate time step is chosen, the prescient implementation can simulate more accurately and more quickly than a given grid-constrained simulation in this regime. Only for very high accuracy can a lower simulation time be achieved using the canonical implementation. Similarly, Figure 6B, which shows the memory requirements as a function of spike-time error in double logarithmic representation, can be read in both directions. This shows qualitatively the same relationship as in Figure 6A, but the point at which one would switch from the prescient implementation to the canonical implementation in order to conserve memory occurs for larger errors. The flatness of the curves for the continuous-time implementations shows that it is possible to increase the accuracy of a simulation considerably without having to worry about the memory requirements.
Continuous Spike Times in Exact Discrete-Time Simulations
B
3
10
2
10
−12
10
−4
10
10
−2
10
2
prescient faster
Input rate [kHz]
10
100
−8
10
−4
10
0
10
Error [ms]
D 120
grid−constrained faster
10
−12
0
10
Simulation time [s]
−1
Eqv. Error [ms]
−8
10
Error [ms]
C
4
1
1
10
6
Memory [GB]
Simulation time [s]
A
65
100 80 60 40 20 0
0
20
40
60
80
Input rate [kHz]
Figure 6: Analysis of simulation time and memory requirements for a network simulation as functions of spike-time error for the single-neuron simulation and input spike rate. (A) Simulation time as a function of spike-time error in double logarithmic representation. Triangles, grid-constrained neuron; plus signs, canonical implementation with cubic interpolation; circles, prescient implementation with cubic interpolation. Data combined from Figure 4 and Figure 5. For errors smaller than 2.3 × 10−2 ms (vertical dashed line), a continuous-time implementation with an appropriately chosen computation step size gives better performance. (B) Memory consumption as a function of spike-time error in double logarithmic representation. Symbols as in A. (C) Error in spike time for which the prescient and grid-constrained implementations require the same simulation time (for different appropriate choices of h) as a function of input spike rate. The unfilled circle indicates the equivalence error for a network rate of 12.7 Hz (input rate 16 kHz), that is, the intersection marked by the vertical dashed line in A. The gray line is a linear fit to the data (slope −0.95). (D) Simulation time as a function of input spike rate for h = 0.125 ms, symbols as in A. The gray lines are linear fits to the data (slopes 1.3, 0.8 and 0.3 s per kHz from top to bottom).
66
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
In general, the point at which the prescient implementation at an appropriate computational resolution can produce a faster and more accurate simulation than a grid-constrained simulation will depend on the rate of events a neuron has to handle. To investigate this relationship, the network described in section 5.2 was simulated with different input currents and inhibitory synapse strengths to generate a range of different firing rates in the asynchronous irregular regime. The single-neuron integration error at which the prescient- and the grid-constrained implementations require the same simulation time is shown in Figure 6C as a function of the input spike rate. This equivalence error depends linearly on the input spike rate, demonstrating that in the parameter space of useful accuracies and realistic input rates, there is a wide regime where the prescient is faster than the grid-constrained implementation. Underlying the benign nature of the comparative effectiveness analyzed in Figure 6C is the dependence of simulation time on the rate of events. For all implementations, the simulation time increases practically linearly with the input spike rate (see Figure 6D), albeit with different slopes. 5.4 Artificial Synchrony. The previous section has shown that in many situations, continuous-time implementations achieve a desired single-neuron integration error more effectively than the grid-constrained implementation. However, continuous-time implementations have an advantage compared to the grid-constrained implementation beyond the single-neuron integration error. In a network simulation carried out with the grid-constrained implementation, the spikes of all neurons are aligned to the temporal grid defined by the computation time step. This causes artificial synchronization between neurons that may distort measures of synchronization and correlation on the network level. To demonstrate this effect, Hansel et al. (1998) investigated a small network of N integrateand-fire neurons with excitatory all-to-all coupling. Here, we extend their analysis to the three implementations under study and provide a comparison of single-neuron and network-level integration error. In contrast to their study, our network is constructed from the model neuron introduced in section 2. The time constant of the synaptic current τα is adjusted to the rise time of the synaptic current in the original model, which was described by a β-function. The synaptic delay d and the absolute refractory period τr are set to the maximum computation time step h investigated in this section. Note that this choice of d and τr means that these parameters can be represented at all computational resolutions, thus ensuring that all simulations using the grid-constrained implementation are solving the same dynamical system. Figure 7A illustrates the synchrony in the network as a function of synaptic strength. Following Hansel et al. (1998), synchrony is defined as the variance of the population-averaged membrane potential normalized by the population-averaged variance of the membrane
Continuous Spike Times in Exact Discrete-Time Simulations
A
67
1
Synchrony
0.8 0.6 0.4 0.2 0
0
0.2
0.4
0.6
0.8
1
Synaptic strength
Synch. Error [%]
2
C 10
Synch. Error [%]
2
B 10
0
10
−2
10
−4
10
0
10
−2
10
−4
−4
10
−2
10
Time step h [ms]
0
10
10
−12
10
−8
10
−4
10
0
10
Error [ms]
Figure 7: Synchronization error in a network simulation. (A) Network synchrony (see equation 3.3) as a function of synaptic strength in a fully connected network of N = 128 neurons (τα = (3/2) ln 3 ms, synaptic delay d = τr = 0.25 ms) with excitatory coupling (cf. Hansel et al., 1998). Neurons are driven by a suprathreshold DC I0 = 575 pA, no further external input. The initial T )], where i ∈ 1, . . . , N is membrane potential is Vi (0) = τCm I0 [1 − exp(−γ i−1 N τm the neuron index, T the period in the absence of coupling, and γ = 0.5 controls the initial coherence. The simulation time is 10 s, and V is recorded in intervals of 1 ms between 5 s and 10 s. Synaptic strength is expressed as the amplitude of a postsynaptic current relative to the rheobase current I∞ = (C/τm )θ = 500 pA and multiplied by the number of neurons N. Other parameters as in section 2. Canonical implementation as reference (h = 2−10 ms, gray curve) and prescient implementation (h = 2−2 ms, circles), both with cubic interpolation; grid-constrained implementation (h = 2−5 ms, triangles). (B) Synchronization error as a function of the computation time step in double logarithmic representation: grid-constrained implementation, triangles; prescient implementation, circles; canonical implementation, plus signs. (C) Synchronization error as a function of the single-neuron integration error for the grid-constrained and the prescient implementation, same representation as in B. The gray lines are linear fits to the data.
68
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
potential time course, S=
2
2
2
2 Vi (t) t − Vi (t) t i , Vi (t) i − Vi (t) i t t
i
(5.1)
where · i indicates averaging over the N neurons and · t indicates averaging over time. This is a measure of coherence in the limit of large N with S = 1 for full synchronization and S = 0 for the asynchronous state. The grid-constrained implementation exhibits a considerable error in synchrony, which vanishes in approaching the asynchronous regime. The prescient implementation accurately preserves network synchrony even with a significantly larger computation time step. The error in synchrony is quantified in Figure 7B as the root mean square of the relative deviation of S with respect to the reference solution, estimated over the range of synaptic strength investigated. Note that this includes the asynchronous regime where errors are small in general. The prescient implementation is easily an order of magnitude more accurate than the gridconstrained implementation and is itself outperformed by the canonical implementation. In addition, for the continuous-time implementations, the error in synchrony drops more rapidly with decreasing computational time step h than for the grid-constrained implementation. However, at the same h, different integration methods exhibit a different integration error for the single-neuron dynamics (see Figure 4). Therefore, to accentuate network effects, continuous-time and grid-constrained implementations should be compared at the same single-neuron integration error. To this end, we proceed as follows: the network spike rate is approximately 80 Hz, corresponding to an input spike rate of some 10 kHz. In Figure 7C, the error in network synchrony at a given computational time step h is plotted as a function of the spike time error of a single neuron driven with an input spike rate of approximately 10 kHz, simulated at the same h. For single-neuron errors of 10−2 ms and above, the grid-constrained implementation results in considerable errors in network synchrony. Spike timing errors of 10−2 ms and below are required for the grid-constrained and the prescient implementations to achieve a synchronization error in the 1% range or better. Interestingly, the grid-constrained implementation exhibits a larger synchronization error than the prescient implementation for identical single-neuron integration errors. 6 Discussion We have shown that exact integration techniques are compatible with continuous-time handling of spike events within a discrete-time simulation. This combination of techniques achieves arbitrarily high accuracy (up to machine precision) without incurring any extra management costs in the global algorithm, such as a central event queue or looking ahead to see
Continuous Spike Times in Exact Discrete-Time Simulations
69
which neuron will fire next. This is particularly important for the study of large networks with frequent events, as the cost of managing events can become prohibitive (Mattia & Del Giudice, 2000; Reutimann, Giugliano, & Fusi, 2003). We introduced a canonical implementation that illustrates the principles of combining these techniques and a prescient implementation that further exploits the linearity of the subthreshold dynamics. The latter implementation simplifies the neuron update algorithm and requires only static data structures and no queuing, leading to a better time and accuracy performance than the canonical implementation. We compared interpolating polynomials of orders 1 to 3 and discovered that the increased numerical complexity of the higher-order interpolations was not reflected in the run time, which is dominated by other factors. Furthermore, it was shown that the highest-order interpolation performed stably. This suggests that the highest-order interpolation should be used, as the greater accuracy is obtained at negligible cost. We have investigated the nature of the trade-off between accuracy and simulation time/memory and demonstrated that for a large range of input spike rates, it is possible to find a combination of continuous-time implementation and computation time step that fulfills a given maximum error requirement both more accurately and faster than a grid-constrained simulation. This measure of efficiency is based on truly large-scale networks (12,800 neurons, 15.6 million synapses). The techniques described here have several possible extensions. First, the canonical implementation places no constraints on the neuron model used beyond the physically plausible requirement that the membrane potential is thrice continuously differentiable. It may therefore be used to implement essentially any kind of neuronal dynamics, including neurons with conductance-based synapses. The prescient implementation further requires that the neuron’s dynamics is linear; it may thus be used for a wide range of model neurons with current-based synapses. The neuron model we implemented does not have invertible dynamics, and so the determination of its spike time necessarily involves approximation. For some neuron models, it is possible to determine the precise spike time without recourse to approximation, such as the Lapicque model (Tuckwell, 1988) and its descendants (Mirollo & Strogatz, 1990). Such models can, of course, also be implemented in this framework, but they would have only the same precision as a classical event-based simulation if they were canonically implemented (obviously without interpolation); a prescient implementation would be able to represent the subthreshold dynamics exactly on the grid but would entail the use of approximative methods to determine the spiketimes. Although we investigated polynomial interpolation, other methods of spike time determination such as Newton’s method can be implemented with no change to the conceptual framework. Second, most constraints imposed in terms of the computational time step h may be relaxed. As indicated in section 4.1.3, the restriction of
70
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
delays to nonzero integer multiples of h can be relaxed to any floatingpoint number ≥ h. When a neuron spikes, the offset of the spike could then be combined on the fly with the delay to create an integer component and a revised offset, thus allowing the spike to be delivered correctly by the global discrete-time algorithm, and processed correctly by the continuoustime neuron model. This relaxation would come at the memory cost of having to store delays as floating-point numbers rather than integers and the computational cost of having to perform the on-the-fly delay decomposition. Furthermore, h is currently a parameter of the entire network, so it is the same for all neurons in a given simulation. Given that the minimum propagation delay already defines natural synchronization points to preserve causality in the simulation, it would be possible to allow h to be chosen individually for each neuron in the network, or even to use variable time steps, while still maintaining consistency with the global discrete-time algorithm. Finally, the techniques are compatible with the distributed computing techniques described in Morrison, Mehring, et al. (2005), requiring only that the spike-time offsets are communicated in addition to the indices of the spiking neurons. This increases the bulk but not the frequency of communication, as it is still sufficient to communicate in intervals determined by the minimum propagation delay. A similar minimum delay principle is used by Lytton and Hines (2005), again suggesting a convergence of time-driven and event-driven approaches. When investigating a particular system, it is worthwhile considering what accuracy is necessary. For the networks described in section 5.2, it would be pointless to simulate with a very small time step, as they are chaotic. As long as the relevant macroscopic measures are preserved, any time step is as good or as bad as any other. However, a good rule of thumb is that it should be possible to discriminate spike times an order of magnitude more accurately than the characteristic timescales of the macroscopic phenomena to be observed, such as the temporal structure of cross-correlations. Note that even if the characteristics of the system to be investigated suggest that a grid-constrained implementation is optimal, the availability of equivalent continuous-time implementations is still advantageous: should the suspicion arise that an observed phenomenon is an artifact of the grid constraints, they can be used to test this without altering any other part of the simulation. For much the same reason, it is very useful to be able to modify the time step without having to adjust the rest of the simulation. The network studied in section 5.4 illustrates two more important points. First, it demonstrates that exceedingly small single-neuron integration errors may be required to accurately capture network synchronization. Second, it is clear from Figure 7C that continuous-time implementations are better at rendering macroscopic measures such as synchrony correctly: in conditions where the grid-constrained and prescient implementations
Continuous Spike Times in Exact Discrete-Time Simulations
71
achieve the same single-neuron spike-timing error, the prescient implementation yields a significantly smaller error in network synchrony. Accuracy cannot be improved on indefinitely: Figures 3 and 4 show that the errors in both membrane potential and spike timing saturate at about 10−14 mV and ms, respectively, for a time step of around 10−3 ms. The saturation accuracy is close to the maximal precision of the standard floating-point representation of the computer, and so 10−3 ms represents a lower bound on the range of useful h. An upper bound is determined by the physical properties of the system. First, h may not be larger than the minimum propagation delay in the network or the refractory period of the neurons. Second, using a large h increases the danger that a spike is missed. This can occur if the true trajectory of the membrane potential passes through the threshold within a step but is subthreshold again by the end of the step. This is less of an issue for the canonical implementation, as the check for a suprathreshold membrane potential is not just performed at the end of every step but also at the arrival of every incoming event (see section 4.1.1). There is a common perception that event-driven algorithms are exact and time-driven algorithms are approximate. However, both parts of this perception are generally false. With respect to the first part, event-driven algorithms are not by the nature of the algorithm more exact than timedriven algorithms. It depends on the dynamics of the neuron model whether an event-driven algorithm can find an exact solution, just as it does for timedriven algorithms. For a restricted class of models, the spike times can be calculated exactly through inversion of the dynamics. For other models, approximate methods to determine the spike times need to be employed. With respect to the second part, time-driven algorithms are not necessarily approximate. A discrete-time algorithm does not imply that spike times have to be constrained onto the grid, as shown by Hansel et al. (1998) and Shelley and Tao (2001). Moreover, the subthreshold dynamics for a large class of neuron models can be integrated exactly (Rotter & Diesmann, 1999). Here we combine these insights to show that the degree of approximation in a simulation is not determined by whether an event-driven or a time-driven algorithm is used but by the dynamics of the neuron model. A further question is whether the terms time-driven and event-driven should even be used in this mutually exclusive way. In our algorithm, neuron implementations treating incoming and outgoing spikes in continuous time are seamlessly integrated into a global discrete-time algorithm. Should this therefore be considered a time-driven or an event-driven algorithm? We believe that this combination of techniques represents a hybrid algorithm that is globally time driven but locally event driven. Similarly, when designing a distributed simulation algorithm (Morrison, Mehring, et al., 2005), it was shown that a time-driven neuron updating algorithm can be successfully combined with event-driven synapse updating, again suggesting that no dogmatic distinction between the two approaches need
72
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
be made. However, although we were able to demonstrate the potential advantages of a hybrid algorithm, these findings do not in principle rule out the existence of pure event-driven or time-driven algorithms with identical universality and better performance for a given set of parameters than the schemes presented here. In closing, we express our hope that this work can help defuse the at times heated debate between advocates of event-driven and time-driven algorithms for the simulation of neuronal networks. Appendix: Numerical Techniques In this appendix, we present the numerical techniques employed to achieve the reported accuracy. A.1 Accuracy of Floating-Point Representation of Physical Quantities. The double representation of floating-point numbers (Press, Teukolsky, Vetterling, & Flannery, 1992) used by standard computer hardware limits the accuracy with which physical quantities can be stored. The machine precision ε is the smallest number for which the double representation of 1 + ε is different from the representation of 1. Consequently, the absolute error σx of a quantity x is limited by the magnitude of x, σx ≈ 2log2 x · ε. In double representation, we have ε = 2−52 ≈ 2.22 · 10−16 . Membrane potential values y are on the order of 20 mV; the lower limit of the integration error in the membrane potential is therefore on the order of 5 · 10−15 mV. According to the rules of error propagation, the error in the time of threshold crossing σ depends on the error in membrane potential as σ = |1/ y˙ |. Typical values of the derivative | y˙ | of the membrane potential are on the order of 1 mV per ms (single-neuron simulations, data not shown), from which we obtain 5 · 10−15 ms as a lower bound for the error in spike timing. Therefore, the observed integration errors at which the simulations saturate are close to the limits imposed by the double representation for both physical quantities. A.2 Representation of Spike Times. We have seen in section A.1 that the absolute error σx depends on the magnitude of the quantity x. As a consequence, the error of spike times recorded in double representation increases with simulation time. An additional error is introduced if the computation time step h cannot be represented as a double (e.g., 0.1 ms).
Continuous Spike Times in Exact Discrete-Time Simulations
73
Therefore, we record spike times as a pair of two values {t + h, δ}. The first one is an integral number in units of h represented as a long int specifying the computation step in which the spike was emitted. The second one is the offset of the spike time in units of ms represented as a double. If h is a power of 2 in units of ms, both values can be represented as doubles without loss of accuracy. A.3 Evaluating the Update Equation. The implementation of the update equation 3.1 in the target programming language requires attention to numerical detail if utmost precision is desired. We were able to reduce membrane potential errors in a nonspiking simulation from some 10−12 mV to about 10−15 mV by careful rearrangement of terms. Although details may depend significantly on processor architecture and compiler optimization strategies, we will briefly recount the implementation we found optimal. The matrix-vector multiplication in equation 3.1 describes updates of the form y ← (1 − e − τ )a + e − τ y, h
h
where a and y are of order unity, while h τ so that e − τ ≈ 1. For a time step h −12 of h = 2 ms and a time constant of τ = 10 ms, one has γ = 1 − e − τ ≈ 10−5 . The quantity γ can be computed accurately for small values of h/τ using the function expm1(x) provided by current numeric libraries (C standard library; see also Galassi et al., 2001). Using double resolution, γ will have some 15 significant digits, spanning roughly from 10−5 to 10−20 , and all h of these digits are nontrivial. The exponential e − τ may be computed to 15 significant digits using exp(x); the first five of these will be trivial nines, though, leaving just 10 nontrivial digits, which furthermore span down to only 10−15 . We thus rewrite the equation above entirely in terms of γ , h
y ← γ a + (1 − γ )y, and finally as y ← γ (a − y) + y. In our experience, this final form yields the most accurate results, as the full precision of γ is retained as long as possible. Note that computing the 1 − γ term above discards the five least significant digits of γ . When several terms need to be added, they should be organized according to their expected magnitude starting with the smallest components. A.4 Polynomial Interpolation of the Membrane Potential. In order to approximate the time of threshold crossing, the membrane potential yt3
74
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
known at grid points t with spacing h can be interpolated with polynomials of different order. For the purpose of this section, we drop the index specifying the membrane potential as the third component of the state vector. Without loss of generality, we assume that the threshold crossing occurs at time δ ∗ in the interval (0, h]. The corresponding values of the membrane potential are denoted y0 < and yh ≥ , respectively. The threshold crossings are found using the explicit formulas for the roots of polynomials of order n = 1, 2, and 3 (Weisstein, 1999). In order to constrain the polynomials, we exploit the fact that the derivative of the membrane potential can be easily obtained from the state vector at both sides of the interval. For the grid-constrained (n = 0) simulation and linear (n = 1) interpolation, we demonstrate why the error in spike timing decreases with h n+1 . A.4.1 Grid-Constrained Simulation. In the variables defined above, the approximate time of threshold crossing δ equals the computation time step h; the spike is reported to occur at the right border of the interval (0, h]. Assuming the membrane potential y(t) to be exact, the error in membrane potential with respect to the value at the exact point of threshold crossing is = yh − y(δ ∗ ). Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + O(t 2 ). Considering terms up to first order, we obtain = {y0 + y˙0 h} − {y0 + y˙0 δ ∗ } = y˙0 (h − δ ∗ ). Hence, reaches its maximum amplitude at δ ∗ = 0, and we can write || ≤ | y˙0 |h. The error in spike timing is σ = h − δ∗ = and |σ | ≤ h.
1 y˙0
Continuous Spike Times in Exact Discrete-Time Simulations
75
A.4.2 Linear Interpolation. A polynomial of order 1 is naturally constrained by the values of the membrane potential (y0 and yh ) at both ends of the interval. With yt = a t + b, the set of equations specifying the coefficients of the polynomial (a and b) is −1 a 0 1 y0 = b h 1 yh −1 −1 y0 h −h = yh 1 0 (yh − y0 )h −1 . = y0 Thus, in normalized form, we need to solve 0=
yh − y0 δ + (y0 − ) h
(A.1)
for δ to find the approximate time of threshold crossing. At the exact point of threshold crossing δ ∗ , the error in membrane potential is =
yh − y0 ∗ δ + y0 − y(δ ∗ ). h
(A.2)
Let us require that y(t) is sufficiently often differentiable and that the derivatives assume finite values. We can then express the membrane potential as a Taylor expansion originating at the left border of the interval y(t) = y0 + y˙0 t + 12 y¨ 0 t 2 + O(t 3 ). Considering terms up to second order, we obtain 1 1 2 = y˙0 δ ∗ + y¨ 0 hδ ∗ + y0 − y0 + y˙0 δ ∗ + y¨ 0 δ ∗ 2 2 1 2 = y¨ 0 (δ ∗ h − δ ∗ ). 2 The time of threshold crossing is bounded by the interval (0, h]. Hence, reaches its maximum amplitude at δ ∗ = 12 h, and we can write || ≤
1 | y¨ 0 | h 2 . 8
76
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
Noting that y(δ ∗ ) = in equation A.2, we have =
yh − y0 ∗ δ + (y0 − ), h
(A.3)
and subtracting equation 4.1 from 4.3, we obtain =
yh − y0 ∗ (δ − δ). h
Thus, the error in spike timing is σ = δ∗ − δ =
h . yh − y0
With the help of the expansion h 1 y¨ 0 = − h, yh − y0 y˙0 2 y˙0 2 we arrive at |σ | ≤
1 y¨ 0 2 h . 8 y˙0
A.4.3 Quadratic Interpolation. Using a polynomial of order 2, we can add an additional constraint to the interpolating function. We decide for the derivative of the membrane potential at the left border of the interval y˙ 0 . With yt = a t 2 + bt + c, we have −1 0 0 1 y0 a b = h 2 h 1 yh c y˙ 0 0 1 0 −2 −2 −1 y0 −h h h 0 1 yh = 0 y˙ 0 1 0 0 (yh − y0 )h −2 − y˙ 0 h −1 . y˙ 0 = y0 Thus, in normalized form, we need to solve 0 = δ2 +
(y0 − )h 2 y˙ 0 h 2 δ+ . yh − y0 − y˙ 0 h yh − y0 − y˙ 0 h
Continuous Spike Times in Exact Discrete-Time Simulations
77
The solution can be obtained by the quadratic formula. The GSL (Galassi et al., 2001) implements an appropriate solver. Generally there are two real solutions: the desired one inside the interval (0, h] and one outside. A.4.4 Cubic Interpolation. A polynomial of order 3 enables us to constrain the interpolation further by the derivative of the membrane potential at the right border of the interval y˙ h . With yt = a t 3 + bt 2 + ct + d, we have −1 0 0 0 1 a y0 3 2 b h h h 1 yh = c 0 0 1 0 y˙ 0 d y˙ h 3h 2 2h 1 0 y0 h −2 2h −2h −3 h −2 −3h −2 3h −2 −2h −1 −h −1 yh = 0 0 1 0 y˙ 0 y˙ h 1 0 0 0 2(y0 − yh )h −3 + ( y˙ 0 + y˙ h )h −2 3(yh − y0 )h −2 − (2 y˙ 0 + y˙ h )h −1 . = y˙ 0 y0 Thus, in normalized form, we need to solve 0 = δ3 + +
3(yh − y0 )h − (2 y˙ 0 + y˙ h )h 2 2 δ 2(y0 − yh ) + ( y˙ 0 + y˙ h )h
y˙ 0 h 3 (y0 − )h 3 δ+ . 2(y0 − yh ) + ( y˙ 0 + y˙ h )h 2(y0 − yh ) + ( y˙ 0 + y˙ h )h
The solution can be found by the cubic formula. There is at least one real solution in the interval (0, h]. It is convenient to chose a substitution that avoids the intermediate occurrence of complex quantities (e.g., Weisstein, 1999). The GSL (Galassi et al., 2001) implements an appropriate solver. If the interval contains more than one solution, the time of threshold crossing is defined by the left-most root. Acknowledgments The new address of A. M. and M. D. is Computational Neuroscience Group, RIKEN Brain Science Institute, Wako, Japan. The new address of S.S. is Human-Neurobiologie, University of Bremen, Germany. We acknowledge Johan Hake for the interesting discussions that started the project.
78
A. Morrison, S. Straube, H. Plesser, and M. Diesmann
As always, Stefan Rotter and the members of the NEST collaboration (in particular Marc-Oliver Gewaltig) were very helpful. We thank Jochen Eppler for assisting with the C++ coding. We acknowledge the two anonymous referees whose stimulating questions helped us to improve the letter. This work was partially funded by DAAD 313-PPP-N4-lk, DIP F1.2, BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg, and EU Grant 15879 (FACETS). As of the date this letter is published, the NEST initiative (www.nest-initiative.org) makes available the different implementations of the neuron model used as an example in this article, in source code form under its public license. All simulations were carried out using the parallel computing facilities of the Norwegian Uni˚ versity of Life Sciences at As. References Brette, R. (2006). Exact simulation of integrate-and-fire models with synaptic conductances. Neural Comput., 18, 2004–2027. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Diesmann, M., & Gewaltig, M.-O. (2002). NEST: An environment for neural systems simulations. In T. Plesser & V. Macho (Eds.), Forschung und wisschenschaftliches ¨ Rechnen, Beitr¨age zum Heinz-Billing-Preis 2001 (pp. 43–70). Gottingen: Gesellschaft ¨ wissenschaftliche Datenverarbeitung. fur Diesmann, M., Gewaltig, M.-O., Rotter, S., & Aertsen, A. (2001). State space analysis of synchronous spiking in cortical neural networks. Neurocomputing, 38–40, 565– 571. Ferscha, A. (1996). Parallel and distributed simulation of discrete event systems. In A. Y. Zomaya (Ed.), Parallel and distributed computing handbook (pp. 1003–1041). New York: McGraw-Hill. Fujimoto, R. M. (2000). Parallel and distributed simulation systems. New York: Wiley. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., & Rossi, F. (2001). Gnu scientific library: Reference manual. Bristol: Network Theory Limited. Golub, G. H., & van Loan, C. F. (1996). Matrix computations (3rd ed.). Baltimore, MD: Johns Hopkins University Press. Hammarlund, P., & Ekeberg, O. (1998). Large neural network simulations on multiple hardware platforms. J. Comput. Neurosci., 5(4), 443–459. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-fire neural networks. Neural Comput., 10(2), 467–483. Harris, J., Baurick, J., Frye, J., King, J., Ballew, M., Goodman, P., & Drewes, R. (2003). A novel parallel hardware and software solution for a large-scale biologically realistic cortical simulation (Tech. Rep.). Las Vegas: University of Nevada. Heck, A. (2003). Introduction to Maple (3rd ed.). Berlin: Springer-Verlag. Lytton, W. W., & Hines, M. L. (2005). Independent variable time-step integration of individual neurons for network simulations. Neural Comput., 17, 903–921. Makino, T. (2003). A discrete-event neural network simulator for general neuron models. Neural Comput. and Applic., 11, 210–223.
Continuous Spike Times in Exact Discrete-Time Simulations
79
Marian, I., Reilly, R. G., & Mackey, D. (2002). Efficient event-driven simulation of spiking neural networks. In Proceedings of the 3. WSES International Conference on Neural Networks and Applications. Interlaken, Switzerland. Mattia, M., & Del Giudice, P. (2000). Efficient event-driven simulation of large networks of spiking neurons and dynamical synapses. Neural Comput., 12(10), 2305– 2329. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50(6), 1645–1662. Morrison, A., Hake, J., Straube, S., Plesser, H. E., & Diesmann, M. (2005). Precise spike timing with exact subthreshold integration in discrete time network simu¨ lations. Proceedings of the 30th Gottingen Neurobiology Conference. Neuroforum, 1(Suppl.), 205B. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C (2nd ed.). Cambridge: Cambridge University Press. Reutimann, J., Giugliano, M., & Fusi, S. (2003). Event-driven simulation of spiking neurons with stochastic dynamics. Neural Comput., 15, 811–830. Rochel, O., & Martinez, D. (2003). An event-driven framework for the simulation of networks of spiking neurons. In ESANN’2003 Proceedings—European Symposium on Artifical Neural Networks (pp. 295–300). Bruges, Belgium: d-side Publications. Rotter, S., & Diesmann, M. (1999). Exact digital simulation of time-invariant linear systems with applications to neuronal modeling. Biol. Cybern., 81(5/6), 381–402. Shelley, M. J., & Tao, L. (2001). Efficient and accurate time-stepping schemes for integrate-and-fire neuronal networks. J. Comput. Neurosci., 11(2), 111–119. Sloot, A., Kaandorp, J. A., Hoekstra, G., & Overeinder, B. J. (1999). Distributed simulation with cellular automata: Architecture and applications. In J. Pavelka, G. Tel, & M. Bartosek (Eds.), SOFSEM’99, LNCS (pp. 203–248). Berlin: SpringerVerlag. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. Cambridge: Cambridge University Press. Weisstein, E. W. (1999). CRC concise encyclopedia of mathematics. Boca Raton, FL: CRC Press. Wolfram, S. (2003). The mathematica book (5th ed.). Champaign, IL: Wolfram Media Incorporated. Zeigler, B. P., Praehofer, H., & Kim, T. G. (2000). Theory of modeling and simulation: Integrating discrete event and continuous complex dynamic systems (2nd ed.). Amsterdam: Academic Press.
Received August 12, 2005; accepted May 25, 2006.
LETTER
Communicated by Daniel Amit
The Road to Chaos by Time-Asymmetric Hebbian Learning in Recurrent Neural Networks Colin Molter
[email protected] Laboratory for Dynamics of Emergent Intelligence, RIKEN Brain Science Institute, Wako, Saitama, 351-0198, Japan, and Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
Utku Salihoglu
[email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
Hugues Bersini
[email protected] Laboratory of Artificial Intelligence, IRIDIA, Universit´e Libre de Bruxelles, 1050 Brussels, Belgium
This letter aims at studying the impact of iterative Hebbian learning algorithms on the recurrent neural network’s underlying dynamics. First, an iterative supervised learning algorithm is discussed. An essential improvement of this algorithm consists of indexing the attractor information items by means of external stimuli rather than by using only initial conditions, as Hopfield originally proposed. Modifying the stimuli mainly results in a change of the entire internal dynamics, leading to an enlargement of the set of attractors and potential memory bags. The impact of the learning on the network’s dynamics is the following: the more information to be stored as limit cycle attractors of the neural network, the more chaos prevails as the background dynamical regime of the network. In fact, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. Next, we introduce a new form of supervised learning that is more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is no longer specified a priori. Based on its spontaneous dynamics, the network decides “on its own” the dynamical patterns to be associated with the stimuli. Compared with classical supervised learning, huge enhancements in storing capacity and computational cost have been observed. Moreover, this new form of supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in Neural Computation 19, 80–110 (2007)
C 2006 Massachusetts Institute of Technology
Road to Chaos in Recurrent Neural Networks
81
the obtained chaos. It is still possible to observe the traces of the learned attractors in the chaotic regime. This complex but still very informative regime is referred to as “frustrated chaos.” 1 Introduction Synaptic plasticity is now widely accepted as a basic mechanism underlying learning and memory. There is experimental evidence that neuronal activity can affect synaptic strength through both long-term potentiation and longterm depression (Bliss & Lomo, 1973). Inspired by or forecasting this biological fact, a large number of learning “rules,” specifying how activity and training experience change synaptic efficacies, have been proposed (Hebb, 1949; Sejnowski, 1977). Such learning rules have been essential for the construction of most models of associative memory (among others, Amari, 1977; Hopfield, 1982; Amari & Maginu, 1988; Amit, 1995; Brunel, Carusi, & Fusi, 1997; Fusi, 2002; Amit & Mongillo, 2003). In such models, the neural network maps the structure of information contained in the external or internal environment into embedded attractors. Since Amari, Grossberg, and Hopfield precursor works (Amari, 1972; Hopfield, 1982; Grossberg, 1992), the privileged regime to code information has been fixed-point attractors. Many theoretical and experimental works have shown and discussed the limited storing capacity of these attractor network (Amit, Gutfreund, & Sompolinsky, 1987; Gardner, 1987; Gardner & Derrida, 1989; Amit & Fusi, 1994; and Domany, van Hemmen, & Schulten, 1995, for a review). However, many neurophysiological reports (Nicolis & Tsuda, 1985; Skarda & Freeman, 1987; Babloyantz & Loureno, 1994; Rodriguez et al., 1999; and Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003) tend to indicate that brain dynamics is much more complex than fixed points and is more faithfully characterized by cyclic and weak chaotic regimes. In line with these results, in this article, we propose to map stimuli to spatiotemporal limit cycle attractors of the network’s dynamics. A learned stimulus is no longer expected to stabilize the network into a steady state (which could in some cases correspond to a minimum of a Lyapunov function). Instead, the stimulus is expected to drive the network into a specific spatiotemporal cyclic trajectory. This cyclic trajectory is still considered an attractor since content addressability is expected: before presentation of the stimulus, the network could follow another trajectory, and the stimulus could be corrupted with noise. By relying on spatiotemporal cyclic attractors, the famous theoretical results on the limited capacity of Hopfield network no longer apply. In fact, the extension of encoding attractors to cycles potentially boosts this storing capacity. Suppose a network is composed of two neurons that can have only two values: −1 and +1. Without paying attention to noise and generalization, only four fixed-point attractors can be exploited, whereas by adding cycles, this number increases. For instance,
82
C. Molter, U. Salihoglu, and H. Bersini
cycles of length two are: (+1, +1)(+1, −1) (+1, +1)(−1, +1) (+1, +1)(−1, −1) (+1, −1)(−1, +1) (+1, −1)(−1, −1) (−1, +1)(−1, −1). In a given network, with a fixed topology and parameterization, the number of cyclic attractors is obviously inferior to the number of fixed points (a cycle iterates through a succession of unstable equilibrium). An indexing of the memorized attractors restricted to the initial conditions, as classically done in Hopfield networks, would not allow a full exploitation of all of these potential cyclic attractors. This is the reason that in the experimental framework presented here, the indexing is done instead by means of an added external input layer that continuously feeds the network with different external stimuli. Each external stimulus modifies the parameterization of the network—thus, the possible dynamical regimes and the set of potential attractors. The goal of this letter is not to calculate the “maximum storage capacity” of these “cyclic attractors networks,”1 either theoretically2 or experimentally. Rather, we intend to discuss the potential dynamical regimes (oscillations and chaos) that allow this new form of information storage. The experimental results we present show how the theoretical limitation of fixed-point attractor networks can easily be overcome by adding the input layer and exploitating the cyclic attractors. By studying small fully connected networks, we have shown previously (Molter & Bersini, 2003) how a synaptic matrix randomly generated allows the exploitation of a huge number of cyclic attractors for storing information. In this letter, according to our previous results (Molter, Salihoglu, & Bersini, 2005a, 2005b), a time-asymmetric Hebbian rule is proposed to encode the information. This rule is related to experimental observations showing an asymmetrical time window of synaptic plasticity in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). Information stored into the network consists of a set of pair data. Each datum is composed of an external stimulus and a series of patterns through which the network is expected to iterate when the stimulus feeds the network (this series corresponds to the limit cycle attractor). Traditionally, the information to be stored is either installed by means of a supervised mechanism (e.g., Hopfield, 1982) or discovered on the spot by an unsupervised version, revealing some statistical regularities in the data presented to the 1 To paraphrase Gardner’s article “Maximum Storage Capacity in Neural Networks” (1987). 2 The fact that we are not working at the thermodynamic limit, as in population models, would render such analysis very difficult.
Road to Chaos in Recurrent Neural Networks
83
net (e.g. Amit & Brunel, 1994). In this letter, two different forms of learning are studied and compared. In the first, the information to be learned (the external stimulus and the limit cycle attractor) is prespecified and installed as a result of a classical supervised algorithm. However, such supervised learning has always raised serious problems at a biological level, due to its top-down nature, and a cognitive level. Who would take responsibility to look inside the brain of the learner, that is, to decide the information to associate with the external stimulus and exert supervision during the learning task? To answer the last question, in the second form of learning proposed here, the semantics of the attractors to be associated with the feeding stimulus is left unprescribed: the network is taught to associate external stimuli with original attractors, not specified a priori. This perspective remains in line with the very old philosophical conviction of constructivism, which has been modernized in neural network terms by several authors (among others, Varela, Thompson, & Rosch, 1991; Erdi, 1996; Tsuda, 2001). One operational form has achieved great popularity as a neural net implementation of statistical clustering algorithms (Kohonen, 1982; Grossberg, 1992). To differentiate between the two learning procedures, the first one is called out-supervised, since the information is fully specified from outside. In contrast, the second one is called in-supervised, since the learning maps stimuli to cyclic attractors “derived” from the ones spontaneously proposed by the network.3 Quite naturally, we show that this in-supervised learning leads to increased storing capacity. The aim of this letter is not only to show how stimuli could be mapped to limit cycle attractors of the network’s dynamics. It also focuses on the network’s background dynamics: the dynamics observed when unlearned stimuli feed the network. More precisely, in line with theoretical investigations of the onset and the nature of chaotic dynamics in deterministic dynamical systems (Eckmann & Ruelle, 1985; Sompolinsky, Crisanti, & Sommers, 1988; van Vreeswijk & Sompolinsky, 1996; Hansel & Sompolinsky, 1996), this letter computes and analyzes the background presence of chaotic dynamics. The presence of chaos in recurrent neural networks (RNNs) and the benefit gained by its presence is still an open question. However, since the seminal paper by Skarda and Freeman (1987) dedicated to chaos in the rabbit brain, many authors have shared the idea that chaos is the ideal regime to store and efficiently retrieve information in neural networks (Freeman, 2002; Guillot & Dauce, 2002; Pasemann, 2002; Kaneko & Tsuda, 2003). Chaos, although very simply produced, 3 This algorithm is still supervised in the sense that the mapped cyclic attractors are only “derived” from the spontaneous network dynamics. External supervision is still partly needed, which raises question about biological plausibility. However, even if the proposed in-supervised algorithm is improved or if it ends as part of a more biologically plausible learning procedure, we think that the dynamical results obtained here will remain qualitatively valid.
84
C. Molter, U. Salihoglu, and H. Bersini
inherently possesses an infinite number of cyclic regimes that can be exploited for coding information. Moreover, it randomly wanders around these unstable regimes in a spontaneous way, thus rapidly proposing alternative responses to external stimuli and able to switch easily from one of these potential attractors to another in response to any coming stimulus. This article maintains this line of thinking by forcing the coding of information in robust cyclic attractors and experimentally showing that the more information is to be stored, the more chaos appears as a regime in the back, erratically itinerating, that is, moving from place to place, among brief appearances of these attractors. Chaos appears to be the consequence of the learning, not the cause. However, it appears as a helpful consequence that widens the net’s encoding capacity and diminishes the existence of spurious attractors. By comparing the nature of the chaotic dynamics obtained from the two learning procedures, we show that more structure in the background chaotic regimes is obtained when the network “chooses” by itself the limit cycle attractors to encode the stimuli (the in-supervised learning). The nature of these chaotic regimes can be related to the classic intermittent type of chaos (Pomeau & Manneville, 1980) and to its extension to biological networks, originally called frustrated chaos (Bersini, 1998), in reminiscence of the frustration phenomenon occurring in both recurrent neural nets and spin glasses. This chaos is related to chaotic itinerancy (Kaneko, 1992; Tsuda, 1992), which has been suggested to be of biological significance (Tsuda, 2001). The plan of the letter is as follows. Section 2 describes the model as well as the learning task. Section 3 describes the out-supervised learning procedure, where stimuli are encoded in predefined limit cycle attractors. Section 4 describes the in-supervised learning procedure, where stimuli are encoded in the limit cycle attractors derived from those spontaneously proposed by the network. Section 5 computes and compares networks’ encoding capacity, as well as the content addressability of encoded information. Section 6 compares and discusses the proportion and the nature of the observed chaotic dynamics when the network is presented with unlearned stimuli. 2 Model and Learning Task Descriptions This section describes the model used in our simulations as well as the learning task. 2.1 The Model. The network is fully connected. Each neuron’s activation is a function of other neurons’ impact and an external stimulus. The neurons’ activation f is continuous and updated synchronously by discrete time step. The mathematical description of such a network is a classic
Road to Chaos in Recurrent Neural Networks
85
one. The activation value of a neuron xi at a discrete time step n + 1 is xi (n + 1) = f (g neti (n)) neti (n) =
N
wij x j (n) +
j=1
M
wis ιs ,
(2.1)
s=1
where N is the number of neurons, M is the number of units composing the stimulus, g is the slope parameter, wij is the weight between the neurons j and i, wis is the weight between the external stimulus’ unit s and the neuron i, and ιs is the unit s of the external stimulus. The saturating activation function f is taken continuous (here tanh) to ease the study of the networks’ dynamical properties. The network’s size, N, and the stimulus’s size, M, have been set to 25 in this article for legibility. Of course, this size has an impact on both the encoding capacity and the background dynamics. This impact will not be discussed. Another impact not discussed here is the value of the slope parameter g, set to 3 in the following.4 The main purpose of this article is to analyze how the learning of stimuli in spatiotemporal attractors affects the network’s background dynamics. When storing information in fixed-point attractors, the temporal update rule can indifferently be asynchronous or synchronous. This is no longer the case when storing information in cycle attractors, for which the updating must necessarily be synchronous: it is a global activity due to one pattern that generates the next one. To compare the network’s continuous internal states with bit patterns, a filter layer quantizing the internal states, based on the sign function, is added (Omlin, 2001). It defines the output vector o:
o i = −1 ⇐⇒ xi < 0 oi = 1
⇐⇒ xi ≥ 0,
(2.2)
where xi is the internal state of the neuron i and o i is its associated output (i.e., its visible value). This filter layer enables it to perform symbolic investigations on the dynamical attractors. Figure 1 represents a period 2 sequence unfolding in a network of four neurons. The persistent external stimulus feeding the network appears Figure 1A. Given that the internal state of neurons is continuous, the internal states (see Figure 1B) are filtered (see Figure 1C) to enable the comparison with the stored data. 4
The impact of this parameter has been discussed in Dauce, Quoy, Cessac, Doyon, & Samuelides (1998). They have demonstrated how the slope parameter can be used as a route to chaos.
86
C. Molter, U. Salihoglu, and H. Bersini
Figure 1: (A) A fully recurrent neural network (N = 4) fed by a persistent external stimulus. (B). Three shots of the network’s states. Each represents the internal state of the network at a successive time step. (C) After filtering, a cycle of period 2 can be seen.
2.2 The Learning Task. Two different forms of supervised learning are proposed in this article. However, both consist in storing a set of q external stimuli in spatiotemporal cycles of the network’s internal dynamics. The data set is written as D = D1 , . . . , Dq ,
(2.3)
where each datum Dµ is defined by a pair composed of a pattern χ µ corresponding to the external stimulus feeding the network and a sequence of patterns ς µ,i , i = 1, . . . , lµ to store in a dynamical attractor: Dµ = χ µ , (ς µ,1 , . . . , ς µ,lµ )
µ = 1, . . . , q ,
(2.4)
where lµ is the period of the sequence µ and may vary from one datum to another. Each pattern µ is defined by assigning digital values to all neurons: µ
χ µ = {χi , i = 1, . . . , M} µ,k
ς µ,k = {ςi
, i = 1, . . . , N}
with with
µ
χi ∈ {−1, 1} µ,k
ςi
∈ {−1, 1}.
(2.5)
Road to Chaos in Recurrent Neural Networks
87
3 The Out-Supervised Learning Algorithm This first learning task is straightforward and consists of storing a welldefined data set. It means that each datum stored in the network is fully specified a priori: each external stimulus must be associated with a prespecified limit cycle attractor of the network’s dynamics. By suppressing the external stimulus and defining all the sequences’ periods lµ to 1, this task is reduced to the classical learning task originally proposed by Hopfield: storing pattern in fixed-point attractors of the underlying RNN’s dynamics. The learning task described above turns out to generalize the one proposed by Hopfield. For ease of reading, when patterns are stored in fixed-point attractors, they are noted by ξ µ . 3.1 Introduction: Hopfield’s Autoassociative Model. In the basic Hopfield model (Hopfield, 1982), all connections need to be symmetric, no autoconnection can exist, and the update rule must be synchronous. Hopfield has proven that these constraints are sufficient to define a Lyapunov function H for the system:5 1 wij xi x j . 2 N
H=−
N
(3.1)
i=1 j=1
Each state variation produced by the system’s equation entails a nonpositive variation of H: H ≤ 0. The existence of such a decreasing function ensures convergence to fixed-point attractors. Each local minimum of the Lyapunov function represents one fixed point of the dynamics. These local minima can be used to store patterns. This kind of network is akin to a content-addressable memory since any stored item will be retrieved when the network dynamics is initiated with a vector of activation values sufficiently overlapping the stored pattern.6 In such a case, the network dynamics is initiated in the desired item’s basin of attraction, spontaneously driving the network dynamics to converge to this specific item. The set of patterns can be stored in the network by using the following Hebbian learning rule, which obviously respects the constraints of the Hopfield model (symmetric connections and no autoconnection): 1 µ µ ξi ξ j N p
wij =
wii = 0.
(3.2)
µ=1
5 A Lyapunov function defines a lower-bounded function whose derivative is decreasing in time. 6 This remains valid only when learning a few uncorrelated patterns. In other cases, the network converges to any fixed-point attractor.
88
C. Molter, U. Salihoglu, and H. Bersini
However, this kind of rule leads to drastic storage limitations. An in-depth analysis of the Hopfield model’s storing capacity has been done by Amit et al. (1987) by relying on a mean-field approach and on replica methods originally developed for spin-glass models. Their theoretical results show that these types of networks, when coupled with this learning rule, are unlikely to store more than 0.14 N uncorrelated random patterns. 3.2 Iterative Version of the Hebbian Learning Rule 3.2.1 Learning Fixed Points. A better way of storing patterns is given by an iterative version of the Hebbian rule (Gardner, 1987; for a detailed description of this algorithm, see van Hemmen & Kuhn (1995), and Forrest & Wallace, 1995). The principle of this algorithm is as follows: at each learning iteration, the stability of every nominal pattern ξ µ is tested. Whenever one pattern has not yet reached stability, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all the synaptic connections impinging on it, µ
µ
wij → wij + εs ξi ξ j ,
(3.3)
where εs defines the learning rate. All patterns to be learned are repeatedly tested for stability, and once all are stable, the learning is complete. This learning algorithm is incremental since the learning of new information can be done by preserving all information that has already been learned. It has been proved (Gardner, 1987) that by using this procedure, the capacity can be increased up to 2N uncorrelated random patterns. In our model, stored cycles are indexed by the use of external stimuli. These external stimuli are responsible for a modification of the underlying network’s internal dynamics and, consequently, increasing the number of potential attractors, as well as the size of their basins of attraction. The connection weights between the external stimuli and the neurons are learned by adopting the same approach as given in equation 3.3. When one pattern is not yet stable, the responsible neuron i sees its connectivity reinforced by adding a Hebbian term to all of the synaptic connections impinging on it (see equation 3.3), including connections coming from the external stimulus: wik → wik + εb χkµ ξiµ ,
(3.4)
where εb defines the learning rate applied on the external stimulus’s connections and which may differ from εs . In order to not only store the patterns but also ensure sufficient content addressability, we must try to “excavate” the basins of attraction. Two approaches are commonly proposed in the literature. The first aims at getting
Road to Chaos in Recurrent Neural Networks
89
the alignment of the spin of the neuron (+1 or −1) together with its local field to be not just positive (the requirement to ensure stability) but greater than a given minimum bound. The second approach attempts explicitly to enlarge the domains of attraction around each nominal pattern. To do so, the network is trained to associate noisy versions of each nominal pattern with the desired pattern, following a given number of iterations expected to be sufficient for convergence. This second approach is the one adopted in this article. Two noise parameters are introduced to tune noise during the learning phase: the noise imposed on the internal states lns and the noise imposed on the external stimulus lnb .7 3.2.2 Learning Cycles. The learning rule defined in equation 3.3 naturally leads to asymmetrical weights’ values. It is no longer possible to define a Lyapunov function for this system, the main consequence being to include cycles in the set of “memory bags.” As for fixed points, the network can be trained to converge to such limit cycles attractors by modifying equations 3.3 and 3.4. This time, the weights wij and wis are modified according to the expected value of neuron i at time t + 1 and the expected value of neuron j and of the external stimulus at time t: µ,ν+1
ςj
µ,ν+1
χi ,
wij → wij + εs ςi
wis → wis + εb ςi
µ,ν µ
(3.5)
where εs and εb , respectively, define the learning rate and the stimulus learning rate. This time-asymmetric Hebbian rule can be related to the asymmetric time window of synaptic plasticity observed in pyramidal cells during tetanic stimulus conditions (Levy & Steward, 1983; Bi & Poo, 1999). 3.2.3 Adaptation of the Algorithm to Continuous Activation Functions. When working with continuous state neurons, we are no longer working with cyclic attractors but with limit cycle attractors. As a consequence, the µ
µ 7 A noisy pattern ξ lns is obtained from a pattern to learn ξ by choosing a set of lns items, randomly chosen among all the initial pattern’s items, and by switching their sign. Thus, d H (lns ) defines the Hamming distance between the two patterns:
d H (lns ) =
N i=1
di = 0 di
where
di = 1
µ
µ
if
ξi ξi,lns = 1
(items equals)
if
µ µ ξi ξi,lns
(items differents)
= −1
In this article, the Hamming distance is normalized to range in [0, 100].
.
90
C. Molter, U. Salihoglu, and H. Bersini
algorithm needs to be adapted in order to prevent the learned data from vanishing after a few iterations. One step iteration does not guarantee longterm stability of the internal states since observations are performed using a filter layer. The adaptation consists of waiting a certain number of cycles before testing the correctness of the obtained attractor. The halting test for discrete neurons is given by the following equation, ∀µ, ν if (x(0) = ς µ,ν ) → x(1) = ς µ,ν+1
⇒ stop;
(3.6)
for continuous neurons, it becomes ∀µ, ν if (x(0) = ς µ,ν ) → (o(1) = o(lµ + 1) = . . . = o(T ∗ lµ + 1) = ς µ,ν+1 )
⇒ stop,
(3.7)
where T is a further parameter of our algorithm (set to 10 in all the experiments) and o is the output filtered vector defined in equation 1.2. 4 In-Supervised Learning algorithm As shown in section 5, the encoding capacities of networks learning in the out-supervised way described above are fairly good. However, these results are disappointing compared to the potential capacity observed in random networks (Molter & Bersini, 2003). Moreover, section 6 shows how learning too many cycle attractors in an out-supervised way leads to the kind of blackout catastrophe similar to the ones observed in fixed-point attractor networks (Amit, 1989). Here, the network’s background regime becomes fully chaotic and similar to white noise. Learning prespecified data appears to be too constraining for the network. This section introduces an in-supervised learning algorithm, more plausible from a biological point of view: the network has to learn to react to an external stimulus by cycling through a sequence that is not specified a priori but is obtained following an internal mechanism. In other words, the information is generated through the learning procedure that assigns a meaning to each external stimulus. There is an important tradition of less supervised learning in neural nets since the seminal work of Kohonen and Grossberg. This tradition enters in resonance with writings in cognitive psychology and constructivist philosophy (among others, Piaget, 1963; Varela et al., 1991; Erdi, 1996; and Tsuda, 2001). The algorithm presented now can be seen as a dynamical extension in the spirit of this preliminary work where the coding scheme relies on cycles instead of single neurons. 4.1 Description of the Learning Task. The main characteristic of this new algorithm lies in the nature of the learned information: only the external stimuli are known before learning. The limit cycle attractor associated
Road to Chaos in Recurrent Neural Networks
91
with an external stimulus is identified through the learning procedure: the procedure enforces a mapping between each stimulus of the data set and a limit cycle attractor of the network’s inner dynamic, whatever it is. Hence, the aim of the learning procedure is twofold: first, it proposes a dynamical way to code the information (i.e., to associate a meaning with the external stimuli), and then it learns it (through a classical supervised procedure). Before mapping, the data set is defined by Dbm (bm standing for “before mapping”): 1 q , . . . , Dbm Dbm = Dbm
µ
Dbm = χ µ
µ = 1, . . . , q .
(4.1)
After mapping, the data set becomes 1 q , . . . , Dam Dam = Dam
µ Dam = χ µ , (ς µ,1 , . . . , ς µ,lµ )
µ = 1, . . . , q ,
(4.2)
where lµ is the period of the learned cycle. 4.2 Description of the Algorithm. Inputs of this algorithm are a data set Dbm to learn (see equation 4.1), and a range [mincs , maxcs ] that defines the bounds of the accepted periods of the limit cycle attractors coding the information. This algorithm can be broken down in three phases that are constantly iterated until convergence: 1. Remapping stimuli into spatiotemporal cyclic attractors. During this phase, the network is presented with an external stimulus that drives it into a temporal attractor outputµ (which can be chaotic). Since the idea is to constrain the network as little as possible, a meaning is assigned to the stimulus by associating it with a close cyclic version of the attractor outputµ , called cycleµ , an original8 attractor respecting the periodic bounds [mincs , maxcs ]. This step is iterated for all the stimuli of the data set; 2. Learning the information. Once a new attractor cycleµ has been proposed for each stimulus, it is tentatively learned by means of a supervised procedure. However, to avoid constraining the network too much, only a limited number of iterations are performed, even if no convergence has been reached. 3. End test. if all stimuli are successfully associated with different cyclic attractors, the in-supervised learning stops, otherwise the whole process is repeated.
8 Original means that each pattern composing the limit cycle attractor must be different from all other patterns of all cycleµ .
92
C. Molter, U. Salihoglu, and H. Bersini
Table 1: Pseudocode of the In-Supervised algorithm. A/ re-mapping stimuli to spatiotemporal cyclic attractors µ
1. ∀ data Dbm to learn, µ = 1, . . . , q a. Stimulation of the network i. The stimulus is initialized with χ µ . ii. The states are initialized with ς µ,1 which are obtained from the previous iteration (or random at first). iii.To skip the transient, the network is simulated some steps. iv. The states ς µ,i crossed by the network’s dynamics are stored in outputµ . b. Proposal of an attractor code i. If outputµ is not a cycle of period lesser or equal to maxcs ⇒ compression process (see Table 2): outputµ → cycleµ . ii. If outputµ is not a cycle of period greater or equal to mincs ⇒ extension process (see Table 2): outputµ → cycleµ . iii.If a pattern contained in cycleµ is too correlated with any other patterns, this pattern is slightly modified to make it “original”. µ
2. The data set Dtemp is created where Dtemp = (χ µ , cycleµ ) B/ learning the information 3. Using an iterative supervised learning algorithm, the data set Dtemp is tentatively learned a limited number of time steps. C/ end test µ
4. If ∀ data Dbm , the network iterates through valid limit cycle attractors ⇒ finished else goto 1
The pseudocode of this in-supervised learning algorithm is described in Table 1. The algorithm presented here learns to map external stimuli into the network’s cyclic attractors in a very unconstrained way. Provided these attractors are derived from the ones spontaneously proposed by the network, a form of supervision is still needed (how to create an original cycle and how to know if the proposed cycle is original). However, recent neurophysiological observations have shown that in some cases, synaptic plasticity is effectively guided by a supervised procedure (Gutfreund, Zheng, & Knudsen, 2002; Franosch, Lingenheil, & van Hemmen, 2005). The supervised procedure proposed here to create and test original cycles (see Table 2) has no real biological grounding. This part of the algorithm might be improved in the future to reinforce its biological likelihood. Nevertheless, we think that whatever supervised procedure is chosen, the part of the study concerned with the capacity of the network and the background chaos remains valid.
Road to Chaos in Recurrent Neural Networks
93
Table 2: Routines Used When Creating an Attractor Code. Compression process Since outputµ = ς µ,1 , . . . , ς µ,maxcs , . . . is a cycle of period greater than maxcs (it could even be chaotic): ⇒ cycleµ is generated by truncating outputµ : cycleµ = ς µ,1 , . . . , ς µ, ps (the compression period ps ∈ [mincs , maxcs ] is another parameter). Extension process Since outputµ = ς µ,1 , . . . , ς µ,q with q < mincs :
⇒ cycleµ is generated by duplicating outputµ : cycleµ = ς µ,1 , . . . , ς µ,q , ς µ,1 , . . . such that size(cycleµ ) = pe , where pe ∈ [mincs , maxcs ] is the extension period, another parameter, randomly given or fixed;
5 Performance Regarding the Encoding Capacity Encoding capacities of networks having learned by using the outsupervised algorithm and networks having learned by using the insupervised algorithm are compared in Figure 2. Here, to ease the comparison, we have enforced the in-supervised algorithm to learn mappings to cycles of fixed period. However, because this algorithm enables mapping stimuli to limit cycle attractors of variable period (between mincs and maxcs ), better performance could be achieved. The number of iterations required to learn the specified data set is plotted as a function of the size of this data set.9 We have seen in section 3.2 that it is possible to enforce greater content addressability by training the network to associate noisy versions of each nominal pattern with the desired limit cycle attractors. Here, the two parameters enforcing the content addressability (lns and lnb ) have been set to 0 in order to display the maximum encoding capacities. As expected, the in-supervised learning procedure, by letting the network decide how to map the stimuli, outperforms its out-supervised counterpart where all the mappings are specified before the learning.10 Robustness to noise is compared in Figure 3. Content addressability has been enforced as much as possible by means of the learning-with-noise procedure. By computing the normalized Hamming distance between the obtained attractor and the initially learned attractor, robustness to noise 9
Each iteration represents 100 weight modifications defined in equation 3.6. The mapping stimuli to fixed-point attractors are not compared, since to obtain relevant results, we have to enforce content addressability by learning with noise. See also section 6.1.1. 10
50
C. Molter, U. Salihoglu, and H. Bersini
20
30
40
5
10
10
10
3
3
5
0
Number of iterations
94
0
10
20
30
40
50
60
70
Number of stimuli mapped to cycles 3, 5 and 10
Figure 2: Encoding capacities’ comparison between out-supervised learning (filled circles) and in-supervised learning (squares). The number of iterations required to learn the specified data set is plotted according to data set size. Results obtained from three different types of data sets are represented: data sets composed of stimuli associated with cycles of period 3, 5, and 10. Each value has been obtained from statistics of 100 different data sets.
indicates how well the dynamics is able to retrieve the original association when both the external stimulus and the network’s internal state are perturbed by noise. The two plots in Figure 3 show the normalized Hamming distance between the stored sequence and the recovered sequence according to the noise injected in the external stimulus and the initial states. This noise is quantified by computing the initial overlaps m0b and m0s .11 The noise injected in the network’s internal state is measured by computing the smallest overlap between the internal state and every pattern composing the limit cycle attractor associated with the external stimulus. In-supervised learning, compared to out-supervised learning, considerably improves robustness. For instance, when learning 6 period–4 cycles, in the in-supervised case (see Figure 3B), stored sequences are very robust to noise, while in the out-supervised case (see Figure 3A), content addressability is no longer observed. By adding a tiny amount of noise, the correlation between the stored sequence and the recovered one goes to zero (the normalized Hamming distance is equal to 50). These figures also show that in the in-supervised case, the external stimulus plays a stronger role in indexing the stored data.
11
The overlap mµ between two patterns is given by mµ =
N 1 µ,idcyc µ,idcyc ςi ςi,noisy . N i=1
Two patterns that match perfectly have an overlap equal to 1; it equals −1 in the opposite case. The overlap m and the Hamming distance dh are related by dh = N m+1 2 .
Road to Chaos in Recurrent Neural Networks
95
Figure 3: Content addressability obtained for the learned networks. The normalized Hamming distance between the expected sequence and the obtained sequence is plotted according to the noise injected in both the initial states (m0s ) and the external stimulus (m0b ). Results for the out-supervised Hebbian algorithm (A) are compared with the ones obtained for the in-supervised Hebbian algorithm (B).
One can say that the in-supervised learning mechanism implicitly supplies the network with an important robustness to noise. This could be explained by the following considerations. First, the coding attractors are derived from the ones spontaneously proposed by the network. Second, they need to have large and stable basins of attraction in order to resist the process of trial, error, and adaptation that characterizes this iterative remapping procedure. 6 Dynamical Analysis Tests performed here aim at analyzing the so-called background or spontaneous dynamics obtained when the network is presented with external stimuli other than the learned ones. In the first two sections, analyses are performed to quantify the presence and proportion of chaotic dynamics. Section 6.1 uses two measures: the network’s mean Lyapunov exponent and the probability of having chaotic dynamics. Section 6.2 uses symbolic analyses12 to quantify the proportion of the different types of symbolic attractors found. It shows how chaotic dynamics help to prevent the proliferation of spurious data. Qualitative analyses of the nature of the chaotic dynamics obtained are performed in the last section (section 6.3) by means of classical tools such as return maps,
12 Symbolic analyses are performed by analyzing the output of the filter layer instead of directly analyzing the network’s internal state.
96
C. Molter, U. Salihoglu, and H. Bersini
power spectra, and Lyapunov spectra. Furthermore an innovative measure is developed to assess the presence of frustrated chaotic dynamics. 6.1 Quantitative Dynamical Analysis: The Background Regime. Quantitative analyses are performed here using two kinds of measures: the mean Lyapunov exponent and the probability of having chaotic dynamics. Both measures come from statistics on 100 learned networks. For each learned network, dynamics obtained from randomly chosen external stimuli and initial states have been tested (1000 different configurations). These tests aim at analyzing the so-called background or spontaneous dynamics obtained by stimulating the network with external stimuli and initial states different from the learned ones. The computation of the first Lyapunov exponent is done empirically by computing the evolution of a tiny perturbation (renormalized at each time step) performed on an attractor state (for more details, see, Wolf, Swift, Swinney, & Vastano, 1984; Albers, Sprott, & Dechert, 1998). This exponent indicates how fast the system’s history is lost. While stable dynamics have negative Lyapunov exponents, Lyapunov exponents bigger than 0 are the signature of chaotic dynamics, and when the biggest Lyapunov exponent is very high, the system’s dynamics may be seen as equivalent to a turbulent state. Here, to distinguish chaotic dynamics from quasi-periodic regimes, dynamics is said to be chaotic if the Lyapunov exponent is greater than a given value slightly above zero (in practice, if the Lyapunov exponent is greater than 0.01). Obtained results are made more meaningful by comparing global dynamics of learned networks (indexed with L ) with global dynamics of networks obtained randomly (without learning). However, to constrain random networks to behave as a kind of “surrogate network” (indexed with S ), they must have the same mean µ and the same standard deviation σ for their weight distributions (neuron-neuron and stimulus-neuron): µ wijL = µ wijS µ wbiL = µ wbiS µ wiiL = µ wiiS σ wijL = σ wijS σ wbiL = σ wbiS σ wiiL = σ wiiS .
(6.1)
These surrogate random networks enable the measurement of the weight distribution’s impact. The random distribution has been chosen gaussian. 6.1.1 Network Stabilization Through Hebbian Learning of Static Patterns. Figure 4A shows the mean Lyapunov exponents and probabilities of having chaos for networks with learned data encoded in fixed-point attractors only. To avoid weight distribution to converge to the identity matrix (if ∀i, j : wii > 0 and wij ≈ 0, all patterns are learned but without robustness), noise has been added during the training period. We can observe that:
Road to Chaos in Recurrent Neural Networks
97
Figure 4: Mean Lyapunov exponents and probability of chaos in RNNs, in function of the learning sets’ size. Out-supervised learned networks (filled circles) are compared with in-supervised learned networks (squares). In both cases, comparisons are performed with their surrogate random networks (in gray). Since these results come from statistics, mean and standard deviation are plotted. (A) Network stabilization can be observed after Hebbian learning of static patterns. (B) Chaotic dynamics appear after mapping stimuli to period 3 cycles.
r
r
r r
The encoding capacity (with content addressability) of in-supervised learned networks is greater than the maximum encoding capacities (without content addressability) obtained from theoretical results (Gardner, 1987). The explanation lies in the presence of the input layer, which modifies the network’s internal dynamics and enables other attractors to appear. Mean Lyapunov exponents of learned networks are always negative. When the learning task are made more complex, networks are more and more constrained. The mean Lyapunov exponent increases but still remains below a negative upper bound. It is nearly impossible to find chaotic dynamics for spontaneous regimes in learned networks, even after intensive learning of static patterns. Surrogate random networks show very different results, clearly indicating how the learned weight distribution is anything but random.
98
C. Molter, U. Salihoglu, and H. Bersini
r
The same trends are observed when relying on both the outsupervised algorithm and the in-supervised algorithm.
The iterative Hebbian learning algorithm described here is likely to keep all connections approximately symmetric, preserving the stability of learned networks, while it is no longer the case for random networks. 6.1.2 Hebbian Learning of Cycles: A New Road to Chaos. If learning data in fixed-point attractors stabilize the network, learning sequences in limit cycle attractors lead to diametrically opposed results. Figure 4B compares the chaotic dynamics’ presence in 25-neuron networks learned with different data set of period 3 cycles. We can observe that chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic (Fusi, 2002). This looks very similar to the “blackout catastrophe” observed in fixed-point attractors networks (Amit, 1989). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—this blackout arises with a progressive increase of the chaotic dynamics:
r
r
r
Networks learned through the out-supervised algorithm are becoming more and more chaotic while the learning task is intensified. At the end, the probability of falling into a chaotic dynamics is equal to one, with high mean Lyapunov exponents. This kind of dynamics is ergodic and turns out to be reminiscent of the “blackout catastrophe” observed in fixed-point attractor networks (Amit, 1989; Fusi, 2002). If the result is the same—the network becomes unable to retrieve any of the memorized patterns—here this blackout arises through a progressive increase of the chaotic dynamics. The same trend shows up in the in-supervised case; however, even after intensive periods of learning, the networks do not become fully chaotic. The dynamics never turns into an ergodic one: chaotic attractors and limit-cycle attractors coexist. Compared to surrogate random networks, in both cases learning contributes to structure the dynamics (at least, learned networks have to remember the learned data!). However, when the learning task is intensified, the differences between the random and the learned networks tend to vanish in the out-supervised case.
The absence of full chaos in in-supervised learned networks is explained by the fact that this algorithm is based on a process of trial, error, and adaptations, which provides robustness and prevents full chaos. By increasing the data set to learn, learning takes more and more time, but at the same time, the number of remappings increases and forces large basins of stable dynamics. The network is more and more constrained, and complex but not fully chaotic.
Road to Chaos in Recurrent Neural Networks
99
Figure 5: Road to chaos observed when learning four cycles of period 7 in a recurrent network of 25 neurons through an iterative supervised Hebbian algorithm (the number of learning iterations is indicated on the x-axis). The network is initialized randomly, and following a given transient, the average state of the network is plotted on the y-axis. The mean Lyapunov exponent demonstrates the growing presence of chaotic dynamics.
Obtaining chaos by learning an increasing number of cycles based on a time-asymmetric Hebbian mechanism can be seen as a new road to chaos. This road to chaos is illustrated in Figure 5. In contrast with the classical roads shaped by the gradual variation of a control parameter, this new road relies on an external mechanism simultaneously modifying a set of parameters to fulfill an encoding task. When learning cycles, the network is prevented from stabilizing in fixed-point attractors. The more cycles there are to learn, the more the network is externally constrained and the more the regime turns out to be spontaneously chaotic. 6.2 Quantitative Dynamical Analysis: Symbolic Analyses. After learning, when the network is presented with a learned stimulus while its initial state is correctly chosen, the expected spatiotemporal attractor is observed. Section 5 showed that noise tolerance is expected: the external stimulus or the network’s internal state can be slightly modified without affecting the result. However, in some cases, the network fails to “understand” the stimulus, and an unexpected symbolic attractor appears in the output. The question is to know whether this attractor should be considered as spurious data. The intuitive idea is that if the attractor is chaotic or if its period is different from the learned data, it is easy to recognize it at a glance, and thus to discard it. This becomes more difficult if the observed attractor’s period is the same as the ones of the attractors learned. In this case, it is in fact impossible to know whether this information is relevant without comparing it with all the learned data. As a consequence, we will define an attractor—having the same period as the learned data but still different from all of them—as spurious data.
100
C. Molter, U. Salihoglu, and H. Bersini
Figure 6: Proportion of the different symbolic attractors obtained during the spontaneous activity of artificial neural networks learned using, respectively, the out-supervised algorithm (A) and the in-supervised algorithm (B). Two types of mappings are analyzed: stimuli mapped to fixed-point attractors and stimuli mapped to spatiotemporal attractors of period 4.
As a result, two classification schemes are used to differentiate the symbolic attractors obtained. The first scheme is based on periods. This criterion enables distinguishing among chaotic attractors, periodic attractors whose periods differ from the learned ones (named out-of-range attractors), and periodic attractors having the same period as the learned ones. The aim of the second classification scheme is to differentiate these attractors, based on the normalized Hamming distance between them and the closest learned data (i.e., the attractors at a distance less than 10%, less than 20%, and so on). Figure 6 shows the proportion of the different types of attractors found as the size of the learned data set is increased. Results obtained with the in-supervised and the out-supervised algorithm while mapping stimuli in spatiotemporal attractors of various periods are compared. For each data set size, statistics are obtained from 100 different learned networks, and each time, 1000 symbolic attractors obtained from random stimuli and internal states have been classified. When stimuli are mapped into fixed-point attractors, the proportion of chaotic attractors and of out-of-range attractors falls rapidly to zero. In contrast, the number of spurious data increases drastically. In fact both
Road to Chaos in Recurrent Neural Networks
101
Figure 7: Proportions of the different types of attractors observed in insupervised learned networks while the noise injected in a previously learned stimulus is varied. Networks’ initial states were randomly initiated. (A) Thirty stimuli are mapped to fixed-point attractors. (B) Thirty stimuli are mapped to period 4 cycles. One hundred learned networks, with each time 1000 configurations have been tested.
learning procedures tend to stabilize the network by enforcing symmetric weights and positive auto-connections.13 When stimuli are mapped to cyclic attractors, both learning procedures lead to an increasing number of chaotic attractors. The more you learn, the more the spontaneous regime of the net tends to be chaotic. Still, we have to differentiate the two learning procedures. Out-supervised learning, due to its very constraining nature, drives the network into a fully chaotic state that prevents the network from learning more than 13 period 4 cycles. The less constraining and more natural in-supervised learning task leads to different behavior. This time, the network does not become fully chaotic, and the storing capacity is enhanced by a factor as large as 4. Unfortunately, the number of spurious data is also increasing, and noticeable proportions of spurious data are visible when the network is fed with random stimuli. The aim of Figure 7 is to analyze the proportion of the different types of attractors observed when the external stimulus is progressively shifted from a learned stimulus to a random one. Again, the network’s initial state is set completely random. Two types of in-supervised mappings are compared: stimuli mapped to fixed-point attractors and stimuli mapped to attractors of period 4. When unnoised learned stimuli are presented to the network, stunning results appear. For fixed-point learned networks, in more than 80% of the observations, we are facing spurious data. Indeed, the distance between the observed attractors and the expected one is larger than 10%, and they have the same period (period 1). By contrast, for period 4 learned networks, the 13 If no noise is injected during the learning procedure, the network converges to the identity matrix.
102
C. Molter, U. Salihoglu, and H. Bersini
probability of recovering the perfect cycle increases to 56%. Moreover, 62% of the obtained cycles are at a distance less than 10% of the expected ones, and the amount of chaotic and out-of-range attractors is, respectively, equal to 23% and 8%. Space left for spurious data becomes less than 5%. Thus, if we imagine a procedure where the network’s states are slightly modified in case a chaotic trajectory is encountered, until a cyclic attractor is found, we are nearly certain of obtaining the correct mapping. Because of the pervading number of spurious data in fixed-point attractors’ learned networks, it becomes difficult to imagine these networks as working memories. By contrast, the presence of chaotic attractors help to prevent the proliferation of spurious data in cyclic attractors’ learned networks, while good storage capacity is possible by relying on in-supervised mappings. 6.3 Qualitative Dynamical Analysis: Nature of Chaos. The preceding section has given quantitative analyses of the presence of chaotic dynamics in learned networks. This section aims at introducing a more qualitative description of the different types of chaotic regimes encountered in these networks. Numerous techniques exist to characterize chaotic dynamics (Eckmann & Ruelle, 1985). In the first part of this section, three well-known tools are used: chaotic dynamics are characterized by means of return maps, power spectra, and analysis of their Lyapunov spectrum. The computation of the Lyapunov spectrum is performed through Gram-Schmidt reorthogonalization of the evolved system’s Jacobian matrix (which is estimated at each time step from the system’s equations). This method is detailed in Wolf et al. (1984). In the second part of this section, to have a better understanding of the notion of frustrated chaos, a new type of measure is proposed, indicating how chaotic dynamics is built on nearby limit cycle attractors. 6.3.1 Preliminary Analyses. From return maps and power spectra analysis, three types of chaotic regimes have been identified in these networks. Figure 8 shows the power spectra and the return maps of these three generic types. They are labeled here white noise, deep chaos, and informative chaos. For each of them, a return map has been plotted for one particular neuron and the network’s mean signal. We observe that:
r
White noise. The power spectrum of this type of chaos (see Figure 8A) shows a total lack of structure; all the frequencies are represented nearly equally. The associated return map is completely filled, with a bigger density of points at the edges, indicating the presence of saturation. No useful information can be obtained from such chaos.
Road to Chaos in Recurrent Neural Networks
103
Figure 8: Return maps of the network’s mean signal (upper figures), return maps of a particular neuron (center figures), and power spectra of the network’s mean signal (lower figures) for chaotic regimes encountered in random (A) and learned (B, C) networks. The Lyapunov exponent of the corresponding dynamics is given.
r
r
Deep chaos. The power spectrum of this type of chaos shows more structure, but is still very similar to white noise (see Figure 8B). However, the return map is very similar to the previous one and does not seem to provide any useful information. Informative chaos. The power spectrum looks very informative (see Figure 8C). Different peaks show up, indicating the presence of nearby limit cycles. The associated return map shows more structure, but still with the presence of saturation.
The most relevant result is the possibility of predicting the type of chaos preferentially encountered by knowing the learning procedure used, as well as the size of the data set to be learned. Chaotic dynamics encountered in surrogate random networks are nearly always similar to white noise. This explains the large mean Lyapunov exponent obtained for these networks.
104
C. Molter, U. Salihoglu, and H. Bersini
Figure 9: The 10 first Lyapunov exponents. (A) After out-supervised learning of 10 period 4 cycles. (B) After in-supervised learning of 20 period 4 cycles. Each time, 10 different learning sets have been tested. For each obtained network, 100 chaotic dynamics have been studied.
When learning by means of the out-supervised procedure, depending on the data set’s size, different types of chaotic regimes appear. In a small data set size, informative chaos and deep chaos coexist. By increasing the data set’s size, the more we want to learn, the more chaotic regimes go from an informative chaos to an almost white noise one. Learning too much information in an out-supervised way leads to networks’ showing very uninformative deep chaos similar to white noise. In other words, having too many competing limit cycle attractors forces the network’s dynamics to become almost random, losing its structure and behaving as white noise. All information about hypothetical nearby limit cycles attractors is lost. No more content addressability can be obtained. When learning by means of the in-supervised procedure, chaotic dynamics are nearly always of the third type: very informative chaos shows up. By not predefining the internal representations of the network, this type of learning preserves more structure in the chaotic dynamics. The dynamical structure of the informative chaos (Figure 8C) reveals the existence of nearby competing attractors. It shows the phenomenon of frustration obtained when the network hesitates between two (or more) nearby cycles, passing from one to another. To provide a better understanding of this type of chaos, Figure 9 compares Lyapunov spectra obtained from deep chaos in out-supervised learned networks and from informative chaos in in-supervised learned networks. From this figure, deep chaos can ¨ be related to “hyperchaos” (Rossler, 1983). In hyperchaos, the presence of more than one positive Lyapunov exponent is expected, and these exponents are expected to be high. By contrast, Lyapunov spectra obtained in informative chaos after in-supervised learning are characteristic of chaotic itinerancies (Kaneko & Tsuda, 2003). In this type of chaos, the dynamics is attracted to learned memories, which is indicated by negative Lyapunov exponents, while at the same time it escapes from them, which is indicated
Road to Chaos in Recurrent Neural Networks
105
Figure 10: Probability of the presence of the nearby limit cycle attractors in chaotic dynamics (x-axis). By slowly shifting the external stimulus from a stimulus previously learned (region a) to another stimulus learned (region b), the network’s dynamics goes from the limit cycle attractor associated with the former stimulus to the limit cycle attractor associated with the latter stimulus. The probabilities of the presence of cycle a and cycle b are plotted (black). The Lyapunov exponent of the obtained dynamics is also plotted (gray). (A) After out-supervised learning, hyperchaos is observed in between the learned attractors. (B) The same conditions lead to frustrated chaos after insupervised learning.
by the presence of at least one positive Lyapunov exponent. This positive Lyapunov exponent must be slightly positive in order not to completely erase the system’s history and thus keep traces of the learned memories. This regime of frustration is increased by some modes of neutral stability indicated by the presence of many exponents whose values are close to zero. 6.3.2 The Frustrated Chaos. The main difference between the frustrated chaos (Bersini & Sener, 2002) and similar forms of chaos welknown in the literature, like intermittency chaos (Pomeau & Manneville, 1980) and chaotic itinerancy (Ikeda, Matsumoto, & Otsuka, 1989; Kaneko, 1992; Tsuda, 1992), lies in the way to obtain it and in the possible characterization of the dynamics in terms of the encoded attractors. If all of those very structured chaotic regimes are characterized by strong cyclic components among which the dynamics randomly itinerates, it is in the transparency and the exploitation of those cycles that lies the key difference. In the frustrated chaos, those cycles are the basic stages of the road to chaos. It is by forcing those cycles in the network (by tuning the connection parameters) that the chaos finally shows up. One way of forcing these cycles is by the time-asymmetric Hebbian learning adopted in this letter. Frustrated chaos is a dynamical regime that appears in a network when the global structure is such that local connectivity patterns responsible for stable and meaningful oscillatory behaviors are intertwined, leading to mutually competing attractors and unpredictable itinerancy among brief appearance of these attractors.
106
C. Molter, U. Salihoglu, and H. Bersini
To have a better understanding of this type of chaos, Figure 10 compares a hyperchaos and the frustrated one through the probability of presence of the nearby limit cycle attractors in chaotic dynamics. In the figure on the left, the network has learned to associate 2 data with limit cycle attractors of period 10 by using the out-supervised Hebbian algorithm. This algorithm barely constrains the network, and as a consequence, chaotic dynamics appear very uninformative: by shifting the external stimulus from one attractor to another one in between, strong chaos shows up (indicated by the Lyapunov exponent), where any information concerning these two limit cycle attractors is lost. In contrast, when mapping four stimuli to period 4 cycles, using the in-supervised algorithm (Figure 10, right), when driving the dynamics by shifting the external stimulus from one attractor to another one, the chaos encountered on the road appears much more structured: small Lyapunov exponents and strong presence of the nearby limit cycles are easy to observe, shifting progressively from one attractor to the other one. 7 Conclusion This letter studies the possibility of encoding information in spatiotemporal cyclic attractors of a network’s internal state. Two versions of a timeasymmetric Hebbian learning algorithm are proposed. First is a classical out-supervised version, where the cyclic attractors are predefined and need to be replicated by the internal dynamics of the network. Second is an insupervised version where the cyclic attractors to be associated with the external stimuli are left unprescribed and derived from the ones spontaneously proposed by the network. First, experimental results show that the encoding performances obtained in the in-supervised case are greatly enhanced compared to the ones obtained in the out-supervised case. This is intuitively understandable since the in-supervised leaning is much less constraining for the network. Second, experimental results aim at analyzing the background dynamical regime of the network: the dynamics observed when unlearned stimuli are presented to the network. It is empirically shown that the more information the network has to store in its attractors, the more its spontaneous dynamical regime tends to be chaotic. Chaos is in fact the biggest pool of potential cyclic attractors. Asymmetric Hebbian learning can be seen as an alternative road to chaos. Adopting the out-supervised learning and increasing the amount of information to store, the background chaos spreads widely and adopts a very unstructured shape similar to white noise. The reason lies in the constraints imposed by this learning process, which becomes harder and harder to satisfy by the network. In contrast, in-supervised learning, by being more “respectful” of the network intrinsic dynamics, maintains much more structure in the obtained chaos. It is still possible to observe in the chaotic regime the traces of the learned attractors. Following
Road to Chaos in Recurrent Neural Networks
107
our previous considerations, we call this complex but still very structured and informative regime frustrated chaos. This behavior can be related to experimental findings where ongoing cortical activity has been shown to encompass a set of dynamically switching cortical states that corresponds to stimulus-evoked activity (Kenet et al., 2003). Symbolic investigations have been performed on the spatiotemporal attractors obtained when the network is in a random state and presented with random or noisy stimuli. Different types of attractors are observed in the output. Spurious data have been defined as attractors having the same period as the learned data but still different from all of them. This follows the intuitive idea that the other kinds of attractors (like chaotic attractors) are easily recognizable at a glance and unlikely to be confused with learned data. In the case of spurious data, it is impossible to know if the observed attractors bear useful information without comparing it with all the learned data Networks where the information is coded in fixed-point attractors are easily corrupted with spurious data, which makes their exploitation very delicate. By contrast, when the information is stored in cyclic attractors using in-supervised learning, chaotic attractors appear as the background regime of the net. They are likely to play a beneficial role by preventing the proliferation of spurious data and helping the recovery of the correct mapping: if a previously learned stimulus is presented to the network, either the network will find the correct mapping, or it will iterate through a chaotic trajectory. While not as easy to interpret and “engineerize” as its classical supervised counterpart, in-supervised learning makes more sense when adopting both a biological and a cognitive perspective. Biological systems propose their own way to treat external impact by slightly perturbing their inner working. Here the information is still meaningful to the extent that the external impact is associated in an unequivocal way with an attractor. As the famous Colombian neurophysiologist Rodolfo Llinas (2001) uses to say: “A person’s waking life is a dream modulated by the senses”.
References Albers, D., Sprott, J., & Dechert, W. (1998). Routes to chaos in neural networks with random weights. International Journal of Bifurcation and Chaos, 8, 1463–1478. Amari, S. (1972). Learning pattern sequences by self-organizing nets of threshold elements. IEEE Transactions on Computers, 21, 1197–1206. Amari, S. (1977). Neural theory of association and concept-formation. Biological Cybernetics, 26, 175–185. Amari, S., & Maginu, K. (1988). Statistical neurodynamics of associative memory. Neural Networks, 1, 63–73. Amit, D. (1989). Modeling brain function. Cambridge: Cambridge University Press.
108
C. Molter, U. Salihoglu, and H. Bersini
Amit, D. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral Brain Science, 18, 617–657. Amit, D., & Brunel, N. (1994). Learning internal representations in an attractor neural network with analogue neurons. Network: Computation in Neural Systems, 6, 359– 388. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Amit, D., Gutfreund, G., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation. Ann. Phys., 173, 30–67. Amit, D., & Mongillo, G. (2003). Spike-driven synaptic dynamics generating working memory states. Neural Computation, 15, 565–596. Babloyantz, A., & Loureno (1994). Computation with chaos: A paradigm for cortical activity. Proceedings of National Academy of Sciences, 91, 9027–9031. Bersini, H. (1998). The frustrated and compositional nature of chaos in small Hopfield networks. Neural Networks, 11, 1017–1025. Bersini, H., & Sener, P. (2002). The connections between the frustrated chaos and the intermittency chaos in small Hopfield networks. Neural Netwoks, 15, 1197–1204. Bi, G., & Poo, M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Bliss, T., & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232, 331–356. Brunel, N., Carusi, F., & Fusi, S. (1997). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network: Computation in Neural Systems, 9, 123–152. Dauce, E., Quoy, M., Cessac, B., Doyon, B., & Samuelides, M. (1998). Self-organization and dynamics reduction in recurrent networks: Stimulus presentation and learning. Neural Networks, 11, 521–533. Domany, E., van Hemmen, J., & Schulten, K. (Eds.). (1995). Models of neural networks (2nd ed.). Berlin: Springer. Eckmann, J., & Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Reviews of Modern Physics, 57(3), 617–656. Erdi, P. (1996). The brain as a hermeneutic device. Biosystems, 38, 179–189. Forrest, B., & Wallace, D. (1995). Models of neural netWorks (2nd ed.). Berlin: Springer. Franosch, J.-M., Lingenheil, M., & van Hemmen, J. (2005). How a frog can learn what is where in the dark. Physical Review Letters, 95, 1–4. Freeman, W. (2002). Biocomputing. Norwell, MA: Kluwer. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biological Cybernetics, 87, 459–470. Gardner, E. (1987). Maximum storage capacity in neural networks. Europhysics Letters, 4, 481–485. Gardner, E., & Derrida, B. (1989). Three unfinished works on the optimal storage capacity of networks. J. Physics A: Math. Gen., 22, 1983–1994. Grossberg, S. (1992). Neural networks and natural intelligence. Cambridge, MA: MIT Press . Guillot, A., & Dauce, E. (Eds.). (2002). Approche dynamique de la cognition artificielle. Paris: Herms Science.
Road to Chaos in Recurrent Neural Networks
109
Gutfreund, Y., Zheng, W., & Knudsen, E. I. (2002). Gated visual input to the central auditory system. Science, 297, 1556–1559. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computation Neuroscience, 3, 7–34. Hebb, D. (1949). The organization of behavior. New York: Wiley-Interscience. Hopfield, J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences USA, 79, 2554–2558. Ikeda, K., Matsumoto, K., & Otsuka, K. (1989). Maxwell-Bloch turbulence. Progress of Theoretical Physics, 99(Suppl.), 295–324. Kaneko, K. (1992). Pattern dynamics in spatiotemporal chaos. Physica D, 34, 1– 41. Kaneko, K., & Tsuda, I. (2003). Chaotic itinerancy. Chaos: Focus Issue on Chaotic Itinerancy, 13(3), 926–936. Kenet, T., Bibitchkov, D., Tsodyks, M., Grinvald, A., & Arieli, A. (2003). Spontaneously emerging cortical representations of visual attributes. Nature, 425, 954– 956. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Levy, W., & Steward, O. (1983). Temporal contiguity requirements for long term associative potentiation/depression in the hippocampus. Neuroscience, 8, 791– 797. Llineas, R. (2001). I of the vortex: From neurons to self. Cambridge, MA: MIT Press. Molter, C., & Bersini, H. (2003). How chaos in small Hopfield networks makes sense of the world. In Proceedings of the International Joint Conference on Neural Networks conference. Piscataway, NJ: IEEE Press. Molter, C., Salihoglu, U., & Bersini, H. (2005a). Introduction of an Hebbian unsupervised learning algorithm to boost the encoding capacity of Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference, Montreal. Los Alamitos, CA: IEEE Computer Society Press. Molter, C., Salihoglu, U., & Bersini, H. (2005b). Learning cycles brings chaos in continuous Hopfield networks. In Proceedings of the International Joint Conference on Neural Networks—IJCNN Conference. Los Alamitos, CA: IEEE Computer Society Press. Nicolis, J., & Tsuda, I. (1985). Chaotic dynamics of information processing: The “magic number seven plusminus two” revisited. Bulletin of Mathematical Biology, 47, 343–65. Omlin, C. (2001). Understanding and explaining DRN behavior. In K. Kremer (Ed.), A field guide to dynamical recurrent networks. Piscataway, NJ: IEEE Press. Pasemann, F. (2002). Complex dynamics and the structure of small neural networks. Network: Computation in Neural Systems, 13(2), 195–216. Piaget, J. (1963). The psychology of intelligence. New York: Routledge. Pomeau, Y., & Manneville, P. (1980). Intermittent transitions to turbulence in dissipative dynamical systems. Comm. Math. Phys., 74, 189–197. Rodriguez, E., George, N., Lachaux, J., Renault, B., Martinerie, J., Reunault, B., & Varela, F. (1999). Perception’s shadow: Long-distance synchronization of human brain activity. Nature, 397, 430–433.
110
C. Molter, U. Salihoglu, and H. Bersini
¨ ¨ Naturforschung, 38a, 788– Rossler, O. E. (1983). The chaotic hierarchy. Zeitschrift fur 801. Sejnowski, T. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biol., 4, 303–321. Skarda, C., & Freeman, W. (1987). How brains make chaos in order to make sense of the world. Behavioral and Brain Sciences, 10, 161–195. Sompolinsky, H., Crisanti, A., & Sommers, H. (1988). Chaos in random neural networks. Physical Review Letters, 61, 258–262. Tsuda, I. (1992). Dynamic link of memory-chaotic memory map in nonequilibrium neural networks. Neural Networks, 5, 313–326. Tsuda, I. (2001). Toward an interpretation of dynamic neural activity in terms of chaotic dynamical systems. Behavioral and Brain Sciences, 24, 793–847. van Hemmen, J., & Kuhn, R. (1995). Models of neural networks (2nd ed.). Berlin: Springer. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. Varela, F., Thompson, E., & Rosch, E. (1991). The embodied mind: Cognitive science and human experience. Cambridge, MA: MIT Press. Wolf, A., Swift, J., Swinney, H., & Vastano, J. (1984). Determining Lyapunov exponents from a time series. Physica, D16, 285–317.
Received April 6, 2005; accepted May 30, 2006.
LETTER
Communicated by Herbert Jaeger
Analysis and Design of Echo State Networks Mustafa C. Ozturk
[email protected] Dongming Xu
[email protected] Jos´e C. Pr´ıncipe
[email protected] Computational NeuroEngineering Laboratory, Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL 32611, U.S.A.
The design of echo state network (ESN) parameters relies on the selection of the maximum eigenvalue of the linearized system around zero (spectral radius). However, this procedure does not quantify in a systematic manner the performance of the ESN in terms of approximation error. This article presents a functional space approximation framework to better understand the operation of ESNs and proposes an informationtheoretic metric, the average entropy of echo states, to assess the richness of the ESN dynamics. Furthermore, it provides an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, a design methodology for functional approximation is put forward where ESNs are designed with uniform pole distributions covering the frequency spectrum to abide by the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. 1 Introduction Dynamic computational models require the ability to store and access the time history of their inputs and outputs. The most common dynamic neural architecture is the time-delay neural network (TDNN) that couples delay lines with a nonlinear static architecture where all the parameters (weights) are adapted with the backpropagation algorithm. The conventional delay line utilizes ideal delay operators, but delay lines with local first-order recursive filters have been proposed by Werbos (1992) and extensively studied in the gamma model (de Vries, 1991; Principe, de Vries, & de Oliviera, 1993). Chains of first-order integrators are interesting because they effectively decrease the number of delays necessary to create time embeddings Neural Computation 19, 111–138 (2007)
C 2006 Massachusetts Institute of Technology
112
M. Ozturk, D. Xu, and J. Pr´ıncipe
(Principe, 2001). Recurrent neural networks (RNNs) implement a different type of embedding that is largely unexplored. RNNs are perhaps the most biologically plausible of the artificial neural network (ANN) models (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1984; Elman, 1990), but are not well understood theoretically (Siegelmann & Sontag, 1991; Siegelmann, 1993; Kremer, 1995). One of the main practical problems with RNNs is the difficulty to adapt the system weights. Various algorithms, such as backpropagation through time (Werbos, 1990) and real-time recurrent learning (Williams & Zipser, 1989), have been proposed to train RNNs; however, these algorithms suffer from computational complexity, resulting in slow training, complex performance surfaces, the possibility of instability, and the decay of gradients through the topology and time (Haykin, 1998). The problem of decaying gradients has been addressed with special processing elements (PEs) (Hochreiter & Schmidhuber, 1997). Alternative second-order training methods based on extended Kalman filtering (Singhal & Wu, 1989; Puskorius & Feldkamp, 1994; Feldkamp, Prokhorov, Eagen, & Yuan, 1998) and the multistreaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications in identification and control of dynamical systems (Kechriotis, Zervas, & Monolakos, 1994; Puskorius & Feldkamp, 1994; Delgado, Kambhampati, & Warwick, 1995). Recently, two new recurrent network topologies have been proposed: the echo state network (ESN) by Jaeger (2001, 2002a; Jaeger & Hass, 2004) and the liquid state machine (LSM) by Maass (Maass, Natschl¨ager, & Markram, 2002). ESNs possess a highly interconnected and recurrent topology of nonlinear PEs that constitutes a “reservoir of rich dynamics” (Jaeger, 2001) and contain information about the history of input and output patterns. The outputs of these internal PEs (echo states) are fed to a memoryless but adaptive readout network (generally linear) that produces the network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously places important constraints in the overall architecture that have not yet been fully studied. Similar ideas have been explored independently by Maass and formalized in the LSM architecture. LSMs, although formulated quite generally, are mostly implemented as neural microcircuits of spiking neurons (Maass et al., 2002), whereas ESNs are dynamical ANN models. Both attempt to model biological information processing using similar principles. We focus on the ESN formulation in this letter. The echo state condition is defined in terms of the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by · ) of the reservoir’s weight matrix (W < 1). This condition states that the dynamics of the ESN is uniquely controlled by the input, and the effect of the initial states vanishes. The current design of ESN parameters
Analysis and Design of Echo State Networks
113
relies on the selection of spectral radius. However, there are many possible weight matrices with the same spectral radius, and unfortunately they do not all perform at the same level of mean square error (MSE) for functional approximation. A similar problem exists in the design of the LSM. LSMs have been shown to possess universal approximation given the separation property (SP) for the liquid (reservoir in ESNs) and the approximation property (AP) for the readout (Maass et al., 2002). SP is quantified by a kernel-quality measure proposed in Maass, Legenstein, and Bertschinger (2005) that is based on the rank of a matrix formed by the system states corresponding to different input signals. The kernel quality is a measure for the complexity and diversity of nonlinear operations carried out by the liquid on its input stream in order to boost the classification power of a subsequent linear decision hyperplane (Maass et al., 2005). A variation of SP has been proposed in Bertschinger and Natschl¨ager (2004), and it has been argued that complex calculations can be best carried out by networks on the boundary between ordered and chaotic dynamics. In this letter, we are interested in studying the ESN for functional approximation (filters that map input functions u(·) of time on output functions y(·) of time). We see two major shortcomings with the current ESN approach that uses echo state condition as a design principle. First, the impact of fixed reservoir parameters for function approximation means that the information about the desired response is conveyed only to the output projection. This is not optimal, and strategies to select different reservoirs for different applications have not been devised. Second, imposing a constraint only on the spectral radius is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perform differently for the same problem; see Figure 2). This letter aims to address these two problems by proposing a framework, a metric, and a design principle for ESNs. The framework is a signal processing interpretation of basis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the ESN states implement a set of basis functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to describe the richness of the ESN dynamics is an information-theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variable (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entropy for the overall state (vector) is estimated. The probability density function (pdf) in a differential geometric framework should be thought of as a volume form; that is, in our case, the pdf of the state vector describes the metric of the state space manifold (Amari, 1990). Moreover, Cox (1946) established information as a coordinate free metric in the state manifold. Therefore, entropy becomes a global descriptor of information that quantifies the volume of the manifold defined by the random variable. Due to the
114
M. Ozturk, D. Xu, and J. Pr´ıncipe
time dependency of the states, the state entropy averaged over time (ASE) is an appropriate estimate of the volume of the state manifold. The design principle specifies that one should consider independently the correlation among the basis and the spectral radius. In the absence of any information about the desired response, the ESN states should be designed with the highest ASE, independent of the spectral radius. We interpret the ESN dynamics as a combination of time-varying linear systems obtained from the linearization of the ESN nonlinear PE in a small, local neighborhood of the current state. The design principle means that the poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively, this will create the least correlated bases for a given spectral radius, which corresponds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the system’s degrees of freedom uniformly in space. It approximates for ESNs the well-known property of orthogonal basis. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again on the desired mapping. The concept of memory depth as explained in Principe et al. (1993) and Jaeger (2002a) is helpful in understanding the issues associated with the spectral radius. The correlation time of the desired response (as estimated by the first zero of the autocorrelation function) gives an indication of the type of spectral radius required (long correlation time requires high spectral radius). Alternatively, a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of the reservoir PEs, which has the net effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. This letter shows that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design.
2 Analysis of Echo State Networks 2.1 Echo States as Bases and Projections. Let us consider the architecture and recursive update equation of a typical ESN more closely. Consider the recurrent discrete-time neural network given in Figure 1 with M input units, N internal PEs, and L output units. The value of the input unit at time n is u(n) = [u1 (n), u2 (n), . . . , u M (n)]T , of internal units are x(n) = [x1 (n), x2 (n), . . . , xN (n)]T , and of output units are y(n) = [y1 (n), y2 (n), . . . , yL (n)]T . The connection weights are given in an N × M weight matrix Win = (wiinj ) for connections between the input and the internal PEs, in an N × N matrix W = (wi j ) for connections between the internal PEs, in an L × N matrix Wout = (wiout j ) for connections from PEs to the
Analysis and Design of Echo State Networks
Dynamical Reservoir
Input Layer Win
115
Read-out Wout
W
x(n)
u(n) .
+ .
y(n)
Wback Figure 1: An echo state network (ESN). ESN is composed of two parts: a fixedweight (W < 1) recurrent network and a linear readout. The recurrent network is a reservoir of highly interconnected dynamical components, states of which are called echo states. The memoryless linear readout is trained to produce the output.
output units, and in an N × L matrix Wba ck = (wibaj ck ) for the connections that project back from the output to the internal PEs (Jaeger, 2001). The activation of the internal PEs (echo state) is updated according to x(n + 1) = f(Win u(n + 1) + Wx(n) + Wba ck y(n)),
(2.1)
where f = ( f 1 , f 2 , . . . , f N ) are the internal PEs’ activation functions. Here, all x −x f i ’s are hyperbolic tangent functions ( ee x −e ). The output from the readout +e −x network is computed according to y(n + 1) = fout (Wout x(n + 1)),
(2.2)
where fout = ( f 1out , f 2out , . . . , f Lout ) are the output unit’s nonlinear functions (Jaeger, 2001, 2002a). Generally, the readout is linear so fout is identity. ESNs resemble the RNN architecture proposed in Puskorius and Feldkamp (1996) and also used by Sanchez (2004) in brain-machine
116
M. Ozturk, D. Xu, and J. Pr´ıncipe
interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer and the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in adaptive signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h(u(t)) be a real-valued function of a real-valued vector u(t) = [u1 (t), u2 (t), . . . , u M (t)]T . In functional approximation, the goal is to estimate the behavior of h(u(t)) as a combination of simpler functions ϕi (t), called the basis functionals, ˆ such that its approximant, h(u(t)), is given by ˆ h(u(t)) =
N
a i ϕi (t).
i=1
Here, a i ’s are the projections of h(u(t)) onto each basis function. One of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal processing, the choice normally goes for a complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g., Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture. For instance, consider a single hidden layer TDNN with N PEs and a linear output. The hiddenlayer PE outputs can be considered a set of nonorthogonal basis functionals dependent on the input, ϕi (u(t)) = g
b i j u j (t) .
j
b i j ’s are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then ˆ h(u(t)) =
N
a i ϕi (u(t)),
(2.3)
i=1
where a i ’s are the weights of the output layer. Notice that the b i j ’s adapt the bases and the a i ’s adapt the projection in the projection space. Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt, which has an impact on generalization and training set size, for example. Usually,
Analysis and Design of Echo State Networks
117
since all of the parameters of the network are adapted, the best basis in the joint (input and desired signals) space as well as the best projection can be achieved and represents the optimal solution. The output of the TDNN is a linear combination of its internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the ϕi (u(t))’s must be enforced. Ito, Shah and Pon, and others have shown that this is indeed the case (Ito, 1996; Shah & Poon, 1999), but a thorough discussion is outside the scope of this article. The ESN (and the RNN) architecture can also be studied in this framework. The states of equation 2.1 correspond to the basis set, which are recursively computed from the input, output, and previous states through Win , W, and Wba ck . Notice, however, that none of these weight matrices is adapted, that is, the functional bases in the ESN are uniquely defined by the input and the initial selection of weights. In a sense, ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed diversified dynamics in the hidden layer. For an ESN with a linear readout network, the output equation (y(n + 1) = Wout x(n + 1)) has the same form of equation 2.3, where the ϕi ’s and a i ’s are replaced by the echo states and the readout weights, respectively. The readout weights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. A similar perspective of basis and projections for information processing in biological networks has been proposed by Pouget and Sejnowski (1997). They explored the possibility that the response of neurons in parietal cortex serves as basis functions for the transformations from the sensory input to the motor responses. They proposed that “the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computation, particularly in the generation of motor commands”. The central issue in ESN design is exactly the nonadaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent states and possess a given spectral radius may define drastically different projection spaces because the correlation among the bases is not constrained. A simple experiment was designed to demonstrate that the selection of the ESN parameters by constraining the spectral radius is not the most suitable for function approximation. Consider a 100-unit ESN where the input signal is sin(2πn/10π). Mimicking Jaeger (2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100-unit ESN were constructed where the entries of W are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1 or, −1 with equal probabilities, and Wba ck is set to zero. Input is applied for 300 time steps, and the echo states are calculated using equation 2.1. The next step is to train the linear readout. One method
118
M. Ozturk, D. Xu, and J. Pr´ıncipe
MSE for different realizations
104
106
108
109
0
10
20 30 Different realizations
40
50
Figure 2: Performances of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, −0.4, and 0 with probabilities of 0.025, 0.025, and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9 × 10−9 to 8.9 × 10−5 . Results show that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice.
to determine the optimal output weight matrix, Wout , in the mean square error (MSE) sense (where MSE is defined by O = 12 (d − y)T (d − y)) is to use the Wiener solution given by Haykin (2001): W
out
T −1
= E[xx ]
E[xd] ∼ =
1 x(n)x(n)T N n
−1
1 x(n)d(n) . N n
(2.4)
Here, E[.] denotes the expected value operator, and d denotes the desired signal. Figure 2 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained vary greatly among different realizations. The minimum MSE value obtained among the 50 realizations is 5.9x10−9 , whereas the maximum MSE is 8.9x10−5 . This experiment
Analysis and Design of Echo State Networks
119
demonstrates that a design strategy that is based solely on the spectral radius is not sufficient to specify the system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the bases will change, and different performances are encountered in practice. 2.2 ESN Dynamics as a Combination of Linear Systems. It is well known that the dynamics of a nonlinear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, Kuznetsov, & Marsden, 1998). Here, we perform the analysis with hyperbolic tangent nonlinearities and approximate the ESN dynamics by the dynamics of the linearized system in the neighborhood of the current system state. Hence, when the system operating point varies over time, the linear system approximating the ESN dynamics changes. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ESN without output feedback given by x(n + 1) = f(Win u(n + 1) + Wx(n)). Linearizing the system around the current state x(n), one obtains the Jacobian matrix, J(n + 1), defined by
f˙ (net1 (n))w11 ˙ f (net2 (n))w21 J(n + 1) = ··· f˙ (netN (n))w N1 =
f˙ (net1 (n)) 0
f˙ (net1 (n))w12 · · · f˙ (net1 (n))w1N f˙ (net2 (n))w22 · · · f˙ (net2 (n))w2N ··· ··· ··· f˙ (netN (n))w N2 · · · f˙ (netN (n))w NN ···
0
f˙ (net2 (n)) · · ·
0
0
···
···
0
0
···
···
· · · f˙ (netN (n))
· W = F(n) · W.
(2.5)
Here, neti (n) is the ith entry of the vector (Win u(n + 1) + Wx(n)), and wi j denotes the (i, j)th entry of W. The poles of the linearized system at time n + 1 are given by the eigenvalues of the Jacobian matrix J(n + 1).1 As the amplitude of each PE changes, the local slope changes, and so the poles of
1 The
A)−1 B
transfer function of a linear system x(n + 1) = Ax(n) + Bu(n) is Ad joint(zI−A) det(zI−A) B.
X(z) U(z)
= (zI −
= The poles of the transfer function can be obtained by solving det(zI − A) = 0. The solution corresponds to the eigenvalues of A.
120
M. Ozturk, D. Xu, and J. Pr´ıncipe
the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the movement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and −0.4 with probabilities 0.9, 0.05, and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1 or −1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system, and the echo states are computed according to equation 2.1. Then the Jacobian matrix and the eigenvalues are calculated using equation 2.5. Figure 3 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locations, hence many different time constants that modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. When compared to their linear counterpart, an ESN with the same number of states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems. Similar results can be obtained using signals of different shapes at the ESN input. A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ESN input without changing the recurrent connection matrix, W. The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity.2 The relevance of bias in terms of overall system performance has also been discussed in Jaeger (2002b) and Bertschinger and Natschl¨ager (2004), but here we approach it from a system theory perspective and explain its effect on reservoir dynamics. 3 Average State Entropy as a Measure of the Richness of ESN Reservoir Previous research was aware of the influence of diversity of the recurrent layer outputs on the overall performance of ESNs and LSMs. Several metrics to quantify the diversity have been proposed (Jaeger, 2001; Maass, et al., 2 Assume W has nondegenerate eigenvalues and corresponding linearly independent eigenvectors. Then consider the eigendecomposition of W, where W = PDP−1 , P is the eigenvector matrix and D is the diagonal matrix of eigenvalues (Dii ) of W. Since F(n) and D are diagonal, J(n + 1) = F(n)W = F(n)(PDP−1 ) = P(F(n)D)P−1 is the eigendecomposition of J(n + 1). Here, each entry of F(n)D, f (net(n))Dii , is an eigenvalue of J. Therefore, | f (net(n))Dii | ≤ |Dii | since f (neti ) ≤ f (0).
Analysis and Design of Echo State Networks
Imaginary
(E)
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
Imaginary
C E
20
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 -1
-0.5
(B)
D
40
60
Time
80
100
0
0.5
1
0
0.5
1
0
0.5
1
Real
(D)
Imaginary
Imaginary
(C)
1 0.8 0.6 0.4 0.2 0 B -0.2 -0.4 -0.6 -0.8 -1 0
-0.5
0
Real
0.5
1
(F)
Imaginary
Amplitude
(A)
121
-0.5
0
Real
0.5
1
Real
Real
Figure 3: The pole tracks of the linearized ESN with 100 PEs when the input goes through a cycle. An ESN with fixed parameters implements a combination of linear systems with varying pole locations. (A) One cycle of sinusoidal signal with a period of 100. (B–E) The positions of poles of the linearized systems when the input values are at B, C, D, and E in Figure 5A. (F) The cumulative pole locations show the movement of the poles as the input changes. Due to the varying pole locations, different time constants modulate the richness of the reservoir of dynamics as a function of input amplitude. Higher-amplitude signals tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system with a large stability margin. When the input is close to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decreasing the stability margin. An ESN with more states results in a detailed coverage of the z-plane dynamics, which illustrates the power of nonlinear systems, when compared to their linear counterpart.
122
M. Ozturk, D. Xu, and J. Pr´ıncipe
2005). Here, our approach of bases and projections leads to a new metric. We propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes across the ESN states. Entropy of the instantaneous ESN states is appropriate to quantify performance in function approximation because the ESN output is a mere weighted combination of the instantaneous value of the ESN states. If the echo state’s instantaneous amplitudes are concentrated on only a few values across the ESN state dynamic range, the ability to approximate an arbitrary desired response by weighting the states is limited (and wasteful due to redundancy between the different states), and performance will suffer. On the other hand, if the ESN states provide a diversity of instantaneous amplitudes, it is much easier to achieve the desired mapping. Hence, the instantaneous entropy of the states appears as a good measure to quantify the richness of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time, will be the parameter used to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by trajectories. Renyi’s quadratic entropy is employed here because it is a global measure of information. In addition, an efficient nonparametric estimator of Renyi’s entropy, which avoids explicit pdf estimation, has been developed (Principe, Xu, & Fisher, 2000). Renyi’s entropy with parameter γ for a random variable X with a pdf f X (x) is given by Renyi (1970): Hγ (X) =
1 γ −1 log E[ f X (X)]. 1−γ
Renyi’s quadratic entropy is obtained for γ = 2 (for γ → 1, Shannon’s entropy is obtained). Given N samples {x1 , x2 , . . . , xN } drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by 1 K σ (x − xi ), N N
f X (x) =
i=1
where K σ is the kernel function with the kernel size σ . Then the Renyi’s quadratic entropy can be estimated by (Principe et al., 2000) 1 K σ (x j − xi ) . H2 (X) = −log 2 N j i
(3.1)
Analysis and Design of Echo State Networks
123
The instantaneous state entropy is estimated using equation 3.1 where the samples are the entries of the state vector x(n) = [x1 (n), x2 (n), . . . , xN (n)]T of an ESN with N internal PEs. Results will be shown with a gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally demonstrating that ESNs with different spectral radius and even with the same spectral radius display different ASEs. Let us consider the same 100-unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2πn/20). Figure 4A depicts the echo states over 200 time ticks. The instantaneous state entropy is also calculated at each time step using equation 3.1 and plotted in Figure 4B. First, note that the instantaneous state entropy changes over time with the distribution of the echo states as we would expect, since state entropy is dependent on the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo state’s instantaneous amplitudes are concentrated on only a few values, which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use ASE, which takes values −0.735, −0.007, and 0.335 for the spectral radii of 0.2, 0.5, and 0.8, respectively. Moreover, even for the same spectral radius, several ASEs are possible. Figure 4C shows ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5, which means that ASE is a finer descriptor of the dynamics of the reservoir. Although we have presented an experiment with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic range is properly selected. Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set in ESNs is created independent of the desired response and the application may require a small spectral radius. However, we maintain that when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. The following section addresses the design of ESNs with high ASE values and a simple mechanism to adjust the reservoir dynamics without changing the recurrent connection weights. 4 Designing Echo State Networks 4.1 Design of the Echo State Recurrent Connections. According to the interpretation of ESNs as coupled linear systems, the design of the internal
124
M. Ozturk, D. Xu, and J. Pr´ıncipe
connection matrix, W, will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which also decorrelates as much as possible the basis functionals. This principle was chosen by analogy to the identification of linear systems using Kautz filters (Kautz, 1954), which shows that the best approximation of a given transfer function by a linear system with finite order is achieved when poles are placed in the neighborhood of the spectral resonances. When no information is available about the desired response, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to distribute the poles inside the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The poles are first initialized at random locations; the quadratic Renyi’s entropy is calculated by equation 3.1, and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus & Principe, 2002). This method is efficient to find uniform coverage of the unit circle with an arbitrary number of poles. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle correspond to many sharp bandpass filters specializing in different frequency regions, whereas the inner poles realize filters of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now the problem is to construct an internal weight matrix from the pole locations (eigenvalues of W). In principle, we would like to create a sparse
Figure 4: Examples of echo states and instantaneous state entropy. (A) Outputs of echo states (100 PEs) produced by ESNs with spectral radius of 0.2, 0.5, and 0.8, from top to bottom, respectively. The diversity of echo states increases when the spectral radius increases. Within the dynamic range of the echo states, systems with smaller spectral radius can generate only uneven representations, while for W = 0.8, outputs of echo states almost uniformly distribute within their dynamic range. (B) Instantaneous state entropy is calculated using equation 3.1. Information contained in the echo states is changing over time according to the input amplitude. Therefore, the richness of representation is controlled by the input amplitude. Moreover, the value of ASE increases with spectral radius. (C) ASEs from 50 different realizations of ESNs with the same spectral radius of 0.5. The plot shows that ASE is a finer descriptor of the dynamics of the reservoir than the spectral radius.
Analysis and Design of Echo State Networks
125
(A) Echo States
1 0 -1
0 1
20 40 60 80 100 120 140 160 180 200
0 -1
0 1
20 40 60 80 100 120 140 160 180 200
0 -1
0
20 40 60 80 100 120 140 160 180 200 Time (B) State Entropy Spectral Radius = 0.2 Spectral Radius = 0.5 Spectral Radius = 0.8
1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5
0
50
100 Time
150
200
(C) Different ASEs for the same spectral radius 0.3
ASE
0.25 0.2 0.15 0.1 0.05 0
10
20
30 Trials
40
50
126
M. Ozturk, D. Xu, and J. Pr´ıncipe
matrix, so we started with the sparsest matrix (with an inverse), which is the direct canonical structure given by (Kailath, 1980)
−a 1 −a 2 · · · −a N−1 −a N
1 0 0 1 W= ··· ··· 0 0
···
0
···
0
···
···
···
1
0 0 . ··· 0
(4.1)
The characteristic polynomial of W is l(s) = det(sI − W) = s N + a 1 s N−1 + a 2 s N−2 + a N = (s − p1 )(s − p2 ) · · · (s − p N ),
(4.2)
where pi ’s are the eigenvalues and a i ’s are the coefficients of the characteristic polynomial of W. Here, we know the pole locations of the linear system obtained from the linearization of the ESN, so using equation 4.2, we can obtain the characteristic polynomial and construct W matrix in the canonical form using equation 4.1. We will call the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solutions with the same eigenvalues can be obtained by Q−1 WQ, where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the recurrent weight matrix having the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a given spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states and use our procedure to create the W matrix for ASE-ESN for different spectral radii between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparseness constraints. This corresponds to a weight distribution having the values 0, c and −c with probabilities p1 , (1 − p1 )/2, and (1 − p1 )/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly distributed between −1 and 1 (U-ESN) and scaled to obtain a given spectral radius (Jaeger & Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input given in section 3 and calculate ASE. Figure 5 compares the ASE values averaged over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptually similar to Jeffreys’ maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized
Analysis and Design of Echo State Networks
127
1 ASEESN UESN sparseness=0.2 sparseness=0.1 sparseness=0.07
0.8
ASE
0.6 0.4 0.2 0 -0.2 -0.4
0
0.2
0.4
0.6
0.8
1
Spectral radius Figure 5: Comparison of ASE values obtained for ASE-ESN having W with uniform eigenvalue distribution, ESNs with random W matrix, and U-ESN with uniformly distributed weights between −1 and 1. Randomly generated weights have sparseness of 0.07, 0.1, and 0.2. ASE values are calculated for the networks with spectral radius from 0.1 to 0.95. The ASE-ESN with uniform pole distribution generates a higher ASE on average for all spectral radii compared to ESNs with random connections.
system in certain regions of the space provides good performance only if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954). 4.2 Design of the Adaptive Bias. In conventional ESNs, only the output weights are trained, optimizing the projections of the desired response onto the basis functions (echo states). Since the dynamical reservoir is fixed, the basis functions are only input dependent. However, since function approximation is a problem in the joint space of the input and desired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defining the echo state dynamics, we propose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that according to linearization analysis, bias can reduce only spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the information derived from the approximation error. With this simple mechanism, some information from the input-output joint space is incorporated in the definition of the projection space of the ESN. The beauty of this method is that the spectral
128
M. Ozturk, D. Xu, and J. Pr´ıncipe
radius can be adjusted by a single parameter that is external to the system without changing reservoir weights. The training of bias can be easily accomplished. Indeed, since the parameter space is only one-dimensional, a simple line search method can be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be evaluated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data are applied to the ESN, and the echo states are calculated. Then the corresponding optimal output weights and the objective function (MSE) are evaluated to pick the next bias value. Alternatively, gradient-based methods can be utilized to optimize the bias, due to simplicity and low computational cost. System update equation with an external bias signal, b, is given by x(n + 1) = f(Win u(n + 1) + Win b + Wx(n)). The update equation for b is given by ∂x(n + 1) ∂ O(n + 1) = −e · Wout × ∂b ∂b ∂x(n) in ˙ ) · W × = −e · Wout × f(net + W . n+1 ∂b
(4.3) (4.4)
Here, O is the MSE defined previously. This algorithm may suffer from similar problems observed in gradient-based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Moreover, since the search parameter is one-dimensional, the gradient vector can assume only one of the two directions. Hence, imprecision in the gradient estimation should affect the speed of convergence but normally not change the correct gradient direction. 5 Experiments This section presents a variety of experiments in order to test the validity of the ESN design scheme proposed in the previous section. 5.1 Short-Term Memory Capacity. This experiment compares the shortterm memory (STM) capacity of ESNs with the same spectral radius using the framework presented in Jaeger (2002a). Consider an ESN with a single input signal, u(n), optimally trained with the desired signal u(n − k), for a given delay k. Denoting the optimal output signal yk (n), the k-delay
Analysis and Design of Echo State Networks
129
STM capacity of a network, MCk , is defined as a squared correlation coefficient between u(n − k) and yk (n) (Jaeger, 2002a). The STM capacity, MC, of the network is defined as ∞ k=1 MC k . STM capacity measures how accurately the delayed versions of the input signal are recovered with optimally trained output units. Jaeger (2002a) has shown that the memory capacity for recalling an independent and identically distributed (i.i.d.) input by an N unit RNN with linear output units is bounded by N. We use ESNs with 20 PEs and a single input unit. ESNs are driven by an i.i.d. random input signal, u(n), that is uniformly distributed over [−0.5, 0.5]. The goal is to train the ESN to generate the delayed versions of the input, u(n − 1), . . . , u(n − 40). We used four different ESNs: R-ESN, U-ESN, ASE-ESN, and BASE-ESN. R-ESN is a randomly connected ESN used in Jaeger (2002a) where the entries of W matrix are set to 0, 0.47, −0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.9. The entries of W of U-ESN are uniformly distributed over [−1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radius of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or −0.1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0 (Jaeger, 2002a). The echo states are calculated using equation 2.1 for 200 samples of the input signal, and the first 100 samples corresponding to initial transient are eliminated. Then the output weight matrix is calculated using equation 2.4. For the BASE-ESN, the bias is trained for each task. All networks are run with a test input signal, and the corresponding output and MCk are calculated. Figure 6 shows the k-delay STM capacity (averaged over 100 trials) of each ESN for delays 1, . . . , 40 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively. First, ESNs with uniform pole distribution (ASEESN and BASE-ESN) have MCs that are much longer than the randomly generated ESN given in Jaeger (2002a) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N = 20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9. In fact, for small k, large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN, which reduces the ASE to the appropriate level required for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-ESN has slightly better STM compared to R-ESN with only three different weight values, although it has more distinct weight values compared to R-ESN. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even with an adaptive bias, since the problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the
130
M. Ozturk, D. Xu, and J. Pr´ıncipe
1
RESN UESN ASEESN BASEESN
Memory Capacity
0.8 0.6 0.4 0.2 0 0
10
20 Delay
30
40
Figure 6: The k-delay STM capacity of each ESN for delays 1, . . . , 40 computed using the test signal. The results are averaged over 100 different realizations of each ESN type with the specifications given in the text for different W and Win matrices. The STM capacities of R-ESN, U-ESN, ASE-ESN, and BASE-ESN are 13.09, 13.55, 16.70, and 16.90, respectively.
suitability of maximizing ASE in tasks that require a substantial memory length. 5.2 Binary Parity Check. The effect of the adaptive bias was marginal in the previous experiment since the nature of the problem required large ASE values. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a crucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-bit parity corresponding to last m bits received, where m is 3, . . . , 8. Similar to the previous experiments, we used the R-ESN, ASE-ESN, and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.06, −0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability, and direct connections from the input to the output are allowed whereas Wba ck is set to 0. The echo states are calculated using equation 2.1 for 1000 samples of the input signal, and the first 100 samples corresponding to the initial transient are eliminated. Then the output weight
Analysis and Design of Echo State Networks
131
350
Wrong Decisions
300 250 200 150 100 ASEESN RESN BASEESN
50 0 3
4
5
6
7
8
m Figure 7: The number of wrong decisions made by each ESN for m = 3, . . . , 8 in the binary parity check problem. The results are averaged over 100 different realizations of R-ESN, ASE-ESN, and BASE-ESN for different W and Win matrices with the specifications given in the text. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASE-ESN, and BASE-ESN are 722, 862, and 699.
matrix is calculated using equation 2.4. For ESN with adaptive bias, the bias is trained for each task. The binary decision is made by a threshold detector that compares the output of the ESN to 0.5. Figure 7 shows the number of wrong decisions (averaged over 100 different realizations) made by each ESN for m = 3, . . . , 8. The total numbers of wrong decisions for m = 3, . . . , 8 of R-ESN, ASEESN, and BASE-ESN are 722, 862, and 699, respectively. ASE-ESN performs poorly since the nature of the problem requires a short time constant for fast response, but ASE-ESN has a large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions, whereas ASE-ESN has 53 wrong decisions. BASE-ESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs similar to the R-ESN, since the task requires access to longer input history, which compromises the need for fast response. Indeed, the bias in the BASE-ESN takes effect when there are errors (m > 4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6, and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In
132
M. Ozturk, D. Xu, and J. Pr´ıncipe
summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. 5.3 System Identification. This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation y(n + 1) = 0.3y(n) + 0.6y(n − 1) + f (u(n)), where f (u) = 0.6 sin(πu) + 0.3 sin(3πu) + 0.1 sin(5πu). The input to the system is chosen to be sin(2πn/25). We used three different ESNs—R-ESN, ASE-ESN, and BASE-ESN—with 30 internal units and a single input unit. The W matrix of each ESN is scaled such that it has a spectral radius of 0.95. R-ESN is a randomly connected ESN where the entries of W matrix are set to 0, 0.35, −0.35 with probabilities 0.8, 0.1, 0.1, respectively. In each ESN, the input weights are set to 1 or −1 with equal probability, and direct connections from the input to the output are allowed, whereas Wba ck is set to 0. The optimal output weights are calculated using equation 2.4. The MSE values (averaged over 100 realizations) for RESN and ASE-ESN are 1.23x10−5 and 1.83x10−6 , respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10−6 to 3.27x10−9 . 6 Discussion The great appeal of echo state networks (ESNs) and liquid state machine (LSM) is their ability to construct arbitrary mappings of signals with rich and time-varying temporal structures without requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed with training limited to the linear output layer. However, the literature did not elucidate on how to properly choose the recurrent parameters for system identification applications. Here, we provide an alternate framework that interprets the echo states as a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of the desired output space onto this representation space. We further introduce an information-theoretic criterion, ASE, to better understand and evaluate the capability of a given ESN to construct such a representation layer. The average entropy of the distribution of the echo states quantifies the volume spanned by the bases. As such, this volume should be the largest to achieve the smallest correlation among the bases and be able to cope with
Analysis and Design of Echo State Networks
133
arbitrary mappings. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spectral radius of an ESN can be optimized for the given problem with the help of an external bias signal that is adapted using the joint inputoutput space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearities is that it allows the fine tuning of the system dynamics for a given problem with a single external adaptive bias input and without changing internal system parameters. In our opinion, the combination of the largest possible ASE and the adaptation of the spectral radius by the bias produces the most parsimonious pole location of the linearized ESN when no knowledge about the mapping is available to optimally locate the bass functionals. Moreover, the bias can be easily trained with either a line search method or a gradient-based method since it is one-dimensional. We have illustrated experimentally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radius by the bias has provided consistently better performance across tasks that require different memory depths. This means that these two parameters’ design methodology is preferred to the spectral radius criterion proposed by Jaeger, and it is still easily incorporated in the ESN design. Experiments demonstrate that the ASE for ESN with uniform linearized poles is maximized when the spectral radius of the recurrent weight matrix approaches one (instability). It is interesting to relate this observation with the computational properties found in dynamical systems “at the edge of chaos” (Packard, 1988; Langton, 1990; Mitchell, Hraber, & Crutchfield, 1993; Bertschinger & Natschl¨ager, 2004). Langton stated that when cellular automata rules are evolved to perform a complex computation, evolution will tend to select rules with “critical” parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested for LSMs (Bertschinger & Natschl¨ager, 2004). Langton’s interpretation of edge of chaos was questioned by Mitchell et al. (1993). Here, we provide a system-theoretic view and explain the computational behavior with the diversity of dynamics achieved with linearizations that have poles close to the unit circle. According to our results, the spectral radius of the optimal ESN in function approximation is problem dependent, and in general it is impossible to forecast the computational performance as the system approaches instability (the spectral radius of the recurrent weight matrix approaches one). However, allowing the system to modulate the spectral radius by either the output or internal biasing may allow a system close to instability to solve various problems requiring different spectral radii. Our emphasis here is mostly on ESNs without output feedback connections. However, the proposed design methodology can also be applied to ESNs with output feedback. Both feedforward and feedback connections contribute to specify the bases to create the projection space. At the same
134
M. Ozturk, D. Xu, and J. Pr´ıncipe
time, there are applications where the output feedback contributes to the system dynamics in a different fashion. For example, it has been shown that a fixed weight (fully trained) RNN with output feedback can implement a family of functions (meta-learners) (Prokhorov, Feldkamp, & Tyukin, 1992). In meta-learning, the role of output feedback in the network is to bias the system to different regions of dynamics, providing multiple input-output mappings required (Santiago & Lendaris, 2004). However, results could not be replicated with ESNs (Prokhorov, 2005). We believe that more work has to be done on output feedback in the context of ESNs but also suspect that the echo state condition may be a restriction on the system dynamics for this type of problem. There are many interesting issues to be researched in this exciting new area. Besides an evaluation tool, ASE may also be utilized to train the ESN’s representation layer in an unsupervised fashion. In fact, we can easily adapt with the SIG (stochastic information gradient) described in Erdogmus, Hild, and Principe (2003): extra weights linking the outputs of recurrent states to maximize output entropy. Output entropy maximization is a well-known metric to create independent components (Bell & Sejnowski, 1995), and here it means that the echo states will become as independent as possible. This would circumvent the linearization of the dynamical system to set the recurrent weights and would fine-tune continuously in an unsupervised manner the parameters of the ESN among different inputs. However, it goes against the idea of a fixed ESN reservoir. The reservoir of recurrent PEs can be thought of as a new form of a timeto-space mapping. Unlike the delay line that forms an embedding (Takens, 1981), this mapping may have the advantage of filtering noise and produce representations with better SNRs to the peaks of the input, which is very appealing for signal processing and seems to be used in biology. However, further theoretical work is necessary in order to understand the embedding capabilities of ESNs. One of the disadvantages of the ESN correlated basis is in the design of the readout. Gradient-based algorithms will be very slow to converge (due to the large eigenvalue spread of modes), and even if recursive methods are used, their stability may be compromised by the condition number of the matrix. However, our recent results incorporating an L 1 norm penalty in the LMS (Rao et al., 2005) show great promise of solving this problem. Finally we would like to briefly comment on the implications of these models to neurobiology and computational neuroscience. The work by Pouget and Sejnowski (1997) has shown that the available physiological data are consistent with the hypothesis that the response of a single neuron in the parietal cortex serves as a basis function generated by the sensory input in a nonlinear fashion. In other words, the neurons transform the sensory input into a format (representation space) such that the subsequent computation is simplified. Then, whenever a motor command (output of the biological system) needs to be generated, this simple computation to
Analysis and Design of Echo State Networks
135
read out the neuronal activity is done. There is an intriguing similarity between the interpretation of the neuronal activity by Pouget and Sejnowski and our interpretation of echo states in ESN. We believe that similar ideas can be applied to improve the design of microcircuit implementations of LSMs. First, the framework of functional space interpretation (bases and projections) is also applicable to microcircuits. Second, the ASE measure may be directly utilized for LSM states because the states are normally lowpass-filtered before the readout. However, the control of ASE by changing the liquid dynamics is unclear. Perhaps global control of thresholds or bias current will be able to accomplish bias control as in ESN with sigmoid PEs.
Acknowledgments This work was partially supported by NSF ECS-0422718, NSF CNS-0540304, and ONR N00014-1-1-0405.
References Amari, S.-I. (1990). Differential-geometrical methods in statistics. New York: Springer. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16(7), 1413–1436. Cox, R. T. (1946). Probability, frequency, and reasonable expectation. American Journal of Physics, 14(1), 1–13. de Vries, B. (1991). Temporal processing with neural networks—the development of the gamma model. Unpublished doctoral dissertation, University of Florida. Delgado, A., Kambhampati, C., & Warwick, K. (1995). Dynamic recurrent neural network for system identification and control. IEEE Proceedings of Control Theory and Applications, 142(4), 307–314. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. Erdogmus, D., Hild, K. E., & Principe, J. (2003). Online entropy manipulation: Stochastic information gradient. Signal Processing Letters, 10(8), 242–245. Erdogmus, D., & Principe, J. (2002). Generalized information potential criterion for adaptive system training. IEEE Transactions on Neural Networks, 13(5), 1035–1044. Feldkamp, L. A., Prokhorov, D. V., Eagen, C., & Yuan, F. (1998). Enhanced multistream Kalman filter training for recurrent networks. In J. Suykens, & J. Vandewalle (Eds.), Nonlinear modeling: Advanced black-box techniques (pp. 29–53). Dordrecht, Netherlands: Kluwer.
136
M. Ozturk, D. Xu, and J. Pr´ıncipe
Haykin, S. (1998). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ. Prentice Hall. Haykin, S. (2001). Adaptive filter theory (4th ed.). Upper Saddle River, NJ: Prentice Hall. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hopfield, J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81, 3088–3092. Ito, Y. (1996). Nonlinearity creates linear independence. Advances in Computer Mathematics, 5(1), 189–203. Jaeger, H. (2001). The echo state approach to analyzing and training recurrent neural networks (Tech. Rep. No. 148). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002a). Short term memory in echo state networks (Tech. Rep. No. 152). Bremen: German National Research Center for Information Technology. Jaeger, H. (2002b). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach (Tech. Rep. No. 159). Bremen: German National Research Center for Information Technology. Jaeger, H., & Hass, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London, A 196, 453–461. Kailath, T. (1980). Linear systems. Upper Saddle River, NJ: Prentice Hall. Kautz, W. (1954). Transient synthesis in time domain. IRE Transactions on Circuit Theory, 1(3), 29–39. Kechriotis, G., Zervas, E., & Manolakos, E. S. (1994). Using recurrent neural networks for adaptive communication channel equalization. IEEE Transactions on Neural Networks, 5(2), 267–278. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(5), 1000–1004. Kuznetsov, Y., Kuznetsov, L., & Marsden, J. (1998). Elements of applied bifurcation theory (2nd ed.). New York: Springer-Verlag. Langton, C. G. (1990). Computation at the edge of chaos. Physica D, 42, 12–37. Maass, W., Legenstein, R. A., & Bertschinger, N. (2005). Methods for estimating the computational power and generalization capability of neural microcircuits. In L. K. Saul, Y. Weiss, L. Bottou (Eds.), Advances in neural information processing systems, no. 17 (pp. 865–872). Cambridge, MA: MIT Press. Maass, W., Natschl¨ager, T., & Markram, H. (2002). Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11), 2531–2560. Mitchell, M., Hraber, P., & Crutchfield, J. (1993). Revisiting the edge of chaos: Evolving cellular automata to perform computations. Complex Systems, 7, 89– 130. Packard, N. (1988). Adaptation towards the edge of chaos. In J. A. S. Kelso, A. J. Mandell, & M. F. Shlesinger (Eds.), Dynamic patterns in complex systems (pp. 293– 301). Singapore: World Scientific.
Analysis and Design of Echo State Networks
137
Pouget, A., & Sejnowski, T. J. (1997). Spatial transformations in the parietal cortex using basis functions. Journal of Cognitive Neuroscience, 9(2), 222–237. Principe, J. (2001). Dynamic neural networks and optimal signal processing. In Y. Hu & J. Hwang (Eds.), Neural networks for signal processing (Vol. 6-1, pp. 6– 28). Boca Raton, FL: CRC Press. Principe, J. C., de Vries, B., & de Oliviera, P. G. (1993). The gamma filter—a new class of adaptive IIR filters with restricted feedback. IEEE Transactions on Signal Processing, 41(2), 649–656. Principe, J., Xu, D., & Fisher, J. (2000). Information theoretic learning. In S. Haykin (Ed.), Unsupervised adaptive filtering (pp. 265–319). Hoboken, NJ: Wiley. Prokhorov, D. (2005). Echo state networks: Appeal and challenges. In Proc. of International Joint Conference on Neural Networks (pp. 1463–1466). Montreal, Canada. Prokhorov, D., Feldkamp, L., & Tyukin, I. (1992). Adaptive behavior with fixed weights in recurrent neural networks: An overview. In Proc. of International Joint Conference on Neural Networks (pp. 2018–2022). Honolulu, Hawaii. Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Puskorius, G. V., & Feldkamp, L. A. (1996). Dynamic neural network methods applied to on-vehicle idle speed control. Proceedings of IEEE, 84(10), 1407–1420. Rao, Y., Kim, S., Sanchez, J., Erdogmus, D., Principe, J. C., Carmena, J., Lebedev, M., & Nicolelis, M. (2005). Learning mappings in brain machine interfaces with echo state networks. In 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. Philadelphia. Renyi, A. (1970). Probability theory. New York: Elsevier. Sanchez, J. C. (2004). From cortical neural spike trains to behavior: Modeling and analysis. Unpublished doctoral dissertation, University of Florida. Santiago, R. A., & Lendaris, G. G. (2004). Context discerning multifunction networks: Reformulating fixed weight neural networks. In Proc. of International Joint Conference on Neural Networks (pp. 189–194). Budapest, Hungary. Shah, J. V., & Poon, C.-S. (1999). Linear independence of internal representations in multilayer perceptrons. IEEE Transactions on Neural Networks, 10(1), 10–18. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 623–656. Siegelmann, H. T. (1993). Foundations of recurrent neural networks. Unpublished doctoral dissertation, Rutgers University. Siegelmann, H. T., & Sontag, E. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Singhal, S., & Wu, L. (1989). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Takens, F. (1981). Detecting strange attractors in turbulence. In D. A. Rand & L.-S. Young (Eds.), Dynamical systems and turbulence (pp. 366–381). Berlin: Springer. Thogula, R. (2003). Information theoretic self-organization of multiple agents. Unpublished master’s thesis, University of Florida. Werbos, P. (1990). Backpropagation through time: What it does and how to do it. Proceedings of IEEE, 78(10), 1550–1560.
138
M. Ozturk, D. Xu, and J. Pr´ıncipe
Werbos, P. (1992). Neurocontrol and supervised learning: An overview and evaluation. In D. White & D. Sofge (Eds.), Handbook of intelligent control (pp. 65–89). New York: Van Nostrand Reinhold. Wilde, D. J. (1964). Optimum seeking methods. Upper Saddle River, NJ: Prentice Hall. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280.
Received December 28, 2004; accepted June 1, 2006.
LETTER
Communicated by Ralph M. Siegel
Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory Edmund T. Rolls
[email protected] Simon M. Stringer
[email protected] Oxford University, Centre for Computational Neuroscience, Department of Experimental Psychology, Oxford OX1 3UD, England
The motion of an object (such as a wheel rotating) is seen as consistent independent of its position and size on the retina. Neurons in higher cortical visual areas respond to these global motion stimuli invariantly, but neurons in early cortical areas with small receptive fields cannot represent this motion, not only because of the aperture problem but also because they do not have invariant representations. In a unifying hypothesis with the design of the ventral cortical visual system, we propose that the dorsal visual system uses a hierarchical feedforward network architecture (V1, V2, MT, MSTd, parietal cortex) with training of the connections with a short-term memory trace associative synaptic modification rule to capture what is invariant at each stage. Simulations show that the proposal is computationally feasible, in that invariant representations of the motion flow fields produced by objects self-organize in the later layers of the architecture. The model produces invariant representations of the motion flow fields produced by global in-plane motion of an object, in-plane rotational motion, looming versus receding of the object, and object-based rotation about a principal axis. Thus, the dorsal and ventral visual systems may share some similar computational principles. 1 Introduction A key issue in understanding the cortical mechanisms that underlie motion perception is how we perceive the motion of objects such as a rotating wheel invariantly with respect to position on the retina, and size. For example, we perceive the wheel shown in Figures 1 and 4a rotating clockwise independent of its position on the retina. This occurs even though the local motion for the wheels in the different positions may be opposite (as indicated in the dashed box in Figure 1). How could this invariance of the visual motion perception of objects arise in the visual system? Invariant motion representations are known to be developed in the cortical dorsal visual system. Motion-sensitive neurons in V1 have small receptive fields Neural Computation 19, 139–169 (2007)
C 2006 Massachusetts Institute of Technology
140
E. Rolls and S. Stringer
Figure 1: A wheel rotating clockwise at different locations on the retina. How can a network learn to represent the clockwise rotation independent of the location of the moving object? The dashed box shows that local motion cues available at the beginning of the visual system are ambiguous about the direction of rotation when the stimulus is seen in different locations. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field and even though the local motion cues may be ambiguous, as shown in the dashed box.
(in the range 1–2 degrees at the fovea), and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). Neurons in MT, which receives inputs from V1 and V2, have larger receptive fields (e.g., 5 degrees at the fovea) and are able to respond to planar global motion, such as a field of small dots in which the majority (in practice, as little as 55%) move in one direction, or to the overall direction of a
Invariant Global Motion Recognition
141
moving plaid, the orthogonal grating components of which have motion at 45 degrees to the overall motion (Movshon, Adelson, Gizzi, & Newsome, 1985; Newsome, Britten, & Movshon, 1989). Further on in the dorsal visual system, some neurons in macaque visual area MST (but not MT) respond to rotating flow fields or looming with considerable translation invariance (Graziano, Andersen, & Snowden, 1994; Geesaman & Andersen, 1996). It is known that single neurons in the ventral visual system have translation, size, and even view-invariant representations of stationary objects (Rolls & Deco, 2002; Desimone, 1991; Tanaka, 1996; Logothetis & Sheinberg, 1996; Rolls, 1992, 2000, 2006). A theory that can account for this uses a feature hierarchy network (Fukushima, 1980; Rolls, 1992; Wallis & Rolls, 1997; Riesenhuber & Poggio, 1999) combined with an associative Hebb-like learning rule (in which the synaptic weights increase in proportion to the pre-and postsynaptic firing rates) with a short-term memory of, for example, 1 sec, to enable different instances of the stimulus to be associated together as the visual objects transform continuously from second to second in the world ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; Bartlett & Sejnowski, 1998; (Foldi´ Rolls & Milward, 2000; Stringer & Rolls, 2000, 2002; Rolls & Deco, 2002). In a unifying hypothesis, we propose here that the analysis of invariant motion in the dorsal visual system uses a similar architecture and learning rule, but in contrast utilizes as its inputs neurons that respond to local motion of the type found in the primary visual cortex, V1 (Wurtz & Kandel, 2000; Duffy, 2004; Bair & Movshon, 2004). A feature of the theory is that motion in the visual field is computed only once in V1 (by processes that take into account luminance changes across short times) and that the representations of motion that develop in the dorsal visual system require no further computation of time-delay-related firing to compute motion. The theory is of interest, for it proposes that some aspects of the computations in parts of the cerebral cortex that appear to be involved in different types of visual function, the dorsal and ventral visual systems, may in fact be performed by some similar organizational and computational principles. 2 The Theory and Its Implementation in a Model 2.1 The Theory. We propose that the general architecture of the dorsal visual system areas we consider is a feedforward feature hierarchy network, the inputs to which are local motion-sensitive neurons of V1 with receptive fields of approximately 1 degree in diameter (see Figure 2). There is convergence from stage to stage, so that a neuron at any one stage need receive only a limited number of inputs from the preceding stage, yet by the end of the network, an effectively global computation that can take into account information derived from different parts of the retina can have been performed. Within each cortical layer of the architecture (or layer of the network), local lateral inhibition implemented by inhibitory feedback neurons implements competition between the neurons, in such a way that fast-firing neurons
142
E. Rolls and S. Stringer
Layer 4
Layer 3
Layer 2
Layer 1 Figure 2: Stylized image of hierarchical organization in the dorsal as well as ventral visual system. The architecture is captured by the VisNet model, in which convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina.
inhibit other neurons in the vicinity, so that the overall activity within an area is kept within bounds. The competition may be nonlinear, due in part to the threshold nonlinearity of neurons, and this competition, helped by the diluted connectivity (i.e., the fact that only a low proportion of the neurons are connected), enables some neurons to respond to particular combinations of the inputs being received from the preceding area (Rolls & Deco, 2002; Deco & Rolls, 2005). These aspects of the architecture potentially enable single neurons at higher stages of the network to respond to combinations of the local motion inputs from V1 to the first layer of the network. These combinations, helped by the increasingly larger receptive fields, could include global motion to partly randomly moving dots (and to plaids) over
Invariant Global Motion Recognition
143
areas as large as 5 degrees in MT (Wurtz & Kandel, 2000; Duffy & Wurtz, 1996). In the architecture shown in Figure 2, layer 1 might correspond to MT; layer 2 to MST, which has receptive fields of 15–65 degrees in diameter; and layers 3 and 4 to areas in the parietal cortex such as 7a and to areas in the cortex in the superior temporal sulcus, which receives from parietal visual areas where view-invariant object-based motion is represented (Hasselmo, Rolls, Baylis, & Nalwa, 1989; Sakata, Shibutani, Ito, & Tsurugai, 1986). The synaptic plasticity between the layers of neurons has a Hebbian associative component in order to enable the system to build reliable representations in which the same neurons are activated by particular stimuli on different occasions in what is effectively a hierarchical multilayer competitive network (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Deco, 2002). Such processes might enable neurons in layer 2 of Figure 4a to respond to, for example, a wheel rotating clockwise in one position on the retina (e.g., neuron A in layer 2). A key issue not addressed by the architecture described so far is how rotation (e.g., of a small wheel rotating clockwise) in one part of the retina activates the same neurons at the end of the network as when it is presented on a different part of the retina (see Figure 1). We propose that an associative synaptic learning rule with a short-term memory trace of neuronal activity is used between the layers to solve this problem. The idea is that if at a high level of the architecture (labeled layer 2/3 in Figure 4a) a wheel rotating clockwise is activating a neuron in one position on the retina, then the activated neurons remain active in a short delay period (of, e.g., 1 s) while the object moves to another location on the retina (e.g., the right position in Figure 4a). Then, with the postsynaptic neurons still active from the motion at the left position, the newly active synapses onto the layer 2/3 neuron (C) show associative modification, resulting in neuron C learning in an unsupervised way to respond to the wheel rotating clockwise in either the left or the right position on the retina. The idea is, just as for the ventral visual system (Rolls & Deco, 2002), that whatever the convergence allows to be learned at each stage of the hierarchy will be learned by this invariance algorithm, resulting in neurons higher in the hierarchy having higher- and higher-level invariance properties, including view-invariant object-based motion. More formally, the rule we propose is that identical to the one ¨ ak, 1991; Rolls, 1992; Wallis & proposed for the ventral visual system (Foldi´ Rolls, 1997; Rolls & Deco, 2002) as follows: w j = α yτ −1 x τj ,
(2.1)
where the trace yτ is updated according to yτ = (1 − η)yτ + ηyτ −1 ,
(2.2)
144
E. Rolls and S. Stringer
and we have the following definitions: x j : jth input to the neuron yτ : trace value of the output of the neuron at time step τ w j : synaptic weight between jth input and the neuron y : output from the neuron α : learning rate; annealed between unity and zero η : trace value; the optimal value varies with presentation sequence length The parameter η may be set anywhere in the interval [0, 1], and for the simulations described here, η was set to 0.8, which works well with nine transforms for each object in the stimulus set (Wallis & Rolls, 1997). (A discussion of the good performance of this rule, and its relation to other versions of trace learning rules, including the point that the trace can be implemented in the presynaptic firing, is provided by Rolls & Milward, 2000, and Rolls & Stringer, 2001. We note that in the version of the rule used here (equation 2.1), the trace is calculated from the postsynaptic firing in the preceding time step (yτ −1 ) but not the current time step, but that analogous performance is obtained if the firing in the current time step is also included (Rolls & Milward, 2000; Rolls & Stringer, 2001).) The temporal trace in the brain could be implemented by a number of processes, as simple as continuing firing of neurons for several hundred ms after a stimulus has disappeared or moved (as shown to be present for at least inferior temporal neurons in masking experiments—Rolls & Tovee, 1994; Rolls, Tovee, Purcell, Stewart, & Azzopardi, 1994), or by the long time constant of NMDA receptors and the resulting entry of calcium to neurons. An important idea here is that the temporal properties of the biologically implemented learning mechanism are such that it is well suited to detecting the relevant continuities in the world of real motion of objects. The system uses the underlying continuity in the world to help itself learn the invariances of, for example, the motions that are typical of objects. 2.2 The Network Architecture. The model we used for the simulations was VisNet, which was developed as a model of hierarchical processing in the ventral visual system that uses a trace learning to develop invariant representations of stationary objects (Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Deco, 2002). The simulations performed here utilized the latest version of the VisNet model (VisNet2), with the same model parameters as used by Rolls and Milward (2000) for their investigations of the formation of invariant representations in the ventral visual system. These parameters were kept identical for all the simulations described here. The difference is that instead of using simple cell-like inputs to the model that respond to stationary-oriented bars and edges (with four spatial frequencies and four orientations), in the modeling described here we used motion-related
Invariant Global Motion Recognition
145
Table 1: VisNet Dimensions.
Dimensions
Number of Connections
Radius
100 100 100 201 -
12 9 6 6 -
32 × 32 32 × 32 32 × 32 32 × 32 128 × 128 × 8
Layer 4 Layer 3 Layer 2 Layer 1 Input layer
Table 2: Lateral Inhibition Parameters. Layer Radius, σ Contrast, δ
1 1.38 1.5
2 2.7 1.5
3 4.0 1.6
4 6.0 1.4
inputs that capture some of the relevant properties of neurons present in V1 as part of the primate magnocellular (M) system (Wurtz & Kandel, 2000; Duffy, 2004; Rolls & Deco, 2002). VisNet is a four-layer feedforward network with unsupervised competitive learning at each layer. For each layer, the forward connections to individual cells are derived from a topologically corresponding region of the preceding layer, with connection probabilities based on a gaussian distribution (see Figure 2). These distributions are defined by a radius that will contain approximately 67% of the connections from the preceding layer. Typical values are given in Table 1. Within each layer there is competition between neurons, which is graded rather than winner-take-all, and is implemented in two stages. First, to implement lateral inhibition, the firing rates of the neurons (calculated as the dot product of the vector of presynaptic firing rates and the synaptic weight vector on a neuron, followed by a linear activation function to produce a firing rate) within a layer are convolved with a spatial filter, I , where δ controls the contrast and σ controls the width, and a and b index the distance away from the center of the filter:
Ia ,b
−δe − a 2σ+b2 2 = 1 − a =0 Ia ,b
if a = 0 or
b = 0,
if a = 0 and b = 0.
b=0
(2.3)
Typical lateral inhibition parameters are given in Table 2. Next, contrast enhancement is applied by means of a sigmoid function y = f sigmoid (r ) =
1 1+
e −2β(r −α)
,
(2.4)
146
E. Rolls and S. Stringer
Table 3: Sigmoid Parameters. Layer Percentile Slope β
1 99.2 190
2 98 40
3 88 75
4 91 26
where r is the firing rate after lateral inhibition, y is the firing rate after contrast enhancement, and α and β are the sigmoid threshold and slope, respectively. The parameters α and β are constant within each layer, although α is adjusted to control the sparseness of the firing rates. For example, to set the sparseness to, say, 5%, the threshold is set to the value of the 95th percentile point of the firing rates r within the layer. Typical parameters for the sigmoid function are shown in Table 3. ¨ ak, 1991; Rolls, 1992; Wallis & Rolls, 1997; The trace learning rule (Foldi´ Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002) is that shown in equation 2.1 and encourages neurons to develop invariant responses to input patterns that tend to occur close together in time, because these are likely to be from the same moving object. 2.3 The Motion Inputs to the Network. The images presented to the network represent local motion signals with small receptive fields. These local visual motion (or local optic flow) input signals are similar to those of neurons in V1 in that they have small receptive fields and cannot detect global motion because of the aperture problem (Wurtz & Kandel, 2000). At each pixel coordinate in the 128 × 128 image, a direction of local motion/optic flow is defined. The global optic flow patterns used in the different experiments occupied part of this 128 × 128 image, as described for each experiment below. At each coordinate, there are eight cells, where the optimal response is defined by flows 45 degrees apart. That is, the cells are tuned to local optic flow directions of 0, 45, 90, . . ., 315 degrees. The firing rate of each cell is set equal to a gaussian function of the difference between the cell’s preferred direction and the actual direction of local optic flow. The standard deviation of this gaussian was 20 degrees. The number of inputs from the arrays of motion sensitive cells to each cell in the first layer of the network is 201, selected probabilistically as a gaussian function of distance as described above and in more detail elsewhere (Rolls & Milward, 2000). The local motion signals are given to the network, and not computed in the simulations, because the aim of the simulations is to test the theory that (given that local motion inputs that are known to be present in early cortical processing; Wurtz & Kandel, 2000) the trace learning mechanism described can in a hierarchical network account for a range of the types of global motion neuron that are found in the dorsal stream visual cortical areas.
Invariant Global Motion Recognition
147
2.4 Training and Test Procedure. To train the network, each stimulus is presented to VisNet in a randomized sequence of locations or orientations with respect to VisNet’s input retina. The different locations were spaced 32 pixels apart on the 128 × 128 retina. At each stimulus presentation, the activation of individual neurons is calculated, then the neuronal firing rates are calculated, and then the synaptic weights are updated. Each time a stimulus has been presented in all the training locations or orientations, a new stimulus is chosen at random and the process repeated. The presentation of all the stimuli through all locations or orientations constitutes one epoch of training. In this manner, the network is trained one layer at a time starting with layer 1 and finishing with layer 4. In the investigations described here, the numbers of training epochs for layers 1 to 4 were 50, 100, 100, and 75, respectively, as these have been shown in previous work to provide good performance (Wallis & Rolls, 1997; Rolls & Milward, 2000). The learning rates α in equation 2.1 for layers 1 to 4 were 0.09, 0.067, 0.05, and 0.04. Two measures of performance were used to assess the ability of the output layer of the network to develop neurons that are able to respond with view invariance to individual stimuli or objects (see Rolls & Milward, 2000). A single cell information measure was applied to individual cells in layer 4 and measures how much information is available from the response of a single cell about which stimulus was shown independent of view. The measure was the stimulus-specific information or surprise, I (s, R), which is the amount of information the set of responses, R, has about a specific stimulus, s. (The mutual information between the whole set of stimuli S and of responses R is the average across stimuli of this stimulus-specific information.) (Note that r is an individual response from the set of responses R.) I (s, R) =
r ∈R
P(r |s) log2
P(r |s) P(r )
(2.5)
The calculation procedure was identical to that described by Rolls, Treves, Tovee, and Panzeri (1997) with the following exceptions. First, no correction was made for the limited number of trials because, in VisNet, each measurement of a response is exact, with no variation due to sampling on different trials. Second, the binning procedure was to use equispaced rather than equipopulated bins. This small modification was useful because the data provided by VisNet can produce perfectly discriminating responses with little trial-to-trial variability. Because the cells in VisNet can have bimodally distributed responses, equipopulated bins could fail to separate the two modes perfectly. (This is because one of the equipopulated bins might contain responses from both of the modes.) The number of bins used was equal to or less than the number of trials per stimulus, that is, for VisNet the number of positions on the retina (Rolls et al., 1997). Because
148
E. Rolls and S. Stringer
VisNet operates as a form of competitive net to perform categorization of the inputs received, good performance of a neuron will be characterized by large responses to one or a few stimuli regardless of their position on the retina (or other transform) and small responses to the other stimuli. We are thus interested in the maximum amount of information that a neuron provides about any of the stimuli rather than the average amount of information it conveys about the whole set S of stimuli (known as the mutual information). Thus, for each cell, the performance measure was the maximum amount of information a cell conveyed about any one stimulus (with a check, in practice always satisfied, that the cell had a large response to that stimulus, as a large response is what a correctly operating competitive net should produce to an identified category). In many of the graphs in this article, the amount of information each of the 50 most informative cells had about any stimulus is shown. A multiple cell information measure, the average amount of information that is obtained about which stimulus was shown from a single presentation of a stimulus from the responses of all the cells, enabled measurement of whether across a population of cells, information about every object in the set was provided. Procedures for calculating the multiple cell information measure are given by Rolls, Treves, and Tovee (1997) and Rolls and Milward (2000). The multiple cell information measure is the mutual information I (S, R), that is, the average amount of information that is obtained from a single presentation of a stimulus about the set of stimuli S from the responses of all the cells. For multiple cell analysis, the set of responses, R, consists of response vectors comprising the responses from each cell. Ideally, we would like to calculate I (S, R) =
P(s)I (s, R).
(2.6)
s∈S
However, the information cannot be measured directly from the probability table P(r, s) embodying the relationship between a stimulus s and the response rate vector r provided by the firing of the set of neurons to a presentation of that stimulus. (Note that “stimulus” refers to an individual object that can occur with different transforms, e.g., translation or size; see Wallis & Rolls, 1997.) This is because the dimensionality of the response vectors is too large to be adequately sampled by trials. Therefore, a decoding procedure is used, in which the stimulus s that gave rise to the particular firing-rate response vector on each trial is estimated. This involves, for example, maximum likelihood estimation or dot product decoding. For example, given a response vector r to a single presentation of a stimulus, its similarity to the average response vector of each neuron to each stimulus is used to estimate using a dot product comparison which stimulus was shown. The probabilities of it being each of the stimuli can be estimated in
Invariant Global Motion Recognition
149
this way. Details are provided by Rolls et al. (1997). A probability table is then constructed of the real stimuli s and the decoded stimuli s . From this probability table, the mutual information is calculated as I (S, S ) =
P(s, s ) log2
s,s
P(s, s ) . P(s)P(s )
(2.7)
The multiple cell information was calculated using the five cells for each stimulus with high information values for that stimulus. Thus, in this letter, 10 cells were used in the multiple cell information analysis. 3 Simulation Results We now describe simulations with the neural network model described in section 2 that enabled us to test this theory. 3.1 Experiment 1: Global Planar Motion. Motion-sensitive neurons in V1 have small receptive fields (in the range 1–2 deg at the fovea) and therefore cannot detect global motion, and this is part of the aperture problem (Wurtz & Kandel, 2000). As described in section 1, neurons in MT have larger receptive fields and are able to respond to planar global motion (Movshon et al., 1985; Newsome et al., 1989). Here we show that the hierarchical feature network we propose can solve this global planar motion problem and, moreover, that the performance is improved by using a trace rather than a purely associative synaptic modification rule. Invariance is addressed in later simulations. The network was trained on two 100 × 100 stimuli representing noisy left and right global planar motion (see Figure 3a). During the training, cells developed that responded to either left or right global motion but not to both (see Figure 3), with 1 bit of information representing perfect discrimination of left from right. The untrained network with initial random synaptic weights tested as a control showed much poorer performance, as shown in Figure 3. It might be expected that some global planar motion sensitivity would be developed by a purely Hebbian learning rule, and indeed this has been demonstrated (under somewhat different training conditions) by Sereno (1989) and Sereno and Sereno (1991). This occurs because on any single trial with one average direction of global motion, neurons at intermediate layers will tend to receive on average inputs that reflect the current average global planar motion and will thus learn to respond optimally to the current inputs that represent that motion direction. We showed that the trace learning rule used here performed better than a Hebb rule (which produced only neurons with 0.0 bits given that the motion stimulus patches presented in our simulations were in nonoverlapping locations, as
150
E. Rolls and S. Stringer
a Global planar motion left
Global planar motion right
Stimulus 1
Stimulus 2
b
c VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4
5
6
7
8
9
10
Number of Cells
Figure 3: Experiment 1. (a) The two motion stimuli used in experiment 1 were noisy global planar motion left (left) and noisy global planar motion right (right), present throughout the 128 × 128 retina. Each arrow in this and subsequent figures represents the local direction of optic flow. The size of the optic flow pattern was 100 × 100 pixels, not the 2 × 4 shown in the diagram. The noise was introduced into each image stimulus by inverting the direction of optic flow at a random set of 45% of the image nodes. This meant that it would not be possible to determine the directional bias of the flow field by examining the optic flow over local regions of the retina. Instead, the overall directional bias could be determined only by analyzing the whole image. (b) When trained with the trace rule, equation 2.1, some single cells in layer 4 conveyed 1 bit of information about whether the global motion was left or right, and this is perfect performance. (The single cell information is shown for the 50 most selective cells.) (c) The multiple cell information measures, used to show that different neurons are tuned to different stimuli (see section 2.4), indicate that over a set of neurons, information about the whole stimulus set was present. (The information values for one cell are the average of 10 cells selected from the 50 most selective cells, and hence the value is not exactly 1 bit.)
Invariant Global Motion Recognition
151
illustrated in Figure 1). A further reason for the better performance of the trace rule is that on successive trials, the average global motion identifiable by a single intermediate-layer neuron from the probabilistic inputs will be a better estimate (a temporal average) of the true global motion, and this will be utilized in the learning. These results show that the network architecture is able to develop global motion representations of the noisy local motion patterns. Indeed, it is emphasized that neurons in the input to VisNet had only local but not global motion information, as shown by the fact that the average amount of information the 50 most selective input cells had about the global motion was 0.0 bits. 3.2 Experiment 2: Rotating Wheel. Neurons in MST, but not MT, are responsive to rotation with considerable translation invariance (Graziano et al., 1994). The aim of this simulation was to determine whether layer 4 cells in our network develop position-invariant representations of wheels rotating clockwise (as shown in Figure 4a) versus anticlockwise. The stimuli consist only of optic flow fields around the rim of a geometric circle with radius 16 unless otherwise stated. The local motion inputs from the wheel in the two positions shown are ambiguous where the wheels are close to each other in Figure 4a. The network was expected to solve the problem as illustrated in Figure 4a. The results in Figures 4b to 4d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, clockwise but not anticlockwise rotation, and did this for each of the nine training positions. Figure 4e shows perfect size invariance for some layer 4 cells when the network was trained with the trace rule with three different radii of the wheels: 10, 16, and 22. These results show that the network architecture is able to develop location- and size-invariant representations of the global, rotating wheel, motion patterns even though the neurons in the input layer receive information from only a small local region of the retina. We note that the position-invariant global motion results shown in Figure 4 were not due to chance mappings of the two stimuli through the network and were a result of the training, in that the position-invariant information about whether the global motion was clockwise or anticlockwise was 0.0 bits for both the single and the multiple cell information in the untrained (“random”) network. Corresponding differences between the trained and the untrained networks were found in all the other experiments described in this article. 3.3 Experiment 3: Looming. Neurons in macaque dorsal stream visual area MSTd respond to looming stimuli with considerable translation
152
E. Rolls and S. Stringer
invariance (Graziano et al., 1994; Geesaman & Andersen, 1996). We tested whether the network could learn to respond to small patches of looming versus contracting motion typically generated by objects as they are seen successively on different locations on the retina. The network was trained on two circular flow patterns representing looming toward and looming away, as shown in Figure 5a. The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle and with radius 16 unless otherwise stated. The results shown in Figures 5b to 5d show perfect performance on position invariance when trained with the trace rule but not when untrained. The perfect performance is shown by the neurons that responded to, for example, looming toward but not movement away, and did this for each of the nine training positions. Simulations were run for various optic flow field diameters to test the robustness of the results, and in all cases tested (which included radii of
Figure 4: Experiment 2. (a) Two rotating wheels at different locations rotating in opposite directions. The local flow field is ambiguous. Clockwise or counterclockwise rotation can be diagnosed only by a global flow computation, and it is shown how the network is expected to solve the problem to produce positioninvariant global-motion-sensitive neurons. One rotating wheel is presented at any one time, but the need is to develop a representation of the fact that in the case shown, the rotating flow field is always clockwise, independent of the location of the flow field. (b) Single cell information measures showing that some layer 4 neurons have perfect performance of 1 bit (clockwise versus anticlockwise) after training with the trace rule, but not with random initial synaptic weights in the untrained control condition. (c) The multiple cell information measures show that small groups of neurons have perfect performance. (d) Position invariance illustrated for a single cell from layer 4, which responded only to the clockwise rotation, and for every one of the nine positions. (e) Size invariance illustrated for a single cell from layer 4, which after training with three different radii of rotating wheel, responded only to anticlockwise rotation, independent of the size of the rotating wheels. (For the position-invariant simulations, the wheel rims overlapped, but are shown slightly separated in Figure 1 for clarity.) The training grid spacing was 32 pixels, and the radii of the wheels were 16 pixels. This ensured the rims of the wheels in adjacent training grid locations overlapped. One wheel was shown on any one trial. On successive trials, the wheel rotating clockwise was shown in each of the nine locations, allowing the trace learning rule to build location-invariant representations of the wheel rotating in one direction. In the next set of training trials, the wheel was shown rotating in the opposite direction in each of the nine locations. For the size-invariant simulations, the network was trained and tested with the set of clockwise versus anticlockwise rotating wheels presented in three different sizes.
Invariant Global Motion Recognition
153
C
a
Layer 3 = MSTd or higher Rotational motion with invariance Larger receptive field size
A
Layer 2 = MT Rotational motion Large receptive field size
B
Layer 1 = MT Global planar motion Intermediate receptive field size
Input layer = V1,V2 Local motion
b
c
VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
trace random
2.5 Information (bits)
Information (bits)
2.5 2 1.5 1 0.5 0
2 1.5 1 0.5 0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
d
2
4 5 6 7 Number of Cells
8
9
10
e Visnet: Cell (24,13) Layer 4
Visnet: Cell (7,11) Layer 4
1
1 ’clock’ ’anticlock’
0.8 Firing Rate
0.8 Firing Rate
3
0.6 0.4
0.6 0.4
0.2
0.2
0
0 0
1
2
3 4 5 6 Location Index
7
8
’clock’ ’anticlock’
10
12
14
16 18 Radius
20
22
154
E. Rolls and S. Stringer
10 and 20 as well as intermediate values), cells developed a transform (location) invariant representation in the output layer. These results show that the network architecture is able to develop invariant representations of the global looming motion patterns, even though the neurons in the input layer receive information from only a small local region of the retina. 3.4 Experiment 4: Rotating Cylinder. Some neurons in the macaque cortex in the anterior part of the superior temporal sulcus (which receives inputs from both the dorsal and ventral visual streams; Ungerleider & Mishkin, 1982; Seltzer & Pandya, 1978; Rolls & Deco, 2002) respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Some neurons in the parietal cortex may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). In experiment 4, we tested whether the network could self-organize to form neurons that represent global motion in an object-based coordinate frame. The network was trained on two stimuli, with four transforms of each. Figure 6a shows stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own vertical axis. The stimuli were presented in a single location, but to solve the problem,
Figure 5: Experiment 3. (a) The two motion stimuli were flow fields looming toward (left) and looming away (right). The stimuli are circular optic flow fields, with the direction of flow either away from (left) or toward (right) the center of the circle. Local motion cells near, for example, the intersection of the two stimuli cannot distinguish between the two global motion patterns. Locationinvariant representations (for nine different locations) of stimuli looming toward or moving away from the observer were learned, as shown by the single cell information measures (b), and multiple cell information measures (c) (using the same conventions as in Figure 3) were formed if the network was trained with the trace rule but not if it was untrained. (d) Position invariance illustrated for a single cell from layer 4, which responded only to moving away, and for every one of the nine positions. (The network was trained and tested with the stimuli presented in a 3 × 3 grid of nine retinal locations, as in experiment 1. The training grid spacing was 32 pixels, and the radii of the circular looming stimuli were 16 pixels. This ensured that the edges of the looming stimuli in adjacent training grid locations overlapped, as shown in the dashed box of Figure 5a.)
Invariant Global Motion Recognition
155
a Looming towards
Moving away
Stimulus 1
Stimulus 2
b
c
VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random Information (bits)
2
1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
trace random
2.5
1.5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
d Visnet: Cell (9,7) Layer 4 1 0.8 Firing Rate
Information (bits)
2.5
’towards’ ’away’
0.6 0.4 0.2 0 0
1
2
3 4 5 6 Location Index
7
8
4 5 6 7 Number of Cells
8
9
10
156
E. Rolls and S. Stringer
the network must form some neurons that respond to the clockwise rotation of the shaded cylinder independent of the four transforms of each, which were upright (0 degrees), 90 degrees, inverted (180 degrees) and 270 degrees. Other neurons should self-organize to respond to view invariant counterclockwise rotation. For this experiment, additional information about surface luminance must be fed into the first layer of the network in order for the network to be able to distinguish between the clockwise and anticlockwise rotating cylinders. Additional retinal inputs to the first layer of the network came from a 128 × 128 array of luminance-sensitive cells. The cells within the luminance array are maximally activated for the shaded region of the cylinder image. Elsewhere the luminance inputs are zero. The number of inputs from the array of luminance sensitive cells to each cell in the first layer of the network was 50. The results shown in Figures 6b to 6c show perfect performance for many single cells, and across multiple cells, in representing the direction of rotation of the shaded cylinder about its own axis regardless of which of the four transforms was shown, when trained with the trace rule but not when untrained. Simulations were run for various sizes of the cylinders, including height = 40 and diameter = 20. For all simulations, cells developed a transform (e.g., upright, inverted) invariant representation in the output layer. That is, some cells responded to one of the stimuli in all of its four transformations (i.e., orientations) but not to the other stimulus. These results show that the network architecture is able to develop objectcentered view-invariant representations of the global motion patterns representing the two rotating cylinders, even though the neurons in the input layer receive information from only a small, local region of the retina.
Figure 6: Experiment 4. (a) Stimulus 1, which is a cylinder with shading at the top rotating clockwise about its own (top-defined) axis. Stimulus 1 is shown in its upright and inverted transforms. Stimulus 2 is the same cylinder with shading at the top, but rotating anticlockwise about its own axis. Invariant representations were formed, with some cells coding for the object rotating clockwise about its own axis and other cells coding for the object rotating anticlockwise, invariantly with respect to whether which of the four transforms (0 degrees = upright, 90 degrees, 180 degrees = inverted, and 270 degrees) was viewed, as shown by the single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3). Because only eight images in one location form the training set, some single cells by chance with the random untrained connectivity had some information about which stimulus was shown, but cells performed the correct mapping only if the network was trained with the trace rule.
Invariant Global Motion Recognition
157
Upright transform
a
Inverted transform
Stimulus 1: Cylinder rotating clockwise when viewed from shaded end.
Stimulus 2: Cylinder rotating anticlockwise when viewed from shaded end.
c
b VisNet: 2s 4t: Single cell analysis
VisNet: 2s 4t: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
trace random
2.5 Information (bits)
Information (bits)
2.5
2 1.5 1 0.5
0
0 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
4 5 6 7 Number of Cells
8
9
10
158
E. Rolls and S. Stringer
3.5 Experiment 5: Optic Flow Analysis of Real Images: Translation Invariance. In experiments 5 and 6, we extend this research by testing the operation of the model when the optic flow inputs to the network are extracted by a motion analysis algorithm operating on the successive images generated by moving objects. The optic flow fields generated by a moving object were calculated as described next and were used to set the firing of the motion-selective cells, the properties of which are described in section 2.3. These optic flow algorithms use an image gradient-based method, which exploits the relationship between the spatial and temporal gradients of intensity, to compute the local optic flow throughout the image. The image flow constraint equation Ix U + I y V + It = 0 is approximated at each pixel location by algebraic finite difference approximations in space and time (Horn & Schunk, 1981). Systems of these finite difference equations are then solved for the local image velocity (U, V) within each 4 × 4 pixel block within the image. The images of the rotating objects were generated using OpenGL. In experiment 5, we investigated the learning of translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated in Figure 7a. The network was trained with the two optic flow patterns generated in nine different locations, as in experiments 2 and 3. The flow fields used to train the network were generated by the object rotating through one degree of angle. The single cell information measures (see Figure 7b) and multiple cell information measures (see Figure 7c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was reached by single cells and with the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn translation-invariant representations with motion flow fields actually extracted from the successive images produced by a rotating object. 3.6 Experiment 6: Optic Flow Analysis of Real Images: Rotation Invariance. In experiment 6 we investigated the learning of rotationinvariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8a. (The algorithm for generating the optic flow field is described in section 3.5.) The radius of the spoked wheel was 50 pixels on the 128 × 128 background. The rotation was in-plane, and the optic flow fields used as an input to the network were extracted from the changing images, each separated by one degree of the object as it rotated through 360 degrees. The single cell information measures (see Figure 8b) and multiple cell information measures (see Figure 8c) (using the same conventions as in Figure 3) show that the maximal information, one bit, was almost reached by single cells and by the multiple cell information measure. The dashed
Invariant Global Motion Recognition
159
a Optic flow fields produced by a tetrahedron rotating clockwise or anticlockwise
b
c VisNet: 2s 9l: Single cell analysis
VisNet: 2s 9l: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4 5 6 7 8 Number of Cells
9
10
Figure 7: Experiment 5. Translation-invariant representations of the optic flow vector fields generated by clockwise versus anticlockwise rotation of the tetrahedron stimulus illustrated. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.
line shows the control condition of a network with random untrained connectivity. This experiment shows that the model can operate well and learn rotation-invariant representations with motion flow fields actually extracted from a very large number of the successive images produced by a rotating object. Because of the large number of closely spaced training images used in this simulation, it is likely that the crucial type of learning was continuous transformation learning (Stringer, Perry, Rolls, & Proske, 2006). Consistent with this, the learning rate was set to the lower value of 7.2 × 10−5 for all layers for experiment 6 (cf. Stringer et al., 2006).
160
E. Rolls and S. Stringer
a Optic flow fields produced by a spoked wheel rotating clockwise or anticlockwise
c
b VisNet: 2s: Single cell analysis
VisNet: 2s: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
2 1.5 1 0.5
0
0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
trace random
2.5 Information (bits)
Information (bits)
2.5
1
2
3
4 5 6 7 8 Number of Cells
9
10
Figure 8: Experiment 6. In-plane rotation-invariant representations of the optic flow vector fields generated by a spoked wheel rotating clockwise or anticlockwise. The optic flow field used as an input to the network was extracted from the changing images of the object as it rotated through 360 degrees, each separated by 1 degree. The single cell information measures (b) and multiple cell information measures (c) (using the same conventions as in Figure 3) show that the maximal information, 1 bit, was reached by both single cells and in the multiple cell information measure. The dashed line shows the control condition of a network with random untrained connectivity.
3.7 Experiment 7: Generalization to Untrained Images. To investigate whether the representations of object-based motion such as circular rotation learned with the approach introduced in this article would generalize usefully to the flow fields generated by other objects moving in the same way, we trained the network on the optic flow vector fields generated by clockwise versus anticlockwise rotation of the spoked wheel stimulus illustrated in Figure 8. The training images rotated through 90 degrees in 1 degree steps. We then tested generalization to the new, untrained image shown in Figure 9a. The single and multiple cell information plots in Figure 9b show that information was available about the direction of
Invariant Global Motion Recognition
161
a Generalisation: Training with rotating spoked wheel, followed by testing with a regular grid rotating clockwise or anticlockwise .
VisNet: 2s: Single cell analysis
b
VisNet: 2s: Multiple cell analysis
3
3 trace random
2 1.5 1 0.5
trace random
2.5
Information (bits)
Information (bits)
2.5
0
2 1.5 1 0.5 0
-0.5
-0.5 5
10 15 20 25 30 35 40 45 50 Cell Rank
1
2
3
4
5 6 7 Cell Rank
8
9
10
Responses of a typical cell to the spoked wheel and grid, after training with the spoked wheel alone. Clockwise rotation Visnet: Cell (17,17) Layer 4
Visnet: Cell (17,17) Layer 4
1
1
0.8
0.8 Firing Rate
Firing Rate
c Spoked wheel
Anticlockwise rotation
0.6 0.4 0.2
0.6 0.4 0.2
0
0 0
10 20 30 40 50 60 70 80 90 Orientation (deg)
0
Visnet: Cell (17,17) Layer 4
Visnet: Cell (17,17) Layer 4
1
1
0.8
0.8 Firing Rate
Grid
Firing Rate
d
0.6 0.4 0.2
10 20 30 40 50 60 70 80 90 Orientation (deg)
0.6 0.4 0.2
0
0 0
10 20 30 40 50 60 70 80 90 Orientation (deg)
0
10 20 30 40 50 60 70 80 90 Orientation (deg)
Figure 9: Experiment 7. Generalization to untrained images. The network was trained on the optic flow vector fields generated by the spoked wheel stimulus illustrated in Figure 8 rotating clockwise or anticlockwise. (a) Generalization to the new untrained image shown at the top right of the Figure was then tested. (b) The single and multiple cell information plots show that information was available about the direction of rotation (clockwise versus anticlockwise) of the untrained test images. (c) The firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. (d) The firing rate of the same fourth layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a.
162
E. Rolls and S. Stringer
rotation (clockwise versus anticlockwise) of the untrained test images. Although the information was not as high as 1 bit, which would have indicated perfect generalization, individual cells did generalize usefully to the new images, as shown in Figures 9c and 9d. For example, Figure 9c shows the firing rate of a fourth layer cell to the clockwise and anticlockwise rotations of the trained image illustrated in Figure 8. Figure 9d shows the firing rate of the same fourth-layer cell to the clockwise and anticlockwise rotations of the untrained image illustrated in Figure 9a. The neuron responded correctly to almost all the anticlockwise rotation shifts, and correctly to many of the clockwise rotation shifts, though some noise was evident in the responses of the neuron to the untrained images. Overall, the results demonstrate useful generalization after training with one object to testing with an untrained, different, object on the ability to represent rotation.
4 Discussion We have presented a hierarchical feature analysis theory of the operation of parts of the dorsal visual system, which provides a computational account for how transform-invariant representations of the flow fields generated by moving objects could be formed in the cerebral cortex. The theory uses a modified Hebb rule with a short-term temporal trace of preceding activity to enable whatever is invariant at any stage of the dorsal motion system across short time intervals to be associated together. The theory can account for many types of invariance and has been tested by simulation for position and size invariance. The simulations show that the network can develop global planar representations from noisy local motion inputs (experiment 1), invariant representations of rotating optic flow fields (experiment 2), invariant representations of looming optic flow fields (experiment 3), and invariant representations of asymmetrical objects rotating about one of their axes (experiment 4). These are fundamental problems in motion analysis, and they have all been studied neurophysiologically, including local versus planar motion (Movshon et al., 1985; Newsome et al., 1989); position-invariant representation of rotating flow fields and looming (Lagae, Maes, Raiguel, Xiao, & Orban, 1994); and object-based rotation (Hasselmo et al., 1989; Sakata et al., 1986). The model thus shows principles by which the different types of motion-related invariant neuronal responses in the dorsal cortical visual system could be produced. The theory is unifying in the sense that the same theory, but with different inputs, can account for invariant representations of objects in the ventral visual system (Rolls, 1992; Wallis & Rolls, 1997; Elliffe, Rolls, & Stringer, 2002; Rolls & Deco, 2002). It is a strength of the unifying concept introduced in this article that the same hierarchical network that can perform computations of the type important in the ventral visual system can also perform computations of a type important in the dorsal visual system.
Invariant Global Motion Recognition
163
Our simulations support the hypothesis that the different response properties of MT and MST neurons from V1 neurons are determined in part by the sizes of their receptive fields, with a larger receptive field needed to analyze some global motion patterns. Similar conclusions were drawn from simulation experiments performed by Sereno (1989) and Sereno and Sereno (1991). This type of self-organization can occur with a Hebbian associative learning rule operating on the feedforward connections to a competitive network. However, experiment 1 showed that even for the computation of planar global motion in intermediate layers such as MT, a trace-based associative learning rule is better than a purely associative Hebbian rule with noisy (probabilistic) local motion inputs, because the trace rule allows temporal averaging to contribute to the learning. In experiments 2 and 3, the trace rule is crucial to the success of the learning, in that the stimuli when presented in different training locations did not overlap, so that the only process by which the different transforms can be linked is by the temporal trace learning rule implemented in the model (Rolls & Milward, 2000; Rolls & Stringer, 2001). (We note that in a new development, it has been shown that if different transforms of the training stimuli do overlap continuously in space, then this overlap can provide a useful learning principle for invariant representations to be formed and requires only associative synaptic modification; Stringer et al., 2006. It would be of interest to extend this concept, which has been applied to the ventral visual system, to the dorsal visual system.) One type of perceptual analysis that can be understood with the theory and simulations described here is how neurons can self-organize to respond to the motion inputs produced by small objects when they are seen on different parts of the retina. This is achieved by using memorytrace-based synaptic modification in the type of architecture illustrated in Figure 4a. The crucial stage for this learning is the top layer in Figure 4a labeled Layer 2/3. The forward connections to the neurons in this layer can form the required representation if they use a trace or similar learning rule, and the object motion occurs with some temporospatial continuity. (Temporospatial continuity has been shown to be important in human face invariance learning [Wallis & Bulthoff, 2001], and spatial continuity over continuous transforms may be a useful learning principle [Stringer et al., 2006].) This aspect of the architecture is what is formally similar to the architecture of the ventral visual system, which can learn invariant representations of stationary objects. The only difference required of the networks is that the ventral visual stream network should receive inputs from neurons that respond to stationary features such as lines or edges and that the dorsal visual stream network should receive inputs from neurons that respond to local motion cues. It is this concept that allows us to propose that there is a unifying hypothesis that applies to some of the computations performed by both the ventral and the dorsal visual streams.
164
E. Rolls and S. Stringer
The way in which position-invariant representations in the model develop is illustrated in Figure 4a, where, in the top layer labeled layer 3, individual neurons receive information from different parts of layer 2, where different neurons can represent the same object motion but in different parts of visual space. In the model, layer 2 can thus be thought of as corresponding to some neurons in area MT, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is not position invariant (Lagae et al., 1994). Layer 3 in the model can in the same way be thought of as corresponding to area MST, in which direction selectivity for elementary optic flow components such as rotation, deformation, and expansion and contraction is position invariant for 40% of neurons (Lagae et al., 1994). A further correspondence between the model and the brain is that neurons that respond to global planar motion are found in the brain in area MT (Movshon et al., 1985; Newsome et al., 1989) and in the model in layer 1, whereas neurons in V1 and V2 do not respond to global motion (Movshon et al., 1985; Newsome et al., 1989), and correspond in the model to the input layer of Figure 4a. Another type of perceptual analysis that can be understood with the theory and simulations described here is the object-based view-independent representation of objects, exemplified by the ability to see that an “ended” object is rotating clockwise about one of its axes. It was shown in experiment 4 that these representations can be formed by combining information from both the dorsal visual stream (about global motion) and the ventral visual stream (about object shape and/or luminance features). For these representations to be learned, a trace associative or similar learning rule must be used while the object transforms from one view to another (e.g., from upright to inverted). A hierarchical network with the general architecture shown in Figure 2 with separate analyses of form and motion that are combined at a final stage (as in experiment 4) is also useful for biological motion, such as representing a person walking (Giese & Poggio, 2003). However, the network described by Giese and Poggio is not very biologically plausible, in that it performs MAX functions to help with the computational issue of transform invariance and does not self-organize on the basis of the inputs so that it must be largely hand-wired. The issue here is that Giese and Poggio suppose that a MAX function is performed to select the maximally active afferent to a neuron, but there is no account of how afferents of just one type (e.g., a bar with a particular orientation and contrast) are being received by a given neuron. Not only is no principle suggested by which this could be achieved, but also no learning algorithm is given to achieve this. We suggest therefore that it would be of interest to investigate whether the more biologically plausible self-organizing type of network described in this article can learn on the basis of the inputs being received to respond to biological motion. To do this, some form of sequence sensitivity would be useful.
Invariant Global Motion Recognition
165
The theory described here is appropriate for the global motion analysis required to analyze the flow fields of objects as they translate, rotate, expand (loom), or contract, as shown in experiments 1 to 3. The theory thus provides a model of some of the computations that appear to occur along the pathway V1–V2–MT–MST, as neurons of these types are generated along this pathway (see section 1). The theory described here can also account for global motion in an object-based coordinate frame as shown in experiment 4. Neurons with these properties have been found in the cortex in the anterior part of the macaque superior temporal sulcus, in which neurons respond to a head when it is rotating clockwise about its own axis but not counterclockwise, regardless of whether it is upright or inverted (Hasselmo et al., 1989). The result of the inversion experiment shows that these neurons are not just responding to global flow across the visual field, but are taking into account information about the shape and features of the object. Area STPa (the cortex in the anterior part of the macaque superior temporal sulcus) contains neurons that respond to a rotating sphere (Anderson & Siegel, 2005), and as shown in experiment 4, the present theory could account for such neurons. Whether the present model could account for the structure from motion also observed for these neurons is not yet known. The theory could also account for neurons in area 7a of the parietal cortex that may also respond to motion of an object about one of its axes in an object-based way (Sakata et al., 1986). Neurons have also been found in the primary motor cortex (M1) that respond similarly to neurons in area 7a when a monkey is solving a visually presented maze (Crowe, Chafee, Averbeck, & Georgopoulos, 2004), but their visual properties are not sufficiently understood to know whether the present model might apply. Area LIP contains neurons that perform processing related to saccadic eye movements to visual targets (Andersen, 1997), and the present theory may not apply to this type of processing. The model of processing utilized here in a series of hierarchically organized competitive networks with convergence at each stage (as illustrated in Figure 2) is intended to capture some of the main anatomical and physiological characteristics of the ventral visual stream of visual cortical areas, and is intended to provide a model for how processing in these areas could operate, as described in detail elsewhere (Rolls & Deco, 2002; Rolls & Treves, 1998). To enable learning along this pathway to result by self-organization in the correct representations being formed, associative learning using a short-term memory trace has been proposed (Rolls, 1992; Wallis & Rolls, 1997; Rolls & Milward, 2000; Rolls & Stringer, 2001; Rolls & Deco, 2002). Another approach used in continuous transformation learning utilizes associative learning without a temporal trace and relies on close exemplars of stimuli being provided during the training (Stringer et al., 2006). What we propose here is that similar connectivity and learning processes in the series of cortical pathways in the dorsal visual stream that includes V1–V2– MT–MST and onward connections to the cortex in the superior temporal
166
E. Rolls and S. Stringer
sulcus and area 7a could account for the invariant representations of the flow fields produced by moving objects. In relation to the number of stimuli that could be learned by the system, we note that the network simulated is relatively small and was designed to illustrate the new computational hypotheses introduced here rather than to analyze the capacity of such feature hierarchical systems. We note in particular that the network simulated has 1024 neurons in each layer and 100 inputs to each neuron in layers 2 to 4. In contrast, it has been estimated that perhaps half of the macaque brain is involved in visual processing, and typically each neuron has on the order of 104 inputs. It will be of interest using much larger simulations in the future to address capacity issues of this class of network. However, we note that because the network can generalize to rotational flow fields generated by untrained stimuli, as shown in experiment 7, separate representations for the flow fields generated by every object may not be required, and this helps to reduce the number of separate representations that the network may be required to learn. In contrast to some other theories, the theory developed here utilizes a single unified approach to self-organization in the dorsal and ventral visual systems. Predictions of the theory described here include the following. First, use of a trace rule in the dorsal as well as ventral visual system is predicted. (Thus, differences in, for example, the time constants of NMDA receptors, or persistent poststimulus firing, either of which could implement a temporal trace, would not be expected.) Second, a feature hierarchy is a useful way for understanding details of the operation of the ventral visual system, but can now be used as a clarifying concept for how the details of representations in the dorsal visual system may be built. Third, the theory predicts that neurons specialized for motion detection by using differences in the arrival times of sensory inputs from different retinal locations need occur at only one stage of the system (e.g., in V1) and need not occur elsewhere in the dorsal visual system. These are labeled as local motion neurons in Figure 4a. Acknowledgments This research was supported by the Wellcome Trust and by the Medical Research Council. References Andersen, R. A. (1997). Multimodal integration for the representation of space in the posterior parietal cortex. Philosophical Transactions of the Royal Society of London B, 352, 1421–1428. Anderson, K. C., & Siegel, R. M. (2005). Three-dimensional structure-from-motion selectivity in the anterior superior temporal polysensory area, STPa, of the behaving monkey. Cerebral Cortex, 15, 1299–1307.
Invariant Global Motion Recognition
167
Bair, W., & Movshon, J. A. (2004). Adaptive temporal integration of motion in direction-selective neurons in macaque visual cortex. Journal of Neuroscience, 24, 7305–7323. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network: Computation in Neural Systems, 9, 399–417. Crowe, D. A., Chafee, M. V., Averbeck, B. B., & Georgopoulos, A. P. (2004). Participation of primary motor cortical neurons in a distributed network during maze solution: Representation of spatial parameters and time-course comparison with parietal area 7a. Experimental Brain Research, 158, 28–34. Deco, G., & Rolls, E. T. (2005). Neurodynamics of biased competition and cooperation for attention: A model with spiking neurons. Journal of Neurophysiology, 94, 295– 313. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Duffy, C. J. (2004). The cortical analysis of optic flow. In L. M. Chalupa & J. S. Werner (Eds.), The visual neurosciences (Vol. 2, pp. 1260–1283). Cambridge, MA: MIT Press. Duffy, C. J., & Wurtz, R. H. (1996). Optic flow, posture, and the dorsal visual pathway. In T. Ono, B. L. McNaughton, S. Molotchnikoff, E. T. Rolls, & H. Nishijo (Eds.), Perception, memory and emotion: frontiers in neuroscience (pp. 63–77). Cambridge: Cambridge University Press. Elliffe, M. C. M., Rolls, E. T., & Stringer, S. M. (2002). Invariant recognition of feature combinations in the visual system. Biological Cybernetics, 86, 59–71. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3, 194–200. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36, 193–202. Geesaman, B. J., & Andersen, R. A. (1996). The analysis of complex motion patterns by form/cue invariant MSTd neurons. Journal of Neuroscience, 16, 4716–4732. Giese, M. A., & Poggio, T. (2003). Neural mechanisms for the recognition of biological movements. Nature Reviews Neuroscience, 4, 179–192. Graziano, M. S. A., Andersen, R. A., & Snowden, R. J. (1994). Tuning of MST neurons to spiral motions, Journal of Neuroscience, 14, 54–67. Hasselmo, M. E., Rolls, E. T., Baylis, G. C., & Nalwa, V. (1989). Object-centered encoding by face-selective neurons in the cortex in the superior temporal sulcus of the monkey. Experimental Brain Research, 75, 417–429. Horn, B. K. P., & Schunk, B. G. (1981). Determining optic flow. Artificial Intelligence, 17, 185–203. Lagae, L., Maes, H., Raiguel, S., Xiao, D.-K., & Orban, G. A. (1994). Responses of macaque STS neurons to optic flow components: A comparison of areas MT and MST. Journal of Neurophysiology, 71, 1597–1626. Logothetis, N. K., & Sheinberg, D. L. (1996). Visual object recognition. Annual Review of Neuroscience, 19, 577–621. Movshon, J. A., Adelson, E. H., Gizzi, M. S., & Newsome, W. T. (1985). The analysis of moving visual patterns. In C. Chagas, R. Gattass, & C. Gross (Eds.), Pattern recognition mechanisms (pp. 117–151). New York: Springer-Verlag.
168
E. Rolls and S. Stringer
Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341, 52–54. Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2, 1019–1025. Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical visual areas. Philosophical Transactions of the Royal Society, 335, 11–21. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 205–218. Rolls, E. T. (2006). The representation of information about faces in the temporal and frontal lobes of primates including humans. Neuropsychologia, PMID = 16797609. Rolls, E. T., & Deco, G. (2002). Computational neuroscience of vision. New York: Oxford University Press. Rolls, E. T., & Milward, T. (2000). A model of invariant object recognition in the visual system: Learning rules, activation functions, lateral inhibition, and informationbased performance measures. Neural Computation, 12, 2547–2572. Rolls, E. T., & Stringer, S. M. (2001). Invariant object recognition in the visual system with error correction and temporal difference learning. Network: Computation in Neural Systems, 12, 111–129. Rolls, E. T., & Tovee, M. J. (1994). Processing speed in the cerebral cortex and the neurophysiology of visual masking. Proceedings of the Royal Society, B, 257, 9–15. Rolls, E. T., Tovee, M. J., Purcell, D. G., Stewart, A. L., & Azzopardi, P. (1994). The responses of neurons in the temporal cortex of primates, and face identification and detection. Experimental Brain Research, 101, 474–484. Rolls, E. T., & Treves, A. (1998). Neural networks and brain function. New York: Oxford University Press. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 149–162. Rolls, E. T., Treves, A., Tovee, M., & Panzeri, S. (1997). Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Journal of Computational Neuroscience, 4, 309–333. Sakata, H., Shibutani, H., Ito, Y., & Tsurugai, K. (1986). Parietal cortical neurons responding to rotary movement of visual space stimulus in space. Experimental Brain Research, 61, 658–663. Seltzer, B., & Pandya, D. N. (1978). Afferent cortical connections and architectonics of the superior temporal sulcus and surrounding cortex in the rhesus monkey. Brain Research, 149, 1–24. Sereno, M. I. (1989). Learning the solution to the aperture problem for pattern motion with a Hebb rule. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 468–476). San Mateo, CA: Morgan Kaufmann. Sereno, M. I., & Sereno, M. E. (1991). Learning to see rotation and dilation with a Hebb rule. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems 3 (pp. 320–326). San Mateo, CA: Morgan Kaufmann. Stringer, S. M., Perry, G., Rolls, E. T., & Proske, J. H. (2006). Learning invariant object recognition in the visual system with continuous transformations. Biological Cybernetics, 94, 128–142.
Invariant Global Motion Recognition
169
Stringer, S. M., & Rolls, E. T. (2000). Position invariant recognition in the visual system with cluttered environments. Neural Networks, 13, 305–315. Stringer, S. M., & Rolls, E. T. (2002). Invariant object recognition in the visual system with novel views of 3D objects. Neural Computation, 14, 2585–2596. Tanaka, K. (1996). Inferotemporal cortex and object vision. Annual Review of Neuroscience, 19, 109–139. Ungerleider, L. G., & Mishkin, M. (1982). Two cortical visual systems. In D. J. Ingle, M. A. Goodale, & R. J. W. Mansfield (Eds.), Analysis of visual behavior (pp. 549–586). Cambridge, MA: MIT Press. Wallis, G., & Bulthoff, H. H. (2001). Effects of temporal assocation on recognition memory. Proceedings of the National Academy of Sciences, 98, 4800–4804. Wallis, G., & Rolls, E. T. (1997). Invariant face and object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wurtz, R. H., & Kandel, E. R. (2000). Perception of motion depth and form. In E. R. Kandel, J. H. Schwartz, & T. M. Jessell (Eds.), Principles of neural science, 4th ed. (pp. 548–571). New York: McGraw-Hill.
Received January 11, 2006; accepted May 15, 2006.
LETTER
Communicated by Mike Arnold
Recurrent Cerebellar Loops Simplify Adaptive Control of Redundant and Nonlinear Motor Systems John Porrill
[email protected] Paul Dean
[email protected] Centre for Signal Processing in Neuroimaging and Systems Neuroscience, Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.
We have described elsewhere an adaptive filter model of cerebellar learning in which the cerebellar microcircuit acts to decorrelate motor commands from their sensory consequences (Dean, Porrill, & Stone, 2002). Learning stability required the cerebellar microcircuit to be embedded in a recurrent loop, and this has been shown to lead to a simple and modular adaptive control architecture when applied to the linearized 3D vestibular ocular reflex (Porrill, Dean, & Stone, 2004). Here we investigate the properties of recurrent loop connectivity in the case of redundant and nonlinear motor systems and illustrate them using the example of kinematic control of a simulated two-joint robot arm. We demonstrate that (1) the learning rule does not require unavailable motor error signals or complex neural reference structures to estimate such signals (i.e., it solves the motor error problem) and (2) control of redundant systems is not subject to the nonconvexity problem in which incorrect average motor commands are learned for end-effector positions that can be accessed in more than one arm configuration. These properties suggest a central functional role for the closed cerebellar loops, which have been shown to be ubiquitous in motor systems (e.g., Kelly & Strick, 2003). 1 Introduction The grace and economy of animal movement suggest that neural control methods are likely to be of interest to robotics. The neural structure particularly associated with coordinated movement is the cerebellum, whose function seems to be the fine-tuning of motor skills by elaborating incomplete or approximate commands issued by higher levels of the motor system (Brindley, 1964). But although the microcircuitry of cerebellar cortex has inspired models for over 30 years (Albus, 1971; Eccles, Ito, & Szent´agothai, 1967; Marr, 1969), the results appear to have produced relatively meager benefits for robotics. As Marr himself commented, “In my own case, the cerebellar study . . . disappointed me, because even if the theory was Neural Computation 19, 170–193 (2007)
C 2006 Massachusetts Institute of Technology
Recurrent Cerebellar Loops Simplify Adaptive Control
171
correct, it did not enlighten one about the motor system—it did not, for example, tell one how to go about programming a mechanical arm” (Marr, 1982, p. 15). Part of the problem is that the control function of any individual region of the cerebellum depends not only on its internal microcircuitry but also on the way it is connected with other parts of the motor system (Lisberger, 1998), and in many cases the details of this connectivity are still not well understood. However, recent anatomical investigations have suggested that there may be a common feature of cerebellar connectivity: a recurrent architecture. It appears as if individual regions of the cerebellum (1) receive a copy of the low-level motor commands that constitute system output and (2) apply their corrections to the high-level command. These “multiple closed-loop circuits represent a fundamental architectural feature of cerebro-cerebellar interactions,” and it is a challenge to future studies to “determine the computations that are supported by this architecture” (Kelly & Strick, 2003). Determining these computations might also throw light on how biological control methods using the cerebellum might be of use to robotics. Our initial attempts to address this issue considered adaptation of the angular vestibular-ocular reflex (aVOR), a classic preparation for studying basic cerebellar function (Boyden, Katoh, & Raymond, 2004; Carpenter, 1988; Ito, 1970). In this reflex, a head-rotation signal from vestibular sensors is used to counter-rotate the eye to maintain stability of the retinal image. Visual processing delays of ∼100 ms mean that movement of the retinal image (a retinal slip) is used as an error signal for calibration of the aVOR rather than for online control, and this calibration requires the floccular region of the cerebellum. In our simulations of the aVOR, the characteristics of the oculomotor plant (eye muscles plus orbital tissue) were altered, and the flocculus (modeled as an adaptive filter; see section 2) was required to learn the appropriate plant compensation. Stable and robust learning was achieved by connecting the filter so that it decorrelated the retinal slip signal from a copy of the motor command sent to the eye muscles (Dean, Porrill, & Stone, 2002; Porrill, Dean, & Stone, 2004) and sent its output to join the vestibular input. This arrangement constitutes an example of the recurrent architecture described above, and ensuring the stability of the decorrelation control algorithm is therefore a candidate for its computational role. Here we extend these findings to derive some theoretical properties of the recurrent cerebellar architecture, which indicate that it may play a central role in simplifying the adaptive control of nonlinear and redundant biological motor systems. These properties will be illustrated by comparison of forward and recurrent architectures in simulated calibration of the inverse kinematic control of a two-degree-of-freedom (2dof) robot arm. Part of this work has been reported in abstract form (Porrill & Dean, 2004).
172
J. Porrill and P. Dean
Figure 1: (a) Schematic representation of the cerebellar microcircuit. MF inputs uk are analyzed in the granule cell layer (only one MF input is shown; in reality, GCs receive direct input from multiple MFs and indirect input from many more via recurrent connections, not shown here, to PFs), and the GC output signals p j are distributed along the PFs. The ith PC makes contact with many PFs that drive its simple spike output vi . The cell also has a CF input e i , which is assumed in Marr-Albus models to act as a teaching signal for the synaptic weights wij . (b) Interpretation of the cerebellar microcircuit as an adaptive filter. Granule cells processing is modeled as a filter pi = G i (u1 , u2 , . . .), and each PC outputs a weighted sum of its PF inputs. (Figure modified from Dean, Porrill, & Stone, 2004.)
2 The Adaptive Filter Model We will use what is perhaps the simplest implementation of the Marr-Albus architecture: the adaptive filter (Fujita, 1982). The microcircuit based around a Purkinje cell (PC) and its computational interpretation as an adaptive filter model are shown schematically in Figure 1.
Recurrent Cerebellar Loops Simplify Adaptive Control
173
The mossy fibers (MFs) carry input signals (u1 , u2 , . . .) that are processed in the granule cell layer to produce the signals ( p1 , p2 , . . .) carried on the parallel fibers (PFs). This process is interpreted as a massive expansionrecoding of the mossy fiber inputs of the form p j = G j (u1 , u2 , . . .).
(2.1)
In the nondynamic problems to be considered here, the G j are functions of the current values of (u1 , u2 , . . .). Dynamic problems can be tackled by allowing the p j to encode aspects of the past history of the ui ; in that case, the G j are required to be more general causal functionals such as tapped delay lines. We are interested in the output of a subset of Purkinje cells that take their inputs from common parallel fibers. The output vi of the ith such PC is modeled as a weighted linear combination of its PF inputs, vi =
wij p j ,
(2.2)
where the coefficient wij is the synaptic weight of the jth PF on the ith PC. The combined transformation from the vector of inputs u = (u1 , u2 , . . .) to the vector of outputs v = (v1 , v2 , . . .) can then be written conveniently as a vector equation, v = C(u) =
wij Gij (u),
(2.3)
where Gij (u) = (0, . . . , G j (u), . . . , 0) is the vector with the jth parallel fiber signal as its ith entry and zeros elsewhere. In Marr-Albus models, the climbing fiber input e i to the Purkinje cell is assumed to act as a teaching signal. The qualitative properties of long-term depression (LTD) and long-term potentiation (LTP) at PF/PC synapses are consistent with the anti-Hebbian heterosynaptic covariance learning rule (Sejnowski, 1977), δwij = −β e i p j ,
(2.4)
which is identical in form to the least mean square (LMS) learning rule of adaptive control theory (Widrow & Stearns, 1985). Note that in all the above formulas, neural signals are assumed to be coded as firing rates relative to a tonic firing rate, and hence can take both positive and negative values. 3 Example Problem: Learning Inverse Kinematics We will show that recurrent loops can simplify the adaptive control of nonlinear and redundant motor systems. We derive the results for a problem
174
J. Porrill and P. Dean
Figure 2: (a) Geometry of the planar, two-degree-of-freedom robot arm. Motor commands (m1 , m2 ) specify joint angles as shown. The position (x1 , x2 ) of the end effector is specified in Cartesian coordinates in arbitrary units. (b) Example plant compensation problem. The arm controller is initially calibrated for the arm lengths l1 = 1.1, l2 = 1.8 of the gray arm. This arm is shown reaching accurately to a point on the gray polar grid covering the work space. When used with the black arm (lengths l1 = 1, l2 = 2), this approximate controller reaches to points on the distorted (black) grid. The task of the cerebellum is to compensate for this miscalibration; reaching errors (δx1 , δx2 ) are provided in Cartesian coordinates.
presenting both of these difficulties: learning robot inverse kinematics. This problem is of general interest, since it is an equivalent to the generic problem of learning right inverse mappings. The theoretical results obtained in the general case will be illustrated throughout by application to the inverse kinematic problem for the 2dof robot arm shown in Figure 2, where the geometry is particularly easy to intuit. Although this is a very simple system, it exhibits strong nonlinearity and a discrete redundancy. We begin by recalling some terminology. The forward kinematics or plant model of a robot arm is the mapping x = P(m) from motor commands m = (m1 , . . . , m M ) to the end-effector positions x = (x1 , . . . , xS ). An inverse kinematics is a mapping m = Q(x) that calculates the motor commands corresponding to a given end-effector position. Since this implies that x = P(Q(x)), an inverse kinematics is a rightinverse P−1 for P. For redundant systems, a given position can be reached using more than one choice of motor command; in this case, we will use the notation Q = P−1 to denote a particular choice of inverse kinematics. For the 2dof arm, there are only two choices of motor command for a given end-effector position. This is an example of a discrete redundancy. More complex systems can have continuous redundancies in which there is a
Recurrent Cerebellar Loops Simplify Adaptive Control
175
Figure 3: Forward architecture. The desired position input xd produces a motor command m = B(xd ) + C(xd ), which is the sum of contributions from a fixed element B and an adaptive cerebellar element C. This input to the plant P produces the output position x. Training the weights of C requires proximal or motor error δm rather than distal or sensory error δx = x − xd . Hence, in this forward architecture, motor error must be estimated by backpropagation via a reference matrix R ≈ ∂P−1 /∂x. This requires detailed prior knowledge of the motor plant.
continuum of motor commands available for each end-effector position. These are sometimes called redundant degrees of freedom. The cerebellum has been described as a “repair shop,” compensating for miscalibration (due to damage, fatigue, or development, for example) of the motor plant (Robinson, 1974). It is this adaptive plant compensation problem that we will model here; that is, we assume that an approximate inverse kinematics controller B ≈ P−1 is available to the system and that the function of the cerebellum is to supplement this controller to produce more accurate movements. Learning is supervised by making the reaching error, δx = x − xd = P(B(xd )) − xd ,
(3.1)
available to the learning system, where xd is the target (desired) position. We call this quantity sensory error since it can be measured by an appropriate sensor and to distinguish it from motor error, which will be defined later. In the 2dof robot arm example shown in Figure 2b, the approximate controller B is the inverse kinematics for a robot with arm lengths that are ±10% in error. Thus, when required to reach to positions on the polar grid shown in Figure 2, the arm actually moves to positions on the overlaid distorted grid. 4 The Forward Learning Architecture To highlight the properties of recurrent connectivity, we begin by considering the problems encountered in implementing a learning rule for the alternative forward connectivity shown schematically in Figure 3. The motor command to the plant is produced by an open-loop filter that is the sum
176
J. Porrill and P. Dean
B + C of the fixed element B and the adaptive cerebellar component C; this combination will be an inverse kinematics B + C = P−1 if C takes the value C∗ =
wij∗ G ij = P−1 − B.
(4.1)
We assume that the granule cell basis functions Gij satisfy the matching condition, that is, that synaptic weights wij∗ can be found such that the above equation holds for the range of P−1 and B under consideration. To obtain a learning rule similar to the covariance rule (see equation 2.3), we introduce the concept of motor error (Gomi & Kawato, 1992). Motor error δm is the error in motor command responsible for the sensory error δx. Minimizing expected square motor error, Em =
1 2 1 t δm = δm δm , 2 2
(4.2)
(where a superscript t denotes the matrix transpose), leads to a simple learning rule because motor error is linearly related to synaptic weight error, δm = C(xd ) − C∗ (xd ) =
(wij − wij∗ )Gij (xd ).
(4.3)
Using this expression, the gradient of expected square motor error is ∂E ∂δm = δmt = δmt Gij (xd ) = δmi p j , ∂wij ∂wij
(4.4)
giving the gradient descent learning rule, δwij = −β δmi p j ,
(4.5)
where β is a small, positive constant. If we label Purkinje cells such that the ith PC output contributes to the ith component of motor error, then comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal e i provided on the climbing fiber input to the ith Purkinje cell must be the ith component of motor error, e i = δmi .
(4.6)
This apparently simple prescription is complicated by the fact that motor error is not in itself an observable quantity. It is a derived quantity given by the equation δm = P−1 (x) − P−1 (xd ).
(4.7)
Recurrent Cerebellar Loops Simplify Adaptive Control
177
This leads to an obvious circularity in that the rule for learning inverse kinematics requires prior knowledge of that same inverse kinematics. This circularity can be circumvented to some extent by supposing that all errors are small so that δm ≈
∂P−1 δx, ∂x
(4.8)
and then replacing the unknown Jacobian ∂P−1 /∂x in this error backpropagation rule by a fixed approximation, R≈
∂P−1 . ∂x
(4.9)
If R were exact, then, if J is the forward Jacobian J = ∂P/∂m, the product JR would be the identity matrix. To ensure stable learning, the approximate R must estimate motor error correctly up to a strict positive realness (SPR) condition, which in this static case requires that the symmetric part of the matrix JR be positive definite. The hypothetical neural structures required to implement this transformation R and recover motor error from observable sensory error, δm ≈ Rδx
or
δmi ≈
Rik δxk ,
(4.10)
have been called reference structures (Gomi & Kawato, 1992), so we will call R the reference matrix. 5 The Motor Error Problem We refer to the requirement that the climbing fibers carry the unobservable motor error signal rather than the observed sensory error signal as the motor error problem. Although the forward architecture has been applied successfully to a number of real and simulated control tasks (notably in the form of the feedback error learning architecture; Gomi & Kawato, 1992, 1993), we will argue here that for generic biological control systems, the complexity of the neural reference structures it requires makes forward architecture implausible. It is clear that the complexity of the reference structure is multiplicative in the dimension of the control and sensor space. For a task in which M muscles control the output of N sensors, there are MN entries in the reference matrix R. For example, in our 2dof robot arm problem, four real numbers must be hard-wired to guarantee learning. For more realistic motor tasks in biological systems (such as reaching while preserving balance), values of 100 or more for MN would not be unreasonable.
178
J. Porrill and P. Dean
Figure 4: (a) The dots show the arrangement of RBF centers, and the circle shows the receptive field radius for the forward architecture. This configuration leads to an accurate fit if exact motor error is provided to the learning algorithm (not shown). (b) A snapshot of performance during learning that illustrates the need for multiple reference structures in nonlinear problems. The reference structure R is chosen as the exact Jacobian at the grid point marked with a small circle. Although the learned (black) grid overlays the exact (gray) grid more accurately in the neighborhood of O (compare with Figure 3 bottom), performance has clearly deteriorated over the remainder of the work space. Learning rate beta = 0.0005. The effect of reducing the learning rate is to delay but not abolish the divergence. (c) An illustration of the redundancy or nonconvexity problem. The arm controller is set up to consistently choose the configurations shown by the solid and dotted arms when reaching into the top or bottom half of the work space. However, when reaching to points in the shaded horizontal sector, the arm retains the configuration used for the previous target. Hence, arm configuration is chosen effectively randomly in this sector, and the system fails to learn. Learning rate is as above. (Exact motor errors were used in this part of the simulation).
In fact, this analysis understates the problem since biological motor systems are often nonlinear, and hence the reference structures are valid only locally. This behavior will be illustrated for the 2dof robot arm calibration problem described above (see Figure 2). Details of the radial basis function (RBF) implementation are given in the appendix. Figure 4a shows a snapshot of performance during learning. In this example, the reference matrix R was chosen to be the exact inverse Jacobian at the point O. Clearly this choice satisfies the SPR condition in a neighborhood of O, and hence in this region where R provides good estimates of motor error, reaching accuracy initially improves. However, outside this region, the sign of a component of motor error is wrongly estimated, and errors in this component diverge catastrophically. This instability will eventually spread to the point O itself because of overlap between adjacent RBFs. To ensure stable learning in this example would require at least three reference structures valid on three different sectors of the work space.
Recurrent Cerebellar Loops Simplify Adaptive Control
179
This requires 3 × 4 = 12 prespecified parameters (not including the extra parameters needed to specify the region of validity of each reference structure). For a general inverse kinematics problem, we must specify MNK parameters, where K is the number of regions required to guarantee that positive definiteness of JR in each region. Finally, we note that in the dynamic case, learning must be dynamically stable. For example, in the linear case, the required reference structure is a matrix of transfer functions R(iω), and an SPR condition must be satisfied by the matrix transfer function J(iω)R(iω) at each frequency, further increasing the complexity of the motor error problem. 6 The Redundancy Problem Most artificial and biological motor systems are redundant, that is, different motor commands can produce the same output. Such redundancy leads to a classic supervised learning problem called the nonconvexity problem: if the training data for the learning system associate multiple motor commands with a given position, and if this set of motor commands is nonconvex, then the system will learn an inappropriate weighted average motor command at that position. The forward architecture shown in Figure 3 is subject to the redundancy problem whenever the controller B is allowed to produce different output commands for the same input. This type of behavior is common in motor tasks; for example, a robot arm configuration might be determined by a combination of convenience and movement history. This type of behavior is illustrated for the 2dof arm in Figure 4b. In this experiment, one arm configuration is used in the top half of the work space and the opposite configuration in the bottom half. However, when reaching into the central sector, the controller reuses the configuration from the previous position (this kind of behavior is common to avoid work space obstacles), and hence the configuration chosen in the central sector is essentially random. While learning succeeds in the top and bottom sectors of the work space, the failure to learn in the central sector is evident from the distorted (black) grid of learned positions. This convexity problem can be avoided by providing auxiliary variables ξ to the learning component such that the combination (x, ξ ) unambiguously specifies the motor state of the system (although identifying such variables in practice can be nontrivial). For example, the discrete redundancy found in the 2dof arm requires a discrete variable ξ = ±1 to identify the particular arm configuration to be used. This solution is not particularly satisfactory since it breaks modularity, forcing a controller whose task is simply to reach to a given position to take responsibility for choosing the required arm configuration. More interesting from our point of view, this solution also increases the complexity of the reference structure, since the number K of reference
180
J. Porrill and P. Dean
Figure 5: Recurrent architecture. Here the motor command generated by the fixed element B is the input to the adaptive element C. The output of C is then used as a correction to the input to B. This loop implements Brindley’s (1964) influential suggestion that the cerebellum elaborates commands coming from higher-level structures in the context of information about the current state of the organism. In this recurrent architecture, the sensor error δx becomes effectively proximal to C and, as demonstrated in the text, can be used directly as a teaching signal.
matrices required must be further increased to reflect the dependence on the extra parameters ξ . For example, to learn correctly in the 2dof arm example in Figure 4b, the two different arm configurations clearly require different reference matrices; hence, to allow both configurations over the whole work space increases the number of hard-wired parameters by a factor of 2 to 2 × 12 = 24. Even in this simple example, the complexity of the reference structure required to support learning is beginning to approach that of the structure to be learned. Note that the situation is not helped by adding a conventional error feedback loop to the adaptive forward architecture (as in the feedback error learning model). This loop also requires a motor error signal, and since error is available only in sensory coordinates, different reference structures are required for different arm configurations. 7 The Recurrent Architecture As we noted in section 1, the forward architecture just described ignores a major feature, the recurrent connectivity, of the circuitry in which the cerebellar microcircuit is embedded. An alternative recurrent architecture reflecting this connectivity is shown schematically in Figure 5. Although the analysis of recurrent networks and their learning rules can be very complex (e.g., Pearlmutter, 1995) this architecture is an important exception; in particular, we will show that no backpropagation step is required in the learning rule. The analysis proceeds in two stages. First, a plausible cerebellar learning rule is derived by the familiar method of gradient descent. This derivation does not provide a rigorous proof of convergence because it requires a small-weight-error approximation. Second, a Lyapunov function
Recurrent Cerebellar Loops Simplify Adaptive Control
181
for the learning rule is derived and used to demonstrate convergence of the learning rule without the need for the small-weight-error approximation. We are able to simplify the treatment because we deal only with kinematics. Hence, we can idealize the recurrent loop as an algebraic loop in which the output m of the closed loop shown in Figure 5 satisfies the implicit equation, m = B(xd +C(m)).
(7.1)
Clearly control is possible only if the fixed element B has some left inverse B−1 . Applying this inverse to equation 7.1, we find that the desired position input xd is related to the actual motor command m by the equation xd = B−1 (m) − C(m).
(7.2)
Again, we assume the matching condition so that weights exist for which C takes the exact value C∗ . For this choice of C, the desired position will equal the actual output position, that is, x = P(m) = B−1 (m) − C∗ (m),
(7.3)
from which we derive the following expression for the desired cerebellar filter: C∗ = B−1 − P.
(7.4)
By subtracting equation 7.3 from 7.2, we find that δx = x − xd = C∗ (m) − ∗ ∗ C(m), and substituting C = wij Gij , C = wij Gij gives the following simple relationship between sensory error and synaptic weight error: δx = P(m) − xd = C(m) − C∗ (m) =
(wij − wij∗ )Gij (m).
(7.5)
(This is analogous to equation 4.3 relating motor error and synaptic weight error for the forward architecture.) Although this equation is at first sight linear in the weights wij , this appearance is misleading since the argument m also depends implicitly on the wij . However, the appearance of linearity is close enough to the truth to allow us to derive a simple learning rule. If weight errors are small, the second term in the derivative, ∂δx ∗ ∂Gkl (m) = Gij (m) + (wkl − wkl ) , ∂wij ∂wij
(7.6)
182
J. Porrill and P. Dean
can be neglected to give the approximation ∂δx ≈ Gij (m). ∂wij
(7.7)
Using this result, we can derive an approximate gradient-descent learning rule by minimizing expected square sensory error (rather than motor error, as in the previous section). Defining Es =
1 2 1 t δx = δx δx , 2 2
(7.8)
its gradient is ∂δx ∂ Es ≈ δxt Gij (m) = δxi p j , = δxt ∂wij ∂wij
(7.9)
leading to the approximate gradient-descent learning rule δwij = −β δxi p j .
(7.10)
In this learning rule, no Jacobian appears, and hence no reference structures embodying prior knowledge of plant parameters are required. Comparison with the covariance learning rule (see equation 2.4) shows that the teaching signal required on the climbing fibers in recurrent architecture is now the sensory error: e i = δxi .
(7.11)
Although this local learning rule has been derived as an approximate gradient-descent rule, its properties are more easily determined by a Lyapunov analysis; this analysis is simplified if we work with the continuous update form of the learning rule: w˙ ij = −β δxi p j .
(7.12)
As a Lyapunov function, we use the sum square synaptic weight error, V=
1 (wij − wij∗ )2 , 2
(7.13)
which has time derivative V˙ =
w˙ ij (wij − wij∗ ) = −β
i
δxi
j
(wij − wij∗ ) p j .
(7.14)
Recurrent Cerebellar Loops Simplify Adaptive Control
183
Substituting into this expression the expression for sensory error above, we find that V˙ = −βδx2 .
(7.15)
Since its derivative is nonpositive, V is a Lyapunov function, that is, a positive function that decreases over time as learning proceeds. It is unnecessary to appeal to the Lyapunov theorems to determine the behavior of this system sufficiently for practical purposes. The equation above shows that over a fixed period of time, the sum square synaptic weight error decreases by an amount proportional to the mean square sensory error; hence, it is clear that the system can make RMS sensory errors above a certain magnitude only for a limited time, since V would otherwise become negative, which is impossible. 8 The Motor Error Problem Is Solved It is clear that this architecture solves the motor error problem in that there is no longer a need for unavailable motor error signals on the climbing fibers or for complex reference structures to estimate them. Figure 6a shows the performance of recurrent architecture for the 2dof arm problem described in section 3 (see the appendix for implementation details). It can be seen that the arm now recalibrates successfully over the whole work space. Figure 6b shows the decrease in position error during training; this decrease is stochastic in nature, as would be expected from the approximate stochastic gradient-descent rule. In contrast, Figure 6c shows the monotonic decrease of sum square weight error V during training predicted by the Lyapunov analysis, with the greatest decreases taking place where large errors are made. 9 The Nonconvexity Problem Is Solved In the recurrent architecture, the nonconvexity problem is easily solved because it does not arise. The adaptive element C takes motor commands as input, and its task is to learn to associate them with a corrective output. Since the motor command completely determines the current configuration, there is no ambiguity to be resolved. Although we illustrate this property below for the discrete redundancy of the 2dof arm, the reasoning above clearly applies to both discrete and continuous redundancies. This property confers remarkable modularity on recurrent architecture. It means that the connectivity of a cerebellar controller is determined solely by the task and is independent of low-level details such as the particular algorithm chosen for resolving joint angle redundancy or how, for example, reciprocal innervation allocates the tension in antagonistic muscle pairs.
184
J. Porrill and P. Dean
Figure 6: (a) This panel illustrates successful recalibration by the recurrent architecture. After training, the learned (black) grid overlays the exact (gray) grid over the whole work space (compare with the initial performance in Figure 2). Learning rate β = 0.05. (b) The grid shows the region of motor space corresponding to the robot work space. The dots and the circle indicate RBF centers and the receptive field radius. (c) The two graphs illustrate the stochastic decrease in squared position error (top), same units of length as Figure 2, and the associated monotonic decrease in sum square synaptic weight error (bottom) as predicted by theory. The behavior at the positions marked by arrows illustrates the fact that faster decrease in sum square weight error is associated with larger position error, as predicted by the Lyapunov equation, 7.13.
This property is illustrated in Figure 7a, where the recurrent architecture is applied to the redundant reaching problem described in section 6. It can be seen that learning is now satisfactory over the whole work space. The only modification to the net for the task of Figure 7 is the need for RBF centers covering points in motor space associated with the alternative arm configuration. This grid of RBF centers is shown in Figure 7b. The situation would be only slightly different for a continuous redundancy; in this case, new RBF centers would be needed to cover all points in motor command space accessible by the redundant degrees of freedom. 10 Discussion As argued in section 1, one reason that cerebellar-inspired models have been of modest use to robotics is that cerebellar connectivity is often poorly understood (Lisberger, 1998). The general idea that identical cerebellar microcircuits can be wired up to perform a wide range of motor and
Recurrent Cerebellar Loops Simplify Adaptive Control
185
Figure 7: (a) This panel shows that recurrent architecture solves the redundant reaching problem example described in section 6 for which forward architecture fails (compare with Figure 4). The learned (black) grid overlays the exact (gray) grid accurately over the whole work space, including the horizontal 60 degree sector in which both arm configurations are used. Learning rate β = 0.05. (b) This panel shows the separate grids in motor space associated with the two arm configurations. The grid of dots and the circle indicate the RBF centers and receptive field radius. The dark gray regions highlight motor commands used to generate the arm configurations used consistently in the top and bottom sectors of the work space, and the light gray regions highlight motor commands that generate the two alternative configurations used in the central sector of the work space (the unshaded regions are not used). Since the arm configurations that are ambiguous in task space are represented in separate regions of motor space, they are learned independently in the recurrent architecture.
cognitive functions (Ito, 1984, 1997) is well appreciated, but identifying specific instances has proved difficult. Here we have explored the computational capacities of the cerebellar microcircuit embedded in a recurrent architecture for adaptive feedforward control of nonlinear redundant systems as exemplified by a simulated 2dof robot arm. We have shown that the architecture solves the distal error problem, copes naturally with the redundancy/convexity problem, and gives enhanced modularity. We now compare it with alternative architectures from both computational and biological perspectives 10.1 Computational Control. The distal error problem arises whenever output errors are used to train internal parameters (Jordan, 1996; Jordan & Wolpert, 2000). It is a fundamental obstacle in neural learning systems, and the consequent lack of biological plausibility of learning rules in neural net supervised learning algorithms has become a clich´e. There have been two main previous approaches to solving the distal error problem: (1) continue
186
J. Porrill and P. Dean
to use standard architectures and hypothesize the existence of structures implementing the required error backpropagation schemes, or (2) look for those special architectures in which output errors are themselves sufficient for training. The forward learning architecture (see Figure 3) appears poorly suited for solving the distal error problem. Considerable ingenuity has been expended on the feedback error learning scheme developed by Kawato and coworkers (Gomi & Kawato, 1992, 1993) (for a recent rigorous treatment, see Nakanishi & Schaal, 2004) in order to rescue this architecture, but even so, substantial difficulties remain. In feedback error learning, the adaptive component is embedded in a feedback controller so that the estimated motor error δmest is used as both a training signal and a feedback error term. As we have noted, feedback error learning imposes SPR conditions on the accuracy of the motor error estimate and hence requires complex reference structures for generic redundant and nonlinear systems. There have been theoretical attempts to remove the SPR condition. For example, Miyamura and Kimura (2002) avoid the necessity for SPR at the cost of requiring large gains in the conventional error feedback loop; this is unacceptable in autonomous and biological systems since one of the primary reasons for using adaptive control is to avoid the destabilizing effect of large feedback gains given inevitable feedback delays. Despite the difficulties we have noted here, feedback error learning has been usefully applied to online learning by autonomous robots in numerous contexts (e.g., Dean, Mayhew, Thacker, & Langdon, 1991; Mayhew, Zheng, & Cornell, 1992). It is clear that feedback error learning remains a useful approach for problems in which simplifying features of the motor plant mean that the reference structures are easily estimated. We have presented a general architecture for tracking control in which output error can be used directly for training. Other architectures of this type in the literature have been designed for specific control problems. For example, the adaptive scheme for control of robot manipulators proposed by Slotine and Li (1989) relies on special features of the problem of controlling joint angles using joint torques. We note also the adaptive schemes for particular single-input–single-output nonlinear systems considered by Patino and Liu (2000) and Nakanishi and Schaal (2004). Although none of these architectures tackles the generic problems of nonlinearity and redundancy considered here, it is interesting to note that they also use recurrent architectures in an essential way, supporting the idea that recurrent connectivity may play a fundamental role in simplifying biological motor control.
10.2 Biological Plausibility. From a biological perspective, the recurrent architecture appears more plausible in the context of plant compensation than the forward architecture, with its requirements for feedback error learning, for three reasons.
Recurrent Cerebellar Loops Simplify Adaptive Control
187
First, as pointed out in section 1, anatomical evidence indicates that the recurrent architecture is a feature of many cerebellar microzones. In addition, where it is available, electrophysiological evidence has specifically identified efferent-copy information as part of the mossy fiber input to particular regions of the cerebellum. Important examples are (1) the primate flocculus and ventral paraflocculus, responsible for adaptation of the vestibulo-ocular reflex (VOR), where extensive recordings have shown that about three-quarters of their mossy fiber inputs carry eye-movementrelated signals (Miles, Fuller, Braitman, & Dow, 1980); (2) the oculomotor vermis, responsible for saccadic adaptation, where about 25% of mossy fibers show short-latency burst firing in association with saccades that closely resemble the activity of excitatory burst neurons in the paramedian pontine reticular formation (Ohtsuka & Noda, 1992); and (3) regions of cerebellar cortex associated with control of limb movement by the red nucleus receive an efferent copy of rubrospinal output to cerebellum, related to limb position and velocity (Keifer & Houk, 1994). Thus, the defining feature of the recurrent architecture used here appears to be present for all the cerebellar microzones that have been adequately investigated. Second, the recurrent architecture allows use of sensory error signals, that is, the sensory consequences of inaccurate motor commands, which are physically available signals. Previous inability to use such distal error signals has been a fundamental obstacle in neural learning systems, so that architectures such as the one described here, in which distal error can be used directly as a teaching signal, have fundamental importance as basic components of biological learning systems. Hence, if the recurrent architecture were used biologically, we would expect cerebellar climbing fibers to carry sensory information. In contrast, a central consequence of feedback error learning is the identification of climbing fiber signals with motor error. In the words of Gomi and Kawato (1992), “Our view that the climbing fibers carry control error information, the difference between the instructions and the motor act, is common to most cerebellar motor-learning models; however ours is unique in that this error information is represented in motor-command coordinates.” This view runs into both theoretical and empirical problems. From a theoretical point of view, the use of the motor error signal requires not only new and as yet unidentified neural reference structures to recover motor error from observable sensory errors, but also new and as yet unidentified mechanisms to calibrate these structures. Again in the words of Gomi and Kawato (1992), “The most interesting and challenging theoretical problem [raised by FEL] is setting an appropriate inverse reference model in the feedback controller at the spinal and brainstem levels.” From the point of view of experimental evidence, it appears that many climbing fibers are in fact strongly activated by sensory inputs, such as touch, pain, muscle sense, or, in the case of the VOR, retinal slip (e.g., Apps & Garwicz, 2005; De Zeeuw et al., 1998; Simpson, Wylie, & De Zeeuw, 1996).
188
J. Porrill and P. Dean
In certain cases where nonsensory signals have been identified in climbingfiber discharge (e.g., Andersson, Garwicz, & Hesslow, 1988; Gibson, Horn, & Pong, 2002), their effect seems to be to emphasize the unpredicted sensory consequences of a movement by gating the expected consequences. Such gated signals are still physically available sensory signals. In other instances, for example, retinal slip in the horizontal VOR, it appears that the effect of nonsensory modulation is to produce a two-valued slip signal that conveys information about the direction of image movement but not its speed (Highstein, Porrill, & Dean, 2005; Simpson, Belton, Suh, & Winkelman, 2002). One line of evidence often used to support feedback error learning concerns ocular following, where externally imposed sliplike movements of the retinal image drive compensatory eye movements. It has been shown that the associated complex spike discharge in the flocculus can be predicted from either slip or eye movement signals (Kobayashi et al., 1998). However, given that a later article states that “only the velocity of the retinal error (retinal slip) was statistically sufficient” (Yamamoto, Kobayashi, Takemura, Kawano, & Kawato, 2002, p. 1558) to reproduce the observed variability in climbing fiber discharge, it is not clear that this evidence decisively supports the existence of a motor error signal, even in these special and ambiguous circumstances. Although we have not considered dynamics here, a further advantage of the recurrent architecture is its capacity to generate long-time constant signals when plant compensation requires integrator-like processes (Dean et al., 2002; Porrill et al., 2004). This desirable feature of the recurrent architecture was pointed out for eye-position control by Galiana and Outerbridge (1984) and has been incorporated in other models of gaze holding (e.g., Glasauer, 2003). It is unclear how forward architectures could be used to achieve the observed performance of the neural integrator (Robinson, 1974). 10.3 Further Developments. Our previous work on plant compensation in the VOR (Dean et al., 2002; Porrill et al., 2004) established the capabilities of recurrent architecture for dynamic control of linear systems. Here we have extended these results to the kinematic control of redundant and nonlinear systems. It is clearly a priority to extend these results to dynamic nonlinear control problems. It is also important to implement the decorrelation-control scheme in a robot, and this work is currently in progress. We have emphasized the limitations of the feedback-error-learning architecture, but its combination of feedback and feedforward controllers does offer considerable advantages for robust online control. It appears that a possibly similar strategy is used biologically to control gaze stability, combining the optokinetic (feedback) and vestibulo-ocular (feedforward) reflexes (Carpenter, 1988). As we will show elsewhere, the recurrent architecture can also be embedded stably and naturally in a conventional feedback loop.
Recurrent Cerebellar Loops Simplify Adaptive Control
189
Finally, although the recurrent architecture does not need forward connections for plant compensation (and so they are not shown in Figure 5), such connections are also ubiquitous for cerebellar microzones. We conjecture that once plant compensation has been achieved, a microzone could in principle use these inputs for a wide range of purposes, including sensory calibration and predictive control. Appendix: Technical Details Explicit details of the forward and recurrent algorithms for the robotic application are supplied below. This is a vanilla RBF implementation, since our intention is to concentrate on the nature of the error signal and the learning rule. Both accuracy and learning speed could be greatly improved by optimizing the choice of centers and transforming to an optimal basis of receptive fields. The forward kinematics for the 2dof robot arm with arm lengths (l1 , l2 ) is given by x1 = P1 (m1 , m2 ) = l1 cos m1 + l2 cos(m1 + m2 ) x2 = P2 (m1 , m2 ) = l1 sin m1 + l2 sin(m1 + m2 ).
(A.1)
The brainstem component of the controller is defined as the exact inverse kinematics for a robot with slightly different arm lengths (l1 , l2 ), −1
m1 = B1 (x1 , x2 ) = tan
x2 x1
−1
− tan
l2 sin ξ θ l1 + l2 sin ξ θ
m2 = B2 (x1 , x2 ) = ξ θ,
(A.2)
where θ = cos−1
x12 + x12 − l12 − l22 2l1 l2
,
(A.3)
and the choice of ξ = ±1 determines the arm configuration. In the forward architecture, the parallel fiber signals are given by gaussian RBFs p j = G j (x1 , x2 ) = e − 2σ 2 ((x1 −c1 j ) 1
2
+(x2 −c 2 j )2 )
,
(A.4)
with the centers (c 1 j , c 2 j ) chosen on a rectangular grid covering the work 1 space and with σ equal to 2 /2 times the maximum grid spacing (see Figure 4). The forward architecture implies the following expression for
190
J. Porrill and P. Dean
the motor commands: m1 = B1 (x1 , x2 ) + m2 = B2 (x1 , x2 ) +
w1 j p j w2 j p j .
(A.5)
The unknown weights wij were initially set to 0. The two components of sensory error are given by the formula δx1 = P1 (m1 , m2 ) − x1 δx2 = P2 (m1 , m2 ) − x2 ,
(A.6)
from which the two components of motor error are estimated using a 2 × 2 reference matrix R δmest R11 R12 δx1 1 = . (A.7) δmest R21 R22 δx2 2 RBF weights are updated using the learning rule wijnew = wijold − βδmiest p j .
(A.8)
In the alternative recurrent architecture, the PF signals are given by RBFs, p j = G j (m1 , m2 ) = e − 2σ 2 ((m1 −c1 j ) 1
2
+(m2 −c 2 j )2 )
,
(A.9)
with centers chosen on a rectangular grid in motor space. The grid was chosen to cover the image of the work space in motor space (see Figure 6b). The recurrent architecture of Figure 5 implies the following equation for motor commands
w1 j p j , x2 + w2 j p j m1 = B1 x1 +
(A.10) w1 j p j , x2 + w2 j p j . m2 = B2 x2 + This is an implicit equation for motor error since the p j depend on the mi via equation A.6. Its solution was obtained at each trial by iterating mn+1 = B(x + C(mn )) to convergence (a relative accuracy of 10−4 required at most 10 iterations in the simulations reported here). This off-line iteration is necessary to allow the arm to make discontinuous movements between randomly chosen gridpoints. If waypoints xn are closely sampled from a continuous curve, the more natural alternative online procedure mn+1 = B(xn + C(mn )) can be used.
Recurrent Cerebellar Loops Simplify Adaptive Control
191
Again the unknown weights wij were initially set to 0. Sensory error obtained from equation A.3 above was used directly in the learning rule wijnew = wijold − βδxi p j .
(A.11)
Values for the optimal weight values wij∗ are required to calculate the sum square weight errors plotted in Figure 6. These were obtained by direct batch minimization of sum square reaching error calculated over a subsampled grid of points in the work space. We have primarily investigated convergence in the circumstances in which it is believed to operate biologically, that is, in repair shop mode with changes in plant characteristics of 10% to 15%. However, simulations have indicated that it also converges from initial weights corresponding to grossly degraded performance, although we have not investigated this systematically. Acknowledgments This work was supported by EPSRC grant GR/T10602/01 under their Novel Computation Initiative. References Albus, J. S. (1971). A theory of cerebellar function. Mathematical Biosciences, 10, 25–61. Andersson, G., Garwicz, M., & Hesslow, G. (1988). Evidence for a GABA-mediated cerebellar inhibition of the inferior olive in the cat. Experimental Brain Research, 72, 450–456. Apps, R., & Garwicz, M. (2005). Anatomical and physiological foundations of cerebellar information processing. Nature Reviews Neuroscience, 6(4), 297–311. Boyden, E. S., Katoh, A., & Raymond, J. L. (2004). Cerebellum-dependent learning: The role of multiple plasticity mechanisms. Annual Review of Neuroscience, 27, 581–609. Brindley, G. S. (1964). The use made by the cerebellum of the information that it receives from sense organs. International Brain Research Organization Bulletin, 3, 80. Carpenter, R. H. S. (1988). Movements of the eyes (2nd ed.). London: Pion. De Zeeuw, C. I., Simpson, J. I., Hoogenraad, C. C., Galjart, N., Koekkoek, S. K. E., & Ruigrok, T. J. H. (1998). Microcircuitry and function of the inferior olive. Trends in Neurosciences, 21(9), 391–400. Dean, P., Mayhew, J. E. W., Thacker, N., & Langdon, P. M. (1991). Saccade control in a simulated robot camera-head system: Neural net architectures for efficient learning of inverse kinematics. Biological Cybernetics, 66, 27–36. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings of the Royal Society of London, Series B, 269(1503), 1895–1904.
192
J. Porrill and P. Dean
Dean, P., Porrill, J., & Stone, J. V. (2004). Visual awareness and the cerebellum: Possible role of decorrelation control. Progress in Brain Research, 144, 61–75. Eccles, J. C., Ito, M., & Szent´agothai, J. (1967). The cerebellum as a neuronal machine. Berlin: Springer-Verlag. Fujita, M. (1982). Adaptive filter model of the cerebellum. Biological Cybernetics, 45, 195–206. Galiana, H. L., & Outerbridge, J. S. (1984). A bilateral model for central neural pathways in vestibuloocular reflex. Journal of Neurophysiology, 51(2), 210–241. Gibson, A. R., Horn, K. M., & Pong, M. (2002). Inhibitory control of olivary discharge. Annals of the New York Academy of Sciences, 978, 219–231. Glasauer, S. (2003). Cerebellar contribution to saccades and gaze holding—a modeling approach. Annals of the New York Academy of Sciences, 1004, 206–219. Gomi, H., & Kawato, M. (1992). Adaptive feedback control models of the vestibulocerebellum and spinocerebellum. Biological Cybernetics, 68(2), 105–114. Gomi, H., & Kawato, M. (1993). Neural network control for a closed-loop system using feedback-error-learning. Neural Networks, 6, 933–946. Highstein, S. M., Porrill, J., & Dean, P. (2005). Report on a workshop concerning the cerebellum and motor learning. Held in St Louis October 2004. Cerebellum, 4, 1–11. Ito, M. (1970). Neurophysiological aspects of the cerebellar motor control system. International Journal of Neurology (Montevideo), 7, 162–176. Ito, M. (1984). The cerebellum and neural control. New York: Raven Press. Ito, M. (1997). Cerebellar microcomplexes. International Review of Neurobiology, 41, 475–487. Jordan, M. I. (1996). Computational aspects of motor control and motor learning. In H. Heuer & S. Keele (Eds.), Handbook of perception and action, Vol. 2: Motor skills (pp. 71–120). London: Academic Press. Jordan, M. I., & Wolpert, D. M. (2000). Computational motor control. In M. S. Gazzaniga (Ed.), The new cognitive neurosciences (2nd ed., pp. 601–618). Cambridge MA: MIT Press. Keifer, J., & Houk, J. C. (1994). Motor function of the cerebellorubrospinal system. Physiological Reviews, 74(3), 509–542. Kelly, R. M., & Strick, P. L. (2003). Cerebellar loops with motor cortex and prefrontal cortex of a nonhuman primate. Journal of Neuroscience, 23(23), 8432–8444. Kobayashi, Y., Kawano, K., Takemura, A., Inoue, Y., Kitama, T., Gomi, H., & Kawato, M. (1998). Temporal firing patterns of Purkinje cells in the cerebellar ventral paraflocculus during ocular following responses in monkeys II. Complex spikes. Journal of Neurophysiology, 80(2), 832–848. Lisberger, S. G. (1998). Cerebellar LTD: A molecular mechanism of behavioral learning? Cell, 92(6), 701–704. Marr, D. (1969). A theory of cerebellar cortex. Journal of Physiology, 202, 437–470. Marr, D. (1982). Vision: A computational investigation into the human representation and processing of visual information. San Francisco: Freeman. Mayhew, J. E. W., Zheng, Y., & Cornell, S. (1992). The adaptive control of a fourdegrees-of-freedom stereo camera head. Philosophical Transactions of the Royal Society of London, Series B, 337, 315–326.
Recurrent Cerebellar Loops Simplify Adaptive Control
193
Miles, F. A., Fuller, J. H., Braitman, D. J., & Dow, B. M. (1980). Long-term adaptive changes in primate vestibuloocular reflex. III. Electrophysiological observations in flocculus of normal monkeys. Journal of Neurophysiology, 43, 1437–1476. Miyamura, A., & Kimura, H. (2002). Stability of feedback error learning scheme. Systems and Control Letters, 45, 303–316. Nakanishi, J., & Schaal, S. (2004). Feedback error learning and nonlinear adaptive control. Neural Networks, 17(10), 1453–1465. Ohtsuka, K., & Noda, H. (1992). Burst discharges of mossy fibers in the oculomotor vermis of macaque monkeys during saccadic eye movements. Neuroscience Research, 15(1–2), 102–114. Patino, H. D., & Liu, D. (2000). Neural network–based model reference adaptive control system. IEEE Transactions on Systems, Man and Cybernetics—Part B: Cybernetics, 30, 198–203. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Porrill, J., & Dean, P. (2004). Recurrent cerebellar loops simplify adaptive control of redundant and nonlinear motor systems. In 2004 Abstract Viewer/Itinerary Planner (pp. Prog. No. 989.984). Washington, DC: Society for Neuroscience. Porrill, J., Dean, P., & Stone, J. V. (2004). Recurrent cerebellar architecture solves the motor error problem. Proceedings of the Royal Society of London, Series B, 271, 789–796. Robinson, D. A. (1974). The effect of cerebellectomy on the cat’s vestibulo-ocular integrator. Brain Research, 71, 195–207. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. Journal of Mathematical Biology, 4, 303–321. Simpson, J. I., Belton, T., Suh, M., & Winkelman, B. (2002). Complex spike activity in the flocculus signals more than the eye can see. Annals of the New York Academy of Sciences, 978, 232–236. Simpson, J. I., Wylie, D. R., & De Zeeuw, C. I. (1996). On climbing fiber signals and their consequence(s). Behavioral and Brain Sciences, 19(3), 384–398. Slotine, J. J. E., & Li, W. P. (1989). Composite adaptive-control of robot manipulators. Automatica, 25(4), 509–519. Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. Upper Saddle River, NJ: Prentice Hall. Yamamoto, K., Kobayashi, Y., Takemura, A., Kawano, K., & Kawato, M. (2002). Computational studies on acquisition and adaptation of ocular following responses based on cerebellar synaptic plasticity. Journal of Neurophysiology, 87(3), 1554–1571.
Received August 8, 2005; accepted May 30, 2006.
LETTER
Communicated by Stephen Jos´e Hanson
Free-Lunch Learning: Modeling Spontaneous Recovery of Memory J. V. Stone
[email protected] Psychology Department, Sheffield University, Sheffield S10 2TP, England
P. E. Jupp
[email protected] School of Mathematics and Statistics, St. Andrews University, St. Andrews KY16 9SS, Scotland
After a language has been learned and then forgotten, relearning some words appears to facilitate spontaneous recovery of other words. More generally, relearning partially forgotten associations induces recovery of other associations in humans, an effect we call free-lunch learning (FLL). Using neural network models, we prove that FLL is a necessary consequence of storing associations as distributed representations. Specifically, we prove that (1) FLL becomes increasingly likely as the number of synapses (connection weights) increases, suggesting that FLL contributes to memory in neurophysiological systems, and (2) the magnitude of FLL is greatest if inactive synapses are removed, suggesting a computational role for synaptic pruning in physiological systems. We also demonstrate that FLL is different from generalization effects conventionally associated with neural network models. As FLL is a generic property of distributed representations, it may constitute an important factor in human memory. 1 Introduction A popular aphorism states that “there’s no such thing as a free lunch.” However, in the context of learning theory, we propose that there is. In previous work, free-lunch learning (FLL) has been demonstrated using a task in which participants learned the positions of letters on a nonstandard computer keyboard (Stone, Hunkin, & Hornby, 2001). After a period of forgetting, participants relearned a proportion of these letter positions. Crucially, it was found that this relearning induced recovery of the nonrelearned letter positions. Preliminary results suggest that FLL also occurs using face stimuli. If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations, so that relearning Neural Computation 19, 194–217 (2007)
C 2006 Massachusetts Institute of Technology
Free-Lunch Learning
195
Figure 1: Free-lunch learning protocol. Two subsets of associations A1 and A2 are learned. After partial forgetting (see text), performance error E pre on subset A1 is measured. Subset A2 is then relearned to preforgetting levels of performance, and performance error E post on subset A1 is remeasured. If E post < E pre then FLL has occurred, and the amount of FLL is δ = E pre − E post .
some old and partially forgotten associations affects the integrity of other old associations. Using neural network models, we show that relearning some associations does not disrupt other stored associations but actually restores them. In essence, recovery occurs in neural network models because each association is distributed among all connection weights (synapses) between units (model neurons). After partial forgetting, relearning some of the associations forces all of the weights closer to preforgetting values, resulting in improved performance even on nonrelearned associations. 1.1 The Geometry of Free-Lunch Learning. The protocol used to examine FLL here is as follows (see Figure 1). First, learn a set of n1 + n2 associations A = A1 ∪ A2 consisting of two intermixed subsets A1 and A2 of n1 and n2 associations, respectively. After all learned associations A have been partially forgotten, measure performance on subset A1 . Finally, relearn only subset A2 , and then remeasure performance on subset A1 . FLL occurs if relearning subset A2 improves performance on A1 . Unless stated otherwise, we assume that for a network with n connection weights, n ≥ n1 + n2 . For the present, we assume that the network has one output unit and two input units, which implies n = 2 connection weights and that A1 and A2 each consist of n1 = n2 = 1 association, as in Figure 2. Input units are
196
a
J. Stone and P. Jupp
b
Figure 2: Geometry of free-lunch learning. (a) A network with two input units and one output unit, with connection weights wa and wb , defines a weight vector w = (wa , wb ). The network learns two associations A1 and A2 , where (for example) A1 is the mapping from input vector x1 = (x11 , x12 ) to desired output value d1 ; learning consists of adjusting w until the network output y1 = w · x1 equals d1 . (b) Each association A1 and A2 defines a constraint line L 1 and L 2 , respectively. The intersection of L 1 and L 2 defines a point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if w = w0 . After partial forgetting, w is a randomly chosen point w1 on the circle C with radius r , and performance error E pre on A1 is the squared distance p 2 . After relearning A2 , the weight vector w2 is in L 2 , and performance error E post on A1 is q 2 . FLL occurs if δ = E pre − E post > 0, or equivalently if Q = p 2 − q 2 > 0. Relearning A2 has one of three possible effects, depending on the position of w1 on C: (1) if w1 is under the larger (dashed) arc C F L L as shown here, then p 2 > q 2 (δ > 0) and therefore FLL is observed; (2) if w1 is under the smaller (dotted) arc, then p 2 < q 2 (δ < 0), and therefore negative FLL is observed; and (3) if w1 is at the critical point wcrit , then p 2 = q 2 (δ = 0). Given that w1 is a randomly chosen point on C and that the length of C F L L is SF L L , the probability of FLL is P(δ > 0) = SF L L /πr (i.e., the proportion of C F L L under the upper semicircle of C).
connected to the output unit via weights wa and wb , which define a weight vector w = (wa , wb ). Associations A1 and A2 consist of different mappings from the input vectors x1 = (x11 , x12 ) and x2 = (x21 , x22 ) to desired output values d1 and d2 , respectively. If a network is presented with input vectors x1 and x2 , then its output values are y1 = w · x1 = wa x11 + wb x12 and y2 = w · x2 = wa x21 + wb x22 , respectively. Network performance error for k = 2 k associations is defined as E(w, A) = i=1 (di − yi )2 .
Free-Lunch Learning
197
The weight vector w defines a point in the (wa , wb )-plane. For an input vector x1 , there are many different combinations of weight values wa and wb that give the desired output d1 . These combinations lie on a straight line L 1 , because the network output is a linear weighted sum of input values. A corresponding constraint line L 2 exists for A2 . The intersection of L 1 and L 2 therefore defines the only point w0 that satisfies both constraints, so that zero error on A1 and A2 is obtained if and only if w = w0 . Without loss of generality, we define the origin w0 to be the intersection of L 1 and L 2 . We now consider the geometric effect of partial forgetting of both associations, followed by relearning A2 . This geometric account applies to a network with two weights (see Figure 2) and depends on the following observation: if the length of the input vector x1 = 1, then the performance error E(w, A1 ) = (d1 − y1 )2 of a network with weight vector w when tested on association A1 is equal to the squared distance between w and the constraint line L 1 (see appendix C). For example, if w is in L 1 , then E(w, A1 ) = 0, but as the distance between w and L 1 increases, so E(w, A1 ) must increase. For the purposes of this geometric account, we assume that x1 = x2 = 1. Partial forgetting is induced by adding isotropic noise v to the weight vector w = w0 . This effectively moves w to a randomly chosen point w1 = w0 + v on the circle C of radius r = v, where r represents the amount of forgetting. For a network with w = w1 , learning A2 moves w to the nearest point w2 on L 2 (see appendix B), so that w2 is the orthogonal projection of w1 on L 2 . Before relearning A2 , the performance error E pre on A1 is the squared distance p 2 between w1 and its orthogonal projection on L 1 (see appendix C). After relearning A2 , the performance error E post is the squared distance q 2 between w2 and its orthogonal projection on L 1 . The amount of FLL is δ = E pre − E post and, for a network with two weights, is equal to Q = p 2 − q 2 . The probability P(δ > 0) of FLL given L 1 and L 2 is equal to the proportion of points on C for which δ > 0 (or, equivalently, for which Q > 0). For example, averaging over all subsets A1 and A2 , there is the probability P(δ > 0) = 0.68 that relearning A2 induces FLL of A1 (see Figure 5), a probability that increases with the number of weights (see theorem 3). If we drop the assumption that a network has only two input units, then we can consider subsets A1 and A2 with n1 > 1 and n2 > 1 associations. If the number of connection weights n ≥ max(n1 , n2 ), then A1 and A2 define an (n − n1 )-dimensional subspace L 1 and an (n − n2 )-dimensional subspace L 2 , respectively. The intersection of L 1 and L 2 corresponds to weight vectors that generate zero error on A = A1 ∪ A2 . Finally, we can drop the assumption that a network has only one output unit, because the connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.
198
J. Stone and P. Jupp
2 Methods Given a network with n input units and one output unit, the set A of associations consisted of k input vectors (x1 , . . . , xk ) and k corresponding desired scalar output values (d1 , . . . , dk ). Each input vector comprises n elements x = (x1 , . . . , xn ). The values of xi and di were chosen from a gaussian distribution with unit variance (i.e., σx2 = σd2 = 1). A network’s output yi is a weighted sum of input values yi = w · xi = kj=1 w j xi j , where xi j is the jth value of the ith input vector xi , and each weight wi is one input-output connection. Given that the network error for a given set of k associations is E(w, A) = k k 2 i=1 (di − yi ) , the derivative ∇ E (w) = 2 i=1 (di − yi )xi of E with respect to w yields the delta learning rule wnew = wold − η∇ E (wold ) , where η is the learning rate, which is adjusted according to the number of weights. A learning trial consists of presenting the k input vectors to the network and then updating the weights using the delta rule. Learning was stopped when ∇ E (w) < k0.001, where ∇ E (w) is the magnitude of the gradient. Initial learning of the k = n associations in A = A1 ∪ A2 was performed by solving a set of n simultaneous equations using a standard method, after which perfect performance on all n associations was obtained. Partial forgetting was induced by adding an isotropic noise vector v with r = v = 1. Relearning the n2 = n/2 associations in A2 was implemented with k = n2 using the delta rule. 3 Results Our four main theorems are summarized here, and proofs are provided in the appendixes. These theorems apply to a network with n weights that learns n1 + n2 associations A = A1 ∪ A2 and, after partial forgetting, relearns the n2 associations in A2 . Theorem 1.
The probability P(δ > 0) of FLL is greater than 0.5.
Theorem 2.
The expected amount of FLL per association in A1 is
E[δ/n1 ] =
n2 E[x2 ]E[v2 ]. n2
(3.1)
For given values of E[x2 ] and E v2 , the value of n2 , which maximizes E[δ/n1 ] (subject to n1 + n2 ≤ n), is n2 = n − n1 . If each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the variance of xi is σx2 , then E x2 = nσx2 . If σx2 is the same for all n, then the state of a neuron (with a typical
Free-Lunch Learning
199
sigmoidal transfer function) would be in a constantly saturated state as the number of synapses increases. One way to prevent this saturation is to assume that the efficacy of synapses on a given neuron decreases as the number of synapses increases. If forgetting is caused primarily by learning spurious inputs, then the delta learning rule used here implies that the “amount of forgetting” v is approximately independent of n. We therefore assume that v and σx2 are constant, and for convenience, we set v = 1 and σx2 = 1. Substituting these values into equation 3.1 yields E[δ/n1 ] =
n2 . n
(3.2)
Using these assumptions, simulations of networks with n = 2 and n = 100 weights agree with equation 3.2, as shown in Figure 3. The role of pruning can be demonstrated as follows. Consider a network with 100 input units and one output unit with n = 100 weights. If n2 = 90 associations are relearned out of an original set of n1 + n2 = 100 associations, then E[δ/n1 ] = n2 /n = 0.90. However, if n = 1000, then E[δ/n1 ] = 0.09. In general, as the number n − (n1 + n2 ) of unpruned redundant weights increases, so E[δ/n1 ] decreases. Therefore, E[δ/n1 ] is maximized if n1 + n2 = n. If n1 + n2 < n, then the expected amount of FLL is not maximal and can therefore be increased by pruning redundant weights until n = n1 + n2 (see Figure 4). Note that for a particular network, performance error E post on A1 after learning A2 can be zero. For example, if w = w∗ in Figure 2, then p = q = 0, which implies that δ/n1 = E post = q 2 = 0. Theorem 3.
The probability P(δ > 0) of FLL of A1 satisfies
P(δ > 0) > 1 −
a 0 (n, n1 , n2 ) + a 1 (n, n2 ) var (x2 )/E[x2 ]2 , n1 n2 (n + 2)2
(3.3)
where a 0 (n, n1 , n2 ) = 2 n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1) a 1 (n, n2 ) = n2 (2n + n2 + 6).
(3.4) (3.5)
Theorem 3 implies that if the numbers (n1 and n2 ) of associations in A1 and A2 are fixed nonzero proportions of the number n of connection weights 2 (n1 /n and n2 /n, respectively) and var x2 /nE x2 → 0 as n → ∞, then P(δ > 0) → 1 as n → ∞; and the probability that each of the n1 associations in A1 exhibits FLL is P(δ/n1 > 0) = P(δ > 0) because δ > 0 iff δ/n1 > 0. For example, if we assume that each input vector x = (x1 , . . . , xn ) is chosen from an isotropic (e.g., isotropic gaussian) distribution and the
200
J. Stone and P. Jupp
2 variance of xi is σx2 , then var x2 /E x2 = 2/n. This ensures that 2 2 2 var x /nE x → 0 as n → ∞, and therefore that P(δ > 0) → 1 as n → ∞. Using this assumption, an approximation of the right-hand side of equation 3.3 yields P(δ > 0) > 1 −
2(1 + α1 − α1 α2 ) 2(2 + α2 + 6/n) − , nα1 α2 α1 α2 (n + 2)2
(3.6)
where α1 = n1 /n and α2 = n2 /n. In this form, it is easy to see that P(δ > 0) → 1 as n → ∞. We briefly consider the case n1 ≥ n and n2 ≥ n, so that each of L 1 and L 2 is a single point. If the distance D between these points is much less than v, then simple geometry shows that performance error E pre on A1 is large and that relearning A2 reduces this error for any v (i.e., with probability 1) with E post ∝ D2 , even in the absence of initial learning of A1 and A2 (see equation A.18 in appendix A). A similar conclusion is implicit in Atkins and Murre (1998). Theorem 4. If, instead of relearning A2 , the network learns a new subset A3 (drawn from the same distribution as A2 ), then the expected amount of FLL is less than the expected amount of FLL after relearning subset A2 . Learning A3 is analogous to the control condition used with human participants (Stone et al., 2001), and the finding that the amount of recovery after learning A3 is less than the amount of recovery after relearning A2 is predicted by theorem 4.
Figure 3: Distribution of free-lunch learning. (a) Histogram of amount of FLL δ/n1 per association, based on 1000 runs, for a network with n = 2 weights (see section 2). After learning two association subsets (η = 0.1), A1 and A2 , containing n1 = 1 and n2 = 1 associations (respectively), the network has a weight vector w0 . Forgetting is then induced by adding a noise vector v with v2 = 1 to w0 . One association A2 is then relearned, and the change in performance on A1 is measured as δ/n1 (see Figure 2). Negative values indicate that performance on A1 decreases after relearning A2 . (b) Histogram of amount of FLL δ/n1 per association for a network with n = 100 weights and η = 0.005, with A1 and A2 each consisting of n1 = n2 = 50 associations, using the same protocol as in (a). In both (a) and (b), the mean value of δ/n1 is about 0.5, as predicted by equation 3.2. As the number of associations learned increases, the amount of FLL becomes more tightly clustered around δ/n1 = 0.5, as demonstrated in these two histograms, and the probability of FLL increases (also see Figure 5).
Free-Lunch Learning
201
202
J. Stone and P. Jupp
Figure 4: Effect of pruning on free-lunch learning. Graph of the expected amount of FLL per association E[δ/n1 ] as a function of the total number n1 + n2 of learned associations in A = A1 ∪ A2 , as given in equation 3.2. In this example, the number of connection weights is fixed at n = 100, and the number of associations in A = A1 ∪ A2 increases from n1 + n2 = 2 to n1 + n2 = 100. The number n2 of relearned associations in A2 is a constant proportion (0.5) of the associations in A. If n1 + n2 ≤ n, then the network contains n − (n1 + n2 ) unpruned redundant connections. Thus, pruning effectively increases as n1 + n2 increases because, as the number n1 + n2 of associations grows, so the number of unpruned redundant connections decreases. The expected amount of FLL per association E[δ/n1 ] increases as the amount of pruning increases.
4 Discussion Theorems 1 to 4 provide the first proof that relearning induces nontransient recovery, where postrecovery error is potentially zero. This contrasts with the usually small and transient recovery that occurs during the initial phase of relearning forgotten associations (Hinton & Plaut, 1987; Atkins & Murre, 1998), and during learning of new associations (Harvey & Stone, 1996). In particular, theorem 2 is predictive inasmuch as it suggests that the amount of FLL in humans should be (1) proportional to the amount of forgetting of A = A1 ∪ A2 and (2) proportional to the proportion n2 /(n1 + n2 ) of associations relearned after partial forgetting of A. We have assumed that the number n1 + n2 of associations A = A1 ∪ A2 encoded by a given neuron is not greater than the number n of input connections (synapses) to that neuron. Given that each neuron typically has
Free-Lunch Learning
203
Figure 5: Probability of free-lunch learning. The probability P(δ > 0) of FLL of associations A1 as a function of the total number n1 + n2 of learned associations A = A1 ∪ A2 for networks with n = n1 + n2 weights. Each of the two subsets of associations A1 and A2 consists of n1 = n2 = n/2 associations. After learning and then partially forgetting A, performance on A1 was measured. P(δ > 0) is the probability that performance on subset A1 is better after subset A2 has been relearned than it is before A2 has been relearned. Solid line: Empirical estimate of P(δ > 0). Each data point is based on 10,000 runs, where each run uses input vectors chosen from an isotropic gaussian distribution (see section 2). Dashed line: Theoretical lower bound on the probability of FLL, as given by theorems 1 and 3, assuming that input vectors are chosen from an isotropic (e.g., isotropic gausssian) distribution.
many thousands of synapses (e.g., cerebellar Purkinje cells), it seems likely that this assumption is valid. However, the total amount of FLL is maximal if n1 = n2 = n/2, so that the full potential of FLL can be realized only if n1 + n2 = n. This optimum number of synapses can be achieved if inactive (i.e., redundant) synapses are pruned. Pruning may therefore contribute to FLL in physiological systems (Purves & Lichtman, 1980; Goldin, Segal, & Avignone, 2001). We have also assumed that a delta rule is used to learn associations between inputs and desired outputs. This general type of supervised learning is thought to be implemented by the cerebellum and basal ganglia (Doya, 1999). Models of the cerebellum (Dean, Porrill, & Stone, 2002) use a delta rule to implement learning. Similarly, models of the basal ganglia (Nakahara, Itoh, Kawagoe, Takikawa, & Hikosaka, 2004) use a temporally discounted
204
J. Stone and P. Jupp
form of delta rule, the temporal difference rule. This temporal difference rule has also been used to model learning in humans (Seymour et al., 2004), and (under mild conditions) is equivalent to the standard delta rule (Sutton, 1988). Indeed, from a purely computational perspective, it is difficult to conceive how these forms of associative learning could be implemented without some form of delta rule. Our analysis is based on the assumption that the network model is linear. Of course, many nonlinear networks can be approximated by linear networks, but it is possible that the results derived here have limited applicability to certain classes of nonlinear networks. Relation to Task Generalization. It is only natural to ask how FLL relates to tasks that a human might learn. One obvious but vital condition for FLL is that different associations must be encoded by a common set of neuronal connections. Aside from this condition, it might be thought that relearning A2 improves performance on A1 because A1 and A2 are somehow related (as in Hanson & Negishi, 2002; Dienes, Altmann, & Gao, 1999), so that learning A2 generalizes to A1 . This form of task generalization can occur if A1 and A2 are related as follows. If the input-output pairs in A1 and A2 are sampled from a sufficiently smooth function f and n1 n and n2 n, then A1 and A2 are statistically related, and therefore the weights induced by learning A1 are similar to those induced by learning A2 . Consequently, the resultant network input-output functions g1 and g2 (respectively) both approximate the function f (i.e., g1 ≈ g2 ≈ f ). In this case, learning A2 yields good performance on A1 . In the context of FLL, if A1 ∪ A2 is learned, forgotten, and then A2 is relearned, performance on A1 will also improve. However, the reason for this improvement is obvious and trivial: it is simply that A1 and A2 are statistically related and large enough (i.e., with n1 n and n2 n) to induce similar network functions. In contrast, the effect described in this letter does not depend on statistical similarity between A1 and A2 . Crucial assumptions are that n1 + n2 ≤ n, n1 < n, and n2 < n, so that learning the n2 associations in A2 in a network with n weights is underconstrained. This implies that the network function induced by learning A1 has no particular relation to the network function induced by learning A2 , even if A1 and A2 are sampled from the same function f (provided A1 and A2 are disjoint sets). For example, if A1 and A2 each consists of one association sampled from a linear function f (i.e., a line), then learning A2 in a linear network (as in Figure 2a) induces a linear network function g1 (i.e., a line) that intersects with f but is otherwise unconstrained. Thus, learning A2 does not necessarily yield good performance on A1 . The FLL effect reported here depends on relearning after forgetting. To cite an extreme example, if unicycling and learning French were encoded by a common set of neurons, then, after forgetting both, relearning unicycling could improve your French (although the mechanism involved here is unrelated to that described in Harvey & Stone, 1996). Thus, FLL contrasts
Free-Lunch Learning
205
with the task generalization outlined above, where it is obvious that both A1 and A2 induce similar network functions. Motivated by the demonstration that recovery occurs in humans (Stone et al., 2001; Coltheart & Byng, 1989; Weekes & Coltheart, 1996) (but not in all studies—Atkins, 2001), we have proven that FLL occurs in network models. The analysis presented here suggests that FLL is a necessary and generic consequence of storing information in distributed systems rather than a side effect peculiar to a particular class of artificial neural nets. Moreover, the generic nature of FLL suggests that it is largely independent of the type (i.e., artificial or physiological) of network used to learn associations. FLL appears to be a fundamental property of distributed representations. Given the reliance of neuronal systems on distributed representations, FLL may be a ubiquitous feature of learning and memory. It is likely that any organism that did not take advantage of such a fundamental and ubiquitous effect would be at a severe selective disadvantage. Appendix A: Analysis of Free-Lunch Learning We proceed by deriving expressions for E pre , E post , and δ = E pre − E post . We prove that if n1 + n2 ≤ n, then the expected value of δ is positive. We then prove that if n1 + n2 ≤ n, the probability P(δ > 0) of FLL is greater than 0.5, that its lower bound increases with n (if n1 /n and n2 /n are fixed), and that this bound approaches unity as n increases. A.1 Definition of Performance Error. For an artificial neural network (ANN) with weight vector w, we define the performance error for input vectors x1 , . . . , xc and desired outputs d1 , . . . , dc to be E(x1 , . . . , xc ; w, d1 , . . . , dc ) =
c
(w · xi − di )2 .
(A.1)
i=1
By putting X = (x1 , . . . , xc )T , d = (d1 , . . . , dc )T and E(X; w, d) = E(x1 , . . . , xc ; w, d1 , . . . , dc ), we can write equation A.1 succinctly as E(X; w, d) = Xw − d2 .
(A.2)
Given a c × n matrix X and a c-dimensional vector d, let L X,d be the affine subspace, L X,d = w : XT Xw = XT d ,
206
J. Stone and P. Jupp
of Rn . Since i. rk XT X ≤ rk (X), ii. XT Xa = 0 ⇒ aT XT Xa = 0 ⇒ Xa = 0, it follows that rk XT X = rk (X) (where rk denotes the rank of a matrix), and so L X,d is nonempty.
(A.3)
If X and d are consistent (i.e., there is a w such that Xw = d), then L X,d = {w : Xw = d}. A.2 Comparison of Performance Errors. Given weight vectors w1 and w2 , a matrix X of input vectors, and a vector d of desired outputs, define δ(w1 , w2 ; X, d) = E pre − E post , ˜ be any element of where E pre = E(X; w1 , d) and E post = E(X; w2 , d). Let w L X,d . Then δ(w1 , w2 ; X, d) = Xw1 − d2 − Xw2 − d2 = Xw1 2 − Xw2 2 − 2 (w1 − w2 )T XT d ˜ = (w1 − w2 )T XT X (w1 + w2 ) − 2 (w1 − w2 )T XT Xw ˜ . = (w1 − w2 )T XT X (w1 + w2 − 2w)
(A.4)
Suppose given ni × n matrices Xi and ni -dimensional vectors di (for i = 1, 2). Put L i = L Xi ,di
for i = 1, 2.
If Xi has rank ni , then Xi = Ti Zi for unique ni × ni and ni × n matrices Ti and Zi with Ti upper triangular and Zi ZiT = Ini . Note that the matrix ZiT Zi represents the operator that projects onto the image of XiT Xi , and so ZiT Zi XiT Xi = XiT Xi .
(A.5)
Free-Lunch Learning
207
Let w0 be an element of L X,d , where
X1 d1 X= d= , X2 d2 that is,
X1T X1 + X2T X2 w0 = X1T d1 + X2T d2 .
(A.6)
(By equation A.3, such a w0 always exists.) Given v in Rn , put w1 = w0 + v. Let w02 and w2 be the orthogonal projections of w0 and w1 , respectively, onto L 2 . Then X2T X2 w02 = X2T d2
w2 = w02 + In −
Z2T Z2
(A.7) (w1 − w02 ) .
Manipulation gives w1 − w2 = Z2T Z2 (v + w0 − w02 ) ,
(A.8)
and so w1 + w2 − 2w0 = 2In − Z2T Z2 v − Z2T Z2 (w0 − w02 ) .
(A.9)
˜ be any element of L X1 ,d1 . Then equations A.4, A.6, A.7 to A.9, and A.5 Let w yield δ(w1 , w2 ; X1 , d1 ) ˜ = (w1 − w2 )T X1T X1 (w1 + w2 − 2w) = (w1 − w2 )T X1T X1 (w1 + w2 ) − 2 (w1 − w2 )T X1T d1 = (w1 − w2 )T X1T X1 (w1 + w2 − 2w0 ) − 2 (w1 − w2 )T X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 (w1 + w2 − 2w0 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 ) = (v + w0 − w02 )T Z2T Z2 X1T X1 2In − Z2T Z2 v − (v + w0 − w02 )T Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) − 2 (v + w0 − w02 )T Z2T Z2 X2T X2 (w0 − w02 )
208
J. Stone and P. Jupp
= vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v − (w0 − w02 )T Z2T Z2 2X2T X2 + X1T X1 Z2T Z2 (w0 − w02 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v − 2 (w0 − w02 )T Z2T Z2 X1T X1 In − Z2T Z2 − X2T X2 v −(w0 − w02 )T (2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 ) (w0 − w02 ) .
(A.10)
A.3 Moments of Isotropic Distributions. In order to obtain results on the distribution of performance error, it is useful to have some moments of isotropic distributions. Let u be uniformly distributed on Sn−1 , and let A and B be n × n matrices. The formulas for the second and fourth moments of u given in equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield tr (A) E uT Au = n
(A.11)
T
tr (AB) + tr AB + tr (A) tr (B) E uT AuuT Bu = n(n + 2) 2 ntr A + ntr AAT − 2tr (A)2 T . var u Au = n2 (n + 2)
(A.12) (A.13)
Now let x be isotropically distributed on Rn , that is, Ux has the same distribution as x for all orthogonal n × n matrices U. Then writing x = xu with u = 1 and using equations A.11 to A.13 gives E x2 tr (A) E xT Ax = n 4 E x tr (AB) + tr ABT + tr (A) tr (B) T T E x Axx Bx = n(n + 2)
var xT Ax =
(A.14)
E x4 ntr A2 + ntr AAT − 2tr (A)2 n2 (n + 2)
+
var x2 tr (A)2 . n2
(A.15)
Free-Lunch Learning
209
A.4 Distribution of Performance Error. Now suppose that X1 , d1 , X2 , d2 , and v are random and satisfy X1 and v are independent, the distribution of X1 is isotropic,
(A.16)
v has an isotropic distribution, where conditions A.16 mean that UX1 V has the same distribution as X1 for all orthogonal n1 × n1 matrices U and all orthogonal n × n matrices V. Then equation A.10 yields E [δ(w1 , w2 ; X1 , d1 ) |X1 , X2 ] E v2 T = tr X1 X1 Z2T Z2 n − (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) .
(A.17)
Taking expectations over X1 and X2 in equation A.17 gives the following general result on FLL: E[δ(w1 , w2 ; X1 , d1 )] > 0 iff n2 E (w0 − w02 )T 2X2T X2 + Z2T Z2 X1T X1 Z2T Z2 (w0 − w02 ) E[v2 ] > . n1 n2 (A.18) The intuitive interpretation of this result is that if E v2 is large enough, then there is FLL, whereas if P (w0 = w02 ) > 0 then “negative FLL” can occur. In particular, if n1 + n2 ≤ n and P (v = 0) > 0, then there is FLL. A.5 The Case n1 + n2 ≤ n. In this section we assume that X1 , d1 , X2 and d2 are random and that (X1 , d1 ), (X2 , d2 ) and v are independent,
(A.19)
the distribution of v is isotropic.
(A.20)
We suppose also that n1 + n2 ≤ n, and that the distributions of X1 , d1 , X2 , and d2 are continuous. Then, with probability 1, X1 w0 = d1 and X2 w0 = d2 , so that w02 = w0 and equation A.10 reduces to δ(w1 , w2 ; X1 , d1 ) = vT Z2T Z2 X1T X1 2In − Z2T Z2 v.
(A.21)
210
J. Stone and P. Jupp
A.5.1 FLL Is More Probable Than Not. Let w∗1 be the reflection of w1 in L 2 , that is, w∗1 = w2 − (w1 − w2 ) . Consideration of the parallelogram with vertices at w0 , w1 , w∗1 , and w1 + w∗1 − w0 gives 2 X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 = X1 [w1 − w0 ] + w∗1 − w0 2 + X1 [w1 − w0 ] − w∗1 − w0 2 = 4 X1 (w2 − w0 ) 2 + X1 (w1 − w2 ) 2 , so that (since d1 = X1 w0 ) δ(w1 , w2 ; X1 , d1 ) + δ(w∗1 , w2 ; X1 , d1 ) = X1 (w1 − w0 ) 2 + X1 (w∗1 − w0 ) 2 − 2X1 (w2 − w0 ) 2 = 2X1 (w1 − w2 ) 2 ≥ 0. Thus if δ(w1 , w2 ; X1 , d1 ) < 0, then δ(w∗1 , w2 ; X1 , d1 ) > 0. If v is distributed isotropically, then w∗1 − w0 is distributed isotropically, so that δ(w∗1 , w2 ; X1 , d1 ) has the same distribution (conditionally on X1 , d1 and X2 ) as δ(w1 , w2 ; X1 , d1 ), and so P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) ≤ P(δ(w∗1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ) = P(δ(w1 , w2 ; X1 , d1 ) > 0|X1 , d1 , X2 ). (A.22) Further, if v ∈ L 2 \ L 1 , then w2 = w1 = w∗1 , so that δ(w1 , w2 ; X1 , d1 ) = δ(w∗1 , w2 ; X1 , d1 ) > 0. By continuity of δ, there is a neighborhood of v on which δ(w1 , w2 ; X1 , d1 ) > 0 and δ(w∗1 , w2 ; X1 , d1 ) > 0. Thus, if L 2 \ L 1 = ∅, then equation A.22 can be refined to P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) < P(δ(w∗1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ). (A.23) Since P(L 2 ⊂ L 1 ) = 0 and P(δ(w1 , w2 ; X1 , d1 ) < 0|X1 , d1 , X2 ) is a continuous function of X1 , d1 and X2 , it follows from equation A.23 that P(δ(w1 , w2 ; X1 , d1 ) < 0) < P(δ(w1 , w2 ; X1 , d1 ) > 0), which implies the following result.
Free-Lunch Learning
211
Theorem 1 P(δ(w1 , w2 ; X 1 , d1 ) > 0) > 0.5. This implies that the median of δ(w 1 , w2 ; X 1 , d1 ) is positive. A.5.2 A Lower Bound for P(δ > 0). Our proof depends on Chebyshev’s inequality, which states that for any positive value of t, P(|δ − E[δ]| ≥ t) ≤
var(δ) , t2
where var(δ) denotes the variance of δ. If we set t = E[δ], then (since, by equation A.28, E[δ] > 0) P (δ ≤ 0) ≤
var (δ) E [δ]2
.
(A.24)
This provides a lower bound for the probability of FLL. We prove that this bound approaches unity as n approaches infinity. Now we assume (in addition to conditions A.19 and A.20) that the distributions of X1 and X2 are isotropic.
(A.25)
It follows from equations A.21, A.14, and A.15 that n1 In 2In − Z2T Z2 v E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] = vT Z2T Z2 E x2 n n1 2 T T (A.26) = E x v Z2 Z2 v, n where x is the first column of X1T , and var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v )
E x4 (n − 2)Z2 v4 + nZ2 v2 2In − Z2T Z2 v2 = n1 n2 (n + 2) var x2 Z2 v4 + . n2
(A.27)
Since v has an isotropic distribution, equations A.26, A.11, and A.13 imply that E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] =
n1 n2 2 E x v2 . n2
(A.28)
212
J. Stone and P. Jupp
Given that there are n1 associations in the subset A1 that is not relearned, equation A.28 implies the following theorem about the expected amount of recovery per association in A1 . Theorem 2 E
n2 δ(w 1 , w 2 ; X 1 , d1 ) 2 2 Z2 , v = n2 E[x ]v . n1
(A.29)
Equations A.26 and A.13 also imply that var (E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ] |Z2 , v ) n 2 v4 2nn2 − 2n22 1 2 E x = n n2 (n + 2) 2 2n21 n2 (n − n2 )E x2 v4 , = n4 (n + 2)
(A.30)
and it follows from equations A.27 and A.12 that E [var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) |Z2 , v ]
n1 v3 E x4 (n − 2)n2 (n2 + 2) + nn2 (2n − n2 + 2) = n(n + 2) n2 (n + 2) var x2 n2 (n2 + 2) + n2 =
n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) . 3 2 n (n + 2) (A.31)
Then equations A.30 and A.31 give var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2 2n21 n2 (n − n2 )E x2 v4 = n4 (n + 2) +
n1 n2 v4 4 E x 2(n2 + 2n − n2 − 2) + var x2 (n + 2)(n2 + 2) n3 (n + 2)2
Free-Lunch Learning
213
n1 n2 v4 {2[n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)]E[x2 ]2 n4 (n + 2)2
=
+ n2 (2n + n2 + 6)var(x2 )}, and so var (δ(w1 , w2 ; X1 , d1 ) |Z2 , v ) 2
E [δ(w1 , w2 ; X1 , d1 ) |Z2 , v ]
=
a 0 (n, n1 , n2 ) + a 1 (n, n2 )γ (n) , n1 n2 (n + 2)2
where a 0 (n, n1 , n2 ) = 2{n1 (n + 2)(n − n2 ) + n(n − n2 ) + n(n + 2)(n − 1)} a 1 (n, n2 ) = n2 (2n + n2 + 6) var x2 γ (n) = 2 . E x2 Chebyshev’s inequality implies the following theorem. Theorem 3 P (δ(w1 , w 2 ; X 1 , d1 ) ≤ 0 |Z2 , v ) ≤
a 0 (n, n1 , n2 ) + a 1 (n, n1 , n2 )γ (n) . n1 n2 (n + 2)2
Since the right-hand side does not depend on Z2 or v, this gives the following result. If γ (n)/n → 0 and n1 /n, n2 /n are bounded away from zero as n → ∞, then P (δ(w1 , w2 ; X1 , d1 ) > 0) → 1,
n → ∞.
If x ∼ N 0, σx2 In ,
Example.
then E x2 = nσx2 ,
var x2 = 2nσx4 ,
γ (n) =
2 , n
and so P(δ(w1 , w2 ; X1 , d1 ) > 0) → 1,
n → ∞,
provided that n1 /n and n2 /n are bounded away from zero.
214
J. Stone and P. Jupp
A.5.3 Learning A3 Instead of A2 . Now suppose that relearning of A2 is replaced by learning another subset A3 of n2 associations. Let the matrix X3 and vector d3 be such that the subspace L 3 corresponding to A3 has the form L 3 = L X3 ,d3 . Let w3 and w13 denote the orthogonal projections of w1 onto L 3 and L 1 ∩ L 3 , respectively. Then (A.32) w3 = w13 + In − Z3T Z3 (w1 − w13 ) , and so w1 = w3 + Z3T Z3 (w1 − w13 ) .
(A.33)
˜ = w13 , and equations A.33 and A.32, we have From equation A.4 with w δ(w1 , w3 ; X1 , d1 ) = (w1 − w3 )T X1T X1 (w1 + w3 − 2w13 ) = (w1 − w13 )T Z3T Z3 X1T X1 (w1 + w3 − 2w13 ) ˜ , = (v − ω) ˜ T Z3T Z3 X1T X1 2In − Z3T Z3 (v − ω)
(A.34)
where ω˜ = w13 − w0 . Since X1 w0 = X1 w13 , equation A.34 can be expanded as δ(w1 , w3 ; X1 , d1 )
= vT Z3T Z3 X1T X1 2In − Z3T Z3 v − vT Z3T Z3 X1T X1 2In − Z3T Z3 ω˜ − ω˜ T Z3T Z3 X1T X1 2In − Z3T Z3 v ˜ − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω,
and so E [δ(w1 , w3 ; X1 , d1 )|X1 , d1 , X2 , d2 , X3 , d3 ] E v2 T = tr Z3 Z3 X1T X1 2In − Z3T Z3 − ω˜ T Z3T Z3 X1T X1 Z3T Z3 ω˜ n E v2 T ˜ 2. = tr X1 X1 Z3T Z3 − X1 Z3T Z3 ω n Now assume that (X1 , d1 ), (X2 , d2 ), (X3 , d3 ) and v are independent, the distributions of X1 , X2 , X3 and v are isotropic.
Free-Lunch Learning
Since
215
E v2 T E v2 T E tr X1 X1 Z2T Z2 = E tr X1 X1 Z3T Z3 n n = E [δ(w1 , w2 ; X1 , d1 )] ,
we have the following theorem. Theorem 4 E[δ(w1 , w 3 ; X 1 , d1 )] ≤ E [δ(w 1 , w2 ; X 1 , d1 )] . Appendix B: Behavior of the Gradient Algorithm If E is regarded as a function of w, then differentiation of equation A.2 shows that the gradient of E at w is ∇ E (w) = 2XT (Xw − d) . Then for any algorithm that takes an initial w(0) to w(1) , w(2) , . . . using steps w(t+1) − w(t) in the direction of ∇ E (w(t) ) , w(t) − w(0) is in the image of XT X, and so is orthogonal to L X,d . It follows that if Xw(t) − d2 → minw Xw − d2 as t → ∞, then w(t) converges to the orthogonal projection of w(0) onto L X,d . Appendix C: The Geometry of Performance Error When n1 = 1 Given associations A1 and A2 , we prove that if n1 = 1 and input vectors have unit length (so that x1 = 1), then the difference δ in performance errors on association A1 of w1 (i.e., after partial forgetting) and w2 (i.e., after relearning A2 ) is equal to the difference Q = p 2 − q 2 . This proof supports the geometric account given in the article and in Figure 2 and does not (in general) apply if n1 > 1. We begin by proving that (if n1 = 1 and x1 = 1) the performance error of an association A1 for an arbitrary weight vector w1 is equal to the squared distance p 2 between w1 and its orthogonal projection w1 onto the affine subspace L 1 corresponding to A1 . If n1 = 1, then L 1 has the form L 1 = {w : w · x1 = d1 } for some x1 and d1 . Given an arbitrary weight vector w1 , we define the performance error on association A1 as equivalent to E(w1 , A1 ) = (w1 · x1 − d1 )2 .
(C.1)
216
J. Stone and P. Jupp
The orthogonal projection w1 of w1 onto L 1 is w1 = w1 +
d1 − w1 · x1 x1 , x1 2
(C.2)
so that d1 = w1 · x1 .
(C.3)
Substituting equation C.3 into C.1 and using C.2 yields E(w1 , A1 ) = w1 − w1 2 x1 2 = p 2 x1 2 .
(C.4)
Now suppose that x1 = 1. Then E(w1 , A1 ) = p 2 , that is, the performance error is equal to the squared distance between the weight vectors w1 and w1 . The same line of reasoning can be applied to prove that E(w2 , A1 ) = q 2 . Thus, the difference δ in performance error on A1 for weight vectors w1 and w2 is δ = E(w1 , A1 ) − E(w2 , A1 ) = p2 − q 2 = Q. Acknowledgments Thanks to S. Isard for substantial help with the analysis presented here; to R. Lister, S. Eglen, P. Parpia, A. Farthing, P. Warren, K. Gurney, N. Hunkin, and two anonymous referees for comments; and J. Porrill for useful discussions. References Atkins, P. (2001). What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychological Research, 65(3), 202–215.
Free-Lunch Learning
217
Atkins, P., & Murre, J., (1998). Recovery of unrehearsed items in connectionist models. Connection Science, 10(2), 99–119. Coltheart, M., & Byng, S. (1989). A treatment for surface dyslexia. In X. Seron (Ed.), Cognitive approaches in neuropsychological rehabilitation. Mahwah, NJ: Erlbaum. Dean, P., Porrill, J., & Stone, J. V. (2002). Decorrelation control by the cerebellum achieves oculomotor plant compensation in simulated vestibulo-ocular reflex. Proceedings Royal Society (B), 269(1503), 1895–1904. Dienes, Z., Altmann, G., & Gao, S.-J. (1999). Mapping across domains without feedback: A neural network model of transfer of implicit knowledge. Cognitive Science, 23, 53–82. Doya, K. (1999). What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Networks, 12(7–8), 961–974. Goldin, M., Segal, M., & Avignone, E. (2001). Functional plasticity triggers formation and pruning of dendritic spines in cultured hippocampal networks. J. Neuroscience, 21(1), 186–193. Hanson, S. J., & Negishi, M. (2002). On the emergence of rules in neural networks. Neural Computation, 14, 2245–2268. Harvey, I., & Stone, J.V. (1996). Unicycling helps your French: Spontaneous recovery of associations by learning unrelated tasks. Neural Computation, 8, 697–704. Hinton, G., & Plaut, D. (1987). Using fast weights to deblur old memories. In Proceedings Ninth Annual Conference of the Cognitive Science Society, Seattle WA, 177–186. Mardia, K. V., & Jupp, P. E. (2000). Directional statistics. New York: Wiley. Nakahara, H., Itoh, H., Kawagoe, R., Takikawa, Y., & Hikosaka, O. (2004). Dopamine neurons can represent context-dependent prediction error. Neuron, 41(2), 269–280. Purves D., & Lichtman, J. (1980). Elimination of synapses in the developing nervous system. Science, 210, 153–157. Seymour, B., O’Doherty, J. P., Dayan, P., Koltzenburg, M., Jones, A. K., Dolan, R. J., Friston, K. J., & Frackowiak, R. (2004). Temporal difference models describe higher order learning in humans. Nature, 429, 664–667. Stone, J. V., Hunkin, N. M., & Hornby, A. (2001). Predicting spontaneous recovery of memory. Nature, 414, 167–168. Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Weekes, B., & Coltheart, M. (1996). Surface dyslexia and surface dysgraphia: Treatment studies and their theoretical implications. Cognitive Neuropsychology, 13, 277–315.
Received August 1, 2005; accepted May 19, 2006.
LETTER
Communicated by Mark Girolami
Linear Multilayer ICA Generating Hierarchical Edge Detectors Yoshitatsu Matsuda
[email protected] Kazunori Yamaguchi
[email protected] Kazunori Yamaguchi Laboratory, Department of General Systems Studies, Graduate School of Arts and Sciences, University of Tokyo, Tokyo, Japan 153-8902
In this letter, a new ICA algorithm, linear multilayer ICA (LMICA), is proposed. There are two phases in each layer of LMICA. One is the mapping phase, where a two-dimensional mapping is formed by moving more highly correlated (nonindependent) signals closer with the stochastic multidimensional scaling network. Another is the local-ICA phase, where each neighbor (namely, highly correlated) pair of signals in the mapping is separated by MaxKurt algorithm. Because in LMICA only a small number of highly correlated pairs have to be separated, it can extract edge detectors efficiently from natural scenes. We conducted numerical experiments and verified that LMICA generates hierarchical edge detectors from large-size natural scenes. 1 Introduction Independent component analysis (ICA) is a recently developed method in the fields of signal processing and artificial neural networks, and it has been shown to be quite useful for the blind separation problem (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996). The linear ICA is formalized as follows. Let s and A be N-dimensional source signals and an N × N mixing matrix. Then the observed signals x are defined as x = As.
(1.1)
The problem to solve is to find out A (or the inverse, W) when only the observed (mixed) signals are given. In other words, ICA blindly extracts the source signals from M samples of the observed signals as follows: Sˆ = WX,
(1.2)
where X is an N × M matrix of the observed signals and Sˆ is the estimate of the source signals. Although this is a typical ill-conditioned problem, ICA Neural Computation 19, 218–230 (2007)
C 2006 Massachusetts Institute of Technology
Linear Multilayer ICA Generating Hierarchical Edge Detectors
219
can solve it if the source signals are generated according to independent and nongaussian probability distributions. Concretely speaking, ICA algorithms find out W by minimizing a criterion (called the contrast function) ˆ that is defined by higher-order statistics (e.g., kurtosis) of components of S, and many methods have been proposed for the minimization, for example, fast ICA (Hyv¨arinen, 1999) and the relative gradient algorithm (Cardoso & Laheld, 1996). Because all of these previous algorithms try to estimate all the N2 components of W rigorously, their time complexity is O(N2 ), which is intractable for large N. For actual data such as natural scenes, W is given not randomly but according to some underlying structure in the data. LMICA (linear multilayer ICA), which we proposed previously (Matsuda & Yamaguchi, 2003, 2004, 2005b), utilizes such structure for estimating W. By gradually improving an estimate of W, it could finally find out a fairly good estimate of W quite efficiently. This letter extends this work with improvement of the algorithm, using the two-dimensional stochastic multidimensional scaling (MDS) network, and exhaustive numerical experiments (generating hierarchical edge detectors). This letter is organized as follows. In section 2, the algorithm is described. In section 3, numerical experiments verify that LMICA can extract edge detectors efficiently from natural scenes, and it can generate hierarchical edge detectors from large-size scenes. This letter is concluded in section 4. 2 Algorithm 2.1 Basic Idea. It is assumed that X is whitened initially. LMICA tries to extract the independent components approximately by repetition of the following two phases: the mapping phase, which brings more highly correlated signals nearer, and the local-ICA phase, where each neighbor pair of signals in the mapping is separated by the MaxKurt algorithm (Cardoso, 1999). Intuitively, the mapping phase finds only a few pairs that are more “important” for estimating W, and the local-ICA phase optimizes only the “important” pairs. The mechanism of LMICA is illustrated in Figure 1. Note that this illustration shows the ideal case where N signals are separated in only O (log N) layers. Although this does not hold for an arbitrary W, it will be shown in section 3 that natural scenes can be separated quite effectively by this method with a two-dimensional mapping. 2.2 Mapping Phase. In the mapping phase, current signals X are arranged 2 2in a two-dimensional array so that pairs of signals taking higher k xik x jk are placed nearer. In order to efficiently calculate such an array, the stochastic MDS network in Matsuda and Yamaguchi (2005a) is used. Its main procedure is the repetition of the calculation of the center of gravity and the slide of each signal toward the center in proportion to its value.
220
Y. Matsuda and K. Yamaguchi
1
2
3
4
5
6
7
8 local-ICA
1-2
1-2
3-4
3-4
5-6
5-6
7-8
7-8 mapping
1-2
3-4
1-2
3-4
5-6
7-8
5-6
7-8 local-ICA
1-4
1-4
1-4
1-4
5-8
5-8
5-8
5-8
1-4
5-8
1-4
5-8
1-4
5-8
1-4
5-8
mapping
local-ICA 1-8
1-8
1-8
1-8
1-8
1-8
1-8
1-8
Figure 1: LMICA (the ideal case). Each number from 1 to 8 means a source signal. In the first local-ICA phase, each neighbor pair of the completely mixed signals (denoted 1-8) is partially separated into 1-4 and 5-8. Next, the mapping phase rearranges the partially separated signals so that more highly correlated signals are nearer. In consequence, the four 1-4 signals (similarly, 5-8 ones) are brought nearer. Then the local-ICA phase partially separates the pairs of neighbor signals into 1-2, 3-4, 5-6, and 7-8. By repetition of these two phases, LMICA can extract all the sources quite efficiently.
The original stochastic MDS network is given as follows (see Matsuda & Yamaguchi, 2005a, for details): 1. Place given signals Z = (zik ) on a two-dimensional array randomly, where each signal i (the ith row of Z) corresponds to a component of the array through a one-to-one correspondence. The x- and ycoordinates of a signal i on the array are denoted as mi1 and mi2 . Note that mi1 and mi2 take discrete values. 2. Pick up a column of Z randomly, whose index is denoted as p. The column can be regarded as a randomly selected sample p of signals. 3. Calculate the center of gravity for the sample p by g
ml (Z, p) = where l = 1 or 2.
i zi p mil , i zi p
(2.1)
Linear Multilayer ICA Generating Hierarchical Edge Detectors STEP 1 5
STEP 2
2 23 17 7
5
STEP 3
2 23 17
7
2 23 17
STEP 4 7
5
2 23 17 12
8 20 15 10 21
8 20 10 12 21
8 20 10
10 15 6 11 3
10 15 6 11
3
8 20 10 12 21 10 15 6 11 3
10 7
25 4 24
25 4 24
9
25 4
25 4 24 1
1
9
13 14 16 22 19
7 unit to be updated new coordinates of 7 offset vector of 7 15 destination unit
2 21
5
221
1
13 14 16 22 19
24 1
9
6 11 3 9
13 14 16 22 19
13 14 16 22 19
unit to be shifted
new offset vector of 7
start and end points medium points 10 12 medium units
Figure 2: Discretized update on the stochastic MDS network (excerpted from Matsuda & Yamaguchi, 2005a, pp. 289). It illustrates each step of the discretized update rule. Each small square represents a signal on the discrete two-dimensional array (denoted a unit in this figure). In the first step, the destination is calculated by adding the current offset vector to the new coordinates. Second, the signals on the route are detected. Third, the signal to be updated is moved by exchanging the signals on the route one by one. Finally, the fraction under the discretization is preserved as the offset vector.
4. Calculate the new coordinate mil for each signal i by g mil = mil − λzi p mil − ml ,
(2.2)
where λ is the step size. 5. Update the coordinates of each signal on the array approximately according to equation 2.2 under the constraints that every coordinate mil has to be on the discrete two-dimensional array. Such discretized updates are conducted by giving an offset vector (which preserves the rounded off fraction under the discretization) to each signal and exchanging the signals one by one on the array. The mechanism is roughly illustrated in Figure 2. The details are omitted in this letter. 6. Terminate the process if a termination condition is satisfied. Otherwise, return to step 2. It can be shown that this process approximately minimizes an error function, C=
(zik z jk ) (mil − m jl )2 , i, j
k
l
(2.3)
222
Y. Matsuda and K. Yamaguchi
under the conditions that the location of each signal (mil , mi2 ) is bound to a unit in the discrete two-dimensional array through a one-to-one correspondence. Because the minimization of C makes more highly correlated pairs of signals (taking larger k (zik z jk )) be closer (smaller l (mil − m jl )2 ), it is shown that this process generates a topographic mapping where highly correlated signals are placed near each other. Some modifications are needed in order to make the original stochastic MDS network suitable for LMICA:
r r
r
zi p is given as xi2p − 1 where p is selected randomly from 1 to M at each update. Because the convergence of the original network is too slow and tends to drop into a local minimum if the signals Z take continuous values, the following elaboration is utilized. The signals are classified into two groups for each i: the positive group σ + consisting of signals satisfying zi p > 0 and the negative one σ − of zi p < 0. Then equations 2.1 and 2.2 are calculated for each group. Note that each signal of negative group is moved away from the center of gravity. Numerical experiments (omitted in this article) showed that this improvement gave more accurate results more efficiently than the original network. Two learning stages are often used in order to make local neighbor relations more accurate if there are many signals. The first is the usual stage for the whole array. The second is the locally tuning stage where the array is divided into some small areas and the algorithm is applied to each area.
The total procedure of the mapping phase for given X, W, A, and a given two-dimensional array is described as follows: Mapping Phase 1. Allocate each signal i to a component of the two-dimensional array by a randomly selected one-to-one correspondence. 2. Repeat the following steps over T times with a decreasing step size λ: a. Select p randomly from {1, . . . , M}, and let zi p = xi2p − 1 for each i. b. Calculate the two centers of gravity: g+ g− i∈σ + zi p mil i∈σ − zi p mil ml (Z, p) = and ml (Z, p) = . z + i p i∈σ i∈σ − zi p (2.4) c. Update eachmil by g+ mil − λzi p mil − ml zi p > 0 mil := (2.5) g− mil − λzi p mil − ml zi p < 0 under the constraints that every mil is on the discrete array.
Linear Multilayer ICA Generating Hierarchical Edge Detectors
223
3. Divide the array into small areas, and apply the above process to each area if there are many signals. 4. Rearrange the rows of X and W and the columns of A according to the generated array. 2.3 Local-ICA Phase. In the local-ICA phase, the following contrast function φ(X) (the minus sum of kurtoses) is used (which is the same one in MaxKurt algorithm; Cardoso, 1999): 4 i,k xik φ (X) = − −3 . (2.6) M φ(X) is minimized by “rotating” nearest-neighbor pairs of signals on the array. For each nearest-neighbor pair (i, j), a rotation matrix R(θ ) is given as cos θ sin θ R(θ ) = . (2.7) − sin θ cos θ Then the optimal angle θˆ is given as 4 4 θˆ = argminθ − xik , + x jk
(2.8)
k where xik = cos θ · xik + sin θ · x jk and x jk = − sin θ · xik + cos θ · x jk . After some tedious transformation of this equation (see Cardoso, 1999), it is shown that θˆ is determined analytically by the following equations:
sin 4θˆ =
αi j αi2j
+
βi2j
and cos 4θˆ =
where αi j =
3 x jk − xik x 3jk and βi j = xik
k
βi j αi2j
+ βi2j
k
,
(2.9)
4 2 2 xik + x 4jk − 6xik x jk
4
.
(2.10)
Now, the procedure of the local-ICA phase for given X, W, A, and an array is described as follows: Local-ICA Phase 1. For every nearest-neighbor pair of signals on the array, (i, j), a. Find out the optimal angle θˆ by equation 2.9. b. Rotate the corresponding parts of X, W, and A.
224
Y. Matsuda and K. Yamaguchi
2.4 Complete Algorithm. The complete algorithm of LMICA for any given observed signals X and array fixing the shape of the mapping is given by repeating the mapping phase and the local-ICA phase alternately: Linear Multilayer ICA Algorithm 1. Initial settings: Let X be a whitened observed signal and W and A be the N × N identity matrix. 2. Repetition: Do the following two phases alternately over L times. a. Do the mapping phase in section 2.2 b. Do the local-ICA phase in section 2.3. 2.5 Some Remarks 2.5.1 Relation to MaxKurt Algorithm. Equation 2.9 is the same the as MaxKurt algorithm (Cardoso, 1999). The crucial difference between LMICA and MaxKurt is that LMICA optimizes just the nearest-neighbor pairs instead of all the N(N−1) ones in MaxKurt. In LMICA, the pairs with higher 22 2 costs (higher k xik x jk ) are brought nearer in the mapping phase. Approximations of independent components can be extracted effectively by optimizing just the neighbor pairs. 2.5.2 Prewhitening. Although LMICA is applicable to any prewhitened signals, the selection of the whitening method is actually crucial for its performance. It is shown in section 3.1 that ZCA (zero-phase component analysis) is more suitable than principal component analysis (PCA) if natural scenes are given as the observed signals. The ZCA filter is given as X := − 2 X, 1
(2.11)
where is the covariance matrix of X. It has been known that the ZCA filter whitens the given signals with preserving the spatial relationship if natural scenes are given as X (Li & Atick, 1994; Bell & Sejnowski, 1997). 3 Results It is well known that various local edge detectors can be extracted from natural scenes by the standard ICA algorithm (Bell & Sejnowski, 1997; van Hateren & van der Schaaf, 1998). Here, LMICA was applied to the same problem. 3.1 Small Natural Scenes. Thirty thousand samples of natural scenes of 12 × 12 pixels were given as the observed signals X. That is, N and M were 144 and 30,000. Original natural scenes were downloaded from http://www.cis.hut.fi/projects/ica/data/images/. X is then whitened by
Linear Multilayer ICA Generating Hierarchical Edge Detectors
225
Decrease of Contrast Function LMICA (a) original topography (b) MaxKurt (c) LMICA for PCA (d)
minus kurtosis
-4 -5 -6 -7 -8 0
3000
6000 9000 12000 times of optimizations for pairs
15000
Figure 3: Decreasing curves of the contrast function φ along the times of optimizations for pairs of signals. (a) The standard LMICA. (b) LMICA using the identity mapping in the mapping phase, which preserves the original topography for the ZCA filter. (c) MaxKurt. (d) LMICA for the PCA-whitened signals.
ZCA. In the mapping phase, a 12 × 12 array was used, where the learning 100 length T was 100,000 with the step size λ = 100+t (t is the current time, which is the number of updates to the point). The calculation time for 400 layers of LMICA with Intel 2.8 GHz CPU was about 40 minutes, about 60% of which was for the mapping phase. It shows that the mapping phase is so efficient that its computational costs are not a bottleneck in LMICA. Figure 3 shows the decreasing curves of the contrast function φ along the times of optimizations (rotations) for pairs of signals, where the values of φ are averaged over 10 trials for independently sampled Xs. For comparison, the following four experiments were done: (1) LMICA; (2) LMICA using not the optimized rearrangement but the simple identity mapping in the mapping phase, which preserves the original 12 × 12 topography of scenes for the ZCA filter; (3) MaxKurt, where all the pairs are optimized one by one in random order; and (4) LMICA for the PCA-whitened observed signals; Note that one layer of LMICA without the mapping phase and one iteration of MaxKurt are equivalent to 12 × 11 × 2 = 264 and 144×143 = 10,296 times 2 of pair optimizations, respectively. In Figure 3, the standard LMICA gives the best solution everywhere except only a few first optimizations and the late ones. Though LMICA using the original topography is the best within about the first 1000 optimizations, it rapidly converged to local minima. On the other hand, MaxKurt and LMICA for PCA became superior to the others only after more than 10,000 optimizations.
226
Y. Matsuda and K. Yamaguchi
(a) LMICA.
(b) original topography.
(c) MaxKurt.
(d) fast ICA with g (u) = tanh (u).
Figure 4: Edge detectors from natural scenes of 12 × 12 pixels after 5280 optimizations. Each shows 144 edge detectors of 12 × 12 pixels from A. (a) LMICA. (b) LMICA using the original topography (namely, the identity mapping). (c) MaxKurt. (d) Fast ICA (g(u) = tanh(u)).
Figures 4a to 4c show the edge detectors after 5280 optimizations (equivalent to the twentieth layer of LMICA) by LMICA, LMICA using the original topography, and MaxKurt, respectively. For comparison, Figure 4d is the result by fast ICA with the widely used nonlinear function g(u) = tanh(u). Figures 4a and 4d show that LMICA could quite rapidly generate edge detectors similar to those in fast ICA. It is especially surprising that the number of optimizations in LMICA (5280) is about half of the degrees of freedom of the mixing matrix A (10,296). It suggests that LMICA gives an
Linear Multilayer ICA Generating Hierarchical Edge Detectors
(a) at the 0th layer (ZCA).
(b) at the 10th layer.
(c) at the 50th layer.
(d) at the 300th layer.
227
Figure 5: Representative edge detectors from large natural scenes at each layer. They show 20 representative edge detectors of A from scenes of 64 × 64 pixels at each layer.
effective model for the ICA processing of natural scenes. There are no clear edges in Figures 4b and 4c. Each detector in these figures is extremely localized and has no orientation preference (or one that is too weak). It shows that the mapping phase of LMICA plays a crucial role in the rapid formation of edge detectors. 3.2 Large Natural Scenes. Here, 100,000 samples of natural scenes of 64 × 64 pixels were given as X. Fast ICA, MaxKurt, and other well-known ICA algorithms are not applicable to such a large-scale problem because they require huge computations. For example, fast ICA based on kurtosis spent about 45 minutes on processing the small images (30,000 samples of 12 × 12 pixels). Under the rough assumption that the calculation time is proportional to the number of samples and the parameters to be estimated, it requires about 2000 hours to process the large images. This estimation is rather optimistic because it assumes that the number of updates for
Y. Matsuda and K. Yamaguchi 4000
4000
3000
3000
frequency
frequency
228
2000 1000
2000 1000
0
0 4 6 8 10 12 14 length of edges (a) at the 0th layer.
2
4 6 8 10 12 14 length of edges (b) at the 10th layer.
4000
4000
3000
3000
frequency
frequency
0
2000 1000
0
2
0
2
2000 1000
0
0 0
2
4 6 8 10 12 14 length of edges (c) at the 50th layer.
4 6 8 10 12 14 length of edges (d) at the 300th layer.
Figure 6: Histograms of edge lengths at each layer of LMICA for large, natural scenes. The lengths were calculated as the full width at half maximum (FWHM) of gaussian approximations of Hilbert transformation of W by a method similar to that of van Hateren and van der Schaaf (1998).
convergence is constant regardless of the number of parameters. In the mapping phase, 64 × 64 array was used for T = 1,500,000; then it was divided into 16 arrays of 16 × 16 components, and each array was optimized for T = 500,000. LMICA was carried out in L = 300 layers, and it consumed about 170 hours with Intel 2.8 GHz CPU. Figure 5 shows some representative edge detectors at the 0th (ZCA filter), 10th, 50th, and 300th layers. The histograms of the lengths of edges at each layer are shown in Figure 6. There are pixel detectors of only length zero at the 0th layer (ZCA) in Figures 5a and 6a. Then many short edge detectors were generated after just 10 layers (see Figures 5b and 6b). At the 50th layer, the lengths of edges were longer (see Figures 5c and 6c). At the final 300th layer (see Figures 5d and 6d), there are some long edges. In addition, it is interesting that some “compound” detectors were observed where multiple edges seem to be included in a single detector.
Linear Multilayer ICA Generating Hierarchical Edge Detectors
229
4 Conclusion In this letter, we proposed linear multilayer ICA (LMICA). We carried out some numerical experiments on natural scenes, which verified that LMICA can extract edge detectors quite efficiently. We also showed that LMICA can generate hierarchical edge detectors from large-size natural scenes, where short edges exist in lower layers and longer edges in higher ones. Although some multilayer models have been employed in ICA (e.g., Hyv¨arinen & Hoyer, 2000, and Hoyer & Hyv¨arinen, 2002), the purpose of and rationale for them are quite different from those of LMICA. Their multilayer networks have been proposed for constructing more powerful models of ICA, which include nonlinear connections and allow some dependencies between sources. On the other hand, LMICA is constructed only for the efficient calculation of the usual linear ICA. Nevertheless, it seems interesting that the structures of their multilayer models (locally connected multilayer networks) are quite similar to that of LMICA. It may be promising to extend LMICA with some nonlinearity. We are planning to apply LMICA to some applications in image processing, such as image compression and digital watermarking. We are also planning to utilize LMICA for other large-scale problems such as text mining. In addition, we are trying to explore a faster method for the mapping phase. Some batch-learning techniques may be promising. We are now paying attention to the fact that it has been known that edge detectors are not formed at the maximum of kurtoses. Some different nonlinearity is needed (e.g., tanh). So the choice of nonlinearity in the contrast function is quite important and sensitive in the usual ICA models, such as the InfoMax model in Bell and Sejnowski (1997). On the other hand, LMICA can generate hierarchical edge detectors by gradually increasing simple kurtoses. This suggests that LMICA may be able to extract edge detectors by using more general contrast functions, but further experiments will be needed to test this hypothesis. Finally, LMICA with kurtoses is expected to be as sensitive to outliers as other cumulant-based ICA algorithms are. Some different contrast functions would be needed for noisy signals. In addition, LMICA is available only if the number of sources is the same as that of observed signals, because it utilizes the ZCA filter. In order to apply LMICA to undercomplete cases, a new whitening method would have to be exploited, which can decrease the number of signals while preserving the topography of images. It is more difficult for LMICA to deal with overcomplete cases, because it is based on the simplest contrast function without any generative models of sources. Nevertheless, it seems interesting that LMICA could generate many edge detectors with different lengths at every layer. These detectors appear to be overcomplete bases, though without any theoretical foundations so far. Further analysis of these bases may be promising.
230
Y. Matsuda and K. Yamaguchi
References Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Cardoso, J.-F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Transactions on Signal Processing, 44(12), 3017–3030. Comon, P. (1994). Independent component analysis—a new concept? Signal Processing, 36, 287–314. Hoyer, P. O., & Hyv¨arinen, A. (2002). A multi-layer sparse coding network learns contour coding from natural images. Vision Research, 42(12), 1593–1605. Hyv¨arinen, A. (1999). Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3), 626–634. Hyv¨arinen, A., & Hoyer, P. O. (2000). Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces. Neural Computation, 12(7), 1705–1720. Jutten, C., & H´erault, J. (1991). Blind separation of sources (part I): An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Li, Z., & Atick, J. J. (1994). Towards a theory of the striate cortex. Neural Computation, 6, 127–146. Matsuda, Y., & Yamaguchi, K. (2003). Linear multilayer ICA algorithm integrating small local modules. In Proceedings of ICA2003 (pp. 403–408). Nara, Japan. Matsuda, Y., & Yamaguchi, K. (2004). Linear multilayer independent component analysis using stochastic gradient algorithm. In C. G. Puntomet & A. Pneto (Eds.), Independent component analysis and blind source separation—ICA2004 (pp. 303–310). Berlin: Springer-Verlag. Matsuda, Y., & Yamaguchi, K. (2005a). An efficient MDS-based topographic mapping algorithm. Neurocomputing, 64, 285–299. Matsuda, Y., & Yamaguchi, K. (2005b). Linear multilayer independent component analysis for large natural scenes. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 897–904). Cambridge, MA: MIT Press. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component filters of natural images compared with simple cells in primary visual cortex. Proceedings of the Royal Society of London: B, 265, 359–366.
Received December 16, 2005; accepted April 5, 2006.
LETTER
Communicated by Andries P. Engelbrecht
Functional Network Topology Learning and Sensitivity Analysis Based on ANOVA Decomposition Enrique Castillo
[email protected] Department of Applied Mathematics and Computational Sciences, University of Cantabria and University of Castilla–La Mancha, Spain
˜ Noelia S´anchez-Marono
[email protected] Amparo Alonso-Betanzos
[email protected] ˜ Spain Computer Science Department, University of A Coruna,
Carmen Castillo
[email protected] Department of Civil Engineering, University of Castilla–La Mancha, Spain
A new methodology for learning the topology of a functional network from data, based on the ANOVA decomposition technique, is presented. The method determines sensitivity (importance) indices that allow a decision to be made as to which set of interactions among variables is relevant and which is irrelevant to the problem under study. This immediately suggests the network topology to be used in a given problem. Moreover, local sensitivities to small changes in the data can be easily calculated. In this way, the dual optimization problem gives the local sensitivities. The methods are illustrated by their application to artificial and real examples.
1 Introduction Functional networks (FN) have been proposed by E. Castillo (1998; Castillo, Cobo, Guti´errez, & Pruneda, 1998) and have been shown to be successful in solving many physical or engineering problems (see Castillo & Guti´errez, 1998; Castillo, Cobo, Guti´errez, & Pruneda, 2000). FN are a generalization of neural networks (NN) that combine knowledge about the structure of the problem and data (Castillo et al., 1998). There are important differences between FN and NN. Standard NN usually have a rigid topology (only the number of layers and the number of neurons can be chosen) and fixed neural functions, so that only the weights are learned. In FN, the neural Neural Computation 19, 231–257 (2007)
C 2006 Massachusetts Institute of Technology
232
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
topology is initially free, and the neural functions can be selected from a wide range of families, so that both the topology and the parameters of the neural functions are learned. In addition, FN incorporate knowledge on the network topology or neural functions, or both. This knowledge can come from two main sources: (1) the knowledge the user has about the problem being solved, which can be written in terms of certain properties or characteristics of the model, or (2) the available data, which can be scrutinized to obtain the structure. The authors cited have described how the first type of knowledge can be used to derive the network topology. This is done mainly with the help of functional equations, which both suggest an initial topology and allow this initial topology to be simplified, leading to an equivalent simpler one. Thus, one of the main tools for building the network topology from this type of knowledge is functional equations. Such equations have been extensively described in the literature (see, e.g., Acz´el, 1966; Castillo & Ru´ız-Cobo, 1992; Castillo, Cobo, Guti´errez, & Pruneda, 1999; Castillo, Iglesias, & Ru´ız-Cobo, 2004). In this way and although FN can also be used as black boxes, the network is a white (rather than a black) box, where the structure has a knowledge-based foundation. However, no efficient and clean method for deriving the network topology from data has yet been devised. This is one of the aims of this article. Sensitivity analysis is another area of great interest (Saltelli, Chan, & Scott, 2000; Castillo, Guti´errez, & Hadi, 1997; Chatterjee & Hadi, 1988; Hadi & Nyquist, 2002); the concern is not only with learning a model but with the sensitivity of the model parameters to data. Sensitivity can be understood as local or global. Local sensitivity aims at discovering how the model changes as small modifications are made to the data and at determining which data values have the greatest influence on the model when changed in small increments. Global sensitivity aims at discovering how sensitive the model is to the inputs and whether some inputs can be removed without a significant loss in output quality. Different studies have been carried out regarding model approximation (Sacks, Mitchell, & Wynn, 1989; Koehler & Owen, 1996; Currin, Mitchell, Morris, & Ylvisaker, 1991). Some of them use regression strategies, similar to, but different from the one described in Jiang and Owen (2001). In this letter, we present a methodology based on ANOVA decomposition that permits the topology of a functional network to be learned from data. The ANOVA decomposition, also permits global sensitivity indices to be obtained for each variable and for each interaction between variables. The topology of a functional network is derived from this information, and lowand high-order interactions among variables are easily determined using ANOVA. Finally, a local sensitivity analysis can be carried out too. In this way, the final model can be fully defined. The letter is structured as follows. Section 2 gives a quick introduction to functional networks and describes the ANOVA decomposition, including global sensitivity indices. Section 3 describes the proposed method for
Functional Network Topology Learning and Sensitivity Analysis
233
learning the network topology and the local and global sensitivity indices from data. Section 4 presents two illustrative examples, and section 5 contains the conclusions. 2 Background Knowledge 2.1 A Brief Introduction to Functional Networks. Functional networks (FN) are a generalization of neural networks that brings together domain knowledge, to determine the structure of the problem, and data, to estimate the unknown functional neurons (Castillo et al., 1998). In FN, there are two types of learning to deal with this domain and data knowledge:
r r
Structural learning, which includes the initial topology of the network and its posterior simplification using functional equations, leading to a simpler equivalent structure. Parametric learning, concerned with the estimation of the neuron functions. This can be done by considering linear combinations of given functional families and estimating the associated parameters from the available data. Note that this type of learning generalizes the idea of estimating the weights of the connections in a neural network.
In FN, not only arbitrary neural functions are allowed, but they are initially assumed to be multiargument and vector-valued functions; that is, they depend on several arguments and are multivariate. This fact is shown in Figure 1a. In this figure, we can also see some relevant differences between neural and functional networks. Note that the FN has no weights, and the parameters to be learned are incorporated into the neural functions f i ; i = 1, 2, 3. These neural functions are unknown functions from a given family (e.g., the polynomial function) to be estimated during the learning process. For example, the neural functions f i could be approximated by f i (xi , x j ) =
mi ki =0
a iki xiki +
mj
k
a ik j x j j ,
k j =0
and the parameters to be learned will be the coefficients a iki and a ik j . As each function f i is learned, a different function is obtained for each neuron. It can be argued that functional networks require domain knowledge for deriving the functional equations and make assumptions about the form the unknown functions should take. However, as neural networks, functional networks can also be used as a black box. 2.2 The ANOVA Decomposition. The analysis of variance (ANOVA) was developed by Fisher (1918) and later further developed by other authors (Efron & Stein, 1981; Sobol, 1969; Hoeffding, 1948).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
234
x1 x2 x3
w51 f (Σw5i xi)
w52 w62 w63
f ( Σw6i xi)
w73 x4
f ( Σw7i xi)
x5 w85 x6 w86 x7
x8
f ( Σw8i xi)
w87
w74 (a)
x1
x5
f 1( x1,x2)
x2
f 2( x2,x3)
x9
f 4( x5,x6)
x3 x6
f 3( x3,x4) x4 (b)
Figure 1: (a) A neural network. (b) A functional network.
According to Sobol (2001), any square integrable function f (x1 , . . . , xn ) defined on the unit hypercube [0, 1]n can be written as:
y = f (x1 , . . . , xn ) = f 0 +
n i 1 =1
+
n−1 n−2
n
f i1 (xi1 ) +
n n−1
f i1 i2 (xi1 , xi2 )
i 1 =1 i 2 =i 1 +1
f i1 ,i2 ,i3 (xi1 , xi2 , xi3 ) + . . .
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1
+
2 1 i 1 =1 i 2 =2
...
n
f i1 i2 ...in (xi1 , xi2 , . . . , xin ),
(2.1)
i n =n
where the term f 0 is a constant and corresponds to the function with no arguments.
Functional Network Topology Learning and Sensitivity Analysis
235
The decomposition, equation 2.1, is called ANOVA iff
1 0
f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) d xir ≡ 0; ∀r = 1, 2, . . . , k; ∀i 1 , i 2 , . . . , i k ; ∀k, (2.2)
where k is an index to point to any of the elements in the set x1 , x2 , . . . , xn . Hence, the functions corresponding to the different summands are orthogonal, that is,
1 0
1
1
...
0
0
f i1 i2 ...ik (xi1 , xi2 , . . . , xik ) f j1 j2 ... j (x j1 , x j2 , . . . , x j ) dx = 0
∀(i 1 i 2 , . . . , i k ) = ( j1 j2 , . . . , j ), where x = (x1 , x2 , . . . , xn ). It is important to notice that the conditions in equation 2.2 are sufficient for the different component functions to be orthogonal. In addition, it can be shown that they are unique, in the sense of L 2 equivalent classes. Note that since the above decomposition includes terms with all possible kinds of interactions among variables x1 , x2 , . . . , xn , it allows these interactions to be determined. The main advantage of this decomposition is that there are explicit formulas for obtaining the different summands or components of f (x1 , . . . , xn ). Sobol (2001) provides the following expressions:
1
f0 =
0
0
0
...
0
1
f (x1 , . . . , xn ) d x1 d x2 , . . . , d xn ,
(2.3)
0
1
...
0 1
f i j (xi , x j ) =
1
0
1
f i (xi ) =
1
f (x1 , . . . , xn )
0 1
... 0
n
d xk − f 0 ,
(2.4)
k=1;k=i 1
f (x1 , . . . , xn )
n k=1;k=i, j
d xk − f 0 −
f k (xk ) (2.5)
k=i, j
and so on. In other words, the f (x1 , . . . , xn ) function can always be written as the sum of 2n orthogonal summands (see equation 2.1). If f (x1 , . . . , xn ) is square integrable, then all f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) for all combinations i 1 i 2 , . . . , i k and k = 1, 2, . . . , n are also square integrable.
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
236
Squaring f (x1 , . . . , xn ) and integrating over (0, 1)n gives
1 0
1
...
0
0
+
n
1
f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn = f 02
f i21 (xi1 ) +
i 1 =1
+
n n−1
f i21 i2 (xi1 , xi2 ) + · · ·
i 1 =1 i 2 =i 1 +1
n−1 n−2
n
f i21 ,i2 ,i3 (xi1 , xi2 , xi3 )
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1
+
2 1
...
n
f i21 i2 ...in (xi1 , xi2 , . . . , xin )
(2.6)
i n =n
i 1 =1 i 2 =2
and calling D=
1
0
Di1 i2 ...ik =
1
0
1
0 1
1
...
0 1
...
0
0
f 2 (x1 , . . . , xn )d x1 d x2 . . . d xn − f 02 ,
(2.7)
f i21 i2 ,...,ik (xi1 , xi2 , . . . , xik )dx,
(2.8)
the result is D=
n
Di1 i2 ,...,ik .
(2.9)
k=1 i 1 i 2 ,...,i k
The constant D is called the variance, because if (x1 , x2 , . . . , xn ) is a uniform random variable in the unit hypercube, then D is the variance of f (x1 , x2 , . . . , xn ). Thus, the following set of global sensitivity indices, adding up to one, can be defined: Si1 i2 ...ik =
Di1 i2 ,...,ik D
⇔
Si1 i2 ,...,ik = 1.
(2.10)
i 1 i 2 ,...,i k
The practical importance of the ANOVA decomposition arises from the following facts: 1. Every square integrable function can be decomposed as the sum of orthogonal functions, including all interaction levels. 2. This decomposition is unique.
Functional Network Topology Learning and Sensitivity Analysis
237
3. There are explicit formulas for determining this decomposition in terms of f (x1 , . . . , xn ). 4. The variance of the initial function can be obtained by summing up the variances of the components, and this permits global sensitivity indices that sum to one to be assigned to the different functional components. 3 Learning Algorithm In this section, we present a method (denominated the AFN, i.e., ANOVA decomposition and functional networks) for learning the functional components of any general function f (x1 , . . . , xn ) from data and for calculating local and global sensitivity indices. Consider a data set {(x1m , x2m , . . . , xnm ; ym )|k = 1, 2, . . . , M}—a sample of size M of n input variables (X1 , X2 , . . . , Xn )—and one output variable Y. The algorithm works as follows. 3.1 Step 1: Select a Set of Approximating Orthonormal Functions. Each functional component f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) is approximated as ki∗ i
1 2 ,...,i k
f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈
c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ),
j=1
(3.1) where c i∗1 i2 ,...,ik ; j are real constants, and {h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki∗1 i2 ,...,ik }
(3.2)
is a set of basis functions (e.g., polynomial, sinusoids, exponential, splines, wavelets), which must be orthonormalized. One possibility consists of using one of the families of univariate orthogonal functions, for example, Legendre polynomials, form tensor products with them, and select a subset of them. Although these polynomials are defined with respect to a uniform weighting of [−1, 1], they can be easily mapped to the interval [0, 1]. Hermite and Chebychev polynomials, Fourier functions, and Haar wavelets also provide univariate basis functions. Another alternative consists of using a family of functions and the Gram-Schmidt technique, which is implemented in some numerical libraries, to orthonormalize these basis functions.
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
238
For the sake of completeness, we give a third alternative. Since they must satisfy the ANOVA constraints, equation 2.2, we must have 0
1
ki∗ i
1 2 ,...,i k
c i∗1 i2 ,...,ik ; j h i∗1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )d xir ≡ 0; (3.3)
j=1
∀r = 1, 2, . . . , k;
∀i 1 , i 2 , . . . , i k ;
∀k ∈ 1, 2, . . . , n.
Note that in spite of the fact that these conditions represent an uncountably infinite number of constraints, only a finite number of them are independent. Thus, only some subset of these constants {c i∗1 i2 ,...,ik ; j | j = 1, 2, . . . , ki∗1 i2 ,...,ik }, remain free. After renaming the free constants {c i1 i2 ,...,ik ; j | j = 1, 2, . . . , ki1 i2 ,...,ik } and orthonormalizing the basis functions, equation 3.1 becomes ki1 i2 ...ik
f i1 i2 ,...,ik (xi1 , xi2 , . . . , xik ) ≈
c i1 i2 ...ik ; j pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ).
(3.4)
j=1
Note that the initial set of basis functions, equation 3.2, changes to { pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik )| j = 1, 2, . . . , ki1 i2 ,...,ik }.
(3.5)
If one uses the tensor product technique, the multivariate normalized basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) can be selected from the set (tensor products): k
j=
pi j ,r j (xi j )|1 ≤ r j ≤ ki1
.
(3.6)
Note that the size of this set grows exponentially with k and that we can be interested in selecting a small subfamily. For example, if we select as univariate basis functions polynomials of degree r , for the m-dimensional basis functions, the tensor product technique leads to polynomials of degree r × m, which is too high. Thus, we can limit the degree of the corresponding m-multivariate basis to contain only polynomials of degree dm or less. This is what we have done with the examples presented in this letter. Example 1 (Univariate Functions): nomial functions,
Consider the basis of univariate poly-
{h ∗1;1 (x1 ), h ∗1;2 (x1 ), h ∗1;3 (x1 ), h ∗1;4 (x1 )} = {1, x1 , x12 , x13 }.
Functional Network Topology Learning and Sensitivity Analysis
239
After imposing the constraints, equations 2.3 to 2.5, we get the reduced basis {h 1;1 (x1 ), h 1;2 (x1 ), h 1;3 (x1 )} = {2x1 − 1, 3x12 − 1, 4x13 − 1}, which after normalization leads to √ √ { p1;1 (x1 ), p1;2 (x1 ), p1;3 (x1 )} = { 3(2x1 − 1), 5(6x12 − 6x1 + 1), √ 7(20x13 − 30x12 + 12x1 − 1)}. Example 2 (Multivariate Functions): If the function is multivariate, the normalized basis can be obtained as the tensor product of univariate basis functions. For example, consider the basis of 16 bivariate functions, {1, x2 , x22 , x23 , x1 , x1 x2 , x1 x22 , x1 x23 , x12 , x12 x2 , x12 x22 , x12 x23 , x13 , x13 x2 , x13 x22 , x13 x23 }, which, after imposing the constraints, equations 2.3 to 2.5, and normalization leads to the normalized basis: √
3 − 1 + 2 x1 (−1 + 2 x2 ), 15 (−1 + 2 x1 ) (1 − 6 x2 + 6 x22 ), √ 21 − 1 + 2 x1 − 1 + 12 x2 − 30x22 + 20 x23 , √ 15 1 − 6 x1 + 6 x12 − 1 + 2 x2 , 5 1 − 6 x1 + 6 x12 1 − 6 x2 + 6 x22 , √ 35 1 − 6 x1 + 6 x12 − 1 + 12 x2 − 30x22 + 20 x23 , √ 21 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 2 x2 √ 35 − 1 + 12 x1 − 30x12 + 20 x13 1 − 6 x2 + 6 x22 , 7 − 1 + 12 x1 − 30x12 + 20 x13 − 1 + 12 x2 − 30x22 + 20 x23 . This family of 9 functions is the family 3.6; that is, it comes from the tensor products of the normalized functions in example 1. Note that these bases can be obtained independently of the data set, which means that they are valid for learning any data set. So once calculated, they can be stored to be used when needed. In addition, since multivariate functions have associated tensor product bases of univariate functions, one needs to calculate and store only the basis associated with univariate functions.
240
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
3.2 Step 2: Learn the Coefficients by Least Squares and Obtain Local Sensitivity Indices. According to our approximation, we have ki 1 n
ym = f (x1m , . . . , xnm ) = f 0 +
c i1 ; j pi1 ; j (xi1 m )
i 1 =1 j=1
+
ki 1 i 2 n n−1
c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )
i 1 =1 i 2 =i 1 +1 j=1
+
(3.7)
ki 1 i 2 i 3 n
n−1 n−2
c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) + · · ·
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1
+
1 2
...
n ki 1 i 2 ...i n i n =n
i 1 =1 i 2 =2
c i1 i2 ...in ; j pi1 i2 ...in (xi1 m , xi2 m , . . . , xin m ).
j=1
The error for the mth data value is m (y, x, c) = ym − f 0 −
ki 1 n
c i1 ; j pi1 ; j (xi1 m )
i 1 =1 j=1
−
ki 1 i 2 n n−1
c i1 ; j pi1 i2 ; j (xi1 m , xi2 m )
i 1 =1 i 2 =i 1 +1 j=1
−
n−1 n−2
ki 1 i 2 i 3 n
c i1 i2 i3 ; j pi1 ,i2 ,i3 ; j (xi1 m , xi2 m , xi3 m ) − · · ·
i 1 =1 i 2 =i 1 +1 i 3 =i 2 +1 j=1
−
2 1
...
i 1 =1 i 2 =2
n ki 1 i 2 ...i n i n =n
c i1 i2 ,...,in ; j pi1 i2 ,...,in (xi1 m , xi2 m , . . . , xin m ), (3.8)
j=1
where y, x are the data vectors and c is the vector including all the unknown coefficients. Hence, to estimate the constants c, that is, all c i1 i2 ,...,ik ; j ; ∀k, j, the following minimization problem has to be solved: Minimize Q = f 0 ,c
M
m2 (y, x, c).
(3.9)
m=1
The minimization problem, equation 3.9, leads to a linear system of equations with a unique solution. However, due to its tensor character, its organization in a standard form as Ac = b requires renumbering of
Functional Network Topology Learning and Sensitivity Analysis
241
the unknowns c, which is not a trivial task. Alternatively, one can use standard optimization packages, such as GAMS (Brooke, Kendrik, Meeraus, & Raman, 1998), to solve equation 3.9. This is the option used in this letter. However, to obtain the local sensitivities of Q to small changes in the data, the following modified problem, which is equivalent to equation 3.9, can be solved instead: Minimize Q= f 0 ,c; y ,x
M
m2 (y , x , c),
(3.10)
m=1
subject to: y = y : λ
(3.11)
x = x : δ,
(3.12)
where λ and δ are the vectors of dual variables, which give the local sensitivities of Q with respect to the data values y and x, respectively. This is so because the dual variables associated with any primal constrained optimization problem are known to be the partial derivatives of the objective function optimal values with respect to changes in the right-hand-side parameters of the corresponding equality constraints. Since in this artificial optimization problem, equations 3.10 to 3.12, the right-hand sides of the equality constraints 3.11 and 3.12 are the data x and y, respectively, the partial derivatives of Q∗ (optimal value of Q), with respect to the data, that is, the sensitivities sought after, are the values of the corresponding dual variables. 3.3 Step 3: Obtain the Global Sensitivity Indices. Since the resulting basis functions have already been orthonormalized, the global sensitivity indices (importance factors) are the sums of the squares of the coefficients, that is, ki1 i2 ...ik
Si1 i2 ...ik =
c i21 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ).
(3.13)
j=1
3.4 Step 4: Return Solution. The algorithm returns the following information:
r
r
An estimation of the coefficients f 0 and {c i1 i2 ,...,ik ; j ; ∀(i 1 i 2 , . . . , i k ; j)}, which, considering the basis functions pi1 i2 ,...,ik ; j (xi1 , xi2 , . . . , xik ) and equation 3.7, allow obtaining an approximation for the solution of the problem being studied Local and global sensitivity indices.
242
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
For simplicity, only one output variable was considered; however, an extension to the case of several outputs can be immediately obtained by solving the corresponding minimization problems. Thus, the algorithm above allows us to obtain an approximation for a given function f in terms of its functional components. Moreover, global sensitivity indices are derived for each functional component; therefore, the topology of a functional network can be considerably simplified, taking into account only the components with higher sensitivities. Suppose that, ˆf (x1 , . . . , xn ) is the f approximation after removing unimportant interactions, that is, when only the subset of important interactions is included in the final model, then the mean squared error (assuming a uniform distribution in the unit cube) is MSE = 0
1
0
1
...
1
2 fˆ(x1 , . . . , xn ) − f (x1 , . . . , xn ) dx.
0
Sometimes it is more convenient to obtain the normalized mean squared error (NMSE), which is adimensional and defined as NMSE =
MSE . Var[ f (x1 , . . . , xn )]
4 Application Examples In this section two application examples are described to illustrate the methodology proposed above. 4.1 Learning a Three-Input Nonlinear Function. The aim of this example is to demonstrate that the global sensitivity indices recovered by the proposed algorithm from data are exactly the same as those that can be obtained directly from the original function. Also, we will show that the resulting global sensitivity indices are almost not affected by noise. For illustration, we selected the following function: y = f (x1 , x2 , x3 ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 .
(4.1)
First, we calculate the exact ANOVA components using equations 2.3 to 2.5, leading to: 7 4 1 f 0 = ; f 1 (x1 ) = − + 2 x1 + x12 ; f 2 (x2 ) = − + x2 ; 3 3 2 x3 1 1 f 3 (x3 ) = − + ; f 12 (x1 , x2 ) = − x1 − x2 + 2 x1 x2 ; 4 2 2
Functional Network Topology Learning and Sensitivity Analysis
243
Table 1: Sensibility Results with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
S1
S2
S3
S12
S13
S23
S123
0.8361 0.8359 0.8356 0.8360 0.8336 0.8335 0.8312 0.8305 0.8321 0.8268 0.8267
0.0922 0.0922 0.0921 0.0917 0.0924 0.0918 0.0931 0.0940 0.0901 0.0925 0.0930
0.0231 0.0230 0.0233 0.0231 0.0237 0.0238 0.0242 0.0241 0.0237 0.0244 0.0238
0.0307 0.0308 0.0309 0.0304 0.0317 0.0311 0.0314 0.0305 0.0319 0.0327 0.0331
0.0077 0.0077 0.0076 0.0081 0.0081 0.0086 0.0088 0.0089 0.0095 0.0110 0.0103
0.0077 0.0078 0.0080 0.0080 0.0077 0.0084 0.0081 0.0091 0.0096 0.0099 0.0096
0.0026 0.0026 0.0026 0.0026 0.0028 0.0028 0.0032 0.0029 0.0031 0.0028 0.0035
Note: Mean values of 100 replications.
1 1 x1 x3 x2 x3 − − + x1 x3 ; f 23 (x2 , x3 ) = − − + x2 x3 ; 4 2 2 4 2 2 x1 x2 1 x3 f 123 (x1 , x2 , x3 ) = − + + − x1 x2 + − x1 x3 − x2 x3 + 2 x1 x2 x3 , 4 2 2 2 (4.2) f 13 (x1 , x3 ) =
whose sum gives exactly the function in equation 4.1. Next, the exact variance D = 0.9037 of function f and the sensitivity indices for the functional components were calculated using equations 2.7 to 2.10. These are shown in the first row of Table 1, corresponding to σ = 0. Subsequently, we show that we can learn the function f , its ANOVA decomposition, and the global indices from data. To this end, a sample of size m = 100 was generated for each input variable {xik |i = 1, 2, 3; k = 1, 2, . . . , m} from a uniform U(0, 1) random variable, and {yk |k = 1, 2, . . . , m} was calculated using equation 4.1. Next, the AFN method was applied in order to obtain an approximation of the function f in equation 4.1. The algorithm starts by considering a set of basis functions for univariate functions. As the function to be estimated was known (because it is an illustrative example), a suitable set of basis functions was formed by third-degree polynomials. These basis functions were orthonormalized as described in example 1. The multivariate functions, with two and three arguments, were approximated with the tensor product functions obtained from the corresponding third-degree polynomials univariate functions (see example 2). However, only polynomials of third degree or lower were used in the approximation. After solving the minimization problem, the exact function
244
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
was recovered by applying the algorithm f (x1 , . . . , xn ) = 1 + x1 + x12 + x1 x2 + 2x1 x2 x3 . Next, the global sensitivity indices were calculated from the coefficients using equation 3.13 in step 3 of the algorithm. These coefficients were the exact ones (i.e., with no error), as can be confirmed in Table 1 in the row labeled “σ = 0.00.” Finally, to test the sensitivity of the results to noise, the learning process was repeated for different noise levels, adding to y in equation 4.1 a normal N(0, σ 2 ) noise with σ = 0.00, 0.02, . . . , 0.20. The results obtained are shown in Table 1. It can be observed that the sensitivity indices are barely affected by the noise level. Once the global sensitivity indices have been calculated, one can decide which interactions must be included in the model. This immediately defines the network topology. Since 98.18% of the variance in the illustrative example can be obtained by the approximation f (x1 , . . . , xn ) ≈ f 0 + f 1 (x1 ) + f 2 (x2 ) + f 3 (x3 ) + f 12 (x1 , x2 ),
(4.3)
one can decide to remove all other interactions. The final topology of the network obtained using our method is shown in Figure 2b, whereas Figure 2a shows the network topology used when the methodology proposed was not applied. Comparing both network topologies, it can be observed that the application of the algorithm achieved considerable simplification while practically maintaining the quality of the outputs. To illustrate the performance of the proposed method, Table 2 shows the means and standard deviations (in parentheses) of MSE (mean squared error) and NMSE (normalized mean squared error) for the training (80% of the sample) and testing (20% of the sample) samples, obtained with 100 replications. Since the values are small, they reveal good performance. In addition, Table 2 includes the errors obtained with increasing noise values (σ ). 4.2 Case Example: A Vertical Breakwater Problem. Breakwaters are constructed to provide sheltered bays for ships and protect harbor facilities. Moreover, in ports open to rough seas, they play a key role in port operations. Since sea waves are enormously powerful, it is not an easy matter to construct structures to mitigate sea power. When designing a breakwater (see Figure 3), one looks for the optimal cross-section that minimizes construction and maintenance costs during the breakwater’s useful life and also satisfies reliability constraints guaranteeing that the work is reasonably safe for each failure mode. Optimization of this engineering
Functional Network Topology Learning and Sensitivity Analysis
X1
245
f1 f12 f2 y
X2
f123
+
f13 f23 X3
f3 (a)
X1
f1 f12
X2
f2
X3
f3
y +
(b) Figure 2: Network topologies corresponding to the three-input function in the example. (a) Using the available data directly, without applying the proposed methodology. (b) Using the knowledge obtained by the proposed methodology to eliminate some interaction terms.
design is extremely important because of the corresponding reduction in the associated cost. Analysis of the failure probabilities of the breakwater requires calculating failure probabilities and the annual failure rate for each failure mode. This implies determining the pressure produced on the breakwater crownwall by the sea waves. This problem involves nine input variables (listed in Table 3) and four output variables: p1 , p2 , p3 , and pu (the first three pi are, respectively, water pressure at the mean, freeboard and base levels of the concrete block, and pu is the maximum subpressure value).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
246
Table 2: Simulation Results for a Sample Size n = 100 with Different Levels of Noise. σ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20
MSE tr
NMSE tr
MSE test
NMSE test
0.00000 (0.00000) 0.00035 (0.00006) 0.00140 (0.00023) 0.00312 (0.00052) 0.00570 (0.00095) 0.00891 (0.00162) 0.01257 (0.00248) 0.01681 (0.00300) 0.02203 (0.00389) 0.02789 (0.00489) 0.03490 (0.00623)
0.00000 (0.00000) 0.00040 (0.00009) 0.00159 (0.00039) 0.00352 (0.00082) 0.00643 (0.00153) 0.00997 (0.00230) 0.01398 (0.00363) 0.01853 (0.00409) 0.02433 (0.00554) 0.03052 (0.00695) 0.03753 (0.00873)
0.00000 (0.00000) 0.00054 (0.00018) 0.00228 (0.00090) 0.00494 (0.00149) 0.00944 (0.00326) 0.01371 (0.00465) 0.02062 (0.00761) 0.02661 (0.00973) 0.03563 (0.01453) 0.04186 (0.01439) 0.05572 (0.02170)
0.00000 (0.00000) 0.00061 (0.00022) 0.00258 (0.00111) 0.00556 (0.00183) 0.01067 (0.00420) 0.01546 (0.00589) 0.02304 (0.00975) 0.02926 (0.01096) 0.03956 (0.01772) 0.04571 (0.01657) 0.06034 (0.02652)
Notes: Mean values of 100 replications. Standard deviations appear in parentheses.
In the case of the vertical breakwater, approximating formulas for calculating p1 , p2 , p3 , and pu were given by Goda (1972) using his own theoretical and laboratory studies. These were later extended by other authors (Takahashi, 1996), obtaining: p1 = 0.5(1 + cos θ )(α1 + α4 cos2 θ )w0 HD
(4.4)
p2 = α2 p1
(4.5)
p3 = α3 p1
(4.6)
pu = 0.5(1 + cos θ )α1 α2 w0 HD ,
(4.7)
where the nondimensional coefficients are given by
(4π h/L) α1 = 0.6 + 0.5 sinh(4π h/L)
2 (4.8)
Functional Network Topology Learning and Sensitivity Analysis
247
p1 p2 hc h’ d h pu p3 Bm Figure 3: Typical cross section of a vertical breakwater.
h 1 α2 = 1 − 1− h cosh(2π h/L) hc α3 = 1 − min 1, 0.75(1 + cos θ )HD α4 = max(α5 , α I )
(4.9) (4.10) (4.11)
α5 = min((1 − d/ h)(HD /d) /3, 2d/HD ) 2
αI = αI 0αI 1 HD /d if HD ≤ 2d αI 0 = 2 otherwise cos δ2 / cosh δ1 αI 1 = 1/(cosh δ1 (cosh δ2 )) 20δ11 if δ11 ≤ 0 δ1 = 15δ11 otherwise
(4.12) (4.13) (4.14)
if δ2 ≤ 0 otherwise
(4.15)
(4.16)
δ11 = 0.93(Bm /L − 0.12) + 0.36[(h − d)/ h − 0.6] 4.9δ22 if δ22 ≤ 0 δ2 = otherwise 3δ22
(4.17)
δ22 = −0.36(Bm /L − 0.12) + 0.93[(h − d)/ h − 0.6].
(4.19)
(4.18)
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
248
Table 3: Input Variables for the Vertical Breakwater Problem. θ w0 HD h L d h hc Bm
incidence angle of waves water unit weight design wave height water depth wave length submerged depth of the vertical breakwater submerged depth of the vertical breakwater, including width of the armor layer free depth of the concrete block on the lee side berm sea side length
As can be observed, these formulas are very complicated and have continuity problems in the first derivatives because they use different expressions in different intervals that do not have the desired regularity properties at the change points. As a consequence, some optimization program solvers, such as GAMS, fail to obtain the optimal solution in some cases. Therefore, more complex solvers, which are more costly in terms of computational power and time, are needed. To solve this problem, the idea is to generate a random set of input variables and then calculate the corresponding outputs using the above formulas, then use these data to train a network that may solve the problem satisfactorily. This application is very important from a practical point of view because it means that an approximation can be obtained for the formulas in equations 4.4 to 4.19 without continuity or regularity problems. Therefore, the optimization problem can be stated, and the optimal solution can be obtained without any special computational requirement. Almost all the variables involved in the breakwater problem can be represented as powers of fundamental magnitudes such as length and mass. Therefore, the breakwater problem can be simplified by the application of dimensional analysis, specifically, its main theorem: the -theorem (Bridgman, 1922; Buckingham, 1914, 1915). This theorem states that any physical relation in terms of n variables can be rewritten in a new relation involving r fewer variables. When this theorem is applied, the input and output variables are transformed into dimensionless monomials. Different transformations are possible, but using engineering knowledge on breakwaters allows determining, without knowledge of formulas 4.4 to 4.19, that the dimensionless output variables
p1 p2 p3 pu , , , w0 HD w0 HD w0 HD w0 HD
(4.20)
Functional Network Topology Learning and Sensitivity Analysis
249
Table 4: Monomials Required for Estimating the Variables p1 , p2 , p3 , and pu . Output/Input p1 w0 HD p2 w0 HD p3 w0 HD pu w0 HD
h L
h h
√ √
√
√ √
√
θ
d h
HD d
BM L
√
√
√
√
√
√
√
√
√
√
√
√
hc HD
√
√
can be written in terms of the set of dimensionless input variables θ,
h d HD h h c Bm . , , , , , L h d h HD L
(4.21)
Not all the dimensionless monomials in equation 4.21 are needed for estimating the dimensionless outputs (see equations 4.4 to 4.19). Table 4 summarizes the required ones. In order to choose the input data points to learn these functions, several different criteria can be used; knowledge about their ranges and statistical distribution is required. (For some interesting work on this and related problems, readers are referred to Sacks et al., 1989; Koehler & Owen, 1996; and Currin et al., 1991.) Knowledge about ranges is easy to obtain from experienced engineers. However, given the lack of knowledge about the statistical distribution of the input data values in existing breakwaters, a set S of 2000 input data (1600 for training and 400 for testing) was generated randomly, assuming that each dimensionless variable in equation 4.21 is U(0, 1), that is, normalized in the interval (0, 1) in order to apply the methodology proposed. This is a reasonable assumption covering the actual ranges of the variables involved. Note that the ranges correspond to nondimensional variables. To show the performance of the methodology proposed (AFN), all of the dimensionless inputs will be considered for estimation of the desired dimensionless outputs. It will be demonstrated that only the relevant input interactions appear in the learned models. In this way, we illustrate how one can use data from a given problem to obtain appropriate knowledge that can assist in deriving a topology for functional networks. To our knowledge, there is no other efficient method for deriving functional network topologies from data. Twenty replications with different starting values for the parameters were performed for each case studied. This example is focused in the estimation of pu /w0 HD , but p1 /w0 HD , p2 /w0 HD , and p3 /w0 HD were also learned in a similar way, and the most significative results are presented. From this point, we will refer to the dimensionless outputs as p1 , p2 , p3 , and pu .
250
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
Table 5: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Data Set. Degrees 2, 1, 1, 1, 1 2, 2, 1, 1, 1 2, 3, 1, 1, 1 2, 4, 1, 1, 1 3, 1, 1, 1, 1 3, 3, 3, 1, 1 3, 4, 1, 1, 1 3, 6, 9, 1, 1 2, 4, 3, 4, 5 3, 6, 3, 4, 5 2, 3, 1
MSEtest
(ST D)
Min
Max
2.3900 × 10−1 7.4135 × 10−2 6.6946 × 10−2 6.7852 × 10−2 9.4133 × 10−1 9.8419 × 10−1 9.5612 × 10−1 9.6607 × 10−1 7.4111 × 10−2 0.1044 × 101 6.8237 × 10−2
(3.8184 × 10−2 ) (4.8200 × 10−3 ) (4.0007 × 10−3 ) (4.0342 × 10−3 ) (5.2477 × 10−2 ) (5.7424 × 10−2 ) (6.0454 × 10−2 ) (6.6378 × 10−2 ) (6.4231 × 10−3 ) (6.9930 × 10−2 ) (4.2030 × 10−3 )
1.9346 × 10−1 6.5495 × 10−2 5.9426 × 10−2 5.9443 × 10−2 8.3121 × 10−1 8.6991 × 10−1 8.3115 × 10−1 8.6482 × 10−1 6.1047 × 10−2 9.2294 × 10−1 5.8822 × 10−2
3.2621 × 10−1 8.4480 × 10−2 7.6709 × 10−2 7.7210 × 10−2 1.0367 1.0875 1.0819 1.1104 8.5550 × 10−2 1.1540 7.4263 × 10−2
Notes: Results for the estimations of pu using the proposed methodology. The last row shows the results obtained when the AFN method was applied and a simplified topology is obtained.
The complexity of the proposed method grows exponentially with the number of inputs. To estimate pu , seven input variables are given; therefore, a complex model is derived if all levels of interactions are considered. Then a simpler model was chosen in the first instance, and its complexity was constantly increased. This model is obtained by considering a reduced number of univariate functions and limiting the multivariate functions obtained from its corresponding tensor products. As in the previous example, the polynomial family was selected for estimating the desired output. The first column in Table 5 shows the different degrees taken into account; the first number is the degree for the univariate functions, the second one is the degree for the bivariate functions, and so on. Note that six and seven arguments functions were not included because there is no improvement in the performance results when more relations are considered. The global sensitivity indices related to the best performance results and its corresponding total sensitivity indices are shown in Tables 6 and 7, respectively. To be concise, all the monomials or relations between them with index values under 0.0005 were removed from Table 6. It can be seen that the monomials with the highest sensitivity values are the three required (the first three rows in Table 6; see Table 4). Besides, the relation between the necessary variables h/L and h / h (fourth row in Table 6) is the most important one, although it has a low value compared to the monomials individually. All other variables and relations have lower sensitivity values and thus were not considered. After removing the unimportant factors, the topology of the functional network is much simpler. It has only three univariate functions
Functional Network Topology Learning and Sensitivity Analysis
251
Table 6: Global Sensitivity Indices for the Univariate and Bivariate Polynomials When Estimating pu . Replication Monomials
1
2
3
4
5
Mean
h/L h/ h θ
0.6644 0.2680 0.0169
0.6598 0.2763 0.0197
0.6544 0.2785 0.0161
0.6671 0.2626 0.0179
0.6594 0.2729 0.0197
0.6622 0.2728 0.0172
h/L, h / h h/L, θ h/L, d/ h h/L, HD /d h / h, θ h / h, d/ h
0.0429 0.0033 0.0002 0.0002 0.0022 0.0002
0.0362 0.0040 0.0002 0.0002 0.0010 0.0006
0.0443 0.0029 0.0000 0.0005 0.0013 0.0001
0.0440 0.0042 0.0005 0.0001 0.0011 0.0000
0.0396 0.0046 0.0002 0.0005 0.0007 0.0001
0.0392 0.0040 0.0002 0.0002 0.0015 0.0002
Notes: Results obtained by the first five replications and mean of the 20 replications. Rows with values under 0.0005 are not included. Shown below the line are the global sensitivity indices for the relations between monomials.
Table 7: Total Sensitivity Indices When Estimating pu . Replication Variables h/L h/ h θ d/ h HD /d B M /L h c /HD
1
2
3
4
5
Mean
0.7114 0.3135 0.0228 0.0007 0.0007 0.0007 0.0004
0.7007 0.3143 0.0254 0.0011 0.0012 0.0007 0.0004
0.7024 0.3243 0.0207 0.0007 0.0008 0.0009 0.0008
0.7163 0.3080 0.0238 0.0011 0.0005 0.0012 0.0011
0.7044 0.3138 0.0255 0.0008 0.0012 0.0010 0.0009
0.7061 0.3141 0.0234 0.0010 0.0010 0.0010 0.0008
Note: Results obtained by the first 5 replications and mean of the 20 replications.
(second-degree polynomials) and a bivariate function relating the variables h/L and h / h. However, the testing results after training this functional network are very similar to those obtained previously, as expected, because only the nonrelevant factors were removed. These results are presented in the last row of Table 5 to facilitate comparison. Note that the number of parameters is extremely reduced. In fact, there are 9 parameters when the proper dimensionless monomials are employed (3 univariate functions with 2 parameters and 1 bivariate function with 3 parameters), and 77 parameters are used when all the dimensionless monomials are taken into account, considering the same degree for the polynomials (7 univariate functions with 2 parameters and 21 bivariate functions with 3 parameters).
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
252
Table 8: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 .
p1 p2 p3
Method
Mean NMSE
(STD)
Min
Max
AFN FN AFN FN AFN FN
1.8693 × 10−1 1.9795 × 10−1 7.7311 × 10−2 9.1324 × 10−2 7.7058 × 10−2 1.0552 × 10−1
(2.6462 × 10−2 ) (2.6624 × 10−2 ) (7.0881 × 10−3 ) (1.0700 × 10−2 ) (1.1104 × 10−2 ) (1.7653 × 10−2 )
1.4714 × 10−1 1.5455 × 10−1 6.5819 × 10−2 7.3875 × 10−2 6.2961 × 10−2 7.7863 × 10−2
2.4397 × 10−1 2.4882 × 10−1 9.0760 × 10−2 1.1522 × 10−1 1.0444 × 10−1 1.3590 × 10−1
Table 9: Total Sensitivity Indices for the Best Approximation of p1 , p2 , and p3 . Input/Output h L h h
θ
d h HD d BM L hc HD
p1
p2
p3
0.550433 0.003544 0.220720 0.107570 0.124541 0.083042 0.003075
0.652522 0.323328 0.040430 0.016706 0.019912 0.013103 0.001099
0.183044 0.001595 0.129206 0.035828 0.041625 0.027788 0.636814
Similar experiments were carried out for the other output variables ( p1 , p2 , and p3 ). However, for simplicity and clarity, only the most significant results are included. The best performance results obtained applying the proposed methodology (AFN) are shown in Tables 8 and 9, indicating the total sensitivity index for each input variable. The variables with the lowest values are those not checked in Table 4, and thus the proposed method is also able to discard the irrelevant variables in these cases. Again, considering the topology derived from the application of the method proposed, a functional network is trained for each output, and its performance results are also included in Table 8 (rows entitled FN). Again, it is important to remark that the simplification of the topology means a significant reduction in the number of parameters while maintaining the performance of the approach. By applying the proposed method, interactions between variables are removed and irrelevant variables are found. This suggests that the AFN method can be applied as a feature selection method, that is, a method that reduces the number of original features or variables by selecting a subset of them. The advantages of reducing the number of inputs have been extensively discussed in the machine learning literature (Kohavi & John, 1997; Guyon & Elisseeff, 2003). Two of the most important ones are that
Functional Network Topology Learning and Sensitivity Analysis
253
Table 10: Mean, Standard Deviation, and Minimum and Maximum Values for the Normalized MSE for the Test Set when Estimating pu Considering All the Inputs and Only the Three Relevant Ones. Neurons
Mean NMSE
(STD)
Min
Max
Considering the seven variables (1.2739 × 10−3 ) 5 1.9880 × 10−3 −4 10 4.8390 × 10 (1.0625 × 10−3 ) 15 2.5818 × 10−4 (7.5651 × 10−4 ) 20 1.5951 × 10−6 (2.2556 × 10−6 ) 25 1.4011 × 10−6 (2.6363 × 10−6 ) 27 8.0122 × 10−7 (8.1440 × 10−7 ) 30 2.5952 × 10−6 (6.3535 × 10−6 )
1.7747 × 10−4 1.7887 × 10−6 1.0661 × 10−6 2.5091 × 10−7 8.3305 × 10−8 1.4289 × 10−7 7.0063 × 10−8
4.8721 × 10−3 3.7790 × 10−3 2.6543 × 10−3 1.0151 × 10−5 1.1963 × 10−5 3.7679 × 10−6 2.4735 × 10−5
Using the three relevant variables (9.9263 × 10−4 ) 5 1.4199 × 10−3 10 2.4766 × 10−5 (6.7618 × 10−5 ) 15 1.1263 × 10−6 (9.7697 × 10−7 ) 20 2.1324 × 10−7 (2.7517 × 10−7 ) 25 1.1236 × 10−7 (1.0665 × 10−7 ) 27 7.2760 × 10−8 (4.1195 × 10−8 ) 30 8.8819 × 10−8 (7.7426 × 10−8 )
1.5737 × 10−4 1.8441 × 10−6 1.5265 × 10−7 3.4903 × 10−8 1.8267 × 10−8 1.0599 × 10−8 1.8998 × 10−8
2.8286 × 10−3 3.0382 × 10−4 3.8548 × 10−6 1.2734 × 10−6 4.3665 × 10−7 1.9125 × 10−7 2.6956 × 10−7
Note: Neurons refers to the number of neurons in the hidden layer.
it allows reducing computational complexity, and it improves performance results. As a feature selection method, the AFN method should indicate how to choose the relevant variables. This information is provided by the total sensitivity indices (TSI) that give a ranking of the variables in terms of its variance. Besides, a threshold is required to determine the variables to be selected, in such a way that variables with TSIs under the threshold are discarded. The establishment of this threshold is not a trivial issue and will be discussed in a further study. As a preliminary approach, the AFN method was used as a feature selection method for the breakwater problem. The threshold was established at 1%, that is, variables with TSI over 1% were chosen (see Tables 7 and 9). Then multilayer perceptrons (MLP) were used to solve the breakwater problem in both cases (using all the given inputs and only the inputs selected by the AFN method). The hyperbolic tangent was employed as the transfer function and the MSE as the performance function. The Levenberg-Marquardt learning algorithm was used to train the network (Levenberg, 1944; Marquardt, 1963). Two thousand epochs were employed because they were enough to the MLP to converge. The results are very representative in the case of pu , because the number of inputs is drastically reduced from 7 to 3. Table 10 shows these results. As can be seen, the reduction in the number of inputs leads to better performance when the same number of neurons in the hidden layer is used,
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
254
Table 11: Mean, Standard Deviation (STD), and Minimum and Maximum Values for the Normalized Mean Square Error (NMSE) for the Test Set When Estimating p1 , p2 , and p3 , Considering All the Inputs (All) and Only the Relevant (Rel.) Ones.
p1 p2 p3
Na
Varb
Mean NMSE
(STD)
Min
Max
35 37 21 22 37 39
All Rel. All Rel. All Rel.
1.4148 × 10−2 3.6129 × 10−3 1.0843 × 10−2 7.3305 × 10−3 6.3983 × 10−3 3.9794 × 10−3
(2.1045 × 10−2 ) (9.4444 × 10−4 ) (8.4366 × 10−3 ) (4.7671 × 10−3 ) (4.9059 × 10−3 ) (2.0662 × 10−3 )
2.8752 × 10−3 2.2487 × 10−3 4.1416 × 10−3 2.6945 × 10−3 2.4433 × 10−3 2.0258 × 10−3
9.8051 × 10−2 5.0922 × 10−3 3.6060 × 10−2 2.2399 × 10−2 2.2456 × 10−2 9.1660 × 10−3
Notes: a N refers to the number of neurons in the hidden layer. b Var. refers to the number of variables.
although it involves a smaller set of weights, that is, the network topology is simplified. The best results achieved when estimating p1 , p2 , and pu are shown in Table 11, although the variable reduction is not so important. Again, better performance results are achieved. 5 Conclusions and Future Work In this article, a new methodology, the AFN method, based on ANOVA decomposition and functional networks, was presented and described. This methodology permits a simplified topology for a functional network to be learned from the data available. As stated in section 1, to our knowledge, there is no other method that permits this to be done. In addition to the advantages inherited from the ANOVA decomposition (uniqueness and orthogonality), the proposed methodology has the following advantages:
r
r
r
Global sensitivity indices can be obtained from the application of the AFN. These can be used to establish the relevance of each functional component; consequently, they determine the input variables and relations to be selected. If a particular variable has no influence (in isolation or related to others), it can be discarded as an input for the functional or neural network. It allows learning and simplifying the topology of a functional or neural network. If the variables required for estimating a specific function are important, then the topology of the functional network should include these relationships. All existing multivariate interactions among variables are identified by the global sensitivity indices.
Functional Network Topology Learning and Sensitivity Analysis
r
r
255
Local sensitivity indices. Although the letter was focused on the global sensitivity indices for deriving the functional network topology, it is important to note that the proposed methodology also provides local sensitivity indices. These indices could be used to detect outliers in the samples or to make a selective sampling (these will be a future line of research for us). Several alternatives for selecting the basis orthonormal functions for the proposed approximation are available. Moreover, a new one has been presented here. As the orthonormalization decomposition is easily accomplished by one of these alternatives, the application of the AFN would require only a minimization problem to be solved.
The suitability of the proposed methodology was illustrated by its application to a real engineering problem: the design of a vertical breakwater. The performance results obtained were compared to those obtained using functional and neural networks to solve the same problem. It was demonstrated that although the AFN obtains similar performance results, it also returns a set of sensitivity indices that permits the initial topologies to be simplified (functional networks) or some of the input variables to be eliminated (neural networks). In view of the results obtained, future work will involve adapting the proposed methodology for use as a feature subset selection method. A detailed study will be carried out comparing the proposed methodology with the existing feature selection methods. Acknowledgments We are indebted to the Spanish Ministry of Science and Technology with FEDER Funds (Projects DPI2002-04172-C04-02 and TIC-2003-00600), to the Xunta de Galicia (Project PGIDIT04-PXIC10502), and to Iberdrola for partial support of this work. References Acz´el, J. (1966). Lectures on functional equations and their applications. New York: Academic Press. Bridgman, P. (1922). Dimensional analysis. New Haven, CT: Yale University Press. Brooke, A., Kendrik, D., Meeraus, A., & Raman, R. (1998). GAMS: A user’s guide. Washington, DC: Gams Development Corporation. Buckingham, E. (1914). On physically similar systems: Illustrations of dimensional equations. Phys. Rev., 4, 345–376. Buckingham, E. (1915). Model experiments and the form of empirical equations. Trans. ASME, 37, 263. Castillo, E. (1998). Functional networks. Neural Processing Letters, 7, 151–159.
256
˜ A. Alonso-Betanzos, and C. Castillo E. Castillo, N. S´anchez-Marono,
Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, E. (1998). Functional networks with applications. Boston: Kluwer. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (1999). Working with differential, functional and difference equations using functional networks. Applied Mathematical Modeling, 23, 89–107. Castillo, E., Cobo, A., Guti´errez, J. M., & Pruneda, R. E. (2000). Functional networks: A new neural network based methodology. Computer-Aided Civil and Infrastructure Engineering, 15, 90–106. Castillo, E., & Guti´errez, J. M. (1998). Nonlinear time series modeling and prediction using functional networks. extracting information masked by chaos. Physics Letters A, 244, 71–84. Castillo, E., Guti´errez, J., & Hadi, A. (1997). Sensitivity analysis in discrete Bayesian networks. IEEE Transactions on Systems, Man and Cybernetics, 26, 412–423. Castillo, E., Iglesias, A., & Ru´ız-Cobo, R. (2004). Functional equations in applied sciences. New York: Elsevier. Castillo, E., & Ru´ız-Cobo, R. (1992). Functional equations in science and engineering. New York: Marcel Dekker. Chatterjee, S., & Hadi, A. S. (1988). Sensitivity analysis in linear regression. New York: Wiley. Currin, C., Mitchell, T., Morris, M., & Ylvisaker, D. (1991). Bayesian prediction of deterministic functions, with applications to the design and analysis of computer experiments. Journal of the American Statistical Association, 86, 953–963. Efron, B., & Stein, C. (1981). The jackknife estimate of variance. Annals of Statistics, 9, 586–596. Fisher, R. (1918). The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society, 52, 399–433. Goda, Y. (1972). Laboratory investigation of wave pressure exerted upon vertical and composite walls. Coastal Engineering, 15, 81–90. Guyon, I., & Elisseef, A. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3, 1157–1182. Hadi, A., & Nyquist, H. (2002). Sensitivity analysis in statistics. Journal of Statistical Studies: A Special Volume in Honor of Professor Mir Masoom. Ali’s 65th Birthday, 125–138. Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Annals of Mathematical Statistics, 19, 293–325. Jiang, T., & Owen, A. (2001). Quasi-regression with shrinkage. Mathematics and Computers in Simulation, 62, 231–241. Koehler, J., & Owen, A. (1996). Computer experiments: Design and analysis of experiments. In S. Ghosh & C. R. Rao (Eds.), Handbook of statistics, 13 (pp. 261–308). New York: Elsevier Science. Kohavi, R., & John, G. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1–2), 273–324. Levenberg, K. (1944). A method for the solution of certain non-linear problems in least squares. Quartely Journal of Applied Mathematics, 2(2), 164–168. Marquardt, D. W. (1963). An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431– 441.
Functional Network Topology Learning and Sensitivity Analysis
257
Sacks, J., Mitchell, W. W. T., & Wynn, H. (1989). Design and analysis of computer experiments. Statistical Science, 4(4), 409–435. Saltelli, A., Chan, K., & Scott, M. (2000). Sensitivity analysis. New York: Wiley. Sobol, I. M. (1969). Multidimensional quadrature formulas and Haar functions. Moscow: Nauka. (in Russian) Sobol, I. M. (2001). Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Mathematics and Computers in Simulation, 55, 271–280. Takahashi, S. (1996). Design of vertical breakwaters (Techn. Rep. No. 34). Yokosuka, Japan: Port and Harbour Research Institute. Ministry of Transport.
Received August 12, 2004; accepted June 17, 2006.
LETTER
Communicated by Erin Bredensteiner
Second-Order Cone Programming Formulations for Robust Multiclass Classification Ping Zhong
[email protected] College of Science, China Agricultural University, Beijing, 100083, China
Masao Fukushima
[email protected] Department of Applied Mathematics and Physics, Graduate School of Informatics, Kyoto University, Kyoto, 606-8501, Japan
Multiclass classification is an important and ongoing research subject in machine learning. Current support vector methods for multiclass classification implicitly assume that the parameters in the optimization problems are known exactly. However, in practice, the parameters have perturbations since they are estimated from the training data, which are usually subject to measurement noise. In this article, we propose linear and nonlinear robust formulations for multiclass classification based on the M-SVM method. The preliminary numerical experiments confirm the robustness of the proposed method. 1 Introduction Given L labeled examples known to come from K (>2) classes T = {(x p , θ p )} Lp=1 ⊂ X × Y, where X ⊂ R N and Y = {1 , . . . , K }, multiclass classification refers to the construction of a discriminate function from the input space X onto the unordered set of classes Y. Support vector machines (SVMs) serve as a useful and popular tool for classification. Recent developments in the study of SVMs show that there are roughly two types of approaches to tackle multiclass classification problem. One is to construct and fuse several binary classifiers, such as “one-against-all” (Bottou et al., 1994; Vapnik, 1998), “one-againstone” (Hastie & Tibshirani, 1998; Kressel, 1999), directed acyclic graph SVM (DAGSVM; Platt, Cristianini, & Shawe-Taylor, 2000), error-correcting output code (ECOC; Allwein, Schapire, & Singer, 2001; Dietterich & Bakiri, 1995), K-SVCR method (Angulo, Parra, & Catal`a, 2003), and ν-K-SVCR Neural Computation 19, 258–282 (2007)
C 2006 Massachusetts Institute of Technology
SOCP for Multiclass Classification
259
method (Zhong & Fukushima, 2006), among others. The other, called “all-together,” is to consider all data in one optimization formulation (Bennett & Mangasarian, 1994; Bredensteiner & Bennett, 1999; Guermeur, 2002; Vapnik, 1998; Weston & Watkins, 1998; Yajima, 2005). In this letter, we focus on the second approach. There are several all-together methods. The method independently proposed by Vapnik (1998) and Weston and Watkins (1998) is similar to oneagainst-all. It constructs K two-class discriminants where each discriminant separates a single class from all the others. Hence, there are K decision functions, but all are obtained by solving one optimization problem. Bennett and Mangasarian (1994) constructed a piecewise-linear discriminant for the K -class classification by a single linear program. The method called M-SVM (Bredensteiner & Bennett, 1999) extends their method to generate a kernel-based nonlinear K -class discriminant by solving a convex quadratic program. Although the original forms proposed by Vapnik (1998), Weston and Watkins (1998), and Bredensteiner and Bennett (1999) are different, they are not only equivalent to each other, but also equivalent to that proposed by Guermeur (2002). Based on M-SVM, the linear programming formulations are proposed in a low-dimensional feature subspace (Yajima, 2005). In the methods noted, the parameters in the optimization problems are implicitly assumed to be known exactly. However, in practice, these parameters have perturbations since they are estimated from the training data, which are usually corrupted by measurement noise. As pointed out by Goldfarb and Iyengar (2003), the solutions to the optimization problems are sensitive to parameter perturbations. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. So it will be useful to explore formulations that can yield discriminants robust to such estimation errors. In this article, we propose a robust formulation of M-SVM, which is represented as a second-order cone program (SOCP). The second-order cone (SOC) in Rn (n ≥ 1), also called the Lorentz cone, is the convex cone defined by Kn =
z0 z¯
: z0 ∈ R, z¯ ∈ Rn−1 , z¯ ≤ z0 ,
where · denotes the Euclidean norm. The SOCP is a special class of convex optimization problems involving SOC constraints, which can be efficiently solved by interior point methods. The work related to SOCP can be seen, for example, in Alizadeh and Goldfarb (2003), Fukushima, Luo, and Tseng (2002), Hayashi, Yamashita, and Fukushima (2005), and Lobo, Vandenberghe, Boyd, and L´ebret (1998). The letter is organized as follows. We first propose a robust formulation for piecewise-linear M-SVM in section 2 and then construct a robust
260
P. Zhong and M. Fukushima
classifier based on the dual SOCP formulation in section 3. In section 4, we extend the robust classifier to the piecewise-nonlinear M-SVM case. Section 5 gives numerical results. Section 6 concludes the letter. 2 Robust Piecewise-Linear M-SVM Formulation For each i, let Ai be a set of examples in the N-dimensional real space R N with cardinality li . Let Ai be an li × N matrix whose rows are the examples in Ai . The pth example in Ai and the pth row of Ai are both denoted Aip . Let ei denote the vector of ones of dimension li . For each i, let wi be a vector in R N and b i be a real number. The sets Ai , i = 1, . . . , K , are called piecewise-linearly separable (Bredensteiner & Bennett, 1999) if there exist wi and b i , i = 1, . . . , K , such that Ai wi − b i ei > Ai w j − b j ei ,
i, j = 1, . . . , K ,
i = j.
Piecewise-linear M-SVM can be formulated as follows (Bredensteiner & Bennett, 1999): i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w, b, y 2 2
i=1 j=1
i=1 j=1, j=i
i=1
s.t. Ai (wi − w j ) − (b i − b j )ei + yij ≥ ei , yij ≥ 0,
i, j = 1, . . . , K ,
i = j,
(2.1)
where ν ∈ (0, 1],
T w = (w 1 )T , (w 2 )T , . . . , (w K )T ∈ R K N , T
b = b 1, b 2, . . . , b K ∈ RK ,
T y = ( y12 )T , . . . , ( y1K )T , . . . , ( yK 1 )T , . . . , ( yK (K −1) )T ∈ R L(K −1) ,
(2.2) (2.3) (2.4)
K li . When ν = 1, equation 2.1 is the formulation for the and L = i=1 piecewise-linearly separable case. Otherwise, it is the formulation for the piecewise-linearly inseparable case. Figure 1 shows an example of a piecewise-linearly separable M-SVM for three classes in two dimensions. The training data Ai , i = 1, . . . , K , used in problem 2.1, are implicitly assumed to be known exactly. However, in practice, training data are often corrupted by measurement noises. Errors in the input space tend to get amplified in the decision function, which often results in misclassification. For example, suppose each example in Figure 1 is allowed to move in a sphere (see Figure 2). The original discriminants cannot separate the training data
SOCP for Multiclass Classification
261
(w1 − w2 )T x = (b1 − b2 ) + 1 (w1 − w2 , b1 − b2 ) (w1 − w2 )T x = (b1 − b2 ) − 1
A1
A2
(w1 − w3 , b1 − b3 )
*
*
*
*
(w2 − w3 , b2 − b3 )
* A3
Figure 1: Three classes separated by piecewise-linear M-SVM discriminants.
A1 A2
*
*
*
*
* 3
Figure 2: An example of the effect of measurement noises.
sets in the worst case. It will be useful to explore formulations that can yield discriminants robust to such estimation errors. In the following, we discuss such a formulation. We assume Aˆ ip = Aip + ρ ip (aip )T ,
(2.5)
where Aˆ ip is the actual value of the training data and ρ ip (aip )T is the measurement noise, with aip ∈ R N , aip = 1 and ρ ip ≥ 0 being a given constant. Denote the unit sphere in R N by U = {a ∈ R N : a = 1}. The robust
262
P. Zhong and M. Fukushima
version of formulation 2.1 can be stated as follows:
min
w, b, y
i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1
i=1 j=1, j=i
i=1
ij
Aip (wi − w j ) + ρ ip (aip )T (wi − w j ) − (b i − b j ) + yp ≥ 1,
s.t.
ij
yp ≥ 0,
∀ aip ∈ U,
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
(2.6)
Since min{ρ ip (aip )T (wi − w j ) : aip ∈ U} = −ρ ip wi − w j , problem 2.6 is equivalent to the following SOCP:
min
w, b, y
i−1 K K K K 1 1 ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij 2 2 i=1 j=1
i=1 j=1, j=i
i=1
ij
Aip (wi − w j ) − ρ ip wi − w j − (b i − b j ) + yp ≥ 1,
s.t.
ij
yp ≥ 0,
(2.7)
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
Denote Q = (K + 1)I K N − , with I K N being the identity matrix of order K N and
IN IN = . .. IN
IN IN .. . IN
··· ··· .. . ···
IN IN K N×K N. .. ∈ R . IN
Denote e = [(e1 )T , . . . , (e1 )T , . . . , (e K )T , . . . , (e K )T ]T ∈ R L(K −1) . The objec K −1
K −1
tive function of problem 2.7 can then be expressed compactly as ν T w Qw + (1 − ν)eT y. 2
(2.8)
SOCP for Multiclass Classification
263
Additionally, Q is a symmetric positive definite matrix, which can be inferred from the following proposition. The proof of the proposition is omitted since it is similar to that given by Yajima (2005). Proposition 1
Denote C =
√
K + 1I K N −
1. Q = C 2 . 2. C is nonsingular, and C −1 =
√
√
K +1−1 . K
1
I + K +1 K N
Then
√
K√+1−1 . K K +1
Let H ij be the K N × N matrix with all blocks being N × N zero matrices except the ith block being I N and the jth block being −I N : H ij = [O, . . . , O, I N , O, . . . , O, −I N , O, . . . , O]T .
(2.9)
Then, by equation 2.2 we get wi − w j = (H ij )T w.
(2.10)
Let r ij be the K -dimensional vector with all components being zero except the ith component being 1 and the jth component being −1: r ij = [0, . . . , 0, 1, 0, . . . , 0, −1, 0, . . . , 0]T .
(2.11)
Then by equation 2.3 we get b i − b j = (r ij )T b.
(2.12)
ij
Let h p be the L(K − 1)-dimensional vector with all components being zero except the ((K − 1) i−1 k=1 lk + ( j − 1)li + p)th component being 1: ij
h p = [0, . . . , 0, . . . , 0, 1, 0, . . . , 0, . . . , 0]T .
(2.13)
Then by equation 2.4, we get ij
ij
yp = (h p )T y.
(2.14)
By equations 2.10, 2.12, and 2.14, the first constraint in problem 2.7 can be rewritten as follows: ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1.
(2.15)
264
P. Zhong and M. Fukushima
Therefore, by equations 2.8 and 2.15 and proposition 1, formulation 2.7 can be written as follows:
min νt + (1 − ν)eT y
w, b, y, t
s.t.
1 Cw2 ≤ t, 2
ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,
(2.16)
p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.
Furthermore, formulation 2.16 can be cast as the following SOCP:
min νt + (1 − ν)eT y
w, b, y, t
√ 2Cw ≤ 1 + t, s.t. 1−t ij T ρ ip (H ij )T w ≤ Aip (H ij )T w − (r ij )T b + h p y − 1,
(2.17)
p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0.
3 Robust Piecewise-Linear M-SVM Classifier In this section, we construct a robust piecewise-linear M-SVM classifier based on the dual formulation of problem 2.17.
3.1 Dual of the Robust Piecewise-Linear M-SVM Formulation. Denote
¯ = B1T , B2T , . . . , B KT T ∈ R L(K −1)×KN A
(3.1)
SOCP for Multiclass Classification
265
with
−Ai . .. O Bi = O . .. O
· · · O Ai O · · · O . . . .. .. . .. .. .. . i i · · · −A A O · · · O ∈ Rli (K −1)×K N . · · · O Ai −Ai · · · O .. .. .. . . .. . . . . . i · · · O A O · · · −Ai
Denote
T ¯1,M ¯ 2T . . . , M ¯ TK T ∈ R L N(K −1)×K N , H¯ = M
(3.2)
where
O ··· O .. .. . . · · · −Mi Mi O · · · O ∈ Rli N(K −1)×K N , · · · O Mi −Mi · · · O .. .. . . .. . . . . O · · · O Mi O · · · −Mi
−Mi . . . O ¯i = M O . . .
· · · O Mi . .. .. . .. .
with
Mi := Mi (ρ) = [ρ1i I N , . . . , ρlii I N ]T ∈ Rli N×N ,
i = 1, . . . , K .
Denote T
E¯ = E 1T , E 2T , . . . , E KT ∈ R L(K −1)×K
(3.3)
266
P. Zhong and M. Fukushima
with
−ei . . . 0 Ei = 0 . . . 0
· · · 0 ei 0 · · · . . .. .. .. . . . . · · · −ei ei 0 · · · · · · 0 ei −ei · · · .. .. .. . . . . . . · · · 0 ei 0 · · ·
0 .. . 0 ∈ Rli (K −1)×K . 0 .. . −ei
We can derive the following dual of problem 2.17 (see appendix A): max eT α − (σ + τ )
α,s,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e, σ − τ = ν, −√ 1 ¯ T α + H¯ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,
(3.4)
p = 1, . . . , li , i, j = 1, . . . , K , j = i,
where α = [(α 12 )T , . . . , (α 1K )T , . . . , (α K 1 )T , . . . , (α K (K −1) )T ]T ∈ R L(K −1) , T T T T , . . . , sl12 , . . . , s1K , . . . , sl1K ,..., s = s12 1 1 1 1
K (K −1) T
s1
K (K −1) T , . . . , sl K
T
∈ R L N(K −1) .
(3.5)
(3.6)
In addition, we get the following complementary equations at optimality:
ij T
αp ij
sp
ij T Aip (H ij )T w − (r ij )T b + h p y − 1 ρ ip (H ij )T w
p = 1, . . . , li , i, j = 1, . . . , K , j = i,
= 0, (3.7)
SOCP for Multiclass Classification
−√ 1 2(K +1)
267
T σ 1+t √ ¯ T α + H¯ T s) (A 2Cw = 0,
(3.8)
1−t
τ
((1 − ν)e − α)T y = 0.
(3.9)
3.2 Robust Classifier. From formulation 3.4 we get σ > 0. In fact, if σ = 0, then τ = 0. The third constraint of formulation 3.4 becomes ν = 0, which contradicts ν > 0. By the complementary equation 3.8, we have the following implications (see appendix B for the complementary conditions in SOCP, equations B.1 to B.3): If −√ 1 ¯ T α + H¯ T s) (A 2(K +1) < σ, τ then √ 2Cw 1 − t = 1 + t = 0. But this contradicts t ≥ 0. So we must have −√ 1 ¯ T α + H¯ T s) (A 2(K +1) = σ. τ Since σ > 0, we have √ 2Cw 1 − t = 1 + t. Hence, there exists µ > 0 such that √ 2Cw =
µ 2(K + 1)
¯ T α + H¯ T s) and 1 − t = −µτ. (A
(3.10)
In addition, it is easy to get the following equalities by proposition 1: ¯T = √ C −1 A
1 K +1
¯T A
and C −1 H¯ T = √
1 K +1
H¯ T .
(3.11)
268
P. Zhong and M. Fukushima
Hence, by equations 3.10 and 3.11, we get w=
t−1 ¯ T α + H¯ T s). (A 2τ (K + 1)
Furthermore, by equations 2.2, 3.1, and 3.2, we get lj li K t − 1 ij α p (Aip )T − α pji (Apj )T wi = 2τ (K + 1) j=1, j=i
p=1
lj li ij ρ ip s p − ρ pj s pji . +
p=1
p=1
p=1
Therefore, the decision functions are given by f i (x) = x T wi − b i
lj li K t−1 ij = α p x T (Aip )T − α pji x T (Apj )T 2τ (K + 1) j=1, j=i
p=1
p=1
lj li ij ρ ip x T s p − ρ pj x T s pji − b i , i = 1, . . . , K . + p=1
(3.12)
p=1
In particular, if we set ρ ip = 0, i = 1, . . . , K , p = 1, . . . , li , then equation 3.12 becomes lj li K i T j T t−1 ij T ji T f i (x) = α p x Ap − α p x Ap 2τ (K + 1) j=1, j=i
p=1
p=1
− b i , i = 1, . . . , K .
(3.13)
Since ρ ip = 0, p = 1, . . . , li , i = 1, . . . , K , imply that the parameter perturbations are not considered (cf. equation 2.5); equation 3.13 corresponds to the discriminants for the case of no measurement noise. With these decision functions, the classification of an example x is to find a class i such that f i (x) = max{ f 1 (x), . . . , f K (x)}. 4 Robust Piecewise-Nonlinear M-SVM Classifier The above discussion is concerned with the piecewise-linear case. In this section, the analysis will be extended to the nonlinear case.
SOCP for Multiclass Classification
269
To construct separating functions in a higher-dimensional feature space, a nonlinear mapping ψ : X → F is used to transform the original examples into the feature space, which is equipped with the inner product defined by k(x, x ) = ψ(x), ψ(x ) , where k(·, ·) : R N × R N → R is a function called a kernel. Typical choices of kernels include polynomial kernels k(x, x ) = (x T x + 1)d with an integer parameter d and radial basis function (RBF) kernels k(x, x ) = exp(−x − x 2 /κ) with a real parameter κ. 4.1 Robust Piecewise-Nonlinear M-SVM Formulation. We assume T ˜ (4.1) ψ Aˆ ip = ψ (Aip )T + ρ˜ ip a˜ ip , a˜ ip ∈ U, where U˜ is a unit sphere in the feature space. For the nonlinear case, ρ˜ ip in the feature space associated with a kernel k(·, ·) can be computed as T T ρ˜ ip = ψ Aˆ ip − ψ Aip % T T & % T T & − 2 ψ Aˆ ip , ψ Aip = ψ Aˆ ip , ψ Aˆ ip % T T &1/2 + ψ Aip , ψ Aip T T T T T T 1/2 = k Aˆ ip , Aˆ ip . − 2k Aˆ ip , Aip + k Aip , Aip For example, for RBF kernels, since T T = 1, k Aˆ ip , Aˆ ip 2 i T i T = exp − ρ ip /κ , k Aˆ p , Ap T T = 1, k Aip , Aip we have 2 1/2 . ρ˜ ip = 2 − 2 exp − ρ ip /κ
(4.2)
The robust version of the piecewise-nonlinear M-SVM can be expressed as follows: i−1 K K K K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2
i=1 j=1
i=1
i=1 j=1, j=i
ij
˜ s.t. (ψ((Aip )T ))T (wi − w j ) + ρ˜ ip ( a˜ ip )T (wi − w j ) − (b i − b j ) + yp ≥ 1, ∀ a˜ ip ∈ U,
270
P. Zhong and M. Fukushima ij
yp ≥ 0,
p = 1, . . . , li , i, j = 1, . . . , K , i = j,
which can be rewritten as the following SOCP: K K K i−1 K 1 1 min ν wi − w j 2 + wi 2 + (1 − ν) (ei )T yij w,b, y 2 2
i=1 j=1
i=1 j=1, j=i
i=1
' '' (T ((T ij s.t. ψ Aip (wi − w j ) − ρ˜ ip wi − w j − (b i − b j ) + yp ≥ 1, ij
yp ≥ 0,
p = 1, . . . , li , i, j = 1, . . . , K , i = j.
Denote
˜ = B˜ 1T , B˜ 2T , . . . , B˜ KT T , A
where
−(Ai ) .. . O B˜ i = O .. . O
···
(Ai ) .. .
O .. .
···
· · · −(Ai ) (Ai )
O
..
.
O .. .
O .. .
···
O
···
O .. .
(Ai ) −(Ai ) · · · .. .. .. . . .
O .. .
···
O
(Ai )
with
T (Ai ) = ψ((Ai1 )T ), . . . , ψ((Alii )T ) .
Denote
T ˜1,M ˜ 2T . . . , M ˜ TK T , H˜ = M
O
· · · −(Ai )
(4.3)
SOCP for Multiclass Classification
271
where
−Mi . . . O ˜i = M O . . . O
· · · O Mi O · · · O . .. .. .. .. . .. . . . · · · −Mi Mi O · · · O · · · O Mi −Mi · · · O .. .. . . .. . . . . · · · O Mi O · · · −Mi
with T
Mi := Mi (ρ) ˜ = ρ˜ 1i I N , . . . , ρ˜ lii I N .
(4.4)
In a similar manner to that of getting formulation 3.4, we get the dual of problem 4.3 as follows: max eT α − (σ + τ )
α,s,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e,
(4.5)
σ − τ = ν, −√ 1 ˜ T α + H˜ T s) (A 2(K +1) ≤ σ, τ ij s p ≤ α ijp ,
p = 1, . . . , li , i, j = 1, . . . , K , j = i.
4.2 Robust Classifier in a Feature Subspace. In the previous section, we have gotten the robust formulation 4.5 in the feature space. However, the feature space F may have an arbitrarily large dimension, possibly infi¨ nite. Usually the kernel principal component analysis (KPCA) (Scholkopf, ¨ Smola, & Muller, 1998; Yajima, 2005) is used for feature extraction. In this section, we first reduce the feature space F to an S-dimensional subspace with S < L by KPCA, and then construct the corresponding robust classifier of piecewise-nonlinear M-SVM in the subspace. j Consider the kernel matrix G = (k((Aip )T , (Ap )T )) ∈ R L×L associated with a kernel k(·, ·). Since G is a symmetric positive semidefinite matrix, there is an orthogonal matrix V such that G = V V T , where is a diagonal matrix whose diagonal elements are the eigenvalues λi ≥ 0, i = 1, . . . , L , of G, and v i , i = 1, . . . , L, the columns of V, are the corresponding eigenvectors. Suppose λ1 ≥ λ2 ≥ . . . ≥ λ L . Select the S(< L)
272
P. Zhong and M. Fukushima
largest√positive √ eigenvalues √ and the corresponding eigenvectors. Denote DS = [ λ1 v 1 , λ2 v 2 , . . . , λ S v S ], where the components of v i are written as follows: T
1 2 K v i = vi,1 , . . . , vi,1 l1 , vi,1 , . . . , vi,2 l2 , . . . , vi,1 , . . . , vi,Kl K . Define the vectors K l j j=1
ui :=
j
j
vi, p ψ((Ap )T ) , i = 1, . . . , S. √ λi
p=1
Then we have ui T ui =
1 T v Gv i = 1 λi i
and ui T u j =
1 v iT Gv j = 0, λi λ j
i = j.
Therefore, {u1 , u2 , . . . , u S } forms an orthogonal basis of an S-dimensional subspace of F. Let ψ S (x) be the S-dimensional subcoordinate of ψ(x), which is given by
T lj lj K K 1 1 j j ψ S (x) = √ v k(x, (Apj )T ), . . . , √ v k(x, (Apj )T ) . λ S j=1 p=1 S, p λ1 j=1 p=1 1, p (4.6) Then, similar to equation 3.12, we can get the decision functions associated with the robust formulation of piecewise-nonlinear M-SVM in the feature subspace as follows: li K t−1 ij f i (x) = α p ψ S (x)T ψ S ((Aip )T ) 2τ (K + 1) j=1, j=i
−
lj
p=1
α pji ψ S (x)T ψ S ((Apj )T )
p=1
lj li ij ρ˜ ip ψ S (x)T s p − ρ˜ pj ψ S (x)T s pji − b i , i = 1, . . . , K . + p=1
p=1
(4.7)
SOCP for Multiclass Classification
273
Table 1: Description of Iris, Wine, and Glass Data Sets.
Name
Dimension (N)
Number of Classes (K )
Number of Examples (L)
Iris Wine Glass
4 13 9
3 3 6
150 178 214
5 Preliminary Numerical Results In this section, through numerical experiments, we examine the performance of the robust piecewise-nonlinear M-SVM formulation and the original model for multiclass classification problems. We use RBF kernel in the experiments. As we described in section 4.2, we first construct an L × L kernel matrix G associated with the RBF kernel for the training data set. Then we decompose G and select an appropriate number S. Using equation 4.6, we obtain the S-dimensional subcoordinate of each point. The problems used in the experiments are the robust model 4.5 and the original model obtained by setting ρ˜ = 0 in equation 4.4. In the latter model, we have H˜ = O. Thus, we may write the problem as follows: max eT α − (σ + τ ) α,σ,τ
s.t. E¯ T α = 0, α ≤ (1 − ν)e,
(5.1)
σ − τ = ν, −√ 1 ˜Tα A 2(K +1) ≤ σ. τ The experiments were implemented on a PC (1GB RAM, CPU 3.00GHz) using SeDuMi1.05 (Sturm, 2001) as a solver. This solver is developed by J. Sturm for optimization problems over symmetric cones, including SOCP. Some experimental results on real-world data sets taken from the UCI machine learning repository (Blake & Merz, 1998) are reported below. Table 1 gives a description of the data sets. In the experiments, the data sets were normalized to lie between −1 and 1. For simplicity, we set all ρ ip in equation 2.5 to be a constant ρ. The measurement noise aip was generated randomly from the normal distribution and scaled on the unit sphere. Two experiments were performed. In the first, an appropriate value of S for getting reasonable discriminants was sought. The second experiment was
I II I II I II I II I II
1 1 2 2 2 2 3 3 12 12
S 0.5364 0.5364 0.7950 0.7950 0.7950 0.7950 0.8836 0.8836 0.9911 0.9911
Rt
Iris
62.67 60.67 89.33 87.33 89.33 87.33 85.33 84.0 88.0 86.67
PTa
Note: a PT: Percentage of tenfold testing correctness on validation set.
0.99
0.8
0.7
0.6
0.5
Ra
Robust (I), Original (II) 5 5 9 9 16 16 — — — —
S 0.5179 0.5179 0.6102 0.6102 0.7103 0.7103 — — — —
Rt
Wine PT 90.0 88.89 88.89 80.0 87.78 82.22 — — — —
Table 2: Results for Iris, Wine, and Glass Data Sets with Noise (ρ = 0.3, κ = 2, ν = 0.05).
1 1 2 2 3 3 4 4 — —
S
0.5737 0.5737 0.6826 0.6826 0.7523 0.7523 0.8002 0.8002 — —
Rt
Glass
35.24 31.43 66.67 32.86 66.67 38.57 66.67 45.24 — —
PT
274 P. Zhong and M. Fukushima
Iris (2) Wine (5) Glass (4)
Data Set (S)
I II I II I II
Robust (I), Original (II) 94.0 94.0 98.33 98.33 59.52 59.52
0 88.67 87.33 91.11 88.89 66.19 46.19
0.1 88.0 87.33 90.0 88.83 65.71 45.71
0.2
ρ
89.33 87.33 90.0 88.89 66.67 45.24
0.3
Table 3: Percentage of Tenfold Test Correctness for the Data Sets with Noise (κ = 2, ν = 0.05).
91.33 86.67 87.78 85.56 66.67 49.05
0.4
90.0 85.33 84.44 82.22 66.67 49.52
0.5
SOCP for Multiclass Classification 275
276
P. Zhong and M. Fukushima
conducted on the three data sets with the measurement noise. Tenfold cross validation was used in the experiments. In order to seek an appropriate value of S, a ratio Ra is set. It is chosen from the set {0.5, 0.6, 0.7, S 0.8, 0.99}. L For each value of Ra , we Sfind the smallL est integer S such that i=1 λi / i=1 λi ≥ Ra , and let Rt := i=1 λi / i=1 λi . At the same time, we test the accuracy on the validation set by computing the percentage of tenfold testing correctness. Table 2 contains these three kinds of results for the robust model and the original model on the Iris, Wine, and Glass data sets with the measurement noise scaled by ρ = 0.3. When Ra is large, we were unable to solve the problems for the Wine and Glass data sets because of memory limitations. Nevertheless, it can be seen from Table 2 that the values of Rt around 50% up to 70% yield reasonable discriminants. Moreover, in all cases, S is much smaller than the data size L. Table 3 shows the percentage of tenfold testing correctness for the robust model and the original model on the three data sets with various noise levels ρ. Especially, ρ = 0 means that there is no noise on the data sets. In this case, the robust model reduces to the original model. It can be observed that the performance of the robust model is consistently better than that of the original model, especially for the Glass data set. In addition, the correctness for the original model on the three data sets with noise is much lower than the results when ρ = 0. For the linear case, we also find that the correctness for the original model on the data sets with noise is lower than the correctness on the data sets without noise. For example, the correctness of the Iris data set, the Wine data set, and the Glass data set without noise is 90.67%, 97.78%, and 57.14%, respectively. However, when ρ = 0.5, the corresponding correctness is 87.33%, 78.89%, and 48.57%, respectively. For the robust model, when ρ = 0.5, the corresponding correctness is 90.67%, 81.11%, and 56.19%, respectively.
6 Conclusion In this letter, we have established the robust linear and nonlinear formulations for multiclass classification based on M-SVM method. KPCA has been used to reduce the feature space to an S-dimensional subspace. The preliminary numerical experiments show that the performance of the robust model is better than that of the original model. Unfortunately, the conic convex optimization solver SeDuMi1.05 (Sturm, 2001) used in our numerical experiments could solve problems only for small data sets. The sequential minimal optimization (SMO) techniques (Platt, 1999) are essential in large-scale implementation of SVM. Future subjects for investigation include developing SMO-based robust algorithms for multiclass classification.
SOCP for Multiclass Classification
277
Appendix A: Dual of Formulation 2.17 In order to get the dual of problem 2.17, we first state a more general primal and dual form of the SOCP. The notations used in section A.1 are independent of those in the other part of the letter. A.1 A General Primal and Dual Pair. For the SOCP min cT x + dT y + eT z
x, y, z
s.t. AT x + B T y + C T z = f ,
(A.1)
y ≥ 0, z ∈ Kn1 × · · · × Knl , its dual is written as follows: max f T w w,u,v
s.t. Aw = c, Bw + u = d, Cw + v = e,
(A.2)
u ≥ 0, v ∈ Kn1 × · · · × Knl . Now consider the problem min cT x + dT y x, y,z
s.t. G¯ iT x + qi ≤ g iT x + hiT y + r iT z + a i ,
i = 1, . . . , m,
y ≥ 0. This problem can be formulated as follows: min cT x + dT y
x, y,z,ζ
s.t. ζ i −
g iT x + hiT y + r iT z + a i G¯ iT x + qi
ζ i ∈ Kni , y ≥ 0,
i = 1, . . . , m,
= 0,
i = 1, . . . , m,
(A.3)
278
P. Zhong and M. Fukushima
which can be further rewritten as min cT x + dT y
x, y,z,ζ
−G 1 −H1 −R1 s.t. I O . .. O
−G 2 −H2 −R2 O I .. . O
ζ i ∈ Kni ,
T x · · · −G m y · · · −Hm β1 z · · · −Rm β2 ··· O ζ 1 = .. ζ2 . ··· O . . .. βm . .. .. ζm ··· I
i = 1, . . . , m,
y ≥ 0, where G i = [g i , G¯ i ], Hi = [hi , O], Ri = [r i , O], β i = [a i , qiT ]T , i = 1, . . . , m. In view of the primal-dual pair A.2 and A.3, we obtain the dual of problem A.3 as follows: max − η, λ
s.t.
m
β iT ηi
i=1 m
G i ηi = c,
i=1 m
Hi ηi + λ = d,
(A.4)
i=1 m
Ri ηi = 0,
i=1
λ ≥ 0, ηi ∈ Kni ,
i = 1, . . . , m.
A.2 Dual of Problem 2.17. In the following we derive the dual of formulation 2.17. The primal problem 2.17 can be put in the following equivalent form:
min 0T , ν x + (1 − ν)eT y
x, b, y
√
0 2C 0 ≤ 0T , 1 x + 1, s.t. x+ 1 0T −1
SOCP for Multiclass Classification
279
ij T i ij T ρ p (H ) , 0 x ≤ Aip (H ij )T , 0 x + h p y − (rij )T b − 1, p = 1, . . . , li , i, j = 1, . . . , K , i = j, y ≥ 0, T
where x = w T , t . Then, by equation A.4, we get the dual of problem 2.17 as follows:
max
α, s,σ, τ
s.t.
li K K
ij
α p − (σ + τ )
(A.5)
i=1 j=1, j=i p=1
√
2C T ξ +
li ' K K ( ij ij α p H ij (Aip )T + ρ ip H ij s p = 0,
(A.6)
i=1 j=1, j=i p=1
σ − τ = ν, K
K
li
(A.7) ij ij
α p h p + λ = (1 − ν)e,
(A.8)
i=1 j=1, j=i p=1
−
li K K
ij
α p r ij = 0,
(A.9)
i=1 j=1, j=i p=1
ξ τ ≤ σ, ij
ij
s p ≤ α p ,
(A.10) p = 1, . . . , li , i, j = 1, . . . , K , i = j,
λ ≥ 0.
(A.11) (A.12)
By equations 2.9, 3.1, and 3.5, we get li K K
ij ¯ T α. α p H ij (Aip )T = A
(A.13)
i=1 j=1, j=i p=1
By equations 3.2 and 3.6, we get li K K i=1 j=1, j=i p=1
ij ρ ip H ij s p = H¯ T s.
(A.14)
280
P. Zhong and M. Fukushima
Hence by equations A.13 and A.14, we can express equation A.6 compactly as follows: √
¯ T α + H¯ T s = 0. 2C T ξ + A
(A.15)
By equations A.15 and 3.11, we get the following equation: ξ = −
1 2(K + 1)
¯ T α + H¯ T s). (A
(A.16)
By equations 2.13 and 3.5, we have li K K
ij ij
α p h p = α.
i=1 j=1, j=i p=1
Hence, equation A.8 can be expressed as follows: (1 − ν)e − λ − α = 0.
(A.17)
By equations 2.11, 3.3, and 3.5, we can rewrite equation A.9 as follows: − E¯ T α = 0.
(A.18)
Combining equations A.16 to A.18, the problem given in A.5 to A.12 can be written as problem 3.4. Appendix B: Complementarity Conditions of SOCP Let bd Kn denote the boundary of Kn : bd Kn =
z0 z¯
∈ Kn : z¯ = z0 .
Let int Kn denote the interior of Kn : int Kn = For two elements
z0 z¯
∈ Kn
z0 z¯
∈ Kn : z¯ < z0 .
SOCP for Multiclass Classification
281
and
z0 z¯
∈ Kn ,
z0 z¯
T
z0 z¯
=0
if and only if the following conditions are satisfied (Lobo et al., 1998):
z0 z¯
∈ bd Kn \ {0},
z0 z¯
z0 z¯ z0 z¯
∈ int Kn ⇒ z¯ = z0 = 0,
(B.1)
∈ int Kn ⇒ z¯ = z0 = 0,
(B.2)
∈ bd Kn \ {0} ⇒
z0 z¯
=µ
z0 , − z¯
(B.3)
where µ > 0 is a constant. These three conditions are regarded as a generalization of the complementary slackness conditions in linear programming. Acknowledgments P.Z is supported in part by a grant-in-aid from the Ministry of Education Culture Sports Science and Technology of Japan and the National Science Foundation of China Grant No. 70601033. M.F. is supported in part by the Scientific Research Grant-in-Aid from the Japan Society for the Promotion of Science. References Alizadeh, F., & Goldfarb, D. (2003). Second-order cone programming. Math. Program., Ser. B, 95, 3–51. Allwein, E. L., Schapire, R. E., & Singer, Y. (2001). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141. Angulo, C., Parra, X., & Catal`a, A. (2003). K-SVCR: A support vector machine for multi-class classification. Neurocomputing, 55, 57–77. Bennett, K. P., & Mangasarian, O. L. (1994). Multicategory discrimination via linear programming. Optimization Methods and Software, 3, 27–39. Blake, C. L., & Merz, C. J. (1998). UCI repository of machine learning databases. University of California. Available online at http://www.ics.uci.edu/∼mlearn/ MLRepository.html. Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I., Jackel, L. D., LeCun, ¨ Y., Muller, U. A., Sackinger, E., Simard, P., & Vapnik, V. (1994). Comparison of classifier methods: A case study in handwriting digit recognition. In IAPR (Ed.), Proceedings of the International Conference on Pattern Recognition (pp. 77–82). Piscataway, NJ: IEEE Computer Society Press.
282
P. Zhong and M. Fukushima
Bredensteiner, E. J., & Bennett, K. P. (1999). Multicategory classification by support vector machines. Computational Optimization and Applications, 12, 53–79. Dietterich, T. G., & Bakiri, G. (1995). Solving multi-class learning problems via errorcorrecting output codes. Journal of Artificial Intelligence Research, 2, 263–286. Fukushima, M., Luo, Z. Q., & Tseng, P. (2002). Smoothing functions for second-ordercone complementarity problems. SIAM Journal on Optimization, 12, 436–460. Goldfarb, D., & Iyengar, G. (2003). Robust convex quadratically constrained programs. Mathematical Programming, 97, 495–515. Guermeur, Y. (2002). Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications, 5, 168–179. Hastie, T. J., & Tibshirani, R. J. (1998). Classification by pairwise coupling. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 507–513). Cambridge, MA: MIT Press. Hayashi, S., Yamashita, N., & Fukushima, M. (2005). A combined smoothing and regularization method for monotone second-order cone complementarity problems. SIAM Journal on Optimization, 15, 593–615. Kressel, U. (1999). Pairwise classification and support vector machines. In B. ¨ Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 255–268). Cambridge, MA: MIT Press. Lobo, M. S., Vandenberghe, L., Boyd, S., & L´ebret, H. (1998). Applications of secondorder cone programming. Linear Algebra and Applications, 284, 193–228. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Platt, J., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass ¨ classification. In S. A. Solla, T. K. Leen, & K. -R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 547–553). Cambridge, MA: MIT Press. ¨ ¨ Scholkopf, B., Smola, A., & Muller, K. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sturm, J. (2001). Using SeDuMi, a Matlab toolbox for optimization over symmetric cones. Department of Econmetrics, Tilburg University, Netherlands. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J., & Watkins, C. (1998). Multi-class support vector machines (CSD-TR-9804). Egham, UK: Royal Holloway, University of London. Yajima, Y. (2005). Linear programming approaches for multicategory support vector machines. European Journal of Operational Research, 162, 514–531. Zhong, P., & Fukushima, M. (2006). A new multi-class support vector algorithm. Optimization Methods and Software, 21, 359–372.
Received September 19, 2005; accepted May 24, 2006.
LETTER
Communicated by Grace Wahba
Fast Generalized Cross-Validation Algorithm for Sparse Model Learning S. Sundararajan
[email protected] Philips Electronics India Ltd., Ulsoor, Bangalore, India
Shirish Shevade
[email protected] Computer Science and Automation, Indian Institute of Science, Bangalore, India
S. Sathiya Keerthi
[email protected] Yahoo! Research, 2821 Mission College Blvd., Santa Clara, CA 95054, USA
We propose a fast, incremental algorithm for designing linear regression models. The proposed algorithm generates a sparse model by optimizing multiple smoothing parameters using the generalized cross-validation approach. The performances on synthetic and real-world data sets are compared with other incremental algorithms such as Tipping and Faul’s fast relevance vector machine, Chen et al.’s orthogonal least squares, and Orr’s regularized forward selection. The results demonstrate that the proposed algorithm is competitive. 1 Introduction In recent years, there has been a lot of focus on designing sparse models in machine learning. For example, the support vector machine (SVM) (Cortes & Vapnik, 1995) and the relevance vector machine (RVM) (Tipping, 2001; Tipping & Faul, 2003) have been proven to provide sparse solutions to both regression and classification problems. Some of the earlier successful approaches for regression problems include Chen, Cowan, and Grant’s (1991) orthogonal least squares and Orr’s (1995b) regularized forward selection algorithms. In this letter, we consider only the regression problem. Often the target function model takes the form
y(x) =
M
wm φm (x).
(1.1)
m=1
Neural Computation 19, 283–301 (2007)
C 2006 Massachusetts Institute of Technology
284
S. Sundararajan, S. Shevade, and S. Keerthi
For notational convenience, we do not indicate the dependency of y on w, the weight vector. The available choices in selecting the set of basis vectors {φ m (x), m = 1, . . . , M} make the model flexible. Given a data set N {xn , tn }n=1 , tn ∈ R ∀ n, we can write the target vector t = (t1 , . . . ., tN )T as the sum of the approximation vector y(x) = (y(x1 ), . . . ., y(x N ))T and an error vector η, t = w + η,
(1.2)
where = ( φ 1 φ 2 , · · · , φ M ) is the design matrix and φ i is the ith column vector of of size N × 1 representing the response of the ith basis vector for all the input samples x j , j = 1, . . . , N. One possible way to obtain this x −x 2
design matrix is to use a gaussian kernel with i j = exp(− i 2z2 j ). The error vector η can be modeled as an independent zero-mean gaussian vector with variance σ 2 . In this context, controlling the model complexity in avoiding overfitting is an important task. This problem has been addressed in the past by regularization approaches (Bishop, 1995). In the classical approach, the sum squared error with weight decay regularization having a single regularization parameter α controls the trade-off between fitting the training data and smoothing the output function. In the local smoothing approach, each weight in w is associated with a regularization or ridge parameter (Hoerl & Kennard, 1970; Orr, 1995a; Denison & George, 2000). The interested reader can refer to Denison and George (2000) for a discussion on the generalized ridge regression and Bayesian approach. For our discussion, we consider the weight decay regularizer with multiple regularization parameters. In this case, the optimal weight vector is obtained by minimizing the following cost function (for a given set of hyperparameter values, α, σ 2 ), C(w, α, σ 2 ) =
1 1 t − w2 + wT Aw, 2σ 2 2
(1.3)
where A is a diagonal matrix with elements α = (α1 , . . . , α M )T , and the optimal weight vector is given by w=
1 −1 T S y, σ2
(1.4)
T where S = R + A and R = σ 2 . Note that this solution depends on the product ασ 2 . This solution is same as the maximum a posteriori (MAP) solution obtainable from defining the gaussian likelihood and gaussian prior for the weights in a Bayesian framework (Tipping, 2001).
Fast GCV Algorithm for Sparse Model Learning
285
The hyperparameters are typically selected using iterative approaches like marginal likelihood maximization (Tipping, 2001). In practice, many of the αi approach ∞. This results in the removal of associated basis vectors, thereby making the model sparse. The final model consists of a small number of basis vectors L (L M), called relevance vectors, and, hence, known by the name relevance vector machine (RVM). This procedure is computationally intensive and needs O(M3 ) effort at least for the initial iterations while starting with the full model (M = N or M = N + 1 if the bias term is included). Hence, it is not suitable for large data sets. This limitation was addressed in Tipping and Faul (2003), where a computationally efficient algorithm was proposed. In this algorithm, basis vectors are added sequentially starting from an empty model. It also allows deleting the basis vectors, which may subsequently become redundant. There are various other basis vector selection heuristics that can be used to design sparse models (Chen et al., 1991; Orr, 1995b). Chen et al. (1991) proposed an orthogonal least-squares algorithm as a forward regression procedure to select the basis vectors. At each step of the regression, the increment to the explained variance of the desired output is maximized. Orr (1995b) proposed an algorithm that combines regularization and crossvalidated selection of basis vectors. Some other promising incremental approaches are the algorithms of Csato and Opper (2002), Lawrence, Seeger, and Herbrich (2003), Seeger, Williams, and Lawrence (2003), and Smola and Bartlett (2000). But they apply to gaussian processes and are not directly related to the problem formulation addressed in this article. Generalized cross validation (GCV) is another important approach for the selection of hyperparameters and has been shown to exhibit good generalization performance (Sundararajan & Keerthi, 2001; Orr, 1995a). Orr (1995a) used the GCV approach to estimate the multiple smoothing parameters of the full model. This approach, however, is not suitable for large data sets due to its computational complexity. Therefore, there is a need to devise a computationally efficient algorithm based on the GCV approach for handling large data sets. In this letter, we propose a new fast, incremental GCV algorithm that can be used to design sparse models exhibiting good generalization performance. In the algorithm, we start with an empty model and sequentially add the basis functions to reduce the GCV error. The GCV error can also be reduced by deleting those basis functions that subsequently become redundant. This important feature offsets the inherent greediness exhibited by other sequential algorithms. This algorithm has the same computational complexity as that of the algorithm given in Tipping and Faul (2003) and is suitable for large data sets as well. Preliminary results on synthetic and real-world benchmark data sets indicate that the new approach gains on generalization but at the expense of a moderate increase in the number of basis vectors.
286
S. Sundararajan, S. Shevade, and S. Keerthi
The letter is organized as follows. In section 2, we describe the GCV approach and compare the GCV error function with marginal likelihood function. Section 3 describes the fast, incremental algorithm, computational complexity, and the numerical issues involved; the update expressions mentioned in this section are detailed in appendixes A to F. In section 4, we present the simulation results. Section 5 concludes the letter. 2 Generalized Cross Validation The standard techniques that estimate the prediction error of a given model are the leave-one-out (LOO) cross-validation (Stone, 1974) and the closely related GCV described in Golub, Heath, and Wahba (1979) and Orr (1995b). The generalization performance of the GCV approach is quite good, like that of the LOO cross-validation approach. The advantage in using the GCV error is that it takes a much simpler form compared to the LOO error and is given by N V(α, σ 2 ) = N
− y(xi ))2 , (tr(P))2
i=1 (ti
(2.1)
where P=I−
S−1 T . σ2
(2.2)
When this GCV error is minimized with respect to the hyperparameters, many of the αi approach ∞, making the model sparse. Equation 2.1 can be written as V(α, σ 2 ) = N
t T P2 t . (tr(P))2
(2.3)
Note that P is dependent only on ζ = ασ 2 . This means that we can get rid of σ 2 from equations 1.4 and 2.1. Then it is sufficient to work directly with ζ . However, using the optimal ζ obtained from minimizing the GCV error, the noise level can be computed from σˆ 2 =
t T P2 t . tr(P)
(2.4)
We now discuss the algorithm, proposed by Orr to determine the optimal set of hyperparameters.
Fast GCV Algorithm for Sparse Model Learning
287
2.1 Orr’s Algorithm. Starting with the full model, Orr (1995a) proposed an iterative scheme to minimize the GCV error, equation 2.3, with respect to the hyperparameters. Although this algorithm was originally described in terms of the variable ζ j , we describe it here using the variables α j and σ 2 for convenience. Each α j is optimized in turn, while the others are held fixed. The minimization thus proceeds by a series of one-dimensional minimizations. This can be achieved by rewriting equation 2.3 using t T P2 t = tr(P) =
a j 2j − 2b j j + c j 2j δjj − j , j
where a j = tT P2j t b j = (t
T
(2.5)
P2j φ j )(tT P j φ j )
(2.6)
c j = (φ Tj P2j φ j )(tT P j φ j )2
(2.7)
δ j = tr(P j )
(2.8)
j = (φ Tj P2j φ j ).
(2.9)
The above relationships follow from the rank-one update relationship between P and P j , P = Pj −
where P j = I −
1 Pj φ j φ Tj Pj , j T j S−1 j j
σ2
(2.10)
and
j = φ Tj P j φ j + α j σ 2 .
(2.11)
Here, j denotes the matrix with the column φ j removed. Therefore, S j = R j + A j , with the subscript j having the same interpretation as in j . Note that S j does not contain α j , and, hence, P j also does not contain α j . Thus, equation 2.3 can be seen as a rational polynomial in α j alone, with a single minimum in the range [0, ∞]. The details of minimizing this polynomial with respect to α j are given in appendix A. After one complete cycle in which each parameter is optimized once, the GCV score is calculated and compared with the score at the end of the previous cycle. If significant decrease has occurred, a new cycle is begun; otherwise, the algorithm
288
S. Sundararajan, S. Shevade, and S. Keerthi
terminates. As detailed in Orr (1995a), the computational complexity of one cycle of the above algorithm is O(N3 ), at least during the first few iterations. Although this will consequently reduce to O(L N2 ) as the basis vectors are pruned, this algorithm is not suitable for large data sets. 2.2 Comparison of GCV Error with Marginal Likelihood. We now compare the GCV error and marginal likelihood by studying their dependence on α j . In the marginal likelihood method, the optimal α j ’s are determined by maximizing the marginal likelihood with respect to α j . The GCV error is minimized to get the optimal α j ’s. First, we study the behavior of the GCV error with reference to α j . The term in the denominator of the GCV error has tr(P) = δ j − jj , where δ j and j are independent of α j and P is a positive semidefinite matrix. Further, tr(P) increases monotonically as a function of α j . Therefore, maximizing the tr(P) (in order to minimize the GCV error) will prefer the optimal value of α j to be ∞. Thus, the denominator term in equation 2.3 prefers a simple model. The term, tT P2 t, in the numerator of equation 2.3 is the squared error at the optimal weight vector in equation 1.4. Let g(α) = tT P2 t. Differentiating g(α) with respect to α j , we get ∂g(α) 2σ 2 = 2 ∂α j j
bj −
cj j
,
where b j and c j are independent of α j and c j , j ≥ 0. If b j is nonpositive, then g(α j ) is a monotonically decreasing function of α j . Minimization of g(α j ) with respect to α j would thus prefer α j to be ∞. On the other hand, if b j is positive, the minimum of g(α j ) would depend on the sign of σ 2 b j s¯ j − c j where s¯ j = φ Tj −1 − j φ j , − j is with the contribution of basis vector j removed and = σ 2 I + A−1 . Note that P = σ 2 −1 . Therefore, the optimal choice of α j using the GCV error method depends on the trade-off between the data-dependent term in the numerator and the term in the denominator that prefers α j to be ∞. The logarithm of marginal likelihood function is def
L(α) = −
1 N log(2π) + log || + tT −1 t . 2
Considering the dependence of L(α) on a single hyperparameter α j , the above equation can be rewritten (Tipping & Faul, 2003) as L(α) = L(α− j ) + l(α j ),
Fast GCV Algorithm for Sparse Model Learning α
where l(α j ) = 12 [log( α j +¯j s j ) +
q¯ 2j α j +¯s j
289
]. L(α− j ) is the marginal likelihood with def
φ j excluded and is thus independent of α j . Here, we have defined q¯ j = φ Tj −1 − j t. Note that q¯ j and s¯ j are independent of α j . The second term in l(α j ) comes from the data-dependent term, tT −1 t, in the logarithm of the marginal likelihood function, and maximization of this term prefers α j to be zero. On the other hand, the first term in l(α j ) comes from the Ockham factor (Tipping, 2001), and maximization of this term chooses α j to be ∞. So the optimal value of α j is a trade-off between the data-dependent term and the Ockham factor. Note that tT −1 t = σ12 tT P2 t + wT Aw. The term on the left-hand side of this equation appears in the negative logarithm of marginal likelihood function, while the first term on the right-hand side appears in the numerator of the GCV error. The key difference in the choice of the basis function is the additional term that is present in the data-dependent term of the marginal likelihood. For the marginal likelihood method, it has been shown that for a given basis function j, if q¯ 2j ≤ s¯ j , then the optimal value of α j is ∞; otherwise, the optimal value is
s¯ 2j q¯ 2j −¯s j
(Tipping and Faul, 2003). For the GCV
method, the optimal value of α j depends on q¯ j , s¯ j and some other quantities detailed in appendix A. Further, the sufficient condition for a basis function to be not relevant for the marginal likelihood method is q¯ 2j ≤ s¯ j and that for the GCV error method is b j ≤ 0. Note that b j is independent of s¯ j but is dependent on q¯ j . In general, a relevant vector obtained using a marginal likelihood (or GCV error) method need not be relevant in the GCV error (or marginal likelihood) method. This fact was also observed in our experiments. We now discuss the effect of scaling on the GCV error. First, note that the GCV error is a function of P. With the scaling of the output t, there will be an associated scaling of basis functions. However, P is invariant to such scaling (see equation 2.2). This will make the GCV error invariant to scaling. Also, a similar result holds for the log marginal likelihood function. 3 Fast GCV Algorithm In this section we describe the fast GCV (FGCV) algorithm, which constructs the model sequentially starting from an empty model. The basis vectors are added sequentially, and their weightings are modified to get the maximum reduction in the GCV error. The GCV error can also be decreased by deleting those basis vectors that subsequently become redundant. By maintaining the following set of variables for every basis vector, m, we can find the optimal value of αm for every basis vector and the corresponding GCV error efficiently: rm = tT Pφ m
(3.1)
290
S. Sundararajan, S. Shevade, and S. Keerthi T γm = φ m Pφ m
(3.2)
ξm = t T P 2 φ m
(3.3)
T 2 um = φ m P φm.
(3.4)
In addition, we need v = tr(P)
(3.5)
q = t P t.
(3.6)
T
2
Further, after every minimization process, these variables can be updated efficiently using the rank-one update given in equation 2.10. We now give the algorithm and discuss the relevant implementation details and storage and computational requirements. 3.1 Algorithm 1. Initialize σ 2 to some reasonable value (e.g., var(t) × .1). 2. Select the first basis vector φk (which forms the initial relevance vector set), and set the corresponding αk to its optimal value. The remaining α’s are notionally set to ∞. 3. Initialize S−1 , w (which are scalars initially), and the variables given in equations 3.1 to 3.6. 4. αk old := αk . 5. For all j, find the optimal solution α j , keeping the remaining αi , i = j fixed and the corresponding GCV error. Select the basis vector k for which reduction in the GCV error is maximum. 6. If αk old < ∞ and αk < ∞, the relevance vector set remains unchanged. 7. If αk old = ∞ and αk < ∞, add φk to the relevance vector set. 8. If αk old < ∞ and αk = ∞, then delete φk from the relevance vector set. 9. Update S−1 , w, and the variables given in equations 3.1 to 3.6. 10. Estimate the noise level using equation 2.4. This step may be repeated once in, for example, five iterations. 11. If there is no significant change in the values of α and σ 2 , then stop. Otherwise, go to step 4. 3.2 Implementation Details 1. Since we start from the empty model, the basis vector that gives the minimum GCV error is selected as the first basis vector (step 2). The details of this procedure are given in appendix B. The first basis vector
Fast GCV Algorithm for Sparse Model Learning
291
can also be selected based on the largest normalized projection onto the target vector, as suggested in Tipping and Faul (2003). 2. The initialization of relevant variables in step 3 is described in appendix C. 3. Appendixes D and A describe the method to estimate the optimal αi and the corresponding GCV error (step 5). 4. The variables in step 9 are updated using the details given in appendix E for the case in step 6 (reestimation) or step 8 (deletion) of the algorithm. The details corresponding to step 7 (addition) are given in appendix F. 5. In practice, numerical accuracies may be affected as the iterations progress. More specifically, the quantities γm and um , which are expected to remain nonnegative, may become negative while updating the variables. When any of these two quantities becomes negative, it is a good idea to compute the quantities in equations 3.1 to 3.6 afresh using direct computations. If the problem still persists (this typically happens when the matrix P becomes ill conditioned, for example, when the width parameter z used in a gaussian kernel is large), we terminate the algorithm. 6. When the noise level is also to be estimated (step 10), all the relevant variables are calculated afresh. This computation is simplified by expanding the matrix P using equation 2.2. 7. In an experiment, some of the α j may reach 0, as can be seen in appendix A. This may affect the solution. Therefore, it is useful to set such α j to a relatively small value, αmin . Setting this value to 1/N was found to work well in practice. 3.3 Storage and Computational Complexity. The storage requirements of the FGCV algorithm are more than that of the fast RVM (FRVM) algorithm in Tipping and Faul (2003) and arise from the additional variables ξm and um to be maintained. However, they are still linear in N. The computational requirements of the proposed algorithm are similar to those of the FRVM algorithm. Step 5 of the algorithm has the computational complexity of O(N). This is possible using the expressions given in appendix D. The computational complexity of the reestimation or deletion of a basis vector is O(L N), while that of the addition of a basis vector is O(N2 ). Step 10 of the algorithm, however, has a computational complexity of O(L N2 ) as it requires reestimating the relevant variables. We observed that the FGCV algorithm was 1.25 to 1.5 times slower than the FRVM algorithm in our simulations for these main reasons:
r
The error function is shallow near the solution, which results in more iterations.
292
S. Sundararajan, S. Shevade, and S. Keerthi
0.7
NMSE
0.6 0.5 0.4 0.3 0.2 0.1
1
2
3
4
3
4
Number of Basis Vectors
Algorithm
80 60 40 20 0 1
2 Algorithm
Figure 1: Results on the Friedman2 data set. Table 1: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman2 Data Set.
OLS RFS FRVM
r r
FGCV
FRVM
RFS
.47 .013 1.3e-12
9.4e-5 .097
.119
There is a higher number of relevance vectors at the solution. The additional variables, ξm and um , are updated.
4 Simulations The proposed FGCV algorithm is evaluated on four popular benchmark data sets and compared with the algorithms described in Tipping and Faul (2003), Chen et al. (1991), and Orr (1995b) and referred to as fast RVM (FRVM), orthogonal least squares (OLS), and regularized forward selection (RFS), respectively. Two of these data sets were generated, as described by Friedman (1991) and are referred to as Friedman2 and Friedman3. The input
Fast GCV Algorithm for Sparse Model Learning
293
0.35
NMSE
0.3 0.25 0.2 0.15 1
2
3
4
3
4
Algorithm
Number of Basis Vectors
80 70 60 50 40 30 20 10 1
2
Algorithm
Figure 2: Results on the Friedman3 data set. Table 2: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Friedman3 Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
4.6e-9 2.1e-6 3.0e-6
.002 .139
.054
dimension for each of these data sets was four. For these data sets, the training set consisted of 200 randomly generated examples, while the test set had 1000 noise-free examples and the experiment was repeated 100 times for different training set examples. We report the normalized mean squared error (NMSE) (normalized with respect to the output variance) on the test set. The third data set used was the Boston housing data set obtained from the StatLib archive (http://lib.stat.cmu.edu/datasets/boston). This data set comprises 506 examples with 13 variables. The data were split into 481/25 training/testing splits randomly, and the partitioning was repeated 100 times independently. The fourth data set used was the Abalone data set (ftp://ftp.ics.uci.edu/pub/machine-learning-databases/abalone/). After mapping the gender encoding (male/female/infant) into {(1,0,0), (0,1,0),
294
S. Sundararajan, S. Shevade, and S. Keerthi
40
MSE
30 20 10
1
2
3
4
3
4
Number of Basis Vectors
Algorithm
150
100
50
1
2 Algorithm
Figure 3: Results on the Boston Housing data set. Table 3: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Boston Housing Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
2.6e-5 .003 2.0e-5
.096 .647
.156
(0,0,1)}, the 10-dimensional data were split into 3000/1177 training/testing splits randomly. The partitioning was repeated 10 times independently. (The exact partitions for the last two data sets were obtained from http://www.gatsby.ucl.ac.uk/˜chuwei/regression.html.) For all the data sets, gaussian kernel was used and the width parameter was chosen by using fivefold crossvalidation. For the OLS and RFS algorithms, the readily available Matlab functions (http://www.anc.ed.ac.uk/˜mjo/software/ rbf2.zip) were used with the default settings. We adhered to the guidelines provided in Tipping and Faul (2003) for FRVM. The results obtained using the four algorithms (FGCV, algorithm 1; FRVM, algorithm 2; RFS, algorithm 3; OLS, algorithm 4) on these data sets are presented in Figures 1 to 4. From these box plots, it is clear that
Fast GCV Algorithm for Sparse Model Learning
295
0.8
MSE
0.7 0.6 0.5 0.4 1
2
3
4
3
4
Algorithm
Number Of Basis Vectors
80 70 60 50 40 30 20 10 1
2 Algorithm
Figure 4: Results on the Abalone data set. Table 4: Statistical Significance (Wilcoxon Signed Rank) Test Results on the Abalone Data Set.
OLS RFS FRVM
FGCV
FRVM
RFS
9.8e-4 .023 .462
9.8e-4 .005
.403
the FGCV algorithm generalizes well compared to the other algorithms. However, this happens at the expense of a moderate increase in the number of basis vectors as compared to the FRVM algorithm. It is worth noting that the sparseness of the solution obtained from the FGCV algorithm is still very good. On the Abalone data set, the FRVM algorithm is slightly better than the FGCV algorithm on the average. Box plots in Figures 1 to 4 show that the distribution of MSE is nonsymmetrical. Therefore, we used Wilcoxon matched-pair signed rank tests to compare the four algorithms; the results are given in Tables 1 to 4. Each box in the table compares an algorithm in the column to an algorithm in the row. The null hypothesis is that the two medians of the test error are the
296
S. Sundararajan, S. Shevade, and S. Keerthi
same, while the alternate hypothesis is that they are different. The p-value of this hypothesis test is given in the box. The following comparisons are made with respect to the significance level of 0.05. If a p-value is smaller than 0.05, then the algorithm in the column (row) is statistically superior to the algorithm in the row (column). Table 1 suggests that for the Friedman2 data set, the FGCV algorithm is statistically superior to the FRVM and RFS algorithms. But it is not significantly different from the OLS algorithm. The FGCV algorithm is statistically superior to all the other algorithms on the Friedman3 and the Boston Housing data sets, as is evident from Tables 2 and 3. For the Abalone data set, the results in Table 4 show that the performances of the FGCV and the FRVM algorithms are not significantly different. However, these algorithms are statistically superior to the OLS and RFS algorithms. We also compared the FGCV and FRVM algorithms with the gaussian process regression (GPR) algorithm on the Boston Housing and Abalone data sets (results reported in http://www.gatsby.ucl.ac.uk/˜chuwei/ regression.html). These comparisons were also done using Wilcoxon matched-pair signed-rank test with a significance level of .05. For the Boston Housing data set, the performance comparison gave p-values of 3.4e-10 (FGCV) and 1.22e-12 (FRVM). This shows that the GPR algorithm (with all basis vectors) is statistically superior to the FRVM algorithm. A similar comparison on the Abalone data set resulted in the p-values of .012 (FGCV) and .095 (FRVM). On this data set, the GPR algorithm is statistically superior to the FGCV algorithm, while it is not statistically superior to the FRVM algorithm.
5 Conclusion We have proposed a fast, incremental GCV algorithm for designing sparse regression models. This algorithm is very efficient and constructs the model sequentially starting from an empty model. In each iteration, it adds or deletes or reestimates the basis vectors depending on the maximum reduction in the GCV error. The experimental results suggest that, considering the requirements of sparseness, good generalization performance, and computational complexity, the FGCV algorithm is competitive. Clearly, this algorithm is an excellent alternative to the FRVM algorithm of Tipping and Faul (2003). We mainly compared our approach against OLS, RFS, and FRVM since they were quite directly related to our problem formulation. We also compared against GPR. We did not compare against the other sparse incremental GP algorithms mentioned in Csato and Opper (2002), Lawrence et al. (2003), Seeger et al. (2003), and Smola and Bartlett (2000) since we felt that those methods will be slightly inferior to GPR. But those comparisons could be interesting, especially if we compare at the same levels of sparsity. This will
Fast GCV Algorithm for Sparse Model Learning
297
be taken up in future work. It will also be interesting to extend the proposed algorithm to classification problems. Appendix A: α Estimation Using GCV Approach Using equations 2.5 to 2.9, the numerator of the derivative of GCV error with respect to α j can be shown to take the form, g j + h j α j , where g j = (δ j b j − a j j )ψ j − (δ j c j − b j j ), h j = (δ j b j − a j j )σ 2 , and ψ j = φ Tj P j φ j .
(A.1)
It is easy to verify that the denominator of the derivative of GCV error with respect to α j is nonnegative, and noting that α j ≥ 0, the solution can be obtained directly using g j and h j or using the sign information of the gradient. More specifically, the optimal solution α j (lying in the interval [0, ∞)) is obtained from one of the following possibilities: If {g j , h j } < 0, then α j = ∞. If {g j , h j } > 0, then, α j = 0. g If g j < 0, h j > 0, then a unique solution exists and is given by α j = − h jj . g
If g j > 0, h j < 0, then a unique solution exists and is given by α j = − h jj . But in this case, the derivative changes from a positive to a negative value while crossing zero. Therefore, it is possible to have solution at either 0 or ∞. In this case, we can evaluate the function value at 0 and ∞ and choose the right one. When h j = 0, the solution is dependent on the sign of g j .
Appendix B: Selection of the First Basis Vector Two quantities that are of interest in finding the GCV error for all the basis vectors are given by tT P2m t =
a 2m − 2b m m + c m 2m
(B.1)
T φm φm , T φ m φ m + αm σ 2
(B.2)
tr(Pm ) = N −
φ S−1 φ T φ T φ +α σ 2 where Pm = I − m σm2 m , Sm = m mσ 2 m , and m = Sm . Then the optimal solution αm can be obtained, as described in appendix A, from the coefficients 2.5 to 2.9 using P j = I. After substituting the optimal solution αm into equations B.1 and B.2, we can evaluate the GCV error for a given m.
298
S. Sundararajan, S. Shevade, and S. Keerthi
Finally, the basis vector j is selected as the index for which the GCV error is minimum. Appendix C: Initialization Once the first basis vector, φ j , is selected based on the GCV error with the optimal α j , initialization of all the relevant quantities is done as follows: r m = tT φ m − T φm − γm = φ m
(tT φ j )(φ Tj φ m ) (φ Tj φ m )2
ξm = t T φ m − 2
(C.2)
j
T φm − 2 um = φ m
v=N−
(C.1)
j
(φ Tj φ m )2 j
+
T (φ m φ m )(φ Tj φ m )2
(tT φ j )(φ Tj φ m ) j
(C.3)
2j +
(tT φ j )(φ Tj φ m )(φ Tj φ j ) 2j
φ Tj φ j
(C.5)
j
q = tT t − 2
(C.4)
(tT φ j )2 (φ Tj φ j ) (tT φ j )2 + , j 2j
(C.6)
tT φ j j
σ , S−1 = [ ] and = [φ j ]. Note j
where j = φ Tj φ j + α j σ 2 . Further, w =
2
that S−1 is a single element matrix when there is a single basis vector. Appendix D: Computing α j
Here, we find the set of coefficients required to compute the optimal α j from the set of quantities defined in equations 3.1 to 3.6. All the results given below are obtained using the rank one update relationship between P and P j . With o j =
α j σ 2r j α j σ 2 −t j
and ψ j =
o j jo j j + 2ξ j j j αjσ2 jo j j + b j = o j ξj αjσ2 j
aj =q +
c j = j o 2j ,
α j σ 2tj α j σ 2 −t j
, we have
(D.1) (D.2) (D.3)
Fast GCV Algorithm for Sparse Model Learning
where j = ψ j + α j σ 2 and j =
uj 2j . (α j σ 2 )2
299
Further,
g j = (δ j b j − a j j )ψ j − (δ j c j − b j j )
(D.4)
h j = (δ j b j − a j j )σ ,
(D.5)
2
where δ j = v + jj . For j not in the relevance vector set, we do not have to find the above set of quantities with P j as j = ∞ and P = P j . Therefore, the quantities in equations 2.5 to 2.9 required to compute the optimal α j can be found in a much simpler way. Next, the optimal solution α j is obtained using g j and h j , as mentioned earlier. (See the discussion in the paragraph below equation A.1.) Appendix E: Reestimation and Deletion of a Basis Vector T −1 Recall that the matrix P = I − Sσ 2 . Then, with σ 2 fixed (at least for the iteration under consideration or for few iterations), a change in α j (which is essentially the reestimation process itself) results in a change in S−1 . Let s j denote the jth column and sjj denote the jth diagonal element of S−1 . Note that the computations are similar to the one used for reestimation except for the coefficient K j . In the case of deletion, K j = s1 , and in the case of
jj
reestimation, K j = (sjj + (αˆ j − α j )−1 )−1 . Here, αˆ j denotes the new optimal solution. The final set of computations required for the reestimation or deletion of basis vectors is given below. Defining ρ jm = sTj T φ m , we have rm = rm + K j w j ρ jm γm = γm +
(E.1)
Kj 2 ρ . σ 2 jm
(E.2)
Defining χ jm = sTj AS−1 T φ m , we have um = um + where τ j =
Kj ρ jm (2χ jm + τ j ρ jm ), σ2
Kj T T s s j . σ2 j
(E.3)
Defining κ j = wT As j , we have
ξm = ξm + K j ρ jm κ j + K j w j (χ jm + ρ jm τ j ) v = v + τj
(E.4) (E.5)
q = q + K j σ w j (2κ j + τ j w j ). 2
(E.6)
Finally, w = w − K j w j s j and S−1 = S−1 − K j s j sTj . Although the set of equations given above is common for reestimation and deletion procedures, the
300
S. Sundararajan, S. Shevade, and S. Keerthi
jth row and/or column is to be removed from S−1 , w, and after making the necessary updates in the deletion procedure. Appendix F: Adding a New Basis Vector On adding the new basis vector j, the dimension of changes, and a new finite α j gets defined. Defining l j = σ12 S−1 T φ j and e j = φ j − l j and µ jm = eTj φ m , we get rm = rm − w j µ jm sjj γm = γm − 2 µ2jm , σ where sjj = we have
tj σ2
1 +α j
(F.1) (F.2)
and w j =
sjj r . σ2 j
Next, defining ν jm = µ jm − lTj AS−1 T φ m ,
sjj µ jm (ξ j − w j u j ) − w j ν jm σ2 s sjj jj . um = um + 2 µ jm u µ − 2ν j jm jm σ σ2 ξ m = ξm −
(F.3) (F.4)
Next, sjj uj σ2 q = q + w j (w j u j − 2ξ j ). v=v −
(F.5) (F.6)
Also, S
−1
=
S−1 + sjj l j lTj −sjj lTj
Finally, w =
w − wjlj wj
−sjj l j sjj
.
(F.7)
and = [ φ j ].
References Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Chen, S., Cowan, C. F. N., & Grant, P. M. (1991). Orthogonal least squares learning for radial basis function networks. IEEE Trans on Neural Networks, 2, 302–309. Cortes, C., & Vapnik, V. N. (1995). Support vector networks. Machine Learning, 20, 273–297. Csato, L., & Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation, 14(3), 641–668.
Fast GCV Algorithm for Sparse Model Learning
301
Denison, D., & George, E. (2000). Bayesian prediction using adaptive ridge estimators. (Tech Rep.). London: Department of Mathematics, Imperial College. Available online at http://stats.ma.ic.ac.uk/dgtd/public html/Papers/grr.ps. Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–141. Golub, G. H., Heath, M., & Wahba, G. (1979). Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21, 215–223. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Lawrence, N., Seeger, M., & Herbrich, R. (2003). Fast sparse gaussian process ¨ & K. Obermethods: The informative vector machine. In S. Becker, S. Thrun, mayer (Eds.), Advances in neural processing information systems, 15(pp. 609–616). Cambridge, MA: MIT Press. Orr, M. J. L. (1995a). Local smoothing of radial basis function networks. In Proceedings of International Symposium on Neural Networks. Hsinchu, Taiwan. Orr, M. J. L. (1995b). Regularization in the selection of radial basis function centers. Neural Computation, 7(3), 606–623. Seeger, M., Williams, C., & Lawrence, N. D. (2003). Fast forward selection to speed up sparse gaussian process regression. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann. Smola, A. J., & Bartlett, P. L. (2000). Sparse greedy gaussian process regression. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 619–625). Cambridge, MA: MIT Press. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). Journal of Royal Statistical Society (series B), 36, 111–147. Sundararajan, S., & Keerthi, S. S. (2001). Predictive approaches for choosing hyperparameters in gaussian processes. Neural Computation, 13(5), 1103–1118. Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244. Tipping, M. E., & Faul, A. (2003). Fast marginal likelihood maximisation for sparse Bayesian models. In C. M. Bishop & B. J. Frey (Eds.), Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. San Francisco: Morgan Kaufmann.
Received August 29, 2005; accepted June 1, 2006.
302
Addendum In “Low-Dimensional Maps Encoding Dynamics in Entorhinal Cortex and Hippocampus” by Dmitri Pervouchine, Theoden Netoff, Horacio Rotstein, John White, Mark Cunningham, Miles Whittington, and Nancy Kopell (Vol. 18, No. 11: 2617–2650), the following paragraph was omitted at the end of Appendix section A.2: Synaptic gating variable s j obeys first order kinetics according to the equation ∂s j = α j (1 − s j ) 1 + tanh((v j − vth )/vsl ) − β j s j , ∂t where α j and β j are the respective synapse rise and decay rate constants, vth = 0, and vsl = 4. The decay rate constants were chosen to achieve the desired synapse decay time (for instance, β j 0.2 for τ = 5, and β j 0.05 for τ = 20). The rise rate constant α j = 20 used in the simulations was not accessible in the experiments.
ARTICLE
Communicated by S. Coombes
The Astrocyte as a Gatekeeper of Synaptic Information Transfer Vladislav Volman
[email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel
Eshel Ben-Jacob
[email protected] School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel-Aviv University, 69978 Tel-Aviv, Israel, and Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.
Herbert Levine
[email protected] Center for Theoretical Biological Physics, University of California at San Diego, La Jolla, CA 92093-0319, U.S.A.
We present a simple biophysical model for the coupling between synaptic transmission and the local calcium concentration on an enveloping astrocytic domain. This interaction enables the astrocyte to modulate the information flow from presynaptic to postsynaptic cells in a manner dependent on previous activity at this and other nearby synapses. Our model suggests a novel, testable hypothesis for the spike timing statistics measured for rapidly firing cells in culture experiments. 1 Introduction In recent years, evidence has been mounting regarding the possible role of glial cells in the dynamics of neural tissue (Volterra & Meldolesi, 2005; Haydon 2001; Newman, 2003; Takano et al., 2006). For astrocytes in particular, the specific association of processes with synapses and the discovery of two-way astrocyte-neuron communication has demonstrated the inadequacy of the previously held view regarding the purely supportive role for these glial cells. Instead, future progress requires rethinking how the dynamics of the coupled neuron-glial network can store, recall, and process information. At the level of cell biophysics, some of the mechanisms underlying the so-called tripartite synapse (Araque, Parpura, Sanzgiri, & Haydon, 1999) are becoming clearer. For example, it is now well established that astrocytic Neural Computation 19, 303–326 (2007)
C 2007 Massachusetts Institute of Technology
304
V. Volman, E. Ben-Jacob, and H. Levine
mGlu receptors detect synaptic activity and respond via activation of the calcium-induced calcium release pathway, leading to elevated Ca 2+ levels. The spread of these levels within a microdomain of one cell can coordinate the activity of disparate synapses that are associated with the same microdomain (Perea & Araque, 2002). Moreover, it might even be possible to transmit information directly from domain to domain and even from astrocyte to astrocyte if the excitation level is strong enough to induce either intracellular or intercellular calcium waves (Cornell-Bell, Finkbeiner, Cooper, & Smith, 1990; Charles, Merrill, Dirksen, & Sanderson, 1991; Cornell-Bell & Finkbeiner, 1991). One sign of the maturity in our understanding is the formulation of semiquantitative models for this aspect of neuron-glial communication (Nadkarni & Jung, 2004; Sneyd, Wetton, Charles, & Sanderson, 1995; Hofer, Venance, & Giaume, 2003). There is also information flow in the opposite direction, from astrocyte to synapse. Direct experimental evidence for this, via the detection of the modulation of synaptic transmission as a function of the state of the glial cells, will be reviewed in more detail below. One of the major goals of this work is to introduce a simple phenomenological model for this interaction. The model will take into account both a deterministic effect of high Ca 2+ in the astrocytic process, namely, the reduction of the postsynaptic response to incoming spikes on the presynaptic axon (Araque, Parpura, Sanzgiri, & Haydon, 1998a), and a stochastic effect, namely, the increase in the frequency of observed miniature postsynaptic current events uncorrelated with any input (Araque, Parpura, Sanzgiri, & Haydon, 1998b). There are also direct NMDA-dependent effects on the postsynaptic neuron of astrocyte-emitted factors (Perea & Araque, 2005), which are not considered here. As we will show, the coupling allows the astrocyte to act as a gatekeeper for the synapse. By this, we mean that the amount of data transmitted across the synapse can be modulated by astrocytic dynamics. These dynamics may be controlled mostly by other synapses, in which case the gatekeeping will depend on dynamics external to the specific synapse under consideration. Alternatively, the dynamics may depend mostly on excitation from the selfsame synapse, in which case the behavior of the entire system is determined self-consistently. Here we focus on the latter possibility and leave for future work the discussion of how this mechanism could lead to multisynaptic coupling. Our ideas regarding the role of the astrocyte offer a new explanation for observations regarding firing patterns in cultured neuronal networks. In particular, spontaneous bursting activity in these networks is regulated by a set of rapidly firing neurons, which we refer to as spikers; these neurons exhibit spiking even during long interburst intervals and hence must have some form of self-consistent self-excitation. We model these neurons as containing astrocyte-mediated self-synapses (autapses) (Segal, 1991, 1994; Bekkers & Stevens, 1991) and show that this hypothesis naturally accounts
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
305
for the observed unusual interspike interval distribution. Additional tests of this hypothesis are proposed at the end. 2 Experimental Observations 2.1 Cultured Neuronal Networks. The cultured neuronal networks presented here are self-generated from dissociated cultures of mixed cortical neurons and glial cells drawn from 1-day-old Charles River rats. The dissection, cell dissociation, and recording procedures were previously described in detail (Segev et al., 2002). Briefly, following dissection, neurons are dispersed by enzymatic treatment and mechanical dissociation. Then the cells are homogeneously plated on multielectrode arrays (MEA, Multi-Channel Systems), precoated with Poly-L-Lysine. Culture media was DMEM, (sigma) enriched by serum, and changed every two days. Plated cultures are placed on the MEA board (B-MEA-1060, Multi Channel Systems) for simultaneous long-term noninvasive recordings of neuronal activity from several neurons at a time. Recorded signals are digitized and stored for off-line analysis on a PC via an A-D board (Microstar DAP) and data acquisition software (Alpha-Map, Alpha Omega Engineering). Noninvasive recording of the networks activity (action potentials) is possible due to the capacitive coupling that some of the neurons form with some of the electrodes. Since typically one electrode can record signals from several neurons, a specially developed spike-sorting algorithm (Hulata, Segev, Shapira, Benveniste, & Ben-Jacob, 2000) is utilized to reconstruct single neuron-specific spike series. Although there are no externally provided guiding stimulations or chemical cues, relatively intense dynamical activity is spontaneously generated within several days. The activity is marked by the formation of synchronized bursting events (SBEs): short (∼200 ms) time windows during which most of the recorded neurons participate in relatively rapid firing (Segev & Ben-Jacob, 2001). These SBEs are separated by long intervals (several seconds or more) of sporadic neuronal firing of most of the neurons. A few neurons (referred to as spiker neurons) exhibit rapid firing even during the inter-SBE time intervals. These neurons also exhibit much faster firing rates during the SBEs, and their interspike intervals distribution is marked by a long-tail behavior (see Figure 4). 2.2 Interspike Interval (ISI) Increments Distribution. One of the tools used to compare model results with measured spike data concerns the distribution of increments in the spike times, defined as δ(i) = ISI (i + 1) − ISI (i), i ≥ 1. The distribution of δ(i) will have heavy tails if there is a wide range of interspike intervals and rapid transitions from one type of interval to the next. For example, rapid transitions from bursting events to occasional interburst firings will lead to such a tail. Applying this analysis to the recorded spike data of cultured cortical networks, Segev et al. (2002) found
306
V. Volman, E. Ben-Jacob, and H. Levine
that distributions of neurons’ ISI increments can be well fitted with Levy functions over three decades in time. 3 The Model In this section we present the mathematical details of the models employed in this work. Readers interested mainly in the conclusions can skip directly to section 4. The basic notion we use is that standard synapse models must be modified to account for the astrocytic modulation, depending, of course, on the calcium level. In turn, the astrocytic calcium level is affected by synaptic activity; for this, we use the Li-Rinzel model where the IP3 concentration parameter governing the excitability is increased on neurotransmitter release. These ingredients suffice to demonstrate what we mean by gatekeeping. Finally, we apply this model to the case of an autaptic oscillator, which requires the introduction of neuronal dynamics. For this, we chose the Morris-Lecar model as a generic example of a type-I firing system. None of our results would be altered with a different choice as long as we retain the tangent-bifurcation structure, which allows for arbitrarily long interspike intervals. 3.1 TUM Synapse Model. To describe the kinetics of a synaptic terminal, we have used the model of an activity-dependent synapse first introduced by Tsodyks, Uziel, and Markram (2000). In this model, the effective synaptic strength evolves according to the following equations: z − uxδ(t − tsp ) τrec y y˙ = − + uxδ(t − tsp ) τin y z z˙ = − τin τrec
x˙ =
(3.1)
Here, x, y, and z are the fractions of synaptic resources in the recovered, active, and inactive states, respectively. For an excitatory glutamatergic synapse, the values attained by these variables can be associated with the dynamics of vesicular glutamate. As an example, the value of y in this formulation will be proportional to the amount of glutamate that is being released during the synaptic event, and the value of x will be proportional to the size of a readily releasable vesicle pool. The time series tsp denotes the arrival times of presynaptic spikes, τin is the characteristic time of postsynaptic currents (PSCs) decay, and τr ec is the recovery time from synaptic depression. Upon arrival of a spike at the presynaptic terminal at time tsp , a fraction u of available synaptic resources is transferred from the recovered
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
307
Table 1: Parameters Used in Simulations. P0 τca r I P3 v1 k3 d3 c0 gL VL V3 Iba se u
0.5
ηmea n
1.2 · 10−3 µA cm−2
σ
0.1
4 sec 7.2 mM sec−1 6 sec−1 0.1 µM 0.9434 µM 2.0 µM 0.5 mS cm−2 −35 mV 10 mV 0.34 µA cm−2 0.1
κ I P3∗ v2 d1 d5 gCa VCa V1 V4 τd A
0.5 sec−1 0.16 µM 0.11 sec−1 0.13 µM 0.08234 µM 1.1 mS cm−2 100 mV −1 mV 14.5 mV 10 msec 10.00 µA cm−2
τ I P3 c1 v3 d2 a2 gK VK V2 φ τr ec
7 sec 0.185 0.9 µM sec−1 1.049 µ M 0.2 µM−1 sec−1 2.0 mS cm−2 −70 mV 15 mV 0.3 100 msec
state to the active state. Once in the active state, synaptic resources rapidly decay to the inactive state, from which they recover within a timescale τr ec . Since the typical times are assumed to satisfy τr ec τin , the model predicts onset of short-term synaptic depression after a period of high-frequency repetitive firing. The onset of depression can be controlled by the variable u, which describes the effective use of synaptic resources by the incoming spike. In the original TUM model, the variable u is taken to be constant for the excitatory postsynaptic neuron; in what follows, we will set u = 0.1. Other parameter choices for these equations as well as for the rest of the model equations are presented in Table 1. To complete the specification, it is assumed that the resulting PSC, arriving at the model neurons’ soma through the synapse, depends linearly on the fraction of available synaptic resources. Hence, a total synaptic current seen by a neuron is Isyn (t) = Ay(t), where A stands for an absolute synaptic strength. At this stage, we do not take into account the long-term effects associated with the plasticity of neuronal somata and take the parameter A to be time independent. 3.2 Astrocyte Response. Astrocytes adjacent to synaptic terminals respond to the neuronal action potentials by binding glutamate to their metabotropic glutamate receptors (Porter & McCarthy, 1996). The activation of these receptors then triggers the production of I P3 , which consequently serves to modulate the intracellular concentration of calcium ions; the effective rate of I P3 production depends on the amount of transmitter released during the synaptic event. We therefore assume that the production of intracellular I P3 in the astrocyte is given by I P3∗ − I P3 d[I P3 ] = + ri p3 y. dt τi p3
(3.2)
308
V. Volman, E. Ben-Jacob, and H. Levine
This equation is similar to the formulation used by Nadkarni and Jung (2004), with some important differences. First, the effective rate of I P3 production depends not on the potential of neuronal membrane, but on the amount of neurotransmitter that is being released into the synaptic cleft. Hence, as the resources of synapse are depleted (due to depression), there will be less transmitter released, and therefore the I P3 will be produced at lower rates, leading eventually to decay of calcium concentration. Second, as the neurotransmitter is released also during spontaneous synaptic events (noise), the latter will also influence the production of I P3 and subsequent calcium oscillations. 3.3 Astrocyte. To model the dynamics of a single astrocytic domain, we use the Li-Rinzel model (Li & Rinzel, 1994; Nadkarni & Jung, 2004), which has been specifically developed to take into account the I P3 -dependent dynamical changes in the concentration of cytosolic Ca 2+ . This is based on the theoretical studies of Nadkarni and Jung (2004), where it is decisively demonstrated that astrocytic Ca 2+ oscillations may account for the spontaneous activity of neurons. The intracellular concentration of Ca2+ in the astrocyte is described by the following set of equations: d[Ca 2+ ] = −J chan − J pump − J leak dt
(3.3)
dq = αq (1 − q ) − βq q . dt
(3.4)
Here, q is the fraction of activated I P3 receptors. The fluxes of currents through ER membrane are given in the following expressions: J chan = c 1 v1 m3∞ n3∞ q 3 ([Ca 2+ ] − [Ca 2+ ] E R ) J pump =
v3 [Ca ] + [Ca 2+ ]2
(3.5)
2+ 2
k32
J leak = c 1 v2 ([Ca 2+ ] − [Ca 2+ ] E R ),
(3.6) (3.7)
where m∞ =
[I P3 ] [I P3 ] + d1
(3.8)
n∞ =
[Ca 2+ ] [Ca 2+ ] + d5
(3.9)
αq = a 2 d2
[I P3 ] + d1 [I P3 ] + d3
(3.10)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
βq = a 2 [Ca 2+ ].
309
(3.11)
The reversal Ca 2+ concentration ([Ca 2+ ] E R ) is obtained after requiring conservation of the overall Ca 2+ concentration: [Ca 2+ ] E R =
c 0 − [Ca 2+ ] . c1
(3.12)
3.4 Glia-Synapse Interaction. Astrocytes affect synaptic vesicle release in a calcium-dependent manner. Rather than attempt a complete biophysical model of the complex chain of events leading from calcium rise to vesicle release (Gandhi & Stevens, 2003), we proceed in a phenomenological manner. We define a dynamical variable f that phenomenologically will capture this interaction. When the concentration of calcium in its synapseassociated process exceeds a threshold, we assume that the astrocyte emits a finite amount of neurotransmitter into the perisynaptic space, thus altering the state of a nearby synapse; this interaction occurs via glutamate binding to presynaptic mGlu and NMDA receptors (Zhang et al., 2004). As the internal astrocyte resource of neurotransmitter is finite, we include the saturation term (1 − f ) in the dynamical equation for f . The final form is f˙ =
−f 2+ + (1 − f )κ ([Ca 2+ ] − [Ca thr eshold ]). τCa 2+
(3.13)
Given this assumption, equations 3.1 should be modified to take this modulation into account. We assume the following simple form: z − (1 − f )uxδ(t − tsp ) − xη( f ) τr ec −y + (1 − f )uxδ(t − tsp ) + xη( f ). y˙ = τin
x˙ =
(3.14) (3.15)
In equations 3.14 and 3.15, η( f ) represents a noise term modeling the increased occurrence of mini-PSCs. The fact that a noise increase accompanies an amplitude decrease is partially due to competition for synaptic resources between these two release modes (Otsu et al., 2004). Based on experimental observations, we prescribe that the dependence of η( f ) on f is such that the rate of noise occurrence (the frequency of η( f ) in a fixed time step) increases with increasing f , but the amplitude distribution (modeled here as a gaussian-distributed variable centered around positive mean) remains unchanged. For the rate of noise occurrence, we chose the following functional dependence: 2 1− f P( f ) = P0 exp − √ , 2σ
(3.16)
310
V. Volman, E. Ben-Jacob, and H. Levine
with P0 representing the maximal frequency of η( f ) in a fixed time step. Note that although both synaptic terminals and astrocytes utilize glutamate for their signaling purposes, we assume the two processes to be independent. In so doing, we rely on existing biophysical experiments demonstrating that whereas a presynaptic terminal releases glutamate in the synaptic cleft, astrocytes selectively target extrasynaptic glutamate receptors (Araque et al., 1998a, 1998b). Hence, synaptic transmission does not interfere with the astrocyte-to-synapse signaling. 3.5 Neuron Model. We describe the neuronal dynamics with a simplified two-component Morris-Lecar model (Morris & Lecar, 1981), V˙ = −Iion (V, W) + Ie xt (t)
(3.17)
W∞ (V) − W(V) , τW (V)
(3.18)
˙ W(V) =φ
with Iion (V, W) representing the contribution of the internal ionic Ca 2+ , K + , and leakage currents with their corresponding channel conductivities gCa , g K , and g L being constant: Iion (V, W) = gCa m∞ (V)(V − VCa ) + g K W(V)(V − VK ) + g L (V − VL ). (3.19) Ie xt represents all the external current sources stimulating the neuron, such as signals received through its synapses, glia-derived currents, artificial stimulations, as well as any noise sources. In the absence of any such stimulation, the fraction of open potassium channels, W(V), relaxes toward its limiting curve (nullcline) W∞ (V), which is described by the sigmoid function, V − V1 1 W∞ (V) = 1 + tanh 2 V2
(3.20)
within a characteristic timescale given by
τW (V) =
cosh
1 V−V1 . 2V2
(3.21)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
311
In contrast to this, it is assumed in the Morris-Lecar model that calcium channels are activated immediately. Accordingly, the fraction of open Ca 2+ channels obeys the following equation: V − V3 1 m∞ (V) = 1 + tanh . 2 V4
(3.22)
For an isolated neuron, rendered with a single autapse, one has Ie xt (t) = Isyn (t) + Iba se , where Isyn (t) is the current arriving through the self-synapse and Iba se is some constant background current. In this work, we assume that Iba se is such that, when acting alone, it causes a neuron to fire at a very low constant rate. Of course, these two terms enter the equation additively, and the dynamics depends on the total external current. Nonetheless, it is important to separate these terms, as only one of them enters through the synapse; it is only this term that is modulated by astrocytic glutamate release and only this term that would be changed by synaptic blockers. As we will note later, the baseline current may also be due to astrocytes, albeit to a direct current directed into the neuronal soma. In anticipation of a better future understanding of this term, we consider it separately from the constant appearing in leak current (g L VL ), although there is clearly some redundancy in the way these two terms set the operating point of the neuron. 4 Results 4.1 Synaptic Model. In simple models of neural networks, the synapse is considered to be a passive element that directly transmits information, in the form of arriving spikes on the presynaptic terminal, to postsynaptic currents. It has been known for a long time that more complex synaptic dynamics can affect this transfer. One such effect concerns the finite reservoir of presynaptic vesicle resources and was modeled by Tsodyks, Uziel, and Markram (TUM) (Tsodyks et al., 2000). Spike trains with too high a frequency will be attenuated by a TUM synapse, as there is insufficient recovery from one arrival to the next. To demonstrate this effect, we fed the TUM synaptic model with an actual spike train recorded from a neuron in a cultured network (shown in Figure 1a); the resulting postsynaptic current (PSC) is shown in Figure 1b. As is expected, there is attenuation of the PSC height during time windows with high rates of presynaptic spiking input. 4.2 Effect of Presynaptic Gating. Our goal is to extend the TUM model to include the interaction of the synapse with an astrocytic process imagined to be wrapped around the synaptic cleft. The effects of astrocytes on stimulated synaptic transmission are well established. Araque et al. (1998a) report that astrocyte stimulation reduced the magnitude of
312
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d)
0
4 time [sec]
Figure 1: The generic effect of an astrocyte on the presynaptic depression, as captured by our phenomenological model (see text for details). To illustrate the effect of presynaptic depression and the astrocyte influence, we feed a model synapse with the input of spikes taken from the recorded activity of a cultured neuronal network (see the text and Segev & Ben-Jacob, 2001 for details). (a) The input sequence of spikes that is fed into the model presynaptic terminal. (b) Each spike arriving at the model presynaptic terminal results in the postsynaptic current (PSC). The strength of the PSC depends on the amount of available synaptic resources, and the synaptic depression effect is clearly observable during spike trains with relatively high frequency. (c) The effect of a periodic gating function, f (t) = 0.5 + f 0 sin(wt), shown in (d). The period of the = 2 sec, is taken to be compatible with the typical timescales oscillation, T = 2π ω of variations in the intraglial Ca 2+ concentration. Note the reduction in the PSC near the maxima of f , along with the elevated baseline resulting from the increase in the rate of spontaneous presynaptic transfer.
action-potential-evoked excitatory and inhibitory synaptic currents by decreasing the probability of evoked transmitter release. Specifically, presynaptic metabotropic glutamate receptors (mGluRs) have been shown to affect the stimulated synaptic transmission by regulating presynaptic voltagegated calcium channels, which eventually leads to the reduction of calcium flux during the incoming spike and results in a decrease of amplitude of synaptic transmission. These results are best shown in Figure 8 of their paper, which presents the amplitude of evoked EPSC both before and after stimulation of an associated astrocyte. Note that we are referring here to “faithful" synapses—those that transmit almost all of the incoming spikes. Effects of astrocytic stimulation on highly stochastic synapses, namely, the increase in fidelity (Kang, Jiang, Goldman, & Nedergaard, 1998), are not studied here. In addition, astrocytes were shown to increase the frequency of spontaneous synaptic events. In detail, Araque et al. (1998b) have shown that
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
313
astrocyte stimulation increases the frequency of miniature postsynaptic currents (mPSC) without modifying their amplitude distribution, suggesting that astrocytes act to increase the probability of vesicular release from the presynaptic terminal. Although the exact mechanism is unknown, this effect is believed to be mediated by NMDA receptors located at the presynaptic terminal. It is important to note that the two kinds of astrocytic influence on the synapse (decrease of the probability of evoked release and increase in the probability of spontaneous release) do not contradict each other. Evoked transmitter release depends on the calcium influx through calcium channels that can be inhibited by the activation of presynaptic mGluRs. On the other hand, the increase in the probability of spontaneous release follows because of the activation of presynaptic NMDA channels. In addition, spontaneous activity can deplete the vesicle pool (in terms of either number or filling) and hence directly lower triggered release amplitudes (Otsu et al., 2004). We model these effects by two modifications of the TUM model. First, we introduce a gating function f that modulates the stimulated release in a calcium-dependent manner. This term will cause the synapse to turn off at high calcium. This presynaptic gating effect is demonstrated in Figure 1c, where we show the resulting PSC corresponding to a case in which f is chosen to vary periodically with a timescale consistent with astrocytic calcium dynamics. The effect on the recorded spike train data is quite striking. The second effect, the increase of stochastic release in the absence of any input, is included as an f -dependent noise term in the TUM equations. This will be important as we turn to a self-consistent calculation of the synapse coupled to a dynamical astrocyte. 4.3 The Gatekeeping Effect. We close the synapse-glia-synapse feedback loop by inclusion of the effect of the presynaptic activity on the intracellular Ca 2+ dynamics in the astrocyte that in turn sets the value of the gating function f . Nadkarni and Jung (2004) have argued that the basic calcium phenomenology in the astrocyte, arising via the glutamate-induced production of I P3 , can be studied using the Li-Rinzel model. What emerges from their work is that the dynamics of the intra-astrocyte Ca 2+ level depends on the intensity of the presynaptic spike train, acting as an information integrator over a timescale on the order of seconds; the level of Ca 2+ in the astrocyte increases according to the summation of the synaptic spikes over time. If the total number of spikes is low, the Ca 2+ concentration in the astrocyte remains below a self-amplification threshold level and simply decays back to its resting level with some characteristic time. However, things change dramatically when a sufficiently intense set of signals arises across the synapse. Now the Ca 2+ concentration overshoots its linear response level, followed by decaying oscillations. Given our results, these high Ca 2+ levels in the astrocyte will in fact attenuate spike information that arrives subsequent to strong bursts of activity. We illustrate this time-delayed gatekeeping (TDGK) effect in
314
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d)
0
20 time [sec]
Figure 2: The gatekeeping effect in a glia-gated synapse. (a) The input sequence of spikes, which is composed of several copies of the sequence shown in Figure 1, separated by segments of long quiescent time. The resulting time series may be viewed as bursts of action potentials arriving at the model presynaptic terminal. The first burst of spikes results in the elevation of free astrocyte Ca 2+ concentration (b), but this elevation alone is not sufficient to evoke oscillatory response. An additional elevation of Ca 2+ , leading to the emergence of oscillation, is provided by the second burst of spikes arriving at the presynaptic terminal. Once the astrocytic Ca 2+ crosses a predefined threshold, it starts to exert a modulatory influence back on the presynaptic terminal. In the model, this is manifested by the rising dynamics of the gating function (c). Note that as the decay time of the gating function f is on the order of seconds, the astrocyte influence on the presynaptic terminal persists even after concentration of astrocyte Ca 2+ has fallen. This is best seen from d, where we show the profile of the PSC. The third burst of spikes arriving at the presynaptic terminal is modulated due to the astrocyte, even though the concentration of Ca 2+ is relatively low at that time. This modulation extends also to the fourth burst of spikes, which together with the third burst leads again to the oscillatory response of astrocyte Ca 2+ . Taken together, these results illustrate a temporally nonlocal gatekeeping effect of glia cells.
Figure 2. We constructed a spike train by placing a time delay in between segments of recorded sequences. As can be seen, since the degree of activity during the first two segments exceeds the threshold level, there is attenuation of the late-arriving segments. Thus, the information passed through the synapse is modulated by previous arriving data. 4.4 Autaptic Excitatory Neurons. Our new view of synaptic dynamics will have broad consequences for making sense of neural circuitry. To illustrate this prospect, we turn to the study of an autaptic oscillator (Seung, Lee, Reis, & Tank, 2000), by which we mean an excitatory neuron
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
315
that exhibits repeated spiking driven at least in part by self-synapses (Segal, 1991, 1994; Bekkers & Stevens, 1991; Lubke, Markram, Frotscher, & Sakmann, 1996). By including the coupling of a model neuron to our synapse system, we can investigate both the case of the role of an associated astrocyte with externally imposed temporal behavior and the case where the astrocyte dynamics is itself determined by feedback from this particular synapse. Finally, we should be clear that when we refer to one synapse, we are also dealing implicitly with the case of multiple self-synapses, all coupled to the same astrocytic domain, which in turn is exhibiting correlated dynamics in its processes connecting to these multiple sites. It is important to note that this same modulation can in fact correlate multiple synapses connecting distinct neurons coupled to the same astrocyte. The effect of this new multisynaptic coupling on the spatiotemporal flow of information in a model network will be described elsewhere. We focus on an excitatory neuron modeled with Morris-Lecar dynamics, as described in section 3. We add some external bias current so as to place the neuron in a state slightly beyond the saddle-node bifurcation, where it would spontaneously oscillate at a very low frequency in the absence of any synaptic input. We then assume that this neuron has a self-synapse (autapse). An excitatory self-synapse clearly has the possibility of causing a much higher spiking rate than would otherwise be the case; this behavior without any astrocyte influence is shown in Figure 3. The existence of autaptic neurons was originally demonstrated in cultured networks (Segal, 1991, 1994; Bekkers & Stevens, 1991) but has been detected in intact neocortex as well (Lubke et al., 1996). Importantly, these can be either inhibitory or excitatory. There has been some speculation regarding the role of autapses in memory (Seung et al., 2000), but this is not our concern here. Are such neurons observed experimentally? In Figure 4 we show a typical raster plot recorded from cultured neural network grown from a dissociated mixture of glial and neuronal cortical cells taken from 1-day-old Charles River rats (see section 2). The spontaneous activity of the network is marked by synchronized bursting events (SBEs)—short (several 100s of ms) periods during which most of the recorded neurons show relatively rapid firing separated by long (order of seconds) time intervals of sporadic neuronal firing of most of the neurons (Segev & Ben-Jacob, 2001). Only small fractions of special neurons (termed spiker neurons) exhibit rapid firing also during inter-SBEs intervals. These spiker neurons also exhibit much higher firing rates during the SBEs. But the behavior of these rapidly firing neurons does not match that expected of the simple autaptic oscillator. The major differences, as illustrated by comparing Figures 3 and 4, are (1) the existence of long interspike intervals for the spikers, marked by a long-tail (Levy) distribution of the increments of the interspike intervals, and (2) the beating or burstlike rate modulation in the temporal ordering of the spike train.
316
V. Volman, E. Ben-Jacob, and H. Levine
(a)
(b)
0
5 time [sec]
(c)
−1
Probability distribution
10
−5
10
1
10 100 δ(ISI) [msec]
1000
Figure 3: Activity of a model neuron containing the self-synapse (autapse), as modeled by the classical Tsodyks-Uziel-Markram model of synaptic transmission. In this case, it is possible to recover some of the features of cortical rapidly firing neurons, namely, the relatively high-frequency persistent activity. However, the resulting time series of action potentials for such a model neuron, shown in (a), is almost periodic. Due to the self-synapse, a periodic series of spikes results in the periodic pattern for the postsynaptic current, shown in (b), which closes the self-consistency loop by causing a model neuron to generate a periodic time series of spikes. Further difference between the model neuron and between cortical rapidly firing neurons is seen upon comparing the corresponding distributions of ISI increments, plotted on double-logarithmic scale. These distributions, shown in (c), disclose that, contrary to the cortical rapidly firing neurons, the increments distribution for the model neuron with TUM autapse (diamonds) is gaussian (seen as a stretched parabola on double-log scale), pointing to the existence of characteristic timescale. On the other hand, distributions for cortical neurons (circles) decay algebraically and are much broader. The distribution of the model neuron has been vertically shifted for clarity of comparison.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer 1
317
(a)
neuron #
12
60 0
45
90
time [sec] 1
(b)
neuron # 60 0
(c)
−1
probability distribution
12
10
−3
10
−5
400
time [msec]
800
10
1
10
δ(ISI) [msec]
100
Figure 4: Electrical activity of in-vitro cortical networks. These cultured networks are spontaneously formed from a dissociated mixture of cortical neurons and glial cells drawn from 1-day-old Charles River rats. The cells are homogeneously spread over a lithographically specified area of Poly-D-Lysine for attachment to the recording electrodes. The activity of a network is marked by formation of synchronized bursting events (SBEs), short (∼100–400 msec) periods of time during which most of the recorded neurons are active. (a) Raster plot of recorded activity, showing a sample of a few SBEs. The time axis is divided into 10−1 s bins. Each row is a binary bar code representation of the activity of an individual neuron; the bars mark detection of spikes. Note that while the majority of the recorded neurons are firing rapidly mostly during SBEs, some neurons are marked by persistent intense activity (e.g., neuron no. 12). This property supports the notion that the activity of these neurons is autonomous and hence self-amplified. (b) A zoomed view of a sample synchronized bursting event. Note that each neuron has its own pattern of activity during the SBE. To access the differences in activity between ordinary neurons and neurons that show intense firing between the SBEs, for each neuron we constructed the series of increments of interspike intervals (ISI), defined as δ(i) = I SI (i + 1) − I SI (i), i ≥ 1. The distributions of δ(i), shown in (c), disclose that the dynamics of ordinary neurons (squares) is similar to the dynamics of rapidly firing neurons (circles), up to the timescale of 100 msec, corresponding to the width of a typical SBE. Note that since increments of interspike intervals are analyzed, the increased rate of neurons firing does not necessarily affect the shape of the distribution. Yet above the characteristic time of 100 msec, the distributions diverge, possibly indicating the existence of additional mechanisms governing the activity of rapidly firing neurons on a longer timescale. Note that for normal neurons, there is another peak at typical interburst intervals (> seconds), not shown here.
318
V. Volman, E. Ben-Jacob, and H. Levine
Motivated by the above and the glial gatekeeping effect studied earlier, we proceed to test if an autaptic oscillator with a glial-regulated self-synapse will bring the model into better agreement. In Figure 5 we show that the activity of such a modified model does show the additional modulation. The basic mechanism results from the fact that after a period of rapid firing of the neuron, the astrocyte intracellular Ca 2+ concentration (shown in Figure 5b) exceeds the critical threshold for time-delayed attenuation. This then stops the activity and gives rise to large interspike intervals. The distributions shown in Figure 5 are a much better match to experimental data for time intervals up to 100 msec.
5 Robustness Tests 5.1 Stochastic Li-Rinzel Model. One of the implicit assumptions of our model for astrocyte-synapse interaction is related to the deterministic nature of astrocyte calcium release. It is assumed that in the absence of any I P3 signals from the associated synapses, the astrocyte will stay “silent,” in the sense that there will be no spontaneous Ca 2+ events. However, it should be kept in mind that the equations for the calcium channel dynamics used in the context of Li-Rinzel model in fact describe the collective behavior of large numbers of channels. In reality, experimental evidence indicates that the calcium release channels in astrocytes are spatially organized in small clusters of 20 to 50 channels—the so-called microdomains. These microdomains were found to contain small membrane leaflets (of O(10 nm) thick), wrapping around the synapses and potentially able to synchronize ensembles of synapses. This finding calls for a new view of astrocytes as cells with multiple functional and structural compartments. The microdomains (within the same astrocyte) have been observed to generate the spontaneous Ca 2+ signals. As the passage of the calcium ions through a single channel is subject to fluctuations, the stochastic aspects can become important for small clusters of channels. Inclusion of stochastic effects can explain the generation of calcium puffs: fast, localized elevations of calcium concentration. Hence, it is important to test the possible effect of stochastic calcium events on the model’s behavior. We achieve this goal by replacing the deterministic Li-Rinzel model with its stochastic version, obtained using Langevin approximation, as has been recently described by Shuai and Jung (2003). With the Langevin approach, the equation for the fraction of open calcium channels is modified and takes the following form:
dq = αq (1 − q ) − βq q + ξ (t), dt
(5.1)
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
319
(a)
(b)
(c)
(d) 0
10
20
time [sec]
(e)
−1
Probability distribution
10
−3
10
−5
10
1
10 100 δ(ISI) [msec]
1000
Figure 5: The activity of a model neuron containing a glia-gated autapse. The equations of synaptic transmission for this case have been modified to take into account the influence of synaptically associated astrocyte, as explained in text. The resulting spike time series, shown in (a), deviates from periodicity due to the slow modulation of the synapse by the adjacent astrocyte. The relatively intense activity at the presynaptic terminal activates astrocyte receptors, which in turn leads to the production of IP3 and subsequent oscillations of free astrocyte Ca 2+ concentration. The period of these oscillations, shown in (b), is much larger than the characteristic time between spikes arriving at the presynaptic terminal. Because Ca 2+ dynamics is oscillatory, so also will be the dynamics of the gating function f , as is seen from (c), and the period of oscillations for f will follow the period of Ca 2+ oscillations. The periodic behavior of f leads to slow periodic modulation of PSC pattern (shown in (d)), which closes the self-consistency loop by causing a neuron to fire in a burstlike manner. Additional information is obtained after comparison of distributions for ISI increments, shown in (e). Contrary to results for the model neuron with a simple autapse (see Figure 4c), the distribution for a glia-gated autaptic model neuron (diamonds) now closely follows the distributions of two sample recorded cortical rapidly firing neurons (circles), up to the characteristic time of ∼100 msec, which corresponds to the width of a typical SBE. The heavy tails of the recorded distributions above this characteristic time indicate that network mechanisms are involved in shaping the form of the distribution on longer timescales.
320
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d) 0
40
time [sec]
80
Figure 6: The dynamical behavior of an astrocyte-gated model autaptic neuron, including the stochastic release of calcium from ER of astrocyte. Shown are the results of the simulation when calcium release from intracellular stores is mediated by a cluster of N = 10 channels. The generic form of the spike time series (shown in (a)) does not differ from those obtained for the deterministic model. Namely, even for the stochastic model, the neuron is still firing in a burstlike manner. Although the temporal profile of astrocyte calcium (b) is irregular, the resulting dynamics of the gating function (c) is relatively smooth, stemming from the choice of the gating function dynamics (being an integration over the calcium profile). As a result, the PSC profile (d) does not differ much from the corresponding PSC profile obtained for the deterministic model.
in which the stochastic term, ξ (t), has the following properties: ξ (t) = 0
ξ (t)ξ (t ) =
αq (1 − q ) + βq q
δ(t − t ). N
(5.2) (5.3)
In the limit of very large cluster size, N → ∞, and the effect of stochastic Ca 2+ release is not significant. On the contrary, the dynamics of calcium release are greatly modified for small cluster sizes. A typical spike-time series of glia-gated autaptic neuron, obtained for the cluster size of N = 10 channels, is shown Figure 6a. Note that while there appear considerable fluctuations in concentration of astrocyte calcium (see Figure 6b), the dynamics of the gating function (see Figure 6c) is less irregular. This follows because our choice of the gating function corresponds to the integration of calcium events. We have also checked that the distribution of interspike intervals is practically unchanged (data not shown). All told, our results indicate that including the stochastic nature of the release of calcium from astrocyte ER does not affect the dynamics of our model autaptic neuron in any significant way.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
321
0
10
−4
τ =4 sec,κ=5*10 f −4 τ =40 sec,κ=1*10 f
−1
probability distribution
10
−2
10
−3
10
−4
10
−5
10
0
10
1
10
2
δ(ISI) [msec]
10
3
10
Figure 7: Distributions of interspike interval increments for the model of an astrocyte-gated autaptic neuron with slow dynamics of the gating function, as compared with the corresponding distribution for the deterministic Li-Rinzel model. Due to the slow dynamics of the gating function, the transitions between different phases of bursting are blurred, resulting in a weaker tail for the distribution of interspike interval increments.
5.2 The Correlation Time of the Gating Function. Another assumption made in our model concerns the dynamics of the gating function. We have assumed the simple first-order differential equation for the dynamics of our phenomenological gating function and have selected timescales that are believed to be consistent with the influence of astrocytes on synaptic terminals. However, because the exact nature of the underlying processes (and corresponding timescales) is unknown, it is important to test the robustness of the model to variations in the gating function dynamics. To do that, we altered the baseline dynamics of the gating function to have a slower characteristic decay time and a slower rate of accumulation; for example, we can set τ f = 40 sec and κ = 0.1 sec−1 . Simulations show that the only effect is a slight blurring of the transition between different phases of the bursting, as would be expected. This can best be detected by looking at the distribution of interspike interval increments, for the case of slow gating dynamics. The distribution, shown in Figure 7, has a weaker tail as compared to the distribution obtained for the faster gating dynamics. This result follows because for slower gating, the modulation of the postsynaptic current is weaker. Hence, the transitions from intense firing
322
V. Volman, E. Ben-Jacob, and H. Levine (a)
(b)
(c)
(d) 0
45
time [sec]
90
Figure 8: The dynamical behavior of an astrocyte-gated model autaptic neuron with slowly oscillating background current. Shown are the results of the simut), T = 10 sec. The mean level of Ibase is set so as to put lation when Ibase ∝ sin( 2π T a neuron in the quiescent phase for half a period. The resulting spike time series (a) disclose the burstlike firing of a neuron, with the superimposed oscillatory dynamics of a background current. The variations in the concentration of astrocyte calcium (b) are much more temporally localized, and so is the resulting dynamics of the gating function (c). Consequently, the PSC profile (d) strongly reflects the burstlike synaptic transmission efficacy, thus forcing the neuron to fire in a burstlike manner and closing the self-consistency loop.
to low-frequency spiking are less abrupt, resulting in a relatively low proportion of large increments. It is worth remembering that large increments of interspike intervals reflect sudden changes in dynamics, which are eliminated by the blurring. Clearly, the model with fast gating does a better job in fitting the spiker data. 5.3 Time-Dependent Background Current. All of the main results were obtained under the assumption of constant background current feeding into neuronal soma, such that when acting alone, this current forces the model neuron to fire at a very low frequency. One may justly argue that there is no such thing as constant current. Indeed, if a background current has to do with the biological reality, then it should possess some dynamics. For example, a better match would be to imagine the background current to be associated with the activity in adjacent astrocytes (see, e.g., Angulo, Kozlov, Charpak, & Audinat, 2004). To test this, we simulated glia-gated autaptic neuron subject to slowly oscillating (T = 10 sec) background current. For this case, we found that the behavior of a model is generically the same. Yet now the transitions between the bursting phases are sharper (see Figure 8a). This in turn leads to the sharper modulation of postsynaptic currents (shown in Figure 8d). We can confirm this by noting that the distribution of interspike interval increments has a slightly heavier tail, as compared to the distribution obtained for
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
323
the case of constant background current (data not shown). On the other hand, replacing the constant current with the oscillating one introduces a typical frequency not seen in the actual spiker data. This artificial problem will presumably disappear when the background current is determined self-consistently as part of the overall network activity. Similarly, the key to extending the increments distributions to longer timescales seems to be getting the network feedback to the spikers to regulate the interburst timing, which at the moment is too regular. This will be presented in a future publication. 6 Discussion In this article, we have proposed that the regulation of synaptic transmission by astrocytic calcium dynamics is a critical new component of neural circuitry. We have used existing biophysical experiments to construct a coupled synapse-astrocyte model to illustrate this regulation and explore its consequences for an autaptic oscillator, arguably the most elementary neural circuit. Our results can be compared to data taken from cultured neuron networks. This comparison reveals that the glial gatekeeping effect appears to be necessary for an understanding of the interspike interval distribution of observed rapidly firing spiker neurons, for timescales up to about 100 msec. Of course, many aspects of our modeling are quite simplified as compared to the underlying biophysics. We have investigated the sensitivity of our results to the modification of some of the parameters of our model as well as the addition of more complex dynamics for the various parts of our system. Our results with regard to the interspike interval are exceedingly robust. This work should be viewed as a step toward understanding the full dynamical consequences brought about by the strong reciprocal couplings between synapses and the glial processes that envelop them. We have focused on the fact that astrocytic emissions shut down synaptic transmission when the activity becomes too high. This mechanism appears to be a necessary part of the regulation of spiker activity; without it, spikers would fire too often, too regularly. Related work by S. Nadkarni and P. Jung (private communication, July 2005) focuses on a different aspect: that of increased fidelity of synaptic release (for otherwise highly stochastic synapses) due to glia-mediated increases in presynaptic calcium levels. As our working assumption is that the spikers are most likely to be neurons with “faithful” autapses, this effect does not play a role in our attempt to compare to the experimental data. It will of course be necessary to combine these two different pieces to obtain a more complete picture. The application to spikers is just one way in which our new synaptic dynamics may alter our thinking about neural circuits. This particular application is appealing and informative but must at the moment be
324
V. Volman, E. Ben-Jacob, and H. Levine
considered an untested hypothesis. Future experimental work must test the assumption that spikers have significant excitatory autaptic coupling, that pharmacological blockage of the synaptic current reverts their firing to low-frequency, almost periodic patterns, and that cutting the feedback loop with the enveloping astrocyte eliminates the heavy-tail increment distribution. Work toward achieving these tests is ongoing. In the experimental system, a purported autaptic neuron is part of an active network and would therefore receive input currents from the other neurons in the network. This more complex input would clearly alter the very-long-time interspike interval distribution, especially given the existence of a new interburst timescale in the problem. Similarly, the current approach of adding a constant background current to the neuron is not realistic; the actual background current, due to such processes as glialgenerated currents in the cell soma, would again alter the long-time distribution. Preliminary tests have shown that these effects could extend the range of agreement between autaptic oscillator statistics and experimental measurements. Just as the network provides additional input for the spiker, the spiker provides part of the stimulation that leads to the bursting dynamics. Future work will endeavor to create a fully self-consistent network model to explore the overall activity patterns of this system. One issue that needs investigation concerns the role that glia might have in coordinating the action of neighboring synapses. It is well known that a single astrocytic process might contact thousands of synapses; if the calcium excitation spreads from being a local increase in a specific terminus to being a more widespread phenomenon within the glial cell body, neighboring synapses can become dynamically coupled. The role of this extra complexity in shaping the burst structure and time sequence is as yet unknown. Acknowledgments We thank Gerald M. Edelman for insightful conversation about the possible role of glia. Eugene Izhikevich, Peter Jung, Suhita Nadkarni, and Itay Baruchi are acknowledged for useful comments and for the critical reading of an earlier version of this manuscript. V. V. thanks the Center for Theoretical Biological Physics for hospitality. This work has been supported in part by the NSF-sponsored Center for Theoretical Biological Physics (grant numbers PHY-0216576 and PHY-0225630), by Maguy-Glass Chair in Physics of Complex Systems. References Angulo, M., Kozlov, A., Charpak, S., & Audinat, E. (2004). Glutamate released from glial cells synchronizes neuronal activity in the hippocampus. J. Neurosci., 24(31), 6920–6927.
The Astrocyte as a Gatekeeper of Synaptic Information Transfer
325
Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998a). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. Eur. J. Neurosci., 10(6), 2129–2142. Araque, A., Parpura, V., Sanzgiri, R., & Haydon, P. (1998b). Glutamate-dependent astrocyte modulation of synaptic transmission between cultured hippocampal neurons. J. Neurosci., 18(17), 6822–6829. Araque A., Parpura, V., Sanzgiri, R., & Haydon, P. (1999). Tripartite synapses: Glia, the unacknowledged partner. Trends in Neurosci., 22(5), 208–215. Bekkers, J., & Stevens, C. (1991). Excitatory and inhibitory autaptic currents in isolated hippocampal neurons maintained in cell culture. Proc. Nat. Acad. Sci., 88, 7834–7838. Charles, A., Merrill, J., Dirksen, E., & Sanderson, M. (1991). Inter-cellular signaling in glial cells: Calcium waves and oscillations in response to mechanical stimulation and glutamate. Neuron., 6, 983–992. Cornell-Bell A., & Finkbeiner, S. (1991). Ca2+ waves in astrocytes. Cell Calcium, 12, 185–204. Cornell-Bell, A., Finkbeiner, S., Cooper, M., & Smith, S. (1990). Glutamate induces calcium waves in cultured astrocytes: Long-range glial signaling. Science, 247, 470–473. Gandhi, A., & Stevens, C. (2003). Three modes of synaptic vesicular release revealed by single-vesicale imaging. Nature, 423, 607–613. Haydon, P. (2001). Glia: Listening and talking to the synapse. Nat. Rev. Neurosci., 2(3), 185–193. Hofer, T., Venance, L., & Giaume, C. (2003). Control and plasticity of inter-cellular calcium waves in astrocytes. J. Neurosci., 22, 4850–4859. Hulata, E., Segev, R., Shapira, Y., Benveniste, M., & Ben-Jacob, E. (2000). Detection and sorting of neural spikes using wavelet packets. Phys. Rev. Lett., 85, 4637–4640. Kang, J., Jiang, L., Goldman, S., & Nedergaard, M. (1998). Astrocyte-mediated potentiation of inhibitory synaptic transmission. Nat. Neurosci., 1, 683–692. Li, Y., & Rinzel, J. (1994). Equations for inositol-triphosphate receptor-mediated calcium oscillations derived from a detailed kinetic model: A Hodgkin-Huxley like formalism. J. Theor. Biol., 166, 461–473. Lubke, J., Markham, H., Frotscher, M., & Sakmann, B. (1996). Frequency and dendritic distributions of autapses established by layer-5 pyramidal neurons in developing rat cortex. J. Neurosci., 616, 3209–3218. Morris, C., & Lecar. H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nadkarni, S., & Jung, P. (2004). Spontaneous oscillations of dressed neurons: A new mechanism for epilepsy? Phys. Rev. Lett., 91(26). Newman, E. (2003). New roles for astrocytes: Regulation of synaptic transmission. Trends in Neurosci., 26(10), 536–542. Otsu, Y., Shahrezaei, V., Li, B., Raymond, L., Delaney K., & Murphy, T. (2004). Competition between phasic and asynchronous release for recovered synaptic vesicles at developing hippocampal autaptic synapses. J. Neurosci., 24(2), 420–433. Perea, G., & Araque, A. (2002). Communication between astrocytes and neurons: A complex language. J. Physiol., 96, 199–207.
326
V. Volman, E. Ben-Jacob, and H. Levine
Perea, G., & Araque, A. (2005). Properties of synaptically evoked astrocyte calcium signals reveal synaptic information processing by astrocytes. J. Neurosci., 25, 2192– 203. Porter. J., & McCarthy, K. (1996). Hippocampal astrocytes in situ respond to glutamate released from synaptic terminals. J. Neurosci., 16(16), 5073–5081. Segal, M. (1991). Epileptiform activity in microcultures containing one excitatory hippocampal neuron. J. Neuroanat., 65, 761–770. Segal, M. (2004). Endogenous bursts underlie seizurelike activity in solitary excitatory hippocampal neurons in microculture. J. Neurophysiol., 72, 1874–1884. Segev, R., & Ben-Jacob, E. (2001). Spontaneous synchronized bursting activity in 2D neural networks. Physica A, 302, 64–69. Segev, R., Benveniste, M., Hulata, E., Cohen, N., Paleski, A., Kapon, E., Shapira, Y., & Ben-Jacob, E. (2002). Long term behavior of lithographically prepared in-vitro neural networks. Phys. Rev. Lett., 88, 118102. Seung, H., Lee, D., Reis, B., & Tank, D. (2000). The autapse: A simple illustration of short-term analog memory storage by tuned synaptic feedback. J. Comp. Neurosci., 9, 171–185. Shuai, J., & Jung, P. (2003). Langevin modeling of intra-cellular calcium dynamics. In M. Falcke & D. Malchow (Eds.), Understanding calcium dynamics—experiments and theory. (pp. 231–252). Berlin: Springer. Sneyd, J., Wetton, B., Charles, A., & Sanderson, M. (1995). Intercellular calcium waves mediated by diffusion of inositol triphosphate: A two-dimensional model. Am. J. Physiology, 268, C1537–C1545. Takano T., Tian, G., Peng, W., Lou, N., Libionka, W., Han, X., & Nedergaard, M. (2006). Astrocyte-mediated control of cerebral blood flow. Nat. Neurosci., 9(2), 260–267. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. Volterra A., & Meldolesi J. (2005). Astrocytes, from brain glue to communication elements: The revolution continues. Nat. Neurosci., 6, 626–640. Zhang, Q., Pangrsic, T., Kreft, M., Krzan, M., Li, N., Sul, J., Halassa, M., van Bockstaele, E., Zorec, R., & Haydon, P. (2004). Fusion-related release of glutamate from astrocytes. J. Biol. Chem., 279, 12724–12733.
Received January 6, 2006; accepted May 25, 2006.
LETTER
Communicated by Gert Cauwenberghs
Thermodynamically Equivalent Silicon Models of Voltage-Dependent Ion Channels Kai M. Hynna
[email protected] Kwabena Boahen
[email protected] Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, U.S.A.
We model ion channels in silicon by exploiting similarities between the thermodynamic principles that govern ion channels and those that govern transistors. Using just eight transistors, we replicate—for the first time in silicon—the sigmoidal voltage dependence of activation (or inactivation) and the bell-shaped voltage-dependence of its time constant. We derive equations describing the dynamics of our silicon analog and explore its flexibility by varying various parameters. In addition, we validate the design by implementing a channel with a single activation variable. The design’s compactness allows tens of thousands of copies to be built on a single chip, facilitating the study of biologically realistic models of neural computation at the network level in silicon. 1 Neural Models A key computational component within the neurons of the brain is the ion channel. These channels display a wide range of voltage-dependent responses (Llinas, 1988). Some channels open as the cell depolarizes; others open as the cell hyperpolarizes; a third kind exhibits transient dynamics, opening and then closing in response to changes in the membrane voltage. The voltage dependence of the channel plays a functional role. Cells in the gerbil medial superior olivary complex possess a potassium channel that activates on depolarization and helps in phase-locking the response to incoming auditory information (Svirskis, Kotak, Sanes, & Rinzel, 2002). Thalamic cells possess a hyperpolarization-activated cation current that contributes to rhythmic bursting in thalamic neurons during periods of sleep by depolarizing the cell from hyperpolarized levels (McCormick & Pape, 1990). More ubiquitously, action potential generation is the result of voltage-dependent dynamics of a sodium channel and a delayed rectifier potassium channel (Hodgkin & Huxley, 1952). Researchers use a variety of techniques to study voltage-dependent channels, each possessing distinct advantages and disadvantages. Neural Computation 19, 327–350 (2007)
C 2007 Massachusetts Institute of Technology
328
K. Hynna and K. Boahen
Neurobiologists perform experiments on real cells, both in vivo and in vitro; however, while working with real cells eliminates the need for justifying assumptions within a model, limitations in technology restrict recordings to tens of neurons. Computational neuroscientists create computer models of real cells, using simulations to test the function of a channel within a single cell or a network of cells. While providing a great deal of flexibility, computational neuroscientists are often limited by the processing power of the computer and must constantly balance the complexity of the model with practical simulation times. For instance, a Sun Fire 480R takes 20 minutes to simulate 1 second of the 4000-neuron network (M. Shelley, personal communication, 2004) of Tao, Shelley, McLaughlin, and Shapley (2004), just 1 mm2 (a single hypercolumn) of a sublayer (4Cα) of the primary visual cortex (V1). An emerging medium for modeling neural circuits is the silicon chip, a technique at the heart of neuromorphic engineering. To model the brain, neuromorphic engineers use the transistor’s physical properties to create silicon analogs of neural circuits. Rather than build abstractions of neural processing, which make gross simplifications of brain function, the neuromorphic engineer designs circuit components, such as ion channels, from which silicon neurons are built. Silicon is an attractive medium because a single chip can have thousands of heterogeneous silicon neurons that operate in real time. Thus, network phenomena can be studied without waiting hours, or days, for a simulation to run. To date, however, neuromorphic models have not captured the voltage dependence of the ion channel’s temporal dynamics, a problem outstanding since 1991, the year Mahowald and Douglas (1991) published their seminal work on silicon neurons. Neuromorphic circuits are constrained by surface area on the silicon die, limiting their complexity, as more complex circuits translate to fewer silicon neurons on a chip. In the face of this trade-off, previous attempts at designing neuromorphic models of voltage-dependent ion channels (Mahowald and Douglas, 1991; Simoni, Cymbalyuk, Sorensen, Calabrese, & DeWeerth, 2004) sacrificed the time constant’s voltage dependence, keeping the time constant fixed. In some cases, however, this nonlinear property is critical. An example is the lowthreshold calcium channel in the thalamus’s relay neurons. The time constant for inactivation can vary over an order of magnitude depending on the membrane voltage. This variation defines the relative lengths of the interburst interval (long) and the burst duration (short) when the cell bursts rhythmically. Sodium channels involved in spike generation also possess a voltagedependent inactivation time constant that varies from a peak of approximately 8 ms just below spike threshold to as fast as 1 ms at the peak of the voltage spike (Hodgkin & Huxley, 1952). Ignoring this variation by fixing the time constant alters the dynamics that shape the action potential. For example, lowering the maximum (peak) time constant would reduce the
Voltage-Dependent Silicon Ion Channel Models
329
size of the voltage spike due to faster inactivation of sodium channels below threshold. Inactivation of these channels is a factor in the failure of action potential propagation in Purkinje cells (Monsivais, Clark, Roth, & Hausser, 2005). On the other hand, increasing the minimum time constant at more depolarized levels—that is, near the peak voltage of the spike—would increase the width of the action potential, as cell repolarization begins once the potassium channels overcome the inactivating sodium channel. A wider spike could influence the cell’s behavior through several mechanisms, such as those triggered by increased calcium entry through voltage-dependent calcium channels. In this letter, we present a compact circuit that models the nonlinear dynamics of the ion channel’s gating particles. Our circuit is based on linear thermodynamic models of ion channels (Destexhe & Huguenard, 2000), which apply thermodynamic considerations to the gating particle’s movement, due to conformation of the ion channel protein in an electric field. Similar considerations of the transistor make clear that both the ion channel and the transistor operate under similar principles. This observation, originally recognized by Carver Mead (1989), allows us to implement the voltage dependence of the ion channel’s temporal dynamics, while at the same time using fewer transistors than previous neuromorphic models that do not possess these nonlinear dynamics. With a more compact design, we can incorporate a larger number of silicon neurons on a chip without sacrificing biological realism. The next section provides a brief tutorial on the similarities between the underlying physics of thermodynamic models and that of transistors. In section 3, we derive a circuit that captures the gating particle’s dynamics. In section 4, we derive the equations defining the dynamics of our variable circuit, and section 5 describes the implementation of an ion channel population with a single activation variable. Finally, we discuss the ramifications of this design in the conclusion. 2 Ion Channels and Transistors Thermodynamic models of ion channels are founded on Hodgkin and Huxley’s (empirical) model of the ion channel. A channel model consists of a series of independent gating particles whose binary state—open or closed—determines the channel permeability. A Hodgkin-Huxley (HH) variable represents the probability of a particle being in the open state or, with respect to the channel population, the fraction of gating particles that are open. The kinetics of the variable are simply described by α(V) (1 − u) −→ ←− u, β(V)
(2.1)
330
K. Hynna and K. Boahen
where α (V) and β (V) define the voltage-dependent transition rates between the states (indicated by the arrows), u is the HH variable, and (1 − u) represents the closed fraction. α (V) and β (V) define the voltage-dependent dynamics of the two-state gating particle. At steady state, the total number of gating particles that are opening—that is, the opening flux, which depends on the number of channels closed and the opening rate—are balanced by the total number of gating particles that are closing (the closing flux). Increasing one of the transition rates—through, for example, a shift in the membrane voltage— will increase the respective flow of particles changing state; the system will find a new steady state with new open and closed fractions such that the fluxes again cancel each other out. We can describe these dynamics simply using a differential equation: du = α (V) (1 − u) − β (V) u. dt
(2.2)
The first term represents the opening flux, the product of the opening transition rate α (V) and the fraction of particles closed (1 − u). The second term represents the closing flux. Depending on which flux is larger (opening or closing), the fraction of open channels u will increase or decrease accordingly. Equation 2.2 is often expressed in the following form: du 1 =− (u − u∞ (V)) , dt τu (V)
(2.3)
where u∞ (V) =
α (V) , α (V) + β (V)
(2.4)
τu (V) =
1 , α (V) + β (V)
(2.5)
represent the steady-state level and time constant (respectively) for u. This form is much more intuitive to use, as it describes, for a given membrane voltage, where u will settle and how fast. In addition, these quantities are much easier for neuroscientists to extract from real cells through voltageclamp experiments. We will come back to the form of equation 2.3 in section 4. For now, we will focus on the dynamics of the gating particle in terms of transition rates. In thermodynamic models, state changes of a gating particle are related to changes in the conformation of the ion channel protein (Hill & Chen, 1972; Destexhe & Huguenard, 2000, 2001). Each state possesses a certain energy
Voltage-Dependent Silicon Ion Channel Models
331
G*(V)
Energy
∆GC(V) ∆GO(V)
GC(V) ∆G(V) GO(V) Closed
Activated
Open
State Figure 1: Energy diagram of a reaction. The transition rates between two states are dependent on the heights of the energy barriers (GC and GO ), the differences in energy between the activated state (G∗ ) and the initial states (GC or GO ). Thus, the time constant depends on the height of the energy barriers, and the steady state depends on the difference in energy between the closed and open states (G).
(see Figure 1), dependent on the interactions of the protein molecule with the electric field across the membrane. For a state transition to occur, the molecule must overcome an energy barrier (see Figure 1), defined as the difference in energy between the initial state and an intermediate activated state. The size of the barrier controls the rate of transition between states (Hille, 1992): α (V ) = α0 e−GC (V )/R T β (V ) = β0 e
−GO (V )/R T
(2.6) ,
(2.7)
where α0 and β0 are constants representing base transition rates (at zero barrier height), GC (V ) and GO (V ) are the voltage-dependent energy barriers, R is the gas constant, and T is the temperature in Kelvin. Changes in the membrane voltage of the cell, and thus the electric field across the membrane, influence the energies of the protein’s conformations differently, changing the sizes of the barriers and altering the transition rates between states. Increasing a barrier decreases the respective transition rate, slowing the dynamics, since fewer proteins will have sufficient energy. The steady state depends on the energy difference between the two states. For a difference of zero and equivalent base transition rates, particles are
332
K. Hynna and K. Boahen
equally distributed between the two states. Otherwise, the state with lower energy is the preferred one. The voltage dependence of an energy barrier has many components, both linear and nonlinear. Linear thermodynamic models, as the name implies, assume that the linear voltage dependence dominates. This dependence may be produced by the movement of a monopole or dipole through an electric field (Hill & Chen, 1972; Stevens, 1978). In this situation, the above rate equations simplify to α (V ) = A e−b1 (V −VH )/R T β (V ) = A e
−b 2 (V −VH )/R T
(2.8) ,
(2.9)
where VH and A represent the half-activation voltage and rate, while b 1 and b 2 define the linear relationship between each barrier and the membrane voltage. The magnitude of the linear term depends on such factors as the net movement of charge or net change in charge due to the conformation of the channel protein. Thus, ion channels use structural differences to define different membrane voltage dependencies. While linear thermodynamic models have simple governing equations, they possess a significant flaw: time constants can reach extremely small values at voltages where either α (V) and β (V) become large (see equation 2.5), which is unrealistic since it does not occur in biology. Adding nonlinear terms in the energy expansion of α (V) and β (V) can counter this effect (Destexhe & Huguenard, 2000). Other solutions involve either saturating the transition rate (Willms, Baro, Harris-Warrick, & Guckenheimer, 1999) or using a three-state model (Destexhe & Huguenard, 2001), where the forward and reverse transition rates between two of the states are fixed, effectively setting the maximum transition rate. Linear models, however, bear the closest resemblance to the MOS transistor, which operates under similar thermodynamic principles. Short for metal oxide semiconductor, the MOS transistor is named for its structure: a metallic gate (today, a polysilicon gate) atop a thin oxide, which insulates the gate from a semiconductor channel. The channel, part of the body or substrate of the transistor, lies between two heavily doped regions called the source and the drain (see Figure 2). There are two types of MOS transistors: negative or n-type (NMOS) and positive or p-type (PMOS). NMOS transistors possess a drain and a source that are negatively doped— areas where the charge carriers are negatively charged electrons. These two areas exist within a p-type substrate, a positively doped area, where the charge carriers are positively charged holes. A PMOS transistor consists of a p-type source and drain within an n-type well. While the rest of this discussion focuses on NMOS transistor operation, the same principles apply to PMOS transistors, except that the charge carrier is of the opposite sign.
Voltage-Dependent Silicon Ion Channel Models gate
source
S
D
drain
n+
333
n+
B
G
B
G
psubstrate
a
b
S
D
Figure 2: MOS transistor. (a) Cross-section of an n-type MOS transistor. The transistor has four terminals: source (S), drain (D), gate (G) and bulk (B), sometimes referred to as the back-gate. (b), Symbols for the two transistor types: NMOS (left) and PMOS (right). The transistor is a symmetric device, and thus the direction of its current—by convention, the flow of positive charges—indicates the drain and the source. In an NMOS, current flows from drain to source, as indicated by the arrow. Conversely, current flows from source to drain in a PMOS.
In the subthreshold regime, charge flows across the channel by diffusion from the source end of the channel, where the density is high, to the drain, where the density is low. Governed by the same laws of thermodynamics that govern protein conformations, the density of charge carriers at the source and drain ends of the channel depends exponentially on the size of the energy barriers there (see Figure 3). These energy barriers exist due to a built-in potential difference, and thus an electric field, between the channel and the source or the drain. Adjusting the voltage at the source, or the drain, changes the charge carriers’ energy level. For the NMOS transistor’s negatively charged electrons, increasing the source voltage decreases the energy level; hence, the barrier height increases. This decreases the charge density at that end of the channel, as fewer electrons have the energy required to overcome the barrier. The voltage applied to the gate, which influences the potential at the surface of the channel, has the opposite effect: increasing it (e.g., from VG to VG1 in Figure 3) decreases the barrier height—at both ends of the channel. Factoring in the exponential charge density dependence on barrier height yields the relationship between an NMOS transistor’s channel current and its terminal voltages (Mead, 1989): Ids = Ids0 e
κVGB −VSB UT
−e
κVGB −VDB UT
,
(2.10)
where κ describes the relationship between the gate voltage and the potential at the channel surface. UT is called the thermal voltage (25.4 mV at room temperature), and Ids0 is the baseline diffusion current, defined by the barrier introduced when the oppositely doped regions (p-type and n-type) were fabricated. Note that, for clarity, UT will not appear in the
334
K. Hynna and K. Boahen
Energy
VG
φS
VG1 > VG
VS
φD VD > VS
Source
Channel
Drain
Figure 3: Energy diagram of a transistor. The vertical dimension represents the energy of negative charge carriers (electrons) within an NMOS transistor, while the horizontal dimension represents location within the transistor. φS and φD are the energy barriers faced by electrons attempting to enter the channel from the source and drain, respectively. VD , VS , and VG are the terminal voltages, designated by their subscripts. During transistor operation, VD > VS , and thus φS < φD . VG1 represents another scenario with a higher gate voltage. (Adapted from Mead, 1989.)
remaining transistor current equations, as all transistor voltages from here on are given in units of UT . When VDB exceeds VSB by 4 UT or more, the drain term becomes negligible and is ignored; the transistor is then said to be in saturation. The similarities in the underlying physics of ion channels and transistors allow us to use transistors as thermodynamic isomorphs of ion channels. In both, there is a linear relationship between the energy barrier and the controlling voltage. For the ion channel, either isolated charges or dipoles of the channel protein have to overcome the electric field created by the voltage across the membrane. For the transistor, electrons, or holes, have to overcome the electric field created by the voltage difference between the source, or drain, and the transistor channel. In both instances, the transport of charge across the energy barrier is governed by a Boltzman distribution, which results in an exponential voltage dependence. In the next section, we use these similarities to design an efficient transistor representation of the gating particle dynamics. 3 Variable Circuit Based on the discussion from the previous section, it is tempting to think we may be able to use a single transistor to model the gating dynamics of a channel particle completely. However, obtaining the transition rates solves only part of the problem. We still need to multiply the rates with the number of gating particles in each state to obtain the opening and closing fluxes, and then integrate the flux difference to update the particle counts
Voltage-Dependent Silicon Ion Channel Models
335
uH N4
uτH uH
N2
N2
VO
VO
N1
VC
uV
uV
C
C
N1
VC
uL
N3
uτL
a
b
uL
Figure 4: Channel variable circuit. (a) The voltage uV represents the logarithm of the channel variable u. VO and VC are linearly related to the membrane voltage, with slopes of opposite sign. uH and uL are adjustable bias voltages. (b) Two transistors (N3 and N4) are added to saturate the variable’s opening and closing rates; the bias voltages uτ H and uτ L set the saturation level.
(see equation 2.2). A capacitor can perform the integration if we use charge to represent particle count and current to represent flux. The voltage on the capacitor, which is linearly proportional to its charge, yields the result. The first sign of trouble appears when we attempt to connect a capacitor to the transistor’s source (or drain) terminal to perform the integration. As the capacitor integrates the current, the voltage changes, and hence the transistor’s barrier height changes. Thus, the barrier height depends on the particle count, which is not the case in biology; gating particles do not (directly) affect the barrier height when they switch state. Our only remaining option, the gate voltage, is unsuitable for defining the barrier height, as it influences the barrier at both ends of the channel identically. α (V) and β (V), however, demonstrate opposite dependencies on the membrane voltage; that is, one increases while the other decreases. We can resolve this conundrum by connecting two transistors to a single capacitor (see Figure 4a). Each transistor defines an energy barrier for one of the transition rates: transistor N1 uses its source and gate voltages (uL and VC , respectively) to define the closing rate, and transistor N2 uses its drain and gate voltages (uH and VO ) to define the opening rate (where uH > uL ). We integerate the difference in transistor currents on the capacitor Cu to update the particle count. Notice that neither barrier
336
K. Hynna and K. Boahen
depends on the capacitor voltage uV . Thus, uV becomes representative of the fraction of open channels; it increases as particles switch to the open state. How do we compute the fluxes from the transition rates? If uV directly represented the particle count, we would take the product of uV and the transition rates. However, we can avoid multiplying altogether if uV represents the logarithm of the open fraction rather than the open fraction itself. uV ’s dynamics are described by the differential equation Cu
duV = Ids0 eκ VO e−uV − e−uH − Ids0 eκ VC e−uL , dt = Ids0 eκ VO −uH e−(uV −uH ) − 1 − Ids0 eκ VC −uL ,
(3.1)
where Ids0 and κ are transistor parameters (defined in equation 2.10), and VO , VC , uH , uL , and uV are voltages (defined in Figure 4a). We assume N1 remains in saturation during the channel’s operation; that is, uV > uL + 4 UT , making the drain voltage’s influence negligible. The analogies between equation 3.1 and equation 2.2 become clear when we divide the latter by u. Our barriers—N1’s source-gate for the closing rate and N2’s drain-gate for the opening rate—correspond to α (V) and β (V), while e−(uV −uH ) corresponds to u−1 . Thus, our circuit computes (and integrates) the net flux divided by u, the open fraction. Fortuitously, the net flux scaled by the open fraction is exactly what we need to update the fraction’s logarithm, since d log(u)/dt = (du/dt)/u. Indeed, substituting uV = log u + uH —our log-domain representation of u—into equation 3.1 yields Qu du = Ids0 eκ VO −uH u dt
1 −1 u
− Ids0 eκ VC −uL
Ids0 κ VC −uL du Ids0 κ VO −uH = ... e e u, (1 − u) − dt Qu Qu
(3.2)
where Qu = Cu UT . If we design VC and VO to be functions of the membrane voltage V, equation 3.2 becomes directly analogous to equation 2.2. In linear thermodynamic models, the transition rates depend exponentially on the membrane voltage. We can realize this by designing VC and VO to be linear functions of V, albeit with slopes of opposite sign. The opposite slopes ensure that as the membrane voltage shifts in one direction, the opening and closing rates will change in opposite directions relative to each other. Thus far in our circuit design, we have not specified whether the variable activates or inactivates as the membrane voltage increases. Recall that for activation, the gating particle opens as the cell depolarizes, whereas for
Voltage-Dependent Silicon Ion Channel Models
337
inactivation, the gating particle opens as the cell hyperpolarizes. In our circuit, whether activation or inactivation occurs depends on how we define VO and VC with respect to V. Increasing VO , and decreasing VC , with V defines an activation variable, as at depolarized levels this results in α (V) > β (V). Conversely, increasing VC , and decreasing VO , with V defines an inactivation variable, as now at depolarized voltages, β (V) > α (V), and the variable will equilibrate in a closed state. Naturally, our circuit has the same limitation that all two-state linear thermodynamic models have: its time constant approaches zero when either VO or VC grows large, as the transition rates α (V) and β (V) become unrealistically large. This shortcoming is easily rectified by imposing an upper limit on the transition rates, as has been done for other thermodynamic models (Willms et al., 1999). We realize this saturation by placing two additional transistors in series with the original two (see Figure 4b). With these transistors, the transition rates become α (V) =
Ids0 eκ VO −uH Qu 1 + eκ(VO −uτ H )
(3.3)
β (V) =
eκ VC −uL Ids0 , Qu 1 + eκ(VC −uτ L )
(3.4)
where the single exponentials in equation 3.2 are now scaled by an additional exponential term. The voltages uτ H and uτ L (fixed biases) set the maximum transition rate for opening and closing, respectively. That is, when VO < uτ H − 4 UT , α (V) ∝ eκ VO , a rate function exponentially dependent on the membrane voltage. But when VO > uτ H + 4 UT , α (V) ∝ eκuτ H , limiting the transition rate and fixing the minimum time constant for channel opening. The behavior of β (V) is similarly defined by VC ’s value relative to uτ L . In the following section, we explore how the channel variable computed by our circuit changes with the membrane voltage and how quickly it approaches steady state. To do so, we must relate the steady state and time constant to the opening and closing rates and specify how the circuit’s opening and closing voltages depend on the membrane voltage. 4 Circuit Operation To help understand the operation of the channel circuit and the influence of various circuit parameters, we will derive u∞ (V) and τu (V) for the circuit in Figure 4b using equations 2.4 and 2.5 and the transistor rate equations (equations 3.3 and 3.4), limiting our presentation to the activation version. The derivation, and the influence of various parameters, is similar for the inactivation version.
338
K. Hynna and K. Boahen
For the activation version of the channel circuit (see Figure 4b), we define the opening and closing voltages’ dependence on the membrane voltage as: VO = φo + γo V
(4.1)
VC = φc − γc V,
(4.2)
where φo , γo , φc , and γc are positive constants representing the offsets and slopes for the opening and closing voltages. Additional circuitry is required to define these constants; one example is described in the next section. In this section, however, we will leave the definition as such while we derive the equations for the circuit. Under certain restrictions (see appendix A), u’s steady-state level has a sigmoidal voltage dependence: u∞ (V) =
1 , V−Vmid 1 + exp − V∗u
(4.3)
u
where Vmid u = V∗u =
1 (φc − φo + (uH − uL )/κ) γ o + γc
(4.4)
UT 1 . γ o + γc κ
(4.5)
Figure 5a shows how the sigmoid arises from the transition rates and, through them, its relationship to the voltage biases. The midpoint of the sigmoid, where the open probability equals half, occurs when α (V) = β (V); it will thus shift with any voltage biases (φo , φc , uH , or uL ) that scale either of the transition rate currents. For example, increasing uH reduces α (V), shifting the midpoint to higher voltages. The slope of the sigmoid around the midpoint is defined by the slopes γo and γc of VO (V) and VC (V). To obtain the sigmoid shape, we restricted the effect of saturation. It is assumed that the bias voltages uτ H and uτ L are set such that saturation is negligible in the linear midsegment of the sigmoid (i.e., VO < uτ H − 4 UT and VC < uτ L − 4 UT when V ∼ Vmid u ). That is why α (V) and β (V) appear to be pure exponentials in Figure 5a. This restriction is reasonable as saturation is supposed to occur only for large excursions from Vmid u , where it imposes a lower limit on the time constant. Therefore, under this assumption, the sigmoid lacks any dependence on the biases uτ H and uτ L .
Voltage-Dependent Silicon Ion Channel Models
339
1
-uL
e
e
-κV -u e c + τL
-κV -u e o + τΗ
e
e
-κ (Vo-uτH)
τu(V) 1 0
0
0
a
1+e
1+e
-κ (Vc-uτL)
τu/τmin
-uH
Transition Rate
u∞
u∞(V)
b
V
V
Figure 5: Steady state and time constants for channel circuit. (a) The variable’s steady-state value (u∞ ) changes sigmoidally with membrane voltage (V), dictated by the ratio of the opening and closing rates (dashed lines). The midpoint occurs when the rates are equal, and hence its horizontal location is affected by the bias voltages (uH and uL ) applied to the circuit (see Figure 4b). (b) The variable’s time constant (τu ) has a bell-shaped dependence on the membrane voltage (V), dictated by the reciprocal of the opening and closing rates (dashed lines). The time constant diverges from these asymptotes at intermediate voltages, where neither rate dominates; it follows the reciprocal of their sum, peaking when the sum is minimized.
Under certain further restrictions (see appendix A), u’s time constant has a bell-shaped voltage dependence: τu (V) = τmin 1 +
exp
V−V1u V∗1u
1
2u + exp − V−V V∗
(4.6)
2u
where V1u = ( uτ H − φo ) /γo V∗1u = UT / (κ γo ) V2u = (φc − uτ H + (uH − uL )/κ) /γc V∗2u = UT / (κ γc ) and τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) . Figure 5b shows how the bell shape arises from the transition rates and, through them, its relationship to the voltage biases. For large excursions of the membrane voltage, one transition rate dominates, and the time constant closely follows its inverse. For small excursions, neither rate dominates, and the time constant diverges from the inverses, peaking at the membrane voltage where the sum of the transition rates is minimized.
340
K. Hynna and K. Boahen
To obtain the bell shape, we saturated the opening and closing rates at the same level by setting κ uτ H − uH = κ uτ L − uL . Though not strictly necessary, this assumption simplifies the expression for τu (V) by matching the minimum time constants at hyperpolarized and depolarized voltages, yielding the result given in equation 4.6. The bell shape also requires this so-called minimum time constant to be smaller than the peak time constant in the absence of saturation. The free parameters within the circuit—φo , γo , φc , φo , uH , uL , uτ H , and uτ L —allow for much flexibility in designing a channel. Appendix B provides an expanded discussion on the influence of the various parameters in the equations above. In the following section, we present measurements from a simple activation channel designed using this circuit, which was fabricated in a standard 0.25 µm CMOS process.
5 A Simple Activation Channel Our goal here is to implement an activating channel to serve as a concrete example and examine its behavior through experiment. We start with the channel variable circuit, which computes the logarithm of the channel variable, and attach its output voltage to the gate of a transistor, which uses the subthreshold regime’s exponential current-voltage relationship to invert the logarithm. The current this transistor produces, which is directly proportional to the variable, can be injected directly into a silicon neuron (Hynna & Boahen, 2001) or can be used to define a conductance (Simoni et al., 2004). The actual choice is irrelevant for the purposes of this article, which demonstrates only the channel variable. In addition to the output transistor, we also need circuitry to compute the opening and closing voltages from the membrane voltage. For the opening voltage (VO ), we simply use a wire to tie it to the membrane voltage (V), which yields a slope of unity (γo = 1) and an intercept of zero (φo = 0). For the closing voltage (VC ), we use four transistors to invert the membrane voltage. The end result is shown in Figure 6. For the voltage inverter circuit we chose, the closing voltage’s intercept φc = κ VC0 (set by a bias voltage VC0 ) and its slope γc = κ 2 /(κ + 1) (set by the transistor parameter defined in equation 2.10). Since κ ≈ 0.7, the closing voltage has a shallower slope than the opening voltage, which makes the closing rate change more gradually, skewing the bell curve in the hyperpolarizing direction as intended for the application in which this circuit was used (Hynna, 2005). This eight-transistor design captures the ion channel’s nonlinear dynamics, which we demonstrated by performing voltage clamp experiments (see Figure 7). As the command voltage (i.e., step size) increases, the output current’s time course and final amplitude both change. The clustering and speed at low and high voltages are what we would expect from a sigmoidal steady-state dependence with a bell-shaped time constant. The
Voltage-Dependent Silicon Ion Channel Models
341
uH VO
V
N5
N6
VC 0
N8
N7
N3
N2
uτΗ
uV
VC
N1
N4
IT
Cu
uG
uL
Figure 6: A simple activating channel. A voltage inverter (N5-8) produces the closing voltage (VC ); a channel variable circuit (N1-3) implements the variable’s dynamics in the log domain (uV ); and an antilog transistor (N4) produces a current (IT ) proportional to the variable. The opening voltage (VO ) is identical to the membrane voltage (V). The series transistor (N2) sets the minimum time constant at depolarized levels. The same circuit can be used to implement an inactivating channel simply by swapping VO and VC .
Figure 7: Channel circuit’s measured voltage-dependent activation. When the membrane voltage is stepped to increasing levels, from the same starting level, the output current becomes increasingly larger, approaching its steady-state amplitudes at varying speeds.
relationship between this output current (IT ) and the activation variable, defined as u = euV −uH , has the form IT = uκ IT .
(5.1)
342
K. Hynna and K. Boahen 1
1.2 1.
0.6
0.8
u∞
τ u (ms)
0.8
0.4
0.6 0.4
0.2 0 0.2
a
0.2 0 0.3 0.4 0.5 0.6 Membrane Voltage (V)
0.3
b
0.4 0.5 0.6 0.7 Membrane Voltage (V)
Figure 8: Channel circuit’s measured sigmoid and bell curve. (a) Dependence of activation on membrane voltage in steady state, captured by sweeping the membrane voltage slowly and recording the normalized current output. (b), Dependence of time constant on membrane voltage, extracted from the curves in Figure 7 by fitting exponentials. Fits (solid lines) are of equations 4.3 = 423.0 mV, V∗u = 28.8 mV, τmin = 0.0425 ms, V1u = 571.3 mV, and 4.6: Vmid u ∗ V1u = 36.9 mV, V2u = 169.8 mV, V∗2u = 71.8 mV.
Its maximum value IT = eκ uH −uG and its exponent κ ≈ 0.7 (the same transistor parameter). It is possible to achieve an exponent of unity, or even a square or a cube, but this requires a lot more transistors. We measured the sigmoidal change in activation directly, by sweeping the membrane voltage slowly, and its bell-shaped time constant indirectly, by fitting the voltage clamp data in Figure 7 with exponentials. The results are shown in Figure 8; the relationship above was used to obtain u from IT . The solid lines are the fits of equations 4.3 and 4.6, which reasonably capture the behavior in both sets of data. The range in the time constant data is limited due to the experimental protocol used. Since we modulated only the step size from a fixed hyperpolarized position, we need a measurable change in the steady-state output current to be able to measure the temporal dynamics for opening. However, this worked to our benefit as, given the range of the fit, there was no need to modify equation 4.6 to allow the time constant to go to zero at hyperpolarized levels (this circuit omits the second saturation transistor in Figure 4b). All of the extracted parameters from the fit are reasonably close to our expectations—based on equations 4.3 and 4.6 and our applied voltage biases—except for V∗2u . For κ ≈ 0.7 and UT = 25.4 mV, we expected V∗2u ≈ 125 mV, but our fit yielded V∗2u ≈ 71.8 mV. There are two possible explanations, not mutually exclusive. First, the fact that equation 4.6 assumes the presence of a saturation transistor, in addition to the limited data along the left side of the bell curve, may have contributed to the underfitting of that value. Second, κ is not constant within the chip but possesses voltage dependence. Overall, however, the analysis matches reasonably well the performance of the circuit.
Voltage-Dependent Silicon Ion Channel Models
343
6 Conclusion We showed that the transistor is a thermodynamic isomorph of a channel gating particle. The analogy is accomplished by considering the operation of both within the framework of energy models. Both involve the movement of charge within an electric field: for the channel, due to conformations of the ion channel protein; for the transistor, due to charge carriers entering the transistor channel. Using this analogy, we generated a compact channel variable circuit. We demonstrated our variable circuit’s operation by implementing a simple channel with a single activation variable, showing that the steady state is sigmoid and the time constant bell shaped. Our measured results, obtained through voltage clamp experiments, matched our analytical results, derived from knowledge of transistors. Bias voltages applied to the circuit allow us to shift the sigmoid and the bell curve and set the bell curve’s height independently. However, the sigmoid’s slope, and its location relative to the bell curve, which is determined by the slope, cannot be changed (it is set by the MOS transistor’s κ parameter). Our variable circuit is not limited to activation variables: reversing the opening and closing voltages’ linear dependence on the membrane voltage will change the circuit into an inactivation variable. In addition, channels that activate and inactivate are easily modeled by including additional circuitry to multiply the variable circuits’ output currents (Simoni et al., 2004; Delbruck, 1991). The change in temporal dynamics of gating particles plays a critical role in some voltage-gated ion channels. As discussed in section 1, the inactivation time constant of T channels in thalamic relays changes dramatically, defining properties of the relay cell burst response, such as the interburst interval and the length of the burst itself. Activation time constants are also influential: they can modify the delay with which the channel responds (Zhan, Cox, Rinzel, & Sherman, 1999), an important determinant of a neuron’s temporal precision. Incorporating these nonlinear temporal dynamics into silicon models will yield further insights into neural computation. Equally important in our design, not only were we able to capture the nonlinear dynamics of gating particles, we were able to do so using fewer transistors than previous silicon models. Rather than exploit the parallels between transistors and ion channels, as we did, previous silicon modelers attempted to “linearize” the transistor, to make it approximate a resistor. After designing a circuit to accomplish this, the resistor’s value had to be adjusted dynamically, so more circuitry was added to filter the membrane voltage. The time constant of this filter was kept constant, sacrificing the ion channel’s voltage-dependent nonlinear dynamics for simplicity. We avoided all these complications by recognizing that the transistor is a
344
K. Hynna and K. Boahen
thermodynamic isomorph of the ion channel. Thus, we were able to come up with a compact replica. The size of the circuit is an important consideration within silicon models, as smaller circuits translate to more neurons on a silicon die. To illustrate, a simple silicon neuron (Zaghloul & Boahen, 2004), with a single input synapse, possessing the activation channel from section 5, requires about 330 µm2 of area. This corresponds to around 30,000 neurons on a silicon die, 10 mm2 in area. Adding additional circuits, such as inactivation to the channel, increases the area of the cell design, reducing the size of the population on the chip (assuming, of course, that the total area of the die remains constant). To compensate for larger cell footprints, we can either increase the size of the whole silicon die (which costs money), or we can simply incorporate multiple chips into the system, easily doubling or tripling the network size. And unlike computer simulations, the increase in network sizes comes with minimal cost in performance or “simulation” time. Of course, like all other modeling media, silicon has its own drawbacks. For one, silicon is not forgiving with respect to design flaws. Once the chip has been fabricated, we are limited to manipulating our models using only external voltage biases within our design. This places a great deal of importance on verification of the final design before submitting it for fabrication; the total time from starting the design to receiving the fabricated chip can be on the order of 6 to 12 months. An additional characteristic of the silicon fabrication process is mismatch, a term referring to the variability among fabricated copies of the same design within a silicon chip (Pavasovic, Andreou, & Westgate, 1994). Within an array of silicon neurons, this translates into heterogeneity within the population. While we can take steps to reduce the variability within an array, generally at the expense of area, this mismatch can be considered a feature, since biology also needs to deal with variability. When we build silicon models that reproduce biological phenomena, being able to do so lends credence to our models, given their robustness to parameter variability. And when we discover network phenomena within our chips, these are likely to be found in biology as well, as they will be robust to biological heterogeneity. With the ion channel design described in this article and our ability to expand our networks without much cost, we have a great deal of potential in building many of the neural systems within the brain, which consists of numerous layers of cells, each possessing its own distinct characteristics. Not only do we have the opportunity to study the role of an ion channel within an individual cell, we have the potential to study its influence within the dynamics of a population of cells, and hence its role in neural computation.
Voltage-Dependent Silicon Ion Channel Models
345
Appendix A: Derivations We can use the transition rates for our channel circuit (see equations 3.3 and 3.4) to calculate the voltage dependence of its steady state and time constant. Starting with the steady-state equation, equation 2.4, u∞ (V) = = =
α (V) α (V) + β (V) Ids0 Qu
Ids0 e−uH Qu e−κ VO +e−κuτ H −u e−uH + IQds0u e−κ VCe+eL−κuτ L e−κ VO +e−κuτ H
1+
e−κ VO +e−κuτ H e−κ VC +e−κuτ L
1 euH −uL
.
Throughout the linear segment of the sigmoid, we set the voltage biases such that uτ H > VO + 4 UT and uτ L > VC + 4 UT . These restrictions essentially marginalize uτ H and uτ L . By the time either of the terms with these two biases becomes significant, the steady state will be sufficiently close to either unity or zero, so that their influence is negligible. Therefore, we drop the exponential terms with uτ H and uτ L ; substituting equations 4.1 and 4.2 yields the desired result of equation 4.3. For the time constant, we substitute equations 3.3 and 3.4 into equation 4.1: τu (V) =
1 α (V) + β (V)
=
Qu Ids0
=
Qu Ids0
1 e−uH e−κ VO +e−κuτ H
+
e−uL e−κ VC +e−κuτ L
1 1 e−(κ VO −uH ) +e−(κ uτ H −uH )
+
1 e−(κ VC −uL ) +e−(κ uτ L −uL )
.
To equalize the minimum time constant at hyperpolarized and depolarized levels, we establish the following relationship: κ uτ H − uH = κ uτ L − uL . After additional algebraic manipulation, the time constant becomes
e−κ (VO −uτ H ) e−κ (VC −uτ L ) e−κ (VO −uτ H ) + e−κ (VC −uτ L ) + 2 1 − −κ (V −u ) , O τ H + e−κ (VC −uτ L ) + 2 e
τu (V) = τmin
1+
346
K. Hynna and K. Boahen
where τmin = (Qu /Ids0 ) e−(κ uτ H −uH ) is the minimum time constant. To reduce the expression further, we need to understand the relative magnitudes of the various terms. We can drop the constant in both denominators, as one of the exponentials there will always be significantly larger. Since VO and VC have opposite signs for their slopes with respect to V, the sum of the exponentials in the denominator peaks at a membrane voltage somewhere within the middle section of its operational voltage range. As it happens, this peak is close to the midpoint of the sigmoid (see below), where we have defined uτ H > VO + 4 UT and uτ L > VC + 4 UT . Thus, at the peak, we know the constant is negligible. As the membrane voltage moves away from the peak, in either direction, one of the exponentials will continue to increase while the other decreases. Thus, with the restriction on the bias voltage uτ H and uτ L , the sum of the exponentials will always be much larger than the constant in the denominator. By the same logic, we can disregard the final term, since the sum of the exponentials will always be substantially larger than the numerator, making the fraction negligible over the whole membrane voltage. With these assumptions, and substituting the dependence of VO and VC on V (see Equations 4.1 and 4.2), we obtain the desired result, equation 4.6. Appendix B: Circuit Flexibility This section is more theoretical in nature, using the steady-state and timeconstant equations derived in appendix A to provide insight into how the various parameters influence the two voltage dependencies (steady state and time constant). This section is likely of interest only to those who wish to use our approach for modeling ion channels. An important consideration in generating these models is defining the location (i.e., the membrane voltage) and magnitude of the peak time constant. Unlike the minimum time constant, which is determined simply by the difference between the bias voltages uτ H and uH (or uτ L and uL ), no bias directly controls the maximum time constant, since it is the point at which the sum of the transition rates is minimized (see Figure 5b). The voltage at which it occurs, however, is easily determined from equation 4.6: Vτ pk = Vmid + u
UT log [γc /γo ] . κ (γo + γc )
(B.1)
Thus, where the bell curve lies relative to the sigmoid, whose midpoint lies at Vmid u , is determined by the opening and closing voltages’ slopes (γo and γc ). Consequently, changing these slopes is the only way to displace the bell curve relative to the sigmoid. Shifting the bell curve by changing
Voltage-Dependent Silicon Ion Channel Models
0.5
0 200 300 400 500 600 700 800 V
a
3. Log 10 [ τ /τ min ]
u∞
1.
347
2. 1. 0 200 300 400 500 600 700 800 V
b
1.
10 τ /τ min
u∞
8 0.5
6 4 2
c
0 200 300 400 500 600 700 800 V
d
0 200 300 400 500 600 700 800 V
Figure 9: Varying the closing voltage’s slope (γc ). (a) Changing γc adjusts both the slope and midpoint of the steady-state sigmoid. (b) γc also affects the location and height (relative to the minimum) of the time constant’s bell curve; the change in height (plotted logarithmically) is dramatic. (c, d) Same as in a and b, except that we adjusted the bias voltage uL to compensate for the change in γc , so the sigmoid’s midpoint and the bell curve’s height remain the same. The sigmoid’s slope does change, as it did before, and the bell curve’s location shifts as well, though much less than before. In these plots, φo = 400 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, and γo = 1.0. γc ’s values are 0.5 (thin, solid line), 0.75 (short, dashed line), 1.0 (long, dashed line), and 1.25 (thick, solid line).
a parameter other than γo or γc will automatically shift the sigmoid by the same amount. As we change the opening and closing voltages’ slopes (γo and γc ), two things happen. First, due to Vmid u ’s dependence on these parameters (see equation 4.4), the sigmoid and the bell curve shift together. Two, due to the dependence we just described (see equation B.1), the bell curve shifts relative to the sigmoid. These effects are illustrated in Figures 9a and 9b for γc (γo behaves similarly). To eliminate the first effect while preserving the second, we can compensate for the sigmoid’s shift by adjusting the bias voltage uL , which scales the closing rate (see equation 4.2). Consequently, uL also rescales the left part of the bell curve, where the closing rate is dominant, reducing its height relative to the minimum and canceling its shift due to the first effect. This is demonstrated in Figures 9c and 9d. The sigmoid remains fixed, while the bell curve shifts (slightly) due to the second effect. Although the bias voltage uL cancels the shift γc produces in the sigmoid, it does not compensate for the change in the sigmoid’s slope. We can shift the sigmoid
348
K. Hynna and K. Boahen 1.
25
0.5
τ / τ min
u∞
20 15 10 5 0 200 300 400 500 600 700 800 V
a
b
0 200 300 400 500 600 700 800 V
a
25
25
20
20
15
15
10
τ / τ min
τ / τ min
Figure 10: Varying the closing voltage’s intercept (φc ). (a) Changing φc shifts the steady-state sigmoid’s midpoint, leaving its slope unaffected. (b) φc also affects the time constant bell curve’s location and height (relative to the minimum). In these plots, φo = 0 mV, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. φc ’s values are 400 mV (thin, solid line), 425 mV (short, dashed line), 450 mV (long, dashed line), and 475 mV (thick, solid line).
10
5
5
0 200 300 400 500 600 700 800 V
0 200 300 400 500 600 700 800 V
b
Figure 11: Setting the bell curve’s location and height independently. (a) Changing the opening and closing voltages’ intercepts (φo and φc ) together shifts the bell curve’s location without affecting its height. The steady-state sigmoid (not shown) moves with the bell curve (see equation B.1). (b) Changing φo and φc in opposite ways increases the height (relative to the minimum) without affecting the location. The steady-state sigmoid (not shown) stays put as well. In these plots, uH = 400 mV, uL = 50 mV, uτ H = 700 mV, γo = 1.0, and γc = 0.5. In both a and b, φc = 400 mV and φo = 0 mV for the thin, solid line. In a, both φc and φo increment by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line and to the thick, solid line. In b, φc increments and φo decrements by 25 mV from the thin, solid line to the short, dashed line, to the long, dashed line, and to the thick, solid line.
and leave its slope unaffected by changing the opening and closing voltages’ intercepts, as shown in Figure 10 for φc (φo behaves similarly). However, the bell curve shifts by the same amount, and its height changes as well, since φc rescales the closing rate. Thus, it is not possible to shift the bell curve relative to the sigmoid without changing the latter’s slope; this is evident from equations 4.5 and B.1.
Voltage-Dependent Silicon Ion Channel Models
349
We can set the bell curve’s location and height independently if we change the opening and closing voltages’ intercepts by the same amount or by equal and opposite amounts, respectively. Equal changes in the intercepts (φo and φc ) shift the opening and closing rate curves by the same amount, thus shifting the bell curve (and the sigmoid) without affecting its height (see Figure 11a), whereas equal and opposite changes shift the opening and closing rate curves apart, leaving the point where they cross at the same location while rescaling the value of the rate there. As a result, the bell curve’s height is changed without affecting its location (see Figure 11b). Choosing values for γo and γc , however, presents a trade-off. These two parameters define the dependence of VO and VC on V and are not external biases like the other parameters; rather, they are defined by the fabrication process through the transistor parameter κ. We can define their relationships with κ through the use of different circuits; in our simple activation channel (see section 5), γc = κ 2 /(κ + 1), as defined by the four transistors that invert the membrane voltage. There are a couple of drawbacks. First, not all values for γo or γc are possible using only a few transistors. Expressed another way, a trade-off needs to be made between achieving more precise values for γo or γc and using fewer transistors within the design. The other drawback is that after fabrication, γo and γc can no longer be modified, as they are defined as functions of the transistor parameter κ. These issues merit special consideration before submitting the chip for fabrication. References Delbruck, T. “Bump” circuits for computing similarity and dissimilarity of analog voltages. In Neural Networks, 1991, IJCNN-91-Seattle International Joint Conference on (Vol. 1, pp. 475–479). Piscataway, NJ: IEEE. Destexhe, A., & Huguenard, J. R. (2000). Nonlinear thermodynamic models of voltage-dependent currents. J. Comput. Neurosci., 9(3), 259–270. Destexhe, A., & Huguenard, J. R. (2001). Which formalism to use for voltagedependent conductances? In R. C. Cannon & E. D. Schutter (Eds.), Computational neuroscience: Realistic modeling for experimentalists (pp. 129–157). Boca Raton, FL: CRC. Hill, T. L., & Chen, Y. (1972). On the theory of ion transport across the nerve membrane. VI. free energy and activation free energies of conformational change. Proc. Natl. Acad. Sci. U.S.A., 69(7), 1723–1726. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer Associates. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117(4), 500–544. Hynna, K. (2005). T channel dynamics in a silicon LGN. Unpublished doctoral dissertation, University of Pennsylvania.
350
K. Hynna and K. Boahen
Hynna, K., & Boahen, K. (2001). Space-rate coding in an adaptive silicon neuron. Neural Networks, 14(6–7), 645–656. Llinas, R. R. (1988). The intrinsic electrophysiological properties of mammalian neurons: Insights into central nervous system function. Science, 242(4886), 1654–1664. Mahowald, M., & Douglas, R. (1991). A silicon neuron. Nature, 354(6354), 515–518. McCormick, D. A., & Pape, H. C. (1990). Properties of a hyperpolarization-activated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Monsivais, P., Clark, B. A., Roth, A., & Hausser, M. (2005). Determinants of action potential propagation in cerebellar Purkinje cell axons. J. Neurosci., 25(2), 464–472. Pavasovic, A., Andreou, A. G., & Westgate, C. R. (1994). Characterization of subthreshold MOS mismatch in transistors for VLSI systems. J. VLSI Signal Process. Syst., 8(1), 75–85. Simoni, M. F., Cymbalyuk, G. S., Sorensen, M. E., Calabrese, R. L., & DeWeerth, S. P. (2004). A multiconductance silicon neuron with biologically matched dynamics. IEEE Transactions on Biomedical Engineering, 51(2), 342–354. Stevens, C. F. (1978). Interactions between intrinsic membrane protein and electric field: An approach to studying nerve excitability. Biophys. J., 22(2), 295–306. Svirskis, G., Kotak, V., Sanes, D. H., & Rinzel, J. (2002). Enhancement of signal-tonoise ratio and phase locking for small inputs by a low-threshold outward current in auditory neurons. J. Neurosci., 22(24), 11019–11025. Tao, L., Shelley, M., McLaughlin, D., & Shapley, R. (2004). An egalitarian network model for the emergence of simple and complex cells in visual cortex. Proc. Natl. Acad. Sci. U.S.A., 101(1), 366–371. Willms, A. R., Baro, D. J., Harris-Warrick, R. M., & Guckenheimer, J. (1999). An improved parameter estimation method for Hodgkin-Huxley models. Journal of Computational Neuroscience, 6(2), 145–168. Zaghloul, K. A., & Boahen, K. (2004). Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Transactions on Biomedical Engineering, 51(4), 667–675. Zhan, X. J., Cox, C. L., Rinzel, J., & Sherman, S. M. (1999). Current clamp and modeling studies of low-threshold calcium spikes in cells of the cat’s lateral geniculate nucleus. J. Neurophysiol., 81(5), 2360–2373.
Received October 6, 2005; accepted June 20, 2006.
LETTER
Communicated by Harry Erwin
Spatiotemporal Conversion of Auditory Information for Cochleotopic Mapping Osamu Hoshino
[email protected] Department of Intelligent Systems Engineering, Ibaraki University, Hitachi, Ibaraki, 316-8511 Japan
Auditory communication signals such as monkey calls are complex FM vocal sounds and in general induce action potentials in different timing in the primary auditory cortex. Delay line scheme is one of the effective ways for detecting such neuronal timing. However, the scheme is not straightforwardly applicable if the time intervals of signals are beyond the latency time of delay lines. In fact, monkey calls are often expressed in longer time intervals (hundreds of milliseconds to seconds) and are beyond the latency times observed in the brain (less than several hundreds of milliseconds). Here, we propose a cochleotopic map similar to that in vision known as a retinotopic map. We show that information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns of neurons, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. We suggest that the spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could serve as the foundation for later processing, or monkey call identification by higher cortical areas. 1 Introduction Frequency modulation (FM) is a critical parameter for constructing auditory communication signals in both humans and monkeys. For example, human speech contains FM sounds called formant transitions that are critical for encoding consonant and vowel combinations such as “ga,” “da,” and “ba” (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967; Suga, 1995). Monkeys use complex FM sounds—the so-called monkey calls—that are considered to be a precursor of human speech (Poremba et al., 2004; Holden 2004), in order to make social interactions with other members of the species (Symmes, Newman, Talmage-Riggs, & Lieblich, 1979; Janik, 2000; Tyack, 2000). Auditory FM signals differ in informational structure from other sensory signals in that they are processed in a time-dependent manner (or characterized by time-varying spectral frequencies), while color and orientation Neural Computation 19, 351–370 (2007)
C 2007 Massachusetts Institute of Technology
352
O. Hoshino
in vision, odorants in olfaction, and tastants in gustation are characterized in most cases in a time-independent manner. Understanding how the auditory cortex encodes and detects the streams of spectral information arising from the temporal structure of FM sounds is one of the most challenging problems in auditory cognitive neuroscience. Auditory signals being sent from receptor neurons enter the primary auditory area (AI). Neurons of the AI are orderly arranged and form the so-called tonotopic map, where tonotopic representation of the cochlea is well preserved (Yost, 1994). Each neuron on the map tends to respond best to its characteristic frequency. Because of the tonotopic organization, these neurons could be activated in a sequential manner and generate action potentials in different timing when stimulated with an FM sound. It is interesting to know how higher cortical areas, to which the AI project, detect the timing of action potentials so that the brain can identify the applied FM sound. One of the effective ways for detecting the timing of action potentials is to use delay lines. Jeffress (1948) proposed a theory for sound localization. The theory is that an interaural time difference, which expresses the azimuthal location of a sound source, could be detected by neurons that receive signals from both ears through distinct delay lines. Suga (1995) proposed a multiple delay line scheme for echo sound detection in bats. Bi and Poo (1999) demonstrated in a cultured hippocampal neuronal network that the timing of action potentials could be detected when relevant delay lines are properly chosen. The timing of action potentials in these processes was relatively short, ranging from submilliseconds to tens of milliseconds, for which the delay line scheme worked well. However, the information about FM sounds contained in monkey calls is in general expressed in longer time intervals of hundreds of milliseconds to seconds. The delay line scheme is unlikely to be applicable for them, because the delay lines observed in the brain are at most of several hundreds of milliseconds (Miller, 1987). This implies that the brain might employ another strategy for identifying individual calls. A speculative cochleotopic map was proposed in relation to a retinotopic map in vision (Rauschecker, 1998). The retinotopic map expresses information about the location of a moving bar in a two-dimensional visual space that is projected onto the retina. As the bar moves in the visual space, the map shows a spatiotemporal firing pattern of neurons. The axes of the map indicate the location of the bar in the two-dimensional visual space. In the cochleotopic mapping scheme, the location of neuronal activation moves along the frequency axis as the frequency of a FM sound sweeps. However, the exact auditory variable for the other axis is still unknown. We propose here a hypothetical two-dimensional (frequency axis, propagation axis) cochleotopic neural network (NC O network) for the AI on which information about FM sounds is mapped in a spatiotemporal manner. The propagation axis is assumed for the unknown axis. Stimulation
Spatiotemporal Conversion of Auditory Information
353
with a pure (single-frequency) tone activates a peripheral neuron, whose activity propagates along the propagation axis, or along its isofrequency band. When the NC O network is stimulated with an FM sound, peripheral neurons are sequentially activated, which then propagates along their isofrequency bands. As a consequence, the information about the applied FM sound is expressed as a specific spatiotemporal firing pattern in the NC O network dynamics. Such activity propagation has been reported in the AI. Hess and Scheich (1996) stimulated Mongolian gerbils with pure tones (1 kHz to 16 kHz) and recorded the activity of AI neurons. The researchers found that neuronal activation propagated along isofrequency bands at all frequencies. Taniguchi, Horikawa, Moriyama, and Nasu (1992) stimulated guinea pigs with pure tones (1 kHz to 30 kHz) and recorded the activity of the anterior field (field A) in which tonotopic organization was well preserved. The researchers found that the focal activation beginning in field A propagated in two directions: along isofrequency bands and toward field DC. There has been neurophysiological evidence that FM sounds are precisely detected by auditory neurons. Neurons of the lateral belt (LB) (Rauschecker, 1998) that receives signals from the AI responded to FM sounds. These neurons were relatively organized in an orderly fashion depending on the sweeping rate (between slow and fast) and direction (upward or downward) of the FMs. Based on these experimental findings, we construct a neural network model (NFM ) for the LB to which the NC O network projects. A given FM sound evokes a specific spatiotemporal firing pattern in the NC O network, to which a certain group of NFM neurons (NFM column) responds and identifies the applied FM. It is also well known that LB neurons respond to monkey calls (Rauschecker, 1997). Monkey calls as vocal signatures are complex FM sounds and play an important role in identifying individuals, especially when their visual systems are unavailable, as in a luxuriant forest (Symmes et al., 1979). To detect such complex FM sounds, we construct a higher neural network (NI D network) model for the STGr (rostral portion of the superior temporal gyrus) to which the LB projects. The NI D network receives selective projections from the NFM network. When a monkey call is presented to the NC O network, multiple NFM columns are sequentially activated in a specific order. The NI D network integrates the sequence of the dynamic NFM columns, thereby identifying that call. Based on the proposed cochleotopic mapping scheme, we investigate how FM sound information is encoded and detected. Applying to the NC O network simple (linearly sweeping) and complex (monkey call) FM sounds, we record the activities of neurons. By statistically analyzing them, we try to understand the neuronal mechanisms that underlie FM sound information processing in the auditory cortex.
354
O. Hoshino
2 Neural Network Model 2.1 Outline of the Model. The NC O network, modeling the AI, is organized in a tonotopic fashion as shown in Figure 1A. When the frequency of an applied FM sound sweeps upward, the neuronal activation of the periphery (p1) sweeps from f1 to f40, which then moves along the propagate axis (isofrequency bands). The filled circles schematically indicate a neuronal firing pattern induced by a simple (linearly sweeping) upward FM sound at a certain time after the stimulus onset. The gray circles indicate a neuronal firing pattern induced by a downward FM sound. The spatiotemporal firing pattern of the NC O network expresses combinatorial information about the direction and the sweep rate of the applied FM sound. Neurons of the NFM network, modeling the LB, receive convergent projections from the NC O network (solid and dashed lines), and detect the upward (black circles) and downward (gray circles) FM sounds. We made sets of selective convergent projections from the NC O to NFM network in order to detect specific linearly sweeping FM sounds. Neurons within isofrequency bands are connected by excitatory and inhibitory delay lines, as shown in Figure 1B. The excitatory connections (solid lines) from the periphery to the center are the major driving force for the neuronal activation to move along the propagation axis, and the inhibitory connections (dashed lines) were employed to sharpen the spatiotemporal firing patterns. This specific circuitry was used for functionally expressing the propagation axis whose evidence in the AI is being accumulated (Hess & Scheich, 1996; Taniguchi et al., 1992). For simplicity, there is no connection between isofrequency bands. Neurons within NFM columns are connected with each other via excitatory synapses, and neurons between NFM columns are connected via inhibitory synapses. This circuitry enables each NFM column to respond selectively to a specific linearly sweeping FM sound. Figures 1C and 1D are schematic drawings of neuronal responses to a linearly sweeping FM sound. When the NC O network is stimulated with an upward FM sound sweeping at a slow (Figure 1C, left), intermediate (Figure 1C, center), or fast rate (Figure 1C, right), the activation area moves from the lower left to upper right (arrows). When a downward FM sound is presented, the activation area moves from the lower right to upper left (see Figure 1D). When the activation area reaches a certain position (gray ellipses), the neurons send action potentials to the NFM network via the selective feedforward projections, and activate the NFM column corresponding to the applied FM (black ellipses). The neuronal activation of the other NC O regions (dashed ellipses) can also be used for the FM detection. Nevertheless, we chose the firing patterns (gray ellipses), because these patterns appear first in the time courses with maximal neurons simultaneously activated. This enables the NFM network to respond reliably and rapidly to the applied FM sounds.
Spatiotemporal Conversion of Auditory Information
A
355
B
c1
c10
,,,,,,, ,,,
c40
,,,,,,,
,,,
,,,
NFM ,,
p ro
c31
,,,,,,,
,,,
n
tio
a ag
upwards
,,
,,
,,
p40
,,,,,,,
p
,,,
,,,
,,,,,,,
p1 input
f1
frequency
low
NCO
propagation - axis
downwards
f40 fl (l = 1, 2, ,,, 40)
high
C
D upward
downward
slow
fast
slow
fast
NCO
frequency
propagation
propagation
NFM
frequency
Figure 1: Neural network model. (A) The NC O network is organized in a tonotopic manner. The NFM network receives selective projections from the NC O network. Among NFM columns, c1–c10 and c31–c40 detect FM sounds that sweep linearly in downward and upward directions, respectively. For clarity, only two sets of firing patterns (black and gray circles) and projections (solid and dashed lines) for an upward and a downward FM sound are depicted. (B) Neuronal connections within isofrequency bands of the NC O network. Neurons are connected via excitatory (solid lines) and inhibitory (dashed lines) delay lines, where t denotes a signal transmission delay time. (C) Schematic drawings of the spatiotemporal neuronal responses of the NC O network and those of the NFM network to simple (linearly sweeping) upward FM sounds. Activity patterns for FM sounds that sweep at a slow (left), intermediate (center), and fast (right) rate are shown. Arrows indicate the directions of movements of active areas. (D) Schematic drawings as in C for downward FM sounds.
356
O. Hoshino
2.2 Model Description. Dynamic evolutions of the membrane potentials of neurons of the NC O and NFM networks are defined, respectively, by D ex duCk Oi (t) wk i,k = −uCk Oi (t) + dt
M
τC O
j=1
+ wki h i,k
τFM
CO (i+ j) Sk (i+ j) (t
CO (i− j) Sk (i− j) (t
− jt)
− jt) ,
(2.1)
f 40 p40 duiF M (t) = −uiF M (t) + L i,k j SkC Oj (t − tFM ) dt k= f 1 j= p1
MFM
+
wiFj M S Fj M (t),
(2.2)
j=1( j=i)
where
Prob SiY (t) = 1 = f Y uiY (t) f Y [u] =
1 . 1 + e −ηY (u−θY )
(Y = C O, F M), (2.3)
uCk Oi (t) and uiF M (t) are the membrane potential of the ith NC O neuron of the kth (k = f1–f40) isofrequency band and that of the ith NFM neuron at time t, respectively. τY (Y = CO, FM) is a decay time of the membrane potential of the network NY . MD is the number of excitatory or inhibitory input delay lines that a single NC O neuron receives from other neurons within isofrequency bands. wke xi,k (i− j) and wki h i,k (i+ j) are, respectively, excitatory and inhibitory synaptic connection strengths from neuron (i − j) to i and from (i + j) to i of the kth isofrequency band. SkC Oj (t − jt) = 1 expresses an action potential of the jth NC O neuron of the kth isofrequency band, where jt denotes a signal transmission delay time (see Figure 1B). (f1– f40, p1–p40) denotes the locations of neurons on the two-dimensional NC O map (see Figure 1A). L i,k j is the strength of synaptic connection from the jth NC O neuron of the kth isofrequency band to the ith NFM neuron. tFM is a signal transmission delay time from the NC O to NFM network. MFM is the number of NFM neurons. wiFj M is the synaptic connection strength from the jth to the ith NFM neuron, and S Fj M (t) = 1 expresses an action potential of the jth NFM neuron. {wiFj M } was set for the neurons within NFM columns (c1–c40; see Figure 1A) to be mutually excited and for the neurons between NFM columns to be laterally inhibited. ηY and θY are the steepness and the threshold of the sigmoid function f Y , respectively, for Y neuron. Equation 2.3 defines the probability of firing of neuron i; that is, the
Spatiotemporal Conversion of Auditory Information
357
probability of SiY (t) = 1 is given by function f Y . After firing, its membrane potential is reset to 0. FM sound stimuli are applied to the peripheral neurons of the NC O network. Dynamic evolutions of membrane potentials of these neurons are defined by D duCk O1 (t) wki h 1,k = −uCk O1 (t) + dt
M
τC O
CO (1+ j) (t)Sk (1+ j) (t
− jt)
j=1
+ α Ik 1 (t),
(2.4)
where Ik 1 (t) is the input stimulus to the peripheral neuron of the kth isofrequency band, or the neuron located at (k, p1; k = f1–f40; see Figure 1A). α is the intensity of the input. Note that the peripheral neurons receive only an inhibitory input from the (1 + j)th neuron with a delay of jt and do not receive any delayed excitatory input. Network parameter values are as follows. The number of neurons are 40 (f1–f40) × 40 (p1–p40) and 40 (c1–c40) × 12 for the NC O and NFM networks, respectively: τC O = 10 ms, τFM = 10 ms, θY = 0.7, ηY = 10.0, and MD = 3. wke xi,k (i− j) = 5.0 and wki h i,k (i− j) = −0.5. t = 10 ms and tFM = 20 ms. L i,k j was selectively set at either 0.1 or 0, as addressed in section 2.1, by which the specific firing patterns induced in the NC O network dynamics can activate their corresponding NFM columns (see Figures 1C and 1D). wiFj M = 0.3 and −5.0 within and between NFM columns, respectively. α = 8.0, and Ik 1 (t) = 1 for an input and 0 for no input. 3 Results 3.1 Tuning Property to Simple FM Sounds. We show here how the information about simple (linearly sweeping) FM sounds could be expressed as spatiotemporal firing patterns in the NC O network dynamics. We also show how the auditory information could be transferred to and detected by specific neuronal columns of the NFM network. Response properties (action potential generation) of NFM neurons are compared with those observed experimentally. An upward FM sound sweeping linearly at 20 Hz per ms induces a specific spatiotemporal firing pattern in the NC O network, in which the neuronal activation moves from the lower left toward the upper right (see Figure 2A). When the activity reaches a certain point (time = 200 ms), the active neurons send action potentials to the NFM network and stimulate the corresponding NFM column (arrow) at time = ∼ 220 ms. The difference in activation time between the NC O (200 ms) and NFM (220 ms) networks arises from a difference in signal transmission delay between the two networks, or tFM = 20 ms (see equation 2.2).
358
O. Hoshino A
120 ms
220 ms
320 ms
520 ms
NFM NCO 100 ms B
200 ms
300 ms
500 ms
f (kHz)
spikes/bin
upward 10 8 6 4 2 0
13.3 Hz/ms 10 8 6 4 2 0
20 Hz/ms
12
12
12
8
8 0 100 200 300
26.7 Hz/ms
10 8 6 4 2 0
8 0 100 200 300
0 100 200 300
time (ms)
downward
f (kHz)
spikes/bin
C 10 8 6 4 2 0
13.3 Hz/ms 10 8 6 4 2 0
12
12
8
20 Hz/ms
26.7 Hz/ms
12
8 0 100 200 300
10 8 6 4 2 0
8 0 100 200 300
time (ms)
0 100 200 300
Spatiotemporal Conversion of Auditory Information
359
We assumed f1 = 8 kHz to f40 = 11.9 kHz (see Figure 1A), where the isofrequency bands were placed at an even interval, or 100 Hz per band. These frequencies are within the range observed in squirrel monkeys (Symmes et al., 1979) and employed for investigating how complex FM sounds such as monkey calls could be identified, as will be shown in sections 3.2 and 3.3. Figure 2B presents the total spike counts (top) and raster plots (middle) for the neurons of a given NFM column when stimulated with different upward FM sounds (bottom). The columnar neurons show specific sensitivity to the upward FM sound (20 Hz per ms) but less to downward FM sounds (see Figure 2C). This tendency is almost consistent with that observed in macaque monkeys (Rauschecker, 1998). Although the tuning characteristic of the NFM columnar neurons to the applied FM sound (sweep rate = 20 Hz per ms; upward) is evident, these neurons also show weak responses to the other upward FM sounds with different sweep rates (see the arrows of Figure 2B). Such weak responsiveness to the “irrelevant” FM sounds is due to the overlapping of NC O firing patterns, as schematically shown in Figure 3. The set of NC O neurons within the solid ellipse, which are to be simultaneously activated by the FM stimulus (sweep rate = 20 Hz per ms; upward), send action potentials to the relevant NFM column and maximally activate the column (top left). However, the subsets of NC O neurons within the overlapping regions (gray and black) could also be activated, respectively, by the irrelevant upward FM sounds (26.7 and 13.3 Hz per ms). These neurons send a relatively small number of action potentials to the same NFM column, which results in weaker neural responses in the column (bottom left and top right). 3.2 Tuning Property to Complex FM Sounds. We show here how complex FM sounds such as monkey calls could be expressed as specific spatiotemporal firing patterns in the NC O network dynamics. We also show
Figure 2: Response property of the NC O and NFM networks. (A) Time courses of neuronal activation when stimulated with an upward FM sound sweeping at 20 Hz per ms. The level of neuronal activity is expressed for each neuron as the number of action potentials observed within a time interval (bin = 10 ms). Arrows indicate the responses of NFM neurons to the applied FM sound. (B) Spike counts (top) and raster plots (middle) for the NFM column neurons when stimulated with different upward FM sounds (bottom). Spike count is the number of action potentials of the NFM column neurons observed within a time interval (bin = 10 ms). (C) Spike counts (top) and raster plots (middle) for the NFM column when stimulated with different downward FM sounds (bottom). The NFM column shows specific sensitivity to the upward FM sound sweeping at 20 Hz per ms.
20 Hz/ms 10 8 6 4 2 0 0 100 200 300
propagation
NCO
spikes/bin
time (ms)
10 8 6 4 2 0
26.7 Hz/ms
spikes/bin
O. Hoshino
spikes/bin
360
low
10 8 6 4 2 0
13.3 Hz/ms
0 100 200 300
time (ms)
high frequency
0 100 200 300
time (ms)
Figure 3: Overlapping property of NC O firing patterns. Thin-dashed, solid, and thick-dashed areas schematically depict the firing patterns generated by the FM sounds sweeping at 26.7 Hz per ms, 20 Hz per ms, and 13.3 Hz per ms, respectively. A certain NFM column is activated maximally by its preferred FM sound (20 Hz per ms) (top left). The overlapping regions, or gray and black regions, are activated not only by the upward FM (20 Hz per ms) but also by those sweeping at 26.7 Hz per ms (gray region) and 13.3 Hz per ms (black region). This results in weaker neuronal responses in the same NFM column (bottom left and top right).
how the information about individual monkey calls can be decomposed into simple (linearly sweeping) FM components. We used artificial isolation peep (IP) as monkey calls (Symmes et al., 1979), whose pitch profiles are shown in Figure 4A. A specific spatiotemporal firing pattern is induced in the NC O network when stimulated with the IP of monkey X, which then activates multiple NFM columns in a sequential manner (see Figure 4B). We have observed distinct sequential orders of columnar activation for monkey call X, Y, and Z. Figure 4C presents the details of the sequences of columnar activation for monkey X (the thick solid line), Y (the thin solid line), and Z (dashed line), indicating that the IPs are decomposed into specific sequences of simple (linearly sweeping) FM components. In the model, the neurons of a currently active NFM -column continue firing, even without any excitatory input, until another dynamic NFM column emerges, or its neurons begin to fire. For example, NFM column c31 is activated at time = 0.22 s (upward arrow) and continues firing (downward arrow) until the NFM column c1 begins to fire (at 0.34 s; rightward arrow).
Spatiotemporal Conversion of Auditory Information A
frequency(kHz)
monkey X
361
monkey Y
monkey Z
12
12
12
8
8
8
4
4
4 0
0.2
0.4
0
0.2
0
0.4
0.2
0.4
time (s) B
NFM NCO 129 ms
222 ms
366 ms
262 ms
419 ms
341 ms
501 ms
C
monkey X:
c40
monkey Y:
NFM-column
c35
monkey Z:
c31
c10 c5 c1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
time (s) Figure 4: Spatiotemporal conversion of information about isolation-peeps (IPs). (A) Profiles of artificial IPs for monkey calls (X, Y, Z) used in the simulations. (B) Time courses of neuronal activation of the NC O and NFM networks induced by the IP of monkey X. (C) Sequences of dynamic NFM columns induced by the IPs of monkey X (thick solid line), Y (thin solid line), and Z (dashed line). Each NFM column, c31–c40 and c1–10, responds to a specific linearly sweeping FM component involved in the IPs. NFM columns (c11–c30) were not assigned to detect FM sounds in this simulation.
362
O. Hoshino
This self-generative continuous firing could be mediated by mutual excitation within NFM columns. In the next section, we try to identify these monkey calls by an integrative process, to which the persistent neuronal firing effectively contributes. 3.3 Monkey Call Identification. We showed in the previous section that the information about complex FM sounds, or the IPs of monkeys, could be decomposed into simple (linearly sweeping) FM components. It reminds us that early visual systems decompose complex visual objects into simple features such as edge, orientation, and color. Higher visual areas integrate these features so that the visual images of the objects can be reconstructed as unified percepts. To identify the sequences of the dynamic NFM columns, we extended the model, adding a higher (integrative) neural network NI D (see Figure 5A). We made selective feedforward projections from the NFM to NI D network (solid lines) in order to integrate the series of FM components. The specific NI D column (filled circles; NI D ) assigned to detect a certain IP receives convergent inputs from the NFM columns (filled circles; NFM ) that are to be sequentially activated by the FM components constituting the IP (filled circles; NC O ). The NI D network may correspond to a lateral belt (LB) area that is known to respond to monkey calls or the rostral portion of the superior temporal gyrus (STGr) to which the LB projects (Rauschecker, 1998), as we assumed here. Dynamic evolutions of membrane potentials of NI D neurons are defined by τI D
MFM duiI D (t) L iIjD S Fj M (t − tI D ) = −uiI D (t) + dt j=1
+
MI D
wiIjD S Ij D (t),
(3.1)
j=1( j=i)
where action potentials are generated according to equation 2.3 and the other parameter values were the same as those for the NFM network (see equation 2.2). As shown in Figure 5B, when the NC O network is stimulated with the IP of monkey X, multiple NFM columns are sequentially activated, which then activate the call-relevant NI D column at 367 ms. In this integrative process, the NI D column receive consecutive action potentials from the NFM columns, as addressed in the previous section (see Figure 4C). Such continuous activation of the NI D column gradually depolarizes its neurons and allows them to fire when the membrane potentials reach a threshold, thereby identifying the applied IP. Figure 5C presents the identification processing for monkey Z.
Spatiotemporal Conversion of Auditory Information
363
A
B
monkey X NID NFM NCO 103 ms
C
208 ms
350 ms
367 ms
338 ms
410 ms
509 ms
monkey Z NID NFM NCO 83 ms
Figure 5: Monkey call identification. (A) A higher (integrative) neural network NI D is added to the original model (see Figure 1A), which integrates the specific sequences of dynamic NFM columns induced by individual IPs. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys X (B) and Z (C).
Figure 6 presents how similar monkey calls could be distinguished from each other. The IP of monkey V (see Figure 6A, left) has a similar spectrogram in the first part (time = 0–0.1 s; solid line) to that of monkey A (dashed line). When the NC O network is stimulated with the IP (see Figure 6B), multiple NFM columns are sequentially activated, which then activate the two
364
O. Hoshino
A
8
4 0
0.2
0.4
monkey W frequency(kHz)
frequency(kHz)
monkey V 12
12
8
4 0
time(sec)
0.2
0.4
time(sec)
B
monkey V NID NFM NCO 250 ms
C
330 ms
446 ms
500 ms
338 ms
380 ms
509 ms
monkey W
NID NFM NCO 83 ms
Figure 6: Identification of monkey calls that have similar auditory spectrograms. (A) Profiles of artificial IPs for monkeys V and W (solid lines). The dashed lines denote that of monkey A. (B,C) Time courses of neuronal activation in the NC O , NFM , and NI D networks when stimulated with the IPs of monkeys V (B) and W (C).
NI D columns corresponding to the IPs of monkeys V and A (at 330 ms). The two dynamic NI D columns compete for a while (at 330–446 ms), and the NI D column corresponding to monkey V finally prevails (at 500 ms). The neuronal competition between the two dynamic NI D columns arises from the lateral inhibition between NI D columns. In contrast, when the IP of monkey W, which has a similar spectrogram in the last part (0.3–0.4 s; see Figure 6A, right), is presented, the NI D column corresponding to the IP of monkey W is selectively activated (at 338 ms) without competition, as shown in Figure 6C. Note that the time required
Spatiotemporal Conversion of Auditory Information
365
to identify monkey W (338 ms; see Figure 6C) is shorter than that for monkey V (500 ms; see Figure 6B), which arises presumably because there is less neuronal competition between dynamic NI D columns. These results indicate that the spectrogram of the last part might be useful for further analyses if circumstances require, although it is more time consuming. 4 Discussion We have proposed a hypothesis that (1) information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, (2) which can then be decomposed into simple (linearly sweeping) FM components and (3) integrated into unified percepts by higher cortical networks. For the cochleotopic two-dimensional map (hypothesis 1), we assumed activity propagation along isofrequency axis (bands) in order to make a distinct spatiotemporal firing pattern for a given monkey call. Imaging studies (Taniguchi et al., 1992; Hess & Scheich, 1996; Song et al., 2005) evidenced such activity propagation in the primary auditory cortex (AI). When presented with alternating pure tones or alternation between 1 and 8 kHz (Hess & Scheich, 1996), the activity propagation was confined to the low and high isofrequency bands. To our knowledge, the exact formation of spatiotemporal firing patterns for FM sound stimulation has not been identified yet, but we simply extended this scheme. Namely, the neuronal activity propagates along multiple isofrequency bands corresponding to the tone frequencies constituting the applied FM sound. Actual spatiotemporal firing patterns induced by monkey calls might be rather complex because of the interaction between isofrequency bands or the influences of other brain regions. Accordingly, the information about individual monkey calls could be encoded more precisely in the auditory cortex. Nevertheless, the proposed simple cochleotopic map was sufficient to generate distinct spatiotemporal firing patterns and worked well as the foundation for later sound processing, or monkey call identification. The delay line proposed for the propagation axis (see Figure 1A) was our speculation prompted by a recent experimental study (Song et al., 2005). The study demonstrated that an electrical pulse, applied focally within an isofrequency band, triggered activity propagation along the isofrequency band that was similar to tone-evoked activation. When the auditory thalamus was chemically lesioned, the electrically evoked activity in the AI was not affected, but the tone-evoked activity was abolished. Based on these results, it was suggested that intracortical connectivity in the AI enables neuronal activity to propagate along isofrequency bands. The underling neuronal mechanisms of activity propagation in the AI has not fully been understood yet, but we assumed the intracortical connectivity via delay lines (see Figure 1B) for developing the activity propagation. Note that the intracortical delay lines are relatively short (less than tens of milliseconds)
366
O. Hoshino
that could be neurophysiologically plausible in the brain as addressed in section 1. For expressing the information about simple (linearly sweeping) FM components (hypothesis 2), we assumed neurons respond selectively to the sweeping rates and directions of FMs. Neurophysiological studies (Rauschecker, 1997, 1998; Tian & Rauschecker, 1998) demonstrated that many neurons of lateral belt areas to which the primary auditory cortex (AI) projects responded better to more complex stimuli, such as FMs and band passed noises, than to pure tones. These neurons were highly selective to the rates and directions of FMs. Neurons of the anterolateral (AL) and caudolateral (CL) belt areas responded better to slower and faster FM sweep rates, respectively. Neurons of the posterior auditory field were highly selective for one direction. The detailed organization of these (rateand direction-selective) neurons has not clearly been identified yet, but we represented them in a simple and functional manner (see NFM ; Figure 1A). To integrate the sequence of linearly sweeping FMs, expressing a specific monkey call, into a unified percept (hypothesis 3), we assumed that neurons respond selectively to the call. Neurophysiological studies (Rauschecker, 1997, 1998) demonstrated that neurons of the lateral belt responded to a certain class of monkey calls. Very few neurons responded to a single call, and most neurons responded to a number of calls. These results imply that the lateral belt is not yet the end state processing monkey calls. Higher cortical areas such as the rostral portion of the superior temporal gyrus (STGr) or the prefrontal cortex, or both, might be responsible for monkey call identification (Rauschecker, 1998). The exact cortical areas whose neurons have selective responsiveness to individual monkey calls have not clearly been identified yet, but we assumed such “call-selective” neurons in the STGr (see NI D ; Figure 5A). The selective projections from the NFM to NI D (or the lateral belt to STGr) were our speculation in order for the model to perform the identification of individual monkey calls or detect the sequence of FM components. Coincidence detection based on a delay line scheme, as addressed in section 1, cannot be applicable for longer signals such as monkey calls (more than 500 ms), because delay lines observed in the brain are at most of several hundred milliseconds. The delay lines proposed here for the propagation axis (NC O ; Figure 1B) are shorter ones (tens of milliseconds) that were speculative but neurophysiologically plausible in the brain. The idea of this study is on temporal-to-spatiotemporal conversion of auditory information mediated by shorter (plausible) delay lines (∼tens of milliseconds) but not on a coincidence detection scheme. In the propagation axis, we assumed delay lines between neighboring neurons ranging from 10 to 30 ms (see section 2.2). This architecture allowed the cochleotopic neural network (NC O ) to propagate along isofrequency bands as observed in the auditory cortex (Taniguchi et al., 1992; Hess & Scheich, 1996). To our knowledge, such delay lines have not been reported
Spatiotemporal Conversion of Auditory Information
367
in the auditory cortex. However, Hess and Scheich (1996) pointed out that activity propagation along isofrequency bands might be closely related to the distribution of response latency in the AI. Tian and Rauschecker (1998) found a response-latency distribution (23 ± 12 ms) in the AI. The range of delay lines used for the NC O network as an AI area is within the range observed. We assumed activity propagation along isofrequency bands. However, activity propagation across isofrequency bands has also been reported (Taniguchi et al., 1992; Hess & Scheich, 1996), for which the interaction between different isofrequency bands might be responsible. The neuronal activation propagated toward the two (isofrequncy and tonotopic-gradient) directions, where the peak activity was shifted along isofrequency bands (Hess & Scheich, 1996). Although the spatiotemporal firing pattern propagating toward the two directions might contribute to encoding auditory information, presumably in a more precise manner, we modeled the propagation only along the isofrequency bands. Panchev and Wermter (2004) proposed a neural network model that can detect temporal sequences in timescale from hundreds of milliseconds to several seconds. The network consisted of integrate-and-fire neurons with active dendrites and dynamic synapses. The researchers applied the model to recognizing words such as bat, tab, cab, and cat. Each word was expressed as a sequence of phonemes, (for example, c → a → t for cat. A spike train of a single input neuron encoded each phoneme of a word, with a 100 ms delay between the onset times of successive phonemes. After training based on spike-timing-dependent plasticity, a single output neuron was able to detect a particular sequence of phonemes or identify a specific word. Their model could be an alternative, especially for the integration network (NI D ; Figure 5A), that detects the sequences of simple (linearly sweeping) FMs. Sargolini and colleagues (2006) found evidence of conjunctive representation of position, direction, and velocity in the entorhinal cortex of rats that explored two-dimensional environments. In the medial entorhinal cortex (MEC), the network of grid cells constituted a spatial coordinate system in which positional information was represented. Head direction cells were responsible for head-directional information representation. Grid cells were co-localized with head direction cells and conjunctive (grid and head direction) cells, and the running speed of the rat modulated these cells. The researchers suggested that the conjunctive cells might update the representation of spatial location by integrating positional and directional and velocity information in the grid cell network during navigation. Such a conjunctive representation may be an alternative for the spatiotemporal representation of monkey calls, in which the information about spectral components, sweep rates and sweep directions and their combinatorial information may be represented by distinct types of cells in the primary auditory cortex.
368
O. Hoshino
In humans, it has been suggested that a voice contains information about not only a speech but also an “auditory face” which allows us to identify individuals (Belin, Fecteau, & Bedard, 2004). This is called auditory face perception and is processed based on a neurocognitive scheme similar to that proposed for visual face perception. Among vocal components for human speech processing, formants (Fitch, 1997) and syllables (Belin & Zatorre, 2003) might be candidate components used for identifying individuals. We suggest that monkey call identification may also be a kind of auditory face perception making use of FM components. There has been evidence that the inferior colliculus (IC) encodes spectrotemporal acoustic patterns of species-specific calls. For example, Suta, Kvasnak, Popelar, and Syka (2003) investigated the neuronal representation of specific calls in the IC of guinea pigs. Responses of individual IC neurons of anesthetized guinea pigs to four typical calls (purr, chutter, chirp, and whistle) were recorded. A majority of neurons (55% of 124 units) responded to all calls. A small portion of neurons (3%) responded to only one call or did not respond to any of the calls. A time-reversed version of calls elicited on average a weaker response. The researchers concluded that the IC neurons do not respond selectively to specific calls but encode spectrotemporal acoustic patterns of the calls. Maki and Riquimaroux (2002) recorded responses of IC neurons of gerbils to two distinct FM sounds that have the same spectral components (5–12 kHz) with an opposite sweep direction, or an upward sweep and a downward sweep. The upward FM generated much stronger responses than the downward FM. The researchers suggested that the directional selectivity to the FM sweeps implies that the IC may encode spectrotemporal acoustic patterns of species-specific calls. These experiments imply that the encoding of spectrotemporal acoustic images of specific calls takes place, in part, in the IC, which presumably makes the spatiotemporal pattern of neuronal activation more complex in the present cochleotopic map. Hopefully, we will know details of it in the near future. 5 Conclusion In this study, we have proposed a cochleotopic map similar to the retinotopic map in vision. When the cochleotopic (NC O ) network was stimulated with a monkey call, the peripheral neurons located on the frequency axis were sequentially activated. The active area moved along the propagation axis, by which the information about the call was mapped as a spatiotemporal firing pattern in the cochleotopic network dynamics. This spatiotemporal conversion was quite effective for the NFM network to decompose the call information into simple (linearly sweeping) FM components, by which the higher network (NI D ) was able to integrate these components into a unified percept, or to identify the call.
Spatiotemporal Conversion of Auditory Information
369
We suggest that the information about monkey calls could be mapped on a cochleotopic cortical network as spatiotemporal firing patterns, which can then be decomposed into simple (linearly sweeping) FM components and integrated into unified percepts by higher cortical networks. The spatiotemporal conversion of auditory information may be essential for developing the cochleotopic map, which could subserve as the foundation for later processing, or monkey call identification by higher cortical areas. Acknowledgments I am grateful to Yuishi Iwasaki for productive discussions and to Hiromi Ohta for her encouragement throughout the study. I am also grateful to the reviewers for giving me valuable comments and suggestions on the earlier draft. References Belin, P., Fecteau, S., & Bedard, C. (2004). Thinking the voice: Neural correlates of voice perception. Trends Cogn. Sci., 8, 129–135. Belin, P., & Zatorre, R. J. (2003). Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport, 14, 2105–2109. Bi, G. Q., & Poo, M. M. (1999). Distributed synaptic modification in neural networks induced by patterned stimulation. Nature, 401, 792–796. Fitch, W. T. (1997). Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques. J. Acoust. Soc. Am., 102, 1213–1222. Hess, A., & Scheich, H. (1996). Optical and FDG mapping of frequency-specific activity in auditory cortex. Neuroreport, 7, 2643–2677. Holden, C. (2004). The origin of speech. Science, 303, 1316–1319. Janik, V. M. (2000). Whistle matching in wild bottlenose dolphins (Tursiops truncates). Science, 289, 1355–1357. Jeffress, L. A. (1948). A place theory of sound localization. J. Comp. Physiol. Psychol., 41, 35–39. Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert-Kennedy M. (1967). Perception of the speech code. Psychol. Rev., 74, 431–461. Maki, K., & Riquimaroux, H. (2002). Time-frequency distribution of neuronal activity in the gerbil inferior colliculus responding to auditory stimuli. Neurosci. Lett., 331, 1–4. Miller, R. (1987). Representation of brief temporal patterns, Hebbian synapses, and the left-hemisphere dominance for phoneme recognition. Psychobiol., 15, 241–247. Panchev, C., & Wermter, S. (2004). Spike-timing-dependent synaptic plasticity: From single spikes to spike trains. Neurocomputing, 58–60, 365–371. Poremba, A., Malloy, M., Saunders, R. C., Carson, R. E., Herscovitch, P., & Mishkin, M. (2004). Species-specific calls evoke asymmetric activity in the monkey’s temporal poles. Nature, 427, 448–451. Rauschecker, J. P. (1997). Processing of complex sounds in the auditory cortex of cat, monkey, and man. Acta Otolaryngol. (Stockh.), 532, 34–38.
370
O. Hoshino
Rauschecker, J. P. (1998). Parallel processing in the auditory cortex of primates. Audiol. Neuro-Otol., 3, 86–103. Sargolini, F., Fyhn, M., Hafting, T., McNaughton, B. L., Witter, M. P., Moser, M. B., & Moser E. I. (2006). Conjunctive representation of position, direction, and velocity in entorhinal cortex. Science, 312, 758–762. Song, W. J., Kawaguchi, H., Totoki, S., Inoue, Y., Katura, T., Maeda, S., Inagaki, S., Shirasawa, H., & Nishimura, M. (2005). Cortical intrinsic circuits can support activity propagation through an isofrequency strip of the guinea pig primary auditory cortex. Cereb. Cortex, 16, 718–729. Suga, N. (1995). Processing of auditory information carried by species-specific complex sounds. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 295–313). Cambridge, MA: MIT Press. Suta, D., Kvasnak, E., Popelar, J., & Syka, J. (2003). Representation of species-specific vocalizations in the inferior colliculus of the guinea pig. J. Neurophysiol., 90, 3794– 3808. Symmes, D., Newman, J. D., Talmage-Riggs, G., & Lieblich, A. K. (1979). Individuality and stability of isolation peeps in squirrel monkeys. Anim. Behav., 27, 1142–1152. Taniguchi, I., Horikawa, J., Moriyama, T., & Nasu, M. (1992). Spatio-temporal pattern of frequency representation in the auditory cortex of guinea pigs. Neurosci. Lett., 146, 37–40. Tian, B., & Rauschecker, J. P. (1998). Processing of frequency-modulated sounds in the cat’s posterior auditory field. J. Neurophysiol., 79, 2629–2642. Tyack, P. L. (2000). Dolphins whistle a signature tune. Science, 289, 1310–1311. Yost, W. A. (1994). The neural response and the auditory code. In W. A. Yost (Ed.), Fundamentals of hearing (pp. 116–133). San Diego, CA: Academic Press.
Received January 19, 2006; accepted June 9, 2006.
LETTER
Communicated by Gal Chechik
Reducing the Variability of Neural Responses: A Computational Theory of Spike-Timing-Dependent Plasticity Sander M. Bohte
[email protected] Netherlands Centre for Mathematics and Computer Science (CWI), 1098 SJ Amsterdam, The Netherlands
Michael C. Mozer
[email protected] Department of Computer Science, University of Colorado, Boulder, CO, U.S.A.
Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic depression when the presynaptic neuron fires shortly after. The dependence of synaptic modulation on the precise timing of the two action potentials is known as spike-timing dependent plasticity (STDP). We derive STDP from a simple computational principle: synapses adapt so as to minimize the postsynaptic neuron’s response variability to a given presynaptic input, causing the neuron’s output to become more reliable in the face of noise. Using an objective function that minimizes response variability and the biophysically realistic spike-response model of Gerstner (2001), we simulate neurophysiological experiments and obtain the characteristic STDP curve along with other phenomena, including the reduction in synaptic plasticity as synaptic efficacy increases. We compare our account to other efforts to derive STDP from computational principles and argue that our account provides the most comprehensive coverage of the phenomena. Thus, reliability of neural response in the face of noise may be a key goal of unsupervised cortical adaptation. 1 Introduction Experimental studies have observed synaptic potentiation when a presynaptic neuron fires shortly before a postsynaptic neuron and synaptic de¨ pression when the presynaptic neuron fires shortly after (Markram, Lubke, Frotscher, & Sakmann, 1997; Bell, Han, Sugawara, & Grant, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thomp¨ om, ¨ Turrigiano, & Nelson, 2001; Nishiyama, son, 1998; Feldman, 2000; Sjostr Hong, Mikoshiba, Poo, & Kato, 2000). The dependence of synaptic Neural Computation 19, 371–403 (2007)
C 2007 Massachusetts Institute of Technology
372 A
S. Bohte and M. Mozer B
Figure 1: (A) Measuring STDP experimentally. Presynaptic and postsynaptic spike pairs are repeatedly induced at a fixed interval tpr e− post , and the resulting change to the strength of the synapse is assessed. (B) Change in synaptic strength after repeated spike pairing as a function of the difference in time between the presynaptic and postsynaptic spikes. A presynaptic before postsynaptic spike induces LTP and postsynaptic before presynaptic LTD (data points were obtained by digitizing figures in Zhang et al., 1998). We have superimposed an exponential fit of LTP and LTD.
modulation on the precise timing of the two action potentials, known as spike-timing dependent plasticity (STDP), is depicted in Figure 1. Typically, plasticity is observed only when the presynaptic and postsynaptic spikes occur within a 20 to 30 ms time window, and the transition from potentiation to depression is very rapid. The effects are long lasting and are therefore referred to as long-term potentiation (LTP) and depression (LTD). An important observation is that the relative magnitude of the LTP component of STDP decreases with increased synaptic efficacy between presynaptic and postsynaptic neuron, whereas the magnitude of LTD remains roughly constant (Bi & Poo, 1998). This finding has led to the suggestion that the LTP component of STDP might best be modeled as additive, whereas the LTD component is better modeled as being multiplicative (Kepecs, van Rossum, Song, & Tegner, 2002). For detailed reviews of STDP see Bi and Poo (2001), Roberts and Bell (2002), and Dan and Poo (2004). Because these intriguing findings appear to describe a fundamental learning mechanism in the brain, a flurry of models has been developed that focus on different aspects of STDP. A number of studies focus on biochemical models that explain the underlying mechanisms giving rise to STDP (Senn, Markram, & Tsodyks, 2000; Bi, 2002; Karmarkar, Najarian, & ¨ otter, ¨ ¨ otter, ¨ Buonomano, 2002; Saudargiene, Porr, & Worg 2004; Porr & Worg 2003). Many researchers have also focused on models that explore the consequences of STDP-like learning rules in an ensemble of spiking neurons
Reducing the Variability of Neural Responses
373
(Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Izhikevich & Desai, 2003; Abbott & Gerstner, 2004; Burkitt, Meffin, & Grayden, 2004; Shon, Rao, & Sejnowski, 2004; Legenstein, Naeger, & Maass, 2005), and a comprehensive review of the different types and con¨ otter, ¨ clusions can be found in Porr and Worg (2003). Finally, a recent trend is to propose models that provide fundamental computational justifications for STDP. This article proposes a novel justification, and we explore the consequences of this justification in detail. Most commonly, STDP is viewed as a type of asymmetric Hebbian learning with a temporal dimension. However, this perspective is hardly a fundamental computational rationale, and one would hope that such an intuitively sensible learning rule would emerge from a first-principle computational justification. Several researchers have tried to derive a learning rule yielding STDP from first principles. Dayan and H¨ausser (2004) show that STDP can be viewed as an optimal noise-removal filter for certain noise distributions. However, even a small variation from these noise distributions yields quite different learning rules, and the noise statistics of biological neurons are ¨ otter, ¨ unknown. Similarly, Porr and Worg (2003) propose an unsupervised learning rule based on the correlation of bandpass-filtered inputs with the derivative of the output and show that the weight change rule is qualitatively similar to STDP. Hopfield and Brody (2004) derive learning rules that implement ongoing network self-repair. In some circumstances, a qualitative similarity to STDP is found, but the shape of the learning rule depends on both network architecture and task. M. Eisele (private communication, April 2004). has shown that an STDP-like learning rule can be derived from the goal of maintaining the relevant connections in a network. Rao and Sejnowski (1999, 2001) suggest that STDP may be related to prediction, in particular to temporal difference (TD) learning. They argue that STDP emerges when a neuron attempts to predict its membrane potential at some time t from the potential at time t − t. As Dayan (2002) points out, however, temporal difference learning depends on an estimate of the prediction error, which will be very hard to obtain. Rather, a quantity that might be called an activity difference can be computed, and the learning rule is then better characterized as a “correlational learning rule between the stimuli, and the differences in successive outputs” (Dayan, 2002; see also ¨ otter, ¨ Porr & Worg 2003, appendix B). Furthermore, Dayan argues that for true prediction, the model has to show that the learning rule works for biologically realistic timescales. The qualitative nature of the modeling makes it unclear whether a quantitative fit can be obtained. Finally, the derived difference rule is inherently instable, as it does not impose any bounds on synaptic efficacies; also, STDP emerges only for a narrow range of t values.
374
S. Bohte and M. Mozer
Chechik (2003) relates STDP to information theory via maximization of mutual information between input and output spike trains. This approach derives the LTP portion of STDP but fails to yield the LTD portion. Nonetheless, an information-theoretic approach is quite elegant and has proven valuable in explaining other neural learning phenomena (e.g., Linsker, 1989). The account we describe in this article also exploits an informationtheoretic approach. We are not the only ones to appreciate the elegance of information-theoretic accounts. In parallel with a preliminary presentation of our work at the NIPS 2004 conference, two quite similar informationtheoretic accounts also appeared (Bell & Parra, 2005; Toyoizumi, Pfister, Aihara, & Gerstner, 2005). It will be easiest to explain the relationship of these accounts to our own once we have presented ours. The computational approaches of Chechik (2003), Dayan and H¨ausser ¨ otter ¨ (2004) and Porr and Worg (2003) are all premised on a rate-based neuron model that disregards the relative timing of spikes. It seems quite odd to argue for STDP using neural firing rate: if spike timing is irrelevant to information transmission, then STDP is likely an artifact and is not central to understanding mechanisms of neural computation. Further, as Dayan and H¨ausser (2004) note, because STDP is not quite additive in the case of multiple input or output spikes that are near in time (Froemke & Dan, 2002), one should consider interpretations that are based on individual spikes, not aggregates over spike trains. In this letter, we present an alternative theoretical motivation for STDP from a spike-based neuron model that takes the specific times of spikes into account. We conjecture that a fundamental objective of cortical computation is to achieve reliable neural responses, that is, neurons should produce the identical response—in both the number and timing of spikes—given a fixed input spike train. Reliability is an issue if neurons are affected by noise influences, because noise leads to variability in a neuron’s dynamics and therefore in its response. Minimizing this variability will reduce the effect of noise and will therefore increase the informativeness of the neuron’s output signal. The source of the noise is not important; it could be intrinsic to a neuron (e.g., a time-varying threshold), or it could originate in unmodeled external sources that cause fluctuations in the membrane potential uncorrelated with a particular input. We are not suggesting that increasing neural reliability is the only objective of learning. If it were, a neuron would do well to shut off and give no response regardless of the input. Rather, reliability is but one of many objectives that learning tries to achieve. This form of unsupervised learning must, of course, be complemented by other unsupervised, supervised, and reinforcement learning objectives that allow an organism to achieve its goals and satisfy drives. We return to this issue below and in our conclusions section.
Reducing the Variability of Neural Responses
375
We derive STDP from the following computational principle: synapses adapt so as to minimize the variability in the timing of the spikes of the postsynaptic neuron’s output in response to given presynaptic input spike trains. This variability reduction causes the response of a neuron to become more deterministic and less sensitive to noise, which provides an obvious computational benefit. In our simulations, we follow the methodology of neurophysiological experiments. This approach leads to a detailed fit to key experimental results. We model not only the shape (sign and time course) of the STDP curve, but also the fact that potentiation of a synapse depends on the efficacy of the synapse; it decreases with increased efficacy. In addition to fitting these key STDP phenomena, the model allows us to make predictions regarding the relationship between properties of the neuron and the shape of the STDP curve. The detailed quantitative fit to data makes our work unique among first-principle computational accounts. Before delving into the details of our approach, we give a basic intuition about the approach. Noise in spiking neuron dynamics leads to variability in the number and timing of spikes. Given a particular input, one spike train might be more likely than others, but the output is nondeterministic. By the response variability minimization principle, adaptation should reduce the likelihood of these other possibilities. To be concrete, consider a particular experimental paradigm. In Zhang et al. (1998), a presynaptic neuron is identified with a weak synapse to a postsynaptic neuron, such that this presynaptic input is unlikely to cause the postsynaptic neuron to fire. However, the postsynaptic neuron can be induced to fire via a second presynaptic connection. In a typical trial, the presynaptic neuron is induced to fire a single spike, and with a variable delay, the postsynaptic neuron is also induced to fire (typically) a single spike. To increase the likelihood of the observed postsynaptic response, other response possibilities must be suppressed. With presynaptic input preceding the postsynaptic spike, the most likely alternative response is no output spikes at all. Increasing the synaptic connection weight should then reduce the possibility of this alternative response. With presynaptic input following the postsynaptic spike, the most likely alternative response is a second output spike. Decreasing the synaptic connection weight should reduce the possibility of this alternative response. Because both of these alternatives become less likely as the lag between preand postsynaptic spikes is increased, one would expect that the magnitude of synaptic plasticity diminishes with the lag, as is observed in the STDP curve. Our approach to reducing response variability given a particular input pattern involves computing the gradient of synaptic weights with respect to a differentiable model of spiking neuron behavior. We use the spike response model (SRM) of Gerstner (2001) with a stochastic threshold, where the stochastic threshold models fluctuations of the membrane potential or
376
S. Bohte and M. Mozer
the threshold outside experimental control. For the stochastic SRM, the response probability is differentiable with respect to the synaptic weights, allowing us to calculate the gradient that reduces response variability with respect to the weights. Learning is presumed to take a gradient step to reduce the response variability. In modeling neurophysiological experiments, we demonstrate that this learning rule yields the typical STDP curve. We can predict the relationship between the exact shape of the STDP curve and physiologically measurable parameters, and we show that our results are robust to the choice of the few free parameters of the model. Many important machine learning algorithms in the literature seek local optimizers. It is often the case that the initial conditions, which determine which local optimizer will be found, can be controlled to avoid unwanted local optimizers. For example, with neural networks, weights are initialized near the origin; large initial weights would lead to degenerate solutions. And K-means has many degenerate and suboptimal solutions; consequently, careful initialization of cluster centers is required. In the case of our model’s learning algorithm, the initial conditions also avoid the degenerate local optimizer. These initial conditions correspond to the original weights of the synaptic connections and are constrained by the specific methodology of the experiments that we model: the subthreshold input must have a small but nonzero connection strength, and the suprathreshold input must have a large connection strength (less than 10%, more than 70% probability of activating the target, respectively). Given these conditions, the local optimizer that our learning algorithm discovers is an extremely good fit to the experimental data. In parallel with our work, two other groups of authors have proposed explanations of STDP in terms of neurons maximizing an informationtheoretic measure for the spike-response model (Bell & Parra, 2005; Toyoizumi et al., 2005). Toyoizumi et al. (2005) maximize the mutual information of the input and output between a pool of presynaptic neurons and a single postsynaptic output neuron, whereas Bell and Parra (2005) maximize sensitivity between a pool of (possibly correlated) presynaptic neurons and a pool of postsynaptic neurons. Bell and Parra use a causal SRM model and do not obtain the LTD component of STDP. As we will show, when the objective function is minimization of (conditional) response variability, obtaining LTD critically depends on a stochastic neural response. In the derivation of Toyoizumi et al. (2005), LTD, which is very weak in magnitude, is attributed to the refractoriness of the spiking neuron (via the autocorrelation function), where they use questionably strong and enduring refractoriness. In our framework, refractoriness suppresses noise in the neuron after spiking, and we show that in our simulations, strong refraction in fact diminishes the LTD component of STDP. Furthermore, the mathematical derivation of Toyoizumi et al. is valid only for an essentially constant membrane potential with small fluctuations, a condition clearly violated in experimental
Reducing the Variability of Neural Responses
377
conditions studied by neurophysiologists. It is unclear whether the derivation would hold under more realistic conditions. Neither of these approaches thus far succeeds in quantitatively modeling specific experimental data with neurobiologically realistic timing parameters, and neither explains the relative reduction of STDP as the synaptic efficacy increases as we do. Nonetheless, these models make an interesting contrast to ours by suggesting a computational principle of optimization of information transmission, as contrasted with our principle of neural response variability reduction. Experimental tests might be devised to distinguish between these competing theories. In section 2 we describe the sSRM, and in section 3 we derive the minimal entropy gradient. In section 4 we describe the STDP experiment, which we simulate in section 5. We conclude with section 6. 2 The Stochastic Spike Response Model The spike response model (SRM), defined by Gerstner (2001), is a generic integrate-and-fire model of a spiking neuron that closely corresponds to the behavior of a biological spiking neuron and is characterized in terms of a small set of easily interpretable parameters (Jolivet, Lewis, & Gerstner, 2003; Paninski, Pillow, & Simoncelli, 2005). The standard SRM formulation describes the temporal evolution of the membrane potential based on past neuronal events, specifically as a weighted sum of postsynaptic potentials (PSPs) modulated by reset and threshold effects of previous postsynaptic spiking events. The general idea is depicted in Figure 2; formally (following Gerstner, 2001), the membrane potential ui (t) of cell i at time t is defined as ui (t) =
η(t − f i ) +
f i ∈Git
j∈i
wi j
(t| f j , Git ),
(2.1)
f j ∈G tj
where i is the set of inputs connected to neuron i; Git is the set of times prior to t that a neuron i has spiked, with firing times f i ∈ Git ; wi j is the synaptic weight from neuron j to neuron i; (t| f j , Git ) is the PSP in neuron i due to an input spike from neuron j at time f j given postsynaptic firing history Git ; and η(t − f i ) is the refractory response due to the postsynaptic spike at time f i . To model the postsynaptic potential ε in a leaky-integrate-and-fire neuron, a spike of presynaptic neuron j emitted at time f j generates a postsynaptic current α(t) for a presynaptic spike arriving at f j for t > f j . In the absence of postsynaptic firing, this kernel (following Gerstner & Kistler, 2002, eqs 4.62–4.56, pp. 114–115) can be computed as (t| f j ) =
t fj
s− f j α(s − f j ) ds, exp − τm
(2.2)
378
S. Bohte and M. Mozer
Figure 2: Membrane potential u(t) of a neuron as a sum of weighted excitatory PSP kernels due to impinging spikes. Arrival of PSPs marked by arrows. Once the membrane potential reaches threshold, it is reset, and a reset function η is added to model the recovery effects of the threshold.
where τm is the decay time of the postsynaptic neuron’s membrane potential. Consider an exponentially decaying postsynaptic current α(t) of the form α(t) =
t 1 H(t) exp − τs τs
(2.3)
(see Figure 3A), where τs is the decay time of the current and H(t) is the Heaviside function. In the absence of postsynaptic firing, this current contributes a postsynaptic potential of the form (t| f j ) =
1 1 − τs /τm
exp
(t − f j ) (t − f j ) − − exp − H(t − f j ), τm τs (2.4)
with current decay time constant τs and decay time constant τm . When the postsynaptic neuron fires after the presynaptic spike arrives—at some time fˆ i following presynaptic spike at time f j —the membrane potential is reset, and only the remaining synaptic current α(t ) for t > fˆ i is integrated in equation 2.2. Following Gerstner, 2001 (section 4.4, equation 1.66), the PSP that takes such postsynaptic firing into account can be written as fˆ i < f j , (t| f j ) (2.5) (t| f j , fˆ i ) = ( f j − fˆ i ) (t| f j ) fˆ i ≥ f j . exp − τs
Reducing the Variability of Neural Responses
379
A
B
C
D
Figure 3: (A) α(t) function. Synaptic input modeled as exponentially decaying current. (B) Postsynaptic potential due to a synaptic input in the absence of postsynaptic firing (solid line), and with postsynaptic firing once and twice (dotted resp. dashed lines; postsynaptic spikes indicated by arrow). (C) Reset function η(t). (D) Spike probability ρ(u) as a function of potential u for different values of α and β parameters.
This function is depicted in Figure 3B, for the cases when a postsynaptic spike occurs both before and after the presynaptic spike. In principle, this formulation can be expanded to include the postsynaptic neuron firing more than once after the onset of the postsynaptic potential. However, for fast current decay times τs , it is useful to consider only the residual current input for the first postsynaptic spike after onset and assume that any further postsynaptic spiking is modeled by a postsynaptic potential reset to zero from that point on. The reset response η(t) models two phenomena. First, a neuron can be in a refractory period: it simply cannot spike again for about a millisecond after a spiking event. Second, after the emission of a spike, the threshold of the neuron may initially be elevated and then recover to the original value (Kandel, Schwartz, & Jessell, 2000). The SRM models this behavior as negative contributions to the membrane potential (see equation 2.1): with s = t − fˆ i denoting the time since the postsynaptic spike, the refractory
380
S. Bohte and M. Mozer
reset function is defined as (Gerstner, 2001):
η(s) =
Uabs Uabs exp
−
s + δr τr
f
+ Ur exp
−
s τrs
0 < s < δr s ≥ δr ,
(2.6)
where a large negative impulse Ua bs models the absolute refractory period, with duration δr ; the absolute refractory contribution smoothly resets via a f fast-decaying exponential with time constant τr . The term Ur models the slow exponential recovery of the elevated threshold with time constant τrs . The function η is depicted in Figure 3C. We made a minor modification to the SRM described in Gerstner (2001) by relaxing the constraint that τrs = τm and also by smoothing the absolute refractory function (such smoothing is mentioned in Gerstner, but is not explicitly defined). In all simulations, we use δr = 1 ms, τrs = 3 ms, and f τr = 0.25 ms (in line with estimates for biological neurons; Kandel et al., 2000; the smoothing parameter was chosen to be fast compared to τrs ). The SRM we just described is deterministic. Gerstner (2001) introduces a stochastic variant of the SRM (sSRM) by incorporating the notion of a stochastic firing threshold: given membrane potential ui (t), the probability of the neuron firing at time t is specified by ρ ui (t) . Herrmann and Gerstner (2001) find that for a reasonable escape-rate noise model of the integration of current in real neurons, the probability of firing is small and constant for small potentials, but around a threshold ϑ, the probability increases linearly with the potential. In our simulations, we use such a function, ρ(v) =
β {ln[1 + exp(α (ϑ − v))] − α(ϑ − v)}, α
(2.7)
where α determines the abruptness of the constant-to-linear transition in the neighborhood of threshold ϑ and β determines the slope of the linear increase beyond ϑ. This function is depicted in Figure 3D for several values of α and β. We also conducted simulation experiments with sigmoidal and exponential density functions and found no qualitative difference in the results. 3 Minimizing Conditional Entropy We now derive the rule for adjusting the weight from a presynaptic input neuron j to a postsynaptic neuron i so as to minimize the entropy of i’s response given a particular spike train from j. A spike train is described by the set of all times at which a neuron i emitted spikes within some interval between 0 and T, denoted GiT . We assume the interval is wide enough that the occurrence of spikes outside
Reducing the Variability of Neural Responses
381
the interval does not influence the state of a neuron within the interval (e.g., through threshold reset effects). This assumption allows us to treat intervals as independent of each other. The set of input spikes received by neuron i during this interval is denoted FiT , which is just the union of all output spike trains of connected presynaptic neurons j: FiT = G Tj ∀ j ∈ i . Given input spikes FiT , the stochastic nature of neuron i may lead not only to the observed response GiT but also to a range of other possibilities. Denote the set of possible responses i , where GiT ∈ i . Further, let binary variable σ (t) denote the state of the neuron in the time interval [t, t + t), where σ (t) = 1 means the neuron spikes and σ (t) = 0 means no spike. A response ξ ∈ i is then equivalent to [σ (0), σ (t), . . . , σ (T)]. Given a probability density p(ξ ) over all possible responses ξ , the differential entropy of neuron i’s response conditional on input FiT is then defined as
h i FiT = −
i
p(ξ ) log p(ξ ) dξ.
(3.1)
According to our hypothesis, a neuron adjusts its weights so as to minimize the conditional response variability. Such an adjustment is obtained by performing gradient descent on the weighted likelihood of the response, which corresponds to the conditional entropy, with respect to the weights,
∂h i |FiT wi j = −γ , ∂wi j
(3.2)
with learning rate γ . In this section, we compute the right-hand side of equation 3.2 for an sSRM neuron. Substituting the entropy definition of equation 3.1 into equation 3.2, we obtain:
∂h i FiT ∂ =− p(ξ ) log( p(ξ )) dξ ∂wi j ∂wi j ∂ log( p(ξ )) p(ξ ) (log( p(ξ )) + 1) dξ. =− ∂wi j i We closely follow Xie and Seung (2004) to derive tiable neuron model firing at times
p(ξ ) =
T t=0
GiT .
P(σ (t)|{σ (t ), ∀t < t}).
∂ log( p(ξ )) ∂wi j
(3.3)
for a differen-
First, we factorize p(ξ ):
(3.4)
382
S. Bohte and M. Mozer
The states σ (t) are conditionally independent as the probability for a neuron i to fire during [t, t + t) is determined by the spike probability density of the membrane potential: σ (t) =
1 with probability
pi = ρi (t)t,
0 with probability
pi = 1 − pi (t),
for the spike probability density of the membrane with ρi (t) shorthand
potential, ρ ui (t) ; this equation holds for sufficiently small σ (t) (see also Xie & Seung, 2004, for more details). We note further that ∂ ln( p(ξ )) 1 ∂ p(ξ ) ≡ ∂wi j p(ξ ) ∂wi j and ∂ρi (t) ∂ρi (t) ∂ui (t) = . ∂wi j ∂ui (t) ∂wi j
(3.5)
It is straightforward to derive: ∂ log( p(ξ )) = ∂wi j
T
t=0
=−
∂ρi (t) ∂ui (t) ∂ui (t) ∂wi j T
t=0
f i ∈FiT
ρi (t) (t| f j , f i )dt +
δ(t − f i ) − ρi (t) ρi (t)
dt,
ρ ( fi ) i ( f i | f j , f i ), ρ i ( fi ) T
(3.6)
f i ∈Fi
∂ρi (t) and δ(t − f i ) is the Dirac delta, and we use that in the where ρi (t) ≡ ∂u i (t) sSRM formulation,
∂ui (t) = (t| f j , f i ). ∂wi j The term ρi (t) in equation 3.6 can be computed for any differentiable spike probability function. In the case of equation 2.7, ρi (t) =
β . 1 + exp(α(ϑ − ui (t))
Reducing the Variability of Neural Responses
383
Substituting our model for ρi (t), ρi (t) from equation 2.7 into equation 3.6, we obtain
∂ log( p(ξ )) = −β ∂wi j +
f i ∈GiT
T t=0
(t| f j , f i ) dt 1 + exp[α(ϑ − ui (t))]
( f i | f j , f i ) . α {ln(1 + exp[α (ϑ − ui ( f i ))]) − α (ϑ − ui ( f i ))}(1 + exp[α(ϑ − ui ( f i ))]) (3.7)
Equation 3.7 can be substituted into equation 3.3, which, when integrated, provides the gradient-descent weight update that implements conditional entropy minimization (see equation 3.2). The hypothesis under exploration is that this gradient-descent weight update yields STDP. Unfortunately, an analytic solution to equation 3.3 (and hence equation 3.2) is not readily obtained. Nonetheless, numerical methods can be used to obtain a solution. We are not suggesting a neuron performs numerical integration of this sort in real time. It would be preposterous to claim biological realism for an instantaneous integration over all possible responses ξ ∈ i , as specified by equation 3.3. Consequently, we have a dilemma: What use is a computational theory of STDP if the theory demands intensive computations that could not possibly be performed by a neuron in real time? This dilemma can be circumvented in two ways. First, the resulting learning rule might be cached in some form through evolution so that the computation is not necessary. That is, the solution—the STDP curve itself—may be built into a neuron. As such, our computational theory provides an argument for why neurons have evolved to implement the STDP learning rule. Second, the specific response produced by a neuron on a single trial might be considered a sample from the distribution p(ξ ), and the integration in equation 3.3 can be performed by a sampling process over repeated trials; each trial would produce a stochastic gradient step.
3.1 Numerical Computation. In this section, we describe the procedure for numerically evaluating equation 3.2 via Simpson’s integration (Hennion, 1962). This integration is performed over the set of possible responses i (see equation 3.3) within the time interval [0 . . . T]. The set i can be divided into disjoint subsets in , which contain exactly n spikes: i = in ∀ n.
384
S. Bohte and M. Mozer
Using this breakdown,
∂h i |FiT ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ, ∂wi j ∂wi j i n=∞ ∂ log(g(ξ )) =− g(ξ )(log(g(ξ )) + 1) dξ. ∂wi j in
(3.8)
n=0
It is illustrative to walk through the alternatives. For n = 0, there is only one response given the input. Assuming the probability of n = 0 spikes is p0 , the n = 0 term of equation 3.8 reads:
T ∂h i FiT = p0 (log( p0 ) + 1) −ρi (t) (t| f j , f i )dt. ∂wi j t=0
(3.9)
The probability p0 is the probability of the neuron not having fired between t = 0 and t = T given inputs FiT resulting in membrane potential ui (t) and hence probability of firing at time t of ρ(ui (t)), p0 = S[0, T] = exp −
T
ρ (ui (t)) dt ,
(3.10)
t=0
which is equal to the survival function S for a nonhomogeneous Poisson process with probability density ρ(ui (t)) for t = [0 . . . T]. (We use the inclusive/exclusive notation for S: S(0,T) computes the function excluding the end points; S[0,T] is inclusive.) For n = 1, we must consider all responses containing exactly one output spike: GiT = { f i1 }, f i1 ∈ [0, T]. Assuming that neuron i fires only at time f i1 with probability p1 ( f i1 ), the n = 1 term of equation 3.8 reads
f 1 =T T i
∂h i FiT = p1 f i1 log p1 ( f i1 ) + 1 −ρi (t) t| f j , f i1 dt 1 ∂wi j f i =0 t=0
1
ρ f + i i1 f i1 | f j , f i1 d f i1 . (3.11) ρi f i The probability p1 ( f i1 ) is computed as
p1 f i1 = S 0, f i1 ρi f i1 S f i1 , T ,
(3.12)
Reducing the Variability of Neural Responses
385
where the membrane potential now incorporates one reset at t = f i1 :
ui (t) = η t − f i1 + wi j t| f j , f i1 . j∈i
f j ∈F tj
For n = 2, we must consider all responses containing exactly two output spikes: GiT = { f i1 , f i2 } for f i1 , f i2 ∈ [0, T]. Assuming that neuron i fires at f i1 and f i2 with probability probability p2 ( f i1 , f i2 ), the n = 2 term of equation 3.8 reads:
f 1 =T f 2 =T i i
∂h i |FiT = p2 f i1 , f i2 log p2 ( f i1 , f i2 ) + 1 ∂wi j f i1 =0 f i2 = f i1 T × −ρi (t) (t| f j , f i1 , f i2 )dt t=0
+
ρi f i2 2
1 1 2 1 2 + | f , f , f | f , f , f f f d f i1 d f i2 .
j j i i i i i i ρi f i1 ρi f i2
ρi
f i1
(3.13) The probability p2 ( f i1 , f i2 ) can again be expressed in terms of the survival function,
p2 f i1 , f i2 = S 0, f i1 ρi f i1 S f i1 , f i2 ρi f i2 S f i2 , T ,
(3.14)
with ui (t) = η(t − f i1 ) + η(t − f i2 ) + j∈i wi j f j ∈F tj (t| f j , f i1 , f i2 ). This procedure can be extended for n > 2 following the pattern above. In our simulation of the STDP experiments, the probability of obtaining zero, one, or two spikes already accounted for 99.9% of all possible responses; adding the responses of three spikes (n = 3) accounted for all possible responses got this number up to ≈ 99.999, which is close to the accuracy of our numerical computation. In practice, we found that taking into account n = 3 had no significant contribution to computing w, and we did not compute higher-order terms as the cumulative probability of these responses was below our numerical precision. For the results we present later, we used only terms n ≤ 2; we demonstrate that this is sufficient in appendix A. In this section, we have replaced an integral over possible spike sequences i with an integral over the time of two output spikes, f i1 and f i2 , which we compute numerically.
386
S. Bohte and M. Mozer
Figure 4: Experimental setup of Zhang et al. (1998).
4 Simulation Methodology We modeled in detail the experiment of Zhang et al. (1998) involving asynchronous costimulation of convergent inputs. In this experiment, depicted in Figure 4, a postsynaptic neuron is identified that has two neurons projecting to it: one weak (subthreshold) and one strong (suprathreshold). The subthreshold input results in depolarization of the postsynaptic neuron, but the depolarization is not strong enough to cause the postsynaptic neuron to spike. The suprathreshold input is strong enough to induce a spike in the postsynaptic neuron. Plasticity of the synapse between the subthreshold input and the postsynaptic neuron is measured as a function of the timing between subthreshold and postsynaptic neurons’ spikes (tpre-post ) by varying the intervals between induced spikes in the subthreshold and the suprathreshold inputs (tpre-pre ). This measurement yields the well-known STDP curve (see Figure 1b). In most experimental studies of STDP, the postsynaptic neuron is induced to spike not via a suprathreshold neuron, but rather by depolarizing current injection directly into the postsynaptic neuron. To model experiments that induce spiking via current injection, additional assumptions must be made in the spike response model framework. Because these assumptions are not well established in the literature, we have focused on the synaptic input technique of Zhang et al. (1998). In section 5.1, we propose a method for modeling a depolarizing current injection in the spike-response model. The Zhang et al. (1998) experiment imposes four constraints on a simulation: (1) the suprathreshold input alone causes spiking more than 70% of the time; (2) the subthreshold input alone causes spiking less than 10% of the time; (3) synchronous firing of suprathreshold or subthreshold inputs causes LTP if and only if the postsynaptic neuron fires; and (4) the time constants of the excitatory PSPs (EPSPs)—τs and τm in the sSRM—are
Reducing the Variability of Neural Responses
387
in the range of 1 to 5 ms and 7 to 15 ms, respectively. These constraints remove many free parameters from our simulation. We do not explicitly model the two input cells; instead, we model the EPSPs they produce. The magnitude of these EPSPs is picked to satisfy the experimental constraints: in most simulations, unless reported otherwise, the suprathreshold EPSP (wsupra ) alone causes a spike in the post on 85% of trials, and the subthreshold EPSP (wsub ) alone causes a spike on fewer than 0.1% of trials. In our principal re-creation of the experiment (see Figure 5), we added normally distributed variation to wsupra and wsub to simulate the experimental selection process of finding suitable supra-subthreshold input pairs according to: wsupra = wsupra + N(0, σsupra ) and wsub = wsub + N(0, σsub ) (we controlled the random variation for conditions outside the specified firing probability ranges). Free parameters of the simulation are ϑ and β in the spike probability function (α can be folded into ϑ) and the magnitude (urs , uabs ) and time f constants (τrs , τr , abs ) of the reset. We can further investigate how the results depend on the exact strengths of the subthreshold and suprathreshold EPSPs. The dependent variable of the simulation is tpre-pre , and we measure the time of the post spike to determine tpr e− post . In the experimental protocol, a pair of inputs is repeatedly stimulated at a specific interval tpre-pre at a low frequency of 0.1 Hz. The weight update for a given tpre-pre is measured by comparing the size of the EPSC before stimulation and (about) half an hour after stimulation. In terms of our model, this repeated stimulation can be considered as drawing a response ξ from the stochastic conditional response density p(ξ ). We estimate the expected weight update for this density p(ξ ) for a given tpre-pre using equation 3.2 by approximating the integral by a summation over all time-discretized output responses consisting of 0, 1, or 2 spikes. Note that performing the weight update computation like this implicitly assumes that the synaptic efficacies in the experiment do not change much during repeated stimulation; since longterm synaptic changes require the synthesis of for example, proteins this seems a reasonable assumption, also reflected in the half-hour or so that the experimentalists wait after stimulation before measuring the new synaptic efficacy.
5 Results Figure 5A shows an STDP curve produced by the model, obtained by plotting the estimated weight update of equation 3.2 against tpre− post for fixed supra and subthreshold inputs. Specifically, we vary the difference in time between subthreshold and suprathreshold inputs (a pre-pre pair), and we compute the expected gradient for the subthreshold input wsub over all responses of the postsynaptic neuron via equation 3.2. We thus obtain a value for w for each tpre-pre data point; we then compute wsub (%) as the
388
S. Bohte and M. Mozer
A
B
C
.,
Figure 5: (A) STDP: experimental data (triangles) and model fit (solid line). and wsub (B) Added simulation data points with perturbed weights wsupra (crosses). STDP data redrawn from Zhang et al. (1998). Model parameters: τs = 2.5 ms, τm = 10 ms, sub- and suprathreshold weight perturbation in B: σsupra = 0.33 wsupra , σsub = 0.1 wsub . (C) Model fit compared to previous generative models (Chechik, 2003, short dashed line; Toyoizumi et al., 2005, long dashed line; data points and curves were obtained by digitizing figures in original papers). Free parameters of Chechik (2003) and Toyoizumi et al. (2005) were fit to the experimental data as described in appendix B.
relative percentage change of synaptic efficacy: wsub (%) = w/wsub × 100%.1 For each tpre-pre , the corresponding value tpr e− post is determined by calculating for each input pair the average time at which the postsynaptic neuron fires relative to the subthreshold input. Together, this results in a set of (tpr e− post , wsub (%)) data points. The continuous graph in Figure 5A We set the global learning rate γ in equation 3.2 such that the simulation curve is scaled to match the neurophysiological results. In all other experiments where we use relative percentage change wsub (%), the same value for γ is used. 1
Reducing the Variability of Neural Responses
389
is obtained by repeating this procedure for fixed supra- and subthreshold weights and connecting the resultant points. In Figure 5B, the supra- and subthreshold weights in the simulation are randomly perturbed for each pre-pre pair, to simulate the fact that in the experiment, different pairs of neurons are selected for each pre-pre pair, leading inevitably to variation in the synaptic strengths. Mild variation of the input weights yields the “scattering” data points of the relative weight changes similar to the experimentally observed data. Clearly, the mild variation we apply is small only relative to the observed ¨ om, ¨ in vivo distributions of synaptic weights in the brain (e.g., Song, Sjostr Reigl, Nelson, & Chklovskii, 2005). However, Zhang et al. (1998) did not sample randomly from synapses in the brain but rather selected synapses that had a particularly narrow range of initial EPSPs to satisfy the criteria for “supra-” and “subthreshold” synapses (see also section 4). Hence, the experimental variance was particularly small (see Figure 1e of Zhang et al., 1998), and our variation of the size of the EPSP is in line with the observed variations in the experimental results of Zhang et al. (1998). The model produces a good quantitative fit to the experimental data points (triangles), especially compared to other related work as discussed in section 1 and robustly obtains the typical LTP and LTD time windows associated with STDP. In Figure 5C, we show our model fit compared to the models of Toyoizumi et al. (2005) and Chechik (2003). Our model obtained the lowest sum squared error (1.25 versus 1.63 and 3.27,2 respectively; see appendix B for methods)—this despite the lack of data in the region tpre-post = 0, . . . , 10 ms in the Zhang et al. (1998) experiment, where difference in LTD behavior is most pronounced. The qualitative shape of the STDP curve is robust to settings of the spiking neuron model’s parameters, as we will illustrate shortly. Additionally, we found that the type of spike probability function ρ (exponential, sigmoidal, or linear) is not critical. Our model accounts for an additional finding that has not been explained by alternative theories: the relative magnitude of LTP decreases as the efficacy of the synapse between the subthreshold input and the postsynaptic target neuron increases; in contrast, LTD remains roughly constant (Bi & Poo, 1998). Figure 6A shows this effect in the experiment of Bi and Poo (1998), and Figure 6B shows the corresponding result from our model. We compute the magnitude of LTP and LTD for the peak modulation (i.e., tpr e− post = −5 for LTP and tpr e− post = +5 for LTD) as the amplitude of 2 Of note for this comparison is that our spiking neuron model uses a more sophisticated difference of exponentials (see equation 2.4) to describe the EPSP, whereas the spiking neuron models in Toyoizumi et al. (2005) and Chechik (2003) use a single exponential. These other models might be improved using the more sophisticated EPSP function.
390 A
S. Bohte and M. Mozer B
Figure 6: Dependence of LTP and LTD magnitude on efficacy of the subthreshold input. (A) Experimental data redrawn from Bi and Poo (1998). (B) Simulation result.
the subthreshold EPSP is increased. The model’s explanation for this phenomenon is simple: as the synaptic weight increases, its effect saturates, and a small change to the weight does little to alter its influence. Consequently, the gradient of the entropy with respect to the weight goes toward zero. Similar saturation effects are observed in gradient-based learning methods with nonlinear response functions such as backpropagation. As we mentioned earlier, other theories have had difficulty reproducing the typical shape of the LTD component of STDP. In Chechik (2003), the shape is predicted to be near uniform, and in Toyoizumi et al. (2005), the shape depends on the autocorrelation. In our stochastic spike response model, this component arises due to the stochastic variation in the neural response: in the specific STDP experiment, reduction of variability is achieved by reducing the probability of multiple output spikes. To argue for this conclusion, we performed simulations that make our neuron model less variable in various ways, and each of these manipulations results in a reduction in the LTD component of STDP. In Figures 7A and 7B, we make the threshold more deterministic by increasing the values of α and β in the spike probability density function. In Figure 7C, we increase the magnitude of the refractory response η, which will prevent spikes following the initial postsynaptic response. And finally, in Figure 7D, we increase the efficacy of the suprathreshold input, which prevents the postsynaptic neuron’s potential from hovering in the region where the stochasticity of the threshold can induce a spike. Modulation of all of these variables makes the threshold more deterministic and decreases LTD relative to LTP. Our simulation results are robust to biologically realizable variation in the parameters of the sSRM model. For example, time constants of the EPSPs can be varied with no qualitative effect on the STDP curves.
Reducing the Variability of Neural Responses
A
B
C
D
391
Figure 7: Dependence of relative LTP and LTD on (A) the parameter α of the stochastic threshold function, (B) the parameter β of the stochastic threshold function, (C) the magnitude of refraction, η, and (D) efficacy of the suprathreshold synapse, expressed as p(fire|supra), the probability that the postsynaptic neuron will fire when receiving only the suprathreshold input. Larger values of p(fire|supra) correspond to a weaker suprathreshold synapse. In all graphs, the weight gradient for individual curves is normalized to peak LTP for comparison purposes.
Figures 8A and 8B show the effect of manipulating the membrane potential decay time τm and the EPSP rise time τs , respectively. Note that manipulation of these time constants does predict a systematic effect on STDP curves. Increasing τm increases the duration of both the LTP and LTD windows, whereas decreasing τs leads to a faster transition from LTP to LTD. Both predictions could be tested experimentally by correlating time constants of individual neurons studied with the time course of their STDP curves. 5.1 Current Injection. We mentioned earlier that in many STDP experiments, an action potential is induced in the postsynaptic neuron not via a suprathreshold presynaptic input, but via a depolarizing current injection. In order to model experiments using current injection, we must
392
S. Bohte and M. Mozer A
B
Figure 8: Influence of time constants of the sSRM model on the shape of the STDP curve: (A) varying the membrane potential time-constant τm and (B) varying the EPSP rise time constant τs . In both figures, the magnitude of LTP and LTD has been normalized to 1 for each curve to allow for easy examination of the effect of the manipulation on temporal characteristics of the STDP curves.
characterize the current function and its effect on the postsynaptic neuron. In this section, we make such a proposal framed in terms of the spike response model and report simulation results using current injection. We model the injected current I(t) as a rectangular step function, I(t) = H(t − f I ) Ic H(t − [ I − f I ]),
(5.1)
where the current of magnitude Ic is switched on at t = f I and off at t = f I + I . In the Zhang et al. (1998) experiment, I is 2 ms, a value we adopted for our simulations as well. The resulting postsynaptic potential, c is
t
c (t) = 0
s exp − τm
I(s) ds.
(5.2)
In the absence of postsynaptic firing, the membrane potential of an integrate-and-fire neuron in response to a step current is (Gerstner, 2001): c (t| f I ) = Ic (1 − exp[−(t − f I )/τm ]).
(5.3)
In the presence of postsynaptic firing at time fˆ i , we assume—as we did previously in equation 2.5—a reset and subsequent integration of the residual
Reducing the Variability of Neural Responses
393
Figure 9: Voltage response of a spiking neuron for a 2 ms current injection in the spike response model. Solid curve: The postsynaptic neuron produces no spike, and the potential due to the injected current decays with the membrane time constant τm . Dotted curve: The postsynaptic neuron spikes while the current is still being applied. Dashed curve: The postsynaptic neuron spikes after application of the current has terminated (moment of postsynaptic spiking indicated by arrows).
current:
s I(s) ds exp − τm 0 t s I(s) ds. + H(t − fˆ i ) exp − τm fˆ i
c (t| fˆ i ) = H( fˆ i − t)
t
(5.4)
These c kernels are depicted in Figure 9 for a postsynaptic spike occurring at various times fˆ i . In our simulations, we chose the current magnitude Ic to be large enough to elicit spiking of the target neuron with probability greater than 0.7. Figure 10a shows the STDP curve obtained using the current injection model for the exact same model parameter settings used to produce the result based on a suprathreshold synaptic input (depicted in Figure 5A) superimposed on the experimental data STDP obtained by depolarizing current injection from Zhang et al. (1998). Figure 10b additionally superimposes the earlier result on the current injection result, and the two curves are difficult to distinguish. As in the earlier result, variation of model parameters has little appreciable effect on the model’s behavior using the current injection paradigm, suggesting that current injection versus synaptic input makes little difference on the nature of STDP.
394
S. Bohte and M. Mozer
(a)
(b)
Figure 10: (A) STDP curve obtained for SRM with current injection (solid curve) compared with experimental data for depolarizing current injection (circles; redrawn from Zhang et al., 1998). (B) Comparing STDP curves for both current injection (solid curve) and suprathreshold input (dashed curve) models. The same model parameters are used for both curves. Experimental data redrawn from Zhang et al. (1998) for current injection (circles) and suprathreshold input (triangles) paradigms are superimposed.
6 Discussion In this letter, we explored a fundamental computational principle: that synapses adapt so as to minimize the variability of a neuron’s response in the face of noisy inputs, yielding more reliable neural representations. From this principle, instantiated as entropy minimization, we derived the STDP learning curve. Importantly, the simulation methodology we used to derive the curve closely follows the procedure used in neurophysiological experiments (Zhang et al., 1998): assuming variation in sub- and suprathreshold synaptic efficacies from experimental pair-to-pair even recovers the noisy scattering of efficacy changes. Our simulations furthermore obtain an STDP curve that is robust to model parameters and details of the noise distribution. Our results are critically dependent on the use of Gerstner’s stochastic spike response model, whose dynamics are a good approximation to those of a biological spiking neuron. The sSRM has the virtue of being characterized by parameters that are readily related to neural dynamics, and its dynamics are differentiable such that we can derive a gradient-descent learning rule that minimizes the response variability of a postsynaptic neuron given a particular set of input spikes. Our model predicts the shape of the STDP curve and how it relates to properties of a neuron’s response function. These predictions may be empirically testable if a diverse population of cells can be studied. The predictions include the following. First, the width of the LTD and LTP windows depends on the (excitatory) PSP time constants (see Figures 7A
Reducing the Variability of Neural Responses
395
and 7B). Second, the strength of LTD relative to LTP depends on the degree of noise in the neuron’s response; the LTD strength is related to the noise level. Our model also can characterize the nature of the learning curve for experimental situations that deviate from the boundary conditions of Zhang et al. (1998). In Zhang et al., the subthreshold and suprathreshold inputs produced postsynaptic firing with probability less than .10 and greater than .70, respectively. Our model can predict the consequences of violating these conditions. For example, when the subthreshold input is very strong or the suprathreshold input is very weak, our model produces strictly LTD, that is, anti-Hebbian learning. The consequence of a strong subthreshold input is shown in Figure 6B, and the consequence of a weak suprathreshold input is shown in Figure 7D. Intuitively, this simulation result makes sense because—in the first case—the most likely alternative response of the postsynaptic neuron is to produce more than one spike, and—in the second case—the most likely alternative response is no postsynaptic spike at all. In both cases, synaptic depression reduces the probability of the alternative response. We note that such strictly anti-Hebbian learning has been reported in relation to STDP-type experiments (Roberts & Bell, 2002). For very noisy thresholds and for weak suprathreshold inputs, our model produces an LTD dip before LTP (see Figure 7D). This dip is in fact also present in the work of Chechik (2003). We find it intriguing that this dip is also observed in the experimental results of Nishiyama et al. (2000). The explanation for this dip may be along the same lines as the explanation for the LTD window: given the very noisy threshold, the subthreshold input may occasionally cause spiking, and decreasing its weight would decrease response variability. This may not be offset by the increase due to its contribution to the spike caused by the suprathreshold input, as it is too early to have much influence. With careful consideration of experimental conditions and neuron parameters, it may be possible to reconcile the somewhat discrepant STDP curves obtained in the literature using our model. In our model, the transition from LTP to LTD occurs at a slight offset from tpr e− post = 0: if the subthreshold input fires 1 to 2 ms before the postsynaptic neuron fires (on average), then neither potentiation nor depression occurs. This offset of 1 to 2 ms is attributable to the current decay time constant, τs . The neurophysiological data are not sufficiently precise to determine the exact offset of the LTP-LTD transition in real neurons. Unfortunately, few experimental data points are recorded near tpr e− post = 0. However, the STDP curve of our model does pass through the one data point in that region (see Figure 5A), so the offset may be a real phenomenon. The main focus of the simulations in this letter was to replicate the experimental paradigm of Zhang et al. (1998), in which a suprathreshold presynaptic neuron is used to induce the postsynaptic neuron to fire. The Zhang et al. (1998) study is exceptional in that most other experimental studies of STDP use a depolarizing current injection to induce the
396
S. Bohte and M. Mozer
postsynaptic neuron to fire. We are not aware of any established model for current injection within the SRM framework. We therefore proposed a model of current injection within the SRM framework in section 5.1. The proposed model is an ideal abstraction of current injection that does not take into account effects like current onset and offset fluctuations inherent in such experimental methods. Even with these limitations in mind, the current injection model produced STDP curves very similar to the ones obtained by the simulation of the suprathreshold input–induced postsynaptic firing. The simulations reported in this letter account for classical STDP experiments in which a single presynaptic spike is paired with a single postsynaptic spike. The same methodology can be applied to model experimental paradigms involving multiple presynaptic or postsynaptic spikes, or both. However, the computation involved becomes nontrivial. We are currently engaged in modeling data from the multispike experiments of Froemke and Dan (2002). We note that one set of simulation results we reported is particularly pertinent for comparing and contrasting our model to the related model of Toyoizumi et al. (2005). The simulations reported in Figure 7 suggest that noise in our model is critical for obtaining the LTD component of STDP and that parameters that reduce noise in the neural response also reduce LTD. We found that increasing the strength of neuronal refraction reduces response variability and therefore diminishes the LTD component of STDP. This notion is also put forward in very recent work by Pfister, Toyoizumi, Barber, and Gerstner (2006), where an STDP-like rule arises from from a supervised learning procedure that aims to obtain spikes at times specified by a teacher. The LTD component in this work also depends on the probability of stochastic activity. In sharp contrast, Toyoizumi et al. (2005) suggest that neuronal refraction is responsible for LTD. Because the two models are quite similar, it seems unlikely that the models make opposite predictions and the discrepancy may be due to Toyoizumi et al.’s focus on analytical approximations to solve the mathematical problem at hand, limiting the validity of comparisons between that model and biological experiments in the process. It is useful to reflect on the philosphy of choosing reduction of spike train variability as a target function, as it so obviously has the degenerate but energy-efficient solution of emitting no spikes at all. The usefulness of our approach clearly relies on the stochastic gradient reaching a local optimum in the likelihood space that does not always correspond to the degenerate solution. We compute the gradient of the input weights with respect to the conditionally independent sequence of response intervals [t, t + ]. The gradient approach tries to push the probability of the responses in these intervals to either 0 or 1, irrespective of what the response is (not firing or firing). We find that in the sSRM spiking neuron model, this gradient can be toward either state of each response interval, which can be attributed
Reducing the Variability of Neural Responses
397
to the monotonically increasing spike probability density as a function of the membrane potential. This spike probability density allows neurons to become very reliable by firing spikes only at specific times, at least when starting from a set of input weights that, given the input pattern, is likely to induce a spike in the postsynaptic neuron. The fact that the target function is the reduction of postsynaptic spike train variability does predict that in the case of small inputs impinging on a postsynaptic target causing only occasional firing, the prediction would be that the average weight update due to this target function would reduce these inputs to zero. We have modeled the experimental studies in some detail, beyond the level of detail achieved by other researchers investigating STDP. Even a model with an entirely heuristic learning rule has value if it obtains a better fit to the data than other models of similar complexity. Our model has a learning rule that goes beyond heuristic: the learning rule is derived from a computational objective. To some, this objective may not be as exciting as more elaborative objectives like information maximization. As it is, our model stands alone from the contenders in providing a first-principle account of STDP that fits experimental data extremely well. Might there be a mathematically sexier model? We certainly hope so, but it has not yet been discovered. We reiterate the point that our learning objective is viewed as but one of many objectives operating in parallel. The question remains as to why neurons would respond in such a highly variable way to fixed input spike trains: a more deterministic threshold would eliminate the need for any minimization of response variability. We can only speculate that the variability in neuronal responses may also well serve these other objectives, such as exploitation or exploration in reinforcement learning or the exploitation of stochastic resonance phenomena (e.g., Hahnloser, Sarpeshkar, Mahowald, Douglas, & Seung, 2000). It is interesting to note that minimization of conditional response variability corresponds to one part of the equation that maximizes mutual information. The mutual information I between input X and outputs Y is defined as I(X, Y) = H(Y) − H(Y|X). Hence, minimization of the conditional entropy H(Y|X)—our objective—along with the secondary unsupervised objective of maximizing the marginal entropy H(Y) maximize mutual information. The first unsupervised objective is notoriously hard to compute (e.g., see Bell & Parra, 2005, for an extensive discussion) whereas, as we have shown, the second objective—conditional entropy minimization—can be computed relatively easily via stochastic gradient descent. Indeed, in this light, it
398
S. Bohte and M. Mozer
Figure 11: STDP graphs for w computed using the terms n ≤ 2 (solid line) and n ≤ 3 (crossed solid line).
is a virtue of our model that we account for the experimental data with only one component of the mutual information objective (taking the responses in the experimental conditions as the set of responses Y). The relatively simple nature of the experiments that uncovered STDP lacks any interaction with other (output) neurons, and we may speculate that STDP may be the degenerate reflection of information maximization in the absence of such interactions. If subsequent work shows that STDP can be explained by mutual information maximization (without the drawbacks of existing work, such as the rate-based treatment of Chechik, 2003, or the unrealistic autocorrelation function and difficulty of relating to biological parameters of Toyoizumi et al., 2005), this work contributes in helping to tease apart the components of the objective that are necessary and sufficient for explaining the data. Appendix A: Higher-Order Spike Probabilities To compute w, we stop at n = 2, as in the experimental conditions that we model, the contribution of n > 2 spikes is vanishingly small. We find that the probability of three spikes occurring is typically < 1e − 5, and the n = 3 term did not contribute significantly, as shown, for example, in Figure 11. Intuitively it seems very unlikely that the gradient of the conditional response entropy is dominated by terms that are highly unlikely. This could be the case only if the gradient on the probability of getting three or more spikes would be much larger than the gradient on getting, say, two spikes.
Reducing the Variability of Neural Responses
399
Given the setup of the model with an increasing probability of firing a spike as a function of the membrane potential, it is easy to see that changing a weight will change the probability of obtaining two spikes much more than the probability of obtaining three spikes. Hence, the entropy gradient from components n ≤ 2 will be (in practice, much) larger than the gradient for terms n = 3, n = 4, . . .. As we remarked before, in our simulation of the experimental setup, the probability of obtaining three spikes given the input was computed to be < 1e − 5; the overall probability was computed at up to 1e − 6. The probability of n = 4 was below the precision of our simulation. Appendix B: Sum-Squared-Error Parameter Fitting To compute the sum squared error when comparing the different STDP models in section 5, we use linear regression in the free parameters to minimize the sum-squared error between the model curves and the experimental data. For m experimental data points {(t1 , w1 ), . . . (tm , wm )} and model curve w = f (t), we report for each model curve the sum-squared error for those values of the free parameters that minimize the sum-squared error E 2: E2 =
min
free parameters
m
(wi − f (ti ))2 .
i=1
For our model, linear regression is performed in the scaling parameter γ in equation 3.2 that relates the gradient obtained with the model parameters mentioned to the weight change. Where possible for the other models, we set model parameters to correspond to the values observed in the experimental conditions described in section 4. For the model by Chechik (2003), the weight update is computed as the sum of a positive exponent and a negative damping contribution,
w = γ H(−t) exp(t/τ ) − H(t + )H(− − t)K , where t is computed as tpr e − tpost , K denotes the negative damping contribution that is applied over a time window before and after the postsynaptic spike, and H( ) denotes the Heaviside function. The time constant τ is related to the decay of the EPSP, and we set this value to the same value we use for our model: 10 ms. Linear regression to find the minimal sum-squared error is performed on the free parameters γ , K , .
400
S. Bohte and M. Mozer
In Toyoizumi et al. (2005), the learning rule is the sum of two terms
w = γ 2 (t) − µ0 (φ 2 )(t) , where (t) is the EPSP, modeled as (t) exp(−t/τ ), and µ0 (φ 2 )(t) is a function of the autocorrelation function of a neuron (φ 2 )(t), times the spontaneous neural activity in the absence of input, µ0 . The EPSP decay time constant used in Toyoizumi et al. was already set to τ = 10 ms, and for the two terms in the sum we used the functions described by Figures 2A and 2B in Toyoizumi et al. We performed linear regression to the one free parameter, γ . Note that for this model, we obtain better LTD, and hence E 2 , for larger values of µ0 as those used in Toyoizumi et al. (2005). However, then E 2 still remains worse than for the other two models, and the spontaneous neural activity becomes unrealistically large. Acknowledgments We thank Tony Bell, Lucas Parra, and Gary Cottrell for insightful comments and encouragement. We also thank the anonymous reviewers for constructive feedback, which allowed us to improve the quality of our work and this article. The work of S.M.B. was supported by the Netherlands Organization for Scientific Research (NWO), TALENT S-62 588 and VENI 639.021.203. The work of M.C.M. was supported by National Science Foundation BCS 0339103 and CSE-SMA 0509521. References Abbott, L., & Gerstner, W. (2004). Homeostasis and learning through spike-timing dependent plasticity. In D. Hansel, C. Chow, B. Gutkin, & C. Meunier (Eds.), Methods and models in neurophysics. In Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Bell, A., & Parra, L. (2005). Maximising sensitivity in a spiking network. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bi, G.-Q. (2002). Spatiotemporal specificity of synaptic plasticity: Cellular rules and mechanisms. Biol. Cybern., 87, 319–332. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18(24), 10464–10472. Bi, G.-Q., & Poo, M.-M. (2001). Synaptic modification by correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166.
Reducing the Variability of Neural Responses
401
Burkitt, A., Meffin, H., & Grayden, D. (2004). Spike timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed-point. Neural Computation, 16(5), 885–940. Chechik, G. (2003). Spike-timing-dependent plascticity and relevant mutual information maximization. Neural Computation, 15, 1481–1510. Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Dayan, P. (2002). Matters temporal. Trends in Cognitive Sciences, 6(3), 105–106. Dayan, P., & H¨ausser, M. (2004). Plasticity kernels and temporal statistics. In S. Thrun, ¨ L. Saul, & B. Scholkopf (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Debanne, D., G¨ahwiler, B., & Thompson, S. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol., 507, 237–247. Feldman, D. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Froemke, R., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics, (vol 4, pp. 469–516). Amsterdam: Elsevier. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neural learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405, 947–951. Hennion, P. E. (1962). Algorithm 84: Simpson’s integration. Communications of the ACM, 5(4), 208. Herrmann, A., & Gerstner, W. (2001). Noise and the PSTH response to current transients: I. General theory and application to the integrate-and-fire neuron. J. Comp. Neurosci., 11, 135–151. Hopfield, J., & Brody, C. (2004). Learning rules and network repair in spike-timingbased computation. PNAS, 101(1), 337–342. Izhikevich, E., & Desai, N. (2003). Relating STDP to BCM. Neural Computation, 15, 1511–1523. Jolivet, R., Lewis, T., & Gerstner, W. (2003). The spike response model: A framework to predict neuronal spike trains. In O. Kaynak, E. Alpaydin, E. Oja, & L. Yu (Eds.), Proc. Joint International Conference ICANN/ICONIP 2003 (pp. 846–853). Berlin: Springer. Kandel, E. R., Schwartz, J., & Jessell, T. M. (2000). Principles of neural science. New York: McGraw-Hill. Karmarkar, U., Najarian, M., & Buonomano, D. (2002). Mechanisms and significance of spike-timing dependent plasticity. Biol. Cybern., 87, 373–382. Kempter, R., Gerstner, W., & van Hemmen, J. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59(4), 4498–4514.
402
S. Bohte and M. Mozer
Kempter, R., Gerstner, W., & van Hemmen, J. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709–2742. Kepecs, A., van Rossum, M., Song, S., & Tegner, J. (2002). Spike-timing-dependent plasticity: Common themes and divergent vistas. Biol. Cybern., 87, 446–458. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1, 402–411. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APS and EPSPS. Science, 275, 213–215. Nishiyama, M., Hong, K., Mikoshiba, K., Poo, M.-M., & Kato, K. (2000). Calcium stores regulate the polarity and input specificity of synaptic modification. Nature, 408, 584–588. Paninski, L., Pillow, J., & Simoncelli, E. (2005). Comparing integrate-and-fire models estimated using intracellular and extracellular data. Neurocomputing, 65–66, 379– 385. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing. Neural Computation, 18, 1318–1348. ¨ otter, ¨ Porr, B., & Worg F. (2003). Isotropic sequence order learning. Neural Computation, 15(4), 831–864. Rao, R., & Sejnowski, T. (1999). Predictive sequence learning in recurrent neocortical ¨ circuits. In S. A. Solla, T. K. Leen, & K. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 164–170). Cambridge, MA: MIT Press. Rao, R., & Sejnowski, T. (2001). Spike-timing-dependent plasticity as temporal difference learning. Neural Computation, 13, 2221–2237. Roberts, P., & Bell, C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biol. Cybern., 87, 392–403. ¨ otter, ¨ Saudargiene, A., Porr, B., & Worg F. (2004). How the shape of pre- and postsynaptic signals can influence STDP: A biophysical model. Neural Computation, 16, 595–625. Senn, W., Markram, H., & Tsodyks, M. (2000). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Shon, A., Rao, R., & Sejnowski, T. (2004). Motion detection and prediction through spike-timing dependent plasticity. Network: Comput. Neural Syst., 15, 179–198. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synpatic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spiketime -dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. ¨ om, ¨ P. J., Reigl, M., Nelson, S., & Chklovskii, D. B. (2005). Highly nonSong, S., Sjostr random features of synaptic connectivity in local cortical circuits. PLoS Biology, 3(3), e68. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press.
Reducing the Variability of Neural Responses
403
van Rossum, R., Bi, G.-Q., & Turrigiano, G. (2000). Stable Hebbian learning from spike time dependent plasticity. J. Neurosci., 20, 8812–8821. Xie, X., & Seung, H. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, L., Tao, H., Holt, C., Harris, W., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received April 6, 2005; accepted June 19, 2006.
LETTER
Communicated by Alexandre Pouget
Fast Population Coding Quentin J. M. Huys
[email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
Richard S. Zemel
[email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5
Rama Natarajan
[email protected] Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 3H5
Peter Dayan
[email protected] Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K.
Uncertainty coming from the noise in its neurons and the ill-posed nature of many tasks plagues neural computations. Maybe surprisingly, many studies show that the brain manipulates these forms of uncertainty in a probabilistically consistent and normative manner, and there is now a rich theoretical literature on the capabilities of populations of neurons to implement computations in the face of uncertainty. However, one major facet of uncertainty has received comparatively little attention: time. In a dynamic, rapidly changing world, data are only temporarily relevant. Here, we analyze the computational consequences of encoding stimulus trajectories in populations of neurons. For the most obvious, simple, instantaneous encoder, the correlations induced by natural, smooth stimuli engender a decoder that requires access to information that is nonlocal both in time and across neurons. This formally amounts to a ruinous representation. We show that there is an alternative encoder that is computationally and representationally powerful in which each spike contributes independent information; it is independently decodable, in other words. We suggest this as an appropriate foundation for understanding time-varying population codes. Furthermore, we show how adaptation to Neural Computation 19, 404–441 (2007)
C 2007 Massachusetts Institute of Technology
Fast Population Coding
405
temporal stimulus statistics emerges directly from the demands of simple decoding.
1 Introduction From the earliest neurophysiological investigations in the cortex, it became apparent that sensory and motor information is represented in the joint activity of large populations of neurons (Barlow, 1953; Georgopoulos, Schwartz, & Kettner, 1983). There are by now substantial ideas and data about how these representations are formed (Rao, Olshausen, & Lewicki 2002), how information can be decoded from recordings of this activity (Paradiso, 1988; Snippe & Koenderinck, 1992; Seung & Sompolinsky, 1993), and how various sorts of computations, including uncertaintysensitive, Bayesian optimal statistical processing can be performed through the medium of feedforward and recurrent connections among the populations (Pouget, Zhang, Deneve, & Latham, 1998; Deneve, Latham, & Pouget, 2001). Critical issues that have emerged from these analyses are the forms of correlations between neurons in the populations, whether these correlations are significant for decoding and computation, and what sorts of prior information are relevant to computations and can be incorporated by such networks. However, many theoretical investigations into population coding have so far somewhat neglected a major dimension of coding: time. This is despite the beautiful and influential analyzes of circumstances in which individual spikes contribute importantly to the representation of rapidly varying stimuli (Bialek, Rieke, de Ruyter van Steveninck & Warland, 1991; Reinagel & Reid, 2000; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Johansson & Birznieks, 2004) and the importance accorded to fast-timescale spiking by some practical investigations into population coding (Wilson & McNaughton, 1993; Schwartz, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998). The assumption is often made that encoded objects do not vary quickly with time and that therefore spike counts in the population suffice. Even some approaches that consider fast decoding (Brunel & Nadal, 1998; Van Rullen & Thorpe, 2001) treat stimuli as being discrete and separate rather than as evolving along whole trajectories. In this letter, we study the generic computational consequences of population coding in time. We analyze decoding in time as a proxy for computation in time as it is the most comprehensive computation that can be performed (accessing all information present). Decoding therefore constitutes a canonical test (Brown et al., 1998; Zhang et al., 1998). We consider a regime in which stimuli are not static and create sparse trains of spikes. Decoding trajectory information from these population spike trains is thoroughly ill posed, and prior information about what trajectories are likely
406
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
comes to play a critical role. We show that optimal decoding with ecological priors formally couples together the spikes, making trajectory inference computationally very hard. We thus consider the prospects for neural populations to recode the information about the trajectory into new sets of spikes that do support simple computations. Phenomena reminiscent of adaptation emerge as a by-product of the maintenance of a computationally advantageous code. We analyze the extension of one of the simplest ideas about population codes for static stimuli (Snippe & Koenderinck, 1992) to the case of trajectories. This links a neurally plausible population encoding model with a naturally realistic gaussian process prior. Unlike some previous work on decoding in time (Brown et al., 1998; Zhang et al., 1998; Smith & Brown, 2003), we do not confine ourselves to recursively specifiable priors and can therefore treat smoother cases. It is these smooth priors that render decoding, and likely other computations, hard and inspire an energy-based (product of experts) recoding (Hinton, 1999; Zemel, Huys, Natarajan, & Dayan, 2005), which makes for readier probabilistic computation. Section 2 starts with a simple encoding model. It introduces the need for priors, their shape, and analytical results for decoding in time. Section 3 shows how priors determine the form in which information is available to downstream neurons. We show that the decoder corresponding to the simple encoder can be extraordinarily complex, meaning that the encoded information is not readily available to downstream neurons. Finally, section 4 proposes a representation that has comparable power but for which decoding requires vastly less downstream computation. 2 A Gaussian Process Prior Approach As a motivating example, consider tennis. The player returning a serve has to predict the position of the ball based on data acquired in fractions of seconds. Experts compensate for the extraordinarily sparse stimulus information with a very rich temporal prior over ball trajectories and thus make predictions that are accurate enough to guarantee many a winning return. Figure 1 illustrates the setting of this article more formally. It shows an array of neurons with partially overlapping tuning functions that emit spikes in response to a stimulus that varies in time. These could be V1 neurons responding to a target (the tennis ball) as it moves through their receptive fields, or hippocampal neurons with place fields firing as a rat explores an environment. The task is to decode the spikes in time, that is, recover the trajectory of the stimulus (the ball’s position, say) based on the spikes, a knowledge of the neuronal tuning functions (cf. Brown et al., 1998; Zhang et al., 1998, for hippocampal examples), and some knowledge about the temporal characteriztics of the stimulus (the prior). In Figure 1, the ordinate represents the stimulus space (here one-dimensional for illustrative
407
Neurone position/Space
Fast Population Coding
0
50
100 150 Time [ms]
200
250
Figure 1: The problem: Reconstructing the stimulus as a function of time, given the spikes emitted by a population of neurons. When a neuron with preferred stimulus si emits a spike at time t, a black dot is plotted at (t, si ). A few example tuning functions are shown in gray. The ordinate represents stimulus space, with each neuron being positioned according to its preferred stimulus si . The decoding problem is related to fitting a line through these points, which is achievable only if there is prior information about the line to be fitted (e.g., the order of a polynomial fit or the smoothness).
purposes) and the abscissa, time. Neuron i has preferred stimulus si . If it emits a spike ξti at time t, a dot is drawn at position (t, si ). The dots in Figure 1 thus represent the spiking activity of the entire population of neurons over time. Our aim is to find, for each observation time T, a distribution over likely stimulus values sT given all the spikes previous to T. This is related to fitting a line representing the trajectory of the stimulus through the points. It is a thoroughly ill-posed problem, for instance, because we are not given any information about the stimulus at all between the spikes. To solve this ill-posed problem, we have to bring in additional knowledge in the form of a prior distribution about likely stimulus trajectories. The prior distribution specifies the temporal characteriztics of the trajectories (e.g., how smooth they are) and also whether they live within some constrained part of the stimulus space. Subjects are assumed to possess such prior information ahead of time—for instance, from previous exposures to trajectories (a good tennis player will have seen many serves). To gain analytical insight into the structure of decoding in this temporally rich case, we consider a very simple spiking model p(ξti |st ) (c.f., Snippe & Koenderinck, 1992, for the static case), augmented with a simple prior over stimulus trajectories p(s). We thereafter follow standard approaches (Zhang et al., 1998; Brown et al., 1998) by performing causal decoding and thus recovering p(sT |ξ ) over the current stimulus sT at time T given all the J spikes ξ ≡ {ξtij } Jj=1 at times 0 < {t j } Jj=1 < T in the observation period
408
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
([0, T)), emitted by the entire population. Here, i = 1, . . . , N designates the neuron that emitted the spike. To state the problem in mathematical terms, we can write (at least for the case that there is no spike at time T itself) p(sT |ξ ) ∝ p(sT ) p(ξ |sT ) = p(sT ) dsT p(ξ |sT ) p(sT |sT ),
(2.1) (2.2)
where, being slightly notationally sloppy, we are integrating over stimulus trajectories sT up to, but not including, time T, but restricted to just those trajectories that end at sT . Equation 2.2 lays bare the two parts of the definition of the problem. One is the likelihood p(ξ |sT ) of the spikes given the trajectory. This will be assumed to arise from a Poisson-gaussian spiking model. The other is the prior p(sT ) p(sT |sT ) = p(s)
(2.3)
over the trajectories. This will be assumed to be a gaussian process. 2.1 Poisson-Gaussian Spiking Model. We first define the spiking model. Let φi (s) be the tuning function of neuron i and assume independent, inhomogeneous, and instantaneous Poisson neurons (Snippe & Koenderinck, 1992; Brown et al., 1998; Barbieri et al., 2004). Let j be an index running over all the spikes in the population, with i( j) reporting the index of the neuron that spiked at time t j . Then, from the basic definition of an inhomogeneous Poisson process, the likelihood of a particular population spike train ξ given the stimulus trajectory sT can be written as
p(ξ |sT ) =
φi( j) (st j ) exp −
j
∝
i
φi( j) (st j ),
T
dtφi (st )
(2.4)
0
(2.5)
j
assuming that the trajectories are such that we can swap the order of the sum and the integral in the exp(·), that tuning functions are sufficiently dense that the sum spiking rate is constant independent of the location of the stimulus st , and that no two neurons ever fire together.
Fast Population Coding
409
Finally, we assume squared-exponential (gaussian) tuning functions,
st j − si φi (st j ) = φmax exp − 2σ 2
2 ,
where φmax is the maximal firing rate of a neuron and si the ith neuron’s preferred stimulus. Combining this with our previous assumptions (see equation 2.5) and completing the square implies that (sξ − θ )T (sξ − θ) p(ξ |sT ) ∝ φmax exp − , 2σ 2
(2.6)
where the spikes from the entire population have been ordered in time; the jth component of both sξ and θ corresponds to the jth spike and is, respectively, the stimulus at that spike’s time t j and the preferred stimulus si of the neuron that produced it. Note that time is continuous here. 2.2 Gaussian Process Prior. The prior p(s) defines a distribution over stimulus trajectories that are continuous in time. However, p(ξ |sT ) in equation 2.6 depends on only the times t j at which neurons in the population spike. Thus, in the integral in equation 2.2, we can formally marginalize or integrate out all the nonspiking times, making the key quantity to be defined by the prior to be p(sξ , sT ). For a gaussian process (GP), this quantity is a multivariate gaussian, defined by its (J + 1)-dimensional mean vector m and covariance matrix C, which can in general depend on the times t j . We write the distribution as p(sξ , sT ) ∼ N (m, C)
Ct j t j = c exp −αt j − t j ζ .
(2.7)
The parameter ζ ≥ 0 dictates the smoothness and the correlation structure of the process. If ζ = 0, then the stimulus is assumed to be constant (we sometimes call this the static case). Setting ζ = 1 corresponds to assuming that the stimulus evolves as an Ornstein-Uhlenbeck (OU) or first-order autoregressive process. This is the generative model underlying Kalman filters (Twum-Danso & Brockett, 2001) and generates an autocorrelation with the Fourier spectrum ∼1/ω2 often observed experimentally (Atick, 1992; Dong & Atick, 1995; Wang, Liu, Sanchez-Vives, & McCormick, 2003). This can be generalized to nth-order autoregressive processes. Setting ζ = 2 leads to the opposite end of the spectrum, with smooth trajectories that are non-Markovian. The parameter α dictates the temporal extent of the correlations and c their overall size (c also parameterizes the scale of the overall process). Example trajectories drawn from these priors for ζ = {1, 2} are shown in Figure 2. For most of the letter, we will let m = 0. Assuming
410
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan 0.5
A
Smooth 0
−0.5 0.5
B OU
0
−0.5
100
200 Time [ms]
Figure 2: Example trajectories drawn from the prior distribution in equation 2.7. (A) Examples for the smooth covariance matrix with ζ = 2. (B) The OU covariance matrix, ζ = 1.
a GP prior with a particular covariance matrix is exactly equivalent to regularizing the autocorrelation of the trajectory. 2.3 Posterior. Making these assumptions, we can write down the posterior distribution p(sT |ξ ) analytically by solving equation 2.2. It is a simple gaussian distribution with mean µ(T) and variance ν 2 (T) given in terms of tuning function widths σ , the vector θ , and the covariance matrix C. All three terms in equation 2.2 are now defined. The conditional distribution p(sξ |sT ) is given in terms of the partitioned covariance matrix C, p(sξ |sT ) = Nsξ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ , where Cξ ξ is the covariance matrix of the stimulus at all the spike times, CTξ and Cξ T are vectors with the cross-covariances between the spike times and the observation time T, and CT T is the marginal (static) stimulus prior at the observation time (constant for the stationary processes considered here). The corresponding partitioning of the matrix C is
Cξ ξ Cξ T . C= CTξ CT T
(2.8)
Fast Population Coding
411
The remaining two terms in equation 2.2 are given by p(sT ) = NsT (0, CT T ) and equation 2.6. As the integral in equation 2.2 is a convolution of two gaussians, the variances add, and the integral evaluates to p(ξ |sT ) = Nθ Cξ T CT−1T sT , Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 . Finally, taking a product with p(sT ), renormalizing, and applying the matrix inversion lemmas (see appendix A), we get µ(T) = k(ξ , T) · θ(T) ν 2 (T) = CT T − k(ξ , T) · Cξ T 2 −1
k(ξ , T) = CTξ (Cξ ξ + Iσ ) .
(2.9) (2.10) (2.11)
The mean µ(T) of the gaussian posterior is thus a weighted sum of the preferred stimulus of those neurons that emitted particular spikes. The weights are given by what we term the temporal kernel k(ξ , T). As we will see, the weight given to each spike will depend strongly on the time at which it occurred. A spike that occurred in the distant past will be given small weight. The posterior variance depends on only C and σ 2 . Remember that C depends on only the times of spikes, not the identities of the neurons that fired them. The posterior variance ν 2 , similar to a Kalman filter, depends on only when data are observed, not what data. This depends on the squared exponential nature of the tuning functions φ, and other tuning functions (e.g., with nonzero baselines) may not lead to this quality. However, it will not affect the conclusions reached below. This posterior distribution p(sT |ξ ) is well known in the GP literature as the predictive distribution (MacKay, 2003). 2.4 Structure of the Code. The operations needed to evaluate the posterior p(sT |ξ ) give us insight into the structure of the code and will be analyzed in section 3 for various priors. If the posterior is a function of combinations of spikes, postsynaptic neurons have to have simultaneous access to all those spikes. This point will be critical in temporal codes, as the spikes to which access is required are spread out in time. Only if spikes are interpretable independently can they be forgotten once they have been used for inference. All information the spikes contribute to some future time T > T is then contained within p(sT |ξ ). If the posterior depends on combinations of spikes (as will be the case for ecological, smooth priors), information that can be extracted from a spike about times T > T is not entirely contained within p(sT |ξ ). As a result, past spikes have to be stored and the posterior recomputed using them—an operation that is nonlocal in time. We will show that under ecological priors, the posterior depends on spike combinations and is thus complex. Decoding for the simple encoder (the spiking model) is thus hard. In section 4, we will illustrate the type
412
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
of computations (“recoding”) a network has to perform to access all the information. This will be equivalent to finding a new, complex encoder in time for which decoding is simple. 3 Effect of the Prior The effect of the prior manifests itself very clearly in the temporal kernels k(ξ , T) from equation 2.11 and the independence structure of the code. We show this by analyzing a representative set of priors in terms of both the behavior of the temporal kernels and the structure of the code, including priors that generate constant, varyingly rough and entirely smooth trajectories. (MATLAB example code can be downloaded from http://www.gatsby. ucl.ac.uk/∼qhuys/code.html.) 3.1 Constant stimulus prior ζ = 0. We first show that our treatment of the time-varying case is an exact generalization of the case in which the stimulus is fixed (does not change relative to the mean m), by rederiving classical results for static stimuli. Snippe and Koenderinck (1992) have shown that the posterior mean and variance (under a flat prior) is given by a weighted spike count, µ(T) =
i
ni (T)si J (T)
ν 2 (T) =
σ2 J (T)
(3.1)
T where ni (T) = 0 dt ξti is the ith neuron’s spike count and J (T) = i ni (T) is the total population spike count at time T. If we let ζ = 0, the matrix Cξ ξ = cnnT where n is a J (T) × 1 vector of ones. Equations 2.9 to 2.11 can then be solved analytically:
(σ 2 + c J (T))δi j − c σ 2 (σ 2 + c J (T)) c k(ξ , T) = 2 n σ + c J (T) ni (T)si c µ(T) = 2 i σ + c J (T)
(Cξ ξ + Iσ 2 )−1
ij
=
ν 2 (T) =
cσ 2 , σ 2 + c J (T)
which is exactly analogous to equation 3.1 with an informative prior. The temporal kernel k(ξ , T) does not decay but is flat, with a magnitude proportional to 1/J (T). The contribution of each neuron to the mean µ(T) is given by its spike count ni (T). Each spike is given the same weight,
Fast Population Coding
413
dynamic inference
static inference
static stim
dynamic stim
A
B
C
D
Figure 3: Comparison of static and dynamic inference. Throughout, the posterior density p(sT |ξ ) is indicated by gray shading, the spikes are vertical (gray) lines with dots, and the true stimulus is the line at the top of each plot. (A) Static stimulus, constant temporal kernel. (B) Moving stimulus, constant temporal kernel. (C) Static stimulus, decaying temporal kernel. (D) Moving stimulus, decaying temporal kernel. A and D show that only a match between true stimulus statistics and prior allows the posterior to capture the stimulus well.
which is a sensible approach only if spikes are eternally informative about the stimulus. This is true only if the covariance matrix is flat, which itself implies that the only time-varying component of the stimulus is in the mean m and not the covariance C. If the stimulus is a varying function of time s(t), spikes at time t are informative only about the stimulus at times t close to t and the influence of each spike on the posterior should fade away with time. This is illustrated in Figure 3. Figure 3A shows the present static case, where the stimulus does indeed not move; over time, the posterior p(sT |ξ ) sharpens up around the true value. However, if the stimulus does move, the posterior ends up at the wrong value (see Figure 3B). If the temporal kernel k(ξ , T) decays, this amounts to downweighting spikes observed in the more distant past. In the following, we analyze the behavior of p(sT |ξ ) and the optimal temporal kernel k(ξ , T) for various stimulus autocorrelation functions. Figure 3C shows that a decaying kernel leads to a posterior that widens inbetween spikes. This is incorrect if the stimulus is static, but Figure 3D shows how such a decaying temporal kernel would, in contrast to Figure 3B, allow p(sT |ξ ) to track the moving stimulus correctly.
414
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Space
−1
0
1
0.05
0.1
0.15 0.2 Time [s]
0.25
0.3
Figure 4: Posterior distribution p(sT |ξ ) for OU prior. Same representation as in Figure 1. The dashed line shows the actual stimulus trajectory used to generate the spikes, the dots are the spikes, the posterior distribution is in gray scale, and the solid line shows the posterior mean. Between spikes, the posterior mean decays exponentially back toward the mean m (here 0), and the variance approaches the static prior variance CT T .
3.2 Nonsmooth (Ornstein-Uhlenbeck) Prior ζ = 1. Setting ζ = 1 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a random walk with drift to zero (an OU process): ds = −(1 − e −α )s(t)dt +
√ c(1 − e −2α ) dt d N(t),
(3.2)
with gaussian noise N(t) ∼ N (0, 1) and parameters as in equation 2.7. The OU process is the underlying generative process assumed by standard Kalman filters. The simplicity of Kalman-filter-like formulations explains some of its wide applicability and success (Brown et al., 1998; Barbieri, et al., 2004). However, as indicated visually by the example trajectories in Figure 2, the rough trajectories this prior favors are not a good model of smooth biological movements (see also section 5). Figure 4 shows a sampled stimulus trajectory, sample spikes generated from it, and the posterior distribution p(sT |ξ ). The mean of the posterior does a good job of tracking the true underlying stimulus trajectory and is never more than two standard deviations away from it. Between spikes, the mean simply moves back to zero (albeit rather slowly given the parameters associated with the Figure shown). Figure 5A displays example temporal kernels k(ξ , T) for inference in this process. They are very close to exponentials (note the logarithmic ordinate). This makes intuitive sense as an OU process is a first-order Markov process (it can be rewritten as a first-order difference equation). In fact, assuming the spikes arrive regularly (replacing each of the interspike intervals (ISI) 1 by their average value = 1J j (t j − t j−1 ) ∝ φmax ) allows us to write the
Fast Population Coding
415
B Kernel size (weight) (log scale)
Kernel size (weight) (log scale)
A −2
10
−4
10
−6
10
−8
10
0.05
0.1
0.15
T−t k
0.2
increasing observation time T
−3
10
−5
10
−7
10
0.2
0.4
0.6 T−t
0.8
1
k
Figure 5: OU temporal kernels for ζ = 1. (A) Example of temporal kernels. The top traces are for lower and the bottom for higher average firing rates. The gray traces show temporal kernels for Poisson spike trains. The components of the vector k(ξ , T) are plotted against the corresponding spike time. The dashed black traces show temporal kernels for regular spike arrivals (metronomic temporal kernels). The true (gray) temporal kernels are relatively tightly bunched around the metronomic temporal kernel. The firing rate affects the slope of the kernel, but not its overall scale of the kernel. (B) The effect of the time since the last spike on the temporal kernel is an overall multiplicative scaling. There is no effect on the slope.
jth component of k(ξ , T) as j−1
k j ≈ d1 λ1 , where d1 and λ1 are constants defined in appendix B. For such metronomic spiking, k(ξ , T) is thus really simply a decaying exponential. Somewhat similar expressions can be obtained for the original case of Poisson-distributed ISIs (see appendix B). Figure 5A shows that the metronomic approximation provides a generally good fit, capturing especially the slope of the true temporal kernels, which depends mostly on the correlation length α and the maximal (or average) firing rate φmax . The remaining quality of the fit is influenced most strongly by the match between and the time since the last spike T − tJ (which takes its effects through CTξ in equation 2.8 and 2.9–2.11). This determines the overall scale of the temporal kernel. The factors influencing the slope of the temporal kernel and its height do not interact greatly; that is, T − tJ does not affect the slope (shape) of the temporal kernel, only its magnitude, as shown in Figure 5B (metronomic temporal kernels are used for clarity, but the argument applies equally to the exact kernel). Conversely, affects mostly the slope. Replacing the true
416
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
−1
A
Space
0
1 −1
B 0
1
0.05
0.1
0.15 Time [s]
0.2
0.25
0.3
Figure 6: Comparison between exact and metronomic kernels. Same representation as in Figure 4. (A) Exact posterior p(sT |ξ ). (B) Approximate posterior derived by replacing all ISIs by but keeping T − tJ . This corresponds to approximating the true kernels with the metronomic kernels in Figure 5. The approximation is very close.
temporal kernels by metronomic temporal kernels, that is, replacing all ISIs by but keeping the time since the last spike T − tJ , does not greatly degrade p(sT |ξ ) (cf. Figures 6A and 6B). The dependence in Figure 5B can be understood by writing out the integrand of equation 2.2 in detail for the OU prior. This factorizes over potentials involving duplets of spikes because, as we show in appendix B, C −1 is tridiagonal, implying that the elements of C −1 involve only two successive spikes: T 1 s p(sξ , sT ) ∝ exp − sξ sT C −1 ξ sT 2 J +1 J 1 st2j Ct−1 + st j Ct−1 s = exp − j tj j ,t j+1 t j+1 2 j=1
p(sξ , sT ) = ψ(sT )
J
ψ(st j , st j+1 ),
j=1
(3.3)
j=1
where tJ stands for the time of the last spike, tJ −1 the time of the penultimate one, and so on, and the observation time T = tJ +1 . Note that the last equality
Fast Population Coding
A
417
B
90
0
−0.2
80
−0.4 −0.6 j
log C(t − t )
60
i
Space [cm]
70
50
−0.8 −1 −1.2
40
−1.4 30 −1.6 40
60 Space [cm]
80
−1.5
−1
−0.5
0 0.5 t −t [s] i
1
1.5
2
j
Figure 7: Natural trajectories are smooth. (A) Position of a rat freely exploring a square environment. (B) Covariance function of the position along the ordinate (gray, dashed line) and a quadratic approximation (black, solid line). Note the logarithmic ordinate. The smoothing applied to eliminate artifacts was of a timescale short enough not to interfere with the overall shape of the covariance function.
implies that the determinant also factors over spike pairs. This means that the integrations over each spike in the main equation 2.2 can be written in a recursive form akin to that used in message-passing algorithms (MacKay, 2003) and the exact Kalman filter. 3.3 Smooth Prior ζ = 2. Setting ζ = 2 in the definition of the prior (see equation 2.7) corresponds to assuming that the stimulus evolves as a non-Markov random walk. Trajectories with this autocovariance function are smooth (Figure 2A shows some sample trajectories generated from the prior) and infinitely differentiable. The smoothness makes it a more ecologically relevant prior for Bayesian decoding from movement-related trajectories than nonsmooth priors since natural objects (and limbs) move along smooth trajectories rather than jumping. As an example, Figure 7A shows trajectories of a rat exploring a square environment (data kindly provided by Lever, Wills, Cacucci, Burgess, & O’Keefe, 2002). Not only are these natural trajectories smooth, but Figure 7B also shows that a squared exponential covariance function closely approximates the real covariance function.1
1
Only the center of the covariance function is shown here. Due to the small size of the environment, the rat runs back and forth the entire available length, and there are oscillating flanks to the covariance function for delays larger than those shown.
418
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
−1
Space
−0.5 0 0.5 1
0.05
0.1 Time [s]
0.15
0.2
Figure 8: Posterior distribution p(sT |ξ ) for the smooth prior. Same representation as in Figure 4. The arrow highlights where the smooth prior uses spike combinations to constrain higher-order statistics of the process, such as velocity, acceleration, and jerk. While the smooth prior correctly predicts that the stimulus will continue away from the mean before returning back, the OU process can predict a decay only back to the mean (see Figure 9). The first spike on the left is the very first spike observed. As the spike history becomes more extensive, the posterior distribution is seen to sharpen up and follow the stimulus accurately.
Figure 8 is the equivalent of Figure 4 for the smooth case and shows the posterior p(sT |ξ ). The main dynamical difference between inference in this smooth case and inference in the OU case is indicated by the arrow in the Figure. While the OU process simply decays back to the mean (here, zero for simplicity), the dynamics of the smooth posterior mean are much richer. In the absence of spikes, the mean continues in its current direction for a while before reversing back. As can be seen, this gives a better fit to the underlying stimulus trajectory (the black dashed line) than would otherwise have been achieved. It arises directly from the fact that the correlations extend essentially beyond the last spike (and into the entire past). For comparison, Figure 9 shows the posterior when the wrong prior is used. The stimulus was generated from the smooth prior, but the OU prior was used to infer the posterior. The arrow indicates where the infelicity of the inaccurate posterior is most apparent, falling back to zero instead of predicting that the stimulus will continue to move farther away from zero. In terms of difference equations, the larger extent of correlations intuitively means that the higher-order derivatives of the process are also “constrained” by the covariance C. The simple exponential temporal kernels observed in the OU process cannot give rise to the reversals observed in the smooth process. Figure 10A shows the temporal kernels for the smooth process, which have a distinctively different flavor from the OU temporal kernels (shown in Figure 5),
Fast Population Coding
419
Space
−0.5 0 0.5 0.05
0.1 Time [s]
0.15
0.2
0.4 0.3 0.2 0.1 0 0.05 0.1 0.15 0.2 T−t j
C 0.2 Increasing observation time T
0.1 0 0
0.05 0.1 T−t j
0.15
Kernel size (weight)
B
0.5
Kernel size (weight)
A
Kernel size (weight)
Figure 9: Posterior distribution p(sT |ξ ) for smooth stimulus but wrongly assuming an OU prior. The posterior is consistently wider than it should be (cf. Figure 8). The arrow points out where the prediction is qualitatively wrong: the OU prior allows for decay back only to zero, unlike the smooth prior. Note also that the beneficial effect of a larger spike history observed in Figure 8 is absent here.
Increasing observation time T
2 1 0 0
0.2
0.4
0.6
T−t j
Figure 10: Temporal kernels for the smooth prior. (A) Exact (gray solid) and metronomic (black dashed) temporal kernels for the smooth prior with ζ = 2. The metronomic kernels again provide a close fit. (B) The metronomic temporal kernels change in a complex manner as the observation time T is moved away from the time of the last spike. Unlike in the OU case, this is not just a recursively implementable multiplication. (C) The same qualitative behavior arises for kernels derived from the empirical covariance function of the rat trajectories.
including oscillating terms multiplying the exponential decay. Most important, the oscillating terms allow the weight assigned to a spike to dip below zero; that is, a spike initially signifies proximity of the stimulus to the neuron’s preferred stimulus but later swaps over, signaling that the stimulus is not there anymore. This feature of the temporal kernels gives rise to the reversals seen in the posterior mean.
420
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
As in the OU case, the metronomic temporal kernel based on equal ISIs gives a good description of the temporal kernel mostly for spikes in the more distant past. Replacing the true temporal kernels by metronomic temporal kernels (but keeping the exact time since the last spike T − tJ ) again does not affect the posterior strongly. Nevertheless, the Kullback-Leibler divergence between the true posterior and the metronomic posterior is larger in the smooth than in the OU case (data not shown), indicating that the exact timing of spikes is more important in the smooth inference. Unlike in the OU case, there is no simple analytical expression for the metronomic temporal kernel (let alone the true temporal kernel). In particular, Figure 10B shows that changing the time since the last observed spike T − tJ does not simply scale the temporal kernel, but also changes the shape of the temporal kernel (it produces a complicated phase shift of the oscillating component). Again, for clarity, the metronomic kernels are used as an illustration, but the argument also applies to the exact kernels. Local structure has complex global consequences in the smooth case, with a single new spike requiring individual reweighting of all past spikes depending on their precise times. By comparison, for the OU process, the reweighting involves multiplication by a single factor. Figure 10C shows that this temporal kernel complexity is also a feature of the temporal kernel derived from the covariance function of the empirical rat trajectories in Figure 7. The fundamental difference between the OU and the smooth temporal kernels arises from the difference in the factorization properties of the prior. Because the inverse of the covariance matrix for ζ ∈ / {0, 1}, and specifically for ζ = 2, is dense, it does not factorize over spike combinations and therefore does not allow a recursive form. To see that a recurrence relation is possible only for the OU prior that factorizes across duplets of spikes, write p(sT |ξ ) = ds J ds J p(sT , s J , s J |ξ ) by expanding and integrating over the stimulus s J at the time tJ of the last spike, and s J at the time of all the spikes apart from the last ∝ ds J p(sT , s J ) p(ξ J |s J ) ds J p(s J , ξ J |sT , s J ) using Bayes rule, and the instantaneity of spiking =
ds J p(sT , s J ) p(ξ J |s J )
ds J p(ξ J |s J ) p(s J |sT , s J ),
again because the spikes are instantaneous, =
ds J p(sT , s J ) p(ξ J |s J )mT (sT , s J , ξ J ).
(3.4)
Fast Population Coding
421
Were mT (sT , s J , ξ J ) independent of sT , this would be exactly like a recursive update equation, with p(sT , s J ) being the transition probability from the last observed spike to the inference time T, p(ξ J |s J ) being the innovation due to the last observation (the likelihood of the last observed spike), and the message mT (s J , ξ J ) propagating the uncertainty from all the spikes other than the last to the last one. However, for general priors, p(s J |sT , s J ), and therefore also mT (sT , s J , ξ J ), do depend on sT , so all spikes have to be used to infer the posterior at each time T. To make the mT independent of sT , the prior has to be Markov in individual spike timings, with p(s J |sT , s J ) = p(s J |s J ),
(3.5)
which makes mT (sT , s J , ξ J ) =
ds J p(ξ J |s J ) p(s J |s J )
≡ mT (s J , ξ J ),
(3.6) (3.7)
which is indeed independent of sT . So for the OU process, the last message mT (s J , ξ J ) merely needs to be multiplied by the transition probability (see Figure 5B). However, the smooth temporal kernel changes shape in a complex way (corresponding to the dependence of the message mT (sT , s J , ξ J ) in equation 3.4 on sT ). Again, this means that all spikes have to be kept in memory for full inference. Note, finally, that this conclusion, and the fact that there is a recursive form for the OU process, do not depend on the particular spiking model assumed, verifying the assertion that the choice of squared exponential tuning functions, although mathematically helpful, does not pose limitations on our conclusions. 3.4 Intermediate (Autoregressive) Processes. There are cases intermediate to the smooth and the OU process that allow a partially recursive formulation. For instance, the metronomic OU process can be generalized to an autoregressive model of nth order by writing st =
n
√ βi st−i + c ηt .
(3.8)
i=1
In this case, the inverse covariance matrix C −1 is (2n + 1)-diagonal (see appendix C), with entries determined directly by the βi . This implies that the posterior factorizes over cliques ψ involving n + 1 spikes (see equation 3.3), and that inference will be Markov in groups of n spikes. Zhang et al. (1998) find that a two-step Bayesian decoder, which is an AR(2) process in our terms, significantly improves decoding hippocampal place cell data.
422
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
B
A
0.5 order=1 0 0.5 order=2
Space
kernel size (weight)
0 0.5 order=3 0 0.5 order=5 0 0.5 order=10 0 0.05 0.1
0.15 0.2
0.25 0.3
0.35 0.4
0.45 0.5
Time [s]
0.05
0.1 0.15 T−t j [s]
0.2
Figure 11: Autoregressive processes of increasing order. (A) Samples from processes of order n = {1, 2, 3, 5, 10} from top to bottom. The top process corresponds to an OU process. (B) Metronomic temporal kernels k(ξ , T) corresponding to the processes in A. The different lines (in varying shades of gray) correspond to increasing the observation time T as in Figures 5 and 10.
Figure 11A shows sample trajectories from such processes of increasing order. The coefficient vectors’ β was set here such that the nth difference of the processes evolved as an OU process (see appendix C). The higher the order, the smoother the processes that can be generated, and the more oscillations are apparent in the temporal kernels. The OU and the smooth processes (see section 3.3) are at opposite ends of this spectrum, with tridiagonal and dense inverse matrices, respectively. The higher the order, the greater the complexity of the code. Indeed, the complexity grows exponentially (since groups of n spikes have to be considered and the number of such groups increases exponentially). While natural stimulus trajectories may not be indefinitely differentiable, the exponential increase in complexity implies that any smoothness has great potential to render the code complex. 4 Expert Spikes for Efficient Computation Complex codes, following, for instance, from the assumption of natural smooth priors, render the information inherent in the spikes hard to extract. Efficient computation in time requires access to all encoded information and
Fast Population Coding
423
thus requires that the complex temporal structure of the code be taken into account. Here, we show that information present in the complex codes can be re-represented using codes that are straightforward to decode and use in key probabilistic computations. Specifically, we propose to decode each spike independently and multiply together the contributions from all spikes. This corresponds to treating each spike as an independent expert in a product of experts (PoE) setting (Hinton, 1999): 1 i pˆ (sT |ξ ) = exp gi (s, t)ξT−t . Z(T) i t
(4.1)
That is, each time a spike ξ i occurs, it contributes its same projection kernel exp(gi (s, t)) to the posterior distribution pˆ (sT |ξ ). To put it another way, for each spike, we add the same stereotyped contribution to the log posterior and then renormalize. From the discussion in the preceding sections, it is immediately apparent that the PoE approximation is a better approximation for the OU case than for the smooth case. In the following, we first derive an approximate analytical expression for separable projection kernels gi (s, t) = f i (s)h(t) based on metronomic spikes and the OU prior. We then remove any restrictions and derive nonparametric, nonseparable gi (s, t) for both the OU and the smooth temporal kernel and show that these still perform better for the OU process than for the smooth process. Finally, we infer a new set of spikes ρ ξ such that decoding according to the PoE model produces a posterior distribution pˆ (sT |ρ ξ ) that matches the true posterior distribution p(sT |ξ ) well for both OU and smooth priors. 4.1 Approximate Projection Kernels 4.1.1 Metronomic Projection Kernels Section 3.2 showed that for the OU process, the weight accorded a spike is approximately a decreasing exponential function of the time elapsed since its occurrence and that replacing the true temporal kernels by the metronomic temporal kernels (without fixing the time since the last spike at ) gives a qualitatively good approximation (see Figure 4). This suggests writing an approximate distribution with spatiotemporally separable projective kernels, pˆ (sT |ξ ) ∝
φi (s)
i
=
i
exp
i −βt t ξT−t e
t
(4.2)
log(φi (s))e
−βt
i ξT−t
(4.3)
424
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
A
−1
B
Space
0 1
0.05
0.1
0.15
0.2
0.25
0.3
0.05
0.1
0.15 Time [s]
0.2
0.25
0.3
−1 0 1
Figure 12: Separable projection kernel for the OU process: Comparison of true p(sT |ξ ) (A) and pˆ (sT |ξ ) from equation 4.3 (B). The left arrows indicate where the variance of the approximate distribution diverges toward ∞ as T − tJ → ∞ rather than approaching CT T . The right arrows show the effect of this on the mean or the approximate posterior, which returns to the prior mean m = 0 more rapidly than the true posterior.
to use exactly the form of equation 4.1. We can thus also write pˆ (sT |A) ∝
i
φi (s) Ai (T) ,
(4.4)
where Ai (T) can be seen as an equivalent “activity” of each neuron. The performance of this approximation is shown in Figure 12 for the OU process (see also Zemel et al., 2005). There are a few differences between Figures 4 and 12. Keeping the φi (s) as before, the variance of this approximation is νˆ 2 (T) = σ 2 / i Ai (T). As the last observed spike recedes into the past, this approaches infinity (left arrows in Figure 12), and the mean returns to zero (right arrows in Figure 12). This is different from the case of exact inference, which approaches the static prior with variance CT T . The mean is always normalized and returns to zero more slowly µ(T) ˆ = i si AiA(T) j (T) j than the variance increases. This introduces an inaccuracy, since the true OU temporal kernels (shown in Figure 5) are not normalized t kt (ξ , T) < 1, which arises because of the weight given to the spatial prior. For the smooth case, no simple approximation of the form of equation 4.3 is viable. This can be seen, for instance, from the fact that the smooth temporal kernels (see Figure 10) dip below zero (making it tricky to use them in products).
Fast Population Coding
425
A
B
Figure 13: Projection kernels inferred by equation 4.5 for OU (A) and smooth (B) priors. Stimulus trajectories and corresponding population spike trains ξ were generated until the update equations converged (approximately 2 · 104 spike trains). Both kernels have the shape of difference of gaussians for t = 0 and fall off exponentially with time. There is little nonseparable structure in both cases.
4.1.2 Inferring Full Spatiotemporal Projection Kernels gi (s, t). To apply expression 4.1 to the smooth case, we inferred gi (s, t) in a nonparametric way by discretizing time and space over which the distributions are defined and minimizing the Kullback-Leibler divergence between the discretized versions p(sT |ξ ) and pˆ (sT |ξ ) with respect to the projection kernels, gi (s, t) ← gi (s, t) − ε∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )),
(4.5)
where DK L ( p(s)||q (s)) = ds p(s) log qp(s) . Given that our approximation 4.1 (s) is related to restricted Boltzmann machines (RBM), it is not surprising that the gradient has a form akin to the wake-sleep algorithm (Hinton, Dayan, Frey, & Neal, 1995): ∇gi (s,t) DK L ( p(sT |ξ )|| pˆ (sT |ξ )) =
[ pˆ (sT |ξ ) − p(sT |ξ )] ξi (T − t).
(4.6)
T
Figure 13 shows the projection kernels inferred for the OU prior (see Figure 13A) and the smooth prior (see Figure 13B). Both start, for t = 0 with a spatial profile similar to a difference of gaussians (DOG), and then fall off as exponentials of time. The kernels gi (s, t) shown here are for neurons i with si close to 0, the center of the gaussian prior over the trajectories. The projection kernels shown are for the same parameter settings as Figures 4 and 8, and the faster decay of the smooth projection kernels is due to the
426
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Figure 14: Projection kernels are independent of contrast. The left-most panel shows an OU kernel for the same contrast (φmax ) as in Figure 13; the contrast is doubled in the middle and quadrupled in the right panel. All these are offcenter kernels with the same parameters as used in the other Figures. Despite a slight slant toward the mean, the kernels are approximately separable.
shorter correlation timescale. For the OU process, the kernels for neurons i with si > 0 become slightly slanted toward −1 over time (and the converse holds for those with si < 0) to capture the decay to the mean (zero), which is only a function of the distance from the mean. This effect is noticeable for the OU but very small for the smooth kernels. Figure 14 shows off-center OU kernels inferred for different contrast (by varying φmax ). As can be seen, the kernels are invariant to the contrast, and the slant effect is small. For the parameter range explored here, both projection kernels are approximately separable, indicating that the analytically derived motivation above may be close to optimal and that, in the PoE framework of equation 4.1, separable projection kernels may be the optimal choice even for the smooth prior. However, simply using these projection kernels to interpret the original spikes ξ results in an approximation that is far from perfect, especially in the smooth case. Figure 15 compares the true posterior distribution and that given by the approximation with the above projection kernels. The cost of independent decoding is quantified in Figure 15A using
1 D( p(sT |ξ )|| pˆ (sT |ξ )) T t H( p(sT |ξ ))
,
(4.7)
p(s,ξ )
where H( p) is the entropy of p and the average is over many stimulus trajectories s ∼ N (0, C) and spikes ξ ∼ p(ξ |s). This quantity can also be interpreted as a percentage information loss. It is larger for the smooth than for the OU process, showing that the OU process suffers much less from the approximation than the smooth prior. Visually, there are no gross differences between p(sT |ξ ) and pˆ (sT |ξ ) for the OU prior (see Figures 15B and 15D). However, for the smooth prior, the arrows in
Fast Population Coding
427 OU
Smooth
−1
Exact
−0.5
0.12
Space
0.06 0.04
1 −1
−0.5
0.02 OU
Smooth
Approx
0.1 0.08
0
0 0.5
0 0.5 1
0.05
0.1
0.15 0.2 Time [s]
0.25
0.3
0.05
0.1 Time [s]
0.15
0.2
Figure 15: Comparison of true distribution p(sT |ξ ) and approximate distribution pˆ (sT |ξ ) given by equation 4.1 with projection kernels inferred by equation 4.5 and shown in Figure 13. Organization is the same as in previous figures. (A) T1 t D( p(sT |ξ )|| pˆ (sT |ξ ))/H( p(sT |ξ )) p(ξ ,s) ± 1 standard deviation for both priors. (B, C) p(sT |ξ ). (D, E) The corresponding pˆ (sT |ξ ) for the same spikes. (B, D) A stimulus generated from the OU prior. (C, E) The smooth prior. pˆ (sT |ξ ) is a good approximation for the OU prior but fails for the smooth prior. The arrows indicate where the approximation fails fundamentally in a similar way to that shown in Figure 9.
Figures 15C and 15E indicate areas where a large mismatch is introduced by the independent treatment of the spikes, which discards all information contained in spike combinations. This mismatch is entirely to be expected.
4.2 Recoding: Finding Expert Spikes. The previous section has shown that an independent interpretation of spikes is more costly with the smooth than with the OU prior. In this section, we show that it is possible to find a new set of “expert” spikes ρ, such that each spike can be interpreted independently and the posterior distribution is matched closely for both the OU and the smooth prior. This recoding thus takes spikes ξ that are redundant in a decoding sense and produces a new set of spikes ρ that can be easily used for efficient neural computation because the decoding redundancy has been eliminated. We first infer real-valued activities aξ and then proceed to infer actual spikes ρ. We use neurally implausible methods to infer the new set of spikes ρ. In a companion paper we will explore the capability of neurally plausible spiking networks to do this recoding and to use the resulting simple code for probabilistic computations in time (see also Zemel, Huys, Natarajan, and Dayan, 2004).
428
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Figure 16: Inferring activities A for the OU prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, but zeroing this affects the match between p(sT |ξ ) and pˆ (sT |A) only marginally.
4.2.1 Activities. Given a set of projection kernels gi (s, t) from the previous section, we can go back and infer the optimal activities A ≥ 0 of neurons by writing pˆ (sT |A) ∝ exp
Ai (T − t)gi (s, t) .
(4.8)
i,t
If we let Ai (T − t) = exp(Bi (T − t)) and minimize with respect to B the ¨ Kullback-Leibler divergence from the true posterior, we simultaneously enforce A ≥ 0: Bi (t) ← Bi (t) − ε∇ Bi (t) DK L ( p(sT |ξ )|| pˆ (sT |A)) .
(4.9)
The results of this procedure are shown for both the OU process (see Figure 16) and for the smooth process (see Figure 17). Figures 16A and 17A show the true spikes ξ and the corresponding distribution p(sT |ξ ). Figures 16B and 17B show the approximate distributions pˆ (sT |A) defined in equation 4.8 for the optimal activities A inferred with equation 4.9. The continuous nature of the activation functions means that they can contain as much information as the distribution itself, and indeed we find empirically that arbitrarily close matches are possible (exemplified by the two Figures; in both cases DK L T ∼ 10−5 ). Figures 16C and 17C finally show
Fast Population Coding
A
429
C
B
Figure 17: Inferring activities A for the smooth prior. (A) True posterior p(sT |ξ ). (B) Approximate posterior pˆ (sT |A), which matches arbitrarily well (for this example, DK L T ∼ 10−5 and the entropy HT ∼ 2, making the information loss I ∼ 10−5 ). (C) Activities A for all neurons. The vertical black lines with dots indicate the original spike times ξ . Each thin line along the gray surface is the “activity” of one neuron as a function of time. There is a small amount of activity away from the spikes, which allows the approximation pˆ (sT |A) to “bend” between spikes. Unlike in the OU case, zeroing this small activity affects the match between p(sT |ξ ) and pˆ (sT |A) strongly.
the inferred activities A. Most important, we see that the inferred activities (one gray line for each neuron) are very sparse (in time and across neurons), suggesting that there might indeed be a set of (zero-one) spikes that leads to a good approximation via equation 4.1. On the other hand, the activities line up closely with the original spikes ξ (vertical black lines with dots), and it may be that the approximations with the original spikes in the previous paragraph already gave us the best possible approximation. For the OU prior (see Figure 16), the activities in spikeless times are extremely small, and zeroing them does not significantly worsen the approximation with pˆ (sT |A). However, for the smooth prior (see Figure 17), there is residual activity between the peaks, the zeroing of which significantly worsens the quality of approximation by pˆ (sT |A) (data not shown). The very close approximation found here is not surprising. The distribution to be matched and the activities are discretized on the same spatial grid and are both positive, real quantities. As we have not imposed any constraints on A beyond positivity, the entropy of A can match any entropy in the distributions p(sT |ξ ). This, however, will change drastically when the activities are forced to be binary. 4.2.2 Spikes via Simulated Annealing. To check whether there exists in fact a set of spikes ρ ξ such that decoding according to equation 4.1 results in a posterior distribution pˆ (sT |ρ) that matches p(sT |ξ ) closely for the smooth
430
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
prior, we assume, as before, the projection kernels gi (s, t) inferred above and find a set of spikes ρ that minimizes DK L ( p(sT |ξ )|| pˆ (sT |ρ)), where pˆ (sT |ρ) given by 1 i pˆ (sT |ρ) = exp gi (s, t)ρT−t . Z(T) i t
(4.10)
This is the same interpretation we gave the original spikes in equation 4.1. Our aim is thus to find a new set of spikes that satisfies ρ = arg min DK L ( p(sT |ξ )|| pˆ (sT |ρ)). ρ
(4.11)
This is a highly nonconvex discrete problem, so we applied standard simulated annealing techniques.2 Figure 18 shows the results. Figure 18A shows the distribution p(sT |ξ ) based on the original spikes. Figure 18B shows pˆ (sT |ρ) using the projection kernels shown in Figure 13. The arrow in Figure 18B indicates where the new set of spikes performs better than the original, independently interpreted, spikes and matches the shifting distribution by adding a new spike. This is a qualitative improvement on what is possible by interpreting the original spikes according to equation 4.1 (see Figure 15) and is most pronounced for the smooth case. From the close match between p(sT |ξ ) and pˆ (sT |A), we expect the overall minimum KL divergence to depend strongly on the projective kernels. Figure 18C shows the effect of scaling the inferred projection kernel gi (s, t). The increase in accuracy due to an increase in the number of spikes offsets the decrease in accuracy due to the absence of prior information. The more spikes are allowed, the closer this scheme is to the one where rates are allowed. Thus, the prior has here literally been replaced by spikes; the input spikes have been “augmented” with spikes that represent information contributed by past spikes in accordance with a prior over stimulus trajectories. Figure 18D shows relative average KL divergences over 100 sets of new spike trains, for different scalings of the gi (s, t). In general, the projection kernels found here form an overcomplete basis set. By scaling them down and allowing more spikes, we come closer to the setting in the previous section where we allowed continuous activities rather than 0-1 spikes. Note the different coding strategy indicated by the arrows, especially in Figure 18C. Here, spikes are positioned such that they take into account
2 From the very strong sensitivity of our simulated annealing results to the procedure used to reduce the temperature, we infer that the optima are not very well separated, with a number of similar sets of spike trains doing approximately equally well. We rendered the procedure more global by evaluating, at each step, the decrease in cost that would accompany switching every spike and accepting one of the best switches probabilistically.
Fast Population Coding
431
A −1 D 0 KL
80
% /±1S.D.
1 −1
Space
B
OU
120
−12 −6
0.02
0.04
0.06
Time [s]
0.08
16 11 0
6
6 12
18 0 Distance btw neurons
Time lag [ms]
Figure 18: Inferring new spikes for smooth prior. (A) Original spikes with p(sT |ξ ). (B) pˆ (sT |ρ) with projection kernels gi (s, t) given by equation 4.5. (C) pˆ (sT |ρ) with scaled projection kernels gi (s, t)/4. Note the increased firing rate. Arrows explained in the text. (D) Percentage change in KL divergences between p(sT |ξ ) and pˆ (sT |ρ) for projection kernels scaled by factors of 1/2 and 1/4 relative to the KL divergence of the unscaled kernel. (E) Cross-correlations between neurons for a few different time lags. Black lines: original spikes ξ . Gray lines: recoded spikes ρ for unscaled projection kernels. The autocorrelation has been scaled to unity, except at zero lag, where the autocorrelation was excluded and scaling was performed with respect to the maximal cross-correlation.
what has already been expressed by previous spikes: spikes are positioned with respect to the distribution that has already been encoded. To put it another way, there are explicit relations among the new spikes that are not directly explained by the stimulus itself. Figure 18E shows this more clearly. The black traces show the stimulus-based (“signal”) correlations of the original spikes ξ . The gray lines show the correlations of the recoded spikes ρ. At lag 0 (bottom of the Figure), flanks appear in the cross-correlation functions, but at greater lags, the cross-correlations are flatter for the recoded than for the original spikes. Requiring independently decodable spikes has introduced instantaneous correlations and flattened the spatial profile of cross-correlations over time—a sign of adaptation to temporal statistics. Thus, here we find that the maintenance of a simple code results in the emergence of what appear to be adaptive properties.
432
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
5 Discussion Here, we have analyzed the structure of a Bayesian, optimal decoder in a simple, analytically tractable model. The results are a direct generalization of decoding in the static gaussian-Poisson encoding model (Snippe & Koenderinck, 1992). We show that the structure of the decoder depends on the prior over stimulus trajectories in time, that realistic priors render decoding hard (nonlocal in time and space), and that an independent code in which information is readily available for computational purposes exists. Finally, we showed that apparently adaptive properties of a coding scheme may result from the requirement of a constant, simple code. We are currently working on a biologically plausible network that approximates this recoding and uses the resulting code for flexible probabilistic computations (see also Zemel et al., 2004). The main innovation in our work is the way we used the gaussian process prior over stimulus trajectories. Figure 7 indicates that the exponential prior with ζ = 2 is a good model of natural movements, as they tend to be smooth. Classically, natural stimuli have been characterized as having autocorrelation structures that fall off exponentially (Dong & Atick, 1995), corresponding to a power spectrum that falls off as a power law function of frequency (∝ 1/ωb ). However, smooth trajectories in time have a much faster (exponential) spectral falloff ∝ exp(−ω2 ). Most previous work has assumed priors within the OU class (Brown et al., 1998; Smith & Brown, 2003; Barbieri et al., 2004; Kemere et al., 2004; Gao, Black, Bienenstock, Shoham, & Donoghue, 2002), perhaps because of the recursive formulation of decoding. However, Zhang et al. (1998) use a two-step Bayesian decoder corresponding to a second-order autoregressive process (AR(2)) with coefficients that fall off as squared exponentials (their equation 43). This two-step decoder is much more competent than a onestep decoder (corresponding to an AR(1) process) on hippocampal place cell data. Of course, more complex priors are also possible. For instance, Kemere et al. (2004) point one way forward. They showed the benefits for decoding from motor cortex spikes of using a rich, modular prior based on separate models for each of the (seven) possible arm movements to be extracted. In terms of our framework, Figure 15A illustrates the cost of neglecting the prior temporal structure and treating all spikes independently. The differences between inference in the smooth and OU case (e.g., the overshoot in Figure 8, which is not seen in Figure 4) also indicate qualitatively what information is lost by applying Kalman-filter-like formulations to decoding. The absolute magnitude of this effect depends on the specifics of the true model, and so remains an empirical question for psychophysical or physiological test. If spikes are dense relative to the movement in the stimulus (the likelihood term in equation 2.2 dominates, either via a very small noise (low σ ) or high firing rates (large φmax )), the contribution of the
Fast Population Coding
433
prior will be small, and approximation by a recursive prior (as in Brown et al., 1998; Zhang et al., 1998) may suffice. However, if spikes are sparse, the prior will be more important and approximations more costly. Finally, in many cases, the correct prior can be acquired only from experience, which itself may be costly. While it is sensible to expect nervous systems ¨ to acquire detailed and correct informative priors (Kording & Wolpert, ¨ 2004; Kording, Ku, & Wolpert, 2004; Adams, Graf, & Ernst, 2004), it remains to be seen whether incorporation of informative priors is generally feasible for decoding applications in engineering domains (e.g., brainmachine-interfaces). The historical approach to population coding is based on ideas of Fisher information. The Fisher information arises from notions of asymptotic normality where there are many “data”—long-spike counting windows and many neurons. In the asymptotic limit, the posterior distribution is well approximated by a gaussian with width (JI F )−1 where J is the number of data points or spikes in our case. This is a linear expansion where each data point (spike) contributes the same amount 1/I F to the variance of the posterior. We, like others before us (Brunel & Nadal, 1998; Bethge, Rotermund, & Pawelzik, 2002), are interested in the case where the population as a whole has emitted few spikes, as indicated in Figure 1—in regimes far from the asymptotic limit. For us, spikes can contribute varied amounts: some (typically the most recent) very much more than others (the most distant). As an analog of Fisher information, it is possible to study the dependence of the posterior variance ν 2 (T) on the width of the encoding tuning functions σ 2 and the dimensionality. In our simple model, we find similar results to previously reported ones (Snippe & Koenderinck, 1992; Zhang & Sejnowski, 1999) (data not shown). However, as we are always in the sparse spike limit, the information per spike is of most relevance, and the posterior variance is strictly increasing in σ , the width of the encoding tuning functions, independent of the dimensionality. If there were dense spiking, the population firing rate (Zhang & Sejnowski, 1999; Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004; DeCharms & Zador, 2000; Knight 1972) might carry enough information to overwhelm any prior. Three assumptions about the encoding model merit discussion. First, the bell-shaped form of the tuning functions, asymptoting at 0, is only very roughly realistic. However, the arguments in sections 3.2 and 3.3 about the recursive and nonrecursive structure of the decoder depend on the nature of the prior (ζ = 1 and ζ = 2), and so will generalize. The most fundamental change would be that the variance of the posterior would depend not only on spike timing but also on the relative tuning preferences of the neurons that emit the spikes. Second, we assume an instantaneous relationship between the rate of the inhomogeneous Poisson process and the stimulus. This should be formally straightforward to relax if the dependence on the stimulus history can be approximated by a linear filter (a discrete sum) as is standard for
434
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
linear-nonlinear-Poisson like model neurons (Paninski, 2003). In that case, the likelihood term in equation 2.5 will become a function of the stimulus at a number of times, each of which enters equation 2.2. Each spike will thus contribute as many entries to the covariance matrix as its linear filter extends in time. The extra complexity of decoding in this case is not completely clear. The third questionable assumption is that spiking is independent across time (the Poisson process assumption) and neurons. The actual degree of independence is, of course, hotly debated (Averbeck et al., 2006; DeCharms & Zador, 2000). In fact, the signal correlations that we assume induce some of the same issues for decoding that the noise correlations discussed in those references. Thus, our work can be seen as casting this debate in a slightly different light by showing that it is possible to recode correlated spiking in a particular, independently interpretable, form. Our companion paper (Natarajan et al., 2006) considers a network- (rather than a simulated-annealing-) based implementation of recoding. Note that Nirenberg, Carcieri, Jacobs, and Latham (2001) have studied independent interpretability (in the spikes of retinal ganglion cells) in a less modeldependent manner. The advantage of independent interpretability (based on equation 4.1) is not confined to decoding. For instance, combining information from different modalities—as in multisensory integration (Ernst and Banks, 2002; Hillis, Ernst, Banks, & Landy, 2002) or sensorimotor integration (Zemel et al., ¨ 2005; Kording and Wolpert, 2004)—becomes straightforward and requires only an addition in the log domain or single-neuron (Chance, Abbott, & Reyes, 2002; Salinas & Abbott, 1996; Poirazi, Brannon, & Mel, 2003a, 2003b) or population (Deneve et al., 2001) multiplication. Recoding produces spike trains that lack temporal redundancy. It is the exact dynamic analog of efficient coding approaches based on producing population activities lacking, for example, spatial correlations (Srinivasan, Laughlin, & Dubs, 1982; Atick, 1992; Nirenberg et al., 2001). As such, it displays phenomena that are strongly reminiscent of adaptation in the static domain. Note, however, that the rationale for adaptation here is different— it is a by-product of the requirement for a computationally efficient code. This rationale may find application outside our particular domain. Finally, the probabilistic coding here arises only from the ill-posed nature of recovering the spike train from a sparse set of noisy spikes. A more fundamental form of so-called computational uncertainty (Zemel, Dayan, & Pouget, 1998) arises in cases such as the aperture problem (Weiss & Adelson, 1998), when the information in the sensory input is ambiguous in a way that does not depend on noise. Various approaches to computational uncertainty have been suggested in the static case (Anderson, 1994; Barber, Clark, & Anderson, 2003; Zemel et al., 1998; Sahani & Dayan, 2003), but their extension to our dynamic framework remains an open problem.
Fast Population Coding
435
Appendix A: Matrix Inversion Lemmas For a matrix partitioned as in equations 2.8, or generally E=
AB , CD
it can be shown that the following equalities hold by inverting E: (D − CA−1 B)−1 = D−1 + D−1 C(A − BD−1 C)−1 BD−1 −1
−1
−1
(D − CA B) CA
−1
−1
−1
= D C(A − BD C) .
(A.1) (A.2)
These identities allow us to perform the final multiplication p(ξ |sT ) p(sT ), here written in the log domain, and then to renormalize: −1 1 sT CT−1T + CT−1T CTξ Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 Cξ T CT−1T sT 2 −1 − 2sT CT−1T Cξ T Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 · θ + const.
log( p(ξ |sT ) p(sT )) = −
thus
−1 −1 Cξ T CT−1T . ν 2 (T) = CT−1T + CT−1T CTξ Cξ ξ + Iσ 2 − Cξ T CT−1T CTξ
Making the following substitutions A = Cξ ξ + Iσ 2 C = CTξ
B = Cξ T D = CT T
allows us to apply equation A.1 and write out the variance in equation 2.10. The mean finally is obtained by writing out −1 µ(t) = ν 2 (T)CT−1T CTξ Cξ ξ − Cξ T CT−1T CTξ + Iσ 2 ·θ = (D − CA−1 B)D−1 C(A − BD−1 C)−1 · θ and applying equation A.2 to directly yield equation 2.9.
436
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Appendix B: OU Process Replacing each of the ISIs by the average value , we get a Kac-MurdockSzego Toeplitz matrix for which the analytical inverse is (Dow, 2003):
1 ρ C =c ρ2 ρ3
ρ 1 ρ ρ2
ρ2 ρ 1 ρ
ρ3 ρ2 ρ 1
C −1
1 −ρ 0 −ρ 1 + ρ 2 −ρ 1 = c(1 − ρ 2 ) 0 −ρ 1 + ρ 2 0 0 −ρ
0 0 , −ρ 1
where ρ = exp(−α ). Rewriting equation 2.11 as k(ξ , T) = CTξ Cξ−1 ξ / −1 2 −1 2 −1 σ (Cξ ξ + I/σ ) , we note that CTξ Cξ ξ ≈ δi−1 ; only the first component of this vector is one, and all others are zero. The second factor is a −ρ 0 0 0 −ρ a + ρ 2 −ρ 0 0 1 −1 2 2 −ρ 0 0 −ρ a + ρ = (C + I/σ ) = , (a − 1)σ 2 0 0 −ρ a + ρ 2 −ρ 0 0 0 −ρ a
A−1
where a = σc2 (1 − p 2 ) + 1. We know A−1 A = I. Neglecting the prefactor for a moment, the first row of A (which is the one of interest) therefore has to satisfy the following recurrence relation: A2,1 = (a A1,1 − 1)/ρ Ak+2,1 = (a /ρ + ρ)Ak+1,1 − Ak,1
(B.1) n>3
for
AN,1 /AN−1,1 = ρ/a .
(B.2) (B.3)
Equation B.2 is a simple two-term linear recurrence equation and can be solved with boundary conditions given by equations B.1 and B.3. The characteristic equation of equation B.2 is r 2 − (a /ρ + ρ)r + 1 = 0 with real roots 1 λ1,2 = a /ρ + ρ ± (a /ρ + ρ)2 − 4 . 2 Including the boundary conditions leads to a solution + d2 λn−1 An,1 = d1 λn−1 1 2
(a λ1 − ρ) d1 = a − λ1 ρ − (a − λ2 ρ) (a λ2 − ρ)
λ1 λ2
N−2 −1
Fast Population Coding
d2 =
437
1 − d1 (a − λ1 ρ) . a − λ2 ρ
One of the eigenvalues will always be greater than 1, the other less than 1, but both are positive. As Cξ ξ is symmetric, so are A−1 and A, and the first column of A is equal to its first row, which we pick out by premultiplying with CTξ Cξ−1 ξ . This vector A1,1:N is exactly the sum of two exponentials we saw when using regular spikes to infer the temporal kernel k(ξ , T), and the nth component of k(ξ , T), kn , is given by kn = [CTξ (Cξ ξ + Iσ 2 )−1 ]n = (a − 1)σ 2 An,1 = (a − 1)σ 2 d1 λn−1 + d2 λn−1 . 1 2 (B.4) If λ1 is the larger eigenvalue, we see that the corresponding coefficient d1 will be ≈ (λ2 /λ1 ) N , which is very small. The contribution of the larger λ will grow only very slowly and be seen only for the very distant spikes. On the other hand, d2 will be ≈ 1/(a − λ2 ρ). For all intents and purposes, the temporal kernel will be decaying exponentially with a negative spike time constant log λ2 . Furthermore, if the second boundary condition (for time 0) is moved to −∞, the result is a pure exponential. Both the analytical and numerical kernels are plotted in Figure 5. Relaxing the assumption of metronomic spiking gives a matrix A−1 , which is still tridiagonal but the elements of which are not equal. Writing matrix C as
1
a C= ab
bd 1 d
C −1 =
a a b a bd 1 b b
a bd bd d
1
1 1−a 2 a − 1−a 2
0 0
a − 1−a 2
0
1−a 2 b 2 b − 1−b 2 (1−a 2 )(1−b 2 ) b 1−b 2 d 2 − 1−b 2 (1−b 2 )(1−d 2 ) d 0 − 1−d 2
0 0 d − 1−d 2
,
1 1−d 2
(B.5) where a = ce −α|t1 −t2 | , b = ce −α|t2 −t3 | , and soon, leads to a set of equations similar to equations B.2 and B.3 but including more terms. Appendix C: Autoregressive Processes of Second and Higher Order An nth-order gaussian autoregressive sequence of length T as produced by equation 3.8 can be written as a sample from a multivariate normal distribution in the following way: Let b = [1, −β1 , −β2 , −β3 , . . . , β N ] and
438
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
let Bt = [0t , β, 0T−n−t ], where 0t stands for a vector of zeros of length t. The inverse covariance matrix of the process is given by C −1 =
T−n−1
Bt BtT .
(C.1)
t=0
For the coefficients of b to define a stationary and finite process, C must be Toeplitz. One way of generating a finite process from the b is by letting the nth derivative of the process evolve as an OU process, (n)
st
√ (n) = β0 st−1 + c ηt ,
(C.2)
in which case the coefficients of the vector b are given by β i = n Ci (−β0 )i−1 ,
(C.3)
where n Ci is the binomial coefficient. To enforce stationarity, we have to finally perform a subtraction, C
−1
=
T−1
Bt BtT
t=0
−
T
Bt−1 Bt−T ,
(C.4)
t =T−n
where we abuse notation and B −1 stands for Bt−1 = [0t , β N , β N−1 , . . . , β1 , 0T−n−t ]. Acknowledgments We thank Jeff Beck, Sophie Den`eve, Peter Latham, M´at´e Lengyel, Liam Paninski, and Peggy Seri`es for helpful discussions and for reading versions of the manuscript. This work was supported by the BIBA consortium (Q.H., P.D.), the UCL Medical School M.B./Ph.D. program (Q.H.), the Gatsby Charitable Foundation (P.D.), and NSERC and CIHRC NET program (R.N., R.Z.). References Adams, W. J., Graf, E. W., & Ernst, M. O. (2004). Experience can change the lightfrom-above prior. Nat. Neurosci., 7(10), 1057–1058. Anderson, C. H. (1994). Basic elements of biological computational systems. Int. J. Modern Physics C, 5(2), 135–137. Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network, 3, 213–251.
Fast Population Coding
439
Averbeck, B. B., Latham, P. E., & Pouget, A. (2006). Neural correlations, population coding and computation. Nat. Rev. Neurosci. 7(5), 358–366. Barber, M. J., Clark, J. W., & Anderson, C. H. (2003). Neural representation of probabilistic information. Neural Comp., 15, 1843–1864. Barbieri, R., Frank, L. M., Nguyen, D. P., Quirk, M. C., Solo, V., Wilson, M. A., & Brown, E. N. (2004). Dynamic analyzes of information encoding in neural ensembles. Neural Comp., 16, 277–307. Barlow, H. B. (1953). Summation and inhibition in the frog’s retina. J. Physiol., 137, 69–88. Bethge, M., Rotermund, D., & Pawelzik, K. (2002). Optimal short-term population coding: When Fisher information fails. Neural Comp., 14(2), 303–319. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., & Warland, D. (1991). Reading a neural code. Science, 252(5014), 1854–1857. Brown, E. N., Frank, L. M., Tang, D., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. J. Neurosci., 18(18), 7411– 7425. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Comp., 10, 1731–1757. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35(4), 773–782. DeCharms, C., & Zador, A. (2000). Neural representation and the cortical code. Annu. Rev. Neurosci., 23, 613–647. Deneve, S., Latham, P. E., & Pouget, A. (2001). Efficient computation and cue integration with noisy population codes. Nat. Neurosci., 4(8), 826–831. Dong, D. W., & Atick, J. J. (1995). Statistic of natural time-varying images. Network: Computation in Neural Systems, 6, 345–358. Dow, M. (2003). Explicit inverses of Toeplitz and associated matrices. Austr. New Zealand Industr. Appl. Math. J. E, 44, E185–E215. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415(6870), 429–433. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2002). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1983). Neuronal population coding of movement direction. Science, 233(4771), 1416–1419. Hillis, J. M., Ernst, M. O., Banks, M. S., & Landy, M. S. (2002). Combining sensory information: Mandatory fusion within, but not between, senses. Science, 298(5598), 1627–1630. Hinton, G. E. (1999). Products of experts. In Ninth International Conference on Artificial Neural Networks, (ICANN 9) (Vol. 1, pp. 1–6). Piscataway, NJ: IEEE. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214), 1158–1161. Johansson, R. S., & Birznieks, I. (2004). First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nat. Neurosci., 7, 170– 177.
440
Q. Huys, R. Zemel, R. Natarajan, and P. Dayan
Kemere, C., Santhaman, G., Yu, B. M., Ryu, S., Meng, T., & Shenoy, K. V. (2004). Modelbased decoding of reaching movements for prosthetic systems. In Proceedings of the 26th Annual International Conference of the IEEE EMBS (pp. 4524–4528). Piscataway, NJ: IEEE Press. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. ¨ Kording, K. P., Ku, S. P., & Wolpert, D. M. (2004). Bayesian integration in force estimation. J. Neurophysiol., 92(5), 3161–3165. ¨ Kording, K., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427(6971), 244–247. Lever, C., Wills, T., Cacucci, F., Burgess, N., & O’Keefe, J. (2002). Long-term plasticity in the hippocampal place cell representation of environmental geometry. Nature, 426, 90–94. MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. Nirenberg, S., Carcieri, S. M., Jacobs, A. L., & Latham, P. E. (2001). Retinal ganglion cells act largely as independent encoders. Nature, 411, 698–701. Paninski, L. (2003). Convergence properties of three spike-triggered analysis techniques. Network, 14(3), 437–464. Paradiso, M. A. (1988). A theory for the use of visual orientation information which exploits the columnar structure of striate cortex. Biol. Cybern., 58(1), 35–49. Poirazi, P., Brannon, T., & Mel, B. W. (2003a). Arithmetic of subthreshold synaptic summation in a model CA1 pyramidal cell. Neuron, 37, 977–987. Poirazi, P., Brannon, T., & Mel, B. W. (2003b). Pyramidal neuron as 2-layer neural network. Neuron, 37, 989–999. Pouget, A., Dayan, P., & Zemel, R. S. (2003). Inference and computation with population codes. Annu. Rev. Neurosci., 26, 318–410. Pouget, A., Zhang, K., Deneve, S., & Latham, P. E. (1998). Statistically efficient estimation using population coding. Neural Comp., 10(2), 373–401. Rao, R. P. N., Olshausen, B. A., & Lewicki, M. S. (Eds.) (2002). Probabilistic models of the brain: Perception and neural function. Cambridge, MA: MIT Press. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. J. Neurosci., 20(14), 5392–5400. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Comp., 15, 2255– 2279. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci. USA, 93, 11956–11961. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540– 543. Seung, S. H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proc. Natl. Acad. Sci. USA, 90(22), 10749–10753. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. J. Neurophysiol., 91(2), 704–709.
Fast Population Coding
441
Smith, A. C., & Brown, E. N. (2003). Estimating a state-space model from point process observations. Neural Comp., 15, 965–991. Snippe, H. P., & Koenderinck, J. J. (1992). Discrimination thresholds for channelcoded systems. Bio. Cybern., 66, 543–551. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B Biol. Sci., 216(1253), 427–459. Twum-Danso, N., & Brockett, R. (2001). Trajectory estimation from place cell data. Neural Networks, 14, 835–844. Van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cells tell the visual cortex. Neural Comp., 13(6), 1255– 1283. Wang, X.-J., Liu, Y., Sanchez-Vives, M. V., & McCormick, D. A. (2003). Adaptation and temporal decorrelation by single neurons in the primary visual cortex. J. Neurophysiol., 89(6), 3279–3293. Weiss, Y., & Adelson, E. H. (1998). Slow and smooth: A Bayesian theory for the combination of local motion signals in human vision (A. I. Memo 1624). Cambridge, MA: MIT, Artificial Intelligence Laboratory. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261(5124), 1055–1058. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Comp., 10(2), 403–430. Zemel, R. S., Huys, Q. J. M., Natarajan, R., & Dayan, P. (2005). Probabilistic computation in spiking populations. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1609–1616). Cambridge, MA: MIT Press. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or to broaden. Neural Comp., 11(1), 75–84.
Received October 17, 2005; accepted June 17, 2006.
LETTER
Communicated by Joshua Gold
The Basal Ganglia and Cortex Implement Optimal Decision Making Between Alternative Actions Rafal Bogacz
[email protected] Department of Computer Science, University of Bristol, Bristol BS8 1UB, U.K.
Kevin Gurney
[email protected] Department of Psychology, University of Sheffield, Sheffield S10 2TP, U.K.
Neurophysiological studies have identified a number of brain regions critically involved in solving the problem of action selection or decision making. In the case of highly practiced tasks, these regions include cortical areas hypothesized to integrate evidence supporting alternative actions and the basal ganglia, hypothesized to act as a central switch in gating behavioral requests. However, despite our relatively detailed knowledge of basal ganglia biology and its connectivity with the cortex and numerical simulation studies demonstrating selective function, no formal theoretical framework exists that supplies an algorithmic description of these circuits. This article shows how many aspects of the anatomy and physiology of the circuit involving the cortex and basal ganglia are exactly those required to implement the computation defined by an asymptotically optimal statistical test for decision making: the multihypothesis sequential probability ratio test (MSPRT). The resulting model of basal ganglia provides a new framework for understanding the computation in the basal ganglia during decision making in highly practiced tasks. The predictions of the theory concerning the properties of particular neuronal populations are validated in existing experimental data. Further, we show that this neurobiologically grounded implementation of MSPRT outperforms other candidates for neural decision making, that it is structurally and parametrically robust, and that it can accommodate cortical mechanisms for decision making in a way that complements those in basal ganglia. 1 Introduction Recent experimental results have established that both the cortex and the basal ganglia are involved in decision making between alternative actions (Chevalier, Vacher, Deniau, & Desban, 1985; Deniau & Chevalier, 1985; Medina & Reiner, 1995; Redgrave, Prescott, & Gurney, 1999; Schall, 2001; Shadlen Neural Computation 19, 442–477 (2007)
C 2007 Massachusetts Institute of Technology
The Brain Implements Optimal Decision Making
443
& Newsome, 2001; Smith, Bevan, Shink, & Bolam, 1998). However, it is necessary to distinguish between two phases of developing task competence (Ashby & Spiering, 2004; Shadmehr & Holcomb, 1997). In the acquisition or learning phase, actions appropriate in a given behavioral state (i.e., a combination of ongoing behavior and stimulus) are being developed that are usually driven by external reward. In this phase, the main difficulty lies in finding the optimal policy: the mapping between states and actions that maximizes reward (Sutton & Barto, 1998). Consistent with this requirement, there is a great deal of evidence that the basal ganglia act as a substrate for reinforcement learning (O’Doherty et al., 2004; Samejima, Ueda, Doya, & Kimura, 2005; Schultz, Dayan, & Montague, 1997), and several models for this process have been proposed (Doya, 2000; Frank, Seeberger, & O’Reilly, 2004; Montague, Dayan, & Sejnowski, 1996). In contrast, in the proficient phase, the mapping between stimulus and an appropriate response is well established, and the requirement is one of performing action selection or decision making. This consists of identifying the current behavioral state and executing the known appropriate action as soon as a certain level of confidence in the identification is reached (Gold & Shadlen, 2001, 2002). There is a great deal of evidence (reviewed below) for the involvement of the basal ganglia in action selection. We view the hypotheses that basal ganglia perform reinforcement learning and that they perform action selection as complementary. Thus, at any particular time, the basal ganglia perform action selection proficiently with respect to a suite of alternatives that have already been learned. Subsequent learning phases will modify that suite of alternatives and shape the profile of selections that can be made. In this article, we focus on the neural mechanisms underlying decision making in the proficient phase. As such, this is the phase under discussion if no qualifier is specified. We return to the relation of our work to task acquisition and the learning phase in section 6. 1.1 Action Selection and Decision Making. Experimental data show that during the decision process in visual discrimination tasks, neurons in cortical areas representing alternative actions gradually increase their firing rate, thereby accumulating evidence supporting these alternatives (Schall, 2001; Shadlen & Newsome, 2001). Hence, the models of decision making based on neurophysiological data (Shadlen & Newsome, 2001; Wang, 2002) assume that there exist connections from neurons representing stimuli to the appropriate cortical neurons representing actions (these connections may develop during the months of training the animals undergo before these experiments). These cortical connections are assumed to encode the stimulus-response mapping. However, even in simple, highly constrained laboratory tasks, there will be more than one possible response, and so there is a problem of action selection in which the representation for the correct response has to take control of the animal’s motor plant. In natural, ethological settings, this problem is
444
R. Bogacz and K. Gurney
exacerbated because, under these circumstances, there are usually multiple, complex sensory streams demanding a variety of behaviors. The problem of action selection was recently addressed by Redgrave et al. (1999), who conceived of it as the resolution of conflict between command centers throughout the brain competing for behavioral expression. These authors examined the problem from a computational perspective and argued that competitions between brain centers vying for expression were best resolved by a central switch examining the urgency or salience of each action request, and that anatomical and physiological evidence pointed to the basal ganglia as the neural substrate for this switch. Thus, the basal ganglia receive widespread input from all over the brain (Parent & Hazrati, 1995a) and, in their quiescent state, send tonic inhibition to midbrain and brain stem targets implicated in executing motor actions, thus blocking cortical control over these actions (Chevalier et al., 1985; Deniau & Chevalier, 1985). Actions are supposed to be selected when neurons in the output nuclei have their activity reduced (under control of the rest of the basal ganglia), thereby disinhibiting their targets (Chevalier et al., 1985; Deniau & Chevalier, 1985). The selection hypothesis for basal ganglia has been tested in biologically realistic computational models in a variety of anatomical contexts by several authors (Brown, Bullock, & Grossberg, 2004; Frank, 2005; Gurney, Prescott, & Redgrave, 2001a, 2001b; Humphries & Gurney, 2002). In sum, the research reviewed above indicates that during decision making among alternative actions, cortical regions associated with the alternatives integrate evidence supporting each one and that the basal ganglia act as a central switch by evaluating this evidence and enabling those behavioral requests that are best supported (most salient). 1.2 Scope of the Letter. We have argued that between bouts of learning—that is, during proficient phases of activity—the primary computational role of the basal ganglia is to act as an action selection mechanism, mediating resolution of the action selection problem by gating behavioral requests. It is this mode of basal ganglia operation that we address in this article. Biologically realistic network models (Gurney et al., 2001a, 2001b) have shown how the basal ganglia could perform the required selection computation. However, such models fail to elucidate possible analytic descriptions of the computation (i.e., selection) that provide it with a theoretical grounding. This letter provides an analytic description of function of a circuit involving cortex and basal ganglia by showing how an optimal abstract decision algorithm maps onto the anatomy and physiology of this circuit. The main goal of this letter is to provide a new algorithmic framework for understanding the computation in the basal ganglia in the proficient phase. The algorithm relates to computations being performed at the systems level of description of the basal ganglia, that is, considering the circuit to be a
The Brain Implements Optimal Decision Making
445
set of interacting neuronal populations described by their overall firing rate (Dayan, 2001). This does not preclude the possibility that computations may be performed at other levels of description (Gurney, Prescott, Wickens, & Redgrave, 2004) dealing with microcircuits, membranes, or molecular signalling pathways. However, as far as computation performed by virtue of the organization of the basal ganglia in toto is concerned, we argue that this will have an integrity apparent at the systems level, while acknowledging that it must ultimately be consistent with lower-level models. In view of these methodological considerations, we therefore deliberately do not attempt to incorporate the overwhelming amount of knowledge available for the basal ganglia at the microanatomical, physiological, and molecular levels of description. The letter is organized as follows. Section 2 reviews relevant background material concerning the neurobiology and theory of decision making in cortex and basal ganglia. Section 3 deals with the central technical argument and proposes how the basal ganglia may perform action selection in an optimal way. Section 4 shows how the specific experimental predictions of the theory are verified by existing data. Section 5 compares the performance of the proposed model against other models of decision making. Section 6 discusses the relation of this work to other theories of action selection and published experimental data. 2 Review of the Neurobiology and Theory of Decision Making This section reviews the material critical for an understanding of our model. The theory of optimal decision making in perceptual tasks has hitherto been grounded almost exclusively in cortical mechanisms. In contrast, we develop the theory of decision making in this article with respect to the neural circuit involving both the cortex and the basal ganglia. Thus, in proposing a neural mechanism for optimal decision making, we link two strands of research: that dealing with putative cortical decision mechanisms and that dealing with action selection in the basal ganglia. Elements from both areas are therefore required background material. First, we present the theory of optimal decision making, and then, we review those aspects of basal ganglia anatomy and physiology critical for the model. 2.1 Decision Making and Cortical Integration. The neural basis of decision making in cortex has been studied extensively using single-cell recordings (Britten, Shadlen, Newsome, & Movshon, 1993; Kim & Shadlen, 1999; Schall, 2001). Typically these studies have used a direction of motion discrimination task using fields of drifting random dots, with response via saccadic eye movements. During these experiments, the mapping between dot movement direction and required response was kept constant for many weeks of the training, so that these studies describe the proficient phase of task acquisition. After stimulus onset, neurons in cortical sensory areas (e.g.,
446
R. Bogacz and K. Gurney
area MT in the visual motion task) respond if their receptive fields encounter the stimulus and are appropriately tuned to the overall direction of motion (Britten et al., 1993; Kim & Shadlen, 1999). However, the instantaneous firing rates in MT are noisy, probably reflecting the uncertainty inherent in the stimulus and its neural representation. Further, this noise is such that decisions based on the activity of MT neurons at a given moment in time would be inaccurate, because the largest firing rate does not always indicate the direction of coherent motion in the stimulus. Therefore, a statistical interpretation is required. An often used hypothesis (Gold & Shadlen, 2001, 2002) is that populations of neurons in MT encode evidence for a particular perceptual decision. To formalize this, denote the evidence supporting decision i, (i is “left” or “right”) provided at time t, by xi (t). Then, under the neural encoding hypothesis, xi (t) corresponds to the total activity of MT neurons selective for direction i at time t. The decision-making process can be defined as one of finding which xi (t) has the highest mean (Gold & Shadlen, 2001, 2002). To solve it, it appears that subsequent cortical areas are invoked to accumulate evidence over time. Thus, in the motion discrimination task, neurons in the lateral intraparietal area (LIP) and frontal eye field (FEF) (which are implicated in the response via saccadic eye movements) gradually increase their firing rate (Schall, 2001; Shadlen & Newsome, 2001) and could therefore be computing Yi (T) =
T
xi (t)
(2.1)
t=1
over the temporal interval [1,T] (where we assume for simplicity a discrete representation of time). The accumulated evidence Yi (T) may now be used in making a decision about which xi (t) has the highest mean. 2.2 Modeling the Decision Criterion. The above description of cortical integration leaves open a central question: When should a neural mechanism stop the integration and execute the action with the highest accumulated evidence Yi (T)? A simple solution to this problem is to execute an action as soon as any Yi (T) exceeds a certain decision threshold, yielding the so-called race model (Vickers, 1970). However, this model does not perform optimally. For example, in case of decision between two alternatives, it is more efficient to compute the difference between the accumulated evidence supporting the two alternatives and execute action as soon as this difference crosses a positive or a negative decision threshold. This procedure is known as a random walk (Laming, 1968; Stone, 1960) or a diffusion (Ratcliff, 1978) model, and it may be shown to implement a statistical decision test known as the sequential probability ratio test (SPRT) (Barnard, 1946; Wald, 1947). The SPRT is optimal in the following sense: among all decision methods
The Brain Implements Optimal Decision Making
447
allowing a certain probability of error, it requires the shortest period of sampling the xi —that is, it minimizes decision time (Wald & Wolfowitz, 1948). 2.3 The MSPRT. For more than two alternatives, there is no single optimal test in the sense that SPRT is optimal for two alternatives, but there are tests that are asymptotically optimal; that is, they minimize decision time for a fixed probability of error when this probability decreases to zero (Dragalin, Tertakovsky, & Veeravalli, 1999). These tests are the so-called multihypothesis SPRTs (MSPRT’s) (Baum & Veeravalli, 1994; Dragalin et al., 1999), and for two alternatives, they simplify to the SPRT. While it has been shown that MSPRT may be performed in a two-layer connectionist network (McMillen & Holmes, 2006), the required complexities in this model mitigate against any obvious implementation in the brain (and, in particular, the cortex). We now introduce the MSPRT (Baum & Veeravalli, 1994). A decision among N alternative actions can be formulated in the same way as for the case of the two alternatives described in section 2.1. That is, it amounts to finding which xi (t) has the highest mean (Gold & Shadlen, 2001, 2002). Let us define a set of N hypotheses Hi such that roughly speaking, each Hi corresponds to xi (t) having the highest mean. More precisely, we define Hi analogous to its definition for two alternatives (Gold & Shadlen, 2001, 2002); Hi is the hypothesis that xi (t) come from independent and identically distributed (i.i.d.) normal distributions with mean µ+ and standard deviation σ , while x j=i (t) come from i.i.d. normal distributions with mean µ− and standard deviation σ , where µ+ >µ− . Bearing in mind that we are integrating evidence up until some time T, denote the entirety of sensory evidence available up to T by input(T) = {xi (t) : 1≤ i ≤ N, 1≤ t ≤ T}. The MSPRT (Baum & Veeravalli, 1994) is equivalent to the following decision criterion at time T: for each alternative i, compute the conditional probability of hypothesis Hi given sensory inputs so far, Pi (T) = P(Hi |input(T)), and execute an action as soon as any of the Pi (T) exceeds a certain decision threshold. Hence, sensory information is gathered until the estimated probability of one of the inputs having the highest mean exceeds the decision threshold. Appendix A describes how Pi (T) can be computed on the basis of sensory evidence. In particular, it shows that the logarithm of Pi (T), which we denote by L i (T), is given by
L i (T) = yi (T) − ln
N
exp(yk (T))
(2.2)
k=1
where yi (T) is proportional to the accumulated evidence supporting action i. In particular, yi (T) = g ∗ Yi (T), where Yi (T) is the accumulated evidence
448
R. Bogacz and K. Gurney
supporting action i (see section 2.1) and g ∗ is a constant. We will refer to yi (T) as the salience of action i. Thus, MSPRT implies that an action should be selected as soon as any of the L i (T) exceeds a fixed decision threshold. Equation 2.2 is the basis for mapping MSPRT onto the basal ganglia. However, before we proceed with this process, we describe some of the intuitive properties of MSPRT, as implemented in equation 2.2. The right-hand side of equation 2.2 has two terms. The first term yi (T) is simply the salience and, on its own, describes a race model (Vickers, 1970). This term therefore incorporates information about the absolute size of the salience of the currently “winning” alternative. The second term in equation 2.2 occurs in subsequent analysis throughout the article, and we denote it by S(T) where, S (T) = ln
N
exp (yk (T)).
(2.3)
k=1
The term S(T) includes summation over all alternatives and does not depend on i. S(T) therefore decreases the value of all L i (T) by the same amount, thereby increasing the minimum salience required for an action to be selected. It may therefore be thought of as representing response conflict, because its value is increased by more actions having high salience. In this way, S(T) allows incorporation of information about the difference between the salience of the currently winning alternative and its competitors. The degree of scaling of the salience required for action selection implied by the particular form of S(T) is critical for optimal decision making; it allows a much lower average decision time for fixed accuracy than when the scaling is not present (i.e., race model), as will be shown in section 5. 2.4 Basal Ganglia Connectivity. The basal ganglia connectivity used in our study contain the major pathways known to exist in basal ganglia anatomy and was based on that used in the model of Gurney et al. (2001a). Figure 1A shows this connectivity for rat in cartoon form (for reviews of basal ganglia anatomy, see Gerfen & Wilson, 1996; Mink, 1996; Smith et al., 1998). Cortex sends excitatory projections to the striatum (Nakano, Kayahara, Tsutsumi, & Ushiro, 2000) and subthalamic nucleus (STN) (Smith et al., 1998). The striatum is the largest basal ganglia nucleus and is divided into two populations of projection neurons differentiated, inter alia, by their anatomical targets and preferential dopamine receptor type (Gerfen & Young, 1988). The neurons in one striatal subpopulation send focused inhibitory projections to the basal ganglia output nuclei: the substantia nigra pars reticulate (SNr) and entopeduncular nucleus (EP) (the homologue of primate globus pallidus internal segment (GPi)). These striatal neurons are associated with D1-type dopamine receptors (Smith et al., 1998) and,
The Brain Implements Optimal Decision Making
449
A
B
Figure 1: Comparison of connectivity of basal ganglia and a network implementing the multihypothesis sequential probability ratio Test (MSPRT). (A) Connectivity of basal ganglia nuclei and its cortical afferents in the rat (modified from Gurney et al., 2001a). Connections and nuclei denoted by dashed lines are not essential for the implementation of MSPRT. (B) Architecture of the network implementing MSPRT. The equations show expressions calculated by each layer of neurons.
450
R. Bogacz and K. Gurney
together with their targets (SNr, EP), constitute the so-called “direct pathway” (in section 3 we associate this direct pathway with first term yi (T) in equation 2.2). Neurons in the other striatal population are also inhibitory, send focused projections to the globus pallidus (GP) (globus pallidus external segment or GPe in primate), and are associated with D2-type dopamine receptors (Smith et al., 1998). Neurons in the STN are glutamatergic and send diffuse excitatory projections to SNr/EP and GP (Parent & Hazrati, 1993, 1995a). The GP sends inhibitory connections to the output nuclei (Bevan, Smith, & Bolam, 1996). This second striatal population therefore gives rise to an indirect pathway to the output nuclei via GP and STN (in section 3, we propose that the nuclei traversed by the indirect pathway are involved in computation of term S(T)). The output nuclei send widespread inhibitory connections to the midbrain, brain stem (Faull & Mehler, 1978; Kha et al., 2001), and thalamus (Alexander, DeLong, & Strick, 1986). 2.5 Neuronal Selectivity in the Basal Ganglia. There is much evidence (reviewed below) pointing to a topographic representation of functionality within basal ganglia and its associated thalamocortical circuitry, leading to the hypothesis that these circuits support a range of discrete channels associated with different actions. This concept will be important for the model; in section 3, we associate each channel with a term L i (T). At the largest scale of organization, Alexander et al. (1986) divided the loops from cortex, through basal ganglia, thalamus, and back to cortex, into five parallel, segregated circuits (cf. Nakano et al., 2000) associated with different functionality. Since this article treats the problem of action selection, we focus on motor and oculomotor circuits and return to consider information processing in the limbic and two prefrontal circuits in section 6. Within the motor circuit, studies of awake primates in behavioral tasks have established that all basal ganglia nuclei have somatotopic organization. Thus, Crutcher and DeLong (1984a, 1984b) have shown that neurons selective for arm, leg, and face are located within different parts of the striatum. Further, within each body part, there are clusters of neurons responding selectively before and during movement of individual joints (often only in single direction) (Crutcher & DeLong, 1984a, 1984b). Similarly, Georgopoulos, Delong, and Crutcher (1983) found neurons in other basal ganglia nuclei (STN, GP, EP) that were selective to the direction and speed of individual movements. These observations led Alexander et al. (1986) to propose that “the motor circuit may be composed of multiple, parallel subcircuits or channels concerned with movement of individual body parts,” which traverse all nuclei of basal ganglia. The notion of channels was incorporated into the computational model of Gurney et al. (2001a), who proposed that each action is associated anatomically with a discrete neural population within each nucleus. Channels are therefore defined at the input nuclei (striatum and STN) as populations
The Brain Implements Optimal Decision Making
451
innervated by the cortical afferents associated with each action. Channel populations in other nuclei (GP, EP, SNr) are then defined by focused projections from corresponding striatal populations.
3 The Basal Ganglia Implements Selection Using MSPRT We now show how the MSPRT test defined by equation 2.2 may be performed in a biologically constrained network model of the basal ganglia. For simplicity of explanation, we first show how equation 2.2 maps onto a model of basal ganglia including only a subset of the known anatomical connections (we exclude the connections marked by dotted lines in Figure 1A). Subsequently we demonstrate the mapping onto the model with the complete set of connectivity. The mapping between equation 2.2 and the network is shown graphically in Figure 1B. In our decomposition, each channel (see section 2.5) is associated with an action i and a term L i (T) in the MSPRT. Hence, we assume that there is a finite number N of available actions represented in a discrete (or “localist”) fashion (the topic of action representations is dealt with further in section 6). We note first that the L i (T) are always negative or equal to 0, because L i (T) = lnPi (T), and Pi (T) are probabilities. Thus, by definition, Pi (T) ≤ 1, and so lnPi (T) ≤ ln1 = 0. Therefore, the L i (T) themselves cannot be represented as firing rates in neuronal populations (since neurons cannot have negative firing rates). This may be overcome by assigning the network output OUTi to –L i (T), that is,
OUTi (T) = −yi (T) + ln
N
exp(yk (T)).
(3.1)
k=1
The decision is now made whenever any output decreases its activity below the threshold. Notice that this is consonant with the supposed action of basal ganglia outputs in performing selection by disinhibition of target structures (Chevalier et al., 1985; Deniau & Chevalier, 1985). As described in section 1, we propose, along with others (Schall, 2001; Shadlen & Newsome, 2001), that quantities like yi (T), representing salience, are computed in cortical regions that project to basal ganglia. In the motion discrimination example (described in section 2.1), yi (T) would be computed in FEF, which is known to innervate the basal ganglia (Parthasarathy, Schall, & Graybiel, 1992). Since yi (T) is the product of the raw accumulated evidence Yi (T) and a scaling factor, g ∗ , we interpret g ∗ as the gain that cortex introduces in computing the salience (Brown et al., 2005). As shown in appendix A, the MSPRT algorithm specifies g ∗ exactly. Thus, there is an
452
R. Bogacz and K. Gurney
optimal gain: g∗ =
µ+ − µ− , σ2
(3.2)
where µ+ , µ− , σ parameterize the cortical inputs (see section 2.3). We return to the question of parametric robustness with respect to gain later. Equation 3.1, describing the activity of the basal ganglia output nuclei, includes two terms, the first of which we propose is computed within the direct pathway, while the second term is within the pathway traversing STN and GP. The first term in equation 3.1, –yi (T), is an inhibitory component and cannot be supplied by cortex since its efferents are glutamatergic. We argue therefore that one function of the population of GABAergic striatal projection neurons with D1 receptors (see Figure 1A) is to provide an inhibitory copy of the salience signal to the output nuclei. Turning to the second term in equation 3.1, this is S(T) (defined in equation 2.3) which supplies an excitatory contribution to the output nuclei. A key aspect of S(T) is that it involves summing over channels. The source of excitation in the basal ganglia is the STN, which sends diffuse projections to the basal ganglia output nuclei (Parent & Smith, 1987). Thus, each output neuron receives many afferents from widespread sources within STN, and so it is plausible that they are performing a summation over channels. In the network model, this is reflected in the fact that neurons in each channel i of the output nuclei compute the quantity OUTi (T) = −yi (T) + (T), where: (T) =
N
STNi (T).
(3.3)
i=1
The model then implements MSPRT if (T) = S(T). We now show, first in outline and then more rigorously, how the form of STNi (T) required in order to ensure (T) = S(T) may be enabled by the interaction between STN and GP and the characteristic transfer functions of their neurons. A first correspondence between equations 2.3 and 3.3 involves summation over channels. Second, since STN receives input yi (T) from the cortex, this suggests that the STN firing rate should be proportional to the exponent of its input. We also propose that the logarithm in equation 2.3 comes from interactions between STN and GP. The log transform may be thought of as a compression of the range of STN activity, plausibly derived from GP inhibition, since this is, in turn, under STN control. Thus, rather than supplying a fixed decrement in STN activity through a fixed level of inhibition, GP increases its inhibition in response to increased activity in STN. We now formalize these requirements, resulting in quantitative predictions about the input-output relations of STN and GP neurons. First, we require that the firing rate of neurons in STN is proportional to an
The Brain Implements Optimal Decision Making
453
exponential function of its inputs: STNi (T) = exp (yi (T) − GPi (T)) .
(3.4)
Since STN projects diffusely to GP (Parent & Hazrati, 1995b), we assume that the STN input to GP channel i is (T) rather than STNi (T). The required log transform is obtained by supposing that the firing rate of GP channel i, GPi (T), is given by GPi (T) = (T) − ln ((T)) ,
(3.5)
since, substituting equation 3.5 into equation 3.4, summing over i, and solving for (T) then yields (T) = S(T). In summary, an implementation of MSPRT defined by equations 3.1 to 3.5 may be realized by a subset of basal ganglia anatomy, defined in Figure 1B, if the behavior of neurons in STN and GP follows equations 3.4 and 3.5. As described so far, the model lacks two known pathways within basal ganglia that were shown in Figure 1A. First, the GP functionality defined in equation 3.5 omits afferents from striatal projection neurons associated with D2-type dopamine receptors. Second, GP projections to the output nuclei have not been included in equation 3.1. It has been proposed that these pathways play a critical role in the learning phase, when they block actions that have been punished (Frank et al., 2004). This function is not included in our model, because we address only the computation in the proficient phase. Appendix B shows that incorporation of these pathways into an anatomically more complete scheme still admits a model of basal ganglia that supports MSPRT. Therefore, the model with all pathways shown in Figure 1A also achieves the optimal performance of the MSPRT. 4 Predicted Requirements for STN and GP Physiology Are Validated by Existing Data In this section we compare the predictions of equations 3.4 and 3.5 concerning the firing rates of STN and GP neurons as a function of their input, with published experimental data. In order to make this comparison, model variables (e.g., yi (T), STNi (T), GPi (T)) are assumed to be proportional to experimentally observed neuronal firing rates. Note, however, that proportionality constants are not uniquely specified by the model because a change in any such constant for a particular nucleus can be absorbed by rescaling the weights in projections from this nucleus to other areas. (The use of interpathway weights is illustrated in the anatomically more complete model described in appendix B.) The forms for STN and GP functionality given in equations 3.4 and 3.5 were derived on the basis of the known anatomy of the basal ganglia and the
454
R. Bogacz and K. Gurney
assumption that the network involving cortex and basal ganglia implements MSPRT. Since we did not use the physiological properties of STN and GP neurons in deriving equations 3.4 and 3.5, these equations represent predictions of the model for the physiological properties of STN and GP, thereby providing an independent means for testing the model. These predictions are very strong; in particular, the theory of section 3 implies that the firing rate of STN neurons should be proportional to the exponent of its input. Such a relation is highly unusual in most neural populations. Furthermore, such a relation is very different from the STN input-output relations assumed by other models: Gurney et al. (2001a) assume a piecewise linear relation, while Frank et al. (2005) assume a sigmoid relation. The response properties of STN neurons have been studied extensively (Hallworth, Wilson, & Bevan, 2003; Overton & Greenfield, 1995; Wilson, Weyrick, Terman, Hallworth, & Bevan, 2004). Typically they have nonzero spontaneous firing and can achieve unusually high firing rates. Our proposed exponential form for firing rate as a function of input (see equation 3.4) explains these features since, in the absence of input, the model gives nonzero (unity) output and exp(.) is a rapidly growing function yielding potentially high firing rates. In order to test the prediction of equation 3.4 quantitatively, we fitted exponential functions to firing rate data in the literature. Figure 2H shows the pooled results of this exercise based on two studies (Hallworth et al., 2003; Wilson et al., 2004). The fit to an exponential function is a good one, consistent with the prediction in equation 3.4. Second, the theory makes predictions, defined by equation 3.5, concerning the firing rate of GP. First, we show that the function defined by equation 3.5 is roughly linear if we make the reasonable assumption that N (the number of channels or available actions) is large. Thus, since yi (T) > 0, then from equation 2.3, S(T) is bounded below by ln(N), so that S(T) increases with N. Now, for large S(T), S(T) >> ln(S(T)), so that the linear term in equation 3.5 dominates, and GPi (T) becomes an approximately linear function of its input S(T). We therefore predict that GP neurons display a roughly linear relation between input and firing rate, and two studies validate this. Nambu and Llinas (1994) have established that for those GP neurons that are most influential on the population firing rate, their firing rate is indeed well approximated by a linear function of the injected current (see Figure 2I), a result that is in agreement with an earlier study by Kita and Kitai (1991). In any case, a model in which GP neurons obey an exactly linear input firing rate relation departs little in performance from the model in which GP is described by equation 3.5 (see section 5.3). Finally, it is intriguing to note that GP also includes two types of neurons whose input-output properties are logarithmic (Nambu & Llinas, 1994). Thus microcircuits within GP making use of intranucleus inhibitory collaterals (Nambu & Llinas, 1997) could also support the exact computation required by equation 3.5 for MSPRT.
The Brain Implements Optimal Decision Making A
B
Firing rate [Hz]
150
C
150
D
150
150
100
100
100
100
50
50
50
50
0
0
E
50 100 150 Input current [pA]
150
Firing rate [Hz]
455
0
0
F
100 200 Input current [pA]
150
0
100
100
50
50
50
0
100 200 Input current [pA]
0
50 100 Input current [pA]
0
50 100 Input current [pA]
0
0
50 100 Input current [pA]
150
100
0
0
G
0
100 200 Input current [pA]
0
H STN
I GP
10
8
25 Number of spikes
Normalized STN output
30
6
4
20 15 10 5
2
0 0
0
0
0.5
1 1.5 Normalized STN input
2
2.5
0.2
0.4
0.6
0.8
Input current [nA]
Figure 2: Firing rates f of STN and GP neurons as a function of input current I . (A–D) The panels replot data on the firing rate of STN neurons presented in Hallworth et al. (2003) in Figure 4b, 4f, 12d, and 13d respectively (control condition). (E–G) The panels replot the data from STN presented in Wilson et al. (2004) in Figures 1c, 2c, 2f respectively (control condition). Only firing rates below 135 Hz are shown. Lines show best fit of the function f = a exp(b I ). (H) Scaled data from A–G ( f j /a,bI j ) plotted on the same axes for all neurons. (I) Number of spikes n produced by a GP neuron of type II (Nambu & Llinas, 1994) in a 242 ms stimulation interval using current injection I . (The data used in this figure, kindly provided by Atsushi Nambu, come from the same neuron analyzed in Figure 5g of Nambu and Llinas, 1994.)
5 Performance of MSPRT Model of the Basal Ganglia and Its Variants The performance of the algorithmically defined model described in section 3 was investigated in simulation.
456
R. Bogacz and K. Gurney
5.1 Simulation Methods. In all numerical experiments described in this section, we simulated a decision process between N alternative actions, with plausible parameters describing sensory evidence. The evidence xi (t) was accumulated in integrators Yi in time steps of δt = 1ms. For the correct alternative i, evidence xi (t) was generated from a normal distribution with mean µ+ δt and variance σ 2 δt, while for other alternatives, x j (t) was generated from a normal distribution with mean µ− δt and variance σ 2 δt. It transpires, in fact, that we require only the values of the signal µ+ − µ− , rather than individual means themselves (see appendix A). This was estimated from a sample participant in experiment 1 from the study of Bogacz, Brown, Moehlis, Holmes, and Cohen (2006), that is, µ+ − µ− = 1.41. An estimate of σ was taken from the same experiment to be 0.33. For each set of parameters, a decision threshold was found numerically that resulted in an error rate of 1% ± 0.2% (SE); this search for threshold was repeated 10 times. For each of these 10 thresholds, the decision time was then found in simulation and their average used to construct the data points. 5.2 MSPRT in the Basal Ganglia Outperforms Alternative Decision Mechanisms. It is instructive to see quantitatively how the performance for the MSPRT model compares with that of two other standard models of decision making in the brain: the race model (Vickers, 1970) and a model proposed by Usher and McClelland (2001) (henceforth, the UM model). While the MSPRT has been shown to be asymptotically optimal as the error approaches zero, its performance with finitely large errors has to be evaluated numerically. To do this, we conducted simulations for differing numbers of competing inputs, N, for all three models, with a 1% error rate. Figure 3A shows that the MSPRT consistently outperforms both the UM and race models (especially in the more realistic large N regime). This result is in agreement with recent work by McMillen and Holmes (2006) (who also showed another feature in Figure 3A—that as N increases, the performance of UM model asymptotically approaches that of the race model). For N = 2, the performance of the MSPRT and UM models is very similar since, in this instance, the latter approximates SPRT (Bogacz et al., 2006; Brown et al., 2005). 5.3 The Model is Parametrically Robust. As previously noted, MSPRT specifies a unique value of the cortical gain parameter g ∗ if MSPRT is to be faithfully implemented. We now analyze the performance of the model with different values of gain g = g ∗ in a general cortical integrator relation yi (T) = gYi (T) (instead of yi (T) = g ∗ Yi (T)). Equation 3.2 implies that the optimal value of gain g ∗ depends on the parameters of the inputs to the cortical integrators (µ+ , µ− , σ ). These parameters are task specific and are therefore unlikely to be known to any neural decision system. It is therefore essential for any biologically realistic implementation of MSPRT
The Brain Implements Optimal Decision Making
457
Decision time for ER=1% [s]
A 0.8 0.7 0.6 0.5 0.4 Basal Ganglia Usher & McClelland Race
0.3 0.2
2
3
4 5 6 8 10 12 Number of alternatives
16 20
B Decision time for ER=1% [s]
0.64 0.62 0.6 0.58 0.56 0.54 0.52 0.5 0.48 0.01
-0.1
g/g*
1
10
Figure 3: Comparison of decision times (DT) of various models described in the text. Simulation details for the MSPRT model are given in section 5.1. (A) Comparison of DT of MSPRT model, the Usher and McClelland (2001) (UM) model and race model (Vickers, 1970) for different numbers of alternative actions. The standard error of the mean decision time estimation (SEM.) was always lower than 7.3 ms. The inhibition and decay parameters of the UM model were set to 100. The isolated filled triangles show DTs for the linearized GP model with two parameter pairs (g, a ) (see section 5.3). The triangle pointing up shows DT for default values g = g* and a = 1, and the triangle pointing down for optimal values g = 0.4g* and a = 0.84. The isolated square shows the DT for a version of MSPRT model in which simple cortical integrators are replaced by a UM model with both decay and inhibition parameters equal to 10. (B) Robustness of MSPRT model under variation in the gain parameter. The solid line shows the dependence of DT for N = 10 alternatives on the value of parameter g (expressed via its ratio g ∗ ). Error bars indicate SEM. The dashed line shows the decision time of the UM model.
458
R. Bogacz and K. Gurney
that this mechanism does not significantly deviate from optimality under variation of the gain g from its optimal value, g ∗ . Figure 3B shows that the model is indeed robust to changes in g. If g > g ∗ , the decision time does not increase, while if g < g ∗ , the decision time does increase, but it never exceeds that of the UM model (see appendix C). Hence, even if the parameters of the inputs to the cortical integrators are not known, the performance may be optimized by setting g as high as possible. Turning to the functionality of GP, we evaluated the decrease in performance under an exact linearization of GP with respect to that shown in Figure 3A, for N = 10 alternatives. To do this, suppose the firing rate of a neuron in GP is a linear function of its input from STN with proportionality constant a :
GPi (T) = a
N
STN j (T) = a S (T) .
(5.1)
j=1
Substituting equation 5.1 into equation 3.4 and summing over i yields
ln (S (T)) + a S (T) = ln
N
exp (yi (T)) .
(5.2)
i=1
equation 5.2 does not have a closed-form solution for S(T), and so this is found by solving equation 5.2 numerically. Decision times (DT) were contingent on the values of parameters g and a . With default values, g = g* and a = 1, DT = 607 ms (with SEM. = ± 3 ms), which is better than the UM model (DT = 628 ms) or race model (DT = 676 ms). A parameter search yielded optimal performance, with DT = 545 ms (±3 SEM.), with g = 0.4 g ∗ and a = 0.84. 5.4 Competition May Occur in Both Cortex and Basal Ganglia. The UM model (as well as the model of cortical decision making in area LIP by Wang, 2002) assumes that cortical integrators not only integrate evidence (as in the MSPRT model) but also actively compete with one another. Appendix D shows that if the cortex performs a computation equivalent to the UM model, then the activity levels of the basal ganglia output nuclei are exactly the same as in the original MSPRT model. As a consequence, the decision times remain the same, and the system as whole still achieves optimal performance. This is illustrated in Figure 3A, where an isolated open square symbol shows the DT for the model augmented with UM-based cortical processing; this DT is the same as for the original MSPRT model.
The Brain Implements Optimal Decision Making
459
6 Discussion 6.1 Summary. Our main result is that a circuit involving cortex and the basal ganglia may be devoted to implementing a powerful (asymptotically optimal) decision mechanism (MSPRT) in a parametrically robust way. Further, our results suggest a division between a core functional anatomy (shown in Figure 1B) and additional pathways (incorporated in the anatomically complete model) that may serve other purposes (e.g., enhancement of robustness and learning) without compromising MSPRT. In addition, the MSPRT model was shown to outperform other decision mechanisms. While the UM and race models avail themselves of simple network implementations, the sophisticated architecture and neural functionality of the basal ganglia appear to have evolved to support the more powerful MSPRT, allowing the brain to make accurate decisions substantially faster than the simpler mechanisms. The model also made several predictions about the physiological properties of STN and GP neurons, which, while consistent with existing data, provide a challenge for further experimental studies to test them in vitro and in vivo with synaptic input. 6.2 Relationship to Other Models of Decision Making 6.2.1 Action Selection in Basal Ganglia. The model described in this article has exactly the same architecture (shown in Figure 1A) as a previous model of action selection in the basal ganglia in the proficient phase (Gurney et al., 2001a). This architecture has been shown before to exhibit appropriate action selection and switching properties in a computational model (Gurney et al., 2001b). The underlying architecture has also been shown to be functionally robust (from a selection perspective) in a variety of settings. Thus, it has also been shown to perform these functions within the anatomical context supplied by associated thalamocortical loops (Humphries & Gurney, 2002), under the addition of further circuitry intrinsic to the basal ganglia (Gurney, Prescott, et al., 2004), when embodied in a complete, behaving autonomous agent (Prescott, Montes-Gonzalez, Gurney, Humphries, & Redgrave, 2006) and in spiking neuron models (Humphries, Stewart, & Gurney, in press; Stewart, Gurney, & Humphries, 2005), which are constrained by significantly more physiological detail than their systems-level counterparts. The model of Gurney et al. (2001a) and the MSPRT model are consistent in proposing similar functions of individual nuclei. In particular, we suppose here that the GP plays a crucial role in limiting STN activity (via a log transform). This function is similar to that proposed for GP in Gurney et al. (2001a), in which GP automatically limits the excitation of basal ganglia output nuclei in order to allow network mechanisms to perform selection. This work differs from Gurney et al. (2001a) in that it provides an analytic description for the computation performed during the selection, thereby
460
R. Bogacz and K. Gurney
providing a new framework for understanding why the basal ganglia are organized in the way they are. The function of STN in our model is similar to that posited by Frank (2005), who proposed that it “can dynamically modulate the threshold for executing responses depending on the degree of response conflict present.” The novelty of our work lies in specifying precisely how STN should modulate this threshold to optimize the performance. 6.2.2 Bayesian Decision Making. MSPRT can be viewed as a Bayesian method of decision making, as it is based on evaluating conditional probabilities using the Bayes theorem (see appendix A). Recently Yu and Dayan (2005) proposed a Bayesian model of attentional modulation of decision making. In this model, the final layer of an abstract neural network performs a computation equivalent to that accomplished by the outputs of the MSPRT model of basal ganglia presented here (i.e., each output unit computes exponents of L i (T)). The novelty of our work lies in showing how this computation may be performed by an identified, biological network of neurons, the basal ganglia. 6.2.3 Cortical Decision Making. From the theoretical perspective, the use of integrated evidence leaves open the possibility that the cortex may operate as the first stage of a two-stage decision process, in which cortical mechanisms of the kind posited in the UM model (for example) make a first-pass filter for actions with small saliences, thereby preventing these requests from propagating to the basal ganglia for further processing. This possibility was confirmed by mathematical analysis and simulation, where identical results were obtained after incorporation of a first stage consisting of a UM network (rather than the simple integration implied in equation 2.1). 6.2.4 Integration of Evidence. The model presented here assumes that cortical neurons integrate evidence in support of alternative actions. Several mechanisms have been proposed for how this may occur; for example, Wang (2002) proposed that the integration occurs via excitatory connections in the cortex. However, if the basal ganglia model presented here is to be a universally applicable solution to the problem of action selection, then the appearance of accumulated evidence must be guaranteed at its inputs under all circumstances, irrespective of the specific action request and brain system generating it. Anatomically, the basal ganglia form a component of loops consisting of projections from cortex to basal ganglia, then to thalamus and back to cortex (Alexander & Crutcher, 1990). In a computational study of basal ganglia and cortex, Humphries and Gurney (2002) showed that cortical regions receiving thalamic input had their activity levels amplified beyond those of sensory areas that did not. This raises the intriguing
The Brain Implements Optimal Decision Making
461
possibility that the feedback in these loops could serve to bootstrap evidence accumulation in cortex so that no separate mechanism is required. 6.2.5 Reinforcement Learning. As stated in section 1, this article focuses on the proficient phase of task acquisition. However, during learning, it is still necessary to identify the behavioral state (stimulus and ongoing behavior) and represent the stimulus response mapping in some way. Both processes are aspects of the decision-making process discussed in this article, and so it is useful to speculate how it might be possible to unify the accounts of decision making and learning into a single coherent framework. To develop these ideas, we consider a version of the motion discrimination task described in section 2.1 in which the stimulus-response mapping is constantly modified and hence must be learned by the animal. In this case, it is unlikely that the stimulus-response mapping would be represented in the connections between MT and FEF, because these connections would have to be rapidly and continuously modified. Many models based on experimental data would assume that in this experiment, the stimulus-response mapping would be stored in much more plastic synapses in the prefrontal cortex or the striatum (Doya, 2000; Miller & Cohen, 2001; O’Doherty et al., 2004). This is also the kind of scheme considered by Ashby, Alfonso-Turken, and Waldron (1998) in the context of category learning with motor response. Here, early stimulus-response mappings are learned in the basal ganglia, while slower consolidation takes place in direct mappings between sensory and motor cortices. Further, we note that behavioral state identification during learning could rely on integration of information, just as it does during the proficient phase in the theory presented here. However, in the case of learning, the integration may occur in regions different from those used during the proficient phase. For example, Gold and Shadlen (2003) have shown that in the version of the motion discrimination task in which the mapping between stimulus and direction of saccade is not known during the information integration, the integration does not occur in FEF. If a unified framework encompassing action selection and learning could be developed along the lines outlined above, it would simultaneously allow predictions of the probabilities of taking alternative actions, as well as the probability distributions of onsets of action initiations (reaction times). However, before such a unified account is developed, a number of questions must be answered, in particular, where in the brain the evidence supporting alternative actions is computed in the learning phase. To address this question, we look forward to studies of neuronal responses in cortex and basal ganglia in the version of the motion discrimination task described above. 6.2.6 Working Memory. O’Reilly and Frank (2006) have proposed that the basal ganglia gate access to working memory and decides whether a newly presented stimulus should be stored in the working memory. According to
462
R. Bogacz and K. Gurney
Alexander et al.’s (1986) multiple loop scheme, this kind of decision would be performed by the dorso-lateral prefrontal circuit. As noted in section 2.5, this letter focuses on motor and oculomotor circuits. It would therefore be interesting to investigate whether the basal ganglia implementing MSPRT could also optimize selection within working memory and, indeed, cognitive selection in general. 6.3 Relationship to Other Experimental Data 6.3.1 Psychological Data. A model of decision making must be consistent with the rich body of psychological data concerning reaction times (RT). Other psychological models are consistent with these data (Ratcliff, Van Zandt, & McKoon, 1999; Usher & McClelland, 2001), and consistency in this respect will therefore not distinguish our model in favor of these alternatives but, rather, is a necessary requirement for its psychological plausibility. Our model is indeed completely consonant with the account of RT data in two alternative choice paradigms given by the diffusion and SPRT models (e.g., Ratcliff et al., 1999), since, under these circumstances, MSPRT reduces to SPRT. For more than two alternatives, it is interesting to note that the decision time of the MSPRT model (shown in Figure 3A) is approximately proportional to the logarithm of the number of alternatives (cf. McMillen & Holmes, 2006), thus following the experimentally observed Hick’s law (Teichner & Krebs, 1974) describing RT as a function of number of choices. Further support for the basal ganglia as a psychologically plausible response mechanism is provided in recent work by Stafford and Gurney (2004, in press). 6.3.2 The Neural Representation of Actions. In this letter, we assumed for simplicity that there was an anatomically separate channel for each possible action. However, this raises the question of what constitutes a separate action. For example, is moving one’s hand 10 cm to the left a different action from moving it 15 cm to the left? If not, then how are these actions differentiated? If they are different actions (with respect to basal ganglia selection), then the number of actions is potentially infinite, and the basal ganglia are confronted with the seemingly impossible task of representing an infinite number of discrete channels. Neurophysiological data provide a clue to a possible answer to these questions. Georgopoulos et al. (1983) studied neuronal responses in basal ganglia in a task (akin to the example above) in which different stimuli required an animal to move its hand in the same direction with three different amplitudes. They noticed that some hand-selective neurons had activity proportional to movement amplitude, while other neurons had activity inversely proportional to the amplitude (they responded most for short movements). This suggests that although movements of different joints may be represented by the separate neuronal populations (channels), the
The Brain Implements Optimal Decision Making
463
fine tuning of the movement is represented in a distributed fashion within a channel. It will therefore be of interest to extend the current theory to incorporate coding by distributed representations. The current MSPRT model describes activity of a neuronal population selective for each alternative within each nucleus by a single variable (corresponding to activity in a localist unit). Recently Bogacz (2006) has shown how to map linear localist decision networks into computationally equivalent distributed decision networks and derived the parameters of decision networks with distributed representations implementing SPRT. Although this network mapping process cannot be directly applied to the MSPRT model (because it includes nonlinear processing in STN), it is likely that a similar approach for particular types of nonlinearities present in the MSPRT model may be developed. 6.3.3 Responses of Striatal Neurons. As noted in section 1, the MSPRT model aims to provide a general framework for understanding computation in the basal ganglia as a whole during action selection in the proficient phase. Therefore, the model does not aim to incorporate all known data on basal ganglia neurons and, in particular, does not aim to explain data relating to the learning phase (during which the striatum is know to play a prominent role). However, it is of interest to see whether the behavior of the population of striatal projection neurons in the proficient phase is consistent with our ideas. In the MSPRT model, we assume, in accordance with experimental observations (Crutcher & DeLong, 1984b; Georgopoulos et al., 1983), that during the proficient phase, the activity of striatal neurons encoding certain actions reflects the activity in corresponding cortical motor or oculomotor regions. Note that the above assumption does not prevent striatal neurons selective for particular actions to be modulated by expected reward in the proficient phase, as it has been shown that the cortical integrators are modulated by expected reward in the study by Platt and Glimcher (1999) in which the stimulus-response mapping was kept constant for many weeks of training and experiment. We assumed that in the proficient phase, the neurons in striatum selective for actions should reflect the activity of cortical integrators. This predicts that in the motion discrimination task described in section 2.1, striatal neurons selective for alternative directions of eye movements should exhibit gradually increasing firing rates similar to those in the cortex. This prediction may seem to contradict the observation that striatal neurons are bistable (Wilson, 1995) with an active “up” state, and an inactive “down” state. However, Okamato, Isomura, Takada, and Fukai (2005) have shown recently that even if neurons are bistable, and if the probability of the onset of the up state depends on the magnitude of input, these neurons may implement information integration, and their activity averaged across trials may be linearly increasing.
464
R. Bogacz and K. Gurney
6.3.4 Dopaminergic Modulation. In this article, we do not analyze the influence of dopamine on basal ganglia computation. However, there are two types of dopamine release within the basal ganglia. First, phasic release (brief pulses of dopamine) is associated with salient or unexpected behavioral events. In particular, it has been proposed that the phasic dopamine signal represents a variable in temporal difference reinforcement learning, namely, the reward prediction error (i.e., the difference between the actual and the predicted levels of expected reward) (Montague et al., 1996; Schultz et al., 1997). Since we do not address learning here, we do not consider phasic release further. A second type of dopamine release provides tonic or background levels, severe lowering of which can result in Parkinson’s disease (Obeso et al., 2000). A recent modeling study of the basal ganglia circuit (Gurney, Humphries, Wood, Prescott, & Redgrave, 2004) has indicated that tonic levels of dopamine may influence the speed-accuracy trade-off in making responses. This is consistent with experimental results showing that the level of tonic dopamine influences RTs (Amalric & Koob, 1987; Amalric, Moukhles, Nieoullon, & Daszuta, 1995). It will be interesting to determine theoretically to what extent the tonic dopamine level can influence the speed-accuracy trade-off while still preserving the optimality of MSPRT and whether this mechanism can play a role in finding the speed-accuracy trade-off that maximizes the rate of reward acquisition in tasks including repeating sequences of choices (Bogacz et al., 2006; Simen, Cohen, & Holmes, 2006). Appendix A: The MSPRT This section supplies details of the derivation of equation 2.2. Noting the definition in the main text Pi (T) = P(Hi |input(T)), then, from Baye’s theorem, Pi (T) =
P (input (T) |Hi ) P (Hi ) . P (input (T))
(A.1)
Notice that the hypotheses Hi are mutually exclusive (as xi cannot simultaneously have mean µ+ and µ− ). Further, we assume that the set of hypotheses Hi covers all possibilities concerning the distribution of sensory inputs (a standard assumption is statistical testing). Then the denominator of equation A.1 can be written as
P (input (T)) =
N k=1
P (input (T) ∧ Hk ),
(A.2)
The Brain Implements Optimal Decision Making
465
but from the definition of conditional probability, P(input(T) ∧ Hk ) = P(input(T)|Hk )P(Hk ), so that equation A.1 can be written as P(input(T)|Hi )P(Hi ) Pi (T) = N . k=1 P(input(T)|Hk )P(Hk )
(A.3)
We assume that we do not have any prior knowledge about which of the hypotheses is more likely, so all prior probabilities must be equal to each other, that is, P(Hi ) = 1/N, and hence they cancel in equation A.3 and we obtain the original form of MSPRT given by Baum and Veeravalli (1994): P(input(T)|Hi ) Pi (T) = N . k=1 P(input(T)|Hk )
(A.4)
We now compute the logarithm of Pi , defined by equation A.4: L i (T) = ln Pi (T) = ln P(input(T)|Hi ) − ln
N
exp(ln P(input(T)|Hk )).
(A.5)
k=1
Equation A.5 already has a form similar to that in equation 2.2. We now show how to obtain equation 2.2 exactly. We first compute the term ln P(input(T)|Hi ) that occurs in equation A.5: T N
+ f (µ− ,σ ) x j (t) ln P (input (T) |Hi ) = ln f (µ ,σ ) (xi (t)) t=1
=
T
j=1 j=i
ln f (µ+ ,σ ) (xi (t)) +
t=1
N T t=1
f (µ− ,σ ) x j (t) ,
j=1 j=i
where f (µ,σ ) denotes the probability density function of a normal distribution with mean µ and standard deviation σ . Therefore, ln P(input(T)|Hi ) =
T
t=1
+
(xi (t) − µ+ ) − ln √ 2σ 2 2πσ
N T j=1 j=i
2
1
t=1
(x j (t) − µ− )2 − ln √ 2σ 2 2πσ 1
466
R. Bogacz and K. Gurney
1 = NT ln √ + −T (µ+ )2 + (N − 1) (µ− )2 2σ 2 2πσ T N T
2 µ+ − µ− −x j (t) + 2µ− x j (t) + xi (t). + σ2 1
j=1 t=1
t=1
The first two terms on the right-hand side do not depend on i,and so, denoting their sum by C,we have ln P(input (T) |Hi ) = C + g ∗ Yi (T) ,
(A.6)
where g ∗ = (µ+ − µ− )/σ 2 , and Yi (T) is defined in equation 2.1. Hence substituting equation A.6 into A.5, we obtain: ∗
L i (T) = C + g Yi (T) − ln exp (C)
N
∗
exp (g Yk (T))
k=1
= g ∗ Yi (T) − ln
N
exp (g ∗ Yk (T))
k=1
which gives equation 2.2. Appendix B: Basal Ganglia Model, Including All Major Pathways To accomplish inclusion of all pathways shown in Figure 1A, we introduce a model with weighted connection strengths. For simplicity, we assume that all the excitatory weights are equal to 1 and denote the inhibitory weight from nucleus A to nucleus B by w A→B . Denote the striatal projection neurons whose dopaminergic receptors are predominantly of D1 and D2 types by S1 and S2, respectively. The activity levels of basal ganglia nuclei in the network of Figure 1A are then given by: GPi =
N j=1
N STN j − ln STN j − w S2→GP yi
(B.1)
j=1
STNi = exp(yi − wGP→STN GPi )
OUTi = −w S1→OUT yi +
N j=1
STN j − wGP→OUT GPi .
(B.2)
(B.3)
The Brain Implements Optimal Decision Making
467
We now derive constraints on the weights that must be satisfied so OUTi (T) = −L i (T) as defined by equation 2.2. Substituting equation B.1 into B.2., STNi = exp yi (1 + wGP→STN w S2→GP )
− wGP→STN
N
N STN j + wGP→STN ln STN j .
j=1
j=1
Summing over i and rearranging terms, we get N i=1
STNi =
N
exp (yi (1 + wGP→STN w S2→GP ))
i=1
× exp wGP→STN
N
−1 STNi
i=1
·
N
wGP→STN .
STNi
i=1
Taking the logarithm of both sides and rearranging terms, ln
N
exp (yi (1 + wGP→STN w S2→GP )) = (1 − wGP→STN ) ln
i=1
N
STNi
i=1
+ wGP→STN
N
STNi .
(B.4)
i=1
Now substitute equation B.1 into B.3: OUTi = −yi (w S1→OUT − wGP→OUT w S2→GP ) + (1 − wGP→OUT )
N j=1
STN j + wGP→OUT ln
N
STN j .
(B.5)
j=1
Comparing equations B.4 and B.5, we note that if the following condition is satisfied, wGP→OUT = 1 − wGP→STN ,
(B.6)
468
R. Bogacz and K. Gurney
then we can substitute equation B.4 into B.5 and obtain OUTi = −yi (w S1→OUT − wGP→OUT w S2→GP ) N exp(y j (1 + wGP→STN w S2→GP )) . + ln
(B.7)
j=1
In order to satisfy OUTi (T) = −L i (T), the two coefficients of yi in equation B.7 must be equal, and the gain coefficient must be modified accordingly. Thus, the following must be satisfied: w S1→OUT − wGP→OUT w S2→GP = 1 + wGP→STN w S2→GP .
(B.8)
Substituting the constraint B.6 into B.8 gives w S1→OUT = 1 + w S2→GP .
(B.9)
The optimal value of the gain must be modified and becomes g∗ =
1 1 + wGP→STN w S2→GP
µ+ − µ− . σ2
Note that with the additional pathways, the optimal gain g ∗ is less than that required without these pathways. In summary, the network in Figure 1A implements MSPRT if the inhibitory weights satisfy constraints B.6 and B.9. Appendix C: Performance for Nonoptimal Values of Parameter g This section describes the performance of the basal ganglia model when parameter g has nonoptimal values. Section C.1 shows that if g > g ∗ , the decision time does not increase since, as g → ∞, the model converges to the other asymptotically optimal test MSPRTb . Section C.2 shows that if g < g ∗ , the decision time does increase, but it never exceeds that of the UM model (Usher & McClelland, 2001) because as g → 0, the proposed model converges to an approximation of the UM model. C.1 Overestimation of Gain. We first describe the other asymptotically optimal test MSPRTb (Dragalin et al., 1999) and go on to show that as g → ∞, the model approximates MSPRTb . In this test, after each sample,
The Brain Implements Optimal Decision Making
469
the following ratios are computed (Dragalin et al., 1999): RBi (T) =
P (input (T) |Hi ) . max P (input (T) |Hk ) 1≤k≤N k=i
The decision is made whenever one of the ratios exceeds a threshold. From equation A.6, the logarithms of the above ratios for the hypotheses defined in the main text become L Bi (T) = g Yi (T) − max Yk (T) .
(C.1)
1≤k≤N k=i
The decision is made whenever one of LBi (T) exceeds a threshold. Let Ym1 (T), Ym2 (T), . . ., Ymi (T), . . ., YmN (T) be the ordered sequence of Yi (T) with Ym1 (T) the largest member. Then, according to equation C.1, the decision is made whenever Ym1 (T) − Ym2 (T) exceeds a threshold. We now show that the basal ganglia model in the main text works in just this way, as g → ∞. The log-ratio L i (T) in equation 2.2 of the main text is equal to N L i (T) = gYi (T) − ln exp (gYm1 (T)) + exp (gYm2 (T)) + exp(gYmj (T)) j=3
= gYi (T) − ln
+
N j=3
exp (gYm1 (T)) 1 + exp (g (Ym2 (T) − Ym1 (T)))
exp g Ymj (T) − Ym1 (T)
= gYi (T) − gYm1 (T) − ln 1 + exp (g (Ym2 (T) − Ym1 (T)))
+
N
exp g Ymj (T) − Ym1 (T) .
j=3
A decision will be made when the largest among L i (T) exceeds a threshold z, that is, when gYm1 (T) − gYm1 (T) − ln 1 + exp (g (Ym2 (T) − Ym1 (T)))
+
N j=3
exp g Ymj (T) − Ym1 (T) > z.
470
R. Bogacz and K. Gurney
Two first terms cancel. Taking the negative, applying the exp(.) function, and subtracting 1 gives exp (g (Ym2 (T) − Ym1 (T))) +
N
exp(g(Ymj (T) − Ym1 (T))) < z ,
j=3
where z is another threshold. After some manipulation, we have exp (g (Ym2 (T) − Ym1 (T))) 1 +
N
exp(g(Ymj (T) − Ym2 (T))) < z .
(C.2)
j=3
Since Yi (T) are sums of samples from continuous normal distributions, then Ym2 (T) > Ym3 (T) with probability 1 for T > 0. Therefore, as g → ∞, the expressions g(Ymj (T) − Ym2 (T)) → −∞, and the content of the square brackets in equation C.2 converges to 1. Thus, taking the logarithm and dividing by −g gives Ym1 (T) − Ym2 (T) > z (where z is also a threshold). Hence, as g goes to infinity, the proposed model makes a decision whenever the difference between the two largest Yi (T) exceeds a threshold, which is equivalent to MSPRTb . C.2 Underestimation of Gain. In this section we show that as g → 0, the model converges to an approximation of the UM model. This model consists of N mutually inhibiting leaky integrators whose dynamics are described by u˙ i (T) = xi (T) − kui (T) − w
N
u j (T),
ui (0) = 0,
(C.3)
j=1 j=i
where k determines a leakage rate constant and w is the weight of inhibitory connections between the integrators. McMillen and Holmes (2005) showed that the performance of the UM model is optimized when k = w and both T go to infinity. In this case, putting Yi (T) = o xi (t)dt, the decision is made whenever any of the following differences, 1 Yj (T), N N
LU Mi (T) = Yi (T) −
j=1
The Brain Implements Optimal Decision Making
471
exceeds a decision threshold (McMillen & Holmes, 2006). We now show that exactly the same criterion for a decision applies in the proposed basal ganglia model when g → 0. For small g, we may approximate the right-hand side of equation 2.2 of the main text by retaining only linear terms in Taylor expansions. Thus, working on the exponential L i (T) → gYi (T) − ln
N
(1 + gYk (T)) = gYi (T) − ln N + g
k=1
N
Yk (T)
k=1
and then the ln() function, 1 Yk (T). N N
L i (T) → gYi (T) − ln N − g
k=1
A decision will be made whenever any L i (T) exceeds a threshold z, that is, when 1 Yk (T) > z. N N
gYi (T) − ln N − g
k=1
Subtracting ln N and dividing by g, we get the following condition for decision: 1 Yk (T) > z . N N
Yi (T) −
k=1
Hence, as g → 0 the proposed basal ganglia model makes a decision under the same conditions as the UM with optimal values of inhibition and decay (w = k, and w, k → ∞). Appendix D: Model with Competing Cortical Integrators Let us consider the UM model described in section C.2, in which decay is equal to inhibition (w = k, it is one of the conditions required for optimal performance of UM model—see above) and the integrators have optimal gain. Then equation C.3 becomes u˙ i (T) = g ∗ xi (T) − w
N j=1
u j (T),
ui (0) = 0.
(D.1)
472
R. Bogacz and K. Gurney
Note that in the above case, all the integrators receive exactly the same inhibition; hence, in the limit of intervals between samples going to 0, the relationship between variables yi of the MSPRT model and ui of the UM model becomes ∀i
yi (T) = ui (T) + c (T) ,
that is, the variables differ by a term c(T), which, although it differs over time (it will typically be negative), is the same for all integrators. However, we now show that if the same term c(T) is added to the cortical firing rate of all integrators, the activity of output nuclei does not change. Thus, N OUTi (T) = − (yi (T) + c (T)) + ln exp(y j (T) + c (T))
j=1
= −yi (T) − c (T) + ln exp (c (T)) N = −yi (T) + ln exp(y j (T)) .
N
exp(y j (T))
j=1
j=1
Since the activity of the output nuclei does not change, a version of MSPRT model in which simple cortical integration is replaced by UM model of equation D.1 achieves the same decision times (for any fixed error rate) as the original MSPRT model. The above also explains a correspondence problem that arises in the MSPRT model because it assumes that at the beginning of the decision process, yi = 0, whereas the baseline firing rate of cortical neurons is nonzero. Acknowledgments This work was supported by EPSRC grants: EP/C516303/1 and EP/C514416/1. We thank Atshushi Nambu for providing data used in Figure 2I. We thank Philip Holmes, Jonathan D. Cohen, Paul Overton, Sander Nieuwenhuis, Eric Shea-Brown, and Tobias Larsen for reading the earlier version of the manuscript and very useful comments, and Peter Redgrave for discussion. References Alexander, G. E., & Crutcher, M. D. (1990). Functional architecture of basal ganglia circuits: Neural substrates of parallel processing. Trends Neurosci., 13(7), 266–271.
The Brain Implements Optimal Decision Making
473
Alexander, G. E., DeLong, M. R., & Strick, P. L. (1986). Parallel organization of functionally segregated circuits linking basal ganglia and cortex. Annu. Rev. Neurosci., 9, 357–381. Amalric, M., & Koob, G. F. (1987). Depletion of dopamine in the caudate nucleus but not in nucleus accumbens impairs reaction-time performance in rats. J. Neurosci., 7(7), 2129–2134. Amalric, M., Moukhles, H., Nieoullon, A., & Daszuta, A. (1995). Complex deficits on reaction time performance following bilateral intrastriatal 6-OHDA infusion in the rat. Eur. J. Neurosci., 7(5), 972–980. Ashby, F. G., Alfonso-Reese, L. A., Turken, A. U., & Waldron, E. M. (1998). A neuropsychological theory of multiple systems in category learning. Psychol. Rev., 105, 442–481. Ashby, F. G., & Spiering, B. J. (2004). The neurobiology of category learning. Behav. Cogn. Neurosci. Rev., 3(2), 101–113. Barnard, G. (1946). Sequential tests in industrial statistics. Journal of Royal Statistical Society Supplement, 8, 1–26. Baum, C. W., & Veeravalli, V. V. (1994). A sequential procedure for multihypothesis testing. IEEE Transactions on Information Theory, 40, 1996–2007. Bevan, M. D., Smith, A. D., & Bolam, J. P. (1996). The substantia nigra as a site of synaptic integration of functionally diverse information arising from the ventral pallidum and the globus pallidus in the rat. Neuroscience, 75(1), 5–12. Bogacz, R. (2006). Optimal decision networks with distributed representation. Manuscript submitted for publication. Bogacz, R., Brown, E., Moehlis, J., Holmes, P., & Cohen, J. D. (2006). The physics of optimal decision making: A formal analysis of models of performance in twoalternative forced choice tasks. Psychol. Rev., 113, 700–765. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1993). Responses of neurons in macaque MT to stochastic motion signals. Vis. Neurosci., 10(6), 1157–1169. Brown, E., Gao, J., Holmes, P., Bogacz, R., Gilzenrat, M., & Cohen, J. D. (2005). Simple networks that optimize decisions. International Journal of Bifurcations and Chaos, 15, 803–826. Brown, J. W., Bullock, D., & Grossberg, S. (2004). How laminar frontal cortex and basal ganglia circuits interact to control planned and reactive saccades. Neural Netw., 17(4), 471–510. Chevalier, G., Vacher, S., Deniau, J. M., & Desban, M. (1985). Disinhibition as a basic process in the expression of striatal functions. I. The striato-nigral influence on tecto-spinal/tecto-diencephalic neurons. Brain Res., 334(2), 215–226. Crutcher, M. D., & DeLong, M. R. (1984a). Single cell studies of the primate putamen. I. Functional organization. Exp. Brain Res., 53(2), 233–243. Crutcher, M. D., & DeLong, M. R. (1984b). Single cell studies of the primate putamen. II. Relations to direction of movement and pattern of muscular activity. Exp. Brain Res., 53(2), 244–258. Dayan, P. (2001). Levels of analysis in neural modeling. In R. A. Wilson & F. C. Keil (Eds.) Encyclopedia of cognitive science. London: Macmillan. Deniau, J. M., & Chevalier, G. (1985). Disinhibition as a basic process in the expression of striatal functions. II. The striato-nigral influence on
474
R. Bogacz and K. Gurney
thalamocortical cells of the ventromedial thalamic nucleus. Brain Res., 334(2), 227– 233. Doya, K. (2000). Complementary roles of basal ganglia and cerebellum in learning and motor control. Curr. Opin. Neurobiol., 10(6), 732–739. Dragalin, V. P., Tertakovsky, A. G., & Veeravalli, V. V. (1999). Multihypothesis sequential probability ratio tests—part I: asymptotic optimality. IEEE Transactions on Information Theory, 45, 2448–2461. Faull, R. L., & Mehler, W. R. (1978). The cells of origin of nigrotectal, nigrothalamic and nigrostriatal projections in the rat. Neuroscience, 3(11), 989–1002. Frank, M. J. (2005). When and when not to use your subthalamic nucleus: Lessons from a computational model of the basal ganglia. Paper presented at the International Workshop on Modelling Natural Action Selection, Edinburgh. Frank, M. J., Seeberger, L. C., & O’Reilly R, C. (2004). By carrot or by stick: Cognitive reinforcement learning in Parkinsonism. Science, 306(5703), 1940–1943. Georgopoulos, A. P., DeLong, M. R., & Crutcher, M. D. (1983). Relations between parameters of step-tracking movements and single cell discharge in the globus pallidus and subthalamic nucleus of the behaving monkey. J. Neurosci., 3(8), 1586– 1598. Gerfen, C. R., & Wilson, C. J. (1996). The basal ganglia. In A. Bjorklund, T. Hokfelt, & L. Swanson (Eds.), Handbook of chemical neuroanatomy (Vol. 12, pp. 369–466). New York: Elsevier. Gerfen, C. R., & Young, W. S. III. (1988). Distribution of striatonigral and striatopallidal peptidergic neurons in both patch and matrix compartments: An in situ hybridization histochemistry and fluorescent retrograde tracing study. Brain Res., 460(1), 161–167. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends Cogn. Sci., 5(1), 10–16. Gold, J. I., & Shadlen, M. N. (2002). Banburismus and the brain: Decoding the relationship between sensory stimuli, decisions, and reward. Neuron, 36(2), 299– 308. Gold, J. I., & Shadlen, M. N. (2003). The influence of behavioral context on the representation of a perceptual decision in developing oculomotor commands. J. Neurosci., 23(2), 632–651. Gurney, K., Humphries, M., Wood, R., Prescott, T. J., & Redgrave, P. (2004). Testing computational hypotheses of brain systems function: A case study with the basal ganglia. Network, 15(4), 263–290. Gurney, K., Prescott, T. J., & Redgrave, P. (2001a). A computational model of action selection in the basal ganglia. I. A new functional anatomy. Biol. Cybern., 84(6), 401–410. Gurney, K., Prescott, T. J., & Redgrave, P. (2001b). A computational model of action selection in the basal ganglia. II. Analysis and simulation of behaviour. Biol. Cybern., 84(6), 411–423. Gurney, K., Prescott, T. J., Wickens, J. R., & Redgrave, P. (2004). Computational models of the basal ganglia: From robots to membranes. Trends Neurosci., 27(8), 453–459. Hallworth, N. E., Wilson, C. J., & Bevan, M. D. (2003). Apamin-sensitive small conductance calcium-activated potassium channels, through their selective coupling
The Brain Implements Optimal Decision Making
475
to voltage-gated calcium channels, are critical determinants of the precision, pace, and pattern of action potential generation in rat subthalamic nucleus neurons in vitro. J. Neurosci., 23(20), 7525–7542. Humphries, M. D., & Gurney, K. N. (2002). The role of intra-thalamic and thalamocortical circuits in action selection. Network, 13(1), 131–156. Humphries, M. D., Stewart, R. D., & Gurney, K. N. (in press). A physiologically plausible model of action selection and oscillatory activity in the basal ganglia. J. Neurosci. Kha, H. T., Finkelstein, D. I., Tomas, D., Drago, J., Pow, D. V., & Horne, M. K. (2001). Projections from the substantia nigra pars reticulata to the motor thalamus of the rat: Single axon reconstructions and immunohistochemical study. J. Comp. Neurol., 440(1), 20–30. Kim, J. N., & Shadlen, M. N. (1999). Neural correlates of a decision in the dorsolateral prefrontal cortex of the macaque. Nat. Neurosci., 2(2), 176–185. Kita, H., & Kitai, S. T. (1991). Intracellular study of rat globus pallidus neurons: Membrane properties and responses to neostriatal, subthalamic and nigral stimulation. Brain Res., 564(2), 296–305. Laming, D. R. J. (1968). Information theory of choice reaction time. New York: Wiley. McMillen, T., & Holmes, P. (2006). The dynamics of choice among multiple alternatives. Journal of Mathematical Psychology, 50, 30–57. Medina, L., & Reiner, A. (1995). Neurotransmitter organization and connectivity of the basal ganglia in vertebrates: Implications for the evolution of basal ganglia. Brain Behav. Evol., 46(4–5), 235–258. Miller, E. K., & Cohen, J. D. (2001). An integrative theory of prefrontal cortex function. Annu. Rev. Neurosci., 24, 167–202. Mink, J. W. (1996). The basal ganglia: Focused selection and inhibition of competing motor programs. Prog. Neurobiol., 50(4), 381–425. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16(5), 1936– 1947. Nakano, K., Kayahara, T., Tsutsumi, T., & Ushiro, H. (2000). Neural circuits and functional organization of the striatum. J. Neurol., 247 (Suppl 5), V1–15. Nambu, A., & Llinas, R. (1994). Electrophysiology of globus pallidus neurons in vitro. J. Neurophysiol., 72(3), 1127–1139. Nambu, A., & Llinas, R. (1997). Morphology of globus pallidus neurons: Its correlation with electrophysiology in guinea pig brain slices. J. Comp. Neurol., 377(1), 85–94. Obeso, J. A., Rodriguez-Oroz, M. C., Rodriguez, M., Lanciego, J. L., Artieda, J., Gonzalo, N., & Olanow, C. W. (2000). Pathophysiology of the basal ganglia in Parkinson’s disease. Trends Neurosci., 23(10 Suppl), S8–19. O’Doherty, J., Dayan, P., Schultz, J., Deichmann, R., Friston, K., & Dolan, R. J. (2004). Dissociable roles of ventral and dorsal striatum in instrumental conditioning. Science, 304(5669), 452–454. Okamoto, H., Isomura, Y., Takada, M., & Fukai, T. (2005). Temporal integration by stochastic dynamics of a recurrent network of bistable neurons. Paper presented at the Computational Cognitive Neuroscience meeting, Washington, DC.
476
R. Bogacz and K. Gurney
O’Reilly, R. C., & Frank, M. J. (2006). Making working memory work: A computational model of learning in the frontal cortex and basal ganglia. Neural Computation, 18, 283–328. Overton, P. G., & Greenfield, S. A. (1995). Determinants of neuronal firing pattern in the guinea-pig subthalamic nucleus: An in vivo and in vitro comparison. J. Neural Transm. Park. Dis. Dement. Sect., 10(1), 41–54. Parent, A., & Hazrati, L. N. (1993). Anatomical aspects of information processing in primate basal ganglia. Trends Neurosci., 16(3), 111–116. Parent, A., & Hazrati, L. N. (1995a). Functional anatomy of the basal ganglia. I. The cortico-basal ganglia-thalamo-cortical loop. Brain Res. Rev., 20(1), 91–127. Parent, A., & Hazrati, L. N. (1995b). Functional anatomy of the basal ganglia. II. The place of subthalamic nucleus and external pallidum in basal ganglia circuitry. Brain Res. Rev., 20(1), 128–154. Parent, A., & Smith, Y. (1987). Organization of efferent projections of the subthalamic nucleus in the squirrel monkey as revealed by retrograde labeling methods. Brain Res., 436(2), 296–310. Parthasarathy, H. B., Schall, J. D., & Graybiel, A. M. (1992). Distributed but convergent ordering of corticostriatal projections: analysis of the frontal eye field and the supplementary eye field in the macaque monkey. J. Neurosci., 12(11), 4468–4488. Platt, M. L., & Glimcher, P. W. (1999). Neural correlates of decision variables in parietal cortex. Nature, 400(6741), 233–238. Prescott, T. J., Montes-Gonzalez, F. M., Gurney, K., Humphries, M. D., & Redgrave, P. (2006). A robot model of the basal ganglia: Behavior and intrinsic processing. Neural Networks, 19(1), 31–61. Ratcliff, R. (1978). A theory of memory retrieval. Psychol. Rev., 83, 59–108. Ratcliff, R., Van Zandt, T., & McKoon, G. (1999). Connectionist and diffusion models of reaction time. Psychol. Rev., 106, 261–300. Redgrave, P., Prescott, T. J., & Gurney, K. (1999). The basal ganglia: A vertebrate solution to the selection problem? Neuroscience, 89(4), 1009–1023. Samejima, K., Ueda, Y., Doya, K., & Kimura, M. (2005). Representation of actionspecific reward values in the striatum. Science, 310(5752), 1337–1340. Schall, J. D. (2001). Neural basis of deciding, choosing and acting. Nat. Rev. Neurosci., 2(1), 33–42. Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593–1599. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. J. Neurophysiol., 86(4), 1916– 1936. Shadmehr, R., & Holcomb, H. H. (1997). Neural correlates of motor memory consolidation. Science, 277(5327), 821–825. Simen, P. A., Cohen, J. D., & Holmes, P. (2006). Rapid decision threshold modulation by reward rate in a neural network. Neural Networks, 19(8), 1013–1026. Smith, Y., Bevan, M. D., Shink, E., & Bolam, J. P. (1998). Microcircuitry of the direct and indirect pathways of the basal ganglia. Neuroscience, 86(2), 353–387. Stafford, T., & Gurney, K. (2004). The role of response mechanisms in determining reaction time performance: Pieron’s law revisited. Psychon. Bull. Rev., 11, 975– 987.
The Brain Implements Optimal Decision Making
477
Stafford, T., & Gurney, K. (in press). The basal ganglia as the selection mechanism in a cognitve task. Philos. Trans. R. Soc. Lond. B Biol. Sci. Stewart, R. D., Gurney, K., & Humphries, M. D. (2005). A large-scale model of the sensorimotor basal ganglia: Evoked responses and selection properties. Paper presented at the Society for Neuroscience meeting, Washington, DC. Stone, M. (1960). Models for choice reaction time. Psychometrika, 25, 251–260. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Teichner, W. H., & Krebs, M. J. (1974). Laws of visual choice reaction time. Psychol. Rev., 81, 75–98. Usher, M., & McClelland, J. L. (2001). The time course of perceptual choice: The leaky, competing accumulator model. Psychol. Rev., 108(3), 550–592. Vickers, D. (1970). Evidence for an accumulator model of psychophysical discrimination. Ergonomics, 13, 37–58. Wald, A. (1947). Sequential analysis. New York: Wiley. Wald, A., & Wolfowitz, J. (1948). Optimum character of the sequential probability ratio test. Annals of Mathematical Statistics, 19, 326–339. Wang, X. J. (2002). Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36(5), 955–968. Wilson, C. J. (1995). The contribution of cortical neurons to firing patterns of striatal spiny neurons. In J. C. Houk, J. L. Davis, & D. G. Beiser (Eds.), Models of information processing in the basal ganglia (pp. 22–50). Cambridge, MA: MIT Press. Wilson, C. J., Weyrick, A., Terman, D., Hallworth, N. E., & Bevan, M. D. (2004). A model of reverse spike frequency adaptation and repetitive firing of subthalamic nucleus neurons. J. Neurophysiol., 91(5), 1963–1980. Yu, A. D., & Dayan, P. (2005). Inference, attention, and decision in a Bayesian neural architecture. In L. K. Saul, W. Yair, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1577–1584). Cambridge, MA: MIT Press.
Received December 17, 2005; accepted May 19, 2006.
LETTER
Communicated by Olivier Faugeras
Realistically Coupled Neural Mass Models Can Generate EEG Rhythms Roberto C. Sotero
[email protected] Nelson J. Trujillo-Barreto
[email protected] Yasser Iturria-Medina
[email protected] Cuban Neuroscience Center, Brain Dynamics Department, Havana, Cuba
Felix Carbonell
[email protected] Juan C. Jimenez
[email protected] Instituto de Cibern´etica Matem´atica y F´ısica, Departamento de Sistemas Adaptativos, Havana, Cuba
We study the generation of EEG rhythms by means of realistically coupled neural mass models. Previous neural mass models were used to model cortical voxels and the thalamus. Interactions between voxels of the same and other cortical areas and with the thalamus were taken into account. Voxels within the same cortical area were coupled (short-range connections) with both excitatory and inhibitory connections, while coupling between areas (long-range connections) was considered to be excitatory only. Short-range connection strengths were modeled by using a connectivity function depending on the distance between voxels. Coupling strength parameters between areas were defined from empirical anatomical data employing the information obtained from probabilistic paths, which were tracked by water diffusion imaging techniques and used to quantify white matter tracts in the brain. Each cortical voxel was then described by a set of 16 random differential equations, while the thalamus was described by a set of 12 random differential equations. Thus, for analyzing the neuronal dynamics emerging from the interaction of several areas, a large system of differential equations needs to be solved. The sparseness of the estimated anatomical connectivity matrix reduces the number of connection parameters substantially, making the solution of this system faster. Simulations of human brain rhythms were carried out in order to test the model. Physiologically plausible results were obtained based on this anatomically constrained neural mass model. Neural Computation 19, 478–512 (2007)
C 2007 Massachusetts Institute of Technology
Realistic Neural Mass Models
479
1 Introduction We aim at describing a generative model of EEG rhythms based on anatomically constrained coupling of neural mass models. Although the biophysical mechanisms underlying oscillatory activity in neurons are well studied, the origin and function of global characteristics (such as EEG rhythms) of large populations of neurons are still unknown. In this context, computational models have proven capable of providing insight into this problem. Two modeling approaches have been used to study the behavior of neuronal populations. One approach is based on the study of large computational neuronal networks in which each cell is realistically represented by multiple compartments for the soma, axon, and dendrites, and each synaptic connection is modeled explicitly (Traub & Miles, 1991). A disadvantage of such realistic modeling is that it requires high computational power. For this reason, simplified versions of such models in which only one compartment model was taken into account have been used (Wang, Golomb, & Rinzel, 1995; Rinzel, Terman, Wang, & Ermentrout, 1998). However, even in this case, the use of such detailed models makes it difficult to determine the influence of each model parameter on the generated average network characteristics. The second approach is based on the use of neural mass models (NMM). In contrast to neuronal networks, NMMs describe the dynamics of cortical columns and brain areas by using only a few parameters, but without much detail. In this approach, spatially averaged magnitudes are assumed to characterize the collective behavior of populations of neurons of a given type instead of modeling single cells and their interactions in a detailed network (Wilson & Cowan, 1972; Lopes da Silva, Hoeks, Smits, & Zetterberg, 1974; Zetterberg, Kristiansson, & Mossberg, 1978; van Rotterdam, Lopes da Silva, van den Ende, Viergever, & Hermans, 1982; Jansen & Rit, 1995; Valdes, Jim´enez, Riera, Biscay, & Ozaki, 1999; Wendling, Bellanger, Bartolomei, & Chauvel, 2000; David & Friston, 2003). Despite their simplicity, NMMs have been very helpful in the study of EEG rhythms. For example, in order to study the origin of the alpha rhythm, Lopes da Silva et al. (1974) developed an NMM comprising two interacting populations of neurons, the thalamocortical relay cells and the inhibitory interneurons, which were interconnected by a negative feedback loop. Later, this model was extended to include both positive and negative feedback loops (Zetterberg et al., 1978; Jansen, Zouridakis, & Brandt, 1993). In Jansen and Rit (1995), alpha and beta activities were replicated by using two coupled neural mass models representing cortical areas with delays in the interconnections between them. Wendling et al. (2000) investigated the generation of epileptic spikes by extending Jansen’s model to multiple coupled areas but explored the dynamics of only three areas for different values of the connectivity constants. David and Friston (2003) considered that a cortical area comprises several noninteracting neuronal populations that were in turn described by Jansen’s models. This model allowed them
480
R. Sotero et al.
to reproduce the whole spectrum of EEG signals by changing the coupling between two of these areas, thus showing the importance of parameters related to the coupling between different brain regions for the resulting dynamics. Due to the lack of information about actual connectivity patterns, previous approaches have been devoted to the study of the temporal evolution of neuronal populations rather than their spatial dynamics. However, the development of techniques such as diffusion weighted magnetic resonance imaging (DWMRI) in the past 20 years makes noninvasive study of the anatomical brain circuitry of the living human brain possible. This has put at our disposal extremely valuable information and gives us a unique opportunity to formulate models that take into account a more realistic view of the central nervous system. In this article, we extend Jansen’s model to characterize the dynamics of a cortical voxel. Building on this, our goal is to study the rhythms obtained when coupling several areas comprising large numbers of interconnected voxels. In this approach, voxels of the same cortical area are assumed to be coupled with excitatory and inhibitory connections (short-range connections), while connections between areas (long-range connections) are assumed to be excitatory only. This is an extension of the model of David and Friston (2003) because interactions between voxels within the same area are taken into account. Additionally, we introduce anatomical constraints on coupling strength parameters (CSP) between different brain regions. This is accomplished by using CSPs estimated from human DWMRI data (Iturria-Medina, Canales-Rodr´ıguez, Meli´e-Garcia, & Vald´es-Hern´andez, 2005). The thalamus is also modeled using a single NMM, mutually coupled with the cortical areas. We show that this anatomically constrained NMM can reproduce the EEG rhythms recorded at the scalp, another step in the validation of NMMs as a valuable tool for describing the activity of large populations of neurons. The letter is organized as follows. In section 2, the NMMs used for a cortical voxel and the thalamus are formulated. Section 3 is devoted to the modeling of short-range and long-range interactions. The integrated thalamocortical model is formulated in section 4. In section 5 the observation model that links the activity in a cortical voxel with the recorded EEG on the scalp surface is presented. Section 6 describes the numerical method used for integrating the huge system of random differential equations obtained. Finally, sections 7 and 8 are devoted to a description of the simulations as well as a discussion of the results. 2 Models for a Cortical Voxel and the Thalamus In this work we model the neuronal activity in a cortical voxel by extending previous neural mass models that have been used to describe neural dynamics at smaller spatial scales. In particular, Jansen and Rit (1995)
Realistic Neural Mass Models
481
A
B
Figure 1: The basic neural mass models used. (A) Model for a cortical voxel. The dynamics is obtained by the interaction of three populations of neurons: excitatory pyramidal cells, excitatory spiny stellate cells, and inhibitory interneurons (smooth stellate cells). Input to the system is modeled by the pulse density p(t). This reduces to Jansen’s model when removing the dashed-line loop. (B) Model for the thalamus. Interaction between one excitatory and one inhibitory subpopulation, representing thalamocortical and reticular-thalamic neurons, respectively.
studied the temporal dynamics of a cortical column by means of the interaction of interconnected populations of excitatory pyramidal and stellate cells as well as inhibitory interneurons. This model (see Figure 1A, except for the dashed-line loop) capitalizes on the basic circuitry of the cortical column (Mountcastle, 1978, 1997). Pyramidal cells receive feedback from
482
R. Sotero et al.
excitatory and inhibitory interneurons, which receive only excitatory input from pyramidal cells. Nevertheless, at the spatial scale of voxels, this simplified circuitry is no longer valid, but more complicated interactions need to be considered. While a cortical column has a diameter of 150 to 300 µm, the length of a voxel usually ranges from 1 to 3 mm, comprising several interconnected columns. At this spatial scale, pyramidal-to-pyramidal connections become increasingly important, accounting for the majority of ¨ 1998). In our approach, Jansen’s intracortical fibers (Braitenberg & Schuz, model is extended to include this kind of connection by adding a selfexcitatory loop for the pyramidal population (see Figure 1B). Zetterberg et al. (1978) had already used a generic model of this kind for describing the temporal dynamics of a neuronal mass consisting of two excitatory and one inhibitory population. Although they were able to produce signals resembling EEG background activity, their model had no specific neural substrate. Usually in this type of modeling, the dynamics of each neuronal population is represented by two transformations (Jansen & Rit, 1995; Zetterberg et al., 1978). First, the average pulse density of action potentials entering a population is converted into an average postsynaptic membrane potential by means of linear convolutions with impulse responses h e (t) and h i (t) for the excitatory and the inhibitory case, respectively: h e (t) = Aa te −a t h i (t) = Bbte
−bt
(2.1) .
(2.2)
Parameters Aand B represent the maximum amplitude of excitatory (EPSP) and inhibitory postsynaptic potentials (IPSP), respectively, while lumped parameters a and b depend on passive membrane time constants and other distributed delays in the dendritic network. The second (nonlinear) transformation accounts for the conversion of average membrane potentials of the population into an average pulse density of action potentials, which is accomplished by means of a sigmoid function,
S(ν) =
2e 0 , 1 + e r (v0 −v)
(2.3)
where 2e 0 is the maximum firing rate, ν0 is the postsynaptic potential (PSP) corresponding to firing rate e 0 , and parameter r controls the steepness of the sigmoid function. The interaction between the different populations is characterized by five connectivity constants c 1 , c 2 , c 3 , c 4 , c 5 (see Figure 1A), which represent the average number of synaptic contacts. The first four constants have the
Realistic Neural Mass Models
483
same interpretation as in Jansen and Rit (1995) and were related in that work based on physiological arguments: c 2 = 0.8c 1 , c 3 = c 4 = 0.25c 1 .
(2.4)
Similarly, c 5 accounts for interactions between pyramidal cells. Using Laplace transforms, this convolution model can easily be expressed in terms of the following system of differential equations: y˙ 1 (t) = y4 (t) y˙ 2 (t) = y5 (t) y˙ 3 (t) = y6 (t) y˙ 4 (t) = Aa { p(t) + c 5 S(y1 (t) − y2 (t)) + c 2 S(c 1 y3 (t))} − 2a y4 (t) − a 2 y1 (t) y˙ 5 (t) = Bbc 4 S(c 3 y3 (t)) − 2by5 (t) − b 2 y2 (t) y˙ 6 (t) = Aa S(y1 (t) − y2 (t)) − 2a y6 (t) − a 2 y2 (t),
(2.5)
where y1 (t), y3 (t) are average EPSP and y2 (t) is the average IPSP (see Figure 1A). From classical electrophysiology, pyramidal cells are considered to be the main contributors to the measured EEG at the scalp surface. This is also known to stem from two basic reasons: (1) the specific location of excitatory and inhibitory synaptic inputs on the pyramidal neuron and (2) the spatial organization of these cells in the form of a palisade oriented perpendicular to the cortex (Niedermeyer & Lopes da Silva, 1987; Nunez, 1981). That is, afferent excitatory connections make contact predominantly on apical dendrites, creating an extracellular negative potential (sink of current due to Na + entering the cell), while inhibitory synapses reach mainly the soma of the neuron, creating an extracellular positive potential (source of current due to Cl − entering the cell). Thus, at a macroscopic level, the potential field generated by a synchronously activated palisade of neurons behaves like that of a dipole layer and can be considered the basis of EEG generation. Note that in our approach, y1 (t) and y2 (t) are the result of excitatory and inhibitory synapses acting on the pyramidal cell population. Thus, the difference y1 (t) − y2 (t) can be taken in a fairly good approximation to be proportional to the EEG signal (Jansen & Rit, 1995). The input to the model in Figure 1A is the process p(t), which represents the basal stochastic activity within the voxel plus the extrinsic input. This extrinsic input does not include afferences from voxels of the same and other cortical areas, but it models specific stimulus-dependent input that modulates the activity of the neuronal population. Additionally, our approach capitalizes on specific properties of the thalamic nuclei in order to model the contribution of this subcortical structure,
484
R. Sotero et al.
which is the main relay station of the sensory input on its way to the cortex. That is, the thalamus comprises two major populations of neurons: thalamocortical (TC) excitatory relay neurons and thalamic inhibitory reticular (RE) neurons. Only the axons of TC neurons project to the cortex, while RE neurons form an inhibitory network that surrounds the thalamus and project inhibitory connections to TC neurons exclusively (Destexhe & Sejnowski, 2001). This circuitry allowed us to formulate the mass model shown in Figure 1B, which Lopes da Silva et al. (1974) used for simulating the alpha rhythm. Note that the model is used to describe the dynamics of the thalamus as a whole, without taking into account any subdivision into voxels. The motivation for this was the physical separation of the two types of neuronal populations that form the thalamus. On the other hand, the absence of any preferential orientation of the neurons in the thalamus, together with the depth of this structure, has led to the idea that it does not generate any measurable voltage at the scalp surface. Thus, in our approach, no direct thalamic contribution to the EEG is considered. A more sophisticated model of the thalamic activity could be used to elucidate its actual contribution to the EEG generation, but that would entail a detailed knowledge of its microscopic cytoarchitectonic organization, which we lack at the moment.
3 Short-Range versus Long-Range Interactions If we want to model the dynamics of macroscopic cortical areas, we have to augment the model described to account for the specific neuronal interactions in the brain. In the cytoarchitectonic organization of the cerebral cortex, two types of connections can be distinguished: short-range and long-range connections (SRC and LRC, respectively). SRCs are made by cortical neurons with local branches that can reach a maximum of about 10 mm (Braitenberg, 1978; Abeles, 1991; Jirsa & Haken, 1997) corresponding to the length of the longest horizontal axon collaterals plus the length of the longest dendrites. LRCs are mainly formed by axons of pyramidal cells that cross the white matter and connect cortical regions that can be up to 100 mm apart (Abeles, 1991; Jirsa & Haken, 1997). Several differences between these two systems can be established, not only in their characteristic lengths but also in their specificity and function ¨ 1998). While SRCs can be described in (Abeles, 1991; Braitenberg & Schuz, statistical terms, with an effective connection strength that diminishes exponentially with the distance between neuronal populations, LRCs do not show such a regular and smooth pattern but have a high degree of specificity. Additionally, LRCs terminate preferentially on the apical dendrites of pyramidal cells, while local collaterals terminate more on basal dendrites. In the following sections, we describe how this differentiation of cortical interactions is incorporated into our model.
Realistic Neural Mass Models
485
3.1 Within-Area Interactions. First, SRCs that mediate neuronal interactions within a cortical area will be described. A diagram of the proposed augmented model for a given voxel is shown in Figure 2A. We assume that voxels of the same area are coupled by means of both excitatory and inhibitory connections, where the pyramidal cell population is the target of all the input to the system. Excitatory input comes from pyramidal and spiny stellate cell populations with coupling strengths described by connectivity functions that depend on the distance between voxels |xm − x j | (Jirsa & Haken, 1996, 1997): mj
ke1 = mj
ke2 =
1 − |xmσ−x j | e1 e 2σe1
(3.1)
1 − |xmσ−x j | e2 e . 2σe2
(3.2) mj
mj
In equations 3.1 and 3.2, ke1 and ke2 are the strengths of the connections that pyramidal cells and excitatory interneurons of voxel m make on pyramidal cells of voxel j, respectively. Delays in these connections can be modeled by linear transformations similar to equation 2.1, as in Jansen and Rit (1995). We use h d2 (t) and h d3 (t), respectively: h d2 (t) = Aa d2 te −a d2 t h d3 (t) = Aa d3 te
−a d3 t
(3.3) .
(3.4)
Similar equations are formulated for within-area inhibitory connections, mj which are considered by means of the connection strength ki that inhibitory interneurons (smooth stellate cells) of voxel m exert on pyramidal cells of voxel j, that is, 1 − |xmσ−x j | i e 2σi
(3.5)
h d4 (t) = Bb d4 te −bd4 t .
(3.6)
mj
ki =
3.2 Between-Area Interactions. Interactions between distant cortical areas are determined by LRCs, which in our approach are mediated by pyramidal-to-pyramidal cell connections. There is no clarity about the rules ¨ 1998). Nevertheless, governing the system of LRCs (Braitenberg & Schuz, the emergence of new imaging techniques like diffusion weighted magnetic resonance imaging (DWMRI) in the past few years has opened a window to the study of this type of connection in the living human brain. DWMRI is based on the fact that trajectories followed by water molecules during diffusion processes reflect the microscopic environment of brain tissues. Then nervous fibers can be characterized by tracing probabilistic paths
486
R. Sotero et al.
Figure 2: The thalamocortical model. (A) Input-output relationship for the neural mass model of a cortical voxel. A cortical voxel receives excitatory afferences from pyramidal cells and excitatory stellate cells of voxels of the same area, from pyramidal cells of other cortical areas, and from TC relay neurons. Inhibitory afferences are received from inhibitory interneurons of voxels of the same area. The pulse density p(t) accounts for the basal stochastic activity within the voxel. (B) Input-output relationship for the neural mass model of the thalamus. TC relay neurons send axonal projections to the cortex, where they make synapses on the three populations of neurons considered. TC cells receive backward connections from cortical pyramidal neurons. The pulse density pth (t) accounts for the background stochastic activity within the thalamus plus stimulus related inputs.
Realistic Neural Mass Models
487
that represent the preferred direction of water diffusion along the nervous fibers in each voxel. Finally, measures of anatomical connectivity reflecting the interrelation between voxels and areas can be defined and calculated based on these paths. In this article, the LRC strength K i,n between two areas i and n will be estimated from actual DWMRI data. Most work in the field of DWMRI has defined measures of betweenvoxel rather than between-areas anatomical connectivity. For example, for characterizing the probability of connection of a point R with a point P, one can use the number of probabilistic paths leaving P and entering R divided by the number of probabilistic paths generated in P (Behrens et al., 2003; Hagmann et al., 2003; Parker, Wheeler-Kingshott, & Barker, 2002; Koch, Norris, & Hund-Georgiadis, 2002). In Tuch (2002) the probability of connection between two points is estimated as the probability of the most probable paths joining both points, while in Parker et al. (2002) and Tournier, Calamate, Gadian, & Connelly, (2003), it is taken as the probability of the path that minimizes a cost function depending on diffusion data. However, diffusion data generally show high isotropy within the gray matter, giving poor information about the distribution of nerve fibers. Consequently, in most cases, it is not convenient to trace probabilistic paths within gray matter areas. Another approach for studying connectivity in the brain is to use only the points at the boundaries of gray matter regions for characterizing connection strengths between zones and not between the individual voxels comprising them. In this line of work, Iturria-Medina, Canales-Rodr´ıguez et al. (2005) and Iturria-Medina, Vald´es-Hern´andez et al. (2005) proposed a model for obtaining anatomical connection strengths (ACS) between brain zones. In their method, the ACS between two regions, A and B (C A↔B ), was defined as being proportional to the cross-section area of the total cylindrical-shaped volume of the nervous fibers shared by these regions. ACS was then calculated as the area occupied by the connector volume over the surfaces of the connected zones, C A↔B = AT(A,B) + BT(A,B) ,
(3.7)
where AT( A,B) and BT( A,B) represent the area defined by fibers (both afferent and efferent), joining A with B over the surface of A and B, respectively. These areas were evaluated from a DWMRI data set by measuring the number of target voxels on each surface of the generated probabilistic paths that simulate fiber trajectories under the assumption that those paths are contained inside the connector volume. Notice that methods for reconstructing fiber paths using DWMRI data cannot distinguish between excitatory and inhibitory connections. However, in the proposed model, LRCs are considered to be excitatory only. This assumption is consistent with physiological and anatomical studies, demonstrating, that pyramidal cells of layer III are the main source of the
488
R. Sotero et al.
feedforward projections to other cortical areas (Gilbert & Wiesel, 1979, 1983; Martin & Whitteridge, 1984; McGuire, Gilbert, Rivlin, & Wiesel, 1991; Douglas & Martin, 2004). Thus, information obtained from DWMRI data can be used as a measure of the CSP (K i,n ) between any two given areas i and n. For modeling delays in long-range connections, we use h d1 (t) = Aa d1 te −a d1 t .
(3.8)
Like parameters a and b of Jansen’s model, constants a d1 , a d2 , a d3 , and b d4 account for passive membrane time constants and other distributed delays in the dendritic network. 4 Thalamocortical Model In this section we present the final mathematical formulation of the thalamocortical model proposed, with SRCs and LRCs included. As shown in Figure 2A, intra- and intervoxel interactions are described by nine connectivity constants, from c 1 to c 9 . The first five constants, c 1 to c 5 , have already been defined in section 2. Constant c 6 represents the number of synapses made by pyramidal cells of one voxel on voxels of other areas, c 7 is the number of synapses made by pyramidal cells on voxels of the same area, and constants c 8 and c 9 are the number of synapses made by excitatory and inhibitory interneurons on pyramidal cells of voxels within the same area. Thus, the dynamics of the neuronal activity in a cortical voxel is described by a system of 16 differential equations: nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
nj
y˙ 1 (t) = y9 (t) y˙ 2 (t) = y10 (t) y˙ 3 (t) = y11 (t) y˙ 4 (t) = y12 (t) y˙ 5 (t) = y13 (t) y˙ 6 (t) = y14 (t) y˙ 7 (t) = y15 (t) y˙ 8 (t) = y16 (t) nj nj nj nj y˙ 9 (t) = Aa c 5 S y1 (t) − y2 (t) + c 2 S y3 (t) + K th,n c 3t S(x4 (t)) Mn
mj mj ke1 c 7 S y6mn (t) + ke2 c 8 S (y7mn (t)) + Aa m=1 m= j
Realistic Neural Mass Models
+
N i=1 i=n
K i,n
Mi
489
c 6 S y5im (t)
m=1
nj
nj
− 2a y9 (t) − a 2 y1 (t) + Aa p nj (t)
Mn
nm nj nj mj nj nj ki c 9 S y8 (t) − 2by10 (t) − b 2 y2 (t) y˙ 10 (t) = Bb c 4 S y4 (t) + m=1
m= j
nj nj nj nj nj y˙ 11 (t) = Aa c 1 S y1 (t) − y2 (t) + K th,n c 4t S(x5 (t)) − 2a y11 (t) − a 2 y3 (t) nj nj nj nj nj y˙ 12 (t) = Aa c 3 S y1 (t) − y2 (t) + K th,n c 5t S (x6 (t)) − 2a y12 (t) − a 2 y4 (t) nj nj nj nj 2 nj y˙ 13 (t) = Aa d1 S y1 (t) − y2 (t) − 2a d1 y13 (t) − a d1 y5 (t) nj nj nj nj 2 nj y˙ 14 (t) = Aa 21 S y1 (t) − y2 (t) − 2a d2 y14 (t) − a d2 y6 (t) nj nj nj 2 nj y˙ 15 (t) = Aa d3 S y3 (t) − 2a d3 y15 (t) − a d3 y7 (t) nj nj nj 2 nj y˙ 16 (t) = Bb d4 S y4 (t) − 2b d4 y16 (t) − b d4 y8 (t), (4.1)
where the superscript nj refers to the voxel j in the area n. The system has extrinsic inputs given by connections from other cortical voxels and the thalamus as well as an intrinsic input p(t) that represents baseline stochastic nj nj activity within the voxel. The output is taken as y1 (t) − y2 (t), which from section 2 can be considered proportional to the generated EEG signal. Note that close voxels separated by the boundary of two neighboring regions are considered to be coupled through LRC solely, even if they are close enough for the SRC to be important. This simplification can easily be avoided by nj nj adding relevant terms to the equations for y˙ 9 and y˙ 10 in equation 4.1. Nevertheless, in the calculations shown here, all anatomical areas were chosen large enough and the activation regions within each area located distant enough such that boundary effects could be neglected. This allowed us to keep equations simple for the sake of a better understanding of the model. In the case of the thalamus, the main output of the system is mediated by TC cells, which are assumed to send axonal projections to the three populations of neurons in the cortex: pyramidal, spiny stellate, and smooth stellate cells (see Figure 2). On the contrary, only pyramidal cells project back to the thalamus and make synapses on TC neurons. This model is described by the following set of 12 differential equations: x˙ 1 (t) = x7 (t) x˙ 2 (t) = x8 (t)
490
R. Sotero et al.
x˙ 3 (t) = x9 (t) x˙ 4 (t) = x10 (t) x˙ 5 (t) = x11 (t) x˙ 6 (t) = x12 (t) N Mi
im th,i x˙ 7 (t) = At a t K c 6 S y5 (t) + pth (t) − 2a t x7 (t) − a t2 x1 (t) i=1
m=1
x˙ 8 (t) = Bt b t c 2t S (c 1t x3 (t)) − 2b t x8 (t) − b t2 x2 (t) x˙ 9 (t) = At a t S(x1 (t) − x2 (t)) − 2a t x9 (t) − a t2 x3 (t) 2 x˙ 10 (t) = At a d1t S (x1 (t) − x2 (t)) − 2a d1t x10 (t) − a d1t x4 (t) 2 x˙ 11 (t) = At a d2t S (x1 (t) − x2 (t)) − 2a d2t x11 (t) − a d2t x5 (t) 2 x6 (t). x˙ 12 (t) = At a d3t S (x1 (t) − x2 (t)) − 2a d3t x12 (t) − a d3t
(4.2)
The pulse density input pth (t) in this case can be modeled as the sum of a stimulus (visual or auditory stimulation for example) and a basal random activity in the thalamus.K th,i is the CSPs between the thalamus and the cortical area i and is obtained from the anatomical connectivity matrix as described in section 3.2. At this point, it is worth making some comments about the mathematical formulation of systems 4.1 and 4.2. The two neural mass models correspond to a mathematical object called random differential equation (RDE) (Hasminskii, 1980 i.e., equations of the form d x(t) = f (x(t), p(t))dt, where p(t) is an external stochastic process). Indeed, in equations 4.1 and 4.2, the nonautonomous inputs p nj (t) and pth (t) are external stochastic processes. Note that for each fixed realization of p nj (t) and pth (t), the systems 4.1 and 4.2 are in fact nonautonomous ordinary differential equations. This approach differs from the neural mass models proposed in Jansen and Rit (1995), Wendling et al. (2000), and David and Friston (2003), which mainly use stochastic differential equations (SDE) (i.e., equations of the form d x(t) = f (x(t))dt + g(x(t))dW(t), where the differential dW(t) of the Wiener process W(t) is a gaussian white noise). Notice that the mathematical formulation of SDEs and RDEs is quite different (Gard, 1988). However, both types of equations are related in the particular case p(t) = W(t). That is, the RDE d x(t) = f (x(t), p(t))dtcan be transformed into an SDE by adding the equation d p(t) = d W(t). Hence, the framework of RDEs allows us to consider not only Wiener process but also other types of modeling for the noise p(t). Besides, numerical calculations can be significantly simplified by
Realistic Neural Mass Models
491
avoiding elements of stochastic integration theory usually required in SDEs. Finally, the existence and uniqueness of the solution of systems 4.1 and 4.2 follow from classical arguments. That is, given a realization of the stochastic processes p nj (t) and pth (t), it is easily seen that the right-hand side of systems 4.1 and 4.2 is given by continuously differentiable functions of their respective arguments (for any choice of the corresponding parameters). Besides, function S (v(t)) defined in equation 2.3, the only source of nonlinearity presented on both systems, satisfies a Lipchitz condition (it has first derivative bounded). According to theorem 3.1 in Hasminskii (1980), these two last conditions (continuity and Lipchitz) are enough to guarantee that our model is well posed (i.e., existence and uniqueness of the solution). 5 EEG Observation Model We now relate the average membrane potential of pyramidal cells of each voxel to the observed EEG on the scalp surface. The current dipole due to a PSP is approximately (H¨am¨al¨ainen, Hari, Ilmoniemi, Knuutila, & Lounasmaa, 1993) j=
π 2 d σin V = εV 4
(5.1)
ε=
π 2 d σin , 4
(5.2)
where d is the diameter of the dendrite, σin is the intracellular conductivity, and V is the change of voltage during a PSP (for a single PSP j ≈ 20 fAm; H¨am¨al¨ainen et al., 1993). The equivalent current dipole (ECD) due to N PSPs is then J =
N
αi ji ≈ ε
i=1
N
αi Vi ,
(5.3)
i=1
where αi is +1 for EPSP and −1 for IPSP. As discussed in section 2, we can assume that for one voxel. y1 (t) − y2 (t) =
N
αi Vi
(5.4)
i=1
and J (t) = ε (y1 (t) − y2 (t)) .
(5.5)
492
R. Sotero et al.
On the other hand, from the forward problem of the EEG, we know that the voltage measured at the scalp surface due to a current density distribution is given by φ(rs , t) =
K (rs , rg ) J (r g , t)d 3 rg .
(5.6)
where the kernel K (rs , rg ) is the electric lead field, which summarizes the electric and geometric properties of the conducting media (brain, skull, and scalp), and J (r g , t) is the primary current density (PCD) vector. The indices s and g run over the sensor (electrodes) and generator (voxels) spaces, respectively, and t denotes time. The lead field matrix K (rs , rg ) is commonly known and can be calculated by means of the reciprocity theorem (Plonsey, 1963; Rush & Driscoll, 1969). Thus, the EEG generated due to the activation of one voxel at a given array of electrodes distributed over the scalp surface can easily be calculated by multiplication of the lead field by expression 5.5, both evaluated for the given voxel. 6 Numerical Method As we mentioned in section 4, an RDE is just a nonautonomous ordinary differential equation (ODE) coupled with a stochastic process ( p(t) and pth (t) in our model). Thus, in principle, RDEs can be integrated by applying conventional numerical methods for ODEs. However, it is well known that classical methods in ODEs may introduce “ghost” solutions and numerical instabilities when applied to nonlinear equations, which is critical when dealing with high-dimensional problems (as in our case). On the other hand, very few researchers have studied systematically the properties of such methods for the numerical integration of RDEs (Grune & Kloeden, 2001; Carbonell, Jimenez, Biscay, & de la Cruz, 2005). In this letter we choose the local linearization (LL) for RDE (Carbonell et al., 2005) to solve systems 4.1 and 4.2 (see a description of this method in the appendix). The rationale for this choice is the fact that the LL method improves the order of convergence and stability properties of conventional numerical integrators (Carbonell et al., 2005) for RDEs. Moreover, the LL approach has been key for constructing efficient and stable numerical schemes for the integration and estimation of various classes of random dynamical systems (Shoji & Ozaki, 1998; Prakasa-Rao, 1999; Schurz, 2002). 7 Results In this section, computational simulations are used to demonstrate the capability of the model to reproduce the temporal dynamics and spectral characteristics of the EEG signal as recorded on the scalp surface. A number of model predictions arising from the results obtained are also shown,
Realistic Neural Mass Models
493
Figure 3: (A) Average anatomical connectivity matrix estimated from diffusion weighted magnetic resonance imaging data of two human subjects. This matrix contains the connection strength between 71 brain areas. These areas were segmented based on the Probabilistic Brain Atlas developed at the Montreal Neurological Institute. The values are normalized between 0 and 1. (B) Random matrix calculated with the same degree of sparseness as that of the anatomical connectivity matrix in A.
and can be further tested experimentally. Additionally, the sensitivity of the results to changes in model parameters is studied. 7.1 Description of the Simulation Method. In the simulations, 71 brain areas were defined based on the Probabilistic MRI Atlas (PMA) produced by the Montreal Neurological Institute (Evans et al., 1993, 1994; Collins, Neelin, Peters, & Evans, 1994; Mazziotta, Toga, Evans, Fox, & Lancaster, 1995). Anatomical connectivity matrices between these areas, corresponding to two human subjects, were then calculated and averaged out. The elements of the resultant average matrix were then used as CSP for betweenareas interactions. As shown in Figure 3, the sparseness of this anatomical connectivity matrix allows for a significant reduction in the number of CSPs and thus reduces the computational cost of solving the large RDE system that describes the model presented here. Initial conditions were set to zero in all simulations, and an integration step size of 5 ms was used. The first 100 points of the simulated signals were discarded in order to avoid transient behavior, and the input p(t) was a gaussian process with mean and standard deviation of 10,000 and 1000 pulses per second, respectively. As for the rest of the model parameter values, they were chosen to vary across space, with values taken from a uniform distribution in the interval between 5% below and above the standard values reported in Table 1. This choice was motivated by the inhomogeneities found in cortical tissue. The motor cortex, for example, has a relatively sparse population of neurons, while sensory cortices tend
494
R. Sotero et al.
Table 1: Values of Model Parameters and Physiological Interpretation. Parameters with the Same Value in All Simulations Parameter A, B e 0 , ν0 , r σe1 , σe2 , σi c6 , c7 , c8 , c9
Physiological Interpretation Average synaptic gain for cortical voxels Parameters of the nonlinear sigmoid function Parameters of the connectivity functions. Average number of synaptic contacts between the neuronal populations of different cortical voxels
Value A = 3.25 mV, B = 22 mV e 0 = 5 s−1 , ν0 = 6 mV r = 0.56 mV−1 σe1 = 10, σe2 = 10, σi = 2 c 6 = 200, c 7 = 100 c 8 = 100, c 9 = 100
Parameters with Different Values for Each Simulation Parameter a, b a d1 , a d2 , a d3 , b d4 c 1 , c 2 = 0.8c 1 c 3 = c 4 = 0.25c 1 c5 At , Bt c 1t , c 2t , c 3t c 4t , c 5t a t , bt Rhythm Delta rhythm Theta rhythm Alpha rhythm
Beta rhythm Gamma rhythm
Physiological Interpretation Average synaptic time constants for excitatory and inhibitory populations Average time delay on efferent connection from a population for cortical voxels Average number of synaptic contacts between the neuronal populations within a cortical voxel
Synaptic gain for the thalamus Number of synaptic contacts made by the neuronal populations of the thalamus Average synaptic time constants for TC and RE populations Actual Values Used in Each Case a = 20 s−1 , b = 20 s−1 , a d1 = 15 s−1 , a d2 = 20 s−1 , a d3 = 20 s−1 , b d4 = 20 s−1 , c 1 = 50 a = 50 s−1 , b = 40 s−1 , a d1 = 15 s−1 , a d2 = 50 s−1 , a d3 = 50 s−1 , b d4 = 40 s−1 , c 1 = 50 a = 100 s−1 , b = 50 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 50 s−1 , c 1 = 150, At = 3.25 mV, Bt = 22 mV, c 1t = 50, c 2t = 50, c 3t = 80, c 4t = 100, c 5t = 80, a t = 100, b t = 40 a = 100 s−1 , b = 100 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 100 s−1 , c 1 = 180 a = 100 s−1 , b = 100 s−1 , a d1 = 33 s−1 , a d2 = 100 s−1 , a d3 = 100 s−1 , b d4 = 100 s−1 , c 1 = 250
Realistic Neural Mass Models
495
to be more densely populated than the average. Furthermore, even within a given cytoarchitectonic region, cortical thickness and neuronal densities vary considerably (Abeles, 1991). However, despite these differences, it is believed that all cortical regions process information locally according to the same principles. Parameter values will also depend on voxel size. Nevertheless, because it is difficult to establish the exact nature of this dependence, the selection of a plausible range of variation of parameter values (see Table 1) was based on previous modeling studies (Lopes da Silva et al., 1974; Zetterberg et al., 1978; Jansen & Rit, 1995; David & Friston, 2003). Parameters used in previous works and parameters introduced for the first time in this study were then tuned so that the model could produce the different rhythms observed in the human EEG. In all simulations, our generator space consists of a 3D regular grid of points that represent the possible generators of the EEG inside the brain, while the measurement space is defined by the array of sensors where the EEG is recorded. In this work, 41,850 grid points (defining voxels of size 4.25 mm × 4.25 mm × 4.25 mm) and a dense array of 120 electrodes are placed in registration with the PMA. The 3D grid is further clipped by the gray matter, which consists of all 71 brain regions segmented (18,905 grid points remaining). The electrodes’ positions were determined by extending and refining the standard 10/20 system (FP1, FP2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T3, T4, T5, T6, Fz, Cz, and Pz). The physical head model constructed in this way allows us to easily compute the electric lead field matrix that relates the PCD inside the brain to the voltage measured at the sensors’ locations. 7.2 Generation of Alpha Rhythm. Experimental data from fMRI as well as solutions of the EEG inverse problem show that generators of the different brain rhythms are not localized in the same cortical areas (Goldman, Stern, Engel, & Cohen, 2002; Mart´ınez-Montes, Vald´es-Sosa, Miwakeichi, Goldman, & Cohen, 2004). In the case of the alpha rhythm, it has been demonstrated that increased EEG alpha power is correlated with decreased fMRI signal in multiple regions of occipital, superior temporal, inferior frontal, and cingulate cortex and with increased power in the thalamus and insula (Goldman et al., 2002). In this study, these regions were segmented and subdivided into 12 anatomical areas selected from the PMA (see Table 2), which were further subdivided based on their hemispheric location (left or right). An illustrative visualization of some of these regions with the corresponding fiber tracts estimated from DWMRI is shown in Figure 4A. The 24 defined brain areas together with their corresponding anatomical connectivity values were used in our model to generate alpha activity; the results are shown in Figure 5. The temporal dynamics and the power spectrum of the signal at the electrodes of maximum amplitude (O1 and O2) show reasonable agreement with experimentally observed EEG (see
496
R. Sotero et al.
Table 2: Brain Areas Selected for Each Rhythm Simulation. Alpha Rhythm
Delta, Beta, Theta, and Gamma Rhythms
Cuneus (left and right) Lingual gyrus (left and right) Lateral occipitotemporal gyrus (left and right) Insula (left and right) Inferior occipital gyrus (left and right) Superior occipital gyrus (left and right) Medial occipitotemporal gyrus (left and right) Occipital pole (left and right) Cingulate region (left and right) Inferior frontal gyrus (left and right) Superior temporal gyrus (left and right) Thalamus (left and right)
Medial front-orbital gyrus (left and right) Middle frontal gyrus (left and right) Precentral gyrus (left and right) Lateral front-orbital gyrus (left and right) Medial frontal gyrus (left and right) Superior frontal gyrus (left and right) Inferior frontal gyrus (left and right)
Figure 4: Visualization of the fiber tracts connecting the brain regions used in the simulations as calculated from DWMRI. (A) Alpha rhythm: left and right superior temporal gyri (1, 2); left and right thalamus (3, 4); left and right medial occipitotemporal gyri (5, 6); left and right occipital poles (7, 8). (B) Beta, gamma, delta, and theta rhythms: left and right superior frontal gyri (1, 2); left and right inferior frontal gyri (3, 4).
Realistic Neural Mass Models
497
Figure 5: Simulation of alpha rhythm (8–12 Hz). (A) Temporal dynamics and spectrum of the signal at electrodes O1 and O2. (B) Topographic distribution of the alpha power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
Figure 5A). The amplitude of the signal varies between 30 and 50 µV for both electrodes, and the maximum of the spectrum was at 9.98 Hz for electrode O1 and 9.90 Hz for electrode O2. These values are in the range of actual recordings. Other qualitative properties, like the waxing and waning characteristic of actual alpha rhythm, as well as interhemispheric symmetry, are also observed. Regarding the topographic distribution of the activity,
498
R. Sotero et al.
Figure 6: Reactivity test simulation. A trapezoidal input of 12 seconds duration and amplitude of 106 pulses per second is used to simulate visual input at the thalamic level. The response of the system results in a significant reduction in the amplitude of both the signal and the peak of the alpha power and a shift of the power spectrum toward higher frequencies.
Figure 5B shows that the simulated EEG is maximal over the occipital regions and diminishes in amplitude toward frontal areas, which has also been described for alpha rhythm. The influence of an external input (stimulus) on the simulated alpha activity was also studied, and results are depicted in Figure 6. Actual alpha rhythm is known to be temporarily blocked by an influx of light (eye opening). This is known as reactivity. The degree of reactivity varies; the alpha rhythm may be blocked, suppressed, or attenuated with voltage
Realistic Neural Mass Models
499
reduction (Niedermeyer & Lopes da Silva, 1987). In the simulations shown in Figure 6, a trapezoidal input representing visual stimulation of 12 seconds duration and amplitude of 106 pulses per second was given to the thalamus. As shown in the figure, the effect of this input was to attenuate the EEG signal and shift the whole EEG spectrum toward higher frequencies with a reduction in the amplitude of the power, which is also observed in actual reactivity tests. 7.3 Generation of Delta, Theta, Beta, and Gamma Rhythms. For the simulation of delta, theta, beta, and gamma rhythms, 7 brain regions (medial-front orbital gyrus, middle frontal gyrus, precental gyrus, lateralfront orbital gyrus, medial frontal gyrus, superior frontal gyrus, and inferior frontal gyrus) were selected and subdivided based on their hemispheric location (left or right), for a total of 14 brain areas. Some of these areas, with the corresponding fiber tracts connecting them, are shown in Figure 4B. In these studies, 30 seconds of EEG data were simulated for each rhythm. Figures 7, 8, 9, and 10 show the temporal dynamics and the spectrum of the signal obtained at electrodes Fp1 and Fp2 for the delta, theta, beta, and gamma rhythms simulations, respectively. Amplitudes and spectral frequency bands obtained are within the range of experimentally observed EEG. Note the characteristic interhemispheric asymmetry in the frequency of the peak for the beta rhythm (see Figure 9). That is, the maximum of the spectrum is located at 18.46 Hz for electrode Fp1 and at 16.76 Hz for electrode Fp2. The rest of the simulated rhythms do not show a significant interhemispheric frequency asymmetry. Finally note that the power spectrum on the scalp surface was maximum at frontal areas in all cases and reaches minimum amplitudes at occipital electrodes. 7.4 Influence of the Model Parameters on the Generation of the EEG Rhythms. Parameter values in Table 1 show that by increasing c 1 (and thus c 2 , c 3 , and c 4 ), while keeping the other parameters constant, the EEG spectrum shifts toward higher frequencies, going from beta to gamma frequency bands. On the other hand, keeping c 1 constant, we can switch from delta to theta rhythm by increasing parameters associated with membrane time constants and dendritic tree time delays. However, as there are 32 parameters in the model (excluding the connectivity matrix values), it is difficult to study the influence of each of them on the output of the model. That is, parameter values presented in Table 1 are not the only ones capable of producing EEG rhythms. Additionally, neural mass models can produce the same EEG rhythm with different sets of parameter values, as shown by David and Friston (2003), which varied excitatory and inhibitory time constants and obtained regions of the parameters’ phase-space within which the same rhythm was obtained. In this section, we perform similar analyses for parameters a and b.
500
R. Sotero et al.
Figure 7: Simulation of delta rhythm (1–4 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the delta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
In the first set of simulations the areas involved in the alpha rhythm simulation study were selected, and then parameters a , b were varied from 10 s−1 to 500 s−1 , with a step size of 10 s−1 , while the rest of the parameters were kept constant with the values in Table 1 that correspond to alpha rhythm. For each case, the resulting dynamics was assigned to an EEG
Realistic Neural Mass Models
501
Figure 8: Simulation of theta rhythm (4–8 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the theta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
band. A phase diagram is shown in Figure 11A. Two phases are obtained depending on parameter values: one corresponding to alpha rhythms and the other corresponding to noise.
502
R. Sotero et al.
Figure 9: Simulation of beta rhythm (12–30 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the theta power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
In the second study, parameters a , b were also varied from 10 s−1 to 500 s−1 with a step size of 10 s−1 , but now the brain areas selected were the ones involved in the generation of delta, theta, beta, and gamma rhythms. As in the previous case, the other parameters were kept constant with the values shown in Table 1 for beta rhythm. The corresponding phase diagram
Realistic Neural Mass Models
503
Figure 10: Simulation of gamma rhythm (30–70 Hz). (A) Temporal dynamics and power spectrum of the signal at electrodes Fp1 and Fp2. (B) Topographic distribution of the gamma power on the scalp surface. The spatial distribution of the 120 electrodes’ system used in the simulation is also shown.
is shown in Figure 11B. As can be seen, five different phases are obtained corresponding to delta, theta, beta, and gamma rhythms, as well as noise. However, alpha rhythm (the small spots in white between theta and beta areas in Figure 11B) was also obtained for three pairs of values (b, a ): (340,180), (340,190), (120,280).
504
R. Sotero et al.
Figure 11: Phase-space diagram of the average synaptic time constants for excitatory (a ) and inhibitory (b) populations. (A) The chosen regions were the ones used previously in the alpha rhythm simulation. The phase space splits into two regions that separate noise from alpha rhythm. (B) The selected areas were the ones used in delta, theta, beta, and gamma rhythm simulations. The phase space splits into different regions—one for each rhythm and noise. Nevertheless, for three pairs of values (b, a ), alpha rhythm was also obtained (white points between theta and beta areas). (C) The same areas as in A were selected but with values extracted from the random matrix of Figure 3B. (D) The same areas as in B but with values extracted from the random matrix. In C and D, for very large physiologically unrealistic values of (b, a ), several rhythms appear.
Thus, by changing a or b, or both, the system undergoes phase transitions that allow the same model to generate the different EEG rhythm. Note also that, as expected, in both sets of simulations, noise is obtained for high values of a or b, or both, which corresponds to processes decaying very rapidly and thus occurring at very high (nonphysiological) frequencies. One of the main goals of this article is the validation of the use of anatomical connectivity information obtained from DWMRI to model the LRC strength between multiple coupled neural mass models representing cortical areas and the capability of this modeled system to generate different EEG rhythms. So far, we have proven that our model is capable of generating alpha, delta, theta, beta, and gamma rhythm in a wide range of parameter values. Nevertheless, an important question remains: How important is the
Realistic Neural Mass Models
505
Figure 12: Effect of the reduction of connectivity parameters on alpha and beta rhythms. CSPs between areas were diminished by 50%, and parameter c 1 was also diminished to c 1 = 50. The dotted line is the spectrum when using the standard values, and the solid line represents the reduced parameters.
use of the actual connectivity matrix values? In other words, wouldn’t be possible to generate EEG rhythms with a completely different connectivity matrix? To elucidate this matter, the two sets of simulations previously described were replicated, but in this case, the random connectivity matrix shown in Figure 3B was used instead for the coupling between the brain areas involved in the simulations. This random matrix was generated with the same degree of sparseness as the actual connectivity matrix. Results are shown in Figures 11C and 11D. As can be seen, no rhythm emerged for the range of values (b, a ) from 10 s−1 to 500 s−1 in these cases. Only for extremely high and nonphysiological values (corresponding to average population time constants of less than 1 ms) are some rhythms obtained. An interesting case is to analyze the impact of decreased connection strengths (see section 8). For this, the values of the anatomical connectivity matrix were decreased 50% and parameter c 1 was reduced to c 1 = 50 for the cases of the alpha and beta rhythms generation models. The results are shown in Figure 12. For both alpha and beta, the peak amplitude of the power spectrum reduces for its corresponding frequency bands, while the delta and theta bands were accentuated. 8 Discussion In this work, previous neural mass models (Lopes da Silva et al., 1974; Zetterberg et al., 1978; Jansen & Rit, 1995) were generalized and combined to account for intravoxel interactions, as well as short-range and long-range connections representing interactions between cortical voxels within the same area and with voxels of other cortical areas, respectively. The thalamus
506
R. Sotero et al.
was also described by a neural mass model and coupled to the cortical areas. Anatomical information from DWMRI was introduced to constrain coupling strength parameters between different brain regions. This allowed us to couple multiple brain areas with real anatomical connectivity values and study the rhythms emerging from these interactions across different spatial scales. Knowing the average depolarization of pyramidal cells in each voxel and the lead field matrix that characterizes the spatial distribution of the electromagnetic field in the head, we obtained the voltage signals recorded at the location of given electrodes on the scalp surface. For validating the model, several simulations were carried out. The parameters that were tuned in each simulation were the time delays of efferent connections and connectivity constants between the different populations within a voxel. The choice of parameter was given by its importance in the generation of different rhythms, as demonstrated by previous work in this field (Jansen & Rit, 1995; David & Friston, 2003). The temporal dynamics and the spectrum of human EEG rhythms were reproduced by the proposed model. EEG rhythms were also obtained at the right spatial locations. In the case of alpha rhythm, the activity was maximal at occipital electrodes and diminished toward the frontal electrodes. Delta, theta, beta, and gamma rhythms were maximal at frontal electrodes. The amplitudes of the rhythms were also within the range observed in actual EEG recordings. Interhemispheric symmetry was found in the case of alpha rhythm, since the difference between the peaks of EEG spectrum at electrodes O1 and O2 was less than 0.1 Hz. In contrast beta rhythm showed interhemispheric asymmetry; that is, there was a difference of 1.7 Hz between the peaks of the EEG spectrum at Fp1 and Fp2. For delta, theta, and gamma rhythms, interhemispheric symmetry with respect to the peak frequency of the spectrum was obtained. In clinical practice, no EEG is complete without a reactivity test (Niedermeyer & Lopes da Silva, 1987). For this reason, we made a reactivity test simulation, finding that the effect of stimulation was to shift the spectrum toward higher frequencies together with a reduction in amplitude, which also agrees with actual EEG recordings in healthy subjects. Instead of the anatomical connectivity matrix, values from a random matrix were also used to couple brain areas. In this case, some rhythms appear for extremely high values of parameters a and b. However, these values of a and b correspond to an average population time constant of less than 1 ms, which does not make much sense from the physiological point of view. Thus, it can be concluded that in the range of physiologically plausible values, no rhythm was obtained when using a nonrealistic connectivity matrix. Finally, the model was used to study the effect of a reduction in all connectivity parameters values predicting an increase in delta and theta power (see Figure 12), along with a decrease in alpha and beta peaks. It is
Realistic Neural Mass Models
507
interesting to note that this slowing of oscillatory activity is a prominent functional abnormality that has been reported in EEG and MEG studies of Alzheimer disease (AD) patients (Coben, Danziger, & Berg, 1983; Penttil¨a, Partanen, Soininen, & Riekkinen, 1985; Schreiter-Gasser, Gasser, & Ziegler, 1993; Berendse, Verbunt, Scheltens, van Dijk, & Jonkman, 2000). On the other hand it is also recognized that AD patients show a progressive cortical disconnection syndrome due to the loss of large pyramidal neurons in cortical layers III and V (Pearson, Esiri, Hiorns, Wilcock, & Powell, 1985; Morrison, Scherr, Lewis, Campbell, & Bloom, 1986). Related to this, it has also been found that the total callosal area is significantly reduced in patients suffering from this pathology (Rose et al., 2000; Hampel et al., 1998). According to our model, the shift of the EEG spectrum toward low frequencies found in AD could be associated with the cortical disconnection syndrome that is reflected in the decreased values of the coupling strength parameters, although the model does not exclude contributions from other mechanisms. As we know, lumped models are simplifications of actual brain structures, so further improvements could be made by including more realistic models in order to describe a wider range of phenomena. Another limitation of our approach is that DWMRI cannot distinguish between efferent and afferent fibers (this is reflected in the symmetry of the connectivity matrix; see Figure 3A), and connectivity values thus could be overestimated. Moreover, the validity of the connectivity matrix values depends on the assumption that probabilistic paths calculated based on the information obtained from DWMRI reflect the actual connections between brain areas. While this letter focused on the EEG forward problem, the next logical step is the estimation of the parameters of this model from data. Previous work has accomplished this task. For example, in Valdes et al. (1999), the neural mass model described in Zetterberg et al. (1978) was fitted to actual EEG alpha data. In that work, the LL approach was used to discretize Zetterberg’s model, which was then reformulated in a state-space formalism that could be integrated by using a nonlinear Kalman filter scheme. A maximum likelihood procedure was then employed to estimate the model parameters for several EEG data sets. Bayesian inference, dynamic causal modeling (Friston, Harrison, & Penny, 2003; David, Harrison, Kilner, Penny, & Friston, 2004), and genetic algorithms (Jansen, Balaji Kavaipatti, & Markusson, 2001) have also been used for estimating the parameters of neural mass models of the type presented here. A more ambitious goal is the integration of EEG with functional imaging techniques such as functional magnetic resonance imaging and positron emission tomography. The use of this model as a framework in which electrical, hemodynamic, and metabolic activities can be coupled is currently under study and will be the subject of a future publication.
508
R. Sotero et al.
Appendix: Local Linearization Method for RDE Assume that a k-dimensional random process p(t), t ∈ [t0 , T] and a nonlinear function f are given, and define the d-dimensional RDE: dy(t) = f (y(t), p(t))dt y(t0 ) = y0 . Then, given a step size h,the local linearization scheme that solves numerically the equation above at the time instants tn = t0 + nh, n = 0, 1, . . . , is given by yn+1 = yn + Le Cn h R, where L = [Id , 0d x2 ], R = [01x(d+1) , 1]T , and the (d + 2) × (d + 2) matrix Cn is defined by blocks as
f˙ y (t)(yn , p(tn )) f˙ p (t)(yn , p(tn )) p(tn+1 h)− p(tn ) f (yn , p(tn )) Cn = 0 0 1 0 0 0 Here, f˙ y (t) and f˙ p (t) denote the derivatives of f with respect to the variables y and p, respectively. Due to the high dimensionality of our problem, the main task in the implementation of the LL scheme is the computation of the matrix exponential e Cn h . The use of the Krylov subspace methods (Sidje, 1998) is strongly recommended for that purpose. Indeed, that method can be used to compute the vector v = e Cn h R, and then the operation Lv reduces to take the first d elements of v. Acknowledgments We thank Trinidad Virues, Elena Cuspineda, and Thomas Wennekers for helpful comments. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Behrens, T., Johansen-Berg, H., Woolrich, M. W., Smith, S. M., WheelerKingshott, C., Boulby, P. A., Barker, G. J., Sillery, E. L., Sheehan, K., Ciccarelli, O., Thompson, A. J., Brady, J. M., & Matthews, P. M. (2003). Non-invasive mapping of connections between human thalamus and cortex using diffusion imaging. Nature Neuroscience, 6, 750–757.
Realistic Neural Mass Models
509
Berendse, H. W., Verbunt, J. P., Scheltens, P., van Dijk, B. W., & Jonkman, E. J. (2000). Magnetoencephalographic analysis of cortical activity in Alzheimer’s disease: A pilot study. Clin. Neurophysiol., 111, 604–612. Braitenberg V. (1978). Cortical architectonics: General and areal. In M. A. B. Brazier and M. Petche (Eds.), Architectonics of the cerebral cortex (pp. 443–465). New York: Raven Press. ¨ A. (1998). Cortex: Statistics and geometry of neuronal connecBraitenberg, V., & Schuz, tivity. Berlin: Springer-Verlag. Carbonell, F., Jimenez, J. C., Biscay, R. J., & de la Cruz, H. (2005). The local linearization method for numerical integration of random differential equations. BIT Numerical Mathematics, 45(1), 1–14. Coben, L. A., Danziger, W. L., & Berg, L. (1983). Frequency analysis of the resting awake EEG in mild senile dementia of Alzheimer type. Electroencephalogr. Clin. Neurophysiol., 55, 372–380. Collins, D. L., Neelin, P., Peters, R. M., & Evans, A. C. (1994). Automatic 3D intersubject registration of MR volumetric in standardized Talairach space. J. Comp. Assist. Tomog., 18, 192–205. David, O., & Friston, K. J. (2003). A neural mass model for MEG/EEG: Coupling and neuronal dynamics. NeuroImage., 20, 1743–1755. David, O., Harrison, L., Kilner, J., Penny, W., & Friston, K. J. (2004). Studying effective connectivity with a neural mass model of evoked MEG/EEG responses. In Proceedings of the 14th International Conference on Biomagnetism BIOMAG 2004 (pp. 136–138). Boston. Destexhe, A., & Sejnowski, T. J. (2001). Thalamocortical assemblies: How ion channels, single neurons, and large-scale networks organize sleep oscillations. New York: Oxford University Press. Douglas, R. J., & Martin, K. A. C. (2004). Neuronal circuits of the neocortex. Annu. Rev. Neurosci., 27, 419–51. Evans, A. C., Collins, D. L., Mills, S. R., Brown, E. D., Kelly, R. L & Peters, T. M. (1993). 3D statistical neuroanatomical models from 305 MRI volumes. Proc. In IEEE Nuclear Science Symposium and Medical Imaging Conference (pp. 1813–1817). London: MTP Press. Evans, A .C., Collins, D. L., Neelin, P., MacDonald, D., Kamber, M., & Marrett, T. S., 1994. Three-dimensional correlative imaging: Applications in human brain mapping. In: Thatcher, R., Hallet, M., Zeffiro, T., John, E. R., & Huerta, M. (Eds.), Functional Neuroimaging: Technical Foundations. Academic Press, New York, pp. 145–161. Friston, K. J., Harrison, L., & Penny, W. (2003). Dynamic causal modelling. NeuroImage, 19, 1273–1302. Gard, T. C. (1988). Introduction to stochastic differential equations. New York: Marcel Dekker. Gilbert, C. D., & Wiesel, T. N. (1979). Morphology and intracortical projections of functionally characterised neurones in the cat visual cortex. Nature, 280, 120–125. Gilbert, C. D., & Wiesel, T. N. (1983). Functional organization of the visual cortex. Prog. Brain Res., 58, 209–218. Goldman, R. I., Stern, J. M., Engel, J., & Cohen, M. S. (2002). Simultaneous EEG and fMRI of the alpha rhythm. NeuroReport, 13(18), 2487–2492.
510
R. Sotero et al.
Grune, L., & Kloeden, P. E. (2001). Pathwise approximation of random ordinary differential equation. BIT, 41, 711–721. Hagmann, P., Thiran, J. P., Jonasson, L., Vandergheynst, P., Clarke, S., Maeder, P., & Meuli, R. (2003). DTI mapping of human brain connectivity: Statistical fibre tracking and virtual dissection. NeuroImage, 19, 545–554. H¨am¨al¨ainen, M., Hari, R., Ilmoniemi, R. J., Knuutila, J., & Lounasmaa, O. V. (1993). Magnetoencephalography—theory, instrumentation and applications to noninvasive studies of the working human brain. Rev. Modern Phys., 65, 413–497. Hampel, H., Teipel, S. J., Alexander, G. E., Horwitz, B., Teichberg, D., Schapiro, M. B., & Rapoport, S. I. (1998). Corpus callosum atrophy is a possible indicator of regionand cell type–specific neuronal degeneration in Alzheimer disease: A magnetic resonance imaging analysis. Arch. Neurol., 55, 193–198. Hasminskii, R. Z. (1980). Stochastic stability of differential equations. Alphen aan den Rijn, The Netherlands: Sijthoff Noordhoff. Iturria-Medina, Y., Canales-Rodr´ıguez, E., Meli´e-Garc´ıa, L., & Vald´es-Hern´andez, P. (2005). Bayesian formulation for fiber tracking. Presented at the 11th Annual Meeting of the Organization for Human Brain Mapping, Toronto, Ontario, Canada. Iturria-Medina, Y., Vald´es-Hern´andez, P., & Canales-Rodriguez, E. (2005). Measures of anatomical connectivity obtained from neuro diffusion images. Presented at the 11th Annual Meeting of the Organization for Human Brain Mapping, Toronto, Ontario, Canada. Jansen, B. H., Balaji Kavaipatti, A., & Markusson, O. (2001). Evoked potential enhancement using a neurophysiologically-based model. Methods of Information in Medicine, 40, 338–345. Jansen, B. H., & Rit, V. G. (1995). Electroencephalogram and visual evoked potential generation in a mathematical model of coupled cortical columns. Biol. Cybern., 73, 357–366. Jansen, B. H., Zouridakis, G., & Brandt, M. E. (1993). A neurophysiologically-based model of flash visual evoked potentials. Biol. Cybern., 68, 275–283. Jirsa, V. K., & Haken, H. (1996). Field theory of electromagnetic brain activity. Phys. Rev. Lett., 77(5), 960–963. Jirsa, V. K., & Haken, H. (1997). A derivation of a macroscopic field theory of the brain from the quasi-microscopic neural dynamics. Physica D, 99, 503–526. Koch, M. A., Norris, D. G., & Hund-Georgiadis, M. (2002). An investigation of functional and anatomical connectivity using magnetic resonance imaging. NeuroImage, 16, 241–250. Lopes da Silva, F. H., Hoeks, A., Smits, H., & Zetterberg, L. H. (1974). Model of brain rhythmic activity: The alpha-rhythm of the thalamus. Kybernetik, 15, 27–37. Martin, K. A. C., & Whitteridge, D. (1984). Form, function and intracortical projections of spiny neurones in the striate visual cortex of the cat. J. Physiol, 353, 463–504. Mart´ınez-Montes, E., Vald´es-Sosa, P. A., Miwakeichi, F., Goldman, R. I., & Cohen, M. S. (2004). Concurrent EEG/fMRI analysis by multiway partial least squares. NeuroImage, 22, 1023–1034. Mazziotta, J. C., Toga, A., Evans, A. C., Fox, P., & Lancaster, J. (1995). A probabilistic atlas of the human brain: Theory and rationale for its development. Neuroimage, 2, 89–101.
Realistic Neural Mass Models
511
McGuire, B. A., Gilbert, C., Rivlin, P. K., & Wiesel, T. N. (1991). Targets of horizontal connections in macaque primary visual cortex. J. Comp. Neurol., 305, 370– 392. Morrison, J. H., Scherr, S., Lewis, D. A., Campbell, M. J., & Bloom, F. E. (1986). The laminar and regional distribution of neocortical somatostatin and neuritic plaques: Implications for Alzheimer’s disease as a global neocortical disconnection syndrome. In A. B. Scheibel, A. F. Weschler (Eds.), The biological substrates of Alzheimer’s disease (pp. 115–131). New York: Academic Press. Mountcastle V. B. (1978). An organizing principle for cerebral function. In G. M. Edelman, V. B. Mountcastle (Eds.), The mindful brain (pp. 7–50). Cambridge, MA: MIT Press. Mountcastle, V. B. (1997). The columnar organization of the neocortex. Brain, 120, 701–722. Niedermeyer, E. & Lopes da Silva, F. (1987). Electroencephalography: Basic principlesclinical applications and related fields (2nd ed.). Baltimore: Urban & Schwarzenberg. Nunez, P. L. (1981). Electric fields of the brain. New York: Oxford University Press. Parker, G. J., Wheeler-Kingshott, C. A., & Barker, G. J. (2002). Estimating distributed anatomical connectivity using fast marching methods and diffusion tensor imaging. IEEE Transactions on Medical Imaging, 21(5), 505–512. Pearson, R. C. A., Esiri, M. M., Hiorns, R. W., Wilcock, G. K., & Powell, T. P. S. (1985). Anatomical correlates of the distribution of the pathological changes in the neocortex in Alzheimer’s disease. Proc. Natl. Acad. Sci. USA, 82, 4531–4534. Penttil¨a, M., Partanen, J. V., Soininen, H., & Riekkinen, P. J. (1985). Quantitative analysis of occipital EEG in different stages of Alzheimer’s disease. Electroencephalogr. Clin. Neurophysiol., 60, 1–6. Plonsey, R. (1963). Reciprocity applied to volume conductors and the ECG. IEEE Trans. BioMedical Electronics, 10, 9–12. Prakasa-Rao, B. L. S. (1999). Statistical inference for diffussion type processes. New York: Oxford University Press. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351–1355. Rose, S. E., Chen, F., Chalk, J. B., Zelaya, F. O., Strugnell, W. E., Benson, M., Semple, J., & Doddrell, D. M. (2000). Loss of connectivity in Alzheimer’s disease: An evaluation of white matter tract integrity with colour coded MR diffusion tensor imaging. J. Neurol. Neurosurg. Psychiatry, 69, 528–530. Rush, S., & Driscoll, D. A. (1969). EEG electrode sensitivity–an application of reciprocity. IEEE Trans on Biomed. Eng., 16(1), 15–22. Schreiter-Gasser, U., Gasser, T., & Ziegler, P. (1993). Quantitative EEG analysis in early onset Alzheimer’s disease: A controlled study. Electroencephalogr. Clin. Neurophysiol., 86, 15–22. Schurz H. (2002). Numerical analysis of stochastic differential equations without tears. In D. Kannan & V. Lakahmikamtham (Eds.), Handbook of stochastic analysis applications (pp. 237–358). New York: Marcell Dekker. Shoji, I., & Ozaki, T. (1998). Estimation for nonlinear stochastic differential equations by a local linearization method. Stochast. Anal. Appl., 16, 733–752. Sidje, R. B. (1998). EXPOKIT: Software package for computing matrix exponentials. AMC Trans. Math. Software, 24, 130–156.
512
R. Sotero et al.
Tournier, J. D., Calamante, F., Gadian, D. G., & Connelly, A. (2003). Diffusionweighted magnetic resonance imaging fibre tracking using a front evolution algorithm. NeuroImage, 20, 276–288. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Tuch, D. (2002). Diffusion MRI of complex neural architecture. Unpublished doctoral dissertation, Massachusetts Institute of Technology. Valdes, P. A., Jim´enez, J. C., Riera, J., Biscay, R., & Ozaki, T. (1999). Nonlinear EEG analysis based on a neural mass model. Biol. Cybern., 81, 415–424. van Rotterdam, A., Lopes da Silva, F. H., van den Ende, J., Viergever, M. A., & Hermans, A. J. (1982). A model of the spatial temporal characteristic of the alpha rhythm. Bull. Math. Biol., 44, 238–305. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst firing in a thalamic model: Specific neuronal parameters. Proc. Natl. Acad. Sci. USA, 92, 5577–5581. Wendling, F., Bellanger, J. J., Bartolomei, F., & Chauvel, P. (2000). Relevance of nonlinear lumped-parameter models in the analysis of depth-EEG epilectic signals. Biol. Cybern., 83, 367–378. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Zetterberg, L. H., Kristiansson, L., & Mossberg, K. (1978). Performance of a model for a local neuron population. Biol. Cybern., 31, 15–26.
Received December 20, 2005; accepted June 9, 2006.
LETTER
Communicated by Aapo Hyvarinen
Dimension Selection for Feature Selection and Dimension Reduction with Principal and Independent Component Analysis Inge Koch
[email protected] Department of Statistics, School of Mathematics, University of New South Wales, Sydney, NSW 2052 Australia
Kanta Naito
[email protected] Department of Mathematics, Faculty of Science and Engineering, Shimane University, Matsue 690-8504 Japan
This letter is concerned with the problem of selecting the best or most informative dimension for dimension reduction and feature extraction in high-dimensional data. The dimension of the data is reduced by principal component analysis; subsequent application of independent component analysis to the principal component scores determines the most nongaussian directions in the lower-dimensional space. A criterion for choosing the optimal dimension based on bias-adjusted skewness and kurtosis is proposed. This new dimension selector is applied to real data sets and compared to existing methods. Simulation studies for a range of densities show that the proposed method performs well and is more appropriate for nongaussian data than existing methods. 1 Introduction This letter proposes a dimension selector for the most informative lowerdimensional subspace in feature extraction and dimension-reduction problems. Our methodology is based on separating the informative part of the data from noise in high-dimensional problems: it reduces high-dimensional data and represents them on a smaller number of new directions that contain the interesting information inherent in the original data. Our algorithm combines principal component analysis (PCA) and independent component analysis (ICA) and makes use of objective- and data-driven criteria in the selection of the best dimension. A combined PC-IC algorithm has been used in a number of papers, including Prasad, Arcot, and Koch (2005), Gilmour and Koch (2004), Koch, Marron, and Chen (2005) and is also implicitly referred to in many ICA algorithms (see Hyv¨arinen, Karhunen, & Oja, 2001). Our approach is motivated Neural Computation 19, 513–545 (2007)
C 2007 Massachusetts Institute of Technology
514
I. Koch and K. Naito
by Prasad et al. (2005), who proposed a feature selection and extraction method for efficient classification by choosing the best subset of features using PC-IC. Their method resulted in a very efficient algorithm that is also highly competitive in terms of classification accuracy. Their research was driven by finding a smaller number of predictors or features for efficient classification, while our research focuses on the more general problem of dimension reduction in high-dimensional data. Features selection can be regarded as a special case of dimension reduction: representing or approximating high-dimensional data in terms of a smaller number of new directions that span a lower-dimensional space. PCA alone achieves dimension reduction of high-dimensional data. PCA is well understood (see Anderson, 1985, Mardia, Kent, & Bibby, 1979), easily accessible, and implemented in many software packages. However, the lower-dimensional directions chosen by PCA do not always preserve or represent the structure inherent in the original data. Finding interesting and informative structure in data is one of the two main aims of ICA. ICA can be applied to the original or to PC-reduced lower-dimensional data. In either case, it aims to find new directions, and the number of new directions is at most that of the data that form the input to ICA. A mixture of PCA and ICA seems to be reasonable since considerable computational effort is required to implement ICA, especially for high-dimensional data, and hence dimension reduction without loss of information or with minimal loss is essential as a step preceding ICA. This is the reason we adopt PCA as a first step prior to ICA. Different approaches to ICA and software packages are available (see Cardoso, 1999; Bach & Jordan, 2002; Amari, 2002). We follow the approach of Hyv¨arinen et al. (2001) and use their software, FastICA2.4 (Hyv¨arinen & Oja, 2005). More specifically, we use their skewness and kurtosis approximation to negentropy, since these provide fast computation as well as easy interpretation. Dimension-reduction techniques generally provide a constructive approach to representing data in terms of new directions in lower-dimensional spaces, with the number of new directions varying from one to the (possibly very high) dimension of the original data. Traditionally in PCA, one is looking for an elbow, or one uses the 95% (resp. 99.9%) rule; that is, one uses the dimension whose contribution to total variance is at least 95% (resp. 99.9%). The latter is the route Prasad et al. (2005) have taken, and this choice of dimension clearly lacks objectivity. A number of (gaussian) model-based approaches to dimension reduction have been proposed in the literature over the past few years. We review these approaches in section 5. If the model fits the data, then these methods may be exactly what is needed, and good dimension-reduction solutions are expected to be obtained with any of them. Furthermore, in specific applications, prior or additional information can help determine the dimension. A practical adjustment of this kind has been discussed in Beckmann and Smith (2004). However, to our knowledge, there do not appear to exist
Dimension Selection with PC-IC
515
nonparametric or model-free objective approaches that will determine the right number of new directions or the best dimension for a lowerdimensional space. We shall see in section 6 in particular that the gaussian model–based approaches are not adequate in many cases and that better results can be achieved with our nonparametric method. The main purpose of this article is to propose a generic nonparametric approach to dimension selection that is based on theory and objective criteria without requiring a model or knowledge of the distribution of the data and is data driven. Of course, if additional knowledge about the distribution of the data exists, integrating such knowledge is expected to improve the performance, but we will not be concerned with special cases here. Closely related to the selection of this dimension is the choice of the direction vectors in this lower-dimensional subspace. In an ICA-based framework, the natural candidates are those IC directions based on the same objective function as the selector—in our case, skewness or kurtosis. Finding those directions is the aim of any good ICA algorithm, so we will not be concerned with this issue here, but will focus instead on the choice of the dimension for the lower-dimensional subspace. We use the following notations throughout this letter. Let X denote a d-dimensional random vector with mean µ and covariance matrix . For k = 3, 4, define the moments βk (X) = max Bk (α|X), α∈S d−1
(1.1)
where k α T (X − µ) − 3δk,4 , Bk (α|X) = E √ α T α S d−1 is the unit sphere in Rd , and δk,4 is the Kronecker delta. The moments β3 (X) and β4 (X) are the multivariate skewness and kurtosis of X, respectively, in the sense of Malkovich and Afifi (1973). Note that multivariate skewness was originally defined as the square of β3 (X) above, but many authors since then have used it in the form given here, as this is more in keeping with the one-dimensional version. The multivariate skewness and kurtosis are both defined such that they are zero for gaussian random vectors. For N independent observations X1 , . . . , XN , the sample skewness and sample kurtosis are defined as Bk (α|X1 , . . . , XN ) βk (X1 , . . . , XN ) = max α∈S d−1
516
I. Koch and K. Naito
for k = 3, 4, where k N T 1 (X − X) α i − 3δk,4 . Bk (α|X1 , . . . , XN ) = √ T α Sα N i=1
(1.2)
Here, X denotes the sample mean and S the sample covariance matrix. We will base our feature extraction and dimension reduction for highdimensional data on these βk ’s. It should be noted that βk (X) is not a function of X but just a property of the distribution of X, while βk (X1 , . . . , XN ) is a random variable. Expressing X and (X1 , . . . , XN ) in terms of βk (X) and βk (X1 , . . . , XN ), respectively, is necessary to explain our technique in the subsequent discussion. This type of statistic, based on the projections and summary statistics, is commonly used in the statistical science comunity. A typical and powerful tool based on projections is projection pursuit (see Huber, 1985; Jones & Sibson, 1987; Friedman, 1987), in which the nonlinear structure of high-dimensional data is explored through low-dimensional projections. The optimal low-dimensional directions are determined via a projection index that aims to detect nongaussianity. Hence, statistics for testing gaussianity could be based on the projection index, and typical instances are the βk ’s above. The projection index proposed in Jones and Sibson (1987) is nothing other than a weighted quadratic combination of βk ’s. A close relation between ICA and projection pursuit has been discussed by several authors (see, e.g., Cardoso, 1999). This letter is organized as follows. Section 2 describes our methodology and algorithm. Details are given in both the population version and the empirical (sample) version. A criterion for selecting the best or most informative dimension is proposed in section 3. The idea to construct well-known Akaike information criteria (AIC) is exploited to develop the proposed criterion. We argue that the gaussian distribution could be seen as the null structure for developing our criterion and present the necessary theoretical results. In section 4, we report the results of applying our selection method to real data sets. Section 5 discusses and appraises some model-based approaches to dimension reduction. The reports of Monte Carlo simulations and comparisons with the model-based approached discussed in section 5 are provided in section 6. All proofs of theoretical results are summarized in appendix A.
2 PC-IC Algorithm Our algorithm for dimension reduction and feature extraction is described in this section. It combines PCA and ICA.
Dimension Selection with PC-IC
517
2.1 PCA and ICA. ICA can be used to extract features in multivariate structure (Hyv¨arinen et al., 2001). Feature extraction and testing for gaussianity (normality) with the skewness and kurtosis approximations in ICA were discussed in Prasad et al. (2005) and Koch et al. (2005). The relevant skeleton of the algorithm in those articles contains two steps: reduce and maximize: Reduce. For 2 ≤ p < d, determine the p-dimensional subspace consisting of the first p principal component directions obtained via PCA. Maximize. Find the most nongaussian direction vector for the pdimensional data using ICA. The usual PCA is applied to high-dimensional data in the reduce step, and the reduced p-dimensional data are obtained as the p-dimensional principal component scores. ICA is applied to these p-dimensional principal component scores in the maximize step to find the most prominent structure. We shall refer to this algorithm as the PC-IC algorithm. 2.2 Algorithm in Detail: Population Version. We describe here the PCIC algorithm for the population setting. Assume that all components of X have fourth moment. Consider the spectral decomposition of = T , where = diag{λ1 , . . . , λd }, λ1 ≥ λ2 ≥ · · · ≥ λd > 0 are the eigenvalues of , and = [γ1 , . . . , γd ] is an orthogonal matrix whose column γ is the eigenvector of belonging to λ for = 1, . . . , d. For simplicity of notation, we assume that has rank d. Note that our arguments below are not affected if the rank of is r < d. The only change required for this situation is to consider values for p ≤ r instead of p < d. Let p = [γ1 , . . . , γ p ], and let p = diag{λ1 , . . . , λ p } for p < d. We obtain a dimension reduction of X to p dimensions via a transformation defined by Y = Tp (X − µ). The Y vector is known as the p-dimensional PC score. To extract features inherent in the original structure of X, the PC-IC algorithm applies FastICA (Hyv¨arinen et al., 2001) to this Y in the following way. First, the sphering of Y is accomplished by Z( p) = −1/2 Y = −1/2 Tp (X − µ). p p Note that Y and hence Z( p) are p-dimensional random vectors. The PC-IC algorithm finds the direction αo,k , which results in βk (Z( p)) for k = 3, 4, as
518
I. Koch and K. Naito
in equation 1.1: αo,k = arg max Bk (α|Z( p)). α∈S p−1
(2.1)
2.3 Algorithm in Detail: Empirical Version. The empirical version is based on the sample covariance matrix S of the data X1 , . . . , XN . The spectral decomposition S = T with descending order of eigenvalues naturally provides submatrices p and p defined similarly as p and p , respectively. From these we obtain the p-dimensional sphered data, Z( p)i = −1/2 µ), Tp (Xi − p
(2.2)
for i = 1, . . . , N, where µ = X. In the empirical version corresponding to equation 2.1, we let IC1k denote the PC-IC direction which gives rise to βk (Z( p)1 , . . . , Z( p) N ). So for k = 3, 4, Bk (α|Z( p)1 , . . . , Z( p) N ), αo,k = arg max IC1k = α∈S p−1
(2.3)
with Bk (α|Z( p)1 , . . . , Z( p) N ), as in equation 1.2. We explore the underlying structure of the original data and extract T T informative features from the projection ( αo,k Z( p)1 , . . . , αo,k Z( p) N ) of the reduced data Z( p)i (i = 1, . . . , N) onto IC1k (k = 3, 4). In Prasad et al. (2005), a similar algorithm, which is based on the first p directions IC1-IC p, was shown to be efficient in extracting features of high-dimensional data. A crucial—and still unanswered—question is that of the choice of p, the dimension onto which the data are projected from their original d dimensions. Prasad et al.’s heuristic procedure for selecting the dimension p is based on the variance contribution in the PCA step: they use the cut-off dimension p95% which results in a contribution to variance of at least 95%. This criterion may be reasonable if we analyze multivariate data by PCA only and do not have any other relevant information. However, a criterion based on the variance contribution does appear to be rather arbitrary and does not seem appropriate in a combined PC-IC, since ICA results in directions based on maximizing skewness and kurtosis. Our claim is that the dimension should be selected in accordance with the objective functions skewness and kurtosis if dimension reduction and feature extraction are implemented through these quantities. 3 Selection of the Dimension This section presents the procedure for choosing the best dimension. 3.1 Methodology. The first direction, IC1k (k = 3, 4), is the most nongaussian direction, at least in theory. This direction and its degree of
Dimension Selection with PC-IC
519
nongaussianity will depend on the input data. If the dimension of the input data is too small, ICA may not find interesting directions, since valuable information may have been lost. On the other hand, a large p requires a higher computational effort and the data may still contain noise or irrelevant information. A compromise between data complexity and available information is therefore required. A naive idea is to select the dimension p as pk = arg max∗ βk (Z( )1 , . . . , Z( ) N ), 2≤ ≤ p
for k = 3, 4, where p ∗ = min{N, d} is the upper bound (or p ∗ = r , where r denotes the rank of the sample covariance matrix). It is easy to see that this selection rule has the trivial solution pk = p ∗ for both k = 3, 4. This implies that this criterion is not meaningful. Therefore, we need to formulate a criterion that includes a penalty term for large dimensions. We exploit the idea of including the AIC in our problem. It is known that the AIC is based on a bias-adjusted version of the log likelihood. Since βk (Z( p)1 , . . . , Z( p) N ) is an estimator of βk (Z( p)), it is natural to reduce the bias of the estimator, and we therefore aim to evaluate
βk (Z( p)1 , . . . , Z( p) N ) − βk (Z( p)), Biask ( p) = E for k = 3, 4. The evaluation of this bias now raises the question of which population should be employed. 3.2 Null Structure. What distribution should we assume as the null structure in our setting? We recall that our objective in the ICA step is to pursue a nongaussian structure as measured by skewness or kurtosis. To judge whether the obtained nongaussian structure is real, it is necessary to evaluate the performance of βk (Z( p)1 , . . . , Z( p) N ) under the structure with null skewness and kurtosis. This suggests that the multivariate gaussian distribution itself is one of the right null structures. This kind of argument has been employed by Sun (1991) in conjunction with the significance level in projection pursuit. In what follows, all distributional results will be addressed under the assumption of gaussianity of X. We therefore note that βk (Z( p)) = 0 for k = 3, 4 and for all p ≤ p ∗ under the gaussianity assumption. Further, the gaussianity assumption implies that βk (Z( p)1 , . . . , Z( p) N ) is an estimator of zero, and so
βk (Z( p)1 , . . . , Z( p) N ) . Biask ( p) = E 3.3 Theoretical Results. Our problem requires the evaluation of Biask ( p) under the assumption that the X1 , . . . , XN are independently and identically distributed random vectors from the d−dimensional gaussian
520
I. Koch and K. Naito
distribution Nd (µ, ). Since skewness and kurtosis are location-scale invariant, without loss of generality we can assume that µ = 0 and = Id , the identity matrix. In this setting, as N → ∞,
N βk (Z( p)1 , . . . , Z( p) N ) ⇒ Tk ( p) k!
for any p ≤ p ∗ , where ⇒ denotes convergence in distribution and Tk ( p) is the maximum of a zero-mean gaussian random field on S p−1 with a suitable covariance function (see Machado, 1983; Kuriki & Takemura, 2001). Hence, for large N, E
N βk (Z( p)1 , . . . , Z( p) N ) E [Tk ( p)] k!
for k = 3, 4. Kuriki and Takemura (2001) consider the probability P (Tk ( p) > a ) for a > 0. As this tail probability is difficult to calculate exactly, they give upper and lower bounds for this probability. We make use of this idea and develop upper and lower bounds for the expected value of Tk ( p) for k = 3, 4 by noting that
E [Tk ( p)] =
∞
P (Tk ( p) > a ) da .
0
This leads to our main result. Theorem 1. For k = 3, 4, we have L Bk ( p) ≤ E [Tk ( p)] ≤ U Bk ( p). For each dimension p, L Bk ( p) =
p−1
ω p−e,k p−e,ρ− p+e (tan2 θk ),
e=0,e:even
U Bk ( p) = L Bk ( p) +
2k − 2 3k − 2
1/2
E χρ [1 − (θk , p)] ,
(3.1)
where χ denotes a chi-distributed random variable with degrees of freedom, ρ = ρ( p, k) =
p+k−1 , k
Dimension Selection with PC-IC
θk = cos
−1
2k − 2 3k − 2
521
1/2 , e/2
[( p + 1) /2] , [( p + 1 − e) /2] (e/2)! p−1 2k − 2 (θk , p) = ω p−e,k B( p−e)/2,(ρ− p+e)/2 3k − 2 e=0 e:even ω p−e,k = (−1) k e/2
( p−1)/2
k−1 k
and B α,β (·) denotes the upper-tail probability of the beta distribution given by
Bα,β (a ) =
ξ α−1 (1 − ξ )β−1 dξ B(α, β)
1
a
with the usual beta function B(α, β). Further, for b > 0 and positive integers m and n, m,n (b) = m − m,n (b) with m = E [χm ] =
m,n (b) =
∞
∞
a2
0
√ [(m + 1) /2] 2 , (m/2) ¯ n (bξ )dξ da , gm (ξ )G
¯ n (t) = P(χn2 > t) for t ≥ 0. Exwhere gm (ξ ) is the density function of χm2 and G pressions for m,n (b) are m,2ν (b) = E [χm ] √
m+1
1 b+1
m,2ν−1 (b) = m,1 (b) + E [χm ] √ ×
ν−2 i=0
i ν−1 b 1 1 1+ , i b + 1 B (i, (m + 1) /2) i=1
1 i + 1/2
1
m+1
b+1 i+1/2
b b+1
1 . B (i + 1/2, (m + 1) /2)
3.4 Practical Calculations. Exact evaluation of the upper and lower bounds U Bk ( p) and L Bk ( p) in theorem 1 is not easy. Even with Mathematica, the terms p−e,ρ− p+e (b) in p−e,ρ− p+e (b) are not easily calculated for some combinations of ρs and ps (in particular, for the cases ρ − p + e = 3mod4),
522
I. Koch and K. Naito
while the terms p−e,k are not a problem. The reason these expressions are difficult to evaluate is that a summation of a large number of terms is required; the range of the individual terms is very large, and the weights are positive and negative. We therefore do not evaluate the formulas for p−e,ρ− p+e (b) directly but aim to obtain an approximate value. From theorem 3.1 in Kuriki and Takemura (2001), a general form of our target can be expressed as
∞
m,n (b) =
∞
a2
0
gm (ξ )G n (bξ )dξ da ,
where G n (·) is the cumulative distribution function of χn2 . To obtain an approximation, we consider a truncated version of m,n (b), defined as
T
m,n (b|T) = 0
∞
a2
gm (ξ )G n (bξ )dξ da .
Consider a random variable U, uniformly distributed on an interval [0, T], and let FU denote the cumulative distribution function of U. Furthermore, let Y ∼ χm2 . Assume that Y and U are mutually independent. Then we have
m,n (b|T) =
T
E Y 1{Y ≥ a 2 }G n (bY) da
0
= T EU,Y [1{Y ≥ U 2 }G n (bY)] = T E Y {G n (bY)EU [1{Y ≥ U 2 }]} √ = T E Y [FU ( Y)G n (bY)], and hence m,n (b|T) can be expressed as the expectation of Y only. Therefore, a Monte Carlo estimate for m,n (b|T) is m,n (b|T) =
N T FU ( Yj )G n (bYj ), N j=1
where the Y1 , . . . , YN are independently and identically distributed random variables from the χm2 distribution. m,n (b|T) is an unbiased estimate of m,n (b|T), but not Obviously of m,n (b). Moreover, it underestimates m,n (b); this follows from the definition of the
truncated version. The mean squared error (MSE) of m,n (b|T), MSE m,n (b|T) , is given as follows:
Dimension Selection with PC-IC
523
Theorem 2. As T → ∞ and N → ∞,
T2 m,n (b|T) = O exp(−T)T m−2 + . MSE N For practical purposes, we suggest using the 100(1 − α)% point of the 2 χm2 distribution as Te , which we denote by χm,α for each m. (Recall that m = p − e as in theorem 1.) Then, using the expression of the squared bias (see the proof of theorem 2), it follows that α 2 1 2 MSE m,n (b|χm,α ) = O + . b N
From theorem 1, the actual value of b is tan2 θk , that is, 3/4 for k = 3 and 2/3 for k = 4. If we choose α = 10−10 , (α/b)2 ≈ 1.33 × 10−10 for k = 3, (α/b)2 ≈ 1.5 × 10−10 for k = 4. This (α/b)2 is the quantity that will not vanish, but it can be as small as required, so will be negligible for practical purposes. Now let L BkT ( p) =
p−1
ω p−e,k p−e,ρ− p+e (tan2 θk |Te )
e=0,e:even
be a truncated version of L Bk ( p), where the Te s are chosen differently for m,n (b|T) appropriately yields the estimate each e. Then combining T L B k ( p) =
p−1
p−e,ρ− p+e (tan2 θk |Te ) ω p−e,k
e=0,e:even 2 of L Bk ( p|T). Hence, if we use independent samples distributed as χ p−e for p−e,ρ− p+e , it follows that each
MSE L B k ( p|T) =
p−1
p−e,ρ− p+e (tan2 θk |Te ) . ω2p−e,k MSE
e=0,e:even
3.5 The Proposed Criterion in Practice. The second term on the righthand side of equation 3.1 can be calculated with sufficient accuracy, for example, with Mathematica. So we define U B k ( p) = L B k ( p|T) +
2k − 2 3k − 2
1/2
E χρ [1 − (θk , p)] .
(3.2)
524
I. Koch and K. Naito
The actual values of U B k ( p) are tabulated in Table 3 in appendix B. Our criterion for choosing the best or most informative dimension is based on the bias-adjusted version of βk given by Ik ( p) =
N B k ( p). βk (Z( p)1 , . . . , Z( p) N ) − U k!
(3.3)
For k = 3, 4 the dimension selector is thus given by pk = arg max ∗ Ik ( p), k = 3, 4. 2≤ p≤ p
(3.4)
Here, we have chosen U Bk ( p) as an appropriate substitute for E[Tk ( p)]. Our calculations show that L Bk ( p) vanishes very rapidly with increasing dimension, and thus it contains little information, while the extra term in U Bk ( p) dominates the behavior of U Bk ( p). Since U Bk ( p) is an upper bound for E[Tk ( p)], Ik ( p) is smaller than the corresponding expression with E[Tk ( p)] instead of U B k ( p). It is not clear whether this affects the determination of pk in equation 3.4. 4 Implementation: Real Data 4.1 The Data Sets. We apply the bias-adjusted selector of the best or most informative dimension to a number of real data sets. Apart from the last two, all our data sets are available on the UCI Machine Learning Repository (see Blake & Merz, 1998). References and more information about these data sets are available on their Web site. Some of these data sets have also been used in Prasad et al. (2005) in their feature selection. We calculate only the first IC from the p-dimensional scores, while Prasad et al. obtained and used all p directions IC1, . . . , IC p for their efficient and accurate classification. The Heroin Market data consists of 17 key indicator data series with observations (counts) collected over 66 consecutive months from January 1997 to December 2002. These data series form part of a broader project funded by the National Drug Law Enforcement Research Foundation (NDLERF). (For a detailed description of these series, see Gilmour & Koch, 2004.) This data set is different from all others used here in that it contains only 17 independent observations, but each observation has 66 features or dimensions. Thus, we are dealing with a HDLSS (high-dimension, low-sample-size) problem. As we shall see, the skewness-based criterion was able to find the most informative dimension, while the kurtosis-based dimension did not. Finally, the Swiss Bank Notes data set stems from Flury and Riedwyl (1988). It deals with six measurements of 100 real and 100 counterfeit old
Dimension Selection with PC-IC
525
Swiss 1000 franc bank notes. A preliminary PCA showed that the first principal component scores are bimodal, and PC1 contributes about 67% to the total variance. In the data sets Bupa Liver, Abalone, Glass, Wine, and Breast Cancer, the different features vary greatly in their range. Features on a large scale dominate the contribution to variance and would normally be picked as outliers in a PC analysis. As our primary interest is the determination of the optimal dimension rather than the specific derived features (see Hastie & Tibshirani, 2002), we used most data sets in their raw and standardized version in our PC-IC analysis. The “Comment” column in Table 1 indicates whether the raw data or the standardized data were used. Here, standardization refers to centering with the sample mean and normalizing with the standard deviation, usually across features, with the exception of the Heroin data set, where we standardized each time series, as the time series (rather than the features) were on different scales. 4.2 Calculation of the Best Dimension. Theorem 1 gives an expression for the upper bounds U Bk , which are used in the bias-adjusted selector, equation 3.4. The calculations of the optimal dimension were carried out in Matlab 7.0.4. For the PC part, we used pcaSM (see Marron, 2004), and for the IC part we used FastICA2.4 (see Hyv¨arinen & Oja, 2005). The data sets vary in their dimensions, from 6 to 33, and number of independent samples. For most of the data sets, we considered all samples as well as smaller subsets of the samples. Obviously the performance of pk depends on the sample size. We shall see in this and the next section that the larger the sample size, the higher the value of pk . For each data combination, that is, for a data set in its raw or standardized version, with all samples or with a subset of samples, we calculate the p-dimensional sphered data Z( p)i (i = 1, . . . , N) as in equation 2.2 for p = 2, . . . , p ∗ , with p ∗ = r ≤ min{N, d} where r denotes the rank of the covariance matrix S. For the p-dimensional sphered data, we calculate the most nongaussian direction with FastICA and the absolute skewness and absolute kurtosis. FastICA is an iterative technique with randomly chosen starting positions. We observed that different runs of ICA (even for the same input data) can converge to different components or different local optima. We found empirically that 10 runs of IC1 for each data combination were sufficient to guarantee a global skewness or kurtosis maximum of IC1. Using equations 3.3 and 3.4 we determine the best dimension. 4.3 Results and Interpretation. The results of the best dimension selection are given in Table 1. The second column details the dimension of each data combination: N denotes the size of the sample, and d denotes the dimension of the original feature space. The relative contributions to
526
I. Koch and K. Naito
variance due to the most nongaussian subspace are given in the Percentage Variance column. The last two columns contain the skewness and kurtosis dimension p3 and p4 . Figures in parentheses in the column p4 denote a local maximum at the dimension given in parentheses. If the values for pk differed for skewness and kurtosis, we reported two numbers in the variance column: the first one refers to the variance contribution obtained for the skewness dimension and the second to the contribution due to kurtosis. Table 1 highlights the following:
r
For the same size of the sample, the raw data resulted in a lower pk value than the standardized data.
r Smaller samples result in a smaller pk value. r The p3 and p4 values are mostly identical. r pk does not appear to be related to the variance contribution of 95%.
How can we interpret the results presented in Table 1? Consider the Swiss Bank Notes. Measurements of six different quantities were made for 200 bank notes, and these six quantities correspond to properties of the bank notes such as the length of the bank notes. Our method selected the first four PC dimensions. This means that these PC combinations are regarded as best in capturing the nongaussian nature of the data. Since they are linear combinations of the original measurements, we learn how to combine the original six measurements in such a way that the dimensionality of the problem is reduced while the most informative information is preserved. For nongaussian data such as the Swiss Bank Notes, our criterion takes advantage of the nature of the data. For gaussian data, other dimension-selection criteria exist, and we briefly describe two such methods in section 5. Different criteria lead to different solutions and therefore have different interpretations. To conclude this section, we return to the dimension-selection results of Table 1. If the raw data contain outliers or observations on a different scale, a high contribution to variance is obtained with a small number of PC directions. Outlier directions are highly nongaussian, and a low-dimensional space will therefore be the most informative, that is, ICA will find the most nongaussian directions within a low-dimensional space. Subsets of the original data set resulted in a lower pk value than the total sample. This is consistent with the test for gaussianity carried out inside ICA. In the next section, we consider this aspect of the dimension selector further. FastICA mostly results in very similar values for p3 and p4 . When the p4 value was clearly bigger than p3 , the calculations showed that the kurtosis approximation had a local maximum at p = p3 .
Dimension Selection with PC-IC
527
Table 1: Dimension Selection for Real Data. N×d
Comment
Bupa Liver
345 × 6 100 × 6 345 × 6 100 × 6
Swiss Notes Abalone
200 × 6 4177 × 8 1000 × 8
Raw Raw, first 100 records Standardized Standardized, first 100 records Raw Raw Raw, 1000 random records Raw, first 100 records Standardized Standardized, 1000 random records Standardized, first 100 records Raw Standardized Raw Standardized Raw Raw, first 100 records Standardized Standardized, first 100 records Raw Raw, first 200 records Raw, records 101–200 Raw Standardized
Data Set
100 × 8 4177 × 8 1000 × 8 100 × 8 Glass Wine Breast Cancer
Ionosphere
Heroin market
214 × 9 214 × 9 178 × 13 178 × 13 569 × 30 100 × 30 569 × 30 569 × 30 351 × 33 200 × 33 100 × 33 17 × 66 17 × 66
Percentage Variance
p3
p4
97.2445 97.821 95.7750 95.5136
3 2 5 5
3 2 5 5
97.3140 100 99.9979; 99.9991
4 7 6
4 7 7
99.9767; 99.9993 100 98.3901; 99,2634
3 6 4
6 (3) 7 5
98.1336; 99.7993
4
6 (4)
98.2290; 99.8338 99.2726 99.9997; 100 89.3368 100 99.8967; 100 99.9999; 100 96.6162; 98.0303
5 7 8 7 15 2 25 11
6 7 10 7 27 (15) 24 (2) 26 13
69.5810; 99.4328 57.0032; 96.7084 77.3381; 91.9988 99.9386 67.4945
7 5 9 2 3
31 (7) 25 16 (9) 2 3
A comparison with the pk values we obtained and the dimension p95% chosen in Prasad et al. (2005) (which is based on explaining 95% of variance) indicates that contribution to variance is not a good predictor for the best nongaussian dimension. Indeed, other effects, such as the differing scales of measurements, outliers, and sample size, affect the dimension of the most nongaussian space more strongly. 5 PCA-Based Approaches We describe some model-based approaches to dimension reduction that have been used in the recent literature and examine their applicability. These approaches fall into two main groups:
r r
Probabilistic principal component analysis Bayesian principal component analysis
528
I. Koch and K. Naito
The approaches have the same starting point: a common underlying model, X = AV + µ + ,
(5.1)
where X denotes a d-dimensional vector of observations with mean µ, V denotes a p-dimensional vector of latent variables, and ∼ N(0, σ 2 Id ) represents independent gaussian noise. In this framework, p < d, A is a d × p matrix, and both A and p are assumed to be unknown. The aim is to determine the unknown dimension p of the latent variables V. Probabilistic principal component analysis is described in Tipping and Bishop (1999) and Bishop (1999). Their key idea is that the dimension of the latent variables is that value of p that minimizes the estimated prediction error. We will denote this solution by p (P) . In addition to the assumptions of equation 5.1, we assume the following: P1: The latent vectors are independent and normal with V ∼ N(0, I p ). P2: The observations are normal with X ∼ N(µ, AAT + σ 2 Id ). P3: The conditional distributions X|V and V|X are normal with mean and variances derived from the means and variances of X, V, and . Under the assumptions P1–P3 the estimated prediction error is the negative log likelihood. For a sample of size N from this model, the log likelihood is given by L=−
N d log(2π) + log |C| + tr(C −1 S) , 2
(5.2)
where C = AAT + σ 2 Id . The second approach, Bayesian principal component analysis, is described in Penny, Roberts, and Everson, (2001) and Beckmann and Smith (2004). It is based on the key idea of a Bayesian framework proposed in Minka (2000). We first describe the main ideas of Minka’s method and then discuss how these ideas are used in Penny et al. and Beckmann and Smith. Similar to Tipping and Bishop (1999), Minka (2000) starts with the gaussian model of equation 5.1, which satisfies P1 and P2. Instead of employing the maximum likelihood solution for A, σ , and for the mean µ, Minka proposes to use a Bayesian framework that requires evaluation of the conditional probabilities pr (X1 , . . . , XN | A, V, µ, σ 2 ). Minka uses a noninformative prior for the mean µ and Laplace’s method to approximate the conditinal density. The final probability is conditioned
Dimension Selection with PC-IC
529
on the dimension p of the latent variables V only: p(X1 , . . . , XN | p) ≈ pr (U)
p
−N/2 λj
j=1
× σˆ −N(d− p) (2π)(m+ p)/2 |AZ |−1/2 N− p/2 ,
(5.3)
with m = d p − p( p + 1)/2, p(U) = 2− p
p
((d − j + 1)/2) π −(d− j+1)/2 ,
j=1 d ˆ −1 (λi − λ j ), λˆ −1 j − λi p
|AZ | = N
i=1 j=i+1
λˆ i = λi
for i ≤ p,
and λ j = σˆ 2
for j > p.
The product on the right-hand side of equation 5.3 is used to estimate the likelihood, and the value p (B) that maximizes this product is the solution to the latent variable dimension selection. Penny et al. (2001) use the linear mixing framework of ICA, described in equation 5.1, which deviates from the classical ICA framework in that it includes additive noise. In their model, µ = 0 and assumptions P1 to P3 are replaced by the weaker assumptions B1 and B2: B1: The latent vectors are independent. B2: The observations x conditional on A and V are normal with (X| A, V) ∼ N(AV, σ 2 Id ). For this framework, they extend Tipping and Bishop’s (1999) approach in their section 12.3 by including a rotation matrix Q in the expression for the mixing matrix A, as is usual in ICA. In section 12.4 the authors state that “for gaussian sources the ICA model reduces to PCA”. From there on, they propose the use of Minka’s method to determine the dimension p, but they also discuss “flexible” source models. We will return to this point. The essence of Beckmann and Smith (2004) is Minka’s method; however, the authors take into account the limited data structure and the structure of noise as found in their FMRI applications. The latter refers to preprocessing of the noise, which includes standardizing the noise. For gaussian sources V, they propose the use of Minka’s p (B) , but they differ from Minka by using the mean as an estimate for µ instead of Minka’s prior distribution. After determining p (B) , the authors calculate the ICA rotation matrix Q and the independent sources within a standard ICA framework. This last step
530
I. Koch and K. Naito
does not appear to relate to the choice of dimension, but it is relevant to the discussion of the applicability of their methodology. Tipping and Bishop (1999) and Minka (2000) propose clear, easy-to-use, and theoretically well-founded methods for selecting a lower dimension for the latent variables. Both articles make it very clear that the sources as well as the noise are gaussian. These properties enable them to combine PCA ideas with gaussian likelihood maximization and result in elegant solutions. The two solutions will, of course, be different, as different problems are solved (probabilistic and Bayesian, respectively). Penny et al. (2001) and Beckmann and Smith (2004) have extended these algorithms to ICA frameworks, that is, to nongaussian sources. Penny et al. offer some discussion regarding a number of other source distributions and flexible source models and give two such examples where reference is made to specific (nongaussian) source distributions and how to accommodate them into Minka’s gaussian framework. This extension will no doubt work in the specific cases but does not lead to a general methodology. In contrast, Beckmann and Smith combine two approaches with inconsistent assumptions: finding the best p (B) by assuming that the sources are gaussian, and then finding p (B) nongaussian sources with standard ICA methodology. Since real data are never purely gaussian, and since, in practice, ICA finds nongaussian structure in gaussian randomly generated data, the method could lead to reasonable results for some data. The question of interest here is whether PCA-based methods and their extensions described in Tipping and Bishop (1999) and Minka (2000) can result in the right dimension reduction without knowledge of the underlying model or distribution. Like the initial approaches of Tipping and Bishop and Minka, the method we propose is generic and supported by theory. We are not interested in finding modifications for specific cases as suggested by Penny et al. (2001) and Beckmann and Smith (2004). In the following section we compare our method with those of Tipping and Bishop (1999) and Minka (2000). As the simulations will demonstrate, for the realistic nongaussian sources we use, neither of these methods is appropriate, and both are inferior to our method in terms of finding the correct dimension. This result is not surprising, since the assumptions of their models have been violated. These comparisons alone strongly indicate the need for our method, which does not replace the existing methods but complements them. 6 Implementation: Simulation Study 6.1 Nongaussianity, Sample Size, and PC Projections. We test the performance of our approach in simulations based on a mixture of gaussian and nongaussian directions. We determine the number of nongaussian directions with the PC-IC algorithm and FastICA. For comparison, we also
Dimension Selection with PC-IC
531
include results obtained with the probabilistic and the Bayesian approaches discussed in section 5. Our theoretical arguments are based on the limit case N → ∞, while in practice, we deal with small and moderate sample sizes. The simulation results are therefore not expected to give a precise answer, especially for smaller sample sizes. A powerful test for nongaussianity is given by the negentropy: for gaussian data, the negentropy is zero, while positive negentropy indicates deviations from gaussianity. The notion of negentropy relies on the joint probability density of the data. Density estimation for multivariate or high-dimensional data becomes more complex as the dimension increases, and good estimates in high dimensions require a large number of data. Even then, current density estimation methods are not always very accurate. For this reason, approximations to the negentropy are commonly used. FastICA uses the skewness and kurtosis as approximations to the negentropy (refered to as nonlinearities in their framework). Below we discuss the capability of skewness and kurtosis to distinguish gaussian from nongaussian models. This analysis informs our choice of simulation models. 6.1.1 Small Sample Sizes. If the sample size is small, the empirical distribution of nongaussian data may not differ sufficiently from that of a gaussian sample of the same size. This means that in a test for gaussianity, the data appear consistent with the gaussian model and will not be rejected as nongaussian. Thus, a technique such as ICA, which tests the deviation from gaussianity in the new directions, tends to underestimate the number of nongaussian directions, and hence also the dimension of the most nongaussian subspace. For small samples, pk is therefore likely to be smaller than the true number of nongaussian dimensions. 6.1.2 Large Sample Sizes. If the sample size is large, the model is a better approximation to the theory, and the dimension pk should therefore give an accurate description of the presence of nongaussian structure in the data. In the large sample case, nongaussianity tests are powerful and easily find deviations from gaussianity. Thus, nongaussian directions will appear as nongaussian. FastICA employs a very sensitive test for nongaussianity: indeed, even for samples from the multivariate gaussian distribution, FastICA will find nongaussian directions (see Koch et al., 2005). Thus, for large samples, the number of nongaussian directions ICA finds could be an overestimate. Consequently, pk is likely to be bigger than the true number of nongaussian dimensions. 6.1.3 Skewness as a Test for Nongaussianity. The skewness is based on the third moment of a random vector or sample. Symmetric distributions have a zero skewness. As the skewness is location and scale invariant, the
532
I. Koch and K. Naito
skewness approximation to negentropy will not be able to differentiate between the gaussian distribution and a distribution that is symmetric about its mean. In particular, ICA-based skewness tests will not be able to differentiate between gaussian and uniform data, or between data from the gaussian and the double exponential distributions, and pk will severely underestimate the nongaussian structure. 6.1.4 Kurtosis as a Test for nongaussianity. The kurtosis is based on the fourth moment of a random vector or sample. Kurtosis as in equation 1.2 is adjusted to yield to zero kurtosis for multivariate gaussian random vectors. For supergaussian distributions, kurtosis is positive, while subgaussian distributions lead to negative kurtosis values. However, the kurtosis test in FastICA is biased toward positive kurtosis, and it is therefore harder to differentiate subgaussian distributions from the gaussian distribution. Consequently, for subgaussian data, pk is smaller than the true number of subgaussian data directions. Indeed, our simulations with the uniform distribution confirmed this bias in the kurtosis estimates. 6.1.5 Support of the Distribution. The gaussian distribution is characterized by its infinite support. Distributions with compact support are therefore clearly distinct from those with infinite support. Neither skewness nor kurtosis tests are able to differentiate on the basis of the support of the distribution. Specifically, the beta distribution with positive and equal parameter values has skewness and kurtosis similar to those of the gaussian distribution. FastICA is not able to distinguish between them. So, again, pk is smaller than the true number of nongaussian data directions. 6.1.6 PC Projections. The first step in the PC-IC algorithm consists in reducing the d-dimensional original data to p-dimensional scores for p = 2, . . . , r, where r = min{min(d, N),rank(S)}. The PC scores are weighted averages of the original data. As the dimension d becomes large (d ≥ 30), a central limit effect seems to be noticeable in the projected data; thus, the scores are more gaussian than the original data, and pk will be smaller than the true number of nongaussian directions.
6.2 Simulation Models and Results 6.2.1 Simulation Strategy. The above discussion informs our choice of distributions and our simulation strategy. Since the accuracy of pk depends on the type of distribution as well as the sample size, we demonstrate its performance across a number of distributions and sample sizes and also show its relative accuracy by comparing it for models that differ only by the number of nongaussian dimensions but are the same otherwise.
Dimension Selection with PC-IC
533
Table 2: Dimensions d for Simulated Models with dng Nongaussian Dimensions. d dng
5 3
5 4
7 3
7 5
10 3
10 4
10 6
20 3
20 4
20 5
20 6
6.2.2 Nongaussian Models and the Mixed Data. The development of our approach and theoretical arguments requires N independent observations X1 , . . . , XN without reference to any model. The only assumption we require is that the observations have fourth moments. Such d-dimensional data correspond best to the generic ICA description: X = AV,
(6.1)
with A the d × d mixing matrix and V a d-dimensional vector of independent sources. This ICA model is different from that in equation 5.1, which forms the bases of the PCA-type approaches. We choose three nongaussian families of distributions, which set them apart from the gaussian distribution in ways that skewness or kurtosis should be able to discern:
r r
r
The beta distribution with parameters 5 and 1.2. This distribution has compact support and is asymmetric. A mixture of two gaussians consisting of one-quarter of the samples from the gaussian distribution N(−1, 0.49) and three-quarters of the data from the gaussian N(4, 1.44). This distribution is strongly bimodal and asymmetric. The exponential distribution with parameter λ = 2. This is a supergaussian that is also asymmetric.
For each of these models, we generate independent source data. We generate N × dng independent random samples from the distribution F , where F denotes any one of the three distributions above and N × dg independent samples from the multivariate gaussian distribution with zero mean and covariance matrix Idg . Then d = dng + dg denotes the dimension of the independent source data V. In the simulations, we vary d, and dng within d. For the sample size N, we use the values N = 20, 50, 100, 200, 500, 1000, 2000. Table 2 summaries the combinations we use. We apply a d × d mixing matrix A to the independent source data V1 , . . . , VN to obtain the observed or mixed data X1 , . . . , XN given in equation 6.1. We use a class of Higham test matrices (see Davies & Higham, 2000). These matrices are generated in Matlab with the command A = gallery(‘randcolu , d). The columns in the random matrix A have unit length. The singular values of this matrix can be specified, but we do not use
534
I. Koch and K. Naito
this option. Matrices of this form are invertible. For all simulations based on a specific dimension d, the same matrix A = A(d) is used. 6.2.3 Approaches for Calculating the Dimension p. We apply the PC-IC algorithm to mixed data X1 , . . . , XN obtained from sources V = V(N, d, dng, F ). The independent sources are characterized by d, dng, the nongaussian distribution F , and the sample size N. For p = 2, . . . , d we determine the most nongaussian IC1 direction by taking the maximum over 10 IC1 repetitions obtained with FastICA2.4. From these values for each p, we determine the pk (k = 3, 4) as in equation 3.3. Our method is based on skewness and kurtosis. We carry out separate simulations for skewness and kurtosis, which result in the values p3 for skewness and p4 for kurtosis. In addition we apply the probabilistic PCA approach of Tipping and Bishop (1999), which we refer to as PPCA, and the Bayesian approach of Minka (2000), referred to as BPCA, to the data X1 , . . . , XN . For the probabilistic PCA, we calculate the log likelihood of equation 5.2 and find its maximizer p P . For the Bayesian approach, we find the maximizer p B of equation 5.3. In the PCA-based simulations, p ranges over the values {1, . . . , d − 1}. Our approach considers values p ∈ {2, . . . , d}, since ICA requires at least a two-dimensional subspace to find a nongaussian direction. As our approach is nonparametric and treats all observations the same way, the flexible sources referred to in Penny et al. (2001) are not appropriate, nor is the preprocessing of the noise as described in Beckmann and Smith (2004) applicable, since there is no additive noise in the model of equation 6.11. Thus, the algorithms we can compare our method with are the probabilistic PCA given in equation 5.2 and the Bayesian PCA given in equation 5.3. 6.2.4 Simulation Results. For 1000 runs with the sources V(N, d, dng, F ), we recorded the proportion of times that dimension p was selected. Figures 1 to 3 show typical performances for our method and the two PCA-based methods. The pattern is the same in each figure. Each row of subplots presents the results of one method. In the first row, we present our skewness results, referred to as PC-IC skewness in the figures. In Figures 1 and 2, the second row shows our kurtosis results (PC-IC kurtosis). The second last row displays the results of PPCA, and the last row shows the results of BPCA. We display the results for the sample sizes N = 50, 100, 200, 500, and 1000. Each column of subplots displays the results of a fixed sample size; the first column has results for N = 50, the next for N = 100, and the last for N = 1000. The x-axis of each subfigure displays the range of dimensions 1, . . . , d, and the y-axis shows the percentage that a particular dimension p was the maximizer in 1000 runs. The correct dimension, namely, the number of nongaussian dimensions, is indicated by a square shape in each plot, for example, dimension 4 in Figure 1.
Dimension Selection with PC-IC
535
100
100
100
100
100
50
50
50
50
50
0
0 2
4
0 2
4
0 2
4
0 2
4
100
100
100
100
100
50
50
50
50
50
0
0 2
4
0 2
4
0 2
4
4
100
100
100
100
50
50
50
50
50
0 2
4
0 2
4
0 2
4
4
100
100
100
100
50
50
50
50
50
0 2
4
0 2
4
0 2
4
2
4
2
4
2
4
0 2
100
0
4
0 2
100
0
2
0 2
4
Figure 1: Proportion p against dimension for five-dimensional distribution. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PC-IC kurtosis (second row), with PPCA (third row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 4.
Figure 1 displays results for low dimensions. For five-dimensional data, we use the beta distribution for four of the independent sources. The results are displayed in the 20 subfigures of Figure 1. The PC-IC skewness criterion (top row) finds the correct dimension for N = 100 and 200. For larger values of the sample size, p3 is a little bigger than the true value. This is consistent with our comments about large samples. The results obtained with kurtosis are not as good as those for skewness. Although kurtosis finds the correct dimension for the two largest sample sizes, the percentages for all values of p are quite high. Further, skewness is computationally preferable, as kurtosis calculations take two to four times longer. The third row shows the results from PPCA: for all sample sizes, the maximum occurs at the correct dimension. Finally, the results for BPCA indicate that BPCA finds the correct dimension for the smallest sample size (N = 50), but for all other sample sizes, BPCA has a maximum at p B = 1. Figure 2 shows the result for 10-dimensional data. Here we use the bimodal distribution in three dimensions. For this moderate dimension, PC-IC skewness (top row) finds the correct dimension for the larger sample sizes and underestimates it by one for the smaller values. This is expected: p3 tends to increase with sample size. The kurtosis results are
536
I. Koch and K. Naito 100
100
100
100
100
50
50
50
50
50
0
0 5
10
0 5
10
0 5
10
0 5
10
100
100
100
100
100
50
50
50
50
50
0
0 5
10
0 5
10
0 5
10
10
100
100
100
100
50
50
50
50
50
0 5
10
0 5
10
0 5
10
10
100
100
100
100
50
50
50
50
50
0 5
10
0 5
10
0 5
10
5
10
5
10
5
10
0 5
100
0
10
0 5
100
0
5
0 5
10
Figure 2: Proportion p against dimension for 10-dimensional distribution with three bimodal dimensions. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PC-IC kurtosis (second row), with PPCA (third row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 3.
very similar in that the correct dimension is underestimated by one for the smaller sample sizes, but again, the skewness results are superior to the kurtosis results. The PPCA plots all have a maximum at 1. Unlike the results for PC-IC skewness, the dimension of the maxium does not change with sample size and indeed is wrong in each case. BPCA shows a trend that was not so clear for the five-dimensional data of Figure 1: for low sample sizes, the maximum of BPCA occurs at the last value of its range, d − 1, and for higher sample sizes, the maximum occurs at the 1. Figure 3 shows the results for 20-dimensional data with four bimodal dimensions. As the kurtosis results are inferior to the skewness results, we present only the skewness results here. For the three larger samples, PC-IC skewness determined the correct dimension convincingly. For 20dimensional data, sample sizes of 50 or 100 are very small, and it is therefore not surprising that the skewness criterion underestimates the dimension. The second row of figures shows that PPCA picked d − 1 = 19 as the dimension of the latent variables for each sample size. This choice is far from the correct choice: four dimensions. As in the results in Figure 2, PPCA appears to be independent of the sample size, but this time, PPCA picks
Dimension Selection with PC-IC
537
100
100
100
100
100
50
50
50
50
50
0
0 0
10
20
0 0
10
20
0 0
10
20
0 0
10
20
100
100
100
100
100
50
50
50
50
50
0
0 0
10
20
0 0
10
20
0 0
10
20
10
20
100
100
100
100
50
50
50
50
50
0 0
10
20
0 0
10
20
0 0
10
20
10
20
0
10
20
0
10
20
0 0
100
0
0
0 0
10
20
Figure 3: Proportion p against dimension for 20-dimensional distribution with three bimodal dimensions. Sample sizes N = 50, 100, 200, 500, 1000 vary along rows. p with PC-IC skewness (top row), with PPCA (middle row), with BPCA (bottom row). Correct dimension indicated by a square, here at p = 4.
the other extreme dimension, d − 1, instead of 1 in the 10-dimensional data. We will comment on this discrepancy between these two results in the last paragraph of this section. BPCA shows the same pattern as for the 10-dimensional data of Figure 2: for the two smallest sample sizes, the maximum occurs at the largest dimension (at p (B) = 19), and for the two largest sample sizes, the maximum occurs at 1. For N = 200, we note that the peak has now moved to p (B) = 3, so happens to be close to the true value. These figures show that PC-IC with the skewness criterion produces good results for nongaussian distributions of diverse dimensions and different sample sizes. PC-IC with skewness is mostly superior to PC-IC with kurtosis in determining the correct dimension. This is a little surprising, but the results are very convincing. Since skewness is calculated more efficiently than kurtosis with FastICA, we recommend the use of PC-IC with skewness. When the dimension of the data is small as in Figure 1, PC-IC with skewness yields correct results for the smaller sample sizes we considered, while for larger samples, the selected dimension can be too large. For small N and moderate or larger dimensions d ≥ 10, p3 tends to be a little smaller than dng, but as N increases, the PC-IC selects the correct dimension. These results are consistent with the observations of section 6.1 and the results in Table 1. A comparison between PC-IC with skewness and the PCA-based methods shows convincingly that our method is clearly superior to either of these two for nongaussian data and models as in equation 6.1. This is not
538
I. Koch and K. Naito
surprising, since both methods were developed for gaussian data with additive gaussian noise. PPCA appeared to select the correct dimension for the low-dimensional simulations based on the beta distribution. In some sense, this distribution is closer to the normal distribution than the bimodal, so a possible explanation for the good performance of PPCA is that it shows some distributional robustness, that is, PPCA can handle models that are close to the gaussian model for which it has been designed. The bimodal distribution is very different from the gaussian distribution, and it is therefore not suprising that PPCA does not yield such good results. What is curious, however, is the jump from the lowest to the highest dimension that occurred between Figures 2 and 3 or between dimensions 10 and 20. Further simulations (not presented here) showed that PPCA depended strongly on the model, here the class of mixing matrices, so PPCA is not model robust. These simulations indicate that the class of models and distributions for which PPCA works is quite restrictive and thus highlight a serious drawback of PPCA. In contrast, BPCA appears to be model robust and always has a very clear peak where the maximum occurs, but BPCA appears too sensitive to sample size. We observed (but do not show this here) that the selected dimension of the maximum decreases from d − 1 to 1 as the sample size increases from 100 to 120. In summary, the results show that our method produces good results, where neither of the PCA-based methods is applicable, so complements these methods.
7 Conclusion A criterion for choosing the dimension in dimension-reduction problems has been proposed. The criterion is essentially based on an approximation of bias-adjusted skewness and kurtosis for PC scores. Applications to real data sets reveal that the proposed criterion returns a higher dimension than p95% based on PC alone. This seems to be obvious since nongaussianity cannot be captured by the variance of low-dimensional projections alone as PCA aims for. Our proposed criterion includes as a vital part pursuing of nongaussianity, and hence it should yield higher dimensions. We have shown in simulation studies that the criterion performs well for simulated structures and performs better than existing PCA-based methods for nongaussian data. The ICA part of the criterion seems to make its performance quite similar to that of testing gaussianity based on skewness and kurtosis, and in fact this has been observed in our simulation studies. Therefore, it is desirable to consider simulations of structures that are used in the literature as alternative hypotheses for testing gaussianity. As mentioned in section 6.1.5, more accurate selection of the dimension would be possible provided we could differentiate between compact and noncompact support of the underlying structure. However, we reserve this problem for a future topic.
Dimension Selection with PC-IC
539
Appendix A: Proof of Theoretical Results The following lemma is useful to prove theorem 1. Lemma 1. (1) For b ≥ 0 , we have
∞ G m [(b + 1 )t ]dt = 2
0
2 [(m + 1 )/2] E[χm ] . = √ b+1 (m/2) b+1
(2) Fix integers iand m and put k = 2i + m. For a , b ≥ 0 the tail probability G k satisfies G k [(b + 1 )a 2 ] (2b)i (k/2) = (b + 1 )k/2 (m/2)
∞
e −bt/2 (bt)i gm (t)dt.
a2
(3) Fix integers iand m and put = 2i + 1 + m. For a , b ≥ 0 the tail probability G satisfies G [(b + 1 )a 2 ] b (2i+1)/2 ( /2) = (b + 1 ) /2 (m/2)
∞
e −bt/2
bt 2
(2i+1)/2 gm (t)dt.
a2
Proof of Lemma 1. 1. With G m (t 2 ) = P(χm2 ≥ t 2 ) = P(χm ≥ t) for t ≥ 0, using the change of variables u = t 2 , we find that
∞
∞ G m (t )dt = 2
0
du G m (u) √ 2 u
0
√ = G m (u) u|∞ 0 +
∞
√
ugm (u)du
0
= E[χm ] since the first partial integrand equals 0. From this equation and applying the change of variable u = (b + 1)t 2 , we get
∞
∞ G m [(b + 1)t ]dt = 2
0
0
G m (u) du 2 u(b + 1)
540
I. Koch and K. Naito
1 E[χm ] =√ b+1 2 [(m + 1)/2] = . b + 1 (m/2) 2. Fix i and m. For a , b ≥ 0, let K denote the integral:
∞ K≡
e −bt/2 (bt)i × gm (t)dt
a2
∞ =
e −(b+1)t/2 b i ×
a2
∞ =
t (2i+m)/2−1 dt 2m/2 (m/2)
e −(b+1)t/2 b i t (2i+m)/2−1 ×
a2
2i [(2i + m)/2] dt. + m)/2](m/2)
2(2i+m)/2 [(2i
Put k = 2i + m. The change of variable r = (b + 1)t, and applying the definition of the tail probability G k leads to (2b)i (k/2) K= (m/2) (2b)i (k/2) = (m/2) =
∞ a2
e −(b+1)t/2 t k/2−1 dt 2k/2 (k/2)
∞ (b+1)a 2
e −r/2 r k/2−1 dr + 1)k/2
2k/2 (k/2)(b
(2b)i (k/2) × G k [(b + 1)a 2 ]. (b + 1)k/2 (m/2)
3. Fix i and m. For a , b ≥ 0 let L denote the integral
∞ L≡
e −(bt)/2
bt 2
(2i+1)/2 gm (t)dt
a2
∞ = a2
∞ = a2
e −(b+1)t/2 b (2i+1)/2 t (2i+1+m)/2−1 dt 2(2i+1+m)/2 (m/2) e −(b+1)t/2 b (2i+1)/2 t (2i+1+m)/2−1
[(2i + 1 + m)/2] dt. + 1 + m)/2](m/2)
2(2i+1+m)/2 [(2i
Dimension Selection with PC-IC
541
With = 2i + 1 + m, and the change of variable r = (b + 1)t, we get b (2i+1)/2 ( /2) L= (m/2) =
=
b (2i+1)/2 ( /2) (m/2)
∞ a2
e −(b+1)t/2 t /2−1 dt 2 /2 ( /2)
∞
e −r/2 r /2−1 dr + 1) /2
2 /2 ( /2)(b
(b+1)a 2
b (2i+1)/2i ( /2) × G [(b + 1)a 2 ]. (b + 1) /2 (m/2)
Proof of Theorem 1. From theorems 3.1 and 3.3 of Kuriki and Takemura (2001), we have L U Qk, p (a ) ≤ P(Tk ( p) ≥ a ) ≤ Qk, p (a )
for a > 0, where p−1
L Qk, p (a ) =
ω p−e,k Q p−e,ρ− p+e (a 2 , tan2 θk ),
e=0 e:even
QU k, p (a ) =
L Qk, p (a )
Qm,n (a , b) =
∞
Vol(Mθk ) 2 2 ¯ + G ρ (a (1 + tan θk )) 1 − , Vol(S ρ−1 ))
¯ n (bξ )}dξ, gm (ξ ){1 − G
a
Vol(Mθk ) is the volume of tube Mθk which is fully described in Kuriki and Takemura (2001), and Vol(S ρ−1 ) is the total volume of S ρ−1 . Using the exU L pression of Qk, p and Qk, p , it follows that L Bk ( p) =
p−1
ω p−e,k
∞
Q p−e,ρ− p+e (a 2 , b)da
0
e=0,e:even
≤ E[Tk ( p)] ≤
p−1
ω p−e,k
e=0,e:even
∞
Q p−e,ρ− p+e (a 2 , b)da
0
∞ Vol(Mθc ) ¯ ρ ((1 + tan2 θc )a 2 )da G + 1− Vol(Sρ−1 ) 0
= U Bk ( p).
542
I. Koch and K. Naito
The ratio Vol(Mθc )/Vol(Sρ−1 ) can be expressed using (θk , p) as in lemmas 3.3 and 3.4 in Kuriki and Takemura (2001). Direct but tedious calculations of the integral Qm,n (a 2 , b) by means of formulas 18.18 and 18.19 in Johnson, Kotz, and Barakrishnan (1994), and the previous lemma yield the expressions tabulated in the theorem. Proof of Theorem 2. First, we note that
m,n (b|T) = E m,n (b|T) − m,n (b) 2 MSE "
! m,n (b|T) m,n (b|T) 2 + Var = Bias 2
m,n (b|T) . = m,n (b|T) − m,n (b) + Var Using expression 18.18 in Johnson et al. (1994) again, we have for even m that
∞
|m,n (b) − m,n (b|T)| =
∞
∞
gm (ξ )G n (bξ )dξ da
∞
a2
T
=
∞
a2
T
≤
gm (ξ )dξ da
¯ m (a 2 )da G
T
2 2 i 1 ∞ a a exp − da = i! T 2 2 (m/2)−1
i=0
((i + 1)/2) ¯ i+1 (T 2 ) G √ i+1 2 i! i=0
(m/2)−1
=
((i + 1)/2) √ i+1 2 i! i=0
(m/2)−1
¯ m/2 (T 2 ) ≤G
= O(exp(−T/2)T (m/2)−1 ). The proof for odd m is similar using equation 18.19 in Johnson et al. (1994). The evaluation of the variance is trivial. Appendix B: Table of Upper Bounds The table here (see Table 3) contains the upper bounds U B k ( p) for dimensions p = 2, . . . , 50 and k = 3, 4 as in equation (7) of section 3.5. The kurtosis
Dimension Selection with PC-IC
543
Table 3: Skewness and Kurtosis Upper Bounds. Dimension 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
UB-skewness
UB-kurtosis
Dimension
UB-skewness
UB-kurtosis
1.5993 2.9790 3.3430 4.4404 5.6317 6.9225 8.2636 9.6954 11.1995 12.7727 14.4123 16.1157 17.8806 19.7050 21.5870 23.5251 25.5157 27.5629 29.6600 31.8075 34.0042 36.2491 38.5413 40.8796 43.2633
1.7536 2.8356 4.5500 6.4576 8.6776 11.2116 14.0606 17.2250 20.7051 24.5010 28.6129 33.0409 37.7849 42.8451 48.2214 53.9138 59.9224 66.2473 72.8883 79.8455 87.1189 94.7085 102.6140 110.8360 119.3750
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
45.6915 48.1634 50.6783 53.2353 55.8339 58.4734 61.1532 63.8726 66.6312 69.4283 72.2634 75.1380 78.0458 80.9920 83.9745 86.9630 90.0460 93.1343 96.2571 99.7390 102.6050 105.8290 109.0860 112.3760
128.2290 137.4000 146.8870 156.6900 166.8090 177.2450 187.9970 199.0650 210.4490 222.1490 234.1660 246.4990 259.1480 272.1130 285.3950 298.9930 312.9070 327.1370 341.6840 356.5460 371.7250 387.2200 403.0320 419.1600
values increase much faster than skewness bounds, and both increase at a rate that is faster than the linear rate. Acknowledgments We gratefully acknowledge the reviewers’ suggestions to include comparisons with probabilistic dimension selection methods. References Amari, S.-I. (2002). Independent component analysis (ICA) and method of estimating functions. IEICE Trans. Fundamentals, E85-A (3), 540–547. Anderson, T. W. (1985). Introduction to multivariate statistical analysis (2nd ed). New York: Wiley. Bach, F. R., & Jordan, M. I. (2002). Kernel independent component analysis. Journal of Machine Learning Research, 3, 1–48. Beckmann, C. F., & Smith, S. M. (2004). Probabilistic independent component analysis for functional magnetic resonance imaging. IEEE Trans. Med. Imaging, 23(2), 137– 152.
544
I. Koch and K. Naito
Bishop, C. M. (1999). Bayesian PCA. In M. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 382–388). Cambridge, MA: MIT Press. Blake, C., & Merz, C. (1998). UCI repository of machine learning databases. Available online at http://www.ics.uci.edu/∼mlearn/MLRepository.html and http:// www.kernel-machines.com/. Cardoso, J.-F. (1999). High-order contrasts for independent component analysis. Neural Computation, 11(1), 157–192. Davies, P. I., & Higham, N. J. (2000). Numerically stable generation of correlation matrices and their factors. BIT, 40, 640–651. Flury, B., & Riedwyl, H. (1988). Multivariate statistics: A practical approach. Cambridge: Cambridge University Press. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Gilmour, S., & Koch, I. (2004). Understanding Australian illicit drug markets: Unsupervised learning with independent component analysis. In Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, ISSNIP (pp. 271–276). Melbourne, Australia: ARC Research Network on Sensor Networks. Hastie, T., & Tibshirani, R. (2002). Independent component analysis through product ¨ L. K. Saul, & B. Scholkopf ¨ density estimation. In S. Thrun, (Eds.), Advances in neural information processing systems, 16 (pp. 649–656). Cambridge, MA: MIT Press. Huber, P. J. (1985). Projection pursuit (with discussion). Annals of Statistics, 13, 435– 475. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (2005). FastICA2.4. Available online at http://www.cis/ hut./fi/projects/ica/fastica/code/dlcode.html. Johnson, N. L., Kotz, S., & Balakrishnan, N. (1994). Continuous univariate distributions. (2nd ed.). New York: Wiley. Jones, M. C., & Sibson, R. (1987). What is projection pursuit? Journal of the Royal Statistical Society Series A, 150, 1–36. Koch, I., Marron, J. S., & Chen, J. (2005). Independent component analysis and simulation of nongaussian populations of kidneys. Manuscript submitted for publication. Kuriki, S., & Takemura, A. (2001). Tail probabilities of the maxima of multilinear forms and their applications. Annals of Statistics, 29, 328–371. Machado, S. G. (1983). Two statistics for testing for multivariate normality. Biometrika, 70, 713–718. Malkovich, J. F., & Afifi, A. A. (1973). On tests for multivariate normality. Journal of the American Statistical Association, 68, 176–179. Mardia, K. V., Kent, J., & Bibby, J. (1979). Multivariate analysis. Orlando, FL: Academic Press. Marron, J. S. (2004). pcaSM.m. Available online at http://www.stat.unc.edu/ postscript/papers/marron/Matlab7Software/General/. Minka, T. P. (2000). Automatic choice of dimensionality for PCA. In S. A. Solla, T. K. ¨ Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 598–604). Cambridge, MA: MIT Press.
Dimension Selection with PC-IC
545
Penny, W. D., Roberts, S. J., & Everson, R. M. (2001). ICA: Model order selection and dynamic source models. In S. J. Roberts and R. M. Everson (Eds.), Independent component analysis: Principles and practice (pp. 299–314). Cambridge: Cambridge University Press. Prasad, M., Arcot, S., & Koch, I. (2005). Deriving significant features in high dimensional data with independent component analysis. Manuscript submitted for publication. Sun, J. (1991). Significance levels in exploratory projection pursuit. Biometrika, 78, 759–769. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal component analysis. J. R. Statist. Soc. B, 61, 611–622.
Received November 21, 2005; accepted June 16, 2006.
LETTER
Communicated by Te-Won Lee
One-Bit-Matching Theorem for ICA, Convex-Concave Programming on Polyhedral Set, and Distribution Approximation for Combinatorics Lei Xu
[email protected] Department of Computer Science and Engineering, Chinese University of Hong Kong, Shatin, NT, Hong Kong, P. R. C.
According to the proof by Liu, Chiu, and Xu (2004) on the so-called onebit-matching conjecture (Xu, Cheung, and Amari, 1998a), all the sources can be separated as long as there is an one-to-one same-sign correspondence between the kurtosis signs of all source probability density functions (pdf’s) and the kurtosis signs of all model pdf’s, which is widely believed and implicitly supported by many empirical studies. However, this proof is made only in a weak sense that the conjecture is true when the global optimal solution of an independent component analysis criterion is reached. Thus, it cannot support the successes of many existing iterative algorithms that usually converge at one of the local optimal solutions. This article presents a new mathematical proof that is obtained in a strong sense that the conjecture is also true when any one of local optimal solutions is reached in helping to investigating convex-concave programming on a polyhedral set. Theorems are also provided not only on partial separation of sources when there is a partial matching between the kurtosis signs, but also on an interesting duality of maximization and minimization on source separation. Moreover, corollaries are obtained on an interesting duality, with supergaussian sources separated by maximization and subgaussian sources separated by minimization. Also, a corollary is obtained to confirm the symmetric orthogonalization implementation of the kurtosis extreme approach for separating multiple sources in parallel, which works empirically but lacks mathematical proof. Furthermore, a linkage has been set up to combinatorial optimization from a distribution approximation perspective and a Stiefel manifold perspective, with algorithms that guarantee convergence as well as satisfaction of constraints. 1 Introduction Independent component analysis (ICA) aims at blindly separating the independent sources s from an unknown linear mixture x = As via y = Wx. Tong, Inouye, and Liu (1993) showed that y recovers s up to constant scales and a permutation of components when the components of y become Neural Computation 19, 546–569 (2007)
C 2007 Massachusetts Institute of Technology
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
547
component-wise independent and at most one of them is gaussian. The problem is formalized by Comon (1994) as ICA. Although ICA has been studied from different perspectives, such as the minimum mutual information (MMI) (Bell & Sejnowski, 1995; Amari, Cichocki, & Yang, 1996) and maximum likelihood (ML) (Cardoso, 1998), in the case that W is invertible, all such approaches are equivalent to minimizing the following cost function, D(W) =
p(y, W) p(y; W) ln n dy, i=1 q (yi )
(1.1)
where q (yi ) is the predetermined model probability density function (pdf) and p(y, W) is the distribution on y = Wx. With each model pdf q (yi ) prefixed, however, this approach works only when the components of y are either all subgaussians (Amari et al., 1996) or all supergaussians (Bell & Sejnowski, 1995). To solve this problem, it is suggested that each model pdf q (yi ) is a flexibly adjustable density that is learned together with W, with the help of either a mixture of sigmoid functions that learns the cumulative distribution function (cdf) of each source (Xu, Yang, & Amari, 1996; Xu, Cheung, Yang, & Amari, 1997) or a mixture of parametric pdf’s (Xu, 1997; Xu, Cheung, & Amari, 1998b). A learned parametric mixture–based ICA (LPMICA) algorithm is derived, with successful results on sources that can be either subgaussian or supergaussian, as well as any combination of both types. The mixture model was also adopted in the ICA algorithm by Pearlmutter and Parra (1996), although it did not explicitly target separating the mixed sub- and supergaussian sources. It has also been found that a rough estimate of each source pdf or cdf may be enough for source separation. For instance, a simple sigmoid function such as tanh(x) seems to work well on the supergaussian sources (Bell & Sejnowski, 1995), and a mixture of only two or three gaussians may be enough (Xu et al., 1998b) for the mixed sub- and supergaussian sources. This leads to the so-called one-bit-matching conjecture (Xu et al., 1998a), which states that “all the sources can be separated as long as there is an one-to-one same-sign correspondence between the kurtosis signs of all source pdf’s and the kurtosis signs of all model pdf’s.” This conjecture has been implicitly supported by several other ICA studies (Girolami, 1998; Everson & Roberts, 1999; Lee, Girolami, & Sejnowski, 1999; Welling & Weber, 2001). Cheung and Xu (2000) gave a mathematical analysis for the case involving only two subgaussian sources. Amari, Chen, and Cichock: (1997) also studied the stability of an ICA algorithm at the correct separation points via its relation to the nonlinearity φ(yi ) = d ln q i (yi )/dyi , but without touching the circumstance under which the sources can be separated. Recently, the conjecture on multiple sources was proved mathematically in a weak sense (Liu et al. 2004). When only skewness and kurtosis of
548
L. Xu
sources are considered with Es = 0 and Ess T = I , and the model pdf’s skewness is designed as zero, the problem minW D(W) by equation 1.1 is as simplified as
max J (R),
RRT =I
J (R) =
n n
ri4j ν sj kim ,
n ≥ 2,
(1.2)
i=1 j=1
where R = (ri j )n×n = W A is an orthonormal matrix, and ν sj is the kurtosis of the source s j , and kim is a constant with the same sign as the kurtosis νim of the model q (yi ).1 Then it is further proved that the global maximization of equation 1.2 can be reachable only by setting R a permutation matrix up to a certain sign indeterminacy. However, this proof still cannot support the successes of many existing iterative ICA algorithms that typically implement gradient-based local search and thus usually converge to one of local optimal solutions. In the next section of this article, all the local maxima of equation 1.2 are investigated using special convex-concave programming on a polyhedral set, from which we prove the one-bit-matching conjecture in a strong sense that it is true when any one of local maxima by equation 1.2 is reached. Theorems have also been provided on separation of sources when there is a partial matching between the kurtosis signs and on an interesting duality of maximization and minimization. Moreover, corollaries are obtained to state that the duality makes it possible to get supergaussian sources by maximization and subgaussian sources by minimization. Another corollary also confirms the symmetric orthogonalization implementation of the extreme approach of the kurtosis for separating multiple sources in parallel, which works empirically but without mathematical proof (Hyvarinen, Karhunen, & Oja, 2001). In section 3, we discuss that equation 1.2, with R being a permutation matrix up to certain sign indeterminacy, becomes equivalent to a special example of the following combinatorial optimization: min E o (V), V = {vi j , i = 1, . . . , N, j = 1, . . . , M}, V
Cc :
N
vi j = 1, j = 1, . . . , M,
i=1
Cb :
vi j takes either 0 or 1.
Cr :
M
subject to
vi j = 1, i = 1, . . . , N;
j=1
(1.3)
1 The details refer to theorem 1 of Liu et al. (2004) for the specific expression k m and i the proof that kim is a constant with the same sign as the kurtosis νim of the model q (yi ).
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
549
Figure 1: Convex set and polyhedral set.
This connection suggests investigating combinatorial optimization not only from a distribution approximation perspective of finding a simple distribution to approximate the Gibbs distribution induced from E o (V), but also a perspective of gradient flow searching within the Stiefel manifold, with algorithms that guarantee convergence as well as constraint satisfaction. 2 One-Bit-Matching Theorem and Extension 2.1 An Introduction to Convex Programming. To facilitate mathematical analysis, we briefly introduce convex programming. A set in Rn is said to be convex if x1 ∈ S, x2 ∈ S; we then have λx1 + (1 − λ)x2 ∈ S for any 0 ≤ λ ≤ 1. Shown in Figure 1 are examples of convex sets. As an important special case of convex sets, a set in Rn is called a polyhedral set if it is the intersection of a finite number of closed half-spaces, that is, S = {x : a it x ≤ αi , f or i = 1, . . . , m}, where a i is a nonzero vector and αi is a scalar for i = 1, . . . , m. The second and third imagestion Figure 1 are two examples. Let S be a nonempty convex set. A vector x ∈ S is called an extreme point of S if x = λx1 + (1 − λ)x2 with x1 ∈ S, x2 ∈ S, and 0 < λ < 1 implies that x = x1 = x2 . We denote the set of extreme point by E, illustrated in Figure 1. Let f : S → R, where S is a nonempty convex set in Rn . As shown in Figure 2, the function f is said to be convex on S if f (λx1 + (1 − λ)x2 ) ≤ λ f (x1 ) + (1 − λ) f (x2 )
(2.1)
for x1 ∈ S, x2 ∈ S and for 0 < λ < 1. The function f is called strictly convex on S if the above inequality is true as a strict inequality for each distinct x1 ∈ S, x2 ∈ S and for 0 < λ < 1. The function f is called concave (strictly concave) on S if − f is convex (strict convex) on S. Considering an optimization problem minx∈S f (x), if x¯ ∈ S and f (x) ≥ f (x¯ ) for each x ∈ S, then x¯ is called a global optimal solution. If x¯ ∈ S and if there exists an ε-neighborhood Nε (x¯ ) around x¯ such that f (x) ≥ f (x¯ ) for
550
L. Xu
Figure 2: Convex and concave function.
each x ∈ S ∩ Nε (x¯ ), then x¯ is called a local optimal solution. Similarly, if x¯ ∈ S and if f (x) > f (x¯ ) for all x ∈ S ∩ Nε (x¯ ), x = x¯ , for some ε, then x¯ is called a strict local optimal solution. An optimization problem minx∈S f (x) is called a convex programming problem if f is a convex function and S is a convex set. Lemma 1. a. Let S be a nonempty open convex set in Rn , and let f : S → R be twice differentiable on S. If its Hessian matrix is positive definite at each point in S, the f is strictly convex. b. Let S be a nonempty convex set in Rn , and let f : S → R be convex on S. Consider the problem of minx∈S f (x). Suppose that x¯ is a local optimal solution to the problem. Then (i) x¯ is a global optimal solution. (ii) If either x¯ is a strict local minimum or if f is strictly convex, then x¯ is the unique global optimal solution. c. Let S be a nonempty compact polyhedral set in Rn , and let f : S → R be a strict convex function on S. Consider the problem of maxx∈S f (x). All the local maxima are reached at extreme points of S. Statements a and b are common knowledge, and statement c is not difficult to understand. As illustrated in Figure 2, assume x¯ is a local maximum but not an extreme point. We may find x1 ∈ Nε (x¯ ), x2 ∈ Nε (x¯ ) such that x¯ = λx1 + (1 − λ)x2 for 0 < λ < 1. It follows from equation 2.1 that f (x¯ ) < λ f (x1 ) + (1 − λ) f (x2 ) ≤ max[ f (x1 ), f (x2 )], which contradicts that x¯ is a local maximum, while at an extreme point x of S, x = λx1 + (1 − λ)x2 with x1 ∈ S, x2 ∈ S and 0 < λ < 1 implies that x = x1 = x2 , which does not contradict the definition of a strict convex function made after equation 2.1. That is, a local maximum can be reached at only one of the extreme points of S. (For details about convex programming, see, e.g., Bazaraa, Sherali, Shetty, 1993.) 2.2 One-Bit-Matching Theorem. This section aims at showing that every local maximum of J (R) on RRT = I by equation 1.2 is reached at R,
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
551
which is a permutation matrix up to sign indeterminacy at its nonzero elements, as long as there is a one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s. The proving line is sketched by two key steps. First, we divide the set C of constraint equations RRT = I into two nonoverlapped sets Cn ∪ Co . That is, for RT = [r1 , r2 , . . . , rn ] with r j = [r1 j , r2 j , . . . , rnj ]T , we have Cn : Co :
n
ri2j = 1, j = 1, . . . , n,
i=1 riT r j
for normalization,
= 0, i = 1, . . . , n − 1, j = i, . . . , n, for orthogonalization. (2.2)
We find all the local maxima of max s.t. Cn J (R) by proving lemma 2. Second, we consider how these local maxima will be affected by adding the constraint set Co . Local maxima of max s.t. Cn J (R) that do not satisfy Co will be discarded. We find that adding Co will not bring any extra local maximum. As a result, we conclude our aim by theorem 1. We let pi j = ri2j and turn max s.t. Cn J (R) into the following problem: n n pi2j ν sj kim , P = ( pi j )n×n , n ≥ 2, max J (P), J (P) = P∈S i=1 j=1 n pi j = 1, f or i = 1, . . . , n, and every pi j ≥ 0 , S = pi j , i, j = 1, . . . , n : j=1
(2.3) where ν sj and kim are same as in equation 1.2 and S becomes a convex set or precisely a polyhedral set. For every local maximum solution P = [ pi j ] in equation 2.3, we can get a subset of local maxima R = [ri j ] of max s.t. Cn J (R) with ri j = 0 for pi j = 0 and either ri j = ±1 for pi j = 1. We stack P into a vector vec[P] of n2 elements and compute the Hessian HP with respect to vec[P], resulting in, HP is a n2 × n2 diagonal matrix with each diagonal element being ν sj kim . (2.4) Thus, whether J (P) is convex can be checked simply via all the signs of ν sj kim . We use En×k to denote a family of matrices, with each E n×k ∈ En×k being an n × k matrix, with every row consisting of zero elements except that one and only one element is 1.
552
L. Xu
Lemma 2. When either νis > 0, kim > 0, ∀i or νis < 0, kim < 0 , ∀i, every local maximum of J (P) is reached at a P ∈ En×n . Proof. We have every ν sj kim > 0, and thus it follows from equation (2.4) and lemma 1a that J (P) is strictly convex on the polyhedral set S. It further follows from Lemma 1 c that all the local maxima
of J (P) are reached at the polyhedral set’s extreme points that satisfy nj=1 pi j = 1, f or i = 1, . . . , n, that is, each local maximum P ∈ En×n . Lemma 3. a. Given p = [ p1 , . . . , pn ] ∈ P = { p : p j ≥ 0, j = 1, . . . , n,
cal maximum of J ( p) = nj=1 p 2j β j is reached at:
n j=1
p j = 1}, a lo-
i. p = [e k : 0] for k ≥ 1 with β j > 0, j = 1, . . . , k and β j < 0 , j = k + 1 , . . . , n, where the row vector e k ∈ E1×k and 0 is a row vector consisting of n − k zeros.2 ii. One p ∈ P that is not in the form [e k : 0], when β j < 0 , j = 1 , . . . , n.3 b. For an unknown 0 < k < n with νis > 0 , kim > 0 , i = 1 , . . . , k and νis < 0 , kim < 0 , i = k + 1 , . . . , n, every local maximum of J (P) in equation 2.3 is reached at P1 0 (2.5) P= , P1 ∈ Ek×k , P2 ∈ E(n−k)×(n−k) . 0 P2 Proof. a. Let J (p) = J + (p+ ) + J − (p− ) with p = [p+ : p− ] and J + (p+ ) =
k j=1
p 2j β j ,
J − (p− ) =
n
p 2j β j .
(2.6)
j=k+1
J − (p− ) ≤ 0 is strictly concave from lemma 1a because of β j < 0, and thus it has only one maximum at p− = 0. Therefore, all the local maxima of J (p) are reached at [p+ : 0] and determined by all the local maxima of J + (p+ ) on
the polyhedral set = { kj=1 p j = 1, p j ≥ 0, j = 1, . . . , k}. It follows from + + lemma 1b that J (p ) is strictly convex on since β j > 0. It further follows from lemma 1c that all its local maxima are reached at the extreme points of , that is, each local maximum is reached at p+ = e k . 2 In the rest of this letter, we use 0 to denote a matrix of all its elements in zeros without explicitly indicating its dimension, including being either a row vector or a column vector. 3 To illustrate, we observe f (x, y) = −a x 2 − by2 , a > 0, b > 0 subject to x + y = 1, x ≥ b a , y = a +b . 0, y ≥ 0, A local maximum of f (x, y) is reached at x = a +b
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
553
for β j < 0, j = 1, . . . , n, it follows from lemma 1a that J (p) =
nParticularly 2 its only maximum at p = 0 is not j=1 p j β j is strictly concave. However,
reachable because the constraint nj=1 p j = 1. Instead, the maximum of J (p) subject to this constraint is reached at one p ∈ P that is not in the form [e k : 0].
b. Notice that the constraint nj=1 pi j = 1 is imposed on only the ith row p(i) = [ pi1 , . . . , pin ], and J (P) in equation 2.3 is additive. The problem of finding the local maxima of J (P) s.t. P ∈ S is equivalent to separately finding the local maxima
J (p(i) ) =
n
pi2j ν sj kim , subject to
j=1
n
pi j = 1, pi j ≥ 0, j = 1, . . . , n.
j=1
(2.7) For each i ≤ k, we have βi j = ν sj kim > 0, a nd j = 1, . . . , k and βi j < 0, j = (i)
k + 1, . . . , n. Therefore, it follows from lemma 3a that p(i) = [e k : 0] with (i) e k ∈ E1×k , for i = 1, . . . , k. For each i > k, we have βi j < 0, j = 1, . . . , k (i) (i) and βi j > 0, j = k + 1, . . . , n. Similarly, we have p(i) = [0 : e n−k ] with e n−k ∈ E1×(n−k) , for i = k + 1, . . . , n. In summary, we get P given by equation 2.5. From every local maximum solution P = [ pi j ] in lemmas 2 and 3b, √ all the matrices R = [ri j ] with ri j = ± pi j are local maxima R = [ri j ] of max s.t. Cn J (R). In other words, all the local maxima of max s.t. Cn J (R) can be summarized as follows:
R = {R = [ri j ] : ri j =
0,
if pi j = 0,
±1
if pi j = 1.
.
(2.8)
We will return to consider J (R) on RRT = I by equation 1.2 by adding the orthogonal constraint Co in equation 2.2 to max s.t. Cn J (R). Local maxima in R that do not satisfy Co will be discarded. Also, we show that adding in Co will not bring any extra local maximum. Along this line, we can prove the following theorem: Theorem 1. Every local maximum of J (R) on RRT = I by equation 1.2 is reached at R, which is a permutation matrix up to sign indeterminacy at its nonzero elements, as long as there is a one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s.
554
L. Xu
Proof. We check every local maximum J (R) s.t. Cn , that is, every solution in R by equation 2.8, whether it satisfies the orthogonal constraint Co in equation 2.2. If not, it is discarded. If it does, it is put into the following solution set, P = {R : every R ∈ R that satisfies Co },
(2.9)
which is not empty and each P ∈ P is either a permutation matrix or a variant with one or more nonzero elements switched to a negative sign. We check whether every R∗ ∈ P is a local maximum of J (R) on RRT = I by equation 1.2. Since R∗ is a local maximum J (R) s.t. Cn , there is a small enough neighbor area NR∗ with R∗ ∈ NR∗ such that every R ∈ NR∗ , R = R∗ satisfies Cn and J (R ) < J (R∗ ). If R∗ also satisfies Co , R∗ must belong to the intersection NR∗ ∩ NCo , where NCo denotes the set of all the points that satisfy Co . It further follows from equation 2.2 that NR∗ ∩ NCo is also a neighbor area of R∗ , within which we have J (R) < J (R∗ ) for every R ∈ NR∗ ∩ NCo , R = R∗ . That is, each R∗ ∈ P by equation 2.9 is a local maximum of J (R) on RRT = I by equation 1.2. We show that adding Co to J (R) s.t. Cn will not create any extra local maximum. Without Co , it is linked by pi j = ri2j that the problem is equivalent to separately finding the local maxima of J (p(i) ) by equation 2.7 for i = 1, . . . , n, respectively. These individual optimizations may occur within the same n-dimensional space or across several different n-dimensional spaces that may have certain overlap. With Co added in, these individual optimizations are forced to be implemented in the n different n-dimensional spaces that are orthogonal to each other. The effect of this process is equivalent to forcing the valid local maxima of J (p(i) ) by equation 2.7 to be located on a more restricted polyhedral set. Without Co , it follows from lemma 3a (i) that the valid local maxima of J (p(i) ) by equation 2.7 should be located on a polyhedral set as follows:
i =
pi j = 0 for ν sj kim < 0, pi j ≥ 0 for ν sj kim > 0 and
pi j ≥0
pi j = 1 . (2.10)
With Co added, i becomes more restrictive, with one or more constraints of type pi j ≥ 0 being further restricted into those of type pi j = 0. Since there will be at least one P by equation 2.5 in an n × n nonsingular matrix, for every i there will be at least one j with its corresponding pi j remaining to be type pi j ≥ 0; that is, no situation stated by lemma 3a (ii) occurs. Thus, the valid local maxima of J (p(i) ) by equation 2.7 on a more restricted polyhedral set is merely a subset of those local maxima before imposing the extra restrictions. In other words, no extra local maximum will be created.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
555
Summarizing the above two aspects and noticing that kim has the same sign as the kurtosis νim of the model density q i (yi ), the theorem is proved. Equation 1.2 is obtained from equation 1.1 by considering only the skewness and kurtosis and with the model pdf’s without skewness. In such an approximative sense, all the sources can also be separated by a local searching ICA algorithm (e.g., a gradient-based algorithm) obtained from equation 1.1 as long as there is an one-to-one same-sign correspondence between the kurtosis of all source pdf’s and the kurtosis of all model pdf’s. Moreover, this approximation can be removed by an ICA algorithm obtained directly from equation 1.2. Under the one-to-one kurtosis sign-matching assumption, we can derive a local search algorithm that is equivalent to maximizing the problem by equation 1.2 directly. A prewhitening is made on observed samples such that we can consider the samples of x with E x = 0, E xx T = I . As a result, it follows from I = E xx T = AEss T AT and Ess T = I that AAT = I , that is, A is orthonormal. Thus, an orthonormal W is considered to let y = Wx become independent among its components by max J (W), J (W) =
WW T =I
n
y
kim νi ,
(2.11)
i=1
y
where νi = E yi4 − 3, i = 1, . . . , n, and kim , i = 1, . . . , n are prespecified constants with the same sign as the kurtosis νim . We can derive its gradient ∇W J (W) and then project it onto WW T = I , which results in an iterative updating algorithm for updating W in a way similar to equations 3.17 and 3.18 at the end of section 3.3. Such an ICA algorithm actually maximizes the problem by equation 1.2 directly by noticing y = Wx = W As = Rs, R = W A, RRT = I , and thus y
νi =
n
ri4j ν sj , i = 1, . . . , n.
(2.12)
j=1
That is, the problem by equation 2.11 is equivalent to the problem by equation 1.2. In other words, under the one-to-one kurtosis sign-matching assumption, it follows from theorem 1 that all the sources can be separated by an ICA algorithm in an exact sense as long as equation 2.12 holds. However, theorem 1 does not tell us how such a kurtosis sign matching is built, which is attempted via equation 1.1 through learning each model pdf q i (yi ) together with learning W (Xu et al., 1996; Xu et al., 1997; Xu et al., 1998b) as well as further advances given in Lee et al. (1999) and Welling and Weber (2001) or by equation 103 in Xu (2003). Still, it remains an open problem whether these efforts or the possibility of
556
L. Xu
developing other techniques can guarantee such a one-to-one kurtosis signmatching either surely or in some probabilistic sense, which deserves future investigations. 2.3 No Matching and Partial Matching. Next, we consider what happens when one-to-one kurtosis sign correspondence does not hold. We start at the extreme situation with the following lemma: Lemma 4 (no matching). When either νis > 0, kim < 0, ∀i or νis < 0, kim > 0, ∀i, J (P) has only one maximum that is not in En×n . Proof. From equation 2.4 and lemma 1a, J (P) is strictly concave since ν sj kim < 0 for every term. Thus, it follows from lemma 1b that it has only one maximum that is not at the extreme points of S. Lemma 5 (partial matching). Given two unknown integers k, m with 0 < k < m < n and provided that νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + 1 , . . . , n, every local maximum of J (P) is reached at P by equation 2.5 with P1 ∈ Em×k , P2 ∈ E(n−m)×(n−k) , when νis < 0 , kim > 0 , i = k + 1 , . . . , m; P1 ∈ Ek×m , P2 ∈ E(n−k)×(n−m) , when νis > 0 , kim < 0 , i = k + 1 , . . . , m. (2.13)
Proof. When νis < 0, kim > 0, i = k + 1, . . . , m, for each i ≤ m, we have βi j = ν sj kim > 0, j = 1, . . . , k and βi j < 0, j = k + 1, . . . , n, and it follows (i)
(i)
from equation 2.7 and lemma 3a that p(i) = [e k : 0] with e k ∈ E1×k , for i = 1, . . . , m. For each i > m, we have βi j < 0, j = 1, . . . , k and βi j > (i) (i) 0, j = k + 1, . . . , n, and we have p(i) = [0 : e n−k ] with e n−k ∈ E1×(n−k) , for i = m + 1, . . . , n. Thus, we get P with P1 ∈ Em×k , P2 ∈ E(n−m)×(n−k) . The situation of νis > 0, kim < 0, i = k + 1, . . . , m is just a swap of the position (i, j). We can get P simply by matrix transposition. Theorem 2. Given two unknown integers k, m with 0 < k < m < n, and provided that νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + 1 , . . . , n,every local maximum of J (R) on RRT = I 0
by equation 1.2 is reached at R = 0 R¯ subject to a 2 × 2 block permutation, where is a (k + n − m) × (k + n − m) permutation matrix up to sign inde¯ is an (m − k) × (m − k) orthonorterminacy at its nonzero elements, while R ¯R ¯ T = I , but usually not a permutation matrix up to sign mal matrix with R indeterminacy.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
557
Proof. Similar to the proof of theorem 1, we consider the effect of imposing the orthogonal constraint Co in equation 2.2 to R by equation 2.8 with P by equation 2.5 but P1 , P2 by equation 2.13. We give more details for the cases of νis < 0, kim > 0, i = 1, . . . , k, for each i ≤ m, while the cases of νis > 0, kim < 0, i = k + 1, . . . , m can be understood simply by matrix transposition. Although the proof is similar to that for the proof of theorem 1, there is a key difference. Considering the row rank, P1 ∈ Em×k is k < m, P2 ∈ E(n−m)×(n−m) is n − m, and thus P by equation 2.5 is k + n − m < n. Since a permutation matrix remains a permutation matrix after any permutation, without losing generality, we can say that the first k + n − m rows of R are at the rank k + n − m. Co will not only force the rest of the m − k columns of the k + n − m rows to become 0 but also force every pi j ≥ 0 f or ν sj kim > 0 in i by equation 2.10 to become type pi j = 0 in the rest of the m − k rows of R such that the first k + n − m columns of the m − k rows also become 0. To keep Cn in equation 2.2 satisfied, it has to be imposed on each row of the rest of the m − k columns with all the (m − k) × (m − k) elements featured by ν sj kim < 0. 0 As a result, we get R = 0 R¯ subject to a 2 × 2 block permutation. Being the same as that in theorem 2, is a (k + n − m) × (k + n − m) permutation ¯ we have R ¯R ¯ T = I from RRT = matrix up to sign indeterminacy, while for R, I . Moreover, it follows from lemma 3a (ii) that the local maxima of J (p(i) ) ¯ is usually not by equation 2.7 are reached at one p(i) not in E1×n . That is, R a permutation matrix up to sign indeterminacy. In other words, there will be k + n − m sources that can be successfully separated in help of a local searching ICA algorithm when there are k + n − m pairs of matching between the kurtosis signs of source and model pdf’s. However, the remaining m − k sources are not separable. Suppose that the kurtosis sign of each model is described by a binary random variable ξi with 1 for + and 0 for −, that is, p(ξi ) = 0.5ξi 0.51−ξi . When there are k sources with their kurtosis signs positive, there is a proban bility p( i=1 ξi = k) of having a one-to-one kurtosis sign correspondence even when model pdf’s are prefixed without knowing the kurtosis signs of sources. Moreover, even when a one-to-one kurtosis sign correspondence does not hold for all the sources, will still be n − | − k| sources re there n coverable with a probability p( i=1 ξi = ). This explains not only why early ICA studies (Amari et al., 1996; Bell & Sejnowski, 1995) work in some cases while failing in other cases due to the predetermined model pdf’s, but also why some existing heuristic ICA algorithms always work to some extent. 2.4 Maximum Kurtosis versus Minimum Kurtosis. Interestingly, it can be observed that changing the maximization in equations 1.2, 2.3, and 2.11, into the minimization will lead to similar results, which are summarized in lemma and theorem.
558
L. Xu
Lemma 6. a. When either νis > 0 , kim > 0 , ∀i or νis < 0 , kim < 0 , ∀i, J (P) has only one minimum that is not in En×n . b. When either νis > 0 , kim < 0 , ∀i or νis < 0, kim > 0 , ∀i, every local minimum of J (P) is reached at a P ∈ En×n . c. For an unknown 0 < k < n with νis < 0 , kim > 0 , i = 1 , . . . , k and νis > 0 , kim< 0 , i = k + 1 , . . . , n, every local minimum of J (P) is reached at P 0
P = 01 P2 with P1 ∈ Ek×k , P2 ∈ E(n−k)(n−k) . d. For two unknown integers k, m with 0 < k < m < n with νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0, i = m + 0 P 1 , . . . , n, every local minimum of J (P) is reached at P = P2 01 with either P1 ∈ Ek×(n−m) , P2 ∈ E(n−k)×m when νis > 0 , kim < 0 , i = k + 1 , . . . , m or P1 ∈ Em×(n−k) , P2 ∈ E(n−m)×k when νis < 0 , kim > 0 , i = k + 1 , . . . , m.
Proof. The proof is similar to those in proving lemmas 2 to 5. The key difference is a shift in focus from the maximization of a convex function on a polyhedral set to the minimization of a concave function on a polyhedral set, with swaps between minimum and maximum, maxima and minima, convex and concave, and positive and negative, respectively. The key point is that lemma 1 remains correct after these swaps. Similar to theorem 2, from the above lemma we get: Theorem 3. a. When either νis kim < 0 , i = 1 , . . . , n or νis < 0 , kim > 0 , i = 1 , . . . , k and νis > 0 , kim < 0 , i = k + 1 , . . . , n for an unknown 0 < k < n, every local minimum of J (R) on RRT = I by equation 1.2 is reached at a permutation matrix R up to sign indeterminacy at its nonzero elements. b. For two unknown integers k, m with 0 < k < m < n with νis > 0 , kim > 0 , i = 1 , . . . , k, νis kim < 0 , i = k + 1 , . . . , m, and νis < 0 , kim < 0 , i = m + T 1 , . . . , n, local minimum of J (R) on RR = I by equation 1.2 is reached every 0
at R = 0 R¯ subject to a 2 × 2 block permutation. When m + k ≥ n, is an (n − m + n − k) × (n − m + n − k) permutation matrix up to sign inde¯ is an (m + k − n) × (m + k − n) terminacy at its nonzero elements, while R T ¯ ¯ orthonormal matrix with R R = I , but usually not a permutation matrix up to sign indeterminacy. When m + k < n, is a (k + m) × (k + m) permuta¯ is an tion matrix up to sign indeterminacy at its nonzero elements, while R ¯R ¯ T = I , but usually (n − k − m) × (n − k − m) orthonormal matrix with R not a permutation matrix up to sign indeterminacy.
In a comparison of theorems 2 and 3, when m + k ≥ n, comparing n − m + n − k with k + n − m, more sources can be separated by minimization
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
559
than maximization if k < 0.5n and by maximization than minimization if k > 0.5n. When m + k < n, comparing k + m with k + n − m, more sources can be separated by minimization than maximization if m > 0.5n, and by maximization than minimization if m < 0.5n. We further consider a special case that kim = 1, ∀i. In this case, equation 1.2 is simplified into J (R) =
n n
ri4j ν sj , n ≥ 2.
(2.14)
i=1 j=1
From theorem 2 at n = m, we can easily obtain the following corollary: Corollary 1. For an unknown integer 0 < k < n with νis > 0 , i = 1 , . . . , k and T νis < 0 , i = k + 1 , . . . ,n, every local maximum of J (R) on RR = I by equation 0
2.14 is reached at R = 0 R¯ subject to a 2 × 2 block permutation, where is a k × k permutation matrix up to sign indeterminacy at its nonzero elements, while ¯ is an (n − k) × (n − k) orthonormal matrix with R ¯R ¯ T = I , but usually not a R permutation matrix up to sign indeterminacy. Similarly, from theorem 3 we also get:
Corollary 2. For an unknown integer k with 0 < k < n with νis > 0 , i = T 1 , . . . , k and νis < 0 , i = k + 1 , . . ., n, every local minimum of J (R) on RR = I ¯ 0 R
by equation 1.2 is reached at R = 0 subject to a 2 × 2 block permutation, where is an (n − k) × (n − k) permutation matrix up to sign indeterminacy at ¯ is a k × k orthonormal matrix with R ¯R ¯ T = I , but its nonzero elements, while R usually not a permutation matrix up to sign indeterminacy. According to corollary 1, k sources of supergaussian components are separated by max RRT =I J (R), while n − k sources of subgaussian components are separated by min RRT =I J (R) according to corollary 2. In implementation, from equation 2.11, we get J (W) =
n
y
νi .
(2.15)
i=1
Then, from maxWWT =I J (W) we get k sources of super gaussian components and from minWWT =I J (W) we get n − k sources of subgaussian components. Thus, instead of learning one-to-one kurtosis sign matching, the problem can be turned into one of selecting supergaussian components from y = Wx with W obtained via maxWWT =I J (W) and of selecting subgaussian components from y = Wx with W obtained via minWWT =I J (W). Though we know neither k nor which components of y should be selected, we can pick those
560
L. Xu
with positive signs as supergaussian ones after maxWWT =I J (W) and pick those with negative signs as subgaussian ones after minWWT =I J (W). The y reason comes from νi = nj=1 ri4j ν sj and the above corollaries. By corollary 1, the kurtosis of each supergaussian component of y is simply one of ν sj > 0, j = 1, . . . , k. Although the kurtosis of each of the other components in y is a weighted combination of ν sj < 0, j = k + 1, . . . , n, the kurtosis signs of these will all remain negative. Similarly, we can find out those subgaussian components according to corollary 2. Anther corollary can be obtained from equation 2.11 by considering a y special case that kim = sign[νi ], ∀i:
max J (W), J (W) =
WW T =I
n y ν . i
(2.16)
i=1
This leads to what is called a kurtosis extreme approach and extensions (Delfosse & Loubation, 1995; Moreau & Macchi, 1996; Hyvarinen et al., 2001), where studies began by extracting one source by a vector w and then extracting multiple sources by either sequentially implementing the onevector algorithm such that the newly extracted vector is orthogonal to previous ones or implementing the one-vector algorithm on all the vectors of W in parallel together with a symmetric orthogonalization at each iterative step, as suggested by (Hyvarinen et al. 2001). In the literature, the success of using one vector w to extract one source has been proved mathematically, and the proof can be carried easily to sequentially extracting a new source with its corresponding vector w being orthogonal to the subspace spanned by its previous ones. However, this mathematical proof is not applicable to the above symmetric orthogonalization-based implementation of the one-vector algorithm in parallel with all the vectors of W. Actually, what Hyvarinen et al., (2001) suggested can only ensure a convergence of a symmetric orthogonalization-based algorithm but cannot guarantee that this local searching-featured iterative algorithm will converge to a solution that can separate all the sources, though experiments have usually been successful.
y When νi = nj=1 ri4j ν sj holds, which is true only when the prewhitening can be made perfectly, it follows from equation 2.16 that
min J (R), J (R) =
RRT =I
n n
ri4j ν sj ,
(2.17)
i=1 j=1
which is covered by lemma 2 and theorem 2. Thus, we can directly prove the following corollary:
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
561
y Corollary 3. As long as νi = nj=1 ri4j ν sj holds, every local minimum of the above J (R) on RRT = I is reached at a permutation matrix up to sign indeterminacy. Actually, it provides a mathematical proof on the success of the symmetric orthogonalization-based implementation of the one-vector algorithm in parallel on separating all the sources. 3 Combinatorial Optimization, Distribution Approximation, and Stiefel Manifold The combinatorial optimization problem by equation 1.3 has been encountered in various applications and still remains difficult to solve. Many efforts have also been made in the literature of neural networks since Hopfield and Tank (1985). As summarized in Xu (2003), these efforts can be roughly classified according to the features on dealing with Cecol , Cer ow , and Cb . Although almost all the neural network motivated approaches are parallel implementable, they share one unfavorable feature: these intuitive approaches have no theoretical guarantee on convergence to even a feasible solution. Interestingly, focusing on local maxima only, equations 1.2 and 2.3 can be regarded as special examples of the combinatorial optimization problem by equation 1.3 simply by regarding pi j or ri j as vi j . Although such a linkage is not useful for ICA, where we do not need to seek a global optimization, a link from equation 1.3 to 1.2 and even 1.1 leads to two interesting questions. Could we consider the combinatorial optimization by equation 1.3 from the perspective of approximating one distribution by another simple model distribution, as in equation 1.1? Can we replace the constraints Cecol , Cer ow , and Cb by the Stiefel manifold RRT = I for developing a more effective implementation? 3.1 A Distribution Approximation Perspective of Combinatorial Optimization. Finding a global minimization solution of E o (V) under a set of constraints is equivalent to finding a global peak of the following Gibbs distribution, e − β Eo (V) , Zβ 1
p(V, β) =
Zβ =
e − β Eo (V) , 1
(3.1)
V
subject to the constraints, since maxV p(V, β) is equivalent to maxV ln p(V, β) or minV E o (V). Usually this p(V, β) has many local maximums, so it is difficulty, to get the peak Vp . To avoid this difficulty, we use a simple distribution q (V) to approximate p(V, β) on a domain Dv such that the global peak of q (V) is easy to find and that p(V, β) and q (V) share the
562
L. Xu
same peak Vp ∈ Dv , where Dv is considered using the following support of p(V, β), Dε (β) = {V : p(V, β) > ε, a small consta nt ε > 0},
(3.2)
under the control of a parameter β. For a sequence β0 > β1 , . . . > βt , we have Dε (βt ) ⊂ . . . Dε (β1 ) ⊂ Dε (β0 ), which includes the global minimization solution of E o (V), since the equivalence of maxV p(V, β) to minV E o (V) is irrelevant to β. Therefore, we can find a sequence q 0 (V), q 1 (V), . . . , q t (V) that approximates p(V, β) on the shrinking domain Dε (β). For a large βt , p(V, β) has large support, and thus q (V) adapts the overall configuration of p(V, β) in the large domain Dε (β). As βt reduces, q t (V) concentrates more and more on adapting the detailed configuration of p(V, β) around the global peak solution Vp ∈ Dε . As long as β0 is large enough and β reduces slowly enough toward zero, we can find the global minimization solution of E o (V). We adopt the following for implementing such a distribution approximation: min K L(q , p), K L(q , p) = p
q (V) ln
V∈Dv
q (V) . p(V, β)
(3.3)
Alternatively, we can also consider min p K L(q , p), which leads us to a class of Metroplois sampling-based mean-field approaches. (For details, see section II(B) in Xu, 2003.) Here, we consider only equation 3.3 with q (V) in the following simple forms: q 1 (V) = Z1−1 q 2 (V) =
e vi j ln qi j ,
0 ≤ q i j ≤ ∞,
Z1 =
i, j v
q i ji j (1 − q i j )1−vi j ,
i, j
e vi j ln qi j ,
i, j
0 ≤ q i j ≤ 1,
(3.4)
i, j
and from the constraints in equation 1.3, we have
Cc :
N vi j = 1, j = 1, . . . , M, C r :
M vi j = 1, i = 1, . . . , N;
i=1
j=1
vi j =
Z q i j Zi1j
qi j ,
,
for q 1 (V), Zi j = e vkl ln qkl , for q 2 (V), k=i,l= j k,l
(3.5)
where x denotes the expectation of the random variable x. When N, M are large, we have Zi j ≈ Z1 , and thus vi j ≈ q i j for the case of q 1 (V).
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
563
Putting equation 3.4 into equation 3.3 and after certain derivations as shown in Xu (2003), equation 3.3 becomes equivalent to
min E({q i j }), subject to equation 3.5, qi j
1 E({q i j }) = E o ({q i j }) + β
q i j ln q i j ,
ij
i j [q i j
for q 1 (V),
ln q i j + (1 − q i j ) ln (1 − q i j )].
for q 2 (V).
The case for q 1 (V) interprets the Lagrange transform approach with the barrier i, j vi j ln vi j in Xu (1994) and justifies the intuitive treatment of simply regarding the discrete vi j as an analog variable between the interval [0, 1]. From this perspective, these analog variables are the parameters of the simple distribution that we use to approximate the Gibbs distribution induced from the cost E o (V) of the discrete variables. Subsequently, a discrete solution will be recovered from these analog parameters of q i j . Similarly, the case for q 2 (V) interprets and justifies the Lagrange trans
form approach with the barrier i, j [vi j ln vi j + (1 − vi j ) ln (1 − vi j )] in Xu (1995), where this barrier is intuitively argued to be better than the barrier
v i, j i j ln vi j because it gives a U shape curve. Here, this intuitive preference can also be justified from equation 3.5 since there is an approximation Zi j ≈ Z1 used for q 1 (V) while transforming vi j into q i j , but no approximation for q 2 (V). Moreover, both barriers are the special cases S(vi j ) = vi j and S(vi j ) = vi j /(1 − vi j ) of a family of barrier functions equivalent to minimizing the leaking energy in the classical Hopfield network (Xu, 1995).
3.2 Lagrange-Enforcing Algorithms. A general iterative procedure was proposed in Xu (1994) and then refined in Xu (1995) for minimizing E({q i j }) by considering the following Lagrange barrier costs: N M 1 col E({q i j }) = E o ({q i j }) + λj qi j − 1 β j=1
+ B(q i j ) +
N
λri ow
i=1
B(q i j ) =
ij
q i j ln q i j ,
i j [q i j
i=1
M
q i j − 1 ,
j=1
q 1 (V),
ln q i j + (1 − q i j ) ln (1 − q i j )], q 2 (V).
(3.6)
564
L. Xu
Following the derivation made in Xu (1994, 1995), it follows from that
q iej
=
e
1 a i b j exp β1 1
1+a i b j exp
∂ E o ({q i j }) ∂q i j 1 ∂ E o ({q i j β ∂q i j
,
}) ,
for q 1 (V), for q 2 (V);
a i = exp (λri ow ) ,
∂ E({q i j }) ∂q i j
=0
b j = exp λcol j (3.7)
∂ E ({q })
∂ 2 E({q })
when o∂qi j i j is irrelevant to q i j and ∂ 2 qi ji j > 0. If other variables are fixed at their old values, E({q i j }) is minimized at q i j = q iej , given by equation 3.7. Thus, we can update e old , for a η > 0 small enough, either q inew = q iej or q inew = q iold j j j + η qi j − qi j (3.8) which will reduce E({q i j }) monotonically, since each q iej − q iold j is the descending direction of E({q i j }) along the coordinate q i j . We can see that the constraints C c , C r are satisfied by q inew as long as they j old e are satisfied by both q i j and q i j . What needs to be done is to enforce N
q iej = 1, j = 1, . . . , M;
i=1
M
q iej = 1, i = 1, . . . , N,
(3.9)
j=1
which is achieved by an ENFORCING-LAGRANGE iteration loop, for example, by iteratively solving the above N + M nonlinear equations with N N M respect to {a i }i=1 , {b j } M j=1 or equivalently finding {a i }i=1 , {b j } j=1 that reaches a global minimum (i.e., zero) of
P({q i j }) =
N M j=1
i=1
2 q iej − 1
2 N M + q iej − 1 . i=1
(3.10)
j=1
As this ENFORCING-LAGRANGE loop converges, we have N M N M col r ow λj qi j − 1 + λi q i j − 1 < ε j=1 i=1 i=1 j=1
(3.11)
for an arbitrarily small ε. Thus, the fact that equation 3.8 can reduce E({q i j }) monotonically is equivalent to being able to reduce β1 E o ({q i j }) + B(q i j )
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
565
monotonically. When β becomes small enough, it is also equivalent to equation 3.8 reducing E o (θ ) monotonically under equation 3.11, because B(q i j ) is bounded for q i j ∈ [0, 1]. In summary, we get the following iterative procedure that guarantees convergence to a feasible solution that is a minimum of E o ({q i j }) and satisfies C c and C r : Step 0: Initialize {q i j } such that they satisfy the constraints Cec , Cer by equation 3.5, Step 1: Get {q iej } by equation 3.7 and then call an inner ENFORCINGLAGRANGE loop until it converges or terminates according to a given checking criterion on the satisfaction of equation 3.11. It results in a specific setting on a i , b j , j = 1, . . . , M, i = 1, . . . , N. new Step 2: Update q iold by using equation 3.8 either sequentially or in j to q i j parallel. Step 3: Check whether the procedure converges according to a pregiven criterion. If yes, stop; otherwise, go to step 1. Based on the obtained {q i j }, we can get a discrete solution simply by a threshold Th > 0 as follows: $ vi j =
1, 0,
if q i j > Th , and resolve a tie heuristically. otherwise,
(3.12)
Moreover, we can also consider explicitly the constraints Cec and get $ vi j =
1, 0,
if j = arg maxk q ik , and resolve a tie heuristically. otherwise,
(3.13)
Similarly, we can also consider explicitly the constraints Cer , as well as both Cec and Cer . 3.3 Stiefel Gradient Flow Algorithms. Alternatively, the study on equation 1.2 via equation 2.3 motivates another way to handle the constraints Cecol , Cer ow , and Cb : let vi j = ri2j , and then use RRT = I to guarantee the constraints, Cecol , Cer ow as well as a relaxed version of Cb (i.e., 0 ≤ vi j ≤ 1). That is, problem equation 1.3 becomes
min
RRT =I f or N≤M
Eo
% &i=N, j=M ri2j i=1, j=1 ,
i=N, j=M
R = {ri j }i=1, j=1 .
(3.14)
566
L. Xu
We consider the problems with ∂ 2 E o (V) is negative definite, ∂vec[V]∂vec[V]T
(3.15)
or E o (V) in a form similar to J (P) in equation 2.3,
E o (V) = −
n n
vi2j a j b i ,
(3.16)
i=1 j=1
with a i > 0, b i > 0, i = 1, . . . , k and a i < 0, b i < 0, i = k + 1, . . . , n after an appropriate permutation on either or both of [a 1 , . . . , a n ] and [b 1 , . . . , b n ]. Similar to the study of equation 2.3, maximizing E o (V) under the constraints Cecol , Cer ow , and vi j ≥ 0 will imply satisfying Cb . In other words, the solutions of equation 3.14 and equation 1.3 are the same. Thus, we can solve the hard problem of combinatorial optimization by equation 1.3 using a gradient flow on the Stiefel manifold RRT = I to maximize the problem by equation 3.14. At least a local optimal solution of equation 1.3 can be reached, with all the constraints Cecol , Cer ow , and Cb guaranteed automatically. To get an appropriate updating flow on the Stiefel manifold RRT = I , we first compute the gradient ∇V E o (V) and then get G R = ∇V E o (V) ◦ R, where the notation ◦ means that a 11 a 12 b b a b a b ◦ 11 12 = 11 11 12 12 . a 21 a 22 b 21 b 22 a 21 b 21 a 22 b 22 Given a small disturbance δ on RRT = I , it follows from RRT = I that the solution of δ RRT + Rδ RT = 0 must satisfy δ R = ZR + U(I − RT R), U is a m × d matrix and Z = −Z is an asymmetric matrix.
(3.17)
From Tr [G TR δ R] = Tr [G TR {ZR + U(I − RT R)}] = Tr [(G R RT )T Z] + Tr [(G R (I − RT R))T U], we get U(I − RT R) = U, (a) T T T (b) Z = G R R − RG R , U = G R (I − R R), δ R = ZR, ZR + U, (c) Rnew = Rold + γt δ R.
(3.18)
That is, we can use one of the above three choices of δ R as the updating direction of R. A general technique for optimization on the Stiefel manifold
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
567
that was elaborately discussed by Edelman, Arias, and Smith (1998) can also be adopted for implementing the problem by equation 3.14. 4 Conclusion The one-to-one kurtosis sign-matching conjecture has been proved in a strong sense that every local maximum of max RRT =I J (R) by equation 1.2 is reached at a permutation matrix up to a certain sign indeterminacy if there is a one-to-one same-sign correspondence between the kurtosis signs of all source pdfs and the kurtosis signs of all model pdf’s. That is, all the sources can be separated by a local search ICA algorithm. Theorems have been provided not only on partial separation of sources when there is a partial matching between the kurtosis signs, but also on an interesting duality of maximization and minimization on source separation. Moreover, corollaries are obtained to state that seeking a one-to-one same-sign correspondence can be replaced by using the duality; supergaussian sources can be separated by maximization and subgaussian sources can be separated by minimization. Furthermore, a corollary is obtained to provide a mathematical proof of the success of symmetric orthogonalization implementation of the kurtosis extreme approach. There still remain problems for study. First, the success of the efforts based on equation 1.1 (Xu et al., 1996; Xu et al., 1997; Xu et al., 1998b; Lee et al., 1999; Welling & Weber, 2001; Xu, 2003) can be explained as their ability to build up a one-to-one kurtosis sign matching. However, we still need a mathematical analysis to guarantee that these approaches can achieve this matching exactly or in some probabilistic sense. Second, a theoretical guarantee on either the kurtosis extreme approach or the approach of extracting supergaussian sources via maximization and subgaussian sources via minimization is true only when the prewhitening can be made perfectly. It remains to be studied on comparison of the two approaches as well as approaches based on equation 1.1. Also, comparison may deserve to be made on convergence rates of different ICA algorithms. Last, but not least, the linkage of the problem by equation 1.3 to equation 1.2 and equation 2.3, as well as equation 1.1, leads us to both a distribution approximation and a Stiefel manifold perspective of combinatorial optimization with algorithms that guarantee both convergence and satisfaction of constraints, which also deserve further investigation. Acknowledgments The work was done in preparation for a possible RGC earmarked grant (CUHK 417707E) to be supported by the Research Grant Council of the Hong Kong SAR. A preliminary version of this work was presented as a plenary talk at the Second International Symposium of Neural Networks (Xu, 2005).
568
L. Xu
References Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S. I., Cichocki, A., & Yang, H. (1996). A new learning algorithm for blind separation of sources. In D. Toureteky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing, 8 (pp. 757–763). Cambridge, MA: MIT Press. Bazaraa, M. S., Sherall, H. D., & Shetty, C. M. (1993). Nonlinear programming: Theory and algorithms. New York: Wiley. Bell, A., & Sejnowski, T. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Cardoso, J.-F. (1998). Blind signal separation: Statistical principles. Proc. IEEE, 86(10), 2009–2025. Cheung, C. C., & Xu, L. (2000). Some global and local convergence analysis on the information-theoretic independent component analysis approach. Neurocomputing, 30, 79–102. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36, 287–314. Delfosse, N., & Loubation, P. (1995). Adaptive blind separation of independent sources: A deflation approach. Signal Processing, 45, 59–83. Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl., 20, 303–353. Everson, R., & Roberts, S. (1999). Independent component analysis: A flexible nonlinearity and decorrelating manifold approach. Neural Computation, 11, 1957–1983. Girolami, M. (1998). An alternative perspective on adaptive independent component analysis algorithms. Neural Computation, 10, 2103–2114. Hopfield, J. J., & Tank, D. W. (1985). Neural computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Hyvarinen, A., Karhunen, J., & Oja, A. (2001), Independent component analysis. New York: Wiley. Lee, T. W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural Computation, 11, 417–441. Liu, Z. Y., Chiu, K. C., & Xu, L. (2004). One-bit-matching conjecture for independent component analysis. Neural Computation, 16, 383–399. Moreau, E., & Macchi, O. (1996). High order contrasts for self-adaptive source separation. International Journal of Adaptive Control and Signal Processing, 10, 1996. Pearlmutter, B. A., & Parra, L. C. (1996). A context-sensitive genaralization of ICA. In Proc. of Int. Conf. on Neural Information Processing, 151–157. Hong Kong: SpringerVerlag. Tong, L., Inouye, Y., & Liu, R. (1993). Waveform-preserving blind estimation of multiple independent sources. Signal Processing, 41, 2461–2470. Welling, M., & Weber, M. (2001). A constrained EM algorithm for independent component analysis. Neural Computation, 13, 677–689. Xu, L. (1994). Combinatorial optimization neural nets based on a hybrid of Lagrange and transformation approaches. In Proc. of World Congress on Neural Networks (pp. 399–404). San Diego, CA.
One-Bit-Matching ICA Theorem and Distribution-Based Combinatorics
569
Xu, L., (1995). On the hybrid LT combinatorial optimization: New U-shape barrier, sigmoid activation, least leaking energy and maximum entropy. In Proc. of Intl. Conf. on Neural Information Processing (pp. 309–312). Beijing, China. Xu, L. (1997). Bayesian ying-yang learning-based ICA models. In Proc. 1997 IEEE Signal Processing Soc. Workshop on Neural Networks for Signal Processing VII (pp. 476– 485). Piscataway, NJ: IEEE. Xu, L. (2003). Distribution approximation, combinatorial optimization, and Lagrange-barrier. In Proc. of International Joint Conference on Neural Networks 2003 (pp. 2354–2359). Piscataway, NJ: IEEE. Xu, L. (2005). One-bit-matching ICA theorem, convex-Concave programming, and Combinatorial optimization. In Advances in neural networks (pp. 5–20). Berlin: Springer-Verlag. Xu, L., Cheung, C. C., & Amari, S. I. (1998a). Further results on nonlinearity and separtion capability of a liner mixture ICA method and learned LPM. In C. Fyfe (Ed.), Proceedings of the International ICSC Workshop on Independence and Artificial Neural Networks (pp. 39–44). Tenerife, Spain. Xu, L., Cheung, C. C., & Amari, S. I. (1998b). Learned parametric mixture based ICA algorithm. Neurocomputing, 22, 69–80. Xu, L., Cheung, C. C., Yang, H. H., & Amari, S. I. (1997). Independent component analysis by the information-theoretic approach with mixture of density. In Proc. of 1997 IEEE-INNS International Joint Conference on Neural Networks (Vol. 3, pp. 1821–1826). Piscataway, NJ: IEEE. Xu, L., Yang, H. H., & Amari, S. I. (1996, April). Signal source separation by mixtures: Accumulative distribution functions or mixture of bell-shape density distribution functions. Paper presented at the Frontier Forum, Institute of Physical and Chemical Research, Japan.
Received January 18, 2005; accepted June 20, 2006.
LETTER
Communicated by Wei-nian Li
Boundedness and Stability for Integrodifferential Equations Modeling Neural Field with Time Delay Xuyang Lou
[email protected] Baotong Cui
[email protected] Research Center of Control Science and Engineering, Southern Yangtze University, Wuxi, Jiangsu 214122, P.R.C.
In this letter, delayed integrodifferential equations modeling neural field (DIEMNF) is studied. The model of DIEMNF is first established as a modified neural field model. Second, it has been proved that if the interconnection of neural field is symmetric, then every trajectory of the system converges to an equilibrium. The boundedness condition for integrodifferential equations modeling neural field with delay is obtained. Moreover, when the interconnection is asymmetric, we give a sufficient condition that can guarantee that the equilibrium of the DIEMNF is unique and is also a global attractor. 1 Introduction Neural networks have many applications in pattern recognition, image processing, association, and other areas. Some of these applications require that the equilibrium points of the designed network are stable (see Zhang, Li, & Liao, 2005). Therefore, it is vital to study the stability of neural networks. However, time delays can be encountered in implementing artificial neural networks (see Hopfield, 1984; Roska & Chua, 1992), and these delays are often a source of oscillation and instability in neural networks. This situation has motivated the study of the stability of delayed neural networks. In recent years, the stability of delayed neural networks (DNN) has been investigated by many researchers (e.g., Arik 2000; Cao & Wang, 2003; Cao, 2000, 2001a, 2001b; Cui & Lou, 2006; Lou and Cui, 2006a, 2006b; Zhang, Wei, & Xu, 2005; Liao, Chen, & Sanchez, 2002; Zeng & Wang, 2006; Lu & Chen, 2006). In addition, a continuum of neurons is useful whenever waves of neural activity have to be modeled (see Wilde, 1997). If the states of the individual neurons cannot be measured, it is often possible to measure the electrical or magnetic fields generated by the waves of activity. This has furthered neural field theory, which has been studied in past years and has found many applications in different areas, for example, perceptual learning (see Shi, Huang, & Zhang, 2004), traveling waves of excitation (see Daniel & Neural Computation 19, 570–581 (2007)
Boundedness and Stability for DIEMNF
571
Andreas, 2002), and knowledge learning (see Luo, Wen, & Huang, 2002; Jirsa, Jantzen, Fuchs, & Kelso, 2001). Neural field models can be used to understand the transformation mechanism, dynamical behavior, capability, and limitation of neural network models by studying globally topological and geometrical structures on parameter spaces of neural networks. The model includes many important models arising from mathematical biology and neural networks, which have been intensively and extensively studied. This class of models typically take the form of integrodifferential equations. For example, Amari (1990, 1991) considers a neural field where infinitely many neurons are continuously put in a two-dimensional field. Also, neural field models have had an important impact in helping to develop an understanding of the dynamics seen in brain slice preparations (see Liley, Cadusch, & Wright, 1999). Recently, more rigorous results of the existence and uniqueness for an equilibrium point of neural networks with asymmetrical connections or with general connections have been achieved (see Chen & Amari, 2001; Yang & Dillon, 1994). It is now necessary to consider the case for the existence of time delays. Moreover, these results provide the motivation and momentum for us to pursue an investigation of DIEMNF qualitative behaviors. This letter is organized as following. In section 2, we establish some elementary lemmas that will be used in later sections. In section 3, we give a boundedness condition for DIEMNF. In section 4, we give the conditions for the global stability of DIEMNF. The main results include two parts. First, we use the Lyapunov method to investigate the convergence of DIEMNF having symmetric interconnection; then we give the conditions for the global stability of DIEMNF having asymmetric interconnection by using a different form than used in section 1. Finally, the conclusion is put forward in section 5. Let χ = C() and take f = maxx∈ | f (x)| for all f ∈ χ; then χ is a Banach space. The set of bounded linear operators mapping χ into itself is briefly denoted by [χ], particularly the identity operator I1 ∈ [χ]. A point λ of the complex plane is called a regular point of an operator A ∈ [χ] if [χ] contains the operator (A − λI1 )−1 . The set ρ(A) of all regular points of an operator A is open. Its complement σ (A) is called the spectrum of A and defined by κ(A), which is defined by κ(A) = max Re(λ). λ∈σ (A)
2 System Description and Preliminaries In this letter we propose the following DIEMNF, ∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t + T(x, y)g(u(t, y)) dy + I (x) = F (u), |g(u(t, x))| ≤ M,
(2.1)
572
X. Lou and B. Cui
where = D ( D : a bounded, open, and connected subset of R2 ). Let u(t, x) = [u1 (t, x), u2 (t, x), . . . , un (t, x)]T represent the state (input) of the system at position x in at time t; I (x) = [I1 (x), I2 (x), . . . , In (x)]T be the external input at position x in ; and g(u(t, x)) = [g1 (u(t, x)), g2 (u(t, x)), . . . , gn (u(t, x))]T be the output of the system at position x at time t having range |g(u(t, x))| ≤ M. The typical input-output relation g(·) is a continuous and monotone-increasing function, and it can be chosen as sigmoid, tanh, arctan, or something else. W(x) denotes n × n dimension the connection weights. For all x, y ∈ , let T(x, y) represent the n × n dimension interconnection, which connects the neuron at the position y to the neuron at position x; a is a positive constant. ξ is called an equilibrium of DIEMNF if F (ξ ) = 0. In order to proceed with the analysis, we need to establish some definitions and elementary facts. Definition 1.
A dynamical system on χ is a mapping
: R+ × χ → χ,
R+ = [0, ∞)
such that: 1. (·, u) : R+ → χ, continuous, for all u ∈ χ. 2. (t, ·) : χ → χ, continuous, for all t ∈ R+ . 3. (0, u) = u, for all u ∈ χ. 4. (t + s, u) = (t, (s, u)), for all t, s ∈ R+ , u ∈ χ. Definition 2. F is Fr´echet differentiable at ξ ∈ χ if there exists a bounded linear operator Fξ ∈ [χ] such that
lim
h→0
F (ξ + h) − F (ξ ) − Fξ (h) h
= 0,
(2.2)
Fξ (h) is the so-called Fr´echet derivative of F at ξ . Lemma 1. Suppose that the spectrum of the bounded linear operator A lies in the interior of the left half plane, so that e At ≤ N0 e v0 t , and suppose condition F (u) ≤ q u (t ∈ R+ , u ≤ ρ, q > 0) is satisfied for q < Nv00 . Then du = Au + F (u) has a negative Bohl exponent at zero. dt Proof.
Daleckii and Krein (1974, pp. 285–289) for the proof.
Lemma 2. Let w(t, x) = ∂t∂ u(t, x), where u(t) is a continuously differentiable function from a compact subset ∈ R2 to R. If w(t) ≤ Ee −ηt
Boundedness and Stability for DIEMNF
573
for some positive E < ∞ and η < ∞, then u(t) converges to some ξ in χ. Proof.
Since
u2 − u(t1 ) =
t2
t1
w(t) dt t2
≤
w(t) dt ≤
t1
t2
Ee −ηt dt ≤
t1
E ηt1 (e − e ηt2 ), η
it follows that for all > 0, there exists an N ∈ N such that t1 ≥ N, t2 ≥ N and
E ηt1 (e − e ηt2 ) ≤ , η
where N denote the set of natural numbers. Then by Cauchy convergence theorem (Yoshida, 1980), u(t) → ξ as t → ∞. Lemma 3. Let u(t, x) be a solution of DIEMNF. If u(t) converges to a limit ξ , then ξ is an equilibrium of DIEMNF. Proof. Suppose that u(t, x) be a solution of DIEMNF, and (0, u(t)) = u(t) → ξ as t → ∞. Clearly, ∂t∂ ξ = 0. Assume that ξ is not an equilibrium, that is, F (ξ ) = 0. Then there exists h > 0, h with sufficiently small such that
(h, ξ ) = ξ . But
(h, ξ ) = (h, lim (0, u(t))) = lim (h, (0, u(t))) t→∞
t→∞
= lim (h, u(t)) = ξ. t→∞
(2.3)
This is a contradiction, so F (ξ ) = 0. Thus, ξ is an equilibrium. Lemma 4. (Halanay’s inequality, Cao & Wang, 2004). Let a > b > 0 and v(t) be a nonnegative continuous function on [t0 − τ, t0 ], and satisfy the following inequality: D+ v(t) ≤ −a v(t) + b sup v(s), t ≥ t0 , t−τ ≤s≤t
where τ is a nonnegative constant. Then there exist constants k. λ > 0 satisfies v(t) ≤ ke −λ(t−t0 ) , t ≥ t0 ,
574
X. Lou and B. Cui
where k = supt−τ ≤s≤t v(s), λ is a unique, positive solution of the following equation: λ = a − be λτ . 3 Boundedness In this section, we give the boundedness condition for DIEMNF, equation 2.1. Theorem 1. Let T be continuous on × with being a compact subset. Then equation 2.1 implies that any solution u(t) is bounded if a > 0. Proof.
Since
∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t + T(x, y)g(u(t, y)) dy + I (x),
∂ ∂t
u(t, x) −
(3.1)
1 1 I (x) = −a u(t, x) − I (x) + W(x)g(u(t − τ, x)) a a T(x, y)g(u(t, y)) dy, (3.2) +
then u(t, x) −
t 1 1 e −a (t−s) (W(x)g(u(s − τ, x)))ds I (x) = e −a t u(0, x) − I (x) + a a 0
t + e −a (t−s) T(x, y)g(u(s, y)) dy ds. (3.3)
0
So it follows, t 1 e −a (t−s) (W(x)g(u(s − τ, x))) ds I (x) 1 − e −a t + a 0
t −a (t−s) + e T(x, y)g(u(s, y)) dy ds. (3.4)
u(t, x) = e −a t u(0, x) +
0
Boundedness and Stability for DIEMNF
575
Obviously one obtains t 1 −a t e a s (|W(x)||g(u(s − τ, x))|) ds |u(t, x)| ≤ |u(0, x)| + |I (x)| + e a 0
t + e −a t eas |T(x, y)||g(u(s, y))|dy ds. (3.5) 0
Since g(u(t, x)) is bounded, that is, |g(u(t, x))| ≤ M, then if a > 0, the right < ∞ for all t ∈ R+ . The part of equation 3.5, is bounded, that is, u(t) ≤ M proof is complete. 4 Stability of the Equilibrium In this section, we give the main results of this letter. For symmetric interconnection case, we prove the convergence of DIEMNF, equation 2.1. And we give a sufficient condition for the global stability of DIEMNF when the interconnection is asymmetric. 4.1 Asymptotic Stability when the Interconnection Is Symmetric. According to lemma 3, there exists an equilibrium u∗ (x) of DIEMNF, equation 2.1. In the following, we always set v(t, x) = u(t, x) − u∗ (x), where u∗ (x) is an equilibrium (F (u∗ (x)) = 0) and u(t, x) is an arbitrary solution of DIEMNF. Then ∂ ∂ ∂ v(t, x) = u(t, x) − u∗ = F (u) − F (u∗ ) ∂t ∂t ∂t = −a (v(t, x) + u∗ (x)) + W(x)g(v(t − τ, x) + u∗ (x)) + T(x, y)g(v(t, y) + u∗ (y)) dy
= −a v(t, x) + W(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] + T(x, y)[g(v(t, y) + u∗ (y)) − g(u∗ (y))]dy
= −a v(t, x) + W(x)[g (u∗ (x))v(t − τ, x) + o(v(t, x))] + T(x, y)[g (u∗ (y))v(t, y) + o(v(t, y))]dy.
(4.1)
By definition 2, the Fr´echet derivative of F at u∗ (x) is
Fu∗ (v(x)) = (−a I + B)v(t, x) + Hv(t, x),
(4.2)
576
X. Lou and B. Cui
where Bv(x) =
T(x, y)g (u∗ (y))v(t, y) dy, Hv(x) = W(x)g (u∗ (x))v(t − τ, x). (4.3)
Theorem 2. The equilibrium u∗ (x) is asymptotically stable if κ(B + H) < a , where B, H are compact operators that satisfy Bv(x) = ∗ T(x, y)g (u (y))v(t, y) dy and Hv(x) = W(x)g (u∗ (x))v(t − τ, x), respec tively. Proof.
Equation 4.1 can be written as
∂ v = F (u∗ + v) = Fu∗ (v) + o(v). ∂t
(4.4)
By lemma 1, if κ(Fu∗ ) < 0, then equation 4.4 has a negative Bohl exponent at zero, that is, u∗ is asymptotically stable. From equation 4.2, it follows that
κ(Fu∗ ) = κ(−a I + B + H) ≤ −a +
max
λ∈σ (B+H)
Re(λ).
(4.5)
And from equation 4.3, it is clear that B, H are compact operators, since T(x, y) ·g (u∗ (y)) is continuous. Hence, σ (B + H) is a compact subset of C, so maxλ∈σ (B+H) Re(λ) exists. Therefore,
κ(Fu∗ ) < 0 ⇔ max Re(λ) = κ(B + H) < a . λ∈σ (B+h)
Thus, the equilibrium u∗ is asymptotically stable if κ(B + H) < a . 4.2 Global Stability when the Interconnection Is Asymmetric. In this section, we give a condition to guarantee the global stability for DIEMNF, equation 2.1. Here, T(x, y) = T(y, x) is not assumed. d In the following, we always assume that w(t) = dt u(t). Then ∂u(t, x) = −a u(t, x) + W(x)g(u(t − τ, x)) ∂t T(x, y)g(u(t, y)) dy + I (x) +
Boundedness and Stability for DIEMNF
577
and ∂w(t, x) = −a w(t, x) + W(x)g (v(t − τ, x))w(t, x) ∂t + T(x, y)g (u(t, y))w(t, y) dy.
And we denote D+ (w(t)) = lim sup h→0
w(t + h) − w(t) . h
By using linearization, we derive w(t + h, x) = w(t, x) + h
∂ w(t, x) + O(h 2 ) ∂t
= (1 − a h)w(t, x) + hW(x)g (v(t − τ, x))w(t, x) +h T(x, y)g (u(t, y))w(t, y) dy + O(h 2 )
(4.6)
and v(t + h, x) = v(t, x) + h
∂ v(t, x) + O(h 2 ) ∂t
= (1 − a h)v(t, x) + hW(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] +h T(x, y)[g(v(t, x) + u∗ (x)) − g(u∗ (x))] dy + O(h 2 ),
(4.7)
where v(t) = u(t) − u∗ (u∗ is an equilibrium and u(t, x) is an arbitrary solution of DIEMNF, equation 2.1. Theorem 3.
If there exist two positive constants β, η such that
|g (x)| < β, ∀x ∈ ,
(4.8)
βW(x) < η, −a + β |T(x, y)|dy ≤ −2η,
(4.9)
(4.10)
then the system 2.1 owns a global attractor such that every solution converges to it.
578
X. Lou and B. Cui
Proof.
From equations 4.9 and 4.10, we have
−a + βW(x) + β
|T(x, y)|dy ≤ −η.
(4.11)
Choose a nonzero solution u(t, x) from DIEMNF, equation 2.1, arbitrarily. From equations 4.6 and 4.11, it follows that
∂ w(t + h) = max w(t, x) + h w(t, x) + O(h 2 ) x∈ ∂t ≤ |1 − a h| · max |w(t, x)| + hβ · max |W(x)||w(t, x)| x∈
x∈
+ hβ · max x∈
|T(x, y)||w(t, y)|dy + O(h 2 ).
It is easy to see there exist h → 0 such that 1 − a h > 0. Then we obtain w(t + h) ≤ (1 − a h)w(t) + hβ · W(x)w(t) + hβ · |T(x, y)|w(t)dy + O(h 2 )
≤ (1 − a h)w(t) + hβ
a −η w(t) + O(h 2 ) β
= (1 − a h)w(t) + h(a − η)w(t) + O(h 2 ).
(4.12)
Then D+ (w(t)) = lim sup h→0
≤ lim sup h→0
w(t + h) − w(t) h [1 − a h + h(a − η)]w(t) − w(t) ≤ −ηw(t), h (4.13)
which implies w(t) ≤ w(0)e −ηt . Then by lemma 2, u(t) → ξ as t → ∞, for some ξ ∈ χ. And lemma 3 guarantees that u∗ is an equilibrium. Second, we show that every solution of DIEMNF, equation 2.1, converges to u∗ . Let u(t, x) be any other solution from u(t, x) and v(t) = u(t) − u∗ . From
Boundedness and Stability for DIEMNF
579
equation 4.9, it follows that v(t + h) = max (1 − a h)v(t) + hW(x)[g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))] x∈
+h
T(x, y)[g(v(t, x) + u∗ (x)) − g(u∗ (x))]dy + O(h 2 )|
≤ |1 − a h| · max |v(t, x)| x∈
+ h · max |W(x)||g(v(t − τ, x) + u∗ (x)) − g(u∗ (x))| x∈
+ h · max x∈
|T(x, y)||g(v(t, x) + u∗ (x)) − g(u∗ (x))|dy + O(h 2 )
≤ |1 − a h| · v(t) + h · max |W(x)||g ( f (x))||v(t − τ, x)| + h · max x∈
x∈
|T(x, y)||g ( f (y))||v(t, y)|dy + O(h 2 )
≤ |1 − a h| · v(t) + hβ · W(x) · v(t − τ ) |T(x, y)| · v(t)dy + O(h 2 ) + hβ ·
≤ |1 − a h| · w(t) + hβ
a −η w(t) β
+ hβ · W(x) · v(t − τ ) + O(h 2 ) = |1 − a h| · v(t) + h(a − η)v(t) + hβ · W(x) · v(t − τ ) + O(h 2 ).
(4.14)
Then D+ (v(t)) = lim sup h→0
v(t + h) − v(t) h
1 ≤ lim sup ((1 − a h + h(a − η))v(t) h→0 h + hβ · W(x) · v(t − τ ) − v(t)) ≤ −ηv(t) + β · W(x) · v(t).
(4.15)
Since equation 4.9 holds, we have η > β · W(x). So from lemma 4, it follows that v(t) ≤ v(0)e −ε(t−t0 ) ,
(4.16)
580
X. Lou and B. Cui
where ε = η − β · W(x)e ετ . Thus, v(t) → 0 as t → ∞; equivalently, v(t) → 0 as t → ∞. Since v(t, x) = u(t, x) + u∗ (x), u(t) → u∗ as t → ∞, and the proof is complete. 5 Conclusion In this letter, we studied the boundedness and stability of DIEMNF and obtained sufficient conditions for the boundedness and stability of DIEMNF when the interconnection is symmetric. Moreover, when the interconnection is asymmetric, we give a sufficient condition that can guarantee that the equilibrium of the DIEMNF is unique by using a different method. Acknowledgments This work is supported by the National Natural Science Foundation of China (No. 60674026) and the Science Foundation of Southern Yangtze University (No. 103000-21050323). References Amari, S. I. (1990). Mathematical foundations of neurocomputation. Proc. IEEE on Neural Networks, 78, 1443–1463. Amari, S. I. (1991). Mathematical theory of neural learning. New Generation Computing, 8(4), 281–294. Arik, S. (2000). Stability analysis of delayed neural networks. IEEE Trans. Circ. Syst. I, 47(7), 1089–1092. Cao, J. D. (2000). Global exponential stability and existence of periodic solutions of delayed CNNs. Sci. China (series E), 30(6), 541–549. Cao, J. D. (2001a). A set of stability criteria for delayed cellular neural networks. IEEE Trans. Circ. Syst. I, 48(4), 494–498. Cao, J. D. (2001b). Global stability conditions for delayed CNNS. IEEE Trans. Circ. Syst. I, 48, 1330–1333. Cao, J., & Wang, J. (2003). Global asymptotic stability of a general class of recurrent neural networks with time-varying delays. IEEE Trans. Circ. Syst. I, 50, 34– 44. Cao, J., & Wang, J. (2004). Absolute exponential stability of recurrent neural networks with Lipschitz-continuous activation functions and time delays. Neural Networks, 17(3), 379–390. Chen, T. P., & Amari, S. I. (2001). Stability of asymmetric Hopfield networks. IEEE Trans. Neural Networks, 12(1), 159–163. Cui, B. T., & Lou, X. Y. (2006). Global asymptotic stability of BAM neural networks with distributed delays and reaction-diffusion terms. Chaos, Solitons, and Fractals, 27, 1347–1354. Daleckii, J. L., & Krein, M. G. (1974). Stability of solutions of differential equations in Banach space. Providence. RI: American Mathematical Society.
Boundedness and Stability for DIEMNF
581
Daniel, C., & Andreas, V. M. H. (2002). Traveling waves of excitation in neural field models: Equivalence of rate descriptions and integrate-and-fire dynamics. Neural Computation, 14, 1651–1667. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl Acad. Sci., 81, 3088–3092. Jirsa, V. K., Jantzen, K. J., Fuchs, A., & Kelso, J. A. S. (2001). Neural field dynamics on the folded three-dimensional cortical sheet and its forward EEG and MEG. In Information processing in medical imaging (pp. 286–299). Berlin: SpringerVerlag. Liao, X., Chen, G., & Sanchez, E. N. (2002). LMI-based approach for asymptotically stability analysis of delayed neural networks. IEEE Trans. Circ. Syst. I, 49, 1033– 1039. Liley, D. T. J., Cadusch, P. J., & Wright, J. J. (1999). A continuum theory of electrocortical activity. Neurocomputing, 26–27, 795–800. Lou, X. Y., & Cui, B. T. (2006a). Absolute exponential stability analysis of delayed bi-directional associative memory neural networks. Chaos, Solitons, and Fractals, 31(3), 695–701. Lou, X. Y., & Cui, B. T. (2006b). New LMI conditions for delay-dependent asymptotic stability of delayed Hopfield neural networks. Neurocomputing, 69(16–18), 2374– 2378. Lu, W. L., & Chen, T. P. (2006). Dynamical behaviors of delayed neural network systems with discontinuous activation functions. Neural Computation, 18(3), 683– 708. Luo, S. W., Wen, J. W., & Huang, H. (2002). Knowledge-increasable learning behaviors research of neural field. In Proceedings of the First International Conference on Machine Learning and Cybernetics (pp. 586–590). Piscataway, NJ: IEEE Press. Roska, T., & Chua, L. O. (1992). Cellular neural networks with non-linear and delaytype template elements and non-uniform grids. International Journal of Circuit Theory and Applications, 20, 469–482. Shi, Z. Z., Huang, Y. P., & Zhang, J. (2004). Neural field model for perceptual learning. In Third IEEE International Conference on Cognitive Informatics (ICCI’04), (pp. 192– 198). Wilde, P. D. (1997). Neural network models. Berlin: Springer-Verlag. Yang, H., & Dillon, T. S. (1994). Exponential stability and oscillation of Hopfield graded response neural network. IEEE Trans. Neural Networks, 5, 719–729. Yoshida, K. (1980). Functional analysis. Berlin: Springer-Verlag. Zeng, Z. G., & Wang, J. (2006). Multiperiodicity and exponential attractivity evoked by periodic external inputs in delayed cellular neural networks. Neural Computation, 18(4), 848–870. Zhang, H. B., Li, C. G., & Liao, X. F. (2005). A note on the robust stability of neural networks with time delay. Chaos, Solitons and Fractals, 25, 357–360. Zhang, Q., Wei, X. P., & Xu, J. (2005). On global exponential stability of delayed cellular neural networks with time-varying delays. Applied Mathematics and Computation, 162, 679–686.
Received April 14, 2006; accepted June 9, 2006.
ARTICLE
Communicated by Maneesh Sahani
Temporal Symmetry in Primary Auditory Cortex: Implications for Cortical Connectivity Jonathan Z. Simon
[email protected] Department of Electrical and Computer Engineering, Department of Biology, Institute for Systems Research, and Program in Neuroscience and Cognitive Sciences, University of Maryland, College Park, MD 20742–3311, U.S.A.
Didier A. Depireux
[email protected] Anatomy and Neurobiology, University of Maryland, Baltimore, MD 21201, U.S.A.
David J. Klein
[email protected] Department of Electrical and Computer Engineering and Institute for Systems Research, University of Maryland, College Park, MD 20742–3311, U.S.A.
Jonathan B. Fritz
[email protected] Institute for Systems Research, University of Maryland, College Park, MD 20742–3311, U.S.A.
Shihab A. Shamma
[email protected] Department of Electrical and Computer Engineering, Institute for Systems Research, and Program in Neuroscience and Cognitive Sciences, University of Maryland, College Park, MD 20742–3311, U.S.A.
Neurons in primary auditory cortex (AI) in the ferret (Mustela putorius) that are well described by their spectrotemporal response field (STRF) are found also to have a distinctive property that we call temporal symmetry. For temporally symmetric neurons, every temporal cross-section of the STRF (impulse response) is given by the same function of time, except for a scaling and a Hilbert rotation. This property held in 85% of neurons (123 out of 145) recorded from awake animals and in 96% of neurons (70 out of 73) recorded from anesthetized animals. This property of temporal symmetry is highly constraining for possible models of functional neural connectivity within and into AI. We find that the simplest models of functional thalamic input, from the ventral medial geniculate body Neural Computation 19, 583–638 (2007)
C 2007 Massachusetts Institute of Technology
584
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
(MGB), into the entry layers of AI are ruled out because they are incompatible with the constraints of the observed temporal symmetry. This is also the case for the simplest models of functional intracortical connectivity. Plausible models that do generate temporal symmetry, from both thalamic and intracortical inputs, are presented. In particular, we propose that two specific characteristics of the thalamocortical interface may be responsible. The first is a temporal mismatch between the fast dynamics of the thalamus and the slow responses of the cortex. The second is that all thalamic inputs into a cortical module (or a cluster of cells) must be restricted to one point of entry (or one cell in the cluster). This latter property implies a lack of correlated horizontal interactions across cortical modules during the STRF measurements. The implications of these insights in the auditory system, and comparisons with similar properties in the visual system, are explored. 1 Introduction The spectrotemporal response field (STRF) is one measurement used to describe how an auditory neuron responds to a spectrally dynamic sound (Aertsen & Johannesma, 1981b; Eggermont, Aertsen, Hermes, & Johannesma, 1981; Hermes, Aertsen, Johannesma, & Eggermont, 1981; Johannesma & Eggermont, 1983; Smolders, Aertsen, & Johannesma, 1979). It is related to the spectral response area of a neuron, defined roughly as the range of frequencies and intensities of pure tones that elicit excitatory or inhibitory responses (though response area measurements are blind to timing information within the responses). The STRF is a measure of the spectral and dynamic properties of auditory response areas and has been used in a variety of auditory areas in both mammals and birds and using a variety of stimuli ranging from simple (e.g., auditory ripples), to complex (e.g. natural sounds; Aertsen & Johannesma, 1981a; deCharms, Blake, & Merzenich, 1998; Epping & Eggermont, 1985; Escabi & Schreiner, 2002; Fritz, Shamma, Elhilali, & Klein, 2003; Kowalski, Versnel, & Shamma, 1995; Linden, Liu, Sahani, Schreiner, & Merzenich, 2003; Miller, Escabi, Read, & Schreiner, 2002; Qin, Chimoto, Sakai, & Sato, 2004; Rutkowski, Shackleton, Schnupp, Wallace, & Palmer, 2002; Schafer, Rubsamen, Dorrscheidt, & Knipschild, 1992; Schreiner & Calhoun, 1994; Sen, Theunissen, & Doupe, 2001; Shamma & Versnel, 1995; Shamma, Versnel, & Kowalski, 1995; Theunissen, Sen, & Doupe, 2000; Valentine & Eggermont, 2004; Yeshurun, Wollberg, & Dyn, 1987). The stimuli and techniques, adapted from studies of visual processing (De Valois & De Valois, 1988) and psychoacoustic research (Green, 1986; Hillier, 1991; Summers & Leek, 1994), apply linear and nonlinear systems theory to measure the response area of auditory units. From the systems theoretic point of view, the spike train output (averaged over many trials) is determined by the spectrotemporal profile of a broadband, dynamic,
Temporal Symmetry
585
b
c
Spectro-Temporal Response Field 8
8
4
4
4
2
2
2
1
1
1
0.5
0.5
0.5
a
0.25
0.25 0
Time (ms)
250
0
Time (ms)
250
Spectro-Temporal Response Field +
Spike Rate
Frequency (kHz)
Spectro-Temporal Response Field 8
–
0.25 0
Time (ms)
250
Figure 1: (a) A simulated fully separable STRF, with spectral and temporal one-dimensional cross-sections as inserts (all horizontal slices have the same profile, as do vertical slices). (b) A simulated high-rank STRF with no particular symmetries. (c) A simulated temporally symmetric and quadrant-separable STRF of rank 2. This STRF’s symmetry is not obviously visible. This STRF was created by adding to the STRF in a the same STRF except with the spatial crosssection shifted upward by 3/4 octave and the temporal cross-section Hilbertrotated (see below) by 30 degrees.
acoustic input (Depireux, Simon, & Shamma, 1998). As can be seen from the example in Figure 1, the STRF includes quantitative information shaping how the neuron determines its firing rate as a function of both spectrum (a vertical cross-section gives firing rate as a function of frequency at some moment in time) and time (a horizontal cross-section gives firing rate as a function of time for a single frequency). There are also analogs of the STRF in other sensory modalities. The most prominent are in vision, where the STRF (De Valois & De Valois, 1988) partially inspired the study of auditory STRFs, and the somatosensory domain (DiCarlo & Johnson, 1999, 2000, 2002; Ghazanfar & Nicolelis, 1999). Any other sensory modality with the concept of a spatial receptive field or response area can be generalized to include the dimension of time and stands to gain from the methodology (Ghazanfar & Nicolelis, 2001; Linden & Schreiner, 2003). Using spectrotemporally rich stimuli and systems analysis methods, STRFs of hundreds of units in AI (primary auditory cortex) in the anesthetized and awake ferret have been measured, mapped, and compared to those obtained from one- and two-tone stimuli (Depireux, Simon, Klein, & Shamma, 2001; Klein, Simon, Depireux, & Shamma, 2006). For these techniques to be useful requires that responses to such broadband stimuli have a robustly linear component, an assumption that has been investigated, and, in many types of cells, confirmed (Escabi & Schreiner, 2002; Klein et al., 2006; Kowalski, Depireux, & Shamma, 1996; Schnupp, Mrsic-Flogel, & King, 2001; Shamma et al., 1995; Theunissen et al., 2000). Robustness requires at least that when the STRF is measured using stimulus sets
586
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
with very different spectrotemporal profiles, the STRF is (approximately) independent of stimulus set (Klein et al., 2006). The most important consequence of linearity, the superposition principle (Papoulis, 1987), requires that the responses to combinations of spectrotemporal envelopes be linearly additive and predictable from the STRF. These results were successfully applied to predictions of responses to spectra composed of multiple moving ripples (Klein et al., 2006; Kowalski et al., 1996). It should be noted that many researchers have also observed a significant proportion of units that are unpredictable or poorly responsive and cannot be described by purely linear assumptions (Theunissen et al., 2000; Ulanovsky, Las, & Nelken, 2003). Nor does a response that is robustly linear disallow nonlinearities: neurons with a substantially linear component typically contain substantial nonlinearities as well, such as having firing rates above the mean with a much larger dynamic range than firing rates below the mean. In particular, response profiles typically have strong, static nonlinearities, and yet their linear response is still robust (van Dijk, Wit, Segenhout, & Tubis, 1994). In this work, we show that those neurons in AI that are well described by STRFs have a special property, which we call temporal symmetry. Temporal symmetry means that all temporal cross-sections of any STRF are the same time function (i.e., impulse response), except for a scaling and a Hilbert rotation (defined below). We further show that temporal symmetry has strong implications for the functional neural connectivity of neurons in AI, in both their thalamic input—from the ventral medial geniculate body (MGB)—and their intracortical inputs. In fact, most simple, otherwise compelling models of functional neural connectivity of neurons in AI are disallowed physiologically because they violate the property of temporal symmetry. Other models, still biologically plausible, are suggested that obey the temporal symmetry property. Most of the mathematical treatments discussed in this work arose from the context of linear systems. It is crucial, however, that the linear systems treatment itself lies in the context of the more general nonlinear framework, for example, of Volterra and Wiener (Eggermont, 1993; Rugh, 1981). Thus, although the system has strong nonlinearities in addition to its linearity, as long as the linear component of the overall response is robust, the estimated STRF itself should be robust. This robust STRF, with its property of temporal symmetry, can justifiably be used to strongly constrain models of neural connectivity in AI. To reiterate, the presence of strong nonlinearities is consistent with the presence of a robust linear component, and that robust linear component (here, the STRF) places strong constraints on models of neural connectivity. 2 Methods 2.1 Defining the Spectrotemporal Response Field. In the auditory system, the STRF is a function of both time and frequency, h(t, x), where t is
Temporal Symmetry
587
response time (e.g., in ms) and x = log2 ( f f 0 ) is the number of octaves above a reference frequency f 0 . The firing rate of the neuron r (t) has a linear component rlin (t) given by the linear, temporal convolution of its STRF, h(t, x), with the spectrotemporal envelope of the stimulus s(t, x): rlin (t) =
dt
d x s(t − t, x)h(t , x)
=
d x s(t, x) ∗t h(t, x),
(2.1)
where ∗t means convolution in the t-dimension (but not x). The full firing rate r (t) will differ from the linear rate rlin (t) to the extent that the system is not entirely linear, but the STRF determines all of the firing rate that is linear with respect to the spectrotemporal envelope of the stimulus (Depireux et al., 1998). Crucially, even if the system has strong nonlinearities, so long as the linear properties are robust, rlin (t) will be consistently determined entirely by the STRF and the spectrotemporal envelope of the stimulus. There are several straightforward interpretations of the STRF, all ultimately equivalent. Four are presented below. The first two interpret the two-dimensional STRF as a collection of one-dimensional response profiles. The third interprets the entire STRF as the spectrotemporal representation of an optimal (acoustic) stimulus. The fourth is a general qualitative scheme for predicting the response to any broadband stimulus from the STRF. 2.1.1 Spectral Response Field Interpretation. Any STRF cross-section at a single moment in time (i.e., a vertical cross-section) can be interpreted as an instantaneous spectral response field (see Figure 2a, bottom left). In this interpretation, a peak in the response field indicates a high (instantaneous) spike rate when the stimulus has enhanced power in that spectral band. A dip in the response field indicates a low (instantaneous) spike rate when the stimulus has enhanced power in that spectral band (e.g., from side-band inhibition). A cross-section can be examined at any instant in time, so the STRF can be interpreted as a time-evolving response field. 2.1.2 Impulse Response Interpretation. Any STRF cross-section at a single frequency (i.e., a horizontal cross-section) can be interpreted as a narrowband impulse response (see Figure 2a, top right). In this interpretation, the impulse response is the response to the instantaneous presentation of high power in a narrow band. There is a separate impulse response for every frequency channel, so the STRF can be interpreted as a spectrally ordered collection of impulse responses. 2.1.3 Optimal Stimulus Interpretation. The entire STRF can be interpreted as a whole by flipping the time axis and interpreting the new image as
588
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
b
c
Figure 2: (a) An experimentally measured STRF, with several spectral and temporal one-dimensional cross-sections. (b) The same STRF interpreted as the spectrogram of an optimal stimulus. (c) An intricate stimulus and how different areas of the stimulus spectrogram contribute to the neuron’s firing rate at any given moment (single unit/awake: z004b03-p-tor.a1-2).
proportional to the spectrogram of a stimulus. In this picture, stimulus time evolves (from left to right), growing less negative until it stops at t = 0 (see Figure 2b). The stimulus is optimal in a very specific sense: of all stimuli with the same power, the (linear estimate of the) stimulus that gives the highest
Temporal Symmetry
589
spike rate for that power is proportional to the time-reversed STRF. This is a straightforward result from linear systems theory (see, e.g., deCharms et al., 1998; Papoulis, 1987). 2.1.4 General Qualitative Interpretation. The response of the entire neuron to any broadband stimulus can be estimated by convolving features of the stimulus spectrogram with features of the STRF (see Figure 2c). Regions of the STRF that are positive (excitatory) will contribute positively to the firing rate when the stimulus has enhanced power in that spectral band (enhanced relative to the background stimulus level). Similarly, regions of the STRF that are negative (inhibitory) will contribute negatively to the firing rate when the stimulus has enhanced power in that spectral band. Naturally, regions of the STRF that are positive (excitatory) will also contribute negatively to the firing rate when the stimulus has diminished power in that spectral band (diminished relative to the background stimulus level). Less intuitive, but still natural, regions of the STRF that are negative (inhibitory) will contribute positively to the firing rate when the stimulus has diminished power in that spectral band (since the tendency to reduce firing rate is itself weakened). The firing rate at any given moment is the sum of all these products, with the appropriate weighting. This last interpretation is really just a verbal description of equation 2.1. 2.2 Measuring the STRF. The STRF can be estimated in many different ways, but all are equivalent to inverting equation 2.1, that is, crosscorrelating the full neural response rate r (t) with the spectrotemporal envelope of the stimulus s(t, x). This is also known as spike-triggered averaging. Many types of stimuli can be used to measure an STRF as long as they are sufficiently spectrotemporally rich. Among the stimuli used are auditory ripples (e.g., Depireux et al., 1998; Kowalski et al., 1996; Miller & Schreiner, 2000; Qiu, Schreiner, & Escabi, 2003), auditory m-sequences (Kvale & Schreiner, 1997), random chords (e.g., deCharms et al., 1998; Valentine & Eggermont, 2004), and spectrotemporally rich natural sounds (e.g., Theunissen et al., 2000). 2.3 Relationship to Vision and the Spatiotemporal Response Field. The visual system has neurons that are well characterized by the analogous quantity, the STRF, h(t, x ), whose arguments are the two-dimensional angular distance, x and time t, and whose temporal convolution with a Spatiotemporal stimulus, s(t, x ) (e.g., drifting contrast gratings), gives the linear firing rate of the cell: rl (t) =
d x s(t, x ) ∗t h(t, x ),
(2.2)
590
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
which is the same as equation 2.1 but with retinotopic position x instead of cochleotopic position x. All the methodology, above and below, relevant to STRFs h(t, x) also ap and plies to STRFs h(t, x ), with the following substitutions: x → x , → , · x . The applications below will apply only to the extent that corx → tical visual processing is comparable to cortical auditory processing and to their respective physiological properties, and, of course, which area within cortex is being characterized. Stimuli used to calculate visual STRFs must have contrast that changes in both space and time. Typical stimuli range from drifting contrast gratings, randomly changing dots or bars, m-sequences, or more complex patterns (see, e.g., De Valois, Cottaris, Mahon, Elfar, & Wilson, 2000; De Valois & De Valois, 1988; Reid, Victor, & Shapley, 1997; Richmond, Optican, & Spitzer, 1990; Sutter, 1992; Victor, 1992). These spatiotemporally rich stimuli can be compared to the spectrotemporally rich auditory stimuli described above (auditory ripples, random chords, and spectrotemporally rich natural sounds). We use the abbreviation STRF to apply to both spectral (auditory) and spatial (visual) cases. Context will make clear to which case it refers. 2.4 Rank and Separability. The rank of a two-dimensional function, such as an STRF or a spectrotemporal modulation transfer function (MTFST ), captures one aspect of how simple the function is. When a two-dimensional function is the simple product of two one-dimensional functions, that is, h(t, x) = f (t)g(x), this captures an important notion of simplicity. When this occurs, the function is of rank 1. When the sum of two products is required, for example, h(t, x) = f A(t)g A(x) + f B (t)g B (x), the function is of rank 2. (In cases of rank 2 and higher, we demand that each temporal function f i (t) be linearly independent of every other temporal function f j (t), and the same for the spectral functions g; otherwise, we could have used a smaller number of terms.) A rank 2 function is clearly not as simple as a rank 1 function but nevertheless can be expressed rather concisely. In general, the rank of any two-dimensional function is the minimum number of simple products needed to describe the function. (When the functions are approximated as discrete, the definition of rank is identical to the definition of the algebraic rank of a matrix.) An STRF of rank 1, also called fully separable, can be written h FS (t, x) = f (t)g(x),
(2.3)
which has this simple interpretation: the temporal processing of the STRF is performed independent of the spectral processing (and, of course, vice versa). A simple model of a neuron with this property is that its spectral processing is due purely to inputs from presynaptic neurons with a range
Temporal Symmetry
591
of center frequencies, while the temporal processing is due to integration in the soma of all inputs arriving from all dendrites. For many peripheral neurons, this model is a good one. An STRF of rank 1 is also called fully separable because its processing separates cleanly into independent spectral and temporal processing stages. The example STRF in Figure 1a is fully separable, and this can be verified by noting that all spectral (vertical) cross-sections have the same shape (the shape of the spectral function g(x)), differing only in amplitude (and possible sign). Similarly, all temporal (horizontal) cross-sections have the same shape (the shape of the temporal function f (t)), differing only in amplitude (and possibly sign). An STRF of rank 2 is somewhat less simple and somewhat less straightforward to interpret. h R2 (t, x) = f A(t)g A(x) + f B (t)g B (x).
(2.4)
One interpretation comes from noting that this h R2 (t, x) can be written as the sum of two fully separable STRFs, h FAS (t, x) = f A(t)g A(x) and h FB S (t, x) = f B (t)g B (x). This implies a possible, but less than satisfying, interpretation: the neuron has exactly two neural inputs, each of which has a fully separable STRF, and then simply adds them. Below we present more realistic interpretations, consistent with known physiology. An STRF described by a generic two-dimensional function would not necessarily have the same physiologically motivated interpretations or models. An STRF of general rank N can be written h RN (t, x) = f A(t)g A(x) + f B (t)g B (x) + · · · + f Z (t)g Z (x) .
(2.5)
N terms
As the rank of an STRF increases, more and more complexity is permitted. Figure 1b demonstrates a simulated STRF of high rank (though still well localized in time and spectrum). STRFs of this complexity are not seen in AI (Klein et al., 2006). In general, higher rank implies more STRF complexity. Lower rank suggests there is a specific property (constraint) that causes this simplicity. 2.5 Singular Value Decomposition Analysis of the STRF. Singular value decomposition (SVD) is a method that can be applied to any finite dimensional matrix (e.g., a discretized version of the STRF) to establish both its rank and a unique reexpression of the matrix as the sum of terms whose number is the rank of the matrix (Hansen, 1997; Press, Teukolsky, Vettering, & Flannery, 1986). The SVD decomposition of a matrix M takes the form T Mij = Au Ai v TAj + B u Bi v BT j + · · · + Z u Zi v Zj , N terms
(2.6)
592
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
where N is the rank of the matrix, u and v are vectors normalized to have unit power, and each is the term’s root mean square (RMS) power. If we discretize the STRF into a finite number of frequencies and time steps, x = {xi } = (x1 , . . . , xM ) and t = {t j } = (t1 , . . . , tN ), so that h(ti , x j ) = h ij = h(t1 , . . . , tN ; x1 , . . . , xM ), we see that h ij = Au A(xi )v A(t j ) + B u B (xi )v B (t j ) + · · · + Z u Z (xi )v Z (t j ),
(2.7)
N terms
where N is the rank of the STRF, the u and v vectors are normalized to have unit power, and each is the RMS power of its term. This is the same as equation 2.5, except that time and frequency have been discretized, and thus the STRF has been discretized also. What makes SVD unique among decompositions is that (1) it automatically orders the terms by decreasing power: A > B > · · · > Z ; (2) each column u A, u B , . . . , u Z is orthogonal to all the others; and (3) each row v TA, v BT , . . . , v ZT is orthogonal to all the others. The mathematical specifics are described well in textbooks (see, e.g., Press et al., 1986) and will not be covered here. Mathematically, SVD is intimately related to principal component analysis (PCA), and both are used for a variety of analytic purposes, including noise reduction (Hansen, 1997). Since measured STRFs are made with noisy measurements (the noise arising from both neural variability and instrument noise), the true rank of the STRF must be estimated. There are a variety of methods to do this (Stewart, 1993), but all use the same conceptual framework: once the power of the noise is estimated, then all SVD components with power greater than the noise can be considered signal, and the number of components satisfying this criterion is the estimate of the rank. This estimate of rank is biased (more noise results in a lower rank estimate), but it has been shown that for range of signal-to-noise ratios and the STRFs used in this study, noise is not an impediment to measuring high rank (Klein et al., 2006). SVD also motivates us to recast equation 2.5 into its continuous form, h RN (t, x) = Av A(t)u A(x) + B v B (t)u B (x) + · · · + Z v Z (t)u Z (x), (2.8) N terms
where the u and v functions have unit power and each is the RMS power of its term. Compared to equation 2.5, it more complex but less arbitrary: decompositions of the form of equations 2.3, 2.4, and 2.5 are not unique since amplitude can be arbitrarily shifted between the temporal and spectral components. In equation 2.8, all amplitude information is explicitly shared within each term by each i coefficient. When the technique of SVD, which is designed for discrete matrices, is applied to continuous twodimensional functions, as in the case of equation 2.8, it is called the singular
Temporal Symmetry
593
value expansion (Hansen, 1997). We will go back and forth between the continuous and discretized versions of the STRF without loss of generality (so long as N is finite), depending on which formalism is more beneficial. 2.6 Hilbert Transform and Partial Hilbert Transforms and Rotations. We now discuss the Hilbert transform, a standard tool in signal processing and necessary for the phenomenon of temporal symmetry. The Hilbert transform of a function produces the same function but with all its phase components shifted by 90 degree. This can be seen in the Fourier domain. For a function f (t) with Fourier transform F (ω), that is, F (ω) = Fω [ f (t)] =
dt f (t)e−jωt
f (t) = Ft−1 [F (ω)] = (2π)−1
dωF (ω)ejωt ,
(2.9)
the Hilbert transform, designated by H or ∧ , is defined by
−1 jπ fˆ(t) = H [ f (t)] = Ft sgn(ω)e /2 F (ω) ,
(2.10)
where e jπ /2 = j is a rotation by 90 degree in the complex plane (the role of sgn(ω) guarantees that the Hilbert transform of a real function is itself a real function). This rotation of phase by 90 degree means that the Hilbert transform of any sine wave is a cosine wave, and the Hilbert transform of any cosine wave is the negative sine wave, but unlike differentiation, the amplitude is unchanged by the operation. An important property of the Hilbert transform is that it is orthogonal to the original function, and yet it still has the same frequency content (aside from the DC component, i.e., its mean, which is zeroed out). fˆ(t) is said to be “in quadrature” with f (t); a demonstration is illustrated in Figure 3. For the remainder of this section, we assume that any function f (t) that will be Hilbert transformed has mean zero (or has had its mean subtracted manually). The double application of a Hilbert transform, since applying two successive 90 degree rotations is equivalent to one 180 degree rotation, is just a sign inversion. H[H[ f (t)]] = fˆˆ(t) = − f (t).
(2.11)
It is also useful to define a partial Hilbert transform. A Hilbert transform of a function can be viewed as a 90 degree rotation in a mixing angle plane, so one can define a partial version of the transform: f θ (t) = sin θ fˆ(t) + cos θ f (t).
(2.12)
594
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
c
a
Phase
8π 6π 4π
time
2π 0 frequency
b Magnitude
-2π -4π -6π 0
frequency
-8π
Figure 3: An example of a function and its Hilbert transform. (a) A function (in black) overlaid with its Hilbert transform (in gray); the two are orthogonal. (b) The magnitude of the Fourier transform of the function (in black) overlaid with the magnitude of the Fourier transform of its Hilbert transform (in gray); they overlap exactly. (c) The phase of the Fourier transform of the function (in black) overlaid with the phase of the Fourier transform of its Hilbert transform (in gray). The difference is exactly ±90 degrees (dashed line).
In this convention, note that fˆ(t) = f π /2 (t), f (t) = f 0 (t), and fˆˆ(t) = f (t) = − f (t). Thus, a partial Hilbert transform still has the same frequency content as the original function, but its phase “rotation” is not restricted to 90 degrees and can be any angle on the complex plane. Physiological examples of the Hilbert transform have been demonstrated in the visual system and have been named “lagged” cells (De Valois et al., 2000; Humphrey & Weller, 1988; Mastronarde, 1987a, 1987b). These lagged cells are located in the lateral geniculate nucleus (LGN), one of the visual thalamic nuclei. We will continue this nomenclature and call any neuron whose impulse response is the Hilbert transform of another the lagged version of the latter. We will further generalize and call any neuron whose impulse response is the partial Hilbert transform (Hilbert rotation) of another, the “partially lagged” version of the latter. Note that the lag is a phase lag, not a time lag. The full and partial Hilbert transform or rotation is not restricted to the time domain and is equally applicable to the spectral domain—for example, π
H [g(x)] = gˆ (x) H [H [g(x)]] = gˆˆ (x) = −g(x) θ
g (x) = sin θ gˆ (x) + cos θg(x).
(2.13) (2.14) (2.15)
Temporal Symmetry
595
2.7 Temporal Symmetry. An important class of STRFs consists of those for which all temporal cross-sections (i.e., each cross-section at a constant spectral index xc ) of the given STRF are related to each other by a simple scaling, g, and rotation, θ , of the same time function, h(t, xc ) = gxc f θxc (t),
(2.16)
where each scaling and rotation can depend on xc . Since this is then true for all spectral indices x, we call the system temporally symmetric and write it in the functional form h T S (t, x) = g(x) f θ (x) (t).
(2.17)
The meaning is still the same: all temporal cross-sections are related to each other by a simple scaling and rotation of the same time function. There is only one function of t, that is, f (t), and Hilbert rotations of it (demonstrated in Figure 4). Using the definition of the Hilbert rotation, equation 2.12, we can reexpress equation 2.17 to explicitly show that a temporally symmetric STRF is rank 2 (i.e., is the sum of two linearly independent product terms): h T S (t, x) = g(x) f θ (x) (t) = g(x) cos θ (x) f (t) + g(x) sin θ (x) fˆ(t) = f (t)g A(x) + fˆ(t)g B (x)
(2.18)
where g A(x) = g(x) cos θ (x) g B (x) = g(x) sin θ (x) (2.19) g B (x) tan θ (x) = g A(x) g 2 (x) = g 2A(x) + g 2B (x). We will often use the form of equation 2.18, which is completely equivalent to equation 2.17. In equation 2.18 it is explicit that a temporally symmetric STRF has rank 2 and cannot have higher rank. For systems that are not exactly temporally symmetric but are of rank 2 or for systems that have been truncated by SVD to rank 2, we can define an index of temporal symmetry, ηt . This index ranges from 0 to 1, where ηt = 1 for the temporally symmetric case and ηt = 0 when the two time functions are temporally unrelated. First we put equation 2.18, which is explicitly
596
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma Spectro-Temporal Response Field
a
+
4 2 1 Spike Rate
Frequency (kHz)
8
0.5 0.25 0
Time (ms)
250
b
0
250
Time (ms)
–
c
0
0
Time (ms)
Time (ms)
250
250
Figure 4: (a) The simulated temporally symmetric and quadrant-separable STRF from Figure 1c and five fixed-frequency cross-sections, corresponding to five temporal impulse responses. (b) The same five impulse responses but individually Hilbert-rotated and rescaled. (c) The same Hilbert-rotated and rescaled impulse responses superimposed. The Hilbert rotation phases were calculated by taking the negative phase of the complex correlation coefficient between the analytic signal of each temporal cross-section and the analytic signal of the fourth temporal cross-section.
rank 2, into the form of equation 2.8: h R2 (t, x) = Av A(t)u A(x) + B v B (t)u B (x).
(2.20)
Since the u and v functions have unit power, we define the index of temporal symmetry to be the magnitude of the normalized complex inner product between the two temporal analytic signals (Cohen, 1995), 1 ∗ ηt = (v A(t) + j vˆ A(t)) (v B (t) + j vˆ B (t)) dt , 2
(2.21)
where ∗ is the complex conjugate operator. The rank 1 case, since it is automatically temporally symmetric, is also given the value ηt = 1.
Temporal Symmetry
597
Temporal symmetry’s cousin, spectral symmetry, can be defined analogously: h SS (t, x) = f (t)g θ (t) (x) = f (t) cos θ (t)g(x) + f (t) sin θ (t)gˆ (x) = f A(t)g(x) + f B (t)gˆ (x),
(2.22)
where f A(t) = f (t) cos θ (t) f B (t) = f (t) sin θ (t) tan θ (t) =
f B (t) f A(t)
f 2 (t) = f A2 (t) + f B2 (t)
(2.23)
and 1 ηs = (u A(x) + j uˆ A(x))∗ (u B (x) + j uˆ B (x)) d x . 2
(2.24)
2.8 Spectrotemporal Modulation Transfer Functions (MTFST ). Just as any STRF may have the property of temporal or spectral symmetry, it may also have the property of quadrant separability. Quadrant separability is most easily described in terms of the spectrotemporal modulation transfer function (MTFST ), which is presented here. The STRF, which is two-dimensional, can also be represented by its twodimensional Fourier transform or its closely related partner, the MTF ST , H(w, ) = Fw, [h(t, −x)] = dt d x h(t, x)e2π j(−wt+x)
(2.25)
where w and are the coordinates Fourier-conjugate to t and x respectively (see Depireux et al., 1998 for sign conventions). Examples are shown in Figure 5. It follows that the inverse Fourier transform of H(w, ) gives the STRF of the cell. −1 h(t, x) = Ft,−x [H(w, )] .
(2.26)
598
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
b Spectro-Temporal c Spectro-Temporal Spectro-Temporal Modulation Transfer Function Modulation Transfer Function Modulation Transfer Function
Spectral Density (cycles/octave)
a
Ω
1.6
Ω
360°
Ω
270°
0.8 0
w
w
180°
w
-1.6 -24
-12
0
12
Modulation Rate (Hz)
a
c
24 -24
-12
0
12
Modulation Rate (Hz)
b
d
24 -24
-12
0
12
Modulation Rate (Hz)
24
Phase
90°
-0.8
0°
Temporal Symmetry
599
The MTFST is a Fourier transform of the STRF and is also used to characterize auditory processing. For example, to the extent that the STRF represents a stimulus that the neuron prefers, the Fourier transform provides an analytical description of the features of that stimulus. Power at low w, which has dimension of cycles per second or Hz, corresponds to smoother temporal features, or slower temporal evolution. Power at high w corresponds to finer temporal features and faster temporal resolution. Lower
Figure 5: (a) The simulated fully separable MTFST generated by the STRF in Figure 1a. Phase is given by hue (scale on right) and amplitude by intensity. Since the STRF is fully separable, the MTFST is separable as well in both amplitude (all horizontal slices have the same intensity profiles, as do vertical slices) and phase. Separability leads to a phase profile that is the direct sum of a purely temporally dependent phase and a purely spectrally dependent phase. When the phase is primarily linear, as it is for STRFs well localized in spectrum and time, the phase profile is diagonal, with slope determined by the location in spectrum and time of the STRF. (b) The simulated MTFST generated by the high-rank STRF in Figure 1b. Phase as in a . Since there is no particular symmetry in the STRF, there is no particular symmetry in the MTFST . To the extent that the STRF is localized in spectrum and time, the phase slope is approximately constant. (c) The simulated MTFST generated by the temporally symmetric and quadrant-separable STRF of rank 2. The symmetry is now more visible than in Figure 1. This MTFST is somewhat directionally selective: quadrant 2 (characterizing responses to sounds with an upward spectral glide) is strong and clearly separable within the quadrant; quadrant 1 (characterizing responses to sounds with an downward spectral glide) is weak. Quadrants 3 and 4 are the complex conjugates of quadrants 1 and 2, by equation 2.27.
Figure 6: (a) An STRF equal to the sum of two simulated fully separable STRFs, identical to each other except translated in time and spectrum (the one with lower best frequency and shorter delay is displayed in Figure 1a). This results in a (not fully separable) strongly velocity-selective STRF. (b) The MTFST of the STRF in a . The spectrotemporal modulation transfer function is clearly not a vertical column–horizontal row product within each quadrant, and therefore the entire spectrotemporal modulation transfer function cannot be quadrant separable. Because the spectrotemporal modulation transfer function is not quadrant separable, it cannot be temporally symmetric. Phase is given by hue (scale on right); amplitude is given by intensity. (c) An STRF equal to the sum of two simulated temporally symmetric STRFs: identical to each other except translated in time and spectrum (the one with higher best frequency and shorter delay is displayed in Figure 1c). (d) The first ten singular values (from SVD) of the STRF show a rank of 4, which cannot be temporally symmetric (temporal symmetry requires a rank of 2).
600
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
(versus higher) , which has dimensions of cycles per octave, corresponds to smoother (versus finer) scale spectral features, such as broad (versus sharp) peaks or formants (versus harmonics). The four possible sign combinations of w and break the MTFST into four quadrants, numbered 1 (w, > 0), 2 (w < 0, > 0), 3 (w, < 0), and 4 (w > 0, < 0). From equation 2.25 it can be seen that H(w, ) is a complex valued function. Because h(t, x) is purely real, the MTFST has a complexconjugate symmetry, H(−w, −) = H ∗ (w, ).
(2.27)
Equation 2.27 also holds for the Fourier transform of any real function of t and x. This means that the value of the MTFST at any point in quadrant 3 is fully determined by the value at the reflected point in quadrant 1 (and similarly for the pair quadrant 4 and quadrant 2). The MTFST , and its properties and interpretations, is discussed in greater detail elsewhere (Depireux et al., 1998, 2006). 2.9 Directionality and Quadrant Separability. When the STRF is separable, the MTFST is separable, because the Fourier transform of the STRF is given by the simple products of the Fourier transforms of f (t) and g(x). It was noticed that even when the STRF is not separable, the quadrants of the MTFST are still individually separable (Klein et al., 2006; Kowalski et al., 1996), but neither the significance nor the origin of this property was well understood. (See McLean & Palmer, 1994, for the analogous case in vision.) With the discovery of temporal symmetry and the relationship between temporal symmetry and quadrant separability, the significance is now clear, as will be shown. One of the most useful properties of the Fourier representation (the MTFST ) over the spectrotemporal representation (the STRF) is that in the Fourier representation, the contributions of the response to stimuli with upward- and downward-moving spectral features are explicitly segregated in different quadrants. The response to any downward-moving component is governed entirely by the MTFST in quadrant 1, and the response to any upward-moving component is governed entirely by the MTFST in quadrant 2. Quadrant separability is a particular generalized symmetry property that an STRF and its MTFST may have, but it is obvious only when seen in the MTFST domain: within each quadrant, the MTFST is separable. For example in quadrant 1, where both w and are positive, the MTFST is the simple product of a horizontal (temporal) function and a vertical (spectral) function. Similarly, in quadrant 2, where w is negative and is positive, the MTFST is the simple product of a different horizontal (temporal) function and a different vertical (spectral) function. An example is shown in
Temporal Symmetry
601
Figure 5c. Quadrant separable MTFST s are defined and characterized with more mathematical detail in the appendix. Historically in audition, the property of quadrant separability was noticed when the spectrotemporal modulation transfer function measured in quadrant 1 was separable, and the spectrotemporal modulation transfer function measured in quadrant 2 was also separable, but the two separable functions were not the same (Depireux et al., 2001; Kowalski, Depireux, & Shamma, 1996) (if it is separable in quadrants 1 and 2, then it is automatically separable in quadrants 3 and 4, from equation 2.27). In vision studies, quadrant separability was invoked for the notion of directional selectivity (Watson & Ahumada, 1985). In this work, we argue that quadrant separability in the auditory system is due entirely to temporal symmetry. Quadrant separability is a property of both the STRF and the MTFST , though visible only in the MTFST . Nevertheless, since the STRF and MTFST are just different representations of the same response properties, it is a property that is held (or not) by both. It is shown in the appendix that a quadrant-separable STRF can always be written in the form h QS (t, x) = f A(t)g A(x) + fˆ B (t)gˆ B (x) + fˆ A(t)g B (x) + f B (t)gˆ A(x),
(2.28)
which is shown in the appendix to be of rank 4 unless additional symmetries, such as those discussed later, reduce the rank to 2 or 1. 2.10 Quadrant Separability and Temporal Symmetry. Comparing equation 2.28 to equation 2.18 we can see that the temporally symmetric h T S (t, x) has the same form as h QS (t, x) for the special case that f B (t) = 0. Thus, a temporally symmetric STRF is automatically quadrant separable. It is not a generic quadrant-separable STRF, since its rank is not 4 but 2 (by inspection of equation 2.18). It nevertheless possesses the defining property of quadrant separability: each quadrant of its MTFST is separately separable (e.g., see Figure 5c). We will use this property below to show that an STRF that is not quadrant separable cannot be temporally symmetric. 2.11 Quadrant Separability and General Symmetries. There are three ways of taking the generic quadrant-separable STRF of rank 4 and finding a generalized symmetry that causes it to be lower rank. The generalized symmetry of temporal Hilbert symmetry has already been discussed. It can be seen by taking the most general form of a quadrantseparable STRF, equation 2.28, and noting that setting f B (t) = 0 = fˆ B (t), so that only f A(t) and fˆ A(t) survive as temporal functions, reduces the number of independent components (i.e., the rank) from 4 to 2. The generalized symmetry of spectral symmetry works analogously, since the mathematics is blind to the difference between time and spectrum. It can be seen by noting that setting g B (x) = 0 = gˆ B (x), so that only g A(x) and
602
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
gˆ A(x) survive as spectral functions, also reduces the number of independent components (i.e., the rank) from 4 to 2. This gives the spectrally symmetric, quadrant-separable STRF: h SS (t, x) = f A(t)g A(x) + f B (t)gˆ A(x).
(2.29)
We are not aware of any physiological system in which this generalized symmetry is realized. Finally, a third generalized symmetry is pure directional selectivity. In the case of this generalized symmetry, we set f A(t) = fˆ B (t) g A(x) = gˆ B (x),
(2.30)
which, when combined with the identity that the double application of the Hilbert operator is a Hilbert rotation of 180 degrees and so is equivalent to multiplication by −1, gives
h DS (t, x) = 2 f A(t)g A(x) − fˆ A(t)gˆ A(x) .
(2.31)
This is the case much discussed in the vision literature when STRF is interpreted as the visual STRF: both temporal and spectral functions are added in quadrature (Adelson & Bergen, 1985; Barlow & Levick, 1965; Borst & Egelhaaf, 1989; Chance, Nelson, & Abbott, 1998; De Valois et al., 2000; Emerson & Gerstein, 1977; Heeger, 1993; Maex & Orban, 1996; McLean & Palmer, 1994; Smith, Snowden, & Milne, 1994; Suarez, Koch, & Douglas, 1995; Watson & Ahumada, 1985). The result is a purely directionally selective response field, and again is rank 2. These three symmetries reduce the rank of a quadrant-separable spectrotemporal modulation transfer function from 4 to 2. In the appendix, it is proven these are the only symmetries that can reduce the rank to 2, and that a quadrant-separable STRF can never be of rank 3 (the rank 1 case is fully separable). We are not aware of any physiological system that possesses a generic quadrant separable STRF, that is, of rank 4. 2.12 Counterexamples. After the preceding examples, one might be tempted to believe that all rank 2 STRFs are also quadrant separable, but this can be shown false with a simple counterexample. The form of equation 2.28 demonstrates how difficult it is to make a quadrant-separable transfer function by combining fully separable inputs: half of the terms are required to be very specific functionals of the other half. Any departure leads to total inseparability. For example, an STRF that
Temporal Symmetry
603
is the linear sum of two separable STRFS, h R2 (t, x) = h FAS (t, x) + h FB S (t, x),
(2.32)
has a spectrotemporal modulation transfer function that is the linear sum of two separable spectrotemporal modulation transfer functions: H R2 (w, ) = HAR2 (w, ) + HBR2 (w, ).
(2.33)
This is not, in general, the form of a quadrant-separable spectrotemporal modulation transfer function. As an example, the sum of two (almost) identical, fully separable STRFs whose only difference is that one is translated spectrally and temporally with respect to the other gives an STRF that is strongly velocity selective and is not quadrant separable. This is demonstrated in Figure 6b, where the MTFST clearly is not quadrant separable. This proves that the STRF in Figure 6a cannot be temporally symmetric. It is also simple to show an example of an STRF that is the sum of two temporally symmetric STRFs but itself is not temporally symmetric. As an example, the sum of two (almost) identical, temporally symmetric STRFs whose only difference is that one is translated spectrally and temporally with respect to the other gives an STRF that has rank 4, not the rank of 2 required by temporally symmetry. This is demonstrated in Figures 6c and 6d. 2.13 Surgery and Animal Preparation. Data were collected from 11 domestic ferrets (Mustela putorius) supplied by Marshall Farms (Rochester, NY). Eight of these ferrets were anesthetized during recording, and details of the surgery in full procedural details are provided in Shamma, Fleshman, Wiser, and Versnel (1993). These ferrets were anesthetized with sodium pentobarbital (40 mg/kg) and maintained under deep anesthesia during the surgery. Once the recording session started, a combination of ketamine (8 mg/kg/hr), xylazine (1.6 mg/kg/hr), atropine (10 µg/kg/hr), and dexamethasone (40 µg/kg/hr) was given throughout the experiment by continuous intravenous infusion, together with dextrose, 5% in Ringer solution, at a rate of 1 cc/kg/hr, to maintain metabolic stability. The ectosylvian gyrus, which includes the primary auditory cortex, was exposed by craniotomy, and the dura was reflected. The contralateral ear canal was exposed and partly resected, and a cone-shaped speculum containing a miniature speaker (Sony MDR-E464) was sutured to the meatal stump. The remaining three ferrets were used for awake recordings, with full surgical procedural details in Fritz et al. (2003). In these experiments, ferrets were habituated to lie calmly in a restraining tube for periods of up to 4 to 6 hours. A head-post was surgically implanted on the ferret’s skull (anesthetized with sodium pentobarbital, 40 mg/kg, and maintained under deep
604
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
anesthesia during the surgery) and used to hold the animal’s head in a stable position during the daily neurophysiological recoding sessions. All experimental procedures were approved by the University of Maryland Animal Care and Use Committee and were in accord with NIH Guidelines. 2.14 Recordings, Spike Sorting, and Selection Criteria. Action potentials from single units were recorded using tungsten microelectrodes with 5 to 7 M tip impedances at 1 kHz. In each animal, electrode penetrations were made orthogonal to the cortical surface. In each penetration, cells were typically isolated at depths of 350 to 600 µm corresponding to cortical layers III and IV (Shamma et al., 1993). In four anesthetized animals, neural signals were fed through a window discriminator, and the time of spike occurrence relative to stimulus delivery was stored using a computer. In the other seven animals, the original neural electrical signals were stored for further processing off-line. Using Matlab software designed in-house, action potentials were then manually classified as belonging to one or more single units, and the spike times for each unit were recorded. The action potentials assigned to a single class met the following criteria: (1) the peaks of the spike waveforms exceeded four times the standard deviation of the entire recording; (2) each spike waveform was less than 2 ms in duration and consisted of a clear, positive deflection followed immediately by a negative deflection; (3) the spike waveform classes were not visibly different from each other in amplitude, shape, or time course; (4) the histogram of interspike intervals evidenced a minimum time between spikes (refractory period) of at least 1 ms; and (5) the spike activity persisted throughout the recording session. This procedure occasionally produced units with very low spike counts. After consulting the distribution of spike counts for all units, units that fired less than half a spike per second were excluded from further analysis since a neuron with such a low spike rate requires longer stimulus durations to analyze. 2.15 Stimuli and STRF Measurement. The stimuli used were temporally orthogonal ripple combinations (TORCs), as described by Klein et al. (2006). TORCs are more complex than individually presented dynamic ripples, which are instances of bandpassed noise whose spectral and temporal envelopes are cosinusoidal and can be thought of as auditory analogs of drifting contrast gratings used in vision studies (Shamma & Versnel, 1995; Shamma et al., 1995). The spectrotemporal envelope of a TORC is composed of sums of the spectrotemporal envelopes of temporally orthogonal dynamic ripples. Temporally orthogonal means that no two ripple components of a given stimulus share the same temporal modulation rate (their temporal correlation is zero); therefore, each component evokes a different frequency in the linear portion of the response. Each TORC spectrotemporal envelope is composed from six dynamic ripples having the same spectral density (in cyc/oct) but different w spanning the range of 4 to 24 Hz. In the
Temporal Symmetry
605
reverse-correlation operation, the 4 Hz response component is orthogonal to all stimulus components besides the 4 Hz ripple, the 8 Hz component is correlated only with the 8 Hz ripple, and so on. Fifteen distinct TORC envelopes are presented, with spectral density ranging from −1.4 cycles per octave to +1.4 cycles per octave in steps of 0.2 cycles per octave. Each of those 15 TORCS is then presented again but with the reverse polarity of its spectrotemporal envelope (the inverse-repeat method) to remove systematic errors due to even-order nonlinearities (Klein et al., 2006; Moller, 1977; Wickesberg & Geisler, 1984). Multiple sweeps were presented for each stimulus. Sweeps of different stimuli, separated by 3 to 6 s of silence, were presented in a pseudorandom order, until a neuron was exposed to between 55 and 110 periods (13.75–27.5 s) of each stimulus. All stimuli had an 8 ms rise and fall time. All stimuli were gated and fed through an equalizer into the earphone. Calibration of the sound delivery system (to obtain a flat frequency response up to 20 kHz) was performed in situ with the use of a 1/8 ¨ & Kjaer 4170 probe microphone. In the anesthetized case, the miin Bruel crophone was inserted into the ear canal through the wall of the speculum to within 5 mm of the tympanic membrane; the speculum and microphone setup resembles closely that suggested by Evans (1979). In the awake case, the stimuli were delivered through inserted earphones that were calibrated in situ at the beginning of each experiment. STRFs were measured by reverse correlation, that is, spike-triggered averaging (Klein, Depireux, Simon, & Shamma, 2000; Klein et al., 2006). In particular, only the sustained portions of the responses were analyzed, since the first 250 ms interval of poststimulus onset response was not used. To ensure reliable estimates, neurons with STRFs whose estimated signalto-noise-ratio was worse than 2 were excluded (Klein et al., 2006). 3 Results 3.1 Temporal Properties of the STRF. In an arbitrary STRF, the spectral and temporal dimensions of the response are not necessarily related in any way. For example, the temporal cross-sections (impulse responses) at different frequencies (x) need not be systematically related to each other in any specific manner. However, Figure 7 illustrates an unanticipated result that we found to be prevalent in our data: all temporal cross-sections of a given STRF are related to each other by a simple scaling and rotation of the same time function. For example, if we designate the impulse response at the best frequency (1.2 kHz) to be the function f (t) = f θ =0 (t) = f 0 (t), then the cross-section at 0.72 kHz is approximately its scaled inverse (≈ −0.50 f (t) = 0.50 f π (t)); at 0.92 kHz, it is the scaled and lagged version (≈ 0.33 f π /2 (t))). A schematic depiction of this STRF response property is shown in Figure 7. This property was defined above, in equation 2.18, to be temporal symmetry. In the following section, we quantify and demonstrate the existence of this response property in almost all of our neurons.
606
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
Spectro-Temporal Response Field 2000
2 1 0 0.5
Spike Rate
Frequency (kHz)
4
0.25 0.125
0
-2000 0
250
Time (ms)
250
Time (ms)
b
0
0
Time (ms)
Time (ms)
250
250
c 16 Spike Rate
Frequency (kHz)
2000 8 4
0 2 0
1
-2000
0.5 0
250
Time (ms)
250
Time (ms)
Temporal Symmetry
607
In general, a pure scaling (i.e., no rotation, θ = 0 among the temporal cross-sections of the STRF is expected if the STRF is fully separable, that is, it can be decomposed into one product of a temporal and a spectral function: h F S (t, x) = f (t)g(x). However, only 50% of our cells can be considered fully separable in the awake population and 67% in the anesthetized population; the remainder all not fully separable. Consequently, this highly constraining and ubiquitous relationship involving a scaling and a rotation, described mathematically by equation 2.18, must imply another basic characteristic of cell responses, as we discuss next. 3.1.1 Rank and Temporal Symmetry in AI. To examine spectrotemporal interactions, we applied SVD analysis to all STRFs derived from AI neurons in our experiments (see Klein et al., 2006, for the method), with results in Table 1. In the awake recordings, the STRF rank was found to be best approximated as rank 1 or 2 for 98% of the neurons and in the anesthetized case, 97%. That is, the STRF is of the form of either equation 2.3 or equation 2.4. In general, an STRF need not be of such a low rank at all. For example, Figure 1b shows an otherwise plausible simulated high-rank STRF. Does low rank reflect a simplicity to the underlying neural circuitry? In the awake recordings, STRFs of rank 1 constituted 50% of all neurons; in the anesthetized recordings, STRFs of rank 1 constituted 67% of all neurons. These STRFs are fully separable, and hence all temporal crosssections of a given STRF are automatically related by a simple scaling. For rank 2 STRFs (awake 48%, anesthetized 30%), however, there is no such relationship. In fact, for a rank 2 STRF, expressed as equation 2.4, there is no mathematical need for any particular relationship between its temporal cross-sections, but physiological evidence provides some. The experimental results for neurons with STRF of rank 2 in Figure 8a highlight a strong relationship between the temporal functions f A(t) and f B (t) isolated by the SVD analysis in equation 2.4, and compared via the
Figure 7: Temporal symmetry demonstrated in the experimentally measured STRFs of 2 example neurons. (a) A rank 2 (not fully separable) STRF and five fixed-frequency cross-sections, giving five temporal impulse responses. (b) The same five impulse responses but Hilbert-rotated and rescaled to have equal power to that of the impulse response at the best frequency (left panel) and the same Hilbert-rotated and rescaled impulse responses superimposed (right panel). The temporal symmetry index for this neuron is ηt = 0.90. (c) Another rank 2 STRF, and five cross-sections (left panel), and the corresponding five impulse responses, Hilbert-rotated, rescaled to have equal power, and superimposed. This temporal symmetry is ηt = 0.73, illustrating that the temporal symmetry index need not be overly close to unity to demonstrate temporal symmetry (single units/awake: D2-4-03-p-c.a1-3, R2-6-03-p-2-a.a1-2).
608
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Table 1: Population Distributions of STRFs Recorded from All Neurons, According to Animal State (Awake/Anesthetized), STRF Rank, and Temporal Symmetry. Awake
Rank 1 Rank 2 Temporally symmetric Nontemporally symmetric Rank 3 Total Rank 1 + rank 2 temporally symmetric Rank 2 nontemporally symmetric + rank 3 Total
Anesthetized
Number
Percent
Number
Percent
72 70 51 19 3 145 123 22 145
50 48
49 22 21 1 2 73 70 3 73
67 30
2 100 85 15 100
3 100 96 4 100
temporal symmetry index defined in equation 2.21. A temporal symmetry index near 1 means that f A(t) and f B (t) are not arbitrary, but instead are closely related by a Hilbert transform. Comparing equation 2.4 and the last line of equation 2.18, we see that we must have: f B (t) = fˆ A(t).
(3.1)
Temporal symmetry in an STRF is an extremely restrictive property. SVD guarantees that f B (t) be orthogonal to f A(t), but does not restrict its frequency content in any substantive way (Stewart, 1990, 1991, 1993). f B (t) = fˆ A(t) is special since fˆ A(t) is the only function orthogonal to f A(t) that has the same frequency content as f A(t). In this sense, there is only one time function in the STRF, the one characterized by f A(t). This is not the case in the spectral dimension, where there is no single special spectral function picked out: g A(x) and g B (x) are mathematically and physiologically unconstrained: the population distribution for the analogous spectral symmetry index in Figure 8b shows no such close relationship. There is no evidence for spectral symmetry. Note also that fully separable (rank 1) neurons are not included in the population shown in Figure 8 since they are automatically temporally symmetric and spectrally symmetric. Alternatively, we can begin with the STRF in the form of equation 2.17 and explain why the temporal cross-sections in our STRFs exhibit the specific relationship depicted in Figure 7. The impulse response at any x can be thought of as a linear combination of f A(t) and fˆ A(t), which by equation 2.18 always gives a scaled version of a Hilbert-rotated f A(t). Another test for the presence of temporal symmetry arises from comparing STRFs approximated by two different means: the first two terms of the singular value expansion versus the first term of the quadrant-separable
Temporal Symmetry
a
609
b
Temporal Symmetry Index
Spectral Symmetry Index
25
25
20
20
15
15
10
10
5
5
0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
0.8
1
Figure 8: The population distributions of the symmetry indices for all neurons with rank 2 (not fully separable) STRFs: awake N = 70 (black), and anesthetized N = 22 (gray). (a) The population distributions of temporal symmetry index. The populations are heavily biased toward high temporal symmetry; cf. the temporal symmetry index of the STRFs in Figure 7, which ranges from 0.73 to 0.90. In the awake population, 51 neurons had temporal symmetry index greater than 0.65 and 20 in the anesthetized population. (b) For comparison, the same statistics but for spectral symmetry. The population is spread over the full range of values, despite potential tonotopic arguments for a narrow distribution near 1.
expansion in singular values (see Klein et al., 2006, for details). The former (rank 2 truncation) is of rank 2 by construction. The latter (quadrantseparable truncation), since it is quadrant separable by construction, should be of rank 4 unless the STRF has a symmetry. Since there is no mathematical reason they should give the same result, the near-unity correlation coefficient between the two shown in Figure 9a is experimental evidence that they are identical up to measurement error: the quadrant-separable STRF is actually of rank 2 and therefore possesses a symmetry. The only quadrant-separable STRFs with rank less than 4 must be temporally symmetric, spectrally symmetric, or directionally selective. The results shown in Figure 8b rule out spectral symmetry, and the analogous analysis for the index of directional selectivity (not shown) rules out directional selectivity. Therefore, STRFs’ symmetry must be temporal symmetry. For comparison, Figure 9b shows two other distributions. On the right is the same distribution as above, but for rank 4 truncations instead of rank 2. The correlations decrease, indicating that rank 2 (and hence temporal symmetry) is a better estimate than rank 4 (generic quadrant-separable STRFs are of rank 4, and only symmetric STRFs are of rank 2). The change must be small, since SVD orders contributions to the rank by decreasing power, but it need not have been negative. In the center of Figure 9b is the distribution of the same quantity as in Figure 9a, but with the STRFs permuted: the rank 2 truncation of the STRF is correlated with the quadrant-separable
610
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
40
Correlation Coefficient Distributions
30
Quadrant Separable Truncation vs. Rank 2 Truncation
20 10 0 -1
b
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
40 Correlation Coefficient Distributions 30
Quadrant Separable Truncation vs. Rank 4 Truncation
20
Permuted Truncations (scaled)
10 00 -1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 9: (a) The population distributions of the correlations for all rank 2 neurons between their rank 2 and quadrant-separable estimates: awake (black) and anesthetized (gray). The high correlations indicate that the quadrant-separable truncation may be of rank 2, evidence of temporal symmetry. (b) For comparison, the distributions of the comparable correlations. (Right) The population distributions of the correlations for all neurons between their rank 4 and quadrant-separable estimates: awake (black), and anesthetized (gray). The correlations become worse for both populations, despite the fact that generic quadrant-separable STRFs are of rank 4, providing more evidence that the quadrant-separable STRFs are of rank 2, implying temporal symmetry. (Center) The populations of permuted STRFs: the rank 2 estimate of each STRF is correlated with the quadrant-separable estimate of every other STRF: awake (dark gray hash) and anesthetized (light gray hash). The population is scaled by the ratio of the number of STRFs to the number of permuted STRF pairs.
truncation of every other STRF for every rank 2 truncation. This distribution is broadly peaked around 0 (with the population scale normalized to be the same as in Figure 9a). This demonstrates that the skewness of the population toward unity in Figure 9a is not due to potentially confounding factors, such as the STRFs’ dominant power in the first ∼100 ms, or having rank 2, or being quadrant separable. Thus, the results shown in Figures 8 and 9 provide evidence that all of these rank 2 neurons are indeed temporally symmetric. However, temporal symmetry is highly restrictive. There are many ways to obtain a rank 2 STRF and only a very small subset is temporally symmetric, yet almost all of AI rank 2 STRFs are. This finding must be a consequence of a fundamental anatomical and physiological constraint on the way AI units are driven by
Temporal Symmetry
611
the spectrotemporally dynamic stimuli. In the remainder of this article, we demonstrate the simplest possible explanations that can give rise to these observed temporal symmetry properties. 3.2 Implications and Interpretation of Temporal Symmetry for Neural Connectivity and Thalamic Inputs to AI. Neurons in AI receive thalamic inputs from ventral MGB, via layers III and IV (see Read, Winer, & Schreiner, 2002, for a recent review; Smith & Populin, 2001). In this section, we analyze the effects of the constraints of temporal symmetry on the temporal and spectral components of the thalamic inputs to an AI neuron. We are not analyzing the cells with fully separable STRFs directly, but it will be shown below that their analysis proceeds almost identically. Note again that the linear equations used in this analysis do not assume that all processing of inputs is linear; rather, they assume that the linear component of the processing is strong and robust, in addition to all nonlinear components of the processing. The analysis below shows that there are physiologically reasonable models consistent with temporally symmetric neurons in and throughout AI. The models presented here require two features: that the STRFs of the thalamic inputs be fully separable and that some of the thalamic inputs be lagged (phase shifted), whether at the output of the thalamic neurons themselves or at their synapses onto AI neurons. Ventral MGB neurons possess STRFs consistent with being fully separable (Miller et al., 2002; Miller, Escabi, & Schreiner, 2001; L. Miller, personal communication, October 1999), but there has been no systematic study. Lagged neurons have not been reported in ventral MGB, though they exist in visual thalamus (Saul & Humphrey, 1990), and other lagging mechanisms that may be present in the auditory system are discussed below. Alternatively, is very difficult to construct physiologically reasonable models consistent with temporal symmetry neurons in AI without these two features. We know of no such models, and we were not able to construct any. From this, we are forced to predict that these two features will be found. Independently of whether this occurs, however, some explanation is still needed for the temporal symmetry displayed strongly in AI, and in this section, we provide a reasonable basis. 3.2.1 Simplistic Model of Thalamic Inputs. h T S (t, x) = f A(t)g A(x) + fˆ A(t)g B (x),
The last line of equation 2.18, (3.2)
explicitly demonstrates that the temporally symmetric STRF is rank 2. A simplistic interpretation of equation 3.2 is that a cell with a temporally symmetric STRF has two fully separable inputs, for example, two cells in ventral MGB. Each of those two input cells has the same temporal
612
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a
b θ=0
θ=0 FS
FS
TS
TS θ =/ 0
θ = π/2 FS
FS
c
d KA(f)
θ=0
FS
θ =/ 0
θ=0
.. .
.. .
TS
.. .
FS
θ =/ 0
f slow kA(t)
TS
.. .
Figure 10: Schematics depicting simple models represented by equations 3.2 to 3.11. (a) Summing the inputs of two fully symmetric (FS) cells whose temporal functions are in quadrature (θ = 0, θ = π/2) results in a temporally symmetric (TS) cell (see equation 3.2). (b) Summing the inputs of two fully symmetric (FS) cells whose temporal functions are in partial quadrature (θ = 0, θ = 0) still results in a temporally symmetric (TS) cell (see equation 3.3). c. Summing the inputs of many fully symmetric (FS) cells whose temporal functions are in partial quadrature still results in a temporally symmetric (TS) cell (see equation 3.6). (d) Summing the inputs of many fully symmetric (FS) cells whose temporal functions are in partial quadrature, with differing impulse responses (at high frequencies), input into a neuron whose somatic impulse response is slow, results in a temporally symmetric (TS) cell (see equation 3.12): (Inset) A spectral schematic of the low-pass nature of the slow somatic impulse response: K A( f ) is the Fourier transform of k A(t).
processing behavior except that one, whose temporal processing is characterized by fˆ A(t), is the Hilbert transform of (lagged with respect to) the other, with temporal processing characterized by f A(t). Mathematically, fˆ(t) is in quadrature with f (t). The two inputs may have different spectral distribution of their own inputs or different synaptic weights as inputs to the cortical cell: g A(x) = g B (x). This is shown schematically in Figure 10a. (It is
Temporal Symmetry
613
also consistent with this model that g A(x) could equal g B (x), but such a case reduces to the simple instance of a fully separable STRF.) This interpretation is mathematically concise and explicitly demonstrates that the temporally symmetric STRF is rank 2. This is the maximally reduced form, and one may proceed to expand the form to demonstrate other forms that its inputs are permitted to take, for example, allowing many inputs from thalamic or even intracortical connections. The goal of this section is to relax the strict decomposition implied by equation 3.2 (and imposed arbitrarily by SVD) and to rewrite it in a form that allows for physiologically reasonable inputs and physiologically reasonable somatic processing, and to analyze the restrictions that are imposed on those inputs. The more realistic decompositions below will allow us to reasonably model the thalamic inputs and the somatic processing by identifying them with terms in the decomposition, in a substantially more realistic interpretation than those simple interpretations presented above. The more realistic decompositions have an additional role as well: they contrast with models or decompositions that, appearing physiologically reasonable otherwise, conflict with the data and so can now be ruled out. 3.2.2 Generalized Simplistic Model of Thalamic Inputs. First, the severity of the full Hilbert transform can be relaxed. The model decomposition can use a partial Hilbert transform f θ (t) (see equation 2.12) instead of the full Hilbert transform. In equation 3.3, the first term of the decomposition has some temporal impulse response f A(t) and an arbitrary spectral response field gC (x). The second term has an impulse response f Aθ (t) that is some Hilbert rotation of the first impulse response and an arbitrary spectral response field g D (x), h T S (t, x) = f A(t)gC (x) + f Aθ (t)g D (x),
(3.3)
which is equivalent to the previous decomposition in equation 3.2: h(t, x) = f A(t)gC (x) + f Aθ (t)g D (x)
= f A(t)gC (x) + sin θ fˆ A(t) + cos θ f A(t) g D (x) = f A(t) (gC (x) + cos θ g D (x)) + ˆf A(t) sin θ g D (x) g A(x)
g B (x)
= f A(t)g A(x) + fˆ A(t)g B (x) = h TS (t, x),
(3.4)
for any θ different from zero. This is physiologically more relevant since a partial (θ < 90 degrees) Hilbert transform may be a simpler operation for a neural process to perform than a full transform. The physiological
614
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
interpretation of the first line of equation 3.4 is that the two temporal functions of the independent input classes need not be related by a full Hilbert transform (fully lagged cell); it is sufficient that one be a Hilbert rotation of the other (partially lagged cell). This is shown schematically in Figure 10b. It should be noted that exact Hilbert rotations for any θ = 0 are ruled out by causality: it can be shown (e.g., Papoulis, 1987) that the Hilbert transform of a causal filter is acausal. Therefore, we would never expect an exact Hilbert transform (or exact temporal quadrature), only an approximate Hilbert transform. This is exemplified in the visual system where the Hilbert transform behavior of lagged cells is only a good approximation for the frequency band 1 to 16 Hz (Saul & Humphrey, 1990). In the auditory system we expect similar behavior: the (partial or full) Hilbert rotation will be performed accurately only in the relevant frequency band. To restate, AI models that attempt a decomposition of the form h(t, x) = f A(t)gC (x) + f B (t)g D (x)
(3.5)
where f B (t) = f Aθ (t) for some θ are ruled out. This is because they violate the temporal symmetry property found in our STRFs. An example of a rank 2 STRF that is of the form of equation 3.5 but f B (t) = f Aθ (t) was shown Figure 6a. It is not temporally symmetric and is therefore ruled out as a model. 3.2.3 Multiple Input Model. Continuing to add more physiological realism to the decomposition, we can allow for each of the two pathways to be made of multiple inputs as long as their temporal structure is related, shown schematically in Figure 10c. In equation 3.6, the first sum in the decomposition is made up of a number of individual input components (m = 1, . . . , M) that all have the same temporal impulse response f A(t) but may have individual spectral responses fields gCm (x). The second sum is made up of a number of individual input components (n = 1, . . . , N) that each may have different temporal impulse responses f Aθn (t), but all are given by some individual Hilbert rotation (θn ) of the initial impulse response and individual spectral response fields g Dn (x):
h TS (t, x) =
M m=1
( f A(t)gCm (x)) +
N
f Aθn (t)g Dn (x) .
(3.6)
n=1
This decomposition is leading toward one based on the many neural inputs (i.e., N + M, which may be a very large number), each of which may have different spectral response fields and different Hilbert rotations, but must be related in their temporal structure. This is equivalent to the previous
Temporal Symmetry
615
decomposition or to that in equation 3.2: h(t, x) =
M
( f A(t)gCm (x)) +
m=1
N
f Aθn (t)g Dn (x)
n=1
= f A(t)
M
(gCm (x)) + f A(t)
m=1
+ fˆ A(t)
N
(cos θn g Dn (x))
n=1
N
(sin θn g Dn (x))
n=1
= f A(t)
M m=1
gCm (x) +
N n=1
(cos θn g Dn (x))
g A(x)
+ fˆ A(t)
N
n=1
(sin θn g Dn (x)) g B (x)
= f A(t)g A(x) + fˆ A(t)g B (x) = h TS (t, x)
(3.7)
The physiological interpretation of equation 3.6 is that the cortical cell may receive inputs from many thalamic inputs, not just two, and those many cells need not be identical to each other. Nor do we require that the spectral response fields be related in any way, though they may (they will likely have similar best frequencies due to the tonotopic organization of the auditory system, but they may be of different spectral response field shapes). The inputs have been broken into two groups corresponding to the two terms of equation 3.3. The first input group consists of M ordinary unlagged inputs. The second input group consists of N lagged inputs. The phase lags may all be the same, or they may be different. To restate, we conclude that models that attempt a decomposition of the form h TS (t, x) =
M m=1
( f A(t)gCm (x)) +
N
( f Bn (t)g Dn (x))
(3.8)
n=1
where f Bn (t) = f Aθn (t) for some θn are ruled out. Such models implicitly violate the temporal symmetry property. 3.2.4 Multiple Fast Input and Slow Output Model. There is another physiologically motivated relaxation we can allow in our progressive
616
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
decomposition to allow the thalamic temporal inputs to not be so carefully matched, as long as the differences do not enter into the cortical computation (e.g., at high frequencies). For instance, we note that the temporal processing of the cell possessing this STRF is well characterized as the convolution of its faster presomatic (multiple input) stage and its slower somatic (aggregate processing) stage. We can thus separate the entire temporal component of the cortical processing, f (t), which is a temporal impulse response, into the convolution of impulse responses from a presomatic stage, k(t), and somatic stage, k A(t): f (t) = k(t) ∗ k A(t).
(3.9)
The presomatic stage may contain high frequencies (e.g., the output of thalamic ventral MGB neurons) that will not be passed though the slower soma. In filter terminology, the somatic stage is a much lower-frequency bandpass filter than the presomatic stage (the soma acts as an integrator, which is a form of low-pass filtering). The convolution operation is used since the filters (characterized by impulse responses) are in series. The reason for making this separation is that we can allow each input component f i (t) to be different as long as the difference is zero when passed through the somatic component (Yeshurun, Wollberg, Dyn, & Allon, 1985),
ki (t) − k j (t) ∗ k A(t) = 0,
for all i, j
ki (t) ∗ k A(t) = k j (t) ∗ k A(t) = f A(t),
(3.10) for all i, j,
(3.11)
and still obtain a temporally symmetric STRF. Physiologically, we are allowing the thalamic inputs to have greatly differing temporal impulse responses, so long as all the differences appear in timescales significantly faster than the somatic processing timescale (Creutzfeldt, Hellweg, & Schreiner, 1980; Rouiller, de Ribaupierre, Toros-Morel, & de Ribaupierre, 1981): h
TS
(t, x) =
M m=1
N θn k An (t)g Dn (x) ∗ k A(t). (k Am (t)gCm (x)) +
(3.12)
n=1
Here the decomposition is broken up into a presomatic input stage and a somatic processing stage, shown schematically in Figure 10d. The somatic processing stage is the final convolution with the somatic (low-pass) temporal impulse response k A(t). The input stage looks very similar to the previous decomposition, except that now all the individual input components may have temporal responses that are different from each other as long as the differences are all for high frequencies (fast timescales), and they remain identical for low frequencies (slow timescales). This is equivalent to
Temporal Symmetry
617
the previous decomposition or to that in equation 3.2: h(t, x) =
M
N θn k An (t)g Dn (x) ∗ k A(t) (k Am (t)gCm (x)) +
m=1
=
M
n=1
((k Am (t) ∗ k A(t)) gCm (x)) +
m=1
=
M
n=1
( f A(t)gCm (x)) +
m=1
=h
N k θAnn (t) ∗ k A(t) g Dn (x)
TS
N
f Aθn (t)g Dn (x)
n=1
(t, x).
(3.13)
Note the Hilbert rotation of the input stage k(t) is conferred on the entire temporal filter f (t), since
ˆ ∗ k A(t) k θ (t) ∗ k S (t) = cos θ k(t) + sin θ k(t)
ˆ ∗ k A(t) = cos θ (k(t) ∗ k A(t)) + sin θ k(t) = cos θ f A(t) + sin θ fˆ A(t) = f Aθ (t),
(3.14)
which uses the Hilbert transfer property kˆ 1 (t) ∗ k2 (t) = H [k1 (t)] ∗ k2 (t) = H [k1 (t) ∗ k2 (t)] = kˆ 12 (t),
(3.15)
where k1 (t) ∗ k2 (t) = k12 (t). That is, once lagging is performed in the thalamus, the effects of lagging continue up the pathway. The physiological interpretation of equation 3.12 is that the thalamic inputs are quite unrestricted in their similarity. We do not require that all thalamic inputs have the same temporal impulse response, plus Hilbert rotations of that same temporal impulse response, as would be implied by equation 3.6. We require that only the low-frequency components of the thalamic impulse response be the same and that there be partial lagging in some subset of the population. High-frequency components of the temporal impulse responses and all aspects of the spectral response fields are completely unrestricted. In fact, high-frequency components of the temporal impulse responses need not even be lagged, since they are filtered out.
618
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
To restate, models that attempt a decomposition of the form h
TS
(t, x) =
M m=1
N θn k An (t)g Dn (x) ∗ k A(t) (k Am (t)gCm (x)) +
(3.16)
n=1
where ki (t) ∗ k A(t) = k j (t) ∗ k A(t) for all i, j are ruled out. From the signal processing point of view, because all the thalamic inputs ki (t) and k j (t), as impulse responses, are so fast compared to the slow somatic filter k A(t) they are convolved with, they are indistinguishable from a pure impulse δ(t). More rigorously, in the frequency domain, the thalamic input transfer functions (Fourier transform representations of the thalamic input impulse responses) change only gradually over the typical range of the somatic frequencies (a few tens of Hz). A transfer function with approximately constant amplitude and constant phase is approximately the impulse function δ(t). That the thalamic inputs may also contain delays does not change the result: a transfer function with approximately constant amplitude and linear phase is approximately the delayed impulse function δ(t − t0 ). Even differential delays among inputs can be tolerated if the differences are not too great, since a relative delay shift of 2 ms changes phase by only 4% of a cycle over 20 Hz and less for slower frequencies. For instance, differential delays of a few milliseconds that arise from different thalamic inputs or from different dendritic branches of a cortical neuron are too small to break the temporal symmetry. Let us summarize the implications of temporal symmetry for thalamic inputs to AI: the thalamic inputs are almost unrestricted, especially in their spectral support, but they must have the same low-frequency temporal structure (e.g., approximately constant amplitude and phase linearity for a few tens of Hz). Any lagged inputs (significantly different from zero but not necessarily near 90 degrees) will cause the cortical STRF to lose full separability but maintain temporal symmetry. This is a prediction of the reasonably simple but still physiologically plausible models presented above. There are no constraints imposed by temporal symmetry on the spectral response fields of the thalamic inputs (despite tonotopy in both ventral MGB and AI), nor are there any constraints on the thalamic input impulse response’s high-frequency (fast) components individually. The main assumptions of the models are that the thalamic inputs are fully separable and that the soma in AI includes a low-pass filter. 3.3 Implications and Interpretation of AI STRF Temporal Properties for Neural Connectivity and Intracortical Inputs to AI. Neurons in AI receive their predominant thalamic (ventral MGB) input in layers III and IV (Smith & Populin, 2001). Other neurons in a given local circuit receive primarily intracortical inputs, and neurons receiving thalamic inputs receive intracortical inputs as well. Temporal symmetry of the STRF places
Temporal Symmetry
619
strong constraints on the nature of these intracortical inputs as well. Unlike the thalamic inputs, that are assumed above to be fully separable, we assume that the STRFs of cortical inputs are not fully separable but do have temporal symmetry. In this section, we explore the constraints on the temporal and spectral components of the intracortical inputs to an AI neuron. 3.3.1 Adding Cortical Feedforward Connections. The first generalization is that we allow the temporally symmetric AI neuron to project to another AI neuron, which itself receives no other input, shown schematically in Figure 11a. This second neuron is then also temporally symmetric, and its STRF is given by the original STRF its input, convolved with the intrinsic impulse response of the second neuron: h TS (t, x) =
M
N θn k An (t)g Dn (x) ∗ k A(t) ∗ k2 (t). (k Am (t)gCm (x)) +
m=1
n=1
(3.17) Before we show the next generalization that maintains temporal symmetry, it is instructive to see a model generalization that does not maintain temporal symmetry, and so is disallowed by the physiology. The sum of two temporally symmetric STRFs is not, in general, temporally symmetric: ˆ h TS 1 (t, x) = f A1 (t)g A1 (x) + f A1 (t)g B1 (x) ˆ h TS 2 (t, x) = f A2 (t)g A2 (x) + f A2 (t)g B2 (x) TS h(t, x) = h TS 1 (t, x) + h 2 (t, x)
= f A1 (t)g A1 (x) + fˆ A1 (t)g B1 (x) + f A2 (t)g A2 (x) + fˆ A2 (t)g B2 (x) = h TS (t, x),
(3.18)
since it generically gives an STRF of rank 4, which cannot be temporally symmetric since then it must be rank 2. There do exist configurations in which there is a special relationship between the several spectral functions, which does result in temporal symmetry, but they are highly restrictive: for example, when g A1 (x) = g A2 (x) and g B1 (x) = g B2 (x). An example of an STRF that is of the form of equation 3.18, where there is no special relationship between the spectral and temporal components of its constituents, is shown in Figure 6c. It is not temporally symmetric and is therefore ruled out as a model. 3.3.2 Adding Cortical Feedback Connections. The next generalization is at a higher level, in that it is the first generalization we consider that has
620
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
a θ=0
.. .
FS
θ =/ 0
slow kA(t)
TS
TS
k 2(t)
1
2
.. . θ=0
.. .
b FS
θ =/ 0
c
TS
k 2(t)
1
TS
2
.. . k 2(t)
θ=0
TS
2
.. .
FS
θ =/ 0
slow kA(t)
slow kA(t)
TS
1
k 3(t)
k4(t)
.. .
TS
3
TS
4
Figure 11: Schematics depicting models that are more complex. (a) Using the output of a temporally symmetric (TS) neuron as sole input to another neuron results in a temporally symmetric (TS) neuron (see equation 3.17). (b) Feedback from such a temporally symmetric neuron whose sole source is the first temporally symmetric neuron is still self-consistently temporally symmetric (see equation 3.19). (c) Multiple examples of feedback and feedforward: The initial neuron TS 1 provides temporal symmetry to all other neurons in the network due to its role as sole input for the network. All other neurons inherit the temporal symmetry, and the feedback is also self-consistently temporally symmetric.
feedback: h 1 (t, x) = TS
M
N θn k An (t)g Dn (x) ∗ k A(t) + h TS (k Am (t)gCm (x)) + 2 (t, x)
m=1
h 2 (t, x) = h 1 (t, x) ∗ k2 (t). TS
TS
n=1
(3.19)
Temporal Symmetry
621
This is an important generalization, shown schematically in Figure 11b. The proof that the result is temporally symmetric is by causal induction: line two of equation 3.19 guarantees that the STRF of neuron 2 is temporally symmetric with the same spectral functions g A(x) and g B (x) of neuron 1. Then since the result of line 1 is the sum of two temporally symmetric STRFs sharing the same g A(x) and g B (x), the resulting STRF is also temporally symmetric (and shares the same g A(x) and g B (x)). A more rigorous proof follows straightforwardly when using the MTFST instead of the STRF. This illustrates the base of a large class of generalizations: Any network of cortical neurons in one self-contained circuit, including feedforward and feedback connections, will remain temporally symmetric as long as all spectral information arises from a single source that receives all thalamic input. That is, if input into the cortical neural circuit arises, ultimately, from a single source receiving all thalamic input (e.g., one neuron in layer III/IV), then all neurons in that circuit will be temporally symmetric and all will share the same spectral functions g A(x) and g B (x). A substantially more complex example is illustrated schematically in Figure 11c. The converse to this rule of a single source of thalamic inputs to a cortical circuit is that almost any other inputs will break temporal symmetry and are therefore disallowed by the physiology. As shown in equation 3.18, any neuron that receives input from two sources whose temporal symmetries are incompatible will not be temporally symmetric and is therefore disallowed. Essentially, each local circuit must get its input from a single layer III/IV cell. This is a strong claim: it demands that neurons in the local circuit but not in layer III/IV must also be temporally symmetric. Generally inputs from neighboring or distant circuits will have a different spectral balance of thalamic inputs and would break temporal symmetry if strongly coupled, at least at timescales important to these STRFs (a few to a few tens of Hz). Only if the neighboring cortical cell were to have a compatible temporal symmetry would it be allowed to take part in the circuit without violating temporal symmetry. Fully separable cells in AI are also well explained by all the above models, except without the second lagged thalamic population. All of the results, generalizations, and restrictions apply equally strongly, since anything that breaks temporal symmetry increases the rank and so does not allow the STRF to be of rank 1 either; and maintaining temporal symmetry is consistent with full separability if there are no lagged inputs to increase the STRF rank from 1 to 2. 4 Discussion Temporal symmetry is a property that STRFs may possess, but there is not any mathematical reason that they must. Nevertheless, in the awake case, 85% of neurons measured in AI were temporally symmetric—ηt > 0.65,
622
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
and 96% in the anesthetized population (including those that are fully separable). This property merits serious attention. Simple models of functional thalamic input to AI, where the temporal functions are unconstrained or even if identical but with relative delays of more than a few milliseconds, do not possess the property of temporal symmetry, and so must be ruled out. On the other hand, it is quite possible for a model to achieve temporal symmetry: the only restriction on thalamic inputs is that their low-pass-filtered versions (at rates below 20 Hz, that is, those commensurate with cortical integration time constants) must be identical. Additionally, some of the ventral MGB inputs must be (at least partially) lagged in order for the cortical neuron to be temporally symmetric without being fully separable. Additional constraints are forced on neurons in AI receiving input from other neurons in AI, whether from simple feedforward or as part of a complicated feedback-feedforward network. Simply taking two temporally symmetric neurons as input will not generate a temporally symmetric neuron. Since the temporal profiles of cortical neurons are not identical, the only way to maintain temporal symmetry is to keep the same spectral balance of the lagged and nonlagged components (g A(x) and g B (x) in equation 3.2). The only sure way to accomplish this is to restrict all inputs to the local circuit network to a single input neuron, for example, in layer III/IV. Another possibility, not ruled out, is to carefully match the balance of the lagged and nonlagged components even when coming from another column. It seems this would be difficult to match well enough to avoid breaking temporal symmetry, but it is not disallowed by this analysis. It is consistent with the findings of Read et al. (2001), who note that horizontal connections match not only in best frequency but also in spectral bandwidth. One can rule out the possibility that temporal symmetry is due to a simple interaction between ordinary transmission delays and limited temporal modulation bandwidths. For instance, if the STRF’s temporal bandwidth is narrow, an ordinary time delay (with phase linear in frequency) might appear to be constant phase lag. If this effect were important, the temporal symmetry parameter should show a negative correlation with temporal bandwidth. In fact, the correlation between temporal bandwidth (of the first singular value temporal cross-section) and temporal symmetry index is small and positive: 0.24 for the awake case and 0.21 for the anesthetized case. Of course, the experimental results of Figures 8 and 9 do not show absolute temporal symmetry, only a large degree. Similarly, intracortical connections from neighboring or distant circuits are not ruled out absolutely: rather, their strength must be small and their effects more subtle. Indeed, for neurons whose STRF is noisy, the “noise” may be due to inputs from neighboring or distant circuits, whether thalamic or intracortical, that are not time-locked.
Temporal Symmetry
623
In addition, since all the STRFs in this analysis were derived entirely from the steady-state portion of the response, this analysis and its models do not apply to neural connections that are active only during onset. This is a limitation of any study of steady-state response. The neural connectivity it predicts is a functional connectivity—the connectivity active during the steady state. Nevertheless, the steadystate response is important in auditory processing (e.g., for background sounds, auditory streaming), nor is it is likely that steady-state functional connectivity is completely unrelated to other measures of functional connectivity. It should be noted that the possibility of thalamic inputs being not fully separable but still temporally symmetric (still with all inputs having the same temporal function) is not strictly ruled out mathematically. It amounts to a rearranging of the terms within the largest parentheses of equation 3.12. There is no physiological evidence against the existence of not fully separable, temporally symmetric neurons, in auditory thalamus, with all inputs having the same temporal function. In fact, it is possible to shift the models presented here down one level, using input from fully separable neurons in inferior colliculus (IC) instead of thalamus and using thalamocortical loops instead of intracortical feedback. In this case, it would be necessary to find a mechanism that lags (Hilbert rotates) some of the IC inputs. The essential predictions are the same: all input to the circuit must be through a single source, and any exceptions would break the temporal symmetry, but the source has been moved down one level. However, it does add another layer of complexity and of strong assumptions to the models, so this appears less likely to be relevant. If, however, ventral MGB neurons are found with substantially more complicated STRF structures (e.g., quadrant separable but not temporally symmetric, or even inseparable without any symmetries, both of which are more complicated than all of the AI neurons described here), then the models proposed here would not be able to accommodate them, and another explanation will have to be found for the temporal symmetry found in AI. The specific models proposed by the equations above (e.g., equations 3.3, 3.6, 3.12, 3.17, 3.19) are somewhat more general than the precise descriptive text that follows from them. For instance, lagged cortical cells are allowed in place of the lagged thalamic cells: for example, in the macaque visual system, there are both lagged and nonlagged V1 cortical neurons, in addition to the same two populations in visual thalamus (De Valois et al., 2000). The models above do not distinguish between those cortical inputs and those thalamic inputs. Similarly, the models do not depend on the mechanism by which input neurons are lagged, whether by thalamic inhibition (Mastronarde, 1987a, 1987b) or synaptic depression (Chance et al., 1998); from equation 3.3, the models do not even depend on the sign or magnitude of the phase lag. Nevertheless, this does not invalidate the models’ ease of falsifiability.
624
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
The example of a feedback network described by equation 3.19 and depicted in Figure 11b is the simplest of systems that include feedback. The more complex model depicted in Figure 11c serves as an example of a more complex cortical network with feedback, but the class of feedback systems is so much larger than of feedforward systems that it is beyond the scope of this article to categorize all of them. One important feature of these models is their ease of falsifiability. Most interactions will break temporal symmetry. Any functionally connected pair of neurons in AI with temporally symmetric STRFs, but with incompatible impulse responses and spectral response fields, would invalidate the model. Finding direct evidence is also possible, and we propose one experiment as an example: simultaneous recording from two functionally connected neurons in AI with strong temporally symmetric STRFs. While presenting steady-state auditory stimuli, electrically stimulate the neuron whose output is the other’s input in a way that is incompatible with temporal symmetry but compatible with the steady-state stimulus. If these models are correct, the STRF of the other neuron will remain well formed but lose temporally symmetry. Additional caveats are in order, delineating the types of functional connections permitted and disallowed by the measured temporal symmetry. The STRF captures many features of the response in AI, but not all. For instance, typical STRFs in AI only go up to a few tens of Hz (Depireux et al., 2001) since the neurons cannot lock to rates higher than that, yet some spikes within the response show remarkably high temporal precision across different presentations—as fast as a millisecond (see, e.g., Elhilali, Fritz, Klein, Simon, & Shamma, 2004). Since STRFs do not capture effects at this timescale, predictions made from the temporal symmetry of the STRFs cannot place any constraints on functional connections that take effect only at this timescale. Similarly, since the STRFs are measured only from the sustained response and do not use the onset response, constraints cannot be placed on functional connections that occur only in the first few hundred milliseconds. These results have been obtained from STRFs measured by a single method (using TORC stimuli) and still need to be verified for STRFs measured using different stimuli. Use of the SVD truncation (see equation 2.20) and calculation of the temporal symmetry index (see equation 2.21) is straightforward for STRFs from any source. The main caveat in applying this technique to other data is that the STRFs should be sufficiently error free, especially in systematic error (Klein et al., 2006). Too much error in the STRF estimate leads to the second term in the singular value truncation being dominated by noise, which corrupts the estimate of the temporal symmetry index. How much of this work generalizes beyond AI? In vision, it appears that temporal and spectral symmetry measurements for spatiotemporal response fields are rarely, if ever, made (these measurements would require
Temporal Symmetry
625
the entire spatiotemporal response field to be measured, not just the temporal response at the best spatial frequency, or the converse). It is certainly true that many spatiotemporal response fields show strong directional selectivity, and it is clear from equation 2.31 that an STRF, which is quadrant separable and purely directional selective, is also temporally symmetric. The population distributions of the temporal symmetry index, shown in Figure 8a, exhibit some differences between the awake and anesthetized populations, the most pronounced of which is the low-symmetry tail in the awake population. The similarities are more profound than the differences, however: the large majority of STRFs show pronounced temporal symmetry. This is especially striking when compared with the spectral case, shown in Figure 8b, which has broad distribution in both the awake and anesthetized populations. Let us also restate an earlier point regarding the validity of drawing conclusions from a linear systems analysis of a system known to have strong nonlinearities. Linear systems are a subcategory of those nonlinear systems well described by Volterra and Wiener expansions (Eggermont, 1993; Rugh, 1981). If, in particular, the system is well described by a linear kernel followed by a static nonlinearity, the measured STRF is still proportional to that linear kernel (Bussgang, 1952; Nykamp & Ringach, 2002), since the stimuli used here, though discretized in rate and spectral density, are approximately spherical. If the system’s nonlinearities are more general or the static nonlinearity is sensitive to the residual nonsphericity of the stimuli, then the measured STRF will contain additional stimulusdependent, nonlinear contributions. Crucially, it has been shown that those potential higher-order nonlinearities are not dominant (i.e., the linear properties are robust) in ferret AI, since using several stimulus sets with widely differing spectrotemporal properties produces measured STRFs that do not differ substantially from each other (Klein et al., 2006). This implies that the STRF measured here is dominantly proportional to a linear kernel (followed by some static nonlinearity) and that the analysis performed here is valid. To the extent that there is higher-order contamination in the STRF, however, the analysis is affected in unknown ways. For example, there are related neural systems for which the dominant response has not been seen to be linear (Machens, Wehr, & Zador, 2004; Sahani & Linden, 2003). The most immediate other question is whether ventral MGB neurons have fully separable STRFs and whether any have lagged outputs. If neither of these holds true, then it becomes very difficult to explain temporal symmetry, which is so easily broken. Another set of questions revolves around the generation of STRFs themselves. What distinguishes AI neurons that have clean STRFs from those that have noisy STRFs, or none at all? Only some (large) fraction of AI neurons have recognizable STRFs, and only some (large) fraction of those recognizable STRFs have a high signal-to-noise ratio (no specific statistics have been gathered, and a careful
626
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
analysis is beyond the scope of this article). What is the cause of this noise, and how is it related to (presumably) non-phase-locked inputs? Appendix A.1 Temporal Symmetry Ambiguities. There are ambiguities and symmetries in the decomposition in equation 3.2: h TS (t, x) = f A(t)g A(x) + fˆ A(t)g B (x). The transformation parameterized by an amplitude and phase (λ, θ ), f A(t) → λ f Aθ (t) = λ(cos θ f A(t) + sin θ fˆ A(t)) π /2 θ +π /2 (t) = λ(− sin θ f A(t) + cos θ fˆ A(t)) fˆ A(t) = f A (t) → λ f A
g A(x) → λ−1 (cos θ g A(x) + sin θ g B (x)) g B (x) → λ−1 (− sin θ g A(x) + cos θ g B (x)),
(A.1)
leaves equation 3.2 invariant. This indicates a twofold symmetry, which here means that this decomposition is ambiguous (nonunique). The amplitude symmetry, parameterized by λ, demonstrates that the power (strength) of an STRF cannot be ascribed to either the spectral or the temporal cross-section. A standard method of removing this ambiguity is to normalize the cross-sections and include an explicit amplitude, h TS (t, x) = Av A(t)u A(x) + B vˆ A(t)u B (x),
(A.2)
where the ui and vi are of unit norm. This method is used in SVD. The rotation transformation parameter θ merely establishes which of the possible rotations of f Aθ (t) we choose to compare against all others. For example, in Figure 7, the temporal cross-section at the best frequency (i.e., the frequency at which the STRF reaches its most extreme value) is chosen to be the standard impulse response against which all others are compared. To do this, we choose θ such that h TS (t, xB F ) = f A(t)g A(xB F ) and λ such that g A(xB F ) = 1. A.2 Quadrant-Separable Spectrotemporal Modulation Transfer Functions. For mathematical convenience, we name and define the crosssectional functions referred to without names above in section 2.9. In quadrant 1, where both w and are positive, H(w, ) = F1+ (w)G + 1 (), where the index 1 refers to quadrant 1 and the + indicates that the function is defined only for positive argument. In quadrant 2, where w is negative and is positive, H(w, ) = F2+∗ (−w)G + 2 (), where the index 2 refers to quadrant 2, the + indicates that the function is defined only for positive argument, –w
Temporal Symmetry
627
is positive, and the complex conjugation makes later equations significantly simpler. Thus, a quadrant-separable MTFST can be expressed as (valid in all 4 quadrants) H QS (w, ) = F1+ (w) (w)G + 1 () () + F2+ (w) (w)G +∗ 2 (−) (−) + F1+∗ (−w) (−w)G +∗ 1 (−) (−) + F2+∗ (−w) (−w)G +∗ 2 () (),
(A.3)
where () is the step function (1 for positive argument and 0 for negative argument). In equation A.3 it is explicit that the spectrotemporal modulation transfer function in quadrants 3 (w < 0, < 0) and 4 (w > 0, < 0) is complex conjugate to quadrants 1 and 2, respectively. This can be written more compactly as ˆ 1 () + Fˆ 2 (w)G ˆ 2 (), H QS (w, ) = F1 (w)G 1 () + F2 (w)G 2 () − Fˆ 1 (w)G (A.4) where Fi (w) = Fi+ (w) (w) + Fi+∗ (−w) (−w) G i () = G i+ () () + G i+∗ (−) (−)
(A.5)
are defined for both positive and negative argument (and are complex conjugate symmetric by construction), i = (1, 2), and the Hilbert transform is defined in transfer function space as Fˆ (w) = j sgn(w) F (w) ˆ G() = j sgn() F ().
(A.6)
Equation A.4 demonstrates that a quadrant-separable spectrotemporal modulation transfer function can be decomposed into the sum of four fully separable spectrotemporal modulation transfer functions where the last two terms are fully determined by the first two and the requirement of quadrantseparability. This is the first time we see that a generic quadrant-separable MTFST (and its STRF) is rank 4, and not 2. Roughly speaking, it comes from the combination of two independent product terms in quadrants 1 and 2, plus the restriction that the MTFST must be complex conjugate symmetric. An equivalent decomposition equally concise, but more amenable to comparison with definitions motivated earlier, of a quadrant-separable spectrotemporal modulation transfer function into the sum of four fully
628
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
separable spectrotemporal modulation transfer functions is ˆ B () + Fˆ A(w)G B () + F B (w)G ˆ A(), H QS (w, ) = F A(w)G A() + Fˆ B (w)G (A.7) where F A(w) = √12 (F1 (w) + F2 (w)) F B (w) = √12 (− Fˆ 1 (w) + Fˆ 2 (w)) G A() = √12 (G 1 () + G 2 ()) ˆ 1 () + G ˆ 2 ()). G B () = √12 (−G
(A.8)
Taking the inverse Fourier transform gives the canonical form of the quadrant-separable STRF h QS (t, x) = f A(t)g A(x) + fˆ B (t)gˆ B (x) + fˆ A(t)g B (x) + f B (t)gˆ A(x),
(A.9)
which is also equation 2.28. A.3 Quadrant Separability Ambiguities. There are ambiguities and symmetries in the decomposition in equation A.3 (and, correspondingly, equations A.4 and A.7). The transformation parameterized by a pair of amplitudes and phases (1 , θ1 , 2 , θ2 ) F1+ → F1+ 1 exp ( jθ1 ) + −1 G+ 1 → G 1 1 exp (− jθ1 )
F2+ → F2+ 2 exp ( jθ2 ) + −1 G+ 2 → G 2 2 exp (− jθ2 )
(A.10)
or, equivalently, F1 (w) → 1 F1θ1 (w) −θ1 G 1 () → −1 1 G 1 ()
F2 (w) → 2 F2θ2 (w) −θ2 G 2 () → −1 2 G 2 (),
(A.11)
Temporal Symmetry
629
leaves equation A.3 invariant, representing an ambiguous/nonunique decomposition. The spectrotemporal version of the transformation is f 1 (t) → 1 f 1θ1 (t) −θ1 g1 (x) → −1 1 g1 (x)
f 2 (t) → 2 f 2θ2 (t) −θ2 g2 (x) → −1 2 g2 (x).
(A.12)
The amplitude symmetry, parameterized by 1 and 2 , demonstrates that the amplitude (or power) in one quadrant of a spectrotemporal modulation transfer function cannot be ascribed to either the spectral or the temporal cross-section. Again, a standard method of removing this ambiguity is to normalize the cross-sections and include an explicit amplitude: QS HST (w, ) = V1 (w)1 U1 () + V2 (w)2 U2 ()
ˆ 1 () + V ˆ 2 (), ˆ 1 (w)1 U ˆ 2 (w)2 U −V
(A.13)
where the Ui and Vi are of unit norm. This method is used in singular value decomposition (SVD). The spectrotemporal version is h QS (t, x) = v1 (t)1 u1 (x) + v2 (t)2 u2 (x) − vˆ 1 (t)1 uˆ 1 (x) + vˆ 2 (t)2 uˆ 2 (x).
(A.14)
The phase ambiguity, parameterized by θ1 and θ2 , can be useful. By judicious choice of this transformation, it can always be arranged that F2 be ˆ 1 simultaneously. This is useful orthogonal to Fˆ 1 and G 2 be orthogonal to G when performing detailed calculations (e.g., the analytic expression below for the singular values of a quadrant separable spectrotemporal modulation transfer function) because it explicitly decomposes the spectrotemporal modulation transfer function, viewed as a linear operator, into components that act on orthogonal subspaces. A.4 Rank of Quadrant-Separable Spectrotemporal Response Fields. This section uses SVD to prove that the rank of a generic quadrant-separable STRF is 4. We begin with equation A.14. In this form, the symmetry generated by 1 and 2 in equation A.11 has already been fixed, but we can still use the symmetry generated by θ1 and θ2 in equation A.11 to simplify later calculations. We quantify the nonorthogonality of the terms in
630
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
equation A.14 by the four angles, cos θV = v1 · v2 = vˆ 1 · vˆ 2 cos θU = u1 · u2 = uˆ 1 · uˆ 2 cos ϕV = v1 · vˆ 2 = −ˆv1 · v2 cos ϕU = u1 · uˆ 2 = −uˆ 1 · u2 ,
(A.15)
where wi · w j := wi w j , and vi · v j = vˆ i · vˆ j = δij = ui · u j = uˆ i · uˆ j . By making an explicit transformation with θ1 and θ2 from equation A.12 such that cos ϕV tan(θ1 − θ2 ) = − cos θV cos ϕU tan(θ1 + θ2 ) = , (A.16) cos θU then it can be shown that after the transformation, v1 · vˆ 2 = 0 = u1 · uˆ 2 ,
(A.17)
and there are new values for cos θV = v1 · v2 and cos θU = u1 · u2 as well. This step is not necessary but it reduces the complexity of the following equations. To avoid double counting, we also need to restrict the range of either θU or θV (just as on a sphere, we let the azimuthal angle run over the entire equator but enforce the polar angle to run over only half a great circle of longitude). We do this by enforcing that cos θU be positive. We then apply SVD to equation A.14 with the constraint of equation A.17. The singular values λ1 , λ2 , λ3 , and λ4 are the eigenvalues of the operator formed by taking the inner product of the STRF with itself along the w axis:
dt h QS (t, x) h QS† (t, x ) = 21 u1 (x)u1 (x ) + 22 u2 (x)u2 (x )
+ 1 2 cos θV u1 (x)u2 (x ) + u2 (x)u1 (x ) + 21 uˆ 1 (x)uˆ 1 (x ) + 22 uˆ 2 (x)uˆ 2 (x )
− 1 2 cos θV uˆ 1 (x)uˆ 2 (x ) + uˆ 2 (x)uˆ 1 (x ) . (A.18)
Each pair of lines of the right-hand side of equation A.18 can then be orthogonalized separately. In fact, after we orthogonalize the first two lines, the second last two lines can be orthogonalized by substituting cos θV → − cos θV and taking the Hilbert transform of each component function. To orthogonalize the first two lines, we look for the
Temporal Symmetry
631
eigenvalues λ2 and coefficients c 1 and c 2 such that ζ (x) = c 1 u1 (x) + c 2 u2 (x)
(A.19)
is the eigenfunction solving the eigenvalue equation,
d x 21 u1 (x)u1 (x ) + 22 u2 (x)u2 (x )
+1 2 cos θV u1 (x)u2 (x ) + u2 (x)u1 (x ) ζ (x ) = λ2 ζ (x).
(A.20)
Putting equation A.19 into equation A.20 and expanding gives a pair of solutions defined (up to normalization) by the two roots of the quadratic equation
c2 c1
2
2
1 cos θU + 1 2 cos θV
c2 c1
+
21 − 22 − 22 cos θU + 1 2 cos θV ,
(A.21)
and the corresponding two solutions of λ = 2
21
+ 1 2 cos θV cos θU +
c2 c1
21 cos θU + 1 2 cos θV .
(A.22)
Solving similarly for the eigenfunctions and eigenvalues of the second last two lines of the right-hand side of equation A.18 gives the same as equations A.21 and A.22 but with cos θV → − cos θV . The eigenfunctions of w are given by using the operator formed by taking the inner product of the STRF with itself along the axis,
d x h QS (t, x) h QS† (t , x),
(A.23)
and give the same eigenvalues. Explicitly, the four eigenvalues λ2 are given by the four solutions of λ2 =
21 +22 2
±
+ 1 2 cos θV cos θU
21 +22 2 − 21 22 (1 − cos2 θV − cos2 θU ) + 21 + 22 1 2 cos θV cos θU . 2
(A.24)
632
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
and the same with cos θV → − cos θV . The four singular values are given by the four positive square roots of λ2 . Because the system is explicitly orthogonal, the rank is given by the number of nonzero values of λ2 . Generically, the four values of λ2 are nonzero, with the only exceptions described below. A.5 Lower-Rank Quadrant-Separable Spectrotemporal Response Fields. This section uses SVD to prove that if a quadrant-separable STRF has rank less than the generic rank of 4, then this is caused by specific symmetries, and the rank must be reduced to 2 (or 1). Label the four eigenvalues λ2 as λ2A± for the two solutions to equation A.24 and λ2B± for the two solutions to equation A.24 but with cos θV → − cos θV . First, we consider the case that neither 1 nor √ 2 vanishes (we will consider the alternative case later). Define 0 = 1 2 and 22 1 21 α2 = , (A.25) + 2 22 21 where it can be shown that α 2 > 1 for any 1 and 2 . Then the SVD eigenvalues can be written as 2 α +cos θV cos θU
2 λ2 = 20 , 2 2 2 ±
α + cos θV cos θU
− (1 + cos θV cos θU ) + (cos θV + cos θU )
(A.26) and the same with cos θV → − cos θV . Note for reference below that since α 2 > 1, the top row expression of equation A.26 is nonnegative, as is the expression under the square root. It can also be shown that the top row expression is greater than or equal to the square root expression, guaranteeing that all of λ2A± and λ2B± are nonnegative. The loss of rank (from 4 to a smaller number) corresponds to at least one of the eigenvalues vanishing, since the rank is the number of nonzero eigenvalues. Because of the nonnegativity properties of the components of equation A.26, we always have λ2A+ > λ2A− and λ2B+ > λ2B− , and so if only one of the eigenvalues vanishes, it must be λ2A− or λ2B− . First, consider the case of λ2A− = 0. This can occur only when the top row expression of equation A.26 cancels the square root expression exactly or, since they are both nonnegative, when their squares cancel exactly. This condition, when simplified, can be written as
1 − cos2 θV 1 − cos2 θU = 0 (A.27) which can occur only when cos2 θV = 1 or cos2 θU = 1. 2 The case θV = 1 results in only two nonzero eigenvalues, with values cos 2 2 2 λ = 20 α ± cos θU , and 2 zero eigenvalues, λ2A− = λ2B− = 0, giving an
Temporal Symmetry
633
STRF of rank 2 (not rank 3, the guess associated with the initial assumption of λ2A− = 0). From equation A.15, this also results in v2 (t) = ±v1 (t), which is necessary and sufficient for temporal symmetry: the STRF has only one temporal function and its Hilbert transform. Analogously, the case cos2 θU = 1 results in spectral symmetry. There are only two nonzero eigenvalues, with values λ2 = 220 α 2 ± cos θV , and two zero eigenvalues, λ2A− = λ2B− = 0, giving an STRF of rank 2. Considering the case of λ2B− = 0 leads to the identical conclusions. Aside from the degenerate case, in which only one nonzero eigenvalue remains (which is the fully separable case), the only other possibility of low rank arises from when either 1 or 2 vanishes. If 1 = 0, then the quadrantseparable STRF of equation A.14 is explicitly in the form of equation 2.31, giving this STRF the symmetry of pure directional selectivity. Equation A.24 simplifies dramatically and results in only two non-zero eigenvalues (both with value 21 ), giving a rank 2 STRF. Similarly, if 2 = 0, the STRF is purely directionally selective in the opposite direction, resulting in only two nonzero eigenvalues (with value 22 ), and also rank 2. Thus, we have proven that whenever the rank of a quadrant separable STRF is less than 4, it must be rank 2 and temporally symmetric; rank 2 and spectrally symmetric; rank 2 and directionally selective; or rank 1 and fully separable. Acknowledgments S.A.S. was supported by Office of Naval Research Multidisciplinary University Research Initiative, Grant N00014-97-1-0501. D.A.D. was supported by the National Institute on Deafness and Other Communication Disorders, National Institutes of Health, Grant R01 DC005937. We thank Shantanu Ray for technical assistance in electronics and major contributions in customized software design and Sarah Newman and Tamar Vardi for assistance with training animals. We also thank Alan Saul, and three anonymous reviewers from an earlier submission, for their helpful and insightful comments. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2), 284–299. Aertsen, A. M., & Johannesma, P. I. (1981a). A comparison of the spectrotemporal sensitivity of auditory neurons to tonal and natural stimuli. Biol. Cybern., 42(2), 145–156. Aertsen, A. M., & Johannesma, P. I. (1981b). The spectrotemporal receptive field: A functional characteristic of auditory neurons. Biol. Cybern., 42(2), 133–143. Barlow, H. B., & Levick, W. R. (1965). The mechanism of directionally selective units in rabbit’s retina. J. Physiol., 178(3), 477–504.
634
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Borst, A., & Egelhaaf, M. (1989). Principles of visual-motion detection. Trends in Neurosciences, 12(8), 297–306. Bussgang, J. J. (1952). Crosscorrelation functions of amplitude-distorted gaussian signals. Cambridge, MA: Research Laboratory of Electronics, Massachusetts Institute of Technology. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 cells. J. Neurosci., 18(12), 4785–4799. Cohen, L. (1995). Time-frequency analysis. Englewood Cliffs, NJ: Prentice Hall. Creutzfeldt, O., Hellweg, F. C., & Schreiner, C. (1980). Thalamocortical transformation of responses to complex auditory stimuli. Exp. Brain. Res., 39(1), 87–104. De Valois, R. L., Cottaris, N. P., Mahon, L. E., Elfar, S. D., & Wilson, J. A. (2000). Spatial and temporal receptive fields of geniculate and cortical cells and directional selectivity. Vision Res., 40(27), 3685–3702. De Valois, R. L., & De Valois, K. K. (1988). Spatial vision. New York: Oxford University Press. deCharms, R. C., Blake, D. T., & Merzenich, M. M. (1998). Optimizing sound features for cortical neurons. Science, 280(5368), 1439–1443. Depireux, D. A., Simon, J. Z., Klein, D. J., & Shamma, S. A. (2001). Spectrotemporal response field characterization with dynamic ripples in ferret primary auditory cortex. J. Neurophysiol., 85(3), 1220–1234. Depireux, D. A., Simon, J. Z., & Shamma, S. A. (1998). Measuring the dynamics of neural responses in primary auditory cortex. Comments. Theor. Biol., 5(2), 89– 118. DiCarlo, J. J., & Johnson, K. O. (1999). Velocity invariance of receptive field structure in somatosensory cortical area 3b of the alert monkey. J. Neurosci., 19(1), 401–419. DiCarlo, J. J., & Johnson, K. O. (2000). Spatial and temporal structure of receptive fields in primate somatosensory area 3b: Effects of stimulus scanning direction and orientation. J. Neurosci., 20(1), 495–510. DiCarlo, J. J., & Johnson, K. O. (2002). Receptive field structure in cortical area 3b of the alert monkey. Behav. Brain. Res., 135(1–2), 167–178. Eggermont, J. J. (1993). Wiener and Volterra analyses applied to the auditory system. Hear. Res., 66(2), 177–201. Eggermont, J. J., Aertsen, A. M., Hermes, D. J., & Johannesma, P. I. (1981). Spectrotemporal characterization of auditory neurons: Redundant or necessary. Hear. Res, 5(1), 109–121. Elhilali, M., Fritz, J. B., Klein, D. J., Simon, J. Z., & Shamma, S. A. (2004). Dynamics of precise spike timing in primary auditory cortex. J. Neurosci., 24(5), 1159–1172. Emerson, R. C., & Gerstein, G. L. (1977). Simple striate neurons in the cat. II. Mechanisms underlying directional asymmetry and directional selectivity. J. Neurophysiol., 40(1), 136–155. Epping, W. J., & Eggermont, J. J. (1985). Single-unit characteristics in the auditory midbrain of the immobilized grassfrog. Hear. Res., 18(3), 223–243. Escabi, M. A., & Schreiner, C. E. (2002). Nonlinear spectrotemporal sound analysis by neurons in the auditory midbrain. J. Neurosci., 22(10), 4114–4131. Evans, E. F. (1979). Single-unit studies of mammalian cochlear nerve. In H. A. Beagley (Ed.), Auditory investigation: The scientific and technological basis (pp. 324–367). New York: Oxford University Press.
Temporal Symmetry
635
Fritz, J., Shamma, S., Elhilali, M., & Klein, D. (2003). Rapid task-related plasticity of spectrotemporal receptive fields in primary auditory cortex. Nat. Neurosci., 6(11), 1216–1223. Ghazanfar, A. A., & Nicolelis, M. A. (1999). Spatiotemporal properties of layer V neurons of the rat primary somatosensory cortex. Cereb. Cortex, 9(4), 348–361. Ghazanfar, A. A., & Nicolelis, M. A. (2001). Feature article: The structure and function of dynamic cortical and thalamic receptive fields. Cereb. Cortex, 11(3), 183–193. Green, D. M. (1986). “Frequency” and the detection of spectral shape change. In B. C. J. Moore & R. D. Patterson (Eds.), Auditory frequency selectivity (pp. 351–359). New York: Plenum Press. Hansen, P. C. (1997). Rank-deficient and discrete ill-posed problems: Numerical aspects of linear inversion. Philadelphia: SIAM. Heeger, D. J. (1993). Modeling simple-cell direction selectivity with normalized, half-squared, linear operators. J. Neurophysiol., 70(5), 1885–1898. Hermes, D. J., Aertsen, A. M., Johannesma, P. I., & Eggermont, J. J. (1981). Spectrotemporal characteristics of single units in the auditory midbrain of the lightly anaesthetised grass frog (Rana temporaria L) investigated with noise stimuli. Hear. Res., 5(2–3), 147–178. Hillier, D. (1991). Auditory processing of sinusoidal spectral envelopes. St. Louis, MO: Washington University. Humphrey, A. L., & Weller, R. E. (1988). Functionally distinct groups of X-cells in the lateral geniculate nucleus of the cat. J. Comp. Neurol., 268(3), 429–447. Johannesma, P. I., & Eggermont, J. J. (1983). Receptive fields of auditory neurons in the frog’s midbrain as functional elements for acoustic communications. In J.-P. Ewert, R. R. Capranica, D. Ingle & North Atlantic Treaty Organization, Scientific Affairs Division (Eds.), Advances in vertebrate neuroethology (pp. 901–910). New York: Plenum Press. Klein, D. J., Depireux, D. A., Simon, J. Z., & Shamma, S. A. (2000). Robust spectrotemporal reverse correlation for the auditory system: Optimizing stimulus design. J. Comput. Neurosci., 9(1), 85–111. Klein, D. J., Simon, J. Z., Depireux, D. A., & Shamma, S. A. (2006). Stimulus-invariant processing and spectrotemporal reverse correlation in primary auditory cortex. Journal of Computational Neuroscience, 20(2), 111–136. Kowalski, N., Depireux, D. A., & Shamma, S. A. (1996). Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J. Neurophysiol., 76(5), 3503–3523. Kowalski, N., Versnel, H., & Shamma, S. A. (1995). Comparison of responses in the anterior and primary auditory fields of the ferret cortex. J. Neurophysiol., 73(4), 1513–1523. Kvale, M., & Schreiner, C. E. (1997). Short-term adaptation of auditory receptive fields to dynamic stimuli. J. Neurophysiol., 91(2), 604–612. Linden, J. F., Liu, R. C., Sahani, M., Schreiner, C. E., & Merzenich, M. M. (2003). Spectrotemporal structure of receptive fields in areas AI and AAF of mouse auditory cortex. J. Neurophysiol., 90(4), 2660–2675. Linden, J. F., & Schreiner, C. E. (2003). Columnar transformations in auditory cortex? A comparison to visual and somatosensory cortices. Cereb. Cortex, 13(1), 83– 89.
636
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Machens, C. K., Wehr, M. S., & Zador, A. M. (2004). Linearity of cortical receptive fields measured with natural sounds. J. Neurosci., 24(5), 1089–1100. Maex, R., & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. J. Neurophysiol., 75(4), 1515–1545. Mastronarde, D. N. (1987a). Two classes of single-input X-cells in cat lateral geniculate nucleus. I. Receptive-field properties and classification of cells. J. Neurophysiol., 57(2), 357–380. Mastronarde, D. N. (1987b). Two classes of single-input X-cells in cat lateral geniculate nucleus. II. Retinal inputs and the generation of receptive-field properties. J. Neurophysiol., 57(2), 381–413. McLean, J., & Palmer, L. A. (1994). Organization of simple cell responses in the three-dimensional (3-D) frequency domain. Vis. Neurosci., 11(2), 295–306. Miller, L. M., Escabi, M. A., Read, H. L., & Schreiner, C. E. (2002). Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J. Neurophysiol., 87(1), 516–527. Miller, L. M., Escabi, M. A., & Schreiner, C. E. (2001). Feature selectivity and interneuronal cooperation in the thalamocortical system. J. Neurosci., 21(20), 8136–8144. Miller, L. M., & Schreiner, C. E. (2000). Stimulus-based state control in the thalamocortical system. J. Neurosci., 20(18), 7011–7016. Moller, A. R. (1977). Frequency selectivity of single auditory-nerve fibers in response to broadband noise stimuli. J. Acoust. Soc. Am., 62(1), 135–142. Nykamp, D. Q., & Ringach, D. L. (2002). Full identification of a linear-nonlinear system via cross-correlation analysis. Journal of Vision, 2(1), 1–11. Papoulis, A. (1987). The Fourier integral and its applications. New York: McGraw-Hill. Press, W. H., Teukolsky, S. A., Vettering, W. T., & Flannery, B. P. (1986). Numerical recipes: The art of scientific computing. Cambridge: Cambridge University Press. Qin, L., Chimoto, S., Sakai, M., & Sato, Y. (2004). Spectral-shape preference of primary auditory cortex neurons in awake cats. Brain Res., 1024(1–2), 167–175. Qiu, A., Schreiner, C. E., & Escabi, M. A. (2003). Gabor analysis of auditory midbrain receptive fields: Spectrotemporal and binaural composition. J. Neurophysiol., 90(1), 456–476. Read, H. L., Winer, J. A., & Schreiner, C. E. (2001). Modular organization of intrinsic connections associated with spectral tuning in cat auditory cortex. Proc. Natl. Acad. Sci. U.S.A., 98(14), 8042–8047. Read, H. L., Winer, J. A., & Schreiner, C. E. (2002). Functional architecture of auditory cortex. Curr. Opin. Neurobiol., 12(4), 433–440. Reid, R. C., Victor, J. D., & Shapley, R. M. (1997). The use of m-sequences in the analysis of visual neurons: Linear receptive field properties. Vis. Neurosci., 14(6), 1015–1027. Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of twodimensional patterns by single units in primate primary visual cortex. I. Stimulusresponse relations. J. Neurophysiol., 64(2), 351–369. Rouiller, E., de Ribaupierre, Y., Toros-Morel, A., & de Ribaupierre, F. (1981). Neural coding of repetitive clicks in the medial geniculate body of cat. Hear. Res., 5(1), 81–100. Rugh, W. J. (1981). Nonlinear system theory: The Volterra/Wiener approach. Baltimore, MD: Johns Hopkins University Press.
Temporal Symmetry
637
Rutkowski, R. G., Shackleton, T. M., Schnupp, J. W., Wallace, M. N., & Palmer, A. R. (2002). Spectrotemporal receptive field properties of single units in the primary, dorsocaudal and ventrorostral auditory cortex of the guinea pig. Audiol. Neurootol., 7(4), 214–227. Sahani, M., & Linden, J. F. (2003). How linear are auditory cortical responses? In S. Becker, S. Thrun & K. Obermeyer (Eds.), Advances in neural information processing systems, 15 (pp. 109–116). Cambridge, MA: MIT Press. Saul, A. B., & Humphrey, A. L. (1990). Spatial and temporal response properties of lagged and nonlagged cells in cat lateral geniculate nucleus. J. Neurophysiol., 64(1), 206–224. Schafer, M., Rubsamen, R., Dorrscheidt, G. J., & Knipschild, M. (1992). Setting complex tasks to single units in the avian auditory forebrain. II. Do we really need natural stimuli to describe neuronal response characteristics? Hear. Res., 57(2), 231–244. Schnupp, J. W., Mrsic-Flogel, T. D., & King, A. J. (2001). Linear processing of spatial cues in primary auditory cortex. Nature, 414(6860), 200–204. Schreiner, C. E., & Calhoun, B. M. (1994). Spectral envelope coding in cat primary auditory cortex: Properties of ripple transfer functions. Aud. Neurosci., 1(1), 39–62. Sen, K., Theunissen, F. E., & Doupe, A. J. (2001). Feature analysis of natural sounds in the songbird auditory forebrain. J. Neurophysiol., 86(3), 1445–1458. Shamma, S. A., Fleshman, J. W., Wiser, P. R., & Versnel, H. (1993). Organization of response areas in ferret primary auditory cortex. J. Neurophysiol., 69(2), 367– 383. Shamma, S. A., & Versnel, H. (1995). Ripple analysis in ferret primary auditory cortex: II. Prediction of unit responses to arbitrary spectral profiles. Aud. Neurosci., 1(3), 255–270. Shamma, S. A., Versnel, H., & Kowalski, N. (1995). Ripple analysis in ferret primary auditory cortex: I. Response characteristics of single units to sinusoidally rippled spectra. Aud. Neurosci., 1(3), 233–254. Smith, A. T., Snowden, R. J., & Milne, A. B. (1994). Is global motion really based on spatial integration of local motion signals? Vision Res., 34(18), 2425–2430. Smith, P. H., & Populin, L. C. (2001). Fundamental differences between the thalamocortical recipient layers of the cat auditory and visual cortices. J. Comp. Neurol., 436(4), 508–519. Smolders, J. W. T., Aertsen, A. M. H. J., & Johannesma, P. I. M. (1979). Neural representation of the acoustic biotope: Comparison of the response of auditory neurons to tonal and natural stimuli in the cat. Biol. Cybern., 35(1), 11–20. Stewart, G. W. (1990). Stochastic perturbation—theory. SIAM Review, 32(4), 579–610. Stewart, G. W. (1991). Perturbation theory for the singular value decomposition. In R. J. Vaccaro (Ed.), SVD and signal processing, II: Algorithms, analysis, and applications. Amsterdam: Elsevier. Stewart, G. W. (1993). Determining rank in the presence of error. In M. S. Moonen, G. H. Golub, & B. L. R. D. Moor (Eds.), Linear algebra for large scale and real-time applications. Dordrecht: Kluwer. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex within the framework of the canonical microcircuit. J. Neurosci., 15(10), 6700–6719.
638
J. Simon, D. Depireux, D. Klein, J. Fritz, and S. Shamma
Summers, V., & Leek, M. R. (1994). The internal representation of spectral contrast in hearing-impaired listeners. J. Acoust. Soc. Am., 95(6), 3518–3528. Sutter, E. E. (1992). A deterministic approach to nonlinear systems analysis. In R. B. Pinter & B. Nabet (Eds.), Nonlinear vision: Determination of neural receptive fields, function, and networks (pp. 171–220). Boca Raton, FL: CRC Press. Theunissen, F. E., Sen, K., & Doupe, A. J. (2000). Spectral-temporal receptive fields of nonlinear auditory neurons obtained using natural sounds. J. Neurosci., 20(6), 2315–2331. Ulanovsky, N., Las, L., & Nelken, I. (2003). Processing of low-probability sounds by cortical neurons. Nat. Neurosci., 6(4), 391–398. Valentine, P. A., & Eggermont, J. J. (2004). Stimulus dependence of spectrotemporal receptive fields in cat primary auditory cortex. Hear. Res., 196(1–2), 119–133. van Dijk, P., Wit, H. P., Segenhout, J. M., & Tubis, A. (1994). Wiener kernel analysis of inner ear function in the American bullfrog. J. Acoust. Soc. Am., 95(2), 904–919. Victor, J. D. (1992). Nonlinear systems analysis in vision: Overview of kernel methods. In R. B. Pinter & B. Nabet (Eds.), Nonlinear vision: Determination of neural receptive fields, function, and networks (pp. 1–38). Boca Raton, FL: CRC Press. Watson, A. B., & Ahumada, A. J., Jr. (1985). Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2(2), 322–341. Wickesberg, R. E., & Geisler, C. D. (1984). Artifacts in Wiener kernels estimated using gaussian white noise. IEEE Trans. Biomed. Eng., 31(6), 454–461. Yeshurun, Y., Wollberg, Z., & Dyn, N. (1987). Identification of MGB cells by Volterra kernels II: Towards a functional classification of cells. Biol. Cybern., 56(2–3), 203– 208. Yeshurun, Y., Wollberg, Z., Dyn, N., & Allon, N. (1985). Identification of MGB cells by Volterra kernels. I: Prediction of responses to species specific vocalizations. Biol. Cybern., 51(6), 383–390.
Received January 31, 2005; accepted June 19, 2006.
LETTER
Communicated by Walter Senn
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity: Synaptic Memory and Weight Distribution Taro Toyoizumi
[email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland, and Department of Complexity Science and Engineering, Graduate School of Frontier Sciences, University of Tokyo, Tokyo 153-8505, Japan
Jean-Pascal Pfister
[email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH–1015 Lausanne EPFL, Switzerland
Kazuyuki Aihara
[email protected] Department of Information and Systems, Institute of Industrial Science, University of Tokyo, 153–8505, Tokyo, Japan, and Aihara Complexity Modelling Project, ERATO, JST, Shibuya-ku, Tokyo 151–0064, Japan
Wulfram Gerstner
[email protected] School of Computer and Communication Sciences and Brain-Mind Institute, Ecole Polytechnique F´ed´erale de Lausanne, CH-1015 Lausanne EPFL, Switzerland
We studied the hypothesis that synaptic dynamics is controlled by three basic principles: (1) synapses adapt their weights so that neurons can effectively transmit information, (2) homeostatic processes stabilize the mean firing rate of the postsynaptic neuron, and (3) weak synapses adapt more slowly than strong ones, while maintenance of strong synapses is costly. Our results show that a synaptic update rule derived from these principles shares features, with spike-timing-dependent plasticity, is sensitive to correlations in the input and is useful for synaptic memory. Moreover, input selectivity (sharply tuned receptive fields) of postsynaptic neurons develops only if stimuli with strong features are presented. Sharply tuned neurons can coexist with unselective ones, and the distribution of synaptic weights can be unimodal or bimodal. The formulation of synaptic dynamics through an optimality criterion provides a simple graphical argument for the stability of synapses, necessary for synaptic memory. Neural Computation 19, 639–671 (2007)
C 2007 Massachusetts Institute of Technology
640
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
1 Introduction Synaptic changes are thought to be involved in learning, memory, and cortical plasticity, but the exact relation between microscopic synaptic properties and macroscopic functional consequences remains highly controversial. In experimental preparations, synaptic changes can be induced by specific stimulation conditions defined through pre- and postsynaptic firing rates (Bliss & Lomo, 1973; Dudek & Bear, 1992), postsynaptic membrane potential (Kelso, Ganong, and Brown, 1986), calcium entry (Malenka, Kauer, Zucker, ¨ & Nicoll, 1988; Lisman, 1989), or spike timing (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 2001. In the theoretical community, conditions for synaptic changes are formulated as synaptic update rules or learning rules (von der Malsburg, 1973; Bienenstock, Cooper, & Munroe, 1982; Miller, Keller, & Stryker, 1989) (for reviews, see Gerstner & Kistler, 2002; Dayan & Abbott, 2001; Cooper, Intrator, Blais, & Shouval, 2004), but the exact features that make a synaptic update rule a suitable candidate for cortical plasticity and memory are unclear. From a theoretical point of view, a synaptic learning rule should be (1) sensitive to correlations between pre- and postsynaptic neurons (Hebb, 1949) in order to respond to correlations in the input (Oja, 1982); they should (2) allow neurons to develop input selectivity (e.g., receptive fields) (Bienenstock et al., 1982; Miller et al. 1989), in the presence of strong input features, but (3) distribution of synaptic strength should remain unimodal ¨ otherwise (Gutig, Aharonov, Rotter, & Sompolinsky, 2003). Furthermore (4) synaptic memories should show a high degree of stability (Fusi, Drew, & Abbott, 2005) and nevertheless remain plastic (Grossberg, 1987). Moreover, experiments suggest that plasticity rules are (5) sensitive to the presynaptic firing rate (Dudek & Bear, 1992), but (6) depend also on the exact timing of the pre- and postsynaptic spikes (Markram et al., 1997, Bi & Poo, 2001). Many other experimental features could be added to this list, for example, the role of intracellular calcium and of NMDA receptors, but we will not do so (see Bliss & Collingridge, 1993, and Malenka & Nicoll, 1993, for reviews). The items in the above list are not necessarily exclusive, and the relative importance of a given aspect may vary from one subsystem to the next; for example, synaptic memory maintenance might be more important for a long-term memory system than for primary sensory cortices. Nevertheless, all of the above aspects seem to be important features of synaptic plasticity. However, the development of theoretical learning rules that exhibit all of the above properties has posed problems in the past. For example, traditional learning rules that have been proposed as an explanation of receptive field development (Bienenstock et al., 1982; Miller et al., 1989) exhibit a spontaneous separation of synaptic weights into two groups, even if the input shows no or only weak correlations. This is difficult to reconcile with experimental results in visual cortex of young rats, where a
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
641
¨ om, ¨ unimodal distribution was found (Sjostr Turrigiano, & Nelson, 2001). Moreover model neurons that specialize early in development on one subset of features cannot readily readapt later. Other learning rules, however, ¨ that exhibit a unimodal distribution of synaptic weights (Gutig et al., 2003) do not lead to a long-term stability of synaptic changes. In this article, we show that all of features 1 to 6 emerge naturally in a theoretical model where we require only a limited number of objectives that will be formulated as postulates. In particular, we study how the conflicting demands on synaptic memory maintenance, plasticity, and distribution of synaptic synapses could be satisfied by our model. Although the postulates are rather general and could be adapted to arbitrary neural systems, we had in mind excitatory synapses in neocortex or hippocampus and exclude inhibitory synapses and synapses in specialized systems such as the calix of Held in the auditory pathway. Our arguments are based on three postulates: A. Synapses adapt their weights so as to allow neurons to efficiently transmit information. More precisely, we impose a theoretical postulate that the mutual information I between presynaptic spike trains and postsynaptic firing be optimized. Such a postulate stands in the tradition of earlier theoretical work (Linsker, 1989; Bell & Sejnowski, 1995), but is formulated here on the level of spike trains rather than rates. B. Homeostatic processes act on synapses to ensure that the long-term average of the neuronal firing rate becomes close to a target rate that is characteristic for each neuron. Synaptic rescaling and related mechanism could be a biophysical implementation of homeostatis (Turrigiano & Nelson, 2004). The theoretical reason for such a postulate is that sustained high firing rates are costly from an energetic point of view (Laughlin, de Ruyter van Steveninck, and Anderson, 1998; Levy & Baxter, 2002). C. C1: Maintenance of strong synapses is costly in terms of biophysical machinery, in particular, in view of continued protein synthesis (Fonseca, N¨agerl, Morris, & Bonhoeffer, 2004). C2: Synaptic plasticity is slowed for very weak synapses in order to avoid a (unplausible) transition from excitatory to inhibitory synapses. Optimality approaches have a long tradition in the theoretical neurosciences and have been utilized in two different ways. Firstly, optimality approaches allow deriving strict theoretical bounds against which performance of real neural systems can be compared (Barlow, 1956; Laughlin, 1981; Britten, Shadlen, Newsome, & Movshon, 1992; de Ruyter van Steveninck & Biale, 1995). Second, they have been used as a conceptual framework since they allow connecting functional objectives (e.g., “be reliable!”) and constraints (e.g., “don’t use too much energy!”) with electrophysiological properties of single neurons and synapses or neuronal populations (Barlow, 1961; Linsker, 1989; Atick & Redlich, 1990; Levy & Baxter,
642
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
2002; Seung, 2003). Our study, a derivation of synaptic update rules from an optimality viewpoint, follows this second, conceptual approach. 2 The Model 2.1 Neuron Model. We simulated a single stochastic point neuron model with N = 100 input synapses. Presynaptic spikes at synapse j are def noted by their arrival time t j and evoke excitatory postsynaptic potentials f
f
(EPSPs) with time course exp[−(t − t j )/τm ] for t ≥ t j , where τm = 20 ms is the membrane time constant. Recent experiments have shown that action potentials propagating back into the dendrite can partially suppress EPSPs measured at the soma (Froemke, Poo, & Dan, 2005). Since our model neuron has no spatial structure, we included EPSP suppression by a phenomenologf ical amplitude factor a (t j − tˆ) that depends on the time difference between presynaptic spike arrival and the spike trigger time tˆ of the last (somatic) action potential of the postsynaptic neuron. In the absence of EPSP suppression, the amplitude of a single EPSP at synapse j is characterized by its weight w j and its duration by the membrane time constant τm . Summation of the EPSPs caused by presynaptic spike arrival at all 100 explicitly modeled synapses gives the total postsynaptic potential u(t) = ur +
N j=1
wj
exp −
f
t j τ AC : C j (t) = lim
ε→+0 0
t+ε
c j (t )e −(t−t )/τC dt .
(2.18)
With this factor C j , we find a batch learning rule (i.e., with expectations over the input and output statistics on the right-hand side) of the form ∂L ≈ ∂w j
0
T
dt C j (t)B post (t) − λw j x j (t) Y,X .
(2.19)
Finally, for slow learning rate α and stationary input statistics, the system becomes self-averaging (i.e., expectations can be dropped due to automatic temporal averaging; Gerstner & Kistler, 2002) so that we arrive at the online gradient learning rule, dw j = α(w j ) C j (t)B post (t) − λw j x j (t) . dt
(2.20)
See Figure 2 for an illustration of the dynamics of C j and B post . The last term has the form of a “weight decay” term common in artificial neural networks (Hertz, Krogh, & Palmer, 1991) and arises from the derivative of the weight-dependent cost term . The parameter λ is set such that dw j /dt = 0 for large enough |t pr e − t post | in the STDP in vitro paradigm. A few steps of calculation (see appendix A) yield λ = 0.026. In our simulations, we take τC = 100 ms for the cut-off factor in Equation 2.18. For a better understanding of the learning dynamics defined in equation 2.20, let us look more closely at Figure 2a. The time course w/w of the potentiation has three components: first, a negative jump at the moment
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
b
−2 −50
0
50
100
B post
−100 0.1
1 0 −1 −2 −50 0.05
Cj
2
0 −0.1 −100
−50
0
50
100
0.4 0.2 0 −0.2 −100
−50
0
50
100
Time [ms]
∆ w/w
∆ w/w
B post
Cj
a
649
0
50
100
150
0
50
100
150
0
50
100
150
0
−0.05 −50 0.05 0 − 0.05 − 0.1 −50
Time [ms]
Figure 2: Illustration of the dynamics of C j (top), B post (middle), and w/w = (w − winit )/winit (bottom) during a single pair of pre- and postsynaptic spikes (indicated schematically at the very top). We always start from an initial weight winit = 4 mV and induce postsynaptic firing at time t post = 0. (a) Pre-before-post timing with t pr e = −10 ms (solid line) induces a large potentiation, whereas t pr e = −50 ms (dashed line) induces almost no potentiation. (b) Due to the EPSP suppression factor, a post-before-pre timing with t pr e = 10 ms (solid line) induces a large depression, whereas t pr e = 50 ms (dashed line) induces a smaller depression. The marks (circle, cross, square, and diamond) correspond to the weight change due to a single pair of pre- and postsynaptic spike after weights have converged to their new values. Note that in Figure 3, the corresponding marks indicate the weight change after 60 pairs of pre- and postsynaptic spikes.
of the presynaptic spike induced by the weight decay term; second, a slow increase in the interval between pre- and postsynaptic spike times induced by C j (t)B post (t) > 0; and third, a positive jump immediately after the postsynaptic spike induced by the singularity in B post combined with a positive Cj. As it is practically difficult to calculate ρ(t) ¯ = ρ(t) X(t)|Y(t) , we estimate ρ¯ by the running average of the output spikes, that is, τρ¯
d ρ¯ est = −ρ¯ est (t) + y(t), dt
(2.21)
with τρ¯ = 1 min. This approximation is valid if the characteristics of the stimulus and output spike trains are stationary and uncorrelated. To simulate in vitro STDP experiments, the initial value of ρ¯ est is set equal to the injected pulse frequency. Other, more accurate estimates of ρ¯ est are possible but lead
650
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
to no qualitative change of results (data not shown). In the simulations of Figures 4 to 6, the initial synaptic strength is set to be winit = 0.4 ± 0.04 mV. 2.4 Stimulation Paradigm 2.4.1 Simulated STDP In Vitro Paradigm. For the simulations of Figure 3, f spike timings t pr e = t j at synapse j and postsynaptic spike times t post are imposed with a given relative timing t pr e − t post . For the calculation of the total STDP effect according to a typical in vitro stimulation paradigm, the pairing of pre- and postsynaptic spikes is repeated until 60 spike pairs have been accumulated. Spike pairs are triggered at a frequency of 1 Hz except for Figure 3c, where the stimulation frequency was varied. 2.4.2 Simulated Stochastic Spike Arrival. In most simulations, presynaptic spike arrival was modeled as Poisson spike input at either a fixed rate (homogeneous Poisson process) or a modulated rate (inhomogeneous Poisson process). For example, for the simulations in Figure 6, with a gaussian profile, spike arrival at synapse j is generated by an inhomogeneous Poisson process with the following characteristics. During a segment of 200 ms, the rate is fixed at ν j = (ν max − ν0 ) exp[−0.01 ∗ d( j − k)2 ] + ν0 , where ν0 = 1 Hz is the baseline firing rate and d( j − k) is the difference between index j and k. The value of k denotes the location of the maximum. The value of k was reset every 200 ms to a value chosen stochastically between 1 and 100. (As indicated in the main text, presynaptic neurons in Figure 6 were considered to have a ring topology, which has been implemented by evaluating the difference d( j − k) as d( j − k) = min{| j − k|, 100 − | j − k|}.) However, in the simulations for Figure 5, input spike trains were not independent Poisson, but we included spike-spike correlations. A correlation index of c = 0.2 implies that between a given pair of synapses, 20% of spikes have identical timing. More generally, for a given value of c within a group of synaptic inputs, 100 c percent of spike arrival times are identical at an arbitrary pair of synapse within the group. 3 Results The mathematical formulation of postulates A, B, and C1 led to an optimality criterion L that was optimized by changing synaptic weights in an uphill direction. In order to include postulate C2, the adaptation speed was made to depend on the current value of the synaptic weight so that plasticity was significantly slowed for synapses with excitatory postsynaptic potentials (EPSPs) of amplitude less than 0.2 mV (see section 2 for details). As in a previous study based on postulates A and B (Toyoizumi et al., 2005a), the optimization of synaptic weights can be understood as a synaptic
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
b 100
50
4mV 6mV
4000 80
∆w /w [%]
∆w /w [%]
40 30 20 10
3000 2000
60
1000 40
0 0.1
20
1
10
0
0 −10 −100
−50
0
50
−20 2
100
tpre − tpost [ms]
4
6
8
10
Initial EPSP amplitude [mV]
c
d 50
2
0.5Hz 1Hz 2Hz
30 20 10 0 −10 −100
−50
pre
t
0
−
post
t
50
[ms]
100
25ms 50ms No cost
1.5
∆ w [mV]
40
w / w [%]
651
1
0.5
0
−0.5 − 100
− 50
0
50
100
tpre − tpost [ms]
Figure 3: The synaptic update rule of the model shares features with STDP. (a) STDP function (percentage change of EPSP amplitude as a function of t pr e − t post ) determined using 60 pre-and-post spike pairs injected at 1 Hz. The initial EPSP amplitudes are 4 mV (dashed line) and 6 mV (dotted). Marks (circle, cross, square and diamond) correspond respectively to t pr e = −50 ms, t pr e = −10 ms, t pr e = 10 ms, and t pr e = 50 ms, also depicted on Figure 2. (b) The percentage change in EPSP amplitude after 60 pre-and-post spike pairs injected at 1 Hz for pre-before-post timing ((t pr e − t post ) = −10 ms, solid line) and post-beforepre timing ((t pr e − t post ) = +10 ms, dashed line) as a function of initial EPSP amplitude. Our model results qualitatively resemble experimental data (see Figure 5 in Bi & Poo, 1998). (c) Frequency dependence of the STDP function: spike pairs are presented at frequencies of 0.5 Hz (dashed line), 1 Hz (dotted line), and 2 Hz (dot-dashed line). The STDP function exhibits only a weak sensitivity to the change in stimulation frequency. (d) STDP function for different choices of model parameters. The extension of the synaptic depression zone for pre-after-post timing (t pr e − t post > 0) depends on the timescale τa of EPSP suppression (dot-dashed line, τa = 50 ms; dashed line, τa = 25 ms). The dotted line shows the STDP function in the absence of a weight-dependent cost term . The STDP function exhibits a positive offset indicating that without the cost term , unpaired presynaptic spikes would lead to potentiation, a non-Hebbian form of plasticity.
652
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
update rule that depends on presynaptic spike arrival, postsynaptic spike firing, the postsynaptic membrane potential, and the mean firing rate of the postsynaptic neuron. In addition, the synaptic update rule in this study included a term that decreases the synaptic weight upon presynaptic spike arrival by a small amount proportional to the EPSP amplitude (see section 2 for details). This term can be traced back to the additional weight-dependent cost term in equation 2.3 that accounts for postulate C1. In order to study the consequences of the synaptic update rule derived from postulates A, B, and C, we used computer simulations of a model neuron that received presynaptic spike trains at 100 synapses. Each presynaptic spike evoked an EPSP with exponential time course (time constant τm = 20 ms). In order to account for dendritic interaction between somatic action potentials and postsynaptic potentials, the amplitude of EPSPs was suppressed immediately after postsynaptic spike firing (Froemke et al., 2005) and recovered with time constant τa = 50 ms (see Figure 1a). As a measure of the weight of a synapse j, we used the EPSP amplitude w j at this synapse in the absence of EPSP suppression. With all synaptic parameters w j set to a fixed value, the model neuron fired stochastically with a mean firing rate ρ¯ that increases with the presynaptic spike arrival rate (see Figure 1b), has a broad interspike interval distribution (see Figure 1c), and has an autocorrelation function with a trough of 10 to 50 ms that is due to reduced excitability immediately after a spike because of EPSP suppression (see Figure 1d). 3.1 The Learning Rule Exhibits STDP. In a first set of plasticity experiments, we explored the behavior of the model system under a simulated in vitro paradigm as used in typical STDP experiments (Bi & Poo, 1998). In order to study the influence of the pre- and postsynaptic activity on the changes of weights as predicted by our online learning rule in equation 2.20, we plotted in Figure 2 the postsynaptic factor B post and the correlation term C j that both appear on the right-hand side of equation 2.20, together with the induced weight change w/w as a function of time. Indeed, the learning rule predicts positive weight changes when the presynaptic spike occurs 10 ms before the postsynaptic one and negative weight changes under reversed timing. For a comparison with experimental results, we used 60 pairs of preand postsynaptic spikes applied at a frequency of 1 Hz and recorded the total change w in EPSP amplitude. The experiment is repeated with different spike timings, and the result is plotted as a function of spike timing difference t pr e − t post . As in experiments (Markram et al., 1997; Zhang, Tao, ¨ om ¨ et al., 2001), we Holt, Harris, & Poo, 1998; Bi & Poo, 1998, 2001; Sjostr find that synapses are potentiated if presynaptic spikes occur about 10 ms before a postsynaptic action potential, but are depressed if the timing is reversed. Compared to synapses with amplitudes in the range of 1 or 2 mV, synapses that are exceptionally strong show a reduced effect of potentiation
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
653
for pre-before-post timing, or even depression (see Figures 3a and 3b), in agreement with experiments on cultured hippocampal neurons (Bi & Poo, 1998). The shape of the STDP function depends only weakly on the stimulation frequency (see Figure 3c), even though a significant reduction of the potentiation amplitude with increasing frequency can be observed. In a recent experimental study (Froemke et al., 2005), a strong correlation between the timescale of EPSP suppression (which was found to depend on dendritic location) and the duration of the long-term depression (LTD) part in the STDP function was observed. Since our model neuron had no spatial structure, we artificially changed the time constant of EPSP suppression in the model equations. We found that indeed only the LTD part of the STDP function was affected, whereas the LTP part remained unchanged (see Figure 3d). In order to study the influence of the weight-dependent cost term in our optimality criterion L, we systematically changed the parameter λ in equation 2.3. For λ = 0, the weight-dependent cost term has no influence, and because the postsynaptic firing rate is close to the desired rate, synaptic plasticity in our model is mainly controlled by information maximization. In this case, synapses with a reasonable EPSP amplitude of one or a few millivolts are always strengthened, even for post-before-pre timing (see Figure 3d, dashed line). This can be intuitively understood since an increase of synaptic weight is always beneficial for information transmission except if spike arrival occurs immediately after a postsynaptic spike. In this case, the postsynaptic neuron is insensitive, so that no information can be transmitted. Nevertheless, information transmission is maximal in a situation where the presynaptic spike occurs just before the postsynaptic one. The weight-dependent cost term derived from postulate C is essential to shift the dashed line in Figure 3d to negative values so as to induce synaptic depression in our STDP paradigm. The optimal value of λ = 0.026, which ensures that for large spike timing differences |t pr e − t post | neither potentiation nor depression occurs, has been estimated from a simple analytical argument (see appendix A). 3.2 Both Unimodal and Bimodal Synapse Distributions Are Stable. Under random spike arrival with a rate of 10 Hz at all 100 synapses, synaptic weights show little variability with a typical EPSP amplitude in the range of 0.4 mV. This unspecific pattern of synapses stays stable even if 20 out of the 100 synapses are subject to a common rate modulation between 1 and 30 Hz (see Figure 4a). However, if modulation of presynaptic firing rates becomes strong, the synapses rapidly develop a specific pattern with large values of weights at synapses with rate-modulated spike input and weak weights at those synapses that received input at fixed rates (synaptic specialization; see Figures 4a and 4b), making the neuron highly selective to input at one group of synapses (input selectivity). Thus, our synaptic update rule is capable of selecting strong features in the input, but also allows a stable, unspecific
654
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
pattern of synapses in case of weak input. This is in contrast to most other Hebbian learning rules, where unspecific patterns of synapses are unstable, so that synaptic weights move spontaneously toward their upper or lower bounds (Miller et al., 1989; Miller & MacKay, 1994; Gerstner, Kempter, van Hemmen, & Wagner, 1996; Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000). After induction of synaptic specialization by strong modulation of presynaptic input, we reduced the rate modulation back to the value that previously led to an unspecific pattern of synapses. We found that the strong synapses remained strong and weak synapses remained weak; synaptic
Figure 4: Potentiation and depression depend on the presynaptic firing rate. Twenty synapses of group 1 receive input with common rate modulation, while the 80 synapses of group 2 (synapse index 21–100) receive Poisson input at a constant rate of 10 Hz. The spike arrival rate in group 1 switches stochastically every 200 ms between a low rate νl = 1 Hz and a high rate νh taken as a parameter. (a) Evolution of synaptic weights as a function of time for different amplitudes of rate modulation, that is, νh changes from 10 Hz during the first hour to 30 Hz, then to 50 Hz, 30 Hz, and back to 10 Hz. During the first 2 hours of stimulation, an unspecific distribution of synapses remains stable even though a slight decrease of weights in group 2 can be observed when the stimulus switches to νh = 30 Hz. A specialization of the synaptic pattern with large weights for synapses in group 1 is induced during the third hour of stimulation and remains stable thereafter. (b) Top: Mean synaptic weights (same data as in a) ¯ 2 , green line). Bottom: The stimulation of group 1 (w ¯ 1 , blue line) and group 2 (w paradigm νh as a function of time. Note that at νh = 30 Hz (second and fourth hour of stimulation), both an unspecific pattern of synapses with little difference ¯ 2 (second hour, top) and a highly specialized pattern (fourth between w ¯ 1 and w hour, top, large difference between solid and dashed lines) are possible. (c) The value of the objective function L (average value per second of time) in color code as a function of the mean synaptic weight in group 1 (y-axis, w ¯ 1 ) and group 2 (x-axis, w ¯ 2 ) during stimulation with νh = 30 Hz. Two maxima can be perceived: ¯ 1 ≈ 0.4 mV) and a broad maximum for the unspecific synapse pattern (w ¯2 ≈ w a pronounced elongated maximum for the specialized synapses pattern (w ¯2 ≈ 0; w ¯ 1 ≈ 0.8 mV). The dashed line indicates a one-dimensional section (see d) through the two-dimensional surface. (d) Objective function L as a function of ¯ 1 between the mean synaptic weights in groups 2 and 1 the difference w ¯2 −w along the line indicated in c. For νh = 30 Hz (dashed, same data as in c) two ¯ 1 ≈ 0 and a narrow but higher maxima are visible: a broad maximum at w ¯2 −w maximum corresponding to the specialized synapse pattern to the very left ¯2 −w ¯1 ≈ of the graph. For νh = 50 Hz (dotted line), the broad maximum at w 0 disappears, and only the specialized synapse pattern remains, whereas for νh = 10 Hz (solid line), the broad maximum of the unspecific synapse pattern dominates.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
b 1.6
Synaptic index
90
1.4
80 70
1.2
60
1
50
0.8
40
0.6
30 0.4
20
0.2
10 0
60
120
180
240
1.5
¯ [mV] w
100
1 0.5 0 0
60
120
180
240
300
60
120
180
240
300
60
νh [Hz]
a
40 20 0 0
300
Time [min]
Time [min] c
d 1
8
10Hz 30Hz 50Hz
6 0.8
4
0.6
3 0.4 2 0.2
0 0
7
5
1 0 0.2
0.4
¯ 2 [mV] w
0.6
L [s −1]
w ¯1 [mV]
655
6 5 4
3 −0.8
−0.6
−0.4
−0.2
0
0.2
0.4
¯2 − w ¯1 [mV] w
specialization was stable against a change in the input (see Figure 4b). This result shows that synaptic dynamics exhibits hysteresis, which is an indication of bistability: for the same input, both an unspecific pattern of synapses and synaptic specialization are stable solutions of synaptic plasticity under our learning rule. Indeed under rate modulation between 1 and 30 Hz for 20 out of the 100 synapses, the objective function L shows two local maxima (see Figure 4c): a sharp maximum corresponding to synaptic specialization (mean EPSP amplitude about 0.8 mV for synapses receiving rate-modulated input and less than 0.1 mV for synapses receiving constant input) and a broader but slightly lower maximum where both groups of synapses have a mean EPSP amplitude in the range of 0.3 to 0.5 mV (see appendix B for details on the method). In additional simulations, we confirmed that both the unspecific pattern of synapses and the selective pattern representing synaptic specialization remained stable over several hours of continued stimulation with rate-modulated input (data not shown). Bistability of selective and unspecific synapse patterns was consistently observed for rates modulated between 1 and 10 Hz or between 1 and 30 Hz, but the unspecific synapse pattern was unstable if the rate was modulated between 1 and 50 Hz consistent with the weight dependence of our objective function L (see Figure 4d).
656
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
3.3 Retention of Synaptic Memories. In order to study synaptic memory retention with our learning rule, we induced synaptic specialization by stimulating 20 out of the 100 synapses by correlated spike input (spikespike correlation index c = 0.2; see section 2.4 for details). The remaining 80 synapses received uncorrelated Poisson spike input. The mean firing rate (10 Hz) was identical at all synapses. After 60 minutes of correlated input at the group of 20 synapses, the stimulus was switched to uncorrelated spike input at the same rate. We studied how well synaptic specialization was maintained as a function of time after induction (see Figures 5a and 5b). Synaptic specialization was defined by a bimodality index that compared the distribution of EPSP amplitudes at synapses that received correlated input with those receiving uncorrelated input. For each of the two groups of synapses, we calculated the mean w ¯ = w and the variance σ 2 = [w j − 2 2 w] ¯ : w ¯ A and σ A for the group of synapses receiving correlated input and w ¯B and σ B2 for those receiving uncorrelated input. We then approximated the two distributions by gaussian functions. The bimodality index depends on the overlap between the two gaussians and is given by b = 0.5 [erf( w√¯ A2σ−ˆs ) + A x ˆ −w ¯B erf( s√ )], where erf(x) = √2π 0 exp(−t 2 )dt is the error function and sˆ is 2σ B one of the two crossing points of the two gaussians such that w ¯ B < sˆ < w ¯ A. The two distributions (strong and weak synapses) started to separate within the first 5 minutes and remained well separated even after the correlated memory-inducing stimulus was replaced by a random stimulus (see Figure 5b). In order to study how synaptic memory retention depended on the induction paradigm, the experiment was repeated with different values of the correlation index c that characterizes the spike-spike correlations during the induction period. For correlations c < 0.1, the two synaptic distributions are not well separated at the end of the induction period (bimodality index < 0.9), but they are well separated for c ≥ 0.15 (see Figure 5c). Good separation at the end of the induction period alone is, however, not sufficient to guarantee retention of synaptic memory, since for the synaptic distribution induced by stimulation with c = 0.15, specialization breaks down at t = 80 min., that is, after only 20 minutes of memory retention (see Figure 5d). On the other hand, for c = 0.2 and larger, synaptic memory is retained over several hours (only the first hour is plotted in Figure 5d). The long duration of synaptic memory in our model can be explained by the reduced adaptation speed of synapses with weights close to zero (postulate C2). If weak synapses change only slowly because of reduced adaptation speed, strong synapses must stay strong because of homeostatic processes that keep the mean activity of the postsynaptic neuron close to a target value. Moreover, the terms in the online learning rule derived from information maximization favor the bimodal distribution. Reduced adaptation speed of weak synapses could be caused by a cascade of
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
657
b 1
100 80
0.8
1.2 1
60
0.8 40
0.6 0.4
20
0.2 0
30
60
90
Distribution
Synaptic index
1.4
0.6 0.4 0.2 0 0
120
0.5
1
1.5
w [mV]
Time [min] c
d
w¯
1
0.5
0
0
0.05
0.1
0.15
0.2
0.25
Spike-spike correlation
0.3
Specialization index
1.5 1 c = 0.05 c = 0.1 c = 0.15 c = 0.2
0.8 0.6 0.4 0.2 0 0
30
60
90
120
Time [min]
Figure 5: Synaptic memory induced by input with spike-spike correlations. (a) Evolution of 100 synaptic weights (vertical axis) as a function of time. During the first 60 minutes, synapses of group A ( j = 1, . . . , 20) receive a Poisson stimulation of 10 Hz with correlated spike input (c = 0.2), and those of group B ( j = 21, . . . , 100) receive uncorrelated spike input at 10 Hz. (b) Distribution of the EPSP amplitudes across the 100 synapses after t = 5 min (dotted line), t = 60 min (solid line) and t = 120 min (dot-dashed line). The red lines denote group A while the blue ones group B. (c) Mean EPSP amplitude of group A (solid red line) and B (dashed blue line) at t = 60 min for different values of correlation c of the input applied to group A. (d) Bimodality index b of the two groups of weights as a function of time. The memory induction by correlated spike input to group B stops at t = 60 min. Memory retention is studied during the following 60 minutes. A bimodality index close to one implies, as for the case with c = 0.2, that synaptic memory is well retained.
intracellular biochemical processing stages with different time constants, as suggested by Fusi, Drew, & Abbott (2005). Thus our synaptic update rule allows for retention of synaptic memories over timescales that are significantly longer than the memory induction time, as necessary for
658
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
any memory system. Nevertheless, synaptic memory in our model will eventually decay if random firing of pre- and postsynaptic neurons persists, in agreement with experimental results (Abraham, Logan, Greenwood, & Dragunow, 2002; Zhou, Tao, & Poo, 2003). We note that in the absence of presynaptic activity, the weights remain unchanged, since the decay of synaptic weights is conditioned on presynaptic spike arrival (see equation 2.20). 3.4 Receptive Field Development. Synaptic plasticity is thought to be involved not only in memory (Hebb, 1949), but also in the development of cortical circuits (Hubel & Wiesel, 1962; Katz & Shatz, 1996; von der Malsburg, 1973; Bienenstock et al., 1982; Miller et al., 1989) and, possibly, cortical reorganization (Merzenich, Nelson, Stryker, Cynader, Schoppmann, & Zook 1984; Buonomano & Merzenich, 1998). To study how our synaptic update rule would behave during development, we used a standard paradigm of input selectivity (Yeung, Shouval, Blais, & Cooper, 2004), which is considered to be a simplified scenario of receptive field development. Our model neuron was stimulated by a gaussian firing rate profile spanned across the 100 input synapses (see Figure 6a). The center of the gaussian was shifted every 200 ms to an arbitrarily chosen presynaptic neuron. In order to avoid border effects, neuron number 100 was considered a neighbor of neuron number 1, that is, we can visualize the presynaptic neurons as being located on a ring. Nine postsynaptic neurons with slightly different initial values of synaptic weights received identical input from the same set of 100 presynaptic neurons. During 1 hour of stimulus presentation, six of the nine neurons developed synaptic specialization, leading to input selectivity. The optimal stimulus for these six neurons varies (see Figures 6b and 6c), so that any gaussian stimulus at an arbitrary location excites at least one of the postsynaptic neurons. In other words, the six postsynaptic neurons have developed input selectivity with different but partially overlapping receptive fields. The distribution of synaptic weights for the selective neurons is bimodal with a first peak for very weak synapses (EPSP amplitudes less than 0.1 mV) and a second peak around EPSP amplitudes of 0.6 mV (see Figure 6d); the amplitude distribution of the unselective neurons is broader, with a single peak at around 0.4 mV. The number of postsynaptic neurons showing synaptic specialization depends on the total stimulation time and the strength of the stimulus. If the stimulation time is extended to 3 hours instead of 1 hour, all postsynaptic neurons become selective using the same stimulation parameters as before. However, if the maximal presynaptic firing rate at the center of the gaussian is reduced to 40 Hz instead of 50 Hz, only six of nine are selective after 3 hours of stimulation, and with a further reduction of the maximal rate to 30 Hz, only a single neuron is selective after 3 hours of stimulation (data not shown). We hypothesize that the coexistence of unselective and selective
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
a
659
b 60 50
ν [Hz]
40 30 20 10 0 0
20
40
60
80
100
Synaptic index c
d 100
0.4
Synaptic index
90
0.2
80
0 0 0.2
70 60 50
0.2
0.4
0.6
0.8
0.2
0.4
0.6
0.8
0.4
0.6
0.8
0.1
40
0 0 0.2
30 20
0.1
10 1
3
5
7
Output neuron
9
0 0
0.2
w [mV]
Figure 6: The synaptic update rule leads to input selectivity of the postsynaptic neuron. (a) gaussian firing rate profile across the 100 presynaptic neurons. The center of the gaussian is shifted randomly every 200 ms. Presynaptic neurons fire stochastically and send their spikes to nine postsynaptic neurons. (b) Evolution of synaptic weights of the nine postsynaptic neurons. Some neurons become specialized for a certain input pattern at the early phase of learning, others become specialized later, and the last three neurons have not yet become specialized. Since the input spike trains are identical for all the nine neurons, the specialization is due to noise in the spike generator of the postsynaptic neurons. (c) Final synaptic weight values of the nine output neurons after 1 hour of stimulus presentation. (d) The distribution of EPSP amplitudes after 1 hour of stimulation for (top) the specialized output neurons 1, 2, . . . , 6; (middle) for nonspecialized neurons 7, 8, 9; (bottom) for all nine output neurons.
neurons during development could explain the broad distribution of EPSP ¨ om ¨ et al., 2001, in rat viamplitudes seen in some experiments (e.g., Sjostr sual cortex). For example, if we sample the synaptic distribution across all nine postsynaptic cells, we find the distribution shown at the bottom of Figure 6d. If the number of unspecific neurons were higher, the relative importance of synapses with EPSP amplitudes of less than 0.1 mV would diminish. If the number of specialized neurons increased, the distribution would turn into a clear-cut bimodal one, which would be akin to
660
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
sampling an ensemble of two-state synapses with all-or-none potentiation on a synapse-by-synapse basis (Petersen, Malenka, Nicoll, & Hopfield, 1998, in rat hippocampus). 4 Discussion 4.1 What Can We and What Can We Not Expect from Optimality Models? Optimality models can be used to clarify concepts, but they are unable to make specific predictions about molecular implementations. In fact, the synaptic update rule derived in this article shares functional features with STDP and classical long-term potentiation (LTP), but it is blind with respect to interesting questions such as the role of NMDA, Kainate, endocannabinoid, or CaMKII in the induction and maintenance of potentiation and depression (Bliss & Collingridge, 1993; Bortolotto, Lauri, Isaac, & Collingridge, 2003; Frey & Morris, 1997; Lisman, 2003; Malenka & Nicoll, ¨ om, ¨ 1993; Sjostr Turrigiano, & Nelson, 2004). If molecular mechanisms are the focus of interest, detailed mechanistic models of synaptic plasticity (Senn, Tsodyks, & Markram, 2001; Yeung et al., 2004) should be preferred. On the other hand, the mere fact that similar forms of LTP or LTD seem to be implemented across various neural systems by different molecular mechanisms leads us to speculate that common functional roles of synapses are potentially more important for understanding synaptic dynamics than the specific way that these functions are implemented. Ideally, optimality approaches such as the one developed in this article should be helpful to put seemingly diverse experimental or theoretical results into a coherent framework. We have listed in section 1 a couple of points, partially linked to experimental results and partially linked to earlier theoretical investigations. Our aim has been to connect these points and trace them back to a small number of basic principles. Let us return to our initial list and discuss the points in the light of the results of the preceding section. 4.2 Correlations. Hebb (1949) postulated an increase in synaptic coupling in case of repeated coactivation of pre- and postsynaptic neurons as a useful concept for memory storage in recurrent networks. In our optimality framework defined by equation 2.3, correlation-dependent learning is not imposed explicitly but arises from the maximization of information transmission between pre- and postsynaptic neurons. Indeed, information transmission is possible only if there are correlations between pre- and postsynaptic neurons. Information transmission is maximized if these correlations are increased. Gradient ascent of the information term hence leads to a synaptic update rule that is sensitive to correlations between pre- and postsynaptic neurons (see equation 2.20). An increase of synaptic weights enhances these correlations and maximizes information transmission. We emphasize that in contrast to Hebb (1949), we do not invoke memory
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
661
formation and recall as a reason for correlation dependence, but information transmission. Similar to other learning rules (Linsker, 1986; Oja, 1982), the sensitivity of our update rule to correlations between pre- and postsynaptic neurons gives the synaptic dynamics a sensitivity to correlations in the input as demonstrated in Figures 4 to 6. 4.3 Input Selectivity. During cortical development, cortical neurons develop input selectivity typically quantified as the width of receptive fields (Hubel & Wiesel, 1962). As illustrated in the scenario of Figure 6, our synaptic update rule shows input selectivity and hence stands in the research tradition of many other studies (von der Malsburg, 1973; Bienenstock et al., 1982; Miller et al., 1989). Input selectivity in our model arises through the combination of the correlation sensitivity of synapses discussed above with the homeostatic term D in equation 2.3 (Toyoizumi et al., 2005a). Since the homeostatic term keeps the mean rate of the postsynaptic neuron close to a target rate, it leads effectively to a normalization of the total synaptic input similar to the sliding threshold mechanism in the Bienenstock-CooperMunro rule (Bienenstock et al., 1982). Normalization of the total synaptic input through firing rate stabilization has also been seen in previous STDP models (Song et al., 2000; Kempter et al., 1999; Kempter, Gerstner, & van Hemmen, 2001). Its effect is similar to explicit normalization of synaptic weights, a well-known mechanism to induce input selectivity (von der Malsburg, 1973; Miller & MacKay, 1994), but our model does not need an explicit normalization step. So far we have studied only a single or a small number of independent postsynaptic neurons, but we expect that as in many other studies (e.g., Erwin, Obermayer, & Schulten, 1995; Song & Abbott, 2001; Cooper et al., 2004), our synaptic update rule would yield feature maps if applied to a network of many weakly interacting cortical neurons. 4.4 Unimodal versus Bimodal Synapse Distributions. In several previous models of rate-based or spike-timing-based synaptic dynamics, synaptic weights evolved always toward a bimodal distribution, with some synapses close to zero and others close to maximal weight (Miller et al., 1989; Miller & MacKay, 1994; Kempter et al., 1999; Song & Abbott, 2001; Toyoizumi et al., 2005a). Thus, synapses specialize on certain features of the input, even if the input has weak or no correlation at all, which seems questionable from a functional point of view and disagrees with experimentally ¨ om ¨ et al., found distributions of EPSP amplitudes in rat visual cortex (Sjostr 2001). As shown in Figures 4 to 6, an unspecific pattern of synapses with a broad distribution of EPSP amplitudes is stable with our synaptic update rule if correlations in the input are weak. Synaptic specialization (bimodal distribution) develops only if synaptic inputs show a high degree of correlations on the level of spikes or firing rates. Thus, neurons specialize only on highly significant input features, not in response to noise. A
662
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
similar behavior was noted in two recent models on spike-timing¨ dependent plasticity (Gutig, Aharonov, Rotter, & Sompolinsky, 2003; Yeung et al., 2004). Going beyond those studies, we also demonstrated stability of the specialized synapse distribution over some time, even if the amount of correlation through rate modulation is reduced after induction of synaptic specialization stimulation (see Figure 4). Thus, for the same input characteristics, our model can show unimodal or bimodal distribution of synaptic weight, depending on the stimulation history. Moreover, we note that in the state of bimodal distribution, the EPSP amplitudes at the depressed synapses are so small that in an experimental setting, they could easily remain undetected or classified as a “silent” synapse (Kullmann, 2003). A large proportion of silent synapses has been previously shown to be consistent with optimal memory storage (Brunel, Hakim, Isope, Nadal, & Barbour, 2004). Furthermore, our results show that in a given population of cells, neurons with specialized synapses can coexist with others that have a broad and unspecific synapse distribution. We speculate that these nonspecialized neurons could then be recruited later for new stimuli, as hypothesized in earlier models of neural networks (Grossberg, 1987). In passing, we note that in models of synaptic plasticity with multiplicative weight dependence, the distribution of weights is always unimodal (Rubin, Lee, & Sompolinsky, 2001; van Rossum, Bi, & Turrigiano, 2000), but in this case, retention of synaptic memories is difficult. 4.5 Synaptic Memory. In the absence of presynaptic input, synapses in our model do not change significantly. Furthermore, our results show that even in the presence of random pre- and postsynaptic firing activity, synaptic memories can be retained over several hours, although a slow decay occurs. Essential for long-term memory maintenance under random spike arrival is the reduced adaptation speed for small values of synaptic weights as formulated in postulate C2. This postulate is similar in spirit to a recent theory of Fusi et al. (2005), with two important differences. First, we work with a continuum of synaptic states (characterized by the value of w), whereas Fusi et al. assume a cascade of a finite number of discrete internal synaptic states where transitions are unidirectional and characterized by different time constants. The scheme of Fusi et al. guarantees slow (i.e., not exponential) decay of memories, whereas in our case, decay is always exponential even if the time constant is weight dependent. Second, Fusi et al. use a range of different time constants for both the potentiated and the unpotentiated state, whereas in our model, it is sufficient to have a slower time constant for the unpotentiated state only. If the change of unpotentiated synapses is slow compared to homeostatic regulation of the mean firing rate, then the maintenance of the strong synapses is given by homeostasis (and also supported by information maximization). The fact that our model has been formulated in terms of a continuum of synaptic weights was for convenience only. Alternatively it is conceivable
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
663
to define a number of internal synaptic states that give rise to binary, or a small number of, synaptic weight values (Fusi, 2002; Fusi et al., 2005; Abarbanel, Talathi, Gibb, & Rabinovich, 2005). The actual number of synaptic states is unknown with conflicting evidence (Petersen et al., 1998; Lisman, 2003). 4.6 Rate Dependence. Our results show that common rate modulation in one group of synapses strengthens these synapses if the modulation amplitude is strong enough. In contrast, an increase of rates to a fixed value of 40 Hz (without modulation) in one group of synapses while another group of synapses receives background firing at 10 Hz does not lead to a synaptic specialization, only to a minor readjustment of weights (data not shown). For a comparison with experimental results, it is important to note that rate dependence is typically measured with extracellular stimulation of presynaptic pathways. We assimilate repeated extracelluar stimulation with a strong and common modulation of spike arrival probability at one group of synapses (as opposed to an increased rate of a homogeneous Poisson process). Under this interpretation, our results are qualitatively consistent with experiments. 4.7 STDP. Our results show that our synaptic update rule shares several features with STDP as found in experiments (Markram et al., 1997; Bi & Poo, ¨ om ¨ et al., 2001). The timescale of the potentiation part of 1998, 2001; Sjostr the STDP function depends in our model on the duration of EPSPs. The timescale of the depression part is determined by the duration of EPSP suppression in agreement with experiments (Froemke et al., 2005). Our model shows that the relative importance of LTP and LTD depends on the initial value of the synaptic weight in a way similar to that found in experiments (Bi & Poo, 1998). However, for EPSP amplitudes between 0.5 and 2 mV, the LTP part clearly dominates over LTD in our model, which seems to have less experimental support. Also, the frequency dependence of STDP in our model is less pronounced than in experiments. Models of STDP have previously been formulated on a phenomenological level (Gerstner et al., 1996; Song et al., 2000; Gerstner & Kistler, 2002) or on a molecular level (Lisman, 2003; Senn et al., 2001; Yeung et al., 2004). Only recently have models derived from optimality concepts moved to the center of interest (Toyoizumi et al., 2005a; Toyoizumi, Pfister, Aihara, & Gerstner, 2005b; Bohte & Mozer, 2005; Chechik, 2003). There are important differences of the present model to the existing “optimal” models. Chechik (2003) used information maximization, but limited his approach to static input patterns, while we consider arbitrary inputs. Bell and Parra (2005) minimize output entropy, and Bohte and Mozer (2005) maximize spike reliability, whereas we maximize the information between full input and output spike trains. None of these studies considered optimization under homeostatic and maintenance cost constraints.
664
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
After we introduced the homeostatic constraint in a previous study, which gave rise to a learning rule with several interesting properties (Toyoizumi et al., 2005a), we realized that this model did not exhibit properties of STDP in an in vitro situation without some additional assumptions. Indeed, as shown in Figure 3d, the STDP function derived from information maximization alone exhibits no depression in the absence of an additional weight-dependent cost term in the optimality function. The weight-dependent cost term introduced in this letter hence plays a crucial role in STDP since it shifts the STDP function to more negative values. 4.8 How Realistic Is a Weight-Dependent Cost Term? The weightdependent cost term in the optimality criterion L depends quadratically on the value of the weights of all synapses converging onto the same postsynaptic neuron. This turns out to be equivalent to a “decay” term in the synaptic update (see the last term on the right-hand side of equation 2.20). Such decay terms are common in the theoretical literature (Bienenstock et al., 1982; Oja, 1982), but the question is whether such a decay term (leading to a slow depression of synapses) is realistic from a biological point of view. We emphasize that the decay term in our synaptic update rule is proportional to presynaptic activity. Thus, in contrast to existing models in the theoretical literature (Bienenstock et al., 1982; Oja, 1982), a synapse that receives no input is protected against slow decay. The specific form of the “decay” term considered in this article was such that synaptic weights decreased with each spike arrival, but presynaptic activity could also be represented in the decay term by the mean firing rate rather than spike arrival, with no qualitative changes to the results. An important aspect of our cost term is that only synapses that have recently been activated are at risk regarding weight decay. We speculate that the weight-dependent cost term could, in a loose sense, be related to the number of plasticity factors that synapses require and compete for during the first hours of synaptic maintenance (Fonseca et al., 2004). According to the synaptic tagging hypothesis (Frey & Morris, 1997), only synapses that have been activated in the recent past compete for plasticity factors, while unpotentiated synapses do not suffer from decay (Fonseca et al., 2004). We emphasize that such a link of our cost term to the competition for plasticity factors is purely hypothetical. Many relevant details of tagging and competition for synaptic maintenance are omitted in our approach. 4.9 Predictions and Experimental Tests. In order to achieve synaptic memory that is stable over several hours, the reduced adaptation speed for weak synapses formulated in postulate C2 turns out to be essential. Thus, an essential assumption of our model is testable: for synapses with extremely
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
665
small EPSP amplitudes, in particular silent synapses, the induction of both LTP and LTD should require stronger stimulation or stimulation sustained over longer times, compared to synapses that are of average strength. This ¨ aspect is distinct from models (Gutig et al., 2003) that postulate for weak synapses a reduced adaptation speed for depression only, but maximal effect of potentiation. Thus, comparison of LTP induction for silent synapses (Kullmann, 2003) with that for average synapses should allow differentiating between the two models. In an alternative formulation, synaptic memory could also be achieved by making strong synapses resistant to further changes. As an aside, we note that the model of Fusi et al. (2005) assumes a reduced speed of transition between several internal synaptic states so that transition would not necessarily be visible as a change in synaptic weight. A second test of our model concerns the pattern of synaptic weights converging on the same postsynaptic neuron. Our results suggest that early in development, most neurons would show an unspecific synapse pattern, that is, a distribution of EPSP amplitudes with a single but broad peak, whereas later, a sizable fraction of neurons would show a pattern of synaptic specialization with some strong synapses and many silent ones, that is, a bimodal distribution of EPSP amplitudes. Ideally the effect would be seen by scanning all the synapses of individual postsynaptic neurons; it remains to be seen if modern imaging and staining methods will allow doing this. Alternatively, by electrophysiological methods, distributions of synaptic strengths could be built up by averaging over many synapses on different ¨ om ¨ et al., 2001). In this case, our model would predict that neurons (Sjostr during development, the histogram of EPSP amplitudes would change in two ways (see Figure 6d): (1) the number of silent synapses increases so that the amplitude of the sharp peak at small EPSP amplitude grows, and (2) the location of the second, broader peak shifts to larger values of the EPSP amplitudes. Furthermore, and in contrast to other models where the unimodal distribution is unstable, the transition to a bimodal distribution depends in our model on the stimulation paradigm. 4.10 Limitations and Extensions. We emphasize that properties of our synaptic update rules have so far been tested only for single neurons in an unsupervised learning paradigm. Extensions are possible in several directions. First, instead of single neurons, a large, recurrent network could be considered. This could, on one side, further our understanding of the model properties in the context of cortical map development (Erwin et al., 1995), and on the other side, scrutinize the properties of the synaptic update rule as a functional memory in recurrent networks (Amit & Fusi, 1994). Second, instead of unsupervised learning where the synaptic update rule treats all stimuli alike whether they are behaviorally relevant or not, a rewardbased learning scheme could be considered (Dayan & Abbott, 2001; Seung, 2003; Pfister et al., 2006). Behaviorally relevant situations can be taken into
666
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
account by optimizing reward instead of information transmission (Schultz, Dayan, & Montague, 1997). Appendix A: Determination of the Parameter λ The parameter λ is set to give w j = 0 for large enough |t pr e − t post | in a simulated STDP in vitro paradigm. In order to find the appropriate value of λ, we separately consider the effects of a presynaptic spike and a postsynaptic one, which is possible since they are assumed to occur at a large temporal distance. Since a postsynaptic spike alone does not change synaptic strength (C j (t) = 0, always), we choose a λ that gives no synaptic change when a presynaptic spike alone is induced. For a given presynaptic spike f at t j , we have C j (t) = −g 0
t
f t − t j e −(t−t )/τm dt
f f τC τm − t−t j /τm −(t−t j )/τC e . = −g −e τC − τm
(A.1)
Since ρ˜ ≈ ρr , and ρ¯ ≈ ρr in this in vitro setting, the factor B post in equation 2.14 is approximated as f B post (t) ≈ −w j ge − t−t j /τm .
(A.2)
Hence, we find the effect of a presynaptic spike as w = 0
T
dw j τC τm dt = w j g 2 dt τC − τm
τC τm τm − − λw j . τC + τm 2
τC The condition of no synaptic change gives λ = g 2 ττCm−τ m used this λ in the numerical code.
τm τC τm +τC
(A.3)
−
τm 2
. We
Appendix B: Weight-Dependent Evaluation of the Optimality Criterion In Figures 4c and 4d the optimality criterion has been evaluated as a function of some artifical weight distribution. Specifically, values of synaptic weights have been chosen stochastically from two gaussian distributions with mean w ¯ 1 and standard deviation σ1 for group 1 and w ¯ 2 and σ2 for group 2. In order to account for differences in standard deviations due to the
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
667
weight-dependent update rate α(w), we chose σ (w) ¯ = 0.1 mV ·w ¯ 4 /(ws4 + 4 w ¯ ), which gives a variance of synaptic weights in both groups, which is consistent with the variance seen in Figure 4A. For a fixed set of synaptic weight values, the network is simulated during a trial time of 30 minutes, while the synaptic updated rule has been turned off and the objective function L defined in equation 2.3 is evaluated using ρ¯ est from equation 2.21. The result is divided by the trial time t and plotted in Figures 4c and 4d in units of s −1 . The mesh size of mean synaptic strength is 0.04 mV. The one-dimensional plot in Figure 4d is taken along the direction w ¯ 2 = 0.8 mV −w ¯ 1. Acknowledgments This research is supported by the Japan Society for the Promotion of Science and Grant-in-Aid for JSPS fellows 03J11691 (T.T.), the Swiss National Science Foundation 200020-108097/1 (J.P.P), and Grant-in-Aid for Scientific Research on Priority Areas 17022012 from MEXT of Japan (K.A.). References Abarbanel, H., Talathi, S., Gibb, L., & Rabinovich, M. (2005). Synaptic plasticity with discrete state synapses. Physical Review E, 72, art. no. 031917 Part 1. Abraham, W., Logan, B., Greenwood, J., & Dragunow, M. (2002). Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. J. Neuroscience, 22, 9626–9634. Amit, D., & Fusi, S. (1994). Learning in neural networks with material synapses. Neural Computation, 6, 957–982. Atick, J., & Redlich, A. (1990). Towards a theory of early visual processing. Neural Computation, 4, 559–572. Barlow, H. (1956). Retinal noise and absolute threshold. J. Opt. Soc. Am., 46, 634–639. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenbluth (Ed.), Sensory communication (pp. 217–234). Cambridge, MA: MIT Press. Bell, A. J., & Parra, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 121–128). Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bi, G., & Poo, M. (2001). Synaptic modification of correlated activity: Hebb’s postulate revisited. Ann. Rev. Neurosci., 24, 139–166. Bienenstock, E., Cooper, L., & Munroe, P. (1982). Theory of the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48.
668
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long-term potentiation in the hippocampus. Nature, 361, 31–39. Bliss, T., & Lomo, T. (1973). Long-lasting potentation of synaptic transmission in the dendate area of anaesthetized rabbit following stimulation of the perforant path. J. Physiol., 232, 351–356. Bohte, S. M., & Mozer, M. C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 201–208). Cambridge, MA: MIT Press. Bortolotto, Z. A., Lauri, S., Isaac, J. T. R., & Collingridge, G. L. (2003). Kainate receptors and the induction of mossy fibre long-term potentiation. Phil. Trans. R. Soc. Lond B: Biological Sciences, 358, 657–666. Britten, K., Shadlen, M., Newsome, W., & Movshon, J. (1992). The analysis of visual motion: A comparison of neuronal and psychophysical performance. J. Neuroscience, 12, 4745–4765. Brunel, N., Hakim, V., Isope, P., Nadal, J.-P., & Barbour, B. (2004). Optimal information storage and the distribution of synaptic weights: Perceptron versus Purkinje cell. Neuron, 43, 745–757. Buonomano, D., & Merzenich, M. (1998). Cortical plasticity: From synapses to maps. Annual Review of Neuroscience, 21, 149–186. Chechik, G. (2003). Spike-timing-dependent plasticity and relevant mututal information maximization. Neural computation, 15, 1481–1510. Cooper, L., Intrator, N., Blais, B., & Shouval, H. Z. (2004). Theory of cortical plasticity. Singapore: World Scientific. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. de Ruyter van Steveninck, R. R., & Bialek, W. (1995). Reliability and statistical efficiency of a blowfly movement-sensitive neuron. Phil. Trans. R. Soc. Lond. Ser. B, 348, 321–340. Dudek, S. M., & Bear, M. F. (1992). Homosynaptic long-term depression in area CA1 of hippocampus and effects of N-methyl-D-aspartate receptor blockade. Proc. Natl. Acad. Sci. USA, 89, 4363–4367. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Comput., 7, 425–468. Fonseca, R., N¨agerl, U., Morris, R., & Bonhoeffer, T. (2004). Competition for memory: Hippocampal LTP under the regimes of reduced protein synthesis. Neuron, 44, 1011–1020. Frey, U., & Morris, R. (1997). Synaptic tagging and long-term potentiation. Nature, 385, 533–536. Froemke, R. C., Poo, M.-M., & Dan, Y. (2005). Spike-timing-dependent synaptic plasticity depends on dendritic location. Nature, 434, 221–225. Fusi, S. (2002). Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates. Biol. Cybern., 87, 459–470. Fusi, S., Annunziato, M., Badoni, D., Salamon, A., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12, 2227–2258.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
669
Fusi, S., Drew, P., & Abbott, L. (2005). Cascade models of synaptically stored memories. Neuron, 45, 599–611. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. Gerstner, W., & Kistler, W. K. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Grossberg, S. (1987). The adaptive brain I. Dordrecht: Elsevier. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through non-linear temporally asymmetry Hebbian plasticity. J. Neuroscience, 23, 3697–3714. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (London), 160, 106– 154. Katz, L. C., & Shatz, C. J. (1996). Synaptic activity and the construction of cortical circuits. Science, 274, 1133–1138. Kelso, S. R., Ganong, A. H., & Brown, T. H. (1986). Hebbian synapses in hippocampus. Proc. Natl. Acad. Sci. USA, 83, 5326–5330. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709– 2741. Kullmann, D. M. (2003). Silent synapses: What are they telling us about long-term potentiation? Phil. Trans. R. Soc. Lond. B: Biological Sciences, 358, 727–733. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforschung, 36, 910–912. Laughlin, S., de Ruyter van Steveninck, R. R., & Anderson, J. (1998). The metabolic cost of neural information. Nature Neuroscience, 1, 36–41. Levy, W., & Baxter, R. (2002). Energy-efficient neuronal computation via quantal synaptic failures. J. Neuroscience, 22, 4746–4755. Linsker, R. (1986). From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. USA, 83, 7508–7512. Linsker, R. (1989). How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Computation, 1(3), 402– 411. Lisman, J. (1989). A mechanism for Hebb and anti-Hebb processes underlying learning and memory. Proc. Natl. Acad. Sci. USA, 86, 9574–9578. Lisman, J. (2003). Long-term potentiation: Outstanding questions and attempted synthesis. Phil. Trans. R. Soc. Lond B: Biological Sciences, 358, 829–842. Malenka, R. C., Kauer, J., Zucker, R., & Nicoll, R. A. (1988). Postsynaptic calcium is sufficient for potentiation of hippocampal synaptic transmission. Science, 242, 81–84. Malenka, R. C., & Nicoll, R. A. (1993). NMDA-receptor-dependent plasticity: Multiple forms and mechanisms. Trends Neurosci., 16, 480–487.
670
T. Toyoizumi, J.-P. Pfister, K. Aihara, and W. Gerstner
¨ Markram, H., Lubke, J. L, Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postysnaptic AP and EPSP. Science, 275, 213–215. Merzenich, M., Nelson, R., Stryker, M., Cynader, M., Schoppmann, A., & Zook, J. (1984). Somatosensory cortical map changes following digit amputation in adult monkeys. J. Comparative Neurology, 224, 591–605. Miller, K., Keller, J. B., & Stryker, M. P. (1989). Ocular dominance column development: Analysis and simulation. Science, 245, 605–615. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Oja, E. (1982). A simplified neuron model as a principal component analyzer. J. Mathematical Biology, 15, 267–273. Petersen, C., Malenka, R., Nicoll, R., & Hopfield, J. (1998). All-or-none potentiation of CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timing dependent plasticity for precise action potential firing in supervised learning. Neural Computation, 18, 1309–1339. Rubin, J., Lee, D. D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Physical Review Letters, 86, 364–367. Schultz, W., Dayan, P., & Montague, R. (1997). A neural substrate for prediction and reward. Science, 275, 1593–1599. Senn, W., Tsodyks, M., & Markram, H. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Computation, 13, 35–67. Seung, H. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40, 1063–1073. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. ¨ om, ¨ Sjostr P., Turrigiano, G., & Nelson, S. (2004). Endocannabinoid-dependent neocortical layer-5 LTD in the absence of postsynaptic spiking. J. Neurophysiol., 92, 3338–3343. Song, S., & Abbott, L. (2001). Column and map development and cortical re-mapping through spike-timing dependent plasticity. Neuron, 32, 339–350. Song, S., Miller, K., & Abbott, L. (2000). Competitive Hebbian learning through spike-time-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005a). Generalized Bienenstock-Cooper-Munro rule for spiking neurons that maximizes information transmission. Proc. National Academy Sciences (USA), 102, 5239–5244. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005b). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neuroscience, 20, 8812–8821. von der Malsburg, C. (1973). Self-organization of orientation selective cells in the striate cortex. Kybernetik, 14, 85–100.
Optimality Model of Unsupervised Spike-Timing-Dependent Plasticity
671
Yeung, L. C., Shouval, H., Blais, B., & Cooper, L. (2004). Synaptic homeostasis and input selectivity follow from a model of calcium dependent plasticity. Proc. Nat. Acad. Sci. USA, 101, 14943–14948. Zhang, L., Tao, H., Holt, C., Harris, W. A., & Poo, M.-M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44. Zhou, Q., Tao, H., & Poo, M. (2003). Reversal and stabilization of synaptic modifications in the developing visual system. Science, 300, 1953–1957.
Received January 13, 2006; accepted June 22, 2006.
LETTER
Communicated by Robert Kass
Nonparametric Modeling of Neural Point Processes via Stochastic Gradient Boosting Regression Wilson Truccolo Wilson
[email protected] John P. Donoghue∗ John
[email protected] Neuroscience Department, Brown University, Providence, RI 02912, U.S.A.
Statistical nonparametric modeling tools that enable the discovery and approximation of functional forms (e.g., tuning functions) relating neural spiking activity to relevant covariates are desirable tools in neuroscience. In this article, we show how stochastic gradient boosting regression can be successfully extended to the modeling of spiking activity data while preserving their point process nature, thus providing a robust nonparametric modeling tool. We formulate stochastic gradient boosting in terms of approximating the conditional intensity function of a point process in discrete time and use the standard likelihood of the process to derive the loss function for the approximation problem. To illustrate the approach, we apply the algorithm to the modeling of primary motor and parietal spiking activity as a function of spiking history and kinematics during a two-dimensional reaching task. Model selection, goodness of fit via the time rescaling theorem, model interpretation via partial dependence plots, ranking of covariates according to their relative importance, and prediction of peri-event time histograms are illustrated and discussed. Additionally, we use the tenfold cross-validated log likelihood of the modeled neural processes (67 cells) to compare the performance of gradient boosting regression to two alternative approaches: standard generalized linear models (GLMs) and Bayesian P-splines with Markov chain Monte Carlo (MCMC) sampling. In our data set, gradient boosting outperformed both Bayesian P-splines (in approximately 90% of the cells) and GLMs (100%). Because of its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf nonparametric tool for initial analyses of large neural data sets (e.g., more than 50 cells; more than 105 samples per cell) with corresponding multidimensional covariate spaces (e.g., more than four covariates). In the cases where a
∗ J. P. Donoghue is a cofounder and shareholder in Cybernetics, Inc., a neurotechnology company that is developing neural prosthetic devices.
Neural Computation 19, 672–705 (2007)
Nonparametric Modeling of Neural Point Processes
673
functional form might be amenable to a more compact representation, gradient boosting might also lead to the discovery of simpler, parametric models. 1 Introduction Modeling neural spiking activity as a function of covariates, such as spiking history, neural ensemble activity, stimuli, and behavior, is a developing field in neuroscience, with many challenges and problems yet to be overcome. Recent work has provided a solid formulation for the parametric modeling of spiking activity, while preserving its point process nature, based on maximum likelihood analysis (Brillinger, 1988; Chornoboy, Schramm, & Karr, 1988; Brown, Barbieri, Eden, & Frank, 2003; Paninski, 2004; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Okatan, Wilson, & Brown, 2005). However, there still remains a need for new statistical tools that enable the discovery and approximation of functional forms relating spiking activity and covariates, rather than simply imposing a priori parametric forms. Previous attempts to address this challenge have relied on cardinal cubic spline modeling (Frank, Eden, Solo, Wilson, & Brown, 2002), Zernicke polynomials (Barbieri et al., 2002), smoothing splines in the context of generalized additive models (Kass & Ventura, 2001), and Bayesian free-knot spline smoothing (DiMatteo, Genovese, & Kass, 2001; Kaufman, Ventura, & Kass, 2005). While these attempts have successfully generated new insights in their applications, there still is a need for new approaches that combine high-fitting performance with computational efficiency in analyses of large neural data sets. For example, motor neurophysiology studies using multiple multielectrode arrays now permit the simultaneous recording of hundreds of cells with often more than 105 samples per cell and at the same time generate measurements of behaviorally relevant high-dimensional covariates including arm kinematics in joint space, joint torques, and electromyograms (EMGs) (Hatsopoulos, Joshi, & O’Leary, 2004; Truccolo, Vargas, Philip, & Donoghue, 2005). Off-the-shelf, computationally efficient modeling algorithms that require a minimum amount of fine tuning, while at the same time maintaining good fit, are required for initial analyses of these data sets. Here, we approach this challenge from a different perspective. Recent developments in the statistical and machine learning communities have provided powerful new tools for nonparametric modeling (Hastie, Tibshirani, & Friedman, 2001; Vapnik, 1998). In the particular case of interest here, this progress originated from the insight gained by relating the nonparametric regression problem to numerical function approximation via gradient descent methods (Breiman, 1997; Friedman, 1999). This insight has led to the development of stochastic gradient boosting regression (Friedman, 2001, 2002). Rather than optimizing in parameter space, as performed in parametric modeling, the approximation problem is formulated
674
W. Truccolo and J. Donoghue
in terms of optimization in function space. Intuitively, stochastic gradient boosting can be seen as a regularized fitting of the residuals in a gradual iterative fashion. This approach has provided both a new theoretical perspective for understanding nonparametric modeling algorithms such as ensemble of experts and other boosting algorithms,1 and a general off-the-shelf regression tool (Hastie et al., 2001). Here, we show how stochastic gradient boosting can be successfully extended to the statistical modeling of neural point processes. We introduce the neural point process modeling problem in the next section. Succinctly, a point process joint probability can be completely specified by its conditional intensity function. The modeling problem can thus be expressed in terms of approximating the conditional intensity of the process as a function of the neuron’s own spiking history and other covariates of interest. We then formulate stochastic gradient boosting in the context of conditional intensity function approximation. Because stochastic gradient boosting allows arbitrary (differentiable) loss functions, our application uses the standard likelihood function for general point processes to derive the loss function for the conditional intensity function optimization problem. To illustrate the theory, we apply the algorithm to spike train data recorded from primary motor (M1) and parietal (5d) cortices in a behaving monkey performing a standard center-out reaching task. Goodness-of-fit assessment of the function approximation is done by applying the time-rescaling theorem (Brown et al., 2001). Additional tools for model interpretation and selection are also discussed and illustrated. Finally, we perform an extensive 1 Most current investigations of boosting approaches originated with the AdaBoost algorithm proposed by Freund and Schapire (1995) for classification problems. The relations between AdaBoost and additive logistic regression were first established by Friedman, Hastie, and Tibshirani (2000). Relations between boosting, maximum likelihood estimation, Kullback- and Kullback-Leibler divergence were explored by Lebanon and Lafferty (2002) in the context of exponential family models. Breiman (1997, 1999), Friedman (1999, 2001) and Mason, Baxter, Bartlett, & Frean (1999) were perhaps the first to formulate boosting in terms of greedy function approximation via gradient descent and to extend it to several types of loss functions. Consistency analyses of boosting algorithms have been presented by Lugosi and Vayatis (2004) and Zhang (2004). Of particular interest here is the result in Rosset, Zhu, & Hastie (2004) showing that boosting trees approximately (and in some cases exactly) minimize the loss criterion with a L 1 − nor m constraint on the estimated parameter vector. Relations between boosting, L 1 -constrained regression methods (e.g., Lasso), and sparse representations have also been discussed in Friedman, Hastie, Rosset, Tibshirani, & Zhu (2004) and in Hastie et al. (2001). Additionally, an interpretation of boosting integrating its formulation in terms of both gradient function optimization and margin maximization (Schapire, Freund, Bartlett, & Lee, 1998) has been provided by Rosset et al. (2004). In comparison to other nonparametric approaches, stochastic gradient boosting has been shown (Friedman, 2001) to outperform multivariate additive regression approaches based on smoothing splines (MARS; Friedman, 1991). In our exposition and approach to modeling neural point processes (section 3), we have followed Friedman’s statistical perspective.
Nonparametric Modeling of Neural Point Processes
675
comparison between stochastic gradient boosting regression and two alternative approaches: standard log-linear models in the parameters (GLMs) and Bayesian P-splines fitted via Markov chain Monte Carlo (MCMC) sampling. We use the tenfold cross-validated log likelihoods of the neural point processes as the criterion for the performance comparison on a 67-cell data set.
2 Neural Point Processes: Stochastic Formulation A neural point process can be completely characterized by its conditional intensity function (CIF; Daley & Vere-Jones, 2003), defined as
λ(t|x(t)) = lim
→0
P[N(t + ) − N(t) = 1|x(t)] ,
(2.1)
where P[·|·] is a conditional probability and N(t) denotes the number of spikes counted in the time interval (0, t] for t ∈ (0, T]. The term x(t) represents covariates of interest: the spiking history up to (but not including) time t and stimuli-behavioral covariates at varied time lags with respect to the spiking activity. The CIF is a strictly positive function that gives a history-dependent generalization of the rate function of a Poisson process. From equation 2.1 and for small , we have that λ(t|x(t)) gives approximately the neuron’s spiking probability in the time interval (t, t + ]. A discrete time representation of the point process is often preferable. To obtain this representation, we choose a large integer K and partition the K observation interval (0, T] into K subintervals (tk−1 , tk ]k=1 , each of length −1 = T K . We choose K large so that there is at most one spike per subinterval. Let Nk = N(tk ), N1:k = N0:tk , and xk = x(tk ). Because we choose K large, the differences Nk = Nk − Nk−1 result in a binary time series of zero and one counts, which corresponds to the usual representation of spike trains. For any discrete time neural point process, the joint probability density of the spike train, that is, the joint probability density of J spike times 0 < u1 < u2 < . . . < u J ≤ T in (0, T] conditioned on history and covariates, can be expressed as (Daley & Vere-Jones, 2003; Truccolo, Eden et al., 2005) P(N1:K |x1:K ) = exp
K k=1
Nk log[λ(tk |xk )] −
K
λ(tk |xk ) + o( J ).
k=1
(2.2)
676
W. Truccolo and J. Donoghue
The likelihood for the point process is readily available from equation 2.2.2 This probability distribution belongs to the exponential family, with the natural parameter given by k = log λ(tk |xk ) = F (xk ).
(2.3)
The problem we address here is the approximation of F (x) above. We will present a new approach to this problem, which consists of treating this approximation as a regularized optimization in function space and using stochastic gradient boosting regression to solve it. 3 Optimization in Function Space and Stochastic Gradient Boosting Consider the random variables (y, x). In general y and x = [x1 , . . . , x p ] can be any continuous, discrete, or mixed set of variables. In the specific case of a discrete time point process, y is a binary sequence. The target function to be approximated is the function F ∗ (x), which relates the covariate vector x to y while minimizing the expected value of some arbitrary (differentiable) loss function L(y, F (x)) over the joint distribution of all (y, x) variables, that is, F ∗ = arg min E y,x L(y, F (x)) = arg min E x E y (L(y, F (x))) | x . F
(3.1)
F
This is an optimization problem in function space. In other words, we consider F (x) evaluated at each point x to be a “parameter” and seek to minimize E y [L(y, F (x)) | x]
(3.2)
at each individual x directly with respect to F (x), observing that pointwise minimization of equation 3.2 is equivalent to minimization of equation 3.1. Numerical optimization methods are often required to solve this type of problem. Commonly they involve an iterative procedure where the approximation to F ∗ (x) takes the form F ∗ (x) ≈ Fˆ (x) =
M
f m (x),
(3.3)
m=0
2 In section 7, our data consist of multiple realizations or trials of the neural point process. The extension of equation 2.2 to reflect multiple realizations is straightforward. However, for notational simplicity, we leave the existence of multiple realizations implicit.
Nonparametric Modeling of Neural Point Processes
677
where f 0 (x) is an initial guess and { f m (x)}1M are successive increments (steps or boosts), each dependent on the preceding steps. In the particular case of function approximation via steepest-descent gradient, the mth step has a (steepest-descent) direction defined by the negative gradient vector gm (x) of the loss function with respect to F (x) and a magnitude ρm obtained via line search. K Given finite data samples {yk , xk }k=1 , smoothing is required to improve generalization of the approximation over nonsampled covariate values. This is achieved by introducing a parametric model at each of the successive steps, such that the approximation in equation 3.3 becomes Fˆ (x) = Fˆ 0 (x) +
M
βm h(x; am ),
(3.4)
m=1
where h(x) is a regressor or base learner, βm and am = {a 1 , a 2 , . . .} are parameters, and Fˆ 0 (x) is a constant. A “greedy stagewise” approach for solving equation 3.1 under the assumptions in equations 3.3 and 3.4 results (see section A.1 in the appendix) in searching, at each iteration step, for the member of the parameterized K class h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 ∈ R K most paralK lel to the negative gradient vector, that is, gm = {gmk (xk )}k=1 ∈ R K . For most choices of base learners, the most parallel vector can be obtained by solving the least-squares problem, am = arg min a,β
K
[gmk (xk ) − βh(xk ; a)]2 .
(3.5)
k=1
This gives the direction for the steepest-descent step. If needed, other minimization criteria can be used in place of equation 3.5. The magnitude of the step is obtained by the line search ρm = arg min ρ
K
L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .
(3.6)
K =1
Therefore, the update rule for the approximation at each step becomes Fˆ m (x) = Fˆ m−1 (x) + ρm h(x; am ).
(3.7)
Fitting the regressor to the negative gradient effectively works as a gradient smoothing. For the case of squared loss functions L(y, F (x)) = 1 [y − F (x)]2 , the negative gradient is simply the usual residual, and 2 the procedure corresponds to residual fitting.
678
W. Truccolo and J. Donoghue
Loss
0.094
α = 0.1
0.088
α = 0.01 α = 0.001 0.082 0
2000
4000 6000 iterations (trees)
8000
10000
Figure 1: Loss function dependence on number of iterations (trees) and shrinkage parameters. The negative of the log likelihood for the point process (cell A) is computed on validation data for three different choices of shrinkage values α and as a function of the number of trees in the approximation. The dotted line shows the value of the loss function computed on the training data for α = 0.001.
3.1 Regularization. In order to improve generalization and avoid overfitting, the approximation can be constrained by the choice of the number of steps or iterations and by weighting the contribution of each base learner. A shrinkage strategy is adopted (Copas, 1983) for the weighting, that is, the update rule at each iteration becomes Fˆ m (x) = Fˆ m−1 (x) + α ρm h(x; am ), 0 < α ≤ 1.
(3.8)
The actual value for α may depend on the data (see Figure 1), but usually small values around α = 0.001 are chosen. With a fixed shrinkage parameter, estimation of an optimal number of iterations is obtained by cross validation on a test data set, where we assess if additional iteration steps improve significantly the minimization of the loss function. As we mentioned in section 1, consistency and convergence analyses have been recently provided by Lugosi and Vayatis (2004) and Zhang (2004). In addition, Rosset et al. (2004) have shown that boosting trees regression approximately (and in some cases exactly) minimizes its loss criterion with a L 1 − norm constraint on the estimated parameters vector. Some initial
Nonparametric Modeling of Neural Point Processes
679
discussion relating boosting, L 1 -constrained regression methods, and sparse representation has appeared in Friedman et al. (2004) and Hastie et al. (2001). 3.2 Variance Reduction via Stochastic Subsampling. Bootstrap subsampling has been shown to reduce the variance of iterative approximations such as done in bagging procedures (Breiman, 1996, 1999). In a similar fashion, a stochastic subsampling can be adopted (Friedman, 2002) to reduce variance in gradient boosting estimates. Specifically, at each iteration, a subsample of the training data is drawn at random (without replacement) from the full training data set. This randomly selected subsample is then used to fit the base learner and compute the model update for the current iteration. Previous analysis suggests that a good size for the subsampling is about half the size of the total available training data (Friedman, 2002). 3.3 Stochastic Gradient Boosting: General Algorithm. Under the general formulation given above, the algorithm for the stochastic gradient boosting can be expressed as: 1. Set the initial value Fˆ 0 (x) = arg min µ
K
L(yk , µ)
k=1
2. For m = 1 to M do: Kˆ 3. Select a random subsample of size Kˆ , {ykˆ , xkˆ }k=1 ˆ , without replacement
4. Compute the components of the negative gradient vector from the subsample
∂ L (ykˆ , F (xkˆ )) gmkˆ = − , kˆ = 1, . . . , Kˆ ∂ F (xkˆ ) F (x)= Fˆ m−1 (x) 5. Fit h(x; a) to the negative gradient by solving am = arg min 6. ρm = arg min ρ
Kˆ
[gmkˆ − βh(xkˆ ; a)]2
a,β
ˆ k=1
Kˆ
L ykˆ , Fˆ m−1 (xkˆ ) + ρh(xkˆ ; am )
ˆ k=1
7. Update Fˆ m (x) = Fˆ m−1 (x) + αρm h(x; am ) 3.4 Choosing the Base Learner: Regression Trees and Interaction Order in the Approximation. Thus far, we have not defined the type of parameterized model h(x; a). In principle, we could boost GLMs, spline models, or many other different models. However, commonly regression trees are
680
W. Truccolo and J. Donoghue
chosen based on two motivations. First, approximation using an ensemble of regression trees is a simple yet powerful approach. In this case, each base learner h(x; a) is an R terminal node regression tree. At each mth iteration, a regression tree partitions the space of all joint values of the covariates x into R-disjoint sets or regions {Srm }rR=1 and predicts a separate constant value in each of these regions, such that h(x; am ) =
R
c rm 1{x∈Srm } ,
(3.9)
r =1
where c rm is the terminal node mean. Therefore, the parameter am specifies the splitting variables, the splitting locations that define the disjoint regions {Srm }rR=1 , and the terminal node means of the tree at the mth iteration. To fit a regression tree at each iteration, we need to decide on the splitting variables and split points and also what topology (shape) the tree should have. We follow the CART algorithm (Breiman, Friedman, Olshen, & Stone, 1984). In this case, the partition of the space is restricted to recursive binary partitions (i.e., at each node, we partition the current variable into two disjoint spaces: left and right of the splitting point). Binary splits are preferred instead of multiway splits because the latter tend to fragment the data too quickly, leaving insufficient data at the next level down. Additionally, multiway splits can be achieved in gradient boosting trees by a series of binary splits, each occurring in one of the iteration steps (Hastie et al., 2001). Finding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. A greedy algorithm is adopted. Starting with all of the data, consider a splitting variable j and split points s, and define the pair of half-planes: S1 ( j, s) = {x|x j ≤ s} and S2 ( j, s) = {x|x j > s}.
(3.10)
Then we seek the splitting variable j and split point s that solve min min j,s
c1
xk ∈S1 ( j,s)
(yk − c 1 )2 + min c2
(yk − c 2 )2 .
(3.11)
xk ∈S2 ( j,s)
For any choice j and s, the inner minimization is solved by cˆ 1 = ave(yk |xk ∈ S1 ( j, s)) and cˆ 2 = ave(yk |xk ∈ S2 ( j, s)).
(3.12)
The computation of the terminal node mean depends on the distribution adopted for the data (see equation 4.4). For each splitting variable, the determination of the split point s can be done very quickly, and hence by scanning through all of the inputs, determination of the best pair ( j,s)
Nonparametric Modeling of Neural Point Processes
681
is feasible. Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions. The second motivation for choosing regression trees is that by controlling the number of terminal nodes, it is also possible to control for the interaction order in the approximation. Usually, in order for Fˆ (x) to approximate well the target F ∗ (x), it is important to consider how close the interaction order of Fˆ (x) is required to be to that of F ∗ (x). Interaction order can be thought of as follows. Suppose that the target function can be expressed in terms of the expansion F ∗ (x) =
i
Fi (xi ) +
Fi j (xi , x j ) +
ij
Fi jk (xi , x j , xk ) + . . .
(3.13)
i jk
A tree with splitting on only one of the covariates falls in the first sum of equation 3.13, a tree with splits in two covariates falls in the second sum, and so on. The highest interaction order is limited by the number p of covariates, and, given the additive nature of the boosting procedure, a tree with R terminal nodes produces a function approximation with interaction order of at most min(R − 1, p). In practice, for a large number of covariates, a good approximation can be obtained by a much lower order.3 Depending on the case, we can either select the interaction order based on testing on a single test data set or by an n-fold cross-validation procedure, or fix a priori the order to capture only second-order interactions among the covariates, for example. 4 Loss Function for Point Process Modeling In order to apply stochastic gradient boosting to neural point processes, we define the loss function at each iteration step to be the negative of the log likelihood for the point process. For a given CIF model in log-scale log λ(tk |xk ) = Fˆ (xk ), the loss function becomes K K Nk Fˆ m−1 (xk ) − exp Fˆ m−1 (xk ) , = L m {Nk , Fˆ m−1 (xk )}k=1 k=1
(4.1)
3 See Friedman (2001) for a discussion of the selection of interaction order. Note also that because we are modeling the natural parameter of the point process distribution log λ, a model containing only the first term in equation 3.13 will already lead to multiplicative effects among the covariates.
682
W. Truccolo and J. Donoghue
where again K is the total number of subintervals or number of samples. The kth component of the negative gradient at each iteration is then gmk
∂ L m Nk , Fˆ m−1 (xk ) =− = exp Fˆ m−1 (xk ) − Nk . ∂ Fˆ m−1 (xk )
(4.2)
The initial value is given by
K 1 Fˆ 0 (x) = log Nk , K
(4.3)
k=1
and the terminal node estimates in the regression tree are given by K
1{xk ∈Rrm } Nk . ˆ k=1 1{xk ∈Rrm } exp[ Fm−1 (xk )]
c rm = log K
k=1
(4.4)
5 Goodness-of-Fit Analysis via the Time Rescaling Theorem Given spike train and covariate time series, we can approximate the CIF ˆ λ(t|x(t)) = exp{ Fˆ (x(t))} using the stochastic gradient boosting approach just described. Assessing the goodness of fit of the approximation can be implemented by applying the time-rescaling theorem for point processes as follows (Brown et al., 2001; Papangelou, 1972; Girsanov, 1960). We start by computing the rescaled times z j from the estimated conditional intensity and from the spike times as z j = 1 − exp
−
u j+1
ˆ λ(t|x(t)) dt
,
(5.1)
uj
for j = 1, . . . , J − 1 spike times. The time-rescaling theorem states that the z j s will be independent uniformly distributed random variables on the interval [0, 1) if and only if the estimated CIF approximates well the true conditional intensity function of the process. Because the transformation in equation 5.1 is one-to-one, any statistical assessment that measures the agreement between the z j s and a uniform distribution directly evaluates how well the original model agrees with the spike train data. Additional tests to check the independence of the rescaled times can be formulated (Truccolo, Eden et al., 2005) but will not be employed here. A Kolmogorov-Smirnov (K-S) test can be constructed by ordering the z j s from smallest to largest and then plotting the values of the cumulative distri bution function of the uniform density function defined as b j = ( j − 1/2) J for j = 1, . . . , J against the ordered z j s. We term these plots K-S plots. If the
Nonparametric Modeling of Neural Point Processes
683
model is correct, then the points should lie on a 45 degree line. Confidence bounds for the degree of agreement between a model and the data may be constructed using the distribution of the K-S statistic (Johnson & Kotz, 1970). For moderate to large sample sizes, the 95% confidence bounds are 1 well approximated (Johnson & Kotz, 1970) by b j ± 1.36 × J −2 . 6 Model Interpretation: Relative Importance and Partial Dependence Plots 6.1 Relative Importance of Individual Covariates. Given an approximated CIF, we would like to assess the relative importance or contribution of each covariate. This assessment is particularly important in more exploratory studies, where the set of covariates is large and little a priori knowledge is available about their individual relevance. A measure of relative importance should reflect the relative influence of a variable xl on the variation of Fˆ (x). Assessing the influence of a covariate on the variation of Fˆ (x) by expected partial derivatives is not possible given the piecewise constant approximations provided by regression trees. Instead, we use the relative importance measure that Breiman et al. (1984) used. We start by computing it for a single tree Tm = h(x, am ) as 2 Iˆlm (Tm ) =
R−1
iˆl2 1(v(r ) = l),
(6.1)
r =1
where the summation is over the nonterminal nodes r of the R-terminal node tree. The term v(r ) gives the splitting variable associated with node r in the tree. The term iˆl2 is the corresponding empirical improvement in squared error achieved by splitting, at the point given by the splitting variable, a region Sr into two left and right subregions (Sr − , Sr + ). The empirical improvement in squared error is given by the weighted squared difference between the constants fit to the left and right subregions. The final relative importance measure with respect to the approximation Fˆ (x), including the entire collection of regression trees, is obtained by averaging equation 6.1 over all of the trees: 1 2 Iˆlm (Tm ). M M
Iˆl2 =
(6.2)
m=1
6.2 Partial Dependence Plots. Beyond assessing the relative importance of individual covariates, we would like to gain insight into the functional form relating response variable and covariates. That might be done by graphically plotting the partial dependence of the approximation on a
684
W. Truccolo and J. Donoghue
small subset of the covariates zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p }. This partial dependence can be formulated in terms of the expected value of Fˆ (x) with respect to the probability of the complement subset z\l (Friedman, 2001; see also section A.2 of the appendix). An empirical approximation to this partial dependence is straightforward to obtain in the case of regression trees based on single variable splits. In this case, the partial dependence of Fˆ (x) on a specified target variable subset zl is evaluated given only the tree, without reference to the data. For a specified set of values for the variables zl , a weighted transversal of the tree is performed. At the root of the tree, a weight of 1 is assigned. For each terminal node visited, if its split variable is in the target subset zl , the appropriate left or right daughter node is visited and the weight is not modified. If the node’s split variable is a member of z\l , then both daughters are visited and the current weight is multiplied by the fraction of training observations that went left or right, respectively, at that node. Each terminal node visited during the transversal is assigned the current value of the weight. When the tree transversal is complete, the weighted average of the Fˆ (x) values over those terminal nodes visited during the tree transversal provides an empirical estimate of the expected value of Fˆ (x) with respect to the complement subset. For a collection of M regression trees, the results from the individual trees are simply averaged.
7 Applications: Modeling Spiking Activity as a Function of Spiking History and Kinematics To illustrate this approach to neural point process modeling, in this section we apply stochastic gradient boosting to spiking activity recorded from two cells, one in M1, and the other in 5d, of a behaving monkey. Those two cells are henceforth referred to as cell A and B, respectively. They were chosen based on their rich modulation by intended and executed kinematics. Details of the basic recording hardware and surgical protocols are available elsewhere (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004). Following task training, two Bionic Technologies LLC (BTL, Salt Lake City, UT) 100-electrode silicon arrays were implanted in M1 and 5d areas related to sensorimotor control of the arm. One monkey (M. mulatta) was operantly conditioned to perform a standard reaching center-out radial task with instructed delay. The monkey learned to use a manipulandum in a horizontal plane. The manipulandum controlled a cursor on a monitor oriented vertically in front of her. Each trial consisted of moving the cursor to one of eight radially displaced targets on the screen. The task started with the monkey moving the manipulandum to a center holding position. Soon after, an instructed delay cue specifying the target for the reach was presented for 1.5 sec. Finally, a go cue, appearing randomly in the time interval 1–2 sec after the disappearance of the instructed delay cue, signaled the monkey to reach for the remembered target. The hand position signal in 2D
Nonparametric Modeling of Neural Point Processes
685
was digitized and resampled to 1 KHz. Hand velocities and accelerations were obtained via standard Savitsky-Golay derivatives. 7.1 Covariate List, Data Set, Boosting Meta-Parameters, and Computational Issues. The following covariates were selected. To incorporate simple spiking history effects, we selected the time elapsed since the last spike. This covariate will be denoted by stk . To introduce the kinematics effects, we chose position, velocity, and acceleration in 2D. These kinematics ˙ k + τq˙ ), and q(t ¨ k + τq¨ ), respectively. covariates are denoted by q(tk + τq ), q(t When referring to specific components of a kinematics vector, we use subscripts, for example, q = [q 1 , q 2 ] . The investigation of kinematics effects at multiple time lags for each covariate, although straightforward, is beyond the scope of investigation in this article. For this reason, we chose to present the simpler case of single time lags for each covariate. The choice of single optimal time lags was based on mutual information analyses (Paninski et al., 2004). For cell A, the time delays for position, velocity and acceleration were set to −0.15, 0.25, and 0.25 sec, respectively. For cell B, these delays were set to zero. In order to account for the planning of movement direction during the instructed delay period, we included in the model the covariate target direction, denoted by φ and henceforth referred to as the intended movement direction in a given trial. This is not to be confused with the actual movement direction derived from the velocity vector. Our covariate vector thus consisted of xtk = [stk , φ, qtk +τq , q˙ tk +τq˙ , q¨ tk +τq¨ ] , with p = 8 covariates. We analyzed 268 trials, distributed almost uniformly across the eight directions. The recorded neural point process was time discretized by taking = 0.001 sec. Given the known refractory period after spiking, this choice ensured that no more than a single event would be observed in each subinterval of the partition. In each trial, only the data in the time interval [−0.4, 0.5 sec], with respect to the movement onset at time zero, were used. Typical hand movements lasted approximately 0.4 sec, with velocity bell shape profiles peaking at around 0.2 sec after movement onset. After time discretization, the number of samples, including data from all of the trials, resulted in K = 241, 200 samples. The indexes k for the resulting samples or tuples {yk , xk }kK were reordered by a single random permutation. That was done to ensure that the training and validation data sets contained samples that originated from everywhere in the experimental session, not just the beginning or the ending part, for example. The meta-parameters for the boosting algorithm were set as follows. Two-thirds of the data set were separated for model fitting, and the remaining data were taken for validation tests. For the stochastic subsampling at each iteration, half of the model-fitting data set was randomly subsampled without replacement. The number of terminal nodes for the regression trees was set to R = 9, that is, p+1. The shrinkage coefficient was set to α = 0.001. This choice was based on the analysis of the loss function computed on
686
W. Truccolo and J. Donoghue
the validation data set for different shrinkage parameters and different numbers of regression trees. In the example shown for cell A in Figure 1, at low shrinkage values, for example, α = 0.1, large generalization errors start to occur after just 50 to 100 trees, resulting in poor approximations. Much better performance was obtained for α = 0.001, with improvement on the loss function remaining roughly constant after approximately 11,000 trees. Similar results were obtained for cell B and other cells in the recorded ensemble (not shown). The gradient boosting algorithm was implemented in C. It took about 3 hours of computational time in a 2 GHz dual processor (2 GB RAM) computer to fit a model for a single cell. We used a computer cluster to run the cross-validation analyses presented in the last part of this section and in the next section. 7.2 Approximated Conditional Intensity Function and Goodness-ofFit Analysis. To provide a glimpse of the type of CIF approximations generated by an ensemble of regression trees, Figure 2 shows an example for cell A. In this case, two main features are observed. First, after each spike, the CIF captures a refractory period (the CIF approximates zero) and a quick rebound excitation. Second, around 200 ms before movement onset, the cell begins to be modulated by the kinematics. This effect peaks about 100 ms prior to movement onset. (See also Figures 4 and 5 for details on the contributions of spiking history and kinematics to the approximation.) With the exception of sharp changes reflecting refractory periods after spiking, the temporal evolution of the estimated CIF is otherwise relatively smooth. Once the CIFs of the two cells were approximated by the ensemble of regression trees, a goodness-of-fit assessment by time rescaling was done on the validation data set. Figure 3 shows that overall, only minor deviations are observed for the quantiles close to 0.5 (cell A) and to 0.25 (cell B). These deviations could have arisen from a lack of fit of the approximation itself or from the absence in the chosen covariate set of other variables relevant to the modeled spiking activity. 7.2 Relative Importance and Partial Dependence Plots. Table 1 shows the estimated relative importance for each of the covariates. Different covariate rankings were observed for the two cells. For cell A, the most important covariates were spiking history and velocity. For cell B, velocity and position in the vertical axis, together with spiking history, were highest ranked. Partial dependence plots are shown in Figures 4 and 5 to display the marginal effects of subsets of the covariates. Each partial dependence plot shows the contribution of a subset in terms of the approximated function Fˆ , that is, the CIF in log scale. The partial dependence plots do not include the constant term Fˆ 0 (x) in the approximation. The estimated effects of spiking history in terms of time elapsed since the last spike are shown in Figure 4A. An increase in the partial dependence, implying also an increase in spiking
Nonparametric Modeling of Neural Point Processes
687
λ (tk|x) (Hz)
120
60
−0.4
−0.3
−0.2
−0.1
0 0.1 time (sec)
0.2
0.3
0.4
Figure 2: Example of approximated conditional intensity function. The resulting approximated CIF via stochastic gradient boosting with approximately 11,000 trees (top plot) is shown together with the related spike train data for that time segment (bottom plot). The top plot also shows the magnitudes (arbitrary units) of the position (dotted), velocity (solid), and acceleration (dashed) vectors. This example was taken from cell A, during a reach to a target to the right (0 degrees). Movement onset is at time zero. 1
1
CDF
A
B 0.5
0
0.5
0
0.5
quantiles
1
0
0
0.5
1
quantiles
Figure 3: K-S goodness-of-fit plots via time rescaling. The K-S plots for cells A and B are shown in plots A and B, respectively. Goodness-of-fit assessment was performed on validation data.
688
W. Truccolo and J. Donoghue −3.5
−4.4
A
B
F(φ)
F(s)
−4 −4.5
−4.5
−5
0 0.02
0.05 s (sec)
0.1
−4.58
−π
φ (rad )
π
Figure 4: Partial dependence plots: spiking history and intended movement direction. (A) The partial dependence on the time elapsed since the last spike is shown for cell A. (B) The partial dependence on the intended movement direction (eight discrete states) is shown for the same cell. Table 1: Covariates’ Relative Importance Cell A 1. s 2. q˙ 1 3. q˙ 2 4. q 1 5. q¨ 1 6. φ 7. q 2 8. q¨ 2
Cell B 32.68 28.08 14.99 7.17 5.49 5.29 3.17 3.13
1. q˙ 2 2. q 2 3. s 4. q¨ 1 5. q˙ 1 6. q 1 7. q¨ 2 8. φ
25.01 22.57 18.61 8.98 7.41 6.73 6.22 4.26
probability, follows after a very short refractory period, just 3 to 4 ms after a spike. This feature is confirmed by the interspike interval histogram of the cell (not shown), which shows a peak around 3 ms confirming the existence of spike doublets (see also Chen & Fetz, 2005). A broader increase in spiking probability appears centered around 30 ms. The contribution of spiking history in cell B has similar features, with the exception of doublets (not shown). Figure 4B shows the partial dependence on the intended movement direction for cell A. This covariate contributed with a constant (throughout the trial) value that was dependent on the direction of the target in a given trial. Its effects can also be clearly seen prior to movement onset, as shown by peri-movement time histograms in Figure 6A. Intended direction effects were minimal for cell B.
A −4.4
B −3.2
F (q)
F (q) ˙
Nonparametric Modeling of Neural Point Processes
−4.8 −10
q1
0
10
−10
0
10
689
−4
−5 20
q2
0
q1
−20
20
−20
0
q2
C F (q) ¨
−4.4
−4.6 20
q¨2
20
0 −20 −20
0
q¨1
Figure 5: Partial dependence plots for kinematics. The partial dependences for position (q 1 and q 2 , in cm), velocity (q˙ 1 and q˙ 2 , in cm/s), and acceleration (q¨ 1 and q¨ 2 , in cm/s2 ) at the specified time lags (see text) for cell A are shown in plots A, B, and C, respectively.
The partial dependencies on kinematics for cell Aare summarized in Figure 5. The strongest partial contributions are given by the velocity covariate. For this cell, the velocity tuning narrowed at higher speeds. A dependence of preferred direction on speed, with a shift ranging about 50 degrees, was also observed. The partial dependence on position, though small, showed a non-monotonic relation with saturation. For acceleration, the functional form was close to that of sigmoidal functions. 7.3 Prediction of Peri-Movement Time Histograms Based on Covariate State. In neurophysiology, the analysis of the average neuronal response to stimuli or behavioral events commonly plays an important role in identifying relevant covariates to the neuron’s spiking activity. In our example, this average response corresponds to what is usually referred to as the perimovement time histogram (PMTH). Here we illustrate how to do a simple assessment of the predictive power of the chosen covariates in terms of their ability to predict the time course of this average response. This can
690
W. Truccolo and J. Donoghue
be done by comparing the predicted PMTH, derived from the single-trial approximated CIFs, to the empirical PMTH. The predicted PMTH is therefore a function of the covariates, while the empirical PMTH is simply a time-locked average response that does not explicitly model the covariates’ effects. This comparison between predicted and empirical PMTHs (assuming the empirical PMTH provides a good estimation of the PMTH) can complement the goodness-of-fit assessment using the time-rescaling theorem. Commonly, the time-rescaling-based assessment will show some lack of fit without, on the one hand, specifying a source for the fitting problem or, on the other hand, pointing to what aspects the model captures well despite a lack of fit. The comparison between predicted and empirically estimated PMTHs can then be used as an additional diagnostic tool to check if the modeled covariate effects capture at least the temporal features observed in the peri-event time histograms. The empirical PMTH was obtained by first time-locking the single trial binary sequences {Nk } to the time of hand movement onset and then averaging them across trials. This average was then smoothed by a gaussian (SD = 0.05 sec) kernel, and finally expressed as spiking frequency in Hz units.4 To obtain the predicted PMTH, we first ran stochastic gradient boosting on a model-fitting data set containing all of the trials, with the exception of the trial whose CIF was being estimated. This corresponds to a leave-one-trial-out cross-validation procedure. Then, given the fitted model and observed covariates, we estimated the CIF for the selected trial. This was done for all of the trials. The predicted PMTH was finally obtained by time-locking the estimated single trial CIFs to the onset of movement, and then averaging them across trials. Figure 6 shows both the predicted and empirical PMTHs. The empirical PMTHs show strong pre- and post-movement effects for cell A, but only post-movement effects for cell B. Overall, predicted and empirical PMTHs agree well for both cells. For these two cells, we found that the selected kinematics and spiking history covariates alone could account for most of the features in the time course of activation. Similar results were obtained for the other cells in the recorded cell ensemble (approximately 60 cells) in the two cortical areas (not shown). These results, together with the goodnessof-fit assessment by time rescaling, suggest that the CIF approximations have explained most of the cells’ spiking behavior and that the model captured well the dependence of the time course of the cells’ activity on movement parameters. The small disagreements between predicted and empirical PMTHs happening at the extremes of the time interval [−0.4, 0.5] could have arisen from nonmodeled motor preparation and reward effects
4
We could have obtained a better estimate of the “empirical” PMTH, with corresponding confidence intervals by fitting Bayesian splines to the histogram as a function of time. However, we found this to be unnecessary for these illustration examples.
Nonparametric Modeling of Neural Point Processes
691
A
Hz
100 50 0 −0.5 0 0.5 Time (sec)
B
Hz
100 50 0 −0.5 0 0.5 Time (sec)
Figure 6: Prediction of PMTH based on the single trial approximated conditional intensity functions. The predicted (black) and empirical (gray) PMTHs for cells A and B are shown in plots A and B, respectively. The eight plots show the PMTH conditioned on the 8 radial directions for the executed reachings.
or from estimation problems resulting from lower signal-to-noise ratio at these extremes. 8 Comparison to Other Approaches: GLMs and Bayesian P-Splines We compare stochastic gradient boosting regression to two alternative approaches: a generalized linear model and Bayesian penalized splines. The GLM framework applied to neural point processes has been described in detail in Truccolo, Eden et al. (2005). Briefly, in the GLM approach, the conditional intensity function was modeled as λ = exp{µ + α1 x1 + . . . + α p x p }. The Bayesian P-splines approach is described below.
692
W. Truccolo and J. Donoghue
8.1 Bayesian P-Splines. Bayesian P-splines are a state-of-the-art nonparametric regression tool. We choose Bayesian P-splines (Brezger & Lang, 2006) instead of a Bayesian free-knot spline approach using reversible-jump MCMC (DiMatteo et al., 2001) because of its faster computation. Consider the following model for the conditional intensity function, log λi = f i1 (xi1 ) + f i2 (xi2 ) + . . . + f i p (xi p ) + viT γ ,
(8.1)
where f (x j ), j = 1, . . . , p are smooth functions and viT γ represents the strictly linear part of the predictor. In the Eilers and Marx (1996) approach, it is assumed that the unknown functions can be approximated by Mj = k j + l B-splines of degree l (usually cubic) with equally spaced knots, ζ j0 = x j,min < ζ j1 < . . . < ζ j,k j−1 < ζ j,k j = x j,max ,
(8.2)
over the domain of x j . Denoting the mth basis function by B jm , we obtain f j (x j ) =
Mj
β jm B jm (x j ).
(8.3)
m=1
By defining the n × Mj design matrices X j with the elements in row i and column m given by X j (i, m) = B jm (xi j ), we can rewrite the predictor in matrix notation as log λ = X1 β1 + X2 β2 + . . . + X p β p + Vγ .
(8.4)
T Here β j = β j1 , . . . , β j Mj , j = 1, . . . , p corresponds to the vectors of unknown regression coefficients. The matrix V is the usual design matrix for linear effects. To overcome the well-known difficulties involved with regression splines, Eilers and Marx (1996) suggested a relatively large number of knots (usually between 20 and 40) to ensure enough flexibility and introduce a roughness penalty on adjacent regression coefficients to regularize the problem and avoid overfitting. Theoretically, because roughness is controlled by the penalty term, once a minimum number of knots is reached, the fit given by a P-spline should be almost independent of the knot number and location. In their frequentist approach, Eiler and Marx proposed penalties based on squared r th order differences. Here we choose second-order differences in analogy to second-order derivatives in smoothing splines. In the Bayesian P-splines approach, the second-order differences are replaced with their stochastic analogs, that is, second-order random walks defined by β jm = 2β j,m−1 − β j,m−2 + u jm
(8.5)
Nonparametric Modeling of Neural Point Processes
693
with gaussian error u jm ∼ N(0, σ j2 ) and diffuse priors β j1 and β j2 ∝ const, for initial values.5 While first-order random walks would penalize abrupt jumps β jm − β j,m−1 between successive parameters, second-order random walks penalize deviations from the linear trend 2β j,m−1 − β j,m−2 . The amount of smoothing is controlled by the parameter σ j2 , which corresponds to the inverse smoothing parameter in the traditional smoothing spline approach. By defining an additional hyperprior for the variance parameters, the amount of smoothness can be estimated simultaneously with the regression coefficients. We assign the conjugate prior for σ j2 , which is an inverse gamma prior with hyperparameters a j and b j , that is, σ j2 ∼ I G(a j , b j ). Based on Brezger and Lang’s (2006) experience from extensive simulation studies, we use a j = b j = 0.001. Bayesian inference is based on the posterior of the model, which is given by p(β1 , . . . , β p , σ12 , . . . , σ p2 , γ |y, X) ∝ L(y, X, β1 , . . . , β p , σ12 , . . . , σ p2 , γ ) p 1 1 T × exp − 2 β j K j β j (σ j2 )r k(K j )/2 2σ j j=1 p 2 −a j −1 bj × exp − 2 , σj (8.6) σj j=1 where L(·) is the likelihood. The two last terms on the right are the priors for the spline coefficients and for the variance of the random walk, respectively; r k(·) denotes the matrix rank and the term K j is a penalty matrix that depends on the prior assumptions about the smoothness of f j and the type of covariate. This penalty matrix is of the form K j = DT D, where D is in our case a second-order difference matrix. Spline parameters and smoothing terms were estimated by MCMC sampling. Good mixing properties of the Markov chain are provided by a sampling scheme that combines an approximation of the full conditionals of regression parameters via IWLS (iteratively weighted least squares) and uses them as proposals in a Metropolis-Hastings algorithm, as proposed by Gamerman (1997). We set the number of MCMC iterations to 20,000, with the first 5000 iterations discarded as the burn in part. In order to reduce correlations between the MCMC samples, the chain was thinned by keeping only the samples from every 10 iterations. We visually inspected the resulting Markov chains to verify convergence. Cubic spline bases with 30 uniformly spaced knots were used. In the computation of the Bayesian P-splines, we used the
5 In cases where the estimated function is highly oscillating, the assumption of a global variance parameter may be relaxed by using u jm ∼ N(0, σ j2 /δ jm ), where both σ j2 and δ jm are hyperparameters to be estimated.
694
W. Truccolo and J. Donoghue
publicly available software BayesX developed by Brezger, Kneib, and Lang (2005). Besides its good MCMC mixing properties especially designed for Bayesian P-splines, this software has been shown to be 60 to 280 times faster than the other well-known alternative for Bayesian inference, the software WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003). 8.2 Cross-Validation Results. We used a 10-fold cross-validation scheme to compare the different modeling approaches. Exactly the same 10 test data sets were used for each modeling approach, and each approach used the same eight covariates as described in section 7. We modeled all 67 cells available in the data set. As a comparison criterion, we adopted the average cross-validated log likelihood of the modeled neural process. Because in Bayesian splines the introduction of interaction terms is usually restricted in practice to interaction surfaces (second-order interactions, modeled by the tensor product of B-splines, for example) and because their inclusion would increase dramatically the computational time for these comparisons, we opted not to include higher-order interaction terms in any of the modeling approaches. Therefore, the gradient boosting regression models included trees with splits in a single covariate only. Additionally, to prevent gradient boosting from having an unfair advantage in crossvalidation performance, we did not estimate the optimal number of trees from the computed loss function in a small test data subset. Instead, we set the shrinkage value to a small value, α = 0.001, and fixed the number of trees to a high number: 40,000 for all of the modeled cells. Hyperparameters for the Bayesian splines and additional parameters for the MCMC sampling were specified as described. For all of the cells, we obtained high acceptance rates in the Markov chains. For comparison purposes, we also computed the cross-validated log likelihood for a model that consisted of only a constant mean Poisson process, independent of any of the covariates. We refer to this model as the noise model. Additionally, we used the difference among the cross-validated log likelihoods ( log likelihood) from the different models. More specifically, the log likelihoods from GLM, Bayesian P-splines, or noise models were subtracted from the corresponding log likelihoods of the gradient boosting regression models. In this way, a positive log likelihood means a better performance for the gradient boosting models in the test data sets. Figure 7 summarizes the comparison results over the 67 cells. Overall, the gradient boosting regression models performed much better for all of the 67 cells when compared to the noise and GLM models, as expected, and performed better in about 90% of the cells when compared to Bayesian P-splines. A preliminary analysis contrasting the types of estimated covariate effects on cell A based on gradient boosting regression and Bayesian P-splines is given in Figure 8. We focused on the shapes of the estimated effects rather than on their absolute magnitudes and chose the velocity covariate because, for most of the cells, it is ranked as the most important covariate. Compared
Nonparametric Modeling of Neural Point Processes
695
N
20 10 0
0
100
200
300
400
500
0
100
200
300
400
500
0
100
200
300
400
500
N
20 10 0
N
20 10 0
∆ log−likelihood
Figure 7: Comparison between the different modeling approaches using crossvalidated log likelihood. (Top) The histogram for the differences between the average 10-fold cross-validated log likelihoods from the gradient boosting regression and Bayesian P-splines models. (Middle) The histogram for the differences between the gradient boosting regression and GLM models. (Bottom) The histogram for the differences between the gradient boosting regression and noise models (a Poisson process with constant spiking rate independent on the covariates). Total N is 67 modeled cells. Positive differences ( log likelihoods) mean that the gradient boosting regression models performed better in the test data than the comparison modeling approach did.
to the gradient boosting estimate (see Figure 8A), the Bayesian P-spline estimate seems to oversmooth (see Figure 8B), missing many of the structures captured by the gradient boosting regression trees model. The gradient boosting estimate clearly partitions the work space into four regions (movements up and down, left and right), with fine structure especially visible in two of these quadrants. However, this structure could be simply an artifact resulting from the piecewise constant approximation nature of regression trees. Assuming for the moment that this fine structure truly exists, it could then be that a different spline approach not dependent on MCMC sampling would better capture this structure. To address this issue we estimated a P-spline model using the recently developed MGCV algorithm in R (Wood, 2004). Although this algorithm does not provide the full Bayesian inference setup as in Bayesian P-splines, it still allows for the automatic estimation of smoothness parameters. Figure 8C shows the resulting estimate obtained
696
W. Truccolo and J. Donoghue
A
C
F (q)
.
20
. 0
q1
−20
20
0
.
−20
q2
0
−20
20
0
− 20
D
B
−20
20
0 20 0
−20
20
0
0
−20 −20 20
F
E
−20
20
0
0
−20 20
−20
0
0
−20 20
Figure 8: Comparison of estimated dependence on velocity from Bayesian Psplines, gradient boosting regression, and other spline-based methods. In all plots, the estimated marginal effects of the velocity covariate for cell A are shown as 2D surfaces. (A) Estimated dependence obtained from stochastic gradient boosting. (B) From Bayesian P-splines. (C) Penalized cubic-splines (MGCV). (D) Top view of the estimated dependence in A. (E) Top view of the estimated dependence in C. (F) Penalized cubic splines with shrinkage (top view; estimated using MGCV). The range of velocity comprises the actually observed velocities in the fitted data. Velocities are given in cm/s. See the text for details.
Nonparametric Modeling of Neural Point Processes
697
by fitting the model with all covariates and using penalized cubic splines (basis dimension set to 32). Interestingly, the estimated effect for velocity via MGCV is overall closer to the gradient boosting model than the Bayesian P-spline model is to the gradient boosting model. However, it does so by paying the high price of having significant oscillations (overshootings and undershootings) at certain regions, which probably makes the Bayesian P-spline smoother approximation better. More important, these oscillations appear exactly at regions where there are reasonable jumps in the piecewise constant approximation (see Figure 8E). Figure 8F shows another example, now using MGCV with penalized cubic splines with shrinkage. Instead of minimizing the oscillation artifacts, it actually increases them. Similar results were obtained with B-spline bases. (We could not compute the model using thin-plate splines in MGCV because of memory allocation problems.) Therefore, we conclude that the fine structure estimated by the gradient boosting regression cannot be easily explained as a pure artifact resulting from the piecewise constant approximation. Instead, this fine structure is likely to reflect features of the true underlying relation between spiking and velocity. 9 Discussion We have demonstrated how stochastic gradient boosting regression can be successfully extended to the modeling of neural point processes, thus providing a robust nonparametric modeling approach for neural spiking data while preserving their point process nature. Furthermore, our choice of loss function keeps this nonparametric modeling approach close in spirit to the (regularized) maximum likelihood principle. Analysis on validation data of the loss function’s dependence on the number of iterations (trees), together with the application of the time rescaling theorem, provided the approach with both model selection and a rigorous goodness-of-fit assessment. Partial dependence plots and analysis of averaged CIFs were shown to enhance the model’s interpretability, which is important in neurophysiological studies of the relation between spiking activity and observed covariates. Our cross-validation analyses showed that gradient boosting regression can perform similar to or superior to a state-of-the-art Bayesian regression tool. We speculate that one of the possible reasons for this better performance with respect to Bayesian splines might be related to the choice of basis learners. The preliminary analysis provided in Figure 8 seems to argue in this direction. Additionally, as already pointed out by Friedman (2001), methods that use smooth functions tend to be adversely affected by outliers and wide tail distributions. On the other hand, the piecewiseconstant approximation in regression trees is more robust to these adverse effects. However, although the issue of smooth versus piecewise constant approximation seems to shed some light on why gradient boosting performed better, there are many other and equally important features in
698
W. Truccolo and J. Donoghue
gradient boosting that could also have contributed to this better performance. These features are shrinkage, stochastic subsampling, and especially iterative regularized fitting of residuals. This makes the proper addressing of why gradient boosting performed better a nontrivial problem. A thorough analysis of this problem is beyond the scope of this article. From a theoretical point of view, we believe that Bayesian splines may provide comparable or higher performance if one spends enough time with the fine tuning of prior distributions and parameters in the MCMC sampling and with the selection of a smaller subset of covariates. While this fine tuning and selection is feasible for smaller selected data subsets, it is precisely what one would like to avoid in initial analyses of large data sets. It is in that sense that we see gradient boosting regression as an off-the-shelf tool. Given its good performance and computational efficiency, we propose stochastic gradient boosting regression as an off-the-shelf modeling tool for the analysis of large neural data sets currently generated by multiple arrays recording technology (e.g., more than 50 cells, number of samples K > 105 ) and corresponding measurements of multidimensional (dimension > 4) covariate spaces. Additionally, while in gradient boosting regression trees, we can easily attempt to capture arbitrary high-order interactions among the covariates, splines-based methods are in practice currently limited to second-order interaction effects usually modeled with tensor products of two univariate spline bases. The introduction of second-order interaction effects by tensor product splines in generalized additive models (GAM) can be in practice infeasible even in the case of low-dimensional covariate spaces. That is because of memory storage problems resulting from the requirement in current algorithms that projections of the covariate data on these tensor product bases be explicitly stored in matrices. These matrices can be prohibitively large given the typical size of neural point process data sets (K > 105 ). The use of Bayesian P-splines would alleviate the memory storage problem. Currently, however, these methods are not a practical choice because of their computational cost and the difficulty in designing MCMC samplers with good mixing properties in high-dimensional parameter spaces when interaction effects are added. Alternative methods based on Kernel regression algorithms (e.g., kernel logistic regression and gaussian process regression) currently face similar difficulties. Most available algorithms require the explicit representation of kernel matrices as 2D arrays and are therefore limited to a few hundred thousand samples. Finally, another advantage of stochastic gradient boosting regression is that it can easily handle mixed metric and discrete (categorical) covariates. Mixed covariates present difficulties for GAM algorithms based exclusively on spline bases. Confidence intervals for the approximated functions and partial covariate effects can in principle be obtained by bootstrapping gradient boosting regression models. However, the introduction of bootstrapping would make gradient boosting regression less attractive in terms of computational
Nonparametric Modeling of Neural Point Processes
699
efficiency. We recommend the following approach. If more detailed statistical inference based on estimated confidence intervals is required after an initial modeling analysis performed with stochastic gradient boosting regression, one can focus on a selected smaller subset of modeled cells and perhaps lower-dimensional covariate sets. Bootstrapping of gradient boosting models could then be used. Alternatively, one could consider the application of Bayesian splines to this selected smaller data subset in order to assess parameter uncertainty. To address these issues, we are also considering a Bayesian formulation of boosting regression trees (work in preparation). In neuroscience, modeling the relations between neural spiking and stimulus-behavioral covariates has been intrinsically associated with the problem of neural population decoding. In this regard, neural point process decoding based on models fitted through stochastic gradient boosting can be easily implemented using sequential Monte Carlo particle filters (Doucet, de Freitas, & Gordon, 2001; Gao, Black, Bienenstock, Shoham, & Donoghue, 2001, Brockwell, Rojas, & Kass, 2004). The spiking probability in the time interval ((k − 1), k] given estimated states, necessary to compute the importance sampling weights for the particles, is simply obtained as P(Nk |xk ) = exp{Nk Fˆ (xk ) − exp{ Fˆ (xk )}. The application of gradient boosting models to other neural decoding approaches that require the computation of partial derivatives of the CIF with respect to covariates (Eden, Frank, Barbieri, Solo, & Brown, 2004) will be problematic because of the piecewise constant nature of the regression trees approximation. We can think of several directions for improvement of stochastic gradient boosting regression for neural point processes. The algorithm is generic regarding the types of base learners or regressors that can be used at each iteration step. We plan to investigate the use of penalized low-dimensional B-splines as base learners, for example. Boosting splines could be used, for example, in cases where smooth approximations are preferable. Also, this choice might result in fewer boosting iterations. With respect to our goal of identifying the features of the functional form relating spiking and covariates and despite the contribution of partial dependence plots in this regard, model interpretability remains a topic for further development. As mentioned before, one advantage of an ensemble of regression trees is that the modeling of interaction effects among the covariates can be easily implemented. Estimated second-order interaction effects can be easily visualized in 2D plots. However, an analysis of higher-order interaction effects and their contribution to spiking activity is beyond the assessment provided by the partial dependence plots. We are also considering more efficient algorithmic implementations of boosting regression in terms of least angle regression (Efron, Hastie, Johnstone, & Tibshirani, 2004). A comparison to other related nonparametric modeling algorithms such as Random Forests (Breiman, 2001) is underway.
700
W. Truccolo and J. Donoghue
Our applications to M1 and 5d data in this article were intended as illustrative examples of the algorithm. Refined models will likely require larger sets of covariates covering more extensive spiking history effects, as well as the introduction of kinematics covariates at multiple time lags. In addition, movement covariates in coordinate frames other than the Cartesian coordinates adopted here and covariates involving dynamic forces are other crucial topics for investigation. A detailed study of the relations between spiking and movement covariates in these two areas is the topic of a future report. We are also currently applying stochastic gradient boosting to the study of cortico-cortico interactions between primary motor (M1) and parietal (5d) cortices during sensorimotor control of reaching (Truccolo, Vargas et al., 2005). Appendix For completeness, we describe in more detail two aspects of stochastic gradient boosting. We follow the exposition in Friedman (2001). A.1 Gradient Smoothing via “Greedy-Stagewise” Approximation. Given the representation
F ∗ (x) ≈ Fˆ (x) =
M
f m (x),
(A.1)
m=0
for the function solving the optimizing problem in equation 3.1, and under sufficient regularity conditions such that we can interchange differentiation and integration, each f m (x) can be implemented as the mth step in a steepestdescent gradient optimization, f m (x) = ρgm (x) = −ρ E y
∂ L(y, F (x)) , |x ∂ F (x) F (x)= Fˆ m−1 (x)
(A.2)
where gm is the negative gradient determining the (steepest-descent) direction of the step and ρm determines the magnitude of the step, and Fˆ m−1 (x) = m−1 j=0 f j (x). Formulated in this simple pointwise optimization manner, this nonparametric approach would result in poor generalization when applied K to finite data sets {yk , xk }k=1 . The approximation would lack generalization when estimating F ∗ (x) at x-values other than the ones provided by the sample data set. This shortcoming can be dealt with by imposing smoothness on the approximation. Gradient smoothing can be achieved by adopting a parameterized form of f m (x) and performing parameter
Nonparametric Modeling of Neural Point Processes
701
optimization at each of the iteration steps. A common parameterization is to express each f m (x) in terms of a regressor or base learner h(x, am ), such M that Fˆ (x) = m=1 βm h(x; am ). The optimization problem then becomes
M {βˆ m , aˆ m }m=1
= arg min
K
L
{βm ,am }1M k=1
yk ,
M
βm h(xk ; am ) .
(A.3)
m=1
Instead of solving the above at once (i.e., for all m = 1, . . . , M simultaneously), a “greedy stagewise” approach is motivated as follows. First, proceeding in an iterative fashion as before, the problem to be solved at each step is
(βˆ m , aˆ m ) = arg min β,a
K
L yk , Fˆ m−1 (x) + βh(xk ; am ) ,
(A.4)
k=1
with Fˆ m−1 (x) = m−1 j=0 h j (x; a j ). Second, the problem can be simplified by noticing that it suffices to find the member of the parameterized class K h(x; am ) that produces the vector hm = {h(xk ; am )}k=1 most parallel to the K negative gradient vector gm = {gmk (xk )}k=1 ∈ R K , where the K -components of the negative gradient are given by
∂ L(yk , F (xk )) gmk (xk ) = − ∂ F (xk )
F (x)= Fˆ m−1 (x)
, k = 1, . . . , K .
(A.5)
Note that the hm most parallel to gm is also the one most correlated with gm (x) over the data distribution. For most types of base learners, the most parallel vector can obtained by solving the least-squares problem am = arg min a,β
K
[−gmk (xk ) − βh(xk ; a)]2 .
(A.6)
k=1
Other minimization criteria can be used in place of equation A.6 if needed. In summary, by proceeding in a greedy-stagewise manner, the hard gradient smoothing problem in equation A.3 has been replaced by the much simpler problem of least-square fitting h(x; a) to the negative gradient followed by only a single parameter optimization for the magnitude of the gradient step ρm in ρm = arg min ρ
K K =1
L yk , Fˆ m−1 (xk ) + ρh(xk ; am ) .
(A.7)
702
W. Truccolo and J. Donoghue
K A.2 Partial Dependence Plots. Given a data set {yk , xk }k=1 , with x = [x1 , x2 , . . . , x p ] , let zl be a chosen target subset, of size l, of the covariate variables x
zl = {z1 , . . . , zl } ⊂ {x1 , . . . , x p },
(A.8)
and z\l be the complement set such that z\l ∪ zl = x.
(A.9)
If one conditions on specific values for the variables in z\l , then Fˆ (x) can be considered as a function only of the variables in the chosen subset zl : Fˆ z\l (zl ) = Fˆ (zl |z\l ).
(A.10)
In general, the functional form of Fˆ (zl ) will depend on the particular values chosen for z\l . If this dependence is not too strong, then the average function F¯ l (zl ) = E z\l Fˆ (x) =
Fˆ (zl , z\l ) p\l (z\l )dz\l ,
(A.11)
with p\l (z\l ) =
p(x)dzl ,
(A.12)
can represent a useful summary of the partial dependence of Fˆ (x) on the chosen variable subset zl . Here p\l (z\l ) is the marginal probability density of z\l p\l (z\l ) =
p(x)dzl ,
(A.13)
where p(x) is the joint density of all the covariates x. An empirical estimate of this quantity can be obtained from the data. In the case where the dependence of Fˆ (x) on the chosen covariate subset is additive or multiplicative, the form of Fˆ z\l (zl ) does not depend on the joint values of the complement covariates z\l and equation A.11 provides a complete description of the variation of Fˆ (x) on the chosen covariate subset. The approximation of the partial dependence measure based on the reasoning above is simplified in the case of regression trees by the procedure given in section 6.2.
Nonparametric Modeling of Neural Point Processes
703
Acknowledgments W. T. thanks Emery Brown and Uri Eden for discussions on point processes theory, Ji Zhu for discussion on boosting algorithms, and Carlos Vargas for the data recordings. This work was supported by ONR NRD-339, NIH NS-25074, and DARPA MDA 972-00-1-0026. References Barbieri, R., Frank, L. M., Quirk, M. C., Solo, V., Wilson, M. A. & Brown, E. N. (2002). Construction and analysis of non-gaussian spatial models of neural spiking activity. Neurocomputing, 44–46, 309–314. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1997). Arcing the edge (Techn. Rep. No. 486). Berkeley: Department of Statistics, University of California, Berkeley. Breiman, L. (1999). Prediction games and arcing algorithms. Neural Computation, 11(7), 1493–1517. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Breiman, L., Friedman, J. H., Olshen, J. R., & Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth. Brezger, A., Kneib, T., & Lang, S. (2005). BayesX: Analyzing Bayesian structured additive regression models. Journal of Statistical Software, 14(11), 1–22. Brezger A., & Lang, S. (2006). Generalized structured additive regression based on Bayesian P-splines. Computational Statistics and Data Analysis, 50, 967–991. Brillinger D. R. (1988). Maximum likelihood analysis of spike trains of interacting nerve cells. Biological Cybernetics, 59, 189–200. Brockwell, A. E., Rojas, A. L. & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91(2), 1899–1907. Brown, E. N., Barbieri, R., Eden, U. T., & Frank, L. M. (2003). Likelihood methods for neural data analysis. In J. Feng (Ed.), Computational neuroscience: A comprehensive approach. CRC. Brown, E. N., Barbieri, R., Ventura, V., Kass, R. E., & Frank, L. M. (2001). The timerescaling theorem and its application to neural spike train data analysis. Neural Computation, 14, 325–2346. Chen, D., & Fetz, E. E. (2005). Characteristic membrane potential trajectories in primate sensorimotor cortex neurons recorded in vivo. Journal of Neurophysiology, 94, 2713–2725. Chornoboy, E. S., Schramm, L. P., & Karr, A. F. (1988). Maximum likelihood identification of neuronal point process systems. Biological Cybernetics, 59, 265–275. Copas, J. B. (1983). Regression, prediction, and shrinkage (with discussion). Journal of the Royal Statistical Society, B, 45, 311–354. Daley, D., & Vere-Jones, D. (2003). An introduction to the theory of point processes. New York: Springer-Verlag. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2001). Bayesian curve-fitting with freeknot splines. Biometrika, 88, 1055–1071.
704
W. Truccolo and J. Donoghue
Doucet, A. de Freitas, N., & Gordon, N. (2001). Sequential Monte Carlo methods in practice. New York: Springer-Verlag. Eden, U. T., Frank, L. M., Barbieri, R., Solo, V., & Brown, E. N. (2004). Dynamic analyses of neural encoding by point process adaptive filtering. Neural Computation, 16(5), 971–998. Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. Annals of Statistics, 32(2), 407–499. Eilers P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11, 89–102. Frank, L. M., Eden, U. T., Solo, V., Wilson, M. A., & Brown, E. N. (2002). Contrasting patterns of receptive field plasticity in the hippocampus and the entorhinalcortex: An adaptive filtering approach. Journal of Neuroscience, 22, 3817– 3830. Freund, Y., & Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In European Conference on Computational Learning Theory (pp. 23–27). Friedman, J. H. (1991). Multivariate adaptive regression splines (with discussion). Annals of Statistics, 19(1), 1–82. Friedman, J. H. (1999). Stochastic gradient boosting (Tech. Rep.). Palo. Alto, CA: Stanford University, Statistics Department. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 29(5), 1189–1232. Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis, 38, 367–378. Friedman, J., Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). Discussion. Annals of Statistics, 32(1), 102–107. Friedman, J. H., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting (with discussion). Annals of Statistics, 28(2), 337– 374. Gamerman, D. (1997). Sampling from the posterior distribution in generalized linear mixed models. Statistics and Computing, 7, 57–68. Gao, Y., Black, M. J., Bienenstock, E., Shoham, S., & Donoghue, J. P. (2001). Probabilistic inference of hand motion from neural activity in motor cortex. In T. G. Dietlerich, S. Becker, & Z. Ghahramani (Eds.), Advances in Neural information process systems, 14 (pp. 213–220). Cambridge, MA: MIT Press. Girsanov, I. V. (1960). On transforming a certain class of stochastic processes by absolutely continuous substitution of measures. Theory of Probability and Its Applications, 5(3), 285–301. Hastie, T., Tibshirani, T., & Friedman, J. H. (2001). Elements of statistical learning. New York: Springer-Verlag. Hatsopoulos, N., Joshi, J., & O’Leary, J. G. (2004). Decoding continuous and discrete motor behaviors using motor and premotor cortical ensembles. Journal of Neurophysiology, 92, 1165–1174. Johnson, A., & Kotz, S. (1970). Distributions in Statistics: Continuous univariate distributions. New York: Wiley. Kass, R. E., & Ventura, V. (2001). A spike-train probability model. Neural Computation, 13, 1713–1720.
Nonparametric Modeling of Neural Point Processes
705
Kaufman, C. G., Ventura, V., & Kass, R. E. (2005). Spline-based non-parametric regression for periodic functions and its application to directional tuning of neurons. Statistics in Medicine, 24(14), 2255–2265. Lebanon, G., & Lafferty, J. (2002). Boosting and maximum likelihood for exponential models. In T. G. Dietlerich, S. Becker, & Z. Ghahramanl (Eds.), Neural information processing systems, 14 (pp. 447–454). Cambridge, MA: MIT Press. Lugosi, G., & Vayatis, N. (2004). On the Bayes-risk consistency of regularized boosting methods. Annals of Statistics, 32(1), 30–55. Mason, L., Baxter, J., Bartlett, P., & Frean, M. (1999). Boosting algorithms as gradi¨ ent descent. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Neural information processing systems, 12 (pp. 512–518). Cambridge, MA: MIT Press. Okatan, M., Wilson, M. A., & Brown, E. N. (2005). Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation, 17(9), 1927–1961. Paninski, L. (2004). Maximum likelihood estimation of cascade point-process neural encoding models. Network, 15(4), 243–262. Paninski, L., Fellows, M. R., Hatsopoulos, N. G., & Donoghue J. P. (2004). Spatiotemporal tuning of motor neurons for hand position and velocity. Journal of Neurophysiology, 91, 515–532. Papangelou, F. (1972). Integrability of expected increments of point processes and a related change of scale. Transactions of the American Mathematical Society, 165, 483–506. Rosset, S., Zhu, J., & Hastie, T. (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5, 941–973. Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. Annals of Statistics, 26, 1651–1686. Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS user Manual (Version 1.4). Cambridge: Medical Research Council Biostatistics Unit. Truccolo, W., Eden, U. T., Fellows, M. R., Donoghue, J. P., & Brown, E. N. (2005). A point process framework for relating neural spiking activity to spiking history, neural ensemble, and extrinsic covariate effects. Journal of Neurophysiology, 93(2), 1074–1089. Truccolo, W., Vargas, C., Philip, B., & Donoghue, J. P. (2005). M1-5d Statistical interdependencies via dual multi-electrode array recordings. Society for Neuroscience, abstract 981.13, Washington, DC. Available online at http://sfn.scholarone.com/itin2005/index.html. Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wood, S. N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673–686. Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32(1), 56–85.
Received August 16, 2005; accepted July 11, 2006.
LETTER
Communicated by Bard Ermentrout
Synchrony of Neuronal Oscillations Controlled by GABAergic Reversal Potentials Ho Young Jeong
[email protected] Center for Neural Science, New York University, New York, NY 10003, U.S.A.
Boris Gutkin
[email protected] Group for Neural Theory, DEC, ENS-Paris; Coll´ege de France, and URA 2169 Recepteurs et Cognition, Institut Pasteur, Paris, France
GABAergic synapse reversal potential is controlled by the concentration of chloride. This concentration can change significantly during development and as a function of neuronal activity. Thus, GABA inhibition can be hyperpolarizing, shunting, or partially depolarizing. Previous results pinpointed the conditions under which hyperpolarizing inhibition (or depolarizing excitation) can lead to synchrony of neural oscillators. Here we examine the role of the GABAergic reversal potential in generation of synchronous oscillations in circuits of neural oscillators. Using weakly coupled oscillator analysis, we show when shunting and partially depolarizing inhibition can produce synchrony, asynchrony, and coexistence of the two. In particular, we show that this depends critically on such factors as the firing rate, the speed of the synapse, spike frequency adaptation, and, most important, the dynamics of spike generation (type I versus type II). We back up our analysis with simulations of small circuits of conductance-based neurons, as well as large-scale networks of neural oscillators. The simulation results are compatible with the analysis: for example, when bistability is predicted analytically, the large-scale network shows clustered states. 1 Introduction Several studies (Ben-Ari, 2001, 2002) have shown that the effect of the GABAergic synapses on the postsynaptic membrane varies according to the developmental stages of neuronal circuits. In the adult brain, the glutamatergic and GABAergic synapses act as the major sources of excitation and inhibition, respectively, through ionotropic ligand-gated channels. The balance between these (the excitation and the inhibition) is essential to sculpt oscillatory patterns of neural activity such as theta or gamma activity. In contrast to the adult, GABA synaptic neurotransmission in the developing Neural Computation 19, 706–729 (2007)
C 2007 Massachusetts Institute of Technology
GABA Reversal of Synchrony
707
brain circuits, is depolarizing (excitatory) as a result of high intracellular concentration of chloride ion. This excitatory effect is associated with the early network oscillations (ENOs) in the CA1 (Garaschuk, Hanse, & Konnerth, 1998) and giant depolarizing potentials (GDPs) in the CA3 region of the neonatal hippocampus (Ben-Ari, Cherubini, Corradetti, & Gaiarsa, 1989). Both the ENOs and the GDPs reflect underlying network activity with spontaneous oscillatory patterns during early postnatal development. It is believed that GABA-mediated depolarization is a key factor in the development of central nervous structures. GABAergic synaptic channels are permeable to the chloride ions whose concentration controls the GABA synaptic reversal potential. In the developing neurons, the reversal potential for chloride (E Cl ) sits at a more depolarized level than the resting membrane potential (E rest )—hence, the depolarizing action of the GABA synapse. In the rodent hippocampus, E Cl decreases with age during the immediate postnatal period and arrives at the typical resting potential of −65 mV after about two weeks (Owens, Boyce, Davis, & Kriegstein, 1996). E Cl in the adult can be below E rest , but depending on the activity of the postsynaptic neuron can ride up toward E rest , hence producing so-called shunting inhibition. Synchronous patterns have been recognized as one of the most prominent features of neuronal activity found in the brain (see Buzsaki & Draguhn, 2004, for review). For example, synchrony of neuronal firing has been observed during sensory information processing such as visual ¨ pattern recognition (Gray, Konig, Engel, & Singer, 1989), odor recognition (MacLeod & Laurent, 1996), and sensorimotor tasks (Abeles, Bergman, Margalit, & Vaadia, 1993). Several computational studies have revealed that GABA-mediated inhibition is capable of synchronizing the activity in interneuronal networks when the decay time of the inhibitory synaptic current is slow relative to the intrinsic membrane recovery time (Bush & Sejnowski, 1996; Hansel & Mato, 2003; van Vreeswijk, Abbott, & Ermentrout, 1994; Wang, 2003). These results are in good agreement with the physiological results showing that the GABA A-receptor-mediated inhibition in hippocampal interneuronal circuits, without the involvement of pyramidal neurons, could give rise to coherent 40 Hz oscillations (Whittington, Traub, & Jefferys, 1995). Also, spontaneous coherent oscillations occurring at 40 Hz have been demonstrated in hippocampal slices with cholinergic modulation, suggesting that synchronization can occur in networks of both interneurons and pyramidal cell provided that the balance between excitatory and inhibitory synaptic transmissions is maintained (Fisahn, Pike, Buhl, & Paulsen, 1998). In the developing neuronal circuits, spontaneous coherent oscillatory patterns of activity such as ENOs and GDPs are clearly present. These behaviors appear at the developmental stages when the glutamatergic neuronal transmission has not matured, yet the depolarizing GABAergic synapses are fully functional, thus suggesting that actions of GABA play a
708
H. Jeong and B. Gutkin
crucial role in generating the neuronal activity in the neonatal hippocampus (Ben-Ari, 2002). However, this appears to be at odds with the previous theoretical results pointing out that synchronization of neuronal oscillations is unlikely to be realized by excitation alone (except under specific conditions on the neurons themselves; see below), but is achieved with the help of synaptic inhibition (Ermentrout, 1996; Gerstner & Kistler, 2002; Hansel, Mato, & Meunier, 1995; Neltner & Hansel, 2001). A partial explanation exists for the conditions under which intrinsically oscillating neurons can synchronize through excitation (Crook, Ermentrout, & Bower, 1998; Ermentrout, Pascal, & Gutkin, 2001). The key was the slow voltagedependent potassium current (the M-current) that has an ability to shift a neuron from a type I to a type II oscillator. In addition, the calciumdependent potassium current producing the afterhyperpolarization (AHP) current gave rise to synchronization by making neurons insensitive to inputs that would normally desynchronize the circuit. However, previous theoretical analysis considered synaptic inputs not as conductances but as currents. Thus, they could not show how changes in the reversal potential of the GABAergic synapses affect the synchronization properties. The analysis presented in such work was based on pairs of mutually coupled neural oscillators. Hence, most of these studies did not consider how a large neuronal network would act in the parameter regions treated by the analysis of two neurons. To address these issues, we consider three representative regimes of the GABAergic synapses in reciprocally coupled neuronal circuits: hyperpolarizing, shunting, and depolarizing. For these regimes, we investigate how synchrony is affected by the adaptation current that is involved in neonatal neurons. Through both theoretical analysis and numerical simulations, we finally show that adaptation-induced low firing rates with depolarizing GABAergic synapses may be the key to the formation of developing neuronal circuits through synchronization. Parts of this work have been presented previously in conference proceedings (Jeong & Gutkin, 2005). 2 The Model and Methods 2.1 Neuron Model. In this work, we use a single compartmental conductance-based model that contains the currents required for spiking: a voltage-dependent transient sodium (I Na ), a delayed rectifier K-current (I K ), and a nonspecific leak current (I L ). In addition, we add a high-threshold calcium current (L current), ICa , and a calcium-dependent potassium current, IAHP , since these currents are associated with generating the neuronal activities in neonatal rat CA3 pyramidal neurons in vitro (Tanabe, Gahwiler, & Gerber, 1998). In particular, a voltage-independent afterhyperpolarization (AHP) gives spike frequency adaptation, as implemented by the L-type calcium channels, without changing the bifurcation structure of the neuron model. The model is a simple variant of Traub’s
GABA Reversal of Synchrony
709
neuron model (Traub & Miles, 1991), and its mathematical description and parameters are borrowed from Ermentrout and colleagues (Ermentrout et al., 2001). We use this particular model so as to make direct comparisons with the previously published work (e.g., Ermentrout et al., 2001). The membrane potential V of a neuron obeys the following currentbalance equation, Cm
dV + I L + I Na + I K + ICa + IAHP = I, dt
(2.1)
where Cm = 1 µF/cm2 and the leak current I L = g L (V − E L ). I is the injected current. The voltage-dependent sodium and potassium currents are described by a gating variable x that satisfies a first-order kinetics, dx = αx (V)(1 − x) − βx (V)x, dt
(2.2)
which can be considered as the mean activity of large number of channels. The rates αx and βx are functions of the voltage. For sodium current, I Na = g¯ Na m3 h(V − E Na ), where αm (V) = 0.32(V + 54)/(1 − exp(−(V + 54)/4)) and βm (V) = 0.284(V + 27)/(exp((V + 27)/5) − 1), αh = 0.128 exp(−(V + 50)/18) and βh = 4/(1 + exp(−(V + 27)/5)). For delayed rectifying potassium current, I K = g¯ K n4 (V − E K ), where αn (V) = 0.032(V + 52)/(1 − exp(−(V + 52)/5)) and βn (V) = 0.5 exp(−(V + 57)/40). The high-threshold calcium current ICa = g¯ Ca m∞ (V − E Ca ), where m∞ = 1/(1 + exp(−(V + 25))/2.5)). The voltage-independent calcium-activated potassium current IAHP = g¯ AHP ([Ca 2+ ]/([Ca 2+ ] + 1))(V − E K ). The change in intracellular calcium concentration [Ca 2+ ] is given by d[Ca 2+ ] [Ca 2+ ] , = − ICa − dt τCa
(2.3)
where is proportional to the ratio of the membrane area to the volume beneath the membrane, and τCa is a time constant to relax [Ca 2+ ]. Here, we choose the same parameter values as Wang (1998) used: = 0.002 µ M (ms µA)−1 cm2 and τCa = 80 ms. The parameter values for generating spikes are chosen to ensure type I membrane since there is some evidence that cortical neurons have such dynamics (Gerstner & Kistler, 2002; Tateno, Harsch, & Robinson, 2004): E K = −100, E Na = 50, E L = −67, E Ca = 120 (in mV); g¯ K = 80, g¯ Na = 100, g L = 0.2, gCa = 1 (in mS/cm2 ). The parameters g¯ AHP and I vary in simulations to control the adaptation and the firing rate of the model. Figure 1 illustrates that the neuron model shows typical behavior for type I membranes. As Ermentrout et al. (2001) pointed out, the neuron is in the resting state at low values of injected current and begins to fire repetitively at an
710
H. Jeong and B. Gutkin
B)
A) 50
80
40 0
−50
60
−100
20
0
300
150
0
V (mV)
Oscillation 50
40 0
-20
-40
Unstable
−50
-60 20
−100
0
150
300
SN = 0.4917 Stable
-80
0 0.5
1.0
1.5
2.0
2.5
-100 0.0
0.5
1.0
1.5
2.0
2.5
2
I (µA/cm )
Figure 1: F-I curve and bifurcation diagram. (A) The onset of firing rate occurs at the saddle-node (SN) bifurcation, and its frequency is zero. Even if the adaptation current is relatively strong (g¯ AHP = 2), the neuron fires at an arbitrarily low frequency. The smooth lines of the firing rate are fitted to the data obtained from each individual injected current (). Insets illustrate the firing with and without the adaptation current. (B) The bifurcation analysis of the Traub neuron shows that the model has an SN for a bifurcation point when g¯ AHP = 1, which is a typical pattern of type I membrane.
arbitrarily low frequency (see Figure 1A). The bifurcation diagram shows that the neuron model has a stable limit cycle when a saddle point and a stable node coalesce and disappear (see Figure 1B). Note that this highadaptation current does not affect the bifurcation structure of the neuron in the equilibrium state, but causes the neuron to have the lower firing rate when the current above the critical value is injected into the neuron (as compared to the nonadapting case). The adapting IAHP provides a delayed negative feedback on the neuron’s excitability through a slow, outward potassium current. 2.2 Synaptic Model. The release of the GABA neurotransmitter that binds to the GABA A receptors is well described by a first-order kinetic model (Destexhe, Mainen, & Sejnowski, 1998), which represents the transitions between the open and closed state of the receptors. If S is defined as a fraction of the receptors in the open state, its change is governed by the kinetics of the GABA A receptors and the concentration of transmitters [T], dS = α[T](1 − S) − β S, dt
(2.4)
where α and β are forward and backward rate constants for GABA A receptors. In this work, α is fixed by 2.0 mM−1 ms−1 , but β is varied to investigate how a decay time constant influences synchrony. For [T], it is convenient
GABA Reversal of Synchrony
711
A)
B) 0.6
0.5
V (mV)
S
0.4
=0.1
0.3
50 0 −50 −100 −64
Presynaptic neuron EGABA = 0 mV
−68 −64
EGABA = −30mV
−68 −64
EGABA = −65 mV
0.2
−68 −64
=0.5 0.1
EGABA = −80 mV
−68
=2.0 0.0 0
5
10
15
20
25
50
Time (msec)
60
70
80
90
100
Time (msec)
Figure 2: Effects of the synaptic decay time and influence of the reversal potential on the postsynaptic membrane potential. (A) The shape of S changes according to the backward reaction rate β. When β increases, synapses become faster, and the stabilities of phase dynamics vary. (B) The transient postsynaptic membrane potentials are caused by the presynaptic release of GABAergic neurotransmitters. Depending on the reversal potential of GABAergic synapses, the postsynaptic membranes become excitatory (E GABA = 0 and −30 mV), shunting (−65 mV), and inhibitory (−80 mV).
to use a continuous function to transform the presynaptic voltage into the concentration. If it is assumed that all reactions involved are relatively fast, they can be considered in the steady state, and the transform of the presynaptic voltage to the concentration of transmitter can be obtained from the stationary relationship: [T](Vpr e ) =
Tmax , 1 + exp(−(Vpr e − Vp )/K p )
(2.5)
where Tmax is the maximal concentration of transmitter in the synaptic cleft and Vp and K p determine the threshold and the stiffness for the release. We use the values that Ermentrout et al. (2001) chose: Tmax = 1 mM, Vp = −10 mV, and K p = 10 mV. Note that the synapse speeds up as the backward rate constant β increases (see Figure 2A). The postsynaptic current IGABA is given by IGABA = g¯ GABA S(t)(V − E GABA ),
(2.6)
where g¯ GABA is the maximal conductance of GABA A receptors and E GABA the reversal potential. This synaptic model is inserted into the neuron membrane model (see equation 2.1). Once a fast chemical synapse activates, a rapid and transient change can be observed in the postsynaptic potentials. When the synapse is excitatory, the membrane potential rapidly depolarizes and then returns slowly to its resting state, producing an excitatory
712
H. Jeong and B. Gutkin
postsynaptic potential (EPSP). In contrast, an inhibitory synapse gives rise to the transient hyperpolarization of postsynaptic membrane, an inhibitory postsynaptic potential (IPSP). Figure 2B illustrates how the reversal potential of GABAergic synapse influences the postsynaptic membrane potential. 2.3 Phase Reduction. In general, it is difficult to predict the behavior of coupled neurons that have nonlinear properties. However, in the limit of weak coupling, we can reduce the behavior of a pair of coupled neurons to a simple phase model and analyze the stability of its phase-locked solutions (Kuramoto, 1984; Rinzel & Ermentrout, 1998; Ermentrout et al., 2001; Pikovsky, Rosenblum, & Kurths, 2001; Winfree, 2001). If individual neurons with identical properties have stable limit cycles and their coupling is weak enough not to affect the amplitude of the neurons limit cycle (or, weak effect on the voltage), then the full system can be reduced to a model for the difference in the phases of the neurons. In this section, we recap how to derive the phase model and then apply it to specific forms for conductance-based neuron models. Consider a system governed by ordinary differential equations, dX = F (X), dt
(2.7)
and suppose that this system has a stable periodic solution. We now consider an identical pair of oscillators Xi (i = 1, 2) that are weakly coupled: the coupling does not distort the waveforms but perturbs only the relative phase between the coupled oscillators. We write them in the general form d Xi = F (Xi ) + gc G i (X1 , X2 ). dt
(2.8)
Using the frequencies of uncoupled oscillators ωi = 1, we rewrite equation 2.8 as a phase θ on the limit cycle. From the chain rule, we obtain ∂θi dθi ∂θi F (Xk ) + gc G k (X1 , X2 ) = dt ∂ X ∂ Xk k k k = ωi + gc
∂θi G k (X1 , X2 ), ∂ Xk k
k = 1, 2.
(2.9) (2.10)
For 0 < gc 0. In this article, we define synchrony as a zero phase lag when two identical neurons are coupled. So far, we have dealt with the general forms for the weakly coupled oscillators. For the mutually coupled neurons with GABA A receptors (see equations 2.1–2.6), the averaged interaction function H is given by
H(φ) =
g¯ G AB AA T
T 0
Z(t)S(t + φT)(E G AB AA − V(t))dt.
(2.15)
This equation shows that all the changes in the function H arise from three factors: the intrinsic property of membrane dynamics, the synaptic function S(t), and the synaptic reversal potential.
714
H. Jeong and B. Gutkin
2.4 Populations of Globally Coupled Neurons. The effect of mutual synchronization of two coupled oscillators can be extended to large networks by the simple differential equation, N dθk g¯ G AB AA = ωk + H(θ j − θk ), dt N
(2.16)
j=1
where N is the number of neurons and H is the interaction function between two neurons defined in equation 2.15. It is generally impossible to obtain the function H because the response function Z can be determined analytically only in special cases. Since H is a periodic function of T, however, it can be approximated by using a Fourier expansion (five components are used here). If the individual neurons are identical and their interactions are homogeneous, then their natural frequencies ω can be set to 1 without loss of generality. Note that large networks can show transitions between the synchronous state and the asynchronous state when varying a synaptic strength as a controlling parameter (Kuramoto, 1984). However, we treat the weak coupling limit in order to compare with two-neuron analysis for synchronization. To characterize the collective behavior of the N-neurons, it is convenient to introduce the order parameter (Kuramoto, 1984; Hansel, Mato, & Meunier, 1993) 1 exp(niθ j (t)) N N
Zn (t) =
(n = 1, 2),
(2.17)
j=1
which measures the degree of phase coherence. The instantaneous degree of coherent behavior in the network can be further described by the square of the modulus of Zn (t): 2 2 N N 1 2 cos(nθ j (t)) + sin(nθ j (t)) Rn (t) = 2 N j=1 j=1 =
N 1 cos(nθ j (t) − nθk (t)), N2
(2.18)
(2.19)
j,k=1
which is called synchronization index. From equation 2.19, it is obvious that the network is in the synchronous state when Rn converge to 1 in time, while the complete asynchronous state results when they go to 0. The network with two equal size clusters especially gives R1 = 0, but R2 = 1. All
GABA Reversal of Synchrony
715
numerical simulations and bifurcation analysis are done with the XPPAUT software (Ermentrout, 2002). 3 Results 3.1 Changes in the Synaptic Reversal Potential. We first investigate how changes in the GABA reversal affect synchronization of two type I neural oscillators. Here, the neurons are induced to fire repetitively by application of a constant current input. In this first case, the neurons are nonadapting. In Figure 3, we plot the infinitesimal PRCs and the asymptotic values of the phase difference between two neurons coupled with a GABAergic synapse. As the reversal potential increases, denoting that the synapse becomes partially depolarizing, a pitchfork bifurcation appears at a critical value of the reversal potential. At this point, the antisynchronous state becomes unstable, and two asynchronous states (i.e., states with a nonzero phase lag) emerge. These asynchronous states move closer to the synchronous state as the GABAergic reversal potential increases. This situation can be enhanced when two neurons (here: nonadapting) have a low firing rate due to a lower amplitude of the input current (see Figures 3A and 3B), yet even at the E GABA = 0, the phase difference is still large (see Figure 3B). When we control the firing rate of the neuron with spike-frequency adaptation, the asynchronous states tend to approach the synchronous state, and the phase difference between two neurons becomes nearly 0 when the reversal potential moves above the cell resting potential. Figure 3C shows that the slow process associated with the adaptation current plays a crucial role in synchronization. This might result from the partial refractoriness due to the adaptation current that makes the neurons insensitive to inputs arriving after a spike. At the same firing rate as the nonadapting case (10 Hz), the adaptation current causes the neuron to have a more flattened PRC at the beginning of the firing cycle and leads to near synchrony (see Figures 3B and 3C). We note that even if the adaptation current is involved, the neuron model is a type I membrane because the PRC is nonnegative so that a brief depolarizing injected current dose not delay the phase. The bifurcation diagram of the phase difference also shows that in the shunting region, where the reversal potential is nearly equal to the resting potential of the membrane (E rest ), the only stable solutions are antiphase (asynchronous), regardless of whether the neuron is adapting. Hyperpolarizing GABAegic synapses can give stable synchronous states in addition to the asynchronous state in the absence of adaptation and lead to the bistability of phase-locked solutions. Neurons with adaptation do not have a stable in-phase locked solution, despite hyperpolarizing GABAergic reversal potentials. This fact seems to be different from previous results (van Vreeswijk et al., 1994; Hansel & Mato, 2003). However, when the synapse is sufficiently slow, the adapting system regains a stable synchronous state (e.g., see Figure 4C). Therefore, the bistability of the phase dynamics can
716
H. Jeong and B. Gutkin 40Hz without adaptation
A)
PRC
Phase difference ( /T)
1.0
Erest 0.5
0.0 0 0
5
10
15
20
25
−100
−80
−60
−40
−20
0
10Hz without adaptation
B)
PRC
Phase difference ( /T)
1.0
0.5
0.0 0 0
20
40
60
80
100
−100
−80
−60
−40
−20
0
10Hz with adaptation
C)
PRC
Phase Difference ( /T)
1.0
0.5
0.0 0 0
20
40
60
Time (msec)
80
100
−100
−80
−60
−40
−20
0
Synaptic Reversal Potential (mV)
Figure 3: Both the GABAergic reversal potential and the AHP current affect synchrony. The infinitesimal PRC is shown in the left panel, and the bifurcation diagrams for phase difference between two neurons as a function of the reversal potential is plotted in the right panel. In the diagram of the phase difference, solid lines denotes stable solutions and dashed lines unstable solutions. The resting potential is marked on each diagram. Here we control the firing rate of the model by changing the injected current. (A) 40 Hz without adaptation. (B) 10 Hz without adaptation. (C) 10 Hz with adaptation.
GABA Reversal of Synchrony
Phase difference ( / T )
A)
717
B)
EGABA=0mV 1.0
1.0
0.5
0.5
0.0 0.0
C)
EGABA=−65mV
1.0
0.5
0.0 0.5
1.0
1.5
2.0
0.0
EGABA=−80mV
0.0 0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
Figure 4: Bifurcation diagrams of phase difference when β as the control parameter changes. (A) Hyperpolarizing region. (B) Shunting region. (C) Depolarizing region of GABAergic synapses.
be considered as the general property for two neurons coupled with hyperpolarizing GABAergic synapses. If this analysis is extended to a large neuronal network, the system could break into multicluster states, as previously suggested (van Vreeswijk, 1996). 3.2 Effects of Synaptic Speed on Synchronization. In the previous section, we saw that the shape of the synaptic function plays an important role in controlling the synchronous states for GABAergic synapses. Figure 4 shows how the synchronous behavior of two neurons is affected by the backward rate constant β in the GABAergic synapse. As β increases, the synapse speeds up, and its decay time is faster (see Figure 2A). For depolarized coupling, the phase difference between two neurons tends to zero as β increases. However, there are no values of β that lead to synchronization. In the shunting region, the stability of antiphaselocked solutions is independent of the synaptic decay time. Hyperpolarizing GABAergic synapses usually leads to antiphase solutions, but when β decreases below the critical point, representing a sufficiently slow synapse, an additional stable synchronous state occurs (as seen above). As a result, the backward rate constant β changes the stability of the phase difference via the supercritical and subcritical pitchfork bifurcation for the depolarizing and hyperpolarizing GABAergic synapse, respectively. The shunting GABAergic synapse always leads to antisynchronization. The weakly coupled oscillator analysis shows that the stability of the phase difference between two neurons is affected by three factors: the adaptation current, the synaptic function, and the reversal potential (see equation 2.15). Furthermore, numerical simulations of five neurons that are mutually coupled with GABAergic synapses confirm the results and imply that this analysis for two neurons can be extended to greater numbers of neurons or a large neuronal network (see Figure 5).
718
H. Jeong and B. Gutkin
A) 40
V (mV)
0
−40
−80
B)
0
100
200
300
400
500
40
V (mV)
0
−40
−80
0
100
200
700
800
900
1000
Time (msec)
Figure 5: Numerical simulations show that the analysis for two neurons can be successfully applied to five coupled neurons. In the depolarizing GABAergic connection (E GABA = −20 mV), the model leads to asynchrony when synapses are slow sufficiently (top, β = 0.1). In contrast, the model gives rise to almostsynchronization when synapses are fast (bottom, β = 2).
3.3 Large Neuronal Network. Numerical simulations show that the analysis of two neurons with PRC can successfully predict the behavior of synchrony for five neurons that are mutually connected. To extend our results to coherent behavior in large-scale networks, we use a globally coupled phase-model network with 100 neurons (see equation 2.16). In the network, the synaptic coupling is based on the interaction function (H) between two neurons. If we assume that the interactions are homogeneous, the model ˜ can be simulated by using the approximate interaction function H(φ) obtained from the Fourier expansion of the interaction function between two coupled neurons. We expect under conditions for bistable synchronous and asynchronous solutions of two neurons the network to break up into cluster states. Previous results indeed suggested that the network with inhibitory coupling can break up into the cluster states (van Vreeswijk, 1996). Figure 6 illustrates that for a large network whose interaction function reflects the presence of adaptation (skewed PRC) excitatory coupling leads
GABA Reversal of Synchrony
719 EGABA= 0mV with adaptation
A)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
0.2
0.0 0
20
40
60
80
100
0
10
20
30
40
50
10
20
30
40
50
EGABA= −80mV with adaptation
B)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
0.2
0.0 0
20
40
60
80
100
0
EGABA= −80mV without adaptation
C)
1.0
0.8
~ H, H
R1, R2
0.6
0.4
R1 R2
0.2
H ~ H
0.0 0
20
40
60
Time (msec)
80
100
0
20
40
60
80
100
Time
Figure 6: Synchrony and onset of cluster states in large networks. The average ˜ (solid line) approximated from the two-neuron phase interaction functions H model (dotted line) are plotted in the left panel. With five components in the Fourier transform, we obtain the compatible results with the two-neuron analysis. The evolution of the R1 and R2 of the complex order parameters is shown in the right panel. Note that we randomly select the initial condition of network since the cluster state depends on it. (A) E GABA = 0 mV with adaptation. (B) E GABA = −80 mV with adaptation. (C) E GABA = −80 mV without adaptation.
720
H. Jeong and B. Gutkin
A)
B) 40 20
PRC
V (mv)
0
Oscillation
−20 −40
unstable −60
0
0
5
10
15
Time (msec)
20
25
30
−100
HB = 5.4983
stable
−80
0
1
2
3
4
5
6
7
8
9
10
2
I ( µA/cm )
Figure 7: The M current yields type II spike-generating dynamics in a neuron model. (A) PRC showing positive and negative regions indicative of type II dynamics. (B) Bifurcation diagram for a type II neuron as a function of injected current. A subcritical Hopf bifurcation leads to the onset of oscillations.
to synchronization (see Figure 6A), but inhibitory coupling leads to desynchronization (see Figure 6B). These behaviors of large neuronal networks are in good agreement with the two-neuron analysis. In addition, we show that the network has a two-cluster state (see Figure 6C) when the coupling is hyperpolarizing and reflects nonadapting neurons (PRC without a strong left skew). The order parameters (see equation 2.19) effectively capture the state of the network. When both R1 and R2 converge to 0 or 1 together, the neurons are perfectly synchronized or desynchronized, respectively (see Figures 6A or 6B). However, when these two quantities go to different values (see Figure 6C), the model has a two-cluster state. 3.4 Stability of Phase-Locked States, Type II Neurons. In the previous sections, we have examined how the GABAergic reversal potential influences the stability of the synchronous and asynchronous solutions for a pair, a small circuit, and a large network of neural oscillators. For that analysis, we assume that the neuron shows type I membrane dynamics, meaning that the spiking is due to a saddle-node bifurcation and the PRC is purely positive. However it was previously shown that type II oscillators have different synchronization properties (Ermentrout, 1996). In particular, type II membranes can synchronize with excitatory (depolarizing) synapses (Ermentrout et al., 2001). The key to this analysis is that the phase resetting curve has negative as well as positive regions, which leads to the zero-phase lagged solution being stable. Ermentrout et al. (2001) and Gutkin, Ermentrout, and Reyes (2005) showed that type I membrane can be changed into type II through a voltage-dependent negative feedback such as a low-threshold slow potassium current I M as opposed to the high-threshold calcium-mediated IAHP used above. The difference between these two currents lies in their effective threshold. The I M current has a lower effective threshold, while the AHP
GABA Reversal of Synchrony EGABA=−65mV
A)
Hodd
721 EGABA=−70mV
B)
0
0
0
5
10
15
20
25
30
EGABA=−90mV
C)
0
10
15
20
25
30
25
30
EGABA=−96mV
D)
0
0
0.00
Hodd
5
0.07
0.14
0.00
0
0.08
0.16
0
0
5
10
15
Time (msec)
20
25
30
0
5
10
15
20
Time (msec)
Figure 8: Hodd ’s for type II neuron (see above) for different reversal potentials of the synapse; reversal potentials are marked on the figure. As can be seen for E GABA near and above the E rest , the slope of Ho dd at the zero phase-lag intersection is positive, and thus synchrony is stable. Synchrony loses stability when E GABA drops below −92.1 mV. Insets magnify the zero phase-lag region. (A) E GABA = −65 mV. (B) E GABA = −70 mV. (C) E GABA = −90 mV. (D) E GABA = −96 mV.
current has a higher effective threshold. Interestingly, the M current affects the bifurcation structure: the stable solution branch now loses stability at a subcritical Hopf bifurcation point (see Figures 1B and 7B for comparison). Since the stable oscillation appears at a nonzero frequency near a subcritical Hopf bifurcation, it can be said that the M current changes the dynamics from a type I to a type II. In this case, the PRC is no longer purely positive but changes sign during the firing cycle (see Figure 7A). For the low-threshold slow potassium current, I M = g¯ M w(V − E K ) where dw/dt = (w∞ (v) − w)/τw (v) with w∞ (v) = 1.0/(1.0 + exp(−(v + 35)/10.0)) and τw (v) = 100/(3.3 exp((v + 35)/20.0) + exp(−(v + 35)/20.0)). The parameter g¯ M is set to 2 mS/cm2 . Figure 8 shows the odd component of the interaction function H with various levels of GABAergic reversal potentials for a type II Traub model
722
H. Jeong and B. Gutkin
Phase difference ( /T)
1.0
Erest 0.5
0.0
−100
−90
−80
−70
−60
−50
−40
−30
−20
−10
0
Synaptic Reversal Potential (mV) Figure 9: Bifurcation diagram as a function of E GABA for a type II neuron. Synchrony is stable and asynchrony is unstable above the resting potential. In the range near rest (down to −92.1 mV), synchrony and asynchrony coexist; below, only asynchrony is stable. A dot marks the resting potential.
firing at 30 Hz. The positive (negative) slope of Hodd at a phase-locked solution denotes a stable (unstable) synchronous solution. Note that synchrony is stable for reversal of −90 mV but becomes unstable for −96 mV. In Figure 9, the bifurcation diagram is plotted with the GABAergic reversal potential as the control parameter. In contrast to the type I neuron, a synchronous phase-locked solution remains stable for almost all GABAergic reversal potentials and loses stability at E GABA = −92.1 mV. Since the hyperpolarizing GABAergic synapses also give rise to a stable antisynchronous solution, the system shows the bistability. However, the bistable region disappears when the GABAergic reversal potential goes below the −92 mV value, which is different from the pattern of type I. The shunting GABAergic reversal potential also gives rise to the bistable synchronous and asynchronous solution. The depolarizing GABAergic synapse leads to stable synchrony, which fits in with the earlier work showing that excitatory coupling leads to synchrony in type II neuronal oscillators. Furthermore, even partially depolarizing GABA can lead to synchrony.
GABA Reversal of Synchrony
A)
723 EGABA=−60mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
B)
50
100
150
19800
19850
19900
19950
20000
EGABA=−67mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
50
100
150
9800
9850
9900
9950
10000
0
50
100
150
9800
9850
9900
9950
10000
900
950
1000
40 20
V (mV)
0 −20 −40 −60 −80 −100
C)
EGABA=−100mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
50
100
150
800
850
Time (msec)
Figure 10: Examples of neuron pairs (type II neurons) coupled with reciprocal GABA synapses. (A) E GABA = −60 mV. (B) E GABA = −67 mV with two different initial conditions. (C) E GABA = −100 mV.
724
H. Jeong and B. Gutkin
We further validate our weak coupling analysis with direct simulations of a small circuit of type II neurons. Figure 10 shows a coupled pair of neurons with the M current. We see that for synapses with a reversal potential of −60 mV, the neurons readily synchronize, while for synapses with E GABA = −67 mV, the cells either fall out of synchrony or synchronize depending on initial conditions. For the E GABA = −100 mV, only the antisynchrony is stable. The heuristic explanation for these results is that when the synapses reverse above the minimal voltage attained by the neural membrane, it acts as depolarizing for a part of the firing cycle. Given that the cell with sufficient AHP spends a significant part of the firing cycle below the resting potential, synapses that reverse at E rest are depolarizing for most of the firing cycle. Synapses that reverse near the resting potential of the neuron are depolarizing for the vast majority of the duty cycle. Figure 11 illustrates this for the Traub neuron with I M adaptation. Thus, heuristically we fall back into the generic type II situation: depolarizing coupling synchronizes. 4 Conclusion In this report, we have investigated the role of GABAergic reversal potential in controlling stability of various phase-locked oscillations in coupled neuronal oscillators. The GABAergic reversal potential changes significantly during the development as well as during increases in firing activity in interneuronal subnetworks. Such GABA-driven activity has been proposed to be important in structuring the developing hippocampal as well as neocortical circuitry, particularly through possible formation of synchronous oscillations and cell assemblies. Here we focused specifically on this GABAergic reversal and showed how it interacts with a number of biophysical parameters to control the emergence of various firing configurations.
Figure 11: Relationship between the GABAergic reversal potential and the membrane voltage during a firing cycle explains why shunting inhibition synchronizes type II neurons. Here we see one firing cycle with three different E GABA values. Light gray marks the portion of the cycle where the GABAergic synapse is depolarizing; dark gray is hyperpolarizing. E rest is at −67 mV, and the minimal voltage attained by the neuron (E min ) is −96 mV. As can be seen, the GABAergic synapse with reversal potentials slightly above a resting membrane potential is depolarizing for virtually all of the firing cycle; hence, it acts as “excitation.” A shunting synapse (E GABA = E rest ) depolarizes the membrane for more than half of the firing cycle. A synapse with the reversal potential below the E min is hyperpolarizing.
GABA Reversal of Synchrony
725
EGABA=−60mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
20
25
30
EGABA=−67mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
20
25
30
EGABA=−96mV
40 20
V (mV)
0 −20 −40 −60 −80 −100 0
5
10
15
Time (msec)
20
25
30
726
H. Jeong and B. Gutkin
We have found that hyperpolarizing GABAergic coupling leads to synchrony being stable for type I oscillators without adaptation or a bistability between a synchronous and an asynchronous state. Depolarizing GABAergic synapses, on the other hand, lead to a stable, almost phaselocked state with partial phase shifts (so-called third mode). This third mode approaches synchrony when the firing rate of the oscillators is low. When adaptation is included, which results in a lower firing rate, hyperpolarizing GABAergic synapses lead to stable asynchrony, while depolarizing GABAergic synapses give stability to the synchronous state. We then show that for various GABAergic reversal potentials and a given low firing rate, the speed of the synapse changes stability of phase-locked states: hyperpolarizing GABA renders the asynchrony stable for relatively slower synapses, and slow depolarizing GABA yields stability for synchrony. This confirms previous results (van Vreeswijk, 1996). However, we have found that for shunting GABA (reversal at −65 mV, the resting potential of the model), asynchrony is stable for all parameters examined. Direct simulations confirm this picture. The results of our analysis appear to be in agreement with previous simulation studies of increased GABA reversal potential in interneuronal networks (Wang & Buzsaki, 1996). A recent simulation study of a realistic model of a hippocampal interneuronal network suggests that partially depolarizing (shunting) inhibition in fact helps the robustness of gamma-band coherent oscillations (Vida, Bartos, & Jonas, 2006). This result is intriguing in the light of analysis presented here, and the reasons for the differences may be due to the different coupling regimes considered (weak here versus strong in Vida et al., 2006) or in the heterogeneity of the network, or both. Another possibility is due to the intrinsic firing properties of the neuronal model considered (type I versus type II). In particular, we found that for partially depolarizing synapses that are sufficiently fast, a solution close to synchrony may appear for type I oscillators (results not shown), whereas for type II oscillators, reversal potential above rest gives a stable synchronous solution (coexisting with the antisynchrony). However, a formal analysis for why the results differ remains a topic of investigation. We have found that neurons coupled with inhibition (GABA hyperpolarizing) can give rise to synchronous solutions as well as antiphase solutions coexisting when synapses are sufficiently slow. Since this situation leads to the bistability of the phase model, we predict that a large neuronal network has the potential to develop into a cluster state, depending on the initial condition of the phase dynamics. We then examined the behavior of large-scale networks with synaptic parameters close to those identified in vitro and with various reversal potentials. We have found that the large-scale networks are consistent with the analysis of neuron pairs and the small-circuit simulations. In cases where the Hodd analysis predicts synchrony, the large-scale network shows synchrony; in cases where the small-scale analysis predicts bistability (coexistence of
GABA Reversal of Synchrony
727
synchronous and asynchronous solutions), the large-scale networks break up into multiple antiphase clusters of synchronized neurons. We further found that the dynamics of spike generation can control the stability of synchrony with partially depolarizing GABA synapses. When the spike frequency adaptation in the neuron is due to a low-threshold K-current (I M ), the dynamics of spike generation change from type I to II. In these neurons, depolarizing and shunting GABA synapses synchronize. We thus show that synchronization due to shunting inhibition may, in principle, be controlled by neuromodulators that control the spike-generating properties of cortical neurons. As has been previously suggested (Gutkin et al., 2005), acetylcholine (Ach), by blocking the I M current, can readily convert neurons between type I and type II spike generation. Since high Ach condition is concomitant with states of vigilance and cognitive effort, while during sleep and rest Ach is low, we may speculate that shunting inhibition plays a synchronizing role during the latter as opposed to the former states, and perhaps in the lower-frequency band. While our results begin to explore how the changing properties of the GABAergic synapses, in consort with the intrinsic neuronal properties, may control patterns of activity during development, the origin of the global firing patterns seen in development, for example, the bursting activity leading to the giant depolarizing potentials in the hippocampus remain to be explored. Furthermore, how such patterns lead to the formation of neural circuitry remains a topic for future study. Acknowledgments We thank Peter Dayan for fruitful discussion and Brent Doiron for his comments on the manuscript. This work was supported in part by the Gatsby Foundation and CNRS (B.S.G.). References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal firing patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70(4), 1629– 1638. Ben-Ari, Y. (2001). Developing networks play a similar melody. Trends in Neurosci., 24, 353–360. Ben-Ari, Y. (2002). Excitatory actions of GABA during development: The nature of the nurture. Nature Rev. Neurosci., 3, 728–739. Ben-Ari, Y., Cherubini, E., Corradetti, R., & Gaiarsa, J. L. (1989). Giant synaptic potentials in immature rat CA3 hippocampal neurones. J. Physiol., 416(1), 303– 325. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comput Neurosci., 3(2), 91–110.
728
H. Jeong and B. Gutkin
Buzsaki, G., & Draguhn, A. (2004). Neuronal oscillations in cortical networks. Science, 304, 1926–1929. Crook, S. M., Ermentrout, B., & Bower, J. M. (1998). Spike frequency adaptation affects the synchronization properties of networks of cortical oscillators. Neural Comp., 10, 837–854. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1998). Kinetic models of synaptic transmission. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Ermentrout, B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Ermentrout, B. (2002). Simulating, analyzing, and animating dynamical systems. Philadelphia: SIAM. Ermentrout, B., Pascal, M., & Gutkin, B. (2001). The effects of spike frequency adaptation and negative feedback on the synchronization of neural oscillators. Neural Comp., 13, 1285–1310. Fisahn, A., Pike, F. G., Buhl, E. H., & Paulsen, O. (1998). Cholinergic induction of network oscillations at 40 Hz in the hippocampus in vitro. Nature, 394, 186–189. Garaschuk, O., Hanse, E., & Konnerth, A. (1998). Developmental profile and synaptic origin of early network oscillations in the CA1 region of rat neonatal hippocampus. J. Physiol., 507, 219–236. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. ¨ Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties, Nature, 338, 334–337. Gutkin, B., Ermentrout, B., Reyes, A. (2005). Phase response curves determine the responses of neurons to transient inputs. J. Neurophys., 94, 1623–1635. Hansel, D., & Mato, G. (2003). Asynchronous states and the emergence of synchrony in large networks of interacting excitatory and inhibitory neurons. Neural Comput., 15(1), 1–56. Hansel, D., Mato, G., & Meunier, C. (1993). Clustering and slow switching in globally coupled phase oscillators. Phys. Rev. E, 48, 3470–3477. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Jeong, H. Y., & Gutkin, B. (2005). Study on the role of GABAergic synapses in synchronization. Neurocompting, 65–66, 859–868. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. New York: SpringerVerlag. MacLeod, K., & Laurent, G. (1996). Distinct mechanisms for synchronization and temporal patterning of odor-encoding cell assemblies. Science, 274(5289), 976– 979. Neltner, L., & Hansel, D. (2001). On synchrony of weakly coupled neurons at low firing rate. Neural Comp., 13, 765–774. Owens, D. F., Boyce, L. H., Davis, M. B. E., & Kriegstein, A. R. (1996). Excitatory GABA Responses in embryonic and neonatal cortical slices demonstrated by gramicidin perforated-patch recordings and calcium imaging. J. Neurosci., 16(20), 6414–6423.
GABA Reversal of Synchrony
729
Pikovsky, A., Rosenblum, M., & Kurths, J. (2001). Synchronization: A universal concept in nonlinear sciences. Cambridge: Cambridge University Press. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling. Cambridge, MA: MIT Press. Tanabe, M., Gahwiler, B. H., & Gerber, U. (1998). L-type Ca2+ channels mediate the slow Ca2+ -dependent afterhyperpolarization current in rat CA3 pyramidal cells in vitro. J. Neurophysiol., 80(5), 2268–2273. Tateno, T., Harsch, A., & Robinson, H. P. C. (2004). Threshold firing frequency-current relationships of neurons in rat somatosensory cortex: Type I and Type 2 dynamics. J. Neurophysiol., 92, 2283–2294. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulse-coupled oscillators. Phys. Rev. E., 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, B. (1994). When inhibition not excitation synchronizes neural firing. J. Comp. Neurosci., 1, 313–321. Vida, I., Bartos, M., & Jonas, P. (2006). Shunting inhibition improves robustness of gamma oscillations in hippocampal interneuron networks by homogenizing firing rates. Neuron, 49, 107–117. Wang, X.-J. (1998). Calcium coding and adaptive temporal computation in cortical pyramidal neurons. J. Neurophysiol., 79(3), 1549–1566. Wang, X.-J. (2003). Neural oscillations. In L. Nadal (Ed.), Encyclopedia of cognitive science, (pp. 272–280). New York: Macmillan. Wang, X.-J., & Buzsaki, G. (1996). Gamma oscillations by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Winfree, T. (2001). The geometry of biological time (2nd ed.). New York: Springer-Verlag.
Received January 3, 2006; accepted July 24, 2006.
LETTER
Communicated by Andrew Barto
Reinforcement Learning State Estimator Jun Morimoto
[email protected] JST, ICORP, Computational Brain Project, 4-1-8 Honcho, Kawaguchi, Saitama, 332-0012, Japan
Kenji Doya
[email protected] Initial Research Project, Okinawa Institute of Science and Technology, Uruma, Okinawa 904-2234, Japan
In this study, we propose a novel use of reinforcement learning for estimating hidden variables and parameters of nonlinear dynamical systems. A critical issue in hidden-state estimation is that we cannot directly observe estimation errors. However, by defining errors of observable variables as a delayed penalty, we can apply a reinforcement learning framework to state estimation problems. Specifically, we derive a method to construct a nonlinear state estimator by finding an appropriate feedback input gain using the policy gradient method. We tested the proposed method on single pendulum dynamics and show that the joint angle variable could be successfully estimated by observing only the angular velocity, and vice versa. In addition, we show that we could acquire a state estimator for the pendulum swing-up task in which a swing-up controller is also acquired by reinforcement learning simultaneously. Furthermore, we demonstrate that it is possible to estimate the dynamics of the pendulum itself while the hidden variables are estimated in the pendulum swing-up task. Application of the proposed method to a two-linked biped model is also presented. 1 Introduction Reinforcement learning is widely used to find policies to accomplish tasks through trial and error (Sutton & Barto, 1998). However, standard reinforcement learning algorithms like Q-learning can perform badly when the state variables of the environment are not fully observable. In this letter, we propose a dual reinforcement learning paradigm in which an agent learns to predict the hidden state of the environment and also to take actions to the environment. There are three major approaches to deal with partially observable Markov decision processes (POMDPs): memoryless stochastic policy Neural Computation 19, 730–756 (2007)
C 2007 Massachusetts Institute of Technology
Reinforcement Learning State Estimator
731
learning, memory-based state augmentation, and model-based state prediction. Memoryless policy gradient methods to achieve better average performance over hidden variables are used in Jaakkola, Singh, & Jordan (1995), Baird & Moore (1999), Meuleau, Kim, & Kaelbling (2001), and Kimura & Kobayashi (1998). However, even when stochastic policies are used, the learning performance of a policy for POMDPs can be significantly worse than that of a policy acquired for Markov decision processes (MDPs). External memory, which stores the history of state transitions and defines an augmented state space to recover the MDP assumption, is used in Meuleau, Peshkin, Kim, and Kaelbling (1999) and McCallum (1995). However, even with external memory, we cannot always estimate hidden variables from observed variables. For example, we need an infinite impulse response filter such as an integrator to estimate position from velocity. Furthermore, a simple integrator does not work well if sensor data include observation noise or if we do not know the initial position. If we have a dynamic model of an environment, it is possible to use a state estimator for POMDPs. If the dynamics is linear and deterministic, we can design an observer (Luenberger, 1971), which is used in control theory to estimate the hidden states. In the case that the dynamics is linear and stochastic, we can use a Kalman filter, which is widely known as the optimal state estimator for linear dynamics with gaussian noise (Kalman & Bucy, 1961). Where the state transition probability distribution is nongaussian, the particle filter (Doucet, Godsill, & Andrieu, 2000) is used in many studies (Thrun, 2000; Arulampalam, Maskell, Gordon, & Clapp, 2002). Wan and van der Merwe (2000) proposed unscented filters for the nongaussian case. In this study, we propose a reinforcement learning state estimator (RLSE), a novel learning method for estimating hidden variables of nonlinear dynamical systems to cope with POMDPs. Evaluation of state estimation policies is difficult because we cannot observe estimation errors for hidden states. However, by defining errors of observable variables as a penalty in the reinforcement learning framework, state estimation problems can be considered as delayed reward problems, and we can apply reinforcement learning to learn policies for hidden state estimation. We show that the state estimator can be acquired by using the policy gradient method (Kimura & Kobayashi, 1998) and demonstrate that we can acquire the state estimator for a pendulum swing-up task in which a swing-up controller is also acquired simultaneously by reinforcement learning. Although we can use the extended Kalman filter not only for estimating state variables but also for estimating model parameters of the pendulum dynamics (Wan & Nelson, 1997), the extend Kalman filter requires the form of the target dynamics model. In this study, we show that the swing-up controller can be acquired through estimating the hidden variables even in the case that the dynamics model of the pendulum is
732
J. Morimoto and K. Doya
not known. We also apply our proposed framework to a biped walking task. In section 2, we introduce the nonlinear state estimation problem. It is well known that optimal control and optimal estimation for linear dynamics are dual problems, that is, we can solve an optimal estimation problem as an optimal control problem of a dual system. We present our idea that the parameters of the state estimator can be derived by solving an optimal control problem even for nonlinear dynamics (Morimoto & Doya, 2002). In section 3, we introduce the reinforcement learning state estimator (RLSE), showing our approach to acquire the state estimation policy in a reinforcement learning framework. Section 4 contains our simulation results, obtained by applying the proposed method to a single pendulum dynamics and a two-linked biped robot model. 2 Nonlinear State Estimator We consider a state estimator for nonlinear stochastic dynamics: x(k + 1) = f(x(k), u(k)) + n(k), y(k) = h(x(k)) + v(k),
n(k) ∼ N (0, ),
v(k) ∼ N (0, R),
(2.1) (2.2)
where x ∈ X ⊂ Rn is the state, u ∈ U ⊂ Rm is the control input, y ∈ Y ⊂ Rl is the observed output, n(k) is the system noise where N (0, ) represents a normal distribution with zero mean and covariance , and v(k) denotes the observation noise where N (0, R) represents a normal distribution with a zero mean and covariance R. We consider the state estimator that has the form xˆ (k + 1) = f(ˆx(k), u(k)) + a(k),
(2.3)
where xˆ ∈ X ⊂ Rn is the estimated state and a ∈ A ⊂ Rn is a state estimator input to correct the current estimation by using observation y. In this study, we define the state estimation problem as the problem of finding a stochastic state estimation policy π(ˆx, y, a) = P(a|ˆx, y) to reduce the estimation error. Here we can consider this estimation problem as POMDP because we cannot directly access the state x, and it is necessary to decide actions for this estimation task according to current observation y(k), where y(k) is given by the conditional probability distribution, P(y(k)|x(k)) =
1 (2π)l |R|
exp{−(y(k)
− h(x(k)))T R−1 (y(k) − h(x(k)))}.
(2.4)
Reinforcement Learning State Estimator
733
The difficulty of finding this policy is that we cannot directly observe the error for the hidden states. For linear dynamics, x(k + 1) = Ax(k) + Bu(k) + n(k), y(k) = Cx(k) + v(k),
(2.5) (2.6)
we can design an observer, xˆ (k + 1) = Aˆx(k) + Bu(k) + L(y(k) − C xˆ (k)),
(2.7)
where observer gain L is selected to stabilize the error dynamics, e(k + 1) = (A − LC)e(k),
(2.8)
which can be derived by subtracting equation 2.7 from equation 2.5 without considering noise input n(k) in equation 2.5, where e(k) = x(k) − xˆ (k). In this case, the estimation policy is represented as a deterministic policy: a(k) = L(y(k) − C xˆ (k)), that is, π(ˆx(k), y(k), a(k)) = P(a(k)|ˆx(k), y(k)) = δ(a(k) − L(y(k) − C xˆ (k))), where δ() denotes the Dirac delta function. Our solution to this estimation problem is to define the accumulated errors of observable variables as the objective function and solve the optimal control problem using a reinforcement learning framework. In the next section, we introduce our proposed method to learn the estimation policy. 3 Reinforcement Learning State Estimator Here we introduce the reinforcement learning state estimator (RLSE), a new method to acquire a state estimation policy based on a reinforcement learning framework. We consider a discrete-time formulation of reinforcement learning with the following system dynamics: x(k + 1) = f(x(k), u(k)) + n(k),
(3.1)
y(k) = h(x(k)) + v(k),
(3.2)
ˆ x(k), u(k); p(k)) + a(k), xˆ (k + 1) = f(ˆ
(3.3)
where a ∈ A ⊂ Rn is the output of a state estimation policy (see section 3.2), which is input to the state estimator; p denotes the parameter vector of the approximated plant model fˆ and is the output of a model estimation policy (see section 3.3). The reward is defined as r (ˆx, y, a) = e(y − h(ˆx)) − c(a),
(3.4)
734
J. Morimoto and K. Doya
where e(y − h(ˆx)) denotes reward function for correct estimation, and c(a) denotes cost function for changing the estimator dynamics (Morimoto & Doya 2002). The function e has the maximum value when the size of the error |y − h(ˆx)| equals zero, and if the direction of the error vector is the same, the value of the function e monotonically decreases when the size of the error grows. The cost function c(a) corresponds to the amplitude of process noise. The state estimator tends to be strongly affected by the observed output y(k) with a smaller cost for changing the estimator dynamics. For example, we can use a negative log-likelihood function as the reward and the cost functions (Crassidis & Junkins, 2004). The basic goal is to find a policy πw (ˆx, y, a) = P(a|ˆx, y; w) that maximizes the expectation of the discounted accumulated reward, E{V(k)|πw } = E
∞ i=k
γ
i−k
r (ˆx(i + 1), y(i + 1), a(i))πw ,
(3.5)
where V(k) is the actual return, w is the parameter vector of the policy πw , and γ is the discount factor. Therefore, we essentially try to solve a smoothing problem rather than a filtering problem because we consider future estimation error represented in equation 3.5 without explicit backward calculation. Figure 1 shows a schematic diagram of the proposed learning system. The RLSE consists of a state estimator and the reinforcement learning (RL) module to learn an estimation policy πw . The state estimator takes control input u and the output of RL module a, that is, the output of the estimation policy, as input variables. Then the state estimator outputs estimated state xˆ as the RLSE output. Parameters of the plant model p used in the state estimator and used in a controller can also be estimated by the RLSE. The problem of learning a state estimation policy is partially observable because we have hidden variables x to be estimated. We use a policy gradient method proposed by Kimura and Kobayashi (1998) in our RLSE framework. Although we can apply the policy gradient method directly to POMDPs without learning the state estimators for some target problems (Jaakkola et al., 1995; Baird & Moore, 1999; Kimura & Kobayashi, 1998), we will show in the pendulum swing-up problem in the following sections that many tasks require the state estimator. The RLSE can be used even when the order of a plant is known but its form is not and when there are hidden state variables that we cannot directly observe. 3.1 Gradient Calculation and Policy Parameter Update. In the policy gradient methods, we calculate the gradient direction of the expectation of the actual return with respect to parameters of a policy w. Kimura and
Reinforcement Learning State Estimator
Plant
State Estimator p RL Module p
Control Input: u
Output: y Reward: r(y-y,a)
RLSE
a
735
Controller
Output Function: y = h( x ) Estimated State x
Figure 1: Reinforcement learning state estimator (RLSE). The RLSE consists of a state estimator and an RL module to learn an estimation policy. The state estimator takes control input u and the output of RL module a, that is, the output of the estimation policy, as input variables. Then the state estimator outputs estimated state xˆ as the RLSE output. Parameters of the plant model p used in the state estimator and used in a controller can also be estimated by the RLSE.
Kobayashi (1998) suggested that we can estimate the expectation of the gradient direction as ∞ ∂ ln π ∂ w ˆ (V(k) − V(x)) E{V(0)|πw } ≈ E πw , ∂w ∂w
(3.6)
k=0
ˆ where V(x(k)) is an approximation of the value function: V πw (x) = E{V(k)|x(k) = x, πw }. 3.1.1 Value Function Approximation. The value function is approximated using a normalized gaussian network (Doya, 2000; see appendix C),
ˆ x) = V(ˆ
N
vi b i (ˆx),
(3.7)
i=1
where vi is a ith parameter, and N is the number of basis functions (see appendix C). An approximation error of the value function is represented
736
J. Morimoto and K. Doya
by TD error (Sutton & Barto, 1998): ˆ x(k + 1)) − V(ˆ ˆ x(k)). δ(k) = r (k) + γ V(ˆ
(3.8)
We update the parameters of the value function approximator using the TD(0) method (Sutton & Barto, 1998), vi (k + 1) = vi (k) + αδ(k)b i (ˆx(k)),
(3.9)
where α is the learning rate. 3.1.2 Policy Parameter Update. We update the parameters of a policy w by using the estimated gradient direction in equation 3.6. Kimura and Kobayashi (1998) showed that we can estimate the gradient direction by using TD error, ∞ ∞ ∂ ln π w ˆ E (V(k) − V(x(k))) δ(k)z(k)πw , πw = E ∂w k=0
(3.10)
k=0
where z is the eligibility trace of the parameter w. Then we can update the parameter w as w(k + 1) = w(k) + βδ(k)z(k),
(3.11)
where the eligibility trace is updated as ∂ lnπw z(k) = ηz(k − 1) + ∂w
,
(3.12)
w=w(k)
where η is the decay factor for the eligibility trace. Equation 3.10 can be derived if the condition η = γ is satisfied. 3.2 State Estimation Policy. We construct the state estimation policy based on the normal distribution, πw (ˆx, y, a j ) = √
1 2πσ j (ˆx)
exp
−(a j − µ j (ˆx, y))2 , 2σ j2 (ˆx)
(3.13)
where a j is the jth output of the policy πw . The mean output µ of the policy πw is given by a feedback gain matrix L and the estimation error for observable variables y − h(ˆx), µ j (ˆx, y) = L j (ˆx)(y − h(ˆx)),
(3.14)
Reinforcement Learning State Estimator
737
and the jth row vector L j is modeled by the normalized gaussian network: L j (ˆx) =
N
µ
wi j b i (ˆx).
(3.15)
i=1 µ
Here, wi j ∈ Rl denotes the ith parameter vector for jth output of the policy πw , and N is the number of basis functions. We represent the variance σ j of the policy πw using a sigmoid function (Kimura & Kobayashi, 1998), σ j (ˆx) =
σ0 , 1 + exp(−σ jw (ˆx))
(3.16)
where σ0 denotes the scaling parameter, and the state-dependent parameter σ jw (ˆx) is also modeled by the normalized gaussian network, σ jw (ˆx) =
N
σ
wi j b i (ˆx),
(3.17)
i=1 σ
where wi j denotes the ith parameter for the variance of jth output. Considering equation 3.13, the eligibilities of the policy parameters are given as ∂ ln πw a j − µ j (y − h(ˆx))b i (ˆx), µ = σ j2 ∂wi j (a j − µ j )2 − σ j2 (1 − σ j /σ0 ) ∂ ln πw = b i (ˆx). σ σ j2 ∂wi j
(3.18)
(3.19)
We update the parameters by applying the update rules in equations 3.11 and 3.12. 3.3 Model Estimation Policy. Since we cannot always have exact information on the target dynamics, it is also necessary to estimate parameters of the dynamics model as well. In this study, we use a reinforcement learning framework based on the policy gradient method (Kimura & Kobayashi, 1998) to estimate parameters of the dynamics model. This model estimation policy is constructed based on the normal distribution,
π
2 p − p j − µ j (ˆx) (ˆx, p j ) = √ exp p 2 , p 2πσ j (ˆx) 2 σ j (ˆx) 1
wp
(3.20)
738
J. Morimoto and K. Doya
where p j is the jth output of the policy πw p . The mean output µ p of the policy πw p is given by p
µ j (ˆx) =
N
µ
p
wi j b i (ˆx).
(3.21)
i=1 µ
Here, wi j denotes the ith parameter for jth output of the policy πw p , and p N denotes the number of basis functions. We represent the variance σ j of p the policy πw using a sigmoid function (Kimura & Kobayashi 1998), p
p
σ j (ˆx) =
σ0 , p 1 + exp(−σ jw (ˆx))
(3.22)
p
where σ0 denotes the scaling parameter, and the state-dependent parameter p σ jw (x) is also modeled by the normalized gaussian network, σ jw (ˆx) = p
N
σ
p
wi j b i (ˆx),
(3.23)
i=1 σ
p
where wi j denotes the ith parameter for the variance of jth output. Considering equation 3.13, the eligibilities of the policy parameters are given as ∂ ln πw p p µj
∂wi
∂ ln πw p σ
p
∂wi j
p
a j − µj = 2 b i (ˆx), p σj p 2 p 2 p p a j − µj − σj 1 − σ j /σ0 = b i (ˆx). p 2 σj
(3.24)
(3.25)
We update the parameters by applying the update rules in equations 3.11 and 3.12. 4 Simulation In this section, we demonstrate the performance of the RLSE. First, in section 4.1, we apply the RLSE to a single-pendulum model to display its basic properties and performance. In section 4.2, we apply it to a two-linked biped model, which has more complicated dynamics than a single-pendulum model, to show that the RLSE can be applied to a more difficult problem. In section 4.1, we first show that the RLSE can stabilize the error dynamics (see equation 2.8) of the linearized pendulum model (see section 4.1.1).
Reinforcement Learning State Estimator
739
Second, we apply it to the pendulum model to estimate the state variables of free-swing movement (see section 4.1.2). Third, we show that it is possible to acquire a controller to swing the pendulum up by using state variables estimated by the RLSE. Here, the RLSE is also acquired simultaneously (see sections 4.1.3, 4.1.4, and 4.1.5). In section 4.1.3, we compare swing-up performance among a memoryless policy gradient method, a memory-base policy gradient method, and a value-gradient-based method (Doya, 2000) with using the RLSE. We demonstrate that only the value-gradient-based method using the RLSE can acquire a swing-up controller when only the angular velocity can be observed. In section 4.1.4, we show that the state estimator can be acquired even in the case that we do not know a precise parameter of the pendulum. In other words, we show that we can acquire control, state estimation, and model estimation policies simultaneously. Section 4.1.5 shows that we can acquire an estimation policy for the pendulum swing-up task even when the dynamics of the pendulum is unknown. Finally, in section 4.2, we show that the RLSE can be used to estimate state variables of the biped model. The task in this case for the biped robot is walking down a slope without falling over. We could simultaneously acquire both control and estimation policies for the biped walking task. 4.1 Application to a Single Pendulum. The proposed method was applied to pendulum dynamics, ml 2 ω˙ = −µω − mgl sin θ + T,
(4.1)
where θ is the angle from the bottom position, ω is the angular velocity, T is the input torque, µ = 0.01 is the coefficient of friction, m = 1.0 kg is the weight of the pendulum, l = 1.0 m is the length of the pendulum, and g = 9.8 m/s2 is the acceleration due to gravity (see Figure 2). We define the state vector as x = (θ, ω)T and the control output as u = T. Here we consider one-dimensional output y. Thus, the dynamics is given by the system x(k + 1) = f(x(k), u(k)) + n(k),
(4.2)
y(k) = Cx(k) + v(k),
(4.3)
where n(k) and v(k) are noise inputs with parameters = diag{0.01, 0.01} t and R = 1.0 t, and the discrete-time dynamics of the pendulum is f(x(k), u(k)) = x(k) +
(k+1) t
F(x(s), uc (s))ds, k t
(4.4)
740
J. Morimoto and K. Doya
Figure 2: Single pendulum swing-up task. θ: joint angle, T: input torque, l: length of the pendulum, m: mass of the pendulum.
F(x(t), uc (t)) =
− gl
ω sin θ −
µ ω ml 2
+
0 1 ml 2
uc ,
(4.5)
where t is a time step of the discrete-time system, and uc is the control input in continuous time. We define the state estimator dynamics as xˆ (k + 1) = f(ˆx(k), u(k)) + a(k),
(4.6)
T is the estimated state and a(k) is the output of the where xˆ (k) = (θˆ (k), ω(k)) ˆ estimation policy. We define the reward function for the state estimation task as
(y − C xˆ )2 r = exp − σr2
−
2 j=1
c(a j ) t,
(4.7)
where σr2 = 0.25 and c(a j ) is the output cost term. We saturate the output of the policy using a sigmoid function in equation A.3 (Doya, 2000; see appendix A) with the maximum output amax = (a 1max , a 2max ) = (5.0, 5.0)T . Then we use the corresponding cost function, c(a j ) = d 0
aj
−1
u a max j
du,
(4.8)
Reinforcement Learning State Estimator
741
where d = 0.1 and are monotonically increasing functions (see appendix A). The learning parameters were set as follows. The learning rate for the value function was α = 0.5, the learning rate for the policy was β = 0.5, the discount factor was γ = 0.99, and the decay factor for the eligibility trace was η = 0.95. A 5 × 5 basis normalized gaussian network for twodimensional space xˆ was used for the state estimation policy in equations 3.15 and 3.17 and the value function in equation 3.7. We located basis functions on a grid with an even interval in each dimension of the input space (−π ≤ θ ≤ π, −3π ≤ ω ≤ 3π). In the current simulations, the centers are fixed in a grid, which is analogous to the “boxes” approach (Barto, Sutton, & Anderson, 1983) often used in discrete RL. Grid allocation of the basis functions enables efficient calculation of their activation as the outer product of the activation vectors for individual input variables. We used the fourth-order Runge-Kutta method with a time step of t = 0.01 sec to derive the discrete pendulum dynamics in equations 4.4 and 4.6 and put a zero-order hold on the input as uc (s) = u(k) for k t ≤ s < (k + 1) t. 4.1.1 Estimation of Free-Swing Trajectories: Linear Pendulum. We first consider a linear problem in order to check whether the pole of the error dynamics learned by the RLSE is inside a unit circle. Thus, we have considered the pendulum dynamics only near the stable equilibrium point x = (0, 0)T , F (x(t), u(t)) =
0 1 − gl − mlµ2
x(t) +
0 1 ml 2
uc (t),
(4.9)
where the discrete pendulum dynamics was derived with equation 4.4. Here, since the RLSE attempts to learn a policy to estimate free-swing trajectories, we always set the control input to zero u(k) = 0. Let us consider two types of observation noise that have different amplitudes: R = 1.0 t and R = 10.0 t. Each trial was started from an initial state x(0) = (θ (0), 0.0)T and an initial estimate xˆ (0) = (θˆ (0), 0.0)T , where θ (0) was selected from a uniform distribution that ranged over − 14 π < θ (0) < 14 π. Then θˆ (0) was selected from a uniform distribution around θ (0), which ranged over − 16 π + θ (0) < θˆ < 16 π + θ (0). Each trial was terminated after 20 sec. We applied an acquired estimation policy after 100 learning trials to estimate linear pendulum states and used a deterministic policy, represented by the mean of the acquired stochastic policy, to illustrate the estimation performance. The initial state was set as x(0) = (− 16 π, 0.0)T and xˆ (0) = (− 13 π, 12 π)T . Figures 3A and 3B show the successful estimation of the linearized pendulum state, and Figure 3C indicates that the pole of the error dynamics stayed within the unit circle, that is, the absolute eigenvalues of error dynamics |λ| are always less than one, |λ| < 1, during state estimation. Furthermore,
742
J. Morimoto and K. Doya
0
R=1.0 R=10.0
1 |λ|
2
1.01
True Estimated (R=1.0) Estimated (R=10.0)
1 θ [rad]
ω [rad/sec]
2
True Estimated (R=1.0) Estimated (R=10.0)
4
0
0.99
−1
−2 0
1
2 3 Time [sec]
(A)
4
5
0
1
2 3 Time [sec]
4
5
0.98 0
1
(B)
2 3 Time [sec]
4
5
(C)
Figure 3: Estimated pendulum states and absolute eigenvalues of error dynamics. (A) Estimated angular velocity ω. (B) Estimated joint angle θ. (C) Time course of absolute eigenvalues |λ(A − LC)|.
the RLSE acquired different estimation policies with different observation noise. The error dynamics had slower convergence; absolute eigenvalues |λ| had larger values with greater observation noise (R = 10.0 t), all results consistent with the properties of the Kalman filter. 4.1.2 Estimation of Free-Swing Trajectories: Nonlinear Pendulum. Now we try to estimate free-swing trajectories of nonlinear pendulum dynamics, equation 4.4, by using the RLSE. Each trial was started from an initial state x(0) = (θ (0), ω(0))T , where θ (0) was selected from a uniform distribution that ranged over − π2 < θ (0) < π2 , and ω(0) was selected from a uniform distribution ranging over − π2 < ω(0) < π2 . In addition, each trial was started T ˆ from an initial estimate xˆ (0) = (θˆ (0), ω(0)) ˆ , where θ(0) was selected from a uniform distribution around θ (0) that ranged over − π4 + θ (0) < θˆ (0) < π + θ (0), and ω(0) ˆ was selected from a uniform distribution around ω(0) 4 that ranged over − π4 + ω(0) < ω(0) ˆ < π4 + ω(0). Each trial was terminated after 20 sec. The following three conditions are considered: 1. Joint angular velocity ω is observed—C = [0 1] in equation 4.3. 2. Joint angle θ is observed—C = [1 0] in equation 4.3. 3. Combined state of joint angle θ and angular velocity ω is observed— C = [1 1] in equation 4.3. Figure 4 illustrates the learning performance of the RLSE under the three different conditions. We applied the acquired policy after 100 learning trials to estimate freeswing trajectories under each condition using deterministic policy, which is represented by the mean of the acquired stochastic policy, to demonstrate the estimation performance. Initial states were set as x(0) = (− π2 , 0.0)T and xˆ (0) = (−π, π)T . Results reveal that acquired policies successfully estimated the true state trajectories (see Figure 5).
Reinforcement Learning State Estimator
743
Average reward
1 0.8 0.6 0.4
y=ω y=θ y=θ+ω
0.2 0
0
50 Trials
100
Figure 4: Learning performance of free-swing trajectory estimation. (Average over 10 simulation runs. We filtered the data with a moving average of 10 trials.) If the initial estimated state xˆ (0) is equal to the actual state x(0) and the RLSE perfectly follows an actual state trajectory, the average reward becomes 1 without considering cost term c(a).
4.1.3 Pendulum Swing-Up Task in POMDPs: With Exact Model. Here we try to acquire the swing-up policies while estimating state variables of the pendulum simultaneously. First, we assume that we have an exact model of the pendulum. We apply the value-gradient-based reinforcement learning (Doya, 2000) to acquire pendulum swing-up policies in POMDPs. The task of swinging up a pendulum (Doya, 2000) is a simple but strongly nonlinear control task. Here we assume that the swing-up policies cannot observe the position of the pendulum θ ; only the angular velocity ω can be observed. The reward function for the swing-up control policies is given as q (x, u) = {−cos θ − ν(u)} t,
(4.10)
where ν(u) is a cost function for control variable u (see appendix A). We use the same reward function in equation 4.7 for the estimation policies. Each trial was started from an initial state x(0) = (θ (0), ω(0))T , where θ (0) was selected from a uniform distribution that ranged over − π4 < θ (0) < π4 , and ω(0) was selected from a uniform distribution ranging over − π4 < ω(0) < π4 . In addition, each trial was started from an initial estimate xˆ (0) = T , where θˆ (0) were selected from a uniform distribution around (θˆ (0), ω(0)) ˆ ˆ θ (0) that ranged over − π4 + θ (0) < θ(0) < θ (0) + π4 , and ω(0) ˆ were selected from a uniform distribution around ω(0), which ranged over − π4 + ω(0) < ω(0) ˆ < ω(0) + π4 . Each trial was terminated after 20 sec. As a measure of the swing-up performance, we defined the time for which the pendulum
744
J. Morimoto and K. Doya
Observed
0
0
−2
5 Time [sec]
−5
10
0
(A) 4
−2
5 Time [sec]
θ [rad] 0
(D)
−2
5 Time [sec]
(G)
10
True Estimated
5
0 −5
10
5 Time [sec]
(F)
ω [rad/sec]
ω [rad/sec]
−5 5 Time [sec]
−4 0
10
True Estimated
5
0
0
0
(E) True Estimated
5
True Estimated
2
0
−4
10
10
(C)
−2
0
5 Time [sec]
4
True Estimated
2 θ [rad]
θ [rad]
10
− 10 0
(B)
0
−4
5 Time [sec]
4
True Estimated
2
ω [rad/sec]
0
−5 0
Observed
5 θ+ω
2
−4
10
Observed 5
ω
θ
4
0 −5
0
5 Time [sec]
(H)
10
0
5 Time [sec]
10
(I)
Figure 5: Estimation of free-swing trajectories: (A–C) Observed state. (D–F) Estimated joint angle θ . (G–I) Estimated angular velocity ω. (A, D, G) Joint angle θ is observed. (B, E, H) Joint angular velocity ω is observed. (C, F, I) Combined state of joint angle and angular velocity θ + ω is observed.
stayed up (π − π4 < |θ | < π + π4 ) as tup . A trial was regarded as successful when tup > 10 sec. We used the number of trials made before achieving 10 successful trials as the measure of the learning speed and limited the output torque of the controller as umax = 3.0 N · m. Appendix A explains the implementation of the value-gradient-based policy. Figure 6A shows the learning performance of the pendulum swingup task. After 85 trials (averaged over 10 simulation runs), the swingup policies and the state estimation policies were acquired. This result reveals that RLSE can obtain estimation policies that are capable of acquiring pendulum swing-up policies. Several studies (Jaakkola et al., 1995; Baird & Moore, 1999; Kimura & Kobayashi 1998; Meuleau et al., 1999) suggested that we can find a good memoryless or memory-based policy for POMDPs using the policy gradient method. Here we directly applied the policy gradient method (Kimura & Kobayashi 1998) to learn the swing-up controller (see appendix B). If full
Reinforcement Learning State Estimator
20
745 25
VG with RLSE
15 t
up
15
up
t
Memoryless External memory Fully observable
20
10
10
5 0 0
5 100
200 300 Trials
(A)
400
500
0
0
100
200 300 Trials
400
500
(B)
Figure 6: Learning performance for the pendulum swing-up task (average over 10 simulation runs). (A) Value-gradient polices with RLSE (we filtered the data with a moving average of 10 trials). (B) Direct use of policy gradient (we filtered the data with a moving average of 10 trials).
state information (θ, ω) is available to the controller, this policy gradient controller can acquire the swing-up policy with 171 trials. Now we apply this controller to the pendulum swing-up task under the same POMDP assumption: that the controller cannot observe the position of the pendulum θ . The solid line in Figure 6B shows the learning performance for 500 trials. This result indicates that the pendulum swing-up policy cannot be acquired by observing only the angular velocity of the pendulum ω using the policy gradient method. We also applied the policy gradient method with external memory. Because the pendulum dynamics is represented by a second-order differential equation, we defined the state space of the controller as (ω(t), ω(t − T))T , where T = 0.5 sec, which is a quarter of the actual period of the linearized pendulum. The dashed line in Figure 6B shows learning performance for 500 trials. This result also indicates that the pendulum swing-up policy cannot be acquired through observing only the angular velocity of the pendulum ω even when using the external memory. Figure 7 shows swing-up trajectories using RLSE from an initial state x(0) = (θ (0), ω(0)) = (0.0, 0.0)T and an initial estimate xˆ (0) = (θˆ (0), ω(0)) ˆ = (− 12 π, π)T . 4.1.4 Pendulum Swing-Up Task in POMDPs: With Modeling Errors. In the previous section, it was assumed that we have an exact model of the pendulum dynamics. However, especially in the real environment, it is difficult to have a perfect model of the target dynamics. Here, we try to acquire an estimation policy while at the same time also learning a model parameter of the pendulum. In other words, we attempt to acquire the swing-up control policy, the state estimation policy, and the model estimation policy
746
J. Morimoto and K. Doya 4
6
True Estimated
True Estimated
4 ω [rad/sec]
θ [rad]
2 0 −2 −4
0
2 0 −2 −4
5 Time [sec]
10
(A)
−6 0
5 Time [sec]
10
(B)
Figure 7: Swing-up trajectories and estimation performance. The dash-dotted line represents the upright position. The initial state x(0) = (θ(0), ω(0)) = ˆ ω(0)) ˆ = (− 21 π, π )T . (A) Estimated (0.0, 0.0)T ; the initial estimate xˆ (0) = (θ(0), ˆ joint angle θ. (B) Estimated angular velocity ω. ˆ
simultaneously. A 1 × 1, that is, only one, basis normalized gaussian network was used as the model estimation policy in equations 3.21 and 3.23. We set the decay factor for the eligibility trace of the model estimation policy as η = 0.90. After 227 trials and 128 trials (average over 10 simulation runs), the swing-up policies and the state estimation policies were acquired with initially incorrect models, l = 0.5 m and l = 2.0 m, respectively (see Figure 8A). Figure 8B shows the learning performance of the model estimation policy with two different modeling errors. Results reveal that the RLSE could acquire the correct model parameter during learning of the swing-up task and the state estimation task. The estimated mean model parameter of 10 simulation runs after 500 trials was l = 1.03 m, and the variance was 1.05 × 10−2 with the initially incorrect model l = 0.5 m. Furthermore, with the initially incorrect model l = 2.0 m, the estimated mean model parameter for 10 simulation runs after 500 trials was l = 1.08 m, and the variance was 0.27 × 10−2 . 4.1.5 Pendulum Swing-Up Task in POMDPs: With Unknown Model. Here we show that the RLSE can be used even when a form of the pendulum dynamics is unknown. We assume to know that the pendulum model is an input-affine system and know the input matrix of equation 4.5: F(ˆx(t), u(t)) =
ω f ω (x)
0 + uc . 1
(4.11)
The model estimation policy introduced in section 3.3 can be directly used to estimate the dynamics model of the pendulum. We used the output
Reinforcement Learning State Estimator 2.5
25
l_init=0.5 l_init=2.0
tup
15 10
l_init=0.5 l_init=2.0
2 Length [m]
20
1.5 1 0.5
5 0 0
747
100
200 300 Trials (A)
400
500
0
0
100
200 300 Trials
400
500
(B)
Figure 8: Learning the swing-up controller and the state estimation policy with the model estimation policy. linit represents the initial model parameter of the pendulum length. (A) Learning performance of the swing-up policy when using RLSE (average over 10 simulation runs; we filtered the data with a moving average of 10 trials). (B) Estimated length of the pendulum at each trial.
of the model estimation policy as f ω (x) ≈ 10.0 × p. A 9 × 4 basis normalized gaussian network was used for the model estimation policy in equations 3.21 and 3.23. We used the learning parameter set σr = 0.2, α = 1.0, β = 1.0, γ = 0.98, η = 0.90 for the state estimation policy and η = 0.96 for the model estimation policy. Figure 9A shows an acquired pendulum dynamics. A sinusoidal shape that represents the pendulum dynamics (see equation 4.5) was successfully acquired. Figure 9B illustrates the learning performance of swing-up policy with estimating pendulum dynamics and state variables. After 692 trials (average over 10 simulation runs), the swing-up policies and the state estimation policies were acquired with an initially unknown model. 4.2 Application to a Biped Model. Finally, the RLSE is applied to a biped model (see Figure 10). The task in this case for the biped robot is walking down a slope without falling over. A biped dynamics model introduced in Goswami, Thuilot, and Espiau (1996) was employed here. Although it is well known that this two-linked biped model can walk down the slope without any actuation, it is necessary to set proper initial angular velocities at each degree of freedom to generate this passive walking pattern. In this study, we try to acquire a controller that can make the biped robot walk even when the initial angular velocities are not suitable for passive walking. We define the state vector as x = (θsw , θst , ωsw , ωst )T , where θsw and θst are leg angles and ωsw and ωst are leg angular velocities, and we define applied torque at the hip joint u = T as the action. We assume that the covariance of the system noise n(k) and the observation noise v(k) are = diag(0.01, 0.01, 0.01, 0.01) t and R = diag(0.01, 0.01) t, respectively.
748
J. Morimoto and K. Doya
25 20
Estimated fω
500
tup
15
10
10
0
5
0
−10 −100 0 100 θ [deg]
(A)
−500
ω [deg/s]
0 0
500 Trials
1000
(B)
Figure 9: Learning the state estimation policy with the dynamics of the pendulum in the pendulum swing-up task. (A) Acquired pendulum model. A sinusoidal shape that represents the pendulum dynamics was successfully acquired. (B) Learning performance of the swing-up policy with estimating pendulum dynamics and state variables (average over 10 simulation runs; we filtered the data with a moving average of 10 trials).
Figure 10: Biped model: m H = 10 kg, m = 5 kg, a = b = 0.5 m, φ = 0.03 rad.
This biped model includes a discontinuity when the swing foot touches the ground. Furthermore, we assume that only the leg angles θsw and θst can be observed; in other words, we consider linear observation function h(x(k)) = Cx(k) with observation matrix C = [1 1 0 0] in equation 2.2. We also assume that we know the exact biped model. The learning parameters were set as follows. The learning rate for the value function was α = 0.5, the learning rate for the policy was β = 1.0, the discount factor was γ = 0.98, and the decay factor for the eligibility trace was η = 0.97. A 5 × 5 × 10 × 10 basis normalized gaussian network for four-dimensional space xˆ was used for the state estimation policy in equations 3.15 and 3.17 and the value function in equation 3.7. We
Reinforcement Learning State Estimator
749 15
0.8
10 0.6
Steps
Average reward
1
0.4
5
0.2 0
0
100
200 300 Trials (A)
400
500
0
0
100
200 300 Trials (B)
400
500
Figure 11: (A) Learning performance of estimation policy for the biped walking task (average over 10 simulation runs; we filtered the data with a moving average of 10 trials). (B) Learning performance for the biped walking task. The horizontal axis represents the number of walking steps (average over 10 simulation runs; we filtered the data with a moving average of 10 trials).
located basis functions on a grid with an even interval in each dimension of the input space (− 14 π ≤ θsw ≤ 14 π, − 14 π ≤ θst ≤ 14 π, −3π ≤ ωsw ≤ 3π, −3π ≤ ωst ≤ 3π). We used the fourth-order Runge-Kutta method with a time step of t = 0.01 sec to derive the discrete biped dynamics and used the reward function 4 r = exp −(y − C xˆ )T S(y − C xˆ ) − c(a j ) t, (4.12) j=1
where S = diag(1/σr2 , 1/σr2 ) with σr2 = 0.001. We saturate the output of the policy using a sigmoid function in equation A.3 with the maximum output amax = (a 1max , a 2max , a 3max , a 4max ) = (150.0, 150.0, 150.0, 150.0). Again, we used value-gradient-based policy (see appendix A). The penalty −0.5 is given to the biped controller when the robot falls over. Each trial was started from an initial state x(0) = (0.0, 0.0, ωsw (0), ωst (0))T , where ωsw (0) and ωst (0) were selected from uniform distributions that ranged over 2.0 < ωsw (0) < 2.2 rad/sec and −0.5 < ωst (0) < −0.7 rad/ sec. In addition, each trial was started from initial estimate xˆ (0) = (0.0, 0.0, ωˆ sw (0), ωˆ st (0))T , where ωˆ sw (0) and ωˆ st (0) were selected from uniform distributions ranging over − π6 + ωsw < ωˆ sw (0) < π6 + ωsw (0) and − π6 + ωst (0) < ωˆ st (0) < π6 + ωst (0). Each trial was terminated after 10 sec. A trial was regarded as successful when the number of steps is more than 10. Figure 11A shows results of the learning performance of estimation policies. The RLSE successfully acquired estimation policies in the biped walking task. Figure 11B illustrates the learning performance for a biped
J. Morimoto and K. Doya
st
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
θ [deg]
θ
sw
[deg]
750
True Estimated 0
0.5
1
1.5
2 2.5 3 Time [sec]
3.5
4
4.5
5
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1
True Estimated 0
0.5
1
1.5
3
2
2
1
1
0 −1
True Estimated
−2 −3 0
0.5
1
1.5
2 2.5 3 Time [sec]
(C)
3.5
4
4.5
5
(B)
3
ωst [deg]
ω
sw
[deg]
(A)
2 2.5 3 Time [sec]
3.5
4
4.5
0 −1
True Estimated
−2
5
−3
0
0.5
1
1.5
2 2.5 3 Time [sec]
3.5
4
4.5
5
(D)
Figure 12: Biped walking trajectory and estimation performance. Initial state x = (θsw , θst , ωsw , ωst ) = (0.0, 0.0, 2.2, −0.7); initial estimate xˆ = (θˆsw , θˆst , ωˆ sw , ωˆ st ) = (0.0, 0.0, 2.0, −0.5). (A) Swing leg angle θsw . (B) Stance leg angle θst . (C) Swing leg angular velocity ωsw . (D) Stance leg angular velocity ωst .
walking task. The biped controllers were successfully acquired after 273 trials (average over 10 simulation runs). Figure 12 shows the estimation performance for a biped walking trajectory. The RLSE could estimate state variables even for a biped model. Although slight estimation errors remained in angular velocities, biped walking policies could be successfully acquired by using RL. Figure 13A displays the initial performance of the control policy. The biped fell over right after its first step. Figure 13B gives the acquired performance of the control policy after 500 trials. The RLSE could be successfully used to acquire a biped walking controller. 5 Discussion A conventional way to solve POMDPs is to use belief state representation. By doing so, the problem can be considered as MDPs in continuous state space (Littman, Cassandra & Kaelbling, 1995). However, the conventional methods can be applied to only small problems in discrete state space, whereas the RLSE can be applied to problems in continuous state space. Backprop-through-time algorithms (Erdogmus, Genc, & Principe, 2002; Xu, Erdogmus, & Principe, 2005) were proposed to learn a state estimator. Although this approach can be applied to the state estimation problems
Reinforcement Learning State Estimator
751
[m]
1 0.5 0 −0.5 −1
0
1
2
3
4 [m]
5
6
7
8
4 [m]
5
6
7
8
(A)
[m]
1 0.5 0 −0.5 −1
0
1
2
3 (B)
Figure 13: Simultaneous learning of the biped walking controller and the state estimator. The solid line represents the right leg, and the dotted line represents the left leg. Initial state x = (θsw , θst , ωsw , ωst )T = (0.0, 0.0, 2.2, −0.7)T ; initial estimate xˆ = (θˆsw , θˆst , ωˆ sw , ωˆ st )T = (0.0, 0.0, 2.0, −0.5)T . (A) Initial performance. (B) Acquired performance.
in continuous state space, backprop-through-time algorithms can be used only for off-line estimation. On the other hand, RL can improve parameters of a state estimator through online update since RL can reduce an accumulated estimation error over time instead of reducing only an instantaneous estimation error. Application of genetic algorithms to the learning state estimator problem was proposed by Porter and Passino (1995). In the genetic algorithm approach, the gain parameter of a state estimator L in equation 3.15 was constant, whereas the RLSE acquired state-dependent gain L(ˆx). This constant gain L can limit the performance of the state estimator if we apply the state estimator to a nonlinear dynamics. Thau (1973) and Raghavan and Hedrick (1994) proposed a nonlinear observer to design an observer that can guarantee the convergence of the error dynamics even for a nonlinear environment. However, this approach considered any nonlinear term of a target dynamics as a modeling error from a linear dynamics. Consequently, the applications of this method are limited. The duality between estimation and control is discussed in Crassidis and Junkins (2004). They showed that the RTS smoother can be derived from optimal control theory. They used the negative log-likelihood function for the
752
J. Morimoto and K. Doya
reward and the cost functions, that is, e(y − h(ˆx)) = − 12 (y − h(ˆx))T R−1 (y − h(ˆx)) and c(a) = 12 aT −1 a in equation 3.4. To acquire the RTS smoother, we need off-line calculation to derive the costate vector. In our method, we can acquire estimation policies without explicit backward calculation by using the reinforcement learning framework. 6 Conclusion We proposed a novel use of reinforcement learning, the reinforcement learning state estimator (RLSE), to estimate hidden variables and parameters of nonlinear dynamical systems. RLSE was applied to the pendulum swingup task, and simultaneous learning of the RLSE and a swing-up controller, which used the estimated state of the RLSE, was accomplished, whereas the direct use of the policy gradient method could not accomplish this task. We also showed that a swing-up policy could be acquired even in the case that the RLSE initially held an incorrect model of the pendulum or did not know the form of the model. Application of the RLSE to the two-linked biped model was also presented. In this study, we used a gaussian distribution to represent an estimation policy. In the future, it may be possible to use a mixture of gaussian representations for the estimation policy to cope with nongaussian state transition probability. Because the state estimation policy and the model estimation policy are mutually dependent, future work will involve analysis of our method to show a relationship between the accuracy of state estimation and model approximation performance. Appendix A: Value-Gradient-Based Policy for Pendulum Swing-Up Task The approximated value function is represented as
ˆ c (x) = V
Nc
vic b i (x),
(A.1)
i=1
where vic is the ith parameter of the function approximator and b i (x) is the normalized gaussian basis function (see appendix C). A 21 × 21 basis normalized gaussian network for two-dimensional space xˆ was used for the value function approximation in the swing-up task. A 5 × 5 × 17 × 17 basis normalized gaussian network for four-dimensional space xˆ was used for the value function approximation in the biped walking task.
Reinforcement Learning State Estimator
753
We use the value-gradient-based policy represented as u = umax
∂ V ∂f(x, u) , ∂x ∂u
(A.2)
where umax = 3.0N · m for the pendulum swing-up task and umax = 20N · m for the biped walking task, and we limit the amplitude of the action using the sigmoid function:
(x) =
π 2 arctan x . π 2
(A.3)
This corresponds to defining the cost function in equation 4.10 as ν(u) = d
u
c
−1
u
umax
0
du ,
(A.4)
where we set the parameter for the cost function d c = 0.1 for both tasks. The learning rate for the value function was α = 1.0, and the discount factor was γ = 0.99. Appendix B: Policy Gradient Method for the Pendulum Swing-Up Task The value function is represented as in equation A.1. Then we construct the model for the swing-up policy based on the normal distribution πwc (x, u) = P(u|x; wc ), πwc (x, u) = √
1 2πσ c (x)
exp
−(u − µc (x))2 , 2(σ c (x))2
(B.1)
where u is the output of the policy πwc . We limit the amplitude of the action using the sigmoid function in equation A.3. The mean output µc of the policy πwc is modeled by the normalized gaussian network, N c
µc (x) =
µc
wi b ic (x),
(B.2)
i=1 µc
where wi denotes the parameter for the output of the policy πwc , Nc denotes the number of basis functions, and b c (x) is the normalized gaussian basis
754
J. Morimoto and K. Doya
function (see appendix C). We represent the variance σ c of the policy πwc using a sigmoid function, σ c (ˆx) =
σ0c , 1 + exp(−σ wc (ˆx))
(B.3)
where σ0c denotes the scaling parameter, and the state-dependent parameter c σ w (ˆx) is also modeled by the normalized gaussian network, σ w (ˆx) = c
Nc
wiσ b i (ˆx), c
(B.4)
i=1
where wiσ denotes the ith parameter for the variance of output. Considering equation B.1, the eligibility of the policy parameters are given as c
∂ ln πwc µc
∂wi
=
u − µc b i (ˆx), (σ c )2
((u − µc )2 − (σ c )2 )(1 − σ c /σ0c ) ∂ ln πwc = b i (ˆx). c ∂wiσ (σ c )2
(B.5) (B.6)
We update the parameters by using the update rules in equations 3.11 and 3.12. A 21 × 21 basis normalized gaussian network for two-dimensional space xˆ was used for the policy represented in equations B.2 and B.4. The learning rate for the control policy was β = 1.0, and the decay factor for the eligibility trace was η = 0.90. Appendix C: Normalized Gaussian Network Basis functions for the normalized gaussian network are represented as ξi (x) b i (x) = N , j=1 ξ j (x)
(C.1)
where N is the number of basis functions and ξ j (x) = e −||s j (x−c j )|| . T
2
(C.2)
The vectors c j and s j define the center and the size of the jth basis function, respectively.
Reinforcement Learning State Estimator
755
Acknowledgments We thank Hirokazu Tanaka and reviewers for their helpful information and comments.
References Arulampalam, M. S., Maskell, S., Gordon, N., & Clapp, T. (2002). A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2), 174–188. Baird, L. C., & Moore, A. W. (1999). Gradient descent for general reinforcement learning. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Crassidis, J. L., & Junkins, J. L. (2004). Optimal estimation of dynamic systems. London: Chapman and Hall. Doucet, A., Godsill, S. J., & Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10, 197–208. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245. Erdogmus, D., Genc, A. U., & Principe, J. C. (2002). A neural network perspective to extended Luenberger observers. Institute of Measurement and Control, 35(2), 10– 16. Goswami, A., Thuilot, B., & Espiau, B. (1996). Compass-like biped robot part I: Stability and bifurcation of passive gaits (Tech. Rep. No. RR-2996). Rocguencourt, France: INRI. Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement learning algorithm for partially observable Markov decision problems. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 345–352). Cambridge, MA: MIT Press. Kalman, R. E., & Bucy, R. S. (1961). New results in linear filtering and prediction theory. Trans., ASME, Series D, J. Basic Engineering, 83(1), 95–108. Kimura, H., & Kobayashi, S. (1998). An analysis of actor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value functions. In Proceedings of the 15th Int. Conf. on Machine Learning (pp. 284–292). San Francisco: Morgan Kaufmann. Littman, M. L., Cassandra, A. R., & Kaelbling, L. P. (1995). Learning policies for partially observable environments: Scaling up. In A. Prieditis & S. Russell (Eds.), Proceedings of the Twelfth International Conference on Machine Learning (pp. 362–370). San Francisco: Morgan Kaufmann. Luenberger, D. G. (1971). An introduction to observers. IEEE Trans., AC, 16, 596–602. McCallum, R. A. (1995). Reinforcement learning with selective perception and hidden state. Unpublished doctoral dissertation, University of Rochester.
756
J. Morimoto and K. Doya
Meuleau, N., Kim, K. E., & Kaelbling, L. P. (2001). Exploration in gradient-based reinforcement learning (Tech. Rep., AI Memo 2001-003). Cambridge, MA: MIT. Meuleau, N., Peshkin, L., Kim, K., & Kaelbling, L. P. (1999). Learning finite-state controllers for partially observable environments. In Proceedings of the Fifteenth International Conference on Uncertainty in Artificial Intelligence (pp. 427–436). San Francisco: Morgan Kaufmann. Morimoto, J., & Doya, K. (2002). Development of an observer by using reinforcement learning. In Proc. of the 12th Annual Conference of the Japanese Neural Network Society (pp. 275–278). Tottorl, Japan. (in Japanese) Porter, L. L., & Passino, K. M. (1995). Genetic adaptive observers. Engineering Applications of Artificial Intelligence, 8(3), 261–269. Raghavan, I. R., & Hedrick, J. K. (1994). Observer design for a class of nonlinear systems. International Journal of Control, 59(2), 515–528. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Thau, F. E. (1973). Observing the state of nonlinear dynamic systems. International Journal of Control, 17(3), 471–479. ¨ Thrun, S. (2000). Monte Carlo POMDPs. In S. A. Solla, T. K. Leen, & K. R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 1064–1070). Cambridge, MA: MIT Press. Wan, E., & van der Merwe, R. (2000). The unscented Kalman filter for nonlinear estimation. In Proc. of IEEE Symposium 2000 (AS-SPCC). Piscataway, NJ: IEEE. Wan, E. A., & Nelson, A. T. (1997). Dual Kalman filtering methods for nonlinear prediction, smoothing, and estimation. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 793–799). Cambridge, MA: MIT Press. Xu, J. W., Erdogmus, D., & Principe, J. C. (2005). Minimum error entropy Luenberger observer. In Proc. of American Control Conference. Piscataway, NJ: IEEE.
Received May 10, 2005; accepted July 17, 2006.
LETTER
Communicated by Yoshua Bengio
Training Recurrent Networks by Evolino ¨ Jurgen Schmidhuber
[email protected] IDSIA, 6928 Manno (Lugano), Switzerland, and TU Munich, 85748 Garching, ¨ Munchen, Germany
Daan Wierstra
[email protected] Matteo Gagliolo
[email protected] Faustino Gomez
[email protected] IDSIA, 6928 Manno (Lugano), Switzerland
In recent years, gradient-based LSTM recurrent neural networks (RNNs) solved many previously RNN-unlearnable tasks. Sometimes, however, gradient information is of little use for training RNNs, due to numerous local minima. For such cases, we present a novel method: EVOlution of systems with LINear Outputs (Evolino). Evolino evolves weights to the nonlinear, hidden nodes of RNNs while computing optimal linear mappings from hidden state to output, using methods such as pseudoinverse-based linear regression. If we instead use quadratic programming to maximize the margin, we obtain the first evolutionary recurrent support vector machines. We show that Evolino-based LSTM can solve tasks that Echo State nets (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-based LSTM. 1 Introduction Recurrent neural networks (RNNs; Pearlmutter, 1995; Robinson & Fallside, 1987; Rumelhart & McClelland, 1986; Werbos, 1974; Williams, 1989) are mathematical abstractions of biological nervous systems that can perform complex mappings from input sequences to output sequences. In principle, one can wire them up just like microprocessors; hence, RNNs can compute anything a traditional computer can compute (Schmidhuber, 1990). In particular, they can approximate any dynamical system with arbitrary precision (Siegelmann & Sontag, 1991). However, unlike traditional, programmed computers, RNNs learn their behavior from a training set of correct example sequences. As training sequences are fed to the network, the error between the actual and desired network output is minimized using Neural Computation 19, 757–779 (2007)
C 2007 Massachusetts Institute of Technology
758
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
gradient descent, whereby the connection weights are gradually adjusted in the direction that reduces this error most rapidly. Potential applications include adaptive robotics, speech recognition, attentive vision, music composition, and innumerable others where retaining information from arbitrarily far in the past can be critical to making optimal decisions. Recently, echo state networks (ESNs; Jaeger, 2004a) and a similar approach, liquid state machines (Maass, Natschl¨ager, & Markram, 2002), have attracted significant attention. Composed primarily of a large pool of hidden neurons (typically hundreds or thousands) with fixed random weights, ESNs are trained by computing a set of weights from the pool to the output units using fast, linear regression. The idea is that with so many random hidden units, the pool is capable of very rich dynamics that just need to be correctly tapped by setting the output weights appropriately. ESNs have the best-known error rates on the Mackey-Glass time-series prediction task (Jaeger, 2004a). The drawback of ESNs is that the only truly computationally powerful, nonlinear part of the net does not learn, whereas previous supervised, gradient-based learning algorithms for sequence-processing RNNs (Pearlmutter, 1995; Robinson & Fallside, 1987; Schmidhuber, 1992; Werbos, 1988; Williams, 1989) adjust all weights of the net, not just the output weights. Unfortunately, early RNN architectures could not learn to look far back into the past because they made gradients either vanish or blow up exponentially with the size of the time lag (Hochreiter, 1991; Hochreiter, Bengio, Frasconi, & Schmidhuber, 2001). A recent RNN called long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997a), however, overcomes this fundamental problem through a specialized architecture that does not impose any unrealistic bias toward recent events by maintaining constant error flow back through time. Using gradient-based learning for both linear and nonlinear nodes, LSTM networks can efficiently solve many tasks that were previously unlearnable using RNNs, (e.g., Gers & Schmidhuber, 2001; Gers, Schmidhuber, & Cummins, 2000; Gers, Schraudolph, & Schmidhuber, 2002; Graves & Schmidhuber, 2005; Hochreiter & Schmidhuber, 1997a; P´erez-Ortiz, Gers, Eck, & Schmidhuber, 2003; Schmidhuber, Gers, & Eck, 2002). However, even when using LSTM, gradient-based learning algorithms can sometimes yield suboptimal results because rough error surfaces can often lead to inescapable local minima. As we showed (Hochreiter & Schmidhuber, 1997b; Schmidhuber, Hochreiter, & Bengio, 2001), many RNN problems involving long-term dependencies that were considered challenging benchmarks in the 1990s turned out to be trivial in that they could be solved by random weight guessing. That is, these problems were difficult only because learning relied solely on gradient information. There was actually a high density of solutions in the weight space, but the error surface was too rough to be exploited using the local gradient. By repeatedly selecting weights at random, the network does not get
Training Recurrent Networks by Evolino
759
stuck in a local minimum and eventually happens on one of the plentiful solutions. One popular method that uses the advantage of random weight guessing in a more efficient and principled way is to search the space of RNN weight matrices (Miglino, Lund, & Nolfi, 1995; Miller, Todd, & Hedge, 1989; Nolfi, Floreano, Miglino, & Mondada, 1994; Sims, 1994; Yamauchi & Beer, 1994; Yao, 1993), using evolutionary algorithms (Holland, 1975; Rechenberg, 1973; Schwefel, 1977). The applicability of such methods is actually broader than that of gradient-based algorithms, since no teacher is required to specify target trajectories for the RNN output nodes. In particular, recent progress has been made with cooperatively coevolving recurrent neurons, each with its own rather small, local search space of possible weight vectors (Gomez, 2003; Moriarty & Miikkulainen, 1996; Potter & De Jong, 1995). This approach can quickly learn to solve difficult reinforcement learning control tasks (Gomez & Schmidhuber, 2005a; Gomez, 2003; Moriarty, 1997), including ones that require use of deep memory (Gomez & Schmidhuber, 2005b). Successfully evolved networks of this type are currently still rather small, with not more than several hundred weights or so. At least for supervised applications, however, such methods may be unnecessarily slow, since they do not exploit gradient information about good directions in the search space. To overcome such drawbacks, in what follows, we limit the domain of evolutionary methods to weight vectors of hidden units, while using fast traditional methods for finding optimal linear maps from hidden to output units. We present a general framework for training RNNs called EVOlution of recurrent systems with LINear Outputs (Evolino) (Schmidhuber, Wierstra, & Gomez, 2005; Wierstra, Gomez, & Schmidhuber, 2005). Evolino evolves weights to the nonlinear, hidden nodes while computing optimal linear mappings from hidden state to output, using methods such as pseudo-inverse-based linear regression (Penrose, 1955) or support vector machines (Vapnik, 1995), depending on the notion of optimality employed. This generalizes methods such as those of Maillard (Maillard & ¨ Gueriot, 1997) and Ishii et al. (Ishii, van der Zant, Beˇcanovi´c, & Ploger, 2004; ¨ van der Zant, Beˇcanovi´c, Ishii, Kobialka, & Ploger, 2004) that evolve radial basis functions and ESNs, respectively. Applied to the LSTM architecture, Evolino can solve tasks that ESNs (Jaeger, 2004a) cannot and achieves higher accuracy in certain continuous function generation tasks than conventional gradient descent RNNs, including gradient-based LSTM (G-LSTM). The next section describes the Evolino framework as well as two specific instances, PI-Evolino (section 2.3), and Evoke (section 2.4), that both combine a cooperative coevolution algorithm called Enforced SubPopulations (section 2.1) with LSTM (section 2.2). In section 3, we apply Evolino to four time series prediction problems, and in section 4 we provide some concluding remarks.
760
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
y1(t) y2(t)
ym(t) Linear Output Layer W
φ1(t) φ2(t) φ3(t) φ4(t)
φ n(t)
Recurrent Neural Network u1(t) u2(t) u3(t) u4(t)
up(t)
Figure 1: Evolino network: A recurrent neural network receives sequential inputs u(t) and produces the vector (φ1 , φ2 , . . . , φn ) at every time step t. These values are linearly combined with the weight matrix W to yield the network’s output vector y(t). While the RNN is evolved, the output layer weights are computed using a fast, optimal method such as linear regression or quadratic programming.
2 Evolino Evolino is a general framework for supervised sequence learning that combines neuroevolution (i.e., the evolution of neural networks) and analytical linear methods that are optimal in some sense, such as linear regression or quadratic programming (see section 2.4). The underlying principle of Evolino is that often a linear model can account for a large number of properties of a problem. Properties that require nonlinearity and recurrence are then dealt with by evolution. Figure 1 illustrates the basic operation of an Evolino network. The output of the network at time t, y(t) ∈ Rm , is computed by the following formulas: y(t) = Wφ(t),
(2.1)
φ(t) = f (u(t), u(t − 1), . . . , u(0)),
(2.2)
where φ(t) ∈ Rn is the output of a recurrent neural network f (·) and W is a weight matrix. Note that because the networks are recurrent, f (·) is indeed
Training Recurrent Networks by Evolino
761
a function of the entire input history, u(t), u(t − 1), . . . , u(0). In the case of maximum margin classification problems (Vapnik, 1995), we may compute W by quadratic programming. In what follows, however, we focus on mean squared error minimization problems and compute W by linear regression. In order to evolve an f (·) that minimizes the error between y and the correct output, d, of the system being modeled, Evolino does not specify a particular evolutionary algorithm, but rather stipulates only that networks be evaluated using the following two-phase procedure. In the first phase, a training set of sequences obtained from the system, {ui , d i }, i = 1..k, each of length l i , is presented to the network. For each sequence ui , starting at time t = 0, each input pattern ui (t) is successively propagated through the recurrent network to produce a vector of activations φ i (t) that is stored as a row in a ik l i × n matrix . Associated with each φ i (t), is a target vector d i (t) in matrix D containing the correct output values for each time step. Once all k sequences have been seen, the output weights W (the output layer in Figure 1) are computed using linear regression from to D. The row vectors in (i.e., the values of each of the n outputs over the entire training set) form a nonorthogonal basis that is combined linearly by W to approximate D. In the second phase, the training set is presented to the network again, but now the inputs are propagated through the recurrent network f (·) and the newly computed output connections to produce predictions y(t). The error in the prediction, or the residual error, is then used as the fitness measure to be minimized by evolution. Alternatively, the error on a previously unseen validation set, or the sum of training and validation error, can be minimized. Neuroevolution is normally applied to reinforcement learning tasks where correct network outputs (i.e., targets) are not known a priori. Evolino uses neuroevolution for supervised learning to circumvent the problems of gradient-based approaches. In order to obtain the precision required for time-series prediction, we do not try to evolve a network that makes predictions directly. Instead, the network outputs a set of vectors that form a basis for linear regression. The intuition is that finding a sufficiently good basis is easier than trying to find a network that models the system accurately on its own. One possible instantiation of Evolino that we have explored thus far with promising results coevolves the recurrent nodes of LSTM networks using a variant of the enforced subpopulations (ESP) neuroevolution algorithm. The next sections describe ESP, LSTM, and the details of how they are combined in the Evolino framework to form two algorithms: PI-Evolino, which uses the mean squared error optimality criterion, and Evoke, which uses the maximum margin.
2.1 Enforced SubPopulations (ESP). Enforced SubPopulations differs from standard neuroevolution methods in that instead of evolving complete
762
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
ESP
fitness
Time series input
output
pseudo−inverse weights
LSTM Network Figure 2: Enforced subpopulations (ESP). The population of neurons is segregated into subpopulations. Networks are formed by randomly selecting one neuron from each subpopulation. A neuron accumulates a fitness score by adding the fitness of each network in which it participates. The best neurons within each subpopulation are mated to form new neurons. The network shown here is an LSTM network with four memory cells (the triangular shapes).
networks, it coevolves separate subpopulations of network components or neurons (see Figure 2). ESP searches the space of networks indirectly by sampling the possible networks that can be constructed from the subpopulations of neurons. Network evaluations serve to provide a fitness statistic that is used to produce better neurons that can eventually be combined to form a successful network. This cooperative coevolutionary approach is an extension to symbiotic, adaptive neuroevolution (SANE; Moriarty, & Miikkulainen, 1996), which also evolves neurons, but in a single population. By using separate subpopulations, ESP accelerates the specialization of neurons into different subfunctions needed to form good networks because members of different evolving subfunction types are prevented from mating. Subpopulations also reduce noise in the neuron fitness measure because each evolving neuron type is guaranteed to be represented in every network that is formed. Both of these features allow ESP to evolve networks more efficiently than SANE (Gomez & Miikkulainen, 1999). ESP normally uses crossover to recombine neurons. However, for the Evolino variant, where fine local search is desirable, ESP uses Cauchydistributed mutation to produce all new individuals, making the approach in effect an evolution strategy (Schwefel, 1995). More concretely, evolution proceeds as follows:
Training Recurrent Networks by Evolino
763
1. Initialization. The number of hidden units H in the networks that will be evolved is specified, and a subpopulation of n neuron chromosomes is created for each hidden unit. Each chromosome encodes a neuron’s input and recurrent connection weights with a string of random real numbers. 2. Evaluation. A neuron is selected at random from each of the H subpopulations and combined to form a recurrent network. The network is evaluated on the task and awarded a fitness score. The score is added to the cumulative fitness of each neuron that participated in the network. This procedure is repeated until each neuron participated in m evaluations. 3. Reproduction. For each subpopulation, the neurons are ranked by fitness, and the top quarter of the chromosomes, or parents, in each subpopulation are duplicated, and the copies, or children, are mutated by adding noise to all of their weight values from the Cauchy distribution f (x) = π (α2α+x2 ) , where the parameter α determines the width of the distribution. The children then replace the lowest-ranking half of their corresponding subpopulation. 4. Repeat the evaluation–reproduction cycle until a sufficiently fit network is found. If during evolution, the fitness of the best network evaluated so far does not improve for a predetermined number of generations, a technique called burst mutation is used. The idea of burst mutation is to search the space of modifications to the best solution found so far. When burst mutation is activated, the best neuron in each subpopulation is saved, the other neurons are deleted, and new neurons are created for each subpopulation by adding Cauchy distributed noise to its saved neuron. Evolution then resumes, but now searching in a neighborhood around the previous best solution. Burst mutation injects new diversity into the subpopulations and allows ESP to continue evolving after the initial subpopulations have converged.
2.2 Long Short-Term Memory. LSTM is a recurrent neural network purposely designed to learn long-term dependencies via gradient descent. The unique feature of the LSTM architecture is the memory cell, which is capable of maintaining its activation indefinitely (see Figure 3). Memory cells consist of a linear unit, which holds the state of the cell, and three gates that can open or close over time. The input gate “protects” a neuron from its input: only when the gate is open can inputs affect the internal state of the neuron. The output gate lets the state out to other parts of the network, and the forget gate enables the state to “leak” activity when it is no longer useful.
764
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
peephole output
GF Go
S Σ
GI
external inputs
Figure 3: Long short-term Memory. The figure shows an LSTM memory cell. The cell has an internal state S together with a forget gate (G F ) that determines how much the state is attenuated at each time step. The input gate (G I ) controls access to the cell by the external inputs that are summed into the unit, and the output gate (G O ) controls when and how much the cell fires. Small, dark nodes represent the multiplication function.
The state of cell i is computed by si (t) = neti (t)giin (t) + gi
f orget
(t)si (t − 1),
(2.3)
where g in and g forget are the activation of the input and forget gates, respectively, and net is the weighted sum of the external inputs (indicated by the s in Figure 3), neti (t) = h
wicell j c j (t
− 1) +
j
cell wik uk (t)
,
(2.4)
k
where h is usually the identity function and c j is the output of cell j: c j (t) = tanh(g out j (t)s j (t)),
(2.5)
Training Recurrent Networks by Evolino
765
where g out is the output gate of cell j. The amount each gate gi of memory cell i is open or closed at time t is calculated by: type gi (t)
=σ
j
type wi j c j (t
− 1) +
type wik uk (t)
,
(2.6)
k
where type can be input, output, or forget, and σ is the standard sigmoid function. The gates receive input from the output of other cells c j and from the external inputs to the network. 2.3 Combining LSTM, ESP, and Pseudoinverse in Evolino. We apply our general Evolino framework to the LSTM architecture, using ESP for evolution and regression for computing linear mappings from hidden state to outputs. ESP coevolves subpopulations of LSTM memory cells instead of standard recurrent neurons (see Figure 2). Each chromosome is a string containing the external input weights and the input, output, and forget gate weights, for a total of 4 ∗ (I + H) weights in each memory cell chromosome, where I is the number of external inputs and H is the number of memory cells in the network. There are four sets of I + H weights because the three gates and the cell itself receive input from outside the cell and the other cells. Figure 4 shows how the memory cells are encoded in an ESP chromosome. Each chromosome in a subpopulation encodes the connection weights for a cell’s input, output, and forget gates, and external inputs. The linear regression method used to compute the output weights (W in equation 2.2) is the Moore-Penrose pseudoinverse method, which is both fast and optimal in the sense that it minimizes the summed squared error (Penrose, 1955) (compare Maillard & Gueriot, 1997, for an application to feedforward RBF nets and Ishii et al., 2004, for an application to echo state networks). The vector φ(t) consists of both the cell outputs c i and their internal states si , so that the pseudoinverse computes two connection weights for each memory cell. We refer to the connections from internal states to the output units as “output peephole” connections, since they peer into the interior of the cells. For continuous function generation, backprojection (or teacher forcing, in standard RNN terminology) is used where the predicted outputs are fed back as inputs in the next time step: φ(t) = f (u(t), y(t − 1), u(t − 1), . . . , y(0), u(0)). During training, the correct target values are backprojected, in effect “clamping” the network’s outputs to the right values. During testing, the network backprojects its own predictions. This technique is also used by ESNs, but whereas ESNs do not change the backprojection connection weights, Evolino evolves them, treating them like any other input to the network. In the experiments described below, backprojection was found useful
766
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
Genotype output gate
Phenotype
O
forget gate
Go
input gate
GF
S
external inputs
GI Σ
Figure 4: Genotype-phenotype mapping. Each chromosome (genotype, left) in a subpopulation encodes the external input, and input, output, and forget gate weights of an LSTM memory cell (phenotype, right). The weights leading out of the state (S) and output (O) units are not encoded in the genotype, but are instead computed at evaluation time by linear regression.
for continuous-function generation tasks, but interferes to some extent with performance in the discrete context-sensitive language task. 2.4 Evoke: Evolino for Recurrent Support Vector Machines. As outlined in section 2, the Evolino framework does not prescribe a particular optimality criterion for computing the output weights. If we replace mean squared error with the maximum margin criterion of support vector machines (SVMs; Vapnik, 1995), the optimal linear output weights can be evaluated using, for example, quadratic programming, as in traditional SVMs. We call this Evolino variant EVOlution of systems with KErnel-based outputs (Evoke; Schmidhuber, Gagliolo, Wierstra, & Gomez, 2006). The Evoke variant of equation 2.1 becomes
y(t) = w0 +
li k i=1 j=0
wi j K (φ(t), φ i ( j)),
(2.7)
Training Recurrent Networks by Evolino
767
where φ(t) ∈ Rn is, again, the output of the recurrent neural network f (·) at time t (see equation 2.2), K (·, ·) is a predefined kernel function, and the weights wi j correspond to k training sequences φ i , each of length li , and are computed with the support vector algorithm. SVMs are powerful regressors and classifiers that make predictions based on a linear combination of kernel basis functions. The kernel maps the input feature space to a higher-dimensional space where the data are linearly separable (in classification) or can be approximated well with a hyperplane (in regression). A limited way of applying existing SVMs to time-series predic¨ tion (Mukherjee, Osuna, & Girosi, 1997; Muller et al., 1997) or classification (Salomon et al., 2002) is to build a training set, either by transforming the sequential input into some static domain (e.g., a frequency and phase representation) or by considering restricted, fixed time windows of m sequential input values. One alternative presented in Shimodaira, Noma, Nakai, & Sagayama, (2002) is to average kernel distance between elements of input sequences aligned to m points. Of course, such approaches are bound to fail if there are temporal dependencies exceeding m steps. In a more sophisticated approach by Suykens and Vandewalle (2000), a window of m previous output values is fed back as input to a recurrent model with a fixed kernel. So far, however, there has not been any recurrent SVM that learns to create internal state representations for sequence learning tasks involving time lags of arbitrary length between important input events. For example, consider the task of correctly classifying arbitrary instances of the context-free language a n b n (n a’s followed by n b’s, for arbitrary integers n > 0). For Evoke, the evolved RNN is a preprocessor for a standard SVM kernel. The combination of both can be viewed as an adaptive kernel learning a task-specific distance measure between pairs of input sequences. Although Evoke uses SVM methods, it can solve several tasks that traditional SVMs cannot solve even in principle. We will see that it also outperforms recent state-of-the-art RNNs on certain tasks, including echo state networks (ESNs) (Jaeger, 2004a) and previous gradient descent RNNs (Hochreiter & Schmidhuber, 1997a; Pearlmutter, 1995; Robinson & Fallside, 1987; Rumelhart & McClelland, 1986; Werbos, 1974; Williams, 1989). 3 Experiments Experiments with PI-Evolino were carried out on four test problems: context-sensitive languages, multiple superimposed sine waves, parity problem with display, and the Mackey-Glass time series. The first two were chosen to highlight Evolino’s ability to perform well in both discrete and continuous domains, and to solve tasks that neither ESNs (Jaeger, 2004a) nor traditional gradient-descent RNNs (Pearlmutter, 1995; Robinson, & Fallside, 1987; Rumelhart, & McClelland, 1986; Werbos, 1974; Williams, 1989) can solve well. We also report successful experiments with
768
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
Evoke-trained RSVMs on these tasks. The parity problem with display demonstrates PI-Evolino’s ability to cope with rough error surfaces that confound gradient-based approaches, including G-LSTM. Although the Mackey-Glass system can be modeled accurately by nonrecurrent systems (Jaeger, 2004a), it was selected to compare PI-Evolino with ESNs, the reference method on this widely used time-series benchmark. 3.1 Context-Sensitive Grammars. Context-sensitive languages are languages that cannot be recognized by deterministic finite-state automata and are therefore more complex in some respects than regular languages. In general, determining whether a string of symbols belongs to a contextsensitive language requires remembering all the symbols in the string seen so far, ruling out the use of nonrecurrent architectures. To compare Evolino-based LSTM with published results for G-LSTM (Gers & Schmidhuber, 2001), we chose the language a n b n c n . The task was implemented using networks with four input units—one for each symbol (a , b, c) plus the start symbol S—and four output units—one for each symbol plus the termination symbol T. Symbol strings were presented sequentially to the network, with each symbol’s corresponding input unit set to 1 and the other three set to −1. At each time step, the network must predict the possible symbols that could come next in a legal string. Legal strings in a n b n c n are those in which the number of a ’s, b’s, and c’s is equal—for example, ST, Sa bcT, Sa a bbccT, and Sa a a bbbcccT. So for n = 3, the set of input and target values would be: Input: S a a a b b b c c c Target: a/T a/b a/b a/b b b c c c T Evolino-based LSTM networks were evolved using eight different training sets, each containing legal strings with values for n as shown in the first column of Table 1. In the first four sets, n ranges from 1 to k, where k = 10, 20, 30, 40. The second four sets consist of just two training samples and were intended to test how well the methods could induce the language from a nearly minimal number of examples. LSTM networks with memory cells were evolved (four for PI-Evolino and five for Evoke), with random initial values for the weights between −0.1 and 0.1 for Evolino and between −5.0 and 5.0 for Evoke, and the inputs were multiplied by 100. The Cauchy noise parameter α for both mutation and burst mutation was set to 0.00001 for Evolino and to 0.1 for Evoke; 50% of the mutations are kept within these bounds. In keeping with the setup in Gers and Schmidhuber (2001), we added a bias unit to the Forget gates and Output gates with values of +1.5 and −1.5, respectively. For Evoke, the parameters of the SVM module were chosen heuristically: a gaussian kernel with standard deviation 2.0 and capacity 100.0. Evolino evaluates fitness on
Training Recurrent Networks by Evolino
769
Table 1: Results for the a n b n c n Language.
Training Data 1..10 1..20 1..30 1..40 10,11 20,21 30,31 40,41
Standard PI-Evolino
Tuned PI-Evolino
GradientLSTM
1..29 1..67 1..93 1..101
1..53 1..95 1..355 1..804
1..28 1..66 1..91 1..120
4..14 13..36 26..39 32..53
3..35 5..39 3..305 1..726
10..11 17..23 29..32 35..45
Notes: The table compares pseudoinverse-based Evolino (PIEvolino) with gradient-based LSTM (G-LSTM) on the a n b n c n language task. “Standard” refers to Evolino with the parameter settings used for both discrete and continuous domains (a n b n c n and superimposed sine waves). The “tuned” version is biased to the language task: we additionally squash the cell input with the tanh function. The left-most column shows the set of strings used for training in each of the experiments. The other three columns show the set of legal strings to which each method could generalize after 50 generations (3000 evaluations), averaged over 20 runs. The upper training sets contain all strings up to the indicated length. The lower training sets contain only a single pair. PI-Evolino generalizes better than G-LSTM, most notably when trained on only two examples of correct behavior. The G-LSTM results are taken from Gers and Schmidhuber (2001).
the entire training set, but Evoke uses a slightly different way of evaluating fitness: while the training set consists of the first half of the strings, fitness was defined as performance on the second half of the data, the validation set. Evolution was terminated after 50 generations, after which the best network in each simulation was tested. Table 1 compares the results of Evolino-based LSTM, using pseudoinverse as supervised learning module (PI-Evolino), with those of G-LSTM from (Gers & Schmidhuber, 2001); standard PI-Evolino uses parameter settings that are a compromise between discrete and continuous domains. If we set h to the tanh function, we obtain tuned PI-Evolino. We never managed to train ESNs to solve this task, presumably because the random prewiring of ESNs rarely represents an algorithm for solving such context-sensitive language problems. The standard PI-Evolino networks had generalization very similar to that of G-LSTM on the 1..k training sets, but slightly better on the two-example training sets. Tuned PI-Evolino showed a dramatic improvement over G-LSTM on all of the training sets, most remarkably on the two-example
770
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez 2000
cell 1 cell 2 cell 3 cell 4
1500
1000
500
0 −500 −1000
0
500
1000
1500
2000
2500
time steps Figure 5: Internal state activations. The state activations for the four memory cells of an Evolino network being presented the string a 800 b 800 c 800 . The plot clearly shows how some units function as “counters,” recording how many a ’s and b’s have been seen. More complex, nonlinear behavior by the gates is not shown.
sets, where it was able to generalize on average to all strings up to n = 726 after being trained on only n = {40, 41}. Evoke’s performance was superior for N = 10 and N = 20, generalizing up to n = 257 and n = 374, respectively, but degraded for larger values of N, for which both PI-Evolino and G-LSTM achieved better results. Figure 5, shows the internal states of each of the four memory cells of one of the networks evolved by PI-Evolino while processing a 800 b 800 c 800 . 3.2 Parity Problem with Display. The parity problem with display involves classifying sequences consisting of 1’s and −1’s according to whether the number of 1’s is even or odd. The target, which depends on the entire sequence, is a display of 10 × 10 output neurons depicting O for odd and E for even. The display prevents the task from being solved by guessing the network weights (Hochreiter & Schmidhuber, 1997b) and makes the error gradient very rough. We trained PI-Evolino with two memory cells on 50 random sequences of length between 100 and 110. Unlike G-LSTM, which typically cannot solve this task due to a lack of global gradient information, PI-Evolino learned a perfect display classification on a test set within 30 generations in all 20 experiments. 3.3 Mackey-Glass Time-Series Prediction. The Mackey-Glass system (MGS; Mackey & Glass, 1977) is a standard benchmark for chaotic timeseries prediction. The system produces an irregular time series that is
Training Recurrent Networks by Evolino
771
generated by the following differential equation: y˙ (t) = αy(t − τ )/(1 + y(t − τ )β ) − γ y(t), where the parameters are usually set to α = 0.2, β = 10, γ = 0.1. The system is chaotic whenever the delay τ > 16.8. We use the most common value for the delay, τ = 17. Although the MGS can be modeled accurately using feedforward networks with a time window on the input, we compare PI-Evolino to ESNs (currently the best method for MGS) in this domain to show its capacity for making precise predictions. We used the same setup in our experiments as in Jaeger (2004a). Networks were evolved in the following way. During the first phase of an evaluation, the network predicts the next function value for 3000 time steps with the benefit of the backprojected target from the previous time step. For the first 100 washout time steps, the vectors φ(t) are not collected; only the φ(t), t = 101..3000, are used to calculate the output weights using the pseudoinverse. During the second phase, the previous target is backprojected only during the washout time, after which the network runs freely by backprojecting its own predictions. The fitness score assigned to the network is the MSE on time steps 101..3000. Networks with 30 memory cells were evolved for 200 generations and a Cauchy noise α of 10−7 . A bias input of 1.0 was added to the network, the backprojection values were scaled by a factor of 0.1, and the cell input was squashed with the tanh function. At the end of an evolutionary run, the best network found was tested by having it predict using the backprojected previous target for the first 3000 steps and then run freely from time step 3001 to 3084.1 The average NRMSE84 for PI-Evolino with 30 cells over the 15 runs was 1.9 × 10−3 compared to 10−4.2 for ESNs with 1000 neurons (Jaeger, 2004a). The PI-Evolino results are currently the second-best reported so far. Figure 6 shows the performance of an Evolino network on the MG time series with even fewer memory cells, after 50 generations. Because this network has fewer parameters, it is unable to achieve the same precision as with 30 neurons, but it demonstrates how Evolino can learn such functions very quickly—in this case, within approximately 3 minutes of CPU time. 3.4 Multiple Superimposed Sine Waves. Learning to generate a sinusoidal signal is a relatively simple task that requires only one bit of memory to indicate whether the current network output is greater or less than the previous output. When sine waves with frequencies that are not integer multiples of each other are superimposed, the resulting signal becomes much harder to predict because its wavelength can be extremely long; there is a large number of time steps before the periodic signal repeats. Generating
1 The normalized root mean square error (NRMSE ) 84 steps after the end of the 84 training sequence is the standard comparison measure used for this problem.
772
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
1.4
Mackey−Glass Evolino
Evolino
1.2
Mackey−Glass
1
0.8
0.6
0.4
0
200
400
600
800
1000
1200
1400
Figure 6: Performance of PI-Evolino on the Mackey-Glass time series. The plot shows both the Mackey-Glass system and the prediction made by a typical Evolino-based LSTM network evolved for 50 generations. The obvious difference between the system and the prediction during the first 100 steps is due to the washout time. The inset shows a magnification that illustrates more clearly the deviation between the two curves. Table 2: PI-Evolino Results for Multiple Superimposed Sine Waves. Number of Sines 2 3 4 5
Number of Cells
Training NRMSE
Generalization NRMSE
10 15 20 20
2.01 × 10−3 2.44 × 10−3 1.51 × 10−2 1.60 × 10−2
4.15 × 10−3 8.04 × 10−3 1.10 × 10−1 1.66 × 10−1
Notes: The training NRMSE is calculated on time steps 100 to 400 (i.e., the washout time is not included in the measure). The generalization NRMSE is calculated for time steps 400 to 700 (averaged over 20 experiments).
such a signal accurately without recurrency would require a prohibitively large time delay window using a feedforward architecture. Jaeger (2004b) reports that echo state networks are unable to learn functions composed of even two superimposed oscillators, in particular sin(0.2x) + sin(0.311x). The reason is that the dynamics of all the neurons in the ESN pool are coupled, while this task requires that both of the two underlying oscillators be represented by the network’s internal state. Here we show how Evolino-based LSTM can not only solve the twosine function mentioned above, but also more complex functions formed by superimposing nup to three more sine waves. Each of the functions was constructed by i=1 sin(λi x), where n is the number of sine waves and λ1 = 0.2, λ2 = 0.311, λ3 = 0.42, λ4 = 0.51, and λ5 = 0.74. For this task, PI-Evolino networks were evolved using the same setup and procedure as for the Mackey-Glass system except that steps 101..400 were used to calculate the output weights in the first evaluation phase and fitness in the second phase. Also, the inputs were multiplied by 0.01 instead of 100. Again, during the first 100 washout time steps, the vectors φ(t) were not collected for computing the pseudoinverse.
Training Recurrent Networks by Evolino
773
Generalization
Training 2
2
1
1
2
0
0
−1
−1 −2
−2 100
150
200
250
300
350
400
9700
3
3
2
2
1
1
9900
9950
10000
2750
2800
2850
2900
2950
3000
450
500
550
600
650
700
450
500
550
600
650
700
0
−2
-2
−3
-3 150
200
250
300
350
2700
400
4
4
3
3
2
2
1
1
4
0
0
−1
−1 −2
−2 −3
−3
−4
−4 150
200
250
300
350
400
400
4
4
2
2
5
0
0 −2
−2
−4 100
9850
−1
-1
100
9800
3
0
100
9750
−4 150
200
250
300
350
400
400
time steps Figure 7: Performance of PI-Evolino on the superimposed sine wave tasks. The plots show the behavior of a typical network produced after a specified number of generations: 50 for the two-, three-, and four-sine functions and 150 for the five-sine function. The first 300 steps of each function, in the left column, were used for training. The curves in the right column show values predicted by the networks (dashed curves) further into the future versus the corresponding reference signal (solid curves). While the onset of noticeable prediction error occurs earlier as more sines are added, the networks still track the correct behavior for hundreds of time steps, even for the five-sine case.
The first three tasks, n = 2, 3, 4, used subpopulations of size 40, and simulations were run for 50 generations. The five-sine wave task, n = 5, proved much more difficult to learn, requiring a larger subpopulation size of 100, and simulations were allowed to run for 150 generations. At the end of each run, the best network was tested for generalization on data points
774
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
2 1
net prediction
0 −1 −2 1914
ouput
1911.5
1909.5
18
cell 1
peephole
14 11 793.6
ouput
793 792.8 −1556
cell 2
peephole
−1562
−1568 −1168.6
ouput
−1169.1
−1169.6
14
cell 3
peephole
11 8
1600
1650
1700
1750
1800
1850
time steps
1900
1950
2000
Figure 8: Internal representation of two-sine function. The upper graph shows the output of a PI-Evolino LSTM network with three cells predicting the two-sine function. The three pairs of lower graphs show the output (upper) and output peephole (lower) values of each cell in the network multiplied by their respective (pseudoinverse-generated) output weight. These six signals are added to generate the signal in the upper graph.
from time steps 401..700, making predictions using backprojected previous predictions. For Evoke, a slightly different setting was used, in which networks were evolved to minimize the sum of training and validation error, on points 100..400 and 400..700, respectively, and tested on points 700..1000. The
Training Recurrent Networks by Evolino
775
weight range was set to [−1.0, 1.0], and a gaussian kernel with standard deviation 2.0 and capacity 10.0 was used for the SVM. Table 2 shows the number of memory cells used for each task and the average summed squared error on both the training set and the testing set for the best network found during each evolutionary run of PI-Evolino. Evoke achieved a relatively low generalization NRMSE of 1.03 × 10−2 on the double sines problem, but gave unsatisfactory results for three or more sines. Figure 7 shows the behavior of one of the successful networks from each of the tasks. The column on the left shows the target signal from Table 2 and the output generated by the network on the training set. The column on the right shows the same curves forward in time to show the generalization capability of the networks. For the two-sine function, even after 9000 time steps, the network continues to generate the signal accurately. As more sines are added, the prediction error grows more quickly, but the overall behavior of the signal is still retained. Figure 8 reveals how the two-sine wave is represented internally by a typical PI-Evolino network. For illustration, a less accurate network containing only 3 cells instead of 10 is shown. The upper graph shows the overall output of the network, while the other graphs show the output peephole and output activity of each cell multiplied by the corresponding pseudoinverse-generated output weight. Although the network can generate the function accurately for thousands of time steps, it does not do so by implementing sinusoidal oscillators. Instead, each cell by itself behaves in a manner that is qualitatively similar to the two-sine, but scaled, translated, and phase-shifted. These six separate signals are added together to produce the network output.
4 Conclusion The human brain is a biological, learning RNN. Previous successes with artificial RNNs have been limited by problems overcome by the LSTM architecture. Its algorithms for shaping not only the linear but also the nonlinear parts allow LSTM to learn to solve tasks unlearnable by standard feedforward nets, support vector machines, hidden markov models, and previous RNNs. Previous work on LSTM has focused on gradientbased G-LSTM (Gers & Schmidhuber, 2001; Gers et al., 2000,2002; Graves & Schmidhuber, 2005; Hochreiter & Schmidhuber, 1997a; P´erez-Ortiz et al., 2003; Schmidhuber et al., 2002). Here we introduced the novel Evolino class of supervised learning algorithms for such nets that overcomes certain problems of gradient-based RNNs with local minima. Successfully tested instances with hidden coevolving recurrent neurons use either the pseudoinverse to minimize the MSE of the linear mapping from hidden units to outputs (PI-Evolino), or quadratic programming to maximize the margin.
776
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
The latter yields the first evolutionary recurrent SVMs or RSVMs, trained by an Evolino variant called Evoke. In the experiments of our pilot study, RSVMs generally performed better than G-LSTM and previous gradient-based RNNs but typically worse than PI-Evolino. One possible reason for this could be that the kernel mapping of the SVM component induces a more rugged fitness landscape that makes evolutionary search harder. All of the evolved networks were comparatively small, usually featuring fewer than 3000 weights. For large data sets, such as those used in speech recognition, we typically need much larger LSTM networks with on the order of 100,000 weights (Graves & Schmidhuber, 2005). On such problems, we have so far generally obtained better results with G-LSTM than with Evolino. This seems to reaffirm the heuristic that evolution of large parameter sets is often harder than gradient search in such sets. Currently it is unclear when exactly to favor one over the other. Future work will explore hybrids combining G-LSTM and Evolino in an attempt to leverage the best of both worlds. We will also explore ways of improving the performance of Evoke, including the coevolution of SVM kernel parameters. We have barely tapped the set of possible applications of our new approaches; in principle, any learning task that requires some sort of adaptive short-term memory may benefit. Acknowledgments This research was partially funded by SNF grant 200020-107534, EU MindRaces project FP6 511931, and a CSEM Alpnach grant to J. S. References Gers, F. A., & Schmidhuber, J. (2001). LSTM recurrent networks learn simple context free and context sensitive languages. IEEE Transactions on Neural Networks, 12(6), 1333–1340. Gers, F. A., Schmidhuber, J., & Cummins, F. (2000). Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10), 2451–2471. Gers, F. A., Schraudolph, N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3, 115–143. Gomez, F. J. (2003). Robust nonlinear control through neuroevolution. Unpublished doctoral dissertation, University of Texas at Austin. Gomez, F., & Miikkulainen, R. (1999). Solving non-Markovian control tasks with neuroevolution. In Proceedings of the 16th International Joint Conference on Artificial Intelligence. San Francisco: Morgan Kaufmann. Gomez, F., & Schmidhuber, J. (2005a). Evolving modular fast-weight networks for control. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds.), Proceedings of the Fifteenth International Conference on Artificial Neural Networks: ICANN-05 (pp. 383–389). Berlin: Springer.
Training Recurrent Networks by Evolino
777
Gomez, F. J., & Schmidhuber, J. (2005b). Co-evolving recurrent neurons learn deep memory POMDPS. In Proceedings of the Genetic Evolutionary Computation Conference (GECCO-05). Berlin: Springer-Verlag. Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18, 602– 610. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma ¨ thesis, Technische Universit¨at Munchen. Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: The difficulty of learning long-term dependencies. In S. C. Kremer, & J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. Piscataway, NJ: IEEE Press. Hochreiter, S., & Schmidhuber, J. (1997a). Long short-term memory. Neural Computation, 9(8), 1735–1780. Hochreiter, S., & Schmidhuber, J. (1997b). LSTM can solve hard long time lag problems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 473–479). Cambridge, MA: MIT Press. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor. University of Michigan Press. ¨ Ishii, K., van der Zant, T., Beˇcanovi´c, V., & Ploger, P. G. (2004). Identification of motion with echo state network. In Proc. IEEE Oceans04, (pp. 1205–1230). Kobe, Japan: IEEE. Jaeger, H. (2004). Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304, 78–80. Jaeger, W. (2004b). The echo state approach to recurrent neural networks. Available online at http://www.faculty.iu-bremen.de/hjaeger/courses/SeminarSpring04/ ESNStandardSlides.pdf. Maass, W., Natschl¨ager, T., & Markram, H. (2002). A fresh look at real-time computation in generic recurrent neural circuits (Tech. Rep.) Graz: Institute for Theoretical Computer Science, TU Graz. Mackey, M. C., & Glass, L. (1977). Oscillation and chaos in physiological control systems. Science, 197, 287–289. Maillard, E. P., & Gueriot, D. (1997). RBF neural network, basis functions and genetic algorithms. In IEEE International Conference on Neural Networks (pp. 2187–2190). Piscataway, NJ: IEEE. Miglino, O., Lund, H., & Nolfi, S. (1995). Evolving mobile robots in simulated and real environments. Artificial Life, 2(4), 417–434. Miller, G., Todd, P., & Hedge, S. (1989). Designing neural networks using genetic algorithms. In Proceedings of the 3rd International Conference on Genetic Algorithms (pp. 379–384). San Francisco: Morgan Kaufmann. Moriarty, D. E. (1997). Symbiotic evolution of neural networks in sequential decision tasks. PhD thesis, Dept. of Computer Sciences, University of Texas at Austin. Technical Report UT-AI97-257. Moriarty, D. E., & Miikkulainen, R. (1996). Efficient reinforcement learning through symbiotic evolution. Machine Learning, 22, 11–32. Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vector machines. In J. Principe, L. Giles, N. Morgan, &
778
J. Schmidhuber, D. Wierstra, M. Gagliolo, and F. Gomez
E. Wilson (Eds.), IEEE Workshop on Neural Networks for Signal Processing VII (pp. 511). Piscataway, NJ: IEEE Press. ¨ ¨ Muller, K., Smola, A., G.R¨atsch, Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. In W. Gerstner, A. Germond, M. Hasler, & J.-O. Nicoud (Eds.), Proceedings of the Seventh International Conference on Artificial Neural Networks: ICANN-97 (pp. 999–1004). Berlin: Springer-Verlag. Nolfi, S., Floreano, D., Miglino, O., & Mondada, F. (1994). How to evolve autonomous robots: Different approaches in evolutionary robotics. In R. A. Brooks & P. Maes, P (Eds.), Proceedings of the Fourth International Workshop on the Synthesis and Simulation of Living Systems (Artificial Life IV) (pp. 190–197). Cambridge, MA: MIT Press. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Penrose, R. (1955). A generalized inverse for matrices. Proceedings of the Cambridge Philosophy Society, 51, 406–413. P´erez-Ortiz, J. A., Gers, F. A., Eck, D., & Schmidhuber, J. (2003). Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Networks, 16(2), 241–250. Potter, M. A., & De Jong, K. A. (1995). Evolving neural networks with collaborative species. In Proceedings of the 1995 Summer Computer Simulation Conference (pp. 340–345). Ottawa: Society of Computer Simulation. Rechenberg, I. (1973). Evolutionsstrategie—Optimierung technischer Systeme nach Prinzipien der biologischen Evolution. Stuttgart, Germany: Fromman-Holzboog. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network (Tech. Rep. CUED/F-INFENG/TR.1). Cambridge: Cambridge University, Engineering Department. Rumelhart, D. E., & McClelland, J. L. (Eds.). (1986). Parallel distributed processing (vol. 1). Cambridge, MA: MIT Press. Salomon, J., King, S., & Osborne, M. (2002). Framewise phone classification using support vector machines. In Proceedings of the International Conference on Spoken Language Processing. Denver. Schmidhuber, J. (1990). Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem. Unpublished doctoral dissertation, Technische Univer¨ sit¨at Munchen. Schmidhuber, J. (1992). A fixed size storage O(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243– 248. Schmidhuber, J., Gagliolo, M., Wierstra, D., & Gomez, F. (2006). Evolino for recurrent support vector machines. In Proceedings of ESANN’06. Evere, Belgium: d-side. Schmidhuber, J., Gers, F., & Eck, D. (2002). Learning nonregular languages: A comparison of simple recurrent networks and LSTM. Neural Computation, 14(9), 2039– 2041. Schmidhuber, J., Hochreiter, S., & Bengio, Y. (2001). Evaluating benchmark problems by random guessing. In S. C. Kremer and J. F. Kolen (Eds.), A field guide to dynamical recurrent neural networks. Piscataway, NJ: IEEE Press. Schmidhuber, J., Wierstra, D., & Gomez, F. J. (2005). Evolino: Hybrid neuroevolution/optimal linear search for sequence prediction. In Proceedings of
Training Recurrent Networks by Evolino
779
the 19th International Joint Conference on Artificial Intelligence (IJCAI). San Francisco: Morgan Kaufman. Schwefel, H. P. (1977). Numerische Optimierung von Computer-Modellen. Basel: Birkh¨auser. Schwefel, H. P. (1995). Evolution and optimum seeking. New York: Wiley Interscience. Shimodaira, H., Noma, K.-I., Nakai, M., & Sagayama, S. (2002). Dynamic timealignment kernel in support vector machine. In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Siegelmann, H. T., & Sontag, E. D. (1991). Turing computability with neural nets. Applied Mathematics Letters, 4(6), 77–80. Sims, K. (1994). Evolving virtual creatures. In A. Glassner (Ed.), Proceedings of SIGGRAPH ’94 (Orlando, Florida, July 1994) (pp. 15–22). New York: ACM Press. Suykens, J., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47(7), 1109–1114. ¨ van der Zant, T., Beˇcanovi´c, V., Ishii, K., Kobialka, H.-U., & Ploger, P. G. (2004). Finding good echo state networks to control an underwater robot using evolutionary computations. In Proceedings of the 5th IFAC symposium on Intelligent Autonomous Vehicles (IAV04). New York: Elsevier. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Unpublished doctoral dissertation, Harvard University. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339–365. Wierstra, D., Gomez, F. J., & Schmidhuber, J. (2005). Modeling non-linear dynamical systems with Evolino. In Proc. GECCO 2005. New York: ACM Press. Williams, R. J. (1989). Complexity of exact gradient computation algorithms for recurrent neural networks (Tech. Rep. NU-CCS-89-27). Boston: Northeastern University, College of Computer Science. Yamauchi, B. M., & Beer, R. D. (1994). Sequential behavior and learning in evolved dynamical neural networks. Adaptive Behavior, 2(3), 219–246. Yao, X. (1993). A review of evolutionary artificial neural networks. International Journal of Intelligent Systems, 8, 539–567.
Received December 2, 2005; accepted May 15, 2006.
LETTER
Communicated by Erkki Oja
A Generalized Divergence Measure for Nonnegative Matrix Factorization Raul Kompass
[email protected] ¨ Mathematik und Informatik, 14152 Berlin, Germany FU Berlin, Institut fur
This letter presents a general parametric divergence measure. The metric includes as special cases quadratic error and Kullback-Leibler divergence. A parametric generalization of the two different multiplicative update rules for nonnegative matrix factorization by Lee and Seung (2001) is shown to lead to locally optimal solutions of the nonnegative matrix factorization problem with this new cost function. Numeric simulations demonstrate that the new update rule may improve the quadratic distance convergence speed. A proof of convergence is given that, as in Lee and Seung, uses an auxiliary function known from the expectationmaximization theoretical framework.
1 Introduction Nonnegative matrix factorization (NMF) is a technique of representational learning that consists in finding an approximation X ≈ MY of a matrix X = (xij ), whose elements are nonnegative, by a product of two matrices M = (mik ) and Y = (ykj ), which also are under the constraint of having nonnegative elements. The columns of X may be interpreted as input data, the columns of M as components, and the columns of Y as representations. Because of nonnegativity, there is no known general exact solution for the approximation problem, unlike PCA in the quadratic distance case. The nonnegativity constraint, which for itself is biologically plausible because neural activity has no negative values, leads to an important property of the resulting representations. They tend to be parts based because the positive components may only be scaled and added to approximate the input. This is a main argument for the biological significance of NMF. Lee and Seung (2001) presented two solutions for the NMF approximation problem: two pairs of iteration formulas for Y and M that decrease either the quadratic distance,
X − MY2 =
xij − (MY)ij
2
,
(1.1)
ij
Neural Computation 19, 780–791 (2007)
C 2007 Massachusetts Institute of Technology
A Generalized Divergence Measure for Nonnegative Matrix Factorization 781
or the Kullback-Leibler divergence, D(X||MY) =
xij log
ij
xij − xij + (MY)ij . (MY)ij
(1.2)
Their formulas, which are derived with a gradient-descent approach, have the advantage of rapidly converging. This is achieved by diagonal rescaling. The step-size parameter η, which in gradient descent is a rather small constant, is element-wise set to a variable and possibly large value with the new technique. As a result, the addition and subtraction terms of gradient descent occur as numerator and denominator in a multiplicative update formula, and the parameter η is gone. Lee and Seung (2001) proved convergence of the new formulas with an auxiliary function following the same strategy as in the expectation maximization (EM) theoretical framework (Dempster, Laird, & Rubin 1977). For the case of quadratic distance, the NMF iteration formulas given by Lee and Seung (2001) in our notation are as follows: yaµ ← ya µ
(MT X)a µ (MT MY)a µ
mia ← mia
(XY T )ia . (MY Y T )ia
(1.3)
For the minimization of the Kullback-Leibler divergence D(X||MY), these equations take a somewhat different form ya µ ← ya µ
i
mia xiµ /(MY)iµ k mka
mia ← mia
µ
ya µ xiµ /(MY)iµ . ν ya ν
(1.4)
The action of the iteration formulas, 1.3 and 1.4, may be related to different forms of learning known from neural network theory. In the quadratic distance case, equation 1.3, the update rules perform a multiplicative activation and divisive error correction. The update of M is a multiplicative version of Karhunen and Oja’s subspace rule for PCA (Oja, 1989). Instead of setting Y = MT X in Oja (1989), a similar multiplication and error-correcting division is employed here in the update of Y. In the Kullback-Leibler divergence case, the updates may be understood as an additional action of a divisory form of preintegration lateral inhibition (Spratling & Johnson, 2002) in the multiplicative as well as in the error-correcting divisory part of the update. This leads to replacement of the terms X and MY by X./(MY) and (MY)./(MY) = 1 in the update formulas, where ./ may denote elementwise division. Through this strong form of inhibition, the denominators in equation 1.4 reduce to normalization terms. The error correction does not operate anymore. In biological networks, both forms of learning may act concurrently, and this raises the question what happens if we reduce the preintegration lateral
782
R. Kompass
inhibition in equation 1.4 so that the resulting denominators still perform some error correction. In the most fortunate scenario, we might yield an even faster convergence than with the NMF formulas, equations 1.3 and 1.4. In the worst case, we might experience a loss of convergence of the iteration. This work gives answers to this question. We present an intermediate version of NMF update with restricted preintegration lateral inhibition. We also present a general parametric divergence measure that contains the quadratic distance and Kullback-Leibler divergence as special cases. We then show that the intermediate NMF update leads to no increase of the new divergence with the according parameter. Finally, we give numerical examples that show that the new update iterates well and may even lead to faster approximation than the original formula. The mathematical proofs presented in the appendix use the same approach known from EM theory (Dempster et al., 1977) like those in Lee and Seung (2001). 2 Generalization of the NMF Iteration Formulas Equations 1.3 and 1.4 differ only in the roles of X and MY within the equations. The term xiµ in equation 1.3 is substituted by xiµ /(MY)iµ in equation 1.4. Likewise, the term (MY)iµ in equation 1.3 is substituted by 1, which is the same as (MY)iµ /(MY)iµ . We associate the division by (MY)iµ with preintegration lateral inhibition (Spratling & Johnson, 2002). We di1−α minish the inhibition by dividing instead by (MY)iµ , where α ∈ [0, 1]. In this way, we obtain the following two generalized parametric NMF update equations: α−1 i mia xiµ (MY)iµ ya µ ← ya µ α k mka (MY)kµ
mia ← mia
α−1 ya µ xiµ (MY)iµ . α ν ya ν (MY)iν
µ
(2.1)
The new parameter α may be set so that either of the two cost functions for NMF approximation is locally minimized. The quadratic distance case in equation 2.1 corresponds to α = 1. The Kullback-Leibler divergence case corresponds to α = 0. With these generalized NMF (GNMF) formulas, the interesting question now is what happens if we use the intermediate values of α ∈ (0, 1) for iteration. We will show that the iteration converges as well and that the resulting Y and W interpolate between the solutions of equations 1.3 and 1.4. 3 The Divergence Measure Dα Another question that naturally arises at this point is whether there is a divergence measure Dα (X, MY) that interpolates between X − MY2 and
A Generalized Divergence Measure for Nonnegative Matrix Factorization 783
α
Dα(1,b)
α
1.5 1
0.5 0 − 0.5
1
1
0 .5
0 .5
0
0
0 .5
1
1 .5
0
b
0
0 .5
1
1 .5
b
Figure 1: Generalized distance measure Dα . α = 1 corresponds to quadratic distance; α = 0 corresponds to Kullback-Leibler divergence.
D(X||MY) so that the GNMF for a certain value of α ∈ (0, 1) leads to a decrease of Dα (X, MY). The search for an interpolating divergence measure may at first be restricted to the case of input dimension n = 1 and number of inputs m = 1. The divergence of higher-dimensional input matrices may then be obtained by summing the distances of the corresponding elements as in equations 1.1 and 1.2. We found that such a desired divergence exists. It has the following form: α α a a −b + b α (b − a ) : α ∈ (0, 1] α Dα (a , b) := . (3.1) a (log a − log b) + (b − a ) : α = 0 2 and D0 (a , b) = D(X||MY). Dα (a , b) is continObviously D1 (a , b) = (a a α−−bb) α = log a − log b. Dα (a , b) is a divergence meauous in α because lim α α→0
sure because Dα (a , a ) = 0 and because with a ∂ Dα (a , b) = −(α + 1) 1−α − b α , ∂b b
(3.2)
Dα (a , b) increases as b moves farther away from a for a ,b > 0 and α > −1. Figure 1 exhibits plots of the distance measure Dα (a , b) for certain values of α. The Kullback-Leibler divergence deviates from quadratic distance by its asymmetry. With constant absolute difference |a − b| of a and b, the case b < a results in a higher divergence than the opposite case, b > a . In the new divergence measure Dα , the degree of asymmetry is adjustable by the parameter α. Values of α below 0 lead to an even stronger asymmetry than
784
R. Kompass
in Kullback-Leibler divergence. For α > 1, the asymmetry has the opposite direction. We define Dα (X, MY) :=
Dα (xij , (MY)ij ).
(3.3)
ij
For the gradients of this measure with respect to M and Y, we obtain xiµ ∂ α Dα (X, MY) = −(1 + α) mia − mka (MY)kµ 1−α ∂ ya µ (MY)iµ i k xiµ ∂ α Dα (X, MY) = −(1 + α) y − (MY)iν ya ν . 1−α a µ ∂mia µ (MY)iµ ν
(3.4)
(3.5)
Ordinary gradient descent might be applied with these expressions. This, however, does not ensure that the elements of Y and M remain positive. In the same way as in Lee and Seung (2001), in the simple gradient-descent additive update rule, yaµ ← yaµ + ηaµ
mia
i
xiµ 1−α (MY)iµ
−
mka (MY)αkµ
,
(3.6)
k
we therefore set ηa µ =
yaµ α , m k ka (MY)kµ
(3.7)
which leads to the multiplicative update rule for Y in equation 2.1. In a similar manner, the multiplicative update for M is obtained. Theorem 1. For α ∈ [0 , 1 ] the generalized divergence Dα (X, MY) is nonincreasing under the update rules, equation 2.1. The generalized divergence is invariant under these updates if and only if M and Y are at a stationary point of the divergence. The proof is given in the appendix. 4 Numerical Results We performed numerical simulations to test how iteration according to the new update, equation 2.1, proceeds and how the results interpolate between the outcomes of minimization of quadratic distance, equation 1.3, and of
A Generalized Divergence Measure for Nonnegative Matrix Factorization 785
Figure 2: Approximation of GNMF for different values of α. Quadratic versus Kullback-Leibler distance. (Left) USPS data (digits), dim(Y) = 20. (Right) CBCL data (faces), dim(Y) = 49.
Kullback-Leibler divergence, equation 1.4. We investigated the iteration with α varying in 10 steps between 0 and 1 with two data sets: the U.S. Postal Service (USPS) digit data (LeCun et al., 1989) and the CBCL face data, which were also used in the first publication on NMF (MIT Center for Biological and Computation Learning, 2000). The USPS, data are given as 16 × 16 = 256 dimensional input, and the CBCL data are given as 19 × 19 pixels, which results in an input dimension of 381. The GNMF of the USPS data was computed with a Y-dimension of r = 20, and the CBCL data were factored with r = 49, as in Lee and Seung (1999). The pixel brightness intensities were linearly scaled to cover the range [0, 1]. The two data sets have different characteristics. The overall pixel brightness intensity of the face data is unimodal, with the mode at about 0.35; for the digit data, it is bimodal because most elements are nearly or exactly zero (black pixels) and there is a mode close to 1 (white). In order to limit the computational cost, we took only the first 1000 samples of either data set. Figure 2 shows the final quadratic distance and Kullback-Leibler divergence of the approximations MY to the input X for the different values of α after 64, 128, 256, or 512 iterations. The distances in the plot (see equations 1.1 and 1.2) are scaled by dividing by the number of elements in X and averaged across 20 runs of GNMF with random initial values of Y and M. Confidence intervals (95%) of means that are not corrected for multiple comparisons are also plotted. The numerical results demonstrate that the generalized update formulas iterate well and lead to meaningful results for intermediate values of α. With many iteration steps (512), there is a gradual transition between quadratic distance and Kullback-Leibler divergence approximation. Values of α close to 0 and 1 lead to considerable improvement of one measure without sacrificing the other. In an NMF approximation task, where the cost function is a mixture of both divergence measures, the new formulas seem to be clearly
786
R. Kompass
superior to the old ones. We tested this for a very large number of iterations (5000) for α ∈ {0, 0.5, 1} for both data sets. Table 1 displays the results. An intermediate value of α = 12 was found to lead to costs of approximately 1.5% to 6% above the absolute minimum for both divergence measures, whereas the standard NMF updates, which correspond to α = 0 or α = 1, optimize one measure but lead to a 11% to 23% increase of the other. With few (128 or fewer) iteration steps, we observe that the best approximation in the quadratic sense is achieved for α in the range [0.3..0.8]. Initially for such parameter values, the iteration seems to converge faster than with α = 1. This could result from the simultaneous action of both forms of error correction discussed in section 1. For α outside the interval [0, 1], the iteration with the above data did not converge. 5 Discussion We extended the NMF iteration formulas given in Lee and Seung (2001), which were designed to minimize the quadratic distance (see equation 1.1) or Kullback-Leibler distance (see equation 1.2) to a continuous family of iterations with parameter α that locally minimize a new general cost function, equations 3.1 and 3.3. Numerical examples indicate that the new iteration yields good approximation in terms of both original cost functions when the parameter α is set to intermediate values (α ∈ (0.4, 0.8)). In section 1, we argued that for such parameter values, the NMF update can be interpreted as a combination of the two forms of learning error correction and presynaptic lateral inhibition (Spratling & Johnson, 2002). The numerical results show that such a combination may result in improved approximation with a fixed moderate number of iteration steps. Preliminary results not presented here show that the advantage is preserved in a biologically more realistic serialized “online” form of NMF update. Also, other variants of preintegration lateral inhibition might be beneficial. Spratling and Johnson (2002) used a rectified subtraction, which we did not explore in conjunction with the multiplicative NMF update. In general, we believe that the combination of the different forms of learning and error correction deserves further investigation. With our extension, the advantage of simplicity of NMF is preserved. The theoretical treatment of convergence (see the appendix) has become even simpler by being reduced to only one parametric case. The parameter α, a new degree of freedom, determines the asymmetry of the cost function. An asymmetric cost function, like Kullback-Leibler divergence, for example, in pattern recognition, more strongly penalizes misses in the reproduction of input data and better tolerates overshootings of these reproductions. It seems plausible within the context of representation of inputs for further processing in biological vision because missing something in a natural environment may be much more disadvantageous than an overamplification of represented activity. In general, it is not known how asymmetric the costs
0 0.5 1 0 0.5 1
USPS
CBCL
α
Data Set
0.0462 0.0421 0.0415 0.00261 0.00222 0.00215
Quadratic Distance −1
11.4% 1.5% 0 21.4% 3.5% 0
D Dmin
0.0556 0.0530 0.0545 0.00479 0.00448 0.00473
D1/2
−1
4.8% 0 2.8% 7.0% 0 5.7%
D Dmin
0.0726 0.0743 0.0814 0.0111 0.0117 0.0136
Kullback-Leibler Divergence
Table 1: Cost Functions D1 , D1/2 , and D0 for USPS and CBCL Data after 5000 GNMF Updates with α ∈ {0, 0.5, 1}.
−1 0 2.4% 12.2% 0 5.1% 22.8%
D Dmin
A Generalized Divergence Measure for Nonnegative Matrix Factorization 787
788
R. Kompass
are in a given task. Therefore, the new formulas might help adjust the cost function to an optimal shape when NMF is applied to pattern recognition. Investigation of the dependence of recognition performance on α certainly depends on the methods used for subsequent processing and classification and therefore is beyond the scope of this work. We hope that future research will be dedicated to this problem and to the problem of finding converging NMF update rules for parameters α outside the interval (0, 1). Appendix: Convergence Proof of Theorem 1 We closely follow Lee and Seung (2001). We show that for an arbitrary fixed α ∈ (0, 1) in every iteration, the cost function F (Y, M) does not increase. First, we prove this for the update of Y. For that purpose, as in the expectation-maximization algorithm (Dempster et al., 1977), we use an auxiliary function. ˜ is an auxiliary function for F (Y) if for every Y and Y˜ G(Y, Y)
Definition 1.
G(Y, Y) = F (Y)
and
˜ ≥ F (Y) G(Y, Y)
(A.1)
hold. ˜ the cost function F (Y) is nonLemma 1. With an auxiliary function G(Y, Y), increasing under the update Ynew = arg minY G(Y, Yold ). Proof.
(A.2)
F (Ynew ) ≤ G(Ynew , Yold ) ≤ G(Yold , Yold ) = F (Yold )
The cost function may be written in the following form: F (Y) =
1 1+α 1 + α xij − xij (MY)αij + (MY)1+α ij . α α ij
Lemma 2.
ij
(A.3)
ij
The EM update rule applied with the function
˜ = G(Y, Y)
1 1+α 1 + α mik y˜ kj xij − xij ˜ ij α α i jk (MY) ij
+
ykj1+α i jk
y˜ kjα
˜ α mik (MY) ij
˜ ij (MY) mik ykj mik y˜ kj
α
(A.4)
A Generalized Divergence Measure for Nonnegative Matrix Factorization 789
leads to the GNMF iteration update in Y given by equation 2.1. We have
Proof.
∂ 1 + α mik y˜ kj xij ˜ ij ∂ ya µ α i jk (MY) = (1 + α)
y˜ a µ ya µ
1−α
˜ ij (MY) mik ykj mik y˜ kj
α
α−1 ˜ iµ xiµ (MY) mia
(A.5)
i
and
1+α yaµ α ∂ yij α ˜ ˜ αkµ m (M Y) = (1 + α) mka (MY) ki kj ∂ ya µ i jk y˜ ijα y˜ aµ k
(A.6)
˜ takes its minimum in Y at a point where ∂ G(Y, Y) ˜ = 0. This corG(Y, Y) ∂Y responds to equality of the right sides of equations A.5 and A.6, which simplifies to yaµ
˜ αkµ = y˜ aµ mka (MY)
k
α−1 ˜ iµ mia xiµ (MY) ,
(A.7)
i
which is equivalent to the GNMF iteration in Y given in equation 2.1. Lemma 3.
˜ is an auxiliary function forF (Y). G(Y, Y)
˜ = F (Y). It remains to show that Proof. It is easy to verify that G(Y, Y) ˜ ˜ G(Y, Y) ≥ F (Y) ∀ Y, Y. We prove the inequality separately for the summands ˜ The first summands are identical. in G(Y, Y). mik y˜ kj Because k (MY) = 1, we may interpret k mik ykj as a convex combi˜ ij
˜
Y)i nation of the points zik = (M mik ykj with the coefficients λik = mik y˜ k α ∈ (0, 1) the function x −→ x α is concave. Therefore, we have
mik y˜ kj ˜ ij (MY) k
˜ ij (MY) mik ykj mik y˜ kj
α ≤
mik y˜ k ˜ i (MY)
. For
α mik ykj
,
k
and the inequality of the second summands follows from this.
(A.8)
790
R. Kompass
To prove inequality for the third summands, we have to show that
mik
ykj1+α
˜ α≥ (MY) ij
y˜ kjα
k
ij
mik ykj
k
ij
α .
mik ykj
(A.9)
k
Because M and Y may take arbitrary positive values, it is sufficient and necessary to prove the inequality for one fixed index combination (i, j) only. Hence, we have to show
k
y1+α mk k α y˜ k
α ≥
mk y˜ k
k
mk yk
k
α mk yk
.
(A.10)
k
With the substitutions zi = y˜ i /yi and ni = mi yi , we get
nk zk−α
k
α nk zk
≥
k
k
nk
α nk
.
(A.11)
k
−α This inequality follows from convexity of the function x −→ x because with the coefficients nk / l nl ,
−α nk zk −α k nk zk k ≥ k nk k nk
(A.12)
holds.
To prove that the update of M does not increase Dα , one may use the auxiliary function ˜ ia ya j 1 1+α 1 + α m ˜ = G(M, M) xij − xij ˜ ij α α i ja ( MY) ij
+
i ja
yij
1+α α mia MY˜ ij m ˜ ia
˜ ij ( MY) mia ya j m ˜ ia ya j
α
(A.13)
and proceed in an analog way. Acknowledgments We thank Raul Rojas for helpful discussions and for comments on a draft of this article. This work was supported by DFG grant KO 2265/2-1.
A Generalized Divergence Measure for Nonnegative Matrix Factorization 791
References Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Lee, D. D., & Seung, H. S. (1999). Learning parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Diatterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 556–562). Cambridge, MA: MIT Press. MIT center for biological and computation learning. (2000). CBCL face database #1. Available online at http://cbcl.mit.edu/cbcl/software-datasets/facedata2.html. Oja, E. (1989). Neural networks, principal components and subspaces. International Journal of Neural Systems, 1(1), 61–68. Spratling, M. W., & Johnson, M. H. (2002). Pre-integration lateral inhibition enhances unsupervised learning. Neural Computation, 14(9), 2157–2179.
Received January 25, 2006; accepted June 17, 2006.
LETTER
Communicated by Ralf Herbrich
Support Vector Ordinal Regression Wei Chu
[email protected] Center for Computational Learning Systems, Columbia University, New York, NY 10115, U.S.A.
S. Sathiya Keerthi
[email protected] Yahoo! Research, Media Studios North, Burbank, CA 91504, U.S.A.
In this letter, we propose two new support vector approaches for ordinal regression, which optimize multiple thresholds to define parallel discriminant hyperplanes for the ordinal scales. Both approaches guarantee that the thresholds are properly ordered at the optimal solution. The size of these optimization problems is linear in the number of training samples. The sequential minimal optimization algorithm is adapted for the resulting optimization problems; it is extremely easy to implement and scales efficiently as a quadratic function of the number of examples. The results of numerical experiments on some benchmark and real-world data sets, including applications of ordinal regression to information retrieval, verify the usefulness of these approaches. 1 Introduction We consider the supervised learning problem of predicting variables of ordinal scale, a setting that bridges metric regression and classification and referred to as ranking learning or ordinal regression. Ordinal regression arises frequently in social science and information retrieval, where human preferences play a major role. The training samples are labeled by ranks, which exhibits an ordering among the different categories. In contrast to metric regression problems, these ranks are of finite types, and the metric distances between the ranks are not defined. These ranks are also different from the labels of multiple classes in classification problems due to the existence of the ordering information. There are several approaches for tackling ordinal regression problems in the domain of machine learning. The naive idea is to transform the ordinal scales into numerical values and then solve the problem as a standard regression problem. Kramer, Widmer, Pfahringer, & DeGroeve (2001) investigated the use of a regression tree learner in this way. A problem with this approach is that there might be no principled way of devising an Neural Computation 19, 792–815 (2007)
C 2007 Massachusetts Institute of Technology
Support Vector Ordinal Regression
793
appropriate mapping function since the true metric distances between the ordinal scales are unknown in most of the tasks. Another idea is to decompose the original ordinal regression problem into a set of binary classification tasks. Frank and Hall (2001) converted an ordinal regression problem into nested binary classification problems that encode the ordering of the original ranks and then organized the results of these binary classifiers in some ad hoc way for prediction. It is also possible to formulate the original problem as a large, augmented binary classification problem. Har-Peled, Roth, and Zimak (2002) proposed a constraint classification approach that provides a unified framework for solving ranking and multiclassification problems. Herbrich, Graepel, and Obermayer (2000) applied the principle of structural risk minimization (Vapnik, 1995) to ordinal regression, leading to a new distribution-independent learning algorithm based on a loss function between pairs of ranks. The main difficulty with these two algorithms (Har-Peled et al., 2002, Herbrich et al., 2000) is that the problem size of these formulations is a quadratic function of the training data size. As for sequential learning, Crammer and Singer (2002) proposed a proceptron-based online algorithm for rank prediction, known as the PRank algorithm. Shashua and Levin (2003) generalized the support vector formulation for ordinal regression by finding r − 1 thresholds that divide the real line into r consecutive intervals for the r ordered categories. However there is a problem with their approach: the ordinal inequalities on the thresholds, b 1 ≤ b 2 ≤ . . . ≤ br −1 , are not included in their formulation. This omission may result in disordered thresholds at the solution on some unfortunate cases (see section 4.1 for an example). It is important to make any method free of bad situations where it will not work. In this letter, we propose two new approaches for support vector ordinal regression. The first takes only the adjacent ranks into account in determining the thresholds, exactly as Shashua and Levin (2003) proposed, but we introduce explicit constraints in the problem formulation that enforce the inequalities on the thresholds. The second approach is entirely new; it considers the training samples from all the ranks to determine each threshold. Interestingly, we show that in this second approach, the ordinal inequality constraints on the thresholds are automatically satisfied at the optimal solution, though there are no explicit constraints on these thresholds. For both approaches, the size of the optimization problems is linear in the number of training samples. We show that the popular SMO algorithm (Platt, 1999; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001) for support vector machines can be easily adapted for the two approaches. The resulting algorithms scale efficiently; empirical analysis shows that the cost is roughly a quadratic function of the problem size. Using several benchmark data sets, we demonstrate that the generalization capabilities of the two approaches are much better than that of the naive approach of doing standard regression on the ordinal labels.
794
W. Chu and S. Keerthi
The letter is organized as follows. In section 2, we present the first approach with explicit inequality constraints on the thresholds, derive the optimality conditions for the dual problem, and adapt the sequential minimal optimization (SMO) algorithm for the solution. In section 3 we present the second approach with implicit constraints. In section 4 we do an empirical study to show the scaling properties of the two algorithms and their generalization performance. We conclude in section 5. Throughout this letter, we use x to denote the input vector of the ordinal regression problem and φ(x) to denote the feature vector in a highdimensional reproducing kernel Hilbert space (RKHS) related to x by transformation. All computations are done using the reproducing kernel function only, which is defined as K(x, x ) = φ(x) · φ(x ) ,
(1.1)
where · denotes the inner product in the RKHS. Without loss of generality, we consider an ordinal regression problem with r ordered categories and denote these categories as consecutive integers Y = {1, 2, . . . , r } to keep the known ordering information. In the jth category, where j ∈ Y , the number of training samples is denoted as n j , and the ith training sample is j j denoted as xi where xi ∈ Rd . The total number of training samples rj=1 n j is denoted as n. b j , j = 1, . . . , r − 1 denote the (r − 1) thresholds. 2 Approach 1: Explicit Constraints on Thresholds As a powerful computational tool for supervised learning, support vector machines (SVMs) map the input vectors into feature vectors in a high¨ dimensional RKHS (Vapnik, 1995; Scholkopf & Smola 2002), where a linear machine is constructed by minimizing a regularized functional. For binary classification (a special case of ordinal regression with r = 2), SVMs find an optimal direction that maps the feature vectors into function values on the real line, and a single optimized threshold is used to divide the real line into two regions for the two classes, respectively. In the setting of ordinal regression, the support vector formulation could attempt to find an optimal mapping direction w, and r − 1 thresholds, which define r − 1 parallel discriminant hyperplanes for the r ranks accordingly. For each threshold b j , Shashua and Levin (2003) suggested considering the samples from the two adjacent categories, j and j + 1, for empirical errors (see Figure 1 for an illustration). More exactly, each sample in the jth category should have a function value that is less than the lower margin b j − 1; j j otherwise, w · φ(xi ) − (b j − 1) is the error (denoted as ξi ). Similarly, each sample from the ( j + 1)th category should have a function value that is j+1 greater than the upper margin b j + 1; otherwise, (b j + 1) − w · φ(xi ) is
Support Vector Ordinal Regression
795
Figure 1: An illustration of the definition of slack variables ξ and ξ ∗ for the thresholds. The samples from different ranks, represented as circles filled with different patterns, are mapped by w · φ(x) onto the axis of function value. Note that a sample from rank j + 1 could be counted twice for errors if it is sandwiched by b j+1 − 1 and b j + 1 where b j+1 − 1 < b j + 1, and the samples from rank j + 2, j − 1, and others never give contributions to the threshold b j . ∗ j+1
the error (denoted as ξi ).1 Shashua and Levin (2003) generalized the primal problem of SVMs to ordinal regression as follows: r −1
1 min ∗ w · w + C w,b,ξ ,ξ 2 j=1
nj
n j+1
j
ξi +
i=1
∗ j+1
ξi
,
(2.1)
i=1
subject to j j w · φ xi − b j ≤ −1 + ξi j ξi ≥ 0, for i = 1, . . . , n j j+1 ∗ j+1 w · φ xi − b j ≥ +1 − ξi ∗ j+1 ξi ≥ 0, for i = 1, . . . , n j+1 ,
(2.2)
where j runs over 1, . . . , r − 1 and C > 0. A problem with the above formulation is that the natural ordinal inequalities on the thresholds, that is, b 1 ≤ b 2 ≤ . . . ≤ br −1 cannot be guaranteed to hold at the solution. To tackle this problem, we explicitly include the
∗ j+1
1 The superscript ∗ in ξ denotes that the error is associated with a sample in the i adjacent upper category of the jth threshold.
796
W. Chu and S. Keerthi
following constraints in equation 2.2: b j−1 ≤ b j , for j = 2, . . . , r − 1.
(2.3)
2.1 Primal and Dual Problems. By introducing two auxiliary variables b 0 = −∞ and br = +∞, the modified primal problem in equations 2.1 to 2.3 can be equivalently written as follows,
j 1 ∗j ξi + ξi , w · w + C 2 nj
r
min ∗
w,b,ξ ,ξ
(2.4)
j=1 i=1
subject to j j j w · φ xi − b j ≤ −1 + ξi , ξi ≥ 0, ∀i, j j ∗j ∗j w · φ xi − b j−1 ≥ +1 − ξi , ξi ≥ 0, ∀i, j b j−1 ≤ b j , ∀ j.
(2.5)
The dual problem can be derived by standard Lagrangian techniques. Let j j ∗j ∗j αi ≥ 0, γi ≥ 0, αi ≥ 0, γi ≥ 0, and µ j ≥ 0 be the Lagrangian multipliers for the inequalities in equation 2.5. The Lagrangian for the primal problem is
j 1 ∗j ξi + ξi Le = w · w + C 2 r
nj
j=1 i=1
n r j
−
j j j αi − 1 + ξi − w · φ xi + b j
j=1 i=1 n r j
−
∗j
αi
∗ j ∗j − 1 + ξi + w · φ xi − b j−1
j=1 i=1 n r j
−
n r j
j
j
γi ξi −
j=1 i=1
j=1 i=1
∗j ∗j
γi ξi −
r
µ j (b j − b j−1 ).
(2.6)
j=1
The Karush-Kuhn-Tucker (KKT) conditions for the primal problem require the following to hold:
∗j ∂Le j j αi − αi φ(xi ) = 0 =w − ∂w r
nj
j=1 i=1
(2.7)
Support Vector Ordinal Regression
∂Le ∂ξi
j
∂Le ∗j
∂ξi
j
797
j
= C − αi − γi = 0, ∀i, ∀ j ∗j
∗j
= C − αi − γi
(2.8)
= 0, ∀i, ∀ j
(2.9)
j ∗ j+1 ∂Le =− αi − µ j + αi + µ j+1 = 0, ∀ j. ∂b j nj
n j+1
i=1
i=1
Note that the dummy variables associated with b 0 and br , that is, µ1 , µr , αi∗1 ’s, and αir ’s, are always zero. The conditions 2.8 and 2.9 give rise j ∗j to the constraints 0 ≤ αi ≤ C and 0 ≤ αi ≤ C, respectively. Let us now apply Wolfe duality theory to the primal problem. By introducing the KKT conditions 2.7 to 2.9 into the Lagrangian, equation 2.6, and applying the kernel trick, equation 1.1, the dual problem becomes a maximization problem involving the Lagrangian multipliers α, α ∗ , and µ: max
α,α ∗ ,µ
j
∗j
αi + αi
−
j,i
1 ∗j j ∗ j j j j α − αi αi − αi K xi , xi , 2 j,i j ,i i
(2.10)
subject to j
0 ≤ αi ≤ C, ∗ j+1 0 ≤ αi ≤ C, j n n j+1 j ∗ j+1 αi + µ j = αi + µ j+1 , i=1
µ j ≥ 0,
i=1
∀i, ∀ j ∀i, ∀ j (2.11)
∀j ∀ j,
where j runs over 1, . . . , r − 1. Leaving the dummy variables out of account, the size of the optimization problem is 2n − n1 − nr (α and α ∗ ) plus r − 2 (for µ). The dual problem, equations 2.10 and 2.11, is a convex quadratic programming problem. Once the α, α ∗ and µ are obtained by solving this problem, w is obtained from equation 2.7. The determination of the b j ’s is addressed in the next section. The discriminant function value for a new input vector x is f (x) = w · φ(x) =
∗j j j αi − αi K xi , x .
(2.12)
j,i
The predictive ordinal decision function is given by arg min{i : f (x) < b i }. i
798
W. Chu and S. Keerthi
2.2 Optimality Conditions for the Dual. To derive proper stopping conditions for algorithms that solve the dual problem and also determine the thresholds b j ’s, it is important to write down the optimality conditions for the dual. Though the resulting conditions that are derived below look a bit clumsy because of the notations, the ideas behind them are very much similar to those for the binary SVM classifier case, as the optimization problem is actually composed of r − 1 binary classifiers with a shared mapping direction w and the additional constraint 2.3. The Lagrangian for the dual can be written as follows: Ld =
1 ∗j j ∗ j j j j αi − αi αi − αi K xi , xi 2 j,i j ,i +
r −1
j n n j+1 j ∗ j+1 βj αi − αi + µ j − µ j+1
j=1
−
i=1
j
i=1 ∗j
∗j
j
ηi αi + ηi αi
−
j,i
j j πi C − αi
j,i
∗j
+ πi
∗ j
C − αi
r −1
−
λjµj −
∗j
j
αi + αi
,
j,i
j=2
j
∗j
j
∗j
where the Lagrangian multipliers ηi , ηi , πi , πi , and λ j are nonnegative, while β j can take any value. The KKT conditions associated with α and α ∗ can be given as follows: ∂Ld j ∂αi
j
j
j
j
= − f (xi ) − 1 − ηi + πi + β j = 0, πi ≥ 0
j j j j j ηi ≥ 0, πi C − αi = 0, ηi αi = 0, for i = 1, . . . , n j
∂Ld ∗ j+1 ∂αi ∗ j+1
πi
j+1 ∗ j+1 ∗ j+1 − 1 − ηi = f xi + πi − βj = 0
∗ j+1 ∗ j+1 αi
ηi
∗ j+1
≥ 0, ηi
∗ j+1
≥ 0, πi
∗ j+1
C − αi
= 0, for i = 1, . . . , n j+1 ,
=0 (2.13)
where f (x) is defined as in equation 2.12 and j runs over 1, . . . , r − 1, while the KKT conditions associated with the µ j are β j − β j−1 − λ j = 0, λ j µ j = 0, λ j ≥ 0,
(2.14)
Support Vector Ordinal Regression
799
where j = 2, . . . , r − 1. The conditions in equation 2.13 can be regrouped into the following six cases: Case 1: Case 2: Case 3: Case 4: Case 5: Case 6:
j
αi = 0 j 0 < αi < C j αi = C ∗ j+1 αi =0 ∗ j+1 0 < αi 0 j = and Bup j B˜ low otherwise
j−1 B˜ up if µ j > 0 j B˜ up otherwise.
(2.17)
It should be easy to see from the above sequence of steps that {β j } and {µ j } are optimal for the problem in equations 2.10 and 2.11 iff equation 2.16 is satisfied. We introduce a tolerance parameter τ > 0, usually 0.001, to define approximate optimality conditions. The overall stopping condition becomes j j max Blow − Bup : j = 1, . . . , r − 1 ≤ τ.
(2.18)
From the conditions in equations 2.13 and 2.2, it is easy to see the close relationship between the b j ’s in the primal problem and the multipliers β j ’s. In particular, at the optimal solution, β j and b j are identical. Thus, j j b j can be taken to be any value from the interval, [Blow , Bup ]. We can rej j 1 solve any nonuniqueness by simply taking b j = 2 (Blow + Bup ). Note that the KKT conditions in equation 2.14, coming from the additive constraints in equation 2.3 we introduced in Shashua and Levin’s (2003) formulation, j−1 j j−1 j enforce Blow ≤ Blow and Bup ≤ Bup at the solution, which guarantee that the thresholds specified in these feasible regions will satisfy the inequality constraints b j−1 ≤ b j ; without the constraints in equation 2.3, the thresholds might be disordered at the solution! Here we essentially consider an optimization problem of multiple learning tasks where individual optimality conditions are coupled together by a joint constraint. We believe this is a nontrivial extension and useful in other applications. The SMO algorithm (Platt, 1999; Keerthi et al., 2001) can be adapted for the solution of equations 2.10 and 2.11. The details are given in the appendix.
Support Vector Ordinal Regression
801
Figure 2: An illustration on the new definition of slack variables ξ and ξ ∗ that imposes implicit constraints on the thresholds. All the samples are mapped by w · φ(x) onto the axis of function values. Note the term ξ3i∗1 in this graph.
3 Approach 2: Implicit Constraints on Thresholds In this section we present a new approach to support vector ordinal regression. Instead of considering only the empirical errors from the samples of adjacent categories to determine a threshold, we allow the samples in all the categories to contribute errors for each threshold. This kind of loss function was also suggested by Srebro, Rennie, and Jaakkola (2005) for collaborative filtering. A very nice property of this approach is that the ordinal inequalities on the thresholds are satisfied automatically at the optimal solution in spite of the fact that such constraints on the thresholds are not explicitly included in the new formulation. We give a proof on this property in the following section. Figure 2 explains the new definition of slack variables ξ and ξ ∗ . For a threshold b j , the function values of all the samples from all the lower categories should be less than the lower margin b j − 1; if that does not j hold, then ξki = w · φ(xik ) − (b j − 1) is taken as the error associated with the sample xik for b j , where k ≤ j. Similarly, the function values of all the samples from the upper categories should be greater than the upper margin ∗j b j + 1; otherwise, ξki = (b j + 1) − w · φ(xik ) is the error associated with the sample xik for b j , where k > j. Here, the subscript ki denotes that the slack variable is associated with the ith input sample in the kth category; the superscript j denotes that the slack variable is associated with the lower categories of b j ; and the superscript ∗ j denotes that the slack variable is associated with the upper categories of b j .
802
W. Chu and S. Keerthi
3.1 Primal Problem. By taking all the errors associated with all r − 1 thresholds into account, the primal problem can be defined as follows: j nk nk r −1 r 1 j ∗ j min w · w + C ξki + ξki w,b,ξ ,ξ ∗ 2 j=1
k=1 i=1
(3.1)
k= j+1 i=1
subject to j w · φ xik − b j ≤ −1 + ξki ,
j
ξki ≥ 0
for k = 1, . . . , j and i = 1, . . . , nk ∗j ∗j w · φ xik − b j ≥ +1 − ξki , ξki ≥ 0 for k = j + 1, . . . , r and i = 1, . . . , nk ,
(3.2)
where j runs over 1, . . . , r − 1. Note that there are r − 1 inequality constraints for each sample xik (one for each threshold). To prove the inequalities on the thresholds at the optimal solution, let us consider the situation where w is fixed and only the b j ’s are optimized. j ∗j Note that the ξki and ξki are automatically determined once the b j are given. To eliminate these variables, let us define, for 1 ≤ k ≤ r , def Iklow (b) = i ∈ {1, . . . , nk } : w · φ xik − b ≥ −1 , def up Ik (b) = i ∈ {1, . . . , nk } : w · φ xik − b ≤ 1 . It is easy to see that b j is optimal iff it minimizes the function
e j (b) =
j
w · φ xik − b + 1
k=1 i∈I low (b) k
+
r
− w · φ xik + b + 1 .
(3.3)
k= j+1 i∈I up (b) k
Let B j denote the set of all minimizers of e j (b). By convexity, B j is a closed interval. Given two intervals B1 = [c 1 , d1 ] and B2 = [c 2 , d2 ], we say B1 ≤ B2 if c 1 ≤ c 2 and d1 ≤ d2 . Lemma 1.
B1 ≤ B2 ≤ · · · ≤ Br–1 .
Support Vector Ordinal Regression
Proof.
803
The “right-side derivative” of e j with respect to b is
g j (b) = −
j k=1
|Iklow (b)| +
r
up
k= j+1
|Ik (b)|.
(3.4)
Take any one j, and consider B j = [c j , d j ] and B j+1 = [c j+1 , d j+1 ]. Suppose c j > c j+1 . Define b j = c j and b j+1 = c j+1 . Since b j+1 is strictly to the left of the interval B j that minimizes e j , we have g j (b j+1 ) < 0. Since b j+1 is a minimizer of e j+1 , we also have g j+1 (b j+1 ) ≥ 0. Thus, we have g j+1 (b j+1 ) − g j (b j+1 ) > 0; also, by equation 3.4, we get low up (b j+1 ) − I j+1 (b j+1 ), 0 < g j+1 (b j+1 ) − g j (b j+1 ) = − I j+1 which is impossible. In a similar way, d j > d j+1 is also not possible. This proves the lemma. If the optimal b j are all unique,2 then lemma 1 implies that the b j satisfy the natural ordinal ordering. Even when one or more b j ’s are nonunique, lemma 1 says that there exist choices for the b j that obey the natural ordering. The fact that the order preservation comes about automatically is interesting and nontrivial, which differs from the PRank algorithm (Crammer & Singer, 2002), where the order preservation on the thresholds is easily brought in by their update rule. It is also worth noting that lemma 1 holds even for an extended problem formulation that allows the use of different costs (different C values) for j different misclassifications (class k misclassified as class j can have a Ck ). In applications such as collaborative filtering, such a problem formulation can be very appropriate; for example, an A-rated movie that is misrated as D may need to be penalized much more than if a B-rated movie is misrated, as D. Shashua and Levin’s (2003) formulation and its extension given in section 2 of this article do not precisely support such a differential cost structure. This is another good reason in support of the implicit problem formulation of the current section. j
j
∗j
∗j
3.2 Dual Problem. Let αki ≥ 0, γki ≥ 0, αki ≥ 0, and γki ≥ 0 be the Lagrangian multipliers for the inequalities in equation 3.2. Using ideas parallel to those in section 2.1, we can show that the dual of equations 3.1 and 3.2 is the following maximization problem that involves only the multipliers α
2
term
If, in the primal problem, we regularize the b j ’s as well (i.e., include the extra cost 2 j b j /2), then the b j ’s are guaranteed to be unique. Lemma 1 still holds in this case.
804
W. Chu and S. Keerthi
and α ∗ : 1 max − α,α ∗ 2 k,i k ,i +
k−1 k,i
k−1
∗j αki
+
j αki
k −1
j=k
j=1 ∗j αki
−
r −1
r −1
∗j
αk i −
j=1
r −1
j αk i K xik , xik
j=k
j αki
,
(3.5)
j=k
j=1
subject to j n k
k=1 i=1 j 0 ≤ αki ∗j 0 ≤ αki
n r k
j
αki =
∗j
αki
∀j
k= j+1 i=1
≤C
∀ j and k ≤ j
≤C
∀ j and k > j.
(3.6)
The dual problem 3.5 and 3.6 is a convex quadratic programming problem. The size of the optimization problem is (r − 1)n, where n = rk=1 nk is the total number of training samples. The discriminant function value for a new input vector x is k−1 r −1 ∗j j f (x) = w · φ(x) = αki − αki K xik , x . k,i
j=1
(3.7)
j=k
The predictive ordinal decision function is given by arg min{i : f (x) < b i }. i
The ideas of adapting the SMO algorithm to equations 3.5 and 3.6 are similar to those in the appendix. The resulting suboptimization problem is analogous to the case of standard update in the appendix, where only one of the equality constraints from equation 3.6 is involved. Full details of the SMO algorithm are given in our technical report (Chu & Keerthi, 2005). 4 Numerical Experiments We have implemented the two SMO algorithms for the ordinal regression formulations with explicit constraints (EXC) and implicit constraints (IMC),3 along with the algorithm of Shashua and Levin (2003) for comparison purpose. The function caching technique and the double-loop scheme
3 The source code of the two algorithms written in ANSI C can be found online at http://www.gatsby.ucl.ac.uk/∼chuwei/svor.htm.
Support Vector Ordinal Regression
805
proposed by Keerthi et al. (2001) have been incorporated in the implementation for efficiency. We begin this section with a simple data set to illustrate the typical behavior of the three algorithms and then empirically study the scaling properties of our algorithms. Then we compare the generalization performance of our algorithms against standard support vector regression on eight benchmark data sets for ordinal regression. The following gaussian kernel was used in these experiments:
d κ K(x, x ) = exp − (xς − xς )2 , 2
(4.1)
ς =1
where κ > 0 and xς denotes the ςth element of the input vector x. The tolerance parameter τ was set to 0.001 for all the algorithms. We have utilized two evaluation metrics that quantify the accuracy of predicted ordinal scales { yˆ 1 , . . . , yˆ t } with respect to true targets {y1 , . . . , yt }: 1. Mean absolute error is the average deviation of the prediction from the t true target, that is, 1t i=1 | yˆ i − yi |, in which we treat the ordinal scales as consecutive integers. 2. Mean zero-one error the fraction of incorrect predictions on individual samples. 4.1 Grading Data Set. The grading data set was used in Chapter 4 of Johnson and Albert (1999) as an example of the ordinal regression problem.4 There are 30 samples of students’ scores. The “SAT-math score” and “grade in prerequisite probability course” of these students are used as input features, and their final grades are taken as the targets. In our experiments, the six students with final grade A or E were not used, and the feature associated with the “grade in prerequisite probability course” was treated as a continuous variable though it had an ordinal scale. In Figure 3, we present the solution obtained by the three algorithms using the gaussian kernel, equation 4.1, with κ = 0.5 and the regularization factor value of C = 1. In this particular setting, the solution to Shashua and Levin’s (2003) formulation has disordered thresholds b 2 < b 1 , as shown in Figure 3 (left); the formulation with explicit constraints corrects this disorder and yields equal values for the two thresholds, as shown in Figure 3 (middle). 4.2 Scaling. In this experiment, we empirically studied how the two SMO algorithms scale with respect to training data size and the number of ordinal scales in the target. The California Housing data set was used in the 4 The grading data set is available online at http://www.mathworks.com/support/ books/book1593.jsp.
806
W. Chu and S. Keerthi
Figure 3: Training results of the three algorithms using a gaussian kernel on the grading data set. The discriminant function values are presented as contour graphs indexed by the two thresholds. The circles denote the students with grade D, the dots denote grade C, and the squares denote grade B.
Figure 4: Plots of CPU time versus training data size on log-log scale, indexed by the estimated slopes, respectively. We used the gaussian kernel with κ = 1 and the regularization factor value of C = 100 in the experiment.
scaling experiments.5 Twenty-eight training data sets with sizes ranging from 100 to 5,000 were generated by random selection from the original data set. The continuous target variable of the California Housing data was discretized to an ordinal scale by using 5 or 10 equal-frequency bins. The standard support vector regression (SVR) was used as a baseline, in which the ordinal targets were treated as continuous values and = 0.1. These data sets were trained by the three algorithms using a gaussian kernel with κ = 1 and a regularization factor value of C = 100. Figure 4 gives plots of the computational costs of the three algorithms as functions of the problem size, for the two cases of 5 and 10 target bins. Our algorithms scale well with scaling exponents between 2.13 and 2.33, while the scaling exponent of SVR is about 2.40 in this case. This near-quadratic property in scaling 5 The California Housing data set can be found online at http://lib.stat.cmu.edu/ datasets/.
Support Vector Ordinal Regression
807
comes from the sparseness property of SVMs, that is, nonsupport vectors affect the computational cost only mildly. The EXC and IMC algorithms cost more than the SVR approach due to the larger problem size. For large sizes, the cost of EXC is only about two times that of SVR in this case. As expected, we also noticed that the computational cost of IMC is dependent on r , the number of ordinal scales in the target. The cost for 10 ranks is observed to be roughly five times that for five ranks, whereas the cost of EXC is nearly the same for the two cases. These observations are consistent with the size of the optimization problems. The problem size of IMC is (r − 1)n (which is heavily influenced by r ), while the problem size of EXC is about 2n + r (which largely depends on n only since we usually have n r ). This factor of efficiency can be a key advantage for the EXC formulation. 4.3 Benchmark data sets. Next, we compared the generalization performance of the two approaches against the naive approach of using standard support vector regression (SVR) and the method (SLA) of Shashua and Levin (2003). We collected eight benchmark data sets that were used for metric regression problems.6 The target values were discretized into ordinal quantities using equal-frequency binning. For each data set, we generated two versions by discretizing the target values into 5 or 10 ordinal scales, respectively. We randomly partitioned each data set into training and test splits, as specified in Table 1. The partitioning was repeated 20 times independently. The input vectors were normalized to zero mean and unit variance, coordinate-wise. The gaussian kernel, equation 4.1, was used for all the algorithms. Fivefold cross validation was used to determine the optimal values of model parameters (the gaussian kernel parameter κ and the regularization factor C) involved in the problem formulations, and the test error was obtained using the optimal model parameters for each formulation. The initial search was done on a 7 × 7 coarse grid linearly spaced in the region {(log10 C, log10 κ)| − 3 ≤ log10 C ≤ 3, −3 ≤ log10 κ ≤ 3}, followed by a fine search on a 9 × 9 uniform grid linearly spaced by 0.2 in the (log10 C, log10 κ) space. The ordinal targets were treated as continuous values in standard SVR, and the predictions for test cases were rounded to the nearest ordinal scale. The insensitive zone parameter, of SVR, was fixed at 0.1. The test results of these algorithms are recorded in Tables 1 and 2. It is very clear that the generalization capabilities of the three ordinal regression algorithms are better than that of the approach of SVR. The performance of Shashua and Levin’s method is similar to our EXC approach, as expected, since the two formulations are pretty much the same. Our ordinal algorithms are comparable on the mean zero-one error, but the results also show the IMC algorithm yields much more stable results on mean absolute error than the
6 These regression data sets are available online at http://www.liacc.up.pt/∼ltorgo/ Regression/DataSets.html.
27 6 13 8 32 21 8 16
Pyrimidines MachineCPU Boston Abalone Bank Computer California Census
50/24 150/59 300/206 1000/3177 3000/5182 4000/4182 5000/15640 6000/16784
Training/Test 0.552±0.102 0.454±0.054 0.354±0.033 0.555±0.016∗ 0.590±0.004∗ 0.289±0.007∗ 0.451±0.004∗ 0.522±0.005∗
SVR 0.525±0.095 0.423±0.060 0.336±0.033 0.522±0.015 0.528±0.004 0.270±0.006 0.424±0.003 0.473±0.003
EXC
Mean Zero-one Error
0.517±0.086 0.431±0.054 0.332±0.024 0.527±0.009 0.537±0.004∗ 0.273±0.007 0.426±0.004 0.478±0.004∗
IMC 0.673±0.160 0.495±0.067 0.376±0.033 0.679±0.014∗ 0.712±0.006∗ 0.306±0.008∗ 0.515±0.004∗ 0.619±0.006∗
SVR
0.623±0.120 0.458±0.067 0.362±0.036 0.662±0.005 0.674±0.006∗ 0.288±0.007 0.494±0.004 0.576±0.003
EXC
Mean Absolute Error
0.615±0.127 0.462±0.062 0.357±0.024 0.657±0.011 0.661±0.005 0.289±0.007 0.491±0.005 0.573±0.005
IMC
Notes: The targets of these benchmark data sets were discretized into five equal-frequency bins. d denotes the input dimension and “training/test” denotes the partition size. The results are the averages over 20 trials, along with the standard deviation. We use boldface to indicate the cases in which the average value is the lowest among the results of the three algorithms. Asterisks indicate the cases significantly worse than the winning entry; a p-value threshold of 0.01 in Wilcoxon rank sum test was used to decide this.
d
Data Set
Partition
Table 1: Test Results of the Three Algorithms Using a Gaussian Kernel.
808 W. Chu and S. Keerthi
SLA
0.756±0.073 0.643±0.057 0.561±0.023 0.739±0.008∗ 0.759±0.005∗ 0.462±0.006 0.640±0.003 0.699±0.002
SVR
0.777±0.068∗ 0.693±0.056∗ 0.589±0.025∗ 0.758±0.017∗ 0.786±0.004∗ 0.494±0.006∗ 0.677±0.003∗ 0.735±0.004∗ 0.752±0.063 0.661±0.056 0.569±0.025 0.736±0.011 0.744±0.005 0.462±0.005 0.640±0.003 0.699±0.002
EXC 0.719±0.066 0.655±0.045 0.561±0.026 0.732±0.007 0.751±0.005∗ 0.473±0.005∗ 0.639±0.003 0.705±0.002∗
IMC 1.404±0.184 1.048±0.141 0.785±0.052 1.407±0.021∗ 1.471±0.010∗ 0.632±0.011∗ 1.070±0.008∗ 1.283±0.009∗
SVR 1.400±0.255 1.002±0.121 0.765±0.057 1.389±0.027∗ 1.414±0.012∗ 0.597±0.010 1.068±0.006∗ 1.271±0.007∗
SLA
1.331±0.193 0.986±0.127 0.773±0.049 1.391±0.021∗ 1.512±0.017∗ 0.602±0.009 1.068±0.005∗ 1.270±0.007∗
EXC
Mean Absolute Error
1.294±0.204 0.990±0.115 0.747±0.049 1.361±0.013 1.393±0.011 0.596±0.008 1.008±0.005 1.205±0.007
IMC
Notes: The partition sizes of these benchmark data sets were same as in Table 1, but the targets were discretized by 10 equal-frequency bins. The results are the averages over 20 trials, along with the standard deviation. We use boldface to indicate the lowest average value among the results of the four algorithms. Asterisks indicate the cases significantly worse than the winning entry; a p-value threshold of 0.01 in Wilcoxon rank sum test was used to decide this.
Pyrimidines MachineCPU Boston Abalone Bank Computer California Census
Data Set
Mean Zero-One Error
Table 2: Test Results of the Four Algorithms Using a Gaussian Kernel.
Support Vector Ordinal Regression 809
810
W. Chu and S. Keerthi
EXC algorithm.7 From the view of the formulations, EXC considers only the extremely worst samples between successive ranks, whereas IMC takes all the samples into account. Thus, the outliers may affect the results of EXC significantly, while the results of IMC are relatively more stable in both validation and test. 4.4 Information Retrieval. Ranking learning arises frequently in information retrieval. Hersh, Buckley, Leone, and Hickam (1994) generated the OHSUMED data set,8 which consists of 348,566 references and 106 queries with their respective ranked results. The relevance level of the references with respect to the given textual query was assessed by human experts using a three-rank scale: definitely, possibly, or not relevant. In our experiments, we used the results of query 1, which contains 107 assessed references (19 definitely relevant, 14 possibly relevant, and 74 irrelevant) taken from the database. In order to apply our algorithms, the bag-of-words representation was used to translate these reference documents into vectors. We computed for all documents the vector of term frequencies (TF) components scaled by inverse document frequencies (IDF). The TFIDF is a weighted scheme for the bag-of-words representation that gives higher weights to terms that occur very rarely in all documents. We used the rainbow software released by McCallum (1998) to scan the title and abstract of these references for the bag-of-words representation. In the preprocessing, we skipped the terms in the stoplist,9 and restricted ourselves to terms that appear in at least 3 of the 107 documents. This results in 462 distinct terms. Each document is represented by its TFIDF vector with 462 elements. To account for different document lengths, we normalized the length of each document vector to unity (Joachims, 1998). We randomly selected a subset of the 107 references (with size chosen from {20, 30, . . . , 60}) for training and then tested them on the remaining references. For each size, the random selection was repeated 100 times. The generalization performance of the two support vector algorithms for ordinal regression was compared against the naive approach of using support vector regression. The linear kernel K(xi , x j ) = xi · x j was employed for the three algorithms. The test results of the three algorithms are presented as box plots in Figure 5. In this case, the zero-one error rates are almost at same level, but the naive approach yields a much worse absolute error rate than the two ordinal SVM algorithms.
j
∗ j+1
As a plausible argument, ξi + ξi in equation 2.1 of EXC is an upper bound on the j j ∗j zero-one error of the ith example, while, in equation 2.17 of IMC, k=1 ξki + rk= j+1 ξki is an upper bound on the absolute error. Note that in all the examples, we use consecutive integers to represent the ordinal scales. 8 This data set is publicly available online at ftp://medir.ohsu.edu/pub/ohsumed/. 9 The stoplist is the SMART systems’ list of 524 common words, like the and of. 7
Support Vector Ordinal Regression
811
Figure 5: Test results of the three algorithms on the subset of the OHSUMED data relating to query 1, over 100 trials. The grouped boxes represent the results of SVR (left), EXC (middle), and IMC (right) at different training data sizes. The notched boxes have lines at the lower quartile, median, and upper quartile values. The whiskers are lines extending from each end of the box to the most extreme data value within 1.5·IQR (interquartile range) of the box. Outliers are data with values beyond the ends of the whiskers, which are displayed by dots. The left graph gives mean zero-one error, and the right graph gives mean absolute error.
5 Conclusion In this article, we proposed two new approaches to support vector ordinal regression that determine r − 1 parallel discriminant hyperplanes for the r ranks by using r − 1 thresholds. The ordinal inequality constraints on the thresholds are imposed explicitly in the first approach and implicitly in the second one. The problem size of the two approaches is linear in the number of training samples. We also designed SMO algorithms that scale only about quadratically with the problem size. The results of numerical experiments verified that the generalization capabilities of these approaches are much better than the naive approach of applying standard regression. Appendix: SMO Algorithm In this section we adapt the SMO algorithm (Platt 1999; Keerthi et al., 2001) for the solution of equations 2.10 and 2.11. The key idea of SMO consists of starting with a valid initial point and optimizing only one pair of variables at a time while fixing all the other variables. The suboptimization problem of the two active variables can be solved analytically. Table 3 presents an outline of the SMO implementation for our optimization problem. In order to determine the pair of active variables to optimize, we select the active threshold first. The index of the active threshold is defined as j j J J J = arg max j {Blow − Bup }. Let us assume that Blow and Bup are actually
812
W. Chu and S. Keerthi
Table 3: Basic Framework of the SMO Algorithm for Support Vector Ordinal Regression Using Explicit Threshold Constraints. Start at a Valid point, α, α ∗ and µ, that satisfy equation 2.11, find the current j j Bup and Blow ∀ j Do
SMO Loop
1. Determine the active threshold J 2. Optimize the pair of active variables and the set µa j
j
3. Compute Bup and Blow ∀ j at the new point while the optimality condition, equation 2.18, has not been satisfied Return α, α ∗ and b
Exit
j
j
o defined by blow and b upu , respectively, and that the two multipliers associated jo j with blow and b upu are αo and αu . The pair of multipliers (αo , αu ) is optimized from the current point (αoold , αuold ) to reach the new point, (αonew , αunew ). It is possible that jo = ju . In this case, named as cross update, more than one equality constraint in equation 2.11 is involved in the optimization that may update the variable set µa = {µmin{ jo , ju }+1 , . . . , µmax{ jo , ju } }, a subset of µ. In the case of jo = ju , named as standard update, only one equality constraint is involved, and the variables of µ are kept intact, that is, µa = ∅. These suboptimization problems can be solved analytically, and the detailed formulas for updating are given in the following. The following equality constraints have to be taken into account in the consequent suboptimization problem:
n j
n j+1
j αi
+µ = j
i=1
∗ j+1
αi
+ µ j+1 ,
min{ jo , ju } ≤ j ≤ max{ jo , ju }.
(A.1)
i=1
Under these constraints, the suboptimization problem for αo , αu , and µa can be written as
min
αo ,αu ,µa
1 ∗j j ∗ j j j j α − αi αi − αi K xi , xi − αo − αu , 2 j,i j ,i i
(A.2)
subject to the bound constraints 0 ≤ αo ≤ C, 0 ≤ αu ≤ C, and µa ≥ 0. The equality constraints, equation A.1, can be summarized for αo and αu as a
Support Vector Ordinal Regression
813
linear constraint αo + so su αu = αoold + so su αuold = ρ, where so = su =
j
j
j
j
−1 if i o ∈ I0ao ∪ I2 o
o o +1 if i o ∈ I0bju ∪ I4ju −1 if i u ∈ I0a ∪ I3
j
(A.3)
j
+1 if i u ∈ I0bu ∪ I1 u .
It is quite straightforward to verify that for the above optimization problem, which is restricted to αo , αu , and µa , one can use the same derivations that led to equation 2.16 to show that the optimality condition for the restricted jo j problem is nothing but blow ≤ b upu . Since we started with this condition being violated, we are guaranteed that the solution of the restricted optimization problem will lead to a strict improvement in the objective function. To solve the restricted optimization problem, first consider the case of a positive-definite kernel matrix. For this case, the unbounded solution can be exactly determined as αonew = αoold + so µ αunew = αuold − su µ j j µnew = µold − dµ µ
(A.4) ∀µ j ∈ µa ,
where dµ =
+1 if jo ≤ ju −1 if jo > ju
(A.5)
µ =
− f (xo ) + f (xu ) + so − su . K(xo , xo ) + K(xu , xu ) − 2K(xo , xu )
(A.6)
and
Now we need to check the box-bound constraint on αo and αu and the constraint µ j ≥ 0. We have to adjust µ toward 0 to draw the out-ofrange solution back to the nearest boundary point of the feasible region and update variables accordingly, as in equation A.4. Note that the cross update allows αo and αu to be associated with the same sample. For the positive semidefinite kernel matrix case, the denominator of equation A.6 can vanish. Still, leaving out the denominator defines the descent direction, which needs to be traversed until the boundary is hit, in order to get the optimal solution. It should be noted that working with just any violating pair of variables does not automatically ensure the convergence of the SMO algorithm. However, since the most violating pair is always chosen, the ideas of Keerthi and
814
W. Chu and S. Keerthi
Gilbert (2002) can be adapted to prove the convergence of the SMO algorithm. Acknowledgments Part of the work was carried out at the Institute for Pure and Applied Mathematics, University of California, from April to June 2004. W.C. was supported by NIH Grant 1 P01 GM63208. The revision work was partly supported by a research contract from Consolidated Edison. The reviewers’ thoughtful comments are gratefully appreciated. References Chu, W., & Keerthi, S. S. (2005). New approaches to support vector ordinal regression (Tech. Rep.). Burbank, CA: Yahoo! Research Labs. Crammer, K., & Singer, Y. (2002). Pranking with ranking. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14 (pp. 641–647). Cambridge, MA: MIT Press. Frank, E., & M. Hall, M. (2001). A simple approach to ordinal classification. In Proceedings of the European Conference on Machine Learning (pp. 145–165). Berlin: Springer. Har-Peled, S., Roth, D., & Zimak D. (2002). Constraint classification: A new approach ¨ & K. Obermayer to multiclass classification and ranking. In S. Becker, S. Thrun, (Eds.), Advances in neural information processing systems, 15 (pp. 785–792). Cambridge, MA: MIT Press. Herbrich, R., Graepel, T., & Obermayer, K. (2000). Large margin rank boundaries for ¨ ordinal regression. In P. J. Bartlett, B. Scholkopf, D. Schuumans, & A. J. Smola (Eds.), Advances in large margin classifiers (pp. 115–132). Cambridge, MA: MIT Press. Hersh, W., Buckley, C., Leone, T., & Hickam D. (1994). Ohsumed: An interactive retrieval evaluation and new large test collection for research. In Proceedings of the 17th Annual ACM SIGIR Conference (pp. 192–201). Berlin: Springer. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. N´edellec & C. Rouveirol (Eds.), Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 137–142). Berlin: Springer-Verlag. Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling (Statistics for social science and public policy). Heldelberg: Springer-Verlag. Keerthi, S. S., & Gilbert, E. G. (2002). Convergence of a generalized SMO algorithm for SVM classifier design. Machine Learning, 46, 351–360. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13, 637–649. Kramer, S., Widmer, G., Pfahringer, B., & DeGroeve, M. (2001). Prediction of ordinal classes using regression trees. Fundamenta Informaticae, 47, 1–13.
Support Vector Ordinal Regression
815
McCallum, A. (1998). Rainbow. Available online at http://www.cs.umass.edu/ ∼mccallum/bow/rainbow/. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (1999). Advances in kernel methods–support vector learning (pp. 185–208). Cambridge, MA: MIT Press. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shashua, A., & Levin, A. (2003). Ranking with large margin principle: Two ap¨ K. Obermayer (Eds.), Advances in neural informaproaches. In S. Becker, S. Thrun, tion processing systems, 15 (pp. 937–944). Cambridge, MA: MIT Press. Srebro, N., Rennie, J. D. M., & Jaakkola, T. (2005). Maximum-margin matrix factorization. In L. K. Saul, Y. Weiss, & Lo Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag.
Received November 17, 2005; accepted June 29, 2006.
LETTER
Communicated by Klaus Brinker
Neighborhood Property–Based Pattern Selection for Support Vector Machines Hyunjung Shin
[email protected] Department of Industrial and Information Systems Engineering, Ajou University, Wonchun-dong, Yeoungtong-gu, 443–749, Suwon, Korea, and Friedrich Miescher ¨ Laboratory, Max Planck Society, 72076, Tubingen, Germany
Sungzoon Cho
[email protected] Seoul National University, Shillim-Dong, Kwanak-Gu, 151-744, Seoul, Korea
The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. When applied to a large data set, however, it requires a large memory and a long time for training. To cope with the practical difficulty, we propose a pattern selection algorithm based on neighborhood properties. The idea is to select only the patterns that are likely to be located near the decision boundary. Those patterns are expected to be more informative than the randomly selected patterns. The experimental results provide promising evidence that it is possible to successfully employ the proposed algorithm ahead of SVM training. 1 Introduction The support vector machine (SVM) has been spotlighted in the machine learning community because of its theoretical soundness and practical performance. SVM has been highly successful in practical applications as diverse as face detection and recognition, handwritten character and digit recognition, text detection and categorization (Byun & Lee, 2002; Dumais, 1998; Heisele, Poggio, & Pontil, 2000; Joachims, 2002; Moghaddam & Yang, ¨ 2000; Osuna, Freund, & Girosi, 1997; Scholkopf, Burges, & Smola, 1999). However, when applied to a large data set, SVM training can become computationally intractable. In the formulation of SVM quadratic programming (QP), the dimension of the kernel matrix (M × M) is equal to the number of training patterns (M) (Vapnik, 1999). Thus, when the data set is huge, training cannot be finished in a reasonable time, even if the kernel matrix could be loaded on the memory. Most standard SVM QP solvers have a time complexity of O(M3 ): MINOS, CPLEX, LOQO, and MATLAB QP routines. And the solvers using decomposition methods have a time complexity of Neural Computation 19, 816–855 (2007)
C 2007 Massachusetts Institute of Technology
Neighborhood Property–Based Pattern Selection for SVMs
817
I · O(Mq + q 3 ), where I is the number of iterations and q is the size of ¨ the working set: Chunking, SMO, SVMlight , and SOR (Hearst, Scholkopf, ¨ Dumais, Osuna, & Platt, 1997; Joachims, 2002; Scholkopf et al., 1999; Platt, 1999). Needless to say, I increases as M increases. Empirical studies have estimated the run time of common decomposition methods to be proportional to O(M p ), where p varies from approximately 1.7 to 3.0 depending on the problem (Hush, Kelly, Scovel, & Steinwart, 2006; Laskov, 2002). Moreover, SVM requires heavy or repetitive computation to find a satisfactory model: for instance, which kernel is favorable over the others (among RBF, polynomial, sigmoid, and others) and, additionally, how to choose the kernel parameters (width of basis function, polynomial degree, offset, and scale, respectively). There has been a considerable amount of work related to this topic, referred to as kernel learning, hyperkernels, and kernel alignment, for example (see Ong, Smola, & Williamson, 2005, and Sonnenburg, R¨atsch, ¨ Sch¨afer, & Scholkopf, 2006). But the most common approaches for parameter selection are dependent on (cross–) validation. This implies one should train an SVM model multiple times until a proper model is found. One way to circumvent this computational burden might be to select some training patterns in advance. One of the distinguishable merits of SVM theory is that it clarifies which patterns are important to training. These patterns are distributed near the decision boundary, and fully and succinctly define the classification task at hand (Cauwenberghs & Poggio, 2001; Pontil & Verri, 1998; Vapnik, 1999). Furthermore, the subset of support vectors (SVs) is almost identical regardless of which kernel function is cho¨ sen for training (Scholkopf, Burges, & Vapnik, 1995). From a computational point of view, it is therefore worth identifying a subset of would-be SVs in preprocessing, and then training the SVM model with the smaller set. There have been several approaches to pattern selection that reduce the number of training patterns. Lyhyaoui et al. (1999) proposed the 1-nearest neighbor, searching from the opposite class after class-wise clustering. But this method presumes no possible class overlap in the training set to find the patterns near the decision boundary. Almeida, Braga, and Braga (2000) conducted k-means clustering. A cluster is defined as homogeneous if it consists of the patterns from the same class and heterogeneous otherwise. All of the patterns from a homogeneous cluster are replaced by a single centroid pattern, while the patterns from a heterogeneous cluster are all selected. The drawback is that it is not clear how to determine the number of clusters. Koggalage and Halgamuge (2004) also employed clustering to select the patterns from the training set. Their approach is quite similar to Almeida et al.’s (2000): clustering is conducted on the entire training set first, and the patterns are chosen that belong to the heterogeneous clusters. For a homogeneous cluster, on the contrary, the patterns along the rim of cluster are selected instead of the centroid of Almeida et al. (2000). It is a relatively safer approach since even in homogeneous clusters, the patterns near the decision boundary can exist if the cluster’s boundary is almost in contact
818
H. Shin and S. Cho
with the decision boundary. But, it has a relative shortcoming as well, in that the patterns far away from the decision boundary are also picked as long as they lie along the rim. Furthermore, the setting of the radius and defining of the width of the rim are still unclear. In the meantime, for the reduced SVM (RSVM) of Lee and Mangasarian (2001), Zheng, Lu, Zheng, and Xu (2003) proposed using the centroids of clusters instead of random samples. It is more reasonable since the centroids are more representative than random samples. However, these clustering-based algorithms have a common weakness: the selected patterns are fully dependent on clustering performance, which could be unstable. Liu and Nakagawa (2001) made a related performance comparison. Sohn and Dagli (2001) suggested a slightly different approach. It utilizes fuzzy class membership through k nearest neighbors. The score of fuzzy class membership is translated as a probability of how deeply a pattern belongs to a class. By the scores, the patterns having a weak probability are eliminated from the training set. However, it overlooks the importance of the patterns near the decision boundary; they are equally treated to outliers (noisy pattern far from the decision boundary). Active learning shares this issue of significant pattern identification with pattern selection (Brinker, 2003; Campbell, Cristianini, & Smola, 2000; Schohn & Cohn, 2000). However, there are substantial differences between them. First, the primary motivation of active learning comes from the high cost of obtaining labeled training patterns, not from that of training itself. For instance, in industrial process modeling, obtaining even a single training pattern may require several days. In e-mail filtering, obtaining training patterns is not expensive, but it takes many hours to label them. In pattern selection, however, the labeled training patterns are assumed to already exist. Second, active learning alternates between training with a newly introduced pattern and making queries over the next pattern. On the contrary, pattern selection runs only once before training as a preprocessor. In this letter, we propose the neighborhood property—based pattern selection algorithm (NPPS). The practical time complexity of NPPS is v M, where v is the number of patterns in the overlap region around the decision boundary. We utilize k nearest neighbors to look around the pattern’s periphery. The first neighborhood property is that “a pattern located near the decision boundary tends to have more heterogeneous neighbors in their class membership.” The second neighborhood property dictates that “a noisy pattern tends to belong to a different class from that of its neighbors.” And the third neighborhood property is that “the neighbors of the decision boundary pattern tend to be located near the decision boundary as well.” The first property is used for identifying the patterns located near the decision boundary. The second property is used for removing the patterns located on the wrong side of the decision boundary. And the third property is used for skipping unnecessary distance calculation, thus accelerating the pattern selection procedure. The letter also provides how to choose k for the
Neighborhood Property–Based Pattern Selection for SVMs
819
proposed algorithm. Note that it has been skipped or dealt with as trivial in other pattern selection methods that employ either k-means clustering or the k-nearest neighbor rule. The remaining part of this letter is organized as follows. Section 2 briefly explains the SVM theory, in particular, the patterns critically affecting the training. Section 3 presents the proposed method, NPPS, that selects the patterns near the decision boundary. Section 4 explains how to choose the number of neighbors, k—the parameter of the proposed method. Section 5 provides the experimental results on artificial data sets, real-world bench-marking data sets, and a real-world marketing data set. We conclude with some future work in section 6. 2 Support Vector Machines and Critical Training Patterns Support vector machines (SVMs) are a general class of statistical learning architectures that perform structural risk minimization on a nested set struc¨ ture of separating hyperplanes (Cristianini & Shawe-Taylor, 2000; Scholkopf & Smola, 2002; Vapnik, 1999). Consider a binary classification problem with M patterns (xi , yi ), i = 1, . . . , M where xi ∈ d and yi ∈ {−1, 1}. Let us assume that patterns with yi = 1 belong to class 1, while those with yi = −1 belong to class 2. SVM training involves solving the following quadratic pro2 gramming problem, which yields the largest margin ( w ) between classes. 1 ξi , ||w|| 2+C 2 i M
min
(w, ξ) =
s. t.
· (xi ) + b) ≥ 1 − ξi , yi (w ξi ≥ 0,
i = 1, . . . , M,
(2.1)
where w ∈ d , b ∈ (see Figure 1, and an error tolerance parameter C ∈ . Equation 2.1 is the most general SVM formulation, allowing both nonseparable and nonlinear cases. The ξ ’s are nonnegative slack variables, which play a role of allowing a certain level of misclassification for a nonseparable case. The (·) is a mapping function for a nonlinear case that projects patterns from the input space into a feature space. This nonlinear mapping is performed implicitly by employing a kernel function K (x , x ) to avoid the costly calculation of inner products, (x ) · (x ).1 There are three typical
1 Aside from the computational efficiency by dot-product replacement, the kernels provide operational benefits as well: it is easy and natural to work on (or integrate) various types of data—vectors, sequences, text, images, and graphs, for example—and to detect very general types of relations therein.
820
H. Shin and S. Cho
Figure 1: SVM classification problem: Through a mapping function (·), the class patterns are linearly separated in a feature space. The patterns determining both margin hyperplanes are outlined. The decision boundary is the halfway hyperplane between margins.
kernel functions: RBF, polynomial, and sigmoid in due order, K (x , x ) = exp(−||x − x ||2 /2σ 2 ), K (x , x ) = (x · x + 1) p , K (x , x ) = tanh (ρ(x · x ) + δ).
(2.2)
The optimal solution of equation 2.1 yields a decision function of the following form, f (x ) = sign (w · (x ) + b) = sign = sign
M
M
yi αi (xi ) · (x ) + b
i=1
yi αi K (xi , x ) + b ,
(2.3)
i=1
where αi s are nonnegative Lagrange multipliers associated with training patterns, respectively. The solutions, αi s, are obtained from the dual problem of equation 2.1, which minimizes the convex quadratic objective
Neighborhood Property–Based Pattern Selection for SVMs
821
function under constraints min W(αi , b) =
0≤αi ≤C
M M M 1 αi α j yi y j K (xi · x j ) − αi + b yi αi . 2 i, j=1
i=1
i=1
The first-order conditions on W(αi , b) are reduced to the Karush-KuhnTucker (KKT) conditions, ∂ W(αi , b) = yi y j K (xi , x j )α j + yi b − 1 = yi ¯f (xi ) − 1 = gi , ∂αi M
j=1
∂ W(αi , b) = y j α j = 0, ∂b M
(2.4)
j=1
where ¯f (·) is the function inside the parentheses of sign in equation 2.3. The KKT complementarity condition, equation 2.4, partitions the training pattern set into three categories according to the corresponding αi s: (a) gi > 0
→
αi = 0
(b) gi = 0
→
0 < αi < C : margin support vectors
(c) gi < 0
→
αi = C
: irrelevant patterns
: error support vectors
Figure 2 illustrates those categories (Cauwenberghs & Poggio, 2001; Pontil & Verri, 1998). The patterns belonging to category a are out of the margins, thus irrelevant to training, while the patterns belonging to categories b and c are critical and directly affect training. They are called support vectors (SVs). The patterns of category b are strictly on the margin; hence, they are called margin SVs. The patterns of category c lie between two margins; hence, they are called error SVs but are not necessarily misclassified. (Note that there is another type of SV belonging to the category of error SV. They are incorrectly labeled patterns that could be located very far from the decision boundary. We regard them as outliers that do not contribute to margin construction. We focus only on the SVs located around the decision boundary, not those located deep in the realm of the opposite class.) Going back to equation 2.3, we can now see that the decision function is a linear combination of kernels on only those critical training patterns (denoted as SVs) because the patterns corresponding to αi = 0 have no influence on the decision result: f (x ) = sign
M i=1
yi αi K (xi , x ) + b
= sign
i∈SVs
yi αi K (xi , x ) + b .
(2.5)
822
H. Shin and S. Cho
Figure 2: Three categories of training patterns.
Equation 2.5 leads us to an attempt that reduces the whole training set to a subset of would-be SVs. 3 Neighborhood Property–Based Pattern Selection To circumvent memory- and time-demanding SVM training, we propose a preprocessing algorithm: a neighborhood property–based pattern selection algorithm (NPPS). The idea of the algorithm is to select only those patterns located around the decision boundary since they are the ones that contain the most information. Contrary to a usually employed random sampling, this approach can be viewed as informative or intelligent sampling. Figure 3 conceptually shows the difference between NPPS and random sampling in selecting a subset of the training data. NPPS selects the patterns in the region around the decision boundary, and random sampling selects those from the whole input space. Obviously no one knows how close a pattern is to the decision boundary until a classifier is built. However, we can infer the proximity ahead of training by utilizing neighborhood properties. The first neighborhood property is that “a pattern located near the decision boundary tends to have more heterogeneous neighbors in its class membership.” A well-known entropy concept can be utilized for the measurement of heterogeneity of class labels among k-nearest neighbors. And the measure will lead us to estimate the proximity accordingly. Here, we define a measure
Neighborhood Property–Based Pattern Selection for SVMs
823
a. NPPS
b. Random Sampling Figure 3: NPPS and random sampling select different subsets. Open circles and squares are the patterns belonging to class 1 and class 2, respectively. Filled circles and squares are the selected patterns.
824
H. Shin and S. Cho
by using (negative) entropy, Neighbors Entropy (x , k) =
J j=1
P j · log J
1 , Pj
where j indicates a particular class out of J classes, and P j is defined as k j /k, where k j is the number of the neighbors belonging to class j. In most cases, a pattern with a positive value of Neighbors Entropy (x , k) is close to the decision boundary, and thus selected. Those patterns are likely to be SVs, which correspond to the margin SVs in Figure 2b or the error SVs in Figure 2c. Among the patterns having a positive value of Neighbors Entropy (x , k), noisy patterns are also present. Here, let us define an overlap region as a hypothetical region in feature space shared by both classes, and overlap patterns as the patterns in that region. (We defer the finer definitions for overlap region and overlap patterns until section 4.1). A set of overlap patterns contains the patterns located not only on the right side of the decision boundary but also on the other side of it. The latter denotes noisy patterns, which should be identified and removed as much as possible because they are more likely to be the error SVs that would be misclassified. To remove this noisy pattern, we take the second neighborhood property: An overlap pattern or an outlier tends to belong to a different class from its neighbors. If a pattern’s own label is different from the majority label of its neighbors, it is likely to be incorrectly labeled. The measure Neighbors Match (x , k) is defined as the ratio of neighbors whose label matches that of x , Neighbors Match (x , k) =
|{x |label(x ) = label(x ), x ∈ kNN(x )}| , k
where k NN(x ) is the set of k nearest neighbors of x . The patterns with a small Neighbors Match (x , k) value are likely to be the ones incorrectly labeled. In that point of view, Neighbors Match can be interpreted as a confidence for k-nearest neighbor classification. Only the patterns satisfying the two conditions, Neighbors Entropy (x , k) > 0 and Neighbors Match (x , k) ≥ 1J , are selected. Figure 4 shows a toy example of how patterns are selected by the proposed measures. The numbers scattered in the figure (1, 2, and 3) stand for the class labels of the patterns. We assume J = 3 and k = 6. For representational simplicity, we consider only six patterns marked by dotted circles. Table 1 presents the values of P j , Neighbors Entropy, and Neighbors Match for these marked patterns. Let us consider x 1 first. x 1 is remote from the decision boundary belonging to class 1 and is thus surrounded by the neighbors belonging to class 1. x 1 is not selected since it does not satisfy the condition of Neighbors Entropy. Meanwhile, x 2 resides in a deep region
Neighborhood Property–Based Pattern Selection for SVMs
825
Figure 4: A toy example. The numbers scattered in the figure (1, 2, and 3) stand for the class labels of the patterns. The parameters are assumed to be J = 3 and k = 6. For representational simplicity, we consider only 6 patterns (out of 29) marked by dotted circles. Table 1 shows the patterns selected by the proposed measures presenting details: the values of P j , Neighbors Entropy, and Neighbors Match.
of class 2, but it is a noise pattern labeled as class 1. Since the composition of class membership of its neighbors is homogeneous (all of them belong to class 2), it also has a zero value of Neighbors Entropy. Therefore, it is excluded from selection. However, the case of the pattern x 2 is different from the case of the pattern x 1 ; for the pattern x 2 , all its neighbors belong to a different class from the class of x 2 , while for the pattern x 1 , all its neighbors belong to the same class that x 1 belongs to. We can differentiate two cases by Neighbors Match, which is 1 for x 1 and 0 for x 2 . In any case, the remote patterns from the decision boundary are screened out by the proposed measures. On the other hand, the pattern x 3 is close to the decision boundary, and the composition of the neighbors shows heterogeneous class membership; two of them belong to class 1, another two belong to class 2, and the rest belong to class 3. This leads to Neighbors Entropy = 1 and Neighbors Match = 1/3, and consequently satisfies the conditions. Similarly, the patterns x 4 and x 5 are chosen. From the example, we can see that the selected patterns are distributed near the decision boundary and almost correctly labeled. However, NPPS takes O(M2 ) to evaluate kNNs for M patterns, so the pattern selection process itself can be time-consuming. To accelerate the pattern selection procedure, we consider the third neighborhood property: The neighbors of a pattern located near the decision boundary tend to be
1 1 2 3 3 2
1 2 1 3 3 3
j1
1 2 1 3 3 3
j2
1 2 2 2 3 3
j3
1 2 2 2 3 3
j4
1 2 3 3 3 3
j5
Class Labels of Neighbors
1 2 3 1 2 2
j6 6/6 0/6 2/6 1/6 0/6 0/6
P1 0/6 6/6 2/6 2/6 1/6 1/6
P2 0/6 0/6 2/6 3/6 5/6 5/6
P3 0 0 1 0.9227 0.4100 0.4100
Neighbors Entropy
6/6 0/6 2/6 3/6 5/6 1/6
Neighbors Match
Notes: j ∗ is the class label of pattern x i , and j k is the label of its kth nearest neighbor. Refer to Figure 4.
x 2 x 3 x 4 x 5 x 6
x 1
j∗
Table 1: A Toy Example: Patterns Selected by Neighbors Entropy and Neighbors Match.
X X X
Selected Patterns
826 H. Shin and S. Cho
Neighborhood Property–Based Pattern Selection for SVMs NPPS (D, k) { [0] Initialize D0e with randomly chosen patterns from D. Constants, k and J, are given. Initialize i and various sets as follows: i ← 0,
S0o ← ∅,
S0x ← ∅,
S0 ← ∅.
while Die = ∅ do [1] Choose x satisfying [Expanding Criteria]. Dio ← {x | N eig hbors E ntr opy(x; k) > 0, x ∈ Die }. Dix ← Die − Dio . [2] Select x satisfying [Selecting Criteria]. Dis ← {x | N eig hbors M atch (x; k) ≥ 1/J, x ∈ Dio }. [3] Update the pattern sets. Si+1 ← Sio ∪ Dio : the expanded, o Si+1 ← Six ∪ Dix : the nonexpanded, x Si+1 ← Si ∪ Dis : the selected. [4] Compute the next evaluation set Di+1 e . Di+1 ← e
x∈Dio
[5] i ← i + 1. end return Si }
Figure 5: NPPS algorithm.
i+1 k NN(x) − (Si+1 o ∪ Sx ).
827
828
H. Shin and S. Cho
Table 2: Notation. Symbol D Die Dio Dix Dis Sio Six Si kNN(x )
Meaning The original training set whose cardinality is M The evaluation set at ith step A subset of Die , the set of patterns to be expanded from Die , each element of which will compute its k nearest neighbors to constitute the next evaluation set, Di+1 e A subset of Die , the set of patterns not to be expanded from Die , or i i i Dx = De − Do The set of “selected” patterns from Dio at ith step j The accumulated set of expanded patterns, i−1 j=0 Do i−1 j The accumulated set of nonexpanded patterns, j=0 Dx j N The accumulated set of selected patterns, i−1 j=0 Ds the last of which S is the reduced training pattern set The set of k nearest neighbors of x
located near the decision boundary as well. Assuming the property, one may narrow the scope of computation from all the training patterns to the patterns near the decision boundary. Only the neighbors of a pattern satisfying Neighbors Entropy (x , k) > 0 are evaluated in the next step. This lazy evaluation can reduce the practical time complexity from O(M2 ) to O(v M), where v is the number of patterns in the overlap region. In most practical problems, v < M holds. In addition, any algorithm on efficient nearest-neighbor searching can be incorporated into the proposed method to further reduce the time complexity. There is a considerable amount of literature about efficient searching for nearest neighbors. Some of it attempts to save distance computation time (Grother, Candela, & Blue, 1997; Short & Fukunaga, 1981), and others attempt to avoid redundant searching time (Bentley, 1975; Friedman, Bentley, & Finkel, 1977; Guttman, 1984; Masuyama, Kudo, Toyama, & Shimbo, 1999). Apart from them, there are many sophisticated NN classification algorithms such as approximated (Arya, Mount, Netanyahu, Silverman, & Wu, 1998; Indyk, 1998), condensed ¨ (Hart, 1968), and reduced (Gates, 1972). (See also Berchtold, Bohm, Keim, & Kriegel, 1997; Borodin, Ostrovsky, & Rabani, 1999; Ferri, Albert, & Vidal, 1999; Johnson, Ladner, & Riskin, 2000; Kleingberg, 1997; Tsaparas, 1999.) The time complexity analysis for the fast NPPS can be found online at http://www.kyb.mpg.de/publication.html?user=shin and Shin and Cho (2003a), and a brief empirical result in appendix A. The algorithm and related notations are shown in Figure 5 and Table 2. 4 How to Determine the Number of Neighbors In this section, we briefly introduce a heuristic for determining the value of k. Too large a value of k results in too many patterns being selected.
Neighborhood Property–Based Pattern Selection for SVMs
829
Consequently, we will achieve little effect of pattern selection. Too small a value of k, leads to few patterns selected. However, it may degrade SVM accuracy since there are more chances to miss important patterns—the would-be support vectors. The dilemma about k will not be symmetrical because a serious loss of accuracy is not likely to be well compensated for by the benefit of a downsized training set. This point relates to our idea: the selected pattern set should be large enough to contain at least the patterns in the overlap region. Therefore, we must first estimate how many training patterns reside in the overlap region. Based on the estimate, the next step is to enlarge the value of k until the size of the corresponding set covers the estimate. Under this condition, the minimum value of k will be regarded as optimal unless SVM accuracy degrades. In the following sections, we first identify the overlap region R and the overlap set V. Second, we estimate the lower bound on the cardinality of V. Third, we define Neighbors Entropy set Bk with respect to the value of k and some related properties as well. Finally, we provide a systematic procedure for determining the value of k. 4.1 Overlap Region and Overlap Set. Consider a two-class classification problem (see Figure 6), f (x ) =
x → C1 x → C2
if if
f (x ) > 0, f (x ) < 0,
(4.1)
where f (x ) is a classifier and f (x ) = 0 is the decision boundary. Let us define noisy overlap patterns (NOPs) as the patterns that are located on the wrong side of the decision boundary. They are shown in Figure 6 as squares located above f (x ) = 0 and circles located below f (x ) = 0. Let R denote the overlap region—a hypothetical region where NOPs reside, the convex-hull of NOPs, enclosed by the dotted lines in Figure 6. Similarly, let us define correct overlap patterns (COPs) as the patterns that are enveloped by the convex hull, yet in the right class side. Note that R is defined by NOPs, but it contains not only NOPs but also COPs. Let V denote the overlap set defined as the intersection of D and R (that is, the subset of D that comprises NOPs and COPs. There are six NOPs and six COPs in Figure 6. The cardinality of V is denoted as v. 4.2 Size Estimation of Overlap Set. Now we estimate v. Let PR (x ) denote the probability that a pattern x falls in the region R. Then we can calculate the expected value of v from the given training set D as v = MPR (x ),
(4.2)
830
H. Shin and S. Cho
Figure 6: Two-class classification problem where the circles belong to class 1 and the squares belong to class 2. The area enclosed by the dotted lines is defined as overlap region R. The area comprises the overlap set V of NOPs and COPs.
where M is the number of the training patterns, say, M = |D|. The probability PR (x ) can be dissected class-wise, such as PR (x ) =
2
P(x ∈ R, C j ),
(4.3)
j=1
where P(x ∈ R, C j ) is the joint probability of x belonging to class C j and lying in R. Note that PR (x ) also can be interpreted as the probability of x being COPs or NOPs. Further, if the region R is divided into R1 = {x ∈ R | f (x ) ≥ 0} and R2 = {x ∈ R | f (x ) < 0},
(4.4)
Equation 4.3 can be rewritten as PR (x ) = P(x ∈ R, C1 ) + P(x ∈ R, C2 ) = P(x ∈ R1 ∪ R2 , C1 ) + P(x ∈ R1 ∪ R2 , C2 ) = P(x ∈ R1 , C2 ) + P(x ∈ R2 , C1 )
(a)
+ P(x ∈ R1 , C1 ) + P(x ∈ R2 , C2 ) .
(b)
(4.5)
Neighborhood Property–Based Pattern Selection for SVMs
831
In the last row, a and b denote the probabilities of the patterns located in R that are incorrectly and correctly classified—NOPs and COPs, respectively. Since all NOPs are the incorrectly classified patterns, a can be estimated from the misclassification error rate Perror of the classifier f (x ). On the contrary, it is not easy to estimate b.2 Therefore, what we can do in practice is to infer the lower bound of it. Generally: the following inequality holds, P(x ∈ R j , C j ) ≥ P(x ∈ R j , Ci ),
j = i.
(4.6)
Based on that, equation 4.5 can be simplified as PR (x ) ≥ 2Perror ,
(4.7)
and the lower bound of v becomes v ≥ v L OW = 2MPerror
(4.8)
from equation 4.2. 4.3 Pattern Set with Positive Neighbors Entropy. Let us define Bk given a specified value of k, Bk = {x | Neighbors Entropy(x , k) > 0, x ∈ D}. Note that Bk is the union of Dio ’s, Bk =
N
(4.9)
Do where N is the total number
i=0
of iterations in the algorithm NPPS (see Figure 5). The following property of Bk leads to a simple procedure. Lemma 1.
A Neighbors Entropy set Bk is a subset of Bk+1 :
Bk ⊆ Bk+1
2 ≤ k ≤ M − 2.
(4.10)
Proof. Denote P jk as the probability that k j out of k nearest neighbors belong to class C j . If x ∈ Bk , then it means Neighbors Entropy (x , k) > 0. k A positive Neighbors Entropy is always accompanied by P jk = kj < 1, ∀ j. Therefore, k j < k, ∀ j.
2
Of course, if R1 and R2 contain roughly the same number of correct and incorrect patterns, the probabilities of a and b become similar. But this assumption will work only when both class distributions follow the uniform distribution.
832
H. Shin and S. Cho
Adding 1 to both sides yields (k j + 1) < (k + 1), ∀ j. Suppose the (k + 1)th nearest neighbor of x belongs to C j ∗ . Then (k j ∗ + 1) < (k + 1) holds for j ∗ , while k j < (k + 1) holds for j = j ∗ . The inequalities lead to P jk+1 < 1 and P j k+1 ∗ = j ∗ < 1, respectively. As a consequence, we have a positive Neighbors Entropy of x in the case of the (k + 1)th nearest neighbor as well. Therefore, Neighbors Entropy (x , k + 1) > 0, which indicates x ∈ Bk+1 . From lemma 1, it follows that b k , the cardinality of Bk , is an increasing function of k. 4.4 Procedure for Determining the Number of Neighbors. Bk larger than V merely increases the SVM training time by introducing redundant training patterns. In contrast, Bk smaller than V could degrade the SVM accuracy. Therefore, our objective is to find the smallest Bk that covers V. Let us define k min using the lower bound of v from equation 4.8: k min = min {k | b k ≥ v L OW , k ≥ 2 }. From lemma 1, we know that b k is an increasing function of k. Therefore, it is not necessary to evaluate the values of k less than k min . Instead, we check svm the stabilization of SVM training error rate Perror only for the k’s larger than min k by increasing the value little by little. The optimal value of k is then chosen as svm svm (k) − Perror (k + 1)| ≤ , k ≥ k min }, k ∗ = arg min { |Perror
(4.11)
where is set to a trivial value. Eauation 4.11 requires several dry runs of SVM training. However, the runs are not likely to impose heavy computational burden since the strong inequality b k sin (3x1 + 0.8)2 ,
To make the density near the decision boundary thicker, four different gaussian noises were added along the decision boundary: N(µ, s 2 I ) where µ is an arbitrary point on the decision boundary and s is a gaussian width parameter (s = 0.1, 0.3, 0.8, 1.0). A total of 500 training patterns were generated including noises. Figure 11.1b shows the problem. For both problems, Figure 8 presents the relationship between distance from decision boundary to a pattern and its value of Neighbors Entropy. The closer to the decision boundary a pattern is, the higher the value of Neighbors Entropy it has. Both Figures 8a and 8b ensure that Neighbors Entropy is pertinent to estimating the proximity to the decision boundary. Among the patterns with positive Neighbors Entropy values, the ones that meet the Neighbors Match condition are selected. They are depicted as filled circles against open ones. Note that the filled circles are distributed nearer the decision boundary. The distribution of the selected patterns is shown in Figures 11.3a and 11.3b. NPPS changes the size as well as the distribution of the training set. Such changes may change the optimal value of hyperparameter that brings the best result. Thus, we first observed the effect of change of training set. The SVM performance was measured under various combinations of hyperparameters such as (C, σ ) ∈ {0.1, 1.0, 10, 100, 1000} × {0.25, 0.5, 1, 2, 3}, where C and σ indicate misclassification tolerance and gaussian kernel width, respectively. For each of the artificial problems, 1000 test patterns were generated from the statistically identical distributions to its original training set. We compared the test error rates (%) of SVM when trained with all patterns (ALL), trained with random patterns (RAN), and trained with the selected patterns (SEL). Figure 9 depicts the test error rate over the hyperparameter variation for Continuous XOR problem. The most pronounced feature is the higher sensitivity of SEL to hyperparameter variation than before (ALL). It may be caused by the fact that the patterns selected by NPPS are mostly distributed in the narrow region along the decision boundary. In Figure 9a, for instance, SEL shows a sudden rise of the test error rate for σ larger than a certain value when C is fixed, and similarly, a sharp descent after a certain value of C when σ is fixed. An interesting point is that SEL can always reach
Neighborhood Property–Based Pattern Selection for SVMs
835
Figure 8: The relationship between the distance from decision boundary and Neighbors Entropy. The closer to decision boundary a pattern is, the higher value of Neighbors Entropy it has. The selected patterns are depicted as filled circles against open ones.
836
H. Shin and S. Cho
Figure 9: Performance comparison over hyperparameter variation: Continuous XOR problem.
a performance comparable to that of ALL by adjusting the value of the hyper-parameters. Now we compare SEL and RAN from Figure 9b. We used the same number of random samples as that of the patterns selected by NPPS. Therefore, there is no difference in the size of training set, only a difference in data distribution. When compared with SEL, RAN is less
Neighborhood Property–Based Pattern Selection for SVMs
837
Table 3: Best Result Comparison: ALL vs. RAN vs. SEL. Continuous XOR (C, σ ) Execution time (sec) Number of training patterns Number of support vectors Test error (%) McNemar’s test (p-value)
Sine Function
ALL (10, 0.5)
RAN (100, 1)
SEL (100, 0.25)
ALL (10, 0.5)
RAN (0.1, 0.25)
SEL (10, 0.5)
454.83
3.02
3.85
267.76
8.97
8.79
600
180
180
500
264
264
167
68
84
250
129
136
12.33 0.08
9.67 1.00
16.34 0.12
12.67 0.89
9.67 —
13.33 —
sensitive to hyperparameter variation because the patterns are distributed all over the region. However, note that RAN never performs as well as ALL since random sampling inevitably misses many patterns of importance to SVM training. See the best result of RAN in Table 3. Figure 10 shows similar results for the sine function problem. Table 3 summarizes the best results of ALL, SEL, and RAN for both problems. First, compared with the training time of ALL, that of SEL is little more than trivial because of the reduced size of training set. For both artificial problems, a standard QP solver, Gunn’s SVM MATLAB toolbox, was used. Considering that model selection always involves multiple trials, an individual’s reduction in training time can amount to a huge savings of time. Second, compared with RAN, SEL achieved an accuracy on a similar level to the best model of ALL while RAN could not. To show that there was no significant difference between ALL and SEL, we conducted the McNemar’s test (Dietterich, 1998). The p-values between ALL and RAN, and also between ALL and SEL, are presented in the last row of Table 3. In principle, the McNemar’s test determines whether classifier A is better than classifier B. A p-value of zero indicates a significant difference between A and B, and a value of one indicates no significant difference. Although the p-value between ALL and RAN is not less than 5% in each problem, one can still compare its degree of difference from the p-value between ALL and SEL in statistical terms. Figure 11 visualizes the experimental results of Table 3 on both problems. The subfigures, 11.1 to 11.3, indicate the results of ALL, RAN, and SEL in order. The decision boundary is depicted as a solid line and the margin as a dotted line. Support vectors are outlined. The decision boundary of ALL
838
H. Shin and S. Cho
Figure 10: Performance comparison over hyperparameter variation: sine function problem.
looks more alike to that of SEL than to that of RAN. That explains why SEL produced similar results to ALL. 5.2 Benchmarking Data Sets. Table 4 presents the summary of the comparison results on various real-world benchmarking data sets (MNIST,
Neighborhood Property–Based Pattern Selection for SVMs
839
1.a. Continuous XOR:
ALL
1.b. Sine Function:
ALL
2.a. Continuous XOR:
RAN
2.b. Sine Function:
RAN
3.a. Continuous XOR:
SEL
3.b. Sine Function:
SEL
Figure 11: Patterns and SVM decision boundaries. A decision boundary is depicted as a solid line and the margins as the dotted lines. Support vectors are outlined. The decision boundary of ALL looks more alike to that of SEL than to that of RAN. This explains why SEL performed similar to ALL.
500 264 264 335 1000 275 275 492
Sine function: 1000 test patterns ALL C = 10, σ = 0.50 RAN C = 0.1, σ = 0.25 SEL C = 10, σ = 0.5, k = 5 SVM-KM C = 50, σ = 0.25, k = 50
4 × 4 Checkerboard: 10,000 test patterns ALL C = 20, σ = 0.25 RAN C = 50, σ = 0.5 SEL C = 50, σ = 0.25, k = 4 SVM-KM C = 20, σ = 0.25, k = 100
Pima Indian Diabetes: 153 test patterns (5-cv out of 768 patterns) ALL C = 100, p = 2 615 RAN C = 1, p = 1 311 SEL C = 100, p = 2, k = 4 311 SVM-KM C = 100, p = 2, k = 62 561
600 180 180 308
Continuous XOR: 1000 test patterns ALL C = 10, σ = 0.5 RAN C = 100, σ = 1 SEL C = 100, σ = 0.25, k = 5 SVM-KM C = 1, σ = 0.25, k = 60
Number of Training Patterns
Table 4: Empirical Result for Benchmarking Data Sets.
330 175 216 352
172 75 148 159
250 129 136 237
167 68 84 149
Number of Support Vectors
– – 0.13 0.48
– – 0.08 49.13
– – 0.26 2.04
– – 0.20 0.78
Preprocessing Time (Ratio)
1.00 0.47 0.58 0.98
1.00 0.42 0.40 0.33
1.00 0.31 0.29 0.27
1.00 0.10 0.12 0.16
Training Time (Ratio)
29.90 31.11 30.30 28.75
4.03 8.44 4.66 5.20
13.33 16.34 12.67 14.00
9.67 12.33 9.67 10.00
Test Error Rate (%)
– 0.61 0.87 0.72
– 0.00 0.76 0.00
– 0.12 0.89 0.81
– 0.08 1.00 0.69
McNemar’s p-value
840 H. Shin and S. Cho
11982 4089 4089 NA 11769 1135 1135 NA 11800 1997 1997 NA
MNIST: 3–8: 1984 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 50, p = 5, k = 50 SVM-KM k = 1198
MNIST: 6–8: 1932 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 20, p = 5, k = 50 SVM-KM k = 1177
MNIST: 9–8: 1983 test patterns ALL C = 10, p = 5 RAN C = 10, p = 5 SEL C = 10, p = 5, k = 40 SVM-KM k = 1180
Wisconsin Breast Cancer: 136 test patterns (5-cv out of 683 patterns) ALL C = 5, p = 3 546 RAN C = 1, p = 1 96 SEL C = 10, p = 3, k = 6 96 SVM-KM C = 10, p = 3, k = 55 418
823 323 631 NA
594 185 421 NA
1253 1024 1024 NA
87 27 41 85
– – 0.19 NA
– – 0.20 NA
– – 0.21 NA
– – 0.01 17.08
1.00 0.07 0.09 NA
1.00 0.03 0.07 NA
1.00 0.15 0.10 NA
1.00 0.05 0.06 0.62
0.41 0.99 0.43 NA
0.25 0.83 0.25 NA
0.50 0.86 0.45 NA
6.80 12.33 6.70 11.02
– 0.01 0.57 NA
– 0.01 0.67 NA
– 0.19 0.42 NA
– 0.60 0.94 0.53
Neighborhood Property–Based Pattern Selection for SVMs 841
842
H. Shin and S. Cho
1998; UCI, 1995), including the artificial data sets in the previous section. Again, SEL is compared with ALL and RAN, and another method is added as a competitor, SVM-KM (Almeida et al., 2000), which is a preprocessing method that reduces the training set based on k-means clustering algorithm. The hyperparameters of SVM were chosen based on the best results using validation: C, σ , and p, indicate the misclassification tolerance, the width of RBF kernel, and the order of polynomial kernel, respectively. The parameter of SEL (NPPS), the number of neighbors k, was determined by the procedure in section 4 and appendix B (see also Shin & Cho, 2003b). And the parameter of SVM-KM, the number of clusters k, was set to 10% of the number of training patterns as recommended in Almeida et al. (2000). For the MNIST (1988) data set, we chose three binary classification problems; they are known as most difficult due to the similar shapes of the digits, for instance, the digit 3 looks similar to the digit 8. Most of the experiments were run on Pentium 4 with 1.9 GHz 512 RAM, whereas Pentium 4 with 3.0 GHz 1024 RAM for the MNIST problems. A standard QP solver lacks adequate memory capacity to handle the real-world data sets, so we chose an iterative SVM solver, OSU SVM Classifier Toolbox (Kernel Machines Organization, n.d.), after brief comparison about scalability with another candidate RSVM (Lee & Mangasarian, 2001). (Refer to appendix C for a comparison of the OSU SVM Classifier with RSVM.) In Table 4, N/A denotes results that are not available. For ease of comparison, the computational time, preprocessing or SVM training, is shown as a ratio to the SVM training time of ALL. The best two results are represented as boldface in the test error rate. Similarly, the statistically most similar result with ALL is represented in boldface type in McNemar’s pvalue. The results show that the preprocessing by either SEL or SVM-KM is of great benefit to SVM training in reducing the training time. Particularly, if multiple runnings of SVM are required, one can take the benefit as many times as required. However, SVM-KM took longer than SEL, and, worse, it ran out of memory in large-sized problems. Also note that the pairwise test shows that SEL is more similar to ALL than any other method. 5.3 Real-World Marketing Data Set. The proposed algorithm was also applied to a marketing data set from (the Direct Marketing Association, n.d.). The data set DMEF4 has been used in other research (Ha, Cho, & MacLachlan, 2005; Malthouse, 2001, 2002). It is concerned with an upscale gift business that mails general or specialized catalogs to customers. The task predicts a customer will respond to the offering during the test period, September 1992 to December 1992. A customer who received the catalog and buys the product is labeled +1; otherwise, he or she is labeled −1. The training set is given based on the period December 1971 to June 1992. There are 101,532 patterns in the data set, each representing the purchase history information of a customer. We derived 17 input variables out of 91 original ones just as in Ha et al. (2005) and Malthouse (2001).
Neighborhood Property–Based Pattern Selection for SVMs
843
To show the effectiveness of SEL, we compared it with seven RANs because random sampling has most commonly been employed when researchers in this field attempt to reduce the size of the training set. Table 5 shows the models: RAN∗ denotes an SVM trained with random samples, where ∗ indicates the ratio of random samples drawn without replacement. Each model was trained and evaluated by using fivefold cross-validation. The hyperparameters of SVM were determined from (C, σ ) = {0.1, 1, 10, 100, 1000} × {0.25, 0.5, 1, 2, 3}. The number of neighbors (k) for SEL was set to 4. The OSU SVM Classifier was used as an SVM solver (Kernel Machines Organization, n.d.). Typically, a DMEF data set (Direct Marketing Association, n.d.) has a severe class imbalance problem because the response rate for the retailer’s offer is very low: 9.4% for DMEF4. In that case, an ordinary accuracy measure tends to mislead us about the results by giving more weight to the heavily represented class. Thus, we used another accuracy measure, Balanced Correct-classification Rate (BCR), defined as Balanced Correct-classification Rate (BCR) =
m11 m1
m22 · , m2
where mi denotes the size of class i and mii is the number of patterns correctly classified into class i. Now the performance measure is balanced between two different class sizes. Figure 12 shows the BCRs of the eight SVM models, and Table 6 presents the details. More patterns result in higher BCR among RAN∗ ’s. However, the training time also increases proportionally to the number of training patterns, with the peak of 4820 sec for RAN100. SEL takes only 68 sec, and 129 sec if the NPPS running time is included. Note that one should perform SVM training several times to find a set of optimal hyperparameters, but only once for NPPS ahead of the whole procedure of training. In the last column of the table, the p-value of McNemar’s test is listed when SEL is compared with an individual RAN∗ . There is a statistically significant difference between SEL and RAN up to RAN60 in accuracy, but no difference between SEL and RAN80 or RAN100. Overall, SEL achieves almost the same accuracy as RAN80 or RAN100 only with the number of training patterns comparable to RAN10 or RAN20. 6 Conclusions and Discussion In this letter, we introduced an informative sampling method for the SVM classification task. By preselecting the patterns near the decision boundary, one can relieve the computational burden during SVM training. In the experiments, we compared the performance of the selected pattern set (SEL) by NPPS with that of a random sample set (RAN) and that of the original training set (ALL). We also compared the proposed method with a
4060 (5%)
Number of patterns
8121 (10%)
RAN10 (100, 1) 16,244 (20%)
RAN20 (100, 0.5) 32,490 (40%)
RAN40 (100, 0.5) 48,734 (60%)
RAN60 (10, 1) 64,980 (80%)
RAN80 (10, 0.5)
81,226 (100%)
RAN100 (10, 0.5)
8871, Average
SEL (10, 0.5)
Notes: RAN∗ denotes an SVM trained with random samples, where ∗ indicates the percentage of random sampling. Note that RAN100 corresponds to ALL. The number of patterns of SEL slightly varies with the given set of each fold of 5-CV, and thus, it is represented as an average.
RAN05 (100, 1)
Model (C, σ )
Table 5: SVM Models.
844 H. Shin and S. Cho
Neighborhood Property–Based Pattern Selection for SVMs
845
Figure 12: Balanced Correct-classification Rate (BCR): RAN∗ is depicted as a filled circle, and SEL is represented as a dotted reference line. Table 6: Empirical Result for a Real-World Marketing Data Set, DMEF4
RAN05 RAN10 RAN20 RAN40 RAN60 RAN80 RAN100 SEL
Number of Training Patterns
Number of Support Vectors
Training Time (sec)
Correct Rate, BCR (%)
McNemar’s Test (p-value)
4060 8121 16,244 32,490 48,734 64,980 81,226 8871
1975 4194 7463 14,967 22,193 28,968 35,529 6624
13.72 56.67 149.42 652.11 1622.06 2906.97 4820.06 68.29
49.49 56.17 58.05 61.02 63.86 64.92 65.17 64.92
0.00 0.00 0.00 0.04 0.08 0.64 0.87 –
competing method, SVM-KM (Almeida et al., 2000). Through the comparison of synthetic and real-world problems, we empirically validated the efficiency of the proposed algorithm. SEL achieved an accuracy similar to that of ALL, while the computational cost was still similar to that of RAN with 10% or 20% of the samples. When compared with SVM-KM, SEL achieved better computational efficiency. Here, we would like to address some future work. First, SVM solves a multiclass problem by a divide-and-combine strategy, which divides the multiclass problem into several binary subproblems (e.g., one-versus-others or one-versus-one) and then combines the outputs. This has led to the application of NPPS to binary class problems. However, NPPS can readily be extended to multiclass problems without major correction. Second, NPPS
846
H. Shin and S. Cho
can also be utilized to reduce the lengthy training time of neural network classifiers. But it is necessary to add extra correct patterns to the selected pattern set to enhance the overlap region near the decision boundary (Choi, & Rockett, 2002; Hara & Nakayama, 2000). The rationale is that overlap patterns located on the “wrong” side of the decision boundary would lengthen the MLP training time. The derivatives of the backpropagated errors will be very small when evaluated at those patterns since they are grouped in a narrow region on either side of the decision boundary. By adding extra correct patterns, however, the network training will converge faster. In the meantime, the idea of adding some randomly selected patterns to SEL may also relieve the higher sensitivity of SEL to hyperparameter variation (see Figure 9). If the high sensitivity is caused by “narrow” distribution of the selected patterns, it can be relaxed by some random patterns from “overall” input space. But the mixing ratio of the patterns from SEL and from RAN requires further study. Third, the current version of NPPS works for classification problems only, and thus is not applicable to regression problems. A straightforward idea for regression would be to use the mean (µ) and variance ( ) of k nearest neighbors’ outputs. A pattern having a small value of can be replaced by µ of its neighbors, including itself. That is, k + 1 patterns can be replaced by one pattern. On the contrary, a pattern having a large value of can be totally eliminated since in a regression problem, the patterns located away from major group, such as outliers, are less important to learning. But its neighbors should be used to explore the next pattern. Similar research based on ensemble neural network was conducted by Shin and Cho (2001). Fourth, one of interesting future directions is the effort to extend NPPS for data with concept drift (Aha, Kibler, & Albert, 1991; Bartlett, Ben-David, & Kulkarni, 2000; Helmbold & Long, 1994; Klinkenberg, 2004; Pratt & Tschapek, 2003; Stanley, 2003; Widmer & Kubat, 1996). In concept drift, the decision boundary changes as data arrive in the form of a stream. A naive approach would be to employ a moving window over the data stream and then run NPPS repeatedly. For better results, NPPS should be substantially modified and rigorously tested. Appendix A: Empirical Complexity Analysis We empirically show that NPPS runs in approximately v M, where M is the number of training patterns and v is the number of overlap patterns (a full theoretical analysis is available in Shin & Cho, 2003a). A total of M patterns, half from each class, were randomly generated from a pair of two-dimensional uniform distributions:
1 −1 < x < , v v ) (1 − 12 M ) (0 − 12 M
−1 1 < x < . C2 = x | U v v (−1 + 12 M ) (0 + 12 M )
C1 =
x | U
Neighborhood Property–Based Pattern Selection for SVMs
847
a. v = 0.3M
b. v = 0.5M
c. v = 0.7M Figure 13: Overlap of two uniform distributions. The dark gray area is the overlap region that contains v patterns. The number of overlap patterns, v, is set to every decile of training set size, M. (a, b, c) v = 0.3M, v = 0.5M, and v = 0.7M, respectively.
We set v to every decile of M : v = 0, 0.1M, 0.2M, . . . , 0.9M, M. Figure 13 shows the distributions of v = 0.3M, v = 0.5M, and v = 0.7M. The larger values of v correspond to more overlap patterns. We set out to see how the computation time of NPPS works with the changing value of v—in particular, whether the computation time is linearly proportional to v as
848
H. Shin and S. Cho
a. The number of selected patterns (%)
b. Computation time Figure 14: Empirical complexity analysis: NPPS with increasing number of overlap patterns v.
Neighborhood Property–Based Pattern Selection for SVMs
849
a. Pima Indians Diabetes (k ∗ = 4)
b. Wisconsin Breast Cancer (k ∗ = 6) Figure 15: SVM test error rates. The error rate is stabilized at about 30.0% for Pima Indian Diabetes when k ≥ 4 and at about 6.7% for Wisconsin Breast Cancer when k ≥ 6.
850
H. Shin and S. Cho
Table 7: Empirical Comparison on Different SVM Solvers. Number of Training Patterns
Time Ratio: Training
Test Error Rate (%)
McNemar’s p-Value
Continuous XOR: 1000 test patterns OSU-SVM C = 10, σ = 0.5 RSVM C = 10, σ = 0.5
600 600
1.00 1.20
9.67 9.97
– 0.79
Sine function: 1000 test patterns OSU-SVM C = 10, σ = 0.50 RSVM C = 100, σ = 0.25
500 500
1.00 0.31
13.33 13.46
– 0.85
1000 1000
1.00 1.87
4.03 3.88
– 0.00
Pima Indian Diabetes: 153 test patterns (5-cv out of 768 patterns) OSU-SVM C = 100, p = 2 615 1.00 29.90 RSVM C = 10, p = 1 615 0.10 29.64
– 0.72
Wisconsin Breast Cancer: 136 test patterns (5-cv out of 683 patterns) OSU-SVM C = 5, p = 3 546 1.00 6.80 RSVM C = 10, p = 3 546 1.65 8.02
– 0.45
MNIST: 3–8: 1984 test patterns OSU-SVM C = 10, p = 5 RSVM
11982 11982
1.00 N/A
0.50 N/A
– N/A
MNIST: 6–8: 1932 test patterns OSU-SVM C = 10, p = 5 RSVM
11769 11769
1.00 N/A
0.25 N/A
– N/A
MNIST: 9–8: 1983 test patterns OSU-SVM C = 10, p = 5 RSVM
11800 11800
1.00 N/A
0.41 N/A
– N/A
4 × 4 checkerboard: 10,000 test patterns OSU-SVM C = 20, σ = 0.25 RSVM C = 100, σ = 0.25
mentioned. Figure 14a shows the number of selected patterns through evaluation when v grows from 0 up to M = 10, 000. In Figure 14b, one can clearly see that the computation time is proportional to v. Appendix B: Experiments on the Number of Neighbors Based on the procedure presented in Figure 7, we briefly present experiments on Pima Indian Diabetes and Wisconsin Breast Cancer. According to the procedure, the lower bounds of v were estimated as 393 and 108 from the training error rates, Perror = 32.0% and Perror = 9.9%, respectively. These led to k ∗ = 4 in Pima Indian Diabetes and k ∗ = 6 in Wisconsin Breast Cancer. (The values of k min were also 4 and 6, respectively.) Figure 15
Neighborhood Property–Based Pattern Selection for SVMs
851
shows that the test error rate stabilizes at about 30.0% for Pima Indian Diabetes when k ≥ 4, and at about 6.7% for Wisconsin Breast Cancer when k ≥ 6. Appendix C: Empirical Comparison on SVM Solvers We provide a short comparison between OSU-SVM (OSU SVM Classifier, Kernel Machines Organization) and RSVM (Lee & Mangasarian, 2001), which are known as the fastest SVM learning algorithms. All of the experimental settings were as identified in section 5.2. The parameter of RSVM, the random sampling ratio, was set to 5% to 10% of training patterns according to Lee and Mangasarian. In Table 7, the computational time of RSVM is shown as a ratio to the SVM training time of OSU-SVM. The results show that both of the algorithms are almost similar in accuracy. Also in training time, it is hard to tell which is superior to the other. However, RSVM was not available for the MNIST (1998) problems. It failed to load the kernel matrix on the memory (note that RSVM is not an iterative solver). Therefore, we chose to use OSU-SVM as a base learner in our experiment because it is relatively more scalable and also parameter free. Acknowledgments H.S. was partially supported by the grant for Post Brain Korea 21. S.C. was supported by grant no. R01-2005-000-103900-0 from the Basic Research Program of the Korea Science and Engineering Foundation, Brain Korea 21, and Engineering Research Institute of SNU. References Aha, D., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37–66. Almeida, M. B., Braga, A., & Braga, J. P. (2000). SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means. In Proc. of the 6th Brazilian Symposium on Neural Networks (pp. 162–167). Washington, DC: IEEE Computer Society. Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., & Wu, A. Y. (1998). An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. Journal of the ACM, 45(6), 891–923. Bartlett, P., Ben-David, S., & Kulkarni, S. (2000). Learning changing concepts by exploiting the structure of change. Machine Learning, 41, 153–174. Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Communications of ACM, 18, 509–517. ¨ Berchtold, S., Bohm, C., Keim, D. A., & Kriegel, H.-P. (1997). A cost model for nearest neighbor search in high-dimensional data space. In Proc. of Annual ACM Symposium on Principles Database Systems (pp. 78–86). New York: ACM.
852
H. Shin and S. Cho
Borodin, A., Ostrovsky, R., & Rabani, Y. (1999). Lower bound for high dimensional nearest neighbor search and related problems. In Proc. of the 31st ACM Symposium on Theory of Computing (pp. 312–321). New York: ACM. Brinker, K. (2003). Incorporating diversity in active learning with support vector machines. In Proc. of the 20th International Conference on Machine Learning (ICML) (pp. 59–66). Menlo Park, CA: AAAI Press. Byun, H., & Lee, S. (2002). Applications of support vector machines for pattern recognition: A survey. Lecture Notes in Computer Science, 2388, 213–236. Campbell, C., Cristianini, N., & Smola, A. J. (2000). Query learning with large margin classifiers. In Proc. of the 17th International Conference on Machine Learning (ICML) (pp. 111–118). San Francisco: Morgan Kaufmann. Cauwenberghs, G., & Poggio, T. (2001). Incremental and decremental support vector machine learning. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 409–415). Cambridge, MA: MIT Press. Choi, S. H., & Rockett, P. (2002). The training of neural classifiers with condensed dataset. IEEE Transactions on Systems, Man, and Cybernetics–PART B: Cybernetics, 32(2), 202–207. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines and other kernel-based learning methods. Cambridge: Cambridge University Press. Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1923. Direct Marketing Association. (N.d.). DMEF data set library. Available online at http://www.the-dma.org/dmef/dmefdset.shtml. Dumais, S. (1998). Using SVMs for text categorization. IEEE Intelligent Systems, 13(4), 21–23. Ferri, F. J., Albert, J. V., & Vidal, E. (1999). Considerations about sample-size sensitivity of a family of edited nearest-neighbor rules. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics, 29(4), 667–672. Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3), 209–226. Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Transactions on Information Theory, IT, 18(3), 431–433. Grother, P. J., Candela, G. T., & Blue, J. L. (1997). Fast implementations of nearest neighbor classifiers. Pattern Recognition, 30(3), 459–465. Guttman, A. (1984). R-trees: A dynamic index structure for spatial searching. Proc. of the ACM SIGMOD International Conference on Management of Data (pp. 47–57). New York: ACM. Ha, K., Cho, S., & MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19, 17–30. Hara, K., & Nakayama, K. (2000). A training method with small computation for classification. In Proc. of the IEEE-INNS-ENNS International Joint Conference (Vol. 3, pp. 543–548). Washington DC: IEEE Computer Society. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT, 14(3), 515–516. ¨ Hearst, M. A., Scholkopf, B., Dumais, S., Osuna, E., & Platt, J. (1997). Trends and controversies—support vector machines. IEEE Intelligent Systems, 13, 18–28.
Neighborhood Property–Based Pattern Selection for SVMs
853
Heisele, B., Poggio, T., & Pontil, M. (2000). Face detection in still gray images (Tech. Rep. No. 1687). Cambridge, MA: MIT AI Lab. Helmbold, D., & Long, P. (1994). Tracking drifting concepts by minimizing disagreements. Machine Learning, 14(1), 27–46. Hush, D., Kelly, P., Scovel, C., & Steinwart, I. (2006). QP algorithms with guaranteed accuracy and run time for support vector machines. Journal of Machine Learning Research, 7, 733–769. Indyk, P. (1998). On approximate nearest neighbors in non-Euclidean spaces. In Proc. of IEEE Symposium on Foundations of Computer Science (pp. 148–155). Washington, DC: IEEE Computer Society. Joachims, T. (2002). Learning to classify text using support vector machines: Methods, theory and algorithms. Norwell, MA: Kluwer. Johnson, M. H., Ladner, R. E., & Riskin, E. A. (2000). Fast nearest neighbor search of entropy-constrained vector quantization. IEEE Transactions on Image Processing, 9(9), 1435–1437. Kernel Machines Organization. (N.d.). Available online at http://www.kernelmachines.org/. Kleingberg, J. M. (1997). Two algorithms for nearest-neighbor search in high dimensions. In Proc. of 29th ACM Symposium on Theory of Computing (pp. 599–608). New York: ACM. Klinkenberg, R. (2004). Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, 8(3), 281–300. Koggalage, R., & Halgamuge, S. (2004). Reducing the number of training samples for fast support vector machine classification. Neural Information Processing—Letters and Reviews, 2(3), 57–65. Laskov, P. (2002). Feasible direction decomposition algorithms for training support vector machines. Machine Learning, 46, 315–349. Lee, Y., & Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In Proc. of the SIAM International Conference on Data Mining. Available online at http://www.cs.wisc.edu/∼olvi/olvi.html. Liu, C. L., & Nakagawa, M. (2001). Evaluation of prototype learning algorithms for nearest-neighbor classifier in application to handwritten character recognition. Pattern Recognition, 34, 601–615. Lyhyaoui, A., Martinez, M., Mora, I., Vazquez, M., Sancho, J., & FigueirasVaidal, A. R. (1999). Sample selection via clustering to construct support vector-like classifiers. IEEE Transactions on Neural Networks, 10(6), 1474– 1481. Malthouse, E. C. (2001). Assessing the performance of direct marketing models. Journal of Interactive Marketing, 15(1), 49–62. Malthouse, E. C. (2002). Performance-based variable selection for scoring models. Journal of Interactive Marketing, 16(4), 10–23. Masuyama, N., Kudo, M., Toyama, J., & Shimbo, M. (1999). Termination conditions for a fast k-nearest neighbor method. In Proc. of 3rd International Conference on Knowledge-Based Intelligent Information Engineering Systems (pp. 443–446). New York: IEEE Computer Society. MNIST. (1998). The MNIST database of handwritten digits. Available online at http://yann.lecun.com/exdb/mnist/.
854
H. Shin and S. Cho
Moghaddam, B., & Yang, M. H. (2000). Gender classification with support vector machines. In Proc. of International Conference on Pattern Recognition (pp. 306–311). Washington, DC: IEEE Computer Society. Ong, C. S., Smola, A. J., & Williamson, R. C. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1045–1071. Osuna, E., Freund, R., & Girosi, F. (1997). Training support vector machines: An application to face detection. In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (pp. 130–136). New York: IEEE Computer Society. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector machines (pp. 185–208). Cambridge, MA: MIT Press. Pontil, M., & Verri, A. (1998). Properties of support vector machines. Neural Computation, 10, 955–974. Pratt, K. B., & Tschapek, G. (2003). Visualizing concept drift. In Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 735–740). New York: ACM. Schohn, G., & Cohn, D. (2000). Less is more: Active learning with support vector machines. In Proc. of the 17th International Conference on Machine Learning (ICML) (pp. 839–846). San Francisco: Morgan Kaufmann. ¨ Scholkopf, B., Burges, C. J. C., & Vapnik, V. (1995). Extracting support data for a given task. In Proc. of 1st International Conference on Knowledge Discovery and Data Mining (pp. 252–257). Menlo Park, CA. AAAI Press. ¨ Scholkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods: Support vector machines. Cambridge, MA: MIT Press. ¨ Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: Support vector machines, regularization, optimization, and beyond. Cambridge, MA: MIT Press. Shin, H. J., & Cho, S. (2001). Pattern selection using the bias and variance of ensemble. Journal of the Korean Institute of Industrial Engineers, 28(1), 112–127. Shin, H. J., & Cho, S. (2003a). Fast pattern selection algorithm for support vector classifiers. In Time complexity analysis (pp. 1008–1015). Available online at http://www.kyb.mpg.de/publication.html?user=shin. Shin, H. J., & Cho, S. (2003b). How many neighbors to consider in pattern preselection for support vector classifiers? In Proc. of the International Joint Conference on Neural Networks (IJCNN) (pp. 565–570). Washington, DC: IEEE Computer Society. Short, R. D., & Fukunaga, K. (1981). The optimal distance measure for nearest neighbor classification. IEEE Transactions on Information and Theory, IT, 27(5), 622–627. Sohn, S., & Dagli, C. H. (2001). Advantages of using fuzzy class memberships in self-organizing map and support vector machines. In Proc. of International Joint Conference on Neural Networks (IJCNN). ¨ Sonnenburg, S., R¨atsch, G., Sch¨afer, S., & Scholkopf, B. (2006). Large scale multiple kernel learning. Journal of Machine Learning Research, 7, 1531–1565. Stanley, K. O. (2003). Learning concept drift with a committee of decision trees (Tech. Rep. No. UTAI-TR-03-302), Austin: Department of Computer Sciences, University of Texas at Austin. Tsaparas, P. (1999). Nearest neighbor search in multidimensional spaces (Tech. Rep. No. 319-02). Toronto: Department of Computer Science, University of Toronto.
Neighborhood Property–Based Pattern Selection for SVMs
855
UCI. (1995). UCI repository of machine learnin databases and domain theories. http://www.ics.uci.edu/∼mlearn/. Vapnik, V. (1999). The nature of statistical learning theory (2nd ed.). Berlin: SpringerVerlag. Widmer, G., & Kubat, M. (1996). Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1), 69–101. Zheng, S. F., Lu, X. F., Zheng, N. N., & Xu, W. P. (2003). Unsupervised clustering based reduced support vector machines. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (Vol. 2, pp. 821–824). Washington, DC: IEEE Computer Society.
Received January 6, 2006; accepted July 24, 2006.
LETTER
Communicated by Marco Saerens
Transductive Methods for the Distributed Ensemble Classification Problem David J. Miller
[email protected] Siddharth Pal
[email protected] Department of Electrical Engineering, Pennsylvania State University, University Park, PA 16802-2701
We consider ensemble classification for the case where there is no common labeled training data for jointly designing the individual classifiers and the function that aggregates their decisions. This problem, which we call distributed ensemble classification, applies when individual classifiers operate (perhaps remotely) on different sensing modalities and when combining proprietary or legacy classifiers. The conventional wisdom in this case is to apply fixed rules of combination such as voting methods or rules for aggregating probabilities. Alternatively, we take a transductive approach, optimizing the combining rule for an objective function measured on the unlabeled batch of test data. We propose maximum likelihood (ML) objectives that are shown to yield well-known forms of probabilistic aggregation, albeit with iterative, expectationmaximization-based adjustment to account for mismatch between class priors used by individual classifiers and those reflected in the new data batch. These methods are extensions, for the ensemble case, of the work of Saerens, Latinne, and Decaestecker (2002). We also propose an information-theoretic method that generally outperforms the ML methods, better handles classifier redundancies, and addresses some scenarios where the ML methods are not applicable. This method also well handles the case of missing classes in the test batch. On UC Irvine benchmark data, all our methods give improvements in classification accuracy over the use of fixed rules when there is prior mismatch. 1 Introduction Ensemble classification systems form ultimate decisions by aggregating the (hard or soft) decisions made by classifiers in a collection or ensemble. This paradigm is often motivated by considerations of accuracy, taking into account factors that affect (stand-alone) classifier performance such as limited training data, local optima of the training objective, and a lack of prior knowledge of the class-discriminating features and their underlying Neural Computation 19, 856–884 (2007)
C 2007 Massachusetts Institute of Technology
Transductive Methods
857
distributions. Diversity of classifiers in an ensemble—in the feature spaces they use, their structures, their parameter initializations, and their training data—can all help to reduce inductive bias associated with a single classifier. Ensembles have been justified, for example, under the assumption of statistically independent decision making (Hansen & Salamon, 1990) and from the standpoint of variance reduction (Tumer & Ghosh, 1996). We can broadly categorize ensemble techniques according to their method of training. For jointly optimized ensembles, the base classifiers and the aggregation function that combines their decisions are jointly designed, using a common set of training (and possibly validation) data. Such methods include boosting (Freund & Schapire, 1996), mixture of experts (Jacobs, Jordan, Nowlan, & Hinton, 1991), and other techniques (e.g., Dietterich & Kong, 1995). A second category focuses on training the aggregation rule, given fixed classifiers. An example is Kang, Kim, and Kim (1997). Finally, there are techniques that do not involve any training of the aggregation rule. These methods apply fixed mathematical rules to combine decisions. Such techniques include hard decision voting (Hansen & Salamon, 1990), applications of Bayes rule (Xu, Krzyzak, & Suen, 1992), arithmetic averaging (Breiman, 1996), variants of averaging (Miller & Yan, 1999), robust averaging (Tumer & Ghosh, 2002), and Dempster-Shafer theory (Chen, Wang, & Chi, 1997). Whereas the first two categories of techniques are usually chosen at the designer’s discretion, this latter category, without training optimization of the combiner, is often applied of necessity. There are two primary scenarios for which such training appears to be precluded: 1. Legacy or proprietary classifiers. Different organizations may have built classifiers, each using “in-house” data. Even if organizations are willing to share data, each classifier may use a different feature space.1 In this case, from company A’s data (which was collected with a particular feature space in mind), it may not be possible to extract instances of the input features used by company B’s classifier. Thus, in general, it will not be possible to form a common pool of labeled data, as required for supervised combiner learning. 2. Distributed multisensor/multimedia classification. Spatially remote sensors may collect measurements at different times, using either the same or different sensing modalities (Landgrebe, 2005). Each sensor may form its own local training set, without any known correspondence (e.g., between labeled examples in sensor A’s database and those in sensor B’s). In fact, the two sensors, while taking measurements of objects from the same set of classes, may not have sampled
1 There may be some features in common. Even in this case, we would say the classifiers use different feature spaces.
858
D. Miller and S. Pal
any common examples. Even if there are common examples, solving the correspondence problem to identify these examples appears daunting unless the data come with time stamps. A related application is vowel recognition based on speech and mouth images, with separate training sets. In the above examples there are no common labeled data across the ensemble, needed for supervised learning of the combining rule. In this letter, we focus on this case, which we refer to as distributed ensemble classification. This type of problem has been recognized before, in its general form (Basak & Kothari, 2004; Miller & Yan, 1999) and as seen in the context of classification over sensor networks (D’Costa, Ramachandran, & Sayeed, 2004). Basak & Kothari (2004) derive a fixed combining rule by applying Bayes theorem and taking proper account of redundancies in the feature spaces used by the local classifiers. This approach requires communication among all pairs of local classifiers to identify the features in common. D’Costa et al. (2004) derive a fixed combining rule for the case where the local classifiers’ feature vectors are jointly gaussian. We emphasize that Basak and Kothari (2004) and D’Costa et al. (2004) apply fixed combining rules. The conventional wisdom is that without common labeled training data, one must apply a fixed combining strategy. Alternatively, we will show it is both possible and in general desirable to learn the combining rule. We propose a transductive learning strategy (Joachims, 1999; Miller & Uyar, 1997; Saerens, Latinne, & Decaestecker, 2002), wherein an objective measured over the test data batch is optimized with respect to the batch of class decisions.2 Several such objective functions will be proposed. There are two main reasons that we expect transductive learning may outperform fixed combining. First, consider the very practical case where the class priors seen (in training data) by individual classifiers do not reflect the true class priors, that is, those seen in the test batch. There are a number of possible reasons for such mismatch:
r r
For the distributed multisensor domain, the class distribution may be spatially varying; for example, each local region (and hence each local classifier) may “see” examples from only a subset of the classes. Rare classes may not be represented proportionately in a training data set.
2 Batch transductive learning jointly yields decisions for all samples in the test batch. Alternatively, to avoid sample (but not processing) delays in decision making, we can instead apply transductive learning sample by sample. In this case, the decision on the current sample xt is made (at least in principle) by transductive learning using the entire batch of test samples seen until now, that is, {x1 , x2 , . . . , xt−1 }. While this learning will yield decisions on all samples in the current batch, only the decision on xt is retained, since decisions on past samples have already been made in this (sample-by-sample) case.
Transductive Methods
r r
859
Class priors may be environment dependent. Groups of classes that are highly confusable are difficult to hand-label. Thus, ironically, although it is important to have sufficient examples from them, these classes may be underrepresented in a labeled set.
Saerens et al. (2002) developed a transductive maximum likelihood (ML) method to correct class priors for the case of a single classifier. We extend their approach to address distributed ensembles and demonstrate performance gains over fixed combining when there is mismatch between the true priors and those used by individual classifiers. Moreover, we observe that for the likelihood objectives considered in this work and in Saerens et al. (2002), the EM learning is globally convergent, that is, it finds the (unique) maximum likelihood estimate. This was not recognized in Saerens et al. (2002). A second motivation for a transductive approach is to avoid unnecessary statistical assumptions implicitly made by using a particular combining rule. We will show that arithmetic averaging and naive Bayes rules emerge as the transductive ML-optimal rules when particular assumptions—conditional independence, in the naive Bayes case—are made about the underlying stochastic model that generated the data. As will be seen in our results, conditional independence in particular is a poor assumption when there is statistical redundancy between the decisions produced by individual classifiers. To avoid making unnecessary assumptions, we also propose an information-theoretic transductive technique for aggregation of distributed ensembles. We have found that this method generally outperforms the ML techniques and, in particular, much better accounts for redundancy between classifiers than the ML methods. This method is also applicable for some scenarios where transductive ML techniques cannot be used. In section 2, we describe the distributed ensemble problem more fully, stating our assumptions and introducing notation. In section 3, we develop two transductive ML techniques. Section 4 develops our informationtheoretic method. Section 5 gives experimental results. The letter concludes with a summary and future work. 2 Assumptions for Distributed Ensemble Classification Consider an ensemble with Ne classifiers and an aggregation function for combining their decisions. The transductive ML techniques we develop assume each classifier estimates a posteriori probabilities P j [C = c|x ( j) ], c = 1, . . . , Nc , j = 1, . . . , Ne , Nc the number of classes, and x ( j) the feature vector for the jth classifier. Each classifier has its own labeled training set for building an a posteriori model. The priors reflected in this training set will in general be different for each local classifier and, of course, may differ from the true priors reflected in the test data batch, which are unknown.
860
D. Miller and S. Pal
Consistent with the proprietary scenario, for example, we assume there is no communication between the individual classifiers for discerning common features. As emphasized before, there is no common labeled training data that could be used for supervised learning of the aggregation function. However, during the operational (use) phase, common data are observed (in a synchronized fashion) across the ensemble, that is, for each test object, a corresponding feature vector is measured by each classifier. Thus, the input to the ensemble system is the concatenated vector x = (x (1) , x (2) , . . . , x (Ne ) ), but with classifier j seeing only x ( j) . We suppose it is not feasible to directly convey the feature vectors x ( j) to the aggregation function; only the classifier decisions (hard or soft) are conveyed. This is consistent with the proprietary scenario and also with the case of high feature dimensionality (and hence high-bandwidth transmission cost). Each classifier may communicate either soft decisions {P j [C = c|x ( j) ], c = 1, . . . , Nc } or maximum a posteriori (MAP) hard decisions. A key aspect of our approach is that in making decisions, we perform transductive learning on a batch of test samples. As mentioned, one possibility is to make joint decisions on all samples in a given test set batch, Xtest = {xi , i = 1, . . . , Ntest }. This will entail collecting batches of soft (or hard) decisions from each of the individual classifiers, that is, ( j) {{P j [c|xi ]∀c ∀ j}, i = 1, . . . , Ntest }. There will be delays in decision making associated with both decision collection and processing by the aggregation function to produce the ensemble decision batch {{Pe [c|xi ]}, i = 1, . . . , Ntest }. For our methods, these delays will grow linearly with Ntest . Alternatively, in the sample-by-sample case (see note 2), there is no delay associated with data (i.e., decision) collection. Finally, we note an important distinction between the ML methods and the information-theoretic method to be developed, for the case where individual classifiers forward hard decisions. For the ML methods, each local classifier must explicitly produce a posteriori probabilities {P j [c|x ( j) ]}, even though only a hard MAP decision will be conveyed. This is required because local MAP decisions cannot be explicitly corrected to account for new estimated priors unless there is access, at least at the local classifiers, if not at the aggregation function, to local a posteriori probabilities. By contrast, as will be seen, the information-theoretic method can be applied and can effectively correct priors even if each local classifier is an inaccessible black box that produces only hard decisions.
3 Transductive Maximum Likelihood Ensembles Saerens et al. (2002) considered the case of a single a posteriori model P[c|x]. They effectively assumed this model is reasonably accurate except possibly in the class priors implicitly used. They proposed to reestimate class priors so as to maximize the data likelihood measured on the test
Transductive Methods
861
set batch. Within the expectation-maximization (EM) learning framework (Dempster, Laird, & Rubin, 1977; McLachlan & Peel, 2000), the revised a posteriori class probabilities, based on the current estimates of the class priors, are computed in the E-step, with the class priors reestimated in the M-step. Here, we extend their approach to address distributed ensembles. We consider two underlying stochastic models, each of which will be shown to be consistent with a particular well-known form of ensemble combining. 3.1 Class-Conditional Feature Independence. We denote the classconditional density (or probability mass function, pmf, in the discrete case) for the random feature vector X( j) , given C = c, by p j [x ( j) |C = c]. Since each local classifier is required only to produce an a posteriori model P j [c|x ( j) ], the density function will in general be unspecified.3 Moreover, even if each classifier is explicitly using a class-conditional density for its feature vector, we are not proposing, as in, for example, Miller and Uyar (1997), to reestimate the parameters of these densities within the transductive learning framework. Regardless, the class-conditional densities do appear in our maximum likelihood formulation. In this section, we assume the local feature vectors {X( j) , j = 1, . . . , Ne } are conditionally independent, given the class. This is a standard naive Bayes assumption that is often applied in distributed classification and detection contexts (Varshney, 1997). The associated stochastic data generation for each test sample xi , assumed to be independently generated, is to first randomly select a class based on {Pe [C = k]} and then to randomly generate each local feature vector, given the chosen class k , via the den( j) sity p j [xi |C = k ]∀ j. This stochastic generation defines a special mixture model for generating x, with mixture component k representing class k, k = 1, . . . , Nc . Thus, Pe [C = k] is the prior for component(class) k, with ( j) { p j [xi |C = k], j = 1, . . . , Ne } the component model for component(class) k. For this mixture model, applying Bayes rule, the transductive incomplete data log likelihood for the unlabeled batch test set, Xtest , is log L(Xtest ) = log L(x1 , . . . , x Ntest ) Ntest Nc Ne ( j) = log Pe [C = k] p j xi |C = k i=1
=
Ntest i=1
log
k=1
j=1
Nc
Ne
k=1
Pe [C = k]
j=1
(3.1)
( j) p j xi ( j) (3.2) P j C = k|xi · P j [C = k]
3 It will be known if the a posteriori model is derived from an underlying stochastic model for the feature vector.
862
D. Miller and S. Pal
=
Ntest
log
i=1
+
Ntest i=1
Ne
( j) p j xi
j=1
( j) Nc Ne P j C = k|xi , log Pe [C = k] P j [C = k] k=1
(3.3)
j=1
e ( j) with the final form obtained using the fact that Nj=1 p j [xi ] has no dependence on k. This is to be maximized over the ensemble class priors ( j) {Pe [C = k]}. Here, P j [C = k|xi ] is local classifier j’s posterior, P j [C = k] ( j) is its prior (measured based on its training set), and p j [xi ] is its feature ( j) vector density. Note that p j [xi ] is in general unknown since each classifier is assumed only to produce class posteriors. However, this is not
Ntest e ( j) problematic since i=1 log( Nj=1 p j [xi ]) has no dependence on the parameters to be optimized. The optimization is accomplished using an EM algorithm whose defining formulation details are given in the appendix. The algorithm amounts to a sequence of iterations, each consisting of (1) reestimation of the ensemble class posteriors on the test data, given fixed ensemble priors (E-step), followed by (2) reestimation of the ensemble class priors, given fixed posteriors (M-step), applied until a convergence condition is met. Each step in the optimization is guaranteed to ascend in log L. The E-step at iteration t is (t−1)
Pe Pe(t) [C = k|xi ] =
Nc
[C = k]
(t−1) [C l=1 Pe
Ne
= l]
( j)
P j [C=k|xi ] P j [C=k]
j=1
Ne j=1
( j)
P j [C=l|xi ] P j [C=l]
,
(3.4)
and the M-step is:
Pe(t) [C = k] =
Ntest 1 Pe(t) [C = k|xi ], Ntest
(3.5)
i=1
(0)
with the initial prior probabilities given as {Pe [C = k]}. We emphasize ( j) that in the E-step, P j [C = k] and P j [C = k|xi ], the prior and posterior for the jth classifier, (t−1) remain unchanged throughout the optimiza[C=k] is effecting correction of the posterior tion. Note that the ratio PeP j [C=k] ( j) priP j [C = k|xi ] to reflect the current estimates of the ensemble (class j) P [C=k|x ] ors measured on the test batch. Note also that the ratio jP j [C=k]i is pro( j) portional to the class-conditional density p j [xi |C = k]. Thus, as it must,
Transductive Methods
863
equation 3.4 takes the naive Bayes posterior form, that is, the form of a model that assumes class-conditional independence of the feature vectors {X( j) , j = 1, . . . , Ne }, with class priors given by the transductive MLestimated ones. At convergence, the E-step is used to compute the ensemble class a posteriori probabilities. Note that the difference between our approach and standard naive Bayes aggregation is that our method learns and, at convergence, uses corrected priors in the naive Bayes rule, as given in equation 3.4. In fact, the solution after the first E-step (without any prior correction) is the standard naive Bayes solution, which we will refer to as the fixed product rule and compare against in our experimental results. It is important to observe that the objective function, equation 3.1, as well as the objective in Saerens et al. (2002), is concave in the parameters {Pe [C = k]}, with, thus, a unique maximum. Moreover, convergence for this particular problem—ML estimation for mixture proportions, with the parameters of the component densities held fixed—is addressed in Redner and Walker (1984). There it is shown that under mild assumptions (nonzero initialization of mixing proportions for all components, positivity of the mixture density for each sample, linearly independent mixture components with positive support on the sample domain), the EM algorithm is globally convergent, to the unique maximum likelihood estimate. Only local optimality was recognized in Saerens et al. (2002). Our method requires each local classifier to convey its test-batch posterior pmfs to the aggregation function. Moreover, each classifier must also convey its class priors {P j [C = k]}. 3.2 Censored Feature Vectors. A well-known drawback of the productbased rule, equation 3.4, is its sensitivity to unreliable experts: a single classifier that gives low probability to the true class may cause an ensemble error. Arithmetic averaging is generally believed to be more robust to unreliable “experts.” In deriving rule 3.4 within the transductive ML framework, we modeled stochastic generation of x in a standard way, using conditional independence. In this section, we show that an arithmetic averaging rule emerges when a somewhat unconventional stochastic model is assumed. Imagine that for any given test set object, only one of the individual classifiers receives measurements, with the other feature vectors essentially censored. Each classifier is assumed equally likely to be uncensored. More precisely, the stochastic data generation is as follows. For each test sample, one first randomly selects a class based on {Pe [C = k]}, then the uncensored classifier (based on a uniform pmf over the Ne classifiers), and finally the sample, given the chosen class (k ) and classifier ( j ), that is, via ( j ) p j [xi |C = k ]. For this model, the transductive censored incomplete data
864
D. Miller and S. Pal
log likelihood, assuming independent test set samples, is
log L cens (Xtest ) =
Ntest
log
Ne
1 ( j) Pe [C = k] p j xi |C = k . Ne
Nc
i=1
k=1
(3.6)
j=1
The appendix casts this ML problem within the EM framework. The ensemble a posteriori class probabilities are again reestimated in the E-step, which now takes the arithmetic averaging form: (t−1) Pe [C
Pe(t) [C = k|xi ] =
Nc l=1
(t−1)
Pe
=
k] N1e
Ne
[C = l] N1e
( j)
P j [C=k|xi ] P j [C=k]
j=1
Ne j=1
( j)
P j [C=l|xi ] P j [C=l]
.
(3.7)
The M-step again computes the new class priors using equation 3.5. Again, (t−1)
[C=k] is effecting class prior correction. At convergence, equathe ratio PeP j [C=k] tion 3.7 computes the ensemble a posteriori probabilities. The solution after the first E-step (before prior correction) in this case is the fixed arithmetic averaging rule, which we will compare against in our results. As in the previous section, this EM algorithm is globally convergent. It is important to emphasize that we are not especially advocating the censored model as a good statistical description that captures particular distributed ensemble scenarios.4 We have simply identified the statistical modeling assumption (random censoring) needed for transductive ML to yield the (popular) arithmetic averaging form of combining. In section 5, we experimentally evaluate this transductive technique, along with the product rule method and an alternative method, developed next.
4 An Information-Theoretic Transductive Ensemble The aggregation technique developed in this section, which we dub the information-theoretic (IT) method, differs in important respects from the ML methods. First, the IT method does not require that individual classifiers produce a posteriori probabilities. Correction of class priors is effectively achieved even if each local classifier only produces hard decisions. Second, we saw in the previous section that well-known aggregation rules emerge when specific statistical assumptions are made. On the one hand, these rules generalize, providing an ensemble decision given any set of local posteriors {{P j [c|x ( j) ]}, j = 1, . . . , Ne }, associated with a new x. That is, suppose there
4 In particular, in our experiments in section 5, there is no actual censoring of feature vectors.
Transductive Methods
865
are two test batches, X1 and X2 . If transductive ML learning is applied to X1 , assuming the class priors in X2 are the same as in X1 , decisions can be made on the examples in X2 without performing additional learning.5 On the other hand, the statistical assumptions may not be warranted—in particular, conditional independence, when there is significant redundancy between classifiers. As an extreme example, consider an ensemble with Ne − 1 1 identical weak classifiers (i.e., ones that are perfect copies of each other) and one strong classifier, statistically independent of the rest. Clearly, the naive Bayes rule in equation 3.4 will bias in favor of the weak classifiers and give poor accuracy in this case. What is instead desired is that the aggregation function properly account for redundancies. For this example, it should ideally treat the ensemble as effectively consisting of two (independent) classifiers, one strong and one weak. The IT method does not produce a parametric aggregation function; given a test batch, it simply yields a table of class a posteriori pmf values, with each row specifying the pmf for one sample in the test batch. On the one hand, this means there is no automatic rule of generalization for new samples; decisions on new samples can be made only using a new round of transductive learning. On the other hand, this approach does not make the statistical assumptions required for the ML techniques. The IT method can be applied for the case of a single classifier, as considered in Saerens et al. (2002), and in this case, it will not require the statistical assumptions made there. However, IT is especially advantageous for the ensemble case focused on here, where there is a need to account for classifier redundancies. We will demonstrate that IT properly treats redundancy for the stark type of example described above and more generally outperforms the ML methods. Finally, we note that there is no explicit prior correction in the IT method because it optimizes directly over test batch decisions rather than over prior parameters. Regardless, IT is experimentally observed to achieve good decision accuracy in cases of large prior mismatch, that is, it achieves effective prior correction. The IT method is inspired by frameworks such as maximum entropy (ME) statistical inference (Jaynes, 1989), which treats learning of probability distributions as a problem of encoding equality constraints on a set of judiciously chosen statistics. ME has been justified on statistical and axiomatic grounds, and the solution consistent with given satisfiable linear constraints is unique. Moreover, there are simple ME parameter learning algorithms with guaranteed convergence, such as iterative scaling techniques (e.g., Della Pietra, Della Pietra, & Lafferty, 1997). In other work (Miller & Pal, In principle X1 X2 should be used for transductive learning, or, if the class priors are different in the two batches, separate transductive learning should be performed on each. However, learning complexity may preclude this. 5
866
D. Miller and S. Pal
2005) we have explicitly cast standard ensemble combining (where there is common training data across the ensemble) as an ME problem with an iterative scaling solution. Unfortunately, for the transductive setting addressed here, the “right” constraints are not guaranteed to be satisfied. Accordingly, we will take a heuristic optimization approach, albeit for a principled goal, choosing ensemble posteriors on the test batch to agree as closely as possible with constraints measured by the local classifiers, as reflected by an IT measure of distance between probability distributions. Casting learning as a problem of constraint encoding is especially advantageous for achieving robustness to classifier redundancy, as can be understood from the following simple argument. Suppose we have already optimized the test set decisions so as to precisely meet the constraints measured by an ensemble of Ne local classifiers. If we increase the ensemble size by one, adding a classifier that is an identical copy of an existing one in the ensemble, this classifier’s constraints are perfectly redundant and already met by the current solution. Thus, this new classifier will have no effect on the solution. As will be demonstrated, the IT method handles classifier redundancy in a fashion quite consistent with this description, achieving accuracy nearly invariant to explicit redundancy in the ensemble. 4.1 Choice of Constraints. In general, one would like to encode as much (accurate) statistical constraint information as possible, to learn the most accurate probability model (Jaynes 1989), that is, joint probabilities on collections of features (e.g., Della Pietra et al., 1997; Miller & Yan, 2000). In our ensemble setting, the local decision Cˆ j is a discrete-valued feature. Thus, in principle, we might try encoding joint statistics such as fourth-order pmfs P[C, Cˆ j , Cˆ k , Cˆ l ]. However, constraint probabilities are usually estimated, by training set frequency counts, with accuracy that decreases with increasing order of the probabilities. This limits the orders used in practice. Moreover, in our distributed setting, with no common labeled data, it is not possible to measure joint statistics involving two or more classifiers. Thus, we are limited to encoding pairwise statistics involving C and individual decisions, Cˆ j . In prior work (Miller & Yan, 2000), pairwise pmfs involving the class label and individual discrete features were encoded into the learning of ME classifiers, and this was found to achieve good accuracy. In our ensemble setting, using its local training set, each classifier, j, can measure the pair( j) wise pmf Pcon [C, Cˆ j ],6 with “con” indicating this pmf is a constraint. Thus, at first glance, it appears reasonable to loosely cast our transductive learning problem as choosing the ensemble posteriors {Pe [C|xi ], i = 1, . . . , Ntest } to satisfy the pairwise pmf constraints contributed by each of the local
6
The superscript indicates that the constraint is measured by classifier j.
Transductive Methods
867
( j) classifiers, that is, {Pcon [C, Cˆ j ], j = 1, . . . , Ne }. However, note that the pair( j) ( j) ( j) wise pmf Pcon [C, Cˆ j ] determines the marginal pmfs Pcon [C] and Pcon [Cˆ j ]. ( j) That is, by seeking to agree with the constraint Pcon [C, Cˆ j ], one is also trying to force the ensemble test batch posteriors {Pe [C|xi ]} to agree with the (perhaps erroneous, as well as inconsistent) class priors and class decision priors measured by each local classifier, based on its respective training set.7 ( j) Thus, rather than encoding the joint pmfs {Pcon [C, Cˆ j ]}, we instead suggest ( j) ˆ encoding the conditional pmfs {[Pcon [C j |C = c]∀c}. These constraint pmfs do not specify the class priors, and they specify the pairwise pmf except for the prior. They are thus as close as we can get to encoding pairwise pmfs.8 ( j) Note also that the pmfs {Pcon [Cˆ j |C = c]∀c} define the confusion matrix for classifier j (measured based on its training set). We believe it is quite reasonable to expect these statistics should be preserved as measured on the test batch. In Saerens et al. (2002) and previous work cited there, the confusion matrix was used in a procedure to reestimate class priors for the test batch. These priors were then used to correct posteriors in essentially the same way they are corrected within our ML algorithms. Although it is possible we ( j) could also use {[Pcon [Cˆ j |C = c]∀c} in this way, such an approach would be similar to the methods suggested in the previous section and would address only prior correction. We instead use the confusion matrix information from each local classifier to define a set of constraints for transductive learning. This approach will be shown experimentally to be quite effective when there is prior mismatch. Beyond this, though, this method does not make unnecessary statistical assumptions; our (constraint-based) method will be demonstrated to handle classifier redundancy much better than the ML methods. The constraint probabilities for classifier j are estimated as9
Nj ( j) Pcon [Cˆ C j
= cˆ |C = c] =
( j)
( j) P j Cˆ j = cˆ |xi
i=1:c i =c
Nj
( j) i=1:c i =c
1
.
(4.1)
7 If P ( j) [C] is mismatched to the true class prior, it is likely that P ( j) [C ˆ j ] is also con con mismatched to the true class decision prior (reflected in the test data batch). 8 One might imagine that it is worthwhile to also encode P ( j) [C|C ˆ j = cˆ ]. However, con ( j) ˆ ( j) P [C =ˆc |C=c]P [C=c] ( j) , that is, the only information in note that Pcon [C = c|Cˆ j = cˆ ] = con j ( j) ˆ con Pcon [C j =ˆc ] ( j) ( j) Pcon [C = c|Cˆ j = cˆ ] relative to Pcon [Cˆ j = cˆ |C = c] is a function of the (mismatched) quanti( j) ( j) ties Pcon [C = c] and Pcon [Cˆ j = cˆ ]. Moreover, experimentally we have found that encoding ( j) ˆ ( j) Pcon [C j = cˆ |C = c] gives much better accuracy than encoding Pcon [C = c|Cˆ j = cˆ ]. 9
It is also possible to improve this estimate via a cross-validation procedure.
868
D. Miller and S. Pal ( j)
( j)
Here, X j ≡ {(xi , c i ), i = 1, . . . , Nj } is the labeled training set for classifier j. The transductive estimate of the constraint 4.1, produced by choosing the ensemble a posterior pmfs on the test batch Xtest , is Pe [Cˆ j = cˆ |C = c] =
Ntest i=1
( j) Pe [C = c|xi ]P j Cˆ j = cˆ |xi .
Ntest i=1 Pe [C = c|xi ]
(4.2)
4.2 Nonsatisfiability of Constraints. Consider the following two-class example with two classifiers: suppose each classifier makes hard decisions, and let one classifier achieve perfect training accuracy: (1)
(1)
(4.3)
(2)
(4.4)
Pcon [Cˆ = 1|C = 2] = 0, Pcon [Cˆ = 2|C = 1] = 0. For the second classifier, let (2)
Pcon [Cˆ = 1|C = 1] = 1, Pcon [Cˆ = 2|C = 2] = 1/2.
Suppose the same training set is seen by each local classifier. Let the test batch be the same as the training set, except that it contains additional samples from class 2 that are all correctly classified by both classifiers. In this case, it is easy to see it is not possible to choose {{P[C = c|xi ]}, i = 1, . . . , Ntest } to satisfy all the specified conditional constraints. Thus, as one expects, one cannot in general jointly satisfy distributed constraints, gleaned from a collection of local classifiers, by transductive learning. 4.3 Information-Theoretic Learning Approach. Although the given constraints cannot be precisely met in general, in practice we have found they typically can be quite well met. Thus, here we develop a learning approach that seeks to agree with them as much as possible. Since our constraints are on pmfs, we quantify the discrepancy between constraint pmfs and ensemble estimates by the information-theoretic measure of relative en
P[x] tropy (Cover & Thomas, 1991), that is, D(P[X]||Q(X)) = x∈S P[x] log Q[x] , S the (common) support set, with P[·] playing the role of “posterior” and Q[·] the “prior.” In our case, the ensemble’s estimate is the “posterior" and the constraint pmfs are the “priors.” From the previous section, we would ideally like to achieve ( j) Pe [Cˆ j |C = c] = ˙ Pcon [Cˆ j |C = c], ∀ j, c.
(4.5)
Transductive Methods
869
It thus appears natural to choose the ensemble posteriors to minimize the sum of cross-entropies: ˜ = R
|Nc | Ne
( j) ˆ D(Pe [Cˆ j |C = c]||Pcon [C j |C = c]).
(4.6)
j=1 c=1
However, consider the case where the true prior is zero for some class c˜ , that is, where this class does not occur in the
test batch.10 In this case, the Ntest posterior pmfs should be chosen such that i=1 Pe [C = c˜ |xi ] = 0. However, this choice would lead to an improper estimate of Pe [Cˆ j |C = c˜ ] in equation 4.2. Choosing the test set posteriors to minimize equation 4.6 forces the choice of a proper Pe [Cˆ j |C = c˜ ]∀ j, which is achieved only if
Ntest minimizi=1 Pe [C = c˜ |xi ] > 0. Thus, for the case of a missing class,
Ntest ˜ must yield soft decision inaccuracy (as reflected by ing R i=1 Pe [C = c˜ |xi ] > 0). ˜ includes constraints (inThe problem here is that equation 4.5 (and R) volving c˜ ) that should not be imposed. What we desire is a simple way to ( j) avoid encoding constraints {Pcon [Cˆ j |C = c˜ ], ∀ j} associated with classes c˜ that do not occur in the test batch (even as we do not a priori know whether there are such missing classes). This is effectively achieved by multiplying
Ntest equation 4.5, both sides, by Pe [C = c] = N1test i=1 Pe [C = c|xi ], that is, by expressing the constraints ( j)
Pe [Cˆ j |C = c] · Pe [C = c] = ˙ Pcon [Cˆ j |C = c] · Pe [C = c], ∀ j, c.
(4.7)
It may look as though equation 4.7 is imposing the (undesirable) pairwise pmf constraints discussed in section 4.1, since the equation equates two pairwise pmfs. However, equation 4.7 does not impose the local classifier’s ( j) ( j) class prior, Pcon [C = c]. It imposes only the conditional pmf Pcon [Cˆ j |C = c]. In particular, note that equation 4.7 is equivalent to the conditional pmf constraints, equation 4.5, for c such that Pe [C = c] > 0 (since in this case Pe [C = c] can be canceled on both sides), but with no constraint imposed when Pe [C = c] = 0, that is, equation 4.7 encodes conditional pmf constraints for all classes estimated to have nonzero priors in the test batch. The effect of multiplying both sides of equation 4.5 by Pe [C = c] is thus to avoid encoding constraints for classes estimated not to be present in the test batch. We encode the objectives 4.7 by choosing the ensemble posterior pmfs on Xtest to minimize the sum of relative entropies over all local
10 Both the particular class and the fact that a class is missing from the test batch are, of course, unknown.
870
D. Miller and S. Pal
classifiers: R=
|N | |N | Ne c c j=1
Pe [Cˆ j = cˆ |C = c]Pe [C = c]
c=1 cˆ =1
Pe [Cˆ j = cˆ |C = c]Pe [C = c] × log ( j) Pcon [Cˆ j = cˆ |C = c]Pe [C = c] =
|Nc | Ne
( j) ˆ Pe [C = c]D(Pe [Cˆ j |C = c]||Pcon [C j |C = c]).
(4.8)
j=1 c=1
˜ ExperiThe latter form clearly shows the relationship between R and R. ˜ gives very similar classification mentally, we have found that minimizing R accuracy to minimizing R when there are no missing classes in the test ˜ fails when there are missing classes, as will batch. However, minimizing R be shown in section 5. We perform the minimization of R by gradient descent. The partial derivative with respect to a given posterior probability can be shown to be e c ∂R 1 P j Cˆ j = cˆ j |x (mj) = ∂ Pe [C = k|xm ] Ntest
N
N
j=1 cˆ j =1
× log
Pe [Cˆ j = cˆ j |C = k] ∀k, m. ( j) Pcon [Cˆ j = cˆ j |C = k]
(4.9)
Note that one way of achieving zero gradient (and an optimal solution) is if the constraints are precisely met. In practice, without loss of generality, but to ensure Pe [C = k|xm ] is preserved as a pmf throughout the optimization, we parameterize it using a softmax function, e γk,m Pe [C = k|xm ] = γ , e k ,m
(4.10)
k
based on the real-valued parameters {γk,m }. Applying the chain rule, we have c ∂R ∂R ∂ Pe [C = k |xm ] = , . ∂γm,k ∂ Pe [C = k |xm ] ∂γm,k
N
k =1
(4.11)
Transductive Methods
871
where ∂ Pe [C = k |xm ] = ∂γm,k
Pe [C = k |xm ] · (1 − Pe [C = k |xm ]),
−Pe [C = k |xm ] · Pe [C = k|xm ],
for
k = k
for
k = k . (4.12)
We have not determined whether equation 4.8, based on the parameterization, equation 4.10, is convex; there may be multiple local minima. Thus, unlike section 3, where the learning was globally convergent, gradient descent may find only solutions locally optimal with respect to equation 4.8. Finally, note that implementation of this technique requires that each classifier, j, in addition to conveying its test batch decisions, also conveys ( j) the constraint pmfs {Pcon [Cˆ j |C = c]∀c} to the aggregation function. 5 Experimental Results We have performed a number of experiments involving several types of comparisons: 1. Transductive ML learning versus fixed combining (sum and product) 2. Transductive IT learning versus fixed combining 3. Transductive ML versus transductive IT learning 4. Specialized tests to study the effect of explicit classifier redundancy, the performance with strong classifiers (SVMs), the case of missing classes, and the effect of batch size The data sets, taken from the UC Irvine machine learning repository, are described in Table 1. To simulate a distributed classification environment, we used five local classifiers, each (initially) a naive Bayes classifier working on a randomly selected subset of features.11 Continuous features were modeled by a gaussian distribution and discrete features by a multinomial. For example, Vehicle has 4 classes and 18 continuous features. The first local classifier used 15 randomly selected features, while the second classifier used 14. We performed five replications of twofold cross validation for all data sets except Satellite. For each replication, the data set was divided into two equal-sized sets. In the first fold, one set was used as training and the other as test, with these roles reversed in the second fold. 5.1 Prior Mismatch. We evaluated classification accuracy for varying test set priors. A new test set with given priors was obtained by sampling 11 Thus, for the conditional independence model of section 3.1, both the local classifiers and the combining rule are based on naive Bayes.
872
D. Miller and S. Pal
Table 1: UC Irvine ML Data Sets That Were Evaluated.
Data Set Satellite Vehicle Segment Liver Disorder Diabetes Mushroom Sonar
Type of Features
Number of Classes
Number of Features
Feature Subspace Size
Continuous Continuous Continuous Continuous
5 4 7 2
33 18 19 6
[28, 29, 30, 31, 32] [15, 14, 15, 16, 17] [15, 16, 15, 17, 18] [5, 4, 5, 4, 5]
Continuous Discrete Continuous
2 2 2
7 22 60
[6, 3, 4, 5, 6] [13, 15, 17, 19, 21] [52, 54, 56, 58, 59]
with replacement from the original test set. For example, for the first fold of the first replication, denote the training set by Sa and the test set by Sb . Suppose we want to change the test set priors to {m1 , . . . ., m Nc },
c m j = 1. Then the new test set ( Sˆb ) is obtained by sampling where Nj=1 with replacement from Sb so as to achieve (or approximately achieve) {m1 , . . . ., m Nc }. The size of the new test set is chosen the same as the original set, Sb . Satellite has separate training and test sets. For this set, rather than performing cross validation, we averaged the test error rate over 10 trials, based on different resamplings of the test data for the given mismatched priors. We quantify the mismatch between test set and local training set priors by the sum of cross-entropies: M=
Nc Ne j=1 k=1
P j [C = k] log
P j [C = k] . Ptest [C = k]
(5.1)
We evaluated performance varying the value of this mismatch index. 5.2 Initialization. The transductive ML methods used initial priors estimated based on frequency counts taken over all the local training sets. These priors were also used as (fixed) priors by the fixed rule methods. The fixed rule methods are thus equivalent to the solutions obtained by the (corresponding) EM algorithms (see sections 3.1 and 3.2) after the first E-step, before any prior correction is made. The IT method started from a uniform posterior pmf initialization for all test batch samples, via γk,m = 0 ∀ k, m. 5.3 Stopping Criteria. In practice, one must either use a fixed stopping rule or one automatically determined, given each new data set. The desire is that learning should cease without substantial over- or underfitting. In this work, we have applied fixed stopping rules empirically chosen
Transductive Methods
873
transductive IT learning 0
−181
−1
−182 −183
−3
Linc
log (cent)
−2
−4 −5
−185
−187
−7 0
200
400
600 800 1000 1200 Number of Iterations
1400
1600
−188
1800
1
0.4
0.9
0.38
0.8
2
4
6
8 10 12 Number of EM steps
14
16
18
2
4
6
8 10 12 Number of EM steps
14
16
18
0.36
0.7
% Error
% Error
−184
−186
−6
−8
transductive ML (sum) learning
0.6
0.34 0.32
0.5 0.3
0.4
0.28
0.3 0.2
0
200
400
600
800 1000 1200 Number of Iteratios
1400
1600
1800
0.26
Figure 1: Learning objective and the test error rate as learning proceeds for transductive methods on Diabetes with prior mismatch value of 2.55.
for each method. Figure 1 gives example curves illustrating that test set performance is not very sensitive to the choice of stopping rule. We observed little overfitting for any of the methods. For the ML methods, we stop learning when the relative change in log likelihood drops below 1e-3. Since most of the improvement in test error was observed to occur in the first few iterations, this was found to generally stop near the best test set performance. For IT, we used gradient descent starting from a nominal step size (with value 10), taking steps with this size so long as the learning objective decreases. When this is not the case, the step size is lowered by half. Thus, equation 4.8 monotonically decreases with each step taken. We have found that the general trend for the test error rate is (also) nonincreasing behavior with increasing number of steps. For IT, we stop learning when the relative change in equation 4.5 is less than 1e-4, or after 3000 learning steps. This ensures the algorithm does a sufficient amount of learning. For all the methods, there are cases where generalization error does increase during learning. However, we have observed these increases to be modest. 5.4 Transductive ML Learning Versus Fixed Combining (Sum and Product). Comparison of generalization error rates for transductive ML and fixed combining is given in Table 2. The second column gives the mismatch between training and test set priors. It can be seen that when the mismatch increases, ML methods outperform the fixed rules, with the
0.31 0.53 0.97 1.68 3.2e-2 1.30 1.73 2.57 5.1e-2 0.55 0.83 2.43 3.1e-2 0.74 1.42 2.58 1.2e-3 0.36 0.99 2.36 4.2e-2 0.60 1.35 1.94
Satellite
Sonar
Mushroom
Diabetes
Liver D
Segment
M
Data Set
22.56 21.00 21.16 24.87 20.88 33.13 25.78 41.44 46.13 36.24 35.49 21.04 23.20 27.29 26.99 23.50 0.65 0.49 6.6e-2 0.078 31.83 26.54 25.58 18.85
ML
Sum
22.51 21.02 21.43 25.32 20.94 33.85 27.04 43.38 44.10 38.03 40.98 31.21 23.27 28.91 33.08 33.46 0.65 0.93 0.44 0.16 32.12 26.83 26.54 20.96
Fixed 22.65 20.97 21.11 24.97 20.96 33.00 26.10 41.00 47.96 35.20 34.97 17.92 23.05 26.58 24.85 21.77 0.70 0.49 5.7e-2 0.12 32.02 26.06 25.67 19.42
ML
Product
22.56 21.10 21.53 25.28 20.96 34.11 27.16 43.48 44.05 38.38 40.92 31.27 23.20 28.83 32.93 33.08 0.69 0.87 0.42 0.25 32.02 26.83 26.92 20.77
Fixed 0.56 0.78 2.35 6.12 1.65 1.31 4.49 5.89 0.73 1.93 4.73 8.15 0.78 6.04 5.35 0.66 0.76 0.79 0.86 5.00
Fixed (S) 0.60 0.98 2.87 4.78 1.59 1.26 3.61 5.13 1.09 1.95 4.58 7.87 1.16 3.9 4.8 1.37 0.84 0.79 1.34 5.83
Fixed (P)
ML Sum
0.70 0.94 2.71 6.27 1.91 1.57 4.74 4.12 0.68 2.82 4.76 7.18 1.2 4.5 6.4 0.552 0.82 0.68 0.79 5.16
0.77 1.21 3.81 4.81 1.36 1.55 4.28 4.12 0.65 2.67 4.90 7.50 1.0 3.9 5.86 0.96 1.0 0.68 1.23 5.40
Fixed (P)
ML Product Fixed (S)
5 × 2 cv F Test
Table 2: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive ML and Fixed Methods.
874 D. Miller and S. Pal
Transductive Methods
875
Table 3: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive IT and Fixed Combining Methods. 5 × 2 cv F Test
Fixed Data Set
M
Satellite
0.96 1.68 1.27 2.73 3.4e-2 1.31 1.83 2.57 5.6e-2 0.55 1.74 2.3e-2 0.74 1.43 2.57 1.6e-3 0.36 0.99 2.36 0.32 0.91 1.37
Vehicle Segment
Liver Disorder
Diabetes
Mushroom
Sonar
IT 20.88 25.28 41.35 34.53 16.51 27.75 11.85 32.12 44.57 43.24 32.54 23.08 26.99 24.36 17.82 0.24 0.073 0.12 0.12 31.83 27.21 23.08
Sum 21.46 25.60 45.60 41.18 21.32 37.23 13.59 43.66 45.20 44.51 35.66 23.72 33.76 30.68 31.84 0.33 0.13 0.83 0.52 35.29 35.19 37.50
Product 21.40 25.51 45.56 41.11 21.40 37.57 13.89 43.71 45.09 44.68 35.72 23.91 33.16 30.90 31.69 0.57 0.17 0.81 0.96 35.38 35.29 37.79
Sum
Product
-
-
6.45 1.26 5.13 34.01 9.75 9.98 0.71 0.77 0.80 1.30 2.68 15.77 8.31 4.0 2.42 4.86 2.49 2.56 2.72 7.74
5.6 1.25 5.26 26.34 8.63 9.65 0.68 0.74 0.80 1.45 1.99 15.45 7.66 8.11 4.52 5.0 7.11 2.58 2.78 8.36
difference generally growing with mismatch. For example, for Diabetes when the mismatch is 0.74, ML (sum/product) error rates are 27.29/26.58 which are less than the fixed combining values of 28.91/28.83. (Saerens et al., 2002) used a likelihood ratio test to decide if the a priori probabilities have significantly changed from training to test set. The solutions obtained by their EM learning were retained only if the likelihood ratios exceeded a confidence threshold. We have found it is not very necessary to apply this type of test for our method. Only in one case of small mismatch value, for Liver Disorder, did we find that use of transductive ML gave a noticeably higher test set error rate. That is, use of our approach was typically found to either improve performance or have little effect (in the small mismatch case). 5.5 Transductive IT Learning versus Fixed Combining (Sum and Product). Table 3 compares error rates between IT and fixed combining. The IT method outperformed fixed rules for all data sets, with gains generally increasing with the prior mismatch. As one example of IT’s prior
876
D. Miller and S. Pal
Table 4: Generalization Error Rates Based on Five Replications of Twofold Cross Validation, Transductive IT and ML Methods. 5 × 2 cv F Test
ML Data Set
M
Satellite
0.97 1.68 1.27 2.73 3.3e-2 1.30 1.83 2.57 4.2e-2 2.44 3.54 1.6e-2 0.74 1.43 2.57 1.6e-3 0.36 0.99 2.36 2.5e-2 0.18 0.61 1.34
Vehicle Segment
Liver Disorder
Diabetes
Mushroom
Sonar
IT 20.85 25.38 44.59 34.45 16.13 25.34 10.82 31.93 44.45 21.33 19.42 23.05 23.65 23.91 18.31 0.24 0.073 0.12 0.12 29.9 30.29 27.60 25.96
Sum 20.94 25.54 46.15 34.86 21.58 34.98 12.79 43.85 45.55 33.87 20.58 23.83 24.54 26.84 22.78 0.34 0.064 0.25 0.044 31.83 31.35 27.79 27.02
Product 20.85 25.47 47.42 34.83 21.56 35.42 13.22 43.20 48.5 37.23 21.51 24.06 24.36 26.39 21.47 0.56 0.17 0.30 0.11 31.92 31.35 27.69 26.73
Sum 0.86 0.52 4.37 3.06 1.78 1.65 0.69 2.04 0.54 0.69 1.53 2.48 1.23 10.4 1 1.07 4.4 1.72 1.29 2.67 0.68
Product 1.49 0.51 4.33 3.01 2.88 1.75 1.45 2.12 0.72 1.09 1.26 1.56 1.58 8.6 5.8 1.98 0.67 1.47 1.39 2.43 0.66
correction, on Diabetes with M = 2.58, the true priors were [0.2, 0.8], and the transductive estimates were [Pe [C = 0], Pe [C = 1]] = [0.29, 0.71]. 5.6 Transductive ML versus Transductive IT Learning. Table 4 compares all the transductive methods. For all data sets, IT performed better than the ML methods except on Mushroom for the mismatch values 0.36 and 2.36. The difference in error rates in both cases is ≤ 0.1%. We attribute much of the gain of the IT method to the nonparametric nature of its combining. In encoding the given constraint information, IT does not make the statistical assumptions (presumably unwarranted) of the ML methods. We will shed further light on this point shortly, when we consider specialized tests involving explicit classifier redundancy. 5.7 Statistical Tests. To assess statistical significance of the comparisons, we used the 5 × 2 cv F test (Alpaydin, 1999). Here, five replications of twofold cross validation are performed. The difference
Transductive Methods
877 j
between the generalization error ( pi ) for both classifiers is computed where i is the replication index and j the fold index. The estimated variance (si2 ) of this difference is computed, with the average difference on each replication given by pi = 0.5 ∗ ( pi1 + pi2 ). The 5x2 cv F test statistic,
5 f =
2
i=1
2∗
j=1
5
j
pi
2 i=1 si
2 ,
(5.2)
is approximately F distributed (Alpaydin, 1999) with 10 and 5 degrees of freedom. We can reject the hypothesis that two algorithms have the same error rate with 95% confidence if f is greater than 4.74. In Table 2, columns 7 and 8 give values of f for comparisons between ML (sum) and fixed rules (sum/product). Columns 9 and 10 compare ML (product) and fixed rules (sum/product). Note that as prior mismatch increases, the value of f generally increases. For Sonar when the mismatch is 1.94, f ≥ 4.74 for all ML methods compared with fixed rules. In Table 3, columns 6 and 7 compare IT with the fixed combining rules. Except for Liver Disorder, all other data sets have at least one case where f ≥ 4.74. Moreover, if we compare Tables 2 and 3, we see IT gives statistically significant improvement over the fixed methods both in more cases and for smaller mismatch values than the ML methods. IT gives significant improvement for M < 1.5 on Vehicle, Segment, Diabetes, and Mushroom. In Table 4, we compare IT with the ML methods. Although IT achieves smaller error rates, f ≤ 4.74 except on Mushroom. 5.8 Redundant Classifiers. Table 5 gives results (for a single training/test split) comparing methods in the presence of explicit classifier redundancy. We selected the second weakest classifier from the five in the ensemble (in terms of test error) and duplicated this classifier k times, k =1,2, 5, and 20. The column under “5” indicates the test error for the original ensemble (with Ne = 5), with column “5 + k” giving the test error for the redundant ensemble (with Ne = 5 + k). Note that IT is almost perfectly robust to redundant classifiers, whereas for both ML methods, the accuracy degrades as k increases. This experiment exhibits the weakness of the ML methods, and the strength of IT, in handling redundancy. 5.9 Strong Classifiers. In a distributed sensor context, each classifier may be inherently weak (e.g., with computation limited by battery power). However, when there is freedom to choose the local classifier, one might choose a strong one such as a support vector machine (SVM), also believed to be less sensitive to class prior mismatch than other classifiers. It is thus interesting to assess the value of transductive learning for this choice.
Diabetes Mismatch IT Tml(sum) Tml(pro) Ionosphere Mismatch IT Tml(sum) Tml(pro)
Data Set
0.8861 21.21 34.03 33.50
0.3192 14.07 15.84 15.43
0.2660 14.12 15.64 14.90
5+1
0.7331 21.19 30.55 28.88
5
0.2812 14.57 16.44 16.12
0.7363 24.17 36.81 36.45
5
0.3937 14.43 18.40 16.89
1.0308 24.40 43.78 46.36
5+2
0.2678 12.44 15.17 14.70
0.7278 21.81 29.26 25.04
5
0.5356 12.86 18.29 16.28
1.4556 22.42 39.72 41.71
5+5
0.2654 15.65 19.43 16.91
0.7374 21.12 32.17 27.55
5
Table 5: Effect of Explicit Classifier Redundancy on Transductive IT and ML Test Set Classification Accuracy.
1.0617 16.48 24.17 21.74
2.94 22.76 48.93 56.14
5 + 20
878 D. Miller and S. Pal
Transductive Methods
879
Table 6: Generalization Error Based on Five Replications of Twofold Cross Validation for IT and Fixed Combining of Linear, Two-Class SVMs.
Data Set Sonar Ionosphere Diabetes
Mismatch
SVM Error (Test)
IT
Fixed (Sum)
5 × 2 cv F Test
0.0563 1.3801 0.0165 2.305 0.0538 1.4301
25.08 27.20 17.20 31.28 27.59 41.43
19.78 23.10 14.18 19.30 25.04 28.80
20.52 23.89 15.42 26.32 29.06 45.02
0.69 0.82 0.968 6.148 1.968 8.148
Table 6 compares the IT method (the only transductive method that works with hard classifier decisions) with fixed averaging (sum) for linear SVMs on two-class tasks. The column “SVM Error (Test)” gives the average test error of the individual SVM classifiers. For Sonar, SVM performance is only weakly sensitive to increases in the mismatch value, with the error increasing from 25.1% to 27.2%, while for the other two data sets, SVM performance is quite sensitive to mismatch. From the table, the advantage of IT over fixed rules appears to depend on the degree of this sensitivity. The advantage of IT on Sonar is small (especially compared with the results for naive Bayes classifiers in Table 3), while the advantage for the high-mismatch case is statistically significant on both Ionosphere and Diabetes. 5.10 Missing Classes in the Test Batch. We performed two experiments to validate our claim in section 4.3 that the IT method (minimizing R in equation 4.8) handles missing classes in the test batch much better than ˜ in equation 4.6. On the two-class Diabetes, excluding class 1 minimizing R from the test set, IT estimated Pe [C = 1] = 0.05 and achieved a test error rate ˜ gave Pe [C = 1] = 0.51 and a test error of 2.9%. By contrast, minimizing R rate of 39.62%. On the seven-class Segment, excluding class 1, IT estimated ˜ yielded Pe [C = 1] = 0.01, with a test error rate of 14.87%. Minimizing R Pe [C = 1] = 0.08 and a test error rate of 30.47%. 5.11 Batch Size. Finally we assessed how batch size affects performance. Table 7 shows the generalization error for varying test batch size, for IT compared with ML combining (shown in parentheses). While IT improves with increasing batch size, IT performance is comparable to or better than ML when the batch size is at least 30% of the test set. Note also that the ML performance seems largely invariant to batch size.
4.6e-2
2.2e-2
Diabetes
M
Liver Disorder
Data Set 47.54 (46.89/47.64) 25.12 (22.56/22.10)
10% 44.20 (46.34/47.10) 22.78 (23.44/23.48)
30% 41.21 (45.26/46.88) 23.28 (22.12/22.52)
50%
Batch Size as % of Test Set
42.64 (46.40/46.15) 23.10 (23.52/23.40)
80%
43.00 (46.56/46.05) 21.58 (22.10/21.78)
100%
Table 7: Effect of Batch Size on Generalization Error for Transductive IT Learning, Compared with ML Sum/Product Combining (Shown in Parentheses).
880 D. Miller and S. Pal
Transductive Methods
881
6 Conclusion We identified the distributed ensemble classification problem, discussed applications in which it arises, and proposed several transductive methods for its solution. All our methods were found to improve on fixed combining rules, with relatively large gains in accuracy achieved in the case of large class prior mismatch. The IT method generally outperformed the ML methods and is the only method suitable when local classifiers solely produce hard decisions. This method gave performance benefits when individual classifiers are both relatively weak (naive Bayes) and strong (SVMs). This approach was also demonstrated to be much more robust to classifier redundancy than the ML methods. IT also well handled the case where classes are missing from the test batch. There are several directions to pursue in the future. First, we did not address any practical communications issues, such as quantization of soft decisions and variable delays from different classifiers, which could disrupt synchronization needed to aggregate decisions. We also did not consider possible information exchange between local classifiers. Second, a theoretical study of conditions under which the ensemble class priors converge to the true ones could be undertaken. Third, our approaches could be improved by conveying information beyond local decisions. In Miller and Pal (2005), we proposed a method based on mixture modeling for encoding novel constraints involving continuous features, within a maximum entropy framework. This approach could be applied within our distributed ensemble setting, with information based on a local sensor’s continuous features (i.e., a posteriori probabilities on mixture component indices) conveyed, along with local decisions, and with associated constraints encoded. Finally, other aggregation rules could be explored; for example, it is straightforward to extend the ML methods to produce the alpha-trimmed mean in Tumer and Ghosh (2002). It would be interesting to determine whether other combining rules also arise as ML learning solutions. Appendix: EM Framework for Transductive ML Techniques For both transductive ML techniques, the data treated as missing within the EM framework are the class labels for the new batch samples. Let Zik = 1 if sample xi (exclusively) belongs to class k, and Zik = 0, otherwise. Then, for the independence case of section 3.1, the complete data log likelihood is
log L c (Xtest ) =
Ntest Nc i=1 k=1
Zik log Pe [C = k] +
Nc
( j) log p j xi |C = k; jk .
j=1
(A.1)
882
D. Miller and S. Pal
Here, the parameters that specify the class-conditional models, jk , even though in general unknown, are introduced for completeness. In the E-step, we compute the expected value of log L c (Xtest ) given the current parameter estimates—that is, given ≡ {{Pe [C = k]}, { jk }}. This is
E[log L c (Xtest )|] = × log Pe [C = k] +
Ntest Nc
E[Zik |]
i=1 k=1 Ne
( j) log p j xi |C = k; jk .
(A.2)
j=1
The expectations are just the ensemble class a posteriori probabilities, which can be derived under the feature vector independence assumption:
E[Zik |] ≡ Pe [C = k|xi ] =
=
Pe [C = k]
Ne j=1
Z Pe [C = k]
Ne
( j) p j xi |C = k
( j)
P j C=k|xi j=1 P j [C=k] Z
(A.3)
,
where Z and Z are the proper normalizations, ensuring a valid probability mass function. The E-step is thus specified by equation 3.4. In the M-step we maximize equation A.2 over the parameters to be adjusted, in this case {Pe [C = k]}, given the E-step probabilities held fixed. This gives the update in equation 3.5. Likewise, for the ML problem in section 3.2, we can derive the expected complete data-censored log likelihood:
E[log L c−cens (Xtest )|] =
Ntest Nc
E[Zik |]
i=1 k=1
Ne ( j) 1 p j xi |C = k; jk . × log Pe [C = k] + log Ne
(A.4)
j=1
In this case, the E-step probabilities are derived using
E[Zik |] = Pe [C = k|xi ] =
Ne j=1
Pe [C = k|xi , j uncensored]
1 Ne
Transductive Methods
883 e ( j) 1 p j xi |C = k Pe [C = k] Z
N
=
j=1
( j) Ne P j C = k|xi 1 = Pe [C = k]. Z P j [C = k]
(A.5)
j=1
The E-step is thus specified by equation 3.7. Maximizing equation A.4 with respect to {Pe [C = k]} given fixed E-step probabilities again gives equation 3.5.
References Alpaydin, E. (1999). Combined 5 × 2 cv F test for comparing supervised classification learning algorithms. Neural Comp., 11(8), 1885–1892. Basak, J., & Kothari, R. (2004). A classification paradigm for distributed vertically partitioned data. Neural Comp., 16(7), 1525–1544. Breiman, L. (1996). Bagging predictors. Mach. Learn., 26, 123–140. Chen, K., Wang, L., & Chi, H. (1997). Methods of combining multiple classifiers with different features and their applications to text independent speaker identification. Intl. J. of Patt. Rec. and Art. Intell., 11(3), 417–445. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. D’Costa, A., Ramachandran, V., & Sayeed, A. (2004). Distributed classification of gaussian space-time sources in wireless sensor networks. IEEE J. Sel. Areas in Comm., 22, 1026–1036. Della Pietra, S., Della Pietra, V. J., & Lafferty J. D. (1997). Inducing features of random fields. IEEE Trans. Patt. Anal. Mach. Intell., 19(4), 380–393. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum-likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc., Ser. B, 39, 1–38. Dietterich, T. G., & Kong, E. B. (1995). Error-correcting output coding corrects bias and variance. In Proceedings of the 12th International Conference on Machine Learning (pp. 313–321). San Francisco: Morgan Kaufmann. Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning (pp. 148–156). San Francisco: Morgan Kaufmann. Hansen, L. K., & Salamon, P. (1990). Neural network ensembles. IEEE Trans. Patt. Anal. Mach. Intell., 12(10), 993–1001. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp., 3(1), 79–87. Jaynes, E. T. (1989). Papers on probability, statistics, and statistical physics (2nd ed.). Norwell, MA: Kluwer. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco: Morgan Kaufmann.
884
D. Miller and S. Pal
Kang, H. J., Kim K., & Kim, J. H. (1997). Optimal approximation of discrete probability distribution with kth-order dependency and its application to combining multiple classifiers. Patt. Rec. Lett., 18, 515–523. Landgrebe, D. A. (2005). Multispectral land sensing: Where from, where to? IEEE Trans. Geo. Rem. Sens., 43(3), 414–421. McLachlan, G., & Peel, D. (2000). Finite mixture models. New York: Wiley. Miller, D. J., & Pal, S. (2005). An extension of iterative scaling for joint decision-level and feature-level fusion in ensemble classification. 2005 International Workshop on Machine Learning for Signal Processing. New York: IEEE. Miller, D. J., & Uyar, H. S. (1997). A mixture of experts classifier with learning based on both labeled and unlabeled data. Neural Info. Proc. Sys., 9, 571–577. Miller, D. J., & Yan, L. (1999). Critic-driven ensemble classification. IEEE Trans. Sig. Proc., 47(10), 2833–2844. Miller, D. J., & Yan, L. (2000). Approximate maximum entropy joint feature inference consistent with arbitrary lower-order probability constraints: Application to statistical classification. Neural Comp., 12(9), 2175–2207. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood, and the EM algorithm. SIAM Review, 26, 195–239. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Comp., 14(1), 21–41. Tumer, K., & Ghosh, J. (1996). Error correlation and error reduction in ensemble classifiers. Connection Science, 8(3/4), 385–404. Tumer, K., & Ghosh, J. (2002). Robust combining of disparate classifiers through order statistics. Patt. Anal. & Apps., 5(2), 189–200. Varshney P. (1997). Distributed detection and data fusion. New York: Springer. Xu, L., Krzyzak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. IEEE Trans. Sys., Man, Cyb., 22, 418–435.
Received February 19, 2006; accepted July 11, 2006.
ARTICLE
Communicated by Erkki Oja
Synergies Between Intrinsic and Synaptic Plasticity Mechanisms Jochen Triesch
[email protected] Frankfurt Institute for Advanced Studies, Johann Wolfgang Goethe University, 60438 Frankfurt am Main, Germany, and Department of Cognitive Science, University of California, San Diego, La Jolla, CA 92093-0515, U.S.A.
We propose a model of intrinsic plasticity for a continuous activation model neuron based on information theory. We then show how intrinsic and synaptic plasticity mechanisms interact and allow the neuron to discover heavy-tailed directions in the input. We also demonstrate that intrinsic plasticity may be an alternative explanation for the sliding threshold postulated in the BCM theory of synaptic plasticity. We present a theoretical analysis of the interaction of intrinsic plasticity with different Hebbian learning rules for the case of clustered inputs. Finally, we perform experiments on the “bars” problem, a popular nonlinear independent component analysis problem. 1 Introduction Synaptic learning mechanisms have always been a central area of research in neural computation, but there has been relatively little work on intrinsic plasticity (IP) mechanisms, which change a neuron’s intrinsic excitability. Such changes have been observed in many species and brain areas (Desai, Rutherford, & Turrigiano, 1999; Zhang & Linden, 2003; Daoudal & Debanne, 2003; Zhang, Hung, Zhu, Xie, & Wang, 2004; Cudmore & Turrigiano, 2004; van Welie, van Hooft, & Wadman, 2004, 2006), and they can be measured as changes to a neuron’s frequency-current (f-I) function, such as alterations to the spike threshold or the initial slope of the f-I function. Interestingly, such effects seem to occur on different timescales, with some changes occurring over days (e.g., Desai et al., 1999) and others occurring much faster (Cudmore & Turrigiano, 2004; van Welie et al., 2004, 2006). Despite recent progress, the biophysical implementation of these mechanisms remains somewhat unclear, and the exact computational functions of these forms of plasticity are not well established either. Changes to the intrinsic excitability of neurons may help to shape the dynamics of neural circuits (Marder, Abbott, Turrigiano, Liu, & Golowasch, 1996) in desired ways. It is also reasonable to assume that IP contributes to neural homeostasis by keeping the level of activity of individual neurons and neural Neural Computation 19, 885–909 (2007)
C 2007 Massachusetts Institute of Technology
886
J. Triesch
circuits in a desirable regime. It is less clear, however, what exactly such a desirable regime may be. It has been proposed that efficient sensory coding can be obtained by adjusting neural responses to match the statistics of signals from the environment in a way that maximizes information transmission. In particular, if a neuron uses all of its possible response levels equally often, then its entropy is maximized (Laughlin, 1981). For visual cortex of cats and macaques, Baddeley, Abbott, Booth, Sengpiel, and Freeman (1998) have observed that individual neurons exhibit an approximately exponential distribution of firing rates in response to stimulation with natural videos. They have argued that this may be motivated through the fact that the exponential distribution maximizes entropy for a fixed mean. Thus, neurons may assume exponential firing-rate distributions in order to maximize the information they convey to downstream targets given a fixed average activity level, which corresponds to a certain energy budget. Based on this idea, Stemmler and Koch (1999) developed a model of IP for a Hodgkin-Huxley-style model neuron. They derived a learning rule for the adaptation of a number of voltage-gated channels to match the firing-rate distribution of the unit to a desired distribution. Shin, Koch, and Douglas (1999) adjusted two conductances in electronic circuit analogs of neurons to adapt to the mean and variance of the time-varying somatic input current. In this work, we describe an IP model for a simple continuous-activation model neuron with only two parameters. We then study the interaction of intrinsic and synaptic learning processes and show how a model neuron with IP at the soma and Hebbian learning at the synapses can discover heavy-tailed directions in its input. We analyze the interaction of IP with different Hebb rules including a BCM rule (Bienenstock, Cooper, & Munro, 1982). We also show how IP may function as an alternative to the sliding threshold mechanism assumed to balance long-term potentiation (LTP) and long-term depression (LTD) in BCM theory. Finally, we demonstrate how a single unit with IP and Hebbian learning discovers an independent component in the “bars” problem, a standard unsupervised learning problem. 2 A Model of Intrinsic Plasticity Since the long-term goal of our research is to study the effects of IP at the level of large networks, we choose to use a simple rate-coding model neuron. We model IP as changes to the neuron’s nonlinear activation function. We consider a model neuron with a parametric sigmoid nonlinearity of the form y = ga b (x) =
1 . 1 + exp(−(a x + b))
(2.1)
Here, x denotes the total synaptic current arriving at the soma, and y is the neuron’s firing rate. The parameters a and b change the slope and offset
Intrinsic and Synaptic Plasticity
887
of the sigmoid nonlinearity. In Triesch (2005b) we proposed an update rule for the parameters of the sigmoid based on the idea of moment matching. This rule took the form of two proportional control laws coupling the offset and slope of the sigmoid to estimates of the first and second moment of the neuron’s firing-rate distribution.1 More recently we derived a gradient rule that aims to directly minimize the Kullback-Leibler divergence between the neuron’s current firing-rate distribution and the optimal exponential distribution (Triesch, 2005a). As derived in the appendix, this ansatz leads to the following discrete learning rule: a ← a + a and b ← b + b, with 1 1 1 +x− 2+ xy + xy2 a µ µ 1 1 b = ηIP 1 − 2 + y + y2 , µ µ
a = ηIP
(2.2) (2.3)
where µ is the neuron’s desired mean activity and ηIP is a small learning rate. This rule is very similar to the one derived by Bell and Sejnowski (1995) for a single sigmoid neuron maximizing its entropy. However, our rule has additional terms stemming from the objective of keeping the mean firing rate low. Note that this rule is strictly local. The only quantities used to update the neuron’s nonlinear transfer function are x, the total synaptic current arriving at the soma, and the firing rate y. Furthermore, it relies on only the neuron’s current activity y and its squared activity y2 , which might be estimated from the cell’s internal calcium concentration. 2.1 Experiments with the Intrinsic Plasticity model 2.1.1 Behavior for Different Distributions of Total Synaptic Current. To illustrate the behavior of the learning rule, we consider a single-model neuron with a fixed distribution of synaptic current f x (x). Figure 1 illustrates the result of adapting the neuron’s nonlinear transfer function to three different input distributions: gaussian, uniform, and exponential. We set the unit’s desired mean activity to µ = 0.1, a tenth of its maximum activity, to reflect the low firing rates observed in cortex. The following results do not critically depend on the precise value of µ as long as µ is small (µ 1). In each case, the sigmoid nonlinearity moves close to the optimal nonlinearity that would result in an exactly exponential distribution of the firing rate, resulting in a sparse distribution of firing rates. Note that since the sigmoid nonlinearity bounds activity to be less than one (y < 1), the resulting distribution p y (y) necessarily has a truncated tail. Also, since the sigmoid has
1 In this work the sigmoid nonlinearity was parameterized slightly differently, with a representing the inverse slope of the sigmoid.
888
J. Triesch
a
b 102 0 frequency
b
−1 −2
0
10
−2
10
−3 −4
−4
1
1.5
10
2
a
c 1 0.8
d
input distribution optimal transfer fct. learned transfer fct.
0
0.5 firing rate
e
input distribution optimal transfer fct. learned transfer fct.
1
0.8
1
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0 −4
−2
0
2
4
0
0
0.5
1
0 0
1
input distribution optimal transfer fct. learned transfer fct.
0.5
1
Figure 1: Dynamics of the IP rule for various input distributions. (a–c) Gaussian input distribution. (a) The phase plane diagram. Arrows indicate the flow field of the system. Dotted lines indicate approximate locations of the nullclines (found numerically). Two example trajectories are exhibited (solid lines), which converge to the stationary point (marked with a circle). (b) The resulting activity distribution (solid curve), which approximates that of an exponential distribution (dotted). (c) The theoretically optimal transfer function (dotted) and the learned sigmoidal transfer function (solid). The gaussian input distribution (dashed, not drawn to scale) is also shown. (d, e) Same as c but for uniform and exponential input distribution. Parameters were µ = 0.1, ηIP = 0.001.
only two degrees of freedom, the match is usually not perfect. As can be seen in the figure, large deviations from the optimal transfer function can sometimes be observed where the probability density of the input distribution is low, and this leads to deviations from the desired exponential activity distribution. A closer fit could be obtained with a different nonlinearity g with more free parameters. Since it is unclear, however, what additional degrees of freedom biological neurons have in this respect, we chose to stick with equation 2.1. When a constant input x0 is presented to the unit— f x (x) is described by a δ-function—then the slope parameter a will diverge as the unit attempts to adjust its nonlinearity to match its deterministic output ga b (x0 ) to the desired exponential distribution. Due to simultaneous adjustment of the threshold parameter b, however, the unit’s activity will remain close to µ. The results obtained with equations 2.2 and 2.3 are very similar to the ones from our previous study, which used a set of two simple control laws
Intrinsic and Synaptic Plasticity
a
889
firing rate
1
0.5
0 0
1
2
3
4
time
1
firing rate
0.8
c 100
before adapted
frequency
b
0.6 0.4
5 4
× 10
before after adapted
−2
10
0.2 0 −4
−4
−2
0 input
2
4
10
0
0.5 firing rate
1
Figure 2: Response to sudden deprivation from afferent input. (a) The unit’s firing rate y as a function of the number of presented inputs (time t). We plot the response only to every 10th input. After 104 time steps, the standard deviation of the input signal is reduced by a factor of five. In response, the neuron adjusts its intrinsic excitability to restore its desired exponential firing-rate distribution. (b) The nonlinear activation function before the deprivation and after adaptation to deprivation. (c) The distribution of firing rates before deprivation, immediately after deprivation, and after adaptation to deprivation.
that updated the sigmoid nonlinearity based on estimates of the first and second moment of the neuron’s firing-rate distribution (Triesch, 2005b). The current rule seems to lead to faster convergence, however. 2.1.2 Response to Sudden Sensory Deprivation. The proposed form of plasticity of a neuron’s intrinsic excitability may help neurons maintain stable firing-rate distributions in the presence of systematic changes to their afferent input. An interesting special case of such a change is that of sensory deprivation. Here, we model sensory deprivation as a fivefold reduction of the standard deviation σ of a gaussian-distributed total synaptic current x arriving at the soma. Figure 2 shows how the neuron learns to adjust to the sudden change of the distribution of total synaptic current occurring after 104 time steps. After a transient phase of low activity, the neuron gradually reestablishes its desired firing-rate level (see Figure 2a) by making its activation function steeper, that is, the neuron becomes more excitable. This is illustrated in Figure 2b, which shows the neuron’s activation function
890
J. Triesch
before the reduction of afferent input and after adjusting to the new input distribution. Figure 2c shows how the distribution of firing rates is affected by the sudden reduction in afferent input and the subsequent adjustment of the neuron’s excitability. After adjustment, the neuron regains its original, approximately exponential firing-rate distribution. 2.2 Discussion of the IP model. It is interesting to note that equation 2.1 can be viewed as linearly transforming input x before passing it through a standard sigmoid function. Such a linear rescaling could also be obtained by modifying all synaptic weights (including a “bias” weight) entering the neuron. Thus, there is a redundancy between the neuron’s intrinsic and synaptic degrees of freedom. This “redundancy” of neural degrees of freedom in our model is specific to our choice of the nonlinearity and its parameterization. In fact, the derivation of the learning rule can be generalized to other nonlinearities with different adjustable parameters. The only requirements are that the nonlinearity g should be strictly monotonically increasing and differentiable with respect to x, and the partial derivatives of log ∂ y/∂ x with respect to the parameters of g (in the above case, a and b) must exist. Nevertheless, there is ample evidence that homeostatic synaptic mechanisms also contribute to keeping a neuron’s activity in a desirable regime (Turrigiano & Nelson, 2000, 2004), especially if the afferent input to the neuron has been reduced (Maffei, Nelson, & Turrigiano, 2004). From a selforganization viewpoint, however, it may be easiest and most effective to implement mechanisms that change the spiking statistics of the neuron directly at the soma—as locally as possible. But since such mechanisms may operate over only a limited range, they may rely on other processes such as synaptic scaling to keep the distribution of currents arriving at the soma in a desired range. In fact, most recent evidence suggests that such synaptic scaling is sensitive to glutamate levels in the extracellular medium (Stellwagen & Malenka, 2006), which is more directly related to the synaptic input x to the neuron rather than its firing activity. Thus, the brain may use synaptic scaling to keep the level of input to a neuron in a desired regime while using intrinsic mechanisms to map this input to its own firing activity in a nearly optimal way. Changes in the slope and threshold of a neuron’s f − I curve as well as combinations of both have been observed experimentally. Such changes occur through the modification of voltage-gated channels including, for example, Na+ channels (Aizenman, Akerman, Jensen, & Hollis, 2003), K+ channels (van Welie et al., 2006), and Ih conductances (van Welie et al., 2004). Many of such modifications seem to depend on the cell’s internal Ca2+ concentration, which is often regarded as an indicator of the neuron’s activity level. Modeling studies have also revealed that concerted changes of I A and Ih could implement the kind of gain changes associated with modification of the a -parameter in our parametric sigmoid (Burdakov, 2005). In sum,
Intrinsic and Synaptic Plasticity
891
it appears plausible that changes to the f − I function as predicted by our model could be implemented through the modification of voltage-gated channels. The most important prediction of our IP model(s) for experiments is that changes to the intrinsic excitability of a neuron should not be driven only by the average spiking activity of the neuron—the first moment of the neuron’s activity—but also depend on higher moments. In particular, our gradient rule and our earlier rule based on moment matching (Triesch, 2005b) rely on the second moment of the neuron’s activity. This implies that experimental manipulations that leave a neuron’s average activity constant but change the second moment of its activity will lead to systematic changes in the neuron’s f − I curve. The required estimate of the second moment of the neuron’s firing rate may be implemented by an agent A that binds two Ca2+ ions: A + 2Ca2+ → A . The concentration of A could then approximate the square of the current firing rate. Furthermore, we hypothesize that the functional role of homeostatic synaptic plasticity is to maintain the synaptic current arriving at the soma in a desired regime, while the role of IP is to map this distribution of synaptic current onto the desired distribution of spiking activity. In principle, this hypothesis can be tested by any experimental technique that allows independently controlling a neuron’s synaptic input and its firing activity. 3 Interaction of IP with Hebbian Learning Since IP appears to be a ubiquitous phenomenon in the brain, it is important to study how it interacts with other forms of plasticity, in particular with associative synaptic learning mechanisms. We begin by considering a standard Hebbian learning rule, w = ηHebb uy(u) = ηHebb uga b (wT u),
(3.1)
where u is the vector of synaptic inputs to the neuron and ηHebb is a small learning rate. Note that since y is positive, wi will always be positive if ui is positive. Thus, for positive inputs ui , the strength of weight wi can only increase. Such uncontrolled weight growth is undesirable, even if the IP counteracts this effect and stabilizes the neuron’s activity. In order to avoid unlimited weight growth, we keep the weight vector fixed at unit length through the multiplicative weight normalization w ← w/w. Future work could also incorporate a model of synaptic scaling to keep the input to the neuron in a desired regime (see the discussion in section 2). Since synaptic scaling also seems to function in a multiplicative fashion, (e.g., Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998), we expect that this would lead to similar results.
892
J. Triesch
3.1 Behavior in the Limit of Fast IP. It is instructive to consider the limiting case of fast IP: ηIP ηHebb . In this case, the IP will have generated an approximately exponential distribution of y(u), before the weight vector w can change much. Note that our discussion is independent of the exact IP mechanism (e.g., the specific learning rule introduced above) as long as the IP mechanism is effective in producing approximately exponential firing-rate distributions. In the case of fast IP, it can be seen from equation 3.1 that the expected value of the weight update given by E[w] = ηHebb E[uy(u)] is an exponentially weighted sum of all the inputs u, with the firing rates y(u) functioning as the exponentially distributed weighting factors. A small number of inputs that produce the highest firing rates will exert a strong pull on the weight vector, while the majority of inputs lead to a small firing rate and change the weight vector only marginally. This mechanism can lead to the discovery of heavy-tailed directions in the input, as illustrated by the following experiment. A careful analysis for clustered inputs is given below. We consider a model neuron with just two inputs. The input distribution is such that there is one heavy-tailed direction and an orthogonal direction without heavy tails. At the same time, the distribution is white—that is, the covariance matrix of the inputs is the 2 × 2 identity matrix. For concreteness we consider the Laplace band whose pdf is given by p(u1 , u2 ) = p(u1 ) p(u2 ) =
1 √ 2 6
√ exp(− 2|u1 |) : 0 :
√ 3 √ . |u2 | > 3
|u2 | ≤
(3.2)
Along the u1 -direction, this distribution has the shape of a Laplacian with exponential tails, while along the u2 -direction, this distribution is uniform with no tails at all (see Figure 3a). Note that since this distribution has the identity matrix as its covariance matrix, standard Hebbian learning in a linear unit would perform a random walk in the weight space; it cannot discover the heavy-tailed direction. In combination with IP, however, the weight vector is drawn toward the heavy-tailed direction (see Figures 3b and 3c; parameters were µ = 0.1, ηIP = 0.01, ηHebb = 0.001). This process can be understood as follows. Consider the weight vector in an oblique orientation (the thin arrow in Figure 3b). For this weight vector, we plot a sample of inputs (circles) with their sizes representing how strongly they will activate the unit. The inputs leading to the highest firing rates (biggest circles) are more likely to appear to the right of the weight vector, at the far right of the input distribution. This imbalance will cause the weight vector to turn clockwise in the direction of the heavy tail. This process ends when
Intrinsic and Synaptic Plasticity
893
a
b
u2 2 1 0 −1
u2
−2
u1
−2
d orientation of weight vector
orientation of weight vector
c 80 60 40 20 0 −20 0
500 time/1000
1000
0
2 u1
80 60 40 20 0 −20 0
500 time/1000
1000
Figure 3: Discovery of heavy-tailed direction in a model neuron with just two inputs. See the text for details.
the weight vector has aligned with the heavy-tailed u1 -direction, which corresponds to an orientation of zero degrees in Figure 3c. We have also experimented with a number of other input distributions and found that the property of seeking out the more heavy-tailed directions is a robust effect. Figure 3d demonstrates this for the case of a white input distribution that is the product of a gaussian along the u2 direction and a Laplacian along the u1 direction. The weight vector aligns with the more heavy-tailed u1 direction (parameters as above). Note that while we have assumed IP being much faster than Hebbian learning, the same behavior is also exhibited if IP is slower than Hebbian learning. This is a robust effect that holds true in this example and for all experiments discussed in this article (an example is given in Figure 8). This mechanism for finding heavy-tailed directions is closely related to a number of approaches for independent component analysis (ICA) that use nonlinear Hebbian terms in their learning rules. A prominent example is the FastICA algorithm (Hyv¨arinen & Oja, 1997; Hyv¨arinen, Karhunen, & Oja, 2001). Since FastICA works with a fixed nonlinearity, it is not surprising that our method finds heavy-tailed direction even when IP is slow. However, one of the benefits of IP is that it drives the neuron toward using an energyefficient code.
894
J. Triesch 0.5 0.4
Ω(y)
0.3 0.2 0.1 0
rel. frequency simple Hebb covariance rule BCM rule
−0.1 −0.2 0
0.1
0.2 0.3 firing rate y
0.4
0.5
Figure 4: Illustration of alternative Hebbian learning rules, plotting the functions (y) for three different learning rules. For the covariance and BCM rules, we plot for the thresholds that balance LTP and LTD. For the BCM rule, has been scaled by a factor five for better visibility. The dotted horizontal line marks the boundary between LTP ((y) > 0) and LTD ((y) < 0). The dashed line shows the relative probability of firing rate y (not drawn to scale).
3.2 Alternative Hebbian Learning Rules. We now discuss the interaction of IP with alternative Hebbian learning rules. We focus on generalizations of the Hebbian learning rule, equation 3.1, of the kind w = ηHebb u (y(u)) ,
(3.3)
where is a generally nonlinear function. The choice id (y) = y corresponds to the standard Hebbian learning rule from above. Other choices lead to different learning rules, as illustrated in Figure 4. The choice cov (y) = y − θcov is a so-called covariance rule, since it can lead to the discovery of the principal eigenvector of the input covariance matrix in a linear unit (e.g., Dayan & Abbott, 2001). To this end, θcov is set to be an online estimate of the unit’s mean activity. The choice BCM (y) = (y − θBCM )y is a so-called quadratic BCM rule (Cooper, Intrator, Blais, & Shouval, 2004), which automatically stabilizes weight growth. Here, θBCM is usually an online estimate of the unit’s mean squared firing rate. In our analysis, we will keep the thresholds θcov and θBCM fixed, since the IP drives the unit to a fixed firing-rate distribution automatically. Both the covariance and the BCM rule differ from the standard rule by introducing long-term depression (LTD) in addition to long-term potentiation (LTP). In connection with IP as introduced above, it is easy to relate the amount of LTP versus LTD to the thresholds used in the learning rules.
Intrinsic and Synaptic Plasticity
895
A natural choice for the thresholds is one that balances LTP and LTD. The specific condition for this balance is that (y) is on average zero:2 0
∞
1 y exp − (y)dy = 0. µ µ
(3.4)
In the case of the covariance rule, this condition leads to θcov = µ, that is, LTD is exhibited for inputs that lead to below-average firing rates, and LTP is exhibited for above-average firing rates (see Figure 4). For the BCM rule, equation 3.4 leads to θBCM = 2µ. Thus, in this balanced LTP-LTD regime, the BCM rule will exhibit maximal LTD for inputs that correspond to the mean firing rate and LTP occurs only for inputs that lead to firing rates greater than twice the average firing rate (compare Figure 4). Firing rates in the tail of the exponential distribution have a very strong influence on the weight vector due to the quadratic increase of (y). Empirically, we found that such alternative Hebbian learning rules are also effective in finding heavy-tailed directions when combined with our IP mechanism. Based on the arguments on the relation of nonlinear Hebbian learning rules to ICA algorithms discussed above, this is not surprising. In particular, it has been argued that BCM-type learning rules can discover sparse directions in the absence of any IP (Cooper et al., 2004). 3.3 IP as an Alternative to the Sliding Threshold in BCM Theory. A hallmark of the BCM theory of synaptic plasticity is the idea of a sliding threshold that separates postsynaptic firing rates producing LTP versus LTD (Cooper et al., 2004). LTP occurs only when the postsynaptic activity is above the threshold θBCM , but θBCM is assumed to be changing as a function of the cell’s activity history. The evidence for such a sliding LTPLTD threshold seems to be only indirect, however, and IP may present a plausible alternative to the sliding threshold proposed in BCM learning. One of the pieces of evidence put forward as support for the idea of a sliding threshold is data from Kirkwood and colleagues demonstrating that for dark-reared cats, weaker inputs to a neuron are sufficient to produce LTP when compared to normally reared cats (Kirkwood, Rioult, & Bear, 1996). The threshold stimulation frequency that separates LTD and LTP is smaller for the dark-reared cats, which is consistent with the idea of a sliding threshold on the postsynaptic activity level. But an alternative explanation of these data is that neurons in the dark-reared cats are more excitable such that the same stimulation frequency leads to higher firing levels, and this is
2 This may not be the only sensible choice, however. Another form of balance would result from requiring that half of the inputs induce LTP, while the other half induce LTD; that is, we require (µ ln(2)) = 0, where µ ln(2) is the median firing activity of the unit. This choice leads to θcov = θBCM = µ ln(2).
896
J. Triesch 2 high variance low variance change in weight
1.5 1 0.5 0 −0.5 −2
0
2 input
4
6
Figure 5: Intrinsic plasticity may provide an alternative explanation to the idea of a sliding threshold as introduced in the BCM theory of Hebbian plasticity. The change of the weight w is plotted as a function of the strength of the input driving the neuron. For a neuron that experiences four times lower variance inputs, the plasticity curve shifts to the left because IP makes the neuron more excitable. Due to this, the unit in the low-variance input environment will exhibit a stronger response and thus a stronger change in weight for the same input.
why weaker inputs can already induce LTP, even for a fixed modification threshold. Figure 5 illustrates this effect. It shows a plasticity curve, where the change of the weight w corresponding to the amount of LTD or LTD is plotted as a function of the strength of the input to the neuron. We plot this for two neurons that have received different levels of stimulation in the past and have adjusted their activation function accordingly. Both units receive white zero mean gaussian input of different variance. The unit receiving input with four times lower variance corresponds to the situation of the dark-reared cats. Due to the low variance input, it has assumed a steeper activation function that produces higher outputs for the same input, as shown in Figure 2. As a consequence, its plasticity function is shifted to the left. Thus, shifts in the plasticity function can in principle be explained by IP and do not require the assumption of a sliding threshold. This mechanism may work in conjunction with homeostatic synaptic scaling mechanisms, as mentioned earlier. 3.4 Analysis for Clustered Inputs. A more formal analysis of the interaction between IP and Hebbian learning rules can be performed when we assume that the input distribution is a mixture of N well-separated clusters in a high-dimensional input space. Let the cluster centers be at locations ci , i = 1, . . . , N. We treat only the case where the clusters have equal
897
contribution to weight vector f
i
Intrinsic and Synaptic Plasticity
simple Hebb covariance rule BCM rule
0.8 0.6 0.4 0.2 0 −0.2 0
10
20 30 cluster number i
40
50
Figure 6: Behavior of three Hebbian learning rules for clustered input. The contribution that each input cluster makes to the weight vector is plotted as a function of the cluster number.
prior probabilities. The extension to the general case is straightforward. As shown in the appendix, for the case of the standard Hebbian learning rule, the stationary solutions for the weight vector w will be a weighted sum of the cluster centers ci , w=
N
f iHebb ci ,
(3.5)
i=1
where the weighting factors f iHebb are given by f iHebb ∝ 1 + log(N) − i log(i) + (i − 1) log(i − 1),
(3.6)
and we define 0 log(0) ≡ 0. This relation is plotted in Figure 6. Here, the clusters have been ordered such that cluster 1 is the one closest to the final weight vector and cluster N is the one most distant from the final weight vector. There can be at most N! resulting weight vectors owing to the number of possible assignments of the f iHebb to the clusters, but in general not all of them are stationary points of the learning dynamics. An interesting result of this analysis is that the final weight vector does not depend on the desired mean activity level µ. In the special case of the cluster centers lying at random locations in a very high-dimensional space, all of the possible stationary weight vectors correspond to heavytailed directions. More precisely, the linear projection of the inputs onto the weight vector will have an approximately exponential distribution, that is, its distribution will have a heavier tail than a gaussian (Triesch, 2005b). This formal result complements the informal analysis from above, which
898
J. Triesch
suggested that the combination of IP with Hebbian learning leads to the discovery of heavy-tailed directions in the input. We have performed the same analysis for the case of the alternative Hebb rules introduced above (see the appendix). For the covariance rule, the contributions of the individual clusters to the weight vector are very similar to the contributions in the simple Hebbian case, as shown in Figure 6, although a few of the weights will actually become slightly negative. For the BCM rule, however, most of the input clusters will contribute to the weight vector with a small negative weight and only a small fraction of the clusters is contributing to the final weight vector with a strong positive weight. The exact size of this fraction depends on the choice of the threshold θBCM . Figure 6 shows the result for the “balanced” choice θBCM = 2µ. 4 The Bars Problem The bars problem is a standard problem for unsupervised learning archi¨ ak, 1990). Formally, it is a nonlinear independent component tectures (Foldi´ analysis problem. The input domain consists of an N-by-N retina. On this retina, all horizontal and vertical bars (2N in total) can be displayed. The presence or absence of each bar is determined independently, with every bar occurring with the same probability p (in our case, p = 1/N). If a horizontal and a vertical bar overlap, the pixel at the intersection point will be just as bright as any other pixel on the bars rather than twice as bright. This makes the problem a nonlinear ICA problem. Example stimuli from the bars data set are shown in Figure 7a. Note that we normalize input vectors to unit length. The goal of learning in the bars problem is to find the independent sources of the images: the individual bars. Thus, the neural learning system should develop filters that represent the individual bars. We have trained an individual sigmoidal model neuron on the bars input domain. We used the simple Hebbian rule, equation 3.1, together with the multiplicative weight normalization that keeps the weight vector at unit length. The theoretical analysis above assumed that intrinsic plasticity is much faster than synaptic plasticity. Here, we set the timescale of IP to be identical to the one for synaptic plasticity (ηIP = 0.01, ηHebb = 0.01). As illustrated in Figure 7b, the unit’s weight vector aligns with one of the individual bars as soon as the intrinsic plasticity has pushed the model neuron into a regime where its responses are sparse: the unit has discovered one of the independent sources of the input domain. This result is robust if the desired mean activity µ of the unit is changed over a wide range. If µ is reduced from its default value (1/2N = 0.05) over several orders of magnitude (we tried down to 10−5 ), the result remains unchanged. However, if µ is increased above about 0.15, the unit will fail to represent an individual bar but will learn a mixture of two or more bars, with different bars being represented with different strengths. Thus, in this example—in contrast to the theoretical result above—the desired mean activity µ does influence the
Intrinsic and Synaptic Plasticity b
a
899
1
0.8
activity
0.6
0.4
0.2
80
0
frequency
c 100
60
0
0.2
0.4
0.6
0.8
1.2 1 input patterns
1.4
1.6
1.8
2 × 104
40 20
0
0.01
0.03 0.02 weight value
0.04
Figure 7: Single-model neuron discovering a “bar” in the bars problem. (a) Examples of bars stimuli. (b) Activity of the neuron during the learning process and weight vector at five times during learning. (c) Histogram of weight strength of the final weight vector. Parameters were ηIP = 0.01, ηHebb = 0.01.
weight vector that is being learned. The reason is that the IP only imperfectly adjusts the output distribution to the desired exponential shape. As long as µ is small, however, discovery of a single bar is a very robust effect. For example, it still occurs if all the input patterns presented to the system have at least, say, four bars in them (i.e., the unit never sees a single bar in isolation). The discovery of a single bar also does not critically depend on the relative time scales of IP and Hebbian learning. Figure 8 shows results for different relative speeds of IP and Hebbian learning. When IP is faster than Hebbian learning (left graph, ηIP = 0.01, ηHebb = 0.001), a single bar is discovered as soon as the unit’s activity is driven into a heavy-tailed regime. When IP is slower than Hebbian learning (center graph, ηIP = 0.001, ηHebb = 0.01), the same behavior is observed, although convergence is slower. Importantly, this behavior is not observed when IP is absent, as shown in Figure 8 (right). In this experiment, we used a fixed sigmoid nonlinearity with a = 5.0, b = −1.15. These values correspond to the final sigmoid after convergence in Figure 7.3 With this fixed nonlinearity, the unit is unable to discover a bar. We varied the learning rate ηHebb over two orders of magnitude and also ran longer simulations, but the unit never discovered a single bar. This demonstrates that IP contributes more to the learning process than just setting the sigmoid nonlinearity to its final shape. Instead, it shows that the interaction of IP with Hebbian learning over an extended time contributes to the final result. This extended interaction leads to large transient
3 More precisely, in the “converged state,” these parameters still fluctuate. a stays mostly in the interval [4.5, 5.5]. b fluctuates in the interval [−1.0, −1.3].
J. Triesch 1
1
0.8
0.8
0.6 0.4 0.2 0 0
activity
1 0.8 activity
activity
900
0.6 0.4 0.2
5 input patterns
10 4 × 10
0 0
0.6 0.4 0.2
5 input patterns
10 4 × 10
0 0
5 input patterns
10 4 × 10
Figure 8: Learning is robust to changes in the relative timescale of IP versus Hebbian learning. Faster IP (left plot, ηIP = 0.01, ηHebb = 0.001) and slower IP (center plot, ηIP = 0.001, ηHebb = 0.01) both lead to the discovery of a single bar (see final weight vectors shown as small insets). Without any IP (right plot, ηHebb = 0.001), but with a constant nonlinearity that corresponds to the final nonlinearity in Figure 7, no bar is discovered.
deviations of the sigmoid from its final shape. During this transient deformation, the unit’s activity first enters into a sparse regime and discovers a bar. This weight pattern is maintained as the unit keeps adjusting its nonlinearity to the final shape. It should be noted, however, that setting a and b to constant values that are assumed during the transient phase can also lead to successful discovery of a bar. Thus, IP is not strictly necessary for successful learning, but it automatically adjusts the nonlinearity in the desired way and ensures an energy-efficient, approximately exponential activity distribution throughout the learning process. 5 Discussion We have presented a model of IP rooted in information theory and have shown that the model is effective in maintaining homeostasis of the neuron’s firing-rate distribution by making it approximately exponential. The model predicts that changes of the neuron’s intrinsic excitability should depend not only on the neuron’s average firing rate (first moment or mean) but also on its average squared firing rate (second moment). This mechanism may complement other homeostasis mechanisms such as synaptic scaling (Turrigiano & Nelson, 2000, 2004; Maffei et al., 2004). Recent evidence suggests that homeostatic synaptic scaling may occur in response to changing levels of afferent input to the cell rather than to changes of the cell’s firing activity (Stellwagen & Malenka, 2006). We propose that synaptic scaling tries to keep the synaptic current arriving at the soma in a certain regime, while intrinsic plasticity adjusts the neuron’s nonlinear properties to map this current onto an energy-efficient, sparse firing pattern. 5.1 Connection to Posttraumatic Epileptogenesis. We demonstrated that the proposed IP mechanism contributes to homeostasis by increasing
Intrinsic and Synaptic Plasticity
901
a neuron’s excitability after reduction of its afferent input. This effect may play an important role in the genesis of posttraumatic epilepsy. Physiological studies have found increased excitability of pyramidal neurons after trauma (Prince & Tseng, 1993), and modeling studies have shown that such increased excitability together with increased NMDA conductance may explain posttraumatic epileptogenesis (Bush, Prince, & Miller, 1999). Our IP model suggests a causal connection between the trauma and the increased excitability. Due to the trauma, neurons in the neighboring tissue lose a fraction of their afferent input, which triggers an increased excitability due to IP. This in turn makes the local circuit more prone to generating epileptiform events. This mechanism may be complementary to the putative role of homeostatic synaptic plasticity in posttraumatic epilpeptogenesis (Houweling, Bazhenov, Timofeev, Steriade, & Sejnowski, 2005). More work is needed to understand the interaction of these mechanisms in detail, with potential implications for clinical applications. 5.2 Role for Learning Efficient Sensory Representations. We have demonstrated that there may be a synergistic relationship between Hebbian synaptic learning and IP mechanisms that changes the intrinsic excitability of neurons. The two processes may work together to find heavy-tailed directions in the input space. While our theoretical arguments explained this behavior only in the limiting case of IP being much faster than synaptic plasticity, we found empirically that the effect persists if intrinsic plasticity is as slow as or even much slower than synaptic plasticity. In addition, we have demonstrated how IP may mimic the effect of a sliding threshold for synaptic plasticity as postulated by the BCM theory. New experiments are needed to distinguish between the alternative explanations of a sliding plasticity threshold or a change in excitability with a fixed threshold. Our work is closely related to other work on finding efficient representations of sensory stimuli in general (Olshausen & Field, 1996; Bell & Sejnowski, 1997) and on doing so with biologically plausible plasticity mechanisms in particular. For example, some authors have used modified versions of standard Hebbian learning rules in order to discover independent components in the input (Blais, Intrator, Shouval, & Cooper, 1998; Hyv¨arinen & Oja, 1998), and certain independent component analysis techniques also rely on terms that multiply the input signal with a nonlinearly transformed output signal (Hyv¨arinen & Oja, 1997; Hyv¨arinen et al., 2001). Thus, our work is closely related to these approaches, but it introduces IP as a mechanism to adapt the nonlinearity online in a biologically viable fashion. To our knowledge, this was ignored in previous work on the topic. Our result from Figure 8 (right), suggests that at least in certain instances, this online adaptation aids learning. In addition, the IP mechanism ensures an energy-efficient encoding throughout the learning process.
902
J. Triesch
5.3 Coupling with Reward Mechanisms. Good sensory representations must efficiently encode the aspects of the input domain that are relevant for the organism’s behavior, which implies not wasting neural resources on representing irrelevant information. While in the experiments on the bars problem, a model neuron learned to represent a single bar at random, it is easy to envision how a simple gating mechanism can lead to learning of relevant heavy-tailed directions in a biologically plausible fashion (e.g., Montague, Dayan, Person, & Sejnowski, 1995; Roelfsema & van Ooyen, 2005). For example, the following modification of the synaptic learning rule allows the discovery of a specific bar that is associated with rewards or punishments: w = ηyxR.
(5.1)
In this equation, R represents a scalar relevance signal identifying situations that may be worth learning about. For example, we may define R ≡ |δ|, where δ represents a temporal difference error (Sutton & Barto, 1998) which has been associated with the activity of dopaminergic neurons. The use of |δ| reflects that both positive and negative events may be worth learning about. If, for example, an unexpected positive or negative reward is given (δ = ±1), whenever a specific bar appears (irrespective of the presence or absence of any other bars), the modified learning rule will discover this specific bar, because the gating signal effectively restricts learning to those stimuli that contain this bar. Such a relevance feedback could be the effect of dopamine, as suggested above, but it could also be associated with the action of acetylcholine, which is known to globally influence synaptic plasticity (Pennartz, 1995). It would be interesting to incorporate the ideas presented in this article into an actor-critic framework to study how the emergence of efficient sensory representations may be influenced by task demands. 5.4 Future Work. As a next step, it is important to study the effect of combined intrinsic and Hebbian synaptic plasticity at the network level. While we have shown that intrinsic plasticity can effectively ensure the lifetime sparseness of individual neurons (Willmore & Tolhurst, 2001) in the sense of exponential firing-rate distributions, additional lateral inhibitory interactions in a network are likely necessary to decorrelate units in a population in order to also ensure population sparseness, as is observed in the cortex (Vinje & Gallant, 2000). We speculate that the cerebral cortex utilizes a synergistic interaction of intrinsic and synaptic plasticity mechanisms within a sophisticated system of local excitation and inhibition to find sensory codes that exhibit both lifetime and population sparseness (Lehky, Sejnowski, & Desimone, 2005). Preliminary work has already established that topographic arrangements of simple cell-like receptive fields emerge in a two-dimensional network of interconnected units with simultaneous intrinsic and synaptic learning (Butko & Triesch, 2006).
Intrinsic and Synaptic Plasticity
903
It will also be interesting to study the effects of combining intrinsic and synaptic learning mechanisms in recurrent network models. Recently we have also started to investigate the interaction of IP models for spiking neurons with spike-timing-dependent plasticity (STDP) in recurrent networks and found that the combination of STDP and IP can produce very different network dynamics from those resulting from STDP without IP. We speculate that similar combinations of plasticity mechanisms may be effective in driving recurrent networks into a regime of great computational power (Bertschinger & Natschl¨ager, 2004) for a fixed energy expenditure. In conclusion, since IP appears to be a ubiquitous phenomenon in the brain, many aspects of neural functioning and plasticity could be profoundly influenced by it. A careful analysis of the interaction of various forms of plasticity in the nervous system is arguably vital for our understanding of the brain’s ability to adapt to ever changing demands during normal development and learning and in response to various disturbances such as trauma or infection. Our analysis of the interaction of Hebbian learning with the proposed IP model is a small step in this direction. In a similar vein, Janowitz and van Rossum (2006) have explored the role of a fast nonhomeostatic IP mechanism for trace conditioning. So far, however, we have seen only a small glimpse of this terra incognita. Appendix A: Gradient Rule for IP We describe the distribution of the total synaptic current arriving at the soma x by the probability distribution f x (x). Assuming the neuron’s nonnegative firing rate is given by y = g(x), where g is strictly monotonically increasing, the distribution of y is given by
f y (y) =
f x (x) ∂y ∂x
.
(A.1)
We are looking for ways to adjust g so that f y (y) is approximately exponential, that is, we would like f y (y) to be “close” to f exp (y) = µ1 exp( −y ). A µ natural measure is to consider the Kullback-Leibler divergence (KL divergence) between f y and f exp :
f (y) y dy D ≡ d( f y f exp ) = f y (y) log (A.2) −y 1 exp µ µ y = f y (y) log( f y (y))dy − f y (y) − − log µ dy µ
(A.3)
904
J. Triesch
= −H(y) +
1 E(y) + log µ, µ
(A.4)
where H(y) is the entropy of y and E(y) is its expected value. Thus, in order to minimize d( f y f exp ), we need to maximize the entropy H(y) while minimizing the expected value E(y). The term log µ does not depend on g and is irrelevant for this minimization. To derive a stochastic gradient-descent rule for IP that strives to minimize equation A.4, we need to take into account the specific form of the parameterized nonlinearity, equation 2.1. To construct the gradient, we calculate the partial derivatives of D with respect to a and b: ∂ D ∂d( f y f exp ) ∂H 1 ∂ E(y) = =− + ∂a ∂a ∂a µ ∂a ∂ ∂y 1 ∂y =E − log + ∂a ∂x µ ∂a 1 1 2 1 xy − xy , = − + E −x + 2 + a µ µ
(A.5) (A.6) (A.7)
where we have used equation A.1 and moved the differentiation inside the expected value operation to obtain equation A.6 and exploited
∂y log ∂x
= log a + log y + log(1 − y)
(A.8)
in the last step. Similarly, for the partial derivative with respect to b, we find: ∂ D ∂d( f y f exp ) ∂H 1 ∂ E(y) = =− + ∂b ∂b ∂b µ ∂b 1 2 1 y− y . = −1 + E 2+ µ µ
(A.9) (A.10)
The resulting stochastic gradient-descent rule is given by a ← a + a and b ← b + b, with: 1 1 1 +x− 2+ xy + xy2 a µ µ 1 1 y + y2 . b = ηIP 1 − 2 + µ µ
a = ηIP
(A.11) (A.12)
Intrinsic and Synaptic Plasticity
905
Appendix B: Analysis of Learning with Clustered Inputs We consider the limiting case of fast IP and we assume that IP is perfect—it leads to an exactly exponential distribution of activities (Triesch, 2005b). We will first treat the case of only two input clusters and then generalize to N clusters. Let us assume two clusters of inputs at locations c1 and c2 . Let us also assume that both clusters account for exactly half of the inputs. If the weight vector is slightly closer to one of the two clusters, inputs from this cluster will activate the unit more strongly. Let m = µ ln(2) denote the median of the exponential firing-rate distribution with mean µ. Then inputs from the closer cluster, say c1 , will be responsible for all activities above m, while the inputs from the other cluster will be responsible for all activities below m. Hence, for the standard Hebb rule, the expected value of the weight update
w will be given by:
w ≈ αc1
∞
m
=
y exp(−y/µ)dy + αc2 µ
m 0
αµ ((1 + ln 2) c1 + (1 − ln 2) c2 ) . 2
y exp(−y/µ)dy µ
(B.1) (B.2)
Taking the multiplicative weight normalization into account, we see that stationary points of the weight vector require that w is parallel to w. From this, it follows that the weight vector must converge to either of the following two stationary states: w=
(1 ± ln 2)c1 + (1 ∓ ln 2)c2 . (1 ± ln 2)c1 + (1 ∓ ln 2)c2
(B.3)
The weight vector moves close to one of the two clusters but does not fully commit to it. In the case of N clusters, the situation is similar. In order to calculate the contribution of the ith closest cluster to the weight vector, we have to consider the inverse of the cumulative distribution function (percent point function) of the exponential firing-rate distribution, which is given by F y−1 ( p) = −µ log(1 − p).
(B.4)
(Recall that the median of a random variable y is given by F y−1 (1/2).) Assuming that all clusters contribute the same number of inputs, the cluster associated with the ith highest interval of firing rates is “responsible” for the interval [1 − Ni , 1 − i−1 ] of the cumulative distribution function. It N
906
J. Triesch
follows that the average amount of weight update associated with inputs from cluster i under the standard Hebb rule is given by
F y−1 (1− i−1 N ) F y−1 (1− Ni )
y
1 y exp − dy, µ µ
(B.5)
which reduces to µ
i i i −1 i −1 1 − log − 1 − log . N N N N
(B.6)
This expression determines the relative contribution of cluster i to the weight vector in the steady state. After simplification, we can see that possible stable weight vectors are of the form w=
N
f iHebb ci ,
(B.7)
i=1
where f iHebb ∝ 1 + log(N) − i log(i) + (i − 1) log(i − 1)
(B.8)
is the relative contribution of the ith cluster to the weight vector. Note that for the case of N = 2, this matches the result above (see equation B.3). It is straightforward to generalize this analysis to the alternative Hebbian learning rules introduced in the main text. In order to calculate the expected weight update due to cluster i, we generalize equation B.5 to
F y−1 (1− i−1 N ) F y−1 (1− Ni )
(y)
1 y dy. exp − µ µ
(B.9)
For the balanced covariance rule (y) = y − µ, we obtain f icov ∝ f iHebb −
µ . N
(B.10)
For the balanced BCM rule (y) = y(y − 2µ), we obtain: i 2 i −1 2 f iBCM ∝ i log − (i − 1) log . N N In Figure 6 we plot normalized f i such that
i
f i2 = 1.
(B.11)
Intrinsic and Synaptic Plasticity
907
Acknowledgments This work was supported in part by the National Science Foundation under grant NSF IIS-0208451 and the Hertie foundation. I thank Nicholas Butko, Emanuel Todorov, and Erik Murphy-Chutorian for discussions and Cornelius Weber and two anonymous reviewers for helpful comments on earlier drafts. References Aizenman, C., Akerman, C., Jensen, K., & Hollis, T. (2003). Visually driven regulation of intrinsic neuronal excitability improves stimulus detection in vivo. Neuron, 39, 831–842. Baddeley, R., Abbott, L. F., Booth, M., Sengpiel, F., & Freeman, T. (1998). Responses of neurons in primary and inferior temporal visual cortices to natural scenes. Proc. R. Soc. London, Ser. B, 264, 1775–1783. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37, 3327–3338. Bertschinger, N., & Natschl¨ager, T. (2004). Real-time computation at the edge of chaos in recurrent neural networks. Neural Computation, 16, 1413–1436. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Blais, B. S., Intrator, N., Shouval, H., & Cooper, L. N. (1998). Receptive field formation in natural scene environments. Neural Computation, 10, 1797–1813. Burdakov, D. (2005). Gain control by concerted changes in i a and i h conductances. Neural Computation, 17, 991–995. Bush, P., Prince, D., & Miller, K. (1999). Increased pyramidal excitability and NMDA conductance can explain posttraumatic epileptogenesis without disinhibition: A model. Journal of Neurophysiology, 82(4), 1748–1758. Butko, N. J., & Triesch, J. (2006). Exploring the role of intrinsic plasticity for the learning of sensory representations. In M. Verleysen (Ed.), European Symposium on Artificial Neural Networks (ESANN) (pp. 467–472). Bruges, Belgium: d-side. Cooper, L., Intrator, N., Blais, B., & Shouval, H. (2004). Theory of cortical plasticity. River Edge, NJ: World Scientific Publishing. Cudmore, R., & Turrigiano, G. (2004). Long-term potentiation of intrinisic excitability in LV visual cortical neurons. J. Neurophysiol., 92, 341–348. Daoudal, G., & Debanne, D. (2003). Long-term plasticity of intrinsic excitability: Learning rules and mechanisms. Learning and Memory, 10, 456–465. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Desai, N. S., Rutherford, L. C., & Turrigiano, G. G. (1999). Plasticity in the intrinsic excitability of cortical pyramidal neurons. Nature Neuroscience, 2(6), 515–520. ¨ ak, P. (1990). Forming sparse representations by local anti-Hebbian learning. Foldi´ Biological Cybernetics, 64, 165–170.
908
J. Triesch
Houweling, A., Bazhenov, M., Timofeev, I., Steriade, M., & Sejnowski, T. (2005). Homeostatic synaptic plasticity can explain post-traumatic epileptogenesis in chronically isolated neocortex. Cerebral Cortex, 15, 834–845. Hyv¨arinen, A., Karhunen, J., & Oja, E. (2001). Independent component analysis. New York: Wiley. Hyv¨arinen, A., & Oja, E. (1997). A fast fixed-point algorithm for independent component analysis. Neural Computation, 9, 1483–1492. Hyv¨arinen, A., & Oja, E. (1998). Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64(3), 301–313. Janowitz, M., & van Rossum, M. (2006). Excitability changes that complement Hebbian learning. Network: Computation in Neural Systems, 17(1), 31–41. Kirkwood, A., Rioult, M., & Bear, M. (1996). Experience-dependent modification of synaptic plasticity in visual cortex. Nature, 381, 526–528. Laughlin, S. (1981). A simple coding procedure enhances a neuron’s information capacity. Z. Naturforsch., 36c, 910–912. Lehky, S., Sejnowski, T., & Desimone, R. (2005). Selectivity and sparseness in the responses of striate complex cells. Vision Research, 45, 57–73. Maffei, A., Nelson, S., & Turrigiano, G. (2004). Selective reconfiguration of layer 4 visual cortical circuitry by visual deprivation. Nature Neuroscience, 7(12), 1353– 1359. Marder, E., Abbott, L. F., Turrigiano, G. G., Liu, Z., & Golowasch, J. (1996). Memory from the dynamics of intrinsic membrane currents. Proc. Natl. Acad. Sci. USA, 93, 13481–13486. Montague, P., Dayan, P., Person, C., & Sejnowski, T. (1995). Bee foraging in uncertain environments using predictive Hebbian learning. Nature, 377, 725–728. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Pennartz, C. (1995). The ascending neuromodulatory systems in learning by reinforcement: Comparing computational conjectures with experimental findings. Brain Res. Rev., 21, 219–245. Prince, D., & Tseng, G.-F. (1993). Epileptogenesis in chronically injured cortex: In vitro studies. Journal of Neurophysiology, 69, 1276–1291. Roelfsema, P. R., & van Ooyen, A. (2005). Attention-gated reinforcement learning of internal representations for classification. Neural Computation, 17, 2176–2214. Shin, J., Koch, C., & Douglas, R. (1999). Adaptive neural coding dependent on the time-varying statistics of the somatic input current. Neural Computation, 11, 1893– 1913. Stellwagen, D., & Malenka, R. C. (2006). Synaptic scaling mediated by glial TNF-α. Nature, 440(7087), 1054–1059. Stemmler, M., & Koch, C. (1999). How voltage-dependent conductances can adapt to maximize the information encoded by neuronal firing rate. Nature Neuroscience, 2(6), 521–527. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Triesch, J. (2005a). A gradient rule for the plasticity of a neuron’s intrinsic excitability. In W. Duch, J. Kacprzyk, E. Oja, & S. Zadrozny (Eds.), Proc. Int. Conf. on Artificial Neural Networks (ICANN) (pp. 65–70). Berlin: Springer.
Intrinsic and Synaptic Plasticity
909
Triesch, J. (2005b). Synergies between intrinsic and synaptic plasticity in individual model neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Turrigiano, G., Leslie, K., Desai, N., Rutherford, L., & Nelson, S. (1998). Activitydependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892– 896. Turrigiano, G., & Nelson, S. (2000). Hebb and homeostasis in neuronal plasticity. Curr. Opin. Neurobiol., 10(3), 358–364. Turrigiano, G., & Nelson, S. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. van Welie, I., van Hooft, J. A., & Wadman, W. J. (2004). Homeostatic scaling of neuronal excitability by synaptic modulation of somatic hyperpolarization-activated Ih channels. Proc. Nat. Acad. Sci., 101(14), 5123–5128. van Welie, I., van Hooft, J. A., & Wadman, W. J. (2006). Background activity regulates excitability of rat hippocampal CA1 pyramidal neurons by adaptation of a K+ conductance. J. Neurophysiology, 95, 2007–2012. Vinje, W. E., & Gallant, J. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287, 1273–1276. Willmore, B., & Tolhurst, D. (2001). Characterizing the sparseness of neural codes. Network: Computation in Neural Systems, 12, 255–270. Zhang, M., Hung, F., Zhu, Y., Xie, Z., & Wang, J.-H. (2004). Calcium signal-dependent plasticity of neuronal excitability developed postnatally. J. Neurobiol., 61(2), 277– 287. Zhang, W., & Linden, D. J. (2003). The other side of the engram: Experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience, 4, 885–900.
Received November 11, 2005; accepted April 27, 2006.
LETTER
Communicated by Michael Hasselmo
Distinguishing Causal Interactions in Neural Populations Anil K. Seth
[email protected] Gerald M. Edelman
[email protected] The Neurosciences Institute, San Diego, CA 92121, U.S.A.
We describe a theoretical network analysis that can distinguish statistically causal interactions in population neural activity leading to a specific output. We introduce the concept of a causal core to refer to the set of neuronal interactions that are causally significant for the output, as assessed by Granger causality. Because our approach requires extensive knowledge of neuronal connectivity and dynamics, an illustrative example is provided by analysis of Darwin X, a brain-based device that allows precise recording of the activity of neuronal units during behavior. In Darwin X, a simulated neuronal model of the hippocampus and surrounding cortical areas supports learning of a spatial navigation task in a real environment. Analysis of Darwin X reveals that large repertoires of neuronal interactions contain comparatively small causal cores and that these causal cores become smaller during learning, a finding that may reflect the selection of specific causal pathways from diverse neuronal repertoires. 1 Introduction How do neuronal interactions in complex nervous systems cause specific outputs, and how are these causal interactions modulated during learning? Addressing these issues requires a theoretical analysis of the principles governing causal interactions in large and richly interconnected neuronal populations during behavior. Here, we describe a novel network analysis that distinguishes causal interactions in the neural activity in relation to a given neural reference. We use the term neural reference (NR) to refer to the activity of a particular neuron (e.g., a motor neuron) at a particular time (e.g., when an output is generated). As part of this analysis, we introduce the concept of a causal core to designate those neural interactions that are causally significant for a given NR. This concept provides a novel means of linking neuronal interactions in large populations to specific outputs. Our analysis derives from the perspective that the brain is a selectional system and that neural interactions reflect processes of selection operating on diverse repertoires of neuronal groups (Edelman, 1978, 1987, 1993; Neural Computation 19, 910–933 (2007)
C 2007 Massachusetts Institute of Technology
Distinguishing Causal Interactions in Neural Populations
911
Izhikevich, Gally, & Edelman, 2004; Seth & Baars, 2005). Our approach here elaborates on three features of this view in particular. First, variability in population activity may be a necessary substrate for the formation of diverse and dynamic repertoires of neuronal interactions and may reflect not only neuronal noise. Second, population activity patterns are interpreted as dynamic, causally effective processes giving rise to specific outputs, and not as information-bearing representations underlying computations. Finally, synaptic plasticity is suggested to contribute to behavioral learning by shaping the selection of causal pathways within populations and not by reinforcing connections among neural representations of sensory and motor variables. We emphasize that selectionist theory provides an interpretive framework for this analysis and is not itself tested by our analysis. The general framework for the analysis is illustrated in Figure 1. First, an NR is selected from among many possible neuronal events—in this example, by virtue of its relationship to a specific behavioral output (see Figure 1A; yellow neuron at time t). Second, a network of neuronal interactions is identified, incorporating only those interactions that could have potentially contributed to the occurrence of the NR (Krichmar, Nitz, Gally, & Edelman, 2005). This network corresponds to the “context network” of the NR (see Figure 1A; network of green neurons). In order to sample the most salient components for a particular NR, these networks generally reflect a relatively short time period. Finally, a subnetwork, which we define as the causal core, is selected from the context network. The causal core comprises those interactions that are causally significant as assessed by Granger causality (see Figure 1B and below; red arrows show causally significant connections), and which form part of a causal chain leading to the NR (see Figure 1C; red arrows indicate the causal core, and the blue arrow indicates a causally significant interaction that is excluded from the core because it is a causal dead-end). The concept of Granger causality is based on prediction: if a signal A causes a signal B, then past values of A should help predict B with greater accuracy than past values of B alone (Granger, 1969). Because the statistical nature of Granger causality requires large sample sizes to make robust inferences, it is assessed over a time period considerably longer than that used for identifying the context network. In practice, the above analysis requires extensive knowledge of anatomical and functional connectivity during behavior. Unfortunately, such data are not currently available in vivo. Therefore, to provide an illustrative example, we analyze a brain-based device (BBD), a physical device whose behavior in a real-world environment is controlled by a simulated nervous system based on vertebrate neuroanatomy and neurophysiology (Krichmar & Edelman, 2002; Seth, McKinstry, Edelman, & Krichmar, 2004b, 2004a; Krichmar, Seth, Nitz, Fleischer, & Edelman, 2005). Unlike animal models, use of a BBD enables precise recording during behavior of the state of every neuronal unit and every synaptic connection. The BBD analyzed here, Darwin X, incorporates a detailed model of the hippocampus and
912
A. Seth and G. Edelman
Figure 1: Distinguishing causal interactions in neuronal populations. (A) Select a neural reference (NR), that is, the activity of a particular neuron (yellow) at a particular time (t). The context network of the NR corresponds to the network of all coactive and connected precursors, assessed over a short time period (e.g., t, t − 1, . . . , t − 6). (B) Assess the Granger causality significance of each interaction in the context network, based on extended time series of the activities of the corresponding neurons (0 − T; this time period is considerably longer than that used to identify the context network itself). Red arrows indicate causally significant interactions. (C) The causal core of the NR (red arrows) is defined as that subset of the context network that is causally significant for the activity of the corresponding neuron (excluding both noncausal interactions—black arrows—and dead-end causal interactions such as 5→4, indicated in blue). The concept of a causal core provides a novel means of linking neuronal interactions in large populations to specific outputs.
surrounding cortical areas. By integrating visual and self-motion cues, it learns to navigate to an arbitrarily chosen location in a room (Krichmar, Nitz et al., 2005; Krichmar, Seth et al., 2005). To analyze Darwin X, we identified NRs at various stages of the learning process, selecting those corresponding to hippocampal control of motor
Distinguishing Causal Interactions in Neural Populations
913
output. By recording in detail the responses of Darwin X’s nervous system during behavior, we were able to construct context networks and identify causal cores for the chosen NRs. We found that causal cores generally comprised small subsets of the corresponding context networks. Moreover, causal cores were not composed solely of strong synaptic interactions and were not random samples of a context network. We also found that causal cores became smaller with experience, a process we refer to here as refinement. Although BBDs are necessarily gross simplifications of biological systems, their analysis can provide potentially useful heuristics for the interpretation of neurobiological data (Krichmar & Edelman, 2002). Accordingly, as with previous analyses of Darwin X (Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005), the results presented here can be related to mammalian hippocampal function. However, our primary purpose here is to elaborate several heuristics for analyzing and interpreting activity patterns within neuronal populations. The timeliness of developing such heuristics is emphasized by recent methodological advances that promise detailed new data revealing anatomical and functional architectures of neuronal populations at microscopic scales (Kelly & Strick, 2004; Konkle & Bielajew, 2004; Raffi & Siegel, 2005; Cossart, Ikegaya, & Yuste, 2005; Ohki, Chung, Kara, & Reid, 2005).
Figure 2: Darwin X in its environment with visual landmarks hung from the walls and with the hidden platform indicated. Darwin X has a CCD camera, which responds to landmarks that appear within its field of view. While Darwin X could not see the hidden platform, it could detect an encounter with the platform by means of a downward-facing IR sensor. This figure has been adapted from (Krichmar, Seth, et al., 2005).
914
A. Seth and G. Edelman
Figure 3: Selected components of Darwin X’s simulated nervous system. Each clear ellipse represents a neural area. There were two visual input streams responding to the color (inferotemporal area I T), and width (parietal area P R), of visual landmarks on the walls, as well as one odometric input signaling Darwin X’s heading (anterior thalamic area AT N). These inputs were reciprocally connected with the hippocampus, which included entorhinal cortical areas ECin and ECout , dentate gyrus DG, and the CA3 and CA1 hippocampal subfields. Plastic connections were updated according a modified BCM learning rule (Bienenstock et al., 1982). When the value system (area S) was activated by an encounter with the hidden platform, synaptic strengths in value-dependent pathways were further modulated. The number of simulated neuronal units in each area is indicated in gray (e.g., 900 units in I T). Neuronal activities were computed using a mean-firing rate model. This figure has been adapted from (Krichmar, Seth, et al., 2005).
We emphasize that this study differs fundamentally from previous analyses of Darwin X (Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005). One previous study introduced the recursive analysis of anatomical and functional connectivity that is used in this study to identify context networks (Krichmar, Nitz, et al., 2005). However, this previous study did not analyze causal interactions or examine changes in neural dynamics during learning. A second study applied a different Granger causality analysis to small circuits of 8 to 10 neuronal units each (Krichmar, Seth, et al., 2005). However, these circuits were selected without reference to anatomy, and the small size of the selected circuits precluded analysis of population-level neuronal dynamics.
Distinguishing Causal Interactions in Neural Populations
915
2 Materials and Methods 2.1 Darwin X. Darwin X is the most recent of a series of autonomous brain-based devices (BBDs) that have been developed over the past 12 years (Edelman et al., 1992; Krichmar & Edelman, 2002; Seth et al., 2004a; 2004b; Krichmar, Nitz, et al., 2005; Krichmar, Seth, et al., 2005). Details of the construction and behavior of Darwin X are available elsewhere (Krichmar et al., 2005a, 2005b) and only those necessary for comprehension will be repeated here. Figure 2 shows the device in its environment, which consisted of an enclosed arena within which paper stripes of different widths and colors were hung on different walls. Darwin X was equipped with a CCD camera, enabling it to detect these stimuli. Figure 3 provides a schematic diagram of the simulated nervous system, in which different sensory streams converged onto a hippocampal region in which there were several levels of signal processing before the processed signals projected divergently back to sensory cortex (Lavenex & Amaral, 2000). The sensory streams included visual “where” signals that respond to stripe width (parietal area Pr ), visual “what” signals that respond to stripe color (inferotemporal area I T), and head-direction signals (anterior thalamic area AT N). These areas were reciprocally connected to entorhinal area EC (subdivided into ECin and ECout ). Both trisynaptic pathways (EC→DG→CA3→CA1) and perforant shortcuts (e.g., EC→CA1) were included in the simulated hippocampal anatomy (Witter, Naber, & van Haeften, 2000). The full nervous system of Darwin X contained 50 neural areas, approximately 90,000 mean firing-rate neuronal units, and approximately 1.4 million synaptic connections. Darwin X was trained on a dry version of the Morris water maze task (Morris, 1984). A hidden platform was placed in one quadrant of the arena (see Figure 2). This platform could be detected by Darwin X only when the device was directly overhead. Each encounter with the platform stimulated a value system that modulated synaptic plasticity at various loci in the simulated nervous system (see Figure 3). Synaptic plasticity in Darwin X was based on a modified BCM learning rule (Bienenstock, Cooper, & Munro, 1982; Krichmar, Seth, et al., 2005) in which synapses between neuronal units with strongly correlated firing rates are potentiated and synapses between neuronal units with weakly correlated firing rates are depressed. As a result of the above features, Darwin X learned to navigate to the platform from arbitrary starting locations. Darwin X was trained over 16 separate trials, each beginning from one of four different starting locations. As reported in detail in Krichmar, Seth, et al. (2005), initially Darwin X searched randomly for the platform, but after about 10 trials, the device reliably took a comparatively direct route to the platform from any starting location. 2.2 Context Networks. To identify context networks in Darwin X’s neural activity, we selected 15 neuronal units in area CA1 of the simulated
916
A. Seth and G. Edelman
hippocampus, whose activity directly affected motor output. We refer to these units as reference units. For each reference unit, we then identified various time steps, drawn from a variety of experimental trials, at which the reference unit was active and at which Darwin X changed its direction. This combination of reference units and time steps provided 96 neural references (NRs; see Figure 1A) that spanned the learning process. It is important to emphasize here these distinctions:
r r r
Neuronal unit, which make up all parts of Darwin X’s simulated nervous system. Reference unit, a neuronal unit in CA1 chosen for its influence on behavior. Neural reference, the activity of a particular reference unit at a particular time. In this analysis, each reference unit was associated with multiple NRs. While it may be natural to select NRs that correspond to behavioral outputs, NRs can in principle correspond to the activity of any neuronal unit at any time.
We then recursively examined the activity of all neuronal units (throughout the simulated nervous system) that led to each NR (see Figure 1A). This procedure has previously been referred to as a backtrace (Krichmar, Nitz, et al., 2005). For each NR, we identified the list of neuronal units that were both anatomically connected to the corresponding reference unit and were active, above a threshold, at the previous time step. The threshold sets the minimum value of the product of presynaptic activity and synaptic strength for a synaptic interaction to be included; it was set to be very low (0.01). A second iteration was carried out by repeating this procedure for each neuronal unit in this list. In total, six successive iterations were carried out for each NR, reflecting six simulation cycles of neural activity, which corresponded to approximately 1.2 s of real time. This limit was chosen in order to isolate the most salient neural interactions for a particular NR while avoiding a combinatorial explosion in the size of successive iterations. The resulting context networks consisted of multiple paths of functional interactions leading, through time, to the corresponding NRs (see Figure 1A). We emphasize that the term context network is used here to refer specifically to the networks described and does not connote alternative uses of the term context pertaining to, for example, invariances in environmental, emotional, or cognitive conditions (Mackintosh, 1983; Bouton, 2004; Fanselow, 2000; Baars, 1988). In principle, this procedure could be extended to NRs consisting of multiple coactivated neuronal units, in which case the first iteration of the backtrace would be carried out using a list of neuronal units rather than a single neuronal unit. However, the corresponding context networks would likely increase in size very rapidly, leading to combinatorial problems.
Distinguishing Causal Interactions in Neural Populations
917
2.3 Causal Cores. To identify the causal core of each NR, we applied a Granger causality analysis to its context network (Granger, 1969; Seth, 2005). The concept of Granger causality is based on prediction, and its application usually involves linear regression modeling. To illustrate Granger causality, suppose that the temporal dynamics of two time series, X1 (t) and X2 (t) (both of length T), can be described by a bivariate autoregressive model: X1 (t) =
X2 (t) =
p
A11, j X1 (t − j) +
p
j=1
j=1
p
p
j=1
A21, j X1 (t − j) +
A12, j X2 (t − j) + E 1 (t)
A22, j X2 (t − j) + E 2 (t),
j=1
where p is the maximum number of lagged observations included in the model (the model order, p < T), A contains the coefficients of the model (i.e., the contributions of each lagged observation to the predicted values of X1 (t) and X2 (t)), and E 1 , E 2 are residuals (prediction errors) for each time series. If the variance of E 1 (or E 2 ) is reduced by the inclusion of the X2 (or X1 ) terms in the first (or second) equation, then it is said that X2 (or X1 ) Granger-causes X1 (or X2 ). In other words, X2 Granger-causes X1 if the coefficients in A12 are jointly significantly different from zero. This can be tested by performing an F-test of the null hypothesis that A12 = 0, given assumptions of covariance stationarity on X1 and X2 . The magnitude of a Granger causality interaction can be estimated by the logarithm of the corresponding F-statistic (Geweke, 1982). We used Granger causality to assess the causal significance of every connection in each context network (see Figure 1B). For each connection, X1 and X2 corresponded to the activity time series of the presynaptic and postsynaptic neuronal units, respectively. To ensure a robust estimation of causal significance, X1 and X2 contained neuronal unit activities for the entire trial in which the corresponding NR was located. The length of these time series ranged from 450 to 4992 time steps (in contrast to the 6 time steps incorporated into each backtrace; see above), and each contained many separate periods of neuronal activity and inactivity. To achieve covariance stationarity, first-order differencing was applied to X1 and X2 . The analyzed time series therefore represented changes in neural activity rather than absolute activity levels. After differencing, only 3 of 96 context networks contained more than 1% of time series that were not covariance stationary (Dickey-Fuller test, P < 0.01). These networks were subsequently excluded from the analysis. For each connection in each of the 93 remaining context networks, we constructed a bivariate autoregressive model of order p = 4. This model order was chosen according to the Bayesian information criterion (BIC) (Schwartz, 1978) (the mean minimum BIC, computed from all bivariate
918
A. Seth and G. Edelman
models, was 3.46). To verify that the order-4 models sufficiently described the data, we noted that the mean adjusted residual-sum-square RSSa d j was sufficiently high (0.35 ±0.168). From each bivariate model, we calculated the statistical significance of a Granger causality interaction from the presynaptic unit (X1 ) to the postsynaptic unit (X2 ). To take account of the large number of connections in each context network, a Bonferroni correction was applied to the significance thresholds (corrected P < 0.01). The resulting networks of significant Granger causality interactions are referred to here as Granger networks. The causal core of each NR was identified by extracting the subset of the corresponding Granger network consisting of all causally significant connections that led, via other causally significant connections, to the NR (see Figure 1C). We emphasize the importance of this step. In order to be causally relevant to an NR, a causally significant interaction must be part of a chain of causally significant interactions leading to the NR. To assess the robustness of our results, we repeated the Granger causality analysis of each context network using a range of Bonferroni-corrected significance thresholds (P < 0.02 to P < 0.05). The resulting Granger networks and causal cores were very similar in all cases to those obtained using the default threshold (P < 0.01). Robustness to model order p was assessed by repeating the analysis using order-8 models, as indicated by the Akaike information criterion, an alternative to the BIC (Akaike, 1974). The resulting Granger causalities were identical to those identified by the order-4 models. 2.4 Network Measures. Throughout this letter, we use the term network to provide a useful and commensurable representation, in terms of nodes and edges, of both real physical interactions (as in context networks) and virtual statistical interactions (as in Granger networks and causal cores). In both cases, nodes correspond to neuronal units. In context networks, edges correspond to synaptic interactions, weighted by the product of synaptic strength and presynaptic activity. In Granger networks and causal cores, edges correspond to statistically significant causal interactions, weighted by the estimated strength of the causal interaction. In order to measure the similarity between two networks A and B, we define the following network similarity index, η AB =
2 f (A B) , f (A) + f (B)
where A B is the network consisting of the edges jointly present in networks A and B, and the function f (X) returns the size (number of edges) of the network X. This index is equal to 1 if A and B are identical and is equal to 0 if A and B are nonoverlapping. It is important to emphasize that
Distinguishing Causal Interactions in Neural Populations
919
while topologies can be compared among different network types using the η index, the specific weight assigned to an edge in a context network cannot be directly compared to the corresponding value in a Granger network or causal core. Networks of each type were visualized using either an energyminimization method to emphasize network structure or, in a standardized circular arrangement, to aid comparison among networks. In the former case, we used the Pajek program (http://vlado.fmf.uni-lj.si/ pub/networks/pajek/), which implements the Kamada-Kawai energy minimization algorithm, with the result that nodes that are strongly interconnected tend to be bunched together. Because this algorithm was applied separately for each visualized network, spatial arrangements cannot be compared between networks. 3 Results 3.1 Causal Cores in Darwin X. Figure 4 shows the context network for an example NR (reference unit 6 during trial 13). The thickness of each line and size of each arrowhead is determined by the product of synaptic strength and presynaptic activity. Figure 5 shows the corresponding Granger network—the subset of interactions of the context network in Figure 4 calculated to be causally significant. Here, the thickness of each line reflects the strength of the corresponding Granger causality interaction. Figure 6 shows the corresponding causal core—the subnetwork of Figure 5 following removal of all dead-end edges that did not participate in causal chains leading to the NR. As in the Granger network (see Figure 5), the thickness of each line in the causal core reflects the strength of the corresponding Granger causality interaction. Comparing the networks in Figures 4, 5, and 6, it is apparent that the context network is much larger than the Granger network, which is itself much larger than the causal core. For this example NR, the causal core has several trisynaptic pathways in which signals from sensory cortex causally influence entorhinal cortex, which in turn causally influences dentate gyrus and CA3 before reaching the NR. One example of a such a pathway is shown by the gray shading in Figure 6. In the context network (see Figure 4), there is a large central cluster of entorhinal interactions interspersed with entorhinal-C A1 interactions. The Granger network (see Figure 5) has a comparatively sparse architecture, with considerably fewer intra-entorhinal interactions. Intra-entorhinal interactions are almost totally absent in the corresponding causal core (see Figure 6). Figure 7 shows the average composition and size of all 93 context networks, Granger networks, and causal cores. Consistent with the above observations, context networks were significantly larger than Granger networks, which were themselves significantly larger than causal cores.
920
A. Seth and G. Edelman
Intra-entorhinal interactions dominated context networks, were less prominent in Granger networks, and were comparatively rare in causal cores. The expanded segments of each chart show the components of the trisynaptic and perforant pathways. These components were much more prominent in causal cores than in context networks or Granger networks, suggesting that these pathways had greater causal influence, in relation to the NRs, than entorhinal interactions. How does the size of a causal core depend on the amount of time that is integrated by a context network? As noted previously (in section 2.2), each context network was constructed by iterating six time steps back from a given NR (a backtrace). Figure 8 shows the ratio of causal core size to the corresponding context network size, for each NR separately, as the backtrace procedure is iterated. As additional time steps were incorporated in the analysis (i.e., as the context network reached further back in time from the NR), the majority of causal cores constituted a progressively smaller fraction of the corresponding context networks. This raises the possibility that in Darwin X, even if the construction of a context network were to be iterated through many more time-steps than is feasible in practice, the corresponding causal core may not change substantially. Can causal cores in Darwin X be identified on the basis of synaptic strengths, without recourse to a statistical causality analysis? To address this question, we examined the relation between synaptic strength and Granger causality for each NR separately, using the network similarity index η defined in section 2.4. From each context network, we selected j subnetworks by removing those edges corresponding to j% of the weakest synaptic weights, for all j ∈ [1, 100]. From each subnetwork, we also removed those dead-end edges that did not lead to the NR. We then calculated the network similarity η between each subnetwork and the corresponding causal core. The resulting values represent, for a particular NR, the similarity between the causal core and a series of subnetworks identified by narrowing the corresponding context network on the basis of synaptic strength. Figure 9A shows the mean network similarity η across all NRs as a function of j. For low j, η is low, as expected, because the context networks are still large. η is also low for high j, which indicates that causal cores do not consist solely of the strongest synaptic weights. Figure 9B shows the same analysis repeated using synaptic interaction strength (i.e., the product of synaptic weight and presynaptic activity) instead of synaptic weight. The results are very similar. The maximum η averaged across all NRs was approximately 0.3 for both analysis methods, and we found the maximum value of η = 1 for only 1/93 NRs, in the second analysis only. These observations indicate that causal cores cannot in general be identified on the basis of synaptic weights or synaptic interaction strengths. In relation to the above point, it bears emphasizing that whereas a Granger-causality analysis offers a natural threshold, based on
Distinguishing Causal Interactions in Neural Populations
Reference
CA1 CA3 DG
921
ECin ECout IT ATN Pr
Figure 4: Example context network from Darwin X, obtained from CA1 reference unit 6 during trial 13. Each circle represents a single neuronal unit; different colors represent distinct brain areas, as defined in section 2.1 and Figure 3. The thickness of each line reflects the product of synaptic strength and presynaptic activity.
Reference
CA1 CA3 DG
ECin ECout IT ATN Pr
Figure 5: The Granger network corresponding to the context network shown in Figure 4. This network comprises those connections in Figure 4 that are causally significant. The thickness of each line reflects the strength of the Granger causality interaction (the logarithm of the corresponding F-statistic; see section 2.3).
922
A. Seth and G. Edelman
Reference
CA1 CA3 DG
ECin ECout IT ATN Pr
Figure 6: The causal core corresponding to the context network in Figure 4 and the Granger network in Figure 5. This network was derived by removing from the corresponding Granger network all dead-end edges that did not participate in causal chains leading to the NR. The gray shading shows one of the several trisynaptic pathways in the causal core. As in Figure 5, the thickness of each line reflects the strength of the Granger causality interaction.
context
Granger
causal core
mean size = 723
mean size = 223
mean size = 29
ATN -> EC
IT->EC
EC->DG
EC->CA1
CA3->CA1
Pr->EC
EC -> EC
EC->CA3
DG->CA3
CA3->CA3
Figure 7: Network composition and average network size (number of edges) for context networks, Granger networks, and causal cores. Different colors represent different connection types (see section 2.1 and Figure 3). Expanded segments show components of trisynaptic pathways (EC→DG, DG →CA3, and CA3→C A1) and perforant pathways (EC→CA1). Context networks had a much higher proportion of intra-entorhinal connections than causal cores.
Distinguishing Causal Interactions in Neural Populations
Nodes (N )
size(CC)/size(CX)
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
Edges (K )
1
0.8
0
-1 -2 -3 -4 -5 -6 time step
923
-1 -2 -3 -4 -5 -6 time step
Figure 8: The ratio of causal core size to context network size, in terms of connections (K ) and neuronal units (N), for all NRs. This ratio is shown as a function of the iteration depth of the backtrace procedure (see section 2.2).
synaptic weight
1
0.5
0 0
B network similarity η
network similarity η
A
50 j
100
synaptic interaction strength 1
0.5
0 0
50 j
100
Figure 9: Network similarity η between causal cores and a series of subnetworks extracted from the corresponding context networks. For each NR, 100 subnetworks were extracted by removing j% of the weakest edges in the corresponding context networks, as well as dead-end edges (see the text for details). (A) Mean η across all 93 NRs for subnetwork extraction based on synaptic weight. (B) Mean η based on synaptic interaction strength. Shaded areas represent standard errors.
statistical significance, for selecting a subnetwork, thresholds based on synaptic weight or synaptic interaction strength are necessarily arbitrary. Comparisons of size and composition would therefore be more difficult to interpret in these latter cases. 3.2 Modulation of Causal Interactions During Learning. We turn now to the modulation of causal interactions during learning of the hiddenplatform task by Darwin X. Figure 10 shows causal cores from two reference
924 Trial
Trial
A. Seth and G. Edelman 6
6
10
13
Reference
CA3
ECin
CA1
DG
ECout
7
14
IT
Pr
ATN
9
13
Figure 10: Causal cores for reference units 5 (top) and 10 (bottom) from a selection of trials spanning the learning period. To facilitate comparison between networks, they are arranged using a circular template in which units from the same brain area are placed next to each other. The causal cores diminished in size over time.
units for several NRs spanning the learning process. To facilitate comparison between networks, and in contrast to Figures 4 to 6, these networks are arranged on a circular template in which neuronal units from the same brain area are placed next to each other (see section 2.4 and Figure 3). Inspection of Figure 10 shows that the causal cores diminished in size over time, mainly due to the pruning of intra-entorhinal interactions. We refer to this reduction in size during learning as refinement. Causal core refinement could arise in three different ways: (1) as a correlate of reduction in size of the corresponding context networks; (2) as a result of a general reduction in the likelihood with which interactions are causally significant (in which case, Granger networks, but not context networks, would be similarly refined); and (3) as the result of selection of particular causal pathways from a diverse and dynamic repertoire of neural interactions (in which case only causal cores would show refinement). To distinguish these possibilities, Figure 11 shows network sizes as a function of experimental trial for each type of network and separately for each of the 15 reference units. In agreement with the third account, only causal cores showed significant refinement (see the inset in Figure 3C). Further, causal core size appeared to reach an asymptote after about 10 trials, which corresponds to the number of trials required for Darwin X to learn to move more
Distinguishing Causal Interactions in Neural Populations
A
B
925
C 400
context
Granger
800
3000
3000
2000
2000
causal core 3000
600
200
K
400
2000 100
200
1000
0 5
1000
10 Trial
15
0 5
300
0 5
10
10 Trial
15
15
1000
0 5
0 5
10
10
15
15
Trial
Figure 11: (A) Size of context networks as a function of trial number during learning, in terms of number of edges (K ), for all 15 reference units. Each line represents a different reference unit. (B) Sizes of the corresponding Granger networks. (C) Sizes of the corresponding causal cores. Insets of panels B and C show the same data on a magnified scale.
or less directly to the hidden platform from any starting point (Krichmar, Seth, et al., 2005). These observations raise the possibility that causal core refinement may be a correlate of learning at the level of neuronal populations. Since causal cores in Darwin X were not composed solely of the strongest synaptic interactions (see Figure 9) and since neither context networks nor Granger networks showed similar reductions in size during learning (see Figure 11), causal core refinement in Darwin X may best be understood as resulting from the modulation, via synaptic plasticity, of causal pathways within neuronal populations. 4 Discussion We have developed a theoretical network analysis that can distinguish statistically causal interactions (according to Granger causality) from noncausal interactions in the neural activity leading up to the activity of a particular neuronal unit at a particular time (a neural reference, NR). According to our analysis, activity within neuronal populations may involve the selection of causal interactions from diverse and dynamic repertoires, and learning may involve synaptic changes that modulate selection at the population level. An illustrative application of our analysis to a brain-based device provided an opportunity to explore patterns of statistically causal interactions in large neuronal repertoires during behavior. Causal interactions in Darwin X arose from network dynamics generated by interactions among anatomy, the specific embodiment of Darwin X, and behavior in a structured environment, and therefore could not be inferred directly from design specifications. Our analysis of Darwin X revealed several novel aspects of such
926
A. Seth and G. Edelman
interactions, notably the small size of causal cores as compared to context networks (see Figures 4–7) and the reduction in size (refinement) of causal cores during learning (see Figures 10 and 11). We verified that causal cores did not comprise only the strongest synaptic weights or synaptic interactions and could not in general be identified on the basis of these features (see Figure 9). Because brain-based devices such as Darwin X reflect vertebrate neuroanatomy and neurophysiology, their analysis can provide potentially useful heuristics for the interpretation of neurobiological data (Krichmar & Edelman, 2002; Seth, Sporns, & Krichmar, 2005). Below, we discuss some implications of these results for the interpretation and analysis of population neural activity, for the relation between synaptic plasticity and behavioral learning, and for the dynamic mechanisms underlying hippocampal function. We also discuss limitations, constraints, and possible extensions of the approach. 4.1 Population Activity and Neuronal Variability. A common view of neural population activity is that activity patterns encode information about sensory, motor, or internal state variables (Pouget, Dayan, & Zemel, 2000; Georgopoulos, Schwartz, & Kettner, 1986; Lee, Rohrer, & Sparks, 1988), and that population activity patterns constitute representations that provide substrates for computational operations (Averbeck, Latham, & Pouget, 2006; Harris, 2005). According to this view, mechanisms are required for encoding afferent signals into activity patterns as well as for decoding these activity patterns in order to generate output (Pouget et al., 2000; Seung & Sompolinsky, 1993). An alternative perspective, which accords with the theory that the brain is a selectional system (Edelman, 1978, 1987, 1993), is that population activity provides a dynamic repertoire from which specific causal pathways are selected during behavior and learning. The analysis here provides a means of identifying these pathways in the form of causal cores. According to this analysis, causal cores are not information-bearing representations and do not require explicit encoding and decoding. Instead, they are dynamic, causally effective processes that give rise to specific outputs. These views offer contrasting interpretations of neuronal variability. Experimental observations of the brain during behavior indicate large variations in neural activity even for performance of the same task, or perception of the same stimulus (Edelman, 1987; Softky & Koch, 1993; Schwartz, 1994; Grobstein, 1994). Often such variability has been treated as an inconvenience and minimized or removed using averaging in order to reveal the underlying activity patterns. From a population coding perspective, attempts have been made to identify whether variation is unavoidable neural noise or, alternatively, is part of a signal transmitted to other neurons (Stein, Gossen, & Jones, 2005; Knoblauch & Palm, 2005; Averbeck et al., 2006) via a neural code (deCharms & Zador, 2000). In
Distinguishing Causal Interactions in Neural Populations
927
contrast, according to the view presented here, variability in the neural activity leading to a given event is expected and indicates the existence of diverse and dynamic repertoires underlying selection. 4.2 Synaptic Plasticity and Behavioral Learning. It is often assumed that learned behavior depends on the cumulative effects of long-term potentiation (LTP) or depression (LTD) of synapses. However, synaptic plasticity and behavioral learning operate over vastly different temporal (Drew & Abbott, 2006) and spatial scales. Although experimental evidence suggests that behavioral learning can induce LTP and LTD in certain cases (Malenka & Bear, 2004; Dityatev & Bolshakov, 2005), the precise functional contributions of synaptic plasticity to learned behavior have remained unclear. The results here raise the possibility that synaptic plasticity may underlie learning via modulations of causal interactions at the population level, specifically, by giving rise to the refinement of causal cores. Accordingly, the comparatively large size of causal cores early in learning may reflect a situation where selection had not yet acted to consolidate a causal pathway within a large space of possible pathways. Recent experimental evidence is consistent with this selectionist view. For example, Barnes and colleagues sampled populations of sensorimotor striatal projection neurons, while rats learned a conditional T-maze task (Barnes, Kubota, Hu, & Graybiel, 2005). They found that variability in task-related spiking activity (as measured by entropy) diminished during training and overtraining. Interestingly, by including extinction and reacquisition phases in the learning paradigm, they showed that this reduction in variability can be reversed. 4.3 Causal Interactions in the Hippocampus. The close correspondence of Darwin X with the neuroanatomy and neurophysiology of the mammalian hippocampus suggests one comparatively specific hypothesis regarding hippocampal function. Intra-entorhinal interactions, which dominated most context networks, were comparatively rare in Granger networks, and especially in causal cores. This raises the possibility that mammalian entorhinal cortex may play a modulatory role, as opposed to a causally driving role, with respect to neuronal responses in CA1 during learning. By the same token, the prevalence of trisynaptic and perforant pathways within causal cores suggests that these pathways may causally drive CA1 responses. Although these possibilities are difficult explicitly to test in vivo, the notion of causally efficacious trisynaptic and perforant pathways is supported by both anatomy (see Figure 3) and recent experimental and modeling results that suggest a functionally significant perforant pathway (Brun, Otnass, & Molden, 2002; Hasselmo, Bodelon, & Wyble, 2002). In addition, evidence is accumulating that entorhinal cortex can modulate behavioral expression, for example by driving the formation of associations
928
A. Seth and G. Edelman
for long-term memory (Frank & Brown, 2003) and by influencing the expression of conditioned fear in rats (Burwell, Bucci, Sanborn, & Jutras, 2004; Hebert & Dash, 2004). Future work will address in detail further implications of this analysis for theories of hippocampal function. 4.4 Role of Neuronal Inactivity. The operation of any neural system is likely to involve neurons for which momentary inactivity is functionally significant. Our analysis accounts for this aspect of neural dynamics in the identification of causal cores from context networks, but not in the identification of context networks themselves. The reason is that momentary inactivity can be functionally significant only as part of an extended time series. As described in section 2.2, context networks comprise those synaptic interactions that could potentially have influenced a neural event at a particular time. Accordingly, context networks incorporate neural interactions corresponding to active synapses only. By contrast, the assessment of causal significance by Granger causality is based on relative predictive ability of time series, which requires each time series to exhibit periods of inactivity as well as periods of activity. Therefore, the identification of a causal core from a context network naturally accommodates momentary neuronal inactivity because such inactivity contributes to predictive ability. 4.5 Validating Causal Hypotheses Using Lesions. A useful means of testing causal relations in general is to lesion or perturb elements of the studied system (Keinan, Sandbank, Hilgetag, Meilijson, & Ruppin, 2004). We have shown in previous work that a Granger causality analysis can successfully predict the behavioral consequences of single-neuron lesions within a highly simplified neural network model and is superior in this regard to predictions based on synaptic weights or synaptic interaction strengths (Seth, 2005). In the analysis here, lesion studies of particular causal cores would be difficult to interpret for the reason that the ongoing behavior of Darwin X depends only minimally on the activity of any single neuronal unit. Moreover, Darwin X is a degenerate system (Edelman & Gally, 2001) in which many different causal circuits are likely to give rise to the same behavior. Having said this, it may be informative for future studies to compare lesions to entire brain regions, including, for example, entorhinal areas and trisynaptic/perforant areas. It bears emphasizing that the present analysis provides causal descriptions in the absence of lesions or exogenous perturbations. The resulting causal cores therefore correspond to the dynamics of the intact system during natural behavior. By contrast, the interpretation of causal inferences based on external interventions is complicated by the fact that the studied system is either no longer intact (for lesions) or may display different behavior (for exogenous perturbations).
Distinguishing Causal Interactions in Neural Populations
929
4.6 Covariance Stationarity. A requirement for the present analysis is that time series are covariance stationary; i.e., means and variances do not change over time. However, biological time series often exhibit strong autocorrelations that are inconsistent with covariance stationarity. Absence of covariance stationarity can be addressed in several ways. This study employed first-order differencing, which can be effective in removing autocorrelative structure but implies that the resulting time series reflect changes in activity rather than absolute activity levels. An alternative approach is prewhitening (Box, Jenkins, & Reinsel, 1994; Langheim, Leuthold, & Georgopoulos, 2006), which involves fitting a separate univariate regression model to each time series and performing any subsequent multivariate analysis on the residuals only. Finally, since short time series are more likely to be covariance stationary than long time series, a third approach would be to break each time series into several comparatively short segments. Although this approach has the advantage of allowing analysis of time-varying causal ¨ influences (Hesse, Moller, Arnold, & Schack, 2003), the reduced number of samples in each time series risks that statistical inferences of causality may be weaker. The present analysis is highly conservative with respect to covariance stationarity for the reason that we included in each time series the maximum number of data points possible. We assessed the robustness of our results to this factor by repeating our analysis after splitting each time series into two equal parts. We found that the causal cores corresponding to each time series segment were nearly identical to the causal cores corresponding to the full series, suggesting that our approach was not excessively conservative.
4.7 Application to Other Neural Simulations. Although our results were generated by analyzing a subset of CA1 neuronal units in a particular brain-based device (Darwin X), it should be emphasized that our analysis can be applied to a broad range of neural simulations. Such applications could usefully assess the generality of the results described here. Some of these results—for example, the particular constitution of context networks and causal cores—may be expected to vary according to the details of the analyzed system. Other results may be more general—for example, the correspondence of causal core refinement with behavioral learning and the small size of causal cores as compared to context networks. Darwin X was based on a mean-firing rate neuronal model, in which the activity of each neuronal unit represented the combined activity of approximately 100 real neurons. Identifying causal cores in real neural systems will require extending our analysis to inferences of causality based on spiking neuronal dynamics, for example by combining point-process models with maximum likelihood models (Okatan, Wilson, & Brown, 2005). Such an extension could usefully be tested on future brain-based devices having spiking neuronal dynamics.
930
A. Seth and G. Edelman
4.8 Application to Empirical Data. More generally, application of our analysis requires the identification of all (or a large fraction) of the connected and coactive precursors to a given NR. At present, this is fully achievable only in a computational model, and only in a brain-based device can an NR be meaningfully associated with real-world behavior. However, recent methodological advances suggest that investigators soon may be able to characterize both anatomical and functional architectures in biological systems at microscopic scales. For example, retrograde transneuronal transport of rabies virus has been used to characterize the microanatomy of connectivity between the basal ganglia and cortex (Kelly & Strick, 2004). Functional microarchitectures have been mapped using metabolic markers for neuronal activity (Konkle & Bielajew, 2004), high-resolution optical imaging (Raffi & Siegel, 2005), the genetic manipulation of neurons to express fluorescent voltage sensors (Ahrens et al., 2004) and the bulk-loading of calcium-sensitive indicators (Cossart et al., 2005; Ohki et al., 2005). Importantly, several of these techniques can be applied to behaving primates (Raffi & Siegel, 2005). These advances reflect a changing experimental emphasis from the blind recording of neurons toward the detailed characterization of specific neurons as their dynamics evolve in time in behaving animals. New analysis methods will be required in order to make sense of the data generated by these new techniques. We suggest that our analysis provides one such method and that the insights that potentially could follow may best be exposed by prior application of the analysis to brain-based devices. Application of the novel experimental methods described above will also permit testing of several predictions of the model, in particular, the small size of causal cores as compared to context networks, and the correspondence between learning and causal core refinement. Acknowledgments We are grateful to Jeffrey Krichmar, Joseph Gally, and Eugene Izhikevich for many constructive discussions. We also benefited from the comments of two anonymous referees. We thank Dr. Krichmar in particular for his contribution to the design and implementation of Darwin X. This research was supported by grant N00014-03-1-0980 from the Office of Naval Research, by DARPA, and by the Neuroscience Research Foundation. References Ahrens, K., Hollender, L., Heider, B., Lee, H., Dean, C., Callaway, E., Isacoff, E., & Siegel, R. (2004). Imaging flare, a genetically encoded fluorescent voltage sensor, with two-photon microscopy in live mammalian cortex. Abst. SocNeurosci. Akaike, H. (1974). A new look at the statistical model identification. IEEE Trans. Autom. Control, 19, 716–723.
Distinguishing Causal Interactions in Neural Populations
931
Averbeck, B., Latham, P., & Pouget, A. (2006). Neural correlations, population coding and computation. Nature Reviews Neuroscience, 7(5), 358–367. Baars, B. (1988). A cognitive theory of consciousness. Cambridge: Cambridge University Press. Barnes, T., Kubota, Y., Hu, D., & Graybiel, A. (2005). Activity of striatal neurons reflects dynamic encoding and recoding of procedural memories. Nature, 437, 1158–1161. Bienenstock, E., Cooper, L., & Munro, P. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in the visual cortex. Journal of Neuroscience, 2(1), 32–48. Bouton, M. (2004). Context and behavioral processes in extinction. Learning and Memory, 11(5), 485–494. Box, G., Jenkins, G., & Reinsel, G. (1994). Time series analysis: Forecasting and control. (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Brun, V., Otnass, M., & Molden, S. (2002). Place cells and place recognition maintained by direct entorhinal-hippocampal circuitry. Science, 296, 2243–2246. Burwell, R., Bucci, D., Sanborn, M., & Jutras, M. (2004). Perirhinal and postrhinal contributions to remote memory for context. J. Neurosci., 24(49), 11023–11028. Cossart, R., Ikegaya, Y., & Yuste, R. (2005). Calcium imaging of cortical network dynamics. Cell Calcium, 37, 451–457. deCharms, R., & Zador, A. (2000). Neural representation and the cortical code. Annu. Rev. Neurosci., 23, 613–647. Dityatev, A., & Bolshakov, V. (2005). Amygdala, long-term potentiation, and fear conditioning. Neuroscientist, 11, 75–88. Drew, P. J., & Abbott, L. F. (2006). Extending the effects of spike-timing-dependent plasticity to behavioral timescales. Proceedings of the National Academy of Sciences, USA, 103(23), 8876–8881. Edelman, G. (1978). Group selection and phasic re-entrant signalling: A theory of higher brain function. In V. Mountcastle (Ed.), The mindful brain. Cambridge, MA: MIT Press. Edelman, G. (1987). Neural Darwinism. New York: Basic Books. Edelman, G. (1993). Selection and reentrant signaling in higher brain function. Neuron, 10, 115–125. Edelman, G., & Gally, J. (2001). Degeneracy and complexity in biological systems. Proceedings of the National Academy of Sciences, USA, 98(24), 13763– 13768. Edelman, G., Reeke, G., Gall, W., Tononi, G., Williams, D., & Sporns, O. (1992). Synthetic neural modeling applied to a real-world artifact. Proc. Nat. Acad. Sci. USA, 89(15), 7267–7271. Fanselow, M. (2000). Contextual fear, gestalt memories, and the hippocampus. Behavioral and Brain Research, 110(1–2), 73–81. Frank, L. M., & Brown, E. (2003). Persistent activity and memory in the entorhinal cortex. Trends Neurosci., 26(8), 400–401. Georgopoulos, A., Schwartz, A., & Kettner, R. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Geweke, J. (1982). Measurement of linear dependence and feedback between multiple time series. Journal of the American Statistical Association, 77, 304–313.
932
A. Seth and G. Edelman
Granger, C. (1969). Investigating causal relations by econometric models and crossspectral methods. Econometrica, 37, 424–438. Grobstein, P. (1994). Variability in brain function and behavior. In V. Ramachandran (Ed.), The encyclopaedia of human behavior (pp. 447–458). Orlando, FL: Academic Press. Harris, K. (2005). Neural signatures of cell assembly organization. Nature Reviews Neuroscience, 6, 399–407. Hasselmo, M., Bodelon, C., & Wyble, B. (2002). A proposed function for hippocampal theta rhythm: Separate phases of encoding and retrieval enhance reversal of prior learning. Neural Comput., 14, 793–817. Hebert, A., & Dash, P. (2004). Plasticity in the entorhinal cortex suppresses memory for contextual fear. J. Neurosci., 24(45), 10111–10116. ¨ Hesse, W., Moller, E., Arnold, M., & Schack, B. (2003). The use of time-variant EEG Granger causality for inspecting directed interdependencies of neural assemblies. Journal of Neuroscience Methods, 124, 27–44. Izhikevich, E., Gally, J., & Edelman, G. (2004). Spike-timing dynamics of neuronal groups. Cerebral Cortex, 14, 933–944. Keinan, A., Sandbank, B., Hilgetag, C., Meilijson, I., & Ruppin, E. (2004). Fair attribution of functional contribution in artificial and biological networks. Neural Computation, 16, 1887–1915. Kelly, R., & Strick, P. (2004). Macro-architecture of basal ganglia loops with the cerebral cortex: Use of rabies virus to reveal multisynaptic circuits. Prog. Brain. Res., 143, 449–459. Knoblauch, A., & Palm, G. (2005). What is signal and what is noise in the brain? Biosystems, 79(1–3), 83–90. Konkle, A., & Bielajew, C. (2004). Tracing the neuroanatomical profiles of reward pathways with markers of neuronal activation. Rev. Neurosci., 15(6), 383–414. Krichmar, J., & Edelman, G. (2002). Machine psychology: Autonomous behavior, perceptual categorization and conditioning in a brain-based device. Cerebral Cortex, 12(8), 818–830. Krichmar, J., Nitz, D., Gally, J., & Edelman, G. (2005). Characterizing functional hippocampal pathways in a brain-based device as it solves a spatial memory task. Proc. Natl. Acad. Sci. USA, 102(6), 2111–2116. Krichmar, J., Seth, A., Nitz, D., Fleischer, J., & Edelman, G. (2005). Spatial navigation and causal analysis in a brain-based device modeling cortical-hippocampal interactions. Neuroinformatics, 3(3), 197–222. Langheim, F., Leuthold, A., & Georgopoulos, A. (2006). Synchronous dynamic brain networks revealed by magnetoencephalography. Proceedings of the National Academy of Sciences, USA, 103(2), 455–459. Lavenex, P., & Amaral, D. (2000). Hippocampal-neocortical interaction: A hierarchy of associativity. Hippocampus, 10, 420–430. Lee, C., Rohrer, W., & Sparks, D. (1988). Population coding of saccadic eye movements by neurons in the superior colliculus. Nature, 332, 357–360. Mackintosh, N. (1983). Conditioning and associative learning. New York: Oxford University Press. Malenka, R., & Bear, M. (2004). LTP and LTD: An embarrasment of riches. Neuron, 44, 5–21.
Distinguishing Causal Interactions in Neural Populations
933
Morris, R. (1984). Developments of a water-maze procedure for studying spatial learning in the rat. Journal of Neuroscience Methods, 11, 47–60. Ohki, K., Chung, S., Kara, Y. C. P., & Reid, R. (2005). Functional imaging with cellular resolution reveals precise microarchitecture in visual cortex. Nature, 433, 597–603. Okatan, M., Wilson, M., & Brown, E. (2005). Analyzing functional connectivity using a network likelihood model of ensemble neural spiking activity. Neural Computation, 17(9), 1927–1961. Pouget, A., Dayan, P., & Zemel, R. (2000). Information processing with population codes. Nature Reviews Neuroscience, 1(2), 125–132. Raffi, M., & Siegel, R. (2005). Functional architecture of spatial attention in the parietal cortex of the behaving monkey. Journal of Neuroscience, 25, 5171–5186. Schwartz, A. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Schwartz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 5(2), 461–464. Seth, A. (2005). Causal connectivity of evolved neural networks during behavior. Network: Computation in Neural Systems, 16, 35–54. Seth, A., & Baars, B. (2005). Neural Darwinism and consciousness. Consciousness and Cognition, 14, 140–168. Seth, A., McKinstry, J., Edelman, G., & Krichmar, J. (2004a). Active sensing of visual and tactile stimuli by brain-based devices. International Journal of Robotics and Automation, 19, 222–238. Seth, A., McKinstry, J., Edelman, G., & Krichmar, J. (2004b). Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device. Cerebral Cortex, 14, 1185–1189. Seth, A., Sporns, O., & Krichmar, J. (2005). Neurorobotic models in neuroscience and neuroinformatics. Neuroinformatics, 3(3), 167–170. Seung, H., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Science (USA), 90(22), 10749–10753. Softky, W., & Koch, C. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Stein, R. B., Gossen, E., & Jones, K. (2005). Neuronal variability: Noise or part of the signal? Nat. Rev. Neurosci., 6(5), 389–397. Witter, M., Naber, P., & van Haeften, T. (2000). Anatomical organization of the parahippocampal-hippocampal network. Ann. N.Y. Acad. Sci., 911, 1–24.
Received December 8, 2005; accepted July 12, 2006.
LETTER
Communicated by Lucas Parra
Model Selection for Convolutive ICA with an Application to Spatiotemporal Analysis of EEG Mads Dyrholm
[email protected] Intelligent Signal Processing Group, Informatics and Mathematical Modelling, Technical University of Denmark, 2800 Lyngby, Denmark
Scott Makeig
[email protected] Swartz Center for Computational Neuroscience, Institute for Neural Computation University of California San Diego, La Jolla CA 92093–0961, U.S.A.
Lars Kai Hansen
[email protected] Intelligent Signal Processing Group, Informatics and Mathematical Modelling, Technical University of Denmark, 2800 Lyngby, Denmark
We present a new algorithm for maximum likelihood convolutive independent component analysis (ICA) in which components are unmixed using stable autoregressive filters determined implicitly by estimating a convolutive model of the mixing process. By introducing a convolutive mixing model for the components, we show how the order of the filters in the model can be correctly detected using Bayesian model selection. We demonstrate a framework for deconvolving a subspace of independent components in electroencephalography (EEG). Initial results suggest that in some cases, convolutive mixing may be a more realistic model for EEG signals than the instantaneous ICA model. 1 Introduction Motivated by the EEG signal’s complex temporal dynamics, we are interested in convolutive independent component analysis (cICA), which in its most basic form concerns reconstruction of L + 1 mixing matrices Aτ and N component signal vectors (innovations), st , of dimension K , combining to form an observed D-dimensional linear convolutive mixture
xt =
L
Aτ st−τ .
(1.1)
τ =0
Neural Computation 19, 934–955 (2007)
C 2007 Massachusetts Institute of Technology
Model Selection in Convolutive ICA
935
That is, cICA models the observed data x as produced by K processes whose time courses are first convolved with fixed, finite-length time filters and then summed in the D sensors. This allows a single component to be expressed in the different sensors with variable delays and frequency characteristics. One common application for this model is the acoustic blind source separation problem in which sound sources are mixed in a reverberant environment. Simple ICA methods that do not take signal delays into account fail to produce satisfactory results for this problem, which has thus for been the focus of much cICA research (e.g., Lee, Bell, & Orglmeister, 1997; Parra, Spence, & Vries, 1998; Sun & Douglas, 2001; Mitianoudis & Davies, 2003, ¨ Anemuller & Kollmeier, 2003). For analysis of human electroencephalographic (EEG) signals recorded from the scalp, ICA has already proven to be a valuable tool for detecting and enhancing relevant source subspace brain signals while suppressing irrelevant noise and artifacts such as those produced by muscle activity and eye blinks (Makeig, Bell, Jung, & Sejnowski, 1996; Jung et al., 2000; Delorme & Makeig, 2004). In conventional ICA, each independent component (IC) is represented as a spatially static projection of cortical activity to the sensors. Results of static ICA decomposition are generally compatible with a view of EEG signals as originating in spatially static cortical domains within which local field potential fluctuations are partially synchronized (Makeig, Enghoff, Jung, & Sejnowski, 2000; Jung et al., 2001; Delorme, Makeig, & Sejnowski, 2002; Makeig, Debener, Onton, & Delorme, 2004; Onton, Delorme, & Makeig, 2005). Modeling EEG data as consisting of convolutive as well as static independent processes allows a richer palette for source modeling, possibly leading to more complete signal independence ¨ (Anemuller, Sejnowski, & Makeig, 2003). In this letter, we present a new cICA decomposition method that, unlike most previous work in the area, operates entirely in the time domain. In the wavelet or DFT (discrete Fourier transform) domain approaches, it is common practice to window and taper the data; hence, the choice of the optimal window size is a problem in practice. In an acoustic application of a DFT-based method, for instance, the window size can be determined based on a listening test; however, an equally elegant solution for application to nonaudible data such as an EEG has yet to be derived. In our new method, the optimal order of the mixing filters has to be determined, and we provide a principled scheme for doing so based on generalization error. The new scheme also makes no assumptions about the nonstationarity of the components, a key assumption in several successful cICA methods (see, e.g., Parra & Spence, 2000; Rahbar, Reilly, & Manton, 2002) whose relevance to EEG is unclear. Previous time domain and DFT domain methods have formulated the problem as one of finding a finite impulse response (FIR) filter that unmixes as in equation 1.2 below (Belouchrani, Meraim, Cardoso, & Moulines, 1997; Choi & Cichocki, 1997; Moulines, Cardoso, & Gassiat,
936
M. Dyrholm, S. Makeig, and L. Hansen
1997; Lee, Bell, & Lambert, 1997; Attias & Schreiner, 1998; Parra, Spence, & Vries, 1998; Deligne & Gopinath, 2002; Douglas, Cichocki, & Amari, 1999; Comon, Moreau, & Rota, 2001; Sun & Douglas, 2001; Rahbar & Reilly, 2001; ¨ Rahbar et al., 2002; Baumann, Kohler, & Orglmeister, 2001; Anemuller & Kollmeier, 2003): sˆ t =
Wλ xt−λ .
(1.2)
λ
However, the inverse of the mixing FIR filter modeled in equation 1.2 is, in general, an infinite impulse response (IIR) filter. We thus expect that FIRbased unmixing will require estimation of extended or potentially infinite length unmixing filters. Our method, by contrast, finds such an unmixing IIR filter implicitly in terms of the mixing model parameters, that is, the Aτ ’s in equation 1.1, isolating st in equation 1.1. as sˆ t =
A#0
xt −
L
Aτ sˆ t−τ ,
(1.3)
τ =1
where A#0 denotes Moore-Penrose inverse of A0 . Another advantage of this parameterization is that the Aτ ’s allow a separated component to be easily backprojected into the original sensor domain. Other authors have proposed the use of IIR filters for separating convolutive mixtures using the maximum likelihood principle. The unmixing IIR filter, equation 1.3, generalizes that of Torkkola (1996) to allow separation of more than two components. Furthermore, it bears interesting resemblance to that of Choi and Cichocki (1997) and Choi, Amari, Cichocki, & Liu (1999). Though put in different analytical terms, the inverses used there are equivalent to the unmixing IIR, equation 1.3. However, the unique expression, equation 1.3, and its remarkable analytical simplicity, is the key to learning the parameters of the mixing model, equation 1.1, directly. 2 Learning the Mixing Model Parameters Statistically motivated maximum likelihood approaches for cICA have been proposed (Torkkola, 1996; Pearlmutter & Parra, 1997; Parra, Spence, & Vries, 1997; Moulines et al., 1997; Attias & Schreiner, 1998; Deligne & Gopinath, 2002; Choi et al., 1999; Dyrholm & Hansen, 2004) and are attractive for a number of reasons. First, they force a declaration of statistical assumptions—in particular, the assumed distribution of the component signals. Second, a maximum likelihood solution is asymptotically optimal given the assumed observation model and the prior choices for the hidden variables.
Model Selection in Convolutive ICA
937
Assuming independent and identically distributed (i.i.d.) component signals and noise-free mixing, the likelihood of the parameters in equation 1.1 given the data is p(X|{Aτ }) =
···
N
δ(et )p(st )dst ,
(2.1)
t=1
where et = x t −
L
(2.2)
Aτ st−τ
τ =0
and δ(et ) is the Dirac delta function. In the following derivation, we assume that the number of convolutive processes K does not exceed the dimension D of the data. First, we note that only the Nth term under the product operator in equation 2.1 is a function of s N . Hence, the s N -integral may be evaluated first, yielding
p(X|{Aτ }) = |A0 A0 | T
−1/2
···
p(ˆs N )
N−1
δ(et )p(st )dst ,
(2.3)
t=1
where integration is over all components at all times except s N , and sˆ t =
xt −
A#0
L
Aτ ut−τ , un ≡
τ =1
sn sˆ n
for for
n K . 2. (Augmented) Perform the decomposition with K set to D, attempting to estimate some extra components. 3. (Diminished) Perform the decomposition with D equal to K , that is, on a K -dimensional subspace projection of the data.
Model Selection in Convolutive ICA
1
941
Source 1 -> Sensor 1
Source 2 -> Sensor 1
Source 1 -> Sensor 2
Source 2 -> Sensor 2
Source 1 -> Sensor 3
Source 2 -> Sensor 3
0 -1 1 0 -1 1 0 -1 0
10
20
Filter lags
30 0
10
20
30
Filter lags
Figure 1: Synthetic MIMO mixing system. Here, two components were convolutively mixed at three sensors. The poles of the mixture (as defined in section 2.1) are all situated within the unit circle; hence, an exact and stable inverse exists in the sense of equation 2.8.
We compared the performance of these three approaches experimentally as a function of signal-to-noise ratio (SNR). First, we created a synthetic mixture: two i.i.d signals s1 (t) and s2 (t) (with 1 ≤ t ≤ N and N = 30,000) generated from a Laplacian distribution, sk (t) ∼ p(x) = 12 exp(−|x|) with variance Var{sk (t)} = 2. These signals were mixed using the filters of length L = 30 shown in Figure 1 producing an overdetermined 3D mixture (D = 3, K = 2). A 3D i.i.d. gaussian noise signal nt was added to the mixture xt = σ nt + τL=0 Aτ st−τ with a controlled variance σ 2 . Next, we investigated how well the three analysis approaches estimated the two components by measuring the correlations between each true component innovation, sk (t), and the best-correlated estimated component innovation, sˆk (t). 3.1 Approach 1 (Rectangular). Here, all three data channels were decomposed and the two true components estimated. Figure 2 shows how well the components were estimated at different SNR levels. The quality of the estimation dropped dramatically as SNR decreased. Although our derivation (see section 2) is valid for the overdetermined case (D > K ), the
942
M. Dyrholm, S. Makeig, and L. Hansen
Correlation coefficient
A
B
1
1
0.9
0.9
0.8
0.8
0.7
0.7 Rectangular Augmented Diminished
0.6 0.5 1
10 100 1000 SNR
0.6 0.5 1
10 100 1000 SNR
Figure 2: Comparison of separation of the system in Figure 1 using three cICA approaches (Rectangular, Augmented, Diminished). (A) Estimates of true component activity: Correlations with the best-estimated component. (B) Similar correlations for the less-well-estimated component.
validity of the zero-noise assumption proves vital in this case. The explanation for this can be seen in the definitions of the likelihood, equation 2.7, and unmixing filter, equation 2.8. In equation 2.7, any rotation on the columns of A0 will not influence the determinant term of the likelihood. From equation 2.8, we note that the estimated component vectors sˆ t are found by linear mapping through A#0 : IR D → IR K . Hence, the prior term in equation 2.7 alone will be responsible for determining a rotation of A0 that hides as much variance as possible in the null-space (IR D−K ) of A#0 in equation 2.8. In an unconstrained optimization scheme, this side effect will be untamed and consequently will hide variance in the null-space of A#0 and achieve an artificially high likelihood while relaxing the effort to make the components independent. 3.2 Approach 2 (Augmented). One solution to the problem with the Rectangular approach above could be to parameterize the null-space of A#0 or, equivalently, the orthogonal complement space of A0 . This can be seen as a special case of the algorithm in which A0 is D−by−D and Aτ is D−by−K . With the D − K additional columns of A0 denoted by B, the model can be written,
xt = Bvt +
L τ =0
Aτ st−τ ,
(3.1)
Model Selection in Convolutive ICA
943
where vt and B constitute a low-rank approximation to the noise. Hence, we declare a gaussian prior p.d.f. on vt . Note that equation 3.1 is a special case of the convolutive model, equation 1.1. In this case, we attempt to estimate the third (noise) component in addition to the two convolutive components. Figure 2 shows how well the components are estimated using this approach for different SNR levels. For the best-estimated component (see Figure 2A), the Augmented approach gave better estimates than the Rectangular or Diminished approaches. This was also the case for the second component (see Figure 2B) at low SNR, but not at high SNR since in this case, the true B was near zero and became improbable under the likelihood model. 3.3 Approach 3 (Diminished). Finally, we investigated the possibility of extracting the two components from a two-dimensional projection of the data. For example, in the EEG experiment later in this letter, we propose projecting the data onto a subspace defined by K instantaneous ICA components. In other situations, PCA might be a more principled way to dimensionality reduction. However, for the current synthetic experiment, we pick a more random projection by excluding the third sensor from the decomposition. Figure 2 shows that even in the presence of considerable noise, the separation achieved was not as good as in the Augmented approach. However, the Diminished approach used the lowest number of parameters and hence had the lowest computational complexity. Furthermore, it lacked the peculiarities of the Augmented approach at high SNR. Finally, we note that once the Diminished model has been learned, an estimate of the Rectangular model can be obtained by solving T T xt st−λ = Aτ st−τ st−λ
(3.2)
τ
for Aτ by regular matrix inversion using the estimated components and N < · >= N1 1=1 . 3.4 Summary of the Three Approaches. In the presence of considerable noise, the best separation was obtained by augmenting the model and extracting, from the D-dimensional mixture, K components as well as a (rank D − K ) approximation of the noise. However, the Diminished approach had the advantage of lower computational complexity, while the separation it achieved was close to that of the Augmented approach. At very high SNR, the Diminished approach was even slightly better than the Augmented approach. The Rectangular approach had difficulties and should not be considered for use in practice, as the presence of some channel noise may be assumed.
944
M. Dyrholm, S. Makeig, and L. Hansen
4 Detecting a Convolutive Mixture Model selection is a fundamental issue of interest; in particular, detecting the order of L can tell us whether the convolutive mixing model is a better model than the simpler instantaneous mixing model of standard ICA methods. In the framework of Bayesian model selection, models that are too complex are penalized by the Occam factor and will therefore be chosen only if there is a need for their complexity. However, this compelling feature can be disrupted if fundamental assumptions are violated. One such assumption was involved in our derivation of the likelihood, in which we assumed that the components are i.i.d., that is, not autocorrelated. The problem with this assumption is that the likelihood will favor models based not only on achieved independence but on component whiteness as well. A model selection scheme for L that does not take the component autocorrelations into account will therefore be biased upward because models with a larger value for L can absorb more component autocorrelation than models with lower L values. To address this problem, we introduce a model for each of the sources sk (t) =
M
h k (λ)zk (t − λ)
(4.1)
λ=0
where zk (t) represents an i.i.d. signal—a whitened version of the component signal. Introducing the K component filters of order M allows us to reduce the value of L, that is, lowering the number of parameters in the model while achieving uniformly better learning for limited data (Dyrholm, Makeig, & Hansen, 2006). We note that some authors of FIR unmixing methods have also used component models (e.g., Pearlmutter & Parra, 1997; Parra et al., 1997; Attias & Schreiner, 1998). 4.1 Learning Component Autocorrelation. The negative log likelihood for the model combining equation 1.1 and 4.1 is given by L = N log | det A0 | + N
log |h k (0)| −
k
N
log p(ˆzt ),
(4.2)
t=1
where zˆ t is a vector of whitened component signal estimates at time t using an operator that represents the inverse of equation 4.1, and we assume A0 to be square as in the Diminished and Augmented approaches above. We can, without loss of generality, set h k (0) = 1; then, L = N log | det A0 | −
N t=1
log p(ˆzt ).
(4.3)
Model Selection in Convolutive ICA
945
For notational convenience, we introduce the following matrix notation instead of equation 4.1, bundling all components in one matrix equation, st =
M
Hλ zt−λ ,
(4.4)
λ=0
where the Hλ ’s are diagonal matrices defined by (Hλ )ii = h i (λ). To derive an algorithm for learning the component autocorrelations in addition to the mixing model, we modify the equations found in section 2.2 by inserting a third Component model step (see below) between the two steps found there, that is, substituting zˆ t for sˆ t in step 2. 4.1.1 Component Model Step. is given by zˆ t = sˆ t −
M
The inverse component coloring operator
Hλ zˆ t−λ ,
(4.5)
λ=1
and the partial derivatives, which we shall use in step 2, are given by ∂(ˆzt )k ∂(A−1 0 )ij
=
∂(ˆs)k ∂(A−1 0 )ij
−
M
Hλ
λ=1
∂(ˆzt−λ )k ∂(A−1 0 )ij
∂(ˆs)k ∂(ˆzt−λ )k ∂(ˆzt )k = − Hλ ∂(Aτ )ij ∂(Aτ )ij ∂(Aτ )ij λ=1 M ∂ zˆ t−λ ∂(ˆzt )k = −δ(k − i)(ˆzt−λ )i − Hλ . ∂(Hλ )ii ∂(Hλ )ii
(4.6)
M
λ =1
(4.7)
(4.8)
k
4.1.2 Step 2 Modified: Gradient of the Cost Function. The gradient of the cost function with respect to A−1 0 with the component model invoked is given by ∂L({Aτ }) ∂(A−1 0 )ij
N = −N A0T ij − ψtT t=1
∂ zˆ t ∂(A−1 0 )ij
,
(4.9)
and the gradient with respect to the other mixing matrices is N ∂L({Aτ })ij ∂ zˆ t =− ψtT . ∂(Aτ )ij ∂(Aτ )ij t=1
(4.10)
946
M. Dyrholm, S. Makeig, and L. Hansen Source 1
Source 2
1 0 -1 0
5
10
15 0
Filter lags
5
10
15
Filter lags
Figure 3: The filters used to produce autocorrelated components (M = 15).
4.2 Protocol for Detecting L. We propose a simple protocol for determining the dimensions (L , M) of the convolutional and component filters. First, expand the convolution without an autofilter (M = 0). This will model the total temporal dependency structure of the system L max . The optimal dimension is found by monitoring the Bayes information criterion (BIC) (Schwarz, 1978), log p(M|X) ≈ log p(X|θθ 0 , M)−
dim θ log N, 2
(4.11)
where M represents a specific choice of model structure (L , M), θ represents the parameters in the model, θ 0 are the maximum likelihood parameters, and N is the size of the data set (number of samples). Next, keep the temporal dependency constant, (L + M) = L max , while expanding the length of the component autofilters M, again monitoring the BIC to determine the optimal choice of L = L max − M. 4.3 Example: Correctly Rejecting cICA of an Instantaneous Mixture. We now illustrate the importance of the component model and the validity of the protocol for detecting L when dealing with the following fundamental question: Do we learn anything by using convolutive ICA instead of instantaneous ICA? Or, put in another way, should L be larger than zero? To produce an instantaneous mixture, we generate two random signals from a Laplace distribution, filter them through filters of order 15 shown in Figure 3, and mix the two filtered components using an arbitrary square mixing matrix. Figure 4A shows the result of using Bayesian model selection for this mixture without allowing for a filter (M = 0). This corresponds to model selection in a conventional convolutive model. Since the signals are nonwhite, L is detected, and the model BIC simply increases as a function of L up to the maximum, here stopped at L = 15. Next, (see Figure 4B), we fix L + M = 15. Models with a larger L have at least the same capability as models with lower L, though models with lower L are preferable because they have fewer parameters. By adding the component model, we get the
Bayes Info. Criterion (/1000)
Model Selection in Convolutive ICA
-30 -32 -34 -36 -38 -40 -42 -44 -46 -48
A
947
-31.6
B
-31.7
-31.8 0 5 10 15 20 25 # Lags (L)
0
5
10
15
# Lags (L)
Figure 4: (A) The result of using Bayesian model selection without allowing for an autofilter (M = 0). Since the signals are nonwhite, the validity of L is unquestioned even at 15 lags (L = 15). (B) We fix L + M = 15, and now get the correct answer: that model information is largest for L = 0, meaning there is no evidence of convolutive mixing.
correct answer in this case: these data contain no evidence of convolutive mixing. 5 Deconvolving an EEG ICA Subspace We now show by example how cICA can be used to separate the delayed influences of statically defined ICA components on each other, thereby achieving a larger degree of independence in the convolutive component time courses. The procedure described here can be seen as a Diminished approach in which we extract K convolutive components from the Ddimensional data by deconvolving a K -dimensional subspace projection of the data. In Dyrholm, Hansen, Wang, Arendt-Nielsen, & Chen (2004) we used a subspace from principal component analysis (PCA), but as our experiment will show, using ICA for that projection has the benefit that the subspace can be chosen, for example, for physiological interest. As a first test of this approach, we applied convolutive decomposition to 20 minutes of a 71-channel human EEG recording (20 epochs of 1 minute duration), downsampled for numeric convenience to a 50 Hz sampling rate after filtering between 1 and 25 Hz with phase-indifferent FIR filters. First, the recorded (channels-by-times) data matrix (X) was decomposed using extended Infomax ICA (Bell & Sejnowski, 1995; Makeig et al., 1996; Jung et al., 1998; Lee, Girolami, & Sejnowski, 1999; Jung et al., 2001) into 71 maximally independent components whose (“activation”) time series were contained
948
M. Dyrholm, S. Makeig, and L. Hansen
in (components-by-times) matrix SICA and whose (“scalp map”) projections to the sensors were specified in (channels-by-components) mixing matrix AICA , assuming instantaneous linear mixing X = AICA SICA . Five of the resulting independent components (ICs) were selected for further analysis on the basis of event-related coherence results that showed a transient partial collapse of component independence following the subject button presses (Makeig, Delorme et al., 2004). More specifically, all five components displayed amplitude activity that was time-locked to the subject button presses, which was evident from looking at ERP images of stimulus-aligned epochs sorted by button-press response time (Makeig et al., 2002). The scalp maps of the five components, that is, the relevant five columns of AICA , are shown on the left margin of Figure 7. Next, cICA decomposition was applied to the five component activation time series (relevant five rows of SICA ), assuming the model sICA = t
L
Aτ scICA t−τ .
(5.1)
τ =0
As a qualified guess of the order L, we applied the approach to estimating L outlined in section 4.2 above to the EEG subspace data. First, we increased the order of the convolutive model L (keeping M = 0) while monitoring the BIC. To produce error bars, we used jackknife resampling (Efron & Tibshirani, 1993). For each value of L, 20 runs with the algorithm were performed, one for each jackknifed epoch; thus, the data in each run consisted of the 19 remaining epochs. Figure 5A shows the mean jackknifed BIC. Clearly, the BIC, without an autofilter included, was at least L max = 40, since some correlations in the data extended to at least 800 ms. Next, we swept the range of possible component model filters M, keeping L + M = 40. Figure 5B shows that L = 10, corresponding to a filter length of 200 ms, proved optimal. Figures 6 shows the 5 × 5 matrix of learned convolutive kernels. Before plotting, we arranged the order of the five output CCs so that the diagonal (CCi → I Ci ) kernels, shown in one-third scale in Figure 6, were dominant. Figure 7 shows the resulting percentage of variance of the contributions from each of the CC innovations to each of the IC activations. As the large diagonal contributions in Figure 7 show, each convolutive CC j dominated one spatially static IC (IC j). However, there were clearly significant off-diagonal contributions as well, indicating that spatiotemporal relationships between the static ICA components were captured by the cICA model. To explore the robustness of this result further, we tested for the presence of delayed correlations, first between the static IC activations (skICA (t)) and then between the learned CC innovations (skcICA (t)). Figure 8 shows, for the
Model Selection in Convolutive ICA
A
Jackknife BIC (/104)
-36 -38 -40 -42 -44 -46 -48
949
B
-36.1 -36.2 -36.3 -36.4 -36.5 -36.6 -36.7 -36.8 -36.9 -37
M=0
M+L=40
0 10 20 30 40 # Lags (L)
0
5 10 15 20 # Lags (L)
Figure 5: Using the protocol for detecting the order of L for EEG. (A) There are correlations over at least 40 lags in the data. This corresponds to 800 ms. (B) By introducing the component model, it turns out that L should be only on the order of 10 corresponding to 200 ms. CC1 -> IC1
CC2 -> IC1
CC3 -> IC1
CC4 -> IC1
CC5 -> IC1
CC1 -> IC2
CC2 -> IC2
CC3 -> IC2
CC4 -> IC2
CC5 -> IC2
CC1 -> IC3
CC2 -> IC3
CC3 -> IC3
CC4 -> IC3
CC5 -> IC3
CC1 -> IC4
CC2 -> IC4
CC3 -> IC4
CC4 -> IC4
CC5 -> IC4
x/3
x/3
x/3
x/3 CC1 -> IC5
CC2 -> IC5
CC3 -> IC5
CC4 -> IC5
CC5 -> IC5
x/3
0
200 0
200 0 200 0 Filter lag (ms)
200 0
200
Figure 6: Kernels of the five derived convolutive ICA components (CCs), arranged (in columns) in order of their respective contributions to the five static ICA components (ICs) (rows). Each CC made a dominant contribution to one IC; these were ordered so as to appear on the diagonal. Scaling of the diagonal kernels is one-third that of the off-diagonal kernels.
950
M. Dyrholm, S. Makeig, and L. Hansen
CC1
49% IC1
CC2
CC3
CC4
CC5
9%
26%
14%
1%
1%
1%
1%
0%
0%
88% 8%
IC2
32% IC3
IC4
66% 1%
82%
10%
3%
4%
1%
2%
2%
2% 94%
IC5
1%
Predictability (%)
Figure 7: Percentage variance of five static ICA components (ICs) accounted for by the five derived convolutive components (CCs). The IC scalp maps on the left are shown for interest. Contributions arranged on the diagonal are dominant. Squares represent the (rounded) percentage variance of the IC activation time series accounted for by each CC. Significant off-diagonal elements indicate the presence of significant delayed spatiotemporal interactions between the static IC activations.
10 8 6 4
ICA cICA
2 0 0
2 4 6 8 Prediction order (r)
10
Figure 8: Predictability of the most predictable ICA component activation (IC3) and cICA component innovation (CC5) from the most predictive other IC and CC component, respectively (IC1, CC3).
Model Selection in Convolutive ICA
951
most predictable IC and CC, the percentage of their time course variances that was accounted for by linear prediction from the past history (of order r ) of the largest contributing remaining ICs or CCs, respectively. As expected from the cICA results, as the prediction order (r ) increased, the predictability of the static ICA component activation also increased. For the ICA component activation, 9% of the variance could be explained by linear prediction from the previous 10 time points (200 ms) of another ICA component. The static ICA component time courses were nearly independent only in the sense of zero-order prediction (r = 0), as expected from their derivation. Their lack of independence at other lags is compatible with the cICA results. For the CC innovation, however, the predictability in Figure 8 remained low as r increased, indicating that cICA in fact deconvolved delayed correlations present in the EEG subspace data. Figure 9 shows the power spectral densities for each of the IC activations (in bold traces) along with the two CCs (in thin traces) that, in accordance with Figure 7, contributed the most to the respective IC (cf. Figure 7). Note that the broad alpha band spectral peak in IC1 (the uppermost panel in Figure 9) around 10 Hz has been split between CC1 and CC3. In the middle panel, the distinct spectral contributions of CC1 and CC3 to the double alpha peak in the IC3 spectrum. As expected, the CCs made different spectral contributions to the IC time courses. For example, CC1 made different power spectral density contributions to IC1, IC3, and IC4.
6 Discussion In general, the usefulness of any blind decomposition method applied to biological time-series data is most likely relative to the fit between the assumptions of the algorithm and the underlying physiology and biophysics. Therefore, it is important to consider the physiological basis of the delayed interactions between statically defined independent component time courses we observed here, and the possible physiological significance of the derived convolutive component filters and time courses. Applied to these EEG data, static ICA gave 15 to 20 components that were of physiological interest according to their spatial projections or activation time series, although we were not able to practically deconvolve more than five components here because of numerical complexity. Open questions therefore are to identify independent component subspaces of interest for cICA decomposition or to explore the efficiency of performing cICA on larger computer clusters. In the future, convolutive ICA might also be applied usefully to other types of biomedical time-series data that involve stereotyped source movements, thus presenting problems for static ICA decomposition. These might include electrocardiographic (ECG) and brain hemo¨ dynamic measures such as diffusion tensor imaging (DTI) (Anemuller et al., 2004).
952
M. Dyrholm, S. Makeig, and L. Hansen
IC1 -80
CC1 (49%)
-120
CC3 (26%)
IC2 Power spectral density (dB)
-80
CC3 (8%)
-120
CC2 (88%)
IC3 -80
CC1 (32%) CC3 (66%)
-120
IC4 -80 CC4 (82%) CC1 (10%)
-120
IC5 -80 CC5 (94%)
-120 0
5
10 15 Frequency (Hz)
20
25
Figure 9: Power spectrum of the most powerful CC contributions to the five ICs.
References ¨ Anemuller, J., Duann, J.-R., Sejnowski, T. J., & Makeig, S. (2004). Unraveling spatiotemporal dynamics in FMRI recordings using complex ICA. In C. G. Puntonet & A. Prieto (Eds.), Independent component analysis and blind signal separation (pp. 1103– 1110). Berlin: Springer-Verlag.
Model Selection in Convolutive ICA
953
¨ Anemuller, J., & Kollmeier, B. (2003). Adaptive separation of acoustic sources for anechoic conditions: A constrained frequency domain approach. IEEE Transactions on Speech and Audio Processing, 39(1–2), 79–95. ¨ Anemuller, J., Sejnowski, T., & Makeig, S. (2003). Complex independent component analysis of frequency-domain EEG data. Neural Networks, 16, 1313–1325. Attias, H., & Schreiner, C. E. (1998). Blind source separation and deconvolution: The dynamic component analysis algorithm. Neural Computation, 10(6), 1373–1424. Baumann, W., Kohler, B.-U., Kolossa, D., & Orglmeister, R. (2001). Real time separation of convolutive mixtures. In T.-W., Lee, T.-P., Jung, S., Makeig, & T. J., Sejnowski (Eds.), 3rd International Conference on Independent Component Analysis and Blind Signal Separation (pp. 65–69). San Diego, CA. Bell, A., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Meraim, K. A., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 42, 434–444. Cardoso, J.-F., & Pham, D.-T. (2004). Optimization issues in noisy gaussian ICA. In C. G. Puntonet & A. Prieto (Eds.), Independent component analysis and blind signal separation (pp. 41–48). Berlin: Springer-Verlag. Choi, S., & Cichocki, A. (1997). Blind signal deconvolution by spatio-temporal decorrelation and demixing. In J. Principe, L. Gile, N. Morgan, & E. Wilson (Eds.), Neural networks for signal processing (pp. 426–435). Amelia Island, CA. Choi, S., Amari, S.-I., Cichocki, A., & wen Liu, R.-W. (1999). Natural gradient learning with a nonholonomic constraint for blind deconvolution of multiple channels. In Independent Component Analysis and Blind Signal Separation (pp. 371–376). Aussois, France. Comon, P., Moreau, E., & Rota, L. (2001). Blind separation of convolutive mixtures: A contrast based joint diagonalization approach. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Independent Component Analysis and Blind Source Separation (pp. 686–691). San Diego, CA. Deligne, S., & Gopinath, R. (2002). An EM algorithm for convolutive independent component analysis. Neurocomputing, 49, 187–211. Delorme, A., & Makeig, S. (2004). EEGlab: An open source toolbox for analysis of single-trial EEG dynamics. Journal of Neuroscience Methods, 134, 9–21. Delorme, A., Makeig, S., & Sejnowski, T. J. (2002). From single-trial EEG to brain area dynamics. Neurocomputing, 44–46, 1057–1064. Douglas, S. C., Cichocki, A., & Amari, S. (1999). Self-whitening algorithms for adaptive equalization and deconvolution. IEEE Transactions on Signal Processing, 47, 1161–1165. Dyrholm, M., & Hansen, L. K. (2004). CICAAR: Convolutive ICA with an autoregressive inverse model. In C. G. Puntonet & A. Prieto (Eds.), Independent Component Analysis and Blind Signal Separation (pp. 594–601). Berlin: Springer-Verlag. Dyrholm, M., Hansen, L. K., Wang, L., Arendt-Nielsen, L., & Chen, A. C. (2004). Convolutive ICA (c-ICA) captures complex spatio-temporal EEG activity. http://www.imm.dtu.dk/∼mad/papers/cICA eeg hbm2004.pdf. Dyrholm, M., Makeig, S., & Hansen, L. K. (2006). Model structure selection in convolutive mixtures. In J. Rosca, D. Erdogmus, J. C. Pripe, & S. Haykin (Eds.),
954
M. Dyrholm, S. Makeig, and L. Hansen
Independent Component Analysis and Blind Signal Separation (pp. 74–81). BerlinSpringer-Verlag. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. London: Chapman & Hall. Hansen, P. C. (2002). Deconvolution and regularization with Toeplitz matrices. Numerical Algorithms, 29, 323–378. Jung, T.-P., Humphries, C., Lee, T.-W., Makeig, S., McKeown, M. J., Iragui, V., & Sejnowski, T. J. (1998). Extended ICA removes artifacts from electroencephalographic recordings. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Jung, T.-P., Makeig, S., Humphries, C., Lee, T.-W., McKeown, M. J., Iragui, V., & Sejnowski, T. J. (2000). Psychophysiology, 37, 163–178. Jung, T.-P., Makeig, S., McKeown, M. J., Bell, A., Lee, T.-W., & Sejnowski, T. J. (2001). Imaging brain dynamics using independent component analysis. Proceedings of the IEEE, 89(7), 1107–1122. Lee, T.-W., Bell, A. J., & Lambert, R. H. (1997). Blind separation of delayed and convolved sources. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 758–764). Cambridge, MA: MIT Press. Lee, T.-W., Bell, A. J., & Orglmeister, R. (1997). Blind source separation of real world signals. In International Conference Neural Networks (pp. 2129–2135). Houston, TX. Lee, T.-W., Girolami, M., & Sejnowski, T. J. (1999). Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation, 11(2), 417–441. Makeig, S., Bell, A. J., Jung, T.-P., & Sejnowski, T. J. (1996). Independent component analysis of electroencephalographic data. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 145–151). Cambridge, MA: MIT Press. Makeig, S., Debener, S., Onton, J., & Delorme, A. (2004). Mining event-related brain dynamics. Trends in Cognitive Science, 8(5), 204–210. Makeig, S., Delorme, A., Westerfield, M., Townsend, J., Courchense, E., & Sejnowski, T. (2004). Electroencephalographic brain dynamics following visual targets requiring manual responses. PLoS Biology, 2, e176. Makeig, S., Enghoff, S., Jung, T.-P., & Sejnowski, T. J. (2000). A natural basis for efficient brain-actuated control. IEEE Trans. Rehab. Eng., 8, 208–211. Makeig, S., Westerfield, M., Jung, T.-P., Enghoff, S., Townsend, J., Courchesne, E., & Sejnowski, T. J. (2002). Dynamic brain sources of visual evoked responses. Science, 295(5555), 690–694. Mitianoudis, N., & Davies, M. (2003). Audio source separation of convolutive mixtures. IEEE Transactions on Speech and Audio Processing, 11(5), 489–497. Moulines, E., Cardoso, J.-F., & Gassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In International Conference on Acoustics, Speech, and Signal Processing (pp. 3617–3620). Los Alamitos, CA: IEEE Computer Society Press. Neumaier, A., & Schneider, T. (2001). Estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Transactions on Mathematical Software, 27(1), 27–57.
Model Selection in Convolutive ICA
955
Nielsen, H. B. (2000). UCMINF—an algorithm for unconstrained, nonlinear optimization (Tech. Rep. IMM-REP-2000-19). Langby: Department of Mathematical Modeling, Technical University of Denmark. Onton, J., Delorme, A., & Makeig, S. (2005). Frontal midline EEG dynamics during working memory. NeuroImage, 27, 342–356. Parra, L., & Spence, C. (2000). Convolutive blind source separation of nonstationary sources. IEEE Transactions on Speech and Audio Processing, 8, 320–327. Parra, L., Spence, C., & Vries, B. (1997). Convolutive source separation and signal modeling with Ml. In International Symposium on Intelligent Systems, Reggio Calbria, Italy. Parra, L., Spence, C., & Vries, B. D. (1998). Convolutive blind source separation based on multiple decorrelation. In Neural Networks for Signal Processing Proceedings (pp. 23–32). Cambridge, UK. Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Rahbar, K., & Reilly, J. (2001). Blind source separation of convolved sources by joint approximate diagonalization of cross-spectral density matrices. In International Conference on Acoustics, Speech, and Signal Processing (pp. 2745–2748). Los Alamitos, CA: IEEE Computer Society Press. Rahbar, K., Reilly, J. P., & Manton, J. H. (2002). A frequency domain approach to blind identification of mimo FIR systems driven by quasi-stationary signals. In International Conference on Acoustics, Speech, and Signal Processing (pp. 1717–1720). Los Alamitos, CA: IEEE Computer Society Press. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Sun, X., & Douglas, S. (2001). A natural gradient convolutive blind source separation algorithms for speech mixtures. In T.-W. Lee, T.-P. Jung, S. Makeig, & T. J. Sejnowski (Eds.), Independent Component Analysis and Blind Source Separation (pp. 59–64). San Diego, CA, USA. Torkkola, K. (1996). Blind separation of convolved sources based on information maximization. In Neural Networks for Signal Processing Proceedings (pp. 423–432). Kyoto, Japan.
Received August 2, 2005; accepted July 26, 2006.
LETTER
Communicated by Daniel Amit
Information and Topology in Attractor Neural Networks D. Dominguez
[email protected] K. Koroutchev
[email protected] E. Serrano
[email protected] F. B. Rodr´ıguez
[email protected] EPS, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain
A wide range of networks, including those with small-world topology, can be modeled by the connectivity ratio and randomness of the links. Both learning and attractor abilities of a neural network can be measured by the mutual information (MI) as a function of the load and the overlap between patterns and retrieval states. In this letter, we use MI to search for the optimal topology with regard to the storage and attractor properties of the network in an Amari-Hopfield model. We find that while an optimal storage implies an extremely diluted topology, a large basin of attraction leads to moderate levels of connectivity. This optimal topology is related to the clustering and path length of the network. We also build a diagram for the dynamical phases with random or local initial overlap and show that very diluted networks lose their attractor ability. 1 Introduction Interest in attractor Amari-Hopfield neural networks, originally dealing with fully connected architectures (Amari, 1972; Hopfield, 1982), has been renewed with the study of more realistic topologies (Masuda & Aihara, ˜ 2004; Torres, Munoz, Marro, & Garrido, 2004). Among them, relatively simple small-world (SW) graphs (Watts & Strogatz, 1998) can capture most facts of a wide range of realistic social, artificial or biological networks (Albert & Barabasi, 2002; Rolls & Treves, 2004). The SW connectivity properties interpolate between regular and a completely random network. The SW graphs are modeled by only two parameters: the average ratio of links per node γ and the ratio of long-range random links ω. The Amari-Hopfield network, as an information processing system, stores information through specific configurations of the network activity (i.e., patterns that can be recovered in response to the stimuli). Of interest is the load α, the ratio of patterns per links that the network stores. The most Neural Computation 19, 956–973 (2007)
C 2007 Massachusetts Institute of Technology
Information and Topology in Attractor Neural Networks Theory + Sim. |J|=40M, m0=1
−4
FC, γ=1.0
m
1
1
0.8
0.8
0.6
0.4
0.2
0.2
0
0.05
0.1
0.15
0.2
0.25
0.15
0.1
0.1
0.2
0.4
0.6
imax αc
αc
0.05
0
0.2
imax
0.15
0
Simulation Theory
0
0
0.2
i
ED, γ=10
0.6
Simulat. Theory
0.4
0.25
957
0.05 0
0
0.05
0.1
α
0.15
0.2
0
0.2
α
0.4
0.6
Figure 1: The overlap m (top panels) and the information rate i (bottom) versus α for fully connected (γ F C = 1.0, left) and random extremely diluted (γ E D = 10−4 , right) networks. Theory (solid lines) and simulation (circles).
common measure of the retrieval ability of these networks is the overlap, m, between neuron states and the memorized patterns (Amit, Gutfreund, & Sompolinsky, 1985). This measure is plotted as a function of the load α in the top panels of Figure 1 for the classical Amari-Hopfield fully connected (FC, γ = 1) and random extremely diluted (RED, γ 1, ω = 1) networks. The left panel of Figure 1 show that the FC network has a critical capacity at αcF C ∼ 0.145: the overlap goes sharply to 0 for loads larger than αcF C , defining well-pronounced phase transition with a small amount of error in the retrieval of patterns mc ∼ 0.97. This result was discussed long ago by Amit et al. (1985) and Crisanti, Amit, and Gutfreund (1986). However, for random extremely diluted networks, the transition of m to zero occurs at αcRE D ∼ 0.64 (Canning & Gardner, 1988), but it is a smooth transition: mc = 0 (see the right panel of Figure 1). Therefore, the characterization of the system exclusively in terms of m and αc poses some problems. Less attention has been paid to another parameter of the network: the mutual information between the stored patterns and the neural states, or its derivate, the information ratio (Okada, 1996; Dominguez & Bolle, 1998). The bottom panels of Figure 1 display the information ratio i(α), evaluated from the conditional probability of neuron states and the patterns. When
958
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
the network approaches its saturation limit αc , the neuron states cannot remain close to the patterns because the overlap is usually small. Thus, while the number of patterns increases, the information per stored pattern decreases. Therefore, the information i(α) for all topologies (except the FC) is a nonmonotonic function of the load, which reaches its maximum i max ≡ i(αmax ) at some value αmax ≤ αc of the load. For the FC case, αmax and αc coincide, and the critical information is about i cF C ∼ 0.13, as can be seen in the left panel of Figure 1. However, for other long-range topologies, as RED, for instance, this does not hold (see the right panel of Figure 1). The RED network has null information at αc . If one looks RE D RE D for the maximum, one finds i max ∼ 0.22 for αmax ∼ 0.32. In conclusion, the parameters i max , αmax are well defined and are more appropriate than mc , αc , to evaluate the Hopfield network’s performance on various topologies. In this letter, we address the problem of searching the topology that maximizes the information capacity of the network. Using the graph framework, we built networks with a connectivity ratio changing from fully connected (γ = 1) to extremely diluted networks (γ → 0) and randomness ranging from purely local (ω = 0) to totally random (ω = 1) connectivity. We show that i increases (decreases) with γ for local (random) networks and remains about the same for SW topologies. A question arises about the optimal topology: If the randomness ω is fixed (by physical constraints), which is the best connectivity γ ? To our knowledge, no one previously has answered this question. We approach this problem from two scenarios: the stability of the memorized patterns (related to the number of errors induced by crosstalk) and the domain of the attractors (related to the network’s ability to associate an corrupted input pattern with a stored pattern). The structure of the letter is as follows. In section 2, we define the neural dynamics model and review the information measures used in the calculations. The results are shown in section 3, where we study the stationary states by simulation and theory. In section 4 the attractors’ sensitivity to different initial conditions is discussed. We present a diagram for the phases with local and random initial conditions and show the relation between topology and mutual information. The conclusions are drawn in the final section.
2 The Model 2.1 Topology and Dynamics. The network state at a given time t is defined by a set of N binary neurons, σ t = {σit ∈ {±1}, i = 1, . . . , N}. The network learns a set of independent patterns {ξ µ , µ = 1, . . . , P}. Accordingly, µ each pattern ξ µ = {ξi ∈ {±1}, i = 1, . . . , N}, is a set of site-independent ranµ dom variables, binary and uniformly distributed: p(ξi = ±1) = 1/2. The synaptic couplings are J ij ≡ Cij Wij , where C denotes the topology matrix and W are the learning weights.
Information and Topology in Attractor Neural Networks
959
The topology, defined by the matrix C = {Cij ∈ {0, 1}}, with Cii = 0, splits in local and random links. The local links connect each neuron to its K l nearest neighbors in a closed ring. The random links connect each neuron to K r others, uniformly distributed along the network. Hence, the network degree is K = K l + K r . The network topology is then characterized by two parameters: the connectivity ratio and the randomness ratio, defined respectively by γ = K /N,
ω = K r /K ,
(2.1)
where ω plays the role of a rewiring probability in the small-world model (SW) (Watts & Strogatz, 1998). The model presented here was proposed by Newman and Watts (1999), which has the advantage of avoiding disconnecting the graph. We named small-world the procedure to build the connectivity graph rather than the properties of a network (large clustering and small path length). Note that the topology can be implemented by an adjacency list connecting neighbors. So the storage cost of this network is given by the cardinality |J| = N × K . Only long-range, K ∝ N, N → ∞, networks will be considered in this work, whatever ω. The learning algorithm updates W, according to the Hebb rule µ
µ−1
Wij = Wij
µ µ
+ ξi ξ j .
(2.2)
The network starts at Wij0 = 0, and after µ = P learning steps, it reaches µ µ a value Wij = µ ξi ξ j . The learning stage has a slow dynamics, which is stationary in the timescale of the much faster retrieval stage. The neural states, σit ∈ {±1}, are updated according to the dynamics, σit+1 = sign(h it ), h it ≡ J ij σ jt , i = 1, . . . , N, (2.3) j
where h it is the local field of neuron i at time t. Note that in the case of symmetric synaptic couplings, J ij = J ji , an energy function can be defined, whose minima are the stable states of the dynamics, equation 2.3. 2.2 Information Measures. The task of the neural channel is to retrieve a pattern (say, ξ ≡ ξµ ) starting from a neuron state σ 0 , which is inside its attractor basin. This is achieved through the network dynamics, equation 2.3. The relevant order parameter is the overlap between the neural states and the pattern, mtN ≡
1 ξ σ t, N i i i
at the time step t.
(2.4)
960
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
We consider a mean-field network, and the distribution of the states is assumed to be site independent (Newman, Moore, & Watts, 2000). Therefore, according to the law of large numbers, the overlap can be written, for K , N → ∞, as mt = σ t ξ σ,ξ . The brackets represent an average over the joint distribution p(σ, ξ ), for a single neuron, understood as an ensemble distribution for the neuron states {σi } and pattern {ξi } (Dominguez & Bolle, 1998). We calculate the mutual information MI [ p(σ ; ξ )], a quantity used to measure the prediction that an observer at the output, σ , can do about the input, ξ (we drop the time index t). We consider solely nonspatially modulated states; hence, MI factorizes in sites. It reads MI = S[σ ] − S[σ |ξ ], where the output entropy, S[σ ], and the conditional entropy, S[σ |ξ ], depend on the distribution p(σ, ξ ). This distribution factorizes in conditional probability p(σ |ξ ) = (1 + mσ ξ )/2 and input probability p(ξ ). In p(σ |ξ ), all types of noise in the retrieval process are enclosed (from both environment and the dynamical process itself). With the above expressions and p(σ ) ≡ δ(σ 2 − 1), the entropies read (Bolle, Dominguez, & Amari, 2000): 1+m 1+m 1−m 1−m log2 − log2 , 2 2 2 2 S[σ ] = 1[bit].
S[σ |ξ ] = −
(2.5)
Together with the overlap, one needs a measure of the load, which is the rate of pattern bits per synapses used to store them. Since the synapses and patterns are independent, the load is given by α = |{ξ µ }|/|J| ≡ P/K . We define the information ratio as i(α, m) ≡ MI [σ ; {ξ µ }]/|J| = α MI [σ ; ξ ], since for independent neurons and patterns, MI [σ ; {ξ µ }] ≡
(2.6) iµ
µ
MI [σi ; ξi ].
3 Stationary States 3.1 Simulation. We deal with different architectures—three regimes of dilution: the fully connected (FC, γ = 1), moderately connected (MD, 0 < γ < 1), and extremely diluted (ED, γ → 0) symmetric networks. Also the level of randomness runs from purely local (ω = 0) to completely random (ω = 1) networks. We simulate the dynamics with a fixed number of total connections, |J | = K · N = 3.6 · 107 , changing the size of the network from N = 6 · 103 = K (γ = 1) up to N = 6 · 105 , K = 60 (γ = 10−4 ). McGraw & Menzinger (2003) use K = 50, N = 5 · 103 (γ = 10−2 ), with |J | = 25 · 104 , which is far from the asymptotic limit. All runs use smoothing average over a uniform window along the load axis P = α · N.
Information and Topology in Attractor Neural Networks
sim;
ω=0,.2,1;
T=0;
γ=1.0000
γ=0.1000
γ=0.0100
961
#|J|=36M; γ=0.0010
0
m =1 γ=0.0001
1 0.8
m
0.6 0.4 0.2 0
i
0.24 0.22 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
ω=0.0 ω=0.2 ω=1.0
0
0.05 0.1 0.15 0
α
0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 0
α
α
0.2
0.4
α
0.6 0
0.2 0.4 0.6 0.8
α
Figure 2: Overlap m(α) (top panels) and information rate i(α) (bottom panels) for initial m0 = 1.0, γ = 1, 10−1 , 10−2 , 10−3 , 10−4 (from left to right) and ω = 0.0, 0.2, 1.0 (circles, plus, diamonds).
We study the stability properties of the retrieval states, starting the network from a given pattern, with initial overlap m0 = 1, and we let the neurons update to relaxation. The upper panels of Figure 2 plot the overlap after it converges to a fixed point (m∗ ), for each learned pattern. There is a transition from a retrieval phase (R, m∗ > 0) to a zero phase (Z, m∗ = 0) at a critical αc . For the FC network, αcF C ∼ 0.14 (Amit et al., 1985; Crisanti et al., 1986). The load αc for diluted networks depends on ω. For the RED network, αcRE D ∼ 0.64 (Canning & Gardner, 1988), for the local ED network, αcL E D ∼ 0.22, as seen in right-most panel of Figure 2. By increasing the dilution (decreasing γ ), the transition with α becomes smoother, but the αc increases. On the other hand, we observe that the overlap is always larger for random networks than for local networks, increasing with ω for all dilution degrees. Since there is a competition between critical load αc and sharpness of the transition in m, we plot in the bottom panel of Figure 2 the information ratio, i(α, m), for the three architectures. It can be seen that i(α, m) is a nonmonotonic function of α that reaches a maximum, i max (γ , ω) ≡ i(αmax , γ , ω), for a load αmax ≤ αc , with αmax = αc only for the discontinuous transition
962
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
of the FC network. The i max increases from the left to the right panel of Figure 2 (γ → 0, ED networks) and from local (ω = 0) to random connections (ω = 1). Overall, the optimal topology with respect to the stability of the retrieval states is the RED network, with i max ∼ 0.23, at αmax ∼ 0.34, and mmax ∼ 0.87. 3.2 Replica Theory. We consider a symmetrical Hebb network with topology described by its connectivity matrix C, with parameters according to equation 2.1. The static thermodynamics of a symmetric Hebb neural network in the thermodynamic limit (N → ∞) is exactly solvable for arbitrary topology, provided that each neuron is equally connected to the others and that a site-independent solution for the fixed point exists. This is the case for SW topology (Watts & Strogatz, 1998), which we will study in detail here. The energy of the symmetric network in the state σ ≡ {σi } is H=−
N 1 Cij Wij σi σ j , 2Nγ
(3.1)
i, j=1
with Wij given by the Hebb rule, equation 2.2. H completely defines the equilibrium thermodynamics of the system. We are looking only for spatially independent solutions; therefore, the order parameters are independent of the indexes. Following the literature (Amit et al., 1985; Canning & Gardner, 1988), we define the order parameters as ρ mρ = ξi σi ξ (3.2) ρ σi σiτ ξ , τ = ρ −1 ρ τ =γα mν mν ,
q ρτ = r ρτ
(3.3) τ = ρ,
(3.4)
ν
where (.)i ≡ N−1 i (.) is a spatial average, . ξ is the pattern average, and . is the thermal average. The ξ is the only pattern with macroscopic overlap; the index ν runs over the rest of the patterns; q and r are the configuration’s correlation between the replicas of σ and mν respectively. Further, we calculate the free energy and get an equation, analogous to equation 3.2 in Canning and Gardner (1988), the extremum in m, r, q of: f=
γ γ γ α + m2 Tr C −1 + αβr (1 − q ) 2 2 2 α Tr ln (1 − β(1 − q )C/N) + 2β
Information and Topology in Attractor Neural Networks
− Tr βq (1 − β(1 − q )C/N)−1 C/N √ 1 Dz ln 2 cosh γβ( αr z + mξ ) ξ , − γβ
963
(3.5)
where β = 1/T is the inverse temperature, 1 is the unity N × N matrix, and Dz means gaussian integral. We denote the trace of a matrix as Tr M ≡ i Mii and not the trace over the configurations of the system. All terms of the free energy scale well with N → ∞. The only problem is in the term Tr C −1 , for which the sum needs to provide 1/γ factor. Let us note that for √the SW connectivity considered here, the degree √ of each node is γ N + O( N), the major eigenvalue of C is also γ N + O( N), and each √ eigenvector component is v0i = 1/ N + O(1/N). The term in question can be calculated decomposing C −1 by its eigenvalues, C −1 =
vk † λ−1 k , k v
k †
and vk its corresponding transposed where λk is the kth largest eigenvalue √ eigenvector. Note that λ0 = γ N + O( N). The approximation terms will vanish in the limit N → ∞. Changing the SW rewiring scheme, one can achieve results without approximation terms, but to match the computer simulations, we do not do that. We have: † (C −1 )ij = Nv0 · C −1 · v0 † = Nv0 · vk λ−1 k · v0 † = 1/γ . k v k
ij
From this point on, the derivation of the equations for the order parameters is straightforward, and in the thermodynamic limit (N → ∞), one obtains the same result as Canning and Gardner (1988):
m=
q=
√ Dz ξ tanh βγ ( αr z + mξ )
(3.6)
√ Dz tanh2 βγ ( αr z + mξ )
(3.7)
q r = Tr γ
ξ
ξ
C 1 − β(1 − q ) N
−2 2 C · . N
(3.8)
Having spatial-dependent connectivity, it is unclear whether the site fluctuations will not provide solution with local macroscopic dependence of
964
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
m, for example, spatially dependent solutions. However, when the neurons are unbiased, σ ∈ {±1}, with zero threshold, only spatially symmetric states can be observed in the thermodynamic limit in the case of finite γ > 0, as can be seen in Koroutchev and Korutcheva (2005). At T → 0, keeping the quantity G ≡ γβ(1 − q ) finite, in equation 3.6, the tanh(.) converge to sign(.), in the next equation tanh2 (.) behaves as 1 − δ(.), and in equation 3.8, the expression can be expanded in series of G, giving: √ m = erf(m/ 2r α) 2 G = 2/(πr α) e −m /(2r α) r=
∞
(k + 1)a k G k ,
(3.9) (3.10) (3.11)
k=0
where a k ≡ γ Tr[(C/K )k+2 ]. Note that a k is the probability of the existence of (eventually self-crossing) cycles of length k + 2 in the connectivity graph. According to the definition, a 0 = 1, because C is symmetric, and a 1 is the probability of having 3-cycle, that is, the clusterization index of the graph. The series converge, because G ∈ [0, 1) and a k ∈ [0, 1]. For random extremely diluted and fully connected networks, one recovers the known results for r RE D = 1 and r F C = 1/(1 − G)2 , respectively (Bolle, Jongen, & Shim, 1999). The only equation explicitly dependent on the topology of the system is equation 3.11, and the only important topology features for the network equilibrium are {a k }. As k → ∞, a k tends to γ . Cyclic evaluation of the Ising model was studied by Parisi (1988). It was observed that compared with the simulations, with finite N, equations 3.9 to 3.11 overestimate the role of long loops. In order to close the system of these equations, one must estimate the {a k }, which are topology dependent. This is done in the appendix. We plot with lines in Figure 3 i max as a function of the dilution γ for different values of the randomness ω, obtained from theoretical equations 3.9 to 3.11. The results from the simulation for mt=0 = 1 are represented in the Figure 3 with symbols. A comparison of theoretical predictions and the simulations yields, in the experimental range we study, no significant deviation from the replica symmetric solution. Therefore, no replica symmetry breaking is expected. Figure 3 shows that both theory and simulation results agree for most ω > 0, with some fluctuations for ω ∼ 0. 4 Attractors 4.1 Sensitivity to Initial Overlap. The theoretical equations for the stationary states, equations 3.9 to 3.11, account only for the existence of the retrieval (R) solution m > 0. They say nothing about its stability. The zero
Information and Topology in Attractor Neural Networks
965
Theory × Simulation (Stability of m>0 × m0=1.0) 0.22
ω=1.0
0.2
ω=0.3
imax
0.18
ω=0.2 ω=0.1
0.16
0.14
ω=0.0 0.12
0.1
−4
10
10
−3
−2
10
−1
10
0
10
γ
Figure 3: Maximal information i max = i(αmax ) versus γ . Theory for the stationary states (solid lines) and simulations (symbols) for m0 = 1.0 and several values of randomness ω.
states (Z), m = 0, are also a solution of these equations, so both R and Z may coexist in some region of topological parameters γ , ω. In order to study the stability of the attractors, we simulate equation 2.3, and see how the network behaves under different initial conditions. It is worth noting that no theory for the dynamics of our topological model is yet known (Bolle et al., 1999). To analyze the attractor properties of the retrieval, we made the neuron states start at a configuration σ 0 weakly correlated with a learned pattern, ξ ≡ ξµ . If they are inside its basin of attraction, σ 0 ∈ B(ξ ), the pattern will be recovered, σ t → ξ . First, we choose an initial configuration given by a random correlation with the learned pattern, p(σ 0 = ±ξ |ξ ) = (1 ± m0 )/2, for all neurons (so we avoid a bias between local and random neighbors). We call this the m R initial overlap. The retrieval dynamics starts with an overlap m0 = 0.1 and stops after it converges to a fixed point m∗ . Usually t f = 80 sequential updates are a large enough delay for retrieval. The information i(α, m; γ , ω) is calculated according to equations 2.5 and 2.6 for m∗ . The results are depicted in the bottom panel of Figure 4, where i(α) is plotted for different values of γ , ω. Unlike the theoretical results for the storage capacity, there are MD topologies that perform best whenever the attractor properties are considered. We define the optimal topology, for a
966
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
γ =1.00
t=80s; |J|=40M; Sim; mL × mR=0.1 −1
−2
γ =10
γ =10
γ =10
−3
γ =10
−4
0.12 0.1
ω=0.0 ω=0.1 ω=0.2
0.08
i
mL
0.06 0.04 0.02 0 0.1
mR
0.08
i
0.06
0.04 0.02
0
0
0.02
α
0.04 0
0.05 0.1 0.15
α
0
0.05
0.1
α
0.15 0
0.05
0.1
α
0.15 0
0.05 0.1 0.15 0.2
α
Figure 4: Mutual information versus α for the local (upper) and random (lower) initial overlap m0 = 0.1, with ω = 0.0, 0.1, 0.2.
fixed ω, as the value of γ that yields the most information. Starting with m0 = 0.1 yields optima i(γopt , w) for moderate dilutions; for instance, with ω = 0.1, it holds γopt ∼ 10−2 . Next, we compare m R with another type of initial distribution. The neurons start with local correlations: σi0 = ξi , i = 1, . . . , (Nm0 ), and random σi0 = ±1 otherwise. We call it the m L initial overlap. The results are shown in the top panel of Figure 4. The first observation is that the maximum information i max (γ ; ω) has the same nonmonotonic behavior with γ in the region of ω considered; there is still a moderate γopt for which the information i(γopt ) is optimized. However, the i max for the m L are a little smaller than for the m R overlap, and the optima shift a bit to the left panels, where the less diluted topologies are. For instance, with ω = 0.1, now it holds γopt ∼ 10−1 . The comparison between the upper (m L ) and lower (m R ) panels of Figure 4 shows that the initial m R allows for an easier retrieval for any ω. Local topologies (ω = 0) are especially sensitive to the type of initial overlap, losing their retrieval abilities for m L if the connectivity is γopt ≤ 10−2 . This sensitivity to the initial conditions can be understood in terms of the basins of attraction. Random topologies have very deep attractors, especially if the network is diluted enough, while regular topologies almost lose their
Information and Topology in Attractor Neural Networks
967
imax>0.05; m0=0.2 1.00
R
ω
R,L
0.10
R
0.01 −4 10
−3
10
−2
10 γ
10
−1
0
10
Figure 5: Diagram (ω × γ ) with the phases R and L, for initial overlap m0 = 0.2.
retrieval abilities with dilution. However, since the basins become rougher with dilution, the network takes longer to reach the attractor and can be trapped in meta-stable states. Hence, the competition between depth and roughness is won by the more robust MD networks. The retrieval capability of the network starting at condition m R or m L is plotted in Figure 5. We represent as R the phase where the retrieval reaches the information i max ≥ 0.05, starting from m0 = 0.2, with m R . The phase L is the same, but starting with m L . Efficient retrieval is not allowed with the m L condition for very connected or local topologies. On the other hand, only local diluted topologies exclude retrieval with the m R initial overlap, shown below the dashed curve in Figure 5. Next we study the role of the symmetry constraints in the synaptic weights. By asymmetric topology, we mean that no condition Cij = C ji is assumed, and the local connections are chosen for only one direction in the network ring. The maximum of information against the dilution, i max (γ ), is plotted in Figure 6, for several degrees of randomness 0 ≤ ω ≤ 1. The initial overlap is the random condition m R , with m0 = 0.1. The left panel shows the results for the symmetric network and the right panel the results for the asymmetric network. We observe that the nonmonotonic behavior of i max (γ ) still holds for the asymmetric network, only for ω ≤ 0.2. The asymmetric topology has more robust attractors than the symmetric topology for strong dilution. The symmetric topology keeps its optimal moderate dilution, 0 < γ < 1, up to ω ∼ 0.4. Both types of symmetry have the same asymptotic behavior with extreme dilution, for random topologies, saturating the information at i max (γ → 0) ∼ 0.22. Nevertheless, the
968
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez N×K=40M Simulation; m0R=0.1 Symmetric Asymmetric
imax [bits]
0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
10
−4
10
−3
−2
10 γ
−1
10
10
0
0
10
ω=0.0 ω=0.1 ω=0.2 ω=0.4 ω=1.0
−4
10
−3
−2
10 γ
10
−1
10
0
Figure 6: Maximal information i max = i(αmax ) versus γ , with initial condition, m0R = 0.1. (Left) Symmetric network. (Right) Asymmetric network.
nonmonotonic behavior is qualitatively similar in spite of the rather different topologies. 4.2 Clustering and Mean-Length Path. We describe here the topological features of the network as a function of its parameters: the clustering coefficient, c (the average number of neighbors of each node that are neighbors between them), and the mean-length path between neurons, l (the average distance between nodes measured as the minimal path length between them). When γ is large, the net has c large, c ∼ 1 and l small, l = O(1), whatever ω used. When γ is small, then if ω ∼ 0, the net is clustered and has large paths: c = O(γ ), l ∼ N/K . If ω ∼ 1, the net becomes random: c 1 and l ∼ ln N. However, if the randomness is about ω ∼ 0.1, then c = O(γ ), but with l ∼ ln N, and the network behaves as an SW: clustered but with short paths. The dependence of c, l, and i on randomness ω is plotted in Figure 7. In all panels, for connectivity γ = 10−1 , 10−2 , 10−3 , we see a decrease of c and l and an increase of i with ω. For these ranges of ω, the path length l has already
Information and Topology in Attractor Neural Networks ω=K r /K, γ =K/N γ =10
969 #=N´K=10´10
6
T=0; m0=0.1
−1
γ =10
−2
−3
γ =10
1.0
l/10,c/c0
0.8
0.6
l c
0.4
0.2
0.0
0.2
i
0.15
0.1
0.05
0
−3
10
−2
10
ω
10
−1
0
10
−2
10
10
ω
−1
0
10
10
−2
−1
10
ω
10
0
Figure 7: The maximal information i(αmax ) (bottom), clustering coefficient c, and mean-path length l (top) versus ω, for γ = 10−1 , 10−2 , 10−3 (from left to right). Simulations with N.K = 10M, m0 = 0.1.
decreased to small values, and the networks have entered the SW region. However, in the right panel, there is still a slowdown of l. The clustering c decreases quickly around ω = 0.2, after which the network is random-like. The comparison of the behavior of c (top panels) with i (bottom panels) is in agreement with Kim (2004), which allows us to conclude that the performance of the Hopfield model on various networks can be enhanced by decreasing the clustering coefficient. This region 0.001 ≤ ω ≤ 0.2 is the SW regime for the network we study. Looking at the bottom panel, we see that at the end of the SW graph, there is a fast increase of i between 0.05 ≤ ω ≤ 0.20. We conjecture that after the SW region, a further increase of the randomness ω is not worth its wiring cost to gain a little extra information i. 5 Conclusions We have discussed the information capacity of an attractor neural network as a function of its topology. We calculated the mutual information for an Amari-Hopfield model, with Hebbian learning, varying the connectivity (γ ) and randomness (ω) parameters, and obtained the maximum respect to α,
970
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
i max (γ , ω) ≡ i(αmax ; γ , ω). The information i max always increases with ω, but for a fixed ω, an optimal topology γopt , in the sense of the information, i opt ≡ i max (γopt , ω), can be found. We studied the stability and attractor properties. From the point of view of stability, the calculations show that the optimal topology regarding the storage is the random extremely diluted (RED) network. Indeed, if no pattern completion is required, the patterns can be stored statically, which is achieved with a vanishing connectivity. As long as the dynamics is concerned (starting from a noisy pattern, m0 < 1), however, the latter is not true: we found there is an moderate γopt , which optimizes the retrieval performance, whenever 0 ≤ ω < 0.3 (see Figure 6). Moreover, the memory behavior of local diluted networks is even more damaged if they start with local overlap than if the initial overlap is random. This can be understood regarding the shape of the attractors. The ED network waits much longer for the retrieval than more connected networks do, so the neurons can eventually be trapped in spurious states with vanishing information. We found a relation between the fast increase of information with ω and the SW region of the topology. This implies that it is worth increasing the wiring length just to the end of the SW zone. In both nature and technological approaches to neural devices, dynamics is an essential issue for information processing. So an optimized topology holds for any practical purpose, even if no attention is paid to wiring or other costs of the random links (Morelli, Abramson, & Kuperman, 2004). We conjecture that the reason for the intermediate optimal γopt is a competition between the broadness (larger storage capacity) and roughness (slower retrieval speed) of the attraction basins. We believe that the maximization of the information with respect to the topology could be a biological criterion (where nonequilibrium phenomena are relevant) for building real neural networks. We expect that the same dependence can be found for more structured networks and learning rules. More complex initial conditions may also play a role in the retrieval and deserve dedicated study. Appendix: The Probability of Cycles The small-world (SW) topology can be defined by the probability of having the nodes i and j connected: γ P(Cij = 1) ≡ ωγ + (1 − ω) γ N − i − j + N + N modN , 2
(A.1)
where ω is the SW parameter defined as the rewiring probability in Watts and Strogatz (1998) and is the unit step function. One can estimate a k by generating random SW graphs with a given size, performing k + 1 random walk steps starting from some arbitrarily chosen origin node and then counting the trail as successful if there is a link from
Information and Topology in Attractor Neural Networks
971
that node and as failing otherwise. Using this Monte Carlo procedure, we have a Bernoulli process for estimating a k . However, this estimation will suffer finite-scale effects that are difficult to estimate, especially in the case of small γ . Therefore, first, one can try to make the limit N → ∞ in equation A.1 and then estimate the probabilities a k . This continuous topology estimation will not suffer finite-scale effects, with the exception of the larger error in estimating a k when γ is small. However, the topology in the continuous case is not exactly the same as in the discrete case (the probability of repeating the movement along any edge in the graph is exactly zero, and the symmetry is loosely defined in the continuous approximation), so the question of whether the discrete estimation with N → ∞ and the continuous estimation coincide is legitimate. We have checked the discrepancies with simulations, and when γ N > 30, there is a very good match between the continuous and the discrete case. Similar questions are discussed by Newman et al. (2000). The conclusion is that the coefficients a k , k > 0 can be calculated using a modification of the Monte Carlo method originally proposed by Canning and Gardner (1988). Let us consider k as a number of iterations and regard a particle that at each iteration changes its position among the network connectivity graph with probabilities defined by the topology of the (SW) graph (see Figure 8). If the particle at step k is in node j(k), we say that the coordinate of the particle is x(k) ≡ j(k)/N. When N → ∞, x(k) is a continuous variable. If y is a uniformly distributed random variable between −1/2 and 1/2, then in step k + 1, the coordinate of the particle changes according to the rule:
x(k + 1) = [x(k) + γ y]mod 1 with probability 1 − ω, x(k + 1) = [x(k) + 1y]mod 1 with probability ω.
(A.2)
If the particle starts at moment k = 0 from x(0) = 0 and is located in position x(k) at step k, then the probability of having a loop of length k + 1 according to the SW topology is
a k−1 = (1 − ω) θ (γ /2 − |x(k)|) + ωγ ,
(A.3)
where it has been assumed that x(k) ∈ [−1/2, 1/2]. From this Bernoulli process one can estimate a k . (See the precision’s estimation at Koroutchev, Dominguez, Serrano, & Rodriguez, 2004.) Alternatively to the Bernoulli process, one can get an expression to estimate a k , k > 0. Suppose that we have unity length of the graph; then the
972
D. Dominguez, K. Koroutchev, E. Serrano, and F. Rodr´ıguez
0 2 1
5 3
4
Figure 8: Calculus of a k by random walk in SW topology. The network with its connection is represented by the gray circle. The height of the surrounding circle represents the probability of connection of the neuron at position 0. Movements to distances shorter (larger) than γ /2 are represented by bold (thin) lines. After five steps, the probability of the existence of a cycle with length 6 (a 4 ) is given by the probability of a connection between x5 and x0 (dotted line).
probability of having a loop is ak = γ
k k+2 d xi R=−k
i=1
γ
xi − R , p(xi ) δ
(A.4)
i
where xi represents the movement in step i, p(x) is the probability of a link at displacement x, x ∈ [−1/2, 1/2), and R is the number of revolution performed by the loop. Although this integral can be solved analytically for small k, the calculation of a k using the Monte Carlo method seems more efficient. The calculation of a k = a k (C) closes the system of equations 3.9 to 3.11, giving the possibility of solving them with respect of (G, m, r ) for every (ω, γ , α) values. Acknowledgments This work was supported by grants TIC01-572, TIN2004-07676-C01-01, BFI2003-07276, TIN2004-04363-C03-03 from MCyT, Spain. D.D. thanks a Ramon y Cajal grant from MCyT. References Albert, R., & Barabasi, A. (2002). Statistical mechanics of complex networks. Rev. Mod. Phys, 74, 47–97.
Information and Topology in Attractor Neural Networks
973
Amari, S. I. (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Trans. Comp., C-21, 1197–1206. Amit, D., Gutfreund, H., & Sompolinsky, H. (1985). Statistical mechanics of neural networks near saturation. Annals of Physics, 173, 30–67. Bolle, D., Dominguez, D., & Amari, S. (2000). Mutual information of sparsely coded associative memory with self-control and ternary neurons. Neural Networks, 13, 445–462. Bolle, D., Jongen, G., & Shim, G. (1999). Parallel dynamics of extremely diluted symmetric q-ising neural networks. Journal of Statistical Physics, 96(3–4), 861–882. Canning, A., & Gardner, E. (1988). Partially connected models of neural networks. J. Phys. A, 21, 3275–3284. Crisanti, A., Amit, D., & Gutfreund, H. (1986). Saturation level of the Hopfield model for neural network. Europhysics Lett., 2, 337–341. Dominguez, D., & Bolle, D. (1998). Self-control in sparsely coded networks. Phys. Rev. Lett, 80, 2961–2964. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Nat. Acad. U.S.A, 79, 2554–2558. Kim, B. (2004). Performance of networks of artificial neurons: The role of clustering. Phys. Rev. E, 045101. Koroutchev, K., Dominguez, D., Serrano, E., & Rodriguez, F. (2004). Mutual information and topology 2: Symmetric network. In F., Yin, J., Wang, & C., Guo (Eds.), Advances in neural networks. (pp. 20–25). Berlin: Springer. Koroutchev, K., & Korutcheva, E. (2005). Conditions for the emergence of spatial asymmetric states in attractor neural network. CEJP, 3, 409–419. Masuda, N., & Aihara, K. (2004). Global and local synchrony of coupled neurons in small-world networks. Biol. Cybernetics, 90, 302–309. McGraw, P. N., & Menzinger, M. (2003). Topology and computational performance of attractor neural networks. Physical Review E, 68, 047102. Morelli, L. G., Abramson, G., & Kuperman, M. (2004). Associative memory on a small-world neural network. European Physical Journal B, 38, 495–500. Newman, M. E. J., Moore, C., & Watts, D. J. (2000). Mean-field solution of the smallworld network model. Physical Review Letters, 84, 3201–3204. Newman, M. J. E., & Watts, D. J. (1999). Scaling and percolation in the small-world network model. Phys. Rev. E, 60, 7332–7342. Okada, M. (1996). Notions of associative memory and sparse coding. Neural Network, 8–9, 1429–1458. Parisi, G. (1988). Statistical filed theory. Redwood City, CA: Addison-Wesley. Rolls, E., & Treves, A. (2004). Neural network and brain function. New York: Oxford University Press. ˜ Torres, J., Munoz, M., Marro, J., & Garrido, P. (2004). Influence of topology on the performance of a neural network. Neurocomputing, 58–60, 229–234. Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks. Nature, 393, 440–422.
Received February 3, 2006; accepted June 22, 2006.
LETTER
Communicated by Anders Lansner
Connection Topology Selection in Central Pattern Generators by Maximizing the Gain of Information Gregory R. Stiesberg
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
Marcelo Bussotti Reyes
[email protected] Instituto de F´ısica, Universidade de S˜ao Paulo, S˜ao Paulo 05508–090, Brazil
Pablo Varona
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A., and Departamento de Ingenier´ıa Inform´atica, Universidad Aut´onoma de Madrid, Madrid, Spain
Reynaldo D. Pinto
[email protected] Instituto de F´ısica, Universidade de S˜ao Paulo, S˜ao Paulo 05508–090, Brazil, and Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
´ Huerta Ramon
[email protected] Institute for Nonlinear Science, University of California San Diego, La Jolla 92093–0402, U.S.A.
A study of a general central pattern generator (CPG) is carried out by means of a measure of the gain of information between the number of available topology configurations and the output rhythmic activity. The neurons of the CPG are chaotic Hindmarsh-Rose models that cooperate dynamically to generate either chaotic or regular spatiotemporal patterns. These model neurons are implemented by computer simulations and electronic circuits. Out of a random pool of input configurations, a small subset of them maximizes the gain of information. Two important characteristics of this subset are emphasized: (1) the most regular output activities are chosen, and (2) none of the selected input configurations are networks with open topology. These two principles are observed in living
Neural Computation 19, 974–993 (2007)
C 2007 Massachusetts Institute of Technology
Topology Selection by Maximizing the Gain of Information
975
CPGs as well as in model CPGs that are the most efficient in controlling mechanical tasks, and they are evidence that the information-theoretical analysis can be an invaluable tool in searching for general properties of CPGs. 1 Introduction A central pattern generator is a group of connected neurons that control the activity of a mechanical device by activating muscle groups. CPGs are responsible for activities like chewing, walking, and swimming (Selverston, 1999a, 1999b; Selverston, Elson, Rabinovich, Huerta, & Abarbanel, 1998; Arshavsky, Deliagina, & Orlovsky, 1997). The intrinsic properties of every neuron in the CPG, together with the connection topology between them, determines the phase relationship in the electrical activity. A group of neurons can generate many different spatiotemporal patterns of activity that can control different types of mechanical devices. Only a subset of the possible solutions of activity of the CPG will work for a given mechanical device (see, e.g., Huerta, Sanchez-Montanes, Corbacho, ¨ & Siguenza, 2000). Among the large number of possible spatiotemporal patterns is a subset that efficiently performs the required mechanical functions of the animals. In this work, we intend to determine some generic constraints of a simplified working CPG. Our motivation is previous work (Huerta, Varona, Rabinovich, & Abarbanel, 2001), where it was shown that the best efficiency of a mechanical pump based in the pyloric chamber and controlled by pyloric CPG of the lobster is achieved by combinations of synaptic connections in sets corresponding to CPGs that oscillate in regular fashion and all of these CPGs have nonopen connection topologies. Here, nonopen topology is defined as a configuration in which every neuron receives synaptic input from at least one other neuron in the CPG (see, e.g., Selverston & Moulins, 1987; Arshavsky, Beloozerova, Orlovsky, Panchin, & Pavlova, 1985; Getting, 1989). In this letter, we test the use of Shannon and Weaver information (Shannon & Weaver, 1949) as a global measure of the efficiency in the CPG activity. One of the best-known CPGs is the pyloric CPG of the lobster stomatogastric ganglion (Selverston & Moulins, 1987), where the rhythm is achieved by a combination of the intrinsic properties of the neurons connected by many inhibitory and a few electrical synapses. Most neurons isolated from this CPG have been shown to behave in a chaotic way (Rabinovich, Abarbanel, Huerta, Elson, & Selverston, 1997; Selverston et al., 2000), and many ideas for the role of chaos in the neurons of the CPG have been proposed (Abarbanel et al., 1996; Rabinovich et al., 1997). Here we study the behavior of simple CPGs made of either three or four chaotic neurons connected by inhibitory synapses. For simulating the chaotic neuron elements, we use one of the simplest mathematical models that is able
976
G. Stiesberg et al.
to produce chaotic behavior: the Hindmarsh-Rose model (HR) of a bursting neuron (Hindmarsh & Rose, 1984; Pinto et al., 2000), and the synapses are simulated by a first-order kinetic model of neurotransmitter release (Destexhe, Mainen, & Sejnowski, 1994; Sharp, Skinner, & Marder, 1996; Pinto et al., 2001). We also included the results from the analysis of CPGs made of three or four electronic neurons (Pinto et al., 2000) based on the HR model (Hindmarsh & Rose, 1984). Our motivation for using analog electronic neurons (ENs) is to include in our experiments more elements of those that a real living CPG is exposed to. Different from mathematical modelling, ENs are made of real components that experience fluctuations in their values (because of changes in temperature, for example) in the same way neurons experience fluctuations in their intrinsic properties due to changes in the temperature or in the ionic concentrations of the saline. ENs are also subject to electronic noise that could be compared to the noise that living neurons receive when embedded in the living tissue. Experiments with ENs produce results that have to be robust to small variations, such as those that can occur in a living CPG, and they can be used to validate the results obtained in ideal systems created by computer simulations. Information theory has been used to study neural systems (e.g., Deco & Schurmann, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). However, our approach is different because we consider the input to the CPG as a set of synaptic and conductance configurations. We disregard the effects of neuromodulation on the endogenous properties of the neurons as proposed by Stemmler and Koch (1999). Basically we want to address the generation of information by the CPG due to some given set of synaptic connections. In order to be able to evaluate the mutual information between the input and the output, three definitions should be given: what the input, the output and the information being conveyed are. First, let us state the definition of the mutual information, MI (I ; O) = H(O) − H(O/I ) where I and O are the input and the output, respectively; H represents the entropy or the measure of uncertainty or variability. Roughly, the entropy of a system is a measure of how much information we need to be able to predict the system behavior. A low entropy means that the system is very regular and the behavior can be easily predicted. Conversely, a high entropy means that the system behaves in an irregular way, which makes prediction difficult. Our goal is to maximize the mutual information; in the optimal case, this means that the uncertainty of the output H(O) is maximum (meaning that it has a wide range of possible responses), and the uncertainty of the output given the input H(O/I ) is minimum, meaning 0. Stated explicitly, we want the response of the system to be as rich as
Topology Selection by Maximizing the Gain of Information
977
Figure 1: What is the input and the output? The concentration of the neuromodulators determines the type of connectivity between the neurons (in real CPGs, they also modify the dynamics of every neuron). The connection topology established by the neuromodulators selects a specific phase of bursting electrical activity between the members of the CPG.
possible, and at the same time we want the response of the system to a specific input to be fully determined by this input. We still need to be more precise about what we call “the input.” In most of the CPGs, the rhythm changes due to neuromodulatory inputs that modify both the synaptic connections between neurons (connection topology) and the kinetics of the ionic channels on the neurons membrane (Harris-Warrick et al., 1998; Brodfuehrer, Debski, O’Gara, & Friesen, 1995; Katz, 1998). To reduce the complexity of the problem, in this letter, we consider only variations in the synaptic connections. As shown schematically in Figure 1, instead of using the neuromodulator concentrations as the input configuration, we choose the connection topology with the assumption that there is a function relating the neuromodulator concentration configuration and the connection topology configuration. Therefore, the input we consider is a specific synaptic configuration, which is given by I = gi j ; 0 ≤ gi j < gmax , i = j; i, j = 1 . . . N , where gmax is the maximum synaptic conductance and N is the number of neurons in the CPG. The output is based on the observation that the electrical activity produced by a neuron is mainly transmitted to the muscles when the burst is on. The spikes are carried through the axons to the muscles, which integrate them to produce an action. Also, there is strong evidence that burst duration can encode information in very different kinds of networks (Kepecs & Lisman, 2003; Lisman, 1997). Thus, for our purposes, we will disregard the frequency of the spike firing, considering that the information is basically carried by the burst start and end times. A random set of input configurations is used to integrate the CPG so that the conditional distributions and entropies can be calculated. Then the
978
G. Stiesberg et al.
estimation of the maximum of the mutual information leads us to a subset of configurations in I. In the next few sections, with an approach independent of the one used by Huerta (Huerta et al., 2001), we show that a subset consisting of 15% of the input configurations is sufficient to obtain the maximum of the mutual information, that all of these configurations lead to regular rhythmic activity, and that all of these are nonopen topology configurations. These last properties are widely observed in living CPGs. 2 Mutual Information Estimation Let us define the variable E(R, t) that can assume the Boolean values 0 (no event) or 1 (presence of an event) at time t. The value of R represents the type of event and is in the range from 1 to 2 · N, where N is the number of neurons in the CPG. Initially all E(R, t) are reset to 0. When neuron i starts a burst at time t ∗ , we set E(R = i, t = t ∗ ) = 1, and when neuron i ends a burst at time t ∗ , we set E(R = N + i, t = t ∗ ) = 1. To calculate the joint probability, we integrate the model CPG and obtain E(R, t) for a given input: a specific synaptic configuration G j ∈ I with j = 1, . . . , M and M is the cardinality of the set I. Setting t = 0 when neuron 1 (used as a time reference) started its last burst, we define p(R, t /G j ) or pG j (R, t ) as the conditional probability of observing an event of type R at time t for a given input G j . The conditional probabilities are calculated as follows. We create a counter-array named C G j (R, t ), where R has the previously explained meaning and t runs from 0 to a sufficiently large value T. Starting from t = 0, we collect the events in E(R, t) for increasing t: 1. The neuron number 1 is set as the time reference for the rest of the neurons. Every time E(R = 1, t = τ ) = 1, the reference time τ is taken. 2. When an event of type i is found at time t, the overall counter C G j (R = i, t = t − τ ) is increased by 1. 3. At the end, C G j (R, t ) is normalized to 1 to yield pG j (R, t ). Examples of some typical pG j (R, t ) obtained for two different input configurations G j (corresponding to qualitatively different bursting CPGs made of four neurons) are shown in Figure 2. Note that in both cases, the time probability for all events (R = 2 to 8) was superimposed. For the regular bursting CPG, the pG j (R, t ) presents well-defined peaks that correspond to the time when each event is expected, while for the irregular bursting CPG, the pG j (R, t ) is spread to a large extent of time. Once we have calculated the conditional probabilities, we determine the values of the conditional entropies (defined as the variability of the output
Topology Selection by Maximizing the Gain of Information
979
Figure 2: Examples of some typical conditional probabilities obtained in experiments with four analog electronic neuron CPGs. (Top) pG 2 51 (R, t) for an irregular bursting CPG. (Bottom) pG 5 17 (R, t) for a regular bursting CPG. In each plot, the probabilities of the different events (R = 2–8) were superimposed.
for a given G j ) as follows,
HG j = −
2·N T
pG j (R, t) log2 pG j (R, t),
t=0 R=1
where the joint probability is p(R, t, G j ) = pG j (R, t) p(G j ) and p(G j ) is the probability of the configuration G j in the set I. Since during the simulations or experiments, M random synaptic configurations G j are chosen, 1 p(G j ) = M . The marginal probability is
p(R, t) =
M
pG j (R, t) p(G j ),
j=1
which is used to calculate H(O) as
H(O) = −
2·N T t=0 R=1
p(R, t) log2 p(R, t),
980
G. Stiesberg et al.
and the conditional entropy H(O/I ) =
M
p(G j )HG j .
j=1
3 The Problem: Maximizing the Mutual Information We have explained all the elements that we need to calculate the mutual information between the input and the output and therefore are ready to restate the question that we want to address in this work. Which subset of input configurations that belong to I maximizes the mutual information? To be more specific, What are the values p(G j ) such that the MI (I ; O) is maximized? In order to maximize the mutual information as a function of p(G j ) (we rename p(G j ) ≡ x j and HG j ≡ h j for convenience), we calculate the gradient of MI (I ; O) on the x j space with restrictions j x j = 1 and x j ≥ 0. The approach followed to calculate the gradient consists of reducing the space of search from M to M − 1 by using xM = 1 − M−1 j=1 x j on the MI (I ; O) formula. The gradient expression used is T 2·N ∂ MI (I ; O) = h M − hµ + pxM (R, t) − pxµ (R, t) log2 p(R, t), ∂ xµ R=1 t=0
with µ = 1, . . . , M, h M = dient helps to maximize MI (I ; O) =
M−1 µ=1
i=µ
h i , and pxM (R, t) =
i=µ
pxi (R, t). This gra-
∂ MI (I ; O) xµ . ∂ xµ
It is sufficient to change the values xµ by using xµ = α
∂ MI (I ; O) ∂ xµ
with α > 0. Another restriction that must be imposed is that if the change of xµ would result in xµ < 0 or xµ > 1, then the modification of the values does not take place. The idea of using the gradient is to find the direction of xµ that corresponds to bigger values of the mutual information. Then we change xµ in that direction. The process is repeated several times and for all xµ until the gradient approaches zero, close to a maximum. At this point we have the set of xµ that maximizes the mutual information.
Topology Selection by Maximizing the Gain of Information
981
Since there are multiple local maxima, the search has to be combined with random hints. The set of p(G j ) is randomly set at first (we replace the 1 common value p(G j ) = M from the simulations or experiments by randomly choosing each p(G j ) from a random uniform distribution between 0 and 1). The random set of p(G j ) is normalized to 1, and the process of finding new p(G j )s that maximize the MI described above is repeated until an approximate maximum of the MI is achieved (changes in p(G j ) become smaller than 1%). The small subset of G j s in I found with maximum p(G j ) after maximizing the MI is stored, and new random p(G j )s are set; the procedure is then repeated. After enough repetitions (about the number of different configurations in I) of the process above, we can define S as the total set of the input configurations found to maximize the mutual information. Then we define the subset s(HG − , HG + ) ∈ S as the chosen configurations that generate a conditional entropy HG j with a value between the numerical values HG − and HG + . We also define the subset i(HG − , HG + ) ∈ I of the input configurations that produce an entropy value in (HG − , HG + ). Finally, we can define ζ as ζ (HG − , HG + ) =
#s(HG − , HG + ) , #i(HG − , HG + )
which represents what percentage of the total number of input configurations that generate entropy in a given range is found to maximize the MI. 4 The Model Neurons and Their Synaptic Connections The single-neuron model is a modified version of the HR model that is known to generate chaotic behavior (Hindmarsh & Rose, 1984). The additional coefficients in the equations were introduced for a more convenient implementation of the analog circuit described in the following sections. The model is made of three dynamical variables comprising a fast subset, x and y, and a slower z: d x(t) = 4 · y(t) + 1.5 · x 2 (t) − 0.25 · x 3 (t) − 2 · z(t) + 2 · e + Isyn (4.1) dt dy(t) (4.2) = b/2 − 0.625 · x 2 (t) − y(t), dt 1 dz(t) = −z(t) + 2 · [x(t) + 2 · d] , (4.3) µ dt where e represents a constant injected (DC) current, and µ is the parameter that controls the time constant of the slow variable. The parameters are
982
G. Stiesberg et al.
chosen to set the isolated neurons in the chaotic spiking-bursting regime (e = 3.281, µ = 0.0021, b = 1, d = 1.6). Isyn represents the postsynaptic current evoked after the onset of a chemical graded synapse. In this letter, we consider only inhibitory synapses, which is the main type of connection present in the pyloric CPG of the California spiny lobster (Panulirus interruptus), and which dominates in most invertebrate CPGs. The synaptic current has been simulated in a fashion similar to that used in the dynamical clamp technique (Sharp et al., 1996), with minor modifications: Isyn = −g · r (xpre ) · ϑ(xpost ),
(4.4)
where g is the maximal synaptic conductance, r (xpre ) is the synaptic activation variable that is determined from the presynaptic activity by: dr = [r∞ (xpre ) − r ]/τr dt r∞ (xpre ) = [1 + tanh((xpre + 1.2)/0.9)]/2,
(4.5) (4.6)
τr is the characteristic time constant of the synapse (τr ∼ 100), and ϑ(xpost ) is a nonlinear function of the membrane potential of the postsynaptic neuron, xpost : ϑ(xpost ) = (1 + tanh((xpost + a )/b)).
(4.7)
Parameters for ϑ(xpost ) (a = 2.807514 and b = 0.4) were chosen so that this function remained linear for the slow oscillations (subthreshold region for the fast spikes), that is, d x ∼1 dv a ,b g(x) ∼ 0.
(4.8) (4.9)
a ,b
The rule used to set the connectivity pattern in the set I is that with a probability of 80%, the connection gi j will range uniformly from 0.0 to 1.0, and with a probability of 20%, it is set to 0. This connectivity pattern rule is chosen in this way because there is no CPG in any animal where all the neurons are connected to all the rest. After choosing the connectivity of the CPG, we perform the simulation by numerically integrating equations 4.1 to 4.3, plus those for all the active synapses, described by equation 4.5, using a fourth-order Runge-Kutta
Topology Selection by Maximizing the Gain of Information
Isyn e
-x(t)
∫ dt
NA
983
out
x2/10 x3/100
b
d
∫ dt
y(t)
µ∫dt
−∑
-z(t)
-1
Figure 3: Block diagram of an analog electronic neuron.
algorithm with a fixed step of 0.01 units of time. The bursts are sampled at every two units of time to build the conditional probabilities pG j (R, t ). 5 The Electronic Neuron We also worked with analog electronic circuits developed to reproduce the observed membrane voltage oscillations of isolated biological neurons from the stomatogastric ganglion of the California spiny lobster. These electronic neurons (ENs) have been successfully used to replace some biological neurons in the stomatogastric ganglion, reestablishing the regular pattern of ¨ et al., 2000). the oscillations of the lobster pyloric CPG (Szucs The block diagram of the simple analog circuit we use to implement the three-dimensional dynamical system of the EN is shown in Figure 3. This circuit integrates equations 4.1 to 4.3. The circuit uses three integrators, one adder, one inverter, two analog multipliers, and one nonlinear amplifier. We used off-the-shelf general-purpose operational amplifiers, National Instruments Model TL082, to build the integrators, adders, inverters, and nonlinear amplifier; and we used Analog Devices Model AD633 as analog multipliers. More details about the analog implementation of the ENs, as well as a four-dimensional EN that is able to reproduce in even more detail the chaotic behavior of the isolated biological neurons, can be found in Pinto et al. (2000). The ENs were also set a priori in chaotic spiking-bursting regimes as described in section 4 but with different values of the control parameters
984
G. Stiesberg et al. microelectrode
i
i (100mV/nA)
CH0 DAC ADC
BN
CH1 DAC ADC
BN
v (x10)
ADC
CH2 DAC
EN
CH3
CH1 DAC ADC
EN
CH2 DAC ADC
EN
CH3 DAC
DAC
(a)
CH0 DAC ADC
EN
ADC
BN
Isyn (200mV/nA)
ADC
DynClamp4
BN
v (x10)
DynClamp4
amplifier
v
(b)
Figure 4: Dynamic clamp using the DynClamp4 program. (a) Between living biological neurons (BNs). (b) Between ENs. Hybrid circuits can be formed mixing both kinds of configurations of the channels (CH0-CH3).
(e ≈ 1.75, b = 0, µ ≈ 0.0022, d ≈ 0.63). The components used in the ENs have a 10% tolerance, and the values indicated for the parameters here are averages of the values set in the ENs. In the experiments, we usually fine-tune e, µ, and d around the values indicated to obtain similar chaotic spiking-bursting patterns among the ENs. There is also an integration time constant that was introduced by means of a capacitor to make the burts and spikes of the EN have approximately the same average duration presented in time series of the membrane potential of living neurons from the lobster CPG. The time constant relates the real time of the EN to the units of time units of the simulations: tr eal = t450 . 5.1 Implementation of the Analog CPGs Using the DynClamp4, a Dynamic Clamp Program. Our DynClamp4 dynamic clamp program (Pinto et al., 2001) runs on a Windows NT 4.0 platform in a Pentium III computer with an Axon Digidata 1200A ADC/DAC board. DynClamp4 is an implementation of the dynamic clamp protocol (Sharp et al., 1996) that uses an auxiliary analog demultiplex circuit, which we also developed, to generate all the possible combinations of chemical and electrical synapses among up to four living or electronic neurons, as well as several HodgkinHuxley voltage-dependent conductances. The DynClamp4 can be used to connect living neurons, electronic neurons, and hybrid circuits, as shown in Figure 4. The program digitizes the membrane potential of the neurons as it appears at the output of the microelectrode amplifiers; then it calculates the current to be injected in each neuron due to the combination of all synapses and conductances. Then the voltage signals proportional to these currents are generated via digital-to-analog converters. This cycle is repeated with
Topology Selection by Maximizing the Gain of Information
985
a frequency of more than 5 kHz so that the neurons are subjected to a continuous current signal, as compared to their intrinsic timescales. In this work, we used a modified version of the DynClamp4 software for the automatic implementation of the random synapses between the ENs. To facilitate this, we needed to tune the nonlinear amplifiers at the output of the ENs to reshape, rescale, and the offset of the membrane potentials xi (t) in order to make them resemble the overall and relative amplitudes between slow oscillations and spikes in the signal we have from the biological neurons. The analog ENs were connected by means of the DynClamp4, so we could repeat the digital computer simulations performed earlier with these analog computers. We used only the chemical synapses of the DynClamp4 program. The model implemented for these synapses is similar to the one described in section 4 for the computer simulations, but it is a closer approximation to the biological graded chemical synapses present in the lobster CPG. The model of the synaptic current is described by the first-order kinetics (Destexhe et al., 1994; Sharp et al., 1996): Isyn = gsyn S Vsyn − Vpost , (1 − S∞ ) τs
dS = (S∞ − S) , dt
where
S∞ Vpre =
tanh
Vpr e −Vthr es Vslope
0
: :
Vpre > Vthres otherwise,
gsyn is the maximal synaptic conductance, S is the instantaneous synaptic activation, S∞ is the steady-state synaptic activation, Vsyn is the synaptic reversal potential, Vpre and Vpost are the presynaptic and postsynaptic voltages, respectively, τs is the time constant for the synaptic decay, Vthres is the synaptic threshold voltage, and Vslope is the synaptic slope voltage. The common parameters we used for all synapses were Vsyn = −80 mV (inhibitory synapses), τs = 100 ms, Vthres = −40 mV, and Vslope = 10 mV. For every trial of synaptic configuration, we chose at random 20% of the synapses to have gsyn = 0, and the other 80% of the synapses were chosen to have gsyn ranging uniformly from 0 to 500 nS. This is similar to what we did in the computer simulations. The output currents generated by the DynClamp4 are in nA, and in the living neurons the microelectrode amplifiers usually use a conversion factor of 50 to 100 mV/nA in their current input. For our ENs, a conversion factor of 200 mV/nA was appropriate to generate the voltage signals to be presented as inputs (Isyn ). For each synaptic configuration, the dynamic clamp cycle was started; then transients were allowed to dissipate for 60 s before acquiring data.
986
G. Stiesberg et al.
150
400
G
#I in (HG− ε,H +ε)
G
#I in (HG−ε,H +ε)
300 100
50
200
100
0
2
4
6 HG (bits)
8
10
0
1
3
5
7
9
11
HG (bits)
Figure 5: (Left) Distribution of inputs with specific values of the conditional entropy for = 0.225 in a three-neuron CPG. (Right) Input distribution in a four-neuron CPG for = 0.58.
The bursts of each EN were sampled every 10 ms and detected when the reshaped membrane potential crossed a threshold of −40 mV. The bursts were recorded for 300 s, and the conditional probabilities pG j (R, t ) for each set of synaptic configuration were calculated in the same way as described in section 2. 6 Results of the Computer Simulations We performed simulations of CPGs of three and four chaotic neurons where the connections between them were chosen at random according to the rules and methods explained in section 4. For each CPG, a pool of 850 synaptic configurations was tested (#I = 850). For each configuration, 30 random initial conditions were utilized, and a transient time of 10,000 was discarded before the probability distributions were calculated for 40,000 units of time. The bursts were sampled at every 2 units of time, the conditional probabilities and entropies were calculated, and the maximization of MI (I ; O) as a function of p(G j ) was carried out. Once all the numerical calculations were finished, the conditional entropies and probabilities were stored. The maximization of the mutual information was performed using the rules explained in previous sections. In Figure 5 (left), one can see the distribution of the conditional entropy values for a four-neuron CPG and a random set of I, where all the elements were chosen from an uniform distribution. We can see that most of the input configurations are situated in the range of entropy values between 5.5 and 7.5 bits. Those entropy values represent irregular activity (see a time series on Figure 6), whereas values closer to 2 represent regular rhythms. In Figure 5 (right) we show the distribution of the conditional entropy for a four-neuron CPG. In this case, most of the input configurations have conditional entropy between 6 and 9 bits.
6
4
4
2
2
V3
6
0
−2
−4
−4
4
4 V2
2 0
2 0
−2
−2
−4
−4
4
4
2
2
0 −2
987
0
−2
V1
V1
V2
V3
Topology Selection by Maximizing the Gain of Information
0 −2
−4
4000
5000
6000
7000
−4
4000
5000
Time
6000
7000
Time
Figure 6: (Left) Time series for a configuration g12 = 0.207668, g13 = 0.396304, g21 = 0.058504, g23 = 0, g31 = 0.782209, g32 = 0.452369 that generates a conditional entropy with value of 6.64 bits. (Right) Time series for the configuration g12 = 0.110004, g13 = 0.253859, g21 = 0.700028, g23 = 0.152461, g31 = 0.772312, g32 = 0 that generates a conditional entropy value of 2.58 bits. 1
ζ(HG−ε,H G+ε)
0.8
0.6
0.4
0.2
0
2
4
6 HG (bits)
8
10
Figure 7: Distribution of the selected input configurations that maximize the information between the input and the rhythmic activity for a three-neuron CPG (dotted) and a four-neuron CPG (solid).
In Figure 7, we can see the ratio ζ between the number of configurations found to maximize the MI and the number of input configurations for a CPG of three and four neurons. Most of the chosen configurations lie in the lower range of the conditional entropy in both cases (ζ vanishes for values about 1 or 2 bits higher than the minimun entropies found), which indicates
988
G. Stiesberg et al.
that a small subset of input configurations is chosen and that most of them behave in a regular manner. The main results are summarized as follows: 1. A subset of 100 configurations of I accounts for 99% of the configura tions that maximize MI (I ; O), that is, µ∈subset p(G µ ) ≤ 0.99. 2. The preferred connectivity patterns are the ones with entropy values between 2 and 3 bits. These configurations produced the most regular CPG activity, as can be seen in Figure 6. 3. All of these configurations are nonopen topologies. Most of the known invertebrate CPGs have nonopen topology connections. We have defined nonopen topologies as those in which each neuron receives synaptic input from at least one other neuron. The explanation for the dominance of such topologies is likely because the model neurons are chaotic: any chaotic neuron that does not receive input from any other neuron remains chaotic. The chosen configurations in our experiments are typically nonopen topologies, which produce regular spatiotemporal patterns, while the open topology configurations almost always produce a chaotic pattern. 7 Results of the Analog Computations: ENs + Dynamic Clamp In the analog computations, we first performed our simulations in a CPG composed of three chaotic neurons, with connections between them chosen at random, as explained in the previous sections. A pool of 500 random synaptic configurations was tested (#I = 500). The methods described at the end of section 5 were repeated to obtain the conditional probabilities and the entropies for each synaptic configuration. The maximization of the MI (I ; O) as a function of p(G j ) was carried out in the same way as for the computer simulations (see section 3). The same procedure was repeated for the CPGs composed of four neurons, but we tested a pool of 630 synaptic configurations (#I = 630). Figure 8 shows the distribution of the conditional entropy values for random set of I where all the elements were chosen from an uniform distribution for both three- and four-neuron CPGs. Here we also have most of the input configurations situated in a small but slightly different range of entropy values (between 5.0 and 6.5 bits for three-neuron CPGs and between 5.5 and 7 bits for four-neuron CPGs) as compared to Figure 5. The ranges are shifted about +1 bit in the simulations, because in the analog experiments, the burst sampling rate is about half the sampling rate used in the simulations (when one converts the units of time). However, these values of entropy still represent irregular activity (as in time series in Figure 9, for example) and the values closer to 3 bits represent regular rhythms. Another difference between the results from simulations and the analog experiments is the minimum entropy,
100
100
80
80 #I in (HG− ε,HG+ε)
#I in (HG − ε,HG+ε)
Topology Selection by Maximizing the Gain of Information
60
40
20
0
989
60
40
20
2
3
4
5
6 HG (bits)
7
8
9
10
0
2
3
4
5
6 HG (bits)
7
8
9
10
Figure 8: (Left) Distribution of inputs with specific values of the conditional entropy for = 0.225 in a three-neuron CPG. (Right) Input distribution in a four-neuron CPG for = 0.58.
Figure 9: (Left) Time series for a configuration g12 = 239 nS, g13 = 374 nS, g21 = 0 nS, g23 = 277 nS, g31 = 217 nS, g32 = 0 nS that generates a conditional entropy with a value of 6.44 bits. (Right) Time series for a configuration g12 = 277 nS, g13 = 0 nS, g21 = 260 nS, g23 = 257 nS, g31 = 0 nS, g32 = 443 nS that generates a conditional entropy with a value of 3.37 bits.
which is smaller in simulations (from 1 to 2 bits) than in the experiments (from 3 to 4 bits), probably because the fluctuations present in the electronic circuits are much higher than the small numerical noise introduced in the simulations. In order to show that the analog computations also generate the same results pointed out in section 6 for the digital computer simulations, we also calculated the ζ ratio between the number of configurations found to maximize the MI and the number of input configurations for the analog computations of the three- and four-neuron CPGs. As can be seen in Figure 10, in both cases, most of the chosen configurations have a conditional entropy smaller than 5 bits (the limit is again just about 1 or 2 bits
990
G. Stiesberg et al.
1
ζ(HG −ε,HG+ε)
0.8
0.6
0.4
0.2
0
2
3
4
5
6 HG (bits)
7
8
9
10
Figure 10: Distribution of the selected input configurations that maximize the information between the input and the rhythmic activity for a three-neuron CPG (dotted line) and four-neuron CPG (solid line).
higher than the minimun entropies found), which indicates again that the subset of configurations chosen behaves in a more regular way. 8 Discussion One of the main problems faced by the study of the CPGs in animals is to establish a relationship between the electrical activity of the CPG and the mechanical device that performs a specific function. In the experiments performed in some animals, the CPG is located in a dish without connections to the mechanical device, and hence a measure of the function of the CPG cannot be established. If the motor activity is recorded, there are no means at the present time to introduce electrodes in the neurons of the CPG in the living animal because it moves and breathes. The question we address is, Can we find a general measure that helps us to bound the potential function of a CPG? In this work, we show how, using computer and electronic models, the maximization of the information gives an approximate response that bounds the possible solutions that one can find in the CPG. It is remarkable that regardless of the differences between the computer and electronic implementations, similar results are obtained. It hints at the generality of the phenomenon for simple CPGs made of chaotic neurons.
Topology Selection by Maximizing the Gain of Information
991
We must consider that the simplifications imposed on our model CPGs, disregarding intrinsic properties of neurons and including only inhibitory connections, seem to be quite drastic and bring serious limits to the generality of the results, especially when more complex CPGs, such as those found in vertebrates, are considered. However, one should also consider that if even our simple model CPGs were able to show some properties found in living invertebrate CPGs, it could mean that the models were able to express important features. So although these results cannot be claimed to validate the models, they do serve as evidence of the generality of the properties found, at least in small invertebrate CPGs. Several other invertebrate CPGs that have excitatory synapses also present nonopen topologies. For a few examples one can see the CPGs for feeding in Planorbis (Arshavsky et al., 1985), swimming in Tritonia (Getting, 1989), and swimming in Clione (Arshavsky et al., 1998). From an evolutionary perspective, our results are evidence that one can think that the CPGs are developed following some general rules and then the final adjustments are performed depending on the specific function to be carried out. From this point of view, the application of information theory to CPG selection would be an invaluable tool to perform a first filter of the large number of possible configurations that can occur. A set of possible experiments to prove whether nature uses the information-maximization principle to evolve would be to let a set of nonspecific neurons develop connections in a culture dish. Once the connections are developed, we should check the configuration obtained. We could repeat the experiments as many times as needed to make the statistic relevant. The observation that the selected connection topologies are nonopen is also supported in previous work (Huerta et al., 2001), where open topology solutions in the CPG do not provide an efficient function of the pyloric chamber of the lobster. It is interesting to note that information maximization gives a bounded region of configurations where the specificity of a function is included. Finally, we note that the use of electronic model neurons can be a good tool to check the results obtained from numerical simulations when no better experiments are available, since the results found have to be robust to several kinds of fluctuations the ENs are exposed to.
Acknowledgments Partial support for this work came from the U.S. Department of Energy, Office of Science, and the U.S. Office of Naval Research. R.D.P. and M.B.R. were supported by the Brazilian agency Fundac¸a˜ o de Amparo a` Pesquisa do Estado de S˜ao Paulo (FAPESP). We are also grateful to A. I. Selverston, M. I. Rabinovich, and H. D. I. Abarbanel for their suggestions.
992
G. Stiesberg et al.
References Abarbanel, H. D. I., Huerta, R., Rabinovich, M. I., Rulkov, N., Rowat, P. F., & Selverston, A. I. (1996). Synchronized action of synaptically coupled chaotic model neurons. Neural Computation, 8, 50–65. Arshavsky, Y. I., Beloozerova, I. N., Orlovsky, G. N., Panchin, Y. V., & Pavlova, G. A. (1985). Control of locomotion in marine mollusc. Clione limacina. Experimental Brain Research, 58(2), 255–293. Arshavsky, Y. I., Deliagina, T. G., & Orlovsky, G. N. (1997). Pattern generation. Current Opinion in Neurobiology, 7(6), 781–789. Arshavsky, Y. I., Deliagina, T. G., Orlovsky, G. N., Panchin, Y. V., Popova, L. B., & Sadreyev, R. I. (1998). Analysis of the central pattern generator for swimming in the mollusk Clione. Annals of the New York Academy of Sciences, 860, 51–69. Brodfuehrer, P. D., Debski, E. A., O’Gara, B. A., & Friesen, W. O. (1995). Neuronal control of leech swimming. Journal of Neurobiology, 27(3), 403–418. Deco, G., & Schurmann, B. (1997). Information transmission and temporal code in central spiking neurons. Physical Review Letters, 79(23), 4697–4700. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). An efficient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 14–18. Getting, P. A. (1989). Emerging principles governing the operation of neural networks. Ann. Rev. Neurosci., 12, 185–204. Harris-Warrick, R. M., Johnson, B. R., Peck, J. H., Kloppenburg, P., Ayali, A., & Skarbinski, J. (1998). Distributed effects of dopamine modulation in the crustacean pyloric network. Annals of the New York Academy of Sciences, 860, 155–167. Hindmarsh, J. L., & Rose, R. M. (1984). A model of neuronal bursting using three coupled first order differential equations. Proc. R. Soc. Lond. B, 221, 87–102. ¨ Huerta, R., Sanchez-Montanes, M. A., Corbacho, F., & Siguenza J. A. (2000). A central pattern generator to control a pyloric-based system. Biological Cybernetics, 82, 85– 94. Huerta, R., Varona, P., Rabinovich, M. I., & Abarbanel, H. D. I. (2001). Topology selection by chaotic neurons of a pyloric central pattern generator. Biological Cybernetics, 84(1), L1–L8. Katz, P. S. (1998). Neuromodulation intrinsic to the central pattern generator for escape swimming in Tritonia. Annals of the New York Academy of Sciences, 860, 181–188. Kepecs, A., & Lisman, J. (2003). Information encoding and computation with spikes and bursts. Network-Computation in Neural Systems, 14(1), 103–118. Lisman, J. E. (1997). Bursts as a unit of neural information: Making unreliable synapses reliable. Trends in Neurosciences, 20(1), 38–43. ¨ A., Rabinovich, M. I., Selverston A. I., & Abarbanel, Pinto, R. D., Elson, R. C., Szucs, H. D. I. (2001). Extended dynamic clamp: Controlling up to four neurons using a single desktop computer and interface. Journal of Neuroscience Methods, 108(1), 39–48. ¨ A., Abarbanel, H. D. I., & Rabinovich, Pinto, R. D., Varona, P., Volkovskii, A. R., Szucs, M. I. (2000). Synchronous behavior of two coupled electronic neurons. Phys. Rev. E, 62, 2644–2656.
Topology Selection by Maximizing the Gain of Information
993
Rabinovich, M. I., Abarbanel, H. D. I., Huerta, R., Elson, R. E., & Selverston, A. I. (1997). Self-regularization of chaos in neural systems: Experimental and theoretical results. IEEE Trans. Circuits Systems I, 44, 997–1005. Selverston, A. (1999a). What invertebrate circuits have taught us about the brain. Brain Research Bulletin, 50(5–6), 439–440. Selverston, A. (1999b). General principles of rhythmic motor pattern generation derived from invertebrate CPGs. Progress in Brain Research, 123, 247–257. Selverston, A., Elson, R., Rabinovich, M., Huerta, R., & Abarbanel, H. (1998). Basic principles for generating motor output in the stomatogastric ganglion. Annals of the New York Academy of Sciences, 860, 35–50. Selverston, A. I., & Moulins, M. (Eds.). (1987). The crustacean stomatogastric system. Berlin: Springer. Selverston, A. I., Rabinovich, M. I., Abarbanel, H. D. I., Elson, R. C., Szcs, A., Pinto, R. D., Huerta, R., & Varona P. (2000). Reliable circuits from irregular neurons: A dynamical approach to understanding central pattern generators. J. Physiol. (Paris), 94, 357–374. Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana: University of Illinois Press. Sharp, A. A., Skinner, F. K., & Marder, E. (1996). Mechanisms of oscillation in dynamic clamp constructed two-cell half-center circuits. J. Neurophysiology, 76, 867–883. Stemmler, M., & Koch, C. (1999). How voltage-dependent conductances can adapt to maximize the information encoded by neuronal firing rate. Nat. Neurosci., 2(6), 521–527. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80(1), 197–200. ¨ Szucs, A., Varona, P., Volkovskii, A. R., Abarbanel, H. D. I, Rabinovich, M. I., & Selverston, A. I. (2000). Interacting biological and electronic neurons generate realistic oscillatory rhythms. NeuroReport, 11, 563–569.
Received September 29, 2005; accepted July 11, 2006.
LETTER
Communicated by Barak Pearlmutter
Independent Slow Feature Analysis and Nonlinear Blind Source Separation Tobias Blaschke
[email protected] Tiziano Zito
[email protected] Laurenz Wiskott
[email protected] Institute for Theoretical Biology, Humboldt University Berlin, Invalidenstraße 43, D-10115 Berlin, Germany
In the linear case, statistical independence is a sufficient criterion for performing blind source separation. In the nonlinear case, however, it leaves an ambiguity in the solutions that has to be resolved by additional criteria. Here we argue that temporal slowness complements statistical independence well and that a combination of the two leads to unique solutions of the nonlinear blind source separation problem. The algorithm we present is a combination of second-order independent component analysis and slow feature analysis and is referred to as independent slow feature analysis. Its performance is demonstrated on nonlinearly mixed music data. We conclude that slowness is indeed a useful complement to statistical independence but that time-delayed second-order moments are only a weak measure of statistical independence.
1 Introduction In signal processing, one often has to deal with multivariate data such as a vectorial signal x(t) = [x1 (t), . . . , xM (t)]T . To facilitate the interpretation of such a signal, a useful representation of the data in terms of a linear or nonlinear transformation has to be found; prominent linear examples are Fourier transformation, principal component analysis, and Fisher discriminant analysis. In this letter, we concentrate on blind source separation (BSS), which recovers signal components (sources) that have originally generated an observed mixture. While the linear BSS problem can be solved by resorting to independent component analysis (ICA), a method based on the assumption of mutual independence between the mixed source signal components, this is not possible in the nonlinear case. Some algorithms have been proposed to address this problem, and we mention them below. The objective of this letter is to show that the nonlinear BSS problem can Neural Computation 19, 994–1021 (2007)
C 2007 Massachusetts Institute of Technology
Independent Slow Feature Analysis and Nonlinear BSS
995
be solved by combining ICA and slow feature analysis (SFA), a method to find a representation where signal components are varying slowly. After a short introduction to linear BSS and ICA in section 2.1, we present the nonlinear BSS problem and some of the available algorithms in section 2.2. SFA is explained in section 3. We introduce independent slow feature analysis (ISFA) in section 4, a combination of second-order ICA and SFA that can perform nonlinear BSS. In section 5 the algorithm is tested on random and surrogate correlation matrices and then applied to nonlinearly mixed audio data. An analysis of the results reveals that nonlinear BSS can be solved by combining the objectives statistical independence and slowness, but that time-delayed second-order moments are not a sufficient measure of statistical independence in our case. We conclude with a discussion in section 6. 2 Blind Source Separation and Independent Component Analysis 2.1 Linear BSS and ICA. Let x(t) = [x1 (t), . . . , xN (t)]T be a linear mixture of a source signal s(t) = [s1 (t), . . . , s N (t)]T and be defined by x(t) = As(t),
(2.1)
with an invertible N × N mixing matrix A. The goal of blind source separation (BSS) is to recover the unknown source signal s(t) from the observable x(t) without any prior information. The only assumption is that the source signal components are statistically independent. Given only the observed signal x(t), we want to find a matrix R such that the components of u(t) = Qy(t) = QWx(t) = Rx(t)
(2.2)
are mutually statistically independent. Here we have divided R into two parts. First, a whitening transformation y(t) = Wx(t) with whitening matrix W is applied, resulting in uncorrelated signal components yi (t) with unit variance and zero mean, where we have assumed x(t) and also s(t) to have zero mean. Second, a transformation u(t) = Qy(t) with orthogonal Q (Comon 1994) results in statistically independent components ui (t). The method of finding a representation of the observed data such that the components are mutually statistically independent is called independent component analysis (ICA). It has been proven that ICA solves the linear BSS problem, apart from the fact that the source signal components can be recovered only up to scaling and permutation (Comon 1994). A variety of algorithms perform linear ICA and therefore linear BSS. They can be divided into two classes (Cardoso 2001): (1) independence is achieved by optimizing a criterion that requires higher-order statistics, and (2) the optimization criterion requires autocorrelations or nonstationarity
996
T. Blaschke, T. Zito, and L. Wiskott
of the source signal components. For the second class of BSS algorithms, second-order statistics is sufficient (see, e.g., Tong, Liu, Soon, & Huang, 1991). Here we focus on class (2) and use a method introduced by Molgedey & Schuster (1994) based on only second-order statistics. It is based on the minimization of an objective function that can be written as 2 N N N 2 (y) (u) τ ICA Ci j (τ ) = (Q) := Qik Q jl Ckl (τ ) i, j=1 i= j
i, j=1 i= j
(2.3)
k,l=1
(u)
operating on the already whitened signal y(t). Ci j (τ ) is an entry of a symmetrized time-delayed correlation matrix, 1 u(t)u(t + τ )T + u(t + τ )u(t)T , 2 1 (u) Ci j (τ ) := ui (t)u j (t + τ ) + ui (t + τ )u j (t) , 2 C(u) (τ ) :=
(2.4) (2.5)
τ can be underand C(y) (τ ) is defined correspondingly. Minimization of ICA stood intuitively as finding an orthogonal matrix Q that diagonalizes the correlation matrix with time delay τ . Since, because of the whitening, the instantaneous correlation matrix, which is simply the covariance matrix, is already diagonal, this results in signal components that are decorrelated instantaneously and at a given time delay τ . This can be sufficient to achieve statistical independence (Tong et al., 1991). Extending this method to several time delays is straightforward and provides greater robustness (see, e.g., ¨ Belouchrani, Abed Meraim, Cardoso, & Moulines, 1997; Ziehe & Muller, 1998; and section 5.1).
2.2 Nonlinear BSS and ICA. An obvious extension to the linear mixing model 2.1 has the form x(t) = F (s(t)),
(2.6)
with a nonlinear function F : R N → R M that maps N-dimensional source vectors s(t) onto M-dimensional signal vectors x(t). The components xi (t) of the observable are a nonlinear mixture of the sources and, as in the linear case, source signal components si (t) are assumed to be mutually statistically independent. Extracting the source signal is possible only if F is an invertible function on the range of s(t), which we will assume from now on.
Independent Slow Feature Analysis and Nonlinear BSS
997
The equivalence of BSS and ICA in the linear case does not hold in general for a nonlinear function F (Hyv¨arinen & Pajunen, 1999; Jutten & Karhunen, 2003). For example, given statistically independent components u1 (t) and u2 (t), any nonlinear functions h 1 (u1 ) and h 2 (u2 ) also lead to components that are statistically independent. Also, a nonlinear mixture of u1 (t) and u2 (t) can still have statistically independent components (Jutten & Karhunen, 2003). Thus, in the nonlinear BSS problem, independence is not sufficient to recover the original source signal, and additional assumptions about the mapping F or the source signal are needed to sufficiently constrain the optimization problem. We list some of the known methods:
r
Constraints on the mapping F : —F is a smooth mapping (Hyv¨arinen & Pajunen, 1999; Almeida, 2004) —F is a postnonlinear (PNL) mapping (Taleb & Jutten, 1997, 1999; Yang, Amari & Cichocki, 1998; Taleb, 2002; Ziehe, Kawanabe, ¨ Harmeling, & Muller, 2003).
r
Prior information about the source signal components: —Source signal components are bounded (Babaie-Zadeh, Jutten, & Nayebi, 2002) —Source signal components have time-delayed autocorrelations (referred to as temporal correlations) (Hosseini & Jutten, 2003) —Source signal components are those that exhibit a characteristic time structure (power spectra are pairwise different) (Harmeling, Ziehe, ¨ Kawanabe, & Muller, 2003)
2.3 A New Approach. In our approach, we do not make any specific assumption about the mapping F , although the function space available for unmixing will be finite-dimensional in the algorithm, which imposes some limitations on F . Since we employ an ICA method based on time-delayed cross-correlations, we make the implicit assumption that the sources have significantly different temporal structure (power spectra are pairwise different) (cf. Harmeling et al., 2003). We also assume that the sampling rate is high enough that the input signal can be treated as if it were continuous and the time derivative is well approximated by the difference of two successive time points. We have seen that in the nonlinear case, statistical independence alone is not a sufficient criterion for BSS. There are infinitely many nonlinearly distorted versions of one source that are all statistically independent of another source. We propose slowness as a means to resolve this ambiguity and select a good representative from the different versions of a source because nonlinearly distorted versions of a source usually vary more quickly than the source itself. Consider a sinusoidal signal component xi (t) = sin(t) and a second component that is the square of the first
998
T. Blaschke, T. Zito, and L. Wiskott
x j (t) = xi (t)2 = 0.5 (1 − cos(2t)). The second component is more quickly varying due to the frequency doubling induced by the squaring. We believe this argument can be made more formal, and it can be proven that given the set of a one-dimensional signal and all its nonlinearly and continuously transformed versions, the slowest signal of the set is either the signal itself or an invertibly transformed version of it (Sprekeler, Zito, & Wiskott, 2007). Considering this, we propose, in order to perform nonlinear BSS, to complement the independence objective of pure ICA with a slowness objective. In the next section, we give a short introduction to slow feature analysis, an algorithm built on the basis of this slowness objective. 3 Slow Feature Analysis Slow feature analysis (SFA) is a method that extracts slowly varying signals from a given observed signal (Wiskott & Sejnowski, 2002). This section gives a short description of the method as well as a link between SFA and second-order ICA (Blaschke, Berkes, & Wiskott, 2006), which provides the means to find a simple objective function for our nonlinear BSS method. Consider a vectorial input signal x(t) = [x1 (t), . . . , xM (t)]T . The objective of SFA is to find a nonlinear input-output function g(x) = [g1 (x), . . . , g L (x)]T such that the components of u(t) = g(x(t)) are varying as slowly as possible. As a measure of slowness, we use the variance of the first derivative, so that a slow signal has on average a small slope. The optimization problem then is as follows. Minimize the objective function (ui ) := u˙ i2 (t)
(3.1)
successively for each ui (t) under the constraints ui (t) = 0, (ui (t))2 = 1, ui (t)u j (t) = 0 ∀ j < i,
(zero mean)
(3.2)
(unit variance)
(3.3)
(decorrelation and order)
(3.4)
where · denotes averaging over time. Constraints 3.2 and 3.3 ensure that the solution will not be the trivial solution ui (t) = const. Constraint 3.4 provides uncorrelated output signal components and thus guarantees that different components carry different information. To make the optimization problem easier to solve, we consider the components gi of the input-output function to be a linear combination of a finite set of nonlinear functions. We can then split the optimization procedure into two parts: (1) nonlinear expansion of the input signal x(t) into a highdimensional feature space and (2) solving the optimization problem in this feature space linearly.
Independent Slow Feature Analysis and Nonlinear BSS
999
3.1 Nonlinear Expansion. A common method to make nonlinear problems solvable in a linear fashion is nonlinear expansion. The observed signal components xi (t) are mapped into a high-dimensional feature space according to z(t) = h(x(t)).
(3.5)
The dimension L of z(t) is typically much larger than that of the original signal. For instance, if we want to expand into the space of second-degree polynomials, we can apply the mapping h(x) = [x1 , . . . , xM , x1 x1 , x1 x2 , . . . , xM xM ]T − hT0 .
(3.6)
The dimensionality of this feature space is L = M + M (M + 1)/2. The constant vector hT0 is needed to make the expanded signal mean free. 3.2 Solution of the Linear Optimization Problem. Given the nonlinear expansion, the nonlinear input-output function g(x) can be written as g(x) = Rh(x) = Rz ,
(3.7)
where R is an L × L matrix, which is subject to optimization. To simplify the optimization procedure we (1) choose the nonlinearities h (· ) such that z(t) is mean free and (2) first find a transformation y(t) = Wz(t) to obtain mutually decorrelated components yi (t) with zero mean. Matrix W is a whitening matrix as in normal ICA; u(t) = Qy(t) = QWz(t) = Rz(t) = g(x(t)) ,
(3.8)
where y(t) is the nonlinearly expanded and whitened signal. It can be shown (Wiskott & Sejnowski, 2002) that the constraints 3.2 to 3.4 are fulfilled trivially if the transformation Q, subject to learning, is an orthogonal matrix. To solve the optimization problem, we rewrite the slowness objective, equation 3.1, as (ui ) = (u˙ i (t))2 = qiT y˙ (t) y˙ (t)T qi =: qiT Eqi ,
(3.9)
where qi = [Qi1 , Qi2 , . . . , Qi L ]T is the ith row of Q and E is the matrix ˙ y(t) ˙ T . For this optimization problem, there exists a unique solution. y(t) For i = 1, the optimal weight vector is the normalized eigenvector that corresponds to the smallest eigenvalue of E. The eigenvectors of the next higher eigenvalues produce the next slow components u2 (t), u3 (t), . . . and so forth. Typically only the first several of all L possible output components are of interest and selected.
1000
T. Blaschke, T. Zito, and L. Wiskott
Finding the eigenvectors is equivalent to finding the transformation Q such that QT EQ is diagonal. As described in detail in Blaschke et al. (2006), this leads to an objective function for SFA subject to maximization,
τ SFA (Q)
2 L L L 2 (y) (u) Cii (τ ) = := Qik Qil Ckl (τ ) , i=1
i=1
(3.10)
k,l=1
where τ is a time delay that arises from an approximation of the time ˙ ≈ y(t + derivative. We set τ = 1 because we make the approximation y(t) 1) − y(t). To understand equation 3.10 intuitively, we note that slowly varying signal components are easier to predict and should therefore have strong correlations in time. Thus, maximizing the time-delayed autocorrelation produces a slowly varying signal component. Since the trace of C(y) (τ ) is preserved under a rotation Q, maximizing the sum over the squared autocorrelations tends to produce a set of most slowly varying signal components at the expense of the other components, which become most quickly varying and are usually discarded. Note the formal similarity between equations 2.3 and 3.10. 4 Independent Slow Feature Analysis The nonlinear BSS method proposed in this section combines the principle of independence known from linear second-order BSS methods with the principle of slowness as described above. Because of the combination of ICA and SFA, we refer to this method as independent slow feature analysis (ISFA). Second-order ICA tends to make the output components independent, and SFA tends to make them slow. Since we are dealing with a nonlinear mixture, we first compute a nonlinearly expanded signal z(t) = h(x(t)) with h : R M → R L being some nonlinear function chosen such that z(t) has zero mean. In a second step, z(t) is whitened to obtain y(t) = Wz(t). Finally, we apply linear ICA combined with linear SFA on y(t) in order to find the output signal u(t), the R first components of which are the estimated source signals, where R is usually much smaller than L, the dimension of the expanded signal. Because of the whitening, we know that ISFA, like ICA and SFA, is solved by finding an orthogonal L × L matrix Q. We write the output signal u(t) as u(t) = Qy(t) = QWz(t) = QWh(x(t)).
(4.1)
While the u1 (t), . . . , u R (t) are statistically independent and slowly varying, the components u R+1 (t), . . . , u L (t) are more quickly varying and may be statistically dependent on each other as well as on the estimated sources.
Independent Slow Feature Analysis and Nonlinear BSS
1001
The last L − R components of the output signal u(t) are irrelevant for the final result but important during the optimization procedure (see below). To summarize, we have an M-dimensional input x(t), an L-dimensional nonlinearly expanded and whitened y(t), and an L-dimensional output signal u(t). ISFA finds an orthogonal matrix Q such that the R first components of the output signal u(t) are mutually independent and slowly varying. These are the estimated sources. 4.1 Objective Function. To recover R source signal components ui (t), i = 1, . . . , R from an L-dimensional expanded and whitened signal y(t), the objective for ISFA with one single time delay τ reads τ τ τ (u1 , . . . , u R ) := b ICA ICA (u1 , . . . , u R ) − b SFA SFA (u1 , . . . , u R ) ISFA
= b ICA
R R 2 2 (u) (u) Ci j (τ ) − b SFA Cii (τ ) ,
(4.2)
i=1
i, j=1, i= j
where we simply combine the ICA objective, equation 2.3, and SFA objective, equation 3.10, for the first R components weighted by the factors b ICA and b SFA , respectively. Note that the ICA and the SFA objective are usually applied to all components and that in the linear case (and for one time delay τ = 1), they are equivalent (Blaschke et al., 2006). Here, they are applied to an R-dimensional subspace in the L-dimensional exτ panded space, which makes them different from each other. ISFA is to be minimized, which is the reason why the SFA part has a negative sign. In the linear case, it is standard practice to use multiple time delays to stabilize the ICA solution (see, e.g., the kTDSEP algorithm by Harmeling et al., 2003). We will see in sections 5.1 and 5.2 that in our case, multiple time delays are essential to get meaningful solutions. The general expression for the objective of ISFA then reads, ISFA (u1 , . . . , u R ) := b ICA
τ τ κICA ICA − b SFA
τ ∈TICA
= b ICA
τ τ κSFA SFA
τ ∈TSFA τ κICA
τ ∈TICA
− b SFA
τ ∈TSFA
R 2 (u) Ci j (τ ) i, j=1, i= j
τ κSFA
R 2 (u) Cii (τ ) ,
(4.3)
i=1
where TICA and TSFA are the sets of time delays for the ICA and SFA obτ τ jectives, respectively, whereas κICA and κSFA are weighting factors for the
1002
T. Blaschke, T. Zito, and L. Wiskott
corresponding correlation matrices. For simplicity, we will continue the description with only one time delay based on equation 4.2 and only later provide the full formulation with multiple time delays based on equation 4.3. 4.2 Optimization Procedure. From equation 4.1, we know that C(u) (τ ) in equation 4.2 depends on the orthogonal matrix Q. There are several ways to find the orthogonal matrix that minimizes the objective function. Here we apply successive Givens rotations to obtain Q. A Givens rotation is a rotation around the origin within the plane of two selected components µ and ν and has the matrix form
µν Qi j
cos(φ) − sin(φ) := sin(φ) δi j
for (i, j) ∈ {(µ, µ) , (ν, ν)} for (i, j) ∈ {(µ, ν)} for (i, j) ∈ {(ν, µ)} otherwise
(4.4)
with Kronecker symbol δi j and rotation angle φ. Any orthogonal L × L matrix such as Q can be written as a product of L(L − 1)/2 (or more) Givens rotation matrices Qµν (for the rotation part) and a diagonal matrix with diagonal elements ±1 (for the reflection part). Since reflections do not matter in our case, we consider only the Givens rotations as is often done in secondorder ICA algorithms (e.g., Cardoso & Souloumiac, 1996)—(but note that here ICA is applied to a subspace). Objective 4.2 as a function of a Givens rotation Qµν reads
τ,µν
ISFA (Qµν ) = b ICA
R
i, j=1 i= j
− b SFA
R i=1
L
2 µν
µν
Qik Q jl Ckl (τ ) (u )
k,l=1
L
2 µν µν (u ) Qik Qil Ckl
(τ ) ,
(4.5)
k,l=1
where u is some intermediate signal during the optimization procedure. τ,µν For each Givens rotation, there exists an angle φmin with minimal ISFA . Sucµν cessive application of Givens rotations Q with the corresponding rotation angle φmin leads to the final rotation matrix Q, yielding C(u) (τ ) = QT C(y) (τ )Q .
(4.6)
In the ideal case, the upper left R × R submatrix of C(u) (τ ) is diagonal with R (u) a large trace i=1 Cii (τ ).
Independent Slow Feature Analysis and Nonlinear BSS
1003
Applying a Givens rotation Qµν in the µν-plane changes all auto- and (u ) cross-correlations Ci j (τ ) with at least one of the indices equal to µ or ν. There exist two invariances under such a transformation: 2 2 (u ) (u ) Cµi (τ ) + Cνi (τ ) = const ∀i ∈ {µ, ν},
(4.7)
2 2 2 2 (u ) (u ) (u ) (u ) Cµµ (τ ) + Cµν (τ ) + Cνµ (τ ) + Cνν (τ ) = const.
(4.8)
τ Assume we want to minimize ISFA for a given R, where R denotes the number of signal components we want to extract. Applying a Givens rotation Qµν , we have to distinguish three cases:
r
r
Case 1. Both axes, µ and ν, lie inside the subspace spanned by the first R axes (µ, ν ≤ R) (see Figure 1a). The sum over all squared cross-correlations of all signal components that lie outside the R-dimensional subspace is constant, as well as that of all signal components inside the subspace. The former holds because of the first invariance, equation 4.7, and the latter because of the first—equation 4.7—and second—equation 4.8–invariance. There is no interaction between inside and outside; in fact, the objective function is exactly the objective for an ICA algorithm based on ¨ second-order statistics, such as TDSEP or SOBI (Ziehe & Muller, 1998; Belouchrani et al., 1997). Blaschke et al. (2006) have shown that this is equivalent to SFA in the case of a single time delay of τ = 1. Case 2. Only one axis, without loss of generality, µ, lies inside the subspace; the other, ν, lies outside (µ ≤ R < ν) (see Figure 1b). Since one axis of the rotation plane lies outside the subspace, u µ in the objective function can be optimized at the expense of the u ν outside the subspace. A rotation of π/2, for example, would simply exchange (u ) components u µ and u ν . For instance, according to equation 4.7, (Cµi )2 (u )
r
can be optimized at the expense of (Cνi )2 with i ∈ {1, . . . , R}; accord(u ) (u ) ing to equation 4.8, (Cµµ )2 can be optimized at the expense of (Cµν )2 , (u ) (u ) (Cνµ )2 , and (Cνν )2 . This gives the possibility of finding the slowest and most independent components in the whole space spanned by all L axes, in contrast to case 1, where the minimum is searched within the subspace spanned by the first R axes considered in the objective function. Case 3. Both axes lie outside the subspace (R < µ, ν) (see Figure 1c). A Givens rotation with the two rotation axes outside the relevant subspace does not affect the objective function and can therefore be disregarded.
1004
T. Blaschke, T. Zito, and L. Wiskott
u´µ u´ν u´R
u´L
u´µ u´ν u´R
u´L (a) Case 1 u´µ u´R
u´R
u´ν u´L
u´µ
u´ν u´L
u´µ u´R
u´R
u´µ u´ν
u´ν
u´L
u´L
(b) Case 2
(c) Case 3 (u )
Figure 1: Each square represents a squared cross- or autocorrelation (Ci j )2 where index i ( j) denotes the row (column) of the square. Dark squares indicate all entries that are changed by a rotation in the µ-ν-plane. L is the dimensionality of the expanded signal u and R the number of signal components ui (t) subject to optimization. The entries incorporated in the objective function are located in the upper-left corner, as indicated by the dashed line.
To optimize the objective function of ISFA, equation 4.2, we need to τ,µν calculate the explicit form of the objective function ISFA in equation 4.5. By inserting the Givens rotation matrix 4.4 into the objective function 4.5, and considering the case with multiple time delays, we can write the objective
Independent Slow Feature Analysis and Nonlinear BSS
1005
as a function of the rotation angle φ, 2 µν ISFA (φ) = b ICA e c + e β cos4−β (φ) sinβ (φ) β=0
− b SFA dc +
4
dα cos
4−α
α
(φ) sin (φ)
(4.9)
α=0 (u )
with constants e and d that depend only on the Ckl before rotation. Further simplification (cf. Blaschke & Wiskott, 2004) leads to µν
Case 1:
ISFA (φ) = A0 + A4 cos(4φ + φ4 )
(4.10)
Case 2:
µν ISFA
(4.11)
(φ) = A0 + A2 cos(2φ + φ2 ) + A4 cos(4φ + φ4 ),
with a single minimum (if without loss of generality, φ ∈ − π2 , π2 ), which can be calculated easily. The derivation of equations 4.10 and 4.11 involves various trigonometric identities and, because of its length, is documented in the appendix. The iterative optimization procedure with successive Givens rotations can now be described as follows:
1. Initialize Q = I and compute C(u ) (τ ) = C(y) (τ )∀τ ∈ TICA ∪ TSFA with equation 2.4 and ISFA with equation 4.3. 2. Choose a random permutation of the set of axis pairs: P = σ ({(µ, ν), with µ ≤ R and µ < ν ≤ L}). 3. Go systematically through all axis pairs in P. For each axis pair: µν (a) Determine the optimal rotation angle φmin for the selected axes with equation 4.10 or 4.11. µν (b) Compute the Givens rotation matrix Qµν φmin defined by equation 4.4. (c) Update C(u ) (τ ) using C(u ) (τ ) → (Qµν )T C(u ) (τ )Qµν . (d)Update Q according to Q → Qµν Q . (e) Back up the previous objective-function value ISFA = ISFA . (f) Calculate the new objective-function value ISFA with equation 4.3 using the updated C(u ) (τ ) from step 3c, −ISFA (g) Store the relative decrease of the objective function value ISFA . | | ISFA
(4) Go to 2 until the relative decrease of the objective function is smaller than 1 for all axis pairs in P. (5) Set Q = Q and u (t) = Qy(t).
1006
T. Blaschke, T. Zito, and L. Wiskott
In step 2 it is important to note that the rotation planes of the Givens rotations are selected from the whole L-dimensional space (although we avoid the irrelevant case 3 by requiring µ ≤ R; see Figure 1), whereas the objective function uses information of correlations only among the first R signal components ui . Since Qµν is very sparse, the Givens rotation in step 3c does not require a full matrix multiplication but can be computed more efficiently. Note that the algorithm works on the intermediate correlation matrices C(u ) (τ ), not on the signals themselves; the input signal y(t) is used only in the initialization (step 1) and at the end (step 5) when the output signal u(t) is computed. To circumvent the problem of getting stuck in local optima of the objective function, a random rotation of the outer space (ν > µ > R) can be performed after convergence in step 4, and the algorithm can be restarted at step 2. 5 Results To evaluate the performance of ISFA, we tested the algorithm first on random matrices to check how many matrices are needed to get meaningful results, then on surrogate matrices to check that the algorithm reliably converges to the global optimum under these ideal conditions, and then on a difficult although low-dimensional mixture of audio data to show how it performs on real data. In order to reduce the problem of local optima, we use SFA as a preprocessing step. That choice follows from the empirical observation that SFA is always able to extract the first source signal. To stabilize the ISFA solutions even further, we typically run the optimization routine once with the first axis fixed and then once more following the procedure described in section 4.2. Throughout this letter, the SFA time-delay set and the weighting factors were as follows: TSFA = 1
(5.1)
τ =1 κSFA
for τ = 1
(5.2)
τ κICA
∀ τ ∈ TICA .
(5.3)
=1
This particular choice makes it easy to interpret the ISFA objective function 4.3. The SFA part is the plain SFA objective function of equation 3.10; the ICA part is the plain ICA objective function of equation 2.3 extended to several time delays. If we would choose more than one time delay for the SFA part, the interpretation in terms of slowness would become less clear (see Blaschke et al., 2006). TICA depends on the experiment (see below). 5.1 Tests with Random Matrices. First, consider only the ICA part of the objective function 4.3. Its purpose is to guarantee statistical independence
Independent Slow Feature Analysis and Nonlinear BSS
1007
Table 1: Critical Number of Time Delays, Tcrit , for Different Values of L and R.
L=9 L = 20
R
2
3
4
5
6
7
8
9
10
More Than 10
Tcrit Tcrit
18 36
8 19
5 13
4 9
3 7
2 6
2 5
2 4
3
2
of the estimated sources by simultaneously diagonalizing the R × R upper left submatrix of T time-delayed L × L correlation matrices, where T is the number of elements in TICA . However, for the ICA term to be useful, we have to take sufficiently many matrices into account so that simultaneous submatrix-diagonalization is not trivial. For instance, a single symmetric matrix can always be fully diagonalized by the orthonormal set of its eigenvectors. Thus, for R = L and T = 1, one has to take at least two matrices to avoid this spurious solution, which would be found even if there are no underlying statistically independent sources. To estimate the minimum number of matrices needed, we ran ISFA with b SFA = 0 on randomly generated symmetric matrices Aτ , τ = 1, ..., T, for different valuesof L, R, and T. The subdiagonalization was considered successful if E :=
Ai2j τ, j,i> j , that is, the square root of the averaged squared
nondiagonal terms, was below a threshold E crit := 10−3 . For fixed L and R < L, we typically observe that a high degree of subdiagonalization is possible for T = 2. For T > 2 the subdiagonalization is still possible but at a lower degree with increasing T, until a critical Tcrit is reached, for which the degree of subdiagonalization displays a sharp transition where E crosses the threshold E crit and remains stable after that. The estimated critical number of time delays Tcrit for L ∈ {9, 20} and different values of R are given in Table 1. In the simulations that follow, we have M = R = 2 and use ISFA3 and ISFA5 (ISFAn refers to ISFA with polynomials of degree n), resulting in L = 9 and L = 20, respectively. From the table, we see that with T = 50, we are well above Tcrit in both cases. 5.2 Tests with Surrogate Matrices. To test the performance of ISFA (now including the SFA part) in the absence of noise, finite-size effects, or any other kind of perturbation, we carried out an experiment with T > 1 surrogate matrices, prepared such that they have a unique exact solution (except for permutations). The first matrix, with τ = 1, is fully diagonal, with the diagonal elements ordered by decreasing absolute value, with the exception of the second and last element, which are swapped. All other T − 1 matrices are random symmetric matrices with a diagonal (R + RICA ) × (R + RICA ) upper submatrix. SFA alone sees only the first matrix (cf. equations 5.1 and 5.2) and would favor a solution in which the last component is swapped back into the R × R subspace in place of the
1008
T. Blaschke, T. Zito, and L. Wiskott
small second component. ICA alone would favor any permutation of the first (R + RICA ) components equally well, because for any of these permutations, the R × R upper submatrices are all diagonal. In this example, ICA should prevent SFA from swapping the last component into the R × R subspace, and SFA should disambiguate the many equally valid ICA solutions by selecting the largest diagonal elements, that is, the slowest components, in the first matrix. This set of matrices constitutes a fixed point for the ISFA algorithm. If we run ISFA directly on these matrices, we get Q = I. If we now apply a random rotation matrix Qrand to the set of matrices, we would expect ISFA to find a matrix Q that inverts this rotation and returns the R original first components, but in any arbitrary order. Thus, the R × R submatrix of the product P := QQrand should be a permutation matrix for perfect unmixing. We performed 10,000 independent tests with R = 2, RICA = 2, L = 9, and T = 50, somewhat imitating the case of two nonlinearly mixed independent sources and an expansion space of all polynomials of degree three. The estimated critical number of matrices Tcrit is 18. Using 50 matrices, we rule out any spurious solution as discussed in section 5.1. As a measure of performance, we used the reconstruction error measure first introduced by, Amari, Cichocki, & Yang (1995) in the formulation given in Blaschke and Wiskott (2004): R R R R |Pi j | |P | 1 ij E= 2 − 1 + − 1 . R maxk |Pik | maxk |Pk j | i=1
j=1
j=1
i=1
(5.4) An experiment is considered successful if the unmixing error is smaller than 10−5 . We found that ISFA always recovered the original components and that this 100% success rate was largely independent of the scaling factors b ICA and b SFA , which we therefore set to b ICA = b SFA = 1 for this experiment. 5.3 Tests with Twisted Audio Data. In the third experiment, we tested the algorithm on 171 pairs of 19 nonlinearly mixed music excerpts. The sample values of the 19 excerpts were in the range of [−1, +1); the mean had an average value of (−10 ± 110) × 10−6 (mean ± std); the standard deviation had an average value of 0.16 ± 0.07, with its minimum and maximum values 0.02 and 0.27, respectively. One additional music excerpt was discarded because it had extreme peaks, which led to a strong nonlinear distortion due to the SFA part and low correlations with the source even though it was in principle extracted correctly. All audio signals were 221 = 2,097,152 samples long and had a CD-quality sampling frequency of 44,100 Hz. We
Independent Slow Feature Analysis and Nonlinear BSS
1009
used the nonlinear mixture introduced by Harmeling et al. (2003) defined by x1 (t) = (s2 (t) + 3s1 (t) + 6) cos(1.5 π s1 (t)),
(5.5)
x2 (t) = (s2 (t) + 3s1 (t) + 6) sin(1.5 π s1 (t)) .
(5.6)
This is quite an extreme nonlinearity, and the unmixing performance depends strongly on the standard deviation of the sources. For the ICA part of the objective in equation 4.3, we used 50 time delays evenly spaced within 1 and 44,100, corresponding to a timescale up to 1 second. The number of time delays is greater than the critical number Tcrit , which is 18 for an expansion with polynomials of degree three and 36 for polynomials of degree five. In order to evaluate the performance of the algorithm fairly, we used linear regression to check if the nonlinear mixture was indeed invertible within the available space. Two orthogonal directions were fit within the whitened expanded space to maximize the correlation with the original sources. Within the space of polynomials of degree three, there were a number of cases (51 examples, 30% of the total) where the two sources were not found by linear regression, which means the nonlinear mixture was not invertible within the available expanded space. This is the main reason for failures in ISFA3 . Within the space of polynomials of degree five, the mixture was always invertible. The scaling factor b SFA was kept constant and equal to 1, while b ICA was manually tuned for each example in order to maximize the correlation between estimated and original sources. For polynomials of degree three, we tested different values of b ICA equidistant on a logarithmic scale between 0 and 10,000. The number of tested values varied between 5 and 40 depending on how clear and robust the optimum was. For polynomials of degree one and five, we largely adopted the values found for polynomials of degree three; only if the algorithm failed with these values did we retune b ICA with 20 equidistant values. This tuning resulted in values between 0 and 1000. A source signal is considered to be recovered if the correlation with the estimated source is greater than 0.9. Scatter plots of a successful example are shown in Figure 2, and a summary of the results is given in Table 2. ISFA is able to separate the nonlinearly mixed sources in about 70% of the cases in which unmixing was possible at all. This is remarkable given the extreme nonlinearity of the mixture and a chance level of unmixing of less than 0.01%, as we have tested by numerical simulations. However, there remains a failure rate of about 30%, which is puzzling given the perfect performance on the surrogate matrices (see section 5.2). We investigate this in the next section. 5.4 Analysis of Failure Cases. Why did ISFA fail in about 30% of the cases where a good solution was available by linear regression? The values of the objective function ISFA , equation 4.3, and its two parts, ICA and SFA ,
1010
T. Blaschke, T. Zito, and L. Wiskott
Figure 2: Scatter plot of two sources, their nonlinear mixture, and the estimated sources. (a) Sources. (b) Mixture. (c) Sources estimated by ISFA5 . (d) First source versus estimated first source. (e) Second source versus estimated second source. Correlation coefficients of estimated sources and original sources were 0.996 and 0.998.
give us some information about possible reasons. Consider the following four different cases: 1. In 1 out of the 35 true failures for ISFA3 and never for ISFA5 , it is the case that ISFA of the sources estimated by ISFA is greater than the ISFA of the sources estimated by linear regression. In this case, the algorithm obviously got stuck in a local optimum. 2. In 15 and 26 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ISFA of the sources estimated by ISFA is smaller than the ISFA of the sources estimated by linear regression, but either ICA or SFA is greater than the corresponding linear regression value. This indicates that the tuning of the weighting factors b SFA and b ICA might not have been fine enough. However, it could also be that there is an abrupt transition between solutions where ICA is greater to solutions where SFA is greater than the corresponding linear-regression value. 3. In 6 and 3 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ICA and SFA of the sources estimated by ISFA are both smaller than the ones of the linear-regression estimate and greater than the ones of the original sources. Neither a local optimum nor the weighting factors are a plausible cause for the failure in these
Independent Slow Feature Analysis and Nonlinear BSS
1011
Table 2: Reconstruction Performance of Linear Regression and ISFA. Number of Reconstructed Sources 2 1 0 Number of Reconstructed Sources
REG1 5% 54% 41%
REG3 (8) (93) (70)
ISFA1
2 1 0
5% 50% 45%
% correct
100%
70% 30% 0%
REG5 (120) (51) (0)
ISFA3 (8) (85) (78) 8 8
50% 34% 16% 71%
100% 0% 0%
(171) (0) (0)
ISFA5 (85) (59) (27) 85 120
71% 18% 11% 71%
(122) (30) (19) 122 171
Notes: The upper part shows percentages of cases where both, one, or none of the two sources were recovered by linear regression (supervised) in the original space (REG1 ) or in the expanded space with polynomials of degree three (REG3 ) or five (REG5 ). The lower part shows the same for ISFA (unsupervised except for the tuning of b ICA ). Each entry indicates the percentage (and number) of pairs with respect to the total of 171 pairs. The last line presents the percentage of both sources recovered correctly with respect to the number of mixtures invertible within the available expanded space by linear regression. In the case of two recovered sources, chance level is always smaller than 0.01%.
cases. It might be that the expansion was too low-dimensional and that a higher-dimensional expansion would have yielded the correct solution. 4. In 13 and 20 out of the 35 and 49 true failures for ISFA3 and ISFA5 , respectively, ICA and SFA of the sources estimated by ISFA are both smaller than the ones of the original sources. In this case, the solution found is even better than the original sources in terms of the objective function, which indicates that there is something wrong with the objective function. It might be possible to eliminate the failures of the first three cases by refining the algorithm, for example, by tuning the weighting factors better or by going to higher polynomials, but case 4 is more fundamental and requires reconsidering the objective function itself. In this latter case, the signals extracted by ISFA appear to be both slower and more mutually independent than the original sources. However, scatter plots of the estimated sources reveal that they are not statistically independent at all, but often one is largely a function of the other (see Figures 3 and 4). Thus, the ICA part of the objective function is not strong enough to ensure statistical independence of the estimated sources. The cross-correlation functions shown in
1012
T. Blaschke, T. Zito, and L. Wiskott
Figure 3: Scatter plot of two sources, their nonlinear mixture, and the sources estimated by ISFA in a failure case. (a) Sources. (b) Mixture. (c) Sources estimated by ISFA3 . (d) First source versus estimated first source (correlation coefficient 0.9771). (e) First source versus estimated second source (correlation coefficient 0.0377). (f) Second source versus estimated first source (correlation coefficient 0.0197). (g) Second source versus estimated second source (correlation coefficient 0.1301).
Figure 5 indicate that this problem is not due to the specific choice of the time delays, because the time-delayed cross-correlations of the estimated sources (mean ± std = 0 ± 0.0028) are overall smaller than the ones of the original sources (0 ± 0.0066). Even using different or more time delays, such a data set would have been processed incorrectly. We conclude that any measure of independence based on time-delayed correlations would be insufficient in our context. Figure 5 suggested to us that sources with a large standard deviation of their cross-correlation function might be particularly difficult to separate with our ISFA algorithm. We tested this hypothesis but did not find a significant correlation with the failure cases. For an expansion with polynomials of degree three, even linear regression fails if the standard deviation of the first signal, which goes along the spiral, is large. For polynomials of degree five, linear regression always worked in our examples, but we suspected that separation might still be more difficult for sources with large standard deviation. But again, we did not find a significant correlation with the failure cases. We argue here that the failures must be attributed to the weakness of the ICA term in the objective function. If the SFA term were too weak, it could happen that all output signal components are truly statistically
Independent Slow Feature Analysis and Nonlinear BSS
1013
Figure 4: Scatter plots of the sources estimated by ISFA for some failure cases. It is clear that in these cases, the signal components are not statistically independent, even though the ICA term indicates so.
independent, but at least some of them are too quickly varying, so that they are not correlated to the sources but to some nonlinearly distorted version of the sources, something we did not observe. Also, the success in detecting the failure cases based on higher-order cumulants (see next section) indicates that the failures are due to the ICA term. 5.5 Unsupervised Detection of Failure Cases. A failure rate of about 30% (or even up to 50% for ISFA3 if one also counts the cases in which even linear regression was not able to recover the sources) is obviously not acceptable, unless one can detect the failure cases in an unsupervised manner. We use the weighted sum over the third- and fourth-order crosscumulants, 34 (u) :=
1 (u) 2 1 (u) 2 Ci jk + C , 3! i jk=iii 4! i jkl=iiii i jkl
(5.7)
as an independent measure of statistical independence to indicate with high values those cases in which the second-order ICA term has failed to yield independent output signal components. The factors 3!1 and
1014
T. Blaschke, T. Zito, and L. Wiskott
Figure 5: Cross-correlation functions of a failure case. (a) Cross-correlation function of the original sources. (b) Cross-correlation function of the estimated sources. Same data set as in Figure 3.
1 4!
arise from an expansion of the Kullback-Leibler divergence in u, which provides a rigorous derivation of this criterion (Comon, 1994; McCullagh, 1987). The receiver operating characteristic (ROC) curves in Figure 6 show that 34 (u) is a good measure of success. These tests also included the cases where linear regression was not able to recover the sources. The area under the ROC curves is 0.952 and 0.988 for ISFA3 and ISFA5 , respectively. 6 Conclusion In the work presented here, we have addressed the problem of nonlinear BSS. It is known that in contrast to the linear case, statistical independence
Independent Slow Feature Analysis and Nonlinear BSS
1015
Figure 6: ROC curves for the test of successful source separation based on 34 (u), the weighted sum of third- and fourth-order cross-cumulants. The area under the curves is 0.952 and 0.988 for ISFA3 (dashed line) and ISFA5 (solid line), respectively.
alone is not a sufficient criterion for separating sources from a nonlinear mixture; additional criteria are needed to solve the problem of selecting the true sources (or good representatives thereof) from the many possible output signal components that would be statistically independent of other components. We claim here that for source signals with significant autocorrelations for time delay one, temporal slowness is a good criterion to solve this selection problem, because the slow components are those most likely related to the true sources by an invertible transformation; noninvertible transformations would typically lead to more quickly varying components. Based on this assumption, we have derived an objective function that combines a term from second-order independent component analysis (ICA) with a term derived from slow feature analysis (SFA). Optimization of the new objective function is achieved by successive Givens rotations, a method often used in the context of ICA. We refer to the resulting algorithm as independent slow feature analysis (ISFA) to indicate the combination of ICA and SFA. The algorithm is somewhat unusual in that only a small submatrix of large time-delayed correlation matrices is being diagonalized by the Givens rotations (usually the full matrices are being diagonalized). This opens the question of the uniqueness of the solution. Using randomly generated pseudocorrelation matrices, we have found that a minimum number of time delays is needed to obtain unique and meaningful solutions. For instance, if the upper left 2 × 2 ubmatrix of 9 × 9 matrices has to be diagonalized, at least 18 such matrices are needed to obtain a meaningful solution, which
1016
T. Blaschke, T. Zito, and L. Wiskott
would be unlikely to emerge by accident. With 17 matrices, on the other hand, good diagonalization can be achieved reliably even for random symmetric matrices. With (sufficiently many) surrogate matrices, structured such that they have a unique solution, we have subsequently verified that the algorithm reliably converges to the correct solution. With tests on quite an extreme nonlinear mixture of two audio signals, we have shown that ISFA is indeed able to perform nonlinear BSS, often with high precision. However, in about 30% of the cases in which the true sources could have been extracted with the nonlinearity used (as verified by regression), ISFA failed to extract them. In many of these cases, the extracted signals were actually better than the original sources in both the SFA term and the ICA term of the objective function. This was a surprising finding for us, since it seems to contradict our basic assumption that a combination of slowness and statistical independence should permit reliable nonlinear BSS. Closer inspection, however, has revealed that the extracted output signal components only appear to be statistically independent in terms of the time-delayed second-order moments but that they are often highly related, as can be seen by visual inspection (see Figure 4) and automatically detected with a measure 34 (u) based on higher-order cumulants. This is not a consequence of the particular choice of time delays we have used but would be expected for any general set of time delays, as can be seen from the cross-correlation functions (in Figure 5). We believe that two important conclusions can be drawn from these results. Firstly, the success cases indicate that combining slowness and statistical independence is a promising approach to nonlinear BSS. Secondly, any measure of statistical independence based on (time-delayed) secondorder moments is too weak to guarantee statistical independence in our context; it might even be too weak in any context where the dimensionality of the space in which the signal components are searched for is significantly larger than the number of components. For a possible theoretical account of the failure of second-order ICA in our context, consider the following example. Given a symmetrically distributed source s1 , the correlation between, for instance, s1 and s12 vanishes (Harmeling et al., 2003, sec. 4.1). To the extent that this also holds for timeshifted versions s1 (t) and s12 (t + τ ) (cf. Harmeling et al., 2003, sec. 5.4), the statistical dependence between s1 and s12 does not manifest itself in the time-delayed correlations. Thus, second-order ICA cannot be expected to prevent extraction of s1 and s12 as the estimated sources, which can easily lead to a failure case say, if s12 is more slowly varying than s2 . A failure rate of 30% would render the algorithm useless if it were not possible to detect the failure cases. We have shown that the measure 34 (u), which is based on higher-order cumulants, permits failure detection with high reliability; the area under the ROC curve is greater than 0.95, resulting in a true positive rate of 90% and 94% at a false-positive rate of 5% and 10% for ISFA3 and ISFA5 , respectively.
Independent Slow Feature Analysis and Nonlinear BSS
1017
It might be possible to use 34 (u) not only to detect the failure cases but also to automatically tune the weight b ICA given b SFA = 1 and determine the number of sources. For the former, one could start with a small value of b ICA , so that only the SFA term is effective and the extracted components might not be independent, and then increase b ICA , so that the ICA term becomes increasingly effective, until the value of 34 (u) drops below a certain threshold. Similarly, for determining the number of sources, one could start by running the algorithm with only two output components to be extracted and successively increase the number of components. One would then stop if adding another component would increase 34 (u) significantly (which can obviously be detected only a posteriori). More interesting, however, would be to use higher-order cumulants more directly to improve the algorithm. For instance, one could define a new objective function that is a combination of the SFA term used here and an ICA term like 34 (u). Given the high reliability with which 34 (u) can detect failure cases, we expect better performance with such a new objective function. However, higher-order cumulants are expensive to compute, especially for high-dimensional and long signals, so that there is probably a trade-off between reliability and computational complexity. Exploring these possibilities will be subject of our future research. Appendix The definitions of the constants dn and e n for the expression of the objective (u) function, equation 4.9, follow directly from the multilinearity of C... (τ ). They are given in Table 3. Using trigonometry, we can derive simpler objective functions of the form µν
Case 1: ISFA (φ) = a 20 + c 24 cos(4φ) + s24 sin(4φ)
(A.1)
µν
Case 2: ISFA (φ) = a 20 + c 22 cos(2φ) + s22 sin(2φ) + c 24 cos(4φ) + s24 sin(4φ),
(A.2)
with constants defined in Table 4. In the next step, these objective functions are further simplified by combining the sine term and cosine term in a single cosine term. This results in µν
Case 1: ISFA (φ) = A0 + A4 cos(4φ + φ4 ) Case 2:
µν ISFA (φ) =
A0 + A2 cos(2φ + φ2 ) + A4 cos(4φ + φ4 ),
with constants defined in Table 5.
(A.3) (A.4)
1018
T. Blaschke, T. Zito, and L. Wiskott
Table 3: Constants in Equation 4.9. Case 1
d0
τ ∈TSFA
d1
4
τ ∈TSFA
d2
2
τ κSFA
2 (u ) (u ) (u ) + Cµµ Cνν 2 Cµν
d3
0
d4
0
e0
2
τ ∈TICA
e1
4
τ ∈TICA
α=1, α∈{µ,ν}
τ κICA
2 (u ) Cµν
τ ∈TICA
2
τ ∈TICA
τ κICA
R−1 R α=1 β>α
4
2 τ (u ) κSFA Cµµ
τ ∈TSFA
2
(u )
Cαβ
2
2 (u ) − Cµν
τ (u ) (u ) κSFA Cµν Cµµ
2 τ (u ) 2 Cµν κSFA
τ ∈TSFA (u ) (u ) +Cµµ Cνν τ (u ) (u ) 4 κSFA Cµν Cνν τ ∈TSFA 2 τ (u ) κSFA Cνν τ ∈TSFA R 2 τ (u ) Cαα κSFA α=1 τ ∈TSFA α=µ
τ κICA
τ ∈TICA
τ (u ) (u ) (u ) (u ) κICA Cνν − Cµµ Cµν Cµν
τ ∈TSFA
2
2 2 τ (u ) (u ) (u ) κICA − Cνν − 2 Cµν Cµµ
e2
ec
R 2 (u ) Cαα
τ κSFA
τ ∈TSFA
τ (u ) (u ) (u ) (u ) κSFA Cµν − Cµν Cνν Cµµ
τ ∈TSFA
dc
Case 2
2 2 τ (u ) (u ) κSFA + Cνν Cµµ
2
α=1 α=µ
τ κICA
τ ∈TICA
R
τ ∈TICA
(u ) (u ) Cµα Cαν
α=1 α=µ
τ κICA
τ ∈TICA
2
R 2 (u ) Cµα
R 2 (u ) Cαν α=1 α=µ
τ κICA
R−1
R 2 (u ) Cαβ
α=1, β=α+1, α=µ β=µ
It is easy to see why it is possible to write both objective functions A.3 and A.4 in such a simple form. First, the terms in equation 4.9 are products of at most four sin(φ) and cos(φ) functions, which allows, at most, a frequency of µν 4. Second, in case 1, ISFA (φ) has a periodicity of π/2 because rotations by multiples of π/2 correspond to a permutation (possibly plus sign change) of the two components. Since both components are inside the subspace, permutations do not change the objective function, and the objective function has a π/2 periodicity. Thus, we conclude that only frequencies of 0 and 4 can be present in equation A.3. In case 2, since one component lies outside the subspace, an exchange of components will change the objective function, equation A.4. A rotation by multiples of π, however, which results only in a possible sign change, will leave the objective function unchanged, resulting in an objective function with π-periodicity and therefore frequencies of 0, 2, and 4.
Independent Slow Feature Analysis and Nonlinear BSS
1019
Table 4: Constants in Equations A.1 and A.2 in Terms of the Constants of Table 3. Case 1 a 20
c 22 s22 c 24 s24
b ICA 4 (4e c + e 2 + 3e 0 ) − bSFA 4 (4dc + d2
Case 2
+ 3d0 )
b SFA b ICA 4 (e 0 − e 2 ) − 4 (d0 b SFA b ICA e − d 1 1 4 4
− d2 )
b ICA 2 (2e c + e 0 + e 2 ) − bSFA 8 (8dc + 3d0 + d2 + 3d4 ) b ICA b SFA 2 (e 0 − e 2 ) − 2 (d0 − d4 ) b ICA b SFA 2 e 1 − 4 (d1 + d3 ) − bSFA 8 (d0 − d2 + d4 ) − bSFA 8 (d1 − d3 )
Table 5: Constants in Equations A.3 and A.4 in Terms of the Constants of Table 4. Case 1
Case 2
A0
a 20
A2 A4
2 + s2 c 24 24
a 20 2 + s2 c 22 22 2 2 c 24 + s24
tan (φ2 )
-
−
tan (φ4 )
−
s24 c 24
−
s22 c 22 s24 c 24
Acknowledgments This work has been supported by the Volkswagen Foundation through a grant to L.W. for a junior research group. We thank Pietro Berkes for carefully going through all constants and Barak Pearlmutter for helpful comments. All examples have been implemented in Python using the Modular Toolkit for Data Processing (freely available online at http://mdptoolkit.sourceforge.net). References Almeida, L. (2004). Linear and nonlinear ICA based on mutual information—the MISEP method. Signal Processing, 84(2), 231–245. Amari, S., Cichocki, A., & Yang, H. (1995). Recurrent neural networks for blind separation of sources. In Proc. of the Int. Symposium on Nonlinear Theory and Its Applications (NOLTA-95) (pp. 37–42). Las Vegas. Available online at http://hawk.ise. chuo-u.ac.jp/NOLTA/95/ad-pro.html. Babaie-Zadeh, M., Jutten, C., & Nayebi, K. (2002). A geometric approach for separating post-nonlinear mixtures. In Proc. of the XI European
1020
T. Blaschke, T. Zito, and L. Wiskott
Signal Processing Conference (EUSIPCO 2002) (pp. 11–14). Available online at http://www.eurasip.org/content/Eusipco/2002/articles/paper.701/html. Belouchrani, A., Abed Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique based on second order statistics. IEEE Transactions on Signal Processing, 45(2), 434–44. Blaschke, T., Berkes, P., & Wiskott, L. (2006). What is the relation between independent component analysis and slow feature analysis? Neural Computation, 18(10), 2495–2508. Blaschke, T., & Wiskott, L. (2004). CuBICA: Independent component analysis by simultaneous third- and fourth-order cumulant diagonalization. IEEE Transactions on Signal Processing, 52(5), 1250–1256. Cardoso, J.-F. (2001). The three easy routes to independent component analysis: Contrasts and geometry. In Proc. of the 3rd Int. Conference on Independent Component Analysis and Blind Source Separation, San Diego, (ICA 2001). San Diego, CA: Institute for Neural Computation. Cardoso, J.-F., & Souloumiac, A. (1996). Jacobi angles for simultaneous diagonalization. SIAM J. Mat. Anal. Appl., 17(1), 161–164. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36(3), 287–314. ¨ Harmeling, S., Ziehe, A., Kawanabe, M., & Muller, K.-R. (2003). Kernel-based nonlinear blind source separation. Neural Computation, 15, 1089–1124. Hosseini, S., & Jutten, C. (2003). On the separability of nonlinear mixtures of temporally correlated sources. IEEE Signal Processing Letters, 10(2), 43–46. Hyv¨arinen, A., & Pajunen, P. (1999). Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks, 12(3), 429–439. Jutten, C., & Karhunen, J. (2003). Advances in nonlinear blind source separation. In Proc. of the 4th Int. Symposium on Independent Component Analysis and Blind Signal Separation, Nara, Japan (pp. 245–256). Nara, Japan: N.P. McCullagh, P. (1987). Tensor methods in statistics. London: Chapman and Hall. Molgedey, L., & Schuster, G. (1994). Separation of a mixture of independent signals using time delayed correlations. Physical Review Letters, 72(23), 3634–3637. Sprekeler, H., Zito, T., & Wiskott, L. (2007). Understanding slow feature analysis: A mathematical framework. Humboldt University. Manuscript in preparation. Taleb, A. (2002). A generic framework for blind source separation in structured nonlinear models. IEEE Transactions on Signal Processing, 50(8), 1819–1830. Taleb, A., & Jutten, C. (1997). Nonlinear source separation: The postnonlinear mixtures. In Proc. European Symposium on Artificial Neural Networks, Bruges, Belgium (pp. 279–284) Available online at http://www.dice.ucl.ac.be/ esann/proceedings/papers.php?ann-1997. Taleb, A., & Jutten, C. (1999). Source separation in post non linear mixtures. IEEE Transactions on Signal Processing, 47(10), 2807–2820. Tong, L., Liu, R., Soon, V. C., & Huang, Y.-F. (1991). Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems, 38(5), 499–509. Wiskott, L., & Sejnowski, T. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770. Yang, H.-H., Amari, S., & Cichocki, A. (1998). Information-theoretic approach to blind separation of sources in non-linear mixture. Signal Processing, 64(3), 291–300.
Independent Slow Feature Analysis and Nonlinear BSS
1021
¨ Ziehe, A., Kawanabe, M., Harmeling, S., & Muller, K.-R. (2003). Blind separation of post-nonlinear mixtures using linearizing transformations and temporal decorrelation. Journal of Machine Learning Research, 4, 1319–1338. ¨ Ziehe, A., & Muller, K.-R. (1998). TDSEP—an efficient algorithm for blind separation using time structure. In Proc. of the 8th Int. Conference on Artificial Neural Networks (ICANN’98) (pp. 675–680). Berlin: Springer Verlag.
Received September 17, 2004; accepted July 21, 2006.
LETTER
Communicated by Laurenz Wiskott
A Maximum-Likelihood Interpretation for Slow Feature Analysis Richard Turner
[email protected] Maneesh Sahani
[email protected] Gatsby Computational Neuroscience Unit, University College London, London, WC1N 3AR, U.K.
The brain extracts useful features from a maelstrom of sensory information, and a fundamental goal of theoretical neuroscience is to work out how it does so. One proposed feature extraction strategy is motivated by the observation that the meaning of sensory data, such as the identity of a moving visual object, is often more persistent than the activation of any single sensory receptor. This notion is embodied in the slow feature analysis (SFA) algorithm, which uses “slowness” as a heuristic by which to extract semantic information from multidimensional time series. Here, we develop a probabilistic interpretation of this algorithm, showing that inference and learning in the limiting case of a suitable probabilistic model yield exactly the results of SFA. Similar equivalences have proved useful in interpreting and extending comparable algorithms such as independent component analysis. For SFA, we use the equivalent probabilistic model as a conceptual springboard with which to motivate several novel extensions to the algorithm. 1 Introduction The meaning of sensory information often varies more slowly than the activity of low-level sensory receptors. For example, the output of photoreceptors directed toward a swaying tree on a bright windy day may flicker, but the identity and percept of the tree remain invariant. Observations of this sort motivate temporal constancy, or slowness, as a useful learning principle to extract meaningful higher-level descriptions of sensory data ¨ ak, 1991; Mitchison, 1991; Stone, 1996). (Hinton, 1989; Foldi´ The slowness learning principle is at the core of the slow feature analysis (SFA) algorithm (Wiskott & Sejnowski, 2002). SFA linearly extracts slowly varying, uncorrelated projections of multidimensional time-series data, ordered by their slowness. When SFA is trained on a nonlinear expansion of a video of natural scene patches, the filter outputs are found to resemble the receptive fields of complex cells (Berkes & Wiskott, 2005). Slowness Neural Computation 19, 1022–1038 (2007)
C 2007 Massachusetts Institute of Technology
Maximum-Likelihood Interpretation for Slow Feature Analysis
1023
and decorrelation of features (in SFA and similar algorithms, e.g., Kayser, ¨ ¨ ¨ ¨ Kording, & Konig, 2003; Kording, Kayser, Einh¨auser, & Konig, 2004) thus provide an interesting alternative heuristic to sparseness and independence for recovering receptive fields similar to those observed in the visual system (e.g., Olshausen & Field, 1996, Bell & Sejnowski, 1997). This letter provides a probabilistic interpretation for SFA. Such a perspective has been useful for both principal component analysis (PCA; see Tipping & Bishop, 1999) and, in particular, independent component analysis (ICA; see MacKay, 1999; Pearlmutter & Parra, 1997), motivating many new learning algorithms (covariant, variational; see Miskin, 2000) and generalizations to the model (independent factor analysis (Attias, 1999)) and gaussian scale mixture priors (Karklin & Lewicki, 2005). More generally, probabilistic models have several desirable features. For instance, they force the tacit assumptions of algorithms to be made explicit, allowing them to be criticized and improved more easily. Furthermore, a number of general tools have been developed for learning and inference in such models, for example, methods to handle missing data (Dempster, Laird, & Rubin, 1977). The probabilistic framework is a powerful one. Fortunately, it is intuitive too, and heuristics such as sparseness and slowness can naturally be translated into probabilistic priors. Indeed, previous work has illustrated how to combine the two approaches into a common probabilistic model (Hyv¨arinen, Hurri, & V¨ayrynen, 2003; Hurri & Hyv¨arinen, 2003). The advantage of the probabilistic approach is that it softens the heuristics and allows them to trade off against one another. One of the contributions of this letter is to place SFA into a proper context within this common framework. In the following, we introduce SFA and provide a geometrical picture of the algorithm. Next, we develop intuition for a class of models in which maximum-likelihood learning has similar flavor to SFA, before proving exact equivalence under certain conditions. In the final section, we use this maximum-likelihood interpretation to motivate a number of interesting probabilistic extensions to the SFA algorithm. 2 Slow Feature Analysis Formally, the SFA algorithm can be defined as follows. Given an MM dimensional time series, xt , t = 1, . . . , T, find M weight vectors {wm }m=1 , such that the average squared temporal difference of the output signals, T yˆ m,t = wm xt is minimized: ˆ m = arg min( yˆ m,t − yˆ m,t−1 )2 , w
(2.1)
wm
given the constraint yˆ m,t yˆ n,t − yˆ m,t yˆ n,t = δmn .
(2.2)
1024
R. Turner and M. Sahani
Without loss of generality, we can assume the time series has zero mean, hence yˆ m,t = 0. The constraint is important in that it is necessary to prevent the trivial solutions yˆ m,t = 0 and ensure that the output signals report different aspects of the stimulus: yˆ m,t = yˆ n,t . Furthermore, it also imposes an ordering on the solutions. We shall see below that the constraint plays as important a role in shaping the results of SFA as does the minimization itself. In the following, we use yˆ to represent the time differences of the features yˆ t − yˆ t−1 , and similarly x for the data time differences. With SFA defined in this way, it can be shown the optimal weights satisfy the generalized eigenvalue equation, 2
¯ WB. WA =
(2.3)
Where we have defined yˆ t = Wxt (that is, the matrix W collects the weight 2 ¯ 2mn = ω¯ m δmn . vectors wm ), Amn = xm xn , Bmn = xm xn , and The slow features are ordered from slowest to fastest by the eigenvalues 2 ω¯ m . As there are efficient methods for solving the generalized eigenvalue equation, learning as well as inference is very fast. 3 A Geometric Perspective SFA has a useful geometric interpretation that makes clear connections to other algorithms such as PCA. The constraint of equation 2.2 requires that multiplication by the SFA weight matrix spatially whiten or “sphere” the data. In general, such a sphering matrix is given by the product of the square root of the covariance matrix with an arbitrary orthogonal transform: W = RB−1/2 .
(3.1)
Geometrically, as illustrated in Figure 1, the rows of W are constrained to lie on a hyper-ellipsoid, but we are free to choose their rotation R (S¨arel¨a & Valpola, 2005). PCA makes one particular choice, R = I, and thus the rows of the principal loading matrix (analogous to W) are parallel to the eigenvectors of B. They may be ordered by the eigenvalues of B, which give the variances of the data in each principal direction. Various ICA algorithms choose a nontrivial rotation R so as to maximize kurtosis or some other statistic of the projected data. In this case, a natural ordering is given by the value of the projected statistic. By contrast, SFA chooses a rotation based on dynamical information, using the normalized eigenvectors of the covariance matrix of the temporal
Maximum-Likelihood Interpretation for Slow Feature Analysis A.
1025
x
2
5
0
−5 −5
0
x
5
1
x
2
B. 2
2
0
0 −2
−2 −2
z2
C.
0
−T/2
2
−2
B
1
1
0
0
−1 −1
0
1
R
D.
−1 −1
0
x
2
0
1
z
0
1
y
1
1
y2
1
0
−1 −1
E.
1
2
w2
1 0 −1 −2 −2
0
2
w1
Figure 1: The geometrical interpretation of PCA and SFA in two dimensions. The left column illustrates PCA and the right column SFA. (A) The data x1,t , x2,t (black dots) and the temporal differences x1,t , x2,t (gray dots). (B) Data space, left panel: The covariance of the data illustrated schematically by a black ellipse: x T Bx = 1; the principal axes are shown by the black lines. Right: The covariance of the data, and the covariance of the time derivatives (gray dotted ellipse) x T Ax = 1. (C) Sphered (PCA) space resulting from the linear transform B−T/2 (a rotation determined by the eigenvectors of B and a rescaling by the square root of the corresponding eigenvalues). Left panel: The sphered covariance matrix (black circle), the principal axes of which (black lines) are now of equal length. Right panel: The sphered covariance and the transformed covariance of the time derivatives (gray dotted ellipse), the principal axes of which (gray dotted lines) are not aligned with the axes of the sphered space. (D) SFA space: An additional rotation R is made to align the axes of the space with the transformed time difference covariance. (E) PCA and SFA weights W shown in weight space. Other choices for the rotation matrix correspond to points on the ellipsoid chosen to be orthogonal in the metric of the ellipse.
1026
R. Turner and M. Sahani
differences of the sphered data.1 This can be seen by substituting equa¯ 2 R. Geometrically, tion 3.1 into equation 2.3, leading to RB−1/2 AB−1/2 = SFA corresponds to transforming into the sphered (PCA) space and then rotating again to align the axes of the transformed data with the principal axes of the transformed time differences. The natural ordering is then given by the eigenvalues of the temporaldifference covariance matrix. If each projection yˆ t varies sinusoidally in time, the eigenvalues are the squares of the corresponding frequencies. More generally, the eigenvalues give the average squared frequency of the extracted features, where the average is taken over the normalized spectral power density, that is, ω¯ n2 =
ω
ω2
| y˜ n (ω)|2 , ˜ n (ω )|2 ω | y
(3.2)
where y˜ n (ω) is the ω-component of the discrete Fourier transform of the nth feature.2 The geometric perspective makes clear one important difference between PCA and SFA: PCA is dependent on the relative scales of the measurements, but SFA is not. That is, rescaling a subset of dimensions of the observations xn,1:T alters the eigenvectors of the covariance matrix, and consequently the extracted time series y1:T that PCA recovers is also changed. Critically, SFA finds an interesting rotation after sphering the data, and so the extracted time series y1:T that it recovers is insensitive to the rescaling. For this reason, SFA can be used meaningfully with an expanded set of pseudo-observations (whose scale is arbitrary), while PCA cannot. This is a property that SFA shares with other algorithms such as factor analysis (Roweis & Ghahramani, 1999). 4 Toward a Probabilistic Interpretation The very notion of extracting features based on temporal slowness is captured naturally in the generative modeling framework, and this suggests that SFA might be amenable to such an interpretation. To show the correspondence between the SFA algorithm and maximum likelihood learning of a particular probabilistic latent variable model (perhaps in a particular limit), we interpret SFA’s weights as recognition weights derived from the
1 However, some variants of ICA are closely related; see Blaschke, Berkes, and Wiskott (2006). 2 To prove this result, note that equation 2.3 gives wT AwT = y ˆ n2 = ω¯ n2 yˆ n2 . By Parsen n 2 = 2 and (using the Fourier transform of the derivative vals theorem, | y ˆ | | y ˜ (ω)| t n,t ω n operator) t | yˆ n,t |2 = ω |ω y˜ n (ω)|2 . Substituting these in for the two averages gives equation 3.2.
Maximum-Likelihood Interpretation for Slow Feature Analysis
1027
statistical inverse of a generative model. The output signals then have a natural interpretation as the mean or mode of the posterior distribution over latent variables, given the data. Our goal is to unpack the assumptions implicit in SFA’s cost function and constraints and intuitively build corresponding factors into a simple candidate form of generative model. This will then be formally analyzed to rederive the SFA algorithm under certain conditions. One benefit of this new perspective relates to the hard constraints that had to be introduced to prevent SFA from recovering trivial solutions. In a probabilistic model, softened versions of these constraints might be expected to arise naturally. For example, in SFA, the scale of the output signals must be set by hand to prevent them from being shrunk away by the smoothness objective function. The temporal difference cost function depends on only the recovered signals, and there is no penalty for eliminating information about the measured data. In a probabilistic setting, a soft volume penalty arising from a combination of the normalizing constant of the prior and the information-preserving likelihood term will prevent shrinking in an automatic manner. However, one of the consequences of such a soft constraint is that the scale of each of the latents may not be identical. Fortunately, this ambiguity does not affect the ordering of the solutions. Similarly, the other constraint of SFA—that the output signals be decorrelated—emerges by choosing a decorrelated prior. Indeed, as with probabilistic PCA and ICA, we choose a fully factored prior distribution on the latent sources. Our ultimate choice will be gaussian, so that decorrelation and independence are the same at each time-step, although the factored form also implies that the processes are decorrelated at different time steps. Thus, consideration of the SFA constraints suggests a factored prior over the latent variables. The cost function, which penalizes the sum of squared differences between adjacent time steps, further specifies the form of the prior on each latent time series. The squared cost is consistent with an underlying gaussian distribution, while the penalty on adjacent differences suggests a Markov chain structure. A reasonable candidate is thus a one time-step linear-gaussian dynamical system (or AR(1) autoregressive model): N 2 p yt |yt−1 , λ1:N , σ1:N = p yn,t |yn,t−1 , λn , σn2
2
n=1
p yn,t |yn,t−1 , λn , σn = Norm λn yn,t−1 , σn2 2 2 p yn,1 |σn,1 = Norm 0, σn,1 .
(4.1)
Intuitively, the λn control the strength of the correlations between the latent variables at different time points and therefore their slowness. If λn = 0, then successive variables are uncorrelated, and the processes vary
1028
R. Turner and M. Sahani
rapidly. As λn → 1, the processes become progressively more correlated and hence slower. More generally, if |λn | < 1, then after a long time (t → ∞), the prior distribution of each latent series settles into a stationary state with temporal covariance, or autocorrelation, given by yn,t yn,t−τ →
σn2 λ|τ | . 1 − λ2n n
(4.2)
The “effective memory” of this prior (defined as the time difference over which correlations fall by 1/e) is thus τeff = −1/ log λn . At first glance, the choice of |λn | < 1 would seem to suggest that each latent process shrink toward 0 with time, but this clearly is at odds with the stationary distribution derived above. This apparent conflict is resolved by noting that while the mean of the one-step conditional distribution is indeed smaller than the preceding value, yn,t |yn,t−1 = λn yn,t−1 , the second moment 2 2 includes the effects of the innovations process yn,t |yn,t−1 = λ2n yn,t−1 + σn2 . σn2 2 2 Thus, if yn,t−1 < 1−λ2 , the conditional expected square is larger than yn,t−1 . n Indeed, it is clear that the only way to achieve a stationary AR(1) process with nonzero innovations noise is to choose |λn | < 1. Two useful corollaries follow from equation 4.2. First, it is possible to express the prior expectation that each latent process is stationary from the outset (that is, without a initial transient) by choosing its initial distribution to match the long-run variance: 2 σn,1 =
σn2 . 1 − λ2n
(4.3)
This form will be assumed in the following, but it is not essential to the derivation. Second, by choosing σn2 = 1 − λ2n , we can set the stationary variance of the prior to one, a fact that we will make use of later. It is important to note here that we have so far discussed the properties of the prior p(y1:T ), and that these may generally be quite different from the properties of the posterior p(y1:T |x1:T ). In addition to the prior on the latent variables, the complete specification of a generative model requires a (potentially probabilistic) mapping from the latents to the observations. This is constrained by noting that inference in SFA is instantaneous: a feature at time t is derived through a linear combination of only the current observations, without reference to earlier or later observed data. In the general Markov dynamical model, accurate inference requires information from multiple time points. For example, adding a linear-gaussian output mapping to the latent model (see equation 4.1), leads to inference through the well-known Kalman smoothing recursions in time. The only way for instantaneous linear recognition to be optimal is for the mapping from latents to observations to be deterministic and
Maximum-Likelihood Interpretation for Slow Feature Analysis
1029
linear; the matrix of generative weights is then the inverse of the recognition matrix W. It is useful to view this deterministic process as the limit of a probabilistic mapping (Tipping & Bishop, 1999), and a natural choice is p(xt |yt , W, σx ) = Norm W−1 yt , σx2 I ,
(4.4)
where the deterministic mapping is recovered as σx2 → 0. This completes the specification of the model, which is a member of the linear gaussian state-space models (Roweis & Ghahramani, 1999), in the limit of zero observation noise. We have described why it might have the same flavor as SFA, but it is certainly not clear a priori under what restrictions this equivalence will hold. For instance, we might worry that peculiar settings of the transition dynamics and state noise might be required. Fortunately, these conditions can be derived analytically and 2are found2to be surprisingly unrestrictive. Denoting the parameters θ = σ1:N , λ1:N , σx , W , we first form the likelihood function: p(x1:T |θ ) =
dy1:T ×
p(xt |yt , W, σx2 )
t=1 N
n=1
=
T
dy1:T
p
yn,1 |σn2
T
T p yn,t |yn,t−1 , λn , σn2 t=2
(4.5)
δ(xt − W−1 yt )
t=1
N
T 1 2 1 1 2 yn,1 + [yn,t − λn yn,t−1 ] × exp − , 2 Z 2σn2 2σn,1 n=1
(4.6)
t=2
where we have taken the limit σx2 → 0 in the likelihood term. Completing the integrals, the log likelihood is L(θ ) = log p(x1:T |θ ) N
1 (wnT x1 )2 2 2σ n,1 n=1
T 2 1 T T + 2 wn xt − λn wn xt−1 , 2σn
= c + T log | det W| −
t=2
(4.7)
(4.8)
1030
R. Turner and M. Sahani
where the constant c is independent of the weights. Expanding the square yields T N σn2 T 1 L(θ ) = c + T log | det W| − wn x1 x1T wn + wnT xt xtT wn 2 2σn2 σn,1 n=1 t=2 T−1 T
T 2 T T T T T + λn wn xt xt wn − λn wn xt xt−1 + xt−1 xt wn . t=1
t=2
t=2
(4.9) The following identities are then useful: T
xt xtT = TB − x1 x1T
(4.10)
xt xtT = TB − xT xTT
(4.11)
t=2 T−1 t=1 T t=2
T xt xt−1 +
T
xt−1 xtT = 2TB − (T − 1)A − x1 x1T − xT xTT ,
(4.12)
t=2
where we remind the reader, 1 (xm,t+1 − xm,t )(xn,t+1 − xn,t ) T −1 T−1
Amn = xm xn =
t=1
Bmn = xm xn =
T 1 xm,t xn,t . T t=1
Substituting and collecting terms this yields N (T − 1) T Aλn L(θ ) = c + T log | det W| − wnT B(1 − λn )2 + 2σn2 T n=1
λn (1 − λn ) T T x1 x1 + xT xT wn . + T
(4.13)
As the number of observations increases, T → ∞, the relative contrin) (x1 x1T + xT xTT ) → 0 and T−1 → 1. bution from edge effects reduces, λn (1−λ T T
Maximum-Likelihood Interpretation for Slow Feature Analysis
1031
Therefore, assuming a large number of data points: L(θ ) ≈ c + T log | det W| − = c + T log | det W| −
T wT B(1 − λn )2 + Aλn wn 2 n 2σ n n T tr WBWT (2) + WAWT (1) , 2
(4.14) (4.15)
(1−λn ) λn (2) where (1) mn = δmn σn2 and mn = δmn σn2 . Differentiating this expression with respect to the recognition weights (assuming the determinant is not zero) recovers the following condition: 2
d L(θ ) ∝ W−T − (2) WB + (1) WA . dW
(4.16)
Setting this to 0 and rearranging, we obtain a condition on the maximum: yˆ m,t yˆ n,t =
σn2 λn δmn − yˆ m,t yˆ n,t , (1 − λn )2 (1 − λn )2
(4.17)
T xt . where, as defined earlier, yˆ m,t = wm Now, as the first term on the right-hand side of equation 4.17 is diagonal, symmetry of the left-hand side requires that, for off-diagonal terms,
λn λm yˆ m,t yˆ n,t = yˆ n,t yˆ m,t . 2 (1 − λn ) (1 − λm )2
(4.18)
But if we choose λm = λn for m = n, this condition can be met only when the covariance matrix of the latent temporal differences is diagonal, which makes the covariance of the latents themselves diagonal by equation 4.17. This implies that we have recovered the SFA directions, but not necessarily the correct scaling. In other words, the most probable output signals are decorrelated but not of equal power; further rescaling is necessary to achieve sphering. This is a consequence of the soft volume constraints and might be a desirable result for reasons discussed earlier. Surprisingly, the maximum-likelihood weights do not depend on the exact setting of the λn , so long as they are all different. If 0 < λn < 1 ∀n, then larger values of λn correspond to slower latents. This corresponds directly to the ordering of the solutions from SFA. To recover exact equivalence to SFA, another limit is required that corrects the scales. There are several choices, but a natural one is to let σn2 = 1 − λ2n (which fixes the prior covariance of the latent chains to be one, as discussed earlier) and then take the limit λ1 ≤ λ2 ≤ . . . λ M → 0,
1032
R. Turner and M. Sahani
which implies yˆ m,t yˆ n,t → (1 + 2λn )δmn − λn yˆ m,t yˆ n,t → δmn .
(4.19)
Using the geometric intuitions of section 3, at the limit the weights sphere the inputs and therefore lie somewhere on the shell of the hyper-ellipsoid defined by the data covariance. However, as the limit is taken, the weights must be parallel to those of SFA (as we have argued above). Provided the behavior is smooth, as the perturbation vanishes, the weights must drop onto the hyper-ellipsoid at precisely the point specified by SFA. The analytical results presented here have been verified using computer simulations, for instance, using the expectation-maximization (EM; Dempster et al., 1977) algorithm and annealing the output noise and transition matrices appropriately so as to take the approximate limit (see Figure 2, top). However, learning by EM is very slow when the variance of output noise σx2 is small; this is alleviated somewhat by the annealing process, but other likelihood optimization schemes might be more suitable. In the above analysis, we thought of the inputs x1:T as observations from some temporal process. However, the argument we have presented is agnostic to the actual source of the inputs. This means, for example, that the recognition model and learning as presented above are formally equivalent to SFA even when the inputs x1:T themselves are formed from nonlinear combinations of observations (as is the usually the case in applications of the SFA algorithm). However, the generative model is not likely to correspond to a particularly sensible selection of assumptions in this case, as will be discussed in the next section. 5 Extensions The probabilistic modeling perspective suggests we could use a linear gaussian state-space model with a factorial prior in place of SFA and learn the state and observation noise, that is, the dynamics, as well as the weights (for which SFA would provide an intelligent initialization). This probabilistic version improves over standard SFA when there is substantial observation noise and if there are missing data, even providing a natural measure of confidence under these conditions (see Figure 2). More radical variants are also interesting. 5.1 Bilinear Extensions. Loosely speaking, there are at least two broad types of slow features in images: object identity (what, or content information) and object location and pose (where, or style information), and there is evidence that there are parallel pathways for the processing of each sort in the brain. Bilinear learning algorithms have recently been developed in which the effects of the two are segregated (Tenenbaum & Freeman, 2000;
Maximum-Likelihood Interpretation for Slow Feature Analysis
1033
Figure 2: Comparisons between slow feature analysis and probabilistic SFA on three data sets drawn from the forward model. Left column: The threedimensional observations. Right column: The slowest latent variable/extracted feature. The dashed line is the actual value of the slowest latent, the gray line is the slowest feature extracted by slow feature analysis, and the gray scale indicates the posterior distribution from the probabilistic model. (Row A) Low observation noise. Both SFA and probabilistic SFA recover the true latents accurately. For each algorithm, we calculate the root mean square (RMS) error between the reconstructed and true value of the latent process. The difference between these errors, divided by the RMS amplitude of the latent process, gives the proportional difference in errors, D. In this case, D = 0.03, indicating that the methods perform similarly at low observation noise levels. (Row B) Higher observation noise. SFA is adversely affected, but the probabilistic algorithm makes reasonable predictions. The difference in the RMS errors is now comparable to the amplitude of the latent signal (D = 1.10). (Row C) As for B, but with missing data. The probabilistic model can predict the intermediate latents in a principled manner, but SFA cannot. Using an ad hoc straight line interpolation of the normal SFA solution, the RMS error over the region of missing data is substantially greater (D = 1.4), indicating the superiority of the probabilistic interpolation.
Grimes & Rao, 2005). SFA, by contrast, confounds these two types of signal because it extracts them both into one type of latent variable. A natural extension to the SFA model might therefore augment the continuous “where” latent variables (yn,t ) with a new set of binary “what” latent variables (sn,t ),
1034
R. Turner and M. Sahani
forming independent discrete Markov chains with transition matrices T(n) , and combining the two bilinearly: N N p st |st−1 , {T(n) }n=1 p sn,t |sn,t−1 , T(n) =
(5.1)
n=1
(n) p sn,t = a |sn,t−1 = b, T(n) = Ta b
(5.2)
N,M , σx = Norm gmn yn,t sm,t , σx2 I . p xt |yt , st , {gmn }n=1,m=1
(5.3)
mn
Exact inference in such a model would be intractable, but the maximum likelihood parameters could be approximated, for example, by using variational EM (Neal & Hinton, 1998; Jordan, Ghahramani, Jaakkola, & Saul, 1999). 5.2 Generalized Bilinear Extensions. Traditionally, SFA has been made into a nonlinear algorithm by expanding the observations through a large set of nonlinearities and using these as new inputs. Probabilistic models are not well suited to such an approach for two reasons: (1) they are much slower to train when the dimensionality of the inputs is large, and (2) the expansion introduces complex dependencies between the new observations that a standard probabilistic model cannot capture. We consider the second of these points in more detail next, first noting that an interesting alternative that sidesteps these two issues is to learn a nonlinear mapping from the latents to the observations directly. One straightforward approach would be to used a generalized linear (or neural network) mapping,
p
N,M xt |yt , st , {gmn }n=1,m=1 , σx
= Norm
gmn yn,t sm,t
, σx2 I
,
(5.4)
mn
for a nonlinear link function . This form would also easily accommodate types of observation for which a normal output model would be inappropriate—binary data, for instance. Again, approximations would be required for learning. 5.3 Nonlinear Extensions. The above is a fairly conventional extension to time-series models (e.g., Ghahramani & Roweis, 1999). Further consideration of the second point above motivates an alternative that is more in the spirit of traditional SFA. In the expanded observation space, points corresponding to realizable data lie on low-dimensional manifolds (see Figure 3 for an example). Therefore, a good generative model should assign probability only to these manifolds and not “waste” it on the unrealisable volumes.
Maximum-Likelihood Interpretation for Slow Feature Analysis
1035
generative model @ t+1
~2 Φ1(x~t)=x t
allowable manifold
projection @ t+1
generative model @t
projection @ t
~ )=x ~ Φ (x 2
t
t
Figure 3: A schematic of the product-of-experts model in expanded observation space. The observations are one-dimensional and are expanded to two dimensions: φ1 (x˜ t ) = x˜ t and φ2 (x˜ t ) = x˜ t2 . The generative model assigns nonzero probability off the manifold, as shown by its energy contours (shown at two successive time steps to indicate a slow direction and a fast direction). The local experts have high energy only around a localized section of the realizable manifold. The product of the two types of expert leads to a projected process with slow latents that generates around the manifold. The width around the manifold is shown for illustration purposes: in the limit β → 0, the projected process lies precisely on the manifold.
The probabilistic model developed thus far is free to generate points anywhere in the expanded observation space; it cannot capture the structure introduced by the expansion. Fortunately, taking inspiration from productof-expert models (Hinton, 2002; Osindero, Welling, & Hinton, 2006), we can use our old probabilistic model to produce a new, more sensible model. The idea is to treat the old generative distribution as a global expert that models temporal correlations in the data and then form a new distribution by multiplying its density by K local experts, which constrain the expanded observations to lie on the realisable manifold, and renormalizing. It is important to note that unlike in the usual product-of-experts model, the local experts will not correspond to normalizable distributions in their own right. Generally product models can be succinctly parameterized via an energy such that P(w) = Z1 exp(−E(w)). A sensible energy function can be devised by introducing two new types of latent variables: x˜ m,t , which can usefully
1036
R. Turner and M. Sahani
be thought of as perturbations from the observations, and φk,t which are perturbations in the expanded observation space:
1 1 2 2 E =− (xm,t − x˜ m,t ) + β [φk,t − φk (˜xt )] − log P(Y, ). 2 t σx2 m k (5.5) The first two terms in this energy correspond to the local experts, and the final term is the global, dynamical model expert (alternatively, the old generative model). Intuitively, the global expert assigns lower energy to solutions with slow latent variables Y and therefore favors them. The local experts are more complicated, but essentially they constrain the observations of the old generative model, , to lie on the realizable manifold as described below (see also Figure 3). The x˜ m,t tend to be close to the observations xm,t (as this lowers the energy, by an amount depending on the value of σx2 ) and can therefore be thought of as perturbations. The φk (˜xt ) correspond to the K nonlinearities applied in the expansion (e.g., polynomials). Due to the perturbations, they are constrained to lie on a localized region of the manifold centered on φk (xt ) in the expanded observation space. The latents φk,t tend to be close to the φk (˜xt ) and therefore lie around the manifold (with a fuzziness determined by β, different from the output noise of the dynamical model). Taken together, these terms act as a local expert, vetoing any output from the global expert that does not lie near the realizable manifold. The parameters of this model can be learned using contrastive divergence (Hinton, 2002). What is more, the function φk (˜xt ) can be parameterized, by a neural network, for example, and the parameters can also be learned in the same way. 6 Conclusion We have shown the equivalence between the SFA algorithm and maximumlikelihood learning in a linear gaussian state-space model, with an independent Markovian prior. This perspective inspired a number of variants on the traditional slow feature analysis algorithm. First, a fully probabilistic version of slow features analysis was shown to be capable of dealing with both output noise and missing data naturally, using the Bayesian framework. Second, we suggested more speculative probabilistic extensions to further improve this new approach. For example, it is known that the traditional SFA algorithm makes no distinction between the “what” information in natural scenes and the “where” information. We suggest augmenting the continuous latent space with binary gating variables to produce a model capable of representing these two types of information distinctly. Finally, we noted that probabilistic models are not compatible with the standard
Maximum-Likelihood Interpretation for Slow Feature Analysis
1037
kernel method of expanding the observations through a large family of nonlinearities. However, an interesting alternative is to learn the nonlinearities in the model directly. We suggest both a traditional method using neural networks to parameterize the mapping from latent variable to observation and a nontraditional method that learns the inverse mapping and is therefore more in the spirit of SFA. The power of these new models will be explored in future work. Acknowledgments We thank Pietro Berkes for helpful discussions and for helping us to clarify the presentation. This work was supported by the Gatsby Charitable Foundation. References Attias, H. (1999). Independent factor analysis. Neural Comp., 11(4), 803–851. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Berkes, P., & Wiskott, L. (2005). Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 5(6), 579–602. Blaschke, A. J., Berkes, P., & Wiskott, L. (2006). What is the relation between slow feature analysis and independent component analysis? Neural Computation, 18(10), 2495–2508. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society Series B (Methodological), 39, 1–38. ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural ComFoldi´ putation, 3(2), 194–200. Ghahramani, Z., & Roweis, S. T. (1999). Learning nonlinear dynamical systems using an EM algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 599–605). Cambridge, MA: MIT Press. Grimes, D. B., & Rao, R. P. N. (2005). Bilinear sparse coding for invariant vision. Neural Comput., 17(1), 47–73. Hinton, G. E. (1989). Connectionist learning procedures. Artificial Intelligence, 40(1–3), 185–234. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800. Hurri, J., & Hyv¨arinen, A. (2003). Temporal and spatiotemporal coherence in simplecell responses: A generative model of natural image sequences. Network, 14(3), 527–551. ¨ Hyv¨arinen, A., Hurri, J., & Vyrynen, J. (2003). Bubbles: A unifying framework for low-level statistical properties of natural image sequences. Journal of the Optical Society of America. A, Optics, Image Science, and Vision, 20(7), 1237–1252. Jordan, M. I., Ghahramani, Z., Jaakkola, T., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.
1038
R. Turner and M. Sahani
Karklin, Y., & Lewicki, M. S. (2005). A hierarchical Bayesian model for learning nonlinear statistical regularities in nonstationary natural signals. Neural Computation, 17(2), 397–423. ¨ ¨ Kayser, C., Kording, K. P., & Konig, P. (2003). Learning the nonlinearity of neurons from natural visual stimuli. Neural Computation, 15(8), 1751–1759. ¨ ¨ Kording, K. P., Kayser, C., Einh¨auser, W., & Konig, P. (2004). How are complex cell properties adapted to the statistics of natural stimuli? Journal of Neurophysiology, 91(1), 206–212. MacKay, D. J. C. (1999). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Available online at ftp://wol.ra.phy.cam.ac.uk/pub/mackay/ica.ps.gz. Miskin, J. (2000). Ensemble learning for independent components analysis. Unpublished doctoral dissertation, Cambridge University. Mitchison, G. (1991). Removing time variation with the anti-Hebbian differential synapse. Neural Computation, 3, 312–320. Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–370). Norwell, MA: Kluwer Academic Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607– 609. Osindero, S., Welling, M., & Hinton, G. E. (2006). Topographic product models applied to natural scene statistics. Neural Computation, 18(2), 381–414. Pearlmutter, B., & Parra, L. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petche (Eds.), Advances in neural information processing systems, 9 (pp. 613–619). Cambridge, MA: MIT Press. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345. S¨arel¨a, J., & Valpola, H. (2005). Denoising source separation. J. Machine Learning Res., 6, 233–272. Stone, J. V. (1996). Learning perceptually salient visual parameters using spatiotemporal smoothness constraints. Neural Computation, 8(7), 1463–1492. Tenenbaum, J. B., & Freeman, W. T. (2000). Separating style and content with bilinear models. Neural Computation, 12(6), 1247–1283. Tipping, M. E., & Bishop, C. M. (1999). Probabilistic principal components analysis. Journal of the Royal Statistical Society Series B (Methodological), 61(3), 611–622. Wiskott, L., & Sejnowski, T. J. (2002). Slow feature analysis: Unsupervised learning of invariances. Neural Computation, 14(4), 715–770.
Received May 10, 2006; accepted July 30, 2006.
LETTER
Communicated by Simon Haykin
An Augmented Extended Kalman Filter Algorithm for Complex-Valued Recurrent Neural Networks Su Lee Goh
[email protected] Danilo P. Mandic
[email protected] Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2AZ, U.K.
An augmented complex-valued extended Kalman filter (ACEKF) algorithm for the class of nonlinear adaptive filters realized as fully connected recurrent neural networks is introduced. This is achieved based on some recent developments in the so-called augmented complex statistics and the use of general fully complex nonlinear activation functions within the neurons. This makes the ACEKF suitable for processing general complex-valued nonlinear and nonstationary signals and also bivariate signals with strong component correlations. Simulations on benchmark and real-world complex-valued signals support the approach. 1 Introduction Recent progress in biomedicine, wireless and mobile communications, seismics, and sonar and radar signal processing has brought to the light new problems, where data models are often complex valued or have a higherdimensional compact representation (Mandic & Chambers, 2001; Haykin, 1994). To process such signals, research has largely been directed toward extending the results from real-valued adaptive filters to those operating in the complex plane, C. One such algorithm is the complex least mean square (CLMS) algorithm for linear finite impulse response (FIR) adaptive filters, introduced in 1975 (Widrow, McCool, & Ball, 1975). More recently, complex-valued learning algorithms have been introduced for training of neural networks (NNs) (Kim & Adali, 2001; Leung & Haykin, 1991; Treichler, Johnson, & Larimore, 1987; Goh & Mandic, 2004). Fully connected recurrent neural networks (FCRNNs) have been employed as nonlinear adaptive filters1 (Mandic & Chambers, 2001; Medsker
1
Nonlinear autoregressive (NAR) processes can be modeled using feedforward networks, whereas nonlinear autogressive moving-average (NARMA) processes can be represented using RNNs. Neural Computation 19, 1039–1055 (2007)
C 2007 Massachusetts Institute of Technology
1040
S. Goh and D. Mandic
& Jain, 2000) where for real-time applications, the real-time recurrent learning (RTRL) algorithm (Williams & Zipser, 1989) has been widely used to train FCRNNs. In the complex domain, a recently proposed fully complex real-time recurrent learning (CRTRL)2 algorithm (Goh & Mandic, 2004; Goh, Popovic, & Mandic, 2004) has been applied to forecasting of the complexvalued wind field. These initial results were promising, but it was also realized that gradient-based learning may experience problems when processing signals with rich dynamics (for instance, intermittent signals). Following the established approaches from the real domain R, a possible solution to this problem may be based on Kalman filters (Puskorius & Feldkamp, 1994), which have been shown to exhibit superior performance in several applications, including state estimation for road navigation (Obradovic, Lenz, & Schupfner, 2004), parameter estimation for timeseries modeling, and neural network training (Puskorius & Feldkamp, 1991; Julier & Uhlmann, 2004; Haykin, 2001). Kalman filtering (Kalman, 1960) is known to give the optimal solution to the linear gaussian sequential state estimation problem in the domain of second order statistics. However, for a nonlinear or nongaussian state tracking problem with possibly an infinite number of states, no such general optimal algorithm exists. Extensions of the class of Kalman filter algorithms that cater to this case are termed the extended Kalman filter (EKF). The EKF (Feldkamp & Puskorius, 1998; Grewal & Andrews, 2001) is based on a truncated Taylor series expansion, which linearizes the system model locally, and the subsequent use of a linear Kalman filter. The EKF algorithms have been used to train temporal neural networks (Obradovic, 1996; Baltersee & Chambers, 1998; Mandic, Baltersee, & Chambers, 1998), with promising results. Recently, EKF training of NNs has been extended to the complex domain (Huang & Chen, 2000). Notice that in order to design an algorithm suitable for the complex domain, we need a precise mathematical model that describes the evolution of system parameters. Hence, extensions of learning algorithms to the complex domain are not trivial and often involve some constraints, for instance, a simplified model of both complex statistics and complex nonlinearities within neurons. This might be suboptimal for classes of signals with significant correlation between the real and imaginary parts, which affects both the choice of the complex activation function and complex statistics. To tackle this problem, we first provide mathematical foundations for complex-valued second-order statistics and highlight the need to consider
2 The CRTRL algorithm uses a fully complex activation function (AF) that is analytical and bounded almost everywhere in C. In a split-complex AF, the real and imaginary components of the input signal x are separated and fed through the real-valued activation function f R (x) = f I (x), x ∈ R. A split-complex activation function is therefore given as 1 1 f (x) = f R (Re(x)) + j f I (I m(x)), for example, f (x) = −β(Re(x)) + j −β(I m(x)) . 1+e
1+e
A Complex EKF for RNNs
1041
the so-called augmented statistics in the derivation of such learning algorithms. Next, both the augmented complex-valued Kalman filter (ACKF) and the the augmented extended complex-valued Kalman filter (ACEKF) algorithm are derived, whereby the recently developed CRTRL algorithm is used to compute the Jacobian matrix within the ACEKF. For rigor, this is achieved for a fully complex nonlinear activation function of a neuron. The potential of such an ACEKF for training of FCRNNs is analyzed, and a comparison with the CRTRL is provided. The analysis is comprehensive and is supported by simulation examples on benchmark complex-valued nonlinear and colored signals, together with simulations on complex-valued real-world radar and environmental measurements. Finally, possible directions for future research are highlighted. 2 Complex-Valued Second-Order Statistics Before deriving the ACEKF algorithm for complex FCRNNS, we introduce some important notations from complex second-order statistics. The covariance matrix for two real-valued RVs x and y is defined as (Anton, 2003) T Pxy = cov[x, y] = E (x − E [x]) y − E y ,
(2.1)
where E(·) represents the statistical expectation operator. Similarly, the four covariance matrices of two complex-valued RVs, x = xr + jxi and y = yr + jyi , are given by (Neeser & Massey, 1992) Pxr yr = cov[xr , yr ]
Pxr yi = cov[xr , yi ]
Pxi yr = cov[xi , yr ]
Pxi yi = cov[xr , yi ],
(2.2)
√ where j = −1, (·)T denotes the vector transpose operator, and superscripts (·)r and (·)i denote, respectively, the real and imaginary part of a complex number or complex vector. These four real-valued matrices are equivalent to the following two complex-valued matrices, given by (Neeser & Massey, 1992), H Pxy = E (x − E [x]) y − E y T ξ Pxy = E (x − E [x]) y − E y where Pxy = Pxr yr + Pxi yi + j(Pxi yr − Pxr yi )
(2.3)
1042
S. Goh and D. Mandic ξ
Pxy = Pxr yr − Pxi yi + j(Pxi yr + Pxr yi ),
(2.4)
and the symbol (·) H denotes the Hermitian transpose operator. We can solve for Pxr yr , Pxi yi , Pxi yr , and Pxr yi to obtain 1 ξ Pxr yr = R Pxy + Pxy 2 1 ξ Pxi yi = R Pxy − Pxy 2 1 ξ Pxi yr = I Pxy + Pxy 2 1 ξ Pxr yi = I − Pxy + Pxy , 2
(2.5)
where symbols R and I denote, respectively, the real and imaginary part of a complex quantity. It is clear that the four real-valued covariance matrices, equation 2.5, are in a one-to-one relationship with the two complex-valued covariance matrices in equation 2.4. In the literature, nearly always only ξ Pxy is considered and is referred to as the covariance matrix, whereas Pxy is termed the pseudocovariance matrix. 2.1 Augmented Covariance Matrix. It is often assumed that the theory of complex-valued random vectors (RVs) is no different from that of the real RVs, as long as the definition of the covariance matrix of an RV x is changed from E[xxT ] in the real case to E[xx H ] in the complex case (Schreier & Scharf, 2003; Picinbono, 1996). This assumption, however, is not justified since the covariance matrix E[xx H ] will not completely describe the second-order statistical behavior of x. For complex-valued gaussian random variables, we therefore need to consider both the variable x and its complex conjugate x∗ in order to make full use of the available statistical information. This additional information is contained in the cross-moments, and to design mathematically wellfounded complex learning algorithms, a study of these moments and their implications on learning is required (Neeser & Massey, 1992). We therefore set out to derive a complex-valued Kalman filter that takes into account the augmented complex statistics and thus provides a general framework for nonlinear adaptive filtering in C. To achieve this, instead of a complex RV x, we consider an “augmented” (2n × 1)dimensional vector xa = [x, x∗ ]. In addition, for “improper” (see appendix A for more detail) vectors, it is the augmented (2n × 2n)-dimensional covariance matrix Pxa xa = E[xa (xa )T ] (rather than the (n × n)-dimensional matrix Pxx = E[xx H ]) that contains the complete second-order statistical information. Such an augmented covariance matrix is given by (Schreier & Scharf,
A Complex EKF for RNNs
1043
2003) T H Pxx x Pxa xa = E ∗ x x = ξ∗ x Pxx
ξ
Pxx P∗xx
.
(2.6)
Notice that the covariance matrix Pxa xa is invertible and thus positive defiξ∗ nite. Besides Pxx being positive semidefinite (PSD) and Pxx being symmetξ ∗ −1 ξ ∗ ric, the Schur complement Pxx − Pxx Pxx Pxx must be PSD to ensure that Pxa xa is PSD and thus a valid covariance matrix for xa (Neeser & Massey, 1992). 3 The Augmented Complex-Valued Kalman Filter Algorithm Based on the analysis of augmented complex statistics, we first derive a Kalman filter learning algorithm for complex-valued inputs using the augmented states and augmented covariance matrix. Consider a general statespace model given by (Haykin, 2001), xk+1 = Fk+1 xk + ωk y k = Hk x k + ν k ,
(3.1)
where ωk and vk are independent, zero-mean, complex-valued gaussian processes of covariance matrices Qk and Rk , respectively, and F and H are the transition and measurement matrix. From equation 3.1, the augmented state-space model is obtained as xak+1 = Fak+1 xak + ωak yak = Hak xak + ν ak ,
(3.2)
where xak = [xk , x∗k ], yak = [yk , y∗k ], Fak = [Fk , F∗k ], Hak = [Hk , H∗k ], ωak = [ωk , ω∗k ], and ν ak = [ν k , ν ∗k ]. The augmented covariance matrices of the zeromean complex-valued gaussian noise processes are denoted respectively by Qak and Rak . In the initialization of the algorithm (k = 0), we set xˆ a0 = E xa0 , T P0 = E xa0 − E xa0 xa0 − E xa0 .
(3.3)
The computation steps for k = 1, . . . , L are given below (for clarity, we use notation similar to that from (Haykin, 2001):
1044
S. Goh and D. Mandic
State estimate propagation: − xˆ ak − = Fak,k−1 xˆ ak−1
(3.4)
Error covariance propagation: a H a + Qak−1 P− k = Fk,k−1 Pk−1 Fk,k−1
(3.5)
Kalman gain matrix: −1 a H a a H a H P + R Gk = P − H H k k k k k k
(3.6)
State estimate update: xˆ ak = xˆ ak − + Gk yak − Hak xˆ ak −
(3.7)
Error covariance update: Pk = I − Gk Hak P− k
(3.8)
This completes the description of the augmented complex-valued Kalman filter. 4 FCRNN Trained with Augmented Extended Complex-Valued Kalman Filter 4.1 FCRNN Architecture. Figure 1 shows an FCRNN, which consists of N neurons with p external inputs and N feedback connections. The network has two distinct layers: the external input-feedback layer and a layer of processing elements. Let yl,k denote the complex-valued output of a neuron, l = 1, . . . , N at time index k and sk the (1 × p) external complexvalued input vector. The overall input to the network uk then represents a concatenation of vectors yk , sk and the bias input (1 + j), and is given by uk = [sk−1 , . . . , sk− p , 1 + j, y1,k−1 , . . . , yN,k−1 ]T un,k ∈ uk = urn,k + juin,k ,
n = 1, . . . , p + N + 1.
(4.1)
For the lth neuron, its weights form a ( p + N + 1) × 1–dimensional weight vector wlT = [wl,1 , . . . , wl, p+N+1 ], l = 1, . . . , N, which are encompassed in the complex-valued weight matrix of the network W = [w1 , . . . , w N ]. The output of every neuron can be expressed as yl,k = (netl,k ),
l = 1, . . . , N,
(4.2)
A Complex EKF for RNNs
1045
Feedback inputs
z−1
Outputs y
z−1
z−1
z−1
1+j s(k-1)
External Inputs s(k-p) I/O layer
Feedforward Processing layer of and hidden and output feedback connections neurons
Figure 1: A fully connected recurrent neural network for prediction.
where is a complex nonlinear activation function of a neuron and
p+N+1
netl,k =
wl,n un,k
(4.3)
n=1
is the net input to lth node at time index k. In order to establish a mathematical framework for Kalman filter-based training of FCRNNs, the dynamical behavior of the FCRNN by a state-space model is given by wak+1 = wak + ωak yak
= h(wak , uk )
(4.4) +
ν ak ,
(4.5)
1046
S. Goh and D. Mandic
where h is a nonlinear operator associated with observations, wak is the augmented weight vector, and yak is the augmented output of the network. The underlying idea here is to linearize the state-space model for every time instant k. Once such a local linear model is obtained, the standard ACKF approach can be applied. In practice, the process noise covariance Qak and measurement noise covariance Rak matrices might be time varying, but here we assume they are constant. Notice that the Jacobian Hak is defined as a set of partial derivatives of the outputs with respect to the weights, wak , and needs to be updated at every time instant during learning. 4.2 Derivation of the Augmented Complex-Valued Extended Kalman Filter Algorithm. When operating in the complex domain, similar to the real EKF, the nonlinear state and measurement equations ought to be linearized about the current state in order to subsequently employ a standard Kalman filter. Since this linearization is dynamical in its nature, the performance of such EKF cannot be assessed beforehand. Indeed, there is a chance that the updated trajectory estimate will be poorer than the nominal one, which leads to inaccuracy in the estimates, causing further error accumulation (Brown & Hwang, 1997). The derivation of the required complexvalued Jacobian matrix is nontrivial, and the linearization can lead to a highly unstable performance (Julier & Uhlmann, 2004). Therefore, careful parameter selection is required when using the ACEKF algorithm. With this in mind, we proceed with the derivation of the ACEKF as an extension from the ACKF presented in the previous section. The following expressions summarize the proposed ACEKF: a − a H −1 a H Gk = P− Hk Pk (Hk ) + Rak k (Hk ) ˆ ak − + Gk yak − h(w ˆ ak − , uk ) ˆ ak = w w a Pk = (I − Gk Hak ) P− k + Qk .
(4.6) (4.7) (4.8)
The ACEKF is initialized by ˆ a0 = E [w0 ] w T P0 = E wa0 − E wa0 wa0 − E wa0 .
(4.9)
For generality, the augmented Jacobian matrix Hak is computed using the augmented CRTRL algorithm3 for recurrent networks (Goh & Mandic, 2004) (using fully complex nonlinearities within the network; see ˆ ak denotes the estimate of the augmented state of appendix B). The vector w 3 Matrix Ha denotes the matrix of partial derivatives of the network’s augmented k output yak with respect to weight parameters (see appendix B).
A Complex EKF for RNNs
1047
the system update at step k. The Kalman gain matrix Gk is a function of the estimated error covariance matrix Pk , the Jacobian matrix Hak , and a global a H a scaling matrix Hak P− k (Hk ) + Rk . 5 Simulations For the experiments, the nonlinearity at the neuron was chosen to be the complex tanh function, (x) =
e βx − e −βx , e βx + e −βx
(5.1)
where x ∈ C. The value of the slope of (x) was β = 1. The architecture of the FCRNN (see Figure 1) consisted of N = 3 neurons with the tap input length of p = 5. To support the analysis, we tested the ACEKF on a wide range of signals, including complex linear and complex nonlinear signals. To further illustrate the approach and verify the advantage of using the ACEKF over CEKF and the CRTRL, additional single-trial experiments were performed on real-world complex-valued wind4 and radar5 data. In the first experiment, the input signal was a stable complex linear AR(4) process given by r (k) = 1.79r (k − 1) − 1.85r (k − 2) + 1.27r (k − 3) + 0.41r (k − 4) + n(k), (5.2) with complex white gaussian noise (CWGN) n(k) ∼ N (0,1) as the driving input. The CWGN can be expressed as n(k) = nr (k) + jni (k). The real and imaginary components of CWGN are mutually independent sequences having equal variances, so that σn2 = σn2r + σn2i . For the second experiment, we used a complex benchmark nonlinear signal (Narendra & Parthasarathy, 1990), z(k) =
4
z2 (k − 1)(z(k − 1) + 2.5) + r (k − 1). 1 + z(k − 1) + z2 (k − 2)
(5.3)
Publicly available online from http://mesonet.agron.iastate.edu/. The wind vector can be expressed in the complex domain C as v(t)e jθ(t) = v E (t) + v N (t). Here, the two wind components, the speed v and direction θ , which are of different natures, are modeled as a single quantity in a complex representation space (Mandic and Goh, 2005). 5 Publicly available online from http://soma.ece.mcmaster.ca/ipix/.
1048
S. Goh and D. Mandic
Table 1: Comparison of Prediction Gains Rp , MSE, and SNR for the Various Classes of Signals. Signal
Nonlinear
AR4
Wind
Radar
6.24 5.55
4.77 3.98
12.65 10.24
10.58 9.91
3.76 5
3.54 5
6.12 3
7.22 3
R p [dB] (ACEKF) R p [dB] (standard CEKF) R p [dB] (CRTRL) SNR [dB]
To assess the performance, standard prediction gain R p was employed and is given by (Haykin & Li, 1995),
R p (k) = 10 log10
σx2 σˆ e2
[d B],
(5.4)
where σx2 denotes the variance of the input signal x(k), and σˆ e2 denotes the estimated variance of the forward prediction error e(k). Table 1 shows a comparison of the prediction gains R p [dB] between the ACEKF, standard complex-valued (CEKF) (without considering the augmented states), and CRTRL for various classes of signals, and also the signal-to-noise ratio (SNR) for various experiments. In all the cases, there was a significant improvement in the performance when ACEKF was employed over that of CRTRL and CEKF. Figures 2 and 3 show a subsegment of the predictions generated by the ACEKF and CRTRL for complex-valued colored (see equation 5.2) and nonlinear (see equation 5.3) signals. The ACEKF outperformed CRTRL for both cases. To further illustrate the advantage of ACEKF over CRTRL, we compared the performances of FCRNNs trained with these algorithms in experiments on real-world radar and wind data. Figure 4 shows the prediction performance of the ACEKF applied to the complex-valued real-world (velocity and angle components) wind signal. Simulation results for radar data are shown in Figure 5. In both cases, the AECKF outperformed the other algorithms considered (for a quantitative performance comparison, see Table 1). 6 Discussion and Conclusion The augmented complex-valued extended Kalman filter (ACEKF) has been introduced for nonlinear adaptive filtering in the complex domain. For generality, this has been achieved for nonlinear adaptive filters realized as fully connected recurrent neural networks (FCRNNs), and the ACEKF has been derived using the augmented complex-valued statistics, whereby
A Complex EKF for RNNs
1049
A
noisy signal predicted signal actual signal
Absolute Nonlinear Signal
1.2
1
0.8
0.6
0.4
0.2
0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(a) ACEKF Algorithm B
predicted signal actual signal
Absolute Nonlinear Signal
1.2
1
0.8
0.6
0.4
0.2
0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(b) CRTRL Algorithm 0.5
C
CRTRL ECKF
0.45 0.4 0.35
MSE
0.3 0.25 0.2 0.15 0.1 0.05 0 1600
1650
1700
1750
1800
1850
1900
Iterations, k
(c) Mean Square Error
Figure 2: Comparison of one-step-ahead prediction performance between the ACEKF and CRTRL for nonlinear signal (see equation 5.3).
1050
S. Goh and D. Mandic 0.8
A
noisy signal predicted signal actual signal
Absolute Colored Signal
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(a) ACEKF Algorithm 0.8
B
predicted signal actual signal
Absolute Colored Signal
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(b) CRTRL Algorithm 0.7
C
CRTRL ACEKF
0.6
MSE
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
(c) Mean Square Error
Figure 3: Comparison of one-step-ahead prediction performance between the ACEKF and CRTRL for colored signal (see equation 5.2).
A Complex EKF for RNNs
1051
0.7 noisy signal predicted signal actual signal
Absolute Wind
0.6
0.5
0.4
0.3
0.2
0.1
0 2200
2250
2300
2350
2400
2450
2500
Iterations, k
Figure 4: Prediction performance for complex wind signal using ACEKF. 1 noisy signal predicted signal actual signal
Absolute Radar Signal
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 2700
2750
2800
2850
2900
2950
3000
Iterations, k
Figure 5: Prediction performance for complex radar signal using ACEKF.
a complete second-order statistics for complex-valued quantities is taken into account. The performance of the ACEKF has been evaluated on benchmark complex-valued nonlinear and colored signals and also on real-life complex-valued wind and radar signals. Experimental results have justified the potential of ACEKF in nonlinear complex-valued neural adaptive filtering applications. It should be noted that the computational cost of the ACEKF is relatively more expensive than that of the CRTRL. To mitigate this problem, a decoupled extended Kalman filter (DEKF) algorithm (Puskorius et al. 1991) may be used; however, the reduction in computational complexity may lead to poorer performance of the algorithm in the domain of augmented complex statistics. This is due to the fact that the DEKF is derived from EKF by approximating the second-order information between
1052
S. Goh and D. Mandic
weights so that they belong only to mutually exclusive groups; correlations between weights estimates can be neglected, which leads to a sparser error covariance matrix Pk . The decoupling methods for processes described by augmented complex statistics are an area for future research, especially if the main concern is the computational complexity of augmented EKF. Appendix A: Proper and Improper Complex Random Vectors ξ
A complex RV x is called proper if its pseudocovariance Pxx vanishes (Neeser & Massey, 1992; Schreier & Scharf, 2003). For a proper complexvalued random vector x, we have Pxr xr = Pxi xi ,
Pxi xr = −PxT i xr .
(A.1)
This means that the real and imaginary parts of x are uncorrelated. The ξ vanishing of Pxx does not imply that the real part of xm ∈ x and the imaginary part of xn ∈ x are uncorrelated for m = n. When RVs are proper, the probability density function (PDF) of a gaussian RV takes a form similar to that from the real case. Let x ∈ Cn be a proper complex-valued gaussian random variable with mean µ and nonsingular covariance matrix Pxx . Then the PDF of x is given by (Picinbono, 1998) f (x) =
1 T −1 e −(x−µ) Pxx (x−µ) , det(πPxx )
(A.2)
where P−1 xx is some Hermitian and positive-definite matrix. For convenience, in many applications, complex-valued RVs are treated as proper. However, ξ Pxx may not be necessarily zero, and the results obtained this way are suboptimal. Appendix B: An Augmented CRTRL Algorithm for the FCRNN The augmented CRTRL algorithm for training the FCRNN is briefly described. The cost function of the recurrent network is given by J k = 12 ek ekT . The instantaneous output error is defined as ek = dak − yak ,
(B.1)
where dak denotes the augmented desired output vector. The augmented weight matrix update becomes wak = −η
∂yak ∂ Jk . a = −ηek ∂wk ∂wak
(B.2)
A Complex EKF for RNNs
1053
a We introduce three new matrices—the 2N × (2N + p + 1) matrix l,k , the 2N × (2N + p + 1) matrix Ul,k , and the 2N × 2N diagonal matrix Sk —as (Haykin, 1994)
a l,k =
∂yak , ∂wak
∗ ∗ ya = y1,k , . . . , yN,k , y1,k , . . . , yN,k ,
0 .. . Ul,k = uk ← lth row, l = 1, . . . , N . ..
l = 1, 2, . . . , N (B.3)
(B.4)
0
S(k) = diag ukT wa1,k , . . . , ukT waN,k H a ∗ a , l,k+1 = S∗k Ul,k + Waf,k l,k
(B.5) (B.6)
where W f denotes the set of those entries in W that correspond to the feedback connections. Acknowledgments We thank Dragan Obradovic from Siemens Corporate Research for his critical comments and lengthy discussions. References Anton, H. (2003). Contemporary linear algebra. New York: Wiley. Baltersee, J., & Chambers, J. A. (1998). Nonlinear adaptive prediction of speech signals using a pipelined recurrent network. IEEE Transactions on Signal Processing, 46(8), 2207–2216. Brown, R. B., & Hwang, P. Y. C. (1997). Introduction to random signals and applied Kalman filtering. New York: Wiley. Feldkamp, L. A., & Puskorius, G. V. (1998). A signal processing framework based on dynamical neural networks with application to problems in adaptation, filtering and classification. IEEE Transactions on Neural Networks, 86, 2259–2277. Goh, S. L., & Mandic, D. P. (2004). A complex-valued RTRL algorithm for recurrent neural networks. Neural Computation, 16(12), 2699–2713. Goh, S. L., Popovic, D. H., & Mandic, D. P. (2004). Complex-valued estimation of wind profile and wind power. In Proceedings of the 12th IEEE Mediterranean Electrotechnical Conference, MELECON 2004 (Vol. 3, pp. 1037–1040). Piscataway, NJ: IEEE Press.
1054
S. Goh and D. Mandic
Grewal, M. S., & Andrews, A. P. (2001). Kalman filtering: Theory and practice. New York: Wiley. Haykin, S. (1994). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall. Haykin, S. (2001). Kalman Filtering and neural networks. New York: Wiley. Haykin, S., & Li, L. (1995). Nonlinear adaptive prediction of nonstationary signals. IEEE Transactions on Signal Processing, 43(2), 526–535. Huang, R. C., & Chen, M. S. (2000). Adaptive equalization using complex-valued multilayered neural network based on the extended Kalman filter. In Proceedings of Signal Processing, International Conference on WCCC-ICSP (Vol. 1, pp. 519–524). Piscataway, NJ: IEEE Press. Julier, S. J., & Uhlmann, J. K. (2004). Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3), 401–422. Kalman, R. E. (1960). A new approach to linear filtering and prediction problems. Transaction of the ASME–Journal of Basic Engineering, 82, 35–45. Kim, T., & Adali, T. (2001). Approximation by fully complex MLP using elementary transcendental activation functions. In Proceedings of XI IEEE Workshop on Neural Networks for Signal Processing (pp. 203–212). Piscataway, NJ: IEEE Press. Leung, H., & Haykin, S. (1991). The complex backpropagation algorithm. IEEE Transactions on Signal Processing, 39(9), 2101–2104. Mandic, D., Baltersee, J., & Chambers, J. (1998). Non-linear adaptive prediction of speech with a pipelined recurrent neural network and advanced learning algorithms. In A. Prochazka, J. Uhlir, P. W. Rayner, & N. G. Kingsbury (Eds.), Signal analysis and prediction. Boston: Birkhauser. Mandic, D. P., & Chambers, J. A. (2001). Recurrent neural networks for prediction: Learning algorithms, architectures and stability. New York: Wiley. Mandic, D. P., & Goh, S. L. (2005). Sequential data fusion via vector spaces: Complex modular neural network approach. In Proceedings of the IEEE Workshop on Machine Learning Signal Processing (pp. 147–151). Piscataway, NJ: IEEE Press. Medsker, L. R., & Jain, L. (2000). Recurrent neural networks: Design and applications. Boca Raton, FL: CRC Press. Narendra, K. S., & Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Neeser, F., & Massey, J. (1992). Proper complex random processes with applications to information theory. IEEE Transactions on Information Theory, 39(4), 1293–1302. Obradovic, D. (1996). On-line training of recurrent neural networks with continuous topology adaptation. IEEE Transactions on Neural Networks, 7(1), 222–228. Obradovic, D., Lenz, H., & Schupfner, M. (2004). Sensor fusion in Siemens car navigation system. In Proceedings of the IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing (pp. 655–664). Piscataway, NJ: IEEE Press. Picinbono, B. (1996). Second-order complex random vectors and normal distributions. IEEE Transactions on Signal Processing, 44(10), 2637–2640. Picinbono, B. (1998). On circularity. IEEE Transactions on Signal Processing, 42(12), 3473–3482. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman filter training of feedforward layered networks. IJCNN91 Settle International Joint Conference on Neural Networks (Vol. 1, pp. 771–777). Piscataway, NJ: IEEE Press.
A Complex EKF for RNNs
1055
Puskorius, G. V., & Feldkamp, L. A. (1994). Neurocontrol of nonlinear dynamical systems with Kalman filter trained recurrent networks. IEEE Transactions on Neural Networks, 5(2), 279–297. Schreier, P. J., & Scharf, L. L. (2003). Second-order analysis of improper complex random vectors and processes. IEEE Transactions on Signal Processing, 51(3), 714– 725. Treichler, J. R., Johnson, J. C. R., & Larimore, M. (1987). Theory and design of adaptive filters. New York: Wiley. Widrow, B., McCool, J., & Ball, M. (1975). The complex LMS algorithm. Proceedings of the IEEE, 63, 712–720. Williams, R. J., & Zipser, D. A. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2), 270–280.
Received August 22, 2005; accepted July 9, 2006.
LETTER
Communicated by Terence Kwok
Equilibria of Iterative Softmax and Critical Temperatures for Intermittent Search in Self-Organizing Neural Networks ˇ Peter Tino
[email protected] School of Computer Science, University of Birmingham, Birmingham B15 2TT, U.K.
Kwok and Smith (2005) recently proposed a new kind of optimization dynamics using self-organizing neural networks (SONN) driven by softmax weight renormalization. Such dynamics is capable of powerful intermittent search for high-quality solutions in difficult assignment optimization problems. However, the search is sensitive to temperature setting in the softmax renormalization step. It has been hypothesized that the optimal temperature setting corresponds to the symmetry-breaking bifurcation of equilibria of the renormalization step, when viewed as an autonomous dynamical system called iterative softmax (ISM). We rigorously analyze equilibria of ISM by determining their number, position, and stability types. It is shown that most fixed points exist in the neighborhood of the maximum entropy equilibrium w = (N−1 , N−1 , . . . , N−1 ), where N is the ISM dimensionality. We calculate the exact rate of decrease in the number of ISM equilibria as one moves away from w. Bounds on temperatures guaranteeing different stability types of ISM equilibria are also derived. Moreover, we offer analytical approximations to the critical symmetrybreaking bifurcation temperatures that are in good agreement with those found by numerical investigations. So far, the critical temperatures have been determined only by trial-and-error numerical simulations. On a set of N-queens problems for a wide range of problem sizes N, the analytically determined critical temperatures predict the optimal working temperatures for SONN intermittent search very well. It is also shown that no intermittent search can exist in SONN for temperatures greater than one-half. 1 Introduction Since the pioneering work of Hopfield (1982), adaptation of neural computation techniques to solving difficult combinatorial optimization problems has proved useful in numerous application domains (Smith, 1999). In particular, a self-organizing neural network (SONN) was proposed as a general methodology for solving 0-1 assignment problems (Smith, 1995). The methodology has been successfully applied in a wide variety of applications, from assembly line sequencing to frequency assignment in Neural Computation 19, 1056–1081 (2007)
C 2007 Massachusetts Institute of Technology
Critical Temperatures for Self-Organizing Neural Networks
1057
mobile communications (see, e.g., Smith, Palaniswami, & Krishnamoorthy, 1998). The key difference between traditional self-organizing maps (SOM) (Kohonen, 1982) and SONN lies in the notion of neighborhood of the winning unit (the unit most responsive to the current input). In the Kohonentype SOM, neural units play the role of codebook vectors in a constrained vector quantization of the input space. Close units on the map represent close regions of the input space. As the map develops, the notion of closeness on the map changes only in scale, while the topology of neighborhood relations remains unchanged. In contrast, SONNs adopt a more flexible notion of the winning unit neighborhood. Neural units represent partial solutions to an assignment optimization problem. The criterion for closeness of two units is their relative performance in solving the optimization task. In this view, the neighborhood of the winning unit contains only relatively successful partial assignment solutions to the optimization task being solved. There is no predetermined neighborhood topology on the neural units. However, both SOM and SONN share the philosophy of neighborhood-based updating of the parameters (weights). The winner unit, as well as units from its neighborhood, are modified to respond better to the current input stimulus. Searching for 0-1 solutions in general assignment optimization problems can be made more effective when performed in a continuous domain, with values in the interval (0, 1) representing partial (soft) assignments (e.g., Kosowsky & Yuille, 1994). Typically the softmax function is employed to ensure that elements within a set of positive parameters sum up to one. The softmax function was incorporated into SONN in Guerrero, Lozano, Smith, Canca, and Kwok (2002) and was used in other optimization frameworks (e.g., in Gold & Rangarajan, 1996; Rangarajan, 2000). When endowed with a physics-based Boltzmann distribution interpretation, the softmax function contains a free parameter: temperature T (or inverse temperature T −1 ). As the system cools, the assignments become increasingly crisp. Recently interesting observations have been made regarding the appropriate values of the temperature parameter when solving assignment problems with SONN endowed with softmax renormalization of the weight parameters (Kwok & Smith, 2002, 2004). There is a critical temperature T∗ at which the SONN achieves superior performance in terms of both quality and efficiency. It has been suggested that the critical temperature may be closely related to the symmetry-breaking bifurcation of equilibria in the autonomous softmax dynamics (Kwok & Smith, 2003). Of even greater interest is the finding that at T∗ , when the neighborhood size and learning rate are kept fixed at appropriate values, SONN is capable of powerful intermittent search through a multitude of high-quality solutions represented as meta-stable states of the SONN dynamics (Kwok & Smith, 2005). Kwok and Smith (2005) numerically studied global dynamical properties of SONN in the intermittent search mode and argued that such models display
ˇ P. Tino
1058
characteristics of systems at the edge of chaos (Langton, 1990; Crutchfield & Young, 1990). In this contribution, we seek to shed more light on the phenomenon of critical temperatures and intermittent search in SONN. In particular, since the critical temperature is closely related to bifurcations of equilibria in autonomous iterative softmax systems (ISM), we rigorously analyze the number, position, and stability types of fixed points of ISM. Moreover, we offer analytical approximations to the critical temperature as a function of the ISM dimensionality. So far, the critical temperatures have been determined only by trial-and-error numerical investigations. This letter is organized as follows. After a brief introduction to SONN in section 2, we formally define ISM in section 3. The numbers and positions of ISM equilibria are studied in section 4, and their stability is investigated in section 5. Analytical approximations to the critical temperature for SONN intermittent search are derived and verified in section 6. Section 7 summarizes the key findings. 2 Self-Organizing Neural Network with SoftMax Weight Renormalization In this section we briefly introduce self-organizing neural networks (SONN) endowed with weight renormalization for solving assignment optimization problems (see, e.g., Kwok & Smith, 2005). Consider a finite set of inputs j ∈ J = {1, 2, . . . , M} that need to be assigned to outputs i ∈ I = {1, 2, . . . , N}, so that a global cost (potential) Q(A) of an assignment A : J → I is minimized. Partial cost of assigning j ∈ J to i ∈ I is denoted by V(i, j). Both the input and output elements j ∈ J and i ∈ I, respectively, are represented through one-hot-encoding, that is, there is one input and one output unit exclusively devoted to each input and output element, respectively. The strength of assigning j to i is represented by the weight wi, j ∈ [0, 1]. The SONN algorithm consists of the following steps: 1) Initialize connection weights wi, j , j ∈ J , i ∈ I, to random values around 0.5. 2) Choose at random an input item jc ∈ J , and calculate partial costs V(i, jc ), i ∈ I, of all possible assignments of jc . 3) The winner output node i( jc ) ∈ I is the one that minimizes V(i, jc ), that is, i( jc ) = argmini∈I V(i, jc ). The neighborhood B L (i( jc )) of size L of the winner node i( jc ) consists of L nodes i = i( jc ), which yield the smallest partial costs V(i, jc ). 4) Weights of nodes i ∈ B L (i( jc )) get strengthened, and those outside B L (i( jc )) are left unchanged: wi, jc ← wi, jc + η(i)(1 − wi, jc ), i ∈ B L (i( jc )),
Critical Temperatures for Self-Organizing Neural Networks
1059
where |V(i( jc ), jc ) − V(i, jc )| η(i) = β exp − , |V(k( jc ), jc ) − V(i, jc )| and k( jc ) = argmaxi∈I V(i, jc ). 5) Weights w j = (w1, j , w2, j , . . . , w N, j ) for each input node j ∈ J are normalized using softmax (here, denotes the transpose operator): exp
wi, j ← N
wi, j wk, j .
T
k=1 exp
T
6) Repeat from step 2 until all inputs jc ∈ J have been selected (one epoch). Although the soft assignments wi, j evolve in continuous space, when needed, a 0-1 assignment solution can be produced by imposing A( j) = i if and only if i = argmaxk∈I wk, j . A frequently studied (NP-hard) assignment optimization problem in case of SONN is the N-queen problem (e.g., Kwok & Smith, 2004, 2005). Place N queens on an N × N chessboard without attacking each other. In this case, J = {1, 2, . . . , N} and I = {1, 2, . . . , N} index the columns and rows, respectively, of the chessboard. Partial cost V(i, j) evaluates the diagonal and column contributions (in the sense of directions on the chessboard) to the global cost Q of placing a queen on column j of row i. (For more details, see, Kwok & Smith, 2004, 2005.) Many neural-based approaches and heuristic techniques have been proposed to solve the N-queen problem (e.g., Minton, Johnston, Philips, & Laird, 1992; Tagliarini & Page, 1987). Previous SONN studies of the N-queen problem used the problem as a convenient test-bed for illustrating optimization capabilities of SONN, without intending to devise the most efficient algorithm. The N-queen problem is used in this study in a similar manner. Kwok and Smith (2003, 2005) argue that step 5 of the SONN algorithm is crucial for intermittent search by SONN for globally optimal assignment solutions. In particular, they note that temperatures at which symmetrybreaking bifurcation of equilibria of the renormalization procedure in step 5 occur correspond to temperatures at which optimal (in terms of both quality and quantity of found solutions) intermittent search takes place. In the following sections, we rigorously analyze equilibria of softmax viewed as an autonomous dynamical system. Our findings will later allow us to formulate analytical approximations to the critical temperatures at which symmetry-breaking bifurcations of the softmax renormalization dynamics take place.
ˇ P. Tino
1060
3 Iterative Softmax Denote the (N − 1)-dimensional simplex in R N by SN−1 , that is, SN−1 = w = (w1 , w2 , . . . , w N ) ∈ R N | wi ≥ 0,
i = 1, 2, . . . , N, and
N
wi = 1 .
(3.1)
i=1 0 The interior of SN−1 is denoted by SN−1 :
0 SN−1
= w ∈ R | wi > 0, i = 1, 2, . . . , N, and N
N
wi = 1 .
(3.2)
i=1
Given a parameter T > 0 (the temperature), the softmax maps R N into 0 SN−1 : w → F(w; T) = (F1 (w; T), F2 (w; T), . . . , F N (w; T)) ,
(3.3)
where exp( wTi ) Fi (w; T) = N , i = 1, 2, . . . , N. wk k=1 exp( T )
(3.4)
We will denote the common normalization factor of Fi ’s by Z(w; T), that is, Z(w; T) =
N k=1
exp
w
k
T
.
(3.5)
0 Linearization of F around w ∈ SN−1 is given by the Jacobian J (w; T), the (i, j)th element of which reads,
J(w; T)i, j =
1 [δi, j Fi (w; T) − Fi (w; T)F j (w; T)], T
(3.6)
where δi, j = 1 iff i = j and δi, j = 0 otherwise. The Jacobian is a symmetric N × N matrix. 0 The softmax map F induces on SN−1 a discrete-time dynamics, w(t + 1) = F(w(t); T),
(3.7)
Critical Temperatures for Self-Organizing Neural Networks
1061
sometimes referred to as iterative softmax (ISM). In the next section, we study the relationship between the number and position of equilibria of ISM and the temperature parameter T. We study systems for N ≥ 2. 4 Fixed Points of Iterative Softmax 0 Recall that w ∈ SN−1 is a fixed point (equilibrium) of the ISM dynamics driven by F, if w = F(w). It is easy to see that for any temperature setting,1 there is always at least one fixed point of ISM (see equation 3.7). 0 Lemma 1. The maximum entropy point w = (N−1 , N−1 , . . . , N−1 ) ∈ SN −1 is a fixed point of ISM, equation 3.7, for any T ∈ R.
Proof. The result directly follows from w ¯1 = w ¯2 = ··· = w ¯ N. There is a strong structure in the fixed points of ISM: coordinates of any fixed point of ISM can take on only up to two distinct values. Theorem 1. Except for the maximum entropy fixed point w = (N−1 , . . . , N−1 ) , for all the other fixed points w = (w1 , w2 , . . . , w N ) of ISM, equation 3.7, it holds: wi ∈ {γ1 (w; T), γ2 (w; T)}, i = 1, 2, . . . , N, where γ1 (w; T) > N−1 and γ2 (w; T) < N−1 are the two solutions of Z(w; T) · x = exp( Tx ). Proof. We have wi = Fi (w; T) =
exp( wTi ) , i = 1, 2, . . . , N. Z(w; T)
That means Z(w; T) · wi = exp
w
i
T
,
and so all the coordinates of w must lie on the intersection of the line Z(w; T) · x and the exponential function exp( Tx ). Since the exponential is a convex function, any line can intersect with it in at most two points. Because w = w, there must be a coordinate wi of w such that wi = N−1 . 0 If wi < N−1 , then (since w ∈ SN−1 ) there exists another coordinate w j of −1 w, such that w j > N . By the argument above, all of the other coordinates of w must be equal to either wi or w j . In addition, wi , w j are solutions of Z(w; T) · x = exp( Tx ). The case wi > N−1 can be treated analogously.
1
In ISM, T is usually positive, but this claim is true for any T ∈ R.
ˇ P. Tino
1062
We will often write the larger of the two fixed-point coordinates as γ1 (w; T) = α N−1 , α ∈ (1, N).
(4.1)
Theorem 2. Fix α ∈ (1, N), and write γ1 = α N−1 . Let min be the smallest natural number greater than (α − 1)/γ1 . Then, for ∈ {min , min + 1, . . . , N − 1}, at temperature
α − 1 −1 , Te (γ1 ; N, ) = (α − 1) − · ln 1 − γ1
(4.2)
there exist N distinct fixed points of ISM (3.7) with (N − ) coordinates having value γ1 and coordinates equal to γ2 =
1 − γ1 (N − ) .
(4.3)
Proof. First, we verify that min can never be greater than N − 1. It is sufficient to show that α−1 < N − 1. γ1
(4.4)
Equation 4.4 is equivalent to stating N > α, which is automatically fulfilled since α ∈ (1, N). 0 Now consider ∈ {min , min + 1, . . . , N − 1}, and a fixed point w ∈ SN−1 of ISM having N − coordinates equal to γ1 . Then w has coordinates of 0 , we have value γ2 < N−1 . Because w ∈ SN−1 (N − )γ1 + γ2 = 1. It follows that γ2 =
1 − γ1 (N − ) .
(4.5)
Note that γ2 must be positive, and so γ1 < (N − )−1 . This is indeed the case, 0 since otherwise w would not lie in SN−1 . Normalizing constant Z(w; T) is equal to Z(w; T) = (N − ) · exp
γ
1
T
+ · exp
γ
2
T
.
Critical Temperatures for Self-Organizing Neural Networks
1063
Because w is a fixed point of ISM, we have exp γT1 γ1 = . Z(w; T) Hence,
γ2 − γ1 = 1. γ1 · (N − ) + exp T
Consequently, γ1 · 1 +
1−α = (N − )−1 . exp N− T
Equation 4.6 can be reformulated to
α−1 α−1 exp − =1− . T γ1
(4.6)
(4.7)
Since ≥ min > (α − 1)/γ1 , the right-hand side of equation 4.7 is always positive. Given the number of coordinates with value γ2 , we can solve for the temperature Te (γ1 ; N, ) satisfying equation 4.7:
α − 1 −1 Te (γ1 ; N, ) = (α − 1) − · ln 1 − . (4.8) γ1 To conclude the proof, we need to show that γ2 in equation 4.5 satisfies the fixed-point condition exp γT2 γ2 = . (4.9) Z(w; T) From equation 4.9, we have
−1 γ1 − γ2 γ2 = (N − ) exp . + T Denote (γ1 − γ2 )/T by ω. Then γ1 = [N − + exp(−ω)]−1 γ2 = [(N − ) exp(ω) + ]−1 . By equation 4.5, 1 N− 1− , γ2 = N − + exp(−ω) which reduces to equation 4.10.
(4.10)
ˇ P. Tino
1064
1 0.8 0.6 0.4
h(x;T)
0.2 0
g(x)
−0.2 0
0.05
0.1
0.15
0.2 x
0.25
0.3
0.35
0.4
Figure 1: A geometric interpretation of the temperature Te (γ1 ; N, ). The decreasing exponential function h(x; T) is intersected by the decreasing line g(x). Given a particular number , temperature T can be set to Te (γ1 ; N, ), so that h(x; Te (γ1 ; N, )) intersects g(x) in x0 = −1 . In this case, N = 10, α = 1.5, γ1 = 0.15, min = 4, = 5, Te (γ1 ; N, ) = 0.0910.
By symmetry of the argument, there are N possibilities for equilibria of ISM with N − and coordinates equal to γ1 and γ2 , respectively. For a geometric interpretation of the temperature Te (γ1 ; N, ), consider functions
α−1 h(x; T) = exp − x and T g(x) = 1 −
α−1 x, γ1
(4.11) (4.12)
operating on (0, ∞) and plotted in Figure 1. The decreasing exponential h(x; T) is intersected by the decreasing line g(x) at x0 . Given a particular number of coordinates of value γ2 , we can set T to T = Te (γ1 ; N, ), so that the exponential h(x; Te (γ1 ; N, )) intersects g(x) in x0 = −1 . Corollary 1. Consider the ISM (3.7). For ∈ {0, 1, . . . , N − 1}, define α =
N . N−
Critical Temperatures for Self-Organizing Neural Networks
1065
Fix α ∈ (1, N) ∪ [α−1 , α ), ∈ {1, . . . , N − 1}. Then, by varying the temperature parameter T, the ISM can have Q N (α) =
N−1 N k=
k
distinct fixed points. Proof. By theorem 2, for α ∈ (1, N) ∪ [α−1 , α ), there are temperature set0 tings (see equation 4.2) so that points w ∈ SN−1 with N − k and k coordinates −1 equal to γ1 = α N and γ2 (see equation 4.3), respectively, are equilibria of the ISM, k = , + 1, . . . , N − 1. For this value of α, no other fixed points are possible. This is easily observed, since for α ∈ (1, N) ∪ [α−1 , α ), the lower bound min on the number of smaller coordinates is equal to . Since α0 = 1 < α1 < α2 < · · · < α N−1 = N, corollary 1 tells us that the richest structure of equilibria of ISM can be found in a small neighborhood of the maximum entropy fixed point w. The fewest fixed points N exist in the = N fixed corner areas of SN−1 (α N−2 = N/2 ≤ α < α N−1 = N): only N−1 points are possible—points at the vertexes of SN−1 . In addition, we can analytically describe a decreasing trend in the number of possible equilibria as one diverges from w. Whenincreasing α crosses α , the number of pos sible fixed points decreases by N . As an example, we show in Figure 2 the number of possible fixed points as a function of the larger coordinate γ1 in a 10-dimensional ISM system.
5 Fixed-Point Stability in Iterative Softmax 0 Lemma 2. Let w ∈ SN− 1 , w = w be a fixed point of ISM with larger and smaller coordinates equal to γ1 and γ2 , respectively. Then γ1 is never further from 1 /2 than γ2 is, that is,
γ1 −
1 ≤ γ2 − 2
1 . 2
(5.1)
Proof. Assume first that γ1 ∈ (0, 12 ]. Then since γ1 > γ2 , equation 5.1 is automatically satisfied. 0 Assume now that γ1 ∈ ( 12 , 1). Since w ∈ SN−1 , there can be only one co1 ordinate of w with value γ1 . Also, γ2 < 2 . The coordinates must sum up to one, and so γ1 = 1 − (N − 1)γ2 .
ˇ P. Tino
1066 1200
# possible fixed points
1000 800 600 400 200 0 0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 2: Number of possible fixed points as a function of the larger coordinate γ1 in ISM with N = 10. The farther we move from the maximum entropy equilibrium w = (0.1, 0.1, . . . , 0.1) , the fewer fixed points are possible.
For N ≥ 2 we have2 γ1 − 1 = γ1 − 1 = 1 − (N − 1)γ2 ≤ 1 − γ2 = γ2 − 2 2 2 2
1 . 2
0 Theorem 3. Consider a fixed point w ∈ SN− 1 of ISM (3.7) with one of its coor−1 dinates equal to N ≤ γ1 < 1. Then if
T > Ts,1 (γ1 ) = 2 γ1 (1 − γ1 ),
(5.2)
T > Ts,2 (γ1 ) = γ1 ,
(5.3)
or if
the fixed point w is stable. Proof. Jacobian J (w; T) of the ISM map, equations 2.3 and 2.4, is given by equation 3.6. The Jacobian is a symmetric matrix, and so the column and row sum norms coincide: N |J (w; T)i, j | . (5.4)
J (w; T) ∞ = max i=1,2,...,N j=1
2
Recall that we study systems for N ≥ 2.
Critical Temperatures for Self-Organizing Neural Networks
1067
The sum of absolute values in the ith row of the Jacobian is equal to Si =
Fi (w; T) (1 − Fi (w; T)) + T
wi (1 − wi ) + wj = T j=i
F j (w; T)
(5.5)
j=i
wi [(1 − wi ) + (1 − wi )] T 2 = wi (1 − wi ). T
=
(5.6) (5.7) (5.8)
Note that equation 5.6 follows form 5.5 because w is a fixed point of F, that 0 is, w = F(w), and equation 5.7 follows form 5.6 because w ∈ SN−1 . We have
J (w; T) ∞ = max {Si } i=1,2,...,N
=
2 T
max
i=1,2,...,N
q (wi ) ,
(5.9) (5.10)
where q : [0, 1] → [0, 1/4] is a unimodal function q (x) = x(1 − x) with maximum at x = 1/2. Point x = 1/2 is also the point of symmetry. Hence,
J (w; T) ∞ =
2 q (w), ˜ T
where w ˜ is the coordinate value of w closest to 1/2. By theorem 1, there can be at most two distinct values taken on by the coordinates of w. If w = w, all the coordinates are the same, and w ˜ = N−1 . 3 −1 If w = w, by lemma 2, it is the bigger value of the two, γ1 > N , that is close to 1/2. Hence, w ˜ = γ1 . It follows that if one of the coordinates of w is equal to N−1 ≤ γ1 < 1, then J (w; T) ∞ = 2q (γ1 )/T. Now, if J (w; T) ∞ < 1, the fixed point w is contractive. In other words, if T > Ts,1 (γ1 ) = 2γ1 (1 − γ1 ), w is a stable fixed point of ISM (see equation 3.7). Note that all eigenvalues of Jacobian J (w; 1) (ISM operating at temperature 1) are upper-bounded by max1≤k≤N Fk (w; 1) (Elfadel & Wyatt, 1994). It follows that all eigenvalues of J (w; T) are upper-bounded by
3
Note that the smaller value γ2 is automatically less than N−1 .
ˇ P. Tino
1068
max1≤k≤N T −1 Fk (w; T). For each fixed point w of ISM operating at temperature T, w = F(w; T), and so by theorem 1, all eigenvalues of J (w; T) are upper-bounded by γ1 T −1 . Hence, the spectral radius ρ(J (w; T)) of J (w; T) (the maximal absolute value of eigenvalues of J (w; T)) is upper-bounded by γ1 T −1 . If ρ(J (w; T)) < 1, that is if T > Ts,2 (γ1 ) = γ1 , w is a stable equilibrium of ISM, equation 3.7. Since Ts,2 (γ1 ) ≤ Ts,1 (γ1 ) on (N−1 , 1/2] and Ts,2 (γ1 ) > Ts,1 (γ1 ) on (1/2, 1), it is useful to combine the two bounds into a single one. 0 Corollary 2. Consider a fixed point w ∈ SN− 1 of ISM, equation 3.7, with one of its coordinates equal to N−1 ≤ γ1 < 1. Define
Ts (γ1 ) =
Ts,2 (γ1 ), Ts,1 (γ1 ),
if if
γ1 ∈ [N−1 , 1/2) γ1 ∈ [1/2, 1).
(5.11)
Then if T > Ts (γ1 ), the fixed point w is stable. Lemma 1 tells us that the maximum entropy point w = (N−1 , N , . . . , N−1 ) is an equilibrium of ISM, equation 3.7, for any temperature setting T. By theorem 3, for T > Ts (N−1 ) = N−1 , w is guaranteed to be stable. The next theorem claims that for high temperature settings, no fixed points other than w exist, that is, w is the unique equilibrium of ISM. −1
w=
Theorem 4. For T > 1/2, the maximum entropy point (N−1 , N−1 , . . . , N−1 ) is the unique equilibrium of ISM, equation 3.7. 0 , Proof. For any w ∈ SN−1
J (w; T) ∞ = max
i=1,2,...,N
= max
i=1,2,...,N
≤
Fi (w; T) F j (w; T) (1 − Fi (w; T)) + T j=i 2 [Fi (w; T)(1 − Fi (w; T))] T
1 21 = . T4 2T
(5.12)
For T > 1/2, the ISM, equation 3.7, is a global contraction. By the Banach fixed point theorem, there exists a unique attractive fixed point of the ISM. But w is always a fixed point of the ISM. It follows that for T > 1/2, the only fixed point of ISM, equation 3.7, is w. In addition, w is attractive. Theorem 4 sharpens the statement by Elfadel and Wyatt (1994) that points of SN−1 converge under ISM, equation 3.7, to w as T → ∞.
Critical Temperatures for Self-Organizing Neural Networks
1069
We now turn our attention to conditions under which fixed points of ISM are not stable. 0 Theorem 5. Consider a fixed point w ∈ SN− 1 of ISM, equation 3.7, with one of its coordinates equal to N−1 ≤ γ1 < 1. Let N − be the number of coordinates of value γ1 . Then if
T < Tu (γ1 ; N, ) = γ1 (2 − Nγ1 )
N− 1 1 + − , N N N
(5.13)
w is not stable. Proof. Let ρ(J (w; T)) be the spectral radius of the Jacobian J (w; T). If for a fixed point w of ISM, equation 3.7, ρ(J (w; T)) > 1, the fixed point is not stable. Note that N N 1 1 |trace(J (w; T))| ρ(J (w; T)) ≥ |λi | ≥ λi = , N N N i=1
i=1
0 where λi are eigenvalues of the Jacobian J (w; T). By theorem 2, w ∈ SN−1 has to have coordinates equal to
γ2 =
1 − γ1 (N − ) .
We have 1 [(N − )γ1 (1 − γ1 ) + γ2 (1 − γ2 )] T 1 = [γ1 (2 − Nγ1 )(N − ) + − 1] . T
|trace(J (w; T))| =
The result follows by imposing |trace(J (w; T))| > 1. N Note that by theorem 5, for temperatures T < Tu (N−1 ; N, N) = N−1 − N−2 = Ts (N−1 ) − N−2 , the maximum entropy equilibrium w of ISM, equation 3.7, cannot be attractive. This bound can be improved by realizing that the Jacobian J (w; T),
ˇ P. Tino
1070 stability and existence of ISM fixed points (N=10) 0.5 stable equilibria
stable equilibria 0.4 potentially saddle−type equilibria
T
0.3
0.2 N1=2 N1=1 N1=3
0.1
N1=4
unstable equilibria
0
0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 3: Stability regions for equilibria w = w of ISM with N = 10 units as a function of the larger coordinate γ1 and temperature T. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with a larger coordinate equal to γ1 are guaranteed to be stable are shown by a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Also plotted are temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines).
equation 3.6, of ISM, equation 3.7, at w has exactly one zero eigenvalue (corresponding to the eigenvector (1, 1, . . . , 1) ) and N − 1 eigenvalues equal to (NT)−1 . Hence, the spectral radius is equal to ρ(J (w; T)) =
1 . NT
(5.14)
It follows that the maximum entropy equilibrium w is attractive and nonattractive if T > N−1 and T < N−1 , respectively. An illustrative summary of the previous results for equilibria w = w is provided in Figure 3. The ISM has N = 10 units. Coordinates of such fixed points can take on only two possible values, the larger of which we denote by γ1 . Temperatures Ts (γ1 ) (see equation 5.11), above which
Critical Temperatures for Self-Organizing Neural Networks
1071
equilibria with larger coordinate equal to γ1 are guaranteed to be stable, are shown as the solid bold line. Denote the number of coordinates with value γ1 by N1 : N1 = N − . For N1 = 1, 2, 3, 4, we show temperatures Tu (γ1 ; N, ), equation 5.13, below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines.4 For a given value of γ1 , temperatures between Tu (γ1 ; N, ) and Ts (γ1 ) potentially correspond to saddle-type equilibria. We also plot temperatures Te (γ1 ; N, ), equation 4.2, at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines). The figure suggests that most of ISM equilibria with more than one larger coordinate γ1 , that is, N1 ≥ 2, are of saddle type. Only fixed points with extreme values of γ1 close to N1−1 are guaranteed to be unstable. These findings were numerically confirmed by eigenvalue decomposition of Jacobians of a large number of fixed points corresponding to the dashed lines. Note that apart from the maximum entropy equilibrium w, only the extreme equilibria close to the vertexes of simplex SN−1 can be guaranteed by our theory to be stable. This happens when the dashed line for N1 = 1 crosses the solid bold line. We now show that the character of the partitioning of the (γ1 , T)-space with respect to stability regimes of ISM equilibria remains unchanged for other values of N (dimensionality of ISM). Lemma 3. Consider equilibria of an N-dimensional (N > 2) ISM, equation 3.7, with at least two larger coordinates of value γ1 : ≤ N − 2 and γ1 ∈ (N−1 , (N − )−1 ). Then, for the temperature function, equation 4.2,
γ1 T,N (γ1 ) = Te (γ1 ; N, ) = (Nγ1 − 1) · ln γ1 ( − N) + 1
−1
,
(5.15)
guaranteeing the existence of such equilibria, it holds: 1. T,N (N−1 ) = Ts (N−1 ). 2. T,N is concave. (N−1 ) = 1 − N/(2). 3. T,N
Proof. 1. At γ1 = 1/N, both the numerator and denominator of equation 5.15 are zero. By L’Hospital rule5 lim T,N (γ1 ) = lim
γ1 →N−1
4 5
γ1 →N−1
Nγ1 [γ1 ( − N) + 1] 1 = = Ts (N−1 ). (5.16) N
0 Note that since w ∈ SN−1 , γ1 must be smaller than 1/N1 . We slightly abuse mathematical notation by a considering γ1 a continuous variable.
ˇ P. Tino
1072
2. The first derivative of T,N is calculated as T,N (γ1 ) =
ln
N γ1 γ1 (−N)+1
−
ln
2
Nγ1 − 1
. γ12 ( − N) + γ1
γ1 γ1 (−N)+1
(5.17)
γ1 . We write To ease the presentation, ln(·) will stand for ln γ1 (−N)+1
T,N (γ1 ) = −1 (A(γ1 ) − B(γ1 )), where
A(γ1 ) =
N ln(·)
and B(γ1 ) =
ln(·)2 γ
Nγ1 − 1 . 1 [γ1 ( − N) + 1]
The derivative of A(γ1 ), A (γ1 ) = −
N , ln(·)2 γ1 [γ1 ( − N) + 1]
is negative, as for γ1 ∈ (N−1 , (N − )−1 ), we have γ1 ( − N) + 1 > 0. Denote ln(·)2 γ1 [γ1 ( − N) + 1] by C. Then, B (γ1 ) = C −2 (D(γ1 ) ln(·)2 + 2 ln(·)), where D(γ1 ) = N(N − )γ12 − 2(N − )γ1 + 1. Now, D(γ1 ) is a convex function with minimum at N−1 and D(N−1 ) = /N > 0. Also note that since γ1 > N−1 , we have γ1 > 1, γ1 ( − N) + 1 and so ln(·) > 0. (γ1 ) < 0 and T,N is From A (γ1 ) < 0 and B (γ1 ) > 0, it follows that T,N concave. 3. The result follows from evaluation of limγ1 →N−1 T,N (γ1 ) based on equation 5.17. After applying the L’Hospital rule twice, we get
γ1
lim T,N (γ1 ) = 1 − →N−1
N . 2
We are now ready to prove that for equilibria of ISM, equation 3.7, with at least two larger coordinates γ1 > N−1 , the temperatures at which they exist, Te (γ1 ; N, ), are always below the bound Ts (γ1 ), above which their stability is guaranteed.
Critical Temperatures for Self-Organizing Neural Networks
1073
Theorem 6. Let N > 2 and ≤ N − 2. Then, for γ1 ∈ (N−1 , (N − )−1 ), it holds: Te (γ1 ; N, ) < Ts (γ1 ). Proof. By lemma 3 (point 2 of lemma 3), T,N (γ1 ) = Te (γ1 ; N, ) is a concave function and hence can be upper-bounded on (N−1 , (N − )−1 ) by the tangent line κ(γ1 ) to T,N (γ1 ) at N−1 : T,N (γ1 ) ≤ κ(γ1 ), where, by equation 5.6 and lemma 3 (points 1, 3), κ(γ1 ) =
1 1 N + 1− γ1 − . N 2 N
Now, Ts (γ1 ) is a line,
1 1 Ts (γ1 ) = + γ1 − . N N Since N/(2) > 0, we have that on (N−1 , (N − )−1 ), Te (γ1 ; N, ) < Ts (γ1 ). Hence, the tangent κ(γ1 ) upper-bounding Ts (γ1 ) never crosses Ts (γ1 ) and always stays below it. As an illustrative example, we present an analog of Figure 3 for a 17-dimensional ISM in Figure 4. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with larger coordinates equal to γ1 are guaranteed to be stable are shown with a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with a larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines) are also marked according to the stability type of the corresponding fixed points. The stability types were determined by eigenanalysis of Jacobians (see equation 3.6) at the fixed points. Stable and unstable equilibria existing at temperature Te (γ1 ; N, ) are shown as stars and circles, respectively. All unstable equilibria are of saddle type. The horizontal dashed line shows a numerically determined temperature by Kwok and Smith (2005) at which attractive equilibria of 17-dimensional ISM lose stability and the maximum entropy point w remains the only stable fixed point. The position where Ts (γ1 ) crosses Te (γ1 ; N, ) for N1 = 1 is marked by a bold circle. Note that no equilibrium with more than one coordinate greater than N−1 is stable.
ˇ P. Tino
1074 stability types of ISM fixed points (N=17) 0.5 0.45 0.4 0.35 N1=1
T
0.3 0.25 0.2 0.15
N1=2
0.1 N1=3
0.05 N1=4
0 0.1
0.2
0.3
0.4
0.5 0.6 gamma1
0.7
0.8
0.9
1
Figure 4: Stability types for equilibria w = w of ISM with N = 17 units as a function of the larger coordinate γ1 and temperature T. The number of coordinates with value γ1 is denoted by N1 . Temperatures Ts (γ1 ) (see equation 5.11) above which equilibria with larger coordinates equal to γ1 are guaranteed to be stable are shown with a solid bold line. For N1 = N − ∈ {1, 2, 3, 4}, we show the temperatures Tu (γ1 ; N, ) (see equation 5.13) below which equilibria with larger coordinate equal to γ1 are guaranteed to be unstable with solid normal lines. Temperatures Te (γ1 ; N, ) (see equation 4.2) at which equilibria with the given number N1 of coordinates of value γ1 exist (dashed lines) are also marked according to the stability type of the corresponding fixed points. Stable and unstable equilibria existing at temperature Te (γ1 ; N, ) are shown as stars and circles, respectively. The horizontal dashed line shows a numerically determined temperature by Kwok and Smith (2005) at which attractive equilibria of 17-dimensional ISM lose stability and the maximum entropy point w remains the only stable fixed point. The bold circle marks the position where Ts (γ1 ) crosses Te (γ1 ; N, ) for N1 = 1.
6 Critical Temperature for Intermittent Search in SONN with Softmax Weight Renormalization It has been hypothesized that ISM provides an underlying driving force behind intermittent search in SONN with softmax weight renormalization (Kwok & Smith, 2003, 2005) (see section 2). Kwok and Smith (2005) argue that the critical temperature at which the intermittent search takes place
Critical Temperatures for Self-Organizing Neural Networks
1075
0.5 0.45 0.4 0.35
T
0.3 0.25 0.2 0.15 0.1
N=15 N=40
0 0.5
N=5
N=10
0.05
0.55
N=30 0.6
0.65
0.7
0.75 gamma1
0.8
0.85
0.9
0.95
1
Figure 5: Approximations of the bifurcation point based on equation 4.2 and bound, equation 5.11, for increasing ISM dimensionalities N.
corresponds to the bifurcation point of the autonomous ISM dynamics when the existing equilibria lose stability and only the maximum entropy point w survives as the sole stable equilibrium. The authors numerically determined such bifurcation points for several ISM dimensionalities N. It was reported that bifurcation temperatures decreased with increasing N. Based on the analysis in section 5, the bifurcation points correspond to the case when equilibria near corners of the simplex SN−1 (equilibria with only one coordinate with large value γ1 ) lose stability. Based on the bound Ts (γ1 ) (see equation 5.11) and temperatures Te (γ1 ; N, N − 1) (see equation 4.2) at which equilibria of ISM exist, such a bifurcation point can be approximated by the bold circle in Figure 4. Figure 5 shows the evolution of the approximation of the bifurcation point (based on Te (γ1 ; N, N − 1) (see equation 4.2) and bound Ts (γ1 ) (see equation 5.11) for increasing ISM dimensionalities N. The bound, equation 5.11, is independent of N. In accordance with the empirical findings of Kwok and Smith, the bifurcation temperature is decreasing with increasing N. We present two approximations to the critical temperature T∗ (N) at which the bifurcation occurs. The first 1 expands Te (γ1 ; N, N − 1) (see equation 4.2) (2) around γ10 as a second-order polynomial TN−1 (γ1 ). Based on Figure 5, a good 0 0 choice for γ1 is γ1 = 0.9. Approximation to the critical temperature is then
ˇ P. Tino
1076
obtained by solving (2)
TN−1 (γ1 ) = Ts (γ1 ) (2)
for γ1 , and then plugging the solution γ1 (2) calculating Ts (γ1 ). We have
back to bound 5.11, that is,
Te (γ1 ; N, N − 1) =
Nγ1 − 1
, 1 (N − 1) ln (N−1)γ 1−γ1
Te (γ1 ; N, N − 1) =
N Nγ1 − 1
−
(N−1)γ1 1 (N − 1) ln 1−γ1 (N − 1)γ1 (1 − γ1 ) ln2 (N−1)γ 1−γ1
(5.18)
(5.19) and
1 Nγ12 − 2γ1 + 1 − 2 ln−1 (N−1)γ 1−γ 1
. Te (γ1 ; N, N − 1) = 2 (N−1)γ1 2 2 (N − 1)γ1 (1 − γ1 ) ln 1−γ1
(5.20)
By solving Aγ12 + Bγ1 + C = 0,
(5.21)
where A= 1 /2 Te γ10 ; N, N − 1 + 2, B = Te γ10 ; N, N − 1 − γ10 Te γ10 ; N, N − 1 − 2, 1 2 C = Te γ10 ; N, N − 1 − γ10 Te γ10 ; N, N − 1 + γ10 Te γ10 ; N, N − 1 2 (2)
and retaining the solution γ1 compatible with the requirement that (2) w ∈ SN−1 , we obtain an analytical approximation T∗ (N) to the critical temperature T∗ (N): (2)
T∗(2) (N) = 2γ1
(2)
1 − γ1
.
(5.22)
A cheaper approximation to T∗ (N) can be obtained by realizing that both functions in Figure 5 are almost linear in the neighborhood of their
Critical Temperatures for Self-Organizing Neural Networks
1077
intersection. First, we approximate the bound Ts (γ1 ), equation 5.11 (the solid bold line) by the tangent line at (1,0): v(γ1 ) = 2(1 − γ1 ).
(5.23)
Imposing v(γ1 ) = Te (γ1 ; N, N − 1) leads to y = 2(N − 1) ln(y) − 1,
(5.24)
where y=
(N − 1)γ1 . 1 − γ1
(5.25)
We expand the right-hand side of equation 5.24 around y0 (corresponding to γ10 ) as 2(N − 1) y + 2(N − 1)(1 + ln(y0 )) − 1 y0 and so the approximate solution to equation 5.24 reads y(1) =
2(N − 1)(1 + ln(y0 )) − 1 1−
2(N−1) y0
.
(5.26)
From equation 5.25, the corresponding value of γ1 is (1)
γ1 =
y(1) , N − 1 + y(1)
(5.27) (1)
yielding an approximation T∗ (N) to the critical temperature T∗ (N): (1)
T∗(1) (N) = 2γ1
(1)
1 − γ1
.
(5.28) (1)
(2)
To illustrate the approximations T∗ (N) and T∗ (N) of the critical temperature TB = T∗ (N), numerically found bifurcation temperatures for ISM systems with dimensions between 8 and 30 are shown in Figure 6 as circles. (2) The approximations based on quadratic and linear expansions, T∗ (N) and (1) T∗ (N), are plotted with bold solid and dashed lines, respectively. Also shown are the temperatures 1/N above which the maximum entropy equilibrium w is stable (solid normal line). At bifurcation temperature, w is already stable, and equilibria at vertexes of simplex SN−1 lose stability. The
ˇ P. Tino
1078 bifurcation temperatures 0.35
TB (qdr. approx) TB (lin. approx)
0.3
TB T (N−queens) *
0.25
T
0.2
0.15 max. entropy point stable 0.1
0.05
0 5
max. entropy point unstable
10
15
20 N
25
30
35
Figure 6: Analytical approximations of the bifurcation temperature TB = T∗ (N) for increasing ISM dimensionalities N. The approximations based on quadratic (2) (1) and linear expansions, T∗ (N) and T∗ (N), are plotted with bold solid and dashed lines, respectively. Numerically found bifurcation temperatures are shown as circles. The best-performing temperatures for intermittent search in the N-queens problems are shown as stars. Also shown are the temperatures 1/N above which the maximum entropy equilibrium w is stable. As Kwok and Smith suggested, there is a marked correspondence between the bifurcation temperatures and the best-performing temperatures in intermittent search by (2) (1) SONN. The analytical solutions T∗ (N) and T∗ (N) appear to approximate the bifurcation temperatures well. (2)
(1)
analytical solutions T∗ (N) and T∗ (N) appear to approximate the bifurcation temperatures well. We also numerically determined optimal temperature settings for intermittent search by SONN in the N-queens problems. Following Kwok and Smith (2005), we set the SONN parameter β to β = 0.8.6 The optimal neighborhood size increased with the problem size N from L = 2 (for N = 8), through L = 3 (N = 10, 13), L = 4 (N = 15, 17, 20), to L = 5
6 Our experiments confirmed findings by Kwok and Smith that the choice of β is in general not sensitive to N.
Critical Temperatures for Self-Organizing Neural Networks
1079
(N = 25, 30).7 Based on our extensive experimentation, the best-performing temperatures for intermittent search in the N-queens problems are shown as stars. Clearly, as suggested by Kwok and Smith, there is a marked correspondence between the bifurcation temperatures of ISM equilibria and the best-performing temperatures in intermittent search by SONN. The only equilibria that can be guaranteed to be stable by our theory are the maximum entropy point w and “one-hot” solutions at the vertexes of the N − 1 dimensional simplex SN−1 . Each of such one-hot solutions corresponds to one particular assignment of SONN inputs to SONN outputs. At the critical point of losing their stability, the one-hot solutions do not exhibit a strong attractive force in the ISM state space, and SONN weight updates can easily jump from one assignment to another, occasionally being pulled by the neutralizing power of the stable maximum entropy equilibrium w. This mechanism enables rapid intermittent search for good assignment solutions in SONN. Obviously much more work is needed to rigorously analyze the intricate influence of the neighborhood size, SONN weight updates, and softmax renormalization in a unified framework. This is a matter for our future work. It is highly encouraging, however, that the analytically obtained approximations of critical temperatures predict the optimal working temperatures for SONN intermittent search very well. So far, the critical temperatures have been determined only by extensive trial-and-error numerical investigations. Note that as a direct consequence of theorem 4, no intermittent search can exist in SONN for temperatures T > 1/2. 7 Conclusions A recently proposed new kind of optimization dynamics using selforganizing neural networks driven by softmax weight renormalization is capable of powerful intermittent search for high-quality solutions in difficult assignment optimization problems (Kwok & Smith, 2005). However, such dynamics is sensitive to temperature setting in the softmax renormalization step. Kwok and Smith (2005) have hypothesized that the optimal temperature setting corresponds to symmetry-breaking bifurcation of equilibria of the renormalization step, when viewed as an autonomous dynamical system. Following Kwok and Smith (2003, 2005), we call such dynamical systems iterative softmax (ISM). We have rigorously analyzed equilibria of ISM. In particular, we determined their number, position, and stability types. Most fixed points can be found in the neighborhood of the maximum entropy equilibrium point w = (N−1 , N−1 , . . . , N−1 ) . We have calculated the exact rate of decrease in
7 The increase of optimal neighborhood size L with increasing problem dimension N is in accordance with findings in Kwok and Smith (2005).
1080
ˇ P. Tino
the number of ISM equilibria as one moves away from w. We have derived bounds on temperatures guaranteeing different stability types of ISM equilibria. It appears that most ISM fixed points are of a saddle type. This hypothesis is supported by our extensive numerical experiments (most are not reported here). The only equilibria that can be guaranteed to be stable by our theory are the maximum entropy point w and one-hot solutions at the vertexes of the N − 1 dimensional simplex. We argued that close to the critical bifurcation temperature at which such one-hot equilibria lose stability, the most powerful intermittent search can take place in SONN. Based on temperature bounds guaranteeing stability of one-hot equilibria and temperatures guaranteeing their existence, we were able to derive analytical approximations to the critical bifurcation temperatures that are in good agreement with those found by numerical investigations. So far, the critical temperatures have been determined only by extensive trial-and-error numerical investigations. Moreover, the analytically obtained critical temperatures predict the optimal working temperatures for SONN intermittent search well. We have also shown that no intermittent search can exist in SONN for temperatures greater than one-half.
References Crutchfield, J. P., & Young, K. (1990). Computation at the onset of chaos. In W. H. Zurek (Ed.), Complexity, entropy, and the physics of information (Vol. 8, pp. 223–269). Reading, MA: Addison-Wesley. Elfadel, I. M., & Wyatt, J. L. (1994). The softmax nonlinearity: Derivation using statistical mechanics and useful properties as a multiterminal analog circuit element. In J. Cowan, G. Tesauro, & C. L. Giles (Eds.), Advances in neural information processing systems, 6 (pp. 882–887). San Mateo, CA: Morgan Kaufmann. Gold, S., & Rangarajan, A. (1996). Softmax to softassign: Neural network algorithms for combinatorial optimization. Journal of Artificial Neural Networks, 2, 381–399. Guerrero, F., Lozano, S., Smith, K. A., Canca, D., & Kwok, T. (2002). Manufacturing cell formation using a new self-organizing neural network. Computers and Industrial Engineering, 42, 377–382. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Science USA, 79, 2554–2558. Kohonen, T. (1982). Self–organizing formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kosowsky, J. J. & Yuille, A. L. (1994). The invisible hand algorithm: Solving the assignment problem with statistical physics. Neural Networks, 7(3), 477–490. Kwok, T., & Smith, K. A. (2002). Characteristic updating-normalisation dynamics of a self-organising neural network for enhanced combinatorial optimisation. In Proceedings of the 9th International Conference on Neural Information Processing (ICONIP’2002), (Vol. 3, pp. 1146–1152). Piscataway, NJ: IEEE.
Critical Temperatures for Self-Organizing Neural Networks
1081
Kwok, T., & Smith, K. A. (2003). Performance-enhancing bifurcations in a selforganising neural network. In Computational Methods in Neural Modeling: Proceedings of the 7th International Work-Conference on Artificial and Natural Neural Networks (IWANN’2003) (pp. 390–397). Berlin: Springer-Verlag. Kwok, T., & Smith, K. A. (2004). A noisy self-organizing neural network with bifurcation dynamics for combinatorial optimization. IEEE Transactions on Neural Networks, 15(1), 84–88. Kwok, T., & Smith, K. A. (2005). Optimization via intermittency with a selforganizing neural network. Neural Computation, 17, 2454–2481. Langton, C. G. (1990). Computation at the edge of chaos: Phase transitions and emergent computation. Physica D, 42, 12–37. Minton, S., Johnston, M. D., Philips, A. B., & Laird, P. (1992). Minimizing conflicts: A heuristic repair method for constraint satisfaction and scheduling problems. Artificial Intelligence, 58(1–3), 161–205. Rangarajan, A. (2000). Self-annealing and self-annihilation: Unifying deterministic annealing and relaxation labeling. Pattern Recognition, 33(4), 635–649. Smith, K. A. (1995). Solving the generalized quadratic assignment problem using a self-organizing process. In Proceedings of the IEEE Int. Conf. on Neural Networks (Vol. 4, pp. 1876–1879). Piscataway, NJ: IEEE. Smith, K. A. (1999). Neural networks for combinatorial optimization: A review of more than a decade of research. INFORMS Journal on Computing, 11(1), 15–34. Smith, K. A., Palaniswami, M., & Krishnamoorthy, M. (1998). Neural techniques for combinatorial optimization with applications. IEEE Transactions on Neural Networks, 9(6), 1301–1318. Tagliarini, G. A., & Page, E. W. (1987). Solving constraint satisfaction problems with neural networks. In The First IEEE International Conference on Neural Networks (Vol. 3, pp. 741–747). Piscataway, NJ: IEEE.
Received May 3, 2006; accepted July 19, 2006.
LETTER
Communicated by Oliver Chapelle
Recursive Finite Newton Algorithm for Support Vector Regression in the Primal Liefeng Bo
[email protected] Ling Wang
[email protected] Licheng Jiao
[email protected] Institute of Intelligent Information Processing, Xidian University, Xi’an 710071, China
Some algorithms in the primal have been recently proposed for training support vector machines. This letter follows those studies and develops a recursive finite Newton algorithm (IHLF-SVR-RFN) for training nonlinear support vector regression. The insensitive Huber loss function and the computation of the Newton step are discussed in detail. Comparisons with LIBSVM 2.82 show that the proposed algorithm gives promising results. 1 Introduction Support vector machines (SVMs) (Burges, 1998; Vapnik, 1998; Smola & ¨ Scholkopf, 2004) are powerful tools for classification and regression. In the past few years, fast algorithms for SVMs have been an active research direction. Traditionally, SVMs are trained by using decomposition techniques such as SVMlight (Joachims, 1999) and sequential minimal optimization (Platt, 1999; Shevade, Keerthi, Bhattacharyya, & Murthy, 2000; Keerthi, Shevade, Bhattacharyya, & Murthy, 2001), which solve the dual problem by optimizing a small subset of the variables each iteration. Recently, some algorithms in the primal have been presented for training SVMs. (Mangasarian, 2002; Fung & Mangasarian, 2003) proposed the finite Newton algorithm and showed that it is rather powerful for linear SVMs. Keerthi and DeCoste (2005) introduced some appropriate techniques to speed up the finite Newton algorithm and named the resulting algorithm the modified finite Newton algorithm. Chapelle (2006) proposed the recursive finite Newton algorithm and showed it to be as efficient as the dual domain method for nonlinear support vector classification. This letter follows these studies and develops a recursive finite Newton algorithm (IHLF-SVR-RFN) for nonlinear SVR. One of its main contributions is to introduce an insensitive Huber loss function that includes several Neural Computation 19, 1082–1096 (2007)
C 2007 Massachusetts Institute of Technology
Recursive Finite Newton Algorithm for SVR
1083
popular loss functions as its special cases. Comparisons with LIBSVM 2.82 (Fan, Chen, & Lin, 2005) suggest that IHLF-SVR-RFN is as efficient as the dual domain method for nonlinear SVR. The letter is organized as follows. In section 2, SVR in the primal is introduced. The insensitive Huber loss function and its corresponding optimization problem are proposed in section 3. The recursive finite Newton algorithm is discussed in section 4. Comparisons with LIBSVM 2.82 are reported in section 5. Section 6 presents the conclusions. 2 Support Vector Regression in the Primal n Consider a regression problem with training samples {xi , yi }i=1 where xi is the input sample and yi is the corresponding target. To obtain a linear predictor, SVR solves the following optimization problem:
p w2 p min ξi + ξ¯i +C w,b 2 n
i=1
s.t. w · xi + b − yi ≤ ε + ξi yi − w · xi + b ≤ ε + ξ¯i ξi , ξ¯i ≥ 0, i = 1, 2, · · · , n.
(2.1)
n Eliminating the slack variables ξi , ξ¯i i=1 and dividing equation 2.1 by the factor C, we get the unconstrained optimization problem, min L ε (w, b) = w,b
n
lε (w · xi + b − yi ) + λ w
2
,
(2.2)
i=1
1 and lε (r ) = max (|r | − ε, 0) p . The most popular selections for where λ = 2C p are 1 and 2. For convenience of expression, the loss function with p = 1 is referred to as insensitive linear loss function (ILLF) and that with p = 2 insensitive quadratic loss function (IQLF). Nonlinear SVR can be obtained by using a kernel function and an associated reproducing kernel Hilbert space H. The resulting optimization is
min L ε ( f ) = f
n
lε ( f (xi ) − yi ) + λ
f 2H
,
(2.3)
i=1
where we have dropped b for simplicity. Our experience shows that the generalization performance of SVR is not affected by this drop. According to the representer theory (Kimeldorf & Wahba, 1970), the optimal function
1084
L. Bo, L. Wang, and L. Jiao
for equation 2.3 can be expressed as a linear combination of the kernel functions centered in the training samples,
f (x) =
n
βi k (x, xi ).
(2.4)
i=1
Substituting equation 2.4 into 2.3, we have min L ε (β) = β
n
lε
i=1
n
βi k xi , x j − yi + λ
n
j=1
βi β j k xi , x j .
(2.5)
i=1
Introducing the kernel matrix K with Ki j = k xi , x j and Ki the ith row of K, equation 2.5 can be rewritten as min L ε (β) = β
n
lε (Ki β − yi ) + λβ Kβ . T
(2.6)
i=1
As long as lε (·) is differentiable, we can optimize equation 2.6 by a gradient descent algorithm. 3 Finite Newton Algorithm for Insensitive Huber Loss Function The finite Newton algorithm is a Newton algorithm that can be proved to converge in a finite number of steps, and it has been demonstrated to be very efficient for support vector classification (Mangasarian, 2002; Keerthi & DeCoste, 2005; Chapelle, 2006). Here, we focus on the regression case. The finite Newton algorithm is straightforward for the insensitive quadratic loss function by defining the generalized Hessian matrix (Clarke, 1983; Hiriart-Urruty, Strodiot, & Nguyen, 1984; Keerthi & DeCoste, 2005); however, it is not applicable to the insensitive linear loss function since it is not differentiable. Inspired by the Huber loss function (Huber, 1981), we propose an insensitive Huber loss function (IHLF), 0 lε, (z) = (|z| − ε)2 ( − ε) (2 |z| − − ε) whose shape is shown in Figure 1.
if if if
|z| ≤ ε ε < |z| < , |z| ≥
(3.1)
Recursive Finite Newton Algorithm for SVR
1085
Figure 1: Insensitive Huber loss function with ε = 0.5 and = 1.5 and insensitive quadratic loss function with ε = 0.5.
We emphasize that is strictly greater than ε, ensuring that IHLF is differentiable. Its first-order derivative can be written as 0 ∂lε, (z) = 2sign (z) (|z| − ε) ∂z 2sign (z) ( − ε)
if if if
|z| ≤ ε ε < |z| < , |z| ≥
(3.2)
where sign (z) is 1 if z ≥ 0; otherwise, sign (z) is −1. The properties of IHLF are controlled by two parameters: ε and . At certain ε and values, we can obtain some familiar loss functions: (1) for ε = 0 and an appropriate , IHLF becomes the Huber loss function (HLF); (2) for ε = 0 and = ∞, IHLF becomes the quadratic (gaussian) loss function (QLF); (3) for ε = 0 and → ε, IHLF approaches the linear (Laplace) loss function (LLF); (4) for 0 < ε < ∞ and = ∞, IHLF becomes the insensitive quadratic loss function (IQLF); and (5) for 0 < ε < ∞ and → ε, IHLF approaches the insensitive linear loss function (ILLF). The profiles of the loss functions derived from equation 3.1 are shown in Figure 2. Introducing IHLF into the optimization problem, equation 2.6, we have the following nonlinear SVR problem: min L ε, (β) = β
n i=1
lε, (Ki β − yi ) + λβ Kβ . T
(3.3)
1086
L. Bo, L. Wang, and L. Jiao
Figure 2: Loss functions derived from equation 3.1.
Let the residual vector be
r (β) = Kβ − y . ri (β) = Ki β − yi
(3.4)
Defining the sign vector s (β) = [s1 (β) , · · · , sn (β)]T by 1 si (β) = −1 0
if ε < ri (β) < if − < ri (β) < −ε , otherwise
(3.5)
the sign vector s¯ (β) = [¯s1 (β) , · · · , s¯n (β)]T by if ri (β) ≥ 1 s¯i (β) = −1 if ri (β) ≤ − , 0 otherwise
(3.6)
Recursive Finite Newton Algorithm for SVR
1087
and the active matrix W (β) = diag w1 (β) , · · · , wn (β)
(3.7)
by wi (β) = si2 (β), we can rewrite equation 3.3 as min L ε, (β) = β
n
wi (β)(|ri (β)| − ε)2
i=1
+
n
s¯i2
T (β) ( − ε) 2 ri (β) − − ε + λβ Kβ . (3.8)
i=1
Rearranging equation 3.8, we can obtain a more compact expression: min β
L ε, (β) = r (β)T W (β) r (β) − 2εr (β)T s (β) + 2 ( − ε) r (β)T s¯ (β) +λβ T Kβ + ε T W (β) ε − 2 − ε 2 s¯ (β)T s¯ (β)
.
(3.9) Let us consider some basic properties of L ε, (β). First, L ε, (β) is a piecewise quadratic function. Second, L ε, (β) is a convex function, so it has a unique minimizer. Finally, L ε, (β) is continuously differentiable with respect to β. Although L ε, (β) is not twice differentiable, one still can use the finite Newton algorithm by defining the generalized Hessian matrix. The flowchart of the finite Newton algorithm is as follows: Algorithm 3.1: IHLF-SVR-FN 1. Choose a suitable starting point β 0 and set k = 0; 2. Check whether β k is the optimal solution of equation 3.9. If so, stop; 3. Compute the Newton step h; 4. Choose the step size t by the exact line search. Set β k+1 = β k + th and k = k + 1. 4 Practical Implementation 4.1 Computing the Newton Step. For any given value of β, wesay that a point xi is support vector if ri (β) > ε. Let sν1 = i|si β k = 0 denote the index set ofthe support vectors lying in the quadratic part of the loss function, sν2 = i|¯si β k = 0 the index set of the support vectors lying in the linear part of the loss function, and nsν the index set of the nonsupport vectors. Chapelle (2006) has shown that in support the vector classification, Newton step can be computed at a cost of O n3sν1 rather than O n3 , where
1088
L. Bo, L. Wang, and L. Jiao
nsν1 denotes the number of the support vectors lying in the quadratic part of the loss function. In the following, we will show that a similar conclusion holds for SVR. The gradient of L ε, (β) with respect to β is ∇ L ε, (β) = 2KT W (β) r (β) − 2εKT s (β) + 2 ( − ε) KT s¯ (β) + 2λKβ.
(4.1)
Define the set A by A = β ∈ Rn |∃i, ri (β) = ε or ri (β) = .
(4.2)
The Hessian exists for β ∈ / A. For β ∈ A, it is (arbitrarily) defined to one of its limits. Thus, we have the generalized Hessian, ∇ 2 L ε, (β) = 2KT W (β) K + 2λK.
(4.3)
The Newton step at the kth iteration is given by −1 h = − ∇ 2 L ε, β k ∇ L ε, β k −1 k W β y + εs β k − ( − ε) s¯ β k − β k , (4.4) = W β k K + λI where I is the identity matrix. For convenience of expression, we reorder the training samples such that the first nsν1 training samples are the support vectors lying in the quadratic part of the loss function and the training samples from nsν1 + 1 to nsν1 + nsν2 are the support vectors lying in the linear part of the loss function. Thus, equation 4.4 can be rewritten as
−1 hsν1 Ksν1 ,sν1 + λIsν1 ,sν1 Ksν1 ,sν2 Ksν1 ,nsν hsν2 = 0 λIsν2 ,sν2 0 0 0 λInsν,nsν hnsν k ysν1 + ε s β k sν1 β sν1 k × − ( − ε) s¯ β sν − β ksν2 . 2 β knsν 0
(4.5)
Recursive Finite Newton Algorithm for SVR
1089
Together with the matrix inversion result from Zhang (2004), we can simplify equation 4.5 as
k + λI + ε s β y (K ) sν1 ,sν1 sν1 sν1 ,sν1 sν1 k ¯ − ε) K s β ( sν ,sν 1 2 hsν1 sν2 k + − β sν1 . hsν2 = λ hnsν − ( − ε) s¯ β k sν2 k − β sν2 λ −1
(4.6)
−β knsν Equation 4.6 indicates that we can compute the Newton step at a cost of O n3sν1 . Instead of inverting the matrix (Ksν1 ,sν1 + λIsν1 ,sν1 ), we can compute the first part of the Newton step by solving the positive definite system: (Ksν1 ,sν1 + λIsν1 ,sν1 ) hsν1 = ysν1
( − ε) Ksν1 ,sν2 s¯ β k sν2 k + ε s β sν1 + . λ (4.7)
Many methods, such as gaussian elimination, conjugate gradient, and factorization, can be used to find the solution to equation 4.7. In the current implementation, we solve equation 4.7 using Matlab command “\” that employs Cholesky factorization. 4.2 Exact Line Search. The minimizer of a piecewise-smooth, convex quadratic function can be found by identifying the points at which the second derivative jumps, sorting these points, and successively searching over those points until a point where the slope changes sign is located. (For more details, refer to Madsen & Nielsen, 1990, and Keerthi & DeCoste, 2005.) The complexity of this line search is O(m log(m)). Since the Newton step is much more expensive, the line search does not add to the complexity of the algorithm. 4.3 Initialization. The initialization of IHLF-SVR-FN is an important problem. If we use a randomly generated β as a starting point, we would have to invert an n × n matrix at the first Newton step. But if we can identify the support vectors ahead of time, we need to invert only an nsν1 × nsν1 matrix at the first Newton step. According to Chapelle’s suggestion (Chapelle, 2006), we can identify the support vectors by a recursive procedure. We start from a small number of
1090
L. Bo, L. Wang, and L. Jiao
training samples, train, double the number of training samples, retrain, and so on. In this way, the support vectors are rather well identified, avoiding the inversion of the n × n matrix. To distinguish this algorithm from the original finite Newton algorithm, we call it a recursive finite Newton algorithm (IHLF-SVR-RFN). 4.4 Checking Convergence. Once the support vectors remain unchanged, IHLF-SVR-FN has found the optimal solution. We use this property to check the convergence of IHLF-SVR-FN. 4.5 Finite Convergence and Computational Complexity Theorem 1: IHLF-SVR-FN terminates at the global minimum solution of equation 3.9 after a finite number of iterations. The proof of theorem 1 is very much along the lines of the proof of theorem 1 in Keerthi and DeCoste (2005), which is easily extended to a piecewise quadratic loss function, although it considers only a squared hinge loss function. In IHLF-SVR-RFN, the most time-consuming step is computing the Newton step, whose complexity is O(n(nsν1 + nsν2 ) + n3sν1 ), where nsν2 denotes the number of the support vectors lying in the linear part of the loss function. The first term is the cost of finding the support vectors and the second term the cost of solving the linear system, equation 4.7. The number of iterations, that is, the loops of steps 2 to 4, is usually small, say, 5 to 30. 5 Experiments In order to verify the effectiveness of IHLF-SVR-RFN and IQLF-SVR-RFN (the insensitive quadratic loss function is used), we compare them with LIBSVM 2.82 on some benchmark data sets. LIBSVM 2.82 is the fastest version of LIBSVM, where the second-order information is used to select the working set. Gaussian kernel k(xi , x j ) = exp(−γ xi − x j 22 ) is used to construct nonlinear SVR. All the experiments are run on a personal computer with 2.4 GHz processors, 1 GB memory, and Windows XP operation systems. (Matlab code for IHLF-SVR-RFN is available online at http://see.xidian.edu.cn/graduate/lfbo/.) We use the data sets Abalone, Computer Activity, and House16H (Rasmussen et al., 1996) for our comparisons. The Abalone data set consists of 4177 samples with 8 attributes. The task is to predict the age of abalone from all eight physical measurements. The Computer Activity data set was collected from a Sun Sparcstation 20/712 with 128 Mb of memory running in a multiuser university department. It consists of 8192 samples with 21 attributes. The task is to predict the portion of time that CPUs run in user mode from all 21 attributes. The House16H data set consists of 22,784
Recursive Finite Newton Algorithm for SVR
1091
Table 1: Training Time on the Abalone, Computer Activity, and House16H Data Sets.
Abalone Computer House16H
IHLF-SVR-RFN
IQLF-SVR-RFN
LIBSVM 2.82
ε
2.48 11.97 1640.81
4.30 18.43 /
5.99 22.48 653.05
0.100 0.050 0.050
0.110 0.055 0.060
samples with 16 attributes. The task is to predict the median price of the houses in a small survey region. 5.1 Comparison with LIBSVM 2.82. The Abalone data set is randomly split into 3000 training samples and 1177 test samples, the Computer Activity data set into 5000 training samples and 4192 test samples, and the House16H data set into 20,000 training samples and 2784 test samples. For each training-test pair, the training samples are scaled into the interval [−1, 1], and the test samples are adjusted using the same linear transformation. For the Abalone and Computer Activity data sets, we precompute the entire kernel matrix for all three algorithms. Since the value of the parameter pair (γ , C) that optimizes the generalization performance is usually not known prior, the value that is ultimately used to design SVR is usually determined by trying many different values of (γ , C). Thus, it is important to understand how different values, optimal and otherwise, affect training time. Four experiments are performed on the Abalone and Computer Activity data sets. In the first experiment, we fix the values of ε and shown in Table 1 and search the optimal pair(γ ∗ , C ∗ ) from the set −4 −3parameter −3 3 4 −2 7 8 2 /d, 2 /d, · · · , 2 /d, 2 /d × 2 , 2 , · · · , 2 , 2 , which gives the best generalization performance on the test samples, where d is the number of the attributes of samples. Therefore, for each problem, we try 9 × 12 = 108 combinations. The mean training time of the three algorithms on all parameter pairs is reported in Table 1. In the second experiment, we fix the value of the kernel parameter and investigate the robustness of the three algorithms against the variation of the regularization parameter C. In the third experiment, we fix the value of the regularization parameter and investigate the robustness of the three algorithms against the variation of the kernel parameter γ . In the fourth experiment, we investigate the influence of training samples size on the training time of the three algorithms. The values of the kernel parameter and the regularization parameter are set to (γ ∗ , C ∗ ). From Table 1, we can see that IHLF-SVR-RFN outperforms IQLF-SVRRFN and LIBSVM 2.82 in terms of the training time. From Figure 3, we observe that the training time of LIBSVM 2.82 has a sharp increase for large
1092
L. Bo, L. Wang, and L. Jiao
Figure 3: Variation of the training time with the regularization parameter on the Abalone (left) and Computer Activity (right) data sets. γ is fixed to 22 /d.
Figure 4: Variation of the training time with the kernel parameter on the Abalone (left) and Computer Activity (right) data sets. For the Abalone data set, C is fixed to 23 , and for the Computer Activity data set, C is fixed to 24 .
C value, and hence IHLF-SVR-RFN and IQLF-SVR-RFN are much more robust against variation of C than LIBSVM 2.82. From Figure 4, we observe that the training time of IHLF-SVR-RFN and LIBSVM 2.82 increases with an increasing kernel parameter. From Figure 5, we observe that in terms of the training time IHLF-SVR-RFN and IQLF-SVR-RFN are superior to LIBSVM 2.82 on the Abalone data set and inferior to LIBSVM 2.82 on the Computer Activity data set using the optimal parameter pair. For the House16H data sets, it is impossible to store the entire kernel matrix because of insufficient memory, so we need to compute the kernel matrix in each Newton step. Since it is computationally prohibitive to search the set of grid values using all the training samples, we use a random subset of the training samples of size 5000 to choose (γ ∗ , C ∗ ). The training time of IHLF-SVR-RFN and LIBSVM 2.82 on (γ ∗ , C ∗ ) is given in Table 1. Note that
Recursive Finite Newton Algorithm for SVR
1093
Figure 5: Variation of the training time with the training set size on the Abalone (left) and Computer 2 Activity (right) data sets. For the Abalone data set, (γ , C) 3 2 /d, 2 , and for the Computer Activity data set, (γ , C) is fixed to is 2fixed 4to 2 /d, 2 .
IQLF-SVR-RFN cannot run on this data set because of insufficient memory. From Table 1, we can see that LIBSVM 2.82 is faster than IHLF-SVR-RFN on this data set. The main reason is that IHLF-SVR-RFN needs to reevaluate the n × (nsν1 + nsν2 ) kernel matrix in each Newton step. However, we believe that IHLF-SVR-RFN has the advantage for large-scale optimization because the computation of the kernel matrix can be easily parallelized, which could improve the training time significantly. 5.2 Influence of . The purpose of the experiments in this section is to study the influence of on test error and training time. The partition of the training samples and test samples is the same as in section 5.1. The values of the kernel parameter and the regularization parameter are set to (γ ∗ , C ∗ ), and the value of ε is the same as in Table 1. The final errors are averaged over 10 random splits of the full data sets. Figures 6 and 7 show the variation of the test error and the training time with on the Abalone and Computer Activity data sets, respectively. From Figure 6, we observe that the performance of IHLF-SVR-RFN is close to that of LIBSVM 2.82 (the insensitive linear loss function is used) when approaches ε; the performance of IHLF-SVR-RFN is close to that of IQLF-SVR-RFN when is significantly greater than ε. This is consistent with the discussion in section 3. From Figure 7, we observe that the training time of IHLF-SVR-RFN fluctuates with the variation of value. The main reason is that the training time is affected by the number of support vectors and the Newton step. The former often increases with an increasing , and hence some training time is added for large value; however, the latter often decreases with an increasing , and hence some training time is subtracted for large value.
1094
L. Bo, L. Wang, and L. Jiao
Figure 6: Influence of on the test error on the Abalone (left) and Computer Activity (right) data sets.
Figure 7: Influence of on the training time on the Abalone (left) and Computer Activity (right) data sets.
6 Conclusion and Discussions In this letter, we present a recursive finite Newton algorithm for training nonlinear SVR. First, we propose the insensitive Huber loss function and show that several popular loss functions are special cases. Then we discuss in detail the implementation of IHLF-SVR-RFN. Finally, we verify the effectiveness of the proposed algorithm on three benchmark regression data sets. The computational complexity of IHLF-SVR-RFN can be decreased by some approximation strategies. For example, if the kernel matrix is sparse, the Newton step can be solved more efficiently. We can also approximate the objective function 3.9 by a matching pursuit approach. Keerthi, Chapelle, and DeCoste (2006) implemented that in the classification case, and it is straightforward to generalize it to the regression case.
Recursive Finite Newton Algorithm for SVR
1095
Acknowledgments We thank the two reviewers for their helpful comments, which greatly improved this letter, and Rainer Stollhoff and Bruce Rogers for their help in proofreading the manuscript. This work was supported by the National Natural Science Foundation of China under grant 60372050 and the National Defense Preresearch Foundation of China under Grant A1420060172. References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Chapelle, O. (2006). Training a support vector machine in the primal. (MPI-Tech. Rep. ¨ no. 147). Tubingen: Max Planck Institute for Biological Cybernetics. Clarke, F. H. (1983). Optimization and nonsmooth analysis. New York: Wiley. Fan, R. E., Chen P. H., & Lin C. J. (2005). Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6, 1889–1918. Fung, G., & Mangasarian, O. L. (2003). Finite Newton method for Lagrangian support vector machine classification. Neurocomputing, 55(1–2), 39–55. Hiriart-Urruty, J. B., Strodiot, J. J., & Nguyen V. H. (1984). Generalized Hessian matrix and second-order optimality conditions for problems with CL1 data. Applied Mathematics and Optimization, 11, 43–56. Huber, P. (1981). Robust statistics. New York: Wiley. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press. Keerthi, S. S., Chapelle, O., & DeCoste D. (2006). Building support vector machines with reduced classifier complexity. Journal of Machine Learning Research, 7, 1493– 1515. Keerthi, S. S., & DeCoste D. M. (2005). A modified finite Newton method for fast solution of large scale linear SVMS. Journal of Machine Learning Research, 6, 341– 361. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., & Murthy, K. R. K. (2001). Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Computation, 13(3), 637–649. Kimeldorf, G. S., & Wahba G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41, 495–502. Madsen, K., & Nielsen H. B. (1990). Finite algorithms for robust linear-regression. BIT, 30(4), 682–699. Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17(5), 913–929. Platt, J. (1999). Sequential minimal optimization: A fast algorithm for training sup¨ port vector machines. In B. Scholkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods—Support vector learning. Cambridge, MA: MIT Press.
1096
L. Bo, L. Wang, and L. Jiao
Rasmussen, C., Neal, R., Hinton G., van Camp D., Ghahramani Z., Kustra, R., & Tibshirani R. (1996). The DELVE manual. Available online at http://www.cs. toronto.edu/∼delve. Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., & Murthy, K. R. K. (2000). Improvements to the SMO algorithm for SVM regression. IEEE Transactions on Neural Networks, 11(5), 1188–1193. ¨ Smola, A. J., & Scholkopf, B. (2004). A tutorial on SVR. Statistics and Computing, 14(3), 199–222. Vapnik, V. (1998). Statistical learning theory, New York: Wiley. Zhang, X. D. (2004). Matrix analysis and application. New York: Springer.
Received January 22, 2006; accepted July 19, 2006.
LETTER
Communicated by Antti Honkela
State-Space Models: From the EM Algorithm to a Gradient Approach Rasmus Kongsgaard Olsson
[email protected] Kaare Brandt Petersen
[email protected] Tue Lehn-Schiøler
[email protected] Technical University of Denmark, 2800 Kongens Lyngby, Denmark
Slow convergence is observed in the EM algorithm for linear state-space models. We propose to circumvent the problem by applying any off-theshelf quasi-Newton-type optimizer, which operates on the gradient of the log-likelihood function. Such an algorithm is a practical alternative due to the fact that the exact gradient of the log-likelihood function can be computed by recycling components of the expectation-maximization (EM) algorithm. We demonstrate the efficiency of the proposed method in three relevant instances of the linear state-space model. In high signal-tonoise ratios, where EM is particularly prone to converge slowly, we show that gradient-based learning results in a sizable reduction of computation time. 1 Introduction State-space models are widely applied in cases where the data are generated by some underlying dynamics. Control engineering and speech enhancement are typical examples of applications of state-space models, where the state-space has a clear physical interpretation. Black box modeling constitutes a different type of application: the state-space dynamics have no direct physical interpretation, only the generalization ability of the model matters, that is, the prediction error on unseen data. A fairly general formulation of linear state-space models (without deterministic input) is: st = Fst−1 + vt
(1.1)
xt = Ast + nt ,
(1.2)
where equations 1.1 and 1.2 describe the state and observation spaces, respectively. State and observation vectors, st and xt , are random processes Neural Computation 19, 1097–1111 (2007)
C 2007 Massachusetts Institute of Technology
1098
R. Olsson, K. Petersen, and T. Lehn-Schiøler
driven by independent and identically distributed (i.i.d.) zero-mean gaussian inputs vt and nt with covariance Q and R, respectively. Optimization in state-space models by maximizing the log likelihood, L(θ ), with respect to the parameters, θ ≡ {Q, R, F, A}, falls in two main categories based on either gradients (scoring) or expectation maximization (EM). The principal approach to maximum likelihood in state-space models, and more generally in complex models, is to iteratively search the space of θ for the local maximum of L(θ ) by taking steps in the direction of the gradient, ∇θ L(θ ). A basic ascend algorithm can be improved by supplying curvature information, such as second-order derivatives or line search. Often, numerical methods are used to compute the gradient and the Hessian, due to the complexity associated with the computation of these quantities. Gupta and Mehra (1974) and Sandell and Yared (1978) give fairly complex recipes for the computation of the analytical gradient in the linear state-space model. The EM algorithm (Dempster, Laird, & Rubin, 1977), is an important alternative to gradient-based maximum likelihood, partly due to its simplicity and convergence guarantees. It was first applied to the optimization of linear state-space models by Shumway and Stoffer (1982) and Digalakis, Rohlicek, & Ostendorf (1993). A general class of linear gaussian (statespace) models was treated in Roweis and Ghahramani (1999), in which the EM algorithm was the main engine of estimation. In the context of independent component analysis (ICA), the EM algorithm has been applied in, among others, Moulines, Cardoso, and Cassiat (1997) and Højen-Sørensen, Winther, and Hansen (2002). In Olsson and Hansen (2004, 2005), the EM algorithm was applied to the convolutive ICA problem. A number of authors have reported the slow convergence of the EM algorithm. In Redner and Walker (1984), impractically slow convergence in two-component gaussian mixture models is documented. This critique is, however, moderated by Xu and Jordan (1996). Modifications have been suggested to speed up the basic EM algorithm (see, e.g., Lachlan & Krishnan, 1997). Jamshidian and Jennrich (1997) employ the EM update, g˜ n ≡ θ n+1 − θ n , as a so-called generalized gradient in variations of the quasi-Newton algorithm. The approach taken in Meilijson (1989) differs in that the gradient of the log likelihood is derived from the expectation step (E-step), from a theorem originally shown by Fisher. Subsequently, a Newton-type step can be devised to replace the maximization step (M-step). The main contribution of this letter is to demonstrate that specialized gradient-based optimization software can replace the EM algorithm, at little analytical cost, reusing the components of the EM algorithm itself. This procedure is termed the easy gradient recipe. Furthermore, empirical evidence, supporting the results in Bermond and Cardoso (1999), is presented to demonstrate that the signal-to-noise ratio (SNR) has a dramatic effect on the convergence speed of the EM algorithm. Under certain circumstances,
State-Space Models
1099
such as in high SNR settings, the EM algorithm fails to converge in reasonable time. Three applications of state-space models are investigated: (1) sensor fusion for the black box modeling of speech-to-face mapping problem, (2) mean field independent component analysis (mean field ICA) for estimating a number of hidden independent sources that have been linearly and instantaneously mixed, and (3) convolutive independent component analysis for convolutive mixtures. In section 2, an introduction to EM and the easy gradient recipe is given. More specifically, the relation between the various acceleration schemes is reviewed in section 2.3. In section 3, the models are stated, and in section 4 simulation results are presented. 2 Theory Assume a model with observed variables x, state-space variables s, and parameters θ . The calculation of the log likelihood involves an integral over the state-space variables: L(θ ) = ln p(x|θ ) = ln
p(x|s, θ ) p(s|θ )ds.
(2.1)
The marginalization in equation 2.1 is intractable for most choices of distributions, hence, direct optimization is rarely an option, even in the gaussian case. Therefore, a lower bound, B, is introduced on the log likelihood, which is valid for any choice of probability density function, q (s|φ): B(θ , φ) =
q (s|φ) ln
p(s, x|θ ) ds ≤ ln p(x|θ ). q (s|φ)
(2.2)
At this point, the problem seems to have been made more complicated, but the lower bound B has a number of appealing properties, which makes the original task of finding the parameters easier. One important fact about B becomes clear when we rewrite it using Bayes’ theorem, B(θ , φ) = ln p(x|θ ) − K L[q (s|φ)|| p(s|x, θ )],
(2.3)
¨ where K L denotes the Kullback-Leibler divergence between the two distributions. In the case that the variational distribution, q , is chosen to be exactly the posterior of the hidden variables, p(s|x, θ ), the bound, B, is equal to the log likelihood. For this reason, one often tries to choose the variational distribution flexible enough to include the true posterior and yet simple enough to make the necessary calculations as easy as possible. However, when computational efficiency is the main priority, a simplistic
1100
R. Olsson, K. Petersen, and T. Lehn-Schiøler
q is chosen that does not necessarily include p(s|x, θ ), and the optimum of B may differ from that of L. Examples of applications involving both types of q are described in sections 3 and 4. The approach is to maximize B with respect to φ in order to make the lower bound as close as possible to the log likelihood and then maximize the bound with respect to the parameters θ . This stepwise maximization can be achieved via the EM algorithm or by applying the easy gradient recipe (see section 2.2). 2.1 The EM Update. The EM algorithm, as formulated in Neal and Hinton (1998), works in a straightforward scheme that is initiated with random values and iterated until suitable convergence is reached: E: Maximize B(θ, φ)w.r.t. φ keeping θ fixed. M: Maximize B(θ, φ)w.r.t. θ keeping φ fixed. It is guaranteed that the lower-bound function does not decrease on any combined E- and M-step. Figure 1 illustrates the EM algorithm. The convergence is often slow; for example, the curvature of the bound function, B, might be much higher than that of L, resulting in very conservative parameter updates. As mentioned in section 1, this is particularly a problem in latent variable models with low-power additive noise. Bermond and Cardoso (1999) and Petersen and Winther (2005) demonstrate that the EM update of the parameter A in equation 1.2 scales with the observation noise level, R. That is, as the signal-to-noise ratio increases, the M-step change in A decreases, and more iterations are required to converge. 2.2 The Easy Gradient Recipe. The key idea is to regard the bound, B, as a function of θ only, as opposed to a function of both the parameters θ and the variational parameters φ. As a result, the lower bound can be applied to reformulate the log likelihood, L(θ ) = B(θ , φ ∗ ),
(2.4)
where φ ∗ = φ ∗ (θ ) satisfies the constraint q (s|φ ∗ ) = p(s|x, θ ). Comparing with equation 2.3, it is easy to see that φ ∗ maximizes the bound. Since φ ∗ is exactly minimizing the KL divergence, the partial derivative of the bound with respect to φ evaluated in the point φ ∗ is equal to zero. Therefore, the derivative of B(θ , φ ∗ ) is equal to the partial derivative, ∂ B(θ , φ ∗ ) d B(θ, φ ∗ ) ∂ B(θ , φ ∗ ) ∂ B(θ , φ) ∂φ = + , = φ ∗ ∂θ φ ∗ dθ ∂θ ∂φ ∂θ
(2.5)
Cost function values
State-Space Models
1101
φ = φ (φ ) *
*
n
L(θ)
B(θ,φ ) *
θ θn+1 n
Parameter θ
Figure 1: Schematic illustration of lower-bound optimization for a onedimensional estimation problem, where θ n and θ n+1 are iterates of the standard EM algorithm. The log-likelihood function, L(θ ), is bounded from below by the function B(θ , φ ∗ ). The bound attains equality to L in θ n due to the choice of variational distribution: q (s|φ ∗ ) = p(s|x, θ n ). Furthermore, in θ n , the derivatives of the bound and the log likelihood are identical. In many situations, the curvature of B(θ , φ ∗ ) is much higher than that of L(θ ), leading to small changes in the parameter, θ n+1 − θ n .
and due to the choice of φ ∗ , the derivative of the log likelihood is the partial derivative of the bound dL(θ ) d B(θ, φ ∗ ) ∂ B(θ , φ ∗ ) = = , dθ dθ ∂θ which can be realized by combining equations 2.4 and 2.5. In this way, exact values and gradients of the true log likelihood can be obtained using the lower bound. This observation is not new; it is essentially the same as that used in Salakhutdinov, Roweis, and Ghahramani (2003) to construct the so-called expected conjugated gradient algorithm (ECG). The novelty of the recipe is, rather, the practical recycling of low-complexity computations carried out in connection with the EM algorithm for a much more efficient optimization using any gradient-based optimizer. This can be expressed in Matlab-style pseudocode where a function loglikelihood receives as argument the parameter θ and returns L and its gradient dL : dθ
1102
R. Olsson, K. Petersen, and T. Lehn-Schiøler
= loglikelihood(θ ) 1. Find φ ∗ such that ∂∂φB φ ∗ = 0 function [L,
dL ] dθ
2. Calculate L = B(θ , φ ∗ ) 3. Calculate
dL dθ
=
∂B (θ , φ ∗ ) ∂θ
Step 1, and to some extent step 2, are obtained by performing an E-step, while step 3 requires only little programming, which implements the gradients used to solve for the M-step. Compared to the EM algorithm, the main advantage is that the function value and gradient can be fed to any gradient-based optimizer, which in most cases substantially improves the convergence properties. In that sense, it is possible to benefit from the speed-ups of advanced gradient-based optimization. In cases when q does not contain p(s|x, θ ), the easy gradient recipe will converge to generalized EM solutions. The advantage of formulating the log likelihood using the bound function, B, depends on the task at hand. In the linear state-space model, equations 1.1 and 1.2, a naive computation of dL involves the somewhat complidθ cated derivatives of the Kalman filter equations with respect to each of the parameters in θ ; this is explained in, for example, Gupta and Mehra (1974). Consequently, it leads to a cost of dim[θ ] times the cost of one Kalman filter filter sweep.1 When using the easy gradient recipe, dL is derived via dθ the gradient of the bound, ∂∂θB (see the appendix for a derivation example). These derivatives are often available in connection with the derivation of the M-step. In addition, the computational cost is dominated by steps 1 and 2, which require only a Kalman smoothing, scaling as two Kalman filter filter sweeps, hence constituting an important reduction of computation time as compared to the naive gradient computation. Sandell and Yared (1978) noted in in their investigation of linear state-space models that a reformulation of the problem resulted in a similar reduction of the computational costs. 2.3 Relation to Other Speed-Up Methods. In this section, a series of extensions to the EM algorithm is discussed. They have in common the utilization of conjugate gradient or quasi-Newton steps, that is, the inverse Hessian is approximated, for example, using the gradient. Steplengthening methods, which are often simpler in terms of analytical cost, have been explored in Salakhutdinov and Roweis (2003) and Honkela, Valpola, and Karhunen (2003). They are, however, out of the scope of this presentation.
1 The computational complexity of the Kalman filter is O[N(d )3 ], where N is the data s length.
State-Space Models
1103
Table 1: Analytical Costs Associated with a Selection of EM Speed-Ups Based on Newton-Like Updates. Method
M
B
B
L
EM (Dempster et al., 1977) QN1 (Jamshidian & Jennrich, 1997) QN2 (Jamshidian & Jennrich, 1997) CG (Jamshidian & Jennrich, 1993) LANGE (Lange, 1995) EGR
x x x x x
x x x x
x -
x x x x
Notes: The x’s indicate whether the quantity is required by the algorithm. It should be noted that B , required by the easy gradient recipe (EGR) as well as QN2, CG, and LANGE, often is a by-product of the derivation of M.
In order to conveniently describe the methods, additional notation along the lines of Jamshidian and Jennrich (1997) is introduced: the combined EM operator, which maps the current estimate of the parameters, θ , to the new one, is denoted θˆ = M(θ ). The change in θ due to an EM update ˜ ) = M(θ) − θ . In a sense, g(θ ˜ ) can be regarded a generalized then is g(θ gradient of an underlying cost function, which has zeros identical to those of g(θ ) = dL(θ) . The Newton-Raphson update can then be devised as dθ ˜ ), θ = −J(θ )−1 g(θ where the Jacobian is defined J(θ ) = M (θ ) − I. A number of methods approximate J(θ )−1 or M (θ ) in various ways. For instance, the quasi-Newton method QN1 of Jamshidian and Jennrich (1997) approximates J−1 (θ ) by employing the Broyden (1965) update. The QN2 algorithm of the same publication performs quasi-Newton optimization on L(θ ), but its update direction ˜ ) − g(θ). can be written as an additive correction to the EM step: d = g(θ The correction term is updated through by means of the BroydenFletcher-Goldfarb-Shanno (BFGS) update (see, e.g., Bishop, 1995). A search is performed in the direction of d so as to maximize L(θ + λd), where λ is the step length. QN2 has the disadvantage compared to QN1 that L(θ ) and ˜ ). This is also the case of the g(θ ) have to be computed in addition to g(θ conjugate gradient (CG) accelerator of Jamshidian and Jennrich (1993). Similarly in Lange (1995), the log likelihood, L(θ ), is optimized directly through quasi-Newton steps, that is, in the direction of d = −J−1 L (θ )g(θ ). However, the iterative adaptation to the Hessian, JL (θ ), involves the computation of B (θ , φ ∗ ), which may require considerable human effort in many applications. In Table 1, the analytical cost associated with the discussed algorithms are summarized.
1104
R. Olsson, K. Petersen, and T. Lehn-Schiøler
The easy gradient recipe lends from QN2 in that quasi-Newton optimization on L(θ ) is carried out. The key point here is that the specific quasi-Newton update of QN2 can be replaced by any software package of choice that performs gradient-based optimization. This has the advantage that various highly sophisticated algorithms can be tried for the particular problem at hand. Furthermore, we emphasize, as does Meilijson (1989), that the gradient can be computed from the derivatives involved in solving for the E-step. This means that the advantages of gradient-based optimization are obtained conveniently at little cost. In this letter, a quasi-Newton gradient-based optimizer has been chosen. The implementation of the BFGS algorithm is due to Nielsen (2000) and has built-in line search and trust region monitoring. 3 Models The EM algorithm and the easy gradient recipe were applied to three models that can all be fitted into the linear state-space framework. 3.1 Kalman Filter-Based Sensor Fusion. The state-space model of equations 1.1 and 1.2 can be used to describe systems where two different types of signals are measured. The signals could be, for example, sound and images in as (Lehn-Schiøler, Hansen, & Larsen, 2005), where speech and lip movements were the observables. In this case, the observation equation 1.2, can be split into two parts, xt =
x1t x2t
=
A1 A2
st +
n1t n2t
,
where n1t ∼ N(0, R1 ) and n2t ∼ N(0, R2 ). The innovation noise in the statespace equation 1.1, is defined as vt ∼ N(0, Q). In the training phase, the parameters of the system, F, A1 , A2 , R1 ,R2 ,Q, are estimated by maximum likelihood using either EM or a gradient-based method. When the parameters have been learned, the state-space variable s, which represents unknown hidden causes, can be deduced from one of the observations (x1 or x2 ), and the missing observation can be estimated by mapping from the state-space. 3.2 Mean Field ICA. In independent component analysis (ICA), one tries to separate linearly mixed sources using the assumed statistical independence of the sources. In many cases, elaborate source priors are necessary, which calls for more advanced separation techniques such as mean field ICA. The method, which was introduced in Højen-Sørensen et al. (2002), can handle complicated source priors in an efficient approximative manner.
State-Space Models
1105
The model in equation 1.2 is identical to an instantaneous ICA model provided that F = 0 and that p(vt ) is reinterpreted as the (nongaussian) source prior. The basic generative model of the instantaneous ICA is xt = Ast + nt ,
(3.1)
where nt is assumed i.i.d. gaussian and st = vt is assumed distributed by a factorized prior i p(vit ), which is independent in both time and dimension. The mean field ICA is only approximately compatible with the easy gradient recipe, since the variational distribution q (s|φ) is not guaranteed to contain the posterior p(s|x, θ ). However, acceptable solutions (generalized EM) are retrieved when q is chosen sufficiently flexible. 3.3 Convolutive ICA. Acoustic mixture scenarios are characterized by sound waves emitted by a number of sound sources propagating through the air and arriving at the sensors in delayed and attenuated versions. The instantaneous mixture model of standard ICA, equation 3.1, is clearly insufficient to describe this situation. In convolutive ICA, the signal path (delay and attenuation) is modeled by an FIR filter, that is, a convolution of the source by the impulse responses of the signal path, xt =
Ck st−k + nt ,
(3.2)
k
where Ck is the mixing filter matrix. Equation 3.2 and the source independence assumption can be fitted into the state-space formulation of equations 1.1 and 1.2 (see Olsson & Hansen, 2004, 2005), by making the following model choices: (1) noise inputs vt and nt are i.i.d. gaussian; (2) the state vector is augmented to contain time-lagged values, that is, s¯ t ≡ [s1,t s1,t−1 . . . s2,t s2,t−1 . . . sds ,t s1,t−1 . . .] ; and (3) state-space parameter matrices (e.g., F) are constrained to a special format (certain elements are fixed to 0’s and 1’s) in order to ensure the independence of the sources mentioned above. 4 Results The simulations that are documented in this section serve to illustrate the well-known advantage of advanced gradient-based learning over standard EM. Before advancing to the more involved applications described above, the advantage of gradient-based methods over EM will be explored for a one-dimensional linear state-space model: an ARMA(1,1) process. In this case, F and A are scalars as well as the observation variance R and the
1106
R. Olsson, K. Petersen, and T. Lehn-Schiøler
Figure 2: Convergence of EM (dashed) and a gradient-based method (solid) in the ARMA(1,1) model. (a) EM has faster initial convergence than the gradientbased method, but the final part is slow for EM. (b) Zoom-in on the loglikelihood axis. Even after 50 iterations, EM has not reached the same level as the gradient-based method. (c) Parameter estimates convergence in terms of squared relative (to the generative parameters) error.
transition variance Q. Q is fixed to unity to resolve the inherent scale ambiguity of the model. As a consequence, the model has only three parameters. The BFGS optimizer mentioned in section 2 was used. Figure 2 shows the convergence of both the EM algorithm and the gradient-based method. Initially, EM is fast; it rapidly approaches the maximum log likelihood, but slows down as it gets closer to the optimum. The large dynamic range of the log likelihood makes it difficult to ascertain the final increase in the log likelihood; hence, Figure 2b provides a close-up of the log-likelihood scale. Table 2 gives an indication of the importance of the
State-Space Models
1107
Table 2: Estimation in the ARMA(1,1) Model.
Iterations Log likelihood F A R
Generative
Gradient
EM 50
EM ∞
0.5000 0.3000 0.0100
43 −24.3882 0.4834 0.2953 0.0097
50 −24.5131 0.5626 0.2545 0.0282
1800 −24.3883 0.4859 0.2940 0.0103
Notes: The convergence of EM is slow compared to the gradient-based method. Note that after 50 EM iterations, the log likelihood is relatively close to the value achieved at convergence, but the parameter values are far from the generative values.
final increase. After 50 iterations, EM has reached a log-likelihood value of −24.5131, but the parameter values are still far off. After convergence, the log likelihood has increased to −24.3883, which is still slightly worse than that obtained by the gradient-based method, but the parameters are now near the generative values. The BFGS algorithm used 43 iterations and 52 function evaluations. For large-scale state-space models, the computation time is all but dominated by the E-step computation. Hence, a function evaluation costs approximately one E-step. Similar results are obtained when comparing the learning algorithms on the Kalman filter-based sensor fusion, mean field ICA, and convolutive ICA problems. As argued in section 2, it is demonstrated that the number of iterations required by the EM algorithm to converge in state-space type models critically depends on the SNR. Figure 3 shows the performance of the two methods on the three problems. The relevant comparison measure is computation time, which in the examples are all but dominated by the E-step for EM and function evaluations (whose cost is also dominated by an E-step) for the gradient-based optimizer. The line search of the latter may require more function evaluations per iteration, but that was most often not the case for the BFGS algorithm of choice. The plots indicate that in the low-noise case, the EM algorithm requires relatively more time to converge, whereas the gradient-based method performs equally well for all noise levels. 5 Conclusion In applying the EM algorithm to maximum likelihood estimation in statespace models, we find, as have many before us, that it has poor convergence properties in the low noise limit. Often a value “close” to the maximum likelihood is reached in the first few iterations, while the final increase, which is crucial to the accurate estimation of the parameters, requires an excessive number of iterations. More important, we provide a simple scheme for efficient gradient-based optimization achieved by transformation from the EM formulation; the
1108
R. Olsson, K. Petersen, and T. Lehn-Schiøler
Figure 3: Number of function evaluations for EM (dashed) and gradient-based optimization (solid) to reach convergence as a function of signal-to-noise ratio for the three problems. (a) Kalman filter-based sensor fusion. (b) Mean field ICA. (c) Convolutive ICA. The level of convergence was defined as a relative change in log likehood below 10−5 , at which point the parameters no longer changed significantly. In the case of the EM algorithm, this sometimes occurred in plateau regions of the parameter space.
simple math and programming of the EM algorithms is preserved. Following this recipe, one can get the optimization benefits associated with any advanced gradient based-method. In this way, the tedious, problem-specific analysis of the cost-function topology can be replaced with an off-the-shelf approach. Although the analysis provided in this letter is limited to a set of linear mixture models, it is in fact applicable to any model subject to the EM algorithm, hence constituting a strong and general tool to be applied by the part of the neural community that uses the EM algorithm.
State-Space Models
1109
Appendix: Gradient Derivation Here we demonstrate how to obtain a partial derivative of the constrained bound function, B(θ , φ ∗ ). The derivation lends from Shumway and Stoffer (1982). At q (s|φ ∗ ) = p(s|x, θ ), we can reformulate equation 2.2 as p(s, x|θ ) B(θ , φ ∗ ) = ln q (s|φ ∗ )
= ln p(s, x|θ ) − ln q (s|φ ∗ ) , where the expectation
· is over the posterior q (s|φ ∗ ) = p(s|x, θ ). Only the
first term, J (θ ) = ln p(s, x|θ ) , depends on θ . The joint variable distribution factors due to the Markov property of the state-space model, p(s, x|θ ) = p(s1 |θ )
N
p(sk |sk−1 , θ )
k=2
N
p(xk |sk , θ ),
k=1
where N is the length of the time series and the initial state, s1 is normally distributed with mean µ1 and covariance 1 . Using the fact that all variables are gaussian, the expected log distributions can be written out as: 1 dim(s) ln 2π + ln |1 | 2
+ (s1 − µ1 ) 1−1 (s1 − µ0 )
J (θ ) = −
+(N − 1) [dim(s) ln 2π + ln |Q|] +
N
(sk − Fsk−1 ) Q−1 (sk − Fsk−1 ) ) k=2
+N [dim(x) ln 2π + ln |R|] +
N
(xk − Ask ) R−1 (xk − Ask ) . k=1
The gradient with respect to A is: ∂ J (θ) 1 −1
2R A sk sk − 2R−1 xk s =− k ∂A 2 N
k=1
= −R−1 A
N k=1
N
−1 x k s sk s k +R k ). k=1
1110
R. Olsson, K. Petersen, and T. Lehn-Schiøler
It is seen depends on the marginal posterior source mo that
the gradient
ments sk s k and sk , which are provided by the Kalman smoother. The gradients with
respect to F and Q furthermore require the marginal moment sk s k−1 . The derivation of the gradients for the covariances R and Q, must respect the symmetry of these matrices. Furthermore, steps must be taken to ensure the positive definiteness following a gradient step. This can be achieved, for example, by adapting Q0 instead of Q where Q = Q0 Q 0. Acknowledgments We thank Danish Oticon Fonden for supporting in part this research project. Inspiring discussions with Lars Kai Hansen and Ole Winther initiated this work.
References Bermond, O., & Cardoso, J.-F. (1999). Approximate likelihood for noisy mixtures. In Proceedings of the ICA Conference (pp. 325–330). Aussois, France. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Broyden, C. G. (1965). A class of methods for solving nonlinear simultanoeus equations. Math. Comput., 19, 577–593. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistics Society, Series B, 39, 1–38. Digalakis, V., Rohlicek, J., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442. Gupta, N. K., & Mehra, R. K. (1974). Computational aspects of maximum likelihood estimation and reduction in sensitivity calculations. IEEE Transactions on Automatic Control, AC-19(1), 774–783. Højen-Sørensen, P. A., Winther, O., & Hansen, L. K. (2002). Mean field approaches to independent component analysis. Neural Computation, 14, 889–918. Honkela, A., Valpola, H., & Karhunen, J. (2003). Accelerating cyclic update algorithms for parameter estimation by pattern searches. Neural Processing Letters, 17(2), 191–203. Jamshidian, M., & Jennrich, R. I. (1993). Conjugate gradient acceleration of the EM algorithm. Journal of the American Statistical Association, 88, 221–228. Jamshidian, M., & Jennrich, R. I. (1997). Acceleration of the EM algorithm using quasi-Newton methods. Journal of Royal Statistical Society, B, 59(3), 569–587. Lachlan, G. J., & Krishnan, T. (1997). The EM algorithm and extensions. New York: Wiley. Lange, K. (1995). A quasi Newton acceleration of the EM algorithm. Statistica Sinica, 5, 1–18.
State-Space Models
1111
Lehn-Schiøler, T., Hansen, L. K., & Larsen, J. (2005). Mapping from speech to images using continuous state space models. In Lecture Notes in Computer Science (Vol. 3361, pp. 136–145). Berlin: Springer. Meilijson, I. (1989). A fast improvement to the EM algorithm on its own terms. Journal of Royal Statistical Society, B, 51, 127–138. Moulines, E., Cardoso, J., & Cassiat, E. (1997). Maximum likelihood for blind separation and deconvolution of noisy signals using mixture models. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Proc. ICASSP (Vol. 5, pp. 3617–3620). Cambridge, MA: MIT Press. Neal, R. M., & Hinton, G. E. (1998). Learning in graphical models: A view of the EM algorithm that justifies incremental, sparse, and other variants. Dordrecht: Kluwer. Nielsen, H. B. (2000). UCMINF—an algorithm for unconstrained nonlinear optimization. (Tech. Rep. IMM-Rep-2000-19). Lungby: Technical University of Denmark. Olsson, R. K., & Hansen, L. K. (2004). Probabilistic blind deconvolution of nonstationary sources. In 12th European Signal Processing Conference (pp. 1697–1700). Vienna, Austria. Olsson, R. K., & Hansen, L. K. (2005). A harmonic excitation state-space approach to blind separation of speech. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 993–1000). Cambridge, MA: MIT Press. Petersen, K. B., & Winther, O. (2005). The EM algorithm in independent component analysis. In International Conference on Acoustics, Speech, and Signal Processing. Piscataway, NJ: IEEE. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood and the EM algorithm. SIAM Review, 2(26), 195–239. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11, 305–345. Salakhutdinov, R., & Roweis, S. (2003). Adaptive overrelaxed bound optimization methods. In International Conference on Machine Learning (pp. 664–671). New York: AAAI Press. Salakhutdinov, R., Roweis, S. T., & Ghahramani, Z. (2003). Optimization with EM and expectation-conjugate-gradient. In International Conference on Machine Learning (Vol. 20, pp. 672–679). New York: AAAI Press. Sandell, N. R., & Yared, K. I. (1978). Maximum likelihood identification of state space models for linear dynamic systems. (Tech. Rep. ESL-R-814). Cambridge, MA: Electronic Systems Laboratory, MIT. Shumway, R., & Stoffer, D. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Anal., 3(4), 253–264. Xu, L., & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, (8), 129–151.
Received February 25, 2005; accepted July 20, 2006.
LETTER
Communicated by Shun-ichi Amari
Variational Bayes Solution of Linear Neural Networks and Its Generalization Performance Shinichi Nakajima
[email protected] Nikon Corporation, Kumagaya, Saitama, 360–8559, Japan
Sumio Watanabe
[email protected] Tokyo Institute of Technology, Midori-ku, Yokohama, Kanagawa, 226-8503, Japan
It is well known that in unidentifiable models, the Bayes estimation provides much better generalization performance than the maximum likelihood (ML) estimation. However, its accurate approximation by Markov chain Monte Carlo methods requires huge computational costs. As an alternative, a tractable approximation method, called the variational Bayes (VB) approach, has recently been proposed and has been attracting attention. Its advantage over the expectation maximization (EM) algorithm, often used for realizing the ML estimation, has been experimentally shown in many applications; nevertheless, it has not yet been theoretically shown. In this letter, through analysis of the simplest unidentifiable models, we theoretically show some properties of the VB approach. We first prove that in three-layer linear neural networks, the VB approach is asymptotically equivalent to a positive-part James-Stein type shrinkage estimation. Then we theoretically clarify its free energy, generalization error, and training error. Comparing them with those of the ML estimation and the Bayes estimation, we discuss the advantage of the VB approach. We also show that unlike in the Bayes estimation, the free energy and the generalization error are less simply related with each other and that in typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one. 1 Introduction The conventional learning theory (Cramer, 1949), which provides the mathematical foundation of information criteria, such as Akaike’s information criterion (AIC) (Akaike, 1974), Bayesian information criterion (BIC) (Schwarz, 1978), the minimum description length criterion (MDL) (Rissanen, 1986), and others, describes the asymptotic behavior of the Bayes free energy, the generalization error, the training error, and so on in the Neural Computation 19, 1112–1153 (2007)
C 2007 Massachusetts Institute of Technology
Variational Bayes Solution and Generalization
1113
regular statistical models. However, many models used in the field of machine learning, such as neural networks, Bayesian networks, mixture models, and hidden Markov models, are actually classified as unidentifiable models, which have singularities in the parameter spaces. Around the singularities, on which the Fisher information matrix is singular, the log likelihood cannot be approximated by any quadratic form of the parameter. Therefore, neither the distribution of the maximum likelihood estimator nor the Bayes posterior distribution asymptotically converges to the normal distribution, which prevents the conventional learning theory from holding (Hartigan, 1985; Watanabe, 1995; Amari, Park, & Ozeki, 2006, Hagiwara, 2002). Accordingly, the information criteria have no theoretical foundation in unidentifiable models. Some properties of learning in unidentifiable models have been theoretically clarified. In the maximum likelihood (ML) estimation, which is asymptotically equivalent to the maximum a posterior (MAP) estimation, the asymptotic behavior of the log-likelihood ratio in some unidentifiable models was analyzed (Hartigan, 1985; Bickel & Chernoff, 1993; Takemura & Kuriki, 1997; Kuriki & Takemura, 2001; Fukumizu, 2003), with facilitation by the idea of the locally conic parameterization (Dacunha-Castelle & Gassiat, 1997). It has thus been known that the ML estimation in general provides poor generalization performance, and in the worst cases, the ML estimator diverges. In linear neural networks, on which we focus in this letter, the generalization error was clarified and proved to be greater than that of the regular models whose dimension of the parameter space is the same when the model is redundant to learn the true distribution (Fukumizu, 1999), although the ML estimator is in a finite region (Baldi & Hornik, 1995). On the other hand, for the analysis of the Bayes estimation in unidentifiable models, an algebraic geometrical method was developed (Watanabe, 2001a), by which the generalization errors or their upper bounds in some unidentifiable models were clarified and proved to be less than that of the regular models (Watanabe, 2001a; Yamazaki & Watanabe, 2002; Rusakov & Geiger, 2002; Yamazaki & Watanabe, 2003a, 2003b; Aoyagi & Watanabe, 2004), and moreover, the generalization error of any model having singularities was proved to be less than that of the regular models when we use a prior distribution having positive values on the singularities (Watanabe, 2001b). According to the previous work noted, it can be said that in unidentifiable models, the Bayes estimation has the advantage of generalization performance over the ML estimation. However, the Bayes posterior distribution can seldom be exactly realized. Furthermore, Markov chain Monte Carlo (MCMC) methods, often used for approximation of the posterior distribution, require huge computational costs. As an alternative, the variational Bayes (VB) approach, which provides computational tractability, was proposed (Hinton & van Camp, 1993; MacKay, 1995; Attias, 1999; Jaakkola &
1114
S. Nakajima and S. Watanabe
Jordan, 2000; Ghahramani and Beal, 2001). In models with latent variables, an iterative algorithm like the expectation maximization (EM) algorithm, which provides the ML estimator (Dempster, Laird, & Rubin, 1977), was derived in the framework of the VB approach (Attias, 1999; Ghahramani & Beal, 2001). Some properties similar to the EM algorithm have been proved (Wang & Titterington, 2004), and it has been shown that the VB approach can be considered as the natural gradient descent with respect to the VB free energy (Sato, 2001). However, its advantage of generalization performance over the EM algorithm has not been theoretically shown yet, although it has been experimentally shown in many applications. In addition, although the VB free energies in some models have been clarified (Watanabe & Watanabe, 2006; Hosino, Watanabe, & Watanabe, 2005; Nakano & Watanabe, 2005), they unfortunately, provide little information on their generalization errors, unlike in the Bayes estimation. The reason is that the simple relation that holds in the Bayes estimation between the free energy and the generalization error does not hold in the VB approach. (This relation will appear as equation 3.15 in section 3.) Currently, in any unidentifiable model, the VB generalization error has never been theoretically clarified yet. Meanwhile, it was surprisingly proved that even in the regular models, the usual ML estimator does not provide optimum generalization performance (Stein, 1956). The James-Stein (JS) shrinkage estimator, which dominates the ML estimator, was subsequently introduced (James & Stein, 1961). (The definition of dominate is given in section 3.3.) Some work has discussed the relation between the JS estimator and Bayesian learning methods as follows: it was shown that the JS estimator can be derived as the solution of an empirical Bayes (EB) approach where the prior distribution of a regular model has a hyperparameter as its variance to be estimated based on the marginal likelihood (Efron & Morris, 1973); the similarity between the asymptotic behavior of the generalization error of the Bayes estimation in a simple class of unidentifiable models and that of the JS estimation in the regular models was pointed out (Watanabe & Amari, 2003); it was proved that in three-layer linear neural networks, a subspace Bayes (SB) approach, the extension of the EB approach where part of the parameters is regarded as a hyperparameter is asymptotically equivalent to a positive-part JS-type shrinkage estimation, a subspecies of the JS estimation (Nakajima & Watanabe, 2005, 2006). Interestingly, the VB solution, derived in this letter, is similar to the SB solution. In this letter, we consider the theoretical reason that the VB approach provides good generalization performance through the analysis of the simplest unidentifiable models. We first derive the VB solution of three-layer linear neural networks, which reveals an interesting relation between the VB approach, a rising method in the field of machine learning, and the JS shrinkage estimation, a classic in statistics. Then we clarify its free energy,
Variational Bayes Solution and Generalization
1115
generalization error, and training error. The major results of this letter are as follows:
r
r
r
r
r
The VB approach is asymptotically equivalent to a positive-part JStype shrinkage estimation. Hence, the VB estimator of a necessary component to realize the true distribution is asymptotically equivalent to the ML estimator, while the VB estimator of a redundant component is not equivalent to the ML estimator, even in the asymptotic limit. In the VB approach, the asymptotic behavior of the free energy and that of the generalization error are less simply related to each other, unlike in the Bayes estimation. In typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one. Moreover, their dependencies on the redundant parameter dimension are different from each other, which can throw doubt on model selection by minimizing the VB free energy to obtain better generalization performance. Although the VB free energy is, by definition, never less than the Bayes one, the VB generalization error can be much less than the Bayes one. This, however, does not mean the domination of the VB approach over the Bayes estimation and is consistent with the proved superiority of the Bayes estimation. The more different the parameter space dimensions of the different layers are, the smaller the generalization error is. This implies that the advantage of the VB approach over the EM algorithm can be enhanced in mixture models, where the dimension of the upper-layer parameters, which corresponds to the number of the mixing coefficients, is one per component, and the dimension of the lower-layer parameters, which corresponds to the number of the parameters that each component has, is usually more. The VB approach has not only a property similar to the Bayes estimation but also another property similar to the ML estimation.
In section 2, neural networks and linear neural networks are described. The Bayes estimation, the VB approach, and the JS-type shrinkage estimator are introduced in section 3. Then the effects of singularities on generalization performance and the significance of consideration of delicate situations are explained in sections 4 and 5, respectively. The solution of the VB predictive distribution is derived in section 6, and its free energy, generalization error, and training error are clarified in section 7. In section 8, the theoretical values of the generalization properties of the VB approach are compared with those of the ML estimation and of Bayes estimation. Discussion and conclusions follow in sections 9 and 10, respectively.
1116
S. Nakajima and S. Watanabe
2 Linear Neural Networks Let x ∈ R M be an input (column) vector, y ∈ R N an output vector, and w a parameter vector. A neural network model can be described as a parametric family of maps { f (·; w) : R M → R N }. A three-layer neural network with H hidden units is defined by f (x; w) =
H h=1
b h ψ a ht x ,
(2.1)
where w = {(a h , b h ) ∈ R M × R N ; h = 1, . . . , H} summarizes all the parameters; ψ(·) is an activation function, which is usually a bounded, nondecreasing, antisymmetric, nonlinear function like tanh(·); and t denotes the transpose of a matrix or vector. We denote by Nd (µ, ) the d-dimensional normal distribution with average vector µ and covariance matrix , and by Nd (·; µ, ) its density function. Assume that the output is observed with a noise subject to N N (0, ). Then the conditional distribution is given by p(y|x, w) = N N (y; f (x; w), ).
(2.2)
In this letter, we focus on linear neural networks, whose activation functions are linear, as the simplest unidentifiable models. A linear neural network model (LNN) is defined by f (x; A, B) = B Ax,
(2.3)
where A = (a 1 , . . . , a H )t is an H × M input parameter matrix and B = (b 1 , . . . , b H ) is an N × H output parameter matrix. Because the transform (A, B) → (T A, BT −1 ) does not change the map for any nonsingular H × H matrix T, the parameterization in equation 2.3 has trivial redundancy. Accordingly, the essential dimension of the parameter space is given by K = H(M + N) − H 2 = H(L + l) − H 2 ,
(2.4)
where L = max(M, N),
(2.5)
l = min(M, N).
(2.6)
We assume that H ≤ l throughout this letter. An LNN, also known as a reduced-rank regression model, is not a toy but a useful model in many applications (Reinsel & Velu, 1998). It is used for multivariate factor analysis when the relevant dimension of the factors is unknown and classified as an essentially singular model when 0 < H < l because no transform makes
Variational Bayes Solution and Generalization
1117
it regular. Although LNNs are simple, we expect that some phenomena caused by the singularities, revealed in this letter, would be observed in other unidentifiable models as well. 3 Learning Methods 3.1 Bayes Estimation. Let Xn = {x1 , . . . , xn } and Yn = {y1 , . . . , yn } be arbitrary n training samples independently and identically taken from the true distribution q (x, y) = q (x)q (y|x). The marginal conditional likelihood of a model p(y|x, w) is given by Z(Yn |Xn ) =
n
φ(w)
i=1 p(yi |xi , w)dw,
(3.1)
where φ(w) is a prior distribution. The posterior distribution is given by p(w|X , Y ) = n
n
φ(w)
n
i=1 p(yi |xi , w) . Z(Yn |Xn )
(3.2)
The predictive distribution is defined as the average of the model over the posterior distribution: p(y|x, Xn , Yn ) =
p(y|x, w) p(w|Xn , Yn )dw.
(3.3)
The free energy, evidence, or stochastic complexity, which is used for model selection or hyperparameter estimation (Efron & Morris, 1973; Akaike, 1980; MacKay, 1992), is defined by F (Yn |Xn ) = − log Z(Yn |Xn ).
(3.4)
Commonly, the normalized free energy, defined by F (n) = F (Yn |Xn ) + log q (Yn |Xn )q (Xn ,Yn ) ,
(3.5)
where ·q (Xn ,Yn ) denotes the expectation value over all sets of n training samples, can be asymptotically expanded as follows: F (n) = λ log n + o(log n),
(3.6)
where λ is called the free energy coefficient in this letter. The generalization error, a criterion of generalization performance, and the training error are
1118
S. Nakajima and S. Watanabe
defined by G(n) = G(Xn , Yn )q (Xn ,Yn ) ,
(3.7)
T(n) = T(Xn , Yn )q (Xn ,Yn ) ,
(3.8)
respectively, where G(X , Y ) = n
n
q (x)q (y|x) log
q (y|x) d xd y p(y|x, Xn , Yn )
(3.9)
is the Kullback-Leibler (KL) divergence of the predictive distribution from the true distribution, and T(Xn , Yn ) = n−1
n i=1
log
q (y|x) p(y|x, Xn , Yn )
(3.10)
is the empirical KL divergence. Commonly, the generalization error and the training error can be asymptotically expanded as follows: G(n) = λn−1 + o(n−1 ),
(3.11)
−1
(3.12)
T(n) = νn
−1
+ o(n ),
where the coefficients of the leading terms, λ and ν, are called the generalization coefficient and the training coefficient, respectively, in this letter. In the regular models, the free energy, the generalization, and the training coefficients are simply related to the parameter dimension K in the Bayes estimation, as well as in the ML estimation, as follows: 2λ = 2λ = −2ν = K .
(3.13)
The penalty factor of AIC (Akaike, 1974) is based on the fact that (λ − ν) = K , and the penalty factor of BIC (Schwarz, 1978), as well as that of the MDL (Rissanen, 1986), is based on the fact that 2λ = K . However, in unidentifiable models, equation 3.13 does not generally hold, and the coefficients can depend not only on the parameter dimension of the learning machine but also on the true distribution. So it seems to be difficult to propose any information criterion in unidentifiable models; nevertheless, an information criterion, called a singular information criterion, has recently been proposed by using the dependence on the true distribution (Yamazaki, Nagata, & Watanabe, 2005). For such kind of work, clarifying the coefficients is important.
Variational Bayes Solution and Generalization
1119
In the (rigorous) Bayes estimation, we can prove that the relation λ = λ
(3.14)
still holds even in unidentifiable models, by using the following well-known relation (Levin, Tishby, & Solla, 1990): G(n) = F (n + 1) − F (n).
(3.15)
Therefore, clarifying the Bayes free energy immediately informs us of the asymptotic behavior of the Bayes generalization error. 3.2 Variational Bayes Approach. For an arbitrary trial distribution r (w), Jensen’s inequality leads to the following one, F (Y |X ) ≤ log n
n
r (w) p(Yn |Xn , w)φ(w)
= F¯ (Yn |Xn ),
(3.16)
r (w)
where · p denotes the expectation value over a distribution p, and F¯ (Yn |Xn ) is called the generalized free energy in this letter. In the VB approach, restricting the set of possible distributions for r (w), we minimize the generalized free energy, a functional of r (w). The optimum of the trial distribution, which we denote by rˆ (w), is called the VB posterior distribution and substituted for p(w|Xn , Yn ) in equation 3.3. In addition, the minimum value of the generalized free energy, which we denote by F˜ (Yn |Xn ), is called the VB free energy, and the expectation value over the VB posterior distribution is called the VB estimator. In the original VB approach, proposed for learning of neural networks, the posterior distribution is restricted to the normal distribution (Hinton & van Camp, 1993). Recently, an iterative algorithm with good tractability has been proposed for models with latent variables, such as mixture models and graphical models, by using an appropriate class of prior distributions and restricting the posterior distribution such that the parameters and the latent variables are independent of each other; it has been attracting attention (Attias, 1999; Ghahramani & Beal, 2001). In LNNs, we can deduce a similar algorithm by restricting the posterior distribution such that the parameters of different layers, as well as those of different components, are independent of each other, as shown below.1
1 Note that this restriction is similar to the one applied in models with hidden variables in Attias (1999). Therefore, the analysis in this letter also provides insight into the properties of the VB approach in more general unidentifiable models, which is discussed in section 9.4.
1120
S. Nakajima and S. Watanabe
ah
bh
Figure 1: VB posterior distribution (right), which is the independent distribution with respect to different layer parameters to be substituted for the Bayes posterior distribution (left).
Assume that the VB posterior distribution factorizes as r (w) = r (A, B) = r (A)r (B).
(3.17)
Then the generalized free energy is given by F¯ (Yn |Xn ) =
r (A)r (B) log
r (A)r (B) d Ad B, p(Yn |Xn , A, B)φ(A, B)
(3.18)
where d V denotes the integral with respect to all the elements of the matrix V. Using the variational method, we can easily show that the VB posterior distribution satisfies the following relations if we use a prior distribution that factorizes as φ(w) = φ(A, B) = φ(A)φ(B): r (A) ∝ φ(A) explog p(Yn |Xn , A, B)r (B) ,
(3.19)
r (B) ∝ φ(B) explog p(Yn |Xn , A, B)r (A) .
(3.20)
We find from equations 3.19 and 3.20 that the VB posterior distributions are the normal if the prior distributions, φ(A) and φ(B), are the normal, because the log likelihood of an LNN, log p(Yn |Xn , A, B), is a biquadratic function of A and B. Figure 1 (right) illustrates the VB posterior distribution. The VB posterior distribution, which is the independent distribution with respect to different layer parameters minimizing the generalized free energy, equation 3.18, is substituted for the Bayes posterior distribution (see Figure 1, left). We then apply another restriction, the independence of the parameters
Variational Bayes Solution and Generalization
1121
of different components, to the posterior distribution: r (A, B) =
H
h=1 r (a h )r (b h ).
(3.21)
This restriction makes the relations, equations 3.19 and 3.20, decompose into those for each component as follows, if we use a prior distribution that H factorizes as φ(A, B) = h=1 φ(a h )φ(b h ): r (a h ) ∝ φ(a h ) explog p(Yn |Xn , A, B)r (A,B)/r (a h ) ,
(3.22)
r (b h ) ∝ φ(b h ) explog p(Y |X , A, B)r (A,B)/r (bh ) ,
(3.23)
n
n
where ·r (A,B)/r (a h ) , as well as ·r (A,B)/r (bh ) , denotes the expectation value over the trial distribution of the parameters except a h , as well as that except bh . In the same way as the Bayes estimation, the predictive distribution, the normalized free energy, the generalization error, and the training error of the VB approach are given by p˜ (y|x, Xn , Yn ) =
p(y|x, w)ˆr (w)dw,
(3.24)
F˜ (n) = F˜ (Yn |Xn ) + log q (Yn |Xn )q (Xn ,Yn ) = λ˜ log n + o(log n), (3.25) −1
˜ ˜ ˜ G(n) = G(X , Y )q (Xn ,Yn ) = λn
−1
+ o(n ),
(3.26)
˜ ˜ n , Yn )q (Xn ,Yn ) = ν˜ n−1 + o(n−1 ), T(n) = T(X
(3.27)
n
n
respectively, where ˜ n , Yn ) = G(X
q (x)q (y|x) log
˜ n , Yn ) = n−1 T(X
n i=1
log
q (y|x) d xd y, p˜ (y|x, Xn , Yn )
q (y|x) . p˜ (y|x, Xn , Yn )
(3.28) (3.29)
˜ because the relation corresponding to equation 3.15 does In general, λ˜ = λ, not hold in the VB approach. Accordingly, the VB free energy and the VB generalization error are less simply related to each other, as will be shown in section 8. 3.3 James-Stein Type Shrinkage Estimator. In this section, considering a regular (i.e., identifiable) model, we introduce the James-Stein (JS)–type shrinkage estimator. Suppose that Zn = {z1 , . . . , zn } are arbitrary n training samples independently and identically taken from Nd (µ, Id ), where µ
1122
S. Nakajima and S. Watanabe
is the mean parameter to be estimated and Id denotes the d × d identity matrix. We denote by ∗ the true value of a parameter and by a hat an estimator of a parameter throughout this letter. We define the risk as the mean squared error: gµˆ (µ∗ ) = µ ˆ − µ∗ 2 q (Zn ) .2 We say that µ ˆ α dominates µ ˆ β if gµˆ α (µ∗ ) ≤ gµˆ β (µ∗ ) for arbitrary µ∗ and gµˆ α (µ∗ ) < gµˆ β (µ∗ ) for a certain µ∗ . n The usual ML estimator, µ ˆ MLE = n−1 i=1 zi , is an efficient estimator that is never dominated by any unbiased estimator. However, the ML estimator has surprisingly been proved to be inadmissible when d ≥ 3; that is, there exists at least one biased estimator that dominates the ML estimator (Stein, 1956). The JS shrinkage estimator, defined as follows, was subsequently introduced (James & Stein, 1961): µ ˆ JS = (1 − χ/n µ ˆ MLE 2 )µ ˆ MLE ,
(3.30)
where χ = (d − 2) is called the degree of shrinkage in this letter. The estimators expressed by equation 3.30 with arbitrary χ > 0 are called the JS-type shrinkage estimators. We can easily find that a positive-part JS-type shrinkage estimator, defined by µ ˆ PJS = θ (n µ ˆ MLE , ˆ MLE 2 > χ)(1 − χ/n µ ˆ MLE 2 )µ ˆ MLE = 1 − χχ −1 µ (3.31) dominates the JS-type shrinkage estimator, equation 3.30, with the same degree of shrinkage, χ. Here, θ (·) is the indicator function of an event; it is equal to one if the event is true and to zero otherwise, and χ = max(χ, n µ ˆ MLE 2 ). Note that even in the large sample limit, the shrinkage estimators are not equivalent to the ML estimator if µ∗ = 0, although they are equivalent to the ML estimator otherwise. Also note that the indicator function in equation 3.31 works like a model selection method, which eliminates insignificant parameters, and the shrinkage factor works like a regularizer, which pulls the estimator near the model with smaller degree of freedom. 4 Unidentifiability and Singularities We say that a parametric model is unidentifiable if the map from the parameter to the probability distribution is not one-to-one. A neural network model, equation 2.1, is unidentifiable because the model is invariant for any a h if b h = 0 or vice versa. The continuous points denoting the same distribution are called the singularities, because the Fisher information matrix
2 The risk function coincides with the twice of the generalization error, equation 3.7, when p(z|µ) ˆ = Nd (z; µ, ˆ Id ) is regarded as the predictive distribution.
Variational Bayes Solution and Generalization
1123
bh
ah
Figure 2: Singularities of a neural network model.
on them is singular. The shadowed locations in Figure 2 indicate the singularities. We can see in Figure 2 that the model denoted by the singularities has more neighborhoods and state density than any other model denoted by only one point each. When the true model is not on the singularities, they asymptotically do not affect prediction, and therefore the conventional learning theory of the regular models holds. When the true model is on the singularities, they significantly affect generalization performance as follows:
r
r
ML-type effect: In the ML estimation, as well as the MAP estimation, an increase of the neighborhoods of the true distribution leads to an increase of the flexibility of imitating noises and therefore accelerates overfitting. Bayesian effect: In the Bayesian learning methods, the large state density of the true distribution increases its weight and therefore suppresses overfitting.
According to the effects above, we can expect the Bayesian learning methods to provide better generalization performance than the ML estimation.
5 Significance of Delicate Situation Analysis Suppression of overfitting accompanies insensitivity to the true components with small amplitude. There is a trade-off that would be ignored in asymptotic analysis if we considered only situations when the true model is distinctly on the singularities or not. Consider, for example, the positive-part JS-type estimator, equation 3.31, with a very large but finite degree of shrinkage, 1 χ < ∞. That estimator provides very small generalization error when µ∗ = 0 because of its strong shrinkage and the same
1124
S. Nakajima and S. Watanabe
generalization error as that of the ML estimator when µ∗ = O(1) in the asymptotic limit because µ ˆ PJS = µ ˆ MLE + O p (n−1 ). Therefore, in ordinary asymptotic analysis, it could seem to be a good learning method. However, such an estimator will not perform well because it can provide extremely worse generalization performance in real situations where the number of training samples, n, is finite. In other words, the ordinary asymptotic analysis ignores the insensitivity to the true small components. To consider the case where n is sufficiently large but finite, with the benefit of the asymptotic approximation, we should √ analyze the delicate situations (Watanabe & Amari, 2003) when 0 < n µ∗ < ∞. We have to balance between the generalization performance in the region where √ µ∗ = 0 and that in the region where 0 < n µ∗ < ∞, when we select a learning method. The delicate situations in unidentifiable models should be defined as the ones where the KL divergence of the true distribution from the singularities is comparable to n−1 . This kind of analysis is important in model selection problems and in statistical tests with finite number of samples for the following reasons: first, there naturally exist a few true components with amplitude comparable to n−1/2 when neither the smallest nor the largest model is selected; and second, whether the selected model involves such components essentially affects generalization performance. In this letter, we discuss the generalization error and the training error in the delicate situations, in addition to the ordinary asymptotic analysis. 6 Theoretical Analysis 6.1 Variational Condition. Assume that the covariance matrix of a noise is known and equal to I N . Then the conditional distribution of an LNN is given by p(y|x, A, B) = N N (y; B Ax, I N ).
(6.1)
We use the prior distribution that factorizes as φ(A, B) = φ(A)φ(B), where φ(A) = N H M (vec(At ); 0, c a2 I H M ),
(6.2)
φ(B) = N NH (vec(B); 0, c b2 I NH ).
(6.3)
Here 0 < c a2 , c b2 < ∞ are constants corresponding to the variances, and vec(·) denotes the vector created from a matrix by stacking the column vectors below one another. Assume that the true conditional distribution is p(y|x, A∗ , B ∗ ), where B ∗ A∗ is the true map with rank H ∗ ≤ H. For simplicity, we assume that the input vector is orthonormalized so
Variational Bayes Solution and Generalization
1125
that xx t q (x)d x = I M . Consequently, the central limit theorem leads to the following two equations with respect to the sufficient statistics: Q(Xn ) = n−1 R(X , Y n
n
n i=1
n ) = n−1 i=1
xi xit = I M + O p (n−1/2 ), yi xit
∗
∗
−1/2
= B A + O p (n
(6.4) ),
(6.5)
where Q(Xn ) is an M × M symmetric matrix and R(Xn , Yn ) is an N × M matrix. Hereafter, we abbreviate Q(Xn ) as Q and R(Xn , Yn ) as R. Let γh be the hth largest singular value of the matrix RQ−1/2 , ωa h the corresponding right singular vector, and ωbh the corresponding left singular vector, where 1 ≤ h ≤ H. We find from equation 6.5 that in the asymptotic limit, the singular values corresponding to the necessary components to realize the true distribution converge to finite values, while the others corresponding to the redundant components converge to zero. Therefore, with probability 1, the largest H ∗ singular values correspond to the necessary components, and the others correspond to the redundant components. Combining that with equation 6.4, we have ωbh RQρ = ωbh R + O p (n−1 )
for
H ∗ < h ≤ H,
(6.6)
where −∞ < ρ < ∞ is an arbitrary constant. Substituting equations 6.1 to 6.3 into equations 3.22 and 3.23, we find that the posterior distribution can be written as follows: r (A, B) =
H
h=1 r (a h )r (b h ),
(6.7)
where r (a h ) = N M (a h ; µa h , a h ),
(6.8)
r (b h ) = N N (b h ; µbh , bh ).
(6.9)
Given an arbitrary map B A, we can have A with its orthogonal row vectors and B with its orthogonal column vectors by using the singular value decomposition. In just that case, both of the prior probabilities, equations 6.2 and 6.3, are maximized. Accordingly, {µa h ; h = 1, . . . , H}, as well as {µbh ; h = 1, . . . , H}, of the optimum distribution is a set of vectors orthogonal to each other. Now we can easily obtain the following condition, called the variational condition, by substituting equations 6.7 to 6.9 into equations 3.22 and 3.23: µa h = na h Rt µbh , a h = (n( µbh + tr(bh ))Q + 2
(6.10) c a−2 I M )−1 ,
(6.11)
1126
S. Nakajima and S. Watanabe
µbh = nbh Rµa h , −1 bh = n µat h Qµa h + tr a1/2 Qa1/2 + c b−2 I N , h h
(6.12) (6.13)
where tr(·) denotes the trace of a matrix. Similarly, by substituting equations 6.7 to 6.9 into equation 3.18, we also obtain the following form of the generalized free energy: 2 F¯ (Yn |Xn ) =
H
h=1
− log σa2M σb2N − 2nµtbh Rµa h h h
µbh 2 + tr(bh ) + n µat h Qµa h + tr a1/2 Qa1/2 h h +
n i=1
yi 2 + O p (n−1/2 ) + const.,
(6.14)
where σa2h and σb2h are defined by a h = σa2h (I M + O p (n−1/2 )),
(6.15)
bh = σb2h (1 + O p (n−1/2 ))I N ,
(6.16)
respectively. 6.2 Variational Bayes Solution. The variational condition, equations 6.10 to 6.13, can be analytically solved, which leads to the following theorem: Theorem 1. is given by Bˆ Aˆ =
Let L h = max(L , nγh2 ). The VB estimator of the map of an LNN H h=1
1 1 − L L − ωbh ωbt h RQ−1 + O p (n−1 ). h
(6.17)
The proof is given in the appendix. Moreover, we obtain the following lemma, expanding the predictive distribution divided by the true distribution, p˜ (y|x, Xn , Yn )/q (y|x): Lemma 1. The predictive distribution of an LNN in the VB approach can be written as follows: ˆ Bˆ Ax, ˆ V) ˆ + O p (n–3/2 ), p˜ (y|x, Xn , Yn ) = N N (y; V ˆ = I N + O p (n−1 ). where V
(6.18)
Variational Bayes Solution and Generalization
1127
Comparing equation 6.17 with the ML estimator, Bˆ Aˆ MLE =
H
ωbh ωbt h RQ−1
(6.19)
h=1
(Baldi & Hornik, 1995), we find that the VB estimator of each component is asymptotically equivalent to a positive-part JS-type shrinkage estimator, defined by equation 3.31. However, even if the VB estimator is asymptotically equivalent to the shrinkage estimator, the VB approach can differ from the shrinkage estimation because of the extent of the VB posterior distribution. Lemma 1 states that the VB posterior distribution is sufficiently localized that we can substitute the model at the VB estimator for the VB predictive distribution with asymptotically insignificant impact on generalization performance. Therefore, we conclude that the VB approach is asymptotically equivalent to the shrinkage estimation. Note that the variances, c a2 and c b2 , of the prior distributions, equations 6.2 and 6.3, asymptotically have no effect on prediction and hence on generalization performance, as far as they are positive and finite constants. 7 Generalization Properties 7.1 Free Energy. We can obtain the VB free energy by substituting the VB posterior distribution, derived in the appendix, into equation 6.14, where only the order of σˆ a2h and that of σˆ b2h signify because the leading term of the normalized free energy should be the first one in the curly braces of equation 6.14. Theorem 2. The normalized free energy of an LNN in the VB approach can be asymptotically expanded as F˜ (n) = λ˜ log n + O(1), where the free energy coefficient is given by 2λ˜ = H ∗ (L + l) + (H − H ∗ )l.
(7.1)
Proof. We will separately consider the necessary components, which imitate the true ones with positive singular values, and the redundant ones. For a necessary component, h ≤ H ∗ , equation A.7 implies that both µ ˆ a h and µ ˆ bh are of order O p (1). Then we find from equations A.4 and A.6 that both a h and bh are of order O p (n−1 ). Substituting the fact above into equation 6.14 leads to the first term of equation 7.1. For a redundant component, h > H ∗ , we find from the VB solution, equations A.51 to A.62, that the order of the estimators is as follows:
1128
S. Nakajima and S. Watanabe
1. When M > N, µ ˆ a h = O p (1), σˆ a2h = O p (1), µ ˆ bh = O p (n−1/2 ), σˆ b2h = O p (n−1 ). (7.2) 2. When M = N, ˆ bh = O p (n−1/4 ), σˆ b2h = O p (n−1/2 ). µ ˆ a h = O p (n−1/4 ), σˆ a2h = O p (n−1/2 ), µ (7.3) 3. When M < N, ˆ bh = O p (1), σˆ b2h = O p (1). (7.4) µ ˆ a h = O p (n−1/2 ), σˆ a2h = O p (n−1 ), µ Substituting equations 7.2 to 7.4 into equation 6.14 leads to the second term of equation 7.1. Note that the contribution of the necessary components involves the trivial redundancy (compare the first term of equation 7.1 with equation 2.4) because the independence between A and B prevents the VB posterior distribution from extending along the trivial degeneracy, which is a peculiarity of the LNNs. 7.2 Generalization Error. The generalization error, as well as the training error, of the VB approach can be derived in the same way as that of the SB approach, which results in a similar solution (Nakajima & Watanabe, 2005; 2006). (See section 9.1.) The existence of the singular value decomposition of any B ∗ A∗ allows us to assume without loss of generality that A∗ and B ∗ consist of its orthogonal row vectors and of its orthogonal column vectors, respectively. Then we find from lemma 1 that the KL divergence, equation 3.28, with a set of n training samples is given by
ˆ 2 (B ∗ A∗ − Bˆ A)x + O p (n−3/2 ) 2 q (x) H ˜ h (Xn , Yn ) + O p (n−3/2 ), = G
˜ n , Yn ) = G(X
h=1
(7.5)
where 1 ˜ h (Xn , Yn ) = tr (b h∗ a h∗t − bˆ h aˆ ht )t (b h∗ a h∗t − bˆ h aˆ ht ) G 2
(7.6)
is the contribution of the hth component. We denote by Wd (m, , ) the d-dimensional Wishart distribution with m degrees of freedom, scale matrix , and noncentrality matrix , and abbreviate as Wd (m, ) the central Wishart distribution. Remember that θ (·) is the indicator function, introduced in section 3.3.
Variational Bayes Solution and Generalization
1129
Theorem 3. The generalization error of an LNN in the VB approach can be asymptotically expanded as ˜ ˜ −1 + O(n-3/2 ), G(n) = λn where the generalization coefficient is given by 2λ˜ = (H ∗ (L + l) − H ∗2 ) +
H−H ∗ h=1
θ (γh2
L > L) 1 − 2 γh
2
γh2
.
(7.7)
q ({γh2 })
Here, γh2 is the hth largest eigenvalue of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), over which ·q ({γh2 }) denotes the expectation value. Proof. According to theorem 1, the difference between the VB and the ML estimators of a true component with a positive singular value is of the order O p (n−1 ). Furthermore, the generalization error of the ML estimator of the component is the same as that of the regular models because of its identifiability. Hence, from equation 2.4, we obtain the first term of equation 7.7 as the contribution of the first H ∗ components. On the other hand, we find from equation 6.6 and theorem 1 that for a redundant component, identifying RQ−1/2 with R affects the VB estimator only of order O p (n−1 ), which, hence, does not affect the generalization coefficient. We say that U is the general diagonalized matrix of an N × M matrix T if T has the following singular value decomposition: T = b Ua , where a and b are an M × M and an N × N orthogonal matrices, respectively. Let D be the general diagonalized matrix of R and D the (N − H ∗ ) × (M − H ∗ ) matrix created by removing the first H ∗ columns and rows from D. Then the first H ∗ diagonal elements of D correspond to the positive true singular value components, and D consists only of noises. Therefore, D is the general diagonalized matrix of n−1/2 R , where R is an (N − H ∗ ) × (M − H ∗ ) random matrix whose elements are independently subject to N1 (0, 1), so that R Rt is subject to W N−H ∗ (M − H ∗ , I N−H ∗ ). The redundant components imitate n−1/2 R . Hence, using theorem 1 and equation 7.6 and noting that the distribution of the (l − H ∗ ) largest eigenvalues, which are not trivially equal to zero, of a random matrix subject to W L−H ∗ (l − H ∗ , I L−H ∗ ) are equal to that of the eigenvalues of another random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), we obtain the second term of equation 7.7 as the contribution of the last (H − H ∗ ) components. Thus, we complete the proof of theorem 3. Although the second term of equation 7.7 is not a simple function, it can relatively easily be numerically calculated by creating samples subject to the Wishart distribution. Furthermore, the simpler function approximating the term can be derived in the large-scale limit when M, N, H, and H ∗ go
1130
S. Nakajima and S. Watanabe
to infinity in the same order, in a similar fashion to the analysis of the ML estimation (Fukumizu, 1999). We define the following scalars: α = (l − H ∗ )/(L − H ∗ ), ∗
∗
β = (H − H )/(l − H ), ∗
κ = L/(L − H ).
(7.8) (7.9) (7.10)
Let W be a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ), and {u1 , . . . , ul−H ∗ } the eigenvalues of (L − H ∗ )−1 W. The measure of the empirical distribution of the eigenvalues is defined by δ P = (l − H ∗ )−1 δ(u1 ) + δ(u2 ) + · · · + δ(ul−H ∗ ) , (7.11) where δ(u) denotes the Dirac measure at u. In the large-scale limit, the measure, equation 7.11, converges almost everywhere to (u − um )(u M − u) θ (um < u < u M )du, (7.12) p(u)du = 2παu √ √ where um = ( α − 1)2 and u M = ( α + 1)2 (Watcher, 1978). Let −1
(2πα)
J (ut ; k) =
∞
uk p(u)du
(7.13)
ut
be the kth order moment of the distribution, equation 7.12, where ut is the lower bound of the integration range. The second term of equation 7.7 consists of the terms proportional to the minus first, the zero, and the firstorder moments of the eigenvalues. Because only the eigenvalues greater than L among the largest (H − H ∗ ) eigenvalues contribute to the generalization error, the moments with the lower bound ut = max(κ, uβ ) should be calculated, where uβ is the β-percentile point of p(u), that is,
∞
β=
uβ
p(u)du = (2πα)−1 J (uβ ; 0).
√ Using the transform s = (u − (um + u M )/2) /(2 α), we can calculate the moments and thus obtain the following theorem: Theorem 4. is given by
The VB generalization coefficient of an LNN in the large-scale limit
2λ˜ ∼ (H ∗ (L + l) − H ∗2 ) +
(L − H ∗ )(l − H ∗ ) J (st ; 1) − 2κ J (st ; 0) + κ 2 J (st ; −1) , 2πα
(7.14)
Variational Bayes Solution and Generalization
1131
where J (s; 1) = 2α − s 1 − s 2 + cos–1 s , √ J (s; 0) = −2 α 1 − s 2 + (1 + α) cos–1 s − (1 − α) cos–1
√
α(1 + α)s + 2α , √ 2αs + α(1 + α)
J (s; −1) √ √ √ 1+α 1 − s2 α(1 + α)s + 2α −1 −1 cos − cos s + (0 < α < 1) √ 2 α √ 1−α 2 αs + 1 + α 2αs + α(1 + α) = , 2 1 − s − cos−1 s (α = 1) 1+s √ and st = max((κ − (1 + α))/2 α, J -1 (2παβ; 0)). Here J -1 (·; k) denotes the inverse function of J (s; k). In ordinary asymptotic analysis, one considers only situations when the amplitude of each component of the true model is zero or distinctly positive. Also theorem 3 holds only in such situations. However, as mentioned in section 5, it is important to consider the delicate situations when ∗ ∗ the true √ ∗map B A has tiny but nonnegligible singular values such that 0 < nγh < ∞. Theorem 1 still holds in such situations by replacing the second term of equation 6.17 with o p (n−1/2 ). We regard H ∗ as√the number of distinctly positive true singular values such that γh∗ −1 = o( n). Without loss of generality, we assume that B ∗ A∗ is a nonnegative, general diagonal matrix with its diagonal elements arranged in nonincreasing order. Let R∗ be the true submatrix created by removing the first H ∗ columns and rows from B ∗ A∗ . Then D , defined in the proof of theorem 3, is the general diagonalized matrix of n−1/2 R , where R is a random matrix such that R Rt is subject to W N−H ∗ (M − H ∗ , I N−H ∗ , nR∗ R∗t ). Therefore, we obtain the following theorem: Theorem 5. The VB generalization coefficient of an LNN in the general situations the true map B ∗ A∗ may have delicate singular values such that √ when ∗ 0 < nγh < ∞ is given by 2λ˜ = (H ∗ (L + l) − H ∗2 ) +
+
H−H ∗
H
nγh∗2
h=H ∗ +1
θ (γh2
h=1
−2 1 −
L γh2
> L)
L 1 − 2 γh
√ γh ωbth nR∗ ωah
2
γh2
, q (R )
(7.15)
1132
S. Nakajima and S. Watanabe
where γh , ωah , and ωbh are the hth largest singular value of R , the corresponding right singular vector, and the corresponding left singular vector, respectively, of which ·q (R ) denotes the expectation value over the distribution. 7.3 Training Error. Lemma 1 implies that the empirical KL divergence, equation 3.29, with a set of n training samples is given by 1 ∗ ∗ tr (B A − Bˆ Aˆ MLE )t (B ∗ A∗ − Bˆ Aˆ MLE ) 2 − ( Bˆ Aˆ − Bˆ Aˆ MLE )t ( Bˆ Aˆ − Bˆ Aˆ MLE ) + O p (n−3/2 ).
˜ n , Yn ) = − T(X
(7.16)
In the same way as the analysis of the generalization error, we obtain the following theorems. Theorem 6. The training error of an LNN in the VB approach can be asymptotically expanded as ˜ T(n) = ν˜ n−1 + O(n−3/2 ), where the training coefficient is given by 2˜ν = −(H ∗ (L + l) − H ∗2 ) H−H ∗
L L 2 1 + 2 γh2 θ (γh > L) 1 − 2 − γ γh h h=1
.
(7.17)
q ({γh2 })
Theorem 7. given by
The VB training coefficient of an LNN in the large-scale limit is
2ν˜ ∼ −(H ∗ (L + l) − H ∗2 ) −
(L − H ∗ )(l − H ∗ ) J (st ; 1) − κ 2 J (st ; −1) 2πα
(7.18)
Theorem 8. The VB training coefficient of an LNN in the general situations √ when the true map B ∗ A∗ may have delicate singular values such that 0 < nγh∗ < ∞ is given by 2ν˜ = −(H ∗ (L + l) − H ∗2 ) +
−
H−H ∗ h=1
H
nγh∗2
h=H ∗ +1
θ (γh2
L > L) 1 − 2 γh
L 1 + 2 γh
γh2
. q (R )
(7.19)
Variational Bayes Solution and Generalization
1133
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
5
10
15
20
25
30
35
40
H Figure 3: Free energy: L = 50, l = 30, H = 1, . . . , 30, and H ∗ = 0. The VB and the Bayes free energies almost coincide with each other.
8 Theoretical Values 8.1 Free Energy. Figure 3 shows the free energy coefficients of the LNNs where L = 50 and l = 30 on the assumption that the true rank is equal to zero, H ∗ = 0. The horizontal axis indicates the rank of the learner, H = 1, . . . , 30. The vertical axis indicates the coefficients normalized by the half of the essential parameter dimension K , given by equation 2.4. The lines correspond to the free energy coefficient of the VB approach, given by theorem 2, to that of the Bayes estimation, clarified in Aoyagi & Watanabe (2004), and to that of the regular models, respectively.3 Note that the lines of the VB approach and of the Bayes estimation almost coincide with each other in Figure 3. Figure 4, in which the VB free energy also well approximates the Bayes one, similarly shows the coefficients of LNNs where L = 80, l = 1, . . . , 80, indicated by the horizontal axis, and H = 1 on the assumption that H ∗ = 0. The following two figures show cases where the VB free energy does not well approximate. Figure 5 shows the coefficients of LNNs where L = l = 80, and H = 1, . . . , 80, indicated by the horizontal axis, on the assumption that H ∗ = 0, and Figure 6 shows the true rank, H ∗ = 1, . . . , 20, dependence of the coefficients of an LNN with L = 50, l = 30, and H = 20 units. Figures 7 to 9 show cases similar to Figure 4 but with H = 10, 20, 40, respectively. We conclude that the VB free energy behaves similarly to the 3 By “regular models,” we denote not LNNs but the regular models with the same parameter dimension, K , as the LNN specified by the horizontal axis.
1134
S. Nakajima and S. Watanabe
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 4: Free energy: L = 80, l = 1, . . . , 80, H = 1, and H ∗ = 0. The VB and the Bayes free energies coincide with each other, as well as in Figure 3.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 H
80
100
Figure 5: Free energy: L = l = 80, H = 1, . . . , 80, and H ∗ = 0.
Bayes one in general and well approximates when l and L are not almost equal to each other or H l. In Figure 6, the VB free energy behaves strangely and poorly approximates the Bayes one when H ∗ is large, which can, however, be regarded as a peculiarity of the LNNs, which have trivial redundancy, rather than a property of the VB approach in general models. (See the last paragraph of section 7.1.)
Variational Bayes Solution and Generalization
1135
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
5
10
15 H*
20
25
30
Figure 6: Free energy: L = 50, l = 30, H = 20, and H ∗ = 1, . . . , 20.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 7: Free energy: L = 80, l = 1, . . . , 80, H = 10, and H ∗ = 0.
8.2 Generalization Error and Training Error. Figures 10 to 13 show the generalization and the training coefficients on the identical conditions to Figures 3 to 6, respectively. The lines in the positive region correspond to the generalization coefficient of the VB approach, clarified in this letter, to that of the ML estimation, clarified in (Fukumizu, 1999), to that of the Bayes estimation, clarified in Aoyagi and Watanabe (2004)4 and to that of 4 The Bayes generalization coefficient is identical to the Bayes free energy coefficient, noted in the last paragraph of section 3.1.
1136
S. Nakajima and S. Watanabe
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 8: Free energy: L = 80, l = 1, . . . , 80, H = 20, and H ∗ = 0.
2
VB Bayes Regular
2 lambda / K
1.5
1
0.5
0 0
20
40
60 l
80
100
Figure 9: Free energy: L = 80, l = 1, . . . , 80, H = 40, and H ∗ = 0.
the regular models, respectively; the lines in the negative region correspond to the training coefficient of the VB approach, to that of the ML estimation, and to that of the regular models, respectively.5 Unfortunately the Bayes training error has not been clarified yet. The results in Figures 10 to 13 have been calculated in the large-scale approximation, that is, by using theorems
5 By “the regular models” we mean both “of the ML estimation and of the Bayes estimation in the regular models.”
Variational Bayes Solution and Generalization
1137
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
−0.5 −1
−1.5 −2
0
5
10
15
20
25
30
35
40
H Figure 10: Generalization error (in the positive region) and training error (in the negative region): L = 50, l = 30, H = 1, . . . , 30, and H ∗ = 0.
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
20
40
60 l
80
100
Figure 11: Generalization error and training error: L = 80, l = 1, . . . , 80, H = 1, and H ∗ = 0.
4 and 7. We have also numerically calculated them by using theorems 3 and 6 and thus found that both results almost coincide with each other so that we can hardly distinguish them. We see in Figures 10 to 13 that the VB approach provides good generalization performance comparable to, or in some cases better than, the Bayes estimation. We also see that the VB generalization coefficient significantly differs from the Bayes one even in the cases where the VB free energy well
1138
S. Nakajima and S. Watanabe
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
−0.5 −1
−1.5 −2
0
20
40
60 H
80
100
Figure 12: Generalization error and training error: L = l = 80, H = 1, . . . , 80, and H ∗ = 0.
2 lambda / K
2
VB ML Bayes Regular
1.5 1 0.5 0
2 nu / K
− 0.5 −1
− 1.5 −2
0
5
10
15 H*
20
25
30
Figure 13: Generalization error and training error: L = 50, l = 30, H = 20, and H ∗ = 1, . . . , 20.
approximates the Bayes one. (Compare, for example, Figures 3 and 4 with the positive regions of Figures 10 and 11.) Furthermore, we see in Figures 5 and 12 that when H l, the VB approach provides much worse generalization performance than the Bayes estimation, while the VB free energy well approximates the Bayes one, and that when H ∼ l, the VB approach provides much better generalization performance, while the VB free energy is significantly larger than the Bayes one. This can throw doubt on model
Variational Bayes Solution and Generalization
1139
selection by minimizing the VB free energy to obtain better generalization performance. (See section 9.2.) We see in Figures 10 and 12 that the VB generalization coefficient depends on H, similar to the ML one. Moreover, we see in Figure 12 that when H = 1, the VB generalization coefficient exceeds that of the regular models, which the Bayes one never exceeds (Watanabe, 2001b). The reason is that because of the asymptotic equivalence to the shrinkage estimation, the VB approach would approximate the ML estimation and hence provide poor generalization performance in models where the ML estimator of a redundant component would √ go out of the effective range of shrinkage, that is, be much greater than L/n. When (l − H ∗ ) (H − H ∗ ), the (H − H ∗ ) largest eigenvalues of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗ ) are much greater than L because of selection from a large number of random variables subject to the noncompact support distribution. Therefore, the eigenvalues {γh2 } in theorem 3 go out of the effective range of shrinkage. It can be said that the VB approach has not only a property similar to the Bayes estimation (i.e., suppression of overfitting by the JS-type shrinkage), but also another property similar to the ML estimation (acceleration of overfitting by selection of the largest singular values of a random matrix). Figure 13 shows that in this LNN, the VB approach provides no worse generalization performance than the Bayes estimation regardless of H ∗ . This, however, does not mean the domination of the VB approach over the Bayes estimation. The consideration of delicate situations in the following denies the domination. Using theorems 5 and 8, we can numerically calculate the VB, as well as the ML,6 generalization and training coefficients in the delicate situations when the true distribution is near the singularities. Figure 14 shows the coefficients of an LNN where L = 50, l = 30, and H = 20 on the assumption that the true map consists of H ∗ = 5 distinctly positive component, the 10 delicate components whose singular values are identical to each other, and the other five null components. The hor√ izontal axis indicates nγ ∗ , where γh∗ = γ ∗ for h = 6, . . . , 15. The Bayes generalization error in the delicate situations was clarified in Watanabe and Amari (2003), but unfortunately, only in single-output (SO) LNNs, that is, N = H = 1.7 Figure 15 shows the coefficients of an SOLNN with M = 5 input units on the assumption that H ∗ = 0 and the one true singular value, indicated by the horizontal axis, is delicate. We see that in some delicate
6 All of the theorems in this letter except theorem 2 can be modified for the ML estimation by letting L = 0. 7 An SOLNN is regarded as a regular model from the viewpoint of the ML estimation because the transform b 1 a 1 → w, where w ∈ R M , makes the model identifiable, and hence the ML generalization coefficient is identical to that of the regular models, as shown in Figure 15. Nevertheless, Figure 15 shows that an SOLNN has a property of unidentifiable models from the viewpoint of the Bayesian learning methods.
1140
S. Nakajima and S. Watanabe
2 lambda / K
2
VB ML Regular
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
2
4
6
8 10 12 sqrt(n) gamma*
14
16
18
16
18
Figure 14: Delicate situations.
2 lambda / K
2
VB ML(=Regular) Bayes
1.5 1 0.5
2 nu / K
0 −0.5 −1 −1.5 −2
0
2
4
6
8 10 12 sqrt(n) gamma*
14
Figure 15: Delicate situations in a single-output LNN.
situations, the VB approach provides worse generalization performance than the Bayes estimation, although this is one of the cases in which the VB approach seemed to dominate the Bayes estimation, without consideration of delicate situations. Note that the ordinary asymptotic cases H ∗ = 5 √ and H ∗ =√15 (in Figure 13) correspond to the case that nγ ∗ = 0 and the case that nγ ∗ → ∞ in Figure 14, respectively. It is expected that, also in Figure 14, there must be a region where the VB approach provides greater generalization error than the Bayes estimation, because the Bayes estimation is admissible. (See section 9.3.) In other words, Figure 13 shows the
Variational Bayes Solution and Generalization
1141
generalization error variation when the number, H ∗ , of positive true components increases one by one, and, in that case, the VB approach never provides worse generalization performance for any H ∗ than the Bayes estimation; while Figure 14 shows the generalization error variation when the amplitudes of 10 true components simultaneously increase, and, in that case, the VB approach is expected to provide worse generalization performance in some region. 9 Discussion 9.1 Relation to Subspace Bayes Approach. Interestingly, the solution of the VB approach is similar to that of the subspace Bayes (SB) approach (Nakajima & Watanabe, 2005b, 2006). The VB solution is actually asymptotically equivalent to that of the MIP version of SB approach, the empirical Bayes (EB) approach where the output parameter matrix is regarded as a hyperparameter, when M ≥ N, and the MOP version, the EB approach where the input parameter matrix is regarded as a hyperparameter, when M ≤ N. The form of the posterior distribution of the parameters corresponding to the redundant components implies the reason of the equivalence. The SB posterior distribution extends with its variance of order O p (1) in the space of the parameters, that is, the ones not regarded as hyperparameters; while the VB posterior distribution extends with its variance of order O p (1) in the larger-dimension parameter space, either the input one or the output one, as we can find in the proof of theorem 2. So the similarity between the solutions is natural, and it can be said that the VB approach automatically selects the larger-dimension space as the marginalized space and that the dimension of the space in which the singularities allow the posterior distribution to extend is essentially related to the degree of shrinkage. Figure 16 illustrates the posterior distribution of the MIP version of the SB approach, and that of the VB approach of a redundant component when M > N.8 The relation between the EB approach and the JS estimation was discussed in Efron and Morris (1973), as mentioned in section 1. In this letter, we have discovered the relation among the VB approach, the EB approach, and the JS-type shrinkage estimation. 9.2 Model Selection in VB Approach. In the framework of the Bayes estimation, the marginal likelihood, equation 3.1 can be regarded as the likelihood of the ensemble of the models. Following the concept of likelihood in statistics, we can accept that the ensemble of models with the maximum marginal likelihood is the most likely. Thus, the model selection method minimizing the free energy, equation 3.4, was proposed (Efron & Morris,
8 Note that the SB posterior distribution is the delta function with respect to b , so it h has infinite values in reality.
1142
S. Nakajima and S. Watanabe
O p (1)
0
ah
O p (1)
(
O p n −1/ 2
)
bh
Figure 16: SB (MIP) posterior (left) and VB posterior (right) for H ∗ < h ≤ H when M > N.
1973; Schwarz, 1978; Akaike, 1980; MacKay, 1992). In the VB approach, the method where the model with the minimum VB free energy is selected was proposed (Attias, 1999). Although it is known that the model with the minimum free energy does not necessarily provide the minimum generalization error even in the Bayes estimation and even in regular models, we usually expect that the selected model would also provide a small generalization error. However, we have shown in this letter that the free energy and the generalization error are less simply related with each other in the VB approach, unlike in the Bayes estimation. This can imply that the increase of the free energy caused by overfitting with redundant components less accurately indicates that of the generalization error in the VB approach than in the Bayes estimation, and throw a doubt on this model selection method. 9.3 Consistency with Superiority of Bayes Estimation. The Bayes estimation is said to be superior to any other learning method in a certain meaning. Consider the case in which we apply a learning method to many applications, where the true parameter w ∗ is different in each application and subject to q (w ∗ ). Note that the true distribution in each application is written as q (y|x) = p(y|x, w ∗ ). Although we have been abbreviating it (except in section 3.3), the generalization error, equation 3.7, naturally depends on the true distribution, so we here denote the dependence explicitly as follows: G(n; w ∗ ). Let ¯ G(n) = G(n; w ∗ )q (w∗ ) be the average generalization error over q (w ∗ ).
(9.1)
Variational Bayes Solution and Generalization
1143
Proposition 1. When we know and use the true prior distribution, q (w ∗ ), the Bayes estimation minimizes the average generalization error over q (w ∗ ), that is, ¯ ¯ Other (n), G(n) ≤G
(9.2)
¯ Other (n) denotes the average generalization error of an arbitrary learning where G method. However, we consider the situations where we need model selection, that is, we do not know even the true dimension of the parameter space. Accordingly, it can happen that an approximation method of the Bayes estimation provides better generalization performance than the Bayes estimation, as shown in this letter. On the other hand, proposition 1 naturally leads to the admissibility of the Bayes estimation: there is no learning method dominating the Bayes estimation. The analysis of the delicate situation has consistently denied the domination of the VB approach over the Bayes estimation in the last paragraph of section 8.2. 9.4 Features of VB Approach. To summarize, we can say that the VB approach has the two aspects caused by the ML-type effect and the Bayesian effect of the singularities, itemized in section 4. In LNNs, the ML-type effect is observed as the acceleration of overfitting by selection of the largest singular values of a random matrix, while the Bayesian effect is observed as the JS-type shrinkage. We have also shown in Figures 11 and 12 that one of the most preferable properties of the Bayes estimation, 2λ ≤ K , does not hold in the VB approach, that is, 2λ˜ can exceed the parameter dimension, K —hence the generalization coefficient of the regular models. We conjecture that the two aspects are essentially caused by the localization of the posterior distribution, which we can see in Figure 1. Because of the independence between the parameters of different layers, the posterior distribution cannot extend so as to fill uniformly the narrow low-energy (high-likelihood) region around the singularities, which changes the strength of the Bayesian effect, sometimes increasing and sometimes decreasing. The restriction on the VB posterior distribution affects the generalization performance. However, the variation of the restriction is limited if we want to obtain a simple iterative algorithm. The independence between the parameters of different layers is especially indispensable for simplifying the calculation. Actually, the restriction applied in models with hidden variables, such as mixture models and hidden Markov models, is very similar to the one applied in LNNs in this letter. In Attias (1999), the VB posterior distribution is restricted such that the parameters and the hidden variables are independent of each other. That restriction results in the independence between the parameters of different layers and consequently leads to the localization of the posterior distribution. In fact, the
1144
S. Nakajima and S. Watanabe
order of the extent of the VB posterior distribution in mixture models when we use the prior distribution having positive values on the singularities has recently been derived as similar to that in LNNs (Watanabe & Watanabe, 2005). So we expect that the two aspects of the VB approach would hold in more general unidentifiable models. In addition, in models with hidden variables, the other restriction that we applied in LNNs (the independence between the parameters of different components) is also guaranteed when we introduce the hidden variables and substitute the likelihood of the complete data for the original marginal likelihood, in the same way as the derivation of the EM algorithm (Dempster et al., 1977; Attias, 1999). Theorem 1 says that increasing the smaller dimension, l, does not increase the degree of shrinkage unless L = l, but theorem 3 says that it increases the number of the random variables (i.e., the number of the eigenvalues of a random matrix subject to Wl−H ∗ (L − H ∗ , Il−H ∗) among which the redundant components of the VB estimator choose. Therefore, we can say that the VB generalization error is very small when the difference between the parameter space dimensions of the different layers is very large, that is, l/L 1. So we conjecture that the advantage of the VB approach over the EM algorithm would be enhanced in mixture models, where the dimension of the upper-layer parameters (the number of the mixing coefficients) is one per component and that of the lower-layer parameters (the number of the parameters that each component has) is usually more. However, in general, nonlinearity increases the ML-type effect of singularities. In (general) neural networks, equation 2.1, for example, we expect that the nonlinearity of the activation function, ψ(·), would extend the range of basis selection and hence increase the generalization error. Because an LNN can be embedded in a finite-dimensional regular model, the ML-type effect of the singularities is relatively soft, and its ML estimator is in a finite region. On the other hand, the ML estimator can diverge in some unidentifiable models. An important question to be solved is how the VB approach behaves in models where the ML estimator diverges because of the MLtype effect, that is, whether the Bayesian shrinkage effect leashes the VB estimator in a finite region. As mentioned in section 1, the VB free energies in some unidentifiable models have been clarified (Watanabe & Watanabe, 2005; Hosino et al., 2005; Nakano & Watanabe, 2005) and shown to behave similarly to the VB free energy in LNNs, except for the peculiarity of LNNs caused by the trivial redundancy—the strange H ∗ dependence discussed in the last paragraph of section 8.1. The fact shown in this letter that in LNNs, the VB free energy and the VB generalization error are less simply related to each other, implies that, also in other models, we cannot expect the VB approach to provide similar generalization performance to the Bayes estimation, even if it provides similar free energy.
Variational Bayes Solution and Generalization
1145
10 Conclusions and Future Work We have proved that in three-layer linear neural networks, the variational Bayes (VB) approach is asymptotically equivalent to a positive-part JamesStein–type shrinkage estimation, and theoretically clarified its free energy, generalization error, and training error. We have thus concluded that the VB approach has not only a property similar to the Bayes estimation but also another property similar to the maximum likelihood estimation. We have also shown that unlike in the Bayes estimation, the free energy and the generalization error are less simply related to each other, and we discussed the relation between the VB approach and the subspace Bayes (SB) approach. Analysis of the VB generalization performance of other unidentifiable models is future work. Consideration of the relation between the VB approach and the SB approach in other models is also future work. We think that any similarity between the approaches may hold not only in LNNs. Appendix A: Proof of Theorem 1 Both tr(a h ) and tr(bh ) are of no greater order than 1 because of the prior distributions, equations 6.2 and 6.3. Both are also not equal to zero; otherwise, the generalized free energy, equation 6.14, diverges to infinity with finite n. Consequently, the generalized free energy diverges to infinity when either µa h or µbh goes to infinity. Hence, the optimum values of µa h , µbh , a h , and bh necessarily satisfy the variational condition, equations 6.10 to 6.13. Combining equations 6.10 and 6.12, we have ˆ a h Rt ˆ bh Rµ µ ˆ a h = n2 ˆ ah ,
(A.1)
ˆ bh R ˆ a h Rt µ ˆ bh , µ ˆ bh = n
(A.2)
2
ˆ bh are an eigenvector of Rt R and an eigenand thus find that µ ˆ a h and µ t ˆ a h R , respectively, or µ vector of R ˆ a h = µ ˆ bh = 0. Hereafter, separately considering the necessary components, which imitate the true ones with positive singular values, and the redundant components, we will find the solution of the variational condition that minimizes the generalized free energy, equation 6.14. For a necessary component, h ≤ H ∗ , the (observed) singular value of R is of order O p (1). Hence, the free energy, equation 6.14, can be minimized only when both µ ˆ a h and µ ˆ bh are of order O p (1). Therefore, the variational condition, equations 6.10 to 6.13, is approximated as follows: µ ˆ a h = µ ˆ bh −2 Q−1 Rt µ ˆ bh + O p (n−1 ), −1
ˆ a h = n µ ˆ bh
−2
−1
Q
−2
+ O p (n ),
(A.3) (A.4)
1146
S. Nakajima and S. Watanabe
µ ˆ bh = (µ ˆ at h Qµ ˆ a h )−1 Rµ ˆ a h + O p (n−1 ),
(A.5)
ˆ bh = n−1 (µ ˆ at h Qµ ˆ a h )−1 I N + O p (n−2 ).
(A.6)
Thus, we obtain the VB estimator of the necessary component: µ ˆ bh µ ˆ at h = ωbh ωbt h RQ−1 + O p (n−1 ).
(A.7)
On the other hand, for a redundant component, h > H ∗ , equations 6.6, 6.15, and 6.16 allow us to approximate the variational condition, equations 6.10 to 6.13, as follows: µ ˆ a h = nσˆ a2h Rt µ ˆ bh (1 + O p (n−1/2 )),
(A.8)
σˆ a2h = (n( µ ˆ bh 2 + Nσˆ b2h ) + c a−2 )−1 ,
(A.9)
µ ˆ bh = nσˆ b2h Rµ ˆ a h (1 + O p (n−1/2 )),
(A.10)
σˆ b2h = (n( µ ˆ a h 2 + Mσˆ a2h ) + c b−2 )−1 .
(A.11)
Combining equations A.9 and A.11, we have
σˆ a2h =
σˆ b2h =
−(nˆηh2 − (M − N)) +
(nˆηh2 + M + N)2 − 4MN
2nM( µ ˆ bh 2 + n−1 c a−2 ) −(nˆηh2 + (M − N)) + (nˆηh2 + M + N)2 − 4MN 2nN( µ ˆ a h 2 + n−1 c b−2 )
,
(A.12)
,
(A.13)
where
ηˆ h2
c −2 = µ ˆ ah + b n 2
µ ˆ b h 2 +
c a−2 n
.
(A.14)
We consider two possible solutions: ˆ bh = 0: Obviously, the following equations sat1. Such that µ ˆ a h = µ isfy equations A.8 and A.10: µ ˆ a h = 0,
(A.15)
µ ˆ bh = 0.
(A.16)
Substituting equations A.15 and A.16 into equations A.12 and A.13, we easily find the following solution:
Variational Bayes Solution and Generalization
(a) When M > N: M− N σˆ a2h = + O p (n−1 ), Mc a−2 σˆ b2h =
c a−2 + O p (n−2 ). n(M − N)
(b) When M = N: ca + O p (n−1 ), σˆ a2h = √ c b nM cb + O p (n−1 ). σˆ b2h = √ c a nM
1147
(A.17) (A.18)
(A.19) (A.20)
(c) When M < N: σˆ a2h = σˆ b2h =
c b−2 + O p (n−2 ), n(N − M) N−M Nc b−2
+ O p (n−1 ).
(A.21) (A.22)
(2) Such that µ ˆ a h , µ ˆ bh > 0. Combining equations A.8 and A.10, we find that µ ˆ a h and µ ˆ bh are the right and the corresponding left singular vectors of R, respectively. Let hth largest singular value component of R correspond to the hth component of the estimator. Then we have µ ˆ a h = µ ˆ a h ωa h ,
(A.23)
µ ˆ b h = µ ˆ bh ωbh .
(A.24)
Substituting equations A.12, A.13, A.23, and A.24 into equations A.8 and A.10, we have 2MN 2 = nˆ η + M + N − (nˆηh2 + M + N)2 − 4MN + O p (n−1/2 ), h nγh2 (A.25) (Mc a−2 δˆ h − Nc b−2 δˆ h−1 )(1 + O p (n−1/2 )) = n(M − N)(γh − γˆh ),
(A.26)
where γˆh = µ ˆ a h µ ˆ bh ,
(A.27)
µ ˆ ah . µ ˆ bh
(A.28)
δˆ h =
Equation A.25 implies that ξ = (nγh2 )−1 is a solution of the following equation: MNξ 2 − (nˆηh2 + M + N)ξ + 1 = O p (n−1/2 ). (A.29)
1148
S. Nakajima and S. Watanabe
Thus, we find that equation A.25 has the solution below when and only when nγh2 ≥ L:
N M 1 − γh2 + O p (n−3/2 ). (A.30) ηˆ h2 = 1 − nγh2 nγh2 By using equations A.27 and A.28, the definition of ηˆ h , equation A.14, can be written as follows: c b−2 c a−2 2 (A.31) ηˆ h = 1 + 1+ γˆh2 , nγˆh δˆ h nγˆh δˆ h−1 Hereafter, we will find the simultaneous solution of equations A.26, A.30, and A.31 in order to obtain the solution that exists only when nγh2 ≥ L. Equations A.30 and A.31 imply that γh2 > ηˆ h2 > γˆh2 and that (γh2 − γˆh2 ) is of order O p (n−1 ). We then find that (γh − γˆh ) is of order O p (n−1/2 ) because γh is of order O p (n−1/2 ). (a) When M > N: In this case, equation A.26 can be approximated as follows: n(M − N)(γh − γˆh ) δˆ h = + O p (n−1/2 ). (A.32) Mc a−2 Substituting equations A.30 and A.32 into equation A.31, we have
N (M − N)(γh − γˆh ) M 1 − γh2 = O p (n−2 ), γˆh2 + γˆh − 1 − M nγh2 nγh2 (A.33) and hence
M M N + 1 − γ γ ˆ γh = O p (n−2 ). γˆh − 1 − h h N nγh2 nγh2 (A.34) Therefore, we obtain the following solution with respect to γˆh and δˆ h :
M γˆh = 1 − γh + O p (n−3/2 ), (A.35) nγh2 (M − N)
+ O p (n−1/2 ). c a−2 γh We thus obtain the following solution:
M M − N 1/2 1− ωa h + O p (n−1 ), µ ˆ ah = nγh2 c a−2 δˆ h =
σˆ a2h =
M− N c a−2 nγh2
+ O p (n−1 ),
(A.36)
(A.37) (A.38)
Variational Bayes Solution and Generalization
µ ˆ bh = σˆ b2h =
1−
M nγh2
c a−2 M− N
1/2
1149
γh ωbh + O p (n−1 ),
c a−2 + O p (n−2 ). n(M − N)
(A.39) (A.40)
(b) When M = N: In this case, we find from equation A.26 that ca δˆ h = . (A.41) cb Substituting equation A.41 and then equation A.30 into equation A.31, we have γˆh = ηˆ h + O p (n−1 )
M γh + O p (n−1 ). = 1− nγh2 We thus obtain the following solution:
1/2 ca M ωa h + O p (n−1 ), µ ˆ ah = 1− γh cb nγh2 ca + O p (n−1 ), σˆ a2h = c b nγh
1/2 cb M ωbh + O p (n−1 ), µ ˆ bh = 1− γh ca nγh2 cb + O p (n−1 ). σˆ b2h = c a nγh
(A.42)
(A.43) (A.44) (A.45) (A.46)
(c) When M < N: In exactly the same way as the case when M > N, we obtain the following solution: 1/2 c b−2 N 1− γh ωa h + O p (n−1 ), (A.47) µ ˆ ah = nγh2 N − M c b−2 + O p (n−2 ), n(N − M) 1/2 N N−M 1− ωbh + O p (n−1 ), µ ˆ bh = nγh2 c b−2 σˆ a2h =
σˆ b2h =
N−M c b−2 nγh2
+ O p (n−1 ).
(A.48)
(A.49) (A.50)
ˆ bh > 0, equaWe can find that, when the solution such that µ ˆ a h , µ tions A.37 to A.40 and A.43 to A.50, exists, it makes the VB free energy,
1150
S. Nakajima and S. Watanabe
equation 6.14, smaller than the solution such that µ ˆ a h = µ ˆ bh = 0, equations A.15 to A.22. Hence, we arrive at the following solution: 1. When M > N,
L L − l 1/2 µ ˆ ah = 1− ωa h + O p (n−1 ), Lh c a−2 L −l + O p (n−1 ), c a−2 L h
−2 1/2 ca L µ ˆ bh = 1− γh ωbh + O p (n−1 ), Lh L − l σˆ a2h =
σˆ b2h =
c a−2 + O p (n−2 ). n(L − l)
µ ˆ ah =
L 1− Lh
c b−2 L −l
σˆ b2h =
L −l c b−2 L h
+ O p (n−1 ).
(A.53)
(A.55) (A.56)
(A.57) (A.58)
1/2 γh ωa h + O p (n−1 ),
c b−2 + O p (n−2 ), n(L − l) 1/2 L L −l µ ˆ bh = 1− ωbh + O p (n−1 ), Lh c b−2 σˆ a2h =
(A.52)
(A.54)
2. When M = N,
1/2 ca L ωa h + O p (n−1 ), 1 − γh µ ˆ ah = cb Lh ca σˆ a2h = + O p (n−1 ), c b nL h 1/2
L cb 1 − γh ωbh + O p (n−1 ), µ ˆ bh = ca Lh cb σˆ b2h = + O p (n−1 ). c a nL h 3. When M < N,
(A.51)
(A.59)
(A.60)
(A.61) (A.62)
Selecting the largest singular value components minimizes the free energy, equation 6.14. Hence, combining equation A.7 with the fact that L L −1 = O p (n−1 ) for the necessary components, and the solution above h with equation 6.6, we obtain the VB estimator in theorem 1.
Variational Bayes Solution and Generalization
1151
Acknowledgments We thank the anonymous reviewers for the meaningful comments that polished this letter. We also thank Kazuo Ushida, Masahiro Nei, and Nobutaka Magome of Nikon Corporation for encouragement to research this subject. References Akaike, H. (1974). A new look at statistical model. IEEE Trans. on Automatic Control, 19(6), 716–723. Akaike, H. (1980). Likelihood and Bayes procedure. In J. M. Bernald (Ed.), Bayesian statistics (pp. 143–166). Valencia, Spain: University Press. Amari, S., Park, H., & Ozeki, T. (2006). Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18, 1007–1065. Aoyagi, M., & Watanabe, S. (2004). Stochastic complerities of reduced rank regression in Bayesian estimation. In Neural Networks, 18, 924–933. Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In Proc. of the Conference on Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann. Baldi, P. F., & Hornik, K. (1995). Learning in linear neural networks: A survey. IEEE Trans. on Neural Networks, 6(4), 837–858. Bickel, P., & Chernoff, H. (1993). Asymptotic distribution of the likelihood ratio statistic in a prototypical non regular problem. In J. K. Ghosh, S. K. Mitra, K. R. Parthasarathy, & B. L. S. Prakasa Rao (Eds.), Statistics and Probability: A Raghu Raj Bahadur Festschrift (pp. 83–96). New York: Wiley. Cramer, H. (1949). Mathematical methods of statistics. Princeton, NJ: Princeton University Press. Dacunha-Castelle, D., & Gassiat, E. (1997). Testing in locally conic models, and application to mixture models. Probability and Statistics, 1, 285–317. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood for incomplete data via the EM algorithm. J. R. Statistical Society, 39B, 1–38. Efron, B., & Morris, C. (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. J. Am. Stat. Assoc., 68, 117–130. Fukumizu, K. (1999). Generalization error of linear neural networks in unidentifiable cases. In Proceedings of the Tenth International Conference, Algorithmic Learning Theory. (pp. 51–62). New York: Springer. Fukumizu, K. (2003). Likelihood ratio of unidentifiable models and multilayer neural networks. Annals of Statistics, 31(3), 833–851. Ghahramani, Z., & Beal, M. J. (2001). Graphical models and variational methods. In M. Opper & D. Saad (Eds.), Advanced mean field methods (pp. 161–177). Cambridge, MA: MIT Press. Hagiwara, K. (2002). On the problem in model selection of neural network regression in overrealizable scenario. Neural Computation, 14, 1979–2002. Hartigan, J. A. (1985). A failure of likelihood ratio asymptotics for normal mixtures. In Proc. of the Berkeley Conference in Honor of J. Neyman and J. Kiefer (pp. 807–810). Monterey, CA: Wadsworth.
1152
S. Nakajima and S. Watanabe
Hinton, G. E., & van Camp, D. (1993). Keeping neural networks simple by minimizing the description length of the weights. In Proc. of the Conference on Computational Learning Theory. (pp. 5–13). New York: ACM. Hosino, T., Watanabe, K., & Watanabe, S. (2005). Stochastic complexity of variational Bayesian hidden Markov models. In Proc. of the International Joint Conference on Neural Networks. Jaakkola, T. S., & Jordan, M. I. (2000). Bayesian parameter estimation via variational methods. Statistics and Computing, 10, 25–37. James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proc. of the 4th Berkeley Symp. on Math. Stat. and Prob. (pp. 361–379). Berkeley: University of California Press. Kuriki, S., & Takemura, A. (2001). Tail probabilities of the maxima of multilinear forms and their applications. Annals of Statistics, 29(2), 328–371. Levin, E., Tishby, N., & Solla, S. A. (1990). A statistical approach to learning and generalization in layered neural networks. Proc. IEEE, 78, 1568–1674. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4(2), 415–447. MacKay, D. J. C. (1995). Developments in probabilistic modeling with neural networks—ensemble learning. In Proc. of the 3rd Ann. Symp. on Neural Networks (pp. 191–198). New York: Springer-Verlag. Nakajima, S., & Watanabe, S. (2005). Generalization error of linear neural networks in an empirical Bayes approach. In Proc. of IJCAI (pp. 804–810). Edinburgh, U.K. Nakajima, S., & Watanabe, S. (2006). Generalization performance of subspace Bayes approach in linear neural networks. IEICE Trans., E89-D(3), 1128–1138. Nakano, N., & Watanabe, S. (2005). Stochastic complexity of layered neural networks in mean field approximation. In Proc. of ICONIP. Taipei, Taiwan. Reinsel, G. C., & Velu, R. P. (1998). Multivariate reduced-rank regression. New York: Springer. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14(3), 1080–1100. Rusakov, D., & Geiger, D. (2002). Asymptotic model selection for naive Bayesian networks. In Proc. of Conference on Uncertainty in Artificial Intelligence (pp. 438– 445). San Francisco: Morgan Kaufmann. Sato, M. (2001). Online model selection based on the variational Bayes. Neural Computation, 13, 1649–1681. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461–464. Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. of the 3rd Berkeley Symp. on Math. Stat. and Prob. (pp. 197–206). Takemura, A., & Kuriki, S. (1997). Weights of chi-barsquare distribution for smooth or piecewise smooth cone alternatives. Annals of Statistics, 25(6), 2368–2387. Wang, B., & Titterington, D. M. (2004). Convergence and asymptotic normality of variational Bayesian approximations for exponential family models with missing values. In Proc. of Conference on Uncertainty in Artificial Intelligence (pp. 577–584). Watanabe, K., & Watanabe, S. (2006). Stochastic complexities of gaussian mixtures in variational Bayesian approximation. Journal of Machine Learning Research, 7, 625–644.
Variational Bayes Solution and Generalization
1153
Watanabe, S. (1995). A generalized bayesian framework for neural networks with singular fisher information matrices. In Proc. of the International Symposium on Nonlinear Theory and Its Applications (Vol. 2, pp. 207–210). Watanabe, S. (2001a). Algebraic analysis for nonidentifiable learning machines. Neural Computation, 13(4), 899–933. Watanabe, S. (2001b). Algebraic information geometry for learning machines with singularities. In T. K. Leen, T. G. Dietterich, & V. Trep (Eds.), Advances in neural information processing systems, 13, (pp. 329–336). Cambridge, MA: MIT Press. Watanabe, S., & Amari, S. (2003). Learning coefficients of layered models when the true distribution mismatches the singularities. Neural Computation, 15, 1013–1033. Watcher, K. W. (1978). The strong limits of random matrix spectra for sample matrices of independent elements. Ann. Prob., 6, 1–18. Yamazaki, K., & Watanabe, S. (2002). Resolution of singularities in mixture models and its stochastic complexity. In Proc. of the 9th International Conference on Neural Information Processing (pp. 1355–1359). Singapore. Yamazaki, K., & Watanabe, S. (2003a). Stochastic complexities of hidden Markov models. In Proc. of Neural Networks for Signal Processing XIII (NNSP) (pp. 179– 188). Toulouse, France. Yamazaki, K., & Watanabe, S. (2003b). Stochastic complexity of Bayesian networks. In Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (pp. 592– 599). San Francisco: Morgan Kaufmann. Yamazaki, K., Nagata, K., & Watanabe, S. (2005). A new method of model selection based on learning coefficient. In Proceedings of the International Symposium on Nonlinear Theory and Its Applications (pp. 389–392). Bruges, Belgium.
Received October 7, 2005; accepted July 27, 2006.
LETTER
Communicated by L´eon Bottou
Training a Support Vector Machine in the Primal Olivier Chapelle∗
[email protected] Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany
Most literature on support vector machines (SVMs) concentrates on the dual optimization problem. In this letter, we point out that the primal problem can also be solved efficiently for both linear and nonlinear SVMs and that there is no reason for ignoring this possibility. On the contrary, from the primal point of view, new families of algorithms for large-scale SVM training can be investigated. 1 Introduction The vast majority of textbooks and articles introducing support vector machines (SVMs) first state the primal optimization problem, and then go directly to the dual formulation (Vapnik, 1998; Burges, 1998; Cristianini & ¨ Shawe-Taylor, 2000; Scholkopf & Smola, 2002). A reader could easily obtain the impression that this is the only possible way to train an SVM. In this letter, we reveal this as being a misconception and show that someone unaware of duality theory could train an SVM. Primal optimizations of linear SVMs have already been studied by Keerthi and DeCoste (2005) and Mangasarian (2002). One of the main contributions of this letter is to complement those studies to include the nonlinear case.1 Our goal is not to claim that primal optimization is better than dual, but merely to show that they are two equivalent ways of reaching the same result. Also, we will show that when the goal is to find an approximate solution, primal optimization is superior. Given a training set {(xi , yi )}1≤i≤n , xi ∈ Rd , yi ∈ {+1, −1}, recall that the primal SVM optimization problem is usually written as min ||w||2 + C w,b
n
p
ξi under constraints yi (w · xi + b) ≥ 1 − ξi ,
ξi ≥ 0,
i=1
(1.1)
∗ Now
at Yahoo! Research. Email:
[email protected] Primal optimization of nonlinear SVMs have also been proposed in Lee and Mangasarian (2001b, section 4), but with a different regularizer. 1
Neural Computation 19, 1155–1178 (2007)
C 2007 Massachusetts Institute of Technology
1156
O. Chapelle
where p is either 1 (hinge loss) or 2 (quadratic loss). At this point, in the literature there are usually two main reasons mentioned for solving this problem in the dual: 1. The duality theory provides a convenient way to deal with the constraints. 2. The dual optimization problem can be written in terms of dot products, thereby making it possible to use kernel functions. We will demonstrate in section 3 that those two reasons are not a limitation for solving the problem in the primal, mainly by writing the optimization problem as an unconstrained one and by using the representer theorem. In section 4, we will see that performing a Newton optimization in the primal yields exactly the same computational complexity as optimizing the dual; that will be validated experimentally in section 5. Finally, possible advantages of a primal optimization are presented in section 6. We will start with some general discussion about primal and dual optimization. 2 Links Between Primal and Dual Optimization Primal and dual optimization have strong connections, and we illustrate some of them through the example of regularized least squares (RLS). Given a matrix X ∈ Rn×d representing the coordinates of n points in d dimensions and a target vector y ∈ Rn , the primal RLS problem can be written as min λw w + Xw − y2 ,
w∈Rd
(2.1)
where λ is the regularization parameter. This objective function is minimized for w = (X X + λI )−1 X y, and its minimum is y y − y X(X X + λI )−1 X y.
(2.2)
Introducing slacks variables ξ = Xw − y, the dual optimization problem is maxn 2α y − α∈R
1 α (XX + λI )α. λ
(2.3)
The dual is maximized for α = λ(XX + λI )−1 y, and its maximum is λy (XX + λI )−1 y.
(2.4)
Training a Support Vector Machine in the Primal
1157
The primal solution is then given by the KKT condition, w=
1 X α. λ
(2.5)
Now we relate the inverses of XX + λI and X X + λI thanks to the Woodbury formula (Golub & Loan, 1996) λ(XX + λI )−1 = I − X(λI + X X)−1 X .
(2.6)
With this equality, we recover that primal, equation 2.2, and dual, equation 2.4, optimal values are the same: that the duality gap is zero. Let us now analyze the computational complexity of primal and dual optimization. The primal requires the computation and inversion of the matrix (X X + λI ), which is O(nd 2 + d 3 ). The dual deals with the matrix (XX + λI ), which requires O(dn2 + n3 ) operations to compute and invert. It is often argued that one should solve either the primal or the dual optimization problem depending on whether n is larger or smaller than d, resulting in an O(max(n, d) min(n, d)2 ) complexity. But this argument does not really hold because one can always use equation 2.6 in case the matrix to invert is too big. So for both primal and dual optimization, the complexity is O(max(n, d) min(n, d)2 ). The difference between primal and dual optimization comes when computing approximate solutions. Let us optimize both the primal, equation 2.1, and dual, equation 2.3, objective functions by conjugate gradient and see how the primal objective function decreases as a function of the number of conjugate gradient steps. For the dual optimiztion, an approximate dual solution is converted to an approximate primal one by using the KKT condition, equation 2.5. Intuitively, the primal optimization should be superior because it directly minimizes the quantity we are interested in. Figure 1 confirms this intuition. In some cases, there is no difference between primal and dual optimization (left), but in other cases, the dual optimization can be much slower to converge (right). In appendix A, we try to analyze this phenomenon by looking at the primal objective value after one conjugate gradient step. We show that the primal optimization always yields a lower value than the dual optimization, and we quantify the difference. The conclusion from this analysis is that even though primal and dual optimization are equivalent, in terms of both the solution and time complexity, when it comes to approximate solution, primal optimization is superior because it is more focused on minimizing what we are interested in: the primal objective function.
1158
O. Chapelle
0
5
10
10 Primal Dual
−2
Primal Dual
10
Primal suboptimality
Primal suboptimality
0
−4
10
−6
10
−8
10
10
−5
10
−10
10 −10
10
−12
10
0
−15
2
4 6 Number of CG iterations
8
10
10
0
2
4 6 Number of CG iterations
8
10
Figure 1: Plots of the primal suboptimality, equations 2.1 and 2.2, for primal and dual optimization by conjugate gradient (pcg in Matlab). n points are drawn randomly from a spherical gaussian distribution in d dimensions. The targets are also randomly generated according to a gaussian distribution. λ is fixed to 1. (Left), n = 10 and d = 100. (Right), n = 100 and d = 10. 3 Linear Quadratic 2.5
Loss
2
1.5
1
0.5
0
− 0.5
0
0.5
1
1.5
Output
Figure 2: SVM loss function, L(y, t) = max(0, 1 − yt) p for p = 1 and 2.
3 Primal Objective Function Coming back to support vector machines, let us rewrite equation 1.1 as an unconstrained optimization problem: ||w||2 + C
n
L(yi , w · xi + b),
(3.1)
i=1
with L(y, t) = max(0, 1 − yt) p (see Figure 2). More generally, L could be any loss function.
Training a Support Vector Machine in the Primal
1159
Let us now consider nonlinear SVMs with a kernel function k and an associated reproducing kernel hilbert space H. The optimization problem, equation 3.1, becomes min λ|| f ||2H + f ∈H
n
L(yi , f (xi )),
(3.2)
i=1
where we have made a change of variable by introducing the regularization parameter λ = 1/C. We have also dropped the offset b for simplicity. However, all the algebra presented below can be extended easily to take it into account (see appendix B). Suppose now that the loss function L is differentiable with respect to its second argument. Using the reproducing property f (xi ) = f, k(xi , ·) H , we can differentiate equation 3.2 with respect to f and at the optimal solution f ∗ , the gradient vanishes, yielding 2λ f ∗ +
n ∂L i=1
∂t
yi , f ∗ (xi ) k(xi , ·) = 0,
(3.3)
where ∂ L/∂t is the partial derivative of L(y, t) with respect to its second argument. This implies that the optimal function can be written as a linear combination of kernel functions evaluated at the training samples. This result is also known as the representer theorem (Kimeldorf & Wahba, 1970). Thus, we seek a solution of the form: f (x) =
n
βi k(xi , x).
(3.4)
i=1
We denote those coefficients βi and not αi , as in the standard SVM literature, to stress that they should not be interpreted as Lagrange multipliers. Let us express equation 3.2 in term of βi , n n n βi β j k(xi , x j ) + L yi , k(xi , x j )β j , (3.5) λ i, j=1
i=1
j=1
where we used the kernel reproducing property in || f ||2H = i,n j=1 βi β j < n k(xi , ·), k(x j , ·) > H = i, j=1 βi β j k(xi , x j ). Introducing the kernel matrix K with K i j = k(xi , x j ) and K i the ith column of K , equation 3.5 can be rewritten as λβ K β +
n L yi , K i β . i=1
(3.6)
1160
O. Chapelle
As long as L is differentiable, we can optimize equation 3.6 by gradient descent. Note that this is an unconstrained optimization problem. 4 Newton Optimization The unconstrained objective function, equation 3.6, can be minimized using a variety of optimization techniques such as conjugate gradient. Here we will consider only Newton optimization as the similarities with dual optimization will then appear clearly. We will focus on two loss functions: the quadratic penalization of the training errors (see Figure 2) and a differentiable approximation to the linear penalization, the Huber loss. 4.1 Quadratic Loss. Let us start with the easier case: the L 2 penalization of the training errors, L(yi , f (xi )) = max(0, 1 − yi f (xi ))2 . For a given value of the vector β, we say that a point xi is a support vector if yi f (xi ) < 1, that is, if the loss on this point is nonzero. Note that this definition of support vector is different from βi = 0.2 Let us reorder the training points such that the first nsv points are support vectors. Finally, let I 0 be the n × n diagonal matrix with the first nsv entries being 1 and the others 0, 1 .. . 0 1 0 . I ≡ 0 .. 0 . 0 The gradient of equation 3.6 with respect to β is ∇ = 2λK β +
nsv
Ki
i=1
= 2λK β + 2
nsv
∂L yi , K i β ∂t
K i yi yi K i β − 1
i=1
= 2(λK β + K I 0 (K β − Y)),
(4.1)
From equation 3.3, it turns out at the optimal solution that the sets {i, βi = 0} and {i, yi f (xi ) < 1} will be the same. To avoid confusion, we could have defined this latter as the set of error vectors. 2
Training a Support Vector Machine in the Primal
1161
and the Hessian is H = 2(λK + K I 0 K ).
(4.2)
Each Newton step consists of the following update, β ← β − γ H −1 ∇, where γ is the step size found by line search or backtracking (Boyd & Vandenberghe, 2004). In our experiments, we noticed that the default value of γ = 1 did not result in any convergence problem, and in the rest of this section we consider only this value. However, to enjoy the theorical properties concerning the convergence of this algorithm, backtracking is necessary. Combining equations 4.1 and 4.2 as ∇ = Hβ − 2K I 0 Y, we find that after the update, β = (λK + K I 0 K )−1 K I 0 Y −1 0 I Y. = λIn + I 0 K
(4.3)
Note that we have assumed that K (and thus the Hessian) is invertible. If K is not invertible, then the expansion is not unique (even though the solution is), and equation 4.3 will produce one of the possible expansions of the solution. To avoid these problems, let us simply assume that an infinitesimally small ridge has been added to K . Let I p denote the identity matrix of size p × p and K sv the first nsv columns and rows of K , that is, the submatrix corresponding to the support vectors. Using the fact that the lower-left block λIn + I 0 K is 0, the invert of this matrix can be easily computed, and finally, the update equation 4.3, turns out to be β=
(λInsv + K sv )−1 0 0 0
(λInsv + K sv )−1 Ysv = 0
Y, .
(4.4)
4.1.1 Link with Dual Optimization. Update rule 4.4 is not surprising if one has a look at the SVM dual optimization problem:
max α
n i=1
αi −
n 1 αi α j yi y j (K i j + λδi j ), 2 i, j=1
under constraints αi ≥ 0.
1162
O. Chapelle
Consider the optimal solution: the gradient with respect to all αi > 0 (the support vectors) must be 0, 1 − diag(Ysv )(K sv + λInsv )diag(Ysv )α = 0, where diag(Y) stands for the diagonal matrix with the diagonal being the vector Y. Thus, up to a sign difference, the solutions found by minimizing the primal and maximizing the primal are the same: βi = yi αi . 4.1.2 Complexity Analysis. Only a couple of iterations are usually necessary to reach the solution (rarely more than five), and this number seems independent of n. The overall complexity is thus the complexity one Newton step, which is O(nnsv + n3sv ). Indeed, the first term corresponds to finding the support vectors (i.e., the points for which yi f (xi ) < 1), and the second term is the cost of inverting the matrix K sv + λInsv . It turns out that this is the same complexity as in standard SVM learning (dual maximization) since those two steps are also necessary. Of course, nsv is not known in advance, and this complexity analysis is an a posteriori one. In the worse case, the complexity is O(n3 ). It is important to note that in general, this time complexity is also a lower bound (for the exact computation of the SVM solution). Chunking and decomposition methods, for instance (Joachims, 1999; Osuna, Freund, & Girosi, 1997), do not help since there is fundamentally a linear system of size nsv to be be solved.3 Chunking is useful only when the K sv matrix can not fit in memory. 4.2 Huber and Hinge Loss. The hinge loss used in SVMs is not differentiable. We propose to use a differentiable approximation of it, inspired by the Huber loss (see Figure 3):
L(y, t) =
0
(1+h−yt)2 4h
1 − yt
if if if
yt > 1 + h |1 − yt| ≤ h, yt < 1 − h
(4.5)
where h is a parameter to choose, typically between 0.01 and 0.5. Note that we are not minimizing the hinge loss, but this does not matter, since from a machine learning point of view, there is no reason to prefer the hinge loss anyway. If really one wants to approach the hinge loss solution, one can make h → 0 (similar to Lee & Mangasarian, 2001b). The derivation of the Newton step follows the same line as for the L 2 loss, and we will not go into the details. The algebra is just a bit more 3 We considered here that solving a linear system (in either the primal or the dual) takes cubic time. This time complexity can, however, be improved.
Training a Support Vector Machine in the Primal
1163
1
0.8
Linear
loss
0.6
0.4 Quadratic 0.2
0 0
0.5
1 yt
1.5
2
Figure 3: The Huber loss is a differentiable approximation of the hinge loss. The plot is equation 4.5 with h = 0.5.
complicated because there are three different parts in the loss (and thus three different categories of points):
r r
r
nsv of them are in the quadratic part of loss. nsv¯ are in the linear part of the loss. We will call this category of points the support vectors at bound, in reference to dual optimization where the Lagrange multipliers associated with those points are at the upper bound C. The rest of the points have zero loss.
We reorder the training set in such a way that the points are grouped in these three categories. Let I 1 be diagonal matrix with first nsv 0 elements followed by nsv¯ 1 elements (and 0 for the rest). The gradient is ∇ = 2λK β +
K I 0 (Kβ − (1 + h)Y) − K I 1Y 2h
and the Hessian H = 2λK +
K I0K . 2h
Thus, ∇ = Hβ − K
1+h 0 1 I + I Y, 2h
1164
O. Chapelle
and the new β is −1 1+h 0 I0K I + I1 Y β = 2λIn + 2h 2h βsv (4hλInsv + K sv )−1 ((1 + h)Ysv − K sv,sv¯ Ysv¯ /(2λ)) ≡ βsv¯ . (4.6) Ysv¯ /(2λ) = 0 0 Again, one can see the link with the dual optimization: letting h → 0, the primal and the dual solution are the same, βi = yi αi . This is obvious for the points in the linear part of the loss (with C = 1/(2λ)). For the points that are right on the margin, their output is equal to their label, K sv βsv + K sv,sv¯ βsv¯ = Ysv . But since βsv¯ = Ysv¯ /(2λ), −1 (Ysv − K sv,sv¯ Ysv¯ /(2λ)), βsv = K sv
which is the same equation as the first block of equation 4.6 when h → 0. 4.2.1 Complexity Analysis. Similar to the quadratic loss, the complexity is O(n3sv + n(nsv + nsv¯ )). The nsv + nsv¯ factor is the complexity for computing the output of one training point (number of nonzeros elements in the vector β). Again, the complexity for dual optimization is the same since both steps (solving a linear system of size nsv and computing the outputs of all the points) are required. 4.3 Other Losses. Some other losses have been proposed to approximate the SVM hinge loss (Lee & Mangasarian, 2001b; Zhang, Jin, Yang, & Hauptmann, 2003, Zhu & Hastie, 2005). However, none of them has a linear part, and the overall complexity is O(n3 ), which can be much larger than the complexity of standard SVM training. More generally, the size of the linear system to solve is equal to nsv , the number of training points 2 for which ∂∂tL2 (yi , f (xi )) = 0. If there are some large linear parts in the loss function, this number might be much smaller than n, resulting in significant speed-up compared to the standard O(n3 ) cost. 5 Experiments The experiments in this section aimal checking that primal and dual optimization of a nonlinear SVM have similar time complexities. However,
Training a Support Vector Machine in the Primal
1165
for linear SVMs, the primal optimization is definitely superior (Keerthi & DeCoste, 2005), as illustrated below. Some Matlab code for the quadratic penalization of the errors and taking into account the bias b is available online at http://www.kyb.tuebingen. mpg.de/bs/people/chapelle/primal. 5.1 Linear SVM. In the case of quadratic penalization of the training errors, the gradient of the objective function, equation 3.1 is ∇ = 2w + 2C
(w · xi − yi )xi , i∈sv
and the Hessian is H = Id + C
xi xi .
i∈sv
The computation of the Hessian is in O(d 2 nsv ) and its inversion in O(d 3 ). When the number of dimensions is relatively small compared to the number of training samples, it is advantageous to optimize directly on w rather than on the expansion coefficients. In the case where d is large but the data are sparse, the Hessian should not be built explicitly. Instead, the linear system H −1 ∇ can be solved efficiently by conjugate gradient (Keerthi & DeCoste, 2005). Training time comparison on the Adult data set (Platt, 1998) is presented in Figure 4. As expected, the training time is linear for our primal implementation, but the scaling exponent is 2.2 for the dual implementation of LIBSVM (comparable to the 1.9 reported in Platt, 1998). This exponent can be explained as follows : nsv is very small (Platt, 1998, Table 12.3), and nsv¯ grows linearly with n (the misclassified training points). So for this data set, the complexity of O(n3sv + n(nsv + nsv¯ )) turns out to be about O(n2 ). It is noteworthy that for this experiment, the number of Newton steps required to reach the exact solution was seven. More generally, this algorithm is usually extremely fast for linear SVMs. 5.2 L 2 Loss. We now compare primal and dual optimization for nonlinear SVMs. To avoid problems of memory management and kernel caching and to make time comparison as straightforward as possible, we decided to precompute the entire kernel matrix. For this reason, the Adult data set used in the previous section is not suitable because it would be difficult to fit the kernel matrix in memory (about 8G). Instead, we used the USPS data set consisting of 7291 training examples. The problem was made binary by classifying digits 0 to 4 versus 5 to 9. An RBF kernel with σ = 8 was chosen. We consider in this section the hard margin SVM by fixing λ to a very small value: 10−8 .
1166
O. Chapelle
LIBSVM Newton Primal
2
Time (sec)
10
1
10
0
10
−1
10
3
4
10
10 Training set size
Figure 4: Time comparison of LIBSVM (an implementation of SMO; Platt, 1998) and direct Newton optimization on the normal vector w.
The training for the primal optimization is performed as follows (see algorithm 1): we start from a small number of training samples, train, double the number of samples, retrain, and so on. In this way, the set of support vectors is rather well identified (otherwise, we would have to invert an n × n matrix in the first Newton step). Algorithm 1: SVM Primal Training by Newton Optimization. Function: β = PRIMALSVM(K,Y,λ) n ← length(Y) {Number of training points} if n > 1000 then n2 ← n/2 {Train first on a subset to estimate the decision boundary} β ← PRIMALSVM(K 1..n2 ,1..n2 , Y1..n2 , λ)] sv ← nonzero components of β else sv ← {1, . . . , n}. end if repeat β sv ← (K sv + λInsv )−1 Ysv Other components of β ← 0 sv ← indices i such that yi [K β]i < 1 until sv has not changed
The time comparison is plotted in Figure 5: the running times for primal and dual training are almost the same. Moreover, they are directly propor3 tional to n3sv , which turns out to be the dominating term in the O(nn√ sv + nsv ) time complexity. In this problem, nsv grows approximatively like n. This
Training a Support Vector Machine in the Primal
1167
2
10
LIBSVM Newton − Primal 3 −8 nsv ( ×10 )
1
Training time (sec)
10
0
10
−1
10
−2
10
−3
10
2
10
3
10 Training set size
4
10
Figure 5: With the L 2 penalization of the slacks, the parallel between dual optimization and primal Newton optimization is striking: the training times are almost the same (and scale in O(n3sv )). Note that both solutions are exactly the same.
seems to be in contradiction with the result of Steinwart (2003), which states that the number of support vectors grows linearly with the training set size. However, this result holds only for noisy problems, and the USPS data set has a very small noise level. 5.3 Huber Loss. We perform the same experiments as in the previous section but introduce noise in the labels: for randomly chosen 10% of the points, the labels have been flipped. In this kind of situation, the L 2 loss is not well suited, because it penalizes the noisy examples too much. However, if the noise were, for instance, gaussian in the inputs, the L 2 loss would have been very appropriate. We will study the time complexity and the test error when we vary the parameter h. Note that the solution will not usually be exactly the same as for a standard hinge loss SVM (it will be the case only in the limit h → 0). The regularization parameter λ was set to 1/8, which corresponds to the best test performance. In the experiments described below, a line search was performed in order to make Newton converge more quickly. This means that instead of using equation 4.6 for updating β, the following step was made, β ← β − γ H −1 ∇, where γ ∈ [0, 1] is found by 1D minimization. This additional line search does not increase the complexity since it is just O(n). For the L 2 loss described in the previous section, this line search was not necessary, and full Newton steps were taken (γ = 1).
1168
O. Chapelle 3
0.18 C=10 C=1
0.175
2.5 time (sec)
Test error
0.17 0.165 0.16
2 0.155 0.15 0.145
−2
−1
10
1.5
0
10
−2
−1
10
10
10
0
10
h
h
Figure 6: Influence of h on the test error (left, n = 500) and the training time (right, n = 1000).
400
Number of support vectors
nsv free n
sv
at bound
350
300
250
200
−2
−1
10
10 h
Figure 7: When h increases, more points are in the quadratic part of the loss (nsv increases and nsv¯ decreases). The dashed lines are the LIBSVM solution. For this plot, n = 1000.
5.3.1 Influence of h. As expected, the left-hand side of Figure 6 shows that the test error is relatively unaffected by the value of h as long as it is not too large. For h = 1, the loss looks more like the L 2 loss, which is inappropriate for the kind of noise we generated. Concerning the time complexity (right-hand side of Figure 6), there seems to be an optimal range for h. When h is too small, the problem is highly non-quadratic (because most of the loss function is linear), and a lot of Newton steps are necessary. On the other hand, when h is large, nsv increases, and since the complexity is mainly in O(n3sv ), the training time increases (see Figure 7).
Training a Support Vector Machine in the Primal
1169
2
10
LIBSVM Newton − primal n3sv ( ×10−8 )
1
Training time (sec)
10
0
10
−1
10
−2
10
2
10
3
10 Training set size
4
10
Figure 8: Time comparison between LIBSVM and Newton optimization. Here nsv has been computed from LIBSVM (note that the solutions are not exactly the same). For this plot, h = 2−5 .
5.3.2 Time Comparison with LIBSVM. Figure 8 presents a time comparison of both optimization methods for different training set sizes. As for the quadratic loss, the time complexity is O(n3sv ). However, unlike Figure 5, the constant for LIBSVM training time is better. This is probably the case because the loss function is far from quadratic, and the Newton optimization requires more steps to converge (on the order of 30). But we believe that this factor can be improved by not inverting the Hessian from scratch in each iteration or by using a more direct optimizer such as conjugate gradient descent.
6 Advantages of Primal Optimization As explained throughout this letter, primal and dual optimization are very similar, and it is not surprising that they lead to the same computational complexity, O(nnsv + n3sv ). So is there a reason to use one rather than the other? We believe that primal optimization might have advantages for largescale optimization. Indeed, when the number of training points is large, the number of support vectors is also typically large, and it becomes intractable to compute the exact solution. For this reason, one has to resort to approximations (Bordes, Ertekin, Weston & Bottou, 2005; Bakır, Bottou, & Weston, 2005; Collobert, Bengio, & Bengio, 2002; Tsang, Kwok, & Cheung, 2005). But introducing approximations in the dual may not be wise. There is indeed no guarantee that an approximate dual solution yields a good approximate primal solution. Since what we are eventually interested in is a good primal
1170
O. Chapelle
objective function value, it is more straightforward to directly minimize it (see the discussion at the end of section 2). Below are some examples of approximation strategies for primal minimization. One can probably come up with many more, but our goal is just to give a flavor of what can be done in the primal. 6.1 Conjugate Gradient. One could directly minimize equation 3.6 by conjugate gradient descent. For squared loss without regularizer, this approach has been investigated in Ong (2005). The hope is that on a lot of problems, a reasonable solution can be obtained with only a couple of gradient steps. In the dual, this strategy is hazardous: there is no guarantee that an approximate dual solution corresponds to a reasonable primal solution. We have indeed shown in section 2 and appendix A that for a given number of conjugate gradient steps, the primal objective function was lower when optimizing the primal than when optimizing the dual. However, one needs to keep in mind that the performance of a conjugate gradient optimization strongly depends on the parameterization of the problem. In section 2, we analyzed an optimization in terms of the primal variable w, whereas in the rest of the letter, we used the reparameterization, equation 3.4, with the vector β. The convergence rate of conjugate gradient depends on the condition number of the Hessian (Shewchuk, 1994). For an optimization on β, it is roughly equal to the condition number of K 2 (see equation 4.2), while for an optimization on w, this is the condition number of K . So optimizing on β could be much slower. There is fortunately an easy fix to this problem: preconditioning by K . In general, preconditioning by a matrix M requires being able to compute efficiently M−1 ∇, where ∇ is the gradient (Shewchuk, 1994). But in our case, it turns out that one can factorize K in the expression of the gradient, equation 4.1, and the computation of K −1 ∇ becomes trivial. With this preconditioning (which comes at no extra computational cost), the convergence rate is now the same as for the optimization on w. In fact, in the case of regularized least squares of section 2, one can show that the conjugate gradient steps for the optimization on w and for the optimization on β with preconditioning are identical. Pseudocode for optimizing equation 3.6 using the Fletcher-Reeves update and this preconditioning is given in algorithm 2. Note that in this pseudocode, ∇ is exactly the gradient given in equation 4.1 but “divided” by K . Finally, we would like to point out that algorithm 2 has another interesting interpretation: it is indeed equivalent to performing a conjugate gradients minimization on w (see equation 3.1), while maintaining the so lution in terms of β, such that w = βi xi . This is possible because at each step of the algorithm, the gradient (with respect to w) is always in the span of the training points. More precisely, we have that the gradient of equation 3.1 with respect to w is (∇new )i xi .
Training a Support Vector Machine in the Primal
1171
Test error
σ=1 σ=2 σ=4 σ=8 −1
10
−2
10
0
10
1
10 Number of CG iterations
2
10
Figure 9: Optimization of the objective function, equation 3.6, by conjugate gradient for different values of the kernel width.
Algorithm 2 Optimization of Equation 3.6 (with the L 2 loss) by Preconditioned Conjugate Gradients. Let β = 0 and d = ∇old = −Y repeat Let t ∗ be the minimizer of equation 3.6 on the line β + td. β ← β + t ∗ d. Let o = K β − Y and sv = {i, o i yi < 1}. Update I 0 . ∇new ← 2λβ + I 0 o. ∇ K ∇new d. d ← −∇new + ∇new old K ∇old ∇old ← ∇new . until ||∇new || ≤ ε
Let us now have an empirical study of conjugate gradient behavior. As in the previous section, we considered the 7291 training examples of the USPS data set. We monitored the test error as a function of the number of conjugate gradient iterations. It can be seen in Figure 9 that:
r
r
A relatively small number of iterations (between 10 and 100) is enough to reach a good solution. Note that the test errors at the right of the figure (corresponding to 128 iterations) are the same as for a fully trained SVM: the objective values have almost converged at this point. The convergence rate depends, using the condition number, on the bandwidth σ . For σ = 1, K is very similar to the identity matrix, and one step is enough to be close to the optimal solution. However, the test error is not so good for this value of σ , and one should set it to 2 or 4.
1172
O. Chapelle
It is noteworthy that the preconditioning discussed above is really helpful. For instance, without it, the test error is still 1.84% after 256 iterations with σ = 4. Finally, note that each conjugate gradient step requires the computation of Kβ (see equation 4.1), which takes O(n2 ) operations. If one wants to avoid recomputing the kernel matrix at each step, the memory requirement is also O(n2 ). Both time and memory requirement could probably be improved to O(n2sv ) per conjugate gradient iteration by directly working on the linear system, equation 4.6. But this complexity is probably still too high for largescale problems. Here are some cases where the matrix vector multiplication can be done more efficiently:
r
r
r
Sparse kernel. If a compactly supported RBF kernel (Schaback, 1995; Fasshauer, 2005) is used, the kernel matrix K is sparse. The time and memory complexities are then proportional to the number of nonzero elements in K . Also, when the bandwidth of the gaussian RBF kernel is small, the kernel matrix can be well approximated by a sparse matrix. Low rank. Whenever the kernel matrix is (approximately) low rank, one can write K ≈ AA where A ∈ Rn× p can be found through an incomplete Cholesky decomposition in O(np 2 ) operations. The complexity of each conjugate iteration is then O(np). This idea has been used in Fine and Scheinberg (2001) in the context of SVM training, but the authors considered only dual optimization. Note that the kernel matrix is usually low rank when the bandwidth of the gaussian RBF kernel is large. Fast multipole methods. Generalizing both cases above, fast multipole methods and KD-trees provide an efficient way of computing the multiplication of an RBF kernel matrix with a vector (Greengard & Rokhlin, 1987; Gray & Moore, 2000; Yang, Duraiswami, & Davis, 2004; de Freitas, Wang, Mahdaviani, & Lang, 2005; Shen, Ng, & Seeger, 2006). These methods have been successfully applied to kernel ridge regression and gaussian processes, but do not seem to be able to handle high-dimensional data. (See Lang, Klaas, & de Freitas, 2005, for an empirical study of time and memory requirement of these methods.)
6.2 Reduced Expansion. Instead of optimizing on a vector β of length n, one can choose a small subset of the training points to expand the solution on and optimize only those weights. More precisely, the same objective function 3.2 is considered, but unlike equation 3.4, it is optimized on the subset of the functions expressed as f (x) =
i∈S
βi k(xi , x),
(6.1)
Training a Support Vector Machine in the Primal
1173
where S is a subset of the training set. This approach has been pursued in Keerthi, Chapelle, and DeCoste (2006), where the set S is greedily constructed, and in Lee and Mangasarian (2001a), where S is selected randomly. If S contains k elements, these methods have a complexity of O(nk 2 ) and a memory requirement of O(nk). 6.3 Model Selection. Another advantage of primal optimization is when some hyperparameters are optimized on the training cost function (Chapelle, Vapnik, Bousqoet, & Mukherjee, 2002; Grandvalet & Canu, 2002). If θ is a set of hyperparameters and α the dual variables, the standard way of learning θ is to solve a min max problem (remember that the maximum of the dual is equal to the minimum of the primal): min max Dual(α, θ ), θ
α
by alternating between minimization on θ and maximization on α (see, e.g., Grandvalet & Canu, 2002) for the special case of learning scaling factors). But if the primal is minimized, a joint optimization on β and θ can be carried out, which is likely to be much faster. Finally, to compute an approximate leave-one-out error, the matrix K sv + λInsv needs to be inverted (Chapelle et al., 2002), but after a Newton optimization, this inverse is already available in equation 4.4. 7 Conclusion In this letter, we have studied the primal optimization of nonlinear SVMs and derived the update rules for a Newton optimization. From these formulas, it appears clear that there are strong similarities between primal and dual optimization. Also, the corresponding implementation is very simple and does not require any optimization libraries. The historical reasons for which most of the research in the past decade has been about dual optimization are unclear. We believe that it is because SVMs were introduced in their hard margin formulation (Boser, Guyon, & Vapnik, 1992), for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account (Chapelle, Weston, Bottou, & Vapnik, 2000). We do not pretend that primal optimization is better in general; our main motivation was to point out that primal and dual are two sides of the same coin and that there is no reason to look always at the same side. And by looking at the primal side, some new algorithms for finding approximate solutions emerge naturally. We believe that an approximate primal solution
1174
O. Chapelle
is in general superior to a dual one since an approximate dual solution can yield a primal one that is arbitrarily bad. In addition to all the possibilities for approximate solutions metioned in this letter, the primal optimization also offers the advantage of tuning the hyperparameters simultaneously by performing a conjoint optimization on parameters and hyperparameters. Appendix A: Primal Suboptimality Let us define the following quantities, A= y y B = y XX y C = y XX XX y After one gradient step with exact line search on the primal objective function, we have w = C +BλB X y, and the primal value 2.1, is 1 1 B2 A− . 2 2 C + λB For the dual optimization, after one gradient step, α = equation 2.5, w = B +AλA X y. The primal value is then 1 1 A+ 2 2
A B + λA
2 (C + λB) −
λA y B + λA
and by
AB . B + λA
The difference between these two quantities is 1 2
A B + λA
2 (C + λB) −
(B 2 − AC)2 AB 1 B2 1 + = ≥ 0. B + λA 2 C + λB 2 (B + λA)2 (C + λB)
This proves that if one does only one gradient step, one should do it on the primal instead of the dual, because one will get a lower primal value this way. Now note that by the Cauchy-Schwarz inequality, B 2 ≤ AC, and there is equality only if XX y and y are aligned. In that case, the above expression is zero: the primal and dual steps are as efficient. That is what happens on the left side of Figure 1: when n d, since X has been generated according to a gaussian distribution, XX ≈ d I and the vectors XX y and y are almost aligned.
Training a Support Vector Machine in the Primal
1175
Appendix B: Optimization with an Offset We now consider a joint optimization on
f (x) =
n
b of the function β
βi k(xi , x) + b.
i=1
The augmented Hessian (see equation 4.2) is 2
1 I 0 1 K I 01
1 I 0 K λK + K I 0 K
,
where 1 should be understood as a vector of all 1. This can be decomposed as
−λ 1 2 0 K
0 1 0 I 1 λI + I 0 K
.
Now the gradient is ∇ = Hβ − 2
1 K
I 0 Y,
and the equivalent of the update equation, equation 4.3, is
0 1 I 0 1 λI + I 0 K
−1
−λ 1 0 K
−1
1 K
I 0Y =
0 1 I 0 1 λI + I 0 K
−1
0 I 0Y
.
So instead of solving equation 4.4, one solves
b β sv
=
0 1 1 λInsv + K sv
−1
0 Ysv
.
Acknowledgments We are grateful to Adam Kowalczyk and Sathiya Keerthi for helpful comments.
1176
O. Chapelle
References Bakır, G., Bottou, L., & Weston, J. (2005). Breaking SVM complexity with cross training. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17, (pp. 81–88). Cambridge, MA: MIT Press. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast kernel classifiers with online and active learning. (Tech. Rep.). Princeton, NJ: NEC Research Institute. Available online at http://leon.bottou.com/publications/pdf/huller3.pdf. Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). New York: ACM. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Choosing multiple parameters for support vector machines. Machine Learning, 46, 131– 159. Chapelle, O., Weston, J., Bottou, L., & Vapnik, V. (2000). Vicinal risk minimization. ¨ In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12, Cambridge, MA: MIT Press. Collobert, R., Bengio, S., & Bengio, Y. (2000). A parallel mixture of SVMS for very large scale problems. Neural Computation, 14(5), 1105–1114. Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines. Cambridge: Cambridge University Press. de Freitas, N., Wang, Y., Mahdaviani, M., & Lang, D. (2005). Fast Krylov methods ¨ for N-body learning. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18 (pp. 251–258). Cambridge, MA: MIT Press. Fasshauer, G. (2005). Meshfree methods. In M. Rieth & W. Schommers (Eds.), Handbook of theoretical and computational nanotechnology. Valencia, CA: American Scientific Publishers. Fine, S., & Scheinberg, K. (2001). Efficient SVM training using lowrank kernel representations. Journal of Machine Learning Research, 2, 243– 264. Golub, G., & Van Loan, C. (1996). Matrix computations. (3rd ed.). Baltimore: Johns Hopkins University Press. Grandvalet, Y., & Canu, S. (2002). Adaptive scaling for feature selection in SVMS. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Neural information processing systems, 15. Cambridge, MA: MIT Press. Gray, A., & Moore, A. (2000). “N-body” problems in statistical learning. In T. K. Leen, T. Dietlerich, & V. Tresp (Eds.), Advances in neural information processing systems, 13 (pp. 521–527). Cambridge, MA: MIT Press. Greengard, L., & Rokhlin, V. (1987). A fast algorithm for particle simulations. Journal of Computational Physics, 73, 325–348. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–Support vector learning. Cambridge, MA: MIT Press.
Training a Support Vector Machine in the Primal
1177
Keerthi, S., Chapelle, O., & DeCoste, D. (2006). Building support vector machines with reducing classifier complexity. Journal of Machine Learning Research, 7, 1493– 1515. Keerthi, S. S., & DeCoste, D. M. (2005). A modified finite Newton method for fast solution of large scale linear SVMS. Journal of Machine Learning Research, 6, 341– 361. Kimeldorf, G. S., & Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41, 495–502. Lang, D., Klaas, M., & de Freitas, N. (2005). Empirical testing of fast kernel density estimation algorithms (Tech. Rep. UBC TR-2005-03). Vancouver: University of British Columbia. Lee, Y. J., & Mangasarian, O. L. (2001a). RSVM: Reduced support vector machines. In Proceedings of the SIAM International Conference on Data Mining. Philadelphia: SIAM. Lee, Y. J., & Mangasarian, O. L. (2001b). SSVM: A smooth support vector machine for classification. Computational Optimization and Applications, 20(1), 5–22. Mangasarian, O. L. (2002). A finite Newton method for classification. Optimization Methods and Software, 17, 913–929. Ong, C. S. (2005). Kernels: Regularization and optimization. Unpublished doctoral dissertation, Australian National University. Osuna, E., Freund, R., & Girosi, F. (1997). Support vector machines: Training and applications (Tech. Rep. AIM-1602). Cambridge, MA: MIT Artificial Intelligence Laboratory. Platt, J. (1998). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods–Support vector learning. Cambridge, MA: MIT Press. Schaback, R. (1995). Creating surfaces from scattered data using radial basis functions. In M. Daehlen, T. Lyche, & L. Schumaker (Eds.), Mathematical methods for curves and surfaces (pp. 477–496), Nashville, TN: Vanderbilt University Press. ¨ Scholkopf, B., & Smola, A. (2002). Learning with kernels. Cambridge, MA: MIT Press. Shen, Y., Ng, A., & Seeger, M. (2006). Fast gaussian process regression using KD¨ trees. In Y. Weiss, B. Scholkopf, & J. Platt (Eds.), Advances in neural information processing systems, 18. Cambridge, MA: MIT Press. Shewchuk, J. R. (1994). An introduction to the conjugate gradient method without the agonizing pain (Tech. Rep. CMU-CS-94-125) Pittsburgh, PA: School of Computer Science, Carnegie Mellon University. Steinwart, I. (2003). Sparseness of support vector machines. Journal of Machine Learning Research, 4, 1071–1105. Tsang, I. W., Kwok, J. T., & Cheung, P. M. (2005). Core vector machines: Fast SVM training on very large datasets. Journal of Machine Learning Research, 6, 363–392. Vapnik, V. (1998). Statistical learning theory. New York: J. Wiley. Yang, C., Duraiswami, R., & Davis, L. (2004). Efficient kernel machines using the ¨ L. K. Saul, & B. Scholkopf ¨ improved fast gauss transform. In S. Thrun, (Eds.), Advances in neural information processing systems, 16. Cambridge, MA: MIT Press. Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. G. (2003). Modified logistic regression: An approximation to SVM and its applications in large-scale text categorization.
1178
O. Chapelle
In Proceedings of the 20th International Conference on Machine Learning. Menlo Park, CA: AAAI Press. Zhu, J., & Hastie, T. (2005). Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14, 185–205.
Received April 11, 2006; accepted August 23, 2006.
LETTER
Communicated by Michael DeWeese
Dynamics of Nonlinear Feedback Control H. P. Snippe
[email protected] Department of Neurobiophysics, University of Groningen, Groningen, The Netherlands
J. H. van Hateren
[email protected] Department of Neurobiophysics, University of Groningen, and Netherlands Institute for Neuroscience, KNAW-NIN, Amsterdam, The Netherlands
Feedback control in neural systems is ubiquitous. Here we study the mathematics of nonlinear feedback control. We compare models in which the input is multiplied by a dynamic gain (multiplicative control) with models in which the input is divided by a dynamic attenuation (divisive control). The gain signal (resp. the attenuation signal) is obtained through a concatenation of an instantaneous nonlinearity and a linear low-pass filter operating on the output of the feedback loop. For input steps, the dynamics of gain and attenuation can be very different, depending on the mathematical form of the nonlinearity and the ordering of the nonlinearity and the filtering in the feedback loop. Further, the dynamics of feedback control can be strongly asymmetrical for increment versus decrement steps of the input. Nevertheless, for each of the models studied, the nonlinearity in the feedback loop can be chosen such that immediately after an input step, the dynamics of feedback control is symmetric with respect to increments versus decrements. Finally, we study the dynamics of the output of the control loops and find conditions under which overshoots and undershoots of the output relative to the steady-state output occur when the models are stimulated with low-pass filtered steps. For small steps at the input, overshoots and undershoots of the output do not occur when the filtering in the control path is faster than the low-pass filtering at the input. For large steps at the input, however, results depend on the model, and for some of the models, multiple overshoots and undershoots can occur even with a fast control path. 1 Introduction In natural environments, sensory inputs typically have a large dynamic range. For instance, the luminance input to the visual system varies over multiple log units not only between day and night, but also on much shorter Neural Computation 19, 1179–1214 (2007)
C 2007 Massachusetts Institute of Technology
1180
H. Snippe and J. van Hateren
timescales (van Hateren, 1997). Likewise, the range of auditory intensities is very large, with changes in intensity often occuring on short timescales (Escab´ı, Miller, Read, & Schreiner, 2003). This strong dynamic nature of the input poses a problem to sensory systems because both the sensory transducers and the subsequent neurons have a limited dynamic range. Hence, to fit the input into these processing units, some form of dynamic compression is required. In this letter, we study systems in which a compression of the dynamic range is attained through an operation of gain control. Thus, the models for gain control we study operate such that, at least in their steady state, they yield a reduction in the range of their output relative to their input range. We study two types of gain control. The first type is multiplicative gain control in which an input I (t) is multiplied by a dynamic gain g(t), which yields an output O(t) = g(t)I (t).
(1.1)
When the gain g decreases for increasing values of the input I , the dynamic range of the output O is compressed relative to the dynamic range of the input I . The second type of control that we study is divisive gain control, in which an input I (t) is divided by an attenuation signal a (t), which yields an output, O(t) =
I (t) . a (t)
(1.2)
When the attenuation a increases with increasing values of the input I , the dynamic range of O is compressed relative to the dynamic range of I . Obviously, multiplicative and divisive gain control are closely related. When the attenuation signal a (t) and the gain signal g(t) are related as a (t) = 1/g(t), the outputs O(t) of equations 1.1 and 1.2 are identical. However, as we show in section 2, for many control systems, even when a = 1/g in steady state, the dynamic behavior of a (t) and g(t) is not related through a simple inversion a (t) = 1/g(t). As a consequence, the outputs of multiplicative and divisive gain control can be qualitatively different. Gain control in neural systems is ubiquitous (Shapley & Enroth-Cugell, 1984; Schwartz & Simoncelli, 2001); thus, it is important to understand its dynamics. For instance, what is the speed of adaptation for systems with gain control? Can we expect asymmetries in adaptation after stimulation with increments and decrements of the input (DeWeese & Zador, 1998; Snippe, Poot, & van Hateren, 2004; Hosoya, Baccus, & Meister, 2005)? Do systems with gain control have monotonic outputs when stimulated with simple monotonic inputs, or should we expect overshoots, undershoots, or oscillations in their output? By focusing on simple models for gain control,
Dynamics of Nonlinear Feedback Control
1181
we can start to answer these questions. Since gain control occurs at many levels in the neural system (Abbott, Varela, Sen, & Nelson, 1997; Salinas & Thier, 2000), the inputs I (t) in equations 1.1 and 1.2 are not necessarily sensory inputs; they could also represent the internal input to a process of gain control anywhere within the neural system. The control signals g(t) in equation 1.1 and a (t) in equation 1.2 could be generated from either the input signal I (t) (feedforward control) or the output O(t) of the control loop (feedback control). In this letter, we concentrate on feedback control, since this appears to be the most common type of control in neural systems (Marmarelis, 1991; Torre, Ashmore, Lamb, & Menini, 1995; Calvert, Ho, Lefebvre, & Arshavsky, 1998; Crevier & Meister, 1998). A second reason to concentrate on feedback control is that the dynamics of a feedback loop is often less immediately obvious than the dynamics of a feedforward loop, especially when the control loop is nonlinear. Although the dynamics of linear, subtractive feedback control is well known (e.g., van de Vegte, 1990; Franklin, Powell & Emami-Naeini, 1994), less work has been done on the dynamics of multiplicative and divisive feedback control. Here we provide a mathematical analysis of such nonlinear feedback control. The letter is organized as follows. In section 2, we define four feedback models—two with multiplicative control and the other two with divisive control. The structure of these models is simple—about as simple as possible for nonlinear feedback. This has been done on purpose, since it allows a mathematical analysis of the dynamics of these models rather than relying on numerical simulations. The results of this analysis should help to understand the behavior of more involved models that are too complex for an exact mathematical analysis. In section 3, we study the stability of the feedback models for a variety of inputs. In section 4, we study the dynamics of the control signals when the models are stimulated with step inputs, and in section 5, we show that these control signals can have strong asymmetries with respect to increment versus decrement steps in the input. In section 6, we focus on the output O(t) of the models rather than on the control signals g(t) and a (t). In particular, we investigate under what circumstances overshoots or oscillations in the output can occur when the models are stimulated with simple, monotonic inputs. Finally, in section 7 we restate the most important conclusions. To avoid disturbing the flow of the arguments, two appendixes contain longer mathematical derivations of the results. 2 Models We analyze the dynamics of the four models for feedback control shown in Figure 1. Two of the models (MNL and MLN ; see Figure 1, top row) have a multiplicative gain control; the other two models (DNL and DLN ; see Figure 1, bottom row) have a divisive gain control. For each of the models,
1182
H. Snippe and J. van Hateren
MNL
A I(t)
g(t)
LPτ
LPτ
g(t)
f DNL
C I(t )
a(t)
O(t )
O(t) h
MLN
B I(t )
f
D I(t )
a(t )
γ (t )
LPτ
DLN
h
α (t )
O (t )
O (t )
LPτ
Figure 1: Models for nonlinear feedback control. Models (A) MNL and (B) MLN have multiplicative control in which the output O(t) equals the input I (t) multiplied by a gain signal g(t). Models (C) DNL and (D) DLN have divisive control in which the output O(t) equals the input I (t) divided by an attenuation signal a (t). The feedback paths consist of a concatenation of a first-order low-pass filtering with time constant τ and an instantaneous nonlinearity ( f and h). Note that in models MNL and DNL , the nonlinearity precedes the low-pass filtering in the feedback path, whereas in models MLN and DLN , the low-pass filtering precedes the nonlinearity.
the control loop consists of a concatenation of an instantaneous nonlinearity (functions f and h) and a linear low-pass filter L Pτ of first order with time constant τ . For the two models on the left in Figure 1 (MNL and DNL ), the control loop consists of a nonlinearity followed by a low-pass filter L Pτ , whereas for the two models on the right in Figure 1 (MLN and DLN ), the control loop consists of a low-pass filter L Pτ , followed by a nonlinearity. Models of the general structure shown in Figure 1 have been shown to explain various aspects of processing in the visual system. For instance, Crevier and Meister (1998) analyzed a multiplicative feedback model of the type of Figure 1B to account for period doubling in the electroretinogram response to periodic flashes. A spatiotemporal extension of this model (Berry, Brivanlou, Jordan, & Meister, 1999) explained retinal responses of salamanders and rabbits to flashed and moving bars (see also Wilke et al., 2001). Snippe et al. (2004) showed that divisive feedback models of a type similar to Figure 1C can explain the asymmetric dynamics of gain control after increments and decrements of contrast. Divisive feedback has also been used to model the dynamics of adaptation in the auditory system (Dau, ¨ Puschel, & Kohlrausch, 1996). For simplicity, in the analysis of the models in
Dynamics of Nonlinear Feedback Control
1183
Figure 1, we assume that the inputs I (t) are nonnegative: I (t) ≥ 0. This is reasonable, for example, for models of light adaptation in which the input signal (luminance) is nonnegative. To describe contrast-gain control (e.g., Berry et al., 1999), the input signal I (t) = C(t)s(t) can be written as the product of a carrier signal s(t) and a contrast envelope C(t), with C(t) ≥ 0. Because the carrier signal s(t) can be negative, the input I (t) does not satisfy I (t) ≥ 0 in this case. However, for contrast-gain control, the effects of the carrier s(t) are expected to be largely demodulated in the control loop (Snippe, Poot, & van Hateren, 2000). Thus, the dynamics of the control loop is governed mainly by the dynamics of the contrast envelope C(t), which does satisfy nonnegativity: C(t) ≥ 0. Hence, also for contrast-gain control, the assumption that the input is nonnegative is not a severe restriction. Due to the first-order filtering in the models of Figure 1, the dynamics of these models are described by first-order differential equations. (See van Hateren, 2005, for a discussion on low-pass filtering in biological systems, and examples of biochemical processes that can be described as low-pass filters.) In general, the output y(t) of a first-order low-pass filter with time constant τ is related to the input x(t) to the filter through a first-order differential equation τ dy/dt = x − y (Oppenheim, Willsky, & Young, 1983). Thus, the dynamics of the gain signal g(t) in model MNL (see Figure 1A) is described by the first-order differential equation,
τ
dg = f (O) − g = f (g I ) − g, dt
(2.1)
since the input to the low-pass filter in the control loop of model MNL is f (O), and the argument O of the nonlinearity f equals g I , due to the multiplicative nature of the gain control of model MNL . Note that due to this multiplicative operation, equation 2.1 is a nonlinear differential equation even when the function f is linear. The dynamics of the gain signal g(t) for a given input I (t) is obtained by solving equation 2.1 (either analytically or numerically). From the solution g(t) for the gain, the output O(t) can be obtained through O(t) = g(t)I (t). When the input I is constant, I = Is , the system attains its steady state, g = gs and O = Os . The input-output relation for the steady state can be obtained by setting the time derivative dg/dt in equation 2.1 equal to zero, yielding gs = f (Os ). With Os = gs Is , hence gs = Os /Is , this yields the steady-state relation Os /Is = f (Os ), hence Is = Os / f (Os ). To obtain a system in which the dynamic range of Os is reduced relative to the dynamic range of Is , we require that the steady-state gain gs = f (Os ) is a decreasing function of its argument, that is, the function f (x) must satisfy d f (x)/d x < 0. Note that this requirement is sufficient to guarantee a monotonic steady-state relation Is = Os / f (Os ): with increasing Os , f (Os ) decreases, hence Is increases.
1184
H. Snippe and J. van Hateren
For model DNL (see Figure 1C), the dynamics of the attenuation signal a (t) is described by the differential equation τ
da = h(O) − a = h(I /a ) − a . dt
(2.2)
From the solution a (t) of equation 2.2 for a given input I (t), the output O(t) of model DNL is obtained as O(t) = I (t)/a (t). In steady state (da /dt = 0), equation 2.2 yields a s = h(Os ). With Os = Is /a s , hence a s = Is /Os , this gives the steady-state relation Is = Os h(Os ). To obtain a reduction of the dynamic range of Os relative to the dynamic range of Is , we require that the steadystate attenuation a s = h(Os ) is an increasing function of its argument, that is, dh(x)/d x > 0. This guarantees a monotonic steady-state relation Is = Os h(Os ). Note that this steady-state relation becomes identical to that of the multiplicative model MNL when the nonlinearities f (O) of model MNL and h(O) of model DNL are related as h(O) = 1/ f (O). A differential equation that describes the dynamics of model MLN (see Figure 1B) can be obtained for the output γ (t) of the low-pass filter L Pτ in the feedback loop, τ
dγ = O − γ = g I − γ = f (γ )I − γ , dt
(2.3)
because the gain g is related to γ as g = f (γ ). From the solution γ (t) of equation 2.3 for a given input I (t), one obtains the dynamic gain g(t) as g(t) = f (γ (t)), and thus the output O(t) = g(t)I (t). In steady state (dγ /dt = 0), equation 2.3 yields γs = Os = f (Os )Is , hence Is = Os / f (Os ), which is identical to the steady-state behavior of model MNL . As in model MNL , for model MLN we require that f be a decreasing function of its argument. Finally, the dynamics of model DLN (see Figure 1D) follows from the differential equation for the output α(t) of the low-pass filter in model DLN : τ
dα I I = O−α = −α = − α. dt a h(α)
(2.4)
From the solution α(t) of this differential equation, one obtains the attenuation a (t) = h(α(t)), and thus the output O(t) = I (t)/a (t). In steady state (dα/dt = 0), equation 2.4 produces the steady-state input-output relation Is = Os h(Os ), which is identical to the steady-state equation obtained for model DNL . As in model DNL , we require that h be an increasing function of its argument.
Dynamics of Nonlinear Feedback Control
1185
2.1 The Nonlinearities f (x) and h(x) Need Not Be Positive for All x. Although for simplicity we assumed that the input I is nonnegative, I ≥ 0, a similar assumption for the nonlinearities f and h in the feedback loops of the models in Figure 1 would be too restrictive, since this would leave out viable and interesting models for gain control. Thus, we allow the functions f (x) in multiplicative control and h(x) in divisive control to be negative for certain ranges of their argument x. Since f (x) is assumed to be decreasing, this means that we allow functions f (x) that are positive in a range 0 < x < x0 , become zero at x = x0 , and are negative for x > x0 , with d f /d x < 0 over the entire range 0 < x < ∞. An example of such a function f would be a linearly decreasing function f (x) = k(x0 − x), with k and x0 positive constants. For a linear function f (x), the ordering of f and the lowpass filtering in Figure 1 is arbitrary, and models MNL and MLN become identical. Solving the steady-state relation Is = Os / f (Os ) with respect to Os for this f yields Os = x0 k Is /(1 + k Is ), a Naka-Rushton relationship (Heeger, 1992). Note that x0 represents the largest steady-state output Os that can be obtained in this case and that f (Os ) ≥ 0 for all steady inputs Is . In a dynamic situation, however, after sudden increases in the input I (t), the output O(t) can become larger than x0 , and f (O) is temporarily negative. This negative value of the function f helps to quickly reduce the gain g (through equation 2.1), which reduces the output O back to values below x0 . A comparison of the dynamics of adaptation of such a multiplicative feedback system (with k = x0 = 1) after increments and decrements of the input is shown in Figure 2A. Note that despite the linearity of f , the dynamics is asymmetric for increments versus decrements. Allowing negative values for f (x) in multiplicative control can thus be beneficial in that it speeds up adaptation after increases in I (t). Nevertheless, in comparing multiplicative and divisive control, some care is required when we allow f (x) in multiplicative control to be zero for x = x0 and negative for x > x0 . This means that the corresponding h(x) = 1/ f (x) in divisive control is monotonically increasing up to x = x0 , where it becomes infinite. For x > x0 , h(x) becomes negative, and thus h(x) is nonmonotonic in the range 0 < x < ∞. For model DLN (see Figure 1D), this is actually not a problem, since for this model, x < x0 also in dynamic situations. An example of adaptation for such a model DLN (again with a Naka-Rushton steady state) is shown in Figure 2B. On the other hand, for model DNL (see Figure 1C), the discontinuity in the function h(x) could become a problem for strong increments at the input. The solution to this problem is to assume that the output O(t) for a divisive control system with such an h(x) in the feedback path remains smaller than x0 also in dynamic situations, so h(x) remains positive at all t. It can be shown that this is indeed what happens when the input I (t) to the system is a continuous function of time, as will always be the case in practice. An example of adaptation in such a model DNL (again with a Naka-Rushton steady state) is shown in Figure 2C. Note the strong asymmetry of the dynamics of this model for increments versus decrements of the input.
1186
A
H. Snippe and J. van Hateren
B
1.0 0.8
increment I
Model MNL, with f ( O )=1-O = Model MLN, with f (γ)=1-γ
Model DLN, with h (α)=1/(1-α) decrement I
5 4
g(t)
a(t)
0.6 3
0.4 2
0.2 0.0
decrement I 0
increment I
1
1
t
2
0
t
1
2
D 1.0
C 5
decrement I Model DNL, with h (O )=1/(1-O )
Model DNL, with h (O )=1-(1/O )
decrement I
0.8 4
a(t)
a(t)
0.6 3
0.4 2
0.2
increment I 1 0
t
1
2
0.0
increment I 0
t
1
2
Figure 2: (A) Dynamics of the gain g(t) in a multiplicative feedback control (with time constant τ = 1) with a linear function f (x) = 1 − x in the feedback path, which yields a Naka-Rushton relation Os = Is /(1 + Is ) for the steady state. The solid line is g(t) in response to an increment step of I (from 1 /4 to 4). The dashed line is g(t) after a decrement step of I (from 4 to 1 /4 ). For linear f (x), models MNL and MLN are identical. (B) Dynamics of the attenuation a (t) of model DLN (see Figure 1D; τ = 1) when h(α) = 1/(1 − α), which yields a Naka-Rushton steady state Os = Is /(1 + Is ). Input steps are identical to A. (C) Dynamics of a (t) of model DNL (see Figure 1C; τ = 1) when h(O) = 1/(1 − O). Steady state of this model is identical to A and B, but the dynamics is very different. (D) Dynamics of a (t) of model DNL (see Figure 1C; τ = 1) when h(O) = 1 − (1/O), which yields Os = 1 + Is in steady state.
Likewise, in divisive control we allow increasing functions h (x) that are negative for x < x0 and positive for x > x0 . For such systems, steady-state output Os is never smaller than x0 ; hence, h(Os ) ≥ 0. When Is = 0, Os = x0 . Thus, for such systems, x0 can be thought of as a spontaneous activity. After large decrements of I (t), the output O(t) can temporarily be smaller than this spontaneous activity x0 of the system; hence, h(O) < 0, which speeds up adaptation according to equation 2.2. An example is shown in Figure 2D. For the corresponding multiplicative control, with f (x) = 1/h(x), f (O) becomes very large when the output O approaches the spontaneous
Dynamics of Nonlinear Feedback Control
1187
activity x0 , and for continuous inputs I (t), this effect is sufficiently strong to keep O(t) > x0 at all times t. 2.2 Identity of Models MLN and DLN . Equations 2.3 and 2.4 yield identical dynamics when the nonlinearities f in equation 2.3 and h in equation 2.4 are related as h = 1/ f . This simply says that multiplying an input I with a gain g to obtain an output O = g I is identical to dividing I by the inverse gain (attenuation) 1/g : O = I /[1/g]. Thus, when the nonlinear functions f and h are related as h = 1/ f , model MLN and model DLN in Figure 1 yield identical responses if they are stimulated with the same inputs. Figures 2A and 2B provide an example: the dynamic attenuation a (t) of model DLN in Figure 2B equals the inverse of the dynamic gain g(t) of model MLN in Figure 2A, that is, a (t) = 1/g(t). 2.3 Duality of Models MNL and DNL . Likewise, models MNL and DNL are closely related, but in this case, the relationship amounts to a duality rather than an identity. For the duality, the nonlinearities f in equation 2.1 and h in equation 2.2 have to be related through an inversion of their argument: h(x) = f (1/x). In that case, when model DNL is stimulated with an input I D (t) = 1/I M (t) that is the inverse of the input I M (t) to model MNL , the dynamics 2.2 for the attenuation a (t) becomes τ
da = h(I D /a ) − a = f (a /I D ) − a = f (I M a ) − a , dt
(2.5)
which is identical to the dynamics of g(t) in equation 2.1. Thus, under these conditions, the dynamic attenuation a (t) of model DNL equals the dynamic gain g(t) of model MNL : a (t) = g(t). Then the output OD (t) of model DNL is related to the output OM (t) of model MNL through an inversion: OD = I D /a = 1/[I M g] = 1/OM . Figures 2A and 2D provide an example of this duality of models DNL and MNL . The attenuation a (t) in Figure 2D after an increment step of the input I (from 1/4 to 4) equals the gain g(t) in Figure 2A after a decrement step of I (from 4 to 1/4). Likewise, a (t) in Figure 2D after a decrement step of I equals g(t) after an increment step in Figure 2A. Thus, when the nonlinearities f and h are related through an inversion of their argument, h (x) = f (1/x), the models MNL and DNL are dual in the following sense: when model MNL transforms an input I (t) into an output O(t), model DNL will transform the input 1/I (t) into the output 1 O(t). Of course, the duality also works in the other direction, from model DNL to model MNL . Apart from the theoretical interest of this duality between multiplicative and divisive control, the duality can be useful in applications. When the mathematics or the statistics of 1/I (t) are simpler than those of the input I (t) itself, it makes sense to do the calculations in the dual model. Stimulating the dual model with an input 1/I (t) and inverting the resulting output of the dual model yields the output O(t) of the original model.
1188
H. Snippe and J. van Hateren
3 Stability As is well known, feedback can lead to instabilities (Bechhoefer, 2005). Here we show that this is not the case for the feedback systems studied in this letter. The basic reason for this is that the low-pass filters in the feedback loops of the schemes in Figure 1 are first order, that is, they are described by the first-order differential equations 2.1 to 2.4. A first-order low-pass filter has an exponential pulse response that starts to operate as soon as a disturbance occurs, that is, without a delay, which could yield instabilities (Bechhoefer, 2005). 3.1 The Models Are Stable for Constant Input. To see more formally that the feedback schemes studied in this letter are stable, first assume that the input signal I (t) is constant, I (t) = Is . For model MNL , the corresponding steady-state gain gs is obtained by setting the time derivative dg/dt in equation 2.1 to zero; hence, f (gs Is ) − gs = 0. Now assume that at some moment, the actual gain g is larger than the steady-state gain gs , g > gs . Since the function f is a decreasing function of its argument, we have f (g Is ) < f (gs Is ), and therefore f (g Is ) − g < f (gs Is ) − gs = 0. Hence, it follows from equation 2.1 that dg/dt < 0. Thus, g will decrease from its initial value (which is larger than gs ) toward the steady-state value gs . Likewise, when g is initially smaller than gs , g will increase toward the steady-state value gs . Hence, for a constant input I , any deviations from steady state will always decrease over time; that is, for constant input I , the model MNL is stable. Similar reasoning shows that for constant input, the other models in Figure 1 are also stable. 3.2 The Models Are Stable for Arbitrary, Bounded Input. Now assume that the input I (t) is not necessarily constant, but that it is bounded, say, between a lower bound I− and an upper bound I+ , that is, at all times t : I− ≤ I (t) ≤ I+ . For such bounded inputs, the systems studied here cannot become unstable; their control signals (and hence their outputs) will remain finite. For instance, consider model DNL , in which the dynamics of the control (attenuation) signal a is described by equation 2.2. Now take any moment t0 in time. At that moment, the input to the system is I0 = I (t0 ), to which corresponds a steady-state value a s for which (from equation 2.2) h(I0 /a s ) − a s = 0. Assume that at the moment t0 , the actual control signal a (t0 ) is larger than this steady-state value a s . Then, since the function h(x) is an increasing function of its argument x, and hence h(I /a ) decreases with increasing a , we can be sure that when a > a 0 , h(I0 /a ) − a < h(I0 /a s ) − a s = 0. Thus, from equation 2.2, da /dt < 0, that is, an attenuation signal that is too large (relative to the instantaneous steady state) is guaranteed to decrease. Similarly, if at any moment the attenuation signal a is too small (relative to the instantaneous steady state), it will increase. In particular, if at some initial time t0 the attenuation signal a would be larger than the
Dynamics of Nonlinear Feedback Control
1189
2
as(t ) a(t)
a(t ) 1
0 0
5
t
10
Figure 3: Stability of feedback control. Different initial values for the control signal (dashed lines) converge to identical solutions. Shown is an example of model DNL , with h(O) = O2 and τ = 3. Input I (t) = I (1 + sin ωt), with I = ω = 1. For constant input I (t) = Is , steady-state attenuation a s = Is2/3 in this model. The continuous line is a plot of the instantaneous steady state a s (t) ≡ (I (t))2/3 , which can be compared with the dynamics of the actual attenuation a (t). Note that da /dt < 0 when a is above the instantaneous steady state, da /dt > 0 when a is below steady state, and da /dt = 0 whenever a (t) crosses steady state.
steady-state value a s+ corresponding to the upper bound I+ , a (t) is guaranteed to decrease monotonically until it becomes smaller than a s+ . Likewise, if at time t0 the attenuation signal a is smaller than the steady-state value a s− corresponding to the lower bound I− , a (t) will increase monotonically until it becomes larger than a s− . And if initially a s− < a (t0 ) < a s+ , then a s− < a (t) < a s+ for all times t > t0 . Figure 3 illustrates this behavior for a specific choice of the nonlinearity, h(x) = x 2 . 3.3 Monotonic Input Yields Monotonic Control Signals. The reasoning we used above to demonstrate the stability of the feedback schemes of Figure 1 can also be used to derive general qualitative properties of the control signals in these schemes. For instance, assume that the input I (t) is a monotonically increasing function of time for times t > t0 and that at time t0 , the attenuation signal a (t0 ) of model DNL is not larger than the steady-state level a s corresponding to the input I (t0 ), a (t0 ) ≤ a s . Then we can be sure that the divisive control signal a (t) is also monotonically increasing, without any overshoots or oscillations. If a (t) would be nonmonotonic, there would be times t1 > t0 for which da /dt < 0. According to equation 2.2, this means that at that time t1 , the attenuation signal has to be larger than the steady-state corresponding to the input I (t1 ) at time t1 . But this means that a (t) must have crossed its steady state at some moment t between t0 and t1 . But at that hypothetical moment, according to equation 2.2, da /dt = 0. Thus, a crossing of a (t) with its instantaneous steady-state value a s (t) would be possible
1190
H. Snippe and J. van Hateren 8 6
as(t )
a(t ) 4
a( t )
2 0 0
2
t 4
6
Figure 4: Attenuation a (t) (dashed line) of the model of Figure 3, when I (t) is monotonic (a series of steps in this example). The attenuation cannot overshoot its steady-state solution a s = I 2/3 (continuous line), since da /dt becomes 0 when a approaches steady state. Hence, when a (t) is initially below this line, it cannot escape from the region below the continuous line. Since da /dt > 0 when a < a s , a monotonic input I (t) yields a monotonic control signal a (t).
only if d I /dt < 0 at that moment (as can be observed in Figure 3), contrary to the assumption of a monotonically increasing I (t). Figure 4 illustrates the behavior of a (t) for a specific choice of I (t) and the nonlinearity h, but the conclusion is general. For the systems in Figure 1, monotonic inputs I (t) yield control signals a (t), respectively g(t), that are monotonic; they do not overshoot or oscillate. 4 Speed of Adaptation After Steps in the Input Here we look at the speed of adaptation of the models in Figure 1 immediately after a step in the input I (t). Steps in the input occur in natural stimuli (e.g., Griffin, Lillholm, & Nielsen, 2004) and are often used to study the dynamics of adaptation (e.g., Crawford, 1947; Thorson & BiedermanThorson, 1974; Snippe et al., 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001; Rieke, 2001). Without loss of generality, the moment of the step in the input is chosen as t = 0. For an increment step, the input at t = 0 switches from a value I− to a higher value I+ > I− . We assume that for t < 0, the input has been at I− for a sufficiently long time such that at time t = 0, the system is in the steady state corresponding to input I− , that is, the system is fully adapted to I− . When the input step at t = 0 occurs, the system starts to adapt to the new input I+ . Likewise, for a decrement step I+ → I− at time t = 0, we assume that at t = 0, the system is fully adapted to the input I+ . We compare the initial speed of adaptation for the models in Figure 1 immediately after increment steps I− → I+ and decrement steps I+ → I− . Speed of adaptation is quantified using equations 2.1 to 2.4, which
Dynamics of Nonlinear Feedback Control
1191
give the speed with which the control loops in Figure 1 adjust after the step in the input. 4.1 Comparison of Multiplicative and Divisive Control 4.1.1 Models MLN and DLN Have Identical Dynamics. As noted in section 2.2, the multiplicative model MLN in Figure 1B and the divisive model DLN in Figure 1D become identical when the nonlinearities f and h in the control paths are related as h = 1/ f . When f and h are related in this way, the attenuation signal a (t) in model DLN is identical to the inverse 1/g(t) of the gain signal g(t) in model MLN . Thus, the dynamics of a (t) and 1/g(t) are identical: da /dt = d[1/g]/dt, which means that the speed of adaptation in models MLN and DLN is identical. 4.1.2 Large Differences in Dynamics for Models MNL and DNL . We find large differences when we compare adaptation in models MNL (see Figure 1A) and DNL (see Figure 1C) in which the nonlinearities f , respectively h, precede the low-pass filtering in the feedback loop. It is still the case that we can equate the steady-state behavior of models MNL and DNL by choosing h = 1/ f , which yields a s = 1/gs . However, even in this case, a (t) and 1/g(t) after a step in the input I (t) are different. For model DNL , the dynamics of a (t) is given by equation 2.2. For model MNL , the dynamics of g(t) is given by equation 2.1, which yields for the inverse gain 1/g(t) immediately after the step in I (t): τ
d[1/g] f (gs I ) (a s )2 τ dg 1 − = a − =− = , s dt (gs )2 dt gs (gs )2 h(I /a s )
(4.1)
in which we used the relation f = 1/ h, and the fact that immediately after the step in I (t), the gain g and the attenuation a still have their steady-state value, and hence are related as gs = 1/a s . Comparing result 4.1 with the result τ da /dt = h (I /a s ) − a s of equation 2.2 immediately after a step in the input I , we find τ
d[1/g] 1 (a s )2 da −τ = h(I /a s ) − 2a s + = [h(I /a s ) − a s ]2 . dt dt h(I /a s ) h(I /a s ) (4.2)
After an increment step in I (t) (when da /dt, d[1/g]/dt and h(I /a s ) are all positive), it follows from equation 4.2 that da /dt > d[1/g]/dt. Thus, immediately after an increment step of the input, adaptation in the divisive feedback model DNL is faster than adaptation in the multiplicative feedback model MNL . In fact, this is true not only immediately after the increment step, but also at later times t, in the sense that for all t > 0, a (t) > 1/g(t).
1192
H. Snippe and J. van Hateren
B
A
4
a(t)
3
adaptation
adaptation
4
1/g(t)
2
a(t )
2
1/g(t) 1
1 0
3
0
2
t
4
6
0
0
2
t
4
6
Figure 5: Dynamics of adaptation for the divisive model DNL (dashed line) and the multiplicative model MNL (dotted line), when h(O) = 1/ f (O) = O2 and τ = 3. Input I (t) is an increment step from 1 to 8 at time t = 0 (A) and a decrement step from 8 to 1 (B). Both models have identical steady state, a s = 1/gs = I 2/3 (continuous line), but strongly different dynamics.
Thus, after an increment step, the multiplicative gain control never catches up with the divisive gain control. The proof of this assertion is easy: suppose that at some time t = t ∗ , the multiplicative control would have just caught ∗ ∗ = 1 g (t ∗ ), up with the divisive control, that is, suppose that at t = t , a ) (t ∗ with a (t) > 1 g(t) for 0 < t < t . Then it follows from equation 4.2 evaluated at t = t ∗ that da /dt > d [1/g]/dt immediately before t = t ∗ ; thus, a < 1/g immediately before t = t ∗ . This contradicts the original assertion that a (t) > 1 g(t) for 0 < t < t ∗ ; hence a (t) > 1 g(t) for all t > 0. Thus, after an increment step in I (t), adaptation in the divisive control system DNL is consistently faster than adaptation in the multiplicative control system MNL . On the other hand, after a decrement step in I (t) (when both da /dt and d [1/g]/dt are negative), equation 4.2 yields |d [1/g]/dt| > |da /dt|, assuming that h (I /a s ) > 0 after the step. In fact, the assumption h (I /a s ) > 0 is not really needed in this case, since we argued in section 2.1 that when h(x) becomes negative in a divisive control, the gain g in the corresponding multiplicative control with f (x) = 1 h(x) changes very rapidly in order to keep the output O(t) in its required range; hence, |d [1/g]/dt| > |da /dt| also when h (I /a s ) < 0. Thus, after a decrement step, adaptation in the multiplicative model MNL is faster than adaptation in the divisive model DNL . An example is given in Figure 5 using a quadratic nonlinearity h(O) = O2 2 for the divisive control and hence f (O) = 1 h(O) = 1 O for the multiplicative control, with identical time constant τ of the low-pass filtering in the control loops. The example shows that the differences in speed of adaptation between multiplicative and divisive control can be large. 4.2 The Order of the Nonlinearity and the Low-Pass Filtering in the Feedback Can Have Strong Effects. In the previous section, we showed
Dynamics of Nonlinear Feedback Control
1193
that the dynamics of divisive and multiplicative control can be very different, even under conditions where the steady-state behaviors of these systems are identical. Here we show that also for a specific choice of control, either divisive or multiplicative, the order of the two operations, nonlinearity and low-pass filtering, in the feedback loops of Figure 1 can have a strong effect on the dynamics of the control system. First, we look at divisive gain control, and compare model DNL (Figure 1C) in which the nonlinearity h precedes the low-pass filtering in the feedback loop with model DLN (Figure 1D) in which the nonlinearity h occurs after the low-pass filtering. If the function h(x) is linear, h(x) = c + kx, models DNL and DLN are identical, since the order of two linear operations (here, low-pass filtering and h(x) = c + kx) has no effect on the outcome. Thus, for linear h(x), models DNL and DLN have identical dynamics. This, however, is not the case when h(x) is nonlinear. As will be argued below, in that case, the speed of adaptation of model DNL relative to model DLN is determined by the curvature of h(x), that is, by its second-order derivative d 2 h(x) d x 2 . If h(x) is an expansive function of x (i.e., if it has a positive second-order derivative everywhere), adaptation after an increment step I− → I+ is faster in model DNL than in model DLN . For model DNL , speed of adaptation is governed by equation 2.2; immediately after the increment step, τ
da = h(I+ /a − ) − a − = h(I+ /a − ) − h(I− /a − ) dt
(ModelDNL ),
(4.3)
where a − is the steady-state attenuation corresponding to input I− , and we have used the fact that in steady state, a − = h(I− /a − ). On the other hand, for model DLN , adaptation is governed by equation 2.4, which for the attenuation signal a = h(α) yields τ
I+ dh(x) dh(x) dα da ·τ · − α− = = dt d x x=I− /a − dt d x x=I− /a − a− I+ I− dh(x) (Model DLN ), · − = d x x=I− /a − a− a−
(4.4)
where in the last step, we used the fact that in steady state, α− = I− /a − . Because of the factor dh/d x in equation 4.4, adaptation in model DLN is governed by a linear extrapolation of the function h(x). As illustrated in Figure 6, for an expansive function h(x), that is, when d 2 h(x)/d x 2 > 0, the outcome of equation 4.3 is larger than the outcome of equation 4.4. Thus, immediately after an increment step, adaptation in model DNL is faster than in model DLN . The intuition for this difference in behavior of the two models is that in model DNL , the overshoot in the output O(t) immediately after the increment is accentuated by an expansive function h(O), which
1194
H. Snippe and J. van Hateren
immediately after the input increment
h(O ) DNL
before the input increment
I- /a-
DLN O
I+ /a-
Figure 6: Comparison of adaptation in models DNL and DLN immediately after an increment step I− → I+ of the input. For model DNL , the dynamics of adaptation immediately after the step is governed by the difference of the values of the nonlinearity h(O) at the outputs before and after the input step. For model DLN , the dynamics of adaptation immediately after the step is governed by a linear extrapolation of h (dotted line). When, as in this example, the function h(O) has positive curvature, adaptation after an increment step is faster in model DNL than in model DLN .
provides a powerful drive to the low-pass filter, and hence fast adaptation. As in section 4.1, it is easy to prove that the difference in adaptation persists for all t > 0; after an increment step in the input, adaptation in model DNL is consistently faster than adaptation in model DLN when h(x) is expansive. On the other hand, when h(x) is compressive, that is, when d 2 h(x)/d x 2 < 0, the linear extrapolation of equation 4.4 exceeds the result of equation 4.3, and adaptation in model DLN after an increment step is faster than adaption in model DNL . After a decrement step I+ → I− , it works just the other way around: when h(x) is expansive, adaptation in model DLN is faster than in model DNL , and when h(x) is compressive, adaptation in model DNL is faster than in model DLN . For multiplicative gain control, similar results hold when comparing adaptation in model MNL (see Figure 1A) with adaptation in model MLN (see Figure 1B). Models MNL and MLN are identical in their dynamics when f (x) is a linear function f (x) = d − kx (with k > 0 to obtain range compression). When f (x) is a nonlinear function, the dynamics of adaptation in models MNL and MLN differ, and again the sign of the difference in speed of adaptation is governed by the curvature of f (x), that is, by its second-order derivative d 2 f (x)/d x 2 . If d 2 f (x)/d x 2 > 0, adaptation after an increment step I− → I+ is faster in model MLN than in model MNL , whereas after a decrement step I+ → I− , adaptation is faster in model MNL than in model MLN . Figure 7 gives an example and shows that the difference in speed of adaptation can be substantial. If d 2 f (x)/d x 2 < 0, the situation
Dynamics of Nonlinear Feedback Control
A
1195
B 1.0
1.0
MNL
g(t )
g(t)
MNL
0.5
MLN
0.5
MLN
0.0
0
2
t
4
6
0.0
0
2
t
4
6
Figure 7: Comparison of adaptation in model MNL (dashed lines) and model MLN (dotted lines) when the nonlinearity in the feedback loop f (x) = 1/x 2 and τ = 3. The input I (t) is an increment step from 1 to 8 (A) and a decrement step from 8 to 1 (B). Note that models MNL and MLN have identical steady state gs = I −2/3 (continuous lines) but strongly different dynamics.
reverses, and adaptation after an increment step is faster in model MNL than in model MLN , whereas adaptation after a decrement step is faster in model MLN than in model MNL . 5 Asymmetries of Adaptation After Increments and Decrements of the Input In a linear system, responses after increments and decrements of the input are equally fast. The systems we study here, however, are fundamentally nonlinear, and the speed with which they adapt to increment steps can be strongly different from the speed with which they adapt to decrement steps. For the models of Figure 1, such strong asymmetries of adaptation can be observed in Figures 5 and 7 by comparing the different panels. For contrast gain control, adaptation for an ideal observer has been predicted to be faster after increments of contrast than after decrements of contrast (DeWeese & Zador, 1998). This prediction has been confirmed for human subjects (Snippe & van Hateren, 2003; Snippe et al., 2004). This type of asymmetry can be observed in Figure 5 for model DNL and in Figure 7 for model MLN . Similarly, faster adaptation after contrast increments than after decrements has been found in ganglion cells of salamanders and rabbits (Smirnakis, Berry, Warland, Bialek, & Meister, 1997), in salamander bipolar cells (Rieke, 2001), and in motion-sensitive cells in the visual system of the fly (Fairhall et al., 2001). Here we study these increment and decrement asymmetries for the models in Figure 1. For each of the models of Figure 1, the model will adapt faster to increment stimuli for some nonlinearities, in the feedback path, whereas for other nonlinearities, the model will adapt faster to decrement stimuli. However, despite their nonlinear nature, for
1196
H. Snippe and J. van Hateren
each of the models in Figure 1, there is a class of nonlinearities in the feedback path such that adaptation is initially equally fast after an increment step I− → I+ and after a decrement step I+ → I− of the input. 5.1 Increment and Decrement Symmetry for Model D LN Is Obtained When h(x) = c exp(kx). For model DLN (see Figure 1D), the speed of adaptation immediately after an increment step I− → I+ of the input follows from equation 4.4; thus, τ
da dh(α) dh(α) dα dh(α) dh(α) I+ I− =τ =τ = [O − α] = − dt dt dα dt dα dα h(α) h(α) =
d ln h(α) [I+ − I− ] . dα
(5.1)
In the last step of equation 5.1 we have used the equality (1/h(x))dh(x)/d x = d ln h(x)/d x, and this derivative is evaluated at the steady-state value of α corresponding to the input I− before the increment step. After a decrement step I+ → I− , a similar derivation as in equation 5.1 shows that the speed of adaptation immediately after the step follows from τ da /dt = d ln h(α)/dα·[I− − I+ ], where the derivative d ln h(α)/dα is now evaluated at the steady-state value of α corresponding to the input I+ before the decrement step. Thus, in model DLN , the speed of adaptation da /dt immediately after increment and decrement steps will be identical (except for the sign, [I+ − I− ] versus [I− − I+ ]) if the derivative d ln h(α)/dα is identical at the points α corresponding to steady-state at input I− , respectively, I+ . This will be the case for any pair of values I+ , I− of the input I if the nonlinearity h(x) is such that d ln h(x)/d x = k, a constant (which must be chosen positive to obtain gain control), hence ln h(x) = kx + ln c, with c another positive constant. Thus, for model DLN , an exponential nonlinearity h(x) = c exp(kx) will yield equally fast adaptation immediately after increment and decrement steps. Due to the nonlinear nature of the model, this equality will not hold for adaptation at later times after the steps, but still it is remarkable that for this nonlinear model, there is a unique shape of the nonlinearity h(x) (i.e., an exponential) that yields symmetric increment and decrement adaptation immediately after the steps for any pair I+ , I− of input levels. Moreover, the structure of equation 5.1 for adaptation immediately after steps in the input leads to a simple rule for the sign of any increment or decrement asymmetries of adaptation in model DLN when h(x) is not an exponential function: the critical feature of h(x) is the behavior of d ln h(x)/d x as a function of x. If d ln h(x)/d x increases with x (i.e., when ln h(x) is an expansive function, d 2 ln h(x)/d x 2 > 0), adaptation immediately after a decrement step will be faster than adaptation after an increment step. This is because when ln h(x) is expansive, d ln h(x)/d x will be larger (hence, adaptation faster, according to equation 5.1) at the high
Dynamics of Nonlinear Feedback Control
1197
steady-state level x+ that is present immediately before a decrement step than at the low steady-state level x− that is present immediately before an increment step. 5.2 Increment √ and Decrement Symmetry for Model M L N Is Obtained When f (x) = c − kx. For model MLN (see Figure 1B), the nonlinearity f (x) governs (a)symmetry of adaptation after increments versus decrements. Using equation 2.3, speed of adaptation of the gain signal g = f (γ ) immediately after an increment step I− → I+ in the input is τ
dg d f (γ ) d f (γ ) dγ d f (γ ) =τ =τ = [O − γ ] dt dt dγ dt dγ =
1 d[ f (γ )]2 d f (γ ) [ f (γ )I+ − f (γ )I− ] = [I+ − I− ], dγ 2 dγ
(5.2)
with the derivative d[ f (γ )]2 /dγ evaluated at the steady-state value of γ that corresponds to the input I− immediately before the input step. After a decrement step I+ → I− the speed of adaptation immediately after the step is minus the result 5.2, but with the derivative now evaluated at the steady state corresponding to I+ . Thus, the speed of adaptation immediately after increment and decrement steps is identical when d[ f (x)]2 /d x = −k, a constant (the minus sign indicates that k should be chosen positive to obtain multiplicative gain control). From this follows [ f (x)]2 = c − kx (with √ c another positive constant), hence f (x) = c − kx. For other choices of f (x), asymmetric adaptation of g(t) occurs after increments versus decrements, and the sign of the asymmetry is governed by the curvature (i.e., the second-order derivative) of [ f (x)]2 . If [ f (x)]2 has positive curvature (i.e., when d 2 [ f (x)]2 /d x 2 > 0), adaptation of g(t) is faster after increment steps than after decrement steps, whereas the reverse holds when [ f (x)]2 has negative curvature. Due to the identity of models MLN and DLN noted in section 2.1, the results derived above for model DLN can be immediately translated into results for model MLN , and vice versa. For instance, symmetry of initial adaptation in the divisive model DLN was obtained for a nonlinearity h(x) = c exp (kx), which corresponds to f (x) = 1/h(x) = (1/c) exp (−kx) in the equivalent multiplicative model MLN . Since da /dt = d (1/g)/dt = − 1/g 2 dg/dt, for this choice of f (x), the speed of adaptation of the gain g divided by the square of this gain is symmetric after increment and decre√ ment steps of the input. Likewise, a choice h(x) = 1/ c − kx leads to a divisive model DLN in which 1/a 2 da /dt is symmetric after increment and decrement steps. Finally, we ask if there is a nonlinearity f (x) = 1/h(x) such that there is symmetry of adaptation after increments and decrements of d ln g/dt = (1/g) dg/dt = − (1/a ) da /dt = −d ln a /dt, that is, for the gain (or, equivalently, attenuation) signal plotted on a logarithmic axis. In fact,
1198
H. Snippe and J. van Hateren
a linear function f (x) = d − kx attains this, as can be immediately proved from equation 5.2. As was seen in section 2.1, a linear function f (x) = c − kx (or, equivalently, h(x) = 1/(c − kx) in model DLN ) leads to a Naka-Rushton relation Os (Is ) = c Is /(1 + k Is ) for the steady-state output (Heeger, 1992). 5.3 Increment and Decrement Symmetry for Model M NL Is Obtained When h(x) = c ln(k/x), and for Model DNL When h(x) = c ln(kx). Also for the models MNL (see Figure 1A) and DNL (see Figure 1C) in which the nonlinearities in the feedback loops precede the low-pass filtering, choices exist for these nonlinearities such that adaptation immediately after a step input is symmetric for increment versus decrement steps. Contrary to equations 5.1 and 5.2 for models DLN , respectively MLN , that have a differential expression in their right-hand side (which can immediately be solved), symmetry for models MNL and DNL leads to functional equations, which can be much harder to solve (Aczel & Dhombres, 1989). In appendix A we show how to derive these functional equations and how to solve them by transforming the functional equations into differential equations. Table 1 includes the results for the functions f (x) = c ln(k/x) of model MNL and h(x) = c ln(kx) of model DNL that yield increment and decrement symmetry of adaptation immediately after the input steps. When plotted on axes such that these resulting functions yield straight lines, the resulting asymmetries of adaptation for other functions f and h can be judged by their curvature. 6 Dynamics of the Output So far we have concentrated on the dynamics of the control signals, a (t) and g(t), of the models in Figure 1. This is a sensible approach for applications in psychophysics where the control signal a (t), or g(t), determines the detectability of a test signal that is superimposed on a background I (t) (Snippe et al., 2004). Also in a neurophysiological context, control signals such as a (t) and g(t) will be important determinants of neural detectability, signal-to-noise ratio, and information transmission. Nevertheless, in physiological experiments, one often directly observes the output signal O(t) but not the control signal a (t) or g(t). Here we study the dynamical behavior of the output O(t) of the control loops in Figure 1. In particular, we are interested if for monotonic input I (t), the output O(t) will also be monotonic, or whether O(t) can have overshoots or oscillations when I (t) is monotonic. Obviously, when I (t) increases very fast (e.g., step-wise), O(t) will have an overshoot, because the control signals in Figure 1 cannot completely follow these rapid changes in I (t) due to the low-pass filtering in the feedback loops. Likewise, when I (t) decreases very fast, O(t) will have an undershoot. Hence, to obtain simple (monotonic) behavior of O(t), restrictions on d I (t)/dt are needed. A reasonable choice to obtain such a restriction is to suppose, as illustrated in Figure 8, that the input I (t) to the gain control
Dynamics of Nonlinear Feedback Control
1199
Table 1: Feedback Nonlinearities Yielding Symmetry. Model
1/g 2 dg/dt
dg/dt
(1/g) dg/dt
MNL MLN
c√ln(k/x) c − kx
c − kx c − kx
c − kx 2 (1/c) exp(−kx)
DNL DLN
da /dt c ln(kx) c exp(kx)
(1/a )da /dt c − (k/x) 1/(c − kx)
(1/a 2 )da /dt 2 c− √(k/x ) 1/ c − kx
Notes: Mathematical form of the nonlinearities in the feedback paths of the models in Figure 1 such that adaptation is equally fast (i.e., symmetric) immediately after increment steps I− → I+ and decrement steps I+ → I− of the input. The second column shows the nonlinearities necessary to obtain increment or decrement symmetry of the gain g(t) (for models MNL and MLN ), respectively the attenuation a (t) (for models DNL and DLN ). The third column shows the nonlinearities for which one obtains symmetry on a logarithmic scale for the gain, respectively the attenuation. Note that since d ln g/dt = (1/g)dg/dt (and similar for a (t)) this corresponds to the case where the speed of adaptation of g, respectively a , relative to its initial value is symmetric. The fourth column shows the nonlinearities that yield symmetry when the control signal g(t) in models MNL and MLN is plotted as an attenuation a (t) = 1/g(t), and likewise when a (t) in models DNL and DLN is plotted as a gain g(t) = 1/a (t). Since d(1/g)/dt = −(1/g 2 )dg/dt (and similar for a (t)), this corresponds to the case where the speed of adaptation of g, respectively a , relative to the square of its initial value is symmetric. The parameters c and k should be chosen positive.
J(t )
τI
I(t ) a(t )
O(t ) τ
h
Figure 8: In section 6, the input I (t) to the gain control (here illustrated for model DNL ) is a low-pass filtered version (time constant τ I ) of a signal J (t) that steps from J = J 0 to J = J 1 at time t = 0.
is a low-pass filtered version of a signal J (t). Since all neural systems incorporate low-pass filtering, for instance, associated with membrane time constants, this is a reasonable choice. In this section, we assume that J (t) is a step function J 0 → J 1 at time t = 0. In fact, it can be shown that when the output O(t) of the gain control is monotonic for such a step J (t), it will certainly be monotonic when J (t) would be a smoother function than a step (i.e., when there would be more low-pass filtering before the gain control). The actual input I (t) to the gain control loop is assumed to be the result of applying a first-order low-pass filter with time constant τ I to the step
1200
H. Snippe and J. van Hateren
function J (t), that is, for t > 0, we assume τI
dI = J 1 − I. dt
(6.1)
The resulting input I (t) to the gain controls of Figure 1 can be written explicitly as I (t) = J 1 + (J 0 − J 1 ) exp(−t/τ I ).
(6.2)
The question we pose here is, What we can expect for the dynamics of the output O(t) when the models of Figure 1 are stimulated with the input, equation 6.2? It turns out that the qualitative nature of O(t) depends critically on the value of the time constant τ I of the filtering of the input in equation 6.1 relative to the time constant τ of the low-pass filtering in the feedback controls of Figure 1. In the next section, we study the case where the input steps are small (i.e., J 1 − J 0 J 0 ), in which case the models of Figure 1 can be linearized. In sections 6.2 to 6.4, we study the full nonlinear case. 6.1 Linearized Models Have Monotonic Output When τ ≤ τ I . For explicitness we apply linearization to model DNL (see Figure 1C), but the other three models in Figure 1 can be linearized in the same way, and in fact the linearized forms of the four models in Figure 1 are identical. To apply linearization, we assume I (t) = [1 + i(t)]Is ,
(6.3)
O(t) = [1 + o(t)]Os ,
(6.4)
a (t) = [1 + η(t)]a s ,
(6.5)
and
with i(t), o(t), and η(t) all small relative to 1. Note that i(t), o(t), and η(t) are dimensionless and will be referred to below as input modulation, output modulation, and control modulation, respectively. The constant steadystate values Is , Os , and a s are related through Os = Is /a s and a s = h(Os ). The dynamic signals i(t), o(t), and η(t) are related through 1 + o(t) = (1 + i(t))/(1 + η(t)). Using the fact that these signals are small relative to 1, 1 + o = (1 + i)/(1 + η) ≈ (1 + i)(1 − η) = 1 + i − η − iη ≈ 1 + i − η, and thus o(t) = i(t) − η(t),
(6.6)
Dynamics of Nonlinear Feedback Control
1201
that is, the linearized system behaves as a subtractive feedback in which the control modulation η(t) is subtracted from the input modulation i(t) to obtain the output modulation o(t). In appendix B, we prove that when the input modulation i(t) is a low-pass filtered step function, as in equation 6.2, the output modulation o(t) has an overshoot when the time constant τ of the low-pass filter in the control path is larger than the time constant τ I of the filtering at the input. In this case, the control modulation η(t) cannot follow the increase in the input modulation i(t) fast enough, and the resulting output modulation o(t) has an overshoot relative to its final steady state. On the other hand, when τ ≤ τ I , the filter in the feedback path is fast enough to follow the input modulation, and the output modulation o(t) will approach its new steady state in a monotonic way, without an overshoot. 6.2 Output of Model DNL Resembles the Output of the Linearized Model. As shown in appendix B, when the linearized models are stimulated with low-pass filtered steps, the qualitative dynamics of the output of these models is quite simple: the sum of a constant and two exponentials. Depending on the time constants of the low-pass filters, the output is either monotonic, or it has a single overshoot (respectively, undershoot for a decrement step). In this section we show that, remarkably, for model DNL (see Figure 1C), this qualitative dynamics persists also for large input steps, independent of the precise form of the nonlinearity h in the feedback path. For the other models in Figure 1, however, qualitatively new features in the output can occur, depending on the size and sign of the input steps and on the form of the nonlinearities in the feedback loops. When stimulated with an input I (t), the output O(t) of model DNL can be written as O(t) = I (t)/a (t). Differentiating this expression with respect to time yields dO 1 da 1 dI J1 − I h (O) − a = 2 a −I = 2 a −I dt a dt dt a τI τ 1 J1 − I Oh(O) − I , = − a τI τ
(6.7)
where we used equations 6.1 and 2.2. Now if O(t) has an overshoot, d O/dt = 0 at some finite time t where the overshoot reaches its maximum value. Setting d O/dt = 0 in equation 6.7 yields Oh(O) =
τ τ I. J1 + 1 − τI τI
(6.8)
When plotted in an input-output diagram with the input I on the horizontal axis and Oh(O) on the vertical axis (such that steady-state, Is = Os h(Os ), is a straight line through the origin), equation 6.8 yields for any value of τ/τ I
1202
H. Snippe and J. van Hateren
B
A 5 4
4
dO/dt=0
3 dO/dt< 0
Oh(O)
Oh(O)
dO/dt< 0
5
J1=4
2
large step
small step
3
J1=4 dO/dt> 0
2
J0=3
dO/dt> 0 1 0 0
1
J0=1 1
2
dO/dt= 0
J0=1
I
3
4
5
0 0
1
2
I
3
4
5
Figure 9: (A) Dynamics of the output O of model DNL (with h(O) = O2 ) stimulated with a low-pass filtered step (time constant τ I ) when τ < τ I (τ = 0.5τ I in this example). The horizontal axis is the input I from equation 6.2 (with J 0 = 1 and J 1 = 4). The vertical axis is a nonlinear transformation Oh(O) = O3 of the output of O the model, such that steady state Os3 = Is (dotted line) is a straight line through the origin. According to equation 6.8, the line d O/dt = 0 (dashed line) also is a straight line in these coordinates, with positive slope when τ < τ I . Hence, the actual (nonlinearly transformed) output O3 (I ) (continuous line) cannot cross this line and must remain in the region below the dashed line, where d O/dt > 0, hence O(t) is monotonic. (B) As A, but now with τ = 2τ I , such that equation 6.8, d O/dt = 0 (dashed line), has negative slope and therefore can be crossed by the actual output O3 (I ) (continuous curve labeled “large step”). In fact, a crossing always occurs since the output cannot cross the output of the linearized model valid for small steps (continuous curve labeled “small step”), which has an overshoot when τ > τ I , as derived in appendix B. Thus, O(t) has a single overshoot, and it has a single undershoot (not shown in the figure) after a decrement step in the input.
a straight line (see Figure 9). This line separates the region above the line where d O/dt < 0 from the region below it where d O/dt > 0. Note that the line where d O/dt = 0 (given by equation 6.8) crosses the steady-state line, Is = Os h(Os ), when I = J 1 . Now consider the case that τ/τ I 0. Thus, for t > 0 the output O(t) initially increases. And in fact it will continue to increase (toward its final steady-state level where Oh(O) = I = J 1 ) for all t > 0. This is because the output O(t) cannot cross the line where d O/dt = 0, since on this line, d O/dt = 0, but for I < J 1 , according to equation 6.1, d I /dt > 0. Hence, if O(t) approaches the line d O/dt = 0, it will be pushed back in the region where d O/dt > 0. Thus, when τ < τ I , O(t)
Dynamics of Nonlinear Feedback Control
1203
approaches its final steady state in a monotonic way, without an overshoot (illustrated in Figure 9A for the case h(O) = O2 ). Likewise, when the step J 0 → J 1 at the input is a decrement step (i.e., when J 0 < J 1 ), the output O(t) will decrease monotonically and reach its final steady state without an undershoot. The situation is very different, however, when τ/τ I > 1—when the lowpass filter in the feedback loop is slower than the filtering at the input. In that case, the straight line of equation 6.8 has a negative slope, and now the output O(t) can cross the line d O/dt = 0, because now if O(t) approaches the line d O/dt = 0, it will be pulled into the region with d O/dt < 0 (see Figure 9B). In fact, from the small-signal analysis performed in appendix B, it follows that such a crossing must occur. In appendix B we showed that when τ/τ I > 1, for small steps at the input, the resulting output attains its final steady state with an overshoot (the curve labeled “small step” in Figure 9B). But the output for a large step (the curve labeled “large step” in Figure 9B) cannot cross the output for the small step (since both outputs are described by the same first-order differential equation 6.7, implying that their trajectories must be identical if they have a point in common). The result, as illustrated in Figure 9B, is that also for a large step, O(t) will have exactly one overshoot before it reaches steady state for t → ∞. Likewise, after a decrement step (i.e., when 0 < J 1 < J 0 in equation 6.2), O(t) will have exactly one undershoot (without any subsequent overshoot) when τ/τ I > 1. Thus, remarkably, the qualitative dynamics of the output O(t) of the nonlinear divisive model DNL is identical to the behavior of its linearized version analyzed in appendix B. 6.3 The Output of Model MNL Can Have Overshoots Also When τ < τ I . The qualitative dynamics of the output O(t) of the multiplicative model MNL (see Figure 1A), however, can differ from the linear analysis of appendix B. When stimulated with input I (t), this model yields an output O(t) = I (t)g(t), hence using equations 6.1 and 2.1, dO f (O) − g dI dg J1 − I +I =g +I =g dt dt dt τI τ 2 I − I f (O)/O J1 − I . − =g τI τ
(6.9)
Setting d O/dt = 0 in equation 6.9 yields I2 O = , f (O) I (τ/τ I + 1) − J 1 τ/τ I
(6.10)
which, contrary to the corresponding equation 6.8 for model DNL , is a nonlinear equation in I . The behavior of equation 6.10 is illustrated for a number
1204
H. Snippe and J. van Hateren
τ=0.25
τ=0.5 τ=2
τ=4
O/f(O)
τ=0.5 τ=1
τ=1
1
τ=2
J1=1
0 0
τ=0.25
τ=4
J0=0.04 1
I
2
Figure 10: Dynamics of the output O of model MNL when stimulated with the low-pass filtered step I (t) of equation 6.2, with τ I = 1. Vertical axis O/ f (O) is a (monotonic) nonlinear transformation of the output such that steady state (the dotted line) Os / f (Os ) = Is is a straight line through the origin. Dashed lines are plots, for a number of values of τ , of equation 6.10, where d O/dt = 0. Note that, contrary to Figure 9 for model DNL , these lines are not straight, and have a sharp upward turn at some value I < J 1 . The response O/ f (O) (continuous line, obtained with f (O) = 1/O, J 0 = 0.04, J 1 = 1, and τ = 0.5) weaves its way through the line d O/dt = 0 for τ = 0.5. As a consequence, O(t) has an overshoot and a subsequent undershoot before it reaches its final steady state.
of values of τ/τ I by the dashed curves in Figure 10. When τ > τ I , that is, when the filter in the feedback loop is slower than the filtering of the input, the structure of the lines in Figure 10 is such that the qualitative behavior of O(t) is identical to the behavior for τ > τ I of the output of model DNL . After an increment step at the input, O(t) has a single overshoot, and after a decrement step, O(t) has a single undershoot. For an increment step, this result follows using the reasoning for model DNL in Figure 9B, since in both Figure 9B and Figure 10, the lines d O/dt = 0 are monotonically decreasing when I < J 1 and τ > τ I . For I > J 1 the lines d O/dt = 0 are nonmonotonic when τ > τ I ; they have a minimum at a finite value of I and then increase with increasing I . However, this does not affect the qualitative behavior of O(t) since O(t) cannot cross this increasing part of the curve d O/dt = 0. Thus, when τ > τ I the output O(t) has a single undershoot after a decrement step at the input. Now consider the case that τ < τ I —the case that the filter in the feedback path is faster than the filter at the input. After a decrement step of the input, the output O(t) is qualitatively identical to the output of model DNL and will reach steady state without an undershoot. This follows from the fact that the curves d O/dt = 0 in Figure 10 are monotonically increasing when I > J 1 and τ < τ I ; hence, the qualitative dynamics for O(t) in this case is identical to that obtained for model DNL . After an increment step, however, now O(t) can be qualitatively different from the output of
Dynamics of Nonlinear Feedback Control
1205
10
O(t) 5
0 0
1
t
2
Figure 11: Output O(t) of model DLN with τ = 1 and a nonlinearity h(x) = 1 + x + 0.8 sin(π x)/π , which is monotonic but has multiple changes in the sign of its second-order derivative. Input I (t) is a step from 0 to 110, low-pass filtered with a time constant τ I = 1. For this model, a simple input (a saturating exponential) generates a complicated output with multiple oscillations.
model DNL , because the curves d O/dt = 0 in Figure 10 now have a more complicated structure than in Figure 9A, with a sharp upturn for small values of I . As a consequence, whereas model DNL reaches steady state in a monotonic fashion when τ < τ I , the output O(t) of model MNL develops a “nose” for suffiently large steps at the input. That is, after large steps of the input I (t), the output O(t) of model MNL has an initial overshoot, followed by a subsequent undershoot, as shown by the continuous curve in Figure 10. 6.4 Simple Input Can Yield a Complicated Output in Models DL N and M L N . Model DLN (or, equivalently, model MLN ) can have an output O(t) that is still more complicated than the output of model MNL shown in Figure 10. When the nonlinearity h(x) (though monotonic) has sign changes in its second-order derivative d 2 h/d x 2 , the output O(t) of model DLN can have oscillatory features when the model is stimulated with the simple low-pass filtered step of equation 6.2. An example is shown in Figure 11. 7 Discussion In this letter, we studied the dynamics of the four feedback models for gain control shown in Figure 1. Two of the models (MNL and MLN ) have a multiplicative gain control, whereas the other two models (DNL and DLN ) have a divisive gain control. The structure of all four models is very similar in that the feedback loop for each consists of a concatenation of an instantaneous nonlinearity and a first-order low-pass filter. As shown
1206
H. Snippe and J. van Hateren
in sections 2.2 and 2.3, two of these models (MLN and DLN ) are actually identical, and the other two models (MNL and DNL ) are connected by a duality. Despite these relationships, however, the dynamics of models MNL , DNL , and MLN (c.q. DLN ) can be very different for a given input I (t). For instance, in section 4, we showed that the dynamics of adaptation of the models differs after steps in the input (see Figures 5 and 7), and in section 5 we showed that the models differ in their asymmetries of adaptation after increments and decrements of the input. Finally, in section 6, we showed that for simple band-limited inputs I (t), the outputs O(t) of the different models can be qualitatively different. Thus, subtle differences in the control structure of feedback models for gain control can have strong effects on their dynamics. As shown in section 6, the input-output behavior of model DNL is qualitatively the simplest, especially when τ < τ I (see Figure 9A), because in that case, a monotonic input I (t) yields a monotonic output O(t). Thus, when a nonmonotonic output O(t) occurs in model DNL , the corresponding input I (t) also must have been nonmonotonic. On the other hand, even when τ < τ I , model MNL can have a nonmonotonic output O(t) when the input I (t) was monotonic, as is shown in Figure 10. Finally, model DLN (and model MLN ) can give complicated outputs even when the input is simple. Hence, for this model, it can be difficult to make a qualitative estimate of the input when observing the output. For instance, it would be hard to discriminate the output seen for this model in Figure 11 from an output generated by this model when stimulated by an input that genuinely oscillates. It appears therefore that model DNL is usually a safer design than the models MNL and (especially) MLN or DLN , since as a general principle in information processing, one should try to avoid introducing spurious structure that is not present at the input (Koenderink, 1984). The structual complexity of the models studied in this letter has been kept to a minimum. The great advantage of this approach is that we can obtain general results using an exact mathematical analysis. It would be very hard to derive conclusions of similar generality if one would have to resort to numerical simulations, which might be necessary for more complex models. Nonetheless, the results derived for the simple models studied here can aid in the analysis of more complex models. For instance, the results derived here can speed the search for realistic complex models by restricting the huge search space of the many possible complex models to more manageable proportions. A realistic strategy for this would be to first investigate the steady-state behavior of the system under study, which determines the mathematical form of the nonlinearity in the system (Chichilnisky, 2001; Snippe et al., 2004). Given the steady state, one can then study the dynamics of the system (e.g., increment and decrement asymmetry and overshoots of the output) to determine which type of model (e.g., divisive or multiplicative) is appropriate to explain the experimental data.
Dynamics of Nonlinear Feedback Control
1207
B
A I(t)
NL
LP
O(t)
I(t) a(t)
a ( t ) =O ( t )
NL LP
LP
O(t)
NL
Figure 12: An example of nonlinear (divisive) control (A) where the nonlinearity and the filtering shape not only the control signal a (t) but also directly the output O(t). The model is equivalent to the model shown at the right (B). Hence, the dynamics of a (t) = O(t) of this model can be studied with the methods developed for model DNL in Figure 1.
There are several simple extensions and alternatives of the models studied here that are amenable to analysis with the methods developed here. For instance, the filtering and nonlinearities in the feedback path in Figure 1 could also be placed at a point before the feedback loop branches off from the output. As shown in Figure 12, this in fact does not affect the dynamics of the gain signal, but it changes the dynamics of the output, which now becomes identical to the gain. Finally, the filtering operations could be generalized. For instance, one could consider higher-order low-pass filters, time delays in the filters, or high-pass filtering of the output (Berry et al., 1999). It is well known that with increasing values of the input (e.g., luminance or contrast), neural systems not only decrease their gain but also adapt their temporal filtering (e.g., Sperling & Sondhi, 1968; Victor, 1987; Benardete, Kaplan, & Knight (1992); Carandini & Heeger, 1994; van Hateren, 2005). It will be interesting to see to what extent the tools developed in this letter can also be used to analyze such more general systems. Appendix A: Increment and Decrement Symmetry for Models DNL and M NL Adaptation in model DNL (see Figure 1C) is described by τ
da I = h(O) − a = h( ) − a . dt a
(A.1)
For a constant input Is , steady state is attained when da /dt = 0, thus, when h (I /a s ) = a s . Hence, in steady state Is = a s h −1 (a s ) = Os h (Os ), where h −1 is the inverse of the function h, and we have used the fact that in steady state a s = h(Os ). Now consider an increment step I− → I+ of the input. For the input I− the steady-state equation is I− = a − h −1 (h − ) = O− h (O− ), and for the input I+ it is I+ = a + h −1 (a + ) = O+ h (O+ ). Immediately after the increment step, when the input is I+ but the attenuation a is still at its old
1208
H. Snippe and J. van Hateren
adaptation level a − , da inc =h τ dt
I+ a−
− a− = h
I+ a−
−h
I− a−
=h
O+ h (O+ ) h (O− )
− h (O− ) . (A.2)
Likewise, immediately after a decrement step I+ → I− , da dec τ =h dt
O− h (O− ) h (O+ )
− h(O− ).
(A.3)
For symmetry of adaptation, we equate da inc /dt and −da dec /dt (taking into account that a decreases after a decrement step). Writing O+ = y and O− = x, this yields h
yh (y) h(x)
+h
xh(x) h (y)
= h (y) + h(x),
(A.4)
a functional equation for h in terms of two variables x and y (Aczel & Dhombres, 1989). It is easy to verify that equation A.4 is solved by the function h(x) = c + k ln x (with c and k constants; k > 0 for gain control), but how does one obtain such a solution? A method to solve A.4 is to set y = x + ε and make a Taylor expansion of the resulting expression. Since, as shown below, terms up to the first order in the expansion cancel, it is necessary to expand to second order in ε. An identical result can be obtained by differentiating equation A.4 twice with respect to y, and putting y = x after differentiating. Taylor-expanding the right-hand side of equation A.4 yields h (x + ε) + h(x) = 2h(x) + ε
dh(x) 1 2 d 2 h(x) 1 = 2h + εh + ε 2 h , + ε dx 2 d x2 2 (A.5)
where for conciseness, we write dh/d x = h and d 2 h/d x 2 = h , and drop the argument x everywhere. Likewise, the left-hand side of equation A.4 yields
(x + ε) h + εh + ε 2 h /2 xh h +h h h + εh + ε 2 h /2 ε 2 h x h +h = h (x + ε) 1 + ε + h 2 h 1 + εh / h + ε 2 h /2h
Dynamics of Nonlinear Feedback Control
1209
h xh xh =h x+ε 1+ + ε2 + h h 2h
h 2 xh 2 xh −ε −x +h x − ε h 2h h h xh xh ε 2 xh 2 h =h+ε 1+ h + ε2 + h + 1+ h h 2h 2 h
2 h xh ε 2 xh 2 2 xh +h − ε h h −ε −x h + h 2h h 2 h 2 2 h xh 2 h xh 2 2 (h ) h = 2h + εh + ε + xh + + 1+ . h 2 h h 2 h (A.6) Equating A.6 and A.5, only terms in ε 2 remain: ε
2
xh h 1 (h )2 + h + + h 2 h
xh h
2
x (h )3 h + h2
=
1 2 ε h . 2
(A.7)
Cancelling the right-hand side of equation A.7 with the second term on the left-hand side of equation A.7, and multiplying with h 2 /ε 2 h yields 2 hh + xhh + x 2 h h + x h = 0,
(A.8)
which can be simplified to
h + xh · h + xh = [xh] · xh = 0.
(A.9)
The first factor in equation A.9, [xh] = 0, has a solution h(x) = k/x that decreases with increasing x, contrary to the assumptions in section 2. However, the second factor in equation A.9, [xh ] = 0, does yield a result for divisive gain control. Because [xh ] = 0 means xh = c with c a constant (positive for divisive gain control), it follows that h = c/x, thus h(x) = c ln (kx), with k another positive constant. Note that when x < 1/k, h(x) = c ln(kx) becomes negative, which speeds up adaptation after large decrements of the input. However, in steady state, Is = Os f (Os ) = Os c ln(k Os ) > 0, hence h(Os ) = c ln(k Os ) > 0. The method used here to derive a functional equation and transform it into a differential equation can also be used to find nonlinearities f (x) that yield symmetric adaptation immediately after increment and decrement steps of the input in the multiplicative model MNL . Likewise,
1210
H. Snippe and J. van Hateren
nonlinearities can be found for which (1/a ) da /dt = d ln a /dt (or, equivalently, (1/g)dg/dt = d ln g/dt) is symmetric immediately after increment and decrement steps. Results are collected in Table 1. Appendix B: Dynamics of the Output for the Linearized Models To obtain an explicit expression for the output modulation o(t) (see equation 6.6), we linearize the dynamics of the feedback loop of model DNL , equation 2.2, which yields τ
da (t) = h(O(t)) − a (t) = h(Os ) + Os h (Os )o(t) − a s − a s η(t) dt = Os h (Os )o(t) − h(Os )η(t).
(B.1)
In the first step of equation B.1, we used the steady-state relation a s = h(Os ), and defined h (Os ) = dh(Os )/d Os . Using equation 6.5 and the steady-state relation a s = h(Os ), the left-hand side of equation B.1 can be rewritten as τ h(Os )dη(t)/dt. Using equation 6.6, we can rewrite equation B.1 as τ
dη(t) = Qi(t) − (1 + Q) η(t), dt
(B.2)
where we define Q ≡ Os h (Os )/h(Os ).
(B.3)
Note that in general, the parameter Q in equation B.3 depends on the steady-state output Os , Q = Q(Os ). For conciseness, we have suppressed this dependence in our notation. Dividing equation B.2 by 1 + Q yields τQ
dη(t) Q = i(t) − η(t), dt 1+ Q
(B.4)
where we defined τ Q = τ/(1 + Q). Equation B.4 says that for the linearized model, the subtractive control modulation η(t) is a low-pass filtered version of the input modulation i(t), in which the filter has a gain Q/(1 + Q) and a time constant τ Q = τ/(1 + Q) (Bechhoefer, 2005). Equation B.4 can be solved explicitly when the input modulation i(t) is a low-pass filtered step of size s, that is, when for t ≥ 0, i(t) = s[1 − exp(−t/τ I )],
(B.5)
and for t < 0, i(t) = 0. The solution of the linear equation B.4 (representing a first-order low-pass filter) can be written as the convolution of i(t) with a
Dynamics of Nonlinear Feedback Control
1211
low-pass filter with time constant τ Q and gain Q/(1 + Q): η(t) =
Q (1 + Q) τ Q
t
i t exp − t − t /τ Q dt ,
(B.6)
0
where the fact that i (t ) = 0 for t < 0 has been used for the lower bound of the integral. Substituting equation B.5 into equation B.6, the integral is elementary and yields η(t) =
τQ sQ 1+ exp(−t/τ Q ) 1+ Q τI − τQ τI − exp(−t/τ I ) when τ I = τ Q , τI − τQ
(B.7)
and η(t) =
sQ t 1− 1+ exp (−t/τ Q ) 1+ Q τQ
when
τI = τQ.
(B.8)
First, we study the case τ I = τ Q . For this case, using equations 6.6, B.5, and B.7, the output modulation o(t) of the gain control is o(t) = i(t) − η(t) = −
s s (τ I − τ ) exp (−t/τ I ) − 1 + Q (1 + Q) (τ I − τ Q )
s Qτ Q exp (−t/τ Q ) , (1 + Q) (τ I − τ Q )
(B.9)
where in the second term of equation B.9, we used the relation τ = (1 + Q)τ Q . Now when τ < τ I (i.e. when the low-pass filter in the control loop is faster than the filter at the input), the prefactors of the exponential functions in equation B.9 are both positive, and hence o(t) is a monotonically increasing function. Thus, when τ < τ I , the output modulation o(t) reaches its final value s Q/(1 + Q) in a monotonic way, without any overshoots or oscillations. When τ > τ I (i.e., when the low-pass filter in the control loop is slower than the filter at the input), two situations can occur, depending on the value of the parameter Q in equation B.3: either τ Q = τ/(1 + Q) < τ I , or τ Q > τ I . If τ Q > τ I , then at large times t, the exponential function exp (−t/τ Q ) in equation B.9 will dominate the exponential function exp (−t/τ I ). However, τ Q > τ I means that τ I − τ Q < 0; hence, in this case, the contribution of exp(−t/τ Q ) to the result B.9 will be positive, which means that for large t, the output modulation o(t) in equation B.9 approaches its final value s Q/(1 + Q) from above. This, however, can be the case only when o(t) has
1212
H. Snippe and J. van Hateren
an overshoot. when τ Q < τ I , the exponential exp (−t/τ I ) dom Likewise, inates exp −t τ Q for large t. However, in this case, the contribution of exp (−t/τ I ) to o(t) is positive, as can be easily verified. Hence, when τ > τ I , the output modulation o(t) always has an overshoot, independent of the precise value of τ Q relative to τ I . One can easily check that o(t) also has a single overshoot when η(t) is described by equation B.8, that is, when τ I = τ Q exactly (and hence τ = (1 + Q) τ Q > τ I ). Acknowledgments This research is supported by the Netherlands Organization for Scientific Research (NWO/ALW). References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Aczel, J., & Dhombres, J. (1989). Functional equations in several variables. Cambridge: Cambridge University Press. Bechhoefer, J. (2005). Feedback for physicists: A tutorial essay on control. Reviews of Modern Physics, 77, 783–836. Benardete, E. A., Kaplan, E., & Knight, B. W. (1992). Contrast gain control in the primate retina: P cells are not X-like, some M cells are. Visual Neurosci., 8, 483– 486. Berry, M. J., Brivanlou, I. H., Jordan, T. A., & Meister, M. (1999). Anticipation of moving stimuli by the retina. Nature, 398, 334–338. Calvert, P. D., Ho, T. W., Lefebvre, Y. M., & Arshavsky, V. Y. (1998). Onset of feedback reactions underlying vertebrate rod photoreceptor light adaptation. J. Gen. Physiol., 111, 39–51. Carandini, M., & Heeger, D. J. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Chichilnisky, E. J. (2001). A simple white noise analysis of neuronal light responses. Network, 12, 199–213. Crawford, B. H. (1947). Visual adaptation in relation to brief conditioning stimuli. Proceedings of the Royal Society B, 134, 283–302. Crevier, D. W., & Meister, M. (1998). Synchronous period-doubling in flicker vision of salamander and man. J. Neurophys., 79, 1869–1878. ¨ Dau, T., Puschel, D., & Kohlrausch, A. (1996). A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am., 99, 3615–3622. DeWeese, M., & Zador, A. (1998). Asymmetric dynamics in optimal variance adaptation. Neural Computation, 10, 1179–1202. Escab´ı, M. A., Miller, L. M., Read, H. L., & Schreiner, C. E. (2003). Naturalistic auditory contrast improves spectrotemporal coding in the cat inferior colliculus. J. Neurosci., 23, 11489–11504.
Dynamics of Nonlinear Feedback Control
1213
Fairhall, A. L., Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787– 792. Franklin, G. F., Powell, J. D., & Emami-Naeini, A. (1994). Feedback control of dynamical systems (3rd ed.). Reading, MA: Addison-Wesley. Griffin, L. D., Lillholm, M., & Nielsen, M. (2004). Natural image profiles are most likely to be step edges. Vision Res., 44, 407–421. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neurosci., 9, 181–197. Hosoya, T., Baccus, S. A., & Meister, M. (2005). Dynamic predictive coding by the retina. Nature, 436, 71–77. Koenderink, J. J. (1984). The structure of images. Biological Cybernetics, 50, 363– 370. Marmarelis, V. Z. (1991). Wiener analysis of nonlinear feedback in sensory systems. Annals of Biomedical Engineering, 19, 345–382. Oppenheim, A. V., Willsky, A. S., & Young, I. T. (1983). Signals and systems. Englewood Cliffs, NJ: Prentice Hall. Rieke, F. (2001). Temporal contrast adaptation in salamander bipolar cells. J. Neurosci., 21, 9445–9454. Salinas, E., & Thier, P. (2000). Gain modulation: A major computational principle of the central nervous system. Neuron, 27, 15–21. Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nat. Neurosci., 4, 819–825. Shapley, R., & Enroth-Cugell, C. (1984). Visual adaptation and retinal gain controls. Progr. Retinal Res., 3, 263–346. Smirnakis, S. M., Berry, M. J., Warland, D. K., Bialek, W., & Meister, M. (1997). Adaptation of retinal processing to image contrast and spatial scale. Nature, 386, 69–73. Snippe, H. P., Poot, L., & van Hateren, J. H. (2000). A temporal model for early vision that explains detection thresholds for light pulses on flickering backgrounds. Visual Neurosci., 17, 449–462. Snippe, H. P., Poot, L., & van Hateren, J. H. (2004). Asymmetric dynamics of adaptation after onset and offset of flicker. J. Vision, 4, 1–12. Snippe, H. P., & van Hateren, J. H. (2003). Recovery from contrast adaptation matches ideal-observer predictions. J. Opt. Soc. Am. A, 20, 1321–1330. Sperling, G., & Sondhi, M. M. (1968). Model for luminance discrimination and flicker detection. J. Opt. Soc. Am., 58, 1133–1145. Thorson, J., & Biederman-Thorson, M. (1974). Distributed relaxation processes in sensory adaptation. Science, 183, 161–172. Torre, V., Ashmore, J. F., Lamb, T. D., & Menini, A. (1995). Transduction and adaptation in sensory receptor cells. J. Neurosci., 15, 7757–7768. van de Vegte, J. (1990). Feedback control systems (2nd ed.). Englewood Cliffs, NJ: Prentice Hall. van Hateren, J. H. (1997). Processing of natural time series of intensities by the visual system of the blowfly. Vision Res., 23, 3407–3416. van Hateren, J. H. (2005). A cellular and molecular model of response kinetics and adaptation in primate cones and horizontal cells. J. Vision, 5, 331–347.
1214
H. Snippe and J. van Hateren
Victor, J. D. (1987). The dynamics of the cat retinal X cell centre. J. Physiol., 386, 219–246. ¨ Wilke, S. D., Thiel, A., Eurich, C. W., Greschner, M., Bongard, M., Ammermuller, J., & Schwegler, H. (2001). Population coding of motion patterns in the early visual system. J. Comp. Physiol. A, 187, 549–558.
Received August 12, 2005; accepted August 1, 2006.
LETTER
Communicated by Bard Ermentrout
Interspike Interval Statistics in the Stochastic Hodgkin-Huxley Model: Coexistence of Gamma Frequency Bursts and Highly Irregular Firing Peter Rowat
[email protected] Institute for Neural Computation, University of California at San Diego, La Jolla, CA 92093, U.S.A.
When the classical Hodgkin-Huxley equations are simulated with Naand K-channel noise and constant applied current, the distribution of interspike intervals is bimodal: one part is an exponential tail, as often assumed, while the other is a narrow gaussian peak centered at a short interspike interval value. The gaussian arises from bursts of spikes in the gamma-frequency range, the tail from the interburst intervals, giving overall an extraordinarily high coefficient of variation—up to 2.5 for 180,000 Na channels when I ≈ 7µA/cm2 . Since neurons with a bimodal ISI distribution are common, it may be a useful model for any neuron with class 2 firing. The underlying mechanism is due to a subcritical Hopf bifurcation, together with a switching region in phase-space where a fixed point is very close to a system limit cycle. This mechanism may be present in many different classes of neurons and may contribute to widely observed highly irregular neural spiking. 1 Introduction The variability of spike times in response to repeated identical inputs is a prominent characteristic of the nervous system that many investigators have observed (Werner & Mountcastle, 1963; Pfeiffer & Kiang, 1965; Nakahama, Suzuki, Yamamoto, & Aikawa, 1968; Noda & Adey, 1970; Lamarre, Filion, & Cordeau, 1971; Bassant, 1976; Burns & Webb, 1976; Whitsel, Schreiner, & Essick, 1977; Dean, 1981; Softky & Koch, 1993; Holt, Softky, Koch, & Douglas, 1996; Shadlen & Newsome, 1998) and analyzed from many different approaches (Stein, 1965; Wilbur & Rinzel, 1983; Softky & Koch, 1993; Mainen & Sejnowski, 1995; Troyer & Miller, 1997; Gutkin & Ermentrout, 1998; Stevens & Zador, 1998; Destexhe, Rudolph, Fellous, & Sejnowski, 2001; Mann-Metzer & Yarom, 2002; Rudolph & Destexhe, 2002; Brette, 2003; Stiefel, Englitz, & Sejnowski, 2006). Although synaptic noise may account for some of this variability, another significant source is noise from stochastic ion channel activity, which can affect neuronal threshold and spike time reliability (Sigworth, 1980; White, Rubinstein, & Kay, Neural Computation 19, 1215–1250 (2007)
C 2007 Massachusetts Institute of Technology
1216
P. Rowat
2000). Indeed, in certain small neurons, a single channel opening can initiate an action potential (Lynch & Barry, 1989). Channel noise has been shown to be significant or essential in other neural systems (White, Klink, Alonso, & Kay, 1998; Dorval & White, 2005; Kole, Hallerman, & Stuart, 2006), including bursting neurons, where Rowat and Elson (2004) showed that the addition of channel noise to a model cell qualitatively reproduced certain burst statistics of an identified biological neuron. In this letter, we show that an isolated model neuron, described by the stochastic Hodgkin-Huxley (HH) equations, generates intrinsic spike-time variability with a characteristic bimodal distribution of interspike intervals (ISIs): a gaussian peak at the shortest intervals in addition to an exponential tail. This bimodal distribution has been observed in the firing of a variety of biological neurons (Gerstein & Kiang, 1960; Nakahama et al., 1968; Lamarre et al., 1971; Ruggero, 1973; Armstrong-James, 1975; Bassant, 1976; Burns & Webb, 1976; Siebler, Koller, Stichel, Muller, & Freund, 1993; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Duchamp-Viret, Kostal, Chaput, Lansky, & Rospars, 2005). Moreover, in the stochastic HH model as in biology, the location of the initial peak lies in the gamma-frequency range (40–100 Hz). Put another way, gamma-frequency firing is a favored behavior of the stochastic HH model. In their deterministic form, the HH equations (Hodgkin & Huxley, 1952), originally introduced to describe action potentials in the squid giant axon, remain a foundation stone of modern neuroscience, and the formalism introduced by this model—also known as conductance based—is widely used today (Meunier & Segev, 2002). The basic outlines of the HH equations’ dynamics in the physiological range are well known (e.g., Hassard 1978; Troy, 1978; Best, 1979; Rinzel & Miller, 1980; Labouriau, 1985; Guckenheimer & Oliva, 2002), and much of what remains unknown in the deterministic case has been illuminated by Guckenheimer and Labouriau (1993). Many neurons, including several cortical types (Markram, Toledo-Rodriguez, Wang, Gupta, Silberberg, & Caizhi, 2004; Tateno, Harsch, & Robinson, 2004), can be classified as Hodgkin’s class II (Hodgkin, 1948)—a type of neuron that, as applied current is increased in steps, begins spiking with positive (nonzero) frequency. Class II behavior arises in most cases from the presence of a subcritical Hopf bifurcation. The HH model is a four-dimensional system, and several reductions to a two-dimensional system have been introduced. Of the latter, the FitzhughNagumo model is perhaps the best known (Fitzhugh, 1961) and has been heavily investigated as a prototype excitable system in biological, chemical, and physical contexts (Lindner, Garica-Ojalvo, Neiman, & SchimanskyGeier, 2004). While not an HH model reduction, the Morris-Lecar neuron model is a conductance-based model of voltage oscillations in the barnacle giant muscle fiber (Morris & Lecar, 1981). It has been analyzed extensively (Rinzel & Ermentrout, 1998) and is often used as a good, qualitatively quite accurate two-dimensional model of neuronal spiking.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1217
Table 1: Hodgkin Huxley Parameters Used in Simulations C VL gL VK g¯ K ρK NK VNa g¯ Na ρ Na NNa
Membrane capacitance Leak reversal potential Leak conductance Potassium reversal potential Maximal potassium conductance Potassium channel density Number of potassium channels Sodium reversal potential Maximal sodium conductance Sodium channel density Number of sodium channels
1 µF/cm2 −54.4 mV 0.3 mS/cm2 −77 mV 36 mS/ cm2 18 channels/µm2 50 mV 120 mS/ cm2 60 channels/µm2
Using a stochastic version of this reduced two-dimensional model, we propose a simple dynamical explanation for the characteristic bimodal distribution of ISIs observed in the four-dimensional stochastic HH model. These dynamics are likely to apply to the spontaneous spiking of most class II neurons, even those with complex arrays of ionic currents. Our findings indicate that a combination of irregular firing and gamma-frequency bursts can be attributed to the effects of channel noise intrinsic to a neuron. 2 Methods 2.1 The Stochastic Hodgkin-Huxley Model and Simulation. The current conservation equation for the HH model is C
dV = I − [g Na (V − VNa ) + g K (V − VK ) + g L (V − VL )] dt
(2.1)
where C is the membrane capacitance, V is the membrane potential, I is the applied current, g Na is the sodium conductance, g K is the potassium conductance, g L is the leak conductance, VNa and VK are the sodium and potassium reversal potentials, and VL is the leak reversal potential. The parameter values used are given in Table 1. The Na and K conductances are written as: g Na = m3 h g¯ Na ,
g K = n4 g¯ K ,
(2.2)
where the gating variables m, h, n, satisfy equations dx = αx (V)(1 − x) − βx (V) x, dt
for x = m, h, n.
(2.3)
1218
P. Rowat
The rate functions αx , βx , for x = m, h, n are as follows: αm (V) =
0.1 (V + 40) , 1 − e −(V+40)/10
αh (V) = 0.07 e −(V+65)/20 , αn (V) =
0.01 (V + 55) , 1 − e −(V+55)/10
βm (V) = 4e −(V+65)/18 , βh (V) =
1 , 1 + e −(V+35)/10
βn (V) = 0.125e −(V+65)/80 ,
(2.4) (2.5) (2.6)
In the stochastic HH model, the gating equations 2.3 are not used, and we proceed as follows. The voltage-dependent conductances for the sodium and potassium currents are given by g Na =
Q Na g¯ Na , NNa
gK =
QK g¯ K , NK
(2.7)
where NNa is the total number of sodium channels, NK the total number of potassium channels, Q Na and Q K are the numbers of sodium and potassium channels in the open state, and g¯ Na and g¯ K are the maximum sodium and potassium conductances. NNa and NK are given by NNa = ρ Na × area, NK = ρ K × area,
(2.8)
where ρ Na and ρ K are the sodium and potassium channel densities. In this letter, ρ Na and ρ K are fixed. Thus, when the membrane area is changed, the total number of channels, NNa and NK , is changed. Each potassium channel has four identical gates. Each gate is either open or closed, and the channel conducts potassium current only when all four gates are open. Each potassium gate is a Markov process with voltagedependent opening and closing rates given by the classical HodgkinHuxley functions αn (V) and βn (V) and the kinetic scheme for the operation of a single potassium channel is the following:
(2.9)
Here, n4 is the state in which all four gates are open. For each state transition, for example, n0 → n1, the function labeling it (here, 4αn ) is the transition rate. A sodium channel has three activation (m-) gates and one inactivation (h-) gate, and conducts sodium current only when all four gates are open. With the opening and closing rates αm (V), βm (V), αh (V), and βh (V) for the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1219
m-gates and the h-gate, the associated Markov kinetic scheme is then
(2.10)
A channel is in state mihj if i m-gates and j h-gates are open, 0 ≤ i ≤ 3, j = 0, 1. A sodium channel is open only when it is in state m3h1. The stochastic Hodgkin-Huxley model consists of equations 2.1, 2.4, 2.5, 2.6, and 2.7; NNa copies of equation 2.9; and NK copies of equation 2.10. The brute-force algorithm keeps track of the states of all the sodium and potassium channels, computationally a hugely inefficient scheme. The first time one of the (NNa + NK ) channels changes state, the membrane equation, 2.1, is integrated up to that instant in time, and the process is repeated. An exact, and much better, algorithm (Gillespie, 1977; Chow & White, 1996) does not track the state of every channel but instead tracks the number of channels in each state, as follows. In schemes 2.9 and 2.10, there are 13 different channel states x, for x = n0, n1, . . . , n4, m0h0, . . . , m3h1, and 28 transitions y with rates {Ry (V) : y = n0 → n1, . . . . , n3 → n4, m0h0 → m1h0, . . . , m2h1 → m1h1, . . . , m3h0 → m3h1}—for example, Rn1→n2 = 3αn . For each x, Nx is the number of channels in state x. The algorithm tracks the global state (Nn0 , Nn1 , . . . , Nm3h1 ), where Nn0 + Nn1 + Nn2 + Nn3 + Nn4 = NK , Nmoho + · · · + Nm3h1 = NNa , Q K = Nn4 , and Q Na = Nm3h1 . The algorithm has five steps: Step 1: Select a transition time ttr from the current global state (Nn0 , Nn1 , . . . , Nm3h1 ) to its immediate successor. Step 1a: Update the transition rates {Ry (V)}. Step 1b: For each z = 1, 2, . . . , 28, compute Rztot = Nsour ce(y) Ry (V), where z = or d(y) : z = 1, 2, . . . , 28 is a fixed ordering of the transitions y = n0 → n1, . . ., m3h0 → m3h1.
1220
P. Rowat
tot Step 1c: Let λ = 28 z=1 Rz , the global transition rate out of the current global state. This has the distribution function λ e −λ(t−t0 ) .
(2.11)
Step 1d: Let ttr = ln(r1−1 )/λ, with r1 a pseudorandom number in [0,1]. ttr is a sample from the PDF (see equation 2.11). Step 2: Integrate equation 2.1 from t0 to t = t0 + ttr to get a new value of V. Step 3: Select the transition taken. Using r2 , a pseudorandom numµ−1 ber in [0,1], compute µ ∈ [1, 28] such that z=1 Rztot < λ r2 ≤ µ tot z=1 Rz . Step 4: Update the state as determined by the transition µ. For example, if Rµtot corresponds to transition m2h1 → m1h1, then add 1 to Nm1h1 and subtract 1 from Nm2h1 . Step 5: Set t0 = t and repeat steps 1 to 4. 2.2 The Stochastic Hodgkin-Huxley Model in Voltage-Clamp Mode. This is an experimental technique physiologists use. We use this mode to collect statistics in channel noise. Fix V = V0 , and use the algorithm just described, but omit step 2. Also, step 1a can be taken out of the loop and executed once on initialization. In all HH voltage-clamp runs, the first 20 ms were discarded to avoid transients. 2.3 The Morris-Lecar Model. The deterministic Morris-Lecar model is C
dV = −g¯ Ca m∞ (V)(V − VCa ) − g¯ K w(V − VK ) − g L (V − VL ) + I dt (2.12)
dw w∞ (V) − w =φ , dt τ∞ (V)
(2.13)
where m∞ (V) = 0.5∗ 1 + tanh (V − V1 )/V2 , w∞ (V) = 0.5∗ 1 + tanh (V − V3 )/V4 ,
(2.14)
τw (V) = 1/ cosh (V − V3 )/(2∗ V4 ) .
(2.15)
and
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1221
Parameter values were V1 = −1.2, V2 = 18, V3 = 2, V4 = 30, g¯ Ca = 4.4, g¯ K = 8.0, g L = 2, VK = −84, VL = −60, VCa = 120, C = 20µF/cm2 , and φ = 0.04. The injected current I was varied. The stochastic Morris-Lecar model is derived in the same manner as the stochastic HH model derives from the deterministic HH model. The K+ gating variable w is replaced by a population of NK channels, each having a single gate with two states—open and closed. We introduce the membrane area Area with channel density ρ K for which NK = ρ K Area
(2.16)
holds. The opening and closing transition probabilities are given by functions αw (V) and βw (V). These are obtained by rewriting equation 2.13 in the form dw = αw (V)(1 − w) − βw (V)w , dt
(2.17)
where, after a little manipulation and using equations 2.14 and 2.15, ∗ αw (V) = 0.5 φ cosh (V − V3 )/(2∗ V4 ) 1 + tanh (V − V3 )/(2∗ V4 ) ∗ βw (V) = 0.5 φ cosh (V − V3 )/(2∗ V4 ) 1 − tanh (V − V3 )/(2∗ V4 ) . (2.18) Now we retain only equation 2.12, and replace equation 2.13 (or its alternate form, 2.17), by the kinetic scheme,
(2.19)
where αw (V) and βw (V) are given by equations 2.18. We used ρ K = 20 channels/µm2 , typically with area = 50 µm2 so the number of potassium channels NK = 1000. The stochastic Morris-Lecar model in voltage-clamp mode is obtained as for the Hodgkin-Huxley model. Six different random number generators were tested, and the resulting histograms compared for I = 0. Since no significant differences were found, all subsequent simulations were done with the Mersenne Twister (Matsumoto & Nishimura, 1998). Most runs were done using double precision (64 bits), but with the larger areas of over 1000 µm2 (hence, larger channel numbers and very small times between Markov state changes), we changed to long double precision (128 bits).
1222
P. Rowat
2.4 Hodgin-Huxley Phase Space and Q Space. In the deterministic HH model, phase space is the set of 4-tuples (V, m, h, n). In the stochastic HH model, the natural space to work with is the space of 3-tuples (V, Q Na , Q K ), where Q Na = Nm3h1 is the number of open Na channels and Q K = Nn4 is the number of open K channels. Knowing NNa and NK , we can move from (V, m, h, n) space to (V, Q Na , Q K ) space via Q Na = m3 h NNa , and Q K = n4 NK ,
(2.20)
by equation 2.7. There is no single 1-1 map from (V, Q Na , Q K ) back to (V, m, h, n) space. One choice is the following, assuming the channel state numbers Nm0h1 , Nm1h1 , Nm2h1 , and Nm3h0 are known. For the single gate open probability n, assuming symmetry among the gates, use n=
4
QK . NK
For sodium channels, by diagram 2.10, we use h=
Nm0h1 + Nm1h1 + Nm2h1 + Q Na NNa
and m=
3
Q Na + Nm3h0 . NNa
3 Results When the ISI time series used for Figure 1A is converted to hertz and the frequency histogram plotted, two significant peaks appear (see Figure 1B); the insert shows the original case with I = 0 and A = 100 µm2 (Chow & White, 1996), with a barely noticeable second peak. In both histograms, the second peak is in the gamma-frequency range (40–100 Hz). This suggested plotting the frequency histograms for a wide range of input currents (see Figure 2), which revealed that the right-hand peak frequency increased linearly with input current. Next, we plotted the right-hand peak against input current frequency for two areas, A = 100 and A = 400 µm2 (see Figure 3, top two traces). On the same graph, the frequency of the deterministic Hodgkin-Huxley equations (Rinzel & Miller, 1980) is plotted, showing that for I ≥ 7 µA/cm2 , the right-hand peaks in the frequency histogram closely track, at a slightly higher frequency, the deterministic frequency.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1223
A
0.04
Probability Density
Prob.Density
0.02
0.02
A=100 µm2 I=0
0 0
200
400
ISI length (ms) A=400 µm2 I=3 µA/cm2
0 0
200
ISI length (ms)
400
Probability Density
B
0.06
0.06 0.03 0
0.03
0
0
0
20
20
40
40
60
60
Instantaneous frequency (Hz) Figure 1: A prominent initial peak in ISI histograms generates a peak in the frequency histogram in the 40–100 Hz range. (A) Smoothed sliding histogram of ISIs generated by the stochastic HH model with I = 3 µA/cm2 and area A = 400 µm2 (corresponding channel numbers NNa = 24000, NK = 72000). The histogram was generated by sliding a bin of width 2 ms along the data in steps of 0.2 ms and then smoothing it slightly. For most histograms, n = 30000 data points were used, except when the mean ISI was higher than a few seconds, when n ranged as low as 3000. Inset: Area = 100 µm2 , I = 0, the case considered by (Chow & White, 1996). All histogram bins are normalized so each curve is an approximate probability density function. (B) Smoothed sliding histograms of the frequencies obtained from the ISI data sets used in A where frequency = 1000/ISI.
1224
P. Rowat
gamma frequency
I=3
Probability Density
0.06
I=10 9 8 7
4
0.03
6 5
5 6 7
4 8
3
0 0
40
Hz
80
Frequency of righthand peak, Hz
Figure 2: Frequency histograms for input current I = 3, 4, . . . ,10 µA/cm2 , A = 400 µm2 . The numerals indicate the current corresponding to each histogram curve.
80 60
Area=100 µm2 Area=400 µm2 Deterministic frequency
40 20
Iν
I1
0 0
5
10
15
I, µ A/cm2 Figure 3: Comparison of right-hand peaks of the frequency histograms in Figure 2 with the frequency of the deterministic Hodgkin-Huxley model. Squares: right-hand peaks plotted against input current I for area = 400 µm2 . Circles: right-hand peaks for area A = 100 µm2 . The solid curve is the deterministic frequency. The left and right vertical dotted lines are where the deterministic frequency drops to 0 as I decreases (left line) or jumps up from 0 as I increases (right line). For the meaning of Iν and I1 , see the text and Figure 6.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1225
0 2 4 6 8 10 12 120mV
7 0
ms
500
Figure 4: Sample traces for different input currents I . Area = 400 µm2 , except for the bottom trace, which has area = 2000 µm2 . Numerals at the left give the input current I (µA/cm2 ).
Figure 4 shows membrane potential traces underlying the ISIs used for histograms of Figures 1 to 3, and Figure 5 shows basic statistics (mean, coefficient of variation) of the ISIs. The grouping of spikes into gammafrequency bursts suggested the unlikely possibility that the stochastic HH model was in fact bursty. We disproved this by taking a 30,000-long ISI time series and examining the distribution of gamma-frequency bursts in 10,000 shuffles. The numbers of pairs, triples, 4-tuples, and 5-tuples of short ISIs in the original time series was not significantly different from their overall distributions in the 10,000 shuffles. This was done twice. We conclude that the distribution of short and longer ISIs within any one ISI time series is random. This has also been found in experimental data (Stiefel et al., 2006). 4 Mechanism The frequency plots in Figure 3 strongly suggest that the right-hand peaks in the frequency histograms in Figure 2—and hence the initial narrow
1226
P. Rowat
A
10000 1000 100 10 Coefficient of Variation
Mean ISI, ms
1e+05
Iν
3
I1
B
100 µm2 400 µm2 1000 µm2 2000 µm2 3000 µm2
2 1 0
0
5
10
15
2
input I, µA/cm
Figure 5: ISI statistics as a function of input current. (A) Mean ISI for each I . (B) Coefficient of variation (CV). Note the log scale for means. Circles, A = 100 µm2 ; squares, A = 400 µm2 ; diamonds, A = 1000 µm2 , triangles, A = 2000 µm2 ; inverted triangles, A = 3000 µm2 . The vertical dashed lines at I = Iν and I1 are important values in the subcritical Hopf bifurcation (see Figures 6 and 3). For the vertical arrow, see Figure 6.
gaussian peak in the ISI histogram—are linked to the spiking frequency of the deterministic HH equations. If this is indeed the case, what mechanism underlies the linear continuation of the frequency plot below Iν for the stochastic HH model? Here we describe mechanisms that underlie the tracking of the deterministic frequencies above Iν and the linear extension of the peak noise-induced frequencies to lower input currents, I < Iν , where the deterministic HH model is quiescent. From Figure 1A, an obvious proposal is that there are two processes present—one generating the gaussian initial peak and another the exponential tail, together with a mechanism for switching between them. The gaussian peak could be generated by consecutive noisy traversals of a limit cycle (LC). Commonly, an exponential distribution is generated by a random walk with absorbing barrier; thus, the second process requires a spike to be followed by a random walk to an absorbing barrier followed by another spike. Thus, the second process could be generated by a system whose state was suddenly transported to a fixed point and then undergoes a random walk to an unstable limit cycle (ULC) that is followed by another spike, that is, another orbit of a limit cycle. In a dynamical system, both processes
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1227
would be present if the phase portrait has a stable limit cycle (SLC), generating the first process, around a stable fixed point (SFP), with an unstable limit cycle between them—as required in two dimensions by the Bendixson theorem (see Figure 7A)—generating the second process. We will often use vicinity to mean “basin of attraction.” What mechanism could act to switch between these processes? Noise could perturb a trajectory from the vicinity of the SLC across the ULC and into the vicinity of the SFP, an inward switch. When the random walk outward from the SLC crosses the ULC into the vicinity of the SLC, an outward switch has occurred. When the first spike is emitted on entering the vicinity of the SLC, an ISI is generated that marks the end of a spiraling random walk and hence contributes to the exponential distribution. Subsequent traversals of the SLC without crossing the ULC result in short ISIs that contribute to the initial gaussian peak. This description is strictly true only in a two-dimensional system, since the “inside” or “outside” of a limit cycle is not defined in higher dimensions. However, it can be generalized to higher dimensions if we replace “unstable limit cycle” by the “attractor basin boundary” (ABB), a lower-dimensional manifold lying between the basins of attraction of the stable fixed point and the stable limit cycle. It is well known that the HH equations undergo a subcritical Hopf bifurcation as input current is increased (Hassard, 1978; Troy, 1978; Rinzel & Miller, 1980). The bifurcation diagram in Figure 6 (Guckenheimer & Oliva, 2002) shows how the voltage coordinate of the fixed point, and the maximum and minimum voltages reached by the stable and unstable limit cycles, change as input current I is increased. The subcritical Hopf bifurcation occurs at input current I1 , while Iν , Iν < I1 , is the current where the stable and unstable limit cycles merge. No limit cycle is present when I < Iν . For I between Iν and I1 (case A in Figure 6), there are a stable LC, an unstable LC, and a stable fixed point, except for a small interval J of length at most 0.12 µA/cm2 and lying strictly between Iν and I1 , which has two extra unstable LCs. Thus, for values of I in the middle interval of the subcritical Hopf bifurcation (Iν < I < I1 ), outside of the small subinterval J, the phase portrait has exactly the structure shown in Figure 7A for an ISI histogram of the form in Figure 1A. Could the transitions from fixed point to stable limit cycle, and vice versa, ever occur as a result of channel noise? A cursory glance at Figure 6A suggests that the distance between SFP and ULC might be too large compared to the noise amplitude. However, on remembering that the ULC and SLC are four-dimensional orbits around the SFP and that the voltage ranges of both the unstable and stable limit cycles include the fixed-point voltage, the question becomes: Are there portions of the SLC where the minimal distance between SFP and SLC is small enough that channel-noise-induced perturbations can plausibly cause the state to switch from one attractor basin to the other? There are two requirements. First, the minimal distance
1228
P. Rowat 40
C:I < Iν
20
B: I > I1
A: I ν < I < I1
V, mV
0
-20
-40
-60
-80
0
input current I, µA/cm 2
5
10
Iν
15
I1
Figure 6: Bifurcation diagram of the deterministic HH model, for parameter I. Solid line, stable fixed point (SFP). Long-dashed line, unstable fixed point (UFP). Dot-dashed lines, extreme values of V on the stable limit cycle (SLC). Short-dashed lines, extreme V values on the unstable limit cycle (ULC). The S-bend in the ULC curve above the vertical arrow implies that three ULCs exist over a small range of input current. The vertical dashed lines at I = I1 , Iν divide the range of I into three cases: A, B, and C. The subcritical Hopf bifurcation occurs at I = I1 and the stable and unstable limit cycles merge at I = Iν . Adapted, with permission, from Guckenheimer & Oliva (2002).
δ1 from the SFP to the ABB must be small enough to permit a noise-induced “out” transition out of the SFP vicinity across the ABB into the SLC vicinity, with nonzero probability. Second, the minimal distance δ2 from SLC to the ABB must be small enough for a noise-induced “in” transition into the SFP basin. It is not necessary for the two minima to occur at the same point or region of the ABB. Sketches in Figures 7B and 7C illustrate this point in two dimensions. A detailed description of this mechanism will not be given for the fourdimensional HH model. Instead the mechanisms are illustrated in a simpler, two-dimensional case: the class II Morris-Lecar model (Rinzel & Ermentrout, 1998), which also undergoes a subcritical Hopf bifurcation. Here the values for Iν and I1 are approximately 88.3 and 93.85 µA/cm2 . We constructed a stochastic version of the Morris-Lecar equations, 2.12 to 2.15, and obtained ISI data distributions similar to those we obtained for the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
B
C in
1229
in
out
out SLC ULC
SFP leaving ULC
Switching Region
Figure 7: Phase plane sketches in region A (Iv < I < I1 ; see Figure 6). Dot, stable fixed point (SFP); continuous circle, stable limit cycle (SLC); dashed circle, unstable limit cycle (ULC). (A) Minimal phase portrait implied by the ISI histogram in Figure 1A. The distances between SFP and ULC and between ULC and SLC are large relative to noise amplitude. (B) There is a single switching region where the distances between SFP, SLC, and ULC are of the same order as the noise. Thus, noise can easily perturb a trajectory from the vicinity of the fixed point over the ULC and “out” to the vicinity of the SLC and, vice versa, perturb a trajectory near the SLC “in” to the vicinity of the SFP. (C) The switching region for an “in” movement of a trajectory could be different from the switching region for an “out” movement.
stochastic HH model. Compare Figure 8A with Figure 1A, Figure 8B with Figure 2, and Figure 8C with Figure 3. Qualitatively, the match is quite good. It might be objected that the Morris-Lecar model does not have a J-interval with more than two limit cycles. We do not think this an important objection because in our use of the HH model, we have not seen any indication of unusual ISI statistics when the injected current I is within the J-interval in the HH model. We conclude that the global dynamics of the Morris-Lecar model should serve as a good, or at least valuable, guide to the dynamics of the stochastic HH model that underlie the ISI distributions of Figure 1. Case A: Iν < I < I1 . In Figure 9A (1), the small dotted rectangle shows that the stable fixed point and the stable and unstable limit cycles are very close together for a short section of the ULC. Therefore, in this region of phase space, noise with amplitude the same order as the distance between the SFP and the SLC will frequently cause the trajectory to switch from the SLC vicinity to the stable fixed-point vicinity and vice versa. A region with this property will be referred to as a switching region. Figure 9A (2) shows an example of noisy switching and Figure 9A (3) the corresponding voltage trace. Once the trajectory switches to the inside of the ULC, in the absence of further noise, it spirals in toward the SFP. Consider a fixed radius from the SFP through the closest point on the ULC and extending out beyond
1230
P. Rowat
A Probability
0.005 ... 11000
0
0 I=82
500
B
0.5 Probability
250
ISI, ms
I=98
84
86 88 90
Frequency , Hz
0
0
10
Frequency, Hz
C 10 Stochastic MLE Deterministic frequency
Iν
I1
0 80
2
I, µA/cm
90
100
Figure 8: ISI data from the stochastic Morris-Lecar model has similar qualitative properties to ISI data from the stochastic HH model. (A) ISI histogram for I = 82.0. Area = 50 µm2 (1000 K+ channels). The long exponential tail, cut off at 500 ms, extends out to 10,000 ms. (B) Superimposed frequency histograms for I = 82, 84, . . . , 98. Area = 50 µm2 . (C) Plot of the right-hand peaks in B (circles) compared with the frequency of the deterministic Morris-Lecar model. Iν ≈ 88.29 and I1 ≈ 93.86.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1231
the ULC. Due to noise, successive intersections of the spiraling trajectory with this radius generate a one-dimensional random walk that terminates when the next intersection with the radius lies outside the ULC, in the SLC attractor basin. Now, in the absence of another perturbation back inside the ULC, the trajectory rejoins the SLC and emits another spike. Case B: I1 < I . Here there is no ULC, only an SLC and a UFP (see Figure 9B(1)). However, the fixed point is only weakly unstable, so if the trajectory is in its vicinity, it remains close for a significant duration, while the trajectory describes a slowly expanding spiral (see Figure 9B(1), inset). See, for example, the initial segment of the voltage trace in Figure 9B(3) before the first spike. Thus, noise on w keeps the state close to this fixed point for prolonged periods until eventually the outward-expanding spiral “wins” (see Figure 9B(2))—the random walk is terminated—and the trajectory returns to the SLC, whose next spike generates an ISI contributing to the negative exponential tail in the ISI PDF. Subsequent noisy ISIs contribute to the gamma-frequency peak in the frequency PDF. Case C: I < I ν. Now there is no limit cycle, just a single SFP. There are, however, remnants of the limit cyle, termed “ruts,” due to the way that trajectories starting from a wide range of initial conditions converge onto a rut. Figure 9C(1) shows two ruts in the deterministic Morris-Lecar phase space and several trajectories converging onto each one. In a rut, the vector field is strongly contracting transverse to the trajectory. In the region of phase space to the right of the fixed point, there is no transverse contraction so initially separate trajectories remain parallel and separate (see Figures 9C(1) and 9C(2). In a similar case (I = 88.2 instead of 82.0), Tateno and Pakdaman (2004) show the distribution of the ML dynamics in phase space: ruts appear as small ridges in approximately the same position, while in the nonrut (noncontraction) areas, the ridges disappear due to the distribution being spread out. Any trajectory starting more than about 0.03 vertically below the fixed point emits a spike and enters the rut that constitutes the downstroke of the MLE spike. The arrow in the inset graph shows the location of the “threshold” between trajectories that do and do not generate spikes. 4.1 Where Is the Switching Region in the Hodgkin-Huxley Model? The mechanisms just described for the two-dimensional (2D) stochastic Morris-Lecar model generalize, approximately, to the 4D stochastic HH model. An exact description is beyond the scope of this letter, in part because 4D objects are difficult to visualize, but also because in 4D, the boundary between two attractor basins (e.g., between a SFP and a SLC) is a 3D manifold, which is very likely to be fractal (McDonald, Grebogi, Ott, & Yorker, 1985; Guckenheimer & Oliva, 2002) in some range of I. Another difficulty in case A of the HH subcritical Hopf bifurcation (see Figure 6) is this: the ULC is a 1D object embedded in the 3D boundary between the basins of attraction of the SFP and the SLC, but a trajectory switching from
A
V (1)
w
Switching Region
(2)
w
stable FP stable LC unstable LC noisy trajectory
V
V 40 0 -40
(3)
0
2
50 ms
500
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1233
one basin to the other might not pass close to the ULC, despite appearances in Figure 6. Corresponding to case C in the Morris-Lecar model (see Figure 9C), Figure 10 shows the projection of a noise-free HH trajectory (I = 6 µA/cm2 ) onto the (m, n)−plane—one of six possible 2D projections—that emits one spike and then spirals into the SFP. The projection of the switching region where small perturbations can generate a spike is enclosed in the ellipse in the inset. There is no ABB since there is only one attractor. However, the dashed line segment indicates something similar: the (2D projection of the) boundary between trajectories that emit a full-blown spike and those continuing with small-amplitude spirals into the SFP. The dashed line segment in Figure 10 is like a threshold. However, the concept of a threshold is problematic for a system with more than one variable because the hidden variables may not have their expected values (see Guckenheimer & Oliva, 2002, for a rigorous definition). If the system is in the vicinity of the SFP with coordinates (V0 , m0 , h 0 , n0 ) but previous perturbations have caused (m, h, n) to momentarily have values different from (m0 , h 0 , n0 ), then the effective V threshold to generate a spike may now be very different. This is seen experimentally: in cortical neurons, the previous voltage behavior can affect spike threshold by as much as 10 mV (Azouz & Gray, 2000). Also, in a very small I-interval within the S-turns in the ULC amplitude plot (see the thick arrow in Figure 6), there is a difficulty of a different kind: with I ≈ 7.9219, Guckenheimer and Oliva (2002) found a chaotic invariant set and conjectured that there are regions of phase space where every neighborhood, however small, contains two kinds of states:
Figure 9: Switching regions in Morris-Lecar phase-portraits. The parameters are as in Rinzel and Ermentrout (1998), subcritical Hopf bifurcation. For one instance of each case A, B, C (see Figure 6), we show (1) the phase portrait of the deterministic (noise-free) system. The switching region is shown by a small dotted rectangle and expanded as an inset graph. (2) A noisy trajectory overlaid on the noise-free phase portrait (3) A noisy voltage trace corresponding to the noisy trajectory in graph 2. In each case, graphs 1 and 2 have identical scales. Scale bars in graph 1: Horizontal bars: main graph, 10 mV; inset graph, 1 mV. Vertical bars: main graph, 0.1; inset graph, 0.01. Case A: I = 90 µA/cm2 , which has an SFP, an SLC, and a ULC. Case B: I = 98 µA/cm2 . An UFP and SLC. The solid line shows a trajectory spiraling out from near the UFP to join the SLC; the latter is shown by dots at equal time intervals. Case C: I = 82 µA/cm2 . Here there is only a stable fixed point and no LC, but “ruts” in phase-space enable noise-induced spikes to occur. Segments of nine trajectories are shown, starting from points very close to the SFP. Six fall into the upper rut and emit a spike, while the remaining three immediately spiral into the SFP. The small arrow in the inset graph indicates the threshold between spiking and nonspiking initial points.
1234
P. Rowat
Figure 10: Hopf case C in the Hodgkin-Huxley model with I = 6 µA/cm2 , so there is no LC. (A) Projection of deterministic trajectory onto the (m, n) phaseplane. (B) Expansion of the region around SFP. The dashed curve indicates the approximate location of the attractor basin boundary or “threshold”; the dotted ellipse indicates the switching region in this projection.
those leading to action potentials and those leading to the stable fixed point. If true, then, mathematically, no threshold can be defined for this value of the injected current. In order to truly locate a switching region for case A of the Hopf bifurcation, one must compute the ABB between the SFP and the SLC and then measure two minimal distances: δ1 from SFP out to the ABB and δ2 inward from SLC to ABB. The endings of δ1 and δ2 in the ABB may be close, but with current knowledge, they could be at different locations (see the sketch in Figure 7C). If the latter, then switching in and out of the vicinity of the SFP will occur at different phases of the spiking cycle. Cases B and C require different definitions of the switching region, which will not be attempted. These difficulties are avoided by a simplistic approach. We computed the minimal distance between fixed point and stable limit cycle in (V, m, h, n)space. It’s also useful to look at this in three-dimensional (V, Q Na , Q K )-, or Q-, space (see Figure 11). If these distances are small enough, then basin switching can be induced by small-amplitude noise. Since no LC exists in case C, only ruts, these plots do not go below Iν . See the location of Iν in Figures 3 and 6.
Minimum distance from SFP to SLC
ISI Statistics in the Stochastic Hodgkin-Huxley Model
6
9
12
1235
15
12
ional imens
ce
Q-spa
d
Three8
e
-spac
0.06
0.04
si
dimen
Four-
6
hase onal p
9 12 Injected current I, µA/cm2
15
Figure 11: These two curves track the closest approach of the HH stable limit cycle to the fixed point for 6.5 ≤ I ≤ 15, in two deterministic phase spaces. The lower trace uses the standard four-dimensional space (V, m, h, n), while the upper uses the three-dimensional space (V, Q Na , Q K ) where Q Na = m3 h × NNa , and Q K = n4 × NK . Area = 400 µm2 .
As input current increases from Iν to 15, the minimal distance in (V, m, h, n)-space from the FP to the SLC increases from 0.04 to nearly 0.07, while the minimal distance in Q-space increases from under 7 to over 12. In Q-space, “minimal distance X” means X channel openings or closing in the direction of the fixed point or SLC depending on whether the trajectory is going “in” or “out.” If one supposes that the ABB or ULC lies approximately halfway,1 then only X/2 channel changes are needed. Thus, the expected probability of basin switching goes down as I increases, which is in agreement with the sequence of traces in Figure 4. More exactly, the probability of switching “in” from SLC vicinity to FP vicinity becomes lower, while the probability of an “out” switch decreases less due to weakening of the FP stability. With I = 7µA/cm2 , the distance from the SFP to points on the SLC during one spiking cycle is shown in Figures 12B and 12C. The distance in (V, m, h, n)-space is dominated by the voltage V, since 0 ≤ m, h, n ≤ 1, whereas the range of V is over 100 mV. Thus, the minimal distance occurs at two downward spikes created when V crosses the SFP rest potential
1 Here I assume that basic dynamical features such as limit cycles in (V, m, h, n)-space map into similar features in Q-space.
1236
P. Rowat
V
40
A
0 -40
rest V0
-80 distance from SFP to SLC
100 1e+02
B
Four-dimensional phase-space
C
Three-dimensional Q-space
1
4
10
2
10
0
10 time, ms
20
Figure 12: (A) The voltage trace for the deterministic HH equations with I = 7 µA/cm2 (thick line). The dotted line is a trace from the stochastic HH model for the same I and with area = 400 µm2 , but shifted up for clarity. (B) The Euclidean distance in (V, m, h, n) space from the fixed point (FP) to the stable limit cycle as two spikes occur. (C) The thick line is the distance to the stable limit cycle in the three-dimensional Q-space (V, Q Na , Q K ) where Q Na = m3 h NNa and Q K = n4 NK . The dotted line is the distance in Q-space from the FP to the Qspace trajectory whose V-component is shown dotted in A. This trace has been shifted up. The arrow-headed line indicates an interval of over 6 ms where the distance in Q-space is under 20. The origin for time coincides with the peak of the first spike. The minimum distance in Q-space from FP to the dotted trajectory is 3.24.
V0 . In Q-space, the range of V is negligible relative to the Q Na and Q K ranges (see Figure 13), and quite different locations of the minimal distance are found. The minimum distance is 6.8 at 9 ms into the 18 ms spike cycle, with another minimum of 8.5 channel steps at 14 ms; moreover, the Q-distance remains below 20 between these minima. Overall this gives a 6 ms interval with Q-distance always under 20. Most switches between SLC and SFP occur in this interval; consequently, the approximate location of the switching region is known.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
15 QNa
QNa
4000
2000
1237
D
10 5 0
0 -80
-40 V
0
40
-62
V
3000
B
E 200 QK
QK
2000
1000
100
0 -80
-40 V
0
40
-64
V -60
3000
C
F 200 QK
QK
2000
1000
100
0 0
2000 QNa
4000
0
20 QNa
40
Figure 13: Comparison of deterministic and stochastic trajectories in Q-space. The thick gray line shows the deterministic LC in (V, Q Na , Q K ) phase space when I = 7 µA/cm2 , and the thin black line is a path from the stochastic HH model. The area was 400 µm2 so NNa = 24,000 and NK = 7200. The fixed point, shown by a circled cross, is visible in the expanded views D, E, and F . In the (V, Q Na ) plane, D, the fixed point appears to be on the limit cycle, not inside, but this is merely due to the orthogonal projection being used. The stochastic trajectory used here is an extension of the one shown in Figure 12.
Figures 12A and 12C show the voltage and the distance to the SFP, respectively, for two spikes generated by the stochastic HH model. Most of the jitter occurs in the arrow-headed interval, as expected, and the trajectory
1238
P. Rowat
comes within 3.24 of the SFP, half the minimum distance of the deterministic trajectory. It is not trapped in the SFP vicinity but instead continues to cycle. 4.2 ISI and Channel Noise Statistics. Figure 5 shows how the overall mean and CV of ISIs varies with current and membrane area. Since there is no geometry associated with the model, the increase in area simply means an increase in the number of channels, which in turn means the channel noise has a lower standard deviation and higher frequency (see Figure 14). For I > I1 , the noise-free dynamics is continuous spiking, so smaller-amplitude noise means a smaller probability of leaving the SLC or of spending any significant time near the weakly unstable FP. Since the intervals between continuous spiking (ICSs) become fewer and shorter, the CV quickly goes to zero as area increases. On the contrary, as I decreases below I1 , the ICSs can become arbitrarily long, as can be seen when the area is high (A = 3000), resulting in very high CV. For I < Iν , the noisefree dynamics rest at the SFP, so the CV curves reflect the introduction of single spikes into a primarily nonspiking system. For case A, Iν < I < I1 , the two opposing tendencies result in CV curves whose maxima increase with increased area (reduced amplitude/increased noise frequency). From Figure 5B, it is tempting to speculate that as the noise amplitude tends to zero, the CV curve tends toward a delta function centered in the region in the bifurcation diagram where chaos and undefinable threshold occur. This is not the case. The probability density ρ(t) = α δ(t) + (1 − α) λ e −λt
(4.1)
with λ > 0 and 0 < α < 1, where δ(t) is the Dirac delta, has a coefficient of variation given by (1 − α)−1/2 , which is always larger than one and can become arbitrarily large when α approaches 1 (Wilbur & Rinzel, 1983). If we replace the initial gaussian peaks in the ISI histograms by tall Dirac delta functions, the probability density of equation 3.1 is a rough match to the histograms. So it is to be expected that for some combinations of area and current, the ISI histograms will have arbitrarily large CV. 4.3 Characteristics of Channel Noise. It is difficult to give a concise description of channel noise occurring in a spiking neuron. First, the time between changes in the number of open channels varies by nearly four orders of magnitude when area = 400 µm2 , from a minimum of less than 10−6 ms at the action potential peak to a maximum of about 0.06 ms, 3 to 5 ms after the deepest hyperpolarization. Second, the size of the Nacurrent passing when a single channel opens or closes varies according to the distance from the Na+ -reversal potential. Thus, the magnitude of a single Na-channel current varies by a factor of six. It is smallest at the
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1239
0.8 CV
0.6
Na activation K activation
A
0.4 0.2 0
KHz
300
B
200
Na channels K channels
100
Frequency,Hz
0
0
2e+06
200
400 2 Area, µm
600
800
C
+
K channels + Na channels
1e+06
D
+
K + Na X 100
0.4
0 0.006 0.003
+
K + Na X 10
E
0 1
+
K activation + Na activation
F
CV
Standard deviation
Mean activation
0 0.8
1000
0 -80
-40
V, mV
0
40
Figure 14: Channel noise statistics when the stochastic HH model is held in voltage clamp. In A and B, the voltage was fixed at −60.78 mV (steady-state value when I = 7 µA/cm2 and the CV and frequency of the Na- and K- activation plotted √ as a function of area. The data in A are fitted with curves of the form a / Area − b. In C, D, E, F , the area = 400 µm2 , while the potential varies from −75 to +31 mV. (C) Frequency. (D) Mean activation. (E) Standard deviation. (F) CV. In D the mean Na-activation is magnified × 100. In E, the standard deviation of Na-activation is magnified × 10.
1240
P. Rowat
peak and largest at the lowest potentials. Similar considerations apply to K-current. We sidestepped the issue of characterizing channel noise in a spike cycle by adopting an experimental technique: the voltage clamp. For each membrane area, we clamped the neuron at −60.78 mV, the steady-state potential for I = 7 µA/cm2 , and then recorded the Na- and K-activations and computed their coefficient of variation (CV) and frequency. Figures 14A and 14B show the CV and frequency as a function of area—hence, of NNa or of NK . As expected, the CV decreases as the inverse square root of the area and the frequency increases linearly. In Figures 14C to 14F, the area is fixed, and the potential ranges from −75 to +31 mV. Figure 14C shows the frequency, while Figure 14D plots the mean activations, which exactly match the deterministic activations: m3 h for Na- and n4 for K-current. The curves in Figures 14C, 14D, and 14F depend on only the rate functions {αx , βx : x = m, h, n} in equations 2.4 to 2.6, but the relationship remains to be elucidated. 4.4 Oscillatory Histograms. In Figure 1A, there is a small, subsidiary peak after the initial, peak which is seen in all ISI histograms we have examined. In some cases, a third, poorly defined peak can be discerned, and when large numbers of channels are used, as in Figure 15, at least three subsidiary peaks can be seen (but note the logarithmic scale in Figure 15B used to make these very small peaks visible). On a gross scale, these subsidiary peaks are simply part of the exponential tail. On a finer scale, they stand out and can be understood in terms of our proposed mechanism. The initial peak arises from bursts of consecutive traversals of the SLC. The second peak arises when the trajectory leaves the SLC vicinity after one traversal and spirals round in the vicinity of the FP for one cycle, and as the next cycle starts, it switches back to the SLC vicinity, thus creating an ISI roughly twice as long as the period of the SLC. Compared to the first two peaks, the probability of a second spike occurring at time T = 1.5 × (SLC period) is small; this results in the significant dip between the first two peaks. The fact that this dip is not down to zero arises from the noise-induced phase randomization that presumably increases with time, thus causing the subsidiary peaks to be smoothed out as the time since the last spike increases. The Na channel numbers in Figure 15 are 60,000, 120,000, and 180,000. The CV for A = 3000 is 2.8. 4.5 Gaussian Noise Has the Same Effect as Channel Noise. We integrated the deterministic HH equations with a gaussian noise term in equation 2.1. In Figure 16, the histogram obtained with gaussian noise and the histogram obtained with channel noise in Figure 1, both with I = 3 µA/cm2 , are plotted together. This shows that the effects of channel noise and gaussian noise are essentially indistinguishable.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
A
0.2
1241
Area=1000 µm2
Probability density
Area=2000 µm2 Area=3000 µm2
0.1
0
B
0.1 0.01 0.001
0.0001 1e-05
20
40
60
ISI, ms
Figure 15: ISI histograms for large areas 1000, 2000, 3000 µm2 , shown on a linear scale in A and logarithmic scale in B.
Figure 16: An ISI histogram nearly identical to the histogram of Figure 1A is obtained when gaussian noise is added to the HH current balance equation, 2.1. Noise with standard deviation 0.1 was used, with integration time step 0.01 ms. This histogram curve, in gray, is overlaid on the curve from Figure 1A, in black. The inset graph shows the gray curve again on a log scale.
1242
P. Rowat
4.6 On the Existence of Ruts. Every SLC has a neighborhood—that is, an annulus in a 2D system or a tube in a 3D system—where the vector field transverse to the SLC is contracting everywhere, with different strengths at different phases. A ULC has a neighborhood where the local vector field at each point of the ULC is expanding everywhere. When a bifurcation occurs and the SLC no longer exists, this means that somewhere along the SLC, the width of the neighborhood of contraction has shrunk to zero. In the geographical view of a dynamical system that depends continuously on parameters—such as a SFP is a hollow and a UFP is a rounded peak—there is a “valley” at every point of a SLC, but as a parameter I passes through a bifurcation value, at some point of the SLC the valley opens out and becomes flat. Except in a completely symmetric system, which is unlikely to occur in applications, there will still be regions close to where the SLC used to be, with transverse contracting neighborhoods still in existence. These are the ruts. In this argument, we appeal only to the continuous dependence of the vector field on the parameter I and to the disappearance of a SLC, not to the width of the interval [Iν , I1 ]. The sudden disappearance of an SLC as I decreases is sufficient to ensure the existence of ruts, independent of the width of the region with two or more attractors. 5 Discussion We have shown that the stochastic HH model generates a bimodal ISI distribution and described in qualitative terms the underlying dynamical mechanism. Although the description is based on a two-dimensional model, our extensive modeling data suggest that the same dynamical mechanism will be present in more complex models, provided a subcritical Hopf bifurcation and a switching region are present. More formally, we conjecture that a bimodal ISI distribution is generated in any noisy system that passes through a subcritical Hopf bifurcation with a switching region. The noise does not need to be channel noise; gaussian noise will suffice (see Figure 16). Our simulations have been done with the stochastic HH equations whose parameters are based on squid, a cold-blooded creature. The dynamical mechanisms described are quite general and can be expected to apply to any neuron, not just the HH model or squid axon, that is, class II. Thus, any such neuron will have a bimodal ISI distribution in the presence of channel or synaptic noise. In particular, it should apply to all class II cortical cells. Experimentally, bimodal ISI histograms are common. They have been observed in many mammalian neurons in spontaneous behavior or in the presence of stimuli: spontaneous activity of a ventro-basal neuron during sleep (Nakahama et al., 1968), ventrolateral thalamic neurons in slow-wave and fast-wave sleep (Lamarre et al., 1971), auditory nerve fiber response to white noise (Ruggero, 1973), adult rat primary somatosensory cortex (Armstrong-James, 1975), pyramidal cells in rabbit (Bassant, 1976), cat
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1243
auditory cortex awake in light (Burns & Webb, 1976) and either spontaneous or in response to clicks (Gerstein & Kiang 1960; Kiang, 1965), cultured hippocampal cells (Siebler et al., 1993), and rat olfactory receptor neurons (Duchamp-Viret et al., 2005). Recordings that look very similar to the traces in Figure 4, but without accompanying ISI histograms, have been observed in cortical fast-spiking cells (Cauli, Audinat, Lambolez, Angulo, Ropert, Tsuzuki, Hestrin, & Rossier, 1997), stellate cells in medial entorhinal cortex (Klink & Alonso, 1993), and in stuttering and irregularspiking inhibitory cortical interneurons (Markram et al., 2004)). In all these observations, the initial gaussian spike is, or would be if the histogram was constructed, in the range of 10 to 30 ms, roughly in the gammafrequency range. A periodic histogram similar to Figure 15 was observed in thalamic somatic sensory neurons (Poggio & Viernstein 1964). The spontaneous activity of frog olfactory neurons generates similar bimodal histograms on a much slower timescale (3–5 Hz) (Rospars, Lansky, Vaillant, Duchamp-Viret, & Duchamp, 1994), while a motion-sensitive neuron in fly generates a bimodal histogram on a much faster timescale (150 Hz) (de Ruyter van Steveninck et al. 1997). Guttman and Barnhill (1970), using space-clamped squid axon treated with low calcium, observed “skip runs,” which look very similar to our model traces (see Figure 4) but were unable to explain them by modeling. Guttman, Lewis, & Rinzel (1980), again with squid axon bathed in low calcium, showed the coexistence of two states in the squid axon—repetitive firing and subthreshold oscillations—and used small, well-chosen perturbations to annihilate repetitive firing in both the axon and model. This confirms the region of bistability in the HH bifurcation diagram (see Figure 6). Gerstein and Mandelbrot (1964) used a random walk model to generate ISI histograms with a nonsimple-exponential tail, that fits data from several types of neurons but not from those under investigation here that generate bimodal ISI histograms. Figure 17 shows an ISI histogram that closely matches the form of the ISI histogram in Figure 1A. It was recorded by Kiang (1965) from an auditory nerve fiber in anesthetized cat and has a timescale approximately eight times faster than the squid axon’s. Note that the characteristic dip in the histogram between initial peak and exponential tail is clearly visible. While on the one hand, it appears from the literature that bimodal ISI histograms occur in many places in the nervous system, on the other hand, we have shown that with the introduction of a source of stochasticity, the HH equations generate bimodal ISI histograms of the same form, that is, a high initial peak followed by an exponential tail. It seems natural, then, to suggest that the stochastic HH model could be used to model the firing properties of the spike initiation zone of other classes of neurons as well as the squid axon. Obviously, the HH
1244
P. Rowat
ISI count
128
64
0 0
32 ISI length (msec)
64
Figure 17: An ISI histogram recorded from spontaneous firing of an auditory nerve fiber in cat. (Reproduced, with permission, from Kiang 1965.) Qualitatively, this has the same form as the histograms generated by the HH model (see Figure 1). However, the frequency associated with the initial peak is much higher—approximately 300 to 500 Hz.
parameters would need considerable changes to generate spikes on the timescales of the frog olfactory neurons or motion-sensitive neurons in fly. In related work, Yu and Lewis (1989) showed that the response of the HH model could be linearized by the presence of noise: using high-amplitude noise, they showed that the simulation data in Figure 3 linearly extend to near zero frequency. Tateno and Pakdaman (2004) investigated the noisy class II Morris-Lecar model and described the phase-space distribution in the same regimes we investigated; however, ISI distributions were not studied and are not easily deduced from the phase-space distributions. Lee, Neiman, & Kim (1998) studied coherence resonance in the HH model with noisy synaptic inputs and computed a coherence measure from the gaussian peak of the ISI histogram. Tiesenga, Jose, and Sejnowski (2000) investigated in the deterministic HH model the effect of different kinds of input noise on the CV and other measures of the output spike train. The work presented here, on the other hand, investigates the effect of the intrinsic neuronal channel noise on the distribution of the output ISIs. Analysis of stochastic dynamical systems (Berglund & Gentz, 2003; Su, Rubin, & Terman, 2004) can give strict estimates of, for example, the time spent in the neigborhood of the unstable fixed point before switching back to the SLC as in our Hopf case B, but the statement of such an estimate would take us too far afield. In an article to which this one could be considered an extension, Schneidman, Freedman, and Segev (1998) showed that a stochastic HH model qualitatively reproduced the different reliability and precision characteristics of
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1245
spike firing in response to DC and fluctuating input in neocortical neurons found by Mainen and Sejnowski (1995). ISI histograms were not examined. In an approximate analysis, Chow and White (1996) derived an exponential distribution for the ISIs generated by channel noise in the subthreshold HH equations. The common practice in modeling studies to assume an exponential distribution for spontaneous ISIs was bolstered by their result. Our work was initiated when we repeated their simulations and, using much longer ISI time series, found the initial peak shown in the inset in Figure 1A—the case they studied. It is hoped that modelers of neural networks will take into account the often bimodal character of the ISI distribution. Are the channel noise levels used in this study realistic in any sense? Typically, the conductance, γ Na , of single, biological Na-channels lies between 2 and 20 pS (Hille, 2001). The population conductance of our model neuron, g Na , was 120 mS/cm2 , implying a channel density 109 to 1010 /cm2 . Assuming that a spike initiation zone might occupy 103 to 104 /mm2 (10−5 to 10−4 cm2 ), one comes to a projected Na channel number, NNa ≈ 104 to 106 , whereas the maximum number used in our simulations was 180,000 = 1.8 ∗ 105 . These are, of course, crude approximations. They serve only to argue that our noise levels are not necessarily unreasonable. In many biological neurons, the actual number of Na and K channels occupying the site of spike initiation remains unknown. 6 Summary We have by simulation shown that the stochastic HH model generates bimodal ISI histograms with a prominent initial peak in the gamma-frequency range and often has high (greater than 1) coefficient of variation in physiological ranges. We introduced a stochastic version of the Morris-Lecar model that uses perhaps the simplest possible kind of channel noise, which may prove useful in the analysis of channel noise and its effects. We used this Morris-Lecar model to give a simple description of the dynamical mechanisms underlying the generation of bimodal ISI histograms and gave numerical evidence that the dynamical mechanism is the same in the Morris-Lecar and HH stochastic models. We conjecture that the same mechanism arises in any neuron model, however complex, provided only that it undergoes a subcritical Hopf bifurcation—is class II—and has a switching region. We pointed out that class II neurons and neurons with bimodal ISI histograms are widespread in the nervous system, in particular in cortex and sensory systems, and suggested that the same underlying dynamics may be present. The high CV often associated with bimodal histograms may contribute to the highly irregular firing of cortical cells (Softky & Koch 1993), in particular to the firing of stuttering and irregular firing classes
1246
P. Rowat
of inhibitory cortical interneurons (Markram et al., 2004). Although we worked here with channel noise, the same properties of the ISI distribution arise with gaussian noise. We conclude that the properties of the stochastic HH model make it a prime candidate for models of mammalian neural networks. Acknowledgments I thank Robert Elson for advice at an early stage, Javier Movellan for the use of his cluster of Mac G5s, Jochen Triesch for a crucial early comment, Terry Sejnowski for encouragement, and two referees for constructive criticism. Above all, I thank my partner, Nona, for her unflagging support and keeping it all together. References Armstrong-James, M. (1975). Functional status and columnar organization of single cells responding to cutaneous stimulation in neonatal rat somatosensory cortex S1. J. Physiol., 246, 501–538. Azouz, R., & Gray, C. (2000). Dynamic spike threshold reveals a mechanism for synaptic coincidence detection in cortical neurons in vivo. Proc. Natl. Acad. Sci., 97(14), 8110–8115. Bassant, M. (1976). Analyse statistique de l’activit´e des cellules pyramidales de l’hippocampe dorsal du lapin. Electroencephal. Clinical Neurophysiology, 40, 585– 603. Berglund, N., & Gentz, B. (2003). Geometric singular perturbation theory for stochastic differential equations. J. Differential Equations, 191, 1–54. Best, E. (1979). Null space in the Hodgkin-Huxley equations. Biophysical J., 27, 87–104. Brette, R. (2003). Reliability of spike timing is a general property of spiking model neurons. Neural Computation, 15, 279–308. Burns, B., & Webb, A. (1976). The spontaneous activity of neurones in the cat’s cerebral cortex. Proc. R. Soc. Lond. B, 194, 211–223. Cauli, B., Audinat, E., Lambolez, B., Angulo, M. C., Ropert, N., Tsuzuki, K., Hestrin, S., & Rossier, J. (1997). Molecular and physiological diversity of cortical nonpyramidal cells. J. Neuroscience, 17(10), 3894–3906. Chow, C. C., & White, J. A. (1996). Spontaneous action potentials due to channel fluctations. Biophysical J., 71, 3013–3021. de Ruyter van Steveninck, R., Lewen, G., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808. Dean, A. (1981). The variability of discharge of simple cells in the cat striate cortex. Experimental Brain Research, 44, 437–440. Destexhe, A., M. Rudolph, Fellous, J., & Sejnowski, T. J. (2001). Fluctuating synaptic conductances recreate in vivo–like activity in neocortical neurons. Neuroscience, 107(1), 13–24.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1247
Dorval, A., & White, J. (2005). Channel noise is essential for perithreshold oscillations in entorhinal stellate neurons. J. Neurosci., 25(43), 10025–10028. Duchamp-Viret, P., Kostal, L., Chaput, M., Lansky, P., & Rospars, J.-P. (2005). Patterns of spontaneous activity in single rat olfactory receptor neurons are different in normally breathing and tracheotomized animals. J. Neurobiology, 65(2), 97–14. Fitzhugh, R. (1961). Impulses and physiological states in models of nerve membrane. Biophys. J., 1, 445–466. Gerstein, G., & Kiang, N. Y.-S. (1960). Approach to the quantitative analysis of electrophysiological data from single neurons. Biophys. J., 6, 15–28. Gerstein, G., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. J. Physical Chemistry, 81(25), 2340–2361. Guckenheimer, J., & Labouriau, I. S. (1993). Bifurcation of the Hodgkin-Huxley equations: A new twist. Bull. Math. Biol., 55, 937–952. Guckenheimer, J., & Oliva, R. A. (2002). Chaos in the Hodgkin-Huxley model. SIAM J. Applied Dynamical Systems, 1(1), 105–114. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10(5), 1047–1065. Guttman, R., & Barnhill, R. (1970). Oscillation and repetitive firing in squid axons. Journal of general Physiology, 55, 104–118. Guttman, R., Lewis, S., & Rinzel, J. (1980). Control of repetitive firing in squid axon membrane as a model for a neuroneoscillator. J. Physiology, 305, 377–395. Hassard, B. (1978). Bifurcation of periodic solutions of the Hodgkin-Huxley model for the squid giant axon. J. Theor. Biology, 71, 401–420. Hille, B. (2001). Ion channels of excitable membranes (3rd ed.). Sunderland, MA: Sinauer. Hodgkin, A. (1948). The local electrical changes associated with repetitive action in a non-medullated axon. J. Physiol., 107, 165–181. Hodgkin, A., & Huxley, A. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Holt, G., Softky, W., Koch, C., & Douglas, R. J. (1996). Comparison of discharge variability in vitro and in vivo in cat visual cortex neurons. J. Neurophysiol., 75(5), 1806–1814. Kiang, N. Y.-S. (1965). Discharge patterns of single fibers in the cat’s auditory nerve. Cambridge, MA: MIT Press. Klink, R., & Alonso, A. (1993). Ionic mechanisms for the subthreshold oscillations and differential electroresponsiveness of medial entorhinal cortex layer II neurons. J. Neurosci., 70(1), 144–157. Kole, M., Hallerman, S., & Stuart, G. (2006). Single Ih channels in pyramidal neuron dendrites: Properties, distribution, and impact on action potential output. J. Neurosci., 26(6), 1677–1687. Labouriau, I. S. (1985). Degenerate Hopf bifurcation and nerve impulse. SIAM J. Math. Anal., 16(6), 1121–1133. Lamarre, Y., Filion, M., & Cordeau, J. P. (1971). Neuronal discharges of the ventral nucleus of the thalamus during sleep and wakefulness in the cat. I. Spontaneous activity. Experimental Brain Research, 12, 480–498.
1248
P. Rowat
Lee, S.-G., Neiman, A., & Kim, S. (1998). Coherence resonance in a Hodgkin-Huxley neuron. Physical Review E, 57(3), 3292–3297. Lindner, B., Garica-Ojalvo, J., Nerman, A., & Schimansky-Geier, L. (2004). Effects of noise in excitable systems. Physics Reports, 392(6), 321–424. Lynch, J., & Barry, P. (1989). Action potentials initiated by single channels opening in a small neuron (rat olfactory receptor). Biophys. J., 55, 755– 768. Mainen, Z., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mann-Metzer, P., & Yarom, Y. (2002). Jittery trains induced by synaptic-like currents in cerebellar inhibitory interneurons. J. Neurophysiol., 87, 149–156. Markram, H., Toledo-Rodriguez, M., Wang, Y., Gupta, A., Silberberg, G., & Caizhi, W. (2004). Interneurons of the neocortical inhibitory system. Nature Reviews Neuroscience, 5, 793–807. Matsumoto, M., & Nishimura, T. (1998). Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. on Modeling and Computer Simulation, 8(1), 3–30. McDonald, S. W., Grebogi, C., Ott, E., & Yorke, J. A. (1985). Fractal basin boundaries. Physica D, 17, 125–153. Meunier, C., & Segev, I. (2002). Playing the devil’s advocate: Is the Hodgkin-Huxley model useful? TINS, 25(11), 558–563. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle fiber. Biophys. J., 35, 193–213. Nakahama, H., Suzuki, H., Yamamoto, M., & Aikawa, S. (1968). A statistical analysis of spontaneous activity of central single neurons. Physiology and Behavior, 3(5), 745–752. Noda, H., & Adey, R. (1970). Firing variability in cat association cortex during sleep and wakefulness. Brain Research, 18, 513–526. Pfeiffer, R., & Kiang, N. Y.-S. (1965). Spike discharge patterns of spontaneous and continuously stimulated activity in the cochlear nucleus of anaesthetized cats. Biophysical Journal, 5, 301–316. Poggio, G., & Viernstein, J. (1964). Time series analysis of impulse sequences of thalamic somatic sensory neurons. J. Neurophysiol., 27, 517–545. Rinzel, J., & Ermentrout, G. B. (1998). Analysis of neural excitability and oscillations. K. Koch & I. Segev (Eds.), In Methods in neuronal modeling (pp. 251–291). Cambridge, MA: MIT Press. Rinzel, J., & Miller, R. (1980). Numerical calculation of stable and unstable periodic solutions to the Hodgkin-Huxley equations. Mathematical Biosciences, 49, 27–59. Rospars, J.-P., Lansky, P., Vaillant, J., Duchamp-Viret, P., & Duchamp, A. (1994). Spontaneous activity of first- and second-order neurons in the frog olfactory system. Brain Research, 662, 31–44. Rowat, P., & Elson, R. (2004). State-dependent effects of Na-channel noise on neuronal burst generation. J. Comput. Neuroscience, 16, 87–112. Rudolph, M., & Destexhe, A. (2002). Point-conductance models of cortical neurons with high discharge variability. Neurocomputing, 44–46, 147–152. Ruggero, M. (1973). Response to noise of auditory nerve fibers in the squirrel monkey. J. Neurophysiol., 36, 569–587.
ISI Statistics in the Stochastic Hodgkin-Huxley Model
1249
Schneidman, E., Freedman, B., & Segev, I. (1998). Ion channel stochasticity may be critical in determining the reliability and precision of spike timing. Neural Computation, 10(7), 1679–1703. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neuroscience, 18(10), 3870–3896. Siebler, M., Koller, H., Stichel, C. C., Muller, H. W., & Freund, H. J. (1993). Spontaneous activity and recurrent inhibition in cultured hippocampal networks. Synapse, 14, 206–213. Sigworth, F. (1980). The variance of sodium current fluctuation at the node of Ranvier. J. Physiol., 307, 97–129. Softky, W., & Koch, K. (1993). The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neuroscience, 13(1), 334– 350. Stein, R. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular firing of cortical neurons. Nature Neuroscience, 1(3), 210–217. Stiefel, K. M., Englitz, B., & Sejnowski, J. (2006). The irregular firing of cortical interneurons in vitro is due to fast K+ -kinetics. Manuscript in preparation. Su, J., Rubin, J., & Terman, D. (2004). Effects of noise on elliptic bursters. Nonlinearity, 17, 133–157. Tateno, T., Harsch, A., & Robinson, H. P. C. (2004). Threshold firing frequency-current relationships of neurons in rat somatosensory cortex: Type 1 and type 2 dynamics. J. Neurophysiol., 92, 2283–2294. Tateno, T., & Pakdaman, K. (2004). Random dynamics of the Morris-Lecar model. Chaos, 14(3), 511–530. Tiesenga, P., Jose, J., & Sejnowski, J. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Review E, 62(6), 8413–8419. Troy, W. C. (1978). The bifurcation of periodic solutions in the Hodgkin-Huxley equations. Quarterly Journal of Applied Mathematics, 36, 73–83. Troyer, T., & Miller, K. (1997). Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Computation, 9(5), 971– 983. Werner, G., & Mountcastle, V. (1963). The variability of central neural activity in a sensory system, and its implications for the central reflection of sensory events. J. Neurophysiol., 26, 958–977. White, J. A., Klink, R., Alonso, A., & Kay, A. P. (1998). Noise from voltage-gated ion channels may influence neuronal dynamics in the entorhinal cortex. J. Neurophysiol., 80, 262–269. White, J. A., Rubinstein, J. T., & Kay, A. P. (2000). Channel noise in neurons. Trends in Neurosciences, 23(3), 131–137. Whitsel, B., Schreiner, R., & Essick, G. K. (1977). An analysis of variability in somatosensory cortical neuron discharge. J. Neurophysiol., 40(3), 589–607. Wilbur, W., & Rinzel, J. (1983). A theoretical basis for large coefficient of variation and bimodality in neuronal interspike interval distribution. J. Theoretical Biology, 105, 345–368.
1250
P. Rowat
Yu, X., & Lewis, E. R. (1989). Studies with spike initiators: Linearization by noise allows continuous signal modulation in neural networks. IEEE Trans. Biomedical Eng., 36(1), 36–43.
Received February 1, 2006; accepted July 24, 2006.
LETTER
Communicated by Christoph Borgers
The Firing of an Excitable Neuron in the Presence of Stochastic Trains of Strong Synaptic Inputs Jonathan Rubin
[email protected] Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15260 U.S.A.
Kreˇsimir Josi´c
[email protected] Department of Mathematics, University of Houston, Houston, TX 77204-3008, U.S.A.
We consider a fast-slow excitable system subject to a stochastic excitatory input train and show that under general conditions, its long-term behavior is captured by an irreducible Markov chain with a limiting distribution. This limiting distribution allows for the analytical calculation of the system’s probability of firing in response to each input, the expected number of response failures between firings, and the distribution of slow variable values between firings. Moreover, using this approach, it is possible to understand why the system will not have a stationary distribution and why Monte Carlo simulations do not converge under certain conditions. The analytical calculations involved can be performed whenever the distribution of interexcitation intervals and the recovery dynamics of the slow variable are known. The method can be extended to other models that feature a single variable that builds up to a threshold where an instantaneous spike and reset occur. We also discuss how the Markov chain analysis generalizes to any pair of input trains, excitatory or inhibitory and synaptic or not, such that the frequencies of the two trains are sufficiently different from each other. We illustrate this analysis on a model thalamocortical (TC) cell subject to two example distributions of excitatory synaptic inputs in the cases of constant and rhythmic inhibition. The analysis shows a drastic drop in the likelihood of firing just after inhibitory onset in the case of rhythmic inhibition, relative even to the case of elevated but constant inhibition. This observation provides support for a possible mechanism for the induction of motor symptoms in Parkinson’s disease and for their relief by deep brain stimulation, analyzed in Rubin and Terman (2004).
Neural Computation 19, 1251–1294 (2007)
C 2007 Massachusetts Institute of Technology
1252
J. Rubin and K. Josi´c
1 Introduction There has been substantial discussion of the roles of excitatory and inhibitory synaptic inputs in driving or modulating neuronal firing. Computational analysis of this issue generally considers a neuron awash in a sea of synaptic bombardment (Somers et al., 1998; van Vreeswijk & Sompolinsky, 1998; De Schutter, 1999; Tiesinga, Jose, & Sejnowski, 2000; Tiesinga, 2005; Chance, Abbott, & Reyes, 2002; Tiesinga & Sejnowski, 2004; Huertas, Groff, & Smith, 2005). In this work, we also investigate the impact of synaptic inputs on the firing of a neuron, but with a focus on the effects of single inputs within stochastic trains. This investigation is motivated by consideration of thalamocortical relay (TC) cells, under the hypothesis that such cells are configured to reliably relay individual excitatory inputs, arising from either strong, isolated synaptic signals or tightly synchronized sets of synaptic signals, during states of attentive wakefulness, yet are also modulated by inhibitory input streams (Smith & Sherman, 2002). This viewpoint leads to the question of how different patterns of inhibitory modulation affect the relationship between the activity of a neuron and a stochastic excitatory input train that it receives. The main goal of this letter is to introduce and illustrate a mathematical approach to the analysis of this relationship. Our approach applies to general excitable systems with separation of timescales, including a single slow variable and fast onset and offset of inputs. These ideas directly generalize to other neuronal models featuring a slow buildup of potential interrupted by instantaneous spikes and resets. We harness these features to reduce system dynamics to a one-dimensional map on the slow variable. Each iteration of the map corresponds to the time interval from the arrival of one excitatory input to the arrival of the next excitatory input (Othmer & Watanabe, 1994; Xie, Othmer, & Watonabe, 1996; Ichinose, Aihara, & Judd, 1998; Othmer & Xie, 1999; Coombes & Osbaldestin, 2000). From this map, under the assumption of a bounded excitatory input rate, we derive a transition process based on the evolution of the slow variable between inputs. In general, such a process would be non-Markovian, because the transition probabilities would depend on the arrival times of all inputs the cell had received since it last spiked. However, we derive an irreducible Markov chain by defining states that are indexed by both slow variable values and numbers of inputs received since last firing. In the case of a constant inhibition level, we prove the key result that under rather general conditions, this Markov chain is aperiodic and hence has a limiting distribution. This limiting distribution can be computed from the distribution of input arrival times. Once obtained, it can be used to deduce much about the firing statistics of the driven cell, including the probability that the cell will fire in response to a given excitatory input, the expected number of response failures that the cell will experience between firings, and the distribution of slow variable values
Firing in the Presence of Stochastic Synaptic Inputs
1253
attained after any fixed number of unsuccessful inputs arriving between firings. We emphasize that, a priori, it is not certain that such a limiting distribution exists. Therefore, Monte Carlo methods, which rely on the assumed existence of a limit, may fail to converge and hence may give misleading results about firing statistics and the distribution of slow variable values at input arrival times. In addition to providing conditions that guarantee the existence of a limiting distribution for the Markov chain, our analysis shows how convergence failure can arise in the case of a stochastic train of excitatory inputs that is close to periodic. When our existence conditions are satisfied, our approach yields an analytical calculation of the limiting distribution, eliminating the need for simulations as well as the complications of transients and slow convergence, and the eigenvalues of the transition matrix used to compute the limiting distribution give information about convergence rate. Moreover, this approach provides a framework for the analysis of how changes in the statistics of the inputs affect the output of a cell. Finally, the Markov chain analysis allows for the identification of bifurcation events in which variation of model parameters can lead to abrupt changes that affect long-term statistics, although we do not pursue this in detail in this work (see Doi, Inoue, & Kumagai, 1998; Tateno & Jimbo, 2000, which we comment on in section 9). We discuss the Markov chain approach in the particular case of constant inhibition, which may be zero or nonzero, and extend it to the case of inhibition that undergoes abrupt switches between two different levels, at a lower frequency than that of the excitatory signal. These choices are motivated by the analysis of TC cell relay reliability in the face of variations in inhibitory basal ganglia outputs that arise in Parkinson’s disease (PD) and under deep brain stimulation (DBS), applied to combat the motor symptoms of PD. An important point emerging from experimental results is that parkinsonian changes induce rhythmicity in inhibitory basal ganglia outputs (Nini, Feingold, Slovin, & Bergman, 1995; Magnin, Morel, & Jeanmonod, 2000; Raz, Vaadia, & Bergman, 2000; Brown et al., 2001), while DBS regularizes these outputs, albeit at higher-than-normal levels (Anderson, Postpuna, & Ruffo, 2003; Hashimoto, Elder, Okun, Patrick, & Vitek, 2003). In recent work, Rubin and Terman (2004) provided computational and analytical support for the hypothesis that, given a TC cell that can respond reliably to excitatory inputs under normal conditions, the parkinsonian introduction of repetitive, fairly abrupt switches between high and low levels of inhibition to the TC cell will disrupt reliable relay. On the other hand, regularized basal ganglia activity, even if it leads to unusually high levels of inhibition, can restore reliable TC responses. While their computational results focused on finite time simulations of two TC models with particular choices of excitatory and inhibitory input strengths, frequencies, and durations, the general framework presented here can be used to analyze how a model TC cell responds to any train of fast excitatory inputs in
1254
J. Rubin and K. Josi´c
the presence of inhibition that stochastically makes abrupt jumps between two levels. Furthermore, the results of this letter yield more detailed statistical information about the firing of the driven TC cells, particularly just after the onset of inhibition. The letter is organized as follows. In section 2, we review the way in which the dynamics of a fast-slow excitable system under pulsatile drive can be reduced to a map. In section 3, we use these ideas to construct a Markov chain that captures the relevant aspects of the dynamics of the system. We consider the existence of limiting densities for the constructed Markov chain and their interpretation in section 4. Further, we discuss the application of these ideas to responses of a population of cells, as well as their extension to populations with heterogeneities and to related models, in section 5. The specific example of determining the reliability of a reduced model TC cell under various types of inputs is addressed, as an application of the theory, in sections 6 and 7, while an explicit connection to PD and DBS is made in section 8. Some details of the calculations underlying the results of these sections are contained in the appendixes. Section 9 provides a summary of our results, a discussion of their relation to past work, and some ideas on possible generalization and future directions. In the two examples that we present, one with a uniform and one with a normal distribution of excitatory inputs, a significant decrease in the responsiveness of the model TC cell and a significant increase in its likelihood of being found far from firing threshold result after the onset of the inhibitory phase of a time-varying inhibitory input, relative to the cases of high or low but constant inhibition. These findings are in agreement with Rubin and Terman (2004) and, based on the generality of our approach, appear to represent general characteristics of the TC model with a single slow variable. 2 Reduction of the Dynamics to Maps of the Interval In this section we introduce a general relaxation oscillator subject to synaptic input. Under certain assumptions on the form of the input, the dynamics of the oscillator is accurately captured by a map of an interval. 2.1 A General Relaxation Oscillator. Consider a model system of the form v = f (v, w) + I (v, t) w = εg(v, w),
(2.1)
where 0 < ε 1. In the neuronal context, the fast variable v models the voltage, while the slow variable w typically models different conductances (Rubin & Terman, 2002). The input to the oscillator is modeled by the term
Firing in the Presence of Stochastic Synaptic Inputs
1255
w 0 wFP E wLK
excitation off N0
excitation on NE
0 wRK E w RK
w−nullcline
V
Figure 1: The nullclines of system 2.1 under the assumptions discussed in the text. The upper and lower v-nullclines correspond to se xc = 0 (excitation off) and se xc = 1 (excitation on), respectively.
I (v, t). We will assume that if I (v, t) = C for a constant C in a range of interest, then the v-nullcline, given implicitly by f (v, w) = −C, has a threebranched, or N-like, shape (see Figure 1). We will refer to the different branches of this nullcline as the left, middle, and right branch, respectively. Under assumptions on the form of f and g that typically hold in practice (Rubin & Terman, 2002), it is well known that if the w-nullcline, given implicitly by g(v, w) = 0, intersects the v-nullcline at the middle branch, then equation 2.1 has oscillatory solutions, known as relaxation oscillations. If the w-nullcline meets the v-nullcline at the left branch, then there are no oscillatory solutions, but the system is excitable. In this case, a kick in the positive v direction can trigger an extended excursion (a spike). In the following, we consider the case when the two nullclines intersect on the left branch of the v-nullcline in the absence of input. In the work system presented here, equation 2.1 will be used as a model of a neuronal cell, and I (v, t) will model synaptic input, so that I (v, t) = −ge xc se xc (t)(v − ve xc ) − ginh sinh (t)(v − vinh ),
(2.2)
where v − ve xc < 0 and v − vinh > 0 over the relevant range of v-values. The two terms in the sum represent effects of excitatory and inhibitory currents, respectively. We will assume that the synaptic variables se xc (t) and sinh (t) switch between values of 0 (off) and 1 (on) instantaneously and independently (Somers & Kopell, 1993; Othmer & Watanabe, 1994). This is a reasonable approximation in a situation where the input cells fire action potentials of stereotypical duration, sufficiently widely separated in time to allow for synaptic decay to a small level between inputs, and where the
1256
J. Rubin and K. Josi´c
synaptic onset and offset rates are rapid. We need not consider other aspects of the dynamics of the neurons supplying inputs to the model cell, as only the input timing will be relevant. If the magnitude of I is not too large, then each of the four possible choices for (se xc , sinh ) yields a different v-nullcline of system 2.1, each of which has an N-like shape. For simplicity, we first consider the case sinh = 0, so that the cell receives only excitatory input. We label the resulting nullclines as N i , with i ∈ {E, 0} corresponding to the values of se xc = 1 and se xc = 0, respectively. The case when sinh = 0 can be treated similarly, resulting in two additional nullclines, N I +E and N I . We refer to the left (right) branch of each v-nullcline N i as the silent (active) phase. Each left branch terminates where it coalesces with the middle branch in a saddle node bifurcation of equilibria for the v-equation. We denote these bifurcation points, or left knees, by (v iL K , wiL K ), and we denote the analogous right knees as (v iRK , wiRK ), with i ∈ {E, 0} as above. Since v − ve xc < 0, increasing se xc lowers the v-nullcline in the (v, w) phase plane. See Figure 1 for an example of the arrangement of the nullclines and the different knees. We have assumed that system 2.1 is excitable when excitation is off, with a stable critical point (v 0FP , w 0FP ) on the left branch of N 0 . All solutions of equation 2.1 will therefore approach (v 0FP , w 0FP ) in the absence of input. We E E assume that the critical point (v FP , w FP ) lies on the middle branch of N E , so that the system is oscillatory in the presence of a constant excitatory input. This second assumption is not essential, and the analysis is easily generalized. Note that in the singular limit, if w ∈ (w LEK , w 0FP ], then an excitatory input results in a large excursion of v, corresponding to a spike. If excitatory inputs are brief, then any interesting dynamics is the consequence of synaptic inputs that result in such spikes. In remark 4, we comment further on the necessity of the assumptions made here. 2.2 Timing of the Synaptic Inputs. We denote the times at which the excitatory synaptic inputs occur by ti , and we assume that each input has the same duration, which we call t ∗ . We will assume that ti+1 − ti > t ∗ , so that the inputs do not overlap, and that the inputs are of equal strength, such that se xc (t) =
1 if ti < t < ti + t ∗ , 0 otherwise.
We comment on trains of inputs with variable durations and amplitudes in remarks 4 and 5. Of fundamental interest in the following analysis will be the timing between subsequent excitatory inputs, which gives rise to the distribution of interexcitatory intervals Ti = ti+1 − ti . We will assume that the Ti are
Firing in the Presence of Stochastic Synaptic Inputs excitation off 0 w= wFP
w=w0 x
excitation on
1257 MT (w)
~ ELK ) MT(w
w= M T(w0 ) x
MT(w ERK )
w E w= wRK
V
n
g j w ~ ELK w ELK w 0FP w
Figure 2: Schematic representation of a typical trajectory starting on the left branch of N 0 with w = w0 , where w0 is in the interval j = (w LEK , w0FP ), and ending on the same branch after time T (left). MT (w0 ) is defined as the w coordinate of the end point of the trajectory (right). Note that the trajectory from w ˜ LEK itself ˜ LEK ) > w LEK . fails to reach the active phase, such that w ˜ LEK · t ∗ = w LEK and MT (w
independent and identically distributed random variables, with corresponding density ρ(t). Excitation that is T-periodic in time is a special case, corresponding to the singular distribution ρ(t) = δ(T). 2.3 Reduction of the Dynamics to a Map. For the purposes of analysis, we consider that the active phase of the cell is very fast compared to the silent phase. More specifically, we treat the neuron as a three-timescale system. The first timescale, on the order of tenths of milliseconds, governs the evolution of the fast variable (the voltage) during spike onset and offset. The second, intermediate timescale, on the order of milliseconds, governs the evolution of the slow variable on the right-hand branch, while the third, slow timescale, on the order of tens of milliseconds, governs its evolution on the left-hand branch. We assume that the duration of excitation is greater than or equal to the duration of the active phase (Stuart & H¨ausser, 2001). Correspondingly, we measure input duration on the slowest timescale. Using ideas of singular perturbation theory, these simplifications allow us to reduce the response of an excitable system to a one-dimensional map on the slow variable w in the singular limit. Similar maps have been introduced previously (Othmer & Watanabe, 1994), in the setting of two rather than three timescales. Since we measure the duration of excitation E on the slow timescale, cells that spike are reinjected at w RK , and the only E 0 accessible points on the left branch of N lie in the interval I = [w RK , w 0FP ) (see Figure 2). A common dynamical systems convention is to denote the solution obtained by integrating a differential equation for time t, from initial condition x0 , by x0 · t. Using this convention, we denote the w coordinate at time t, of a point on the trajectory with initial condition (v0 , w0 ), by w0 · t. We now define a map MT : I → I by MT (w0 ) = w0 · T, where it is assumed that an excitatory input is received at time t = 0, at the start of the map cycle, and
1258
J. Rubin and K. Josi´c
no other excitatory inputs are received between times 0 and T. We claim that in the singular limit, there is no ambiguity in the definition of the map MT . Indeed, in the singular limit, the initial point (v0 , w0 ) can be assumed to lie on the left branch of the nullcline N 0 . Since the evolution on the right branch of N E occurs on the intermediate timescale and the time T exceeds the duration of excitation t ∗ , the trajectory starting at (v0 , w0 ) again ends on the left branch of N 0 at time T. We emphasize that when we repeatedly iterate a map MT or compose a sequence of maps MTi , we assume for consistency that an input arrives at the start of each map cycle and that no other inputs arrive during each cycle. Figure 2 (left) shows a schematic depiction of part of a typical trajectory, starting on N 0 with w = w0 , during an interval of length T following an excitatory input. The trajectory jumps to the right nullcline after excitation is received. The jumps between the branches occur on the fast timescale (triple arrows), while the evolution on the right and left nullcline are intermediate and slow, respectively (double and single arrows). A possible form of the resulting map MT is shown on the right. Note that this map has three branches. Let w ˜ LEK ≤ w LEK denote the infimum of the set of w-values from which a cell will reach the active phase, when given an excitatory input of duration t ∗ . Orbits with initial conditions in the interval j = (w LEK , w 0FP ) will jump to the active phase as soon as excitation is received. Meanwhile, orbits E with initial conditions in n = [w RK ,w ˜ LEK ] will be blocked from reaching the active phase, by the left branch of the N E nullcline. Intermediate initial conditions, in g = (w ˜ LEK , w LEK ], will yield trajectories that jump to N E and then make a second jump, to the active phase, after a time 0 ≤ t < t ∗ . In the singular limit, the region j is compressed to a single point under MT , so that E E MT ( j) = MT (w RK ) = w RK · T. Since the times of entry into the active phase vary continuously across initial conditions in g, MT (g) has a nonzero slope. Finally, note that on n, the map MT has slope less than one, corresponding to the fact that w decreases as w approaches the fixed point w 0FP along the left branch of N 0 . Remark 1. The definition of the map MT (w) can be extended to the case when the system is excitable rather than oscillatory under constant excitaE E tion, that is, when the point (v FP , w FP ) is on the left branch of N E . In this E case, MT (w) < w may occur for w near w FP if T is close to t ∗ , due to the E E E stability of the fixed point (v FP , w FP ) on N . 2.4 Periodic Excitation. Although the case of periodic excitation has been analyzed in much detail elsewhere, we review it here since the analysis shares a common framework with the developments in the subsequent sections. First, consider the limit as the duration of excitation goes to zero, with respect to the slow timescale, such that w ˜ LEK ↑ w LEK and the middle branch
Firing in the Presence of Stochastic Synaptic Inputs
MT (w)
1259
MT (w)
4
MT (wERK ) 3 MT (wERK ) MT2(wERK ) 1
MT (wERK ) w
E MT5(w LK ) 5 ~E+ MT(w LK) 1 ~E+ MT(w LK)
w ELK
g
w
~ ELK w ELK w
Figure 3: Examples of the map MT . (Left) An example with t ∗ = 0, for which, after a finite number of iterations, all initial conditions will be mapped to a ˜ LE+ periodic orbit of period N = 4. (Right) An example with t ∗ > 0, with MT1 (w K) 1 defined as limw↓w˜ LEK MT (w). This example exhibits contraction of the interval g, 1 1 E since MT5 (g) ⊂ MT1 (g) and |MT5 (w LEK ) − MT5 (w ˜ LE+ ˜ LE+ K )| < |MT (w L K ) − MT (w K )|. Note 5 1 E E that in both cases, MT (w L K ) = MT (w L K ).
of MT is eliminated. As shown in the left panel of Figure 3, the fact that j = (w LEK , w 0FP ) is contracted to a single point under MT implies that all points w0 ∈ I get mapped to a periodic orbit of MT in a finite number of iterations. A simple analysis shows that there exists a natural number N such that a population of cells with initial conditions distributed on I will form N synchronous clusters under N applications of the map MT . The E periodic orbit is obtained from applying N iterations of MT to MT (w RK ), and i N N E E it consists of the points {MT (w RK )}i=1 , where MT (w RK ) ∈ j (see Figure 3). Every trajectory is absorbed into this orbit, possibly after an initial transient. This clustered state persists away from the singular limit. More precisely, consider the following intervals, or bins, −(i−1) E ji = MT−i w LEK , MT wL K
i = 1, 2, . . .
(2.3)
Note that the ith iterate of ji under MT is contained in j. Therefore, since E MTi ( ji ) ⊂ j, it follows that MTi+1 ( ji ) = MT (w RK ) or, more generally, MTl ( ji ) = l−i E MT (w RK ) for l > i. We can interpret this as follows. Consider a collection of identical oscillators subject to identical input, under the assumption that all oscillators have initial conditions in ji . This collection will get mapped to the interval j just prior to the ith input. It will then respond to the ith input by firing in unison and will form a synchronous cluster after i + 1 excitatory inputs, since the interval j collapses to a single point after the cells fire. If the bin ji contains a fraction q of the initial conditions, then this fraction of cells will fire at the ith input, as well as on every Nth input after that.
1260
J. Rubin and K. Josi´c
Therefore, without knowing the distribution of initial conditions, it is not possible to know what fraction of the cell population will respond to a given input. Consider the two extreme examples: if all cells have initial conditions lying in one bin, the population of cells will respond only to every Nth input. On the other hand, if initially every bin ji contains some fraction of cells, then each input will induce a response by some fraction of the population. Now consider the effect of t ∗ > 0, corresponding to a nonzero duration of excitation. As long as t ∗ is sufficiently small, or the slope of the middle branch of MT is sufficiently shallow, then MT will be contracting on g, as shown in the right panel of Figure 3. In this case, a similar clustered state arises away from the t ∗ = 0 limit as well. Moreover, the definitions and clustering phenomenon described here carry over identically to the case of periodic excitation with inhibition held on at any constant level. In that case, the system will evolve on the nullclines N I and N I +E in the singular I +E limit. The bins are defined in terms of the points w LI +E K and w RK , rather E E than w L K and w RK as above. We will show in the next section that under general conditions, the situation can be quite different when the times between onsets of successive excitatory inputs, which we will call interexcitation intervals (IEIs), are random and the possibility of sufficiently long IEIs exists. Remark 2. The map MT resembles a time T map of the voltage obtained from an integrate-and-fire system. In our case, the map is defined on the slow conductance w, however. If an excitatory input fails to elicit a spike and g(v, w) depends weakly on v, then the input may have little effect on the slow conductance. This is unlike the typical integrate-and-fire model, in which excitatory inputs move the cell closer to threshold. 3 The Construction of a Markov Chain We next analyze the long-term statistical behavior of a cell that receives excitatory input that is not periodic in time. Our goals are to determine the probability that the cell fires in response to each subsequent input since it last fired and the number of inputs the cell is expected to receive between firings or, equivalently, the average number of failures before a spike occurs. Our results can be interpreted in the context of a population of cells as well, and this is discussed in section 5. A key point in our approach is that, as noted in section 2, all firing events E lead to reinjection at w RK . Therefore, the system has only limited memory, since after a cell fires, all information about its prior behavior is erased. This allows us to describe the evolution of the variable w through the interval E [w RK , w 0FP ) as a Markov process with a finite number of states. The steps in this process will be demarcated by the arrival times of excitatory inputs. Each IEI is defined as the time from the onset of one excitatory input to the
Firing in the Presence of Stochastic Synaptic Inputs w
0 wFP E wLK
E. wRK 2S E. wRK S
1261
excitation off
IN IN−1
. . .
excitation on
I2 I1
w
x E wRK
V
Figure 4: Bins I j are defined by intervals of values of the slow variable w. Note E for that larger subscripts label intervals of larger w values. Iterations of w RK integer multiples of the minimal IEI time S are used to define most bins, as in equation 3.1, whereas bins I N−1 and I N are defined using w LEK and w0FP , as in equation 3.2.
onset of the next. The length of this interval includes the duration of the input that occurs at the start of the IEI. We assume that the lengths of the IEIs, which we denote by T, are independent and identically distributed random variables with density ρ. Fundamental in our analysis is the assumption that the support of ρ takes the form [S, U], where 0 < S < U < ∞. As long as the frequency of the cells providing excitatory inputs is bounded, this assumption is not unreasonable. If ρ satisfies this assumption, then the long-term behavior of a population of cells is accurately captured by the asymptotics of a corresponding Markov chain with finitely many states. We show that under certain conditions, this Markov chain is aperiodic and irreducible and thus has a limiting distribution. This distribution can be used to describe the firing statistics of the cell. 3.1 The States of the Markov Chain. We start by again assuming that the cell receives only excitatory input. To simplify the exposition, we now assume that the input is instantaneous on the slow timescale, so that it causes a cell to spike if and only if w > w LEK . In the singular limit, such an input has no effect on the slow variable w of a cell unless a spike is evoked. This assumption will be relaxed subsequently. In the case of periodic excitation, we considered bins defined by backward iterations of w LEK , as in equation 2.3. For notational reasons, it is now E more convenient to consider forward iterations of w RK to form bins used in the definition of the Markov chain (see Figure 4). Let N be the smallest E number such that w RK · NS > w LEK , where S > 0 is the lower bound of
1262
J. Rubin and K. Josi´c
the support of ρ mentioned above. Therefore, N is the maximal number of E excitatory inputs that a cell starting at any point w0 ∈ [w RK , w 0FP ] can receive before firing. Set E E Ik = w RK · k S, w RK · (k + 1)S)
if
1 ≤ k ≤ N − 2,
(3.1)
and define two additional bins: E I N−1 = w RK · (N − 1)S, w LEK I N = w LEK , w 0FP = j.
(3.2)
If T is a random variable, then the map MT transfers ensembles of cells between bins. This process is non-Markovian, in that transition probabilities depend on the full set of IEIs that have occurred since a cell last spiked, and not just current bin membership. That is, a cell that enters a bin after fewer inputs will be more likely to lie in the lower part of the bin than a cell that enters the same bin after more inputs. Hence, the number of inputs that a cell has received since its last spike can significantly affect its transition probabilities between bins. Therefore, to form a Markov chain from MT , a further refinement is needed. To that effect, we define Markov chain states (Ik , l) as follows. A cell is in state (Ik , l) if w ∈ Ik and l complete IEIs have passed since the cell last fired. Note that by construction, the first of these IEIs (l = 1) actually corresponds to the duration of the input that made the cell fire, together with the pause between that input and the next, since the actual firing and reset occur relative to the slow timescale. As in the case of periodic excitation, the analysis carries over directly to the case of constant inhibition. In this case, the nullclines N I and N I +E are used instead of N 0 and N E above. The remainder of the above construction is identical, with the superscripts I and I + E replacing the superscripts 0 and E, respectively. Remark 3. For the remainder of this section, we continue to denote the states of the Markov chain by (Ik , l), to emphasize that the first index refers to an interval of slow variable values. For simplicity, we use (k, l) instead of (Ik , l) in the examples of the analysis that appear later in the letter. Remark 4. Up to this point, we have made several assumptions that can in fact be weakened in the following ways:
r
For ε = 0, the actual jump-up threshold lies below w LEK and is given by the minimal level of w for which an input current pushes the point (v, w) on the left branch of N 0 into the active phase. As long as all inputs are the same (otherwise, see the next point and remark 5), this value is uniquely defined and can be used to replace w LEK in the definitions of I N−1 and I N in equation 3.2.
Firing in the Presence of Stochastic Synaptic Inputs
r
r
r
1263
If the excitatory input is not instantaneous in time, then the bins have to be adjusted. For instance, the top bin will include all cells that fire by jumping to the left branch of N E and then reaching w LEK with the input still on. Each IEI will consist of the duration of the excitatory input beyond reset (see the next point), plus the time until the arrival of the next input. The distribution of IEIs must be adjusted accordingly. These steps are incorporated in the examples presented in section 6. As stated in section 2, when inputs are not instantaneous, input duration is measured on the slowest timescale. It is possible, however, that the input duration, in milliseconds, is only slightly longer than the active phase, which transpires on our intermediate timescale. What we refer to here as the input duration is therefore really the duration of the part of the input that extends beyond the cell’s return to the silent phase, which we measure on the slow timescale. If the input turns on and off gradually, then the analysis becomes more complicated, although additional simplifying assumptions could reduce this complexity.
3.2 Transition Probabilities. To complete the definition of the Markov chain, we need to compute the transition probabilities between the different states. The transitions occur at times at which the cell receives excitatory inputs. One way to think about this probability is to imagine a large pool of noninteracting cells, each of which receives a different train of inputs with IEIs chosen from the same distribution. To start, assume that each cell is in bin I N and receives its next excitatory input at slow time 0. By the definition E of I N , these inputs cause the cells to spike, and they get reinjected at w RK , still at slow time 0. The assumption that all cells are reset to the same point is essential in the definition of the Markov chain. Recall that we have defined the length T of an IEI as the time between the onsets of two subsequent inputs and that these times are independent and drawn from the distribution ρ. Focus on a particular cell, and denote the length of its next IEI by T1 . Just before the cell’s next input arrives, it E will be at w RK · T1 . Similarly, if we check the locations of all other cells after their respective IEIs, we will find that the population of cells is distributed E in some interval starting at w RK · S, as shown in Figure 5. The transition probability from the state (I N , k) to the state (I j , 1) equals the fraction of the population of cells that are in bin I j at the ends of their respective IEIs. This fraction is independent of the number of inputs k a cell received before firing, by our assumption that all information about prior dynamics is erased once a cell has fired. E Let us return to the cell at w RK · T1 , and let T2 denote its second IEI since time 0, also chosen from the distribution ρ. After this IEI, the example E cell will be at w RK · (T1 + T2 ). The fraction of the population lying in bin I j after one IEI since time 0 and in bin Ik after two IEIs since time 0 is
1264
J. Rubin and K. Josi´c
w ELK
w ELK 3
w ELK
w ELK
3
MS (wERK )
E MS (wRK )
E MS (wRK )
E ) MS (wRK
2
MS2(wERK )
E ) MS2(wRK
E ) MS2(wRK
E MS1(wRK )
MS1(wERK )
E MS1(wRK )
E MS1(wRK )
MS (wERK )
3
3
Figure 5: An example of the state of the population of cells that start in I N and immediately receive inputs and fire. After one subsequent IEI, the cells are E · T, where T is distributed according to ρ. Therefore, all cells lie in an at w RK E E ) = w RK · S, as shown in the left-most part interval bounded below by MS1 (w RK of the figure. For this example, we have assumed that supp(ρ) ⊂ [S, 2S), such E · 2S, as shown. After a second IEI, each that all cells are below MS2 (w RK ) = w RK E · (T1 + T2 ), where both T1 and T2 are chosen from the distribution cell is at w RK ρ. After a third IEI, some cells will lie above w LEK , such that they fire to their next inputs, while others will not. The distributions of cells after the third and fourth IEIs are the right-most two distributions shown above.
the transition probability between the states (I j , 1) and (Ik , 2). This process can be continued to compute all the transition probabilities. In the example shown in Figure 5, we have assumed that supp(ρ) ⊂ [S, 2S) and that there are five accessible states, which we order as (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 3), and (I4 , 4). Hence, N = 4, and the matrix of transition probabilities has the form
0 0 A= 0 1 1
1 0 0 0 0
0 0 ∗ ∗ 0 0 0 0 0 0
0 0 1 , 0 0
(3.3)
where Ai j gives the transition probability from the ith state in the list to the jth state in the list. If initial conditions are selected randomly, then some cells may initially E lie in transient states that cannot be reached subsequent to reset from w RK . We neglect such states in the Markov chain, since including them does not affect the statistics we consider. We now are ready to compute the transition probabilities for the general Markov chain that we have defined. We first compute transition
Firing in the Presence of Stochastic Synaptic Inputs
1265
probabilities between the states (I j , l) and (Ik , l + 1) for l ≤ j < k ≤ N. Consider independent and identically distributed random variables Ti , each with probability density ρ. Let σl = li=1 Ti denote the sum of these random variables. The transition probability between the states (I j , l) and (Ik , l + 1) is given by p( j,l)→(k,l+1) = P[(Ik , l + 1)|(I j , l)] = P σl+1 ∈ [k S, (k + 1)S) σl ∈ [ j S, ( j + 1)S) .
(3.4)
Since the probability density ρ (l) of the sum σl can be computed recursively by the convolution formula, ρ (l) (t) =
ρ (l−1) (u)ρ(t − u)du,
the conditional probabilities in the expression above can be evaluated. In particular, since P σl+1 ∈ [z, z + z] & σl ∈ [w, w + w] ≈ ρ (l) (w)ρ(z − w)wz, it follows that (k+1)S ( j+1)S p( j,l)→(k,l+1) =
kS
ρ (l) (w)ρ(z − w)dwdz . ( j+1)S ρ (l) (z)dz jS
jS
(3.5)
Next, we define the transition probabilities from the states (I N , l). By our assumption, a cell in one of these states fires when it receives an excitatory input. Therefore, the next state must be of the form (I j , 1). As discussed above, once a cell fires, it has no memory of the number of excitatory inputs it received prior to this event. Therefore, the transition probability from (I N , l) to (I j , 1) is the same for all l. This transition probability can be obtained as ( j+1)S p(N,l)→( j,1) = P[(I j , 1)|(I N , l)] = ρ(t)dt. jS
Since no transitions other than the ones described above are possible, this completes the definition of the Markov chain. As a final step, let Iki denote the highest bin, with respect to values of the slow variable, that can be reached through i IEIs after reset. To form the transition matrix A for the Markov chain, we simply order the states of the Markov chain in the list (I1 , 1), (I2 , 1), . . . , (Ik1 , 1), (I2 , 2), (I3 , 2), . . . , (Ik2 , 2), . . . , (I N , N).
1266
J. Rubin and K. Josi´c
The transition from the mth state to the nth state in the list is taken to be the (m, n) element of the matrix A, as was done in matrix 3.3 in the above example. Remark 5. As mentioned in remark 4, the threshold w LEK in equation 3.2 depends on input amplitude and duration. If input amplitudes are not constant, then there will no longer be a single value N such that a cell fires to its next input if and only if it is in a state of the form (I N , j), no matter how I N is defined. In such a case, we can, for example, use the threshold defined for the maximal relevant input amplitude to define I N . Then the probability distribution of input amplitudes can be used to compute probabilities p(N,l)→(N,l+1) and p(N,l)→( j,1) . For some range of l values, p(N,l)→(N,l+1) will be nonzero, and there will be a maximal value l such that no cell requires more than l inputs to fire, and hence (I N , l) is the last state in the chain. 4 Limiting Distributions of the Markov Chain and Their Interpretation We next consider the long-term behavior of the Markov chains defined above and interpret this behavior in terms of the original fast-slow system. For a finite-state Markov chain with M states and transition matrix A, the probability distribution π = (π1 , . . . π M ) is a stationary distribution if i πi Ai, j = π j for all j. The stationary distribution π is a limiting distribution if lim {An }i, j = π j .
n→∞
A Markov chain is irreducible if for any two states i and j, there exists a finite n such that {An }i, j > 0. In other words, there is a nonzero probability of transition between any two states in a finite number of steps. An irreducible finite state Markov chain has a unique stationary distribution (Hoel, Port, & Stone, 1972). The period di of the state i in a Markov chain is the greatest common divisor of the set {n|{An }i,i > 0}. For an irreducible chain, all states have the same period d. Such a chain is called aperiodic if d = 1 and periodic if d > 1. A key point is made in the following theorem: Theorem 1 ((Hoel et al., 1972), p. 73). For an aperiodic, irreducible Markov chain, the stationary distribution is a limiting distribution. 4.1 Conditions for the Existence of a Limiting Distribution. We next show that under very general conditions, the Markov chain constructed in the previous section is irreducible and aperiodic and therefore has a limiting distribution. First, recall that the chain was constructed to include precisely those states that can be reached with nonzero probability in a finite number of steps after a cell is reset. Since a cell in any state will fire and
Firing in the Presence of Stochastic Synaptic Inputs
1267
be reset after a finite number of steps, this implies that there is a nonzero probability of transition from any state to any other state in the chain in a finite number of steps. Therefore, the Markov chain under consideration is always irreducible. We next consider conditions under which the Markov chain is aperiodic. E It is sufficient to start with a continuum ensemble of cells at w RK . If these cells are subject to different realizations of the input, and a nonzero fraction of cells occupies every state after a finite number inputs, then the transition matrix is aperiodic. To start, note that if ρ is supported on a single point T, so that the input j E is periodic, then the Markov chain will be periodic. Each point MT (w RK ) is N contained in bin I j . Therefore the states of the Markov chain are {(Ii , i)}i=1 . In the case depicted in Figure 3, for example, the transition matrix has the form 0 1 0 0 0 0 1 0 (4.1) A= 0 0 0 1. 1 0 0 0 Consider next what happens as the support of ρ is enlarged to [T, T + δ] in this example. For small δ, there is no change in the structure of the transition matrix. For some δ0 sufficiently large, however, we have E M3(T+δ0 ) (w RK ) = w LEK , and when δ > δ0 , as shown in Figure 5, a fraction of the cells will be in state (I3 , 3) after three IEIs, while another fraction will be in state (I4 , 3). When their next excitatory input arrives, the cells in state (I4 , 3) will fire and transition to (I1 , 1), whereas the cells in (I3 , 3) will require two more inputs to fire. Therefore, in addition to the states (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 4) that were relevant in the periodic case, the new state (I4 , 3) becomes part of the Markov chain. The transition matrix between these five states for δ > δ0 has the form
0 0 A= 0 1 1
1 0 0 1 − ε(δ) 0 0 0 0 0 0
0 ε(δ) 0 0 0
0 0 1 , 0 0
(4.2)
where ε(δ0 ) = 0, ε increases with δ, and the states are ordered as (I1 , 1), (I2 , 2), (I3 , 3), (I4 , 3), and (I4 , 4). It can be checked directly that this transition matrix is aperiodic. This is also a consequence of the following general theorem: Theorem 2. Suppose that the support of the distribution ρ of IEIs (including input durations) is the interval [S, U]. The transition matrix defined in section 3
1268
J. Rubin and K. Josi´c
E is aperiodic if and only if MiU (w RK ) > w LEK for an integer 0 < i < N, where N is E E E defined by M(N−1)S (w RK ) < w L K < MNS (w RK ), as in section 3.
Corollary 1. The Markov chain defined above has a limiting distribution if and E ) > w LEK for an integer 0 < i < N, where N is described in only if MiU (w RK theorem 2. Remark 6. In the example discussed above, as the support of ρ widens, the periodic transition matrix in equation 4.1 is replaced by the aperiodic matrix, equation 4.2. Since the limiting behavior of periodic and aperiodic Markov chains is very different, the system can be thought of as undergoing a bifurcation as δ is increased past δ0 . E ) ≤ w LEK for all integers 0 < i < N, then it can be checked Proof. If MiU (w RK directly that the only states that are achievable from an initial condition E MS (w RK ) have the form (Ii , i) for 0 < i ≤ N. Therefore, the transition matrix is of size N × N and has the form of matrix 4.1, with ones on the superdiagonal in all rows except the last, which has a one in its first column. Thus, the Markov chain is periodic. E Assume instead that MiU (w RK ) > w LEK for an integer 0 < i < N. Consider E a continuum of cells that have just spiked and been reset to w RK and are receiving independent realizations of the input. First, note that after i IEIs following the spike, there will be a nonzero fraction of cells in all states (Ik , i), k ≥ i that are part of the Markov chain. E The condition MiU (w RK ) > w LEK and the fact that the support of ρ is an interval imply that some cells will fire on the ith, (i + 1)st, . . . , and Nth inputs after the one that causes their initial reset. Correspondingly, there will be cells in all states (Ik , 1), k ≥ 1 in the chain after the ith input, cells in all states (Ik , 2), k ≥ 2 and (Ik , 1), k ≥ 1 in the chain after the (i + 1)st input, and so on, until all states of the form (Ik , j) with k ≥ j and j ∈ {1, . . . , N − i + 1} are nonempty after N inputs. Similarly, some cells will fire again to input number 2i, with some other cells firing to each input from 2i + 1 up to 2N, and after 2N inputs, all states (Ik , j) with k ≥ j and j ∈ {1, . . . , 2N − 2i + 1} in the chain will be nonempty. Continuing this argument inductively shows that all states are necessarily occupied just before the arrival of the (b N)th input after reset (i.e., after b N IEIs), where b = N−1 —namely, the least N−i integer greater than or equal to (N − 1)/(N − i), such that b N is bounded above by (N − 1)N, since i ≤ N − 1. This establishes theorem 2. Finally, since we established previously that the Markov chain is irreducible, theorems 1 and 2 imply that it has a limiting distribution whenever E MiU (w RK ) > w LEK for an integer 0 < i < N. If this condition fails, then the Markov chain is periodic and therefore has no limiting distribution. Hence, corollary 1 holds as well.
Firing in the Presence of Stochastic Synaptic Inputs
1269
4.2 Interpretation of the Invariant Density. The limiting distribution Q of the derived Markov chain has two interpretations (Taylor & Karlin, 1998). Assume that a cell is subjected to excitatory input for a long time. If w is recorded after each IEI, just prior to the moment when the cell receives an excitatory input, then Q[(I j , l)] is the fraction of recordings observed to fall inside (I j , l). Consequently, the total mass of the distribution Q that lies in the states (I N , j), namely, Nj=1 Q[(I N , j)], is the probability that a cell will respond to the next excitatory input by firing. Similarly, we can N think of Q[(I j , l)]/ k=l Q[(Ik , l)] as the probability that a cell will be found to have w ∈ I j just prior to receiving its lth excitatory input since it fired last. Note that the firing probability Nj=1 Q[(I N , j)] is not the reciprocal of the average number of failures before a spike. However, the average number of failures, call it E f , can be computed once the Q[(I j , l)] are known. Clearly,
Ef =
N−1
j Fj,
(4.3)
j=1
where each F j denotes the probability that exactly j failures occur. Note that j failures occur precisely when j + 1 conditions are met. That is, for each i = 1, . . . , j, just prior to the ith input since the last firing, the trajectory falls into a state of the form (Iki , i), where i ≤ ki < N. Further, the trajectory ends in state (I N , j + 1) before the ( j + 1)st input, to which the cell responds by firing. The quantity N−1 k=i
Q(Ik , i)
k=i
Q(Ik , i)
N
equals the probability that just before the ith input arrives, a cell will lie in a state (Ik , i) with k < N. Thus, the failure probabilities are given by Fj =
N−1 k=1 Q(Ik , 1) N k=1 Q(Ik , 1)
N−1 k= j N k= j
Q(Ik , j) Q(Ik , j)
N−1 k=2 Q(Ik , 2) N k=2 Q(Ik , 2)
N
···
Q(I N , j + 1)
k= j+1
Q(Ik , j + 1)
.
(4.4)
Of course, while some of the Q(Ik , j) may be zero, the exclusion of transient states in our Markov chain construction ensures that all of the sums in the denominators of equation 4.4 will be positive for every population of cells.
1270
J. Rubin and K. Josi´c
Finally, we note that the eigenvalues of the transition matrix can be used to quantify the rate at which the system will converge to its limiting distribution, when one exists. This information may be useful even if Monte Carlo simulations will be used. 5 Extensions of the Markov Chain Approach The ideas introduced in the previous section are applicable in many other settings. Here we outline extensions to heterogeneous populations of excitable cells subject to identical periodic input and to general excitable systems and related models. 5.1 Populations and Heterogeneity. The limiting distribution computed from a Markov chain cannot be used directly to determine what fraction of a population of identical cells subject to identical stochastic trains of excitatory inputs will respond to a given input. As noted earlier, it is necessary to specify the distribution of initial conditions to compute this fraction. On the other hand, the limiting distribution contains information about a population of identical cells subject to different realizations of the stochastic train of excitatory inputs (Taylor & Karlin, 1998). In this case, all of the quantities discussed in the previous section can be reinterpreted with respect to the firing statistics of a population of cells. For example, the fraction of cells responding with a spike to a given excitatory input equals the probability of a single cell firing to a particular excitatory input, that is, N j=1 Q[(I N , j)]. We can also apply similar ideas to the case of a heterogeneous population of excitable cells all subject to the same periodic input, say, of period T. Heterogeneity in intrinsic dynamics may lead to different rates of evolution for different cells in the silent phase, but this disparity can be eliminated by rescaling time separately for each cell, so that all cells evolve at the same unit speed in the silent phase. As a result of this rescaling, the heterogeneity in the population will become a heterogeneity in firing thresholds. Denote the t resulting distribution of thresholds by φ, such that for any t > 0, 0 φ(τ ) dτ gives the fraction of cells with thresholds below t. There will exist a maximal nonnegative integer m and a minimal positive integer n > m such that the support of the distribution φ is contained in the interval [mT, nT]. Thus, for iT i ≥ 1, δi = (i−1)T φ(τ ) dτ gives the fraction of cells that will fire in response to the ith input, which is nonzero for i = m + 1, . . . , n. We can define Markov chain states I j , for j = 1, . . . , n, by stating that a cell is in state I j if it has received j − 1 inputs since it last fired. The transition probability P( j, j + 1) from state I j to state I j+1 is 1 if j ≤ m; if j = m + 1, then P(m + 1, m + 2) = 1 − δm+1 , while the probability of transition from state Im to state I1 is P(m + 1, 1) = δm+1 ; and if m + 1 < j ≤ n − 1, then P( j, j + 1) is given by the proportion of those cells that have not fired to
Firing in the Presence of Stochastic Synaptic Inputs
1271
the first j − 1 inputs that also do not fire to the jth input, namely, P( j, j + 1) = 1 − γ j := =1 −
1 − δ j − δ j−1 − . . . − δm+1 1 − δ j−1 − δ j−2 − . . . − δm+1
δj , 1 − δ j−1 − δ j−2 − . . . − δm+1
while P( j, 1) = γ j = δ j /(1 − δ j−1 − . . . − δm+1 ). Finally, the transition probability from state In to state I1 is 1, and all other transitions have zero probability. If we set γm+1 = δm+1 , then the transition matrix takes the form
0 0 · · ·
A= γm+1 · · · γn−1 1
1 0 · · · 0 · · · 0 0
0 1 · · · 0 · · · 0 0
0 0 · · · 0 · · · 0 0
... ... ... ... ... ... ... ... ... ... ...
0 0 · · · 1 − γm+1 · · · 0 0
0 0 · · · 0 · · · 0 0
... ... ... ... ... ... ... ... ... ... ...
0 0 · · · 0 · · · 0 0
0 0 · · · . 0 · · · 1 − γn−1 0
As long as m < n − 1, such that the distribution φ has support on an interval of length greater than T, not all cells will fire together. Under this condition, there exists i < n such that δi = 0 and, correspondingly, γi = 0. In this case, the proof of theorem 2 immediately generalizes to show that there exists a sufficiently large number of iterations N such that {Ar }i,i > 0 for all r > N, which implies that A is aperiodic and the Markov chain has a limiting distribution. This result can be interpreted as follows. If the heterogeneity is weak, then cells starting with the same initial condition will always respond to the same inputs in a train, giving an unreliable population response. With a stronger degree of heterogeneity, the population response will disperse, eventually every input will evoke a response from some nonempty subset of the cells, and the statistics of the population response will be given by the limiting distribution of the Markov chain. 5.2 General Excitable Systems and Related Models. An excitable system can be characterized by the existence of a stable rest state and a threshold. In such a system, when a perturbation pushes a trajectory from the rest state across the threshold, the trajectory makes a significant excursion before returning to a small neighborhood of the rest state. Suppose that an n-dimensional excitable system receives transient inputs of a stereotyped
1272
J. Rubin and K. Josi´c
form. If we select a time T, then we can define a map MT (u) having as its domain a set of starting states u for the system, each of which is an n-dimensional vector. In theory, bins in n-dimensional space could be defined to implement the Markov chain approach for an n-dimensional map. Indexing these bins efficiently and computing transition probabilities between them could become problematic when n is not small, however. On the other hand, the Markov chain approach discussed in the previous section can be applied directly to systems for which MT (u) = u(T) lie (approximately) on an interval in the phase space of the system, and which are reset to a fixed value ureset after crossing threshold. The assorted versions of the integrate-andfire (IF) model (leaky IF, quadratic IF, exponential IF, and so on) in the subthreshold regime satisfy both of these conditions, as do the lighthouse model of (Haken, 2002; Chow & Coombes, 2006) and the Kuramoto (1984) model on S1 . There is an important difference between the excitable systems considered in detail here and the IF model, however. When the results derived for excitable systems are applied to neuronal models, the Markov chain will be defined using a slowly changing ionic conductance, or an associated activation or inactivation, while in the case of the IF model, it would be defined in terms of the voltage. As a consequence, for the IF case, one would need to take into account the jumps in voltage due to synaptic inputs when defining the states of the Markov chain and computing the transition probabilities. Moreover, an additional difficulty arises from the fact that voltage will decrease after the application of an input that pulls the model above its intrinsic rest state but fails to cause a threshold crossing. This nonmonotonicity will complicate bin definitions. A similar issue will arise even in an excitable system with one slow variable w if the system remains excitable while its input is on. In this case, w converges E E toward w FP whenever w ∈ (w FP , w LEK ) on N E . 6 An Example: The TC Model A prototypical representative of the class of models to which the analysis outlined in the preceding sections is applicable is a model for a thalamocortical relay (TC) cell relevant for the study of Parkinson’s disease (PD) and deep brain stimulation (DBS). The TC model that we consider takes a similar form to the reduced TC model in Rubin and Terman (2004): Cm v = −I L − IT − ge xc se xc (v − ve xc ) − ginh sinh (v − vinh ) w = φ(w∞ (v) − w)/τ (v).
(6.1)
Here, the leak current I L = g L (v − v L ) and the T-type or low-threshold calcium current IT = gT m∞ (v)w(v − vCa ), with parameter values Cm = 1 µF/cm2 , g L = 1.5 mS/cm2 , v L = −68 mV, gT = 5 mS/cm2 , vCa = 90 mV,
Firing in the Presence of Stochastic Synaptic Inputs
1273
ge xc = 0.08 mS/cm2 , ve xc = 0 mV, ginh = 0.12 mS/cm2 , vinh = −85 mV, φ = 3.5 and functions m∞ (v) = (1 + exp(−(v + 35)/7.4))−1 , w∞ (v) = (1 + exp((v + 61)/9))−1 , and τ (v) = 10 + 400/(1 + exp((v + 50)/3)). The values of se xc and sinh will be determined by stochastic processes discussed below, and we made an additional modification to the model to make it oscillatory in the presence of excitation, which is discussed in remark 8. The assumptions that we make regarding the presence of three timescales in the dynamics are not unreasonable in this model, as shown in Rubin and Terman (2004) and Stone (2004). In this section, we present the analytical results of the Markov chain approach for this model for two particular IEI distributions, one uniform and one normal, both with inhibition held off for all time and with inhibition held on for all time (further details appear in appendixes A and B). We compare the stationary distributions found analytically with those found by numerical simulations and find good agreement. Indeed, the Markov chain analysis can be used to check under what conditions a limiting distribution exists, so that numerical simulations will converge, and how long they need to be run to yield accurate results. 6.1 Uniform IEI Distribution. Consider system 6.1 with constant inhibition, which may be on or off. We take the IEIs to be distributed uniformly on [30,70] msec, except for the first such interval after each reset, which is chosen from a uniform distribution on [20,60] msec (see remark 7 below). Using the bin definitions 3.1 and 3.2, with this minor adjustment to E E E E I1 , we have I1 = [w RK · 20, w RK · 50), I2 = [w RK · 50, w RK · 80), . . . , I N−1 = E [w RK · 20 + (N − 2)30, w LEK ], I N = (w LEK , w 0FP ). For the default parameters of system 6.1, inhibition held at sinh = 0, and the excitation characteristics described, simulation of the TC model, equation 6.1, for one passage through the silent phase shows that N = 3, while with inhibition at sinh = 1, we find N = 5. With sinh = 0, the states of the Markov chain are (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), with bin boundaries defined from the one-time simulation. The transition matrix for this case, call it P 0 , is computed analytically in appendix A and appears in equation A.1. The unit-dominant eigenvector of (P 0 )T gives the limiting distribution, v 0 = [.3404 .1135 .0922 .3617 .0922]T .
(6.2)
As discussed in section 4, the values of v 0 represent the likelihood that a cell is found in a given state, if state membership is recorded just prior to the onset of an excitatory input. For comparison, we simulated a single cell, modeled by equation 6.1, with the technical modification mentioned in remark 8 below, over 70 sec, after an initial transient of 10 sec. This
1274
J. Rubin and K. Josi´c
simulation yielded the vector 0 = [.3276 .1289 .0878 .3629 .0878]T vnum
of proportions of inputs during which the cell belonged to each relevant state, which agree nicely with the analytically computed expectations v 0 in equation 6.2. Over the entire simulation, the cell never failed to fire to three consecutive inputs. Grouping our analytical values by bins indicates that just before onset of excitation, on 34.04% of the observations, we expect w ∈ [w RK · 20, w RK · 50), on 20.57% of the observations, we expect w ∈ [w RK · 50, w RK · 75.5), and on 45.39% of the observations, we expect w > w RK · 75.5. In particular, this implies that a cell will fire to roughly 45% of its inputs for these choices of parameters. Further, from equations 4.3 and 4.4 with Q values given by entries of v 0 , it is expected that a successful input will be followed by 1.20 inputs that fail to elicit a spike, before the next successful input occurs; this is in good agreement with direct simulation results showing a mean of 1.19 unsuccessful inputs between successful ones. As discussed in section 5, these results can also be interpreted in terms of a population of identical cells as long as the cells receive independent realizations of the input. With nonzero constant inhibition, given by sinh = 1, the states of the Markov chain are (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5). The transition matrix, P I , is computed analytically in appendix B and appears in equation B.1. In this case, (P I )T has the unitdominant eigenvector v I , which is shown below together with the distriI bution of states listed in vector vnum . The latter are obtained from direct numerical simulation for 80 sec, with a transient consisting of the first 10 sec removed from consideration: vI
= [.2290 .0763 .0859 .1622 .0215 .0569 .0624 .0005 .0004 .2211 .0833 .0005]T
I vnum = [.2272 .0784 .0726 .1927 .0194 .0388 .0669 .0022 .0007 .2171 .0820 .0002]T .
Grouping the states (I5 , j) reveals that a cell will fire in response to about 30% of its inputs. This failure rate is higher than in the case of sinh = 0, which E is to be expected because inhibition changes the locations of w LEK and w RK , E E so that the time to evolve from w RK to w L K is longer with inhibition on than with inhibition off. Similarly, based on v I , the case of sinh = 1 has a substantially higher expected number of failed inputs between spikeinducing inputs, namely, E f = 1(.0013) + 2(.7240) + 3(.2731) + 4(.0016) = 2.28 from equations 4.3 and 4.4, compared to the 1.20 expected in the case of sinh = 0. Direct numerical simulation gives a similar estimate of E f = 2.35 in this case.
Firing in the Presence of Stochastic Synaptic Inputs
1275
Remark 7. In the above calculation, each time interval from the offset of one input to the onset of the next was chosen from a uniform distribution on [20,60] msec. Suppose that we were to fix the total duration of each excitatory input at 10 msec. Since ε = 0 in simulations, even with this constant input duration, the part of the input duration remaining after the cell is reset from the active phase would vary after different spiking events. This variation can be handled easily, as discussed in remark 4, and results in a distribution of IEIs T with support on [20+x,70] msec, where x is the minimal duration of input remaining after reset. However, to maintain a uniform IEI distribution, we chose to adjust the simulations by simply turning off the input each time a firing cell is reset or, equivalently, setting the input duration equal to 0 on the slow timescale. As a result, the distribution for the first IEI after reset was supported on [20,60] msec rather than on [30,70] msec. Remark 8. For consistency with section 3, we modified model 6.1 to make the cell oscillatory in the presence of excitation. To do this, we simplified the dynamics to the form w = (w∞ − w)/τw whenever v < −55, for constants w∞ = .61 and τw = 407, which were chosen to give nice bin boundary values. We also made a related technical adjustment to simplify the presentation of these example results, related to the nonzero duration of excitation. If an excitatory input arrives but fails to make a cell fire, then the cell jumps from the left branch of N 0 to the left branch of N E , and it evolves on this branch of N E while the input is on. The rates of flow on the left branches of N 0 and N E may differ, however. Thus, the time a cell takes to travel from one value of w to another in the silent phase may depend on how many inputs the cell receives during the passage. This can easily be taken into account in the above calculations by making appropriate adjustments to bin boundaries, but this would clutter the exposition. Hence, we instead adjusted the flow in our simulations of equation 6.1 in this example so that the w dynamics was identical on the left branches of N 0 and N E . In theory, the way that we did this introduces the possibility that solutions may escape from the left branch of N 0 in the vicinity of w 0L K and fire without receiving input, since w 0L K < w∞ . However, in practice, this does not occur because w 0L K is sufficiently large, relative to the IEIs, that it is never reached. 6.2 Normal, or Other, IEI Distributions. If IEIs are taken from a nonuniform distribution, we can still use equation 3.5 to obtain the transition matrix elements analytically, albeit with numerical evaluation of the integrals that arise. These integrals are analogous to those given in appendixes A and B for the uniform case. Once the transition matrix is obtained, the limiting distribution for the Markov chain is computed as the unit-dominant eigenvector of its transpose, as previously. For example, consider IEIs of the form T = tref + X, where tref is a fixed constant and X is selected from a truncated normal distribution. In
1276
J. Rubin and K. Josi´c
particular, suppose that tref = 20 msec, that X˜ is selected from a normal distribution with mean 20 msec and standard deviation 10 msec, and that 0, X˜ < 0, ˜ ˜ X = X/M X , 0 ≤ X ≤ 40, ˜ 40, 40 < X,
(6.3)
where MX is a constant correction factor such that R X = 1. This is a reasonable choice that keeps IEIs within the bounds present in the example in the previous section. For this example, with other simulation parameters fixed as in the previous section, including sinh = 0 and an input duration of 10 msec, and the same adjustment mentioned in remark 8, we obtained the transition matrix (Pn0 ) given in equation A.2, corresponding to states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3). The dominant eigenvector vn0 of (Pn0 )T matches closely with a vector (vn0 )num obtained from direct counting of state membership in the last 90 sec of a l00 sec numerical simulation: vn0
= [.4073 .0670 .0594 .4108 .0594]T
(vn0 )num
= [.3883 .0812 .0625 .4070 .0610]T .
Here, the subscript n simply indicates the use of a normal distribution of IEIs. Based on these results, we expect that a cell will respond to approximately 47% of the inputs, with an average of 1.15 failures between successful spikes from equations 4.3 and 4.4. This agrees nicely with the estimate 1.13 that we obtained from direct numerical simulations. For completeness, we conclude with the results of an analogous calculation done with sinh = 1. This yields the transition matrix PnI given in equation B.2, corresponding to states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 2), (5, 3), (5, 4). The corresponding dominant eigenvector vnI , and an example vector (vnI )num of state occupancy probabilities from direct simulations, are vnI
= [.2638 .0438 .0672 .2234 .0072 .0169 .0701 .2133 .0942]T
(vnI )num
= [.2587 .0505 .0697 .2132 .0121 .0263 .0637 .2329 .0728]T ,
where we have omitted the (5,2) entry since the probability of occupancy there is less than 0.5 × 10−5 . These results imply that a cell with sinh = 1 will respond to approximately 31% of excitatory inputs and is thus less reliable than a cell with sinh = 0. The average number of failures between spikes is 2.27, from equations 4.3 and 4.4, which is similar to the 2.24 obtained from direct simulations and exceeds that found with sinh = 0.
Firing in the Presence of Stochastic Synaptic Inputs
1277
7 Inhibitory and Excitatory Inputs We next consider the more complex case in which the postsynaptic cell receives a stochastic excitatory input train while subjected to modulation by an inhibitory input. The corresponding analysis illustrates how switches between any pair of epochs with different state transition characteristics can be handled naturally within the Markov chain framework through products of transformation matrices. In particular, the same type of analysis could be done if both input trains were excitatory, with different statistics. Moreover, in the next section of the letter, the example that we present will be discussed in connection with a possible mechanism for the therapeutic effectiveness of DBS.
7.1 Transition Matrices. In this section, we assume that for a system of the form 2.1 with input given by equation 2.2, the function sinh (t) turns on and off abruptly, with a relatively long period between transitions. In the previous section, we discussed how to derive the transition matrices P I and P 0 for the case of constant inhibition of any level. Our next goal is to derive transition matrices P 0→I and P I →0 , encoding the probabilities of passing between various states during the onset and offset of inhibition, respectively. Using these matrices, we can form a transition matrix for the time from one inhibitory offset (or onset) to the next, by matrix multiplication. One complication in this derivation is that the bin boundaries differ between the cases of zero and nonzero inhibition. Even with the assumption that the rate of evolution in the silent phase is independent of input level, differences in bin boundaries remain due to differences in right knee positions, leading to different starting points in the silent phase. Similarly, differences in left knee positions lead to different cutoffs for firing. To make this ex0 plicit, in the following analysis, let {S0j } Nj=1 denote the ordered states for the Markov chain formed when inhibition is off, which we call the inhibition-off I Markov chain, and let {S Ij } Nj=1 denote the states for the Markov chain defined when inhibition is on, which we call the inhibition-on Markov chain. We use the notation (Ik0 , l) or (IkI , l) for the states in these collections. A second difficulty is that due to the stochasticity of the input trains, a change in inhibition level may occur at any time relative to the start of the particular IEI during which it happens, or even at a time when excitation is on. An example of how the resulting bin membership depends on the inhibition onset time is illustrated schematically in Figure 6. To handle both of these issues, we define the transition matrix P 0→I for the onset of inhibition in the following way. First, note that there will be a final iteration of the inhibition-off Markov chain, before inhibition arrives, after which the cell will lie in a state Su0 for some u. During the next IEI, inhibition arrives. At the end of this IEI, the cell will lie in a state SvI for some v. For i = 1, . . . , N0 and j = 1, . . . , NI , let the (i, j) entry of P 0→I denote the
1278
J. Rubin and K. Josi´c NE 0 wFP
NI I 40
I wFP
NE 0 wFP
I
I4
E wLK
0
I3
I+E wLK
A
A .T
I I2
0
I wFP
0 I3
I+E wLK
I
I4 I
I3 I02 II2
0
0
I1 I
.S
E wRK
I4 E wLK
I
I3 I 02
NI
I1 I+E .S wRK
I1 E. wRK S
I
I1
.
I+E wRK S
Figure 6: A schematic depiction of the two difficulties encountered when defining the matrix P 0→I . (Left) An ensemble of cells A in state (I20 , l) is mapped to two bins, I2I and I3I , when inhibition turns on. (Right) The same ensemble gets mapped only to the bin I3I if inhibition turns on a time T later than in the left panel.
probability that u = i and v = j. Recall that the states of the inhibition-off Markov chain, by definition, take the form (Ik0 , l), where l gives a count of IEIs since last reset; thus, many elements of P 0→I will be zero. There are several important advantages to basing the cycle length on only excitatory input times rather than the time when inhibition turns on. This definition of P 0→I renders irrelevant the bins into which an ensemble is mapped by the onset of inhibition itself (e.g., Figure 6), since bin membership is not checked when inhibition is turned on but is instead checked at the end of the IEI during which the inhibitory onset occurs. Moreover, in this formulation, it does not matter whether inhibition turns on while the excitation is still on or after it is off, if we continue to assume that the rate of silent phase evolution is input independent (assumed in remark 8). Whether the last excitatory input that arrives before the onset of inhibition causes the cell to fire, or fails to do so, is determined by the knee positions E without inhibition, and if a firing occurs, instantaneous reset to w RK follows. We neglect the probability zero case of excitation and inhibition turning on at precisely the same moment, and therefore if a cell fires, then inhibition will always turn on after the cell is reset. I +E I +E As previously, let w LI +E , K , w RK denote the knees of the nullcline N corresponding to the case when both excitation and inhibition are on. One I +E E additional complication may arise due to the fact that w RK < w RK . That is, at the onset of inhibition, w may lie below the lower boundary of the inhibition-on partition. To account for this possibility requires the inclusion of additional bins in this partition, with a corresponding adjustment of the matrix P I to allow for matrix multiplication.
Firing in the Presence of Stochastic Synaptic Inputs
1279
We define the transition matrix P I →0 , for the offset of inhibition, analogously to the case of P 0→I . That is, there will be a final iteration of the inhibition-on Markov chain, before inhibition turns off, after which the cell will lie in a state SuI . During the next IEI, inhibition turns off. At the end of this IEI, the cell will lie in a state Sv0 . The (i, j) entry of P I →0 denotes the probability that u = i and v = j. The complication in this case is that postinhibitory rebound (PIR) may occur if, when the offset of inhibition occurs, either excitation is off and w > w 0L K or excitation is still on and w > w LEK . Rebound leads to reinjection into the silent phase followed by evolution there. However, the duration of this evolution, from the time of reinjection to the time that the next excitatory input arrives, depends on when the inhibition turned off, relative to the time of the previous input, which complicates calculations. Assuming that P 0→I , P I →0 can be computed, the appropriate transition matrix for the case of excitatory and inhibitory input trains is obtained by multiplication of the transposes of the separately computed transition matrices P 0 , P I , P 0→I , and P I →0 . Specifically, let [M]T denote the transpose of matrix M, and let (M)n denote M to the nth power. Suppose that n IEIs occur in the absence of inhibition, that inhibition arrives during the next IEI, that m additional IEIs occur with inhibition on, and that inhibition turns off again on the next IEI. In this case, the ( j, i) entry of the matrix [P I →0 ]T ([P I ]T )m [P 0→I ]T ([P 0 ]T )n gives the probability that a cell that starts in state Si0 of the inhibition-off partition ends up in state S0j after the sequence of n + m + 2 IEIs described above. Of course, when the IEIs and the durations of inhibitory on and off periods are selected randomly from distributions, the probabilities that different numbers (m, n) of excitatory inputs arrive during these periods are also random; in appendix C, these are calculated for the example of uniform distributions. In the general discussion given here, let us assume that m, n ≥ 0 are bounded above by M, N < ∞, respectively. Suppose that over a long time period, we check the cell’s state membership at the first onset of excitation following each inhibitory offset. We would like to claim that in the asymptotic limit, the proportion of these trials for which a cell will belong to each inhibition-off state is given by the entries of the unit-dominant eigenvector of the matrix N M
c m,n [P I →0 ]T ([P I ]T )m [P 0→I ]T ([P 0 ]T )n ,
(7.1)
m=0 n=0
where each coefficient c m,n denotes the probability of occurrence of the corresponding exponent pair (m, n), some of which may be 0. Substantiating
1280
J. Rubin and K. Josi´c
this claim necessitates justifying whether this eigenvector exists and truly represents a limiting distribution. This will be true if the matrix in equation 7.1 is irreducible and aperiodic, as discussed in section 4. Recall that the proof of theorem 2 gives an upper bound b for the number of inputs after which occupancy of all states is guaranteed, under the assumptions of the theorem, for constant inhibition. Now, let b I , b 0 denote the respective upper bounds for matrices P I , P 0 , based on the IEI distribution. A sufficient pair of conditions to guarantee the existence of a limiting distribution for equation 7.1 is that if the probability p(m) > 0, then m ≥ b I , while if p(n) > 0, then n ≥ b 0 . If these conditions are violated, the states of the system may still tend to some limiting distribution, particularly if p(m), p(n) are substantial for some values above the respective bounds, but this must be verified numerically. Similarly, the unit-dominant eigenvector of the matrix, N M
c m,n [P 0→I ]T ([P 0 ]T )n [P I →0 ]T ([P I ]T )m ,
(7.2)
m=1 n=1
if it exists and represents a limiting distribution, gives the expected proportions of trials for which a cell will belong to each inhibition-on state (IkI , l) at the moment of arrival of the first excitatory input after inhibition turns on. The dominant eigenvectors of the matrices N M
c m,n ([P 0 ]T )n [P I →0 ]T ([P I ]T )m [P 0→I ]T
m=1 n=1 M N
c m,n ([P I ]T )m [P 0→I ]T ([P 0 ]T )n [P I →0 ]T
m=1 n=1
have similar interpretations. Remark 9. In fact, justifying the existence of the limiting distribution for the matrix in equation 7.2 requires an additional technical adjustment, relative to that for matrix 7.1. The added complication arises because the fixed point of N I is higher than that of N 0 , which may lead to a violation of irreducibility. This is simply a technical point and can be handled by replacing ([P I ]T )m by the product of a sequence of nonidentical matrices that incorporate successively larger numbers of states, assuming m is not too small. 7.2 TC Cell Example Revisited: Uniform Distributions. Consider again the TC model equation 6.1. We focused on inhibitory onset and computed the transition matrices from equation 7.2 analytically, using
Firing in the Presence of Stochastic Synaptic Inputs
1281
equation 3.5 as in the constant inhibition cases discussed in appendixes A and B, under the assumptions that the intervals from excitatory offsets to onsets are selected from a uniform distribution on [20,60] msec, that the excitatory duration is fixed at 10 msec, and that the durations of both the inhibitory inputs and the time intervals between inhibitory inputs are selected from a uniform distribution on [125,175] msec. In this example, the coefficients c m,n = p(m) p(n) in equation 7.2 can also be computed analytically, and we discuss some details of this calculation in appendix C. After the p(m) are computed, the dominant eigenvector v 0→I of matrix 7.2 can be 0→I obtained. We compare v 0→I to its counterpart, vnum , generated numerically from 301 inhibitory onsets during the last 90 sec of a 100 sec simulation: v 0→I = [.3530 .0801 .1598 .2558 .0333 .0677 .0363 0 0 .0140
0
0]T
0→I vnum = [.3089 .1030 .0897 .2724 .0199 .0432 .0698 0 0 .0664 .0066 0]T .
The states here are exactly those listed in section 6 for the uniform IEI distribution with sinh = 1: (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5). Note that the zero entries here signify values less than 0.5 × 10−5 . These states are included in the chain because they are reached after inhibition stays on sufficiently long, since inhibition shifts the v-nullcline to higher w-values. However, the probability of membership in these states just after inhibition turns on is negligible. 0→I The agreement between v 0→I and vnum is fairly good, although not as good as in the case of constant inhibition in section 6. This is presumably due to errors introduced by some minor simplifying assumptions that we made, which are discussed in remark 10. The distribution v 0→I indicates that a cell will be expected to fire well under 10% of the time to the first input that arrives just after the onset of inhibition. This low number fits the prediction of phase plane and bifurcation analysis, which implies that the onset of inhibition interferes with responsiveness to the subsequent excitatory input (Rubin & Terman, 2004). Moreover, the full distribution v 0→I characterizes the expected behavior of the cell after this first excitatory input. In particular, note that the cell will be in the (1,1) state, corresponding to the lowest slow variable values, on the arrival of over 30% of such inputs, which is a significant increase over the likelihood of membership in this state in the sinh = 1 case. Hence, the compromise of responsiveness by rhythmic inhibition will endure beyond the first excitation after inhibitory onset. Remark 10. Recall that PIR, or rebound, refers to the firing of a spike immediately on the removal of inhibition. Although we focus on equation 7.2, and hence on the limiting distribution v 0→I corresponding to the onset of inhibition, in the above example, we still must consider PIR in the calculation of the elements of the matrix P I →0 , which appears in equation 7.2. Since the possibility of rebounding in general, and in particular rebounding
1282
J. Rubin and K. Josi´c
and reaching I2 before the arrival of the next excitation, is relatively small for our parameter values, we make the simplifying assumption that after PIR, a cell can reach only I1 before the next excitation arrives. To maintain I +E E tractability, we also neglect the fact that w RK > w RK , based on the fact that the right knees are fairly close for ginh = 0.12. These approximations do introduce some error into our results. 7.3 TC Cell Example Revisited: Normal Distributions. For comparison to direct numerical simulation, we also calculated the transition matrices and dominant eigenvector when IEIs were selected from the truncated normal distribution described earlier (see equation 6.3), with inhibition on and off durations selected from a similar truncated normal distribution. The distribution for on and off intervals of inhibition was supported on [125,175] msec and had mean 150 msec, as in the uniform distribution example. To compute the relevant dominant eigenvector, as done above in the uniform distribution case, we need coefficients c m,n , which we computed as c m,n = p(m) p(n) from the numerically obtained probabilities p(1) = .0014, p(2) = .1625, p(3) = .6763, p(4) = .1625, p(5) = .0014, with p( j) = 0 for j ≥ 6. In this case, we used transition probabilities obtained from long-time simulations to compute the transition matrices PnI →0 , Pn0→I , which we do not display here, although these could have been computed from equation 3.5 as well. The dominant eigenvector vn0→I , corresponding to state occupancy at the arrival time of the first excitatory input after inhibitory onset, and an example of the distribution of states (vn0→I )num obtained from the last 90 sec of a 100 sec simulation are vn0→I
= [.3354 .0463 .1233 .3685 .0083 .0394 .0350 .0439
0
]T
(vn0→I )num = [.3984 .0549 .1071 .3132 .0082 .0302 .0384 .0467 .0027 ]T , for states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 3), (5, 4), with the subscript n for normal as previously. These results show that a cell can be expected to fire reliably to the first excitatory input after the onset of inhibition less than 5% of the time, and it will lie in the (1,1) state on the arrival of over 30% of such inputs, as also seen in the uniform distribution example. 8 Explicit Connection to Parkinsonian Reliability Failure In PD, the inhibitory output from the basal ganglia may become rhythmic. DBS eliminates this rhythmicity, leading to inhibition from the basal ganglia that is elevated but, when summed over all inputs to a cell, is roughly constant, possibly with fast oscillations around a high constant level. A possible mechanism for the induction of motor symptoms in PD and for their amelioration by DBS, analyzed in Rubin and Terman (2004), is that
Firing in the Presence of Stochastic Synaptic Inputs
1283
rhythmic basal ganglia inhibitory outputs periodically compromise TC response reliability, while the regularization of these outputs by DBS restores responsiveness. The drastic drop in the number of TC cells expected to fire and the accumulation of cells in states far from firing threshold found just after inhibitory onset in the case of rhythmic inhibition in our examples, relative even to the case of elevated but constant inhibition, offers a strong demonstration of the feasibility of this idea. The approach that we have taken to obtain these results is based on limiting distributions, and hence eliminates the possibility of spurious outcomes due to transient effects in simulations. Suppose that we adjust parameters to an extreme case, so that TC cells fire in response to every excitatory input when sinh = 0 and when sinh = 1. In terms of the slow variable, these conditions become E w RK · S > w LEK
(8.1)
and I +E w RK · S > w LI +E K .
Even with these strong conditions, we still find some failures when inhibition turns on and off rhythmically. In this case, we can read off from v 0→I the probabilities with which a cell will experience each possible number of failures due to inhibitory onset, as seen in the examples discussed above. In particular, even if no failures occur when inhibition is held at a constant level, there may be multiple failures following inhibitory onset if the I +E E difference w RK − w RK is large. Inhibitory offset may lead to response failure as well, through PIR. Cells E that rebound when inhibition turns off are reset to w RK . The assumption of reliability in the inhibition-off case implies that inequality 8.1 holds. However, the time from the offset of inhibition until the arrival of the next excitatory input may be less than S, since inhibition may turn off at any moment in the IEI. Therefore, the time t ∗ from PIR to the next arrival of excitatory input may be less than S, and a response failure can follow PIR. Finally, after such a failure, relay will proceed reliably under assumption E E 8.1, since a cell will lie at w RK · t ∗ > w RK after its first failure. In summary, even under the assumption of perfect TC relay reliability in the constant inhibition case, multiple relay failures can occur following inhibitory onsets, while inhibitory offsets can induce PIR, possibly followed by a single relay failure. 9 Discussion We have considered a fast-slow excitable system subject to a stochastic excitatory input train, and we have shown how to derive an irreducible
1284
J. Rubin and K. Josi´c
Markov chain that can be used analytically to compute the system’s firing probability to each input, expected number of response failures between firings, and distribution of slow variable values between firings, in the infinite-time limit. We have illustrated this analysis on a model TC cell subject to a uniform or a truncated normal distribution of excitatory synaptic inputs, in the cases of constant inhibition and of inhibition that switches abruptly between two levels. The analysis generalizes to any pair of input trains, excitatory or inhibitory and synaptic or not, that switch, with distinct frequencies, between discrete on and off states. In such cases, an appropriate transition matrix, analogous to equation 7.1, can always be derived. This approach also can be extended to other models, such as the Haken (2002) or Kuramoto (1984) models, featuring a single variable that builds up to a threshold, mediates an instantaneous spike, and experiences a reset, possibly followed by a refractory period. In this vein, we expect that extension to integrate-and-fire type models should be possible, but the details of handling nonmonotone changes in voltage would need to be worked out. In the TC cell case, our results generalize earlier findings suggesting how the modulation of inhibitory outputs of the basal ganglia can compromise TC responsiveness to excitatory inputs, with possible relevance to PD and DBS (Rubin & Terman, 2004). The method used here goes beyond the direct counting of failed responses to a sequence of inputs that was done previously, by deriving information about the complete limiting distribution of states in the TC cell Markov chain. More generally, basal ganglia output areas in rat show abrupt firing rate fluctuations on the timescale of secondsto-minutes even in non-parkinsonian states (Ruskin et al., 1999), and the ideas we introduced could be used to consider how different fluctuation characteristics affect TC reliability. TC cells are also inhibited by thalamic reticular (RE) cells, which are targets of excitatory corticothalamic inputs. Cortical oscillations, particularly abrupt transitions between cortical up and down states (Steriade, Nunez, & Amzica, 1993a; Cowan & Wilson, 1994), could naturally lead to jumps in the levels of inhibition from RE cells to TC cells, providing an alternative source for the type of inhibition that we consider. In this context, our results provide a way to quantify the expected extent of the transient loss of thalamic relay reliability during the transitions between up and down states of different depths, as well as the likelihood that transitions from down to up will be signaled by PIR thalamic bursts (Steriade, Nunez, & Amzica, 1993b). Huertas et al. (2005) have also considered the relay properties of TC cells, specifically those in the dorsal lateral geniculate nucleus, in the presence of RE inhibition, using simulations of an integrate-and-fire-or-burst model with an oscillatory driving input based on retinal ganglion cell activity. In their simulations, as found here and in previous work such as Rubin and Terman (2004), rhythmic inhibition to TC cells led to TC bursts and a failure of TC cells to respond to excitatory signals, although TC-RE interactions in their model gave rhythmic TC bursting
Firing in the Presence of Stochastic Synaptic Inputs
1285
phase-locked to the driving stimulus, which would not be present for the types of synaptic drive we consider. We have proved that under general conditions, the Markov chain we derived is irreducible and aperiodic and hence has a limiting distribution, which can be computed analytically using equation 3.5, using a onetime simulation for establishment of bin boundaries. This finding contrasts with Monte Carlo simulations, in which convergence properties cannot be forecast and transient effects may be a concern. Our analysis also explains why a limiting distribution will not exist when these conditions break down, and hence these results, even without the full calculations, can be used to decide whether Monte Carlo simulations are a reasonable option. It is interesting to consider the minimal requirements for the application of our Markov chain ideas. Application of this approach will be possible whenever a driven system can be characterized as having a single (or single dominant) slow recovery process or other variable that builds up to a firing threshold in the silent phase, the transition probabilities between states in the silent phase can be estimated from the behavior of this variable, and the statistics of the input to the cell are known. Consideration of the minimal requirements for the experimental characterization of such transition probabilities for a neuron, possibly along the lines of phase response curve estimation, remains an interesting avenue for future consideration. Our work is related to two earlier studies in which a Markov operator was derived to track transitions between oscillator phases, relative to sinusoidal forcing, at successive jump-up (Doi et al., 1998) or threshold-crossing (Tateno & Jimbo, 2000) times, in the presence of noise. Neither of these works, however, used a Markov chain to track transitions linked to successive input arrivals but not to firing events. Moreover, neither arrived at analytically computable transition probabilities or proved the existence of a limiting distribution, as we have done. A nice feature of the previous letters was the numerical tracking of subdominant eigenvalues to identify bifurcations, defined in a stochastic sense, relating to changes in mode locking. The approach that we have presented could also be used to study bifurcations. For example, as mentioned in remark 6, a change in the range of IEIs can cause the transition probability between a pair of states to switch from zero to nonzero, which may abruptly change the existence, or the nature, of the corresponding limiting distribution. As a related alternative approach, one could try to analyze the longterm behavior of a cell by studying the random dynamical system wn+1 = MT (wn ), where the interinput interval T is a random variable with a known density (Lasota & Mackey, 1994). If it could be found, then the limiting density for the variable w could be used to compute the statistics of interest for the cell. However, finding this limiting density is in general quite difficult. The Markov chain approach that we have presented can in fact be viewed as a discretization of this procedure. As a result, we recover a coarsely grained
1286
J. Rubin and K. Josi´c
version of the full limiting density for w, which is sufficient to determine the statistics of interest. In a series of letters (Othmer & Watanabe, 1994; Xie et al., 1996; Othmer & Xie, 1999), Othmer and collaborators studied the application of step function forcing to a piecewise linear excitable system similar to those that we consider. In their work, which focused on the existence of modelocked (harmonic or subharmonic) solutions and on chaos, they used a map and defined states as we do, but their states were based on positions of the knees of nullclines and their projections to other nullclines rather than the flow along branches of nullclines, and they did not consider a Markov chain for transitions between states. Moreover, their analysis was restricted to periodic forcing, whereas ours accommodates, and indeed is particularly well suited for, stochasticity in input timing. LoFaro and Kopell (1999) also used 1D maps to study a forced excitable system, but in their work, the excitable system was a neuron mutually coupled via inhibitory synapses to an oscillatory cell and the map was a singular Poincar´e map, with each iteration corresponding to the time between jumps to the active phase. Similarly, Keener, Hoppensteadt, and Rinzel (1981) and subsequent authors have used firing time maps to study mode locking in integrateand-fire and related models with periodic stimuli. Alternatively, other works have used 1D maps based on interinput intervals to study modelocked, quasi-periodic, and chaotic responses of excitable systems to periodic, purely excitatory inputs (Coombes & Osbaldestin, 2000; Ichinose et al., 1998). Clearly, the transformation of some combination of excitatory and inhibitory synaptic inputs into postsynaptic neuronal responses is a fundamental operation present within any nontrivial nervous system. As a result, various manifestations of this transformation have been studied, computationally and experimentally, by many researchers. In our work, as in Smith and Sherman (2002), we consider the excitatory input stream as a drive to the postsynaptic cell and the inhibitory input as a modulation that alters the cell’s responsiveness to its drive. We neglect such intriguing effects as long-term synaptic scaling (Desai, Cudmore, Nelson, & Turrigiano, 2002), changes in intrinsic excitability (Aizenman, Akerman, Jensen, & Cline, 2003), and changes in the balance of excitation and inhibition (Somers et al., 1998), which could become relevant in the asymptotic limit, as well as the effect of attention (Tiesinga, 2005). Moreover, we assume that successful thalamic relay consists of single-spike responses to an input train, neglecting the idea that by virtue of their stereotyped form and reliability, thalamic bursts may serve a relay function (Person & Perkel, 2005; Babadi, 2005). Other authors have considered how variations in intrinsic properties of postsynaptic cells determine the input characteristics that induce these cells to spike most reliably (Fellous et al., 2001; Schreiber, Fellous, Tiesinga, & Sejnowski, 2004) and how neuronal processing varies under more general changes in input spike patterns than the ones that
Firing in the Presence of Stochastic Synaptic Inputs
1287
we have considered here (e.g., Hunter & Milton, 2003; Tiesinga & Toups, 2005). While these issues are beyond the scope of this work, our approach can accommodate input trains that vary stochastically in a variety of ways (e.g., see remarks 3 and 4), and hence it may be well suited for the future exploration of such issues. Appendix A: TC Cell with sinh = 0 With sinh = 0, a cell that has just spiked is guaranteed to spike again after at most N = 3 subsequent inputs. Although the intervals Ik are defined in terms of intervals of the slow variable w, clearly there is also a particular time interval associated with each Ik (see remark 11 below). In the particular model, equation 6.1, with sinh = 0, these time intervals are E [20, 50), [50, 75.5), [75.5, ∞), where w RK · 75.5 ≈ w LEK . In practice, we used a simulation protocol, rather than determination of the left knee of N E directly from equation 6.1, to compute the value 75.5. That is, we found 75.5 to be the minimum value of t such that, given an initial condition with E w = w RK · t and with v at the corresponding position on the left branch of 0 N , the model cell would fire in response to an excitatory input of duration 10 msec. This adjustment represents the modification to w LEK discussed in remark 4 and is therefore consistent with the rest of the analysis that we present. Using these time intervals allows us to compute the transition probabilities between states (Ik , j) for j, k = 1, . . . , 3, with j ≤ k. Let p(k1 , j1 )→(k2 , j2 ) = P[(Ik2 , j2 )|(Ik1 , j1 )]. To compute these probabilities, we start with the fate of a cell just after firing, computing p(3, j)→(k,1) for each relevant ( j, k). First, note that since all cells fire from a state of the form (I3 , j) for some j ≤ 3 and E all firing cells are reset together to w RK regardless of where they fired from, p(3, j)→(k,1) is independent of j. In this example, with ginh = 0.12, we have p(3, j)→(1,1) = P[T1 ∈ [20, 50)], where T1 denotes the time from reset to the onset of the next excitatory input. Further, P[T1 ∈ [20, 50)] = 3/4, since T1 is selected from a uniform distribution on [20,60]. Similarly, p(3, j)→(2,1) = 1/4 and p(3, j)→(k,1) = 0 for all k > 2. Next, we seek values for p(1,1)→(k,2) for each k ≥ 2 and p(2,1)→(k,2) for each k ≥ 3. We have p(1,1)→(2,2) = P[T1 + T2 + 10 ∈ [50, 75.5)|T1 ∈ [20, 50)], where T2 + 10 denotes the second IEI, with the input duration of 10 msec written out explicitly. This can be computed as the ratio of two areas, either geometrically or by integration, leading to the result that p(1,1)→(2,2) = 2601/9600. Since N = 3 for sinh = 0, it follows that p(1,1)→(3,2) = 6999/9600,
1288
J. Rubin and K. Josi´c
and that p(2,1)→(3,2) = p(2,2)→(3,3) = 1. This completes the calculation for sinh = 0. The corresponding transition matrix for sinh = 0 is thus
0
0
2601/9600 6999/9600 0
0 0 0 0 P = 0 3/4 1/4 3/4 1/4
0
1
0
0
0
0
0
0
0 1, 0 0
where the states of the Markov chain (1, 1), (2, 1), (2, 2), (3, 2), (3, 3). Similar calculations yield the transition matrix
0
0
0 0 0 0 Pn = 0 .8576 .1424 .8576 .1424
.1474 .8526 0 0
1
0
0
0
0
0
0
(A.1)
are
the
states
0 1 0 0
(A.2)
for states (1,1), (2,1), (2,2), (3,2), (3,3) in the case of normally distributed IEIs, also discussed in section 6. Remark 11. The time intervals associated with each Ik are independent of the number of inputs received since the last spike, and hence uniquely defined, under the assumption of input-independent evolution of w (see remark 8). Without this assumption, to each bin, we could associate different time intervals based on the number of inputs received since the last spike. Once this is done, the calculations proceed as discussed here, with appropriate adjustment of the limits of integration in equation 3.5. Appendix B: TC Cell with sinh = 1 The case sinh = 1 requires more analytical computations than the case sinh = 0, since N = 5 for sinh = 1. In this case, p(5, j)→(k,1) are identical to p(3, j)→(k,1) , computed for sinh = 0 above, and are nonzero only for k = 1, 2. The states for sinh = 1 correspond to times [20,50), [50,80), [80,110), [110,128), [128, ∞) (see remark 11). Similar to the previous case, with the input duration of 10 msec again included explicitly, p(1,1)→(2,2) = P[T1 + T2 + 10 ∈ [50, 80)|T1 ∈ [20, 50)] = 3/8,
Firing in the Presence of Stochastic Synaptic Inputs
1289
with p(1,1)→(3,2) = 1/2 and p(1,1)→(4,2) = 1/8 by analogous calculations. Along the same lines, p(2,1)→(3,2) = P[T1 + T2 + 10 ∈ [80, 110)|T1 ∈ [50, 60)] = 5/8, while p(2,1)→(4,2) = 37/100 and p(2,1)→(5,2) = 1/200. As we proceed further, certain probability calculations become more involved, because membership in a state may be achieved by more than one path. For example, we see already that a cell may reach state (I3 , 2) from (I1 , 1) or from (I2 , 1). Thus, continuing to use Ti to denote the times between the offset of one input and the onset of the next, we have p(3,2)→(4,3) = P[T1 + T2 + T3 + 20 ∈ [110, 128)|T1 ∈ [20, 50) and T1 + T2 + 10 ∈ [80, 110)] + P[T1 + T2 + T3 + 20 ∈ [110, 128)|T1 ∈ [50, 60) and T1 + T2 + 10 ∈ [80, 110)] = 2123/14,250. The full set of calculations reveals that the transition matrix P I for sinh = 1 has the form
0
0 0 0 0 0 I P = 0 0 3 4 3 4 3 4 3 4
0
3 8
1 2
0
1 8
0
.
.
.
0
0
5 8
0
37 100
0
0
1 200
0
0
0
0
0
1 4
0
6011 13500
0
0
2057 6750
0
.
.
.
0
2123 14250
0
0
12127 14250
0
.
.
.
0
243 10000
0
0
9757 10000
.
.
.
0
1
0
.
.
.
0
1
.
.
.
1 4
0
.
.
.
1 4
0
.
.
.
1 4
0
.
.
.
0 0 0 0 0 0 1 0 0 0
1 4
0
.
.
.
0
0
0
(B.1)
where the states of the Markov chain are (1,1), (2,1), (2,2), (3,2), (3,3), (4,2), (4,3), (4,4), (5,2), (5,3), (5,4), and (5,5).
1290
J. Rubin and K. Josi´c
Similar calculations yield the transition matrix
0
0
0 0 0 0 0 0 0 0 PnI = 0 0 0 0 .8576 .1424 .8576 .1424
.2548 .7249
0
.0203
0
.7352
0
.2642 .0006
0
0
.1074
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
.8576 .1424
0
0
0
0
0
0
0 .5612 .3314 0 .1448 .8552 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0
0
(B.2)
0
for states (1, 1), (2, 1), (2, 2), (3, 2), (3, 3), (4, 2), (4, 3), (5, 2), (5, 3), (5, 4) in the case of normally distributed IEIs, also discussed in section 6. Appendix C: Coefficients for Matrix 7.1 in the Case of Time-Dependent Inhibition and Excitation To calculate the coefficients c m,n = p(m) p(n) analytically, we compute p(1), which is the probability that exactly one excitatory input arrives during an epoch of constant inhibition, and then we compute p(m) for each m = 2, . . . , 6 as the probability of at most m inputs arriving minus the probability of at most m − 1 inputs arriving. We stop at m = 6, since at most six excitatory inputs can arrive during 175 msec, given the IEIs and excitation duration that we consider. Let t = 0 denote the time of inhibitory offset and let TI ∈ [125, 175] denote the duration of the ensuing time interval on which inhibition remains off. Define T−1 as the time from the last excitatory onset before t = 0 to the first excitatory onset after t = 0. Let T1 denote the time of this first excitatory onset. Given that each excitation endures for 10 msec, T−1 ∈ [30, 70], while T1 ∈ [0, T−1 ] (see Figure 7). Finally, let T2 denote the IEI following the end of the excitatory input that occurs at time T1 . Since T1 < TI must hold, the value of p(1) is the probability that T1 + T2 + 10 > TI . Thus, 175 70 T−1 60 p(1) =
125
30
min(TI −T1 −10,60)
0
175 70 T−1 60 125
30
0
20
dT2 dT1 dT−1 dTI
dT2 dT1 dT−1 dTI
=
27 . 51,200
For m > 1, each quantity p(m) is given as the ratio of two multiple integrals as well, with one additional nested integral in each, relative to those used to calculate p(m − 1). Since these integrals can be evaluated exactly, we obtain exact fractional representations for each, but to save
Firing in the Presence of Stochastic Synaptic Inputs
1291
T I T-1
T1
T2
T3
t=0 Figure 7: Notation for the calculation of the probabilities p(m). Note that the actual number of excitatory inputs arriving during the interval of constant inhibition will be between one and six.
space, we simply give decimal approximations here: p(2) ≈ .1985, p(3) ≈ .6075, p(4) ≈ .1875, p(5) ≈ .0060, p(6) ≈ 4.731 × 10−6 . Acknowledgments We received partial support for this work from National Science Foundation grants DMS-0414023 to J.R., and DMS-0244529, ATM-0417867 and a GEAR grant from the University of Houston to K.J. We thank the anonymous reviewers for comments that helped improve the clarity of the letter.
References Aizenman, C. D., Akerman, C. J., Jensen, K. R., & Cline, H. T. (2003). Visually driven regulation of intrinsic neuronal excitability improves stimulus detection in vivo. Neuron, 39(5), 831–842. Anderson, M., Postpuna, N., & Ruffo, M. (2003). Effects of high-frequency stimulation in the internal globus pallidus on the activity of thalamic neurons in the awake monkey. J. Neurophysiol., 89, 1150–1160. Babadi, B. (2005). Bursting as an effective relay mode in a minimal thalamic model. J. Comput. Neurosci., 18, 229–243. Brown, P., Oliviero, A., Mazzone, P., Insola, A., Tonali, P., & Lazzaro, V. D. (2001). Dopamine dependency of oscillations between subthalamic nucleus and pallidum in Parkinson’s disease. J. Neurosci., 21, 1033–1038. Chance, F., Abbott, L., & Reyes, A. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782. Chow, C., & Coombes, S. (2006). Existence and wandering of bumps in a spiking neural network model. SIAM J. Dyn. Syst., 5, 552–574.
1292
J. Rubin and K. Josi´c
Coombes, S., & Osbaldestin, A. (2000). Period-adding bifurcations and chaos in a periodically stimulated excitable neural relaxation oscillator. Phys. Rev. E, 62, 4057–4066. Cowan, R. L., & Wilson, C. J. (1994). Spontaneous firing patterns and axonal projections of single corticostriatal neurons in the rat medial agranular cortex. J. Neurophysiol., 71(1), 17–32. De Schutter, E. (1999). Using realistic models to study synaptic integration in cerebellar Purkinje cells. Rev. Neurosci., 10, 233–245. Desai, N. S., Cudmore, R. H., Nelson, S. B., & Turrigiano, G. G. (2002). Critical periods for experience-dependent synaptic scaling in visual cortex. Nat. Neurosci., 5(8), 783–789. Doi, S., Inoue, J., & Kumagai, S. (1998). Spectral analysis of stochastic phase lockings and stochastic bifurcations in the sinusoidally forced van der Pol oscillator with additive noise. J. Stat. Phys., 90(5/6), 1107–1127. Fellous, J. M., Houweling, A. R., Modi, R. H., Rao, R. P., Tiesinga, P. H., & Sejnowski, T. J. (2001). Frequency dependence of spike timing reliability in cortical pyramidal cells and interneurons. J. Neurophysiol., 85(4), 1782–1787. Haken, H. (2002). Brain dynamics: Synchronization and activity patterns in pulse-coupled neural nets with delays and noise. Berlin: Springer-Verlag. Hashimoto, T., Elder, C., Okun, M., Patrick, S., & Vitek, J. (2003). Stimulation of the subthalamic nucleus changes the firing pattern of pallidal neurons. J. Neurosci., 23, 1916–1923. Hoel, P. G., Port, S. C., & Stone, C. J. (1972). Introduction to stochastic processes. Boston: Houghton Mifflin. Huertas, M., Groff, J., & Smith, G. (2005). Feedback inhibition and throughput properties of an integrate-and-fire-or-burst network model of retinogeniculate transmission. J. Comp. Neurosci., 19, 147–180. Hunter, J. D., & Milton, J. G. (2003). Amplitude and frequency dependence of spike timing: Implications for dynamic regulation. J. Neurophysiol., 90(1), 387– 394. Ichinose, N., Aihara, K., & Judd, K. (1998). Extending the concept of isochrons from oscillatory to excitable systems for modeling excitable neurons. Int. J. Bif. Chaos, 8(12), 2375–2385. Keener, J. P., Hoppensteadt, F. C., & Rinzel, J. (1981). Integrate-and-fire models of nerve membrane response to oscillatory input. SIAM J. Appl. Math., 41(3), 503– 517. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin, SpringerVerlag. Lasota, A., & Mackey, M. C. (1994). Chaos, fractals, and noise: Stochastic aspects of dynamics (2nd ed.). New York: Springer-Verlag. LoFaro, T., & Kopell, N. (1999). Timing regulation in a network reduced from voltagegated equations to a one-dimensional map. J. Math. Biol., 38(6), 479–533. Magnin, M., Morel, A., & Jeanmonod, D. (2000). Single-unit analysis of the pallidum, thalamus, and subthalamic nucleus in parkinsonian patients. Neuroscience, 96, 549–564. Nini, A., Feingold, A., Slovin, H., & Bergman, H. (1995). Neurons in the globus pallidus do not show correlated activity in the normal monkey, but phase-locked
Firing in the Presence of Stochastic Synaptic Inputs
1293
oscillations appear in the MPTP model of parkinsonism. J. Neurophysiol., 74, 1800– 1805. Othmer, H., & Watanabe, M. (1994). Resonance in excitable systems under stepfunction forcing. I. Harmonic solutions. Adv. Math. Sci. Appl., 4, 399–441. Othmer, H., & Xie, M. (1999). Subharmonic resonance and chaos in forced excitable systems. J. Math. Biol., 39, 139–171. Person, A. L., & Perkel, D. J. (2005). Unitary IPSPs drive precise thalamic spiking in a circuit required for learning. Neuron, 46(1), 129–140. Raz, A., Vaadia, E., & Bergman, H. (2000). Firing patterns and correlations of spontaneous discharge of pallidal neurons in the normal and tremulous 1-methyl-4phenyl-1,2,3,6 tetrahydropyridine vervet model of parkinsonism. J. Neurosci., 20, 8559–8571. Rubin, J. E., & Terman, D. (2002). Geometric singular perturbation analysis of neuronal dynamics. In B. Fiedler (Ed.), Handbook of dynamical systems (Vol. 2, pp. 93–146). Amsterdam: North-Holland. Rubin, J. E., & Terman, D. (2004). High frequency stimulation of the subthalamic nucleus eliminates pathological thalamic rhythmicity in a computational model. J. Comput. Neurosci., 16, 211–235. Ruskin, D. N., Bergstrom, D. A., Kaneoke, Y., Patel, B. N., Twery, M. J., & Walters, J. R. (1999). Multisecond oscillations in firing rate in the basal ganglia: Robust modulation by dopamine receptor activation and anesthesia. J. Neurophysiol., 81(5), 2046–2055. Schreiber, S., Fellous, J.-M., Tiesinga, P., & Sejnowski, T. J. (2004). Influence of ionic conductances on spike timing reliability of cortical neurons for suprathreshold rhythmic inputs. J. Neurophysiol., 91(1), 194–205. Smith, G., & Sherman, S. (2002). Detectability of excitatory versus inhibitory drive in an integrate-and-fire-or-burst thalamocortical relay neuron model. J. Neurosci., 22, 10242–10250. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Somers, D., Todorov, E., Siapas, A., Toth, L., Kim, D., & Sur, M. (1998). A local circuit approach to understanding integration of long-range inputs in primary visual cortex. Cerebral Cortex, 8, 204–217. Steriade, M., Nunez, A., & Amzica, F. (1993a). A novel slow (< 1 Hz) oscillation of neocortical neurons in vivo: Depolarizing and hyperpolarizing components. J. Neurosci., 13(8), 3252–3265. Steriade, M., Nunez, A., & Amzica, F. (1993b). Intracellular analysis of relations between the slow (< 1 Hz) neocortical oscillation and other sleep rhythms of the electroencephalogram. J. Neurosci., 13(8), 3266–3283. Stone, M. (2004). Varied stimulation of a relaxation oscillator. Unpublished master’s thesis, University of Houston. Stuart, G., & H¨ausser, M. (2001). Dendritic coincidence detection of EPSPs and action potentials. Nat. Neurosci., 4, 63–71. Tateno, T., & Jimbo, Y. (2000). Stochastic mode-locking for a noisy integrate-and-fire oscillator. Phys. Lett. A, 271, 227–236. Taylor, H. M., & Karlin, S. (1998). An introduction to stochastic modeling (3rd ed.). San Diego, CA: Academic Press.
1294
J. Rubin and K. Josi´c
Tiesinga, P. (2005). Stimulus competition by inhibitory interference. Neural Comput., 17, 2421–2453. Tiesinga, P., Jose, J., & Sejnowski, T. (2000). Comparison of current-driven and conductance-driven neocortical model neurons with Hodgkin-Huxley voltagegated channels. Phys. Rev. E, 62, 8413–8419. Tiesinga, P., & Sejnowski, T. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Comput., 16, 251–275. Tiesinga, P. H. E., & Toups, J. V. (2005). The possible role of spike patterns in cortical information processing. J. Comput. Neurosci., 18(3), 275–286. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Xie, M., Othmer, H., & Watanabe, M. (1996). Resonance in excitable systems under step-function forcing. II. Subharmonic solutions and persistence. Physica D, 98, 75–110.
Received November 9, 2005; accepted August 18, 2006.
LETTER
Communicated by Emilio Salinas
Optimal Coding Predicts Attentional Modulation of Activity in Neural Systems Santiago Jaramillo
[email protected] Barak A. Pearlmutter
[email protected] Hamilton Institute, NUI Maynooth, County Kildare, Ireland
Neuronal activity in response to a fixed stimulus has been shown to change as a function of attentional state, implying that the neural code also changes with attention. We propose an information-theoretic account of such modulation: that the nervous system adapts to optimally encode sensory stimuli while taking into account the changing relevance of different features. We show using computer simulation that such modulation emerges in a coding system informed about the uneven relevance of the input features. We present a simple feedforward model that learns a covert attention mechanism, given input patterns and coding fidelity requirements. After optimization, the system gains the ability to reorganize its computational resources (and coding strategy) depending on the incoming attentional signal, without the need of multiplicative interaction or explicit gating mechanisms between units. The modulation of activity for different attentional states matches that observed in a variety of selective attention experiments. This model predicts that the shape of the attentional modulation function can be strongly stimulus dependent. The general principle presented here accounts for attentional modulation of neural activity without relying on special-purpose architectural mechanisms dedicated to attention. This principle applies to different attentional goals, and its implications are relevant for all modalities in which attentional phenomena are observed. 1 Introduction Adaptation in the nervous system occurs at several timescales, from the fast changes in spiking patterns to the long-term effects of development and learning. One of the most intriguing types of adaptation is the modification of neural coding strategies related to selective attention. The fact that attention can be covertly shifted to different aspects of a stimulus to improve their detection and discrimination (Downing, 1988) indicates that the nervous system is not just filtering out features in a fixed manner, but Neural Computation 19, 1295–1312 (2007)
C 2007 Massachusetts Institute of Technology
1296
S. Jaramillo and B. Pearlmutter
is adapting its function according to some objective. This adaptation is expressed as an attentional-state dependent modulation of neural activity. Attentional modulation of activity has been characterized at different levels, from whole brain (Heinze et al., 1994; Hopfinger, Buonocore, & Mangun, 2000; Corbetta & Shulman, 2002) to single cell (Moran & Desimone, 1985; Luck, Chelazzi, Hillyard, & Desimone, 1997; Treue & Maunsell, 1999; McAdams & Maunsell, 1999). In contrast, theoretical accounts of these phenomena are still a matter of debate and would greatly benefit from models that relate these effects to general coding principles. Traditional accounts of receptive field formation assume equal relevance of all features of a stimulus. Attempts to include the adaptability necessary for dealing with uneven relevance (e.g. for attention-related phenomena) usually include mechanisms tailored to the precise phenomenology, such as gating or shifting circuitry (Olshausen, Anderson, & Van Essen, 1993; Deco & Zihl, 2001; Heinke & Humphreys, 2003). A different approach is to find high-level principles that give rise to the phenomena observed during changing attentional state. The literature provides a few examples of this approach, namely, models in which various types of uncertainty are included in a framework of optimal inference and learning (Dayan & Zemel, 1999; Dayan, Kakade, & Montague, 2000; Yu & Dayan, 2005; Rao, 2005). Our proposal here is in some sense less radical than the assumption that attention subserves optimal inference in that we account for the sensory codes in question using the optimal coding framework, a framework that has been successfully applied to a variety of nonattentional sensory coding phenomena (Atick, 1992). We hypothesize that the nervous system uses an optimal code for representing sensory signals and that fidelity requirements of the code change based on top-down information. These conditions imply shifts in the neural code and changes in response properties of single neurons that together can be regarded as a global resource allocation. Using computer simulation, we explore the efficient representation of sensory input under the assumption of limited capacity and changing nonuniform fidelity requirements. These simulations show that (1) the modulation of activity of the simulated units matches that observed in animals during selective attention tasks; (2) the magnitude of this modulation depends on the capacity of the neural circuit; (3) the behavior of a single neuron cannot be well characterized by measurements of attentional modulation of only a single sensory stimulus; (4) modulation of coding strategies does not require special mechanisms—it is possible to obtain dramatic modulation even when signals informing the system about fidelity requirements enter the system in a fashion indistinguishable from sensory signals; and (5) even a simple feedforward network can perform sophisticated and dramatic reallocation of processing resources.
Optimal Coding Predicts Attentional Modulation A
0.6
1297 B
0.6
Top Right
Bottom Right
Top Left
Bottom Left
DECODER OUTPUT
0
0.3
Error
ATTENDED LOCATION
Input / Output
INPUT
C Uninformed
Informed
Uninfor > Infor
0.3 Error
ERROR −0.6
0
0
Figure 1: Reallocation of resources was observed when attentional signal changed. (A) Example of reconstruction of a single input pattern and four different attentional states as indicated by the dashed circles. Error is lower for attended locations. (B) Structure of the feedforward network used in the simulations. For each unit, attentional inputs (after training) are indistinguishable from sensory inputs. (C) Mean over patterns of position-specific error for one attentional state, compared to a system with no attentional signal. The white region in the image on the right indicates lower error when the system makes use of the attentional signal.
2 Methods 2.1 Network Structure. An autoassociative network was constructed consisting of five layers connected in a feedforward fashion (see Figure 1B), following Jaramillo and Pearlmutter (2004). The number of units in each layer was 256–20–10–20–256, respectively. Each unit received inputs from all units in the previous layer, in addition to two attentional signals, displayed as a single arrow per layer in Figure 1B. This attentional input was the same for all layers, and its role depended on the cost function to be minimized, as explained below. There were no lateral connections within units in a layer. The activity in the input and output layers was represented as an image of 16×16 pixels. 2.2 Unit Model. A firing rate model was used in which the output of each unit was calculated as the weighted sum of the inputs passed through a saturating nonlinearity, as follows: ri = s
j
wi j r j + b i = s wi j r j + wi j r j + b i , j
(2.1)
j
where the sum is over all units from the previous layer (indexed by j ) and the attentional inputs (indexed by j ). The saturating function was s(x) = a tanh(bx), with a = 1.716, b = 0.667. The activity of unit i is denoted ri , and the parameters wi j correspond to the strength of the connection
1298
S. Jaramillo and B. Pearlmutter
from unit j to unit i. Note that each unit i also included a bias term b i . The connection strength wi j from unit j to unit i was real valued and unbounded. 2.3 Stimulus Statistics. The set of patterns used during optimization consisted of 20,000 monochromatic 16×16 pixel images. Pixel intensity values had zero mean and standard deviation σ = 1/3. The images were created by convolving (filtering) white gaussian noise images with a rotationally symmetric 2D gaussian with σfilter = 2. Edge effects were avoided by extracting only the 16×16 center of the resulting image. These images were then scaled to have the desired variance. 2.4 Attentional Input. The attentional signal consisted of a two-element vector with elements in the range [−1, 1]. For each optimization step, this input was randomly drawn from a uniform distribution over the possible range. After optimization, this signal enters each unit in the same fashion as signals from other layers. It is only through the optimization process that its role is defined. 2.5 Optimization. The optimization process consisted of finding the set of weights and bias parameters that minimize the cost function E = E p , where the error for one input pattern and attentional state is Ep =
2 c k (p) yk (p) − dk (p) ,
(2.2)
k
in which k indexes locations in the 16×16 grids holding the stimulus and its reconstruction, c k (p) defines the importance of that particular location (analogous to the intensity of an attentional spotlight), yk (p) is the output of the network, dk (p) is the desired output, which in our case is the same as the input, and p represents the complete information coming into the system at one point in time—the input image as well as the top-down attentional signal. The expectation · is taken over p. The gradient was calculated using backpropagation (Rumelhart, Hinton, & Williams, 1986) of the weighted error defined in equation 2.2, and optimization used online gradient descent with a weight decay term of 10−6 and a learning rate η = 0.005. All weights were plastic during learning, and the attention coefficients in the penalty function formed a simple soft mask, c k (p) =
1+
m2
1 , k − a (p)2
(2.3)
with a (p) being the attentional input (in our case, a two-dimensional vector representing the center of attention) and k being a location in the plane. The
Optimal Coding Predicts Attentional Modulation
1299
width of the attentional spotlight was set by m, which was held constant at m = 12 in our simulations. Noise was injected into the bottleneck units (before the nonlinearity) during optimization in order to limit the capacity of the system. The noise level was σNB = 0.1, except in Figures 7C and 7D, in which additional values in the range σNB ∈ [0.0125, 1.6] were also used. 2.6 Testing Performance. The performance of the system at encoding and decoding the input was measured using independently generated random patterns. For performance comparison (see Figure 1C), a network that used a flat penalty function c k (p) = 1 was also trained. The error in Figures 1 and 7 was the absolute difference between input and output pixel values. 2.7 Finding Preferred Stimuli. To calculate the stimuli that maximally drive each unit in the bottleneck layer for a particular attentional state, we first generated 106 random white gaussian images and found the activation of the unit of interest for each of these patterns (keeping the attentional state fixed). The preferred stimulus was defined as the average of these random patterns weighted by the activity produced in the unit of interest. The antipreferred stimulus was defined as the negative of the preferred stimulus. 3 Results The task of the network of Figure 1B was to encode its input (a monochromatic image) into a compressed representation and decode that representation, with minimal error. An additional input (a two-element vector denoted in Figure 1B as A) informed the system about the current attentional state. Attentional states differed in that selected regions were preferentially reconstructed with higher quality than the rest, but these regions were not explicitly represented by any signal after optimization. Each unit computed its output as the weighted sum of its inputs passed through a saturating nonlinearity. The results presented in the following sections correspond to measurements on the system after the optimization procedure has found the appropriate connection strengths. Any modulation is due only to changes in activity, not to changes in the structure or connection strengths of the network. 3.1 Reallocation of Resources Emerged Naturally. First, it was verified that some regions of the input image were in fact reconstructed better than others depending on the attentional state. An example of this uneven performance is presented in Figure 1A. Here, the input image remained fixed as the attentional input changed. The dashed circles indicate those regions for which preferential reconstruction was requested. We should keep in mind that the attentional input consisted of only two additional values (which
1300 Attention
S. Jaramillo and B. Pearlmutter U1
U2
U3
U4
U5
U6
U7
U8
U9
U10 100%
Stimulus Intensity 0%
Figure 2: Preferred stimuli were dependent on attentional state. Gray-scale images represent the preferred stimulus for each unit in the bottleneck layer (columns) for three attentional states (rows), as indicated by the dashed circles. Higher contrast was observed on attended regions.
the network will interpret as the center of attention) and that these signals entered each layer the same way as feedforward (sensory) inputs. The regions represented by the dashed circles were not explicitly represented by any signal in the network during this simulation (only during optimization, through the coefficients c k in equation 2.2). As expected, lower error was achieved for regions closer to the center of attention. After confirming that the system’s processing was dependent on attentional state, we asked how these results compared to those from a classical model in which there is no informing signal. Figure 1C compares the mean reconstruction performance of a system that ignores the informing signal with a system attending to the bottom-left corner of the input image. The difference shows that reconstruction was better for attended regions but worse for unattended ones. 3.2 Preferred Stimuli of Bottleneck Units Was Dependent on Attentional State. Results presented in this section concern the units from the bottleneck layer. Activity from these units represents a lossy encoding of the input image. Figure 2 shows the stimulus that maximally drove each of these units at different attentional states, as indicated by the dashed circles. Results are clearly dependent on attentional state, with preferred stimuli containing sharper edges in regions closer to the center of attention. 3.3 Activity of Bottleneck Units Was Modulated by Nonsensory Signals. How was the activity in the bottleneck neurons affected by the attentional signals? Figure 3 shows maps of activity created by fixing the stimulus and moving the center of attention around the image. The intensity at each point in the map represents the activity of one unit in the bottleneck, normalized in the range of activities observed for that particular stimulus. The bars between the two rows of maps represent the range of activities achieved in each condition. The pattern representing activity modulation
Optimal Coding Predicts Attentional Modulation Stimulus
U1
U2
U3
U4
U5
U6
U7
1301 U8
U9
U10
Max Normalized response Min
Figure 3: Activity was modulated by the attentional signal. Gray-scale images represent the activity of each bottleneck unit for a fixed stimulus as the center of attention is directed to different regions of the input space. Two different stimuli (top and bottom rows) are used for comparison. Maps are scaled according to the range of activities observed for that particular stimulus. The bars between the two rows display the range of activity for each of the two conditions with respect to the absolute limits of the activity of the units.
was clearly different for each unit. Furthermore, the modulation of a unit’s activity was dependent on the stimulus. For example, for unit U1, higher changes in activity occurred when attention shifted from left to right or from top to bottom, depending on the stimulus. This effect occurred even when, in the case of U1, the activity ranges were very similar for the two conditions.
3.4 The Model Reproduced Measurements from Visual Cortex. We related these findings to experimentally observed modulation of neural activity during selective attention tasks. First, we replicated the analysis presented in Treue and Maunsell (1999), in which combinations of preferred and antipreferred stimuli were presented inside the receptive field (RF) of a cell and activity was measured as attention changed between these two regions. Figure 4A shows the response from a neuron in the medial superior temporal (MST) area. The preferred stimulus for this neuron was a pattern of dots moving in one direction, indicated by the upward-pointing arrow. The attended stimulus is indicated by the dashed ellipse. The histograms in gray show the firing rate of the cell as a function of time, and mean responses are indicated by solid horizontal lines for each condition. Overlaid, we show the response of one neuron from our model (dotted lines) under similar conditions. The stimuli consisted of combinations of the left and right halves from the preferred and antipreferred stimuli, calculated with attention directed to the center of the input space. The maximal firing rate for the model neuron was set to 50 spikes per second to obtain comparable magnitudes. The modulation of activity of the model neuron matched that of the visual neuron. In particular, when attention was shifted from preferred to nonpreferred features of the same stimulus, activity decreased dramatically.
S. Jaramillo and B. Pearlmutter
MST neuron
A
Model neuron
Response [spikes/sec]
50
0 pref
null
pref
null 1 sec
pref
null
B
Response with preferred target (spikes/sec)
1302
100
80
60
40 MT 20
MST Model
0 0 20 40 60 80 100 Response with anti - preferred target (spikes/sec)
Figure 4: Activity modulation matched experimental measurements from area MST. (A) Neuronal response to two stimuli inside the receptive field of the cell—one preferred (arrow up) and one antipreferred (arrow down). Attentional focus is indicated by a dashed ellipse. The histograms in gray show the firing rate of the cell as a function of time. Mean responses are indicated by solid horizontal lines (MST cell) and dotted lines (model unit) for each condition. (B) Scatter plot showing the response to antipreferred stimuli versus the response to preferred stimuli for each unit. Points above the diagonal indicate higher activity when attending to the preferred stimulus. (Reproduced from Treue & Maunsell, 1999.)
The scatter plot in Figure 4B shows the firing rates for neurons in areas MT and MST when attention was directed toward the preferred stimulus (y-axis) versus the firing rates obtained while attending the antipreferred stimulus (x-axis). Points above the diagonal indicate higher activity when attending the preferred stimulus. Values for all model units, indicated by stars, fall above the diagonal and within the range of experimentally observed values. For this plot, the range of the s(·) function was scaled and shifted to 0 to 100 spikes per second. Further exploration of these effects is shown in Figure 5. Responses to the four combinations of preferred and antipreferred half-stimuli are presented, first for a single unit and then for all units in the bottleneck layer. Figure 5A shows the stimuli and corresponding attention maps for unit U2. The stimuli consisted of combinations of the left and right halves from the preferred and antipreferred stimuli. Attention maps showed a clear change in the activity of the unit as attention was shifted from right to left. The simulation also showed that changes in features far from the attended region have a smaller effect on activity than changes presented at the attended location (compare the dark bars of Figure 5B). As shown in Figure 4, when attention was shifted from preferred to nonpreferred features of the same stimulus, activity decreased dramatically. In contrast, the effect of attention when both halves were preferred or antipreferred was very small. These observations were consistent for all units (see Figure 5C).
Optimal Coding Predicts Attentional Modulation
[+|+]
1
[+|- ]
[ -|+]
[ - |- ]
Stimulus intensity
A
1303
-1 Activity
100%
0%
Activity
B 100%
50%
0%
Change in activity
C
50% 25% 0 -25% -50%
Figure 5: All model units showed similar activity modulation. (A) Combination of preferred (+) and antipreferred (−) stimuli for unit U2 (top) and attentional modulation of activity in U2 for these inputs (bottom). (B) Comparison of activity in U2 for two attentional states, as indicated by the dashed ellipses. (C) Changes in activity for each unit in the bottleneck as attention is shifted from right to left. Bars show means across units.
The model was also compared to experiments in which the responses of cells in area V4 were measured for four attentional conditions, while a bar of fixed orientation was displaced inside the receptive field of the cell (Connor, Gallant, Preddie, & Van Essen, 1996; Connor, Preddie, Gallant, &
1304
S. Jaramillo and B. Pearlmutter 30
A
50%
B
20
Response rate (spikes/sec)
0
1 2 3 4 5
30 20 10 0
12345 1 2 3 4 5
1 2 3 4 5
30 20
Response model neuron (normalized)
25% 10
0%
1 2 3 4 5
50% 25% 12345 0%
1 2 3 4 5
1 2 3 4 5
50% 25%
10 0
1 2 3 4 5
Bar position (spacing = 1.0°)
0%
1 2 3 4 5
Stim position
Figure 6: Activity modulation matched experimental measurements from area V4. (A) Response of one V4 neuron to a bar stimulus placed at five different positions inside the receptive field, as indicated by a dashed circle. Attention was directed to one of the four circles outside the receptive field. (Reproduced from Connor et al., 1996). (B) Response of one model neuron as a region of a nonpreferred stimulus is replaced by the preferred stimulus in five different locations. Attention is directed to the border of the input space as indicated by the circles.
Van Essen, 1997). Figure 6A shows the response of one V4 cell. These plots contain features that are common in attentional modulation:
r r r r
The response of the cell for a fixed stimulus depends on the attended location. The stimulus that elicits the strongest response depends on attention. The cell’s response when attention is fixed depends on the stimulus position. This dependence on stimulus position differs between attentional states—for instance, the left and right panels in Figure 6A display different trends as the stimulus position is changed.
Figure 6B shows the results from a model neuron under similar conditions. In this case, the input image is composed of a nonpreferred stimulus with a small region belonging to the preferred stimulus at different positions, indicated by the numbers 1 to 5. Attention is directed to the borders of the image, as indicated by the circles. The model exhibited all features observed in the experimental data. Two shift indexes were also calculated for each neuron, after Connor et al. (1997). The fractional shift measures the proportion of total response that shifted from one side of the RF to the other when attention is shifted. This index is bounded between −1 and 1, which is a positive value indicating shifts in the direction of attention. Connor et al. (1997) report
Optimal Coding Predicts Attentional Modulation A
C
25%
25%
20% Activity Modulation
Activity Modulation
20%
15%
10%
15%
10%
5%
5%
0%
0%
0.25
0.4
B
Unattended Attended
D 0.2 Average Error
0.3 Average Error
1305
0.2
0.15
0.1
0.1
Unattended Attended 0
5
10 15 Number of bottleneck units
20
0.05 −2
−1.5 −1 −0.5 Bottleneck noise level (log σNB)
0
Figure 7: Magnitude of attentional modulation depends on system capacity. (A) Modulation of activity in response to random stimuli as attention was shifted from left to right, for networks with different numbers of units in the bottleneck. Open gray circles represent the mean modulation for each unit, and solid circles show the mean over all units, with bars indicating the standard error. (B) Mean reconstruction error for the attended and unattended sides. (C, D) Same as A and B but using 10 bottleneck units and instead varying the amount of noise injected into the bottleneck units during training.
mean values of 0.16 or 0.26 depending on whether five or seven bar positions were in use. All our model neurons had a positive fractional shift, with a mean value of 0.22. The second index, the peak shift, measures the distance between positions generating maximum responses. Mean experimental reported values were 10% or 25% of the RF size, depending on whether five or seven bar positions were in use. Our model neurons had nonnegative peak shifts with a mean value of 25% of the total position variation. 3.5 Magnitude of Attentional Modulation Depended on Capacity. The mean modulation of activity of bottleneck units in the model as attention was shifted from left to right is shown in Figure 7, with Figure 7A showing how this varies with the number of bottleneck units and Figure 7C showing how this varies with amount of injected noise. Gray open circles correspond to the absolute value of the modulation
1306
S. Jaramillo and B. Pearlmutter
averaged over 1000 random test patterns, for each unit in the bottleneck. The solid circles show the mean across units, with error bars indicating the standard error. Figures 7B and 7D show the corresponding reconstruction error for each set of parameters, plotting the mean errors for the attended versus unattended halves of the sensory input. Note that noise was added to the bottleneck units only during optimization, and for this reason changes in the modulation magnitude cannot be attributed to noise in the measurements. The magnitude of the modulation decreased as the number of bottleneck units increased. As expected, errors in the ignored region were higher than those in the attended region, but decreased as the number of bottleneck units was increased, gradually closing the gap. As the noise level in the bottleneck units was increased, the magnitude of both the modulation and the reconstruction error increased. The error for the attended region increased faster than for the unattended region. These two plots display an abrupt transition as the noise level becomes very high: the trend of the activity modulation, as well as the error values, changes, with the error values for the attended and unattended regions becoming equal. 4 Discussion The results show that when a neural system is optimized to encode its inputs with changing fidelity requirements, the code developed exhibits modulatory phenomena of the sort that has been observed in the visual systems of animals engaged in selective attention tasks. These modulatory phenomena, which constitute a reallocation of resources, emerge even when the encoder is a very simple homogeneous feedforward neural model. 4.1 Limitations of the Simulation. The simulations incorporated a number of simplifications, most of which were made for ease of exposition or computational efficiency: 1. During optimization, the penalty function was set to a single attentional spotlight. This is not a requirement of the general model, and other functions could be used to define the performance demands. For example, nonspatial goals could be incorporated by requiring higher-fidelity reconstruction of some particular feature regardless of its spatial location. 2. To allow convenient visual display, simulated stimuli were visual patterns. Efficient representations of stimuli and attentional modulation phenomena are present in many (if not all) modalities, and the model explored here could be applied equally well to nonvisual and nonspatial modalities. 3. The simulation allowed synapses from a single neuron to be both excitatory and inhibitory. This is not a common feature in biological neural
Optimal Coding Predicts Attentional Modulation
1307
systems, where a combination of excitatory and inhibitory neurons could achieve the same functionality. 4. The simulation is limited to explorations inside the receptive field of a cell. Furthermore, as a consequence of the simulation’s simplicity, it cannot make detailed predictions about the form of the RFs. Extended and more complicated models would be necessary to predict attentional modulation of realistic RFs. For instance, incorporating sparsity constraints into the optimization procedure should give rise to more realistic localized RFs (Olshausen & Field, 1996), which would allow the prediction of attentional modulation of RF shape and location. 5. The capacity in the bottleneck was restricted by limiting the number of neurons, with each neuron’s capacity limited by an intrinsic noise term. Given that the primary sensory cortex typically has greater than 103 × more neurons than the sensory sheet it represents (e.g., in humans, 106 retinal ganglion cells versus over 109 V1 pyramidal neurons), one is led to question the biological relevance of this capacity restriction. It is important to note that this is not an issue with the model of attention but rather with the optimal coding framework it builds on. One potential resolution of this issue is that energetic constraints may drive the nervous system to use efficient codes even in areas where there is an abundance of neurons and even where their principal role is computation rather than coding. 6. The data displayed were collected after the optimization procedure had been run and the connection strengths fixed at their optimal values. In nature, we might expect this type of adaptation to be continuous rather than being confined to a period of training. 7. To allow standard effective learning algorithms to be applied, the simulation was at the level of firing rates rather than spikes, and short-term plasticity was not included. 8. The goal of the network in the simulations was to find an efficient representation for the stimulus. Biological networks likely transform stimuli not to merely compressed representations but to representations that serve to inform future actions. The limitations presented above are mostly specific to the current simulation rather than to the general model proposed in this letter or to the predictions it makes. Stimulus-driven attentional phenomena, and the undefined origin of the attentional signal, are some of the limitations that elaborations of the model might address. Despite these simplifications, the simulation exhibited many phenomena observed experimentally and makes novel concrete, testable predictions about the modulation of neural activity by attentional processes. The fact that the model, even with these simplifications, accounts for a broad range of previous measurements and makes novel robust predictions is, to our mind, a strength rather than a weakness.
1308
S. Jaramillo and B. Pearlmutter
4.2 Consistency with Electrophysiological Data. The predictions of this model are consistent with observations from electrophysiological recordings in which the output of neurons in the visual system is modulated by the attentional state of the subject. This modulation suggests that the characterization of neural response should go beyond the traditional stimulus-response dependency, usually represented as a receptive field or tuning curve. Treue and Maunsell (1999) showed that when combining preferred and nonpreferred moving stimuli inside the receptive field of a cell in monkey area MST, the activity of the cell was higher when attention was directed to the region containing the preferred direction of motion. These results are exhibited by the model (see Figures 4 and 5). Results from the model are also consistent with those of Connor et al. (1997). Experimental measurements of the modulation of neuronal response for different sets of stimuli and attentional conditions qualitatively matched those from model neurons. Furthermore, when setting the maximum firing rate parameter to biologically plausible values, the model produced modulations that lie in the ranges observed experimentally. We would make one observation regarding the shift indices of Connor et al. (1997): it may be misleading to interpret the partial shift values merely as a tendency to have increased activity when a bar is presented at the attended edge of the RF, as compared to the activity when a bar is presented at the opposite edge of the RF—a property exhibited by the model of Rao (2005). For instance, even when all units show positive indices, some may show higher activity for a bar in position 1 compared to position 5 in both (left and right) attentional conditions. What the index measures is the proportion of activity that shifts when attention changes. Implicit in the measurements of Treue and Maunsell (1999) and Connor et al. (1997) is the fact that the attentional modulation is dependent on the stimulus, for example, attention moving from left to right inside the RF of a cell will have an increasing or decreasing effect on the neuron’s firing rate depending on the relative location of preferred and nonpreferred features. Examples of this dependency in our simulation are shown in Figure 3. This phenomenon implies that attentional modulation cannot be fully characterized using a single stimulus. Treue and Maunsell (1999) also observed that directing attention toward the RF of a cell can both enhance and reduce responses; for example, when the animal was attending to the antipreferred of two directions in the RF, the response was below the one evoked by the same stimulation when attention was directed outside the RF. While this phenomenon could not be tested directly using our simulation, our results suggest a clear reduction in activity when attention is directed to the antipreferred region compared to other regions (see Figure 5). This is not obvious from models in which attention simply amplifies activity for attended locations. Furthermore, the model predicts that stimulus changes in the attended location generally
Optimal Coding Predicts Attentional Modulation
1309
produce a higher change in activity than stimulus changes in unattended locations (see Figure 5B). Results from our model are consistent with other experimental results not analyzed in detail in this letter. For instance, the response of cells in areas V2 and V4 is attenuated when placing a nonpreferred feature inside the receptive field when compared to a preferred feature alone. If attention is directed to the preferred feature, activity is restored (Reynolds, Chelazzi, & Desimone, 1999). Our model exhibits the same phenomenon, as suggested by Figure 5. Another example relates to the type of modulation that attentional signals produce on neural activity. Preliminary results suggest that this model can display apparent multiplicative effects without explicit multiplicative interactions. A more complex model that exhibits localized RFs, as discussed above, would be necessary to obtain conditions comparable to those from the electrophysiological recordings discussed in McAdams and Maunsell (1999). 4.3 Comparison to Other Modeling Approaches. Selective attention researchers have suggested a wide variety of models with different predictive power, most of them presenting accounts of behavioral phenomena rather than explicit predictions on the modulation of neural activity (Mozer & Sitton, 1998; Deco & Zihl, 2001; Heinke & Humphreys, 2003). An influential early modeling study that made concrete predictions regarding attentional changes of neural activity was developed in Olshausen et al. (1993). In that study, control neurons dynamically modified the synaptic strengths of the connections in a model network of the ventral visual pathway. The network selectively routed information into higher cortical areas, producing invariant representation of visual objects. This model predicted changes in position and size of receptive fields as attention was shifted or rescaled. These phenomena are partially supported by results from Connor et al. (1997). Their model also qualitatively matches modulation effects observed by Moran and Desimone (1985) with stimuli inside and outside the classical receptive field of V4 neurons. In comparison to our model, in which attentional modulation emerges from the nonlinearity of the units and general objective of the network, their model obtained modulatory effects by explicitly modulating the synaptic strengths of the connections. Their model also used a spatially localized connectivity pattern, which gave rise to localized RFs, thus allowing for comparison of attention directed inside versus outside the RF. More recent studies incorporate principles of statistical inference into models of attention. For instance, Yu and Dayan (2005) and Rao (2005) present networks that implement Bayesian integration of sensory inputs and priors and replicate behavioral as well as electrophysiological measurements. In these studies, spatial attention is equated to prior information on the location of the features of interest. The Bayesian inference approach to modeling attention should be regarded as complementary to that taken
1310
S. Jaramillo and B. Pearlmutter
here. The transformations performed by the model units in this work are defined by the solution to an optimal coding problem, and under certain conditions, these computations would be equivalent to those in inferencebased networks. In fact, coding, statistical modeling of distributions, and inference from partial data are, mathematically speaking, very closely related. 4.4 Mechanisms That Subserve Attentional Modulation. Local gain modulation, a common tool in mechanistic models of attentional modulation, is not necessarily in opposition with our approach. What we suggest is that events traditionally described as local gain modulation subserve a global reallocation of resources, which is the strategy the nervous system has evolved for approaching optimal performance given its constraints. The mechanisms of attentional modulation have been traditionally posited to be changes in synaptic efficacy or modulation of presynaptic terminals (Olshausen et al., 1993). Here we show that the overall effects of this modulation can appear in a network of saturating units without any changes in synaptic strength, being controlled only by the activity itself. This is consistent with the notion that a network optimized for efficient coding under shifting fidelity requirements will integrate whatever information is available about the current requirements by pressing into service any available mechanism. In other words, it is possible that there is no way to anatomically distinguish between neuronal mechanisms that support coding from mechanisms that support shifts in coding. While it could be argued that evidence of multiple sites of integration in cortical neurons (Larkum, Zhu, & Sakmann, 1999) is inconsistent with this idea, the results above show that attentional modulation of the neural code is not sufficient to explain the functional role of inputs at different cortical layers, instead suggesting that these mechanisms may be more important for learning or other processes. 4.5 Main Predictions. The fact that our simulation shows modulation effects consistent with physiological recordings suggests that we should not necessarily expect explicit gating circuitry in neural systems responsible for attentional phenomena. Furthermore, the informing signals do not have to explicitly represent the attentional space, that is, a spatial attention effect is not necessarily mediated by a topographic input. Our model strongly predicts that a neuron’s preferred stimulus will depend on attentional state. Moreover, the behavior of a single neuron in this model cannot be well characterized by measurements of attentional modulation of only a single sensory stimulus. This prediction is consistent with experimental results discussed above, but it should be possible to test it more explicitly using currently available experimental techniques.
Optimal Coding Predicts Attentional Modulation
1311
The model also suggests that stronger modulations are expected when the complexity of the input grows relative to the capacity of the system. 5 Conclusion The model presented here accounts for attentional modulation of neural response in a framework that includes both attention and receptive field formation, and as a consequence of an underlying normative principle (optimal coding) rather than by tuning a complex special-purpose architecture. The model shows that reallocation of resources can emerge even in a simple feedforward network and challenges the traditional characterization of neural activity. These results are consistent with the notion that attentional modulation is not, at its root, due to specific local architectural features but is rather a ubiquitous phenomenon to be expected in any system with shifting fidelity requirements. Acknowledgments This work was supported by U.S. NSF PCASE 97-02-311, the MIND Institute, an equipment grant from Intel, a gift from the NEC Research Institute, Science Foundation Ireland grant 00/PI.1/C067, and Higher Education Au´ thority of Ireland (An tUdar´ as Um Ard-Oideachas.) References Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Computation in Neural Systems, 3(2), 213–251. Connor, C. E., Gallant, J. L., Preddie, D. C., & Van Essen, D. C. (1996). Responses in area V4 depend on the spatial relationship between stimulus and attention. J. Neurophysiol., 75(3), 1306–1308. Connor, C. E., Preddie, D. C., Gallant, J. L., & Van Essen, D. C. (1997). Spatial attention effects in macaque area V4. J. Neurosci., 17(9), 3201–3214. Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven attention in the brain. Nat. Rev. Neurosci., 3(3), 201–215. Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neuroscience, 3, 1218–1223. Dayan P., & Zemel, R. (1999). Statistical models and sensory attention. In D. Willshaw & A. Murray (Eds.), Proceedings of the Ninth International Conference on Artificial Neural Networks (Vol. 2, pp. 1017–1022). Piscataway, NJ: IEEE. Deco, G., & Zihl, J. (2001). A neurodynamical model of visual attention: Feedback enhancement of spatial resolution in a hierarchical system. Computational Neuroscience, 10(3), 231–253. Downing, C. (1988). Expectancy and visual-spatial attention: Effects on perceptual quality. J. Exp. Psychol. Hum. Percept. Perform., 14(2), 188–202.
1312
S. Jaramillo and B. Pearlmutter
Heinke, D., & Humphreys, G. W. (2003). Attention, spatial representation, and visual neglect: Simulating emergent attention and spatial memory in the selective attention for identification model (SAIM), Psychol. Rev., 110(1), 29–87. ¨ ¨ A., Mangun, G. R., & Hillyard, S. A. (1994). Heinze, H. J., Luck, S. J., Munte, T. F., Gos, Attention to adjacent and separate positions in space: An electrophysiological analysis. Percept. Psychophys., 56(1), 42–52. Hopfinger, J. B., Buonocore, M. H., & Mangun, G. R. (2000). The neural mechanisms of top-down attentional control. Nature Neuroscience, 3(3), 284–291. Jaramillo, S., & Pearlmutter, B. A. (2004). A normative model of attention: Receptive field modulation. Neurocomputing, 58–60, 613–618. Larkum, M. E., Zhu, J. J., & Sakmann, B. (1999). A new cellular mechanism for coupling inputs arriving at different cortical layers. Nature, 398(6725), 338–341. Luck, S. J., Chelazzi, L., Hillyard, S. A., & Desimone, R. (1997). Neural mechanisms of spatial selective attention in areas V1, V2, and V4 of macaque visual cortex. J. Neurophysiol., 77(1), 24–42. McAdams, C. J., & Maunsell, J. H. R. (1999). Effects of attention on orientationtuning functions of single neurons in macaque cortical area V4. J. Neurosci., 19(1), 431–441. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229(4715), 782–784. Mozer, M. C., & Sitton, M. (1998). Computational modeling of spatial attention. In H. Pashler (ed.), Attention (pp. 341–393). New York: Psychology Press. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607– 609. Olshausen, B. A., Anderson, C. H., & Van Essen, D. C. (1993). A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. J. Neurosci., 13(11), 4700–4719. Rao, R. (2005). Bayesian inference and attentional modulation in the visual cortex. Neuroreport, 16(16), 1843–1848. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19(5), 1736–1753. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back–propagating errors. Nature, 323, 533–536. Treue, S., & Maunsell, J. H. R. (1999). Effects of attention on the processing of motion in macaque middle temporal and medial superior temporal visual cortical areas, J. Neurosci., 19(17), 7591–7602. Yu, A., & Dayan, P. (2005). Uncertainty, neuromodulation, and attention. Neuron, 46(4), 681–692.
Received December 7, 2005; accepted November 27, 2006.
LETTER
Communicated by Raj Rao
Efficient Computation Based on Stochastic Spikes Udo Ernst
[email protected] David Rotermund
[email protected] Klaus Pawelzik
[email protected] Institute for Theoretical Neurophysics, Otto-Hahn-Allee, 28359, Bremen, Germany
The speed and reliability of mammalian perception indicate that cortical computations can rely on very few action potentials per involved neuron. Together with the stochasticity of single-spike events in cortex, this appears to imply that large populations of redundant neurons are needed for rapid computations with action potentials. Here we demonstrate that very fast and precise computations can be realized also in small networks of stochastically spiking neurons. We present a generative network model for which we derive biologically plausible algorithms that perform spike-by-spike updates of the neuron’s internal states and adaptation of its synaptic weights from maximizing the likelihood of the observed spike patterns. Paradigmatic computational tasks demonstrate the online performance and learning efficiency of our framework. The potential relevance of our approach as a model for cortical computation is discussed. 1 Introduction Experimental work has demonstrated that perception can be a very fast process. As an example, it has been shown that human brains take only 150 ms to decide if a natural image contains an animal (Thorpe, Fize, & Marlot, 1996). From typical firing frequencies of cortical neurons (about 15–45) and from the number of different processing stages involved in this task (about 10), it was concluded that about one to three spikes per neuron and per processing stage are sufficient to robustly achieve this performance. Another example for the exceptional speed of our brain has recently been found in contour integration. In these experiments, macaque monkeys were trained to correctly detect aligned edge elements within a field of randomly oriented distracter edges from presentations lasting 30 to 50 ms before a mask appeared (Mandon & Kreiter, 2005). Even if contour integration is performed locally within the primary visual cortex, this computation can rely on only very few spikes per neuron. In masking Neural Computation 19, 1313–1343 (2007)
C 2007 Massachusetts Institute of Technology
1314
U. Ernst, D. Rotermund, and K. Pawelzik
experiments, it has been shown that grouping processes involved in Gestalt perception are capable of binding features that were presented only for 20 to 30 ms (Herzog & Fahle, 2002). In all these examples, the nervous system depends on spikes as signals that appear to be emitted stochastically and with a high degree of variability (Shadlen & Newsome, 1998). These observations open the question which neuronal algorithms might underlie the computational processes that enable fast perception and recognition. To explain the high speed of categorization in humans, a rank-order code was proposed (Thorpe, Delorme, & van Rullen, 2001; van Rullen & Thorpe, 2001), which in principle allows for a coding of natural images using only one spike per neuron. This approach assumes that the latency of spikes from the retinal ganglion cells is inversely proportional to the luminance of an image patch (Berry, Warland, & Meister, 1997). While this method displays robustness against static pixel noise (Delorme & Thorpe, 2001), it may fail when the spiking process is not strictly deterministic. In addition, real-life situations require the brain to cope with nonstationary inputs that have no well-defined starting and ending time, which put to question if a rank-order code is available in the brain.1 Alternative frameworks for explaining the speed of cortical computation explicitly use stochastic spikes for transmitting information. The corresponding investigations identified several coding principles allowing for a high velocity of signal transmission. Examples include massive population rate codes with and without a balance of excitation and inhibition (Panzeri, Treves, Schultz, & Rolls, 1999; Gerstner & Kistler, 2002; Fourcaud & Brunel, 2002) or codes where the signal is coded into the variance of inputs onto a large population of neurons (Silberberg, Bethge, Markram, Pawelzik, & Tsodyks, 2004). In addition, the shape of neuronal tuning curves might be optimized to facilitate fast information coding. Depending on the decoding time available, either binary tuning curves (for short times) or gradual tuning curves (for long times) were found to be optimal (Bethge, Rotermund, & Pawelzik, 2003a, 2003b). These approaches provide no conclusive explanation of fast perception for two reasons. First, the work on population coding largely ignores the problem of processing information with respect to a certain task. Second, it is doubtful if the massive numbers of neurons required for accurate information transmission are available in the brain. Therefore, we focused our research on a plausible neuronal online algorithm for small networks, which might explain fast and precise computations despite a high degree of stochasticity in the spiking process. Generative models represent a framework that is well suited for describing probabilistic computation. In technical contexts, generative models 1
For the early visual system, saccades or input from the reticular formation could provide a reference signal for latency coding. However, it is doubtful if such a signal can be found in higher brain areas.
Efficient Computation Based on Stochastic Spikes
1315
have been used successfully, for example, for deconvolution of noisy, blurred images (Richardson, 1972; Lucy, 1974). Since information processing in the brain, too, seems to be stochastic in nature, generative models are also promising candidates for modeling neural computation. In many practical applications and—because firing rates are positive—possibly also in the brain, generative models are subject to nonnegativity constraints on their parameters and variables. The subcategory of update algorithms for these constrained models was termed nonnegative matrix factorization (NNMF) by Lee and Seung (1999). The objective of NNMF is to iteratively factorize scenes (images) Vµ into a linear additive superposition Hµ of elementary features W such that Vµ ≈ W Hµ . For this well-defined problem, multiplicative update and learning algorithms have been derived and analyzed for different noise models and various further constraints on the parameters W and internal representation Hµ (Lee & Seung, 2000). Generative models can be seen as algorithmic realizations of a principle stated first by Helmholtz, which hypothesizes that the brain attains internal states that are maximally predictive of the sensory inputs. While this idea has often been used in phenomenological models to explain psychophysical and neurobiological evidence, it is tempting to apply it also to neural subsystems in the sense that a network should evolve into a state that is the best explanation of all its inputs. This perspective on neural systems might appear rather philosophical, but it turns out to be a very useful framework for understanding neuronal computation. Generative models also cover deterministic computations by being able to predict the missing part of a typical but incomplete input pattern from its internal representation (for a schematic example, see Figure 1). In the general nondeterministic case, generative models are obviously related to Bayesian estimation. If input signals originate from different sensory modalities, a generative model can perform optimal multimodal integration, and the internal states will then represent optimal Bayesian estimates of underlying features (Deneve, Latham, & Pouget, 2001; see Figure 1). Recent experimental evidence demonstrates that the brain can indeed adopt a Bayesian strategy when integrating two sources of information. In one example, tactile and visual information about the width of a bar is integrated in an optimal manner (Ernst & Banks, 2002). In another example, uncertain information about the position of a virtual hand is evaluated, incorporating prior knowledge about a random shift in the hand position ¨ (Kording & Wolpert, 2004). In addition, motion processing in area MT has been successfully described by a Bayesian approach, replicating the neurons’ firing properties to various moving random dot patterns (Koechlin, Anton, & Burnod, 1999). Taken together, generative models provide a very general and plausible framework for investigations of neural computation. Nevertheless, it is an open question if it is possible to implement a biologically plausible generative network able to cope with the high stochasticity of spike trains, while at the same time being as fast as possible in
1316
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 1: Example schema of the general framework investigated in this letter. A scene V composed as a superposition of different features or objects is coded into stochastic spike trains S by the sensory system. In this example, the scene comprises a house, a tree, and a horse, together with their spoken labels. With each incoming spike at one input channel s, the model updates the internal variables h(i) of the hidden nodes i. Normally the generative network uses the likelihoods p(s|i) to evolve its internal representation H toward the best explanation, which maximizes the likelihood of the observation S. In our example, the network seeks the optimal, combined explanation of the spikes from the audiovisual input stream (black arrows). An alternative application of a generative model can be used to infer functional dependencies in its input. In every natural environment, certain features have a high likelihood of co-occurence because they describe two different aspects of a single object. In this example, the impinging signals can be divided into two parts, SV and S A, which represent the input and output arguments of a function S A = f (SV ). In our example, the function f describes meaningful combinations of visual and auditory input. Presenting only the visual part SV to an appropriate generative network will lead to an explanation H∗ (gray arrows to the right). This internal representation H∗ in turn can generate a sharp distribution over the function values, centered at the correct auditory counterpart S A, to the input SV . In this sense, the model computes the function f mapping the correct spoken label to each of the visual objects (gray arrows to the left).
integrating the incoming information into a suitable internal representation of the external world. In this letter, we investigate a generative network designed to solve a typical factorization problem with nonnegativity constraints. In contrast to previous work using noise-free analog input patterns (Lee & Seung, 1999), our network receives a stochastic sequence of input spikes from a scene Vµ . We derive a novel algorithm that updates the internal representation using only a single spike per iteration step. It turns out to be surprisingly simple and efficient and can be complemented by an online learning
Efficient Computation Based on Stochastic Spikes
1317
algorithm for the model parameters that can be interpreted as a Hebbian rule with weight decay. Furthermore, we modify the basic framework of a generative model for implementing deterministic computations of arbitrary functions, which we show to be closely linked to optimal multimodal integration. The dynamics and performance of this approach are demonstrated by applying it to two illustrative cases: a network trained with Boolean functions demonstrates the learnability and errorless performance for arbitrary complex functions, and a small network for handwritten digit recognition demonstrates that less than one stochastic spike per input neuron can suffice to achieve impressing performance in a visual task exceeding the nearest-neighbor classifier (explained in appendix D). We further use the Boolean parity function to demonstrate that suitable combinations of our basic model can perform efficient spike-based computation in hierarchical networks. Taken together, our framework provides a novel approach for explaining fast perception with stochastic spikes in a generative network model of the brain. This letter is organized as follows. In section 2, we introduce our model, together with a definition of the dynamics of its stochastic sensory input. In section 3, the algorithms for rapid update of internal state variables and model parameters are derived and discussed. In section 4, we present the results from our simulations of the two paradigmatic cases and the extension of the framework to hierarchical networks. This contribution concludes with a discussion of different probabilistic interpretations of the model and an evaluation of the biological pausibility of our model within the context of realistic neuronal dynamics and synaptic learning. Mathematical details are described in the appendixes. 2 A Spike-Based Generative Model Every natural environment composed of various objects leads to an abundance of spikes that networks in the brain receive from the sensors or other brain regions. A generative network should infer (or recognize) typical objects or features that underlie (i.e., generate or predict) an observed spike train (see Figure 1). Hereby, the brain must combine prior knowledge about typical features and their frequency of occurrence, as well as previously observed signals and input from other networks. This prior knowledge has to be acquired during earlier learning phases, where typical correlations in the stream of information from all sources should be extracted in order to increase recognition speed and performance in a later task. 2.1 Basic Model. We consider a structurally simple realization of a generative model as a network consisting of a layer of input nodes indexed by s = 1, . . . , S connected via weights ws,i to a layer of hidden nodes indexed by i = 1, . . . , H. The hidden nodes hold the model’s internal states h µ,i (see Figure 1). The connection weights represent information about the input
1318
U. Ernst, D. Rotermund, and K. Pawelzik
statistics when typical features i are part of a scene µ. With the framework chosen in Lee and Seung (1999), the model is supposed to explain an observed scene Vµ 2 as a linear, nonnegative superposition of the features W via vµ,s ≈
H
ws,i h µ,i .
(2.1)
i=1
At first glance, this approach is limited because of its linearity constraint on the decomposition of an observation. Nevertheless, such a framework has been successfully applied to many problems in natural image processing with analog input patterns (Hoyer, 2004; Olshausen & Field, 1996; Lee & Seung, 1999), and allows comparing the differences between our novel spike-based stochastic model directly to known analog deterministic algorithms. We demonstrate that this approach is sufficiently complex to allow computations of arbitrary deterministic functions with few stochastic neurons in the brain. Furthermore, the mathematical derivation of spikebased information processing can also be applied to nonlinear generative networks using similar techniques. 2.2 From Poisson to Bernoulli Processes. Because of the high degree of stochasticity of spike events, networks in the brain receive information about a scene µ through stochastic spike trains impinging on the input nodes s. In our model, we assume that spike trains are generated with rates Rµ ∝ Vµ from independent Poissonian point processes. In this case, the number of spikes in each channel counted within a given time interval provide sufficient statistics; they provide the maximal amount of information about Vµ available from spike observation. For setting up the model, we reformulate the rate rµ,s in terms of the probability pµ (s) to receive the respective next spike in channel s. A straightforward calculation yields p˜ µ (s) = rµ,s /
S s =1
rµ,s = vµ,s /
S
vµ,s .
(2.2)
s =1
The representation of the scene µ through these event probabilities has the convenient properties that it is invariant with respect to the total rate
2 The following notation will be used throughout this letter. Vectors or matrices X are in bold, and their single components xi are in italics. Through braces, components may be combined to a vector or matrix as X = { xi }i=1,...,N = { xi }. Examples are W = { ws,i }s=1,...,S; i=1,...,H = { ws,i } and H = { h i }i=1,...,H = { h i }. Brackets select individual entries from a vector or matrix via [ X]i = xi .
Efficient Computation Based on Stochastic Spikes
1319
and ignores the true timing of the spikes. It changes the description of the spike statistics by Poissonian point processes to the more tractable Bernoulli processes and identifies each spike event with the index of the channel in which the spike occurred. These properties will strongly simplify the subsequent construction of our model. Instead of real time, we now use the running number of spike events, which we here denote by t. Note that this spike-by-spike clocking implies that real time is on average proportional S to the mean of t/ s=1 rµ,s . Thus, global knowledge about the total input rate allows us to estimate the real time elapsed during t events but is not necessary for deriving the following algorithms. 2.3 From Deterministic to Probabilistic Decomposition. Similar to the transformation of vµ,s into p˜ µ (s), the weights ws,i are proportional to the probability p(s|i) of an input spike in channel s given that the scene µ is composed of the single feature i, p(s|i) = ws,i
S
ws ,i .
(2.3)
s =1
Reformulating the input scene Vµ in terms of the (normalized) firing probabilities p˜ µ (s) as described in section 2.2 transforms the factorization problem, equation 2.1, into the expression H H p(s|i)h µ,i h µ,i = p(s|i) H . p˜ µ (s) ≈ H i=1 S |i) h p(s i=1 µ,i s =1 i =1 h µ,i i=1 It is obvious that a simple variable transformation h µ (i) = h µ,i / reduces this problem to p˜ µ (s) ≈
H
p(s|i)h µ (i) = pµ (s),
(2.4) H
i =1
h µ,i
(2.5)
i=1
with the nonnegativity and normalization constraints p(s|i) > 0 S s=1
p(s|i) = 1
h µ (i) > 0 H
h µ (i) = 1.
(2.6)
i=1
The new variables h µ (i) denoting the normalized superposition coefficients have a corresponding stochastic interpretation. To explain the next action
1320
U. Ernst, D. Rotermund, and K. Pawelzik
potential arriving at one input node s, the generative model assumes a doubly stochastic process underlying spike generation. First, a specific feature or cause i is drawn from the probability distribution h µ (i). In a second step, a spike is drawn from the corresponding conditional probability distribution p(s|i), which is observed in channel s. While at first glance, this interpretation seems rather academic, it is a natural description of the physical process of how a visual scene composed of superimposed features creates photons impinging on retinal ganglion cells. While we interpret the conditional probabilities p(s|i) to be related to synaptic weights connecting input neuron s with hidden neuron i, the assumption of a continuous internal state variable to be present in each neuron goes somewhat beyond usual network models in which only inputs and activities that are equivalent to output signals are allowed to represent the network’s state. Since real neurons are spatially extended objects with large amounts of internal state variables, we believe it is well justified to hypothesize that some of them parameterize neuronal excitability rather independently from its actual activity. The characteristic dynamics of internal state variables derived in the next section may then serve as specific predictions of our model (see also section 4). 2.4 Estimation and Learning Spike by Spike. Bayesian estimation implies that prior information about the internal state should be multiplied with the conditional probability of observing the actual external input, given the actual internal state. Therefore, generative models with iterative Bayesian algorithms appear to be suitable candidates for a comprehensive, abstract model for the function and dynamics of networks in the brain (Mumford, 2002; Lee & Mumford, 2003; Rao, 2004). In the case of perception subjected to time constraints, prior knowledge about the causes underlying the previous sensory input has to be integrated online as efficiently as possible while the sequence of observations of the spikes arrives at the different input channels. As we already pointed out, previous models largely neglect this realistic condition by using analog, noise-free input patterns in each iteration step instead of only a single spike. We begin to derive update algorihms for the internal states and synaptic weights by considering stochastic ensembles of exactly T incoming spikes for a scene µ, which can be written as the temporally ordered sequences of input channels at which these spikes arrive, t=1,...,T ST µ = sµt .
(2.7)
We now seek an iterative algorithm that with every spike encountered tends to maximize the likelihood P({STµ }|{h µ (i)}, { p(s|i)})
(2.8)
Efficient Computation Based on Stochastic Spikes
1321
of observing the ensemble of sequences {STµ }µ=1,...,M over the space of all internal states { h µ (i)} and model parameters { p(s|i)}. With pˆ µ (s) = 1/T
T
δs,sµt
(2.9)
t=1
denoting the relative number of spikes in channel s in one observation sequence µ, this likelihood is given by P=
M
µ=1
T! S s=1 T pˆ µ (s) !
S
s=1
pµ (s)
T pˆ µ (s)
.
(2.10)
pµ (s) is defined as in equation 2.5.3 The standard approach to this optimization problem is to minimize the negative logarithm of the likelihood P, L = −1/T log P = C −
S M
pˆ µ (s) log pµ (s),
(2.11)
µ=1 s=1
under the constraints mentioned in equation 2.6. C is a constant that does not depend on the optimization variables. If there is no prior knowledge about the typical features of which scenes µ are composed, the generative network will have to do both, estimate h µ (i), and learn the likeliest set of features p(s|i) in order to explain the ensemble of observed spike sequences. As soon as suitable p(s|i) have been found and can be held constant, minimizing L turns into a convex optimization problem for the h µ (i) only. Unfortunately, serial algorithms updating their representations with each incoming spike from one input pattern cannot be deduced directly from known update equations that work on µ = 1, . . . , M full observation sequences ST µ in parallel (Lee & Seung, 1999). The reasons for this will become clear during the following derivations, in which we seek a rapid update rule for the h µ (i)’s and p(s, i)’s minimizing L, which are based on a single spike observation sµt in each iteration step t. In particular, we aim at a multiplicative update rule, because in many cases (Lant´eri, Roche, & Aime, 2002), multiplicative algorithms are known to converge very fast when compared to simple additive gradient methods. Unfortunately, we were not able to find the fastest algorithm possible by
3 Note that p(s), p ˆ (s), and p˜ (s) all denote different variables. While p˜ (s) describes the real but unknown input statistics of a scene, pˆ (s) denotes a particular stochastic realization of this scene measured as the relative spike count at the input nodes s. The approximation of pˆ (s) by a linear superposition of elementary features is then denoted by p(s).
1322
U. Ernst, D. Rotermund, and K. Pawelzik
deriving it directly from L, and we believe that it cannot be done for the general case. For devising a multiplicative rule by means of a gradient descent, we first compute the derivatives −∇ L of the log likelihood, pˆ µ (s) ∂L = p(s|i) ∂h µ (i) pµ (s) S
−
(2.12)
s=1
pˆ µ (s) δs ,s ∂L pˆ µ (s ) = h µ (i) = h µ (i) . ∂ p(s|i) pµ (s ) pµ (s) S
M
−
M
µ=1 s =1
(2.13)
µ=1
For unconstrained problems, D∇ L is a valid descent direction if D is a positive definite matrix (Bertsekas, 1995). With z denoting the iteration step, the gradient descent in its most general form reads
h˜ µz+1 (i) = h µz (i) − γh Dhz ∇h µ L z i
p˜ z+1 (s|i) = p z (s|i) − γ p Dzp ∇ p L z s i .
(2.14) (2.15)
In our case, we choose Dhz and Dzp as diagonal matrices with entries
z
Dh i,i = h z (i) and Dzp s i,s i = p z (s|i). Dhz and Dzp explicitly depend on the update step z. We further introduced the update constants γh (estimation rate) and γ p (learning rate). Since the update is always positive, we have to satisfy the normalization constraints only, which is done by a postupdate normalization scheme: h˜ µz+1 (i) h µz+1 (i) = H ˜ z+1 j=1 h µ ( j) p z+1 (s|i) = S
p˜ z+1 (s|i)
s =1
p˜ z+1 (s |i)
(2.16) .
(2.17)
Combining update and normalization and substituting = γh /(1 + γh ) and γ = γ p yields the equations
S pˆ µ (s) (1 − ) + p(s|i) p z (s) s=1 µ
M pˆ µ (s) h (i) p z (s|i) 1 + γ µ=1 µ z pµ (s) . p z+1 (s|i) = M pˆ (s ) 1 + γ µ=1 h µ (i) sS =1 pµz (s ) p z (s |i) h µz+1 (i) = h µz (i)
µ
(2.18)
(2.19)
Efficient Computation Based on Stochastic Spikes
1323
These equations still use the full sequences ST µ for all µ = 1, . . . , M patterns to do one update step from z to z + 1. If a given collection of T spikes is considered a fixed input vector, equations 2.18 and 2.19 have the same fixed points as the corresponding NNMF algorithms (Lee & Seung, 2000), which can be derived using estimation-maximization (EM) and for → 0 and γ → ∞ they become equivalent. However, at one instant in real time, only one scene µ is presented, and only one or very few spikes are observed. In order to convert equations 2.18 and 2.19 to an online update rule, we restrict the algorithms to one pattern µ at a time. Furthermore, we assume that from this pattern, only one spike observed at time t enters the update equations for h or p. This procedure is equivalent to dropping the summations over µ, replacing z by t, and inserting δs,s t instead of the full pattern pˆ µ (s). The corresponding update equations (online learning and online estimation) resulting from equations 2.18 and 2.19 then read p(s t |i) h t+1 (i) = h t (i) (1 − ) + t t (2.20) p (s ) γ h(i) t t t δ p t+1 (s|i) = p t (s|i) 1 + t t − p (s |i) . (2.21) s,s p (s ) + γ h(i) p t (s t |i) The algorithms—equation 2.20 (spike-by-spike online estimation) and equation 2.21 (spike-by-spike online learning)—are our main results. Equations 2.20 and 2.21 are online algorithms, in contrast to equations 2.18 and 2.19 and the NNMF learning rules. Their update relies on the extreme case of only one spike, but it is straightforward to construct mixed versions that take into account several spikes for each interation step (this is done by choosing T in the definition equation 2.9 for pˆ µ equal to the number of spikes that should be considered in one step). Our online versions have finite memory, and for stationary input patterns, fluctuations of the hidden states will in general remain finite due to the stochastic drive. Therefore, both algorithms can at most achieve approximate maximization of the likelihood for the full input spike sequence. In the next section, we demonstrate by means of simple examples that with suitably chosen update parameters ∈ [0, 1] and γ ∈ [0, ∞], this approximate convergence is sufficient to achieve high computational performance. 2.5 Simplified Algorithm with Batch Learning. The online learning equation can be transformed to reduce computational complexity. In this case, learning of the p(s|i) is assumed to take place on a much larger timescale than the estimation process of the internal state h. In reality, learning is a slow process extending over the (repeated) presentation of many training scenes µ. Equation 2.21 hereby imposes a high computational load during online learning. This load can be significantly reduced by assuming that on a fast timescale, the hidden representations
1324
U. Ernst, D. Rotermund, and K. Pawelzik
h tµ for each µ, µ = 1, . . . , M, are computed with the online estimation algorithm, equation 2.20, for T spikes and time steps each. After these computations, now on a much slower timescale, the averaged hidden activities T t < h > µ = 1/ t=T−+1 h µ are used in parallel for a single update step of the conditional probabilities p(s|i) according to equation 2.19. A suitable multiplicative algorithm for this update has been proposed by Lee and Seung (1999), which reads
p˜ (s|i) = p (s|i) z
z
M
pˆ µ (s)< h > µ (i)
p
(s|i) = p˜ (s|i) z
S
z
p (s|i
)< h > µ (i )
(2.22)
i =1
µ=1
z+1
H
p˜ z (s |i) .
(2.23)
s =1
The parameter z = 1, . . . , Z counts the learning steps and is identical to the update step used in equations 2.18 and 2.19. With M different scenes (stimuli) in total, one update of the p(s|i) is made each T M spikes. Equations 2.22 and 2.23, together with equation 2.20, define the spike-by-spike batch learning algorithm (SbS-batch). Note that equations 2.22 and 2.23 can be derived directly from equation 2.19 by substituting the average < h > for h and by choosing the maximum update constant γ → ∞. 3 Results In this section, we demonstrate that the SbS-online and SbS-batch algorithms optimize suitable representations of scene ensembles by predicting the spikes arriving at the input nodes. For learning Boolean functions, our scene ensemble will consist of all input and output bit patterns of the respective function. For learning handwritten digits, the scene ensemble will be represented by the pixel patterns of the digit images, together with their correct classifications from a training data set. The performance of the learned representation will be evaluated in a subsequent computation or classification task on a test data set. Results will be compared to a simple nearest-neighbor classifier (abbreviated by NN classifier; for details, see appendix D) and to the NNMF algorithm by Lee and Seung (1999). 3.1 A Simple Example. First, we illustrate characteristic properties of the algorithm for a spike-by-spike update of internal states and the resulting coding in our network using a minimal model. For this purpose, we consider a highly overcomplete situation where S = 2 input neurons project to H = 100 hidden neurons. The optimization problem is therefore underconstrained, and accordingly, the Seung model converges to different internal
Efficient Computation Based on Stochastic Spikes
1325
states H for different initial conditions. It appears that the algorithm prefers mixtures of almost all available features W as shown in Figure 4A. In contrast, the sparse solution H found by our online update algorithm turns out to be unique for almost all initial conditions (see Figure 4B). The degree of sparseness is tightly linked to the parameter : the larger , the sparser is the internal representation. With = 1, only one hidden state will be active, which has the maximum likelihood to explain the input sequence. In the example shown in Figure 4B, the high value of causes the representation to concentrate on the feature vector, which in this case is the most natural explanation of the input spike statistics. 3.2 Preprocessing, Training, and Classification. The simulations are carried out in two stages: a training and test run. Before we present results on learning and performance for specific computations, we briefly introduce constructing, preprocessing, and partitioning the data into training and test sets. 3.2.1 Training. For learning a classification or computation task, each of the Mtr training scenes Vtr µ comprises a pattern or image, together with its correct classification into one of Sc classes. First, the average is removed from each pattern (or image). Since firing rates are always positive and because this transformation leads to negative pattern values, it is necessary to distribute the corresponding values to twice the number of channels. Positive values are directly assigned to the odd channels, while the negative values are rectified and assigned to the even channels. Second, the classification is represented by a vector with Sc entries with only one nonzero entry at the position corresponding to the correct class. Finally, this vector and the processed pattern are weighted and combined to form a single, normalized vector Vtr µ from which the input spikes are drawn. The mathematical details of this procedure are described in appendixes A and B. On Vtr µ , the network now learns a suitable representation including the correct match between patterns and classes in an unsupervised manner. This procedure is schematically explained in Figure 1, and employs the network structure sketched in Figure 2. 3.2.2 Test. For testing the trained generative network, the Mts test scenes Vts µ consist of only the transformed, positive input patterns without the class information (details described in appendix C; see also Figure 3). The correct class c µts is compared to the prediction cˆ µ being inferred from the internal states of the generative model. This method is described in Figure 1 as an inference procedure with partial sensory input and employs the network structure sketched in Figure 3. 3.3 Boolean Functions. For a proof of principle, we used the problem of learning and executing Boolean functions. Computation of arbitrary
1326
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 2: Spike-by-spike network for training a classification task. The training patterns with Sp components, together with their correct classification into one of Sc classes, are presented as randomly drawn spike trains to S = Sp + Sc input nodes during learning. The network thereby finds a suitable representation p(s|i) of the input ensemble and estimates an internal state h(i) for each input pattern according to either the SbS online or the SbS batch algorithm.
Figure 3: Spike-by-spike network for classification of patterns. The test patterns are presented to the Sp input nodes, and an appropriate internal state h(i) explaining the actual pattern is estimated by the SbS algorithm. From the internal state, a classification q (c) is inferred and compared to the correct classification.
Boolean functions in networks that receive only stochastic input is particularly nasty. The spike noise needs to be cancelled perfectly, because Boolean functions are nonsmooth in the sense that flipping one input bit can result in a complete change of the output values. If a network is capable of performing the required mapping, this has interesting consequences for multimodal integration and attention, which will be explained in section 4. In any case, efficient computation of Boolean functions would demonstrate that in principle, every possible computation, including the well-known XOR problem, can be rapidly performed in small networks with stochastic spikes. For our simulations, we randomly selected Boolean functions of N = 5 input bits mapped to one output bit, leading to Mtr = Mts = 32 = 2 N
Efficient Computation Based on Stochastic Spikes
1327
input-output patterns for one specific function.4 Consequently, a number of H = 2 N hidden units should suffice to represent one Boolean function of N bits. In this case, each weight vector p(s|i) comprising the conditional probabilities for a fixed i matches one of the Mts test patterns. Consequently, for each of the input patterns, only one of the hidden units h should be active (winner-takes-all network). In contrast, it might be possible that fewer than 2 N hidden nodes in the network are sufficient to perfectly represent a specific Boolean function. In this case, each input pattern can be represented by a mixture of active hidden nodes (soft competition network). As we considered only one output bit, there are only Sc = 2 different classes to which an input pattern can be assigned. For details of the simulation, see appendix E. 3.3.1 Performance of the Network. Figure 5 shows the mean classification error of different algorithms in dependence on the number of hidden units H. The spike-by-spike algorithms (with online learning) perform considerably better than the NNMF algorithm. The NNMF algorithm tends to explain the input as a very broad mixture of many basic features, as in the other example shown in Figure 4, while our algorithm concentrates on the most predictive features. With SbS, the error drops to approximately zero as soon as more than H = 25 hidden units are in the network. In these cases, the input patterns are reconstructed using a mixture of two or more features represented by the activated hidden nodes. Figure 6 shows how the classification error decreases with the number of incoming spikes. Here, SbS leads to a classification error that reaches zero after only about three spikes per input node. In comparison, the NN classifier is also astonishingly fast and closely approaches the error curve of the SbS algorithm. In this specific example, the complexity of the codebooks of the SbS algorithm and NN-classifier is comparable. 3.4 Handwritten Digits. As a third example, we selected the problem of handwritten digit recognition in order to demonstrate that real-world estimation problems can be solved efficiently and very fast by estimation with single spikes. We used the U.S. Postal Service data set, which serves as a benchmark for classification algorithms and allows comparison with human performance. For parameters of our simulations, see appendix F. 3.4.1 Performance of the Network. Figure 7 shows the classification error of the network versus the mean number of spikes per input node. It turns out
4 We used the same set of patterns for both training and classification. Note, however, that the realization of the patterns was different in both runs because the patterns are represented by a finite number of input spikes, which are randomly drawn in each repetition of a presentation.
1328
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 4: (A) Dynamics of internal representation H in dependence on the iteration step z for the NNMF model by Lee and Seung (1999) and (B) for the spike-by-spike network in dependence on the number t of incoming spikes. The networks comprise S = 2 input nodes and H = 100 hidden nodes, with the weights chosen as p(1|i) = (i + 1)/H and p(2|i) = 1 − p(1|i). The initial state was random but identical for both models, and the input was given by the vector V = {0.42, 0.58}. While the NNMF model converges to a broad mixture of all feature vectors p(s|i), the SbS network converges to a sparse state where the internal representation peaks around the feature vector p(s|42) = {0.42, 0.58} and accurately reflects the input rate distribution. The update constant was 0.5. In addition we show in the right plot of B, that the Kullback-Leibler (KL) divergence, averaged over 1000 runs with different V, decreases with iteration time, thus demonstrating convergence of our algorithm.
Figure 5: Mean classification error for Boolean functions of five input and one output bits, in dependence on the number of nodes H in the hidden layer of the network. The SbS online learning (solid line) clearly outperforms the noiseless NNMF algorithm (dashed-dotted line), which barely approaches 20% error.
Efficient Computation Based on Stochastic Spikes
1329
Figure 6: Error rate e of SbS online algorithm (solid line) for the task in Figure 5 in comparison to the error of a NN classifier (dotted line). The SbS online algorithm is as fast and as accurate as the NN classifier, which in this case represents optimal lookup of function values. Note that a mean of three spikes per input neuron is sufficient to achieve perfect classification.
that the NN classifier is remarkably good, reaching 6.1% error on the test set after about a mean of one spike per input node. This performance comes, however, at the price of using the full Mtr = 7291 training patterns for classification. In contrast, a network with only H = 500 nodes optimized on the training set with the batch algorithm has with one spike per input neuron a superior performance. An increase in classification speed and quality can be obtained using an annealing strategy by adapting (cooling) during learning and estimation of the internal states: the corresponding curve (the dotted line in Figure 7) shows that with only H = 100, the classification is nearly as fast and good as the NN classifier with its 7291 stored patterns. Figure 8 shows the complete weight set for the case SbS with H = 500 and = 0.1. About half of the weights consist of templates, which, however, become combined with particular feature weights during classification runs and are required to achieve the classification performance shown. With the same parameters, we computed the classification error of the network in dependence of H, which is shown in Figure 9. Using H = 25 hidden nodes is already sufficient to obtain an error that is only twice as high as the error of the NN classifier using the full set of 7291 pattern templates. In addition, the error curve shows no signatures of overfitting as it decreases monotonically even for very large networks with up to H = 7290 nodes. 3.4.2 Robustness Against Noise. While our algorithm by construction is robust against noise in the spiking process, the question is whether this
1330
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 7: Mean classification error e for the USPS database shown for different numbers of neurons in the hidden layer, in dependence on the number of input spikes (parameters as in Figure 8). The dashed line shows the error of an NN classifier in comparison. The SbS algorithm with 500 hidden neurons (solid line) is considerably slower but exceeds the performance of the NN classifier. If learning and classification are performed with an adaptive estimation constant ∝ t −1/3 (dotted line), the performance with only 100 hidden neurons approaches the speed and performance of the NN classifier. The weights in this example were trained with the batch learning rule.
remains true for other types of noise. To address this issue, we subjected the digit patterns to two types of perturbations. First, we occluded an increasing number of vertical or horizontal lines in the digit patterns (starting with the two center lines), and second, we added an increasing amount of noise to each pixel in a pattern. Figure 10A shows the classification performance in dependence on the number of covered lines. Up to a number of six lines, classification error stays below 20 percent. Figure 10B demonstrates that pixel noise must increase to a considerable amount η = 0.35 in order to raise the error above 20%. 3.5 Hierarchical Networks. Finally, we demonstrate that our basic network can be combined by using the hidden variables h(i) as probabilities for sending out next spikes—that is, we turn them into input neurons of subsequent networks. As a simple example, we present a hierarchical network that solves the Boolean function of the parity problem that requires the output to be zero whenever an unequal number of input neurons are active and to be one otherwise. While this problem can be solved trivially with H = 2 N hidden neurons, a hierarchical topology of simple SbS
Efficient Computation Based on Stochastic Spikes
1331
Figure 8: Conditional probabilities (weight vectors) p ∗ (s|i) for the SbS batch learning algorithm using H = 500 hidden nodes. Vectors i for even and odd input nodes s are combined and individually rescaled to a gray value gi (s) ∈ [−1, 1], and then displayed in a 16 × 16 pixel raster. The gi (s ) were computed as gi (s ) = g˜ i (s )/ max{|g˜ i (s )|} with g˜ i (s ) = p∗ (2s − 1|i) − p∗ (2s |i). Parameters for batch learning were chosen as described in appendix F.
Figure 9: Mean classification error e for the USPS database in dependence on the number of hidden nodes after 10 spikes per input node. The weights are attained by the batch learning rule, and the other parameters are as chosen for Figure 8.
1332
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 10: Error rate of the SbS algorithm for (A) digit patterns partially occluded by gray bars and (B) digit patterns subjected to a varying amount η of pixel noise (weights and parameters as in Figure 8). In A, the gray bars indicate the error rate under a horizontal occlusion of pixel rows (for details, see appendix F), while black bars indicate the error under a vertical occlusion of pixel columns in dependence on the number of occluded rows or columns. For fewer than four occluded pixel rows or columns and for a noise level less than η = 0.25, the recognition error remains below 10%.
networks can drastically reduce the size of the network. This network proves that our approach can be generalized in a self-consistent manner, using the output of one simple module as a meaningful input to another simple module. Therefore, one can conclude that SbS networks can be combined for iterative computations. Together with our results on Boolean functions, spike-by-spike networks are thus able to compute arbitrary complex functions with any required precision in a finite amount of time. 3.5.1 Parity Function Network. The binary parity function is 1 if the number of input bits being 1 is an odd number and 0 otherwise. For an arbitrary number of input bits, the parity function is easily computed by a hierarchical tree of XOR modules. For our simulations, we choose N = 16 input bits, thus having Sp = 2N = 32 on-off input channels, and two output classification nodes (see Figure 11). The full network consists of 15 spike-by-spike XOR modules, whose weights were set up manually (a value of 0.5 was assigned to all connections drawn in Figure 11). In every module, input-output nodes are shown as white discs and hidden nodes as gray discs. The hidden node activities h(i) and output variables pˆ (k, n) of one XOR module are updated by equations 2.20 and 2.32 with each incoming spike. Spikes are elicited by either the external input or by the activities pˆ (k, n) of an internal output node n, which is at the same time the input node for the next XOR module. With probability p I = 0.05, a spike is drawn according to the probability distribution of the actual input bit pattern and sent to one of the 8 XOR modules in the input layer. With probability p H = pˆ (k, n)(1 − p I )/14, a spike is drawn from one
Efficient Computation Based on Stochastic Spikes
1333
Figure 11: Structure of a hierarchical network constructed by hand for solving the parity function with 16 input bits. The network consists of 15 XOR modules linked together. Gray discs denote the hidden-layer nodes of the single modules, and open circles denote the input-output nodes. Connections between the nodes are marked with solid lines (all connections set to a weight of 0.5). The arrow indicates the output nodes that compute the parity subfunction for the first eight input bits (from left).
of the output nodes of the 14 XOR modules in the lower layers of the hierarchy and passed on to the next XOR module. If the network performs the intended computation perfectly, by construction each pair of input-output nodes should display a well-defined activity for each input pattern. As an example, consider the two output nodes marked by an arrow in Figure 11. If the number of bits of value 1 within the first 8 bits of the input pattern is an even number, the output variable of the left node should take a lower value (ideally 0) than the output variable of the right node (ideally 1), thus computing the parity function for the first 8 input bits. Figure 12 displays the corresponding errors of the network for the output layer and for the input-output nodes in the hidden layers in dependence on the number of spikes arriving in the input layer. All errors go down to zero but on different timescales. Clearly, it can be seen that the errors in the lower layers of the hierarchy decrease earlier than errors in the higher layers. This behavior is to be expected from the structure of the network: only when the input from the (m − 1)th layer is correct can the output error of the mth layer decrease to 0. 4 Summary and Discussion We presented a generative framework for neural computation that relies on spike signals given by Poissonian point processes. In our framework, probabilities for eliciting spikes capture neural activities, and synaptic weights are correspondingly related to conditional probabilities to observe spikes given a putative elementary cause. While our scheme is equivalent to
1334
U. Ernst, D. Rotermund, and K. Pawelzik
Figure 12: Classification error of the hierarchical SbS network displayed in Figure 11 in dependence on the mean number of spikes per input node (solid line). In a hierarchical network, the accuracy of the output of a higher layer relies on the accuracy of the output of the lower layers. This dependence is demonstrated by the error curves for the first hidden layer (dashed line), the second hidden layer (dashed-dotted line), and the third hidden layer (dotted line). Correspondingly, the curves show that the error in the nth layer falls below 10% after about 1 to 2 spikes per input node later than in the n − 1th layer.
nonnegative matrix factorization (NNMF) when used for the asymptotic case where mean rates represent the signals, the spike-by-spike online update for single-action potentials leads to a fundamentally different dynamics of hidden states. In particular, for overcomplete representations, our fast algorithm favors sparse hidden-layer representations in contrast to the basic algorithms for NNMF derived in Lee and Seung (1999). While ad hoc sparseness conditions for NNMF have been discussed in Hoyer (2004), in our framework they emerge from the requirement of rapid convergence. A different idea to link sparseness to fast visual processing was put forward by Perrinet, Samuelides, and Thorpe (2004). Here, the idea is to subtract with each spike an orthogonal projection of the best matching feature from the input scene (matching pursuit). This method has been shown to need very few spikes to achieve good reconstruction of natural scenes. We demonstrated with simple paradigmatic examples that very small networks using our dynamics can rapidly and accurately perform complex computations using only a few stochastic spikes per input neuron for achieving maximal performance. For a proof of concept, we considered learning and computation of random Boolean functions. Boolean functions
Efficient Computation Based on Stochastic Spikes
1335
are particularly challenging because a different input in one channel (bit) may switch an arbitrary function of the remaining bits to an arbitrary different function. In a neural environment, this dynamical property allows changing a computation performed in a certain brain area depending on a specific context. This context can be interpreted, for example, as attentional priming, additional sensory cues, or information from other brain areas. For Boolean functions with N input bits, 2 N hidden nodes are in general required for a complete representation. However, we found that optimal performance is in most cases possible with fewer than these 2 N hidden nodes. This reduction is due to the capability of the algorithm to find and represent redundancies in an effective manner. The error vanished completely, in dramatic contrast to the algorithm for NNMF by Lee and Seung (1999), which failed for this task. It is possible that the stochasticity of the spikes effectively implements a Monte Carlo sampling over the posterior, which helps to find a better solution as with NNMF. This is an interpretation being put forward by Hoyer and Hyv¨arinen (2003). The ability of our network to learn and compute Boolean functions shows that highly nonlinear, nonsmooth functions can be computed efficiently on the basis of very few spikes, which underlines the universality of our framework. Applications of our model with H ≈ 500 hidden nodes to handwritten digits demonstrate that on average, fewer than one spike per input neuron leads to recognition rates exceeding those of the nearest neighbor classifier that used the full training set with more than 7000 patterns. This high speed of recognition appears to be a general property of our framework and depends little on the input stimulus ensemble. We also studied our model with natural images, thereby confirming these results (data not shown). Training the network with natural images by the batch learning rule leads to weight vectors p(s|i) whose appearances strongly depend on the value of . For very small , weight vectors approximate orthogonal pixel basis functions, specializing in explaining the spikes of a few neighboring input channels. Consequently, the representation of one natural image is a linear, nonsparse superposition of these basis functions. In contrast, values of near one lead to weight vectors approximating templates of image patches. The representation then gets maximally sparse by activating only one hidden node whose weight vector most closely matches the presented input pattern. In between these extreme cases, the weight vectors have strong similarities to the receptive fields shown in Olshausen and Field (1996). However, our interpretation of the weight vectors learned with a fixed is one of a representation that is optimized to explain any input pattern with an average number of spikes being proportional to 1/. In summary, our results challenge the notion that the high speed of visual perception found in human and monkey psychophysics (Thorpe et al., 1996; Mandon and Kreiter, 2005) can be explained only by assuming
1336
U. Ernst, D. Rotermund, and K. Pawelzik
that the exact timing of individual spikes conveys the information about the stimulus (van Rullen & Thorpe, 2001). The recognition of handwritten digits was also used to show that the framework is very robust to noise and data corruption. By construction, our framework is invariant to scalings of overall intensities. This property has been found in primary visual cortex where the tuning of neurons is invariant to stimulus contrast (Skottun, Bradley, Sclar, Ohzawa, & Freeman, 1987), which gives a hint of the potential relevance of our approach for explaining cortical computation. At first sight, the equations for update and online learning, equations 2.20 and 2.21, might appear biologically unrealistic. However, we expect that the algorithms can be implemented using known properties of real neural circuits only. This implementation is the subject of our ongoing studies, but we give an outline of the underlying ideas in the following paragraph. First, we assume that the hidden variable h t (i) is represented in the excitability of a neuron or, even more realistic, in the excitability of a neuronal group or population (Wilson & Cowan, 1972), which would be equal to the average over all membrane potentials. Second, we observe that the update term in equation 2.20 is proportional to h t (i) p(s t |i). One could interpret this term as a spike delivered over a synapse with efficacy p(s t |i), which is multiplied with the excitability h t (i). In fact, such multiplicative interactions have been observed in real neurons, as in Chance, Abbott, and Reyes (2002). Spikes in the populations representing the h t (i)’s will be elicited with a probability proportional to h t (i). One inhibitory population would suffice to receive and to average those spikes, and feed them back to the other populations as a divisive normalization, which is also a well-known neural operation that has been extensively studied in, for example, the early visual cortical areas (Heeger, 1992; Carandini & Heeger, 1994). Taken together, this outline demonstrates that a biophysically plausible implementation of our algorithm is within reach and will in its final form be described by coupled differential equations defined in real time instead of using event timing in discrete update algorithms. A different approach to spike-by-spike estimation was studied by Den`eve (2005), who derived a differential equation updating an estimate of the log-likelihood ratio for the presence or absence of a stimulus in the receptive field of a Bayesian neuron. This study is very interesting because it takes the temporal dynamics of a stimulus into account, although in its current state, only for the presence or absence of a single cause and not for mixtures of causes. Den`eve also demonstrated that the corresponding differential equation is, assuming various approximations, similar to those of an integrate-and-fire neuron. For the biological plausibility of online learning, equation 2.21, we observe that the weight change is essentially given by two terms: one that is Hebbian and another that describes weight decay proportional to
Efficient Computation Based on Stochastic Spikes
1337
postsynaptic activity: p(s|i) ∝ h(i) p(s|i)(δs,s t − p(s t |i)) .
(4.1)
In previous approaches, simple cell responses were explained from ecological requirements (Barlow, 1960), like statistical independence of hidden node activations (Bell & Sejnowski, 1997). For natural image data, it was argued that this requirement is equivalent to sparseness constraints on the activities (Olshausen & Field, 1996). A different justification for sparseness may be found in the principle of minimal energy consumption (Levy & Baxter, 1996). In frameworks like ours in which the input data are modeled by a linear superposition of basis vectors, however, imposing sparseness conditions on the coefficients right from the beginning appears rather ad hoc. In contrast, our work points toward a new explanation of sparseness as a consequence of convergence time constraints. In our algorithm 2.20, residual noise of the hidden variable estimations caused by nonvanishing values of the update speed parameter induces sparseness of hidden node activations. The relation of speed and sparseness becomes clear when considering the update algorithm with the extreme value of = 1. In this case, asymptotically only the hidden neuron obtaining the largest input will remain active. Here, sparseness becomes maximal, and we effectively have a winner-takes-all network. With < 1, sparseness is enforced; however, sparseness of the hidden representation is dominated by the need to explain the data. These dependencies may also be interpreted in a different fashion: the parameter introduces a timescale that effectively determines how strong the observation of an input spike changes the internal representation. Small increase the memory for previous spikes; thus, the internal representation can be adjusted more accurately to an input pattern at the cost of a larger observation time. It is obvious that a more accurate representation in general needs more basic features to explain the input data. In contrast, with a large , a sparse representation of a few features suffices to explain an input pattern having a high potential variability due to the small number of observed spikes. Taking all these obsevations together, our framework might provide a novel basis for understanding neural computation. Coming from a rather technical background where similar algorithms were introduced to achieve blind source separation and blind deconvolution (often referred to as the cocktail party problem), we have shown that our approach is quite universal and can be used to perform general computations efficiently. In particular, it can implement multimodal neural integration by taking spike sequences from different sensory modalities or other brain regions into account (Deneve et al., 2001). Also it is straightforward to extend our approach to include statistical expectations, particularly tasks and attention. We believe
1338
U. Ernst, D. Rotermund, and K. Pawelzik
that future investigations along these lines will allow formulating specific predictions of our framework that can be tested in neurophysiological and psychophysical experiments. Appendix A: Pattern Preprocessing To avoid negative firing rates, the raw image or bit patterns Bµ with N components each were preprocessed to form the positive input patterns Bµ + finally applied to the network. In a first step, the b µ,s , s = 1, . . . , N were normalized by subtracting the individual mean value < b µ,s > = 1/N sN =1 b µ,s , b˜ µ,s = b µ,s − < b µ,s >.
(A.1)
Then the N components were duplicated and assigned pairwise to the even and uneven input node pairs, yielding S = 2N nonnegative components b µ,s + according to the expressions b µ,2s −1 + = b µ,2s + =
+b˜ µ,s 0
for b˜ µ,s > 0 otherwise
(A.2)
0 −b˜ µ,s
for b˜ µ,s > 0 . otherwise
(A.3)
This preprocessing is motivated by the properties of early visual processing in the brain: splitting the mean-value corrected input into negative and positive values resembles the analysis of visual stimuli by on- and off-cells in the lateral geniculate nucleus (LGN). Appendix B: Training Procedures For the training procedure (see Figure 2), the S input nodes are split into two sets of sizes Sp (indexed by s = 1, . . . , Sp ) and Sc (indexed by s = Sp + 1, . . . , M = Sp + Sc ). Correspondingly, there also exist two sets of conditional probabilities p(s|i). Each item in the training set comprises a + tr nonnegative input pattern Btr µ , together with its correct classification c µ . + The pattern Btr is applied to the first set of Sp input nodes, while the µ correct classification c µtr activates the c µtr th node of the second set of input nodes. Pattern and classification inputs are in addition weighted by a factor λ. This factor regulates the ratio of the mean number of spikes caused by the two types of input. λ allows balancing the input and output arguments of a function to be learned by the network. Using these assignments, the
Efficient Computation Based on Stochastic Spikes
1339
final training scenes Vtr µ are given by the expressions tr vµ,s
=λ
tr b µ,s
Sp
tr b µ,s
for s ∈ [1, Sp ]
(B.1)
s =1
and tr = (1 − λ)δs−Sp ,cµtr vµ,s
for s ∈ [Sp + 1, Sp + Sc ].
(B.2)
During training, spikes are drawn randomly from Vtr µ according to equation 2.2, while both h(i) and p(s|i) are updated with the SbS online or SbS batch algorithms. Appendix C: Classification and Computation Procedures For the classification run, input scenes Vts µ are composed solely of the pre Sp ts + ts ts via v = b / processed patterns Bts µ,s µ,s µ s =1 b µ,s (see Figure 3). The first set of conditional probabilities p(s|i) from the learning procedure with s = 1, . . . , Sp is renormalized, yielding the pattern weights p ∗ (s|i) = p(s|i)/
Sp
p(s |i) .
(C.1)
s =1
For each presented pattern or scene µ, now only the internal states h µ (i), not the weights p(s|i), are updated using equation 2.20. The remaining Sc weight vectors are transposed and normalized, yielding the classification weights p(c + Sp |i) pˆ (i|c) = Sc c =1 p(c + Sp |i)
for c = 1, . . . , Sc .
(C.2)
H pˆ (i|c)h µ (i) are computed. From h(i) and pˆ (i|c), quantities q µ (c) = i=1 These q µ (c) denote the probabilities for each of the Mts test patterns Vts µ to belong to the class c. From the q µ (c), a prediction for the classification cˆ µt is computed and updated within each time step t, cˆ µt = argmaxc q µt (c) .
(C.3)
The mean classification error e t over all Mts test patterns is finally computed as e t = 1/Mts
Mts µ=1
δcµts ,ˆcµt .
(C.4)
1340
U. Ernst, D. Rotermund, and K. Pawelzik
Appendix D: Definition of Nearest-Neighbor Classifier For a comparison of our learning and reconstruction algorithms to standard approaches, we decided to use a nearest-neighbor classifier. This classifier is easy to implement and yields relatively good performance at the price of using the whole training data set as codebook vectors. In our simulations, tr a test pattern Vts µ is classified by searching for the training pattern Vµ tr ts with the smallest quadratic distance Vµ − Vµ 2 to the test pattern. The tr classification is then done by assuming that both patterns Vts µ and Vµ tr belong to the same class c µ . Appendix E: Details and Parameters for the Computation of Boolean Functions From all possible 232 Boolean functions of five input and one output bits, Fo = 100 Boolean functions were chosen randomly for the SbS network (with online learning). By splitting the inputs arguments into on-off channels as described in appendix A, we therefore have S = Sp + Sc = 12 input nodes. During SbS with online learning, the 32 patterns were presented sequentially with a number of 4620 spikes per pattern and with λ = 5/6. This procedure was repeated Z = 20 times. Before each learning step and for each different pattern µ, the h’s were reset to a constant value of h 0µ = 1/H. Before the first learning step, the p z (s|i) were uniformly initialized with p 0 (s|i) = 1/S. After learning, the h’s were again reset to h 0µ , and each input pattern was presented for T = 1000 spikes for testing the classification performance. Learning constants were chosen as = 1000/1001 and γ = 0.01, unless stated otherwise. Appendix F: Details and Parameters for the Classification of Handwritten Digits The database of handwritten digits from the U.S. Postal Service contains Mtr = 7291 training and Mts = 2007 test patterns, each one consisting of N = 16 × 16 gray scale values (pixels) ranging from −1 (black) to 1 (white). According to the data preprocessing described before, the network then comprises Sp = 512 pattern input channels and Sc = 10 classification input channels for the training. Using the SbS algorithm with batch learning, one learning step included the parallel presentation of all training pattern with λ = 0.5, = 0.1, T = 4620, = 4620, and Z = 20 similar to the procedure employed with the Boolean functions. During classification, each digit pattern was presented for T = 10, 000 spikes. In order to test for the robustness of the algorithms, the recognition was made more difficult in two different ways. The first challenge was
Efficient Computation Based on Stochastic Spikes
1341
introduced by an occlusion of a variable number of rows or columns. Starting from the center columns or rows of the digit pattern, pixel values of whole rows and columns were set to an intermediate value of 0. For example, a vertical occlusion of six columns on a pattern of size 16 × 16 was realized by affecting the pixel values of columns 6 to 11 in all rows 1 to 16. A horizontal occlusion of four rows was realized by setting all pixel values of rows 7 to 10 of all columns 1 to 16 to a value of 0. The second challenge was introduced by superimposing random noise on the digit pattern. In detail, for each digit to be noisified, random pixel values Brµnd were drawn from a uniform distribution between −1 and 1. By means of a parameter η ∈ [0, 1], the original pattern was combined with the noise pattern to a new input new pattern b µ,s according to the rule new r nd b µ,s = (1 − η)b µ,s + ηb µ,s .
(F.1)
The parameter η regulates the amount of noise on the original pattern. η = 0 represents the noise-free original pattern, and η = 1 denotes the case when the original pattern has been fully replaced by the noise pattern. Acknowledgments Discussions with Matthias Bethge and Misha Tsodyks were very helpful for clarifying basic notions of our framework. This work was supported by the German Science Foundation (DFG, project B5 in SFB 517) and the Ministry of Research and Education (BMBF, DIP METACOMP). References Barlow, H. (1960). The coding of sensory messages. Cambridge: Cambridge University Press. Bell, A., & Sejnowski, T. (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37(23), 3327–3338. Berry, M., Warland, D., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci., 99, 5411–5416. Bertsekas, D. (1995). Non-linear programming. Belmont, MA: Athena Scientific. Bethge, M., Rotermund, D., & Pawelzik, K. (2003a). Optimal neural rate coding leads to bimodal firing rate distributions. Network: Comput. Neural Syst., 14, 303–319. Bethge, M., Rotermund, D., & Pawelzik, K. (2003b). A second order phase transition in neural rate coding: Binary encoding is optimal for rapid signal transmission. Phys. Rev. Lett., 90(8), 088104–1. Carandini, M., & Heeger, D. (1994). Summation and division by neurons in primate visual cortex. Science, 264, 1333–1336. Chance, F. S., Abbott, L. F., & Reyes, A. D. (2002). Gain modulation from background synaptic input. Neuron, 35, 773–782.
1342
U. Ernst, D. Rotermund, and K. Pawelzik
Delorme, A., & Thorpe, S. J. (2001). Face identification using one spike per neuron: Resistance to image degradations. Neural Networks, 14, 795– 803. Den`eve, S. (2005). Bayesian inference in spiking neurons. In L. K. Saul, Y. Weiss, L. Bottou, (Eds.), Advances in neural information processing systems, 17 (pp. 353–360). Cambridge, MA: MIT Press. Den`eve, S., Latham, P. E., & Pouget, A. (2001). Efficient computation and cue integration with noisy population codes. Nature Neuroscience, 4(8), 826–831. Ernst, M. O., & Banks, M. S. (2002). Humans integrate visual and haptic information in a statistically optimal fashion. Nature, 415, 429–433. Fourcaud, N., & Brunel, N. (2002). Dynamics of the firing probability of noisy integrate-and-fire neurons. Neural Computation, 14, 2057–2110. Gerstner, W., & Kistler, W. (2002). Spiking neuron models. Cambridge: Cambridge University Press. Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–198. Herzog, M. H., & Fahle, M. (2002). Effects of grouping in contextual modulation. Nature, 415, 433–436. Hoyer, P. O. (2004). Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5, 1457–1469. Hoyer, P. O., & Hyv¨arinen, A. (2003). Interpreting neural response variability as Monte Carlo sampling of the posterior. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 277–284). Cambridge, MA: MIT Press. Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian interference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biol. Cyb., 80, 25–44. ¨ Kording, K. P., & Wolpert, D. M. (2004). Bayesian integration in sensorimotor learning. Nature, 427, 244–247. Lant´eri, H., Roche, M., & Aime, C. (2002). Penalized maximum likelihood image restoration with positivity constraints: Multiplicative algorithms. Inverse Problems, 18, 1397–1419. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788–791. Lee, D. D., & Seung, H. S. (2000). Algorithms for non-negative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.). Advances in neural information processing systems, 13, (pp. 556–562). Cambridge, MA: MIT Press. Lee, T., & Mumford, D. (2003). Hierarchical Bayesian inference in the visual cortex. J. Opt. Soc. Am., 20(7), 1434–1448. Levy, W., & Baxter, R. (1996). Energy-efficient neural codes. Neur. Comp., 8, 531–543. Lucy, L. (1974). An iterative technique for the rectification of observed distributions. Astron. J., 74, 745–754. Mandon, S., & Kreiter, A. K. (2005). Rapid contour integration in macaque monkeys. Vision Research, 45, 291–300. Mumford, D. (2002). Pattern theory: The mathematics of perception. In L. Tatsien (Ed.), International Congress of Mathematicians, Beijing, China (Vol. III, pp. 401–422). Singapore: World Scientific.
Efficient Computation Based on Stochastic Spikes
1343
Olshausen, B., & Field, D. (1996). Emergence of simple cells receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. (1999). On decoding the responses of a population of neurons from short time epochs. Neural Computation, 11, 1553–1577. Perrinet, L., Samuelides, M., & Thorpe, S. (2004). Coding static natural images using spiking event times: Do neurons cooperate? IEEE Trans. Neur. Networks, 15, 1164– 1175. Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Computation, 16, 1–38. Richardson, W. (1972). Bayesian-based iterative method of image restoration. J. Opt. Soc. Am., 62, 55–59. Shadlen, M., & Newsome, W. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation and information coding. J. Neurosci., 18(10), 3870–3896. Silberberg, G., Bethge, M., Markram, H., Pawelzik, K., & Tsodyks, M. (2004). Dynamics of population rate codes in ensembles of neocortical neurons. Journal of Neurophysiology, 91, 704–709. Skottun, B., Bradley, A., Sclar, G., Ohzawa, I., & Freeman, R. (1987). The effects of contrast on visual orientation and spatial frequency discrimination: A comparison of single cells and behaviour. J. Neurophysiol., 57, 773–786. Thorpe, S., Delorme, A., & van Rullen, R. (2001). Spike-based strategies for rapid processing. Neural Networks, 14, 715–725. Thorpe, S. J., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. van Rullen, R., & Thorpe, S. J. (2001). Rate coding versus temporal order coding: What the retinal ganglion cell tells the visual cortex. Neural Computation, 13, 1255–1283. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Kybernetik, 12, 1–24.
Received February 11, 2005; accepted September 1, 2006.
LETTER
Communicated by Raj Rao
Exact Inferences in a Neural Implementation of a Hidden Markov Model Jeffrey M. Beck
[email protected] Alexandre Pouget
[email protected] Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627, U.S.A.
From first principles, we derive a quadratic nonlinear, first-order dynamical system capable of performing exact Bayes-Markov inferences for a wide class of biologically plausible stimulus-dependent patterns of activity while simultaneously providing an online estimate of model performance. This is accomplished by constructing a dynamical system that has solutions proportional to the probability distribution over the stimulus space, but with a constant of proportionality adjusted to provide a local estimate of the probability of the recent observations of stimulusdependent activity-given model parameters. Next, we transform this exact equation to generate nonlinear equations for the exact evolution of log likelihood and log-likelihood ratios and show that when the input has low amplitude, linear rate models for both the likelihood and the loglikelihood functions follow naturally from these equations. We use these four explicit representations of the probability distribution to argue that, in contrast to the arguments of previous work, the dynamical system for the exact evolution of the likelihood (as opposed to the log likelihood or log-likelihood ratios) not only can be mapped onto a biologically plausible network but is also more consistent with physiological observations. 1 Introduction It is becoming increasingly clear that some of the computations performed by the nervous system are akin to Bayesian inferences, whether in the context of sensory perception, reasoning, or motor control (Carpenter & Williams, 1995; Gold & Shadlen, 2001; Knill & Pouget, 2004). However, the neural mechanisms of these inferences remain unclear. Here, we explore one particular neural implementation of Bayesian inferences for hidden Markov models (HMMs). HMMs are particularly appealing because they can be used to model a wide variety of problems involving dynamic stimuli such as cue integration, decision making, and motor control. The goal is to understand how a neural network can infer Pr(θ (t)|v (t), . . . , v (0)), where Neural Computation 19, 1344–1361 (2007)
C 2007 Massachusetts Institute of Technology
Exact Inferences in a Neural Implementation of an HMM
1345
θ (t) is a time-varying stimulus and {v(t), . . . , v(0)} is a set of observations or measurements related to θ (t), where θ (t) could be the position of a moving object inferred from a sequence of images {v(t), . . . , v(0)}. Consistent with the work of others, we will limit ourselves to the consideration of recurrent networks for which the firing rate ui (t) of neuron i is related to the probability that the stimulus takes on some value θi through some monotonically increasing function, ui (t) = f (Pr(θ (t) = θi |v (s)∀s < t))
(1.1)
where f (x) is typically equal to the identity, a convolution (Anderson, 1994), or a log transform (Rao, 2004; Yu & Dayan, 2005; Den`eve, 2005). Exact solutions to this problem have recently been derived for a single neurons with a binary variable θ (Den`eve, 2005) and for multiple neurons with a static variable θ (Koechlin, Anton, & Burnod, 1999). However, for nontrivial, dynamic θ (t) and a population of neurons, the only available solution relies on the assumption that the log of a sum can be approximated by the sum of logs (Rao, 2004). In particular, this study showed that when f (x) = log(x), a linear dynamical system of the form du = − u + W u + Mv , dt
(1.2)
where W is the recurrent connectivity matrix and M is the feedforward connectivity matrix, approximates Bayes inferences for such an HMM after the weight matrix W was adjusted by a numerical fit. Here we present a simple, rigorous, and systematic derivation of a quadratic, nonlinear first-order dynamical system that has parameters that may be linked directly to Bayes rule and has solutions that give the exact probability that a Markov stimulus currently takes on a particular value given the entire history of the observed pattern of activity. Furthermore, it will be shown that such a system follows solely from the assumptions that the stimulus is Markov and that the stimulus-dependent pattern of activity consists of a collection of independent Poisson processes with stimulusdependent (and nonzero) rates. A companion equation for the evolution of the log likelihood of the observed stimulus-dependent activity given the model parameters will also be derived. We argue that this equation provides the proper objective function for a Bayes-linked learning rule and indicates that the goal of such a network is, in fact, the mean prediction of the input patterns. Moreover, we will demonstrate that the rate of increase of the probability of the data given model parameters can be incorporated into the amplitude of the output probability distribution in a manner that allows for an online estimate of model performance. For the purposes of comparison with previous work, we then generate the associated nonlinear evolution equations for the log-likelihood and
1346
J. Beck and A. Pouget
log-likelihood ratios. The assumption of low-amplitude input will then be applied to the log-likelihood equations to demonstrate that the recurrent weights of the linearized model may be calculated directly, as opposed to by a numerical fit as in Rao (2004). We then show that the assumption of low-amplitude input may also be applied to the nonlinear equation for the evolution of the likelihood function to obtain an equivalent linear dynamical system with an associated Hebbian learning rule. Finally, we discuss the biological plausibility of the various dynamical systems for Bayesian inference that we have generated and argue the case for a neural implementation, which directly represents the amplitude-adjusted probability density function as opposed to a representation in the log likelihood or log-likelihood ratio domain. 2 Model To demonstrate the distinction between our approach and that of Rao (2004) and Den`eve (2005), we begin by replicating their derivation of a discrete dynamical system that performs exact Bayesian inferences. We begin with a formulation of Bayes’ rule that calculates the probability that at the time step n, a particular stimulus (θi ) is being presented given the total history of all previous stimulus-dependent activity (v n , v n−1 , . . . , v 0 ). Thus, if we denote θ n = θi as θin , we may write the conditional probability that the stimulus takes on some value given the history of the activity pattern as Pr v n |θin , v n−1 , . . . , v 0 Pr θin |v n , v n−1 , . . . , v 0 = Pr (v n |v n−1 , . . . , v 0 ) n−1 n−1 n−1 0 0 Pr θ . Pr θin |θ n−1 , v , . . . , v | v , . . . , v × j j
(2.1)
j
The assumptions of a first-order Markov process for the stimulus and conditional independence of v n given θin (see Figure 1) imply that a discrete dynamical system for the quantity uin = Pr(θin |v n , v n−1 , . . . , v 0 ),
(2.2)
is given by uin
n−1 Pr v n |θin = Pr θin |θ n−1 uj . j n n−1 0 Pr (v |v , . . . , v ) j
(2.3)
At this point, Rao (2004) takes the natural log of this equation so that dependence on the noisy input (v n ) comes in as an additive driving force
Exact Inferences in a Neural Implementation of an HMM
θ n −1
θn
θ n +1
v n −1
vn
v n +1
1347
Figure 1: Graphical representation of a hidden Markov model. Arrows indicate conditional independence.
rather than a multiplicative one. Unfortunately, this causes the transition probabilities to come into the equation through a term that is the natural log of a sum. This nonlinearity is eliminated through a numerical parameter fit. It is then assumed that both the additive driving term and the resulting linear approximation to the transition probability term are of order t, so that a first-order linear differential equation may be obtained. In contrast, here we start with the goal of obtaining a temporally continuous first-order dynamical system and explicitly delineate the assumptions necessary to come to this conclusion. In particular, we begin by noting that for a discrete system of the form un−1 , v n , t), u n = h(
(2.4)
a first-order dynamical system can be obtained when un−1 , v n , t) = u n−1 + t h 1 ( un−1 , v n ) + O(t 2 ). h(
(2.5)
It can be shown that to put equation 2.3 into this form, we need only make the reasonable assumptions that (1) the probability of transitioning from state j to state i = j is proportional to the discrete time interval t and that (2) the ability of a particular pattern of activity to demonstrate a preference for a given stimulus is also proportional to t, the time interval over which the pattern of activity is observed. Written in terms of the relevant probability distributions, assumption 1 is given by Pr θin |θ n−1 = δij + twij , j where δij is the Kronecker delta and Similarly, assumption 2 is given by
(2.6)
Pr θin |v n n = 1 + tdin + O(t 2 ), Pr θi
i
wij = 0, and wij > 0 for all i = j.
(2.7)
1348
J. Beck and A. Pouget
where din = di (v n ) are subject to the condition d n · b = 0 for all n, for b i = Pr(θin ) gives the prior or bias in the stimulus. Note that din can be thought of as the growth rate of the conditional preference of the probability of θin given v n . Clearly, assumption 1 is just a restatement of the assumption that the stimulus is governed by a continuous-time Markov process and thus has Poisson transition times. Assumption 2 is slightly more restrictive. It is easy to show that any set of discrete-time HMMs that converges to a continuous HMM as t goes to zero has the property that the left-hand side of equation 2.7 has a Taylor expansion in t. However, it is not generally the case that the first term of this expansion is equal to one as is required to put equation 2.3 into the form of equation 2.5. Fortunately, a wide class of biologically plausible stimulus-dependent patterns of activity has this property. Specifically, a layer of cortex modeled by a set of independent Poisson neurons with stimulus-dependent rates has this property, as does a layer cortex modeled by a covariant gaussian distribution associated with a dense set of tuning curves. (See appendix A for an example of the Poisson case.) Regardless, equations 2.6 and 2.7 can be used with equation 2.3 after another application of Bayes’ rule,
Pr (v
n
|θin )
Pr θin |v n Pr (v n ) = , Pr θin
(2.8)
to yield n n−1 + O(t 2 ), = Pr (v n ) (I + tDn ) (I + tW) u Pr v n |v n−1 . . . v 0 u (2.9) where Dn is a diagonal matrix with diagonal entries given by the din defined by equation 2.7, W is the input-independent, recurrent weight matrix with elements given by transition probabilities, and I is the identity matrix. To obtain an expression for Pr v n |v n−1 . . . v 0 , we simply take the dot the vector composed entirely of ones, while product of equation 2.9 with 1, n−1 = 1 since both u n and u n−1 represent probarecalling that 1 · u n = 1 · u bility distribution. This yields n−1 + O t 2 . Pr v n |v n−1 . . . v 0 = Pr (v n ) 1 + t 1 · (Dn + W) u (2.10)
Exact Inferences in a Neural Implementation of an HMM
1349
Then, taking the ratio of equations 2.9 and 2.10, we obtain u n =
un−1 } + O(t 2 ) u n−1 + t{(Dn + W) , 1 + t{1 · (Dn + W) un−1 } + O(t 2 )
(2.11)
where the denominator corresponds to a form of divisive normalization (Heeger, 1992; Nelson, 1994), which, in this particular context, is required to ensure that u (t) is a probability distribution. Expanding this equation to first order in t, we obtain u n = u n−1 + t{(Dn + W) un−1 − 1 · (Dn + W) un−1 } + O(t 2 ).
(2.12)
Taking the limit as t goes to zero, we find that du ·u = D(t) u + W u − (d(t) ) u, dt
(2.13)
where we have used the definition of the matrix D(t) and the fact that 1 · W = 0 T to simplify the uniform inhibitory term. Recalling that D(t) is a function of the time-dependent stimulus v(t), we see that we have obtained a first-order quadratic nonlinear recurrent network for the exact computation of Bayesian inferences. Additionally, we note that the quadratic inhibitory term in equation 2.13 is a result of the divisive normalization in equation 2.11, and has the same effect in both equations. For example, when D(t) is constant, the solution to equation 2.13 is u (t) =
exp[(D + W)t] u0 . (1 − 1 · u 0 ) + 1 · exp[(D + W)t] u0
(2.14)
This expression is similar to the divisive normalization reported in cortical circuits (Heeger, 1992; Nelson, 1994). 3 Estimating Model Performance Solutions to equation 2.13 have the property that 1 · u (t) = 1 for all time. As noted in previous work (Sahani & Dayan, 2003) this property, while useful in the context of the above derivation, is not a necessary feature of the activity of a recurrent network, which directly represents a probability distribution and may be used to represent some other variable. In this context, the natural choice would be to use the amplitude to represent an estimate of model performance. The performance of the model can be assessed at any given time by computing the likelihood of the data under the current model parameters M. A procedure similar to that used above may be used to generate an
1350
J. Beck and A. Pouget
expression for the evolution of this quantity, leading to L n = Pr(v n , v n−1 . . . v 0 |M) = Pr(v n |v n−1 . . . v 0 , M)L n−1 .
(3.1)
Using equation 2.9 to evaluate the conditional dependence of the current activity pattern on the history of the activity pattern, we see that L n = L n−1 Pr(v n |M)(1 + t d n · u n−1 ).
(3.2)
Taking the natural log, expanding to first order in t, and eliminating terms that are independent of model parameters yields L ∗ (t) − L ∗ (0) =
t
·u d(s) (s)ds,
(3.3)
0
where L ∗ (t) is a function optimized by the same model parameters that optimize the L(t) when the model parameters accurately represent the average stimulus-independent and time-independent behavior of the input patterns, that is, when terms of the form Pr(v n |M) may be neglected. (See appendixes A and B for details.) Thus, the expected value of d · u provides the proper objective function for evaluating the performance of the model while a local average of d · u provides a local estimate of model performance. Our goal is to find a differential equation over a new variable u, ˜ which has a solution proportional to the solution of equation 2.13: u(t) ˜ = α(t) u(t), and for which α(t) can be related to a local average of d · u . This can be achieved by suitably adjusting the term responsible for divisive normalization. In particular, a dynamical system of the form d u˜ = (D(t)+ W) u˜ − f (α) u˜ dt
(3.4)
has a solution given by u(t) ˜ = α(t) u(t), where α(t) obeys d log(α) = d ·u − f (α) . dt
(3.5)
Thus, if f (α) is a monotonically increasing function, this quantity is attracted to a local average of d · u as desired. In particular, if f (α) = β log(α), then β log(α) is a local average of d · u . Moreover, since the effective time constant of equation 3.5 is given by the inverse of the parameter β, we may adjust the effective time window over which the “local” average of model performance is taken by simply manipulating this parameter. Of course, our choice of f (α), the coefficient of this divisive normalization term in equation 3.4, is fairly arbitrary. Indeed, if we wish to restrict
Exact Inferences in a Neural Implementation of an HMM
1351
ourselves to dynamical systems with quadratic nonlinearities, then we should choose f (α) to be linear and obtain d u˜ = (D(t) + W + βI)u˜ − γ (n · u) ˜ u, ˜ dt
(3.6)
which has solutions given by u˜ = α(t) u but with α(t) obeying the equation dα = α d · u + β − γα , dt
(3.7)
where β and γ may be used to adjust the effective time constant of integration for the evolution of the amplitude. 4 Special Case: Log Probability The equations we have derived so far are general in the sense that they make no assumption about the mapping between the activity of the neurons and the encoded distributions (i.e., the function f (x) in equation 1.1). We now consider special cases that have been investigated in other studies and show that our exact equation may be reduced to yield the results of previous work when identical assumptions are made. For instance, Den`eve (2005) has recently obtained a set of equations for exact Bayesian inferences for binary Markov random variables using neurons encoding the log-likelihood ratio: f (x) = log(x) − log(1 − x). To determine the multidimensional analog, we first assume that f (x) = log(x) and transform equation 3.6 to obtain d φ˜ i wij exp φ˜ j − exp φ˜ j , = di (t) + exp −φ˜ i dt j j
(4.1)
where φ˜ i = log(u˜ i ) = log(α) + log(ui ). This equation can be further simplified by considering the complete set of likelihood ratios defined as φij = log(ui ) − log(u j ). This yields the multidimensional equivalent of the binary case derived by Den`eve (2005): dφij wik exp (φki ) − w jk exp φk j . = di − d j + dt k
(4.2)
A simple linear transformation allows an equation of this form to be mapped onto a biologically plausible network that functions by linearly summing inputs under the assumption that the action of a dendrite is to exponentiate the rate of the presynaptic neuron.
1352
J. Beck and A. Pouget
5 Linearized Equations A related study (Rao, 2004) considered a similar log probability encoding ( f (x) = log(x)). However, since many models of neural activity assume that a linear dynamical system can be used to model neural activity (Dayan & Abbott, 2001), Rao generated a linear differential equation for Bayesian inference by numerically approximating the log of a sum as a sum of logs. Unfortunately, as equation 4.1 indicates, Bayes-Markov inference cannot always be reduced to a linear dynamical system under a log-encoding scheme, and while this approach worked well enough for several cases of interest, that study did not make clear the conditions under which the approximation is likely to hold. Here, we address this issue by considering the assumptions necessary to linearize equations 2.13, 4.1, or 4.2. As shown in appendix C, an equivalent linearization follows naturally from the assumption that the input has a low amplitude, that is, |d(t)| 1. For equation 4.1, this implies that |φ˜ i − φ˜ j − (log(b i ) − log(b j ))| 1 and, similar to Rao (2004), this leads to a linear dynamical system of the form d φ˜ i1 = di (t) + wij∗ φ˜ 1j − log(b j ) − γ b j φ˜ 1j − log(b j ) , dt j j
(5.1)
but with recurrent weights given by wij∗ = wij
bj bi
(5.2)
rather than by a numerical fit. Note also that the parameter γ > 0 does not affect the output probability distribution but is necessary to ensure the linear stability of solutions. Alternatively, we note that the same assumption necessary to linearize the equation in the log domain can be applied to equation 2.13, leading to a compellingly simple equation for an approximate probability distribution, dui1 wij u1j . = b i di (t) + dt j
(5.3)
This implies that the approximation used in Rao is valid in the same regime as the simple linear model in the probability domain given by equation 5.3. Furthermore, when the driving function di (t) is obtained by linearly summing input rates, this equation takes the form of a typical rate model
Exact Inferences in a Neural Implementation of an HMM
1353
for neural activity (Dayan & Abbott, 2001), d u˜ 1 − 1 · u˜ 1 ), = BMv (t) + Wu˜ 1 + γ b(1 dt
(5.4)
where B is the diagonal matrix given by the bias vector, so that the matrix BM has the property that 1 · BM = 0 T . Once again, γ > 0 is an adjustable parameter that does not affect the output probability distribution, but is necessary to ensure that the solution is asymptotically stable and, in the absence of inputs, equal to the bias vector. Moreover, equation 3.3 implies that when the stimulus is unbiased, maximum likelihood parameter estimation for this equation has an objective function given by d · u = u1 · Mv ,
(5.5)
which we recognize as the quantity maximized by a constrained Hebbian learning rule associated with such a simple linear model of neural activity (Dayan & Abbott, 2001). Once again, note that as in the log-likelihood ratio case, linearizing eliminates the possibility of using the amplitude to encode model performance. 6 Discussion We have presented a systematic technique for the generation of a system of equations that track the evolution of a dynamical neural network for the case of hidden (and hierarchical hidden) Markov models. This technique was shown to benefit from being exact, requiring no parameter fit, and retaining a simple interpretation of recurrent weights—that is, excitatory recurrent weights are transition probabilities. Moreover, we have demonstrated that since the amplitude of a population that directly encodes a probability distribution is arbitrary, we can use it to incorporate an online estimate of model performance. For the purposes of comparison with previous work, we then generated equations for the evolution of the log likelihood and loglikelihood ratios, as well as linear approximations to the dynamical systems that calculate the likelihood and log-likelihood functions. We now consider the biologically plausibility of Bayes inferences in the likelihood and log-likelihood domains. One advantage of the likelihood implementation we have proposed is that exact Bayesian inference for an HMM requires only divisive normalization and dendritic multiplication, both of which have been observed in neurons (Heeger, 1992; Pena & Konishi, 2001; Gabbiani, Krapp, Koch, & Laurent, 2002). In contrast, much previous work has argued for the implementation of Bayesian inferences in the log probability domain, typically by arguing that such implementations do not require that neurons perform any complicated actions such as
1354
J. Beck and A. Pouget
multiplication. Unfortunately, this is not the case. While it is true that the log transformation eliminates the multiplicative terms associated with the feedforward terms, equation 4.1 clearly demonstrates that for tasks more complicated than binary decision making, this transformation can result in a dynamical system that has an even more complicated nonlinear interaction associated with the recurrent terms. Rao (2004), resolved this issue through the generation of a linear dynamical system that approximates the full nonlinear system. However, the derivation of equation 5.1 demonstrates that the assumption necessary to make this type of approximation valid in the log-likelihood domain could just as easily have been applied in the likelihood domain to obtain an equally plausible linear dynamical system. Additionally, an implementation of Bayes inferences for an HMM capable of incorporating prior knowledge on short timescales would require a network capable of dynamically adjusting the recurrent connection strengths. In both the likelihood and log-likelihood domains, the recurrent connection strengths are multiplied by the synaptic input, so that if priors are to be included, then even the linear equations may ultimately require neurons capable of performing a multiplicative operation. Thus, it seems that there is both little distinction between and little utility of linearization in the likelihood and log-likelihood domains. This may be why the log of the sum approximation has been dropped in a more recent work (Rao, 2005), which identifies the attracting state of the membrane potential of a neuron with the log of the equation for Bayesian inference (see equation 2.3). The firing rate of such a neuron is then obtained by exponentiating this quantity so that the firing rate is approximately proportional to the likelihood function. The log of sums is implemented by a dendritic filtering function that applies a log transform to a linear combination of the input rates. In the work presented here, we are agnostic as to the interpretation and evolution of membrane potential and simply remark that, if neuronal dynamics can implement the model described in Rao (2005) for inputs generated by an HMM (as opposed to the static stimulus which is analyzed in that work), then the resulting rate dynamics should approximate the exact evolution equation for the likelihood function, equation 2.13, derived above. Moreover, we note that the utility of this work lies in the demonstration that the consideration of a dynamic as opposed to static stimulus actually simplifies network dynamics required to implement Bayesian inference. Finally, we argue that simply representing the likelihood function is inefficient and suggest that the ability of equation 3.6 to represent model performance through mean activity of the network can be used to explain one of the experimental discrepancies of the Bayesian information integration model used by Shadlen and Newsome (2001) to model the behavior of LIP neurons. In that work, a weakly tuned binary stimulus (left or right motion) was shown, and neurons in LIP were observed to initially exhibit a linear increase or decrease in activity depending on their direction selectivity.
Exact Inferences in a Neural Implementation of an HMM
1355
Figure 2: Predictions for the four explicit representations of the probability distribution for a binary decision task. Solid lines represent right selective neurons, while dashed lines represent left selective neurons. In the likelihood and loglikelihood ratio cases, left selective neurons decrease in activity at the same rate as right selective neurons. In the log-likelihood case, the decrease is larger than the increase. Only when the amplitude is used to encode model performance does the increase in activity of the right selective neuron exceed the decrease in the left selective neurons, as is observed in Shadlen and Newsome (2001).
However, the linear decrease was always observed to be of lesser absolute slope than the linear increase, indicating that mean activity increases. As noted in that work, this is incompatible with the interpretation of the activity of left and right selective neurons as representative of log-likelihood ratios. In particular, if these LIP neurons are tracking log-likelihood ratios or potential functions or any linearized quantity that is a function of the probability, then the mean activity should not change (see Figure 2c). This is also the case when these LIP neurons are assumed to represent the likelihood function directly (see Figure 2a). Similarly, when the mean activity represents the log likelihood, the concavity of the log function implies that the mean activity should actually decrease (see Figure 2b). In contrast, if the amplitude of the population (or in this case the average activity of the
1356
J. Beck and A. Pouget
pair of neurons) also encodes the likelihood of the recent data given the model parameters, then we would expect an increase in mean activity to accompany the presentation of a familiar stimulus (see Figure 2d). To our knowledge, this is the only model that explicitly represents the probability distribution and accounts for this feature of the relevant experimental data in LIP. Thus, we conclude that the implementation of exact Bayes-Markov inferences for biologically plausible stimulus-dependent patterns of activity in the likelihood domain may be amplitude modulated to provide an online representation of model performance in a manner consistent with both biological constraints and physiological measurements. Appendix A: Conditional Preference The only departure from a standard hidden Markov that was required to yield an exact implementation of Bayesian inference through a first-order differential equation was the requirement that the conditional preference have a Taylor expansion in t, given by Pr v n |θin Pr θin |v n n = 1 + tdin + O t 2 . = n Pr (v ) Pr θi
(A.1)
By considering a series of discrete-time hidden Markov models that converge to a continuous-time HMM, it is possible to show that a Taylor expansion in t necessarily exists, but that it is not generally the case that the first term of the expansion is one. This additional restriction holds only when a single snapshot of the stimulus-dependent pattern of activity is noninformative, that is, when Pr(θin |v , t = 0) = Pr(θin ). Fortunately, when the stimulus-dependent pattern of activity is composed of independent asynchronous events, such an event occurs with zero probability in a time window of zero width. This implies that zero-width time windows are noninformative. As an example, consider a stimulus-dependent pattern of activity that is a collection of independent Poisson processes with stimulus-dependent (and nonzero) rates,
Pr(v n |θ, t) =
(t f i (θ ))tvin exp (−t f i (θ )) , tvin ! i
(A.2)
where vin is now thought of as a discrete estimate of the rate of neuron i obtained by summing Poisson spikes in a time window of length t and f i (θ ) is the tuning curve of that neuron. For such a distribution
Exact Inferences in a Neural Implementation of an HMM
1357
equations 2.7 and A.1 take the form Pr(v n |θ, t) (t f i (θ ))tvi exp (−t f i (θ )) = tvin . Pr(v n |t) t ¯f i exp −t ¯f i i n
(A.3)
The log of this equation yields
Pr(v n |θ, t) log Pr(v n |t)
= t
vin
log
i
f i (θ ) ¯f i
− f i (θ ) + ¯f i ,
(A.4)
which is proportional to t since vin is an estimate of rate (as opposed to spike count) and may be considered an order one quantity. Exponentiation of this equation then leads to the desired expansion in t provided that f i (θ ) is nonzero for all values of the stimulus. Appendix B: Estimating Model Performance To generate an expression for the evolution of the probability of the input pattern of activity given the model parameters M, we recall that L n = Pr(v n , v n−1 , . . . , v 0 |M) = Pr(v n |v n−1 , . . . , v 0 , M)L n−1 .
(B.1)
Using equation 2.9 to evaluate the conditional dependence of the current activity pattern on the history of the activity pattern, we see that L n = L n−1 Pr(v n |M)(1 + t d n · u n−1 ),
(B.2)
Taking the natural log, expand = 0. where we have used the fact that WT n ing to order t yields n−1 log(L n ) = log(L n−1 ) + log(Pr(v n |M)) + t d n · u
(B.3)
or, equivalently, log(L N ) = log(L 0 ) +
N j=1
log(Pr(v j |M)) +
N
t d j · u j−1 .
(B.4)
j=1
Written in this way, we note that the first sum in equation B.4 represents the probability of the stimulus-dependent activity under the assumption that each interval is independent. However, for a hidden Markov model, each stimulus-dependent pattern of activity is independent only when conditioned on the stimulus; therefore, this term represents the log likelihood
1358
J. Beck and A. Pouget
of the stimulus-dependent pattern of activity under the assumption that the pattern is independent of the time course of the stimulus. This implies that if the model parameters accurately characterize the average behavior of the stimulus-dependent pattern of activity, but not necessarily the actual stimulus dependence, then this term may be neglected, giving rise to the expression in equation 2.17. Once again a consideration of the independent Poisson case is helpful. In this case, equation A.3 implies that the first sum in equation B.4 has terms of the form log (Pr(v n |M, t)) = t
vin log ¯f i − ¯f i
i
+
tvin
tvin
log (ti ) −
i
log ( j)
(B.5)
j=1
However, only terms that contain ¯f i are affected by model parameters so that a maximum likelihood estimate may ignore the second line of equation B.5. Thus maximizing L is equivalent to maximizing L ∗ if L ∗ (T) is given by L ∗ (T) = L ∗ (0) +
T
·u d(s) (s)ds +
0
0
T
vi (s) log ¯f i − ¯f i ds.
(B.6)
Finally, if the model parameters do not affect the stimulus-independent mean rates of the input neurons ¯f i , then we may conclude that d L∗ = d · u , dt
(B.7)
as desired. Appendix C: Linearizations To generate a linear dynamical system that approximates a Bayes-Markov network we could linearize any of the exact equations 2.13, 3.6, 4.1, or 4.2. To demonstrate that such linearizations follow naturally from the assumption of weakly tuned input, we now assume that D(t) = εD1 (t) and ui (t) = b i + εui1 (t) + O(ε 2 ). Inserting this ansatz into equation 2.14 yields dui1 = b i di1 (t) + wij u1j + O (ε) . dt j
(C.1)
Exact Inferences in a Neural Implementation of an HMM
1359
Equation 5.4 is obtained by noting that the recurrent weight matrix is sin respectively. gular with the left and right eigenvectors given by 1 and b, Since the dynamical system is linear, an addition of a recurrent term proportional to (1 · u )b and a driving term proportional to b will have the effect of adding a vector proportional to b to the solution, while enhancing the linear stability of the solution. For the purposes of comparison with Rao (2004), we also choose to linearize equation 4.1. First, we note the uniform inhibitory term in equation 4.1 can be replaced by an arbitrary function without affecting the resulting probability distribution since exp(φi (t) + χ(t)) exp(φi (t)) ui (t) = = exp(φi (t) + χ(t)) exp(φi (t)) j
(C.2)
j
for any χ(t). Thus, a linear dynamical system that approximates a BayesMarkov network is concerned only with the accuracy of the linear approximation to the nonlinearity, which contributes the excitatory term, that is, the term given by
wij exp φ˜ j − φ˜ i .
(C.3)
j
When D is of order ε,
φ˜ j − φ˜ i = φ j − φi ≈ log(b j ) − log(b i ) + ε
u1j bj
−ε
ui1 , bi
(C.4)
which implies that
wij exp φ˜ j − φ˜ i ≈
j
j
bj wij bi
u1j
u1 1+ε −ε i bj bi
=ε
j
wij
u1j bi
, (C.5)
where we have used the fact that j
wij b j = 0.
(C.6)
1360
J. Beck and A. Pouget
Defining φ˜ 1j so that ε
u1j bj
= φ˜ 1j − log(b j ) − log(α(t)),
(C.7)
we may combine equation 5.1 with the approximation given by equation C.5 to obtain the linearized equation d φ˜ i1 wij∗ φ˜ 1j − wij∗ log(b j ), = di (t) + β + dt j j
(C.8)
with wij∗ = wij
bj . bi
(C.9)
Recalling that any uniform term may added to an equation that tracks the log likelihood, we obtain equation 5.1 by simply adding an arbitrary uniform term given from a sum of the φ˜ j ’s. Acknowledgments J.B. was supported by NIH training grant T32-MH19942, and A.P. was supported by NSF grant 0346785. References Anderson, C. (1994). Neurobiological computational systems. In J. M. Zurada, R. J. Marks, & C. J. Robinson (Eds.), Computational intelligence imitating life. New York: IEEE Press. Carpenter, R. H., & Williams, M. L. (1995). Neural computation of log likelihood in control of saccadic eye movements. Nature, 377(6544), 59–62. Dayan, P., & Abbott, L. F. (2001). Theoretical neuroscience. Cambridge, MA: MIT Press. Den`eve, S. (2005). Bayesian inferences in spiking neurons. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 353–360). Cambridge, MA: MIT Press. Gabbiani, F., Krapp, H. G., Koch, C., & Laurent, G. (2002). Multiplicative computation in a visual neuron sensitive to looming. Nature, 420(6913), 320–324. Gold, J. I., & Shadlen, M. N. (2001). Neural computations that underlie decisions about sensory stimuli. Trends in Cognitive Sciences, 5, 10–16. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197. Knill, D. C., & Pouget, A. (2004). The Bayesian brain: The role of uncertainty in neural coding and computation. Trends Neurosci., 27(12), 712–719.
Exact Inferences in a Neural Implementation of an HMM
1361
Koechlin, E., Anton, J. L., & Burnod, Y. (1999). Bayesian inference in populations of cortical neurons: A model of motion integration and segmentation in area MT. Biol. Cybern., 80(1), 25–44. Nelson, M. E. (1994). A mechanism for neuronal gain control by descending pathways. Neural Computation, 6, 242–254. Pena, J. L., & Konishi, M. (2001). Auditory spatial receptive fields created by multiplication. Science, 292(5515), 249–252. Rao, R. P. (2004). Bayesian computation in recurrent neural circuits. Neural Comput., 16(1), 1–38. Rao, R. P. (2005). Bayesian inference and attention modulation in the visual cortex. Neuroreport, 16(16), 1843–1848. Sahani, M., & Dayan, P. (2003). Doubly distributional population codes: Simultaneous representation of uncertainty and multiplicity. Neural Comput., 15(10), 2255– 2279. Shadlen, M. N., & Newsome, W. T. (2001). Neural basis of a perceptual decision in the parietal cortex (area LIP) of the rhesus monkey. J. Neurophysiol., 86(4), 1916–1936. Yu, A. J., & Dayan, P. (2005). Inference, attention, and decision in a Bayesian neural architecture. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press.
Received June 10, 2005; accepted August 18, 2006.
LETTER
Communicated by Walter Senn
Multispike Interactions in a Stochastic Model of Spike-Timing-Dependent Plasticity Peter A. Appleby
[email protected] Institute for Theoretical Biology, Humboldt-Universit¨at zu Berlin, Invalidenstraße 43, D-10115 Berlin, Germany
Terry Elliott
[email protected] Department of Electronics and Computer Science, University of Southampton, Highfield, Southampton, SO17 1BJ U.K.
Recently we presented a stochastic, ensemble-based model of spiketiming-dependent plasticity. In this model, single synapses do not exhibit plasticity depending on the exact timing of pre- and postsynaptic spikes, but spike-timing-dependent plasticity emerges only at the temporal or synaptic ensemble level. We showed that such a model reproduces a variety of experimental results in a natural way, without the introduction of various, ad hoc nonlinearities characteristic of some alternative models. Our previous study was restricted to an examination, analytically, of twospike interactions, while higher-order, multispike interactions were only briefly examined numerically. Here we derive exact, analytical results for the general n-spike interaction functions in our model. Our results form the basis for a detailed examination, performed elsewhere, of the significant differences between these functions and the implications these differences have for the presence, or otherwise, of stable, competitive dynamics in our model. 1 Introduction The dependence of at least some forms of synaptic plasticity on the relative timing of pre- and postsynaptic spiking activity is now a well-established experimental result (see, e.g., Bi & Poo, 1998; Zhang, Tao, Holt, Harris, & Poo, 1998; Froemke & Dan, 2002). Spike-timing-dependent plasticity (STDP) has mostly been examined from the perspective of the interaction between exactly two spikes, one presynaptic and the other postsynaptic, although spike triplet and spike quadruplet interactions have begun to be studied too (Froemke & Dan, 2002). For two-spike interactions, it is usually found that a presynaptic spike followed by a postsynaptic spike leads to a potentiation of synaptic efficacy, while reversing the order of the two spikes leads to Neural Computation 19, 1362–1399 (2007)
C 2007 Massachusetts Institute of Technology
Multispike Interactions in a Stochastic Model of STDP
1363
a depression of synaptic efficacy (but see Bell, Han, Sugawara, & Grant, 1997). The extent of potentiation or depression depends on the interspike interval, with shorter intervals generally leading to greater changes. Beyond an interspike interval of several tens of milliseconds, no change is observed. Despite the fact that the experimental protocols tend to study multiple spike pairings in order to observe a statistically significant change in synaptic efficacy and that changes are usually measured over the entire synaptic connection between a pair of coupled neurons (that is, not at single synapses), the theoretical literature has assumed that the double exponential-like rule characteristic of STDP is obeyed by every individual synapse for every single spike pairing. The STDP rule has been studied directly to examine its computational consequences (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000), and biophysical models of the dynamics of single synapses have been built that seek to understand the mechanisms that could give rise to the STDP rule (Castellani, Quinlan, Cooper, & Shauval, 2001; Senn, Markram, & Tsodyks, 2001; Karmarker & Buonomano, 2002; Shouval, Bear, & Cooper, 2002). Much attention has also focused on understanding the conditions under which a BienenstockCooper-Munro (BCM)–like (Bienenstock, Cooper, & Munro, 1982) rule can emerge from STDP (Senn et al., 2001; Izhikevich & Desai, 2003; Burkitt, Meffin, & Grayden, 2004), and how to extend the two-spike STDP rule to accommodate the data from spike triplet and quadruplet interactions (Froemke & Dan, 2002). Previously we presented a stochastic, ensemble-based view of STDP in which STDP is an emergent property of neurons at the temporal or synaptic ensemble level (Appleby & Elliott, 2005). We showed that it is possible to build a model in which single synapses respond to spike pairs by adjusting their synaptic efficacies in a fixed, all-or-none fashion (cf. Petersen, Malenka, Nicoll, & Hopfield, 1998; O’Conner, Wittenberg, & Wang, 2005). If the second spike arrives within a time window that is stochastically determined, then a fixed level of potentiation or depression will occur, but if the second spike arrives too late, no change in efficacy will take place. Critically, any changes that do take place do not depend on the time of the second spike relative to the first. Because the window size is stochastic, different synapses can respond to the same spike pairing differently, and the same synapse can respond to multiple presentations of the same pairing differently. Only at this temporal or synaptic ensemble level does the overall synaptic connection between a pair of neurons exhibit STDP. Due to the overall structure of the proposed model, we found that the experimental data on spike triplet interactions (Froemke & Dan, 2002) were automatically reproduced, and we also showed that a BCM-like learning rule can emerge from the two-spike interactions. Here, we derive exact, analytical results for the general multispike or n-spike interaction functions, for any n ≥ 2, averaged over all possible sequences of n spikes in our model. This is achieved by extending our previous
1364
P. Appleby and T. Elliott
model slightly and then finally taking a limit that reduces it to its original form. Although we derive exact results for the n-spike interaction function, we show that the asymptotic, large n form of this function is of particular interest, not least because the finite n form rapidly converges to the asymptotic limit, so that the limit is achieved within just a few tens of spikes. These results are required for an analysis, performed elsewhere (Appleby & Elliott, 2006), of the differences between the two-spike and the general n-spike, n > 2, interaction functions, this analysis showing that, at least in our model, the two-spike interaction function’s form is entirely different from the general form observed for all other n-spike interaction functions.
2 Summary of Model and Its Extension We first provide a brief summary of our earlier model, the derivation of the general two-spike interaction function from it (Appleby & Elliott, 2005), and its extension to a slightly more general form that will facilitate the derivation of general multispike interaction functions. We proposed that all synapses that constitute the overall connection between a pair of synaptically coupled neurons can each exist, independent of the other synapses, in one of three functional states. The resting state of a synapse is labeled the OFF state. Following a presynaptic spike, the synapse undergoes a transition from the OFF state to another state, the POT state, while following a postsynaptic spike, the synapse undergoes a transition from the OFF state to the DEP state. Once in the POT state, further presynaptic spiking has no effect on the synapse’s state, and similarly, once in the DEP state, further postsynaptic spiking has no effect. (We shall modify this assumption in order to generalize the model somewhat.) In the absence of postsynaptic (presynaptic) spiking, the synapse stochastically returns to the OFF state from the POT (DEP) state. These stochastic transitions are governed by separate probability density functions (pdfs), f + (t) for the POT → OFF transition, and f − (t) for the DEP → OFF transition; here, the variable t denotes time. However, if the synapse is in the POT (DEP) state and a postsynaptic (presynaptic) spike occurs, then the synapse immediately returns to the OFF state but also triggers potentiation (depression) by a fixed amount A+ (A− ). Thus, the timing of the second, pre- or postsynaptic spike relative to the first spike does not affect the degree of change in synaptic strength: the strength changes in an all-or-none fashion, with no change occurring if the second spike arrives too late, after the synapse has stochastically returned to the OFF state. Because the transitions are stochastic, however, different synapses respond to the same spike pairing differently, and the same synapse responds to multiple presentations of the same spike pairing differently. Only at this synaptic or temporal ensemble level does the overall connection strength between the two neurons respect a spike-timing-dependent form of synaptic plasticity, while
Multispike Interactions in a Stochastic Model of STDP
1365
individual synapses obey a much simpler, double step-function response when presented with a single spike pair. To calculate the average response to a spike pair, we assume that preand postsynaptic spiking are independent Poisson processes of rates λπ and λ p , respectively. Because they are independent, the combined spike train, ignoring the distinction between pre- and postsynaptic events, is also Poisson, with combined rate β = λπ + λ p . Defining λπ = λπ /β and λp = λ p /β, the probability that any particular spike from this combined sequence is presynaptic (postsynaptic) is then just λπ (λp ). The interspike interval between spikes is governed by an exponential probability distribution, with pdf f T (t) = β exp(−βt). We shall label presynaptic (postsynaptic) spikes by the symbol π ( p) for notational convenience and also refer to π spikes and p spikes throughout. Consider the sequence π p. The π spike moves the synapse from OFF to POT. The probability that the synapse is still in the POT state when the p spike arrives is P + (t) =
∞ t
dt f + (t ),
(2.1)
where t is the interspike interval. The average change in synaptic strength triggered by the π p pairing is then just Sπ p (t) = +A+ P + (t).
(2.2)
Similarly, for a pπ pairing, we have Spπ (t) = −A− P − (t),
(2.3)
where P − (t) is identical to P + (t) except that f + is replaced by f − in the integral. These two results are conditional expectation values, conditional on a given spike pairing of a given interspike interval. The average response to any isolated spike pair of any interspike interval is obtained by weighting the conditional expectation values by the probability of each particular spike pair and integrating out the time of the first spike and the interspike interval according to their common pdf, f T (t). Hence, S2 =
i, j∈{π, p}
λi λj
∞
∞
dt0 f T (t0 ) 0
dt1 f T (t1 )Si j (t1 ),
(2.4)
0
where, of course, Sπ π (t) = Spp (t) = 0, and S2 denotes the expected response to an average two-spike pairing. In the remainder of this letter, we shall be principally concerned with the derivation of an expression for Sn , the expected response to an average n-spike sequence.
1366
P. Appleby and T. Elliott
We previously took the pdfs f ± (t) to be governed by integer-order gamma distributions, so that f ± (t) =
(t/τ± )ν± −1 1 exp(−t/τ± ), (ν± − 1)! τ±
(2.5)
where τ± are the characteristic timescales for the stochastic processes and ν± are their orders. Notice that if ν± = 1, then these are just simple exponential distributions. It is then easy to see that P ± (t) = e ν± (t/τ± ) exp(−t/τ± ), where e ν (x) =
ν−1 i=0
K 1± (λ) = 1 −
(2.6)
xi /i! is the truncated exponential series. Defining
1 , (1 + τ± λ)ν±
(2.7)
we then have S2 = λπ λp [A+ K 1+ (β) − A− K 1− (β)].
(2.8)
We now generalize the above model slightly in order to facilitate the calculation of Sn . We assumed that if a synapse is in the POT (DEP) state, then further presynaptic (postsynaptic) spiking has no effect. We now propose instead that after r additional presynaptic (postsynaptic) spikes, the stochastic processes governing the return of the synapse from the POT (DEP) state to the OFF state are reset, provided that the synapse has not returned to the OFF state at any time after the first spike moved it into the POT (DEP) state. The overall result of this is to increase the likelihood that potentiation or depression will occur by essentially discarding the spike history of the synapse after r + 1 identical spikes and starting afresh. For ν± = 1, however, the exponential distribution is memoryless, so such a system should exhibit dynamics independent of the value of r . Strictly, we should allow differing numbers of spikes to reset the POT and DEP states, but it is easy to generalize our results to this case. Furthermore, we are particularly interested in the properties of the r = 1 and r → ∞ models, the latter being equivalent to the original form of our model. Arguably, these r = 1 and r → ∞ models are more biologically plausible than models with other values of r : either a spike of the appropriate sort resets the stochastic process or it does not, and other values of r would require us to postulate some form of spike counting machinery at the synapse. We therefore assume that r additional spikes of the appropriate kind reset both stochastic processes for DEP → OFF and POT → OFF, and we refer to this generalized model as the r -reset model. We call the 1-reset model
Multispike Interactions in a Stochastic Model of STDP
1367
the resetting model and the ∞-reset model the nonresetting model. Our strategy in much of what follows is to extract n-spike results for the small r models, because r limits the possible extent of spike interactions in a long sequence of spikes. The behavior for large r will then be clear, and the results will fall out directly. 3 Preliminaries It is convenient before deriving general results for Sn to derive some preliminary results that will be of use throughout the remainder of the letter. Consider first a sequence of n identical spikes, say, π spikes, followed by a different spike, say, a p spike, giving a total of n + 1 spikes, the first spike occurring at time t0 , and let the subsequent interspike intervals be t1 , . . . , tn , so that ti , i > 0, is the time between the ith spike and (i + 1)th spike. The time of the ith spike is then i−1 j=0 t j , for any i. We assume that the synapse is initially in the OFF state, so that the first π spike at time t0 moves the + synapse into the POT state. Let the multiargument function PON (t1 , . . . , tn ) − be the probability that the synapse is in the POT state (PON for the DEP n ± state and p spikes) at time i=0 ti , when the final spike arrives; clearly PON do not depend on the time of the first spike, t0 , as this merely moves the synapse from the OFF state to the POT or DEP states. For the moment, we do not consider resetting. If, when the second spike arrives, the synapse is OFF due to a stochastic decay, then this second spike moves the synapse back to POT, and the probability, n given OFF+ at time t0 + t1 , that the synapse is in the POT state at time i=0 ti is just PON (t2 , . . . , tn ). The probability of being in the OFF state at time t0 + t1 is just 1 − P + (t1 ), from the previous section. The probability of being in the POT state at time t0 + t1 is obviously P + (t1 ), so we can write + + (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) PON + (t2 , . . . , tn |POT at t0 + t1 ), + P + (t1 )PON
(3.1)
+ (t2 , . . . , tn |POT at t0 + t1 ) is the conditional probability that the where PON n synapse is in the POT state at time i=0 ti given that it was still in the POT state when the second spike arrived at time t0 + t1 . We now repeat the above argument, conditioning on whether the synapse is still in the POT state when the third spike arrives at time t0 + t1 + t2 given that it was still in the POT state when the second spike arrived at time t0 + t1 . Suppose that the synapse is in the OFF state at time t0 + t1 + t2 given this sequence. Then the third spike sends the synapse back to POT. The probability that the synapse is in the OFF state at time t0 + t1 + t2 given that it was in the POT state at time t0 + t1 is 1 − P + (t1 + t2 |POT at t0 + t1 ), and the probability that it is still in the POT state is P + (t1 + t2 |POT at t0 + t1 ).
1368
P. Appleby and T. Elliott
Hence, we can now write + + PON (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) + + [P + (t1 ) − P + (t1 + t2 )]PON (t3 , . . . , tn ) + + P + (t1 + t2 )PON (t3 , . . . , tn |POT at t0 + t1 + t2 ),
(3.2)
where we have used the definition of the conditional probability, P + (t1 + t2 |POT at t0 + t1 ) = =
P + (POT at t0 + t1 + t2 & POT at t0 + t1 ) P + (POT at t0 + t1 ) P + (t1 + t2 ) . P + (t1 )
(3.3)
Thus, repeating the above argument for all subsequent spikes, we see that + + (t1 , . . . , tn ) = [1 − P + (t1 )]PON (t2 , . . . , tn ) PON + + [P + (t1 ) − P + (t1 + t2 )]PON (t3 , . . . , tn )
+ ··· + + [P + (t1 + · · · + ti−1 ) − P + (t1 + · · · + ti )]PON (ti+1 , . . . , tn )
+ ··· + + [P + (t1 + · · · + tn−2 ) − P + (t1 + · · · + tn−1 )]PON (tn )
+ P + (t1 + · · · + tn ),
(3.4)
+ where clearly PON (t) ≡ P + (t). The conditional expectation value for the extent of potentiation or depression induced by this sequence of n identical spikes followed by a differ+ − ent spike is just +A+ PON (t1 , . . . , tn ) or −A− PON (t1 , . . . , tn ), respectively. As with the two-spike calculation, we convert these into unconditional expectations by integrating out the interspike intervals according to their pdfs and weighting by the spike sequences’ probabilities. We do this for the gen+ − eral distribution PON , not specifying whether PON or PON , and hence we do not at this point weight by the spike sequence probabilities, as these differ for the two possible cases. Because the combined pre- and postsynaptic spike sequence is Poisson with rate β = λπ + λ p , fortunately the interspike pdf is always the same, although in general, for non-Poisson processes, this will not be true. We take equation 3.4 and integrate t1 , . . . , t n over their entire ranges, ti ∈ [0, ∞), having weighted by β n exp(−β nj=1 t j ). Notice that the
Multispike Interactions in a Stochastic Model of STDP
1369
integration over the first spike time, t0 , always produces a result of unity, by the definition of the pdf. Defining
∞
Tl (β) = β l
∞
dt1 · · ·
0
dtl exp −β
0
l
t j PON (t1 , . . . , tl )
(3.5)
j=1
and K l (β) = β l
∞
dt1 · · ·
0
= βl
∞
dtl exp −β
0 ∞
dt 0
l
t j P(t1 + · · · + tl )
j=1
tl−1 −βt e P(t) (l − 1)!
(3.6)
both for l ≥ 1, and defining K 0 (β) ≡ 1, we then have Tn = K n +
n−1
Tn−i (K i−1 − K i ),
(3.7)
i=1
where, unless explicitly indicated otherwise in what follows, the argument of the Ti ’s or K i ’s is always β. Using the gamma pdfs for P(t) above, it is straightforward although tedious to show that K l± (β)
=
τ± β 1 + τ± β
l ν ± −1 i +l −1 i=0
l −1
1 . (1 + τ± β)i
(3.8)
The K i ’s will play an important role throughout. We may now consider how the possibility of spikes resetting the stochastic processes affects the above. In the r -reset model, if r + 1 < n, then after the (r + 1)th identical spike, the process, if it has never deactivated during these identical spike events, is reset. This has the effect of making subsequent probabilities independent of any of the events that occurred before the (r + 1)th identical spike arrived and causes subsequent conditional probabilities to cease to depend on times earlier than t0 + · · · + tr . For concreteness, consider r = 1. In moving between equations 3.1 and + 3.2, we had to expand PON (t2 , . . . , tn |P OT at t0 + t1 ) by conditioning on the state of the synapse when the third spike arrived. However, for r = 1, the second π spike at time t0 + t1 resets the stochastic process because it is given that the synapse has stayed in the POT state since the first π spike at time t0 moved the synapse into the POT state. Thus, subsequent changes in synaptic strength are independent of the fact that the synapse did not move
1370
P. Appleby and T. Elliott
to the OFF state between the first and second π spikes, because the second π spike restarted the stochastic process afresh. Thus, + + PON (t2 , . . . , tn |P OT at t0 + t1 ) = PON (t2 , . . . , tn ) only for r = 1.
(3.9)
+ + (t1 , . . . , tn ) = PON (t2 , . . . , tn ), and of course Hence, equation 3.1 becomes PON + + we finally arrive at PON (t1 , . . . , tn ) = · · · = PON (tn ) ≡ P + (tn ). This equation means that for r = 1, the final spike in a long sequence of identical spikes is the only spike that counts. Hence, for r = 1, we have that Tl = T1 = K 1 , any l, by the definition of the Ti ’s in equation 3.5. Putting Tl = K 1 into equation 3.7, which we now regard as an equation defining the K i ’s in terms of the Ti ’s, it is easy to see that for this r = 1 case, K l = K 1l . Thus, we extract the one-reset model from the nonresetting model by replacing K l by K 1l in any equations in which it occurs. This reflects the fact that the one-reset model replaces P(t1 + · · · + tl ) by li=1 P(ti ). Similar considerations show that the r -reset model replaces K rl+s , 0 ≤ s < r , by K rl K s , so any results that we obtain for the nonresetting model can be converted into those for the r -reset model by the substitution K rl+s → K rl K s . In general, the finite r models are useful precisely because they limit multispike interactions to at most r + 1 spikes, and this enables us, in the next section, to set up a recurrence relation in Sn .
4 Multispike Recurrence Relations We now derive some recurrence relations for Sn for the one-, two- and three-reset models, and then, having developed intuitions based on these small r cases, we can proceed to derive the results for the arbitrary r -reset model. 4.1 1-Reset Model. Let n (t0 , t1 , . . . , tn−1 ) denote an arbitrary sequence of n spikes with interspike intervals ti , i > 0, and t0 the time of the first spike, and let the function Wn (t0 , . . . , tn−1 ) denote the pdf and spike probability weighting for this sequence that is required to transform conditional expectation values into unconditional expectation values. If n contains m π spikes and n − m p spikes, then n−m Wn (t0 , . . . , tn−1 ) = λm exp[−β(t0 + · · · + tn−1 )]. π λp
(4.1)
Let Sn (t1 , . . . , tn−1 ) be the conditional expectation value for the change in synaptic strength induced by the sequence n . This is not a function of the first spike time, t0 , but only of the interspike intervals. Consider a sequence of n + 2 spikes, the first two of which are either both π spikes or both p spikes. For r = 1, the first spike is essentially irrelevant because the second, identical spike either reactivates the same process if
Multispike Interactions in a Stochastic Model of STDP
1371
the synapse is in the OFF state when it arrives at time t0 + t1 or resets the process if the synapse is still in the POT state. We therefore see that Sπ π n (t1 , . . . , tn+1 ) = Sπ n (t2 , . . . , tn+1 ),
(4.2)
Sppn (t1 , . . . , tn+1 ) = Spn (t2 , . . . , tn+1 ).
(4.3)
Now consider the spike sequence π pn (t0 , . . . , tn+1 ). If the synapse has returned to the OFF state before the second, p spike arrives, then the first spike is irrelevant. If the synapse is still in the POT state when the p spike arrives, then there will be a potentiation by an amount A+ , and the remainder of the spike sequence n (t2 , . . . , tn+1 ) contributes to Sπ pn (t1 , . . . , tn+1 ) independent of these first two spikes. Hence, we have that Sπ pn (t1 , . . . , tn+1 ) = [1 − P + (t1 )]Spn (t2 , . . . , tn+1 ) + P + (t1 )[Sn (t3 , . . . , tn+1 ) + A+ ].
(4.4)
Similarly, we have Spπ n (t1 , . . . , tn+1 ) = [1 − P − (t1 )]Sπ n (t2 , . . . , tn+1 ) + P − (t1 )[Sn (t3 , . . . , tn+1 ) − A− ].
(4.5)
Defining Sn =
∞
dt0 · · ·
0
0
∞
dtn−1 Wn (t0 , . . . , tn−1 )Sn (t1 , . . . , tn−1 ),
(4.6)
where the absence of temporal arguments on Sn indicates that interspike intervals have been integrated out according to their pdfs and spike probabilities included, Equations 4.2 and 4.3 become Sπ π n = λπ Sπ n ,
(4.7)
Sppn = λp Spn ,
(4.8)
where the factors of λπ and λp arise because the dependence of Wπ π n or Wppn on the first spike factors out and the remainder of Wπ n or Wpn is absorbed into the definition of Sπ n or Spn . Similarly, Sπ pn = λπ (1 − K 1+ )Spn + λπ λp K 1+ (Sn + A+ Fn ),
(4.9)
Spπ n = λp (1 − K 1− )Sπ n + λπ λp K 1− (Sn − A− Fn ),
(4.10)
1372
P. Appleby and T. Elliott
where
F n =
∞
dt2 · · ·
0
0
∞
dtn+1 Wn (t2 , . . . , tn+1 ) = λπ λp m
n−m
(4.11)
if the spike sequence n contains mπ spikes and n − mp spikes. We nowsum over all possible spike sequences n , of which there π n are 2 . Let denote this sum, and define S = n { } {n } Sn , Sn+1 = n ππ S , S = S , and so on, in an obvious notation. Then π n π π n {n } {n } n+2 we have that ππ π Sn+2 = λπ Sn+1 ,
(4.12)
πp Sn+2
= λπ (1 −
p K 1+ )Sn+1
pπ Sn+2
= λp (1
π K 1− )Sn+1
−
+ λπ λp K 1+ (Sn + A+ ), +
λπ λp K 1− (Sn
(4.13)
− A− ),
(4.14)
Sn+2 = λp Sn+1 , pp
p
(4.15)
π ≡ where we have used the binomial theorem, {n } Fn ≡ 1. Since Sn+2 πp p pπ pp ππ Sn+2 + Sn+2 and Sn+2 ≡ Sn+2 + Sn+2 (by summing over all possip ble second spikes), and Sn ≡ Snπ + Sn , we finally obtain the coupled recurrence relations: π Sn+2 = +λπ λp A+ K 1+ + λπ λp K 1+ Sn + λπ Sn+1 − λπ K 1+ Sn+1 p
p Sn+2
π = −λπ λp A− K 1− + λπ λp K 1− Sn + λp Sn+1 − λp K 1− Sn+1 .
(4.16) (4.17)
These recurrence relations represent our final form for the 1-reset model. 4.2 2-Reset Model. For the one-reset model, we had to look at specific spike patterns extending out to two spikes. For the 2-reset model, we must go out to three spikes, because only the third spike in a sequence of three identical spikes possesses the capacity to reset the stochastic processes. Consider Sπ π π n (t1 , . . . , tn+2 ). As usual, we condition on the state of the synapse as the second and third π spikes arrive. If the synapse is in the OFF state when the second π spike arrives, we obtain a subsequent change governed by Sπ π n (t2 , . . . , tn+2 ). If the synapse is still in the POT state when the second spike arrives, then we look at the state when the third spike arrives. If the synapse is in the OFF state at this time, we obtain a subsequent change governed by Sπ n (t3 , . . . , tn+2 ). However, if the synapse is still in the POT state, then the third spike resets the process, and subsequent changes are still governed by Sπ n (t3 , . . . , tn+2 ). Hence, two conditioning steps lead to Sπ π π n (t1 , . . . , tn+2 ) = P + (t1 )Sπ π n (t2 , . . . , tn+2 |P OT at t0 + t1 ) + [1 − P + (t1 )]Sπ π n (t2 , . . . , tn+2 ),
(4.18)
Multispike Interactions in a Stochastic Model of STDP
1373
where the conditional change Sπ π n (t2 , . . . , tn+2 |POT at t0 + t1 ) = P + (t1 + t2 |POT at t0 + t1 )Sπ n (t3 , . . . , tn+2 ) + [1 − P + (t1 + t2 |POT at t0 + t1 )]Sπ n (t3 , . . . , tn+2 ) = Sπ n (t3 , . . . , tn+2 ),
(4.19)
so that Sπ π π n (t1 , . . . , tn+2 ) = P + (t1 )Sπ n (t3 , . . . , tn+2 ) + [1 − P + (t1 )]Sπ π n (t2 , . . . , tn+2 ).
(4.20)
This cancellation between the conditional probabilities in the final conditioning step is characteristic of all the r -reset models and is due precisely to the stochastic process resetting. Similarly, we obtain + (t1 , t2 )[Sn (t4 , . . . , tn+2 ) + A+ ] Sπ π pn (t1 , . . . , tn+2 ) = PON + + [1 − PON (t1 , t2 )]Spn (t3 , . . . , tn+2 )
(4.21)
+
Sπ pπ n (t1 , . . . , tn+2 ) = P (t1 )[Sπ n (t3 , . . . , tn+2 ) + A+ ] + [1 − P + (t1 )]Spπ n (t2 , . . . , tn+2 )
(4.22)
+
Sπ ppn (t1 , . . . , tn+2 ) = P (t1 )[Spn (t3 , . . . , tn+2 ) + A+ ] + [1 − P + (t1 )]Sppn (t2 , . . . , tn+2 ),
(4.23)
and the remaining four expressions with a leading p spike follow by symmetry. Integrating out the spike times and summing over all spike sequences n as for the 1-reset model, we have πππ ππ π = λπ (1 − K 1+ )Sn+2 + λπ K 1+ Sn+1 Sn+3 2
ππp Sn+3
2 = λπ (1
π pπ Sn+3
= λπ (1
(4.25)
λπ
A+ )
(4.26)
Sn+3 = λπ (1 − T1+ )Sn+2 + λπ λp T1+ (Sn+1 + λp A+ ).
(4.27)
−
pπ T1+ )Sn+2
+
(4.24)
2 λπ λp T2+ (Sn
+ A+ )
π pp
−
p T2+ )Sn+1
+
π λπ λp T1+ (Sn+1
pp
+
p
ππp
πp
π pπ
π pp
ππ πππ = Sn+3 + Sn+3 and Sn+3 = Sn+3 + Sn+3 , and so on, we Since Sn+3 obtain ππ = λπ (1 − T2+ )Sn+1 + λπ λp T2+ (Sn + A+ ) Sn+3 2
p
2
ππ π + λπ K 1+ Sn+1 + λπ (1 − K 1+ )Sn+2 2
(4.28)
1374
P. Appleby and T. Elliott πp
Sn+3 = λπ (1 − T1+ )Sn+2 + λπ λp T1+ (Sn+1 + A+ ). p
(4.29)
ππ = The trick to simplify these equations further is to observe that Sn+2 πp πp π Sn+2 − Sn+2 and use equation 4.29 to rewrite Sn+2 . Since this equation has no terms in Sn , this trick does not increase the order of the resulting recurrence relations, but does reduce them from a coupled four-dimensional system to a more manageable, coupled two-dimensional system. Hence, π Sn+3 = λπ λp [T2+ − T1+ (1 − K 1+ )](Sn + A+ ) 2
π + λπ [1 − T2+ − (1 − T1+ )(1 − K 1+ )]Sn+1 + λπ K 1+ Sn+1 2
2
p
+ λπ λp T1+ (Sn+1 + A+ ) + λπ (1 − T1+ )Sn+2 .
(4.30)
p
A symmetrical version of this equation holds for Sn+3 . We observe that T2+ − T1+ (1 − K 1+ ) = K 2+ and 1 − T2+ − (1 − T1+ )(1 − K 1+ ) = K 1+ − K 2+ , and the overall multiplier of A+ is λp (λπ K 1+ + λπ 2 K 2+ ). So, defining Q+ 2 = − 2 − = λ K + λ K , we obtain the coupled system λπ K 1+ + λπ 2 K 2+ and Q− p 1 p 2 2 + + π Sn+3 = +λp A+ Q+ 2 + λπ λ p K 2 Sn + λπ K 1 Sn+1 2
− λπ K 2+ Sn+1 + λπ (1 − K 1+ )Sn+2 2
p
(4.31)
− − Sn+3 = −λπ A− Q− 2 + λπ λ p K 2 Sn + λ p K 1 Sn+1 2
p
π − λp K 2− Sn+1 + λp (1 − K 1− )Sn+2 2
(4.32)
These coupled recurrence relations represent our final form for the 2-reset model. 4.3 3-Reset Model. Now we go out to the first four spikes in the sequence because the fourth identical spike possesses the capacity to reset the stochastic processes. Three conditioning steps and the usual weighting by W give ππππ πππ ππ π Sn+4 = λπ (1 − K 1+ )Sn+3 + λπ (K 1+ − K 2+ )Sn+2 + λπ K 2+ Sn+1 . 2
3
(4.33) π pπ π
π pπ p
π ppπ
π ppp
The four equations for Sn+4 , Sn+4 , Sn+4 , and Sn+4 all sum to give
πp p Sn+4 = λπ λp T1+ (Sn+2 + A+ ) + λπ 1 − T1+ Sn+3 ,
(4.34)
Multispike Interactions in a Stochastic Model of STDP π π pπ
1375
π π pp
while those for Sn+4 and Sn+4 sum to give ππp
Sn+4 = λπ λp T2+ (Sn+1 + A+ ) + λπ (1 − T2+ )Sn+2 , 2
2
p
(4.35)
and we also have πππp
Sn+4 = λπ λp T3+ (Sn + A+ ) + λπ (1 − T3+ )Sn+1 . 3
3
p
(4.36)
In these four equations, for reasons that will become clear in the r -reset model, we maintain a rigorous distinction between T1+ and K 1+ even though they are identical. We see that these last three equations are, in fact, instances of the generic equation Smπ
i
p
= λπ λp Ti+ (Sm−i−1 + A+ ) + λπ (1 − Ti+ )Sm−i , i
i
p
(4.37)
for m > i, where π i p means i π spikes followed by a p spike. This equation follows easily, in fact, from conditioning on whether the synapse is in the POT state when the first p spike arrives, leading to the Ti+ term, or in the OFF state when the first p spike arrives, leading to the 1 − Ti+ term, and then, as usual, integrating out interspike intervals and summing over all subsequent spikes. By observing that Smπ = Smπ + i
i−1
Smπ p , j
(4.38)
j=1
we can use this equation to eliminate all occurrences of Smπ terms, i > 1, from equation 4.33 and then use equation 4.37 to eliminate all occurrences πi p p of Sm terms, i > 0, leaving terms in Smπ and Sm only. Defining Q+ 3 = + 2 + 3 + λπ K 1 + λπ K 2 + λπ K 3 , after some algebra, we finally obtain i
π Sn+4 = λπ (1 − K 1+ )Sn+3 + (λπ K 1+ − λπ K 2+ )Sn+2 + λπ K 2+ Sn+1 2
2
− λπ K 3+ Sn+1 + λp λπ K 3+ Sn + λp A+ Q+ 3, 3
p
3
(4.39)
p with a similar equation for Sn+4 . It is convenient to define K˜ i+ = λπ i K i+ and K˜ i− = λp i K i− . Then Qi± = ij=1 K˜ ± j , and we may finally write π ˜+ ˜+ ˜+ = +λp A+ Q+ Sn+4 3 + λ p K 3 Sn + K 2 Sn+1 − K 3 Sn+1 p
+ ( K˜ 1+ − K˜ 2+ )Sn+2 + (λπ − K˜ 1+ )Sn+3
(4.40)
1376
P. Appleby and T. Elliott ˜− ˜− ˜− π Sn+4 = −λπ A− Q− 3 + λπ K 3 Sn + K 2 Sn+1 − K 3 Sn+1 p
+ ( K˜ 1− − K˜ 2− )Sn+2 + (λp − K˜ 1− )Sn+3 .
(4.41)
These coupled recurrence relations represent our final form for the 3-reset model. 4.4 r -Reset Model. Having built some intuitions for small r , we can now proceed to the general r case. The standard conditioning argument gives r +1
π Sn+r +1 =
r −1 + π r −i ˜+ π (λπ K˜ i+ − K˜ i+1 )Sn+r −i + K r Sn+1 ,
(4.42)
i=0
where K˜ 0± ≡ 1, and we still have the generic results in equations 4.37 and 4.38. It is, of course, equation 4.42 that always ensures that the resulting p recurrence relation in Smπ and Sm is of order r + 1 and hence a finite system of coupled equations. Pulling out the i = r − 1 term from the sum and then writing j = i + 1, we have the slightly more convenient form r +1
π Sn+r +1 =
r −1 π r +1− j ˜+ π ˜+ (λπ K˜ + j−1 − K j )Sn+r +1− j + λπ K r−1 Sn+1 .
(4.43)
j=1
Using equation 4.38 to eliminate terms in Smπ , i > 1, and then equation πi p 4.37 to eliminate terms in Sm , i > 0, we obtain the apparently rather unpromising equation, i
π Sn+r +1 = X1 + λ p A+ X2 + λ p X3 + X4 ,
(4.44)
where π + X1 = λπ K˜ r+−1 Sn+1
r −1 π ˜+ (λπ K˜ + j−1 − K j )Sn+r +1− j ,
(4.45)
j=1
X2 =
r
λπ Tj+ − j
j=1
X3 =
r j=1
r− j r −1
λπ
i+ j
+ K+ Ti+ , j−1 − K j
(4.46)
j=1 i=1
λπ Tj+ Sn+r − j − j
r− j r −1
λπ
i+ j
+ K+ Ti+ Sn+r −i− j , − K j j−1
j=1 i=1
(4.47)
Multispike Interactions in a Stochastic Model of STDP
X4 =
r
λπ
j
1377
p 1 − Tj+ Sn+r +1− j
j=1
−
r− j r −1
λπ
i+ j
+ K+ j−1 − K j
p 1 − Ti+ Sn+r +1−i− j .
(4.48)
j=1 i=1
However, the quantities X2 , X3 and X4 simplify dramatically. Consider, for example, X3 :
X3 =
r
λπ Tj+ Sn+r − j − j
j=1
=
r
r− j r −1
λπ
i+ j
+ K+ Ti+ Sn+r −i− j j−1 − K j
j=1 i=1
λπ Tj+ Sn+r − j − j
j=1
r
λπ Sn+r −k k
k=2
= λπ T1+ Sn+r −1
+
r
j λπ Sn+r − j
r
+ + − K l+ Tk−l K l−1
l=1
Tj+
j=2
≡ λπ T1+ Sn+r −1 +
k−1
−
j−1
+ K i−1
−
K i+
+ Tj−i
i=1
λπ Sn+r − j K + j j
j=2
=
r
K˜ + j Sn+r − j ,
(4.49)
j=1
where the penultimate line follows from the recurrence relation defining Tj+ in equation 3.7. This was the reason for rigorously distinguishing between T1 and K 1 , even though they are identical, so that we could immediately use equation 3.7 without any further manipulation. The key step in this simplification is to rewrite the double sum by defining i + j to be k and summing over its range. Identical manipulations reduce X2 and X4 , giving
X2 =
r
+ K˜ + j ≡ Qr ,
(4.50)
j=1
X4 =
r j=1
˜ + S p λπ K˜ + n+r +1− j . j−1 − K j
(4.51)
1378
P. Appleby and T. Elliott
Thus, finally we have π + ˜+ π Sn+r +1 = λ p A+ Qr + λπ K r −1 Sn+1 +
r −1 π ˜+ λπ K˜ + Sn+r − K +1− j j j−1 j=1
+ λp
r
K˜ + j Sn+r − j +
r
j=1
˜ + S p λπ K˜ + n+r +1− j . j−1 − K j
(4.52)
j=1
This expression is valid for all r ≥ 1. A similar expression follows by symp metry for Sn+r +1 . After a little further manipulation, we obtain, for r ≥ 2, π + ˜+ ˜+ ˜+ Sn+r +1 = +λ p A+ Qr + λ p K r Sn + K r −1 Sn+1 − K r Sn+1 p
+
r −2
˜+ ˜+ K˜ + j − K j+1 Sn+r − j + λπ − K 1 Sn+r
(4.53)
j=1 p π Sn+r +1 = −λπ A− Qr− + λπ K˜ r− Sn + K˜ r−−1 Sn+1 − K˜ r− Sn+1
+
r −2
˜− ˜− K˜ − j − K j+1 Sn+r − j + λ p − K 1 Sn+r ,
(4.54)
j=1
−2 ≡ 0 for r = 2. where, of course, rj=1 p π Notice that the equations for Sn+r +1 and Sn+r +1 involve terms in Sm p π only, except for the single terms Sn+1 and Sn+1 . These two terms thus p π prevent us from simply adding the equations for Sn+r +1 and Sn+r +1 and hence obtaining a single, uncoupled (r + 1)th-order recurrence relation for Sm . The above coupled recurrence relations represent our final form, and we now turn to obtaining the exact solution for r = 1 and the asymptotic large n solution for r > 1. 5 Solution of Recurrence Relations We now solve the one-reset model exactly for any n, then determine the leading-order, asymptotic solutions for large n for the r -reset, r > 1, model. 5.1 Exact Solution of the 1-Reset Model. The 1-reset model has the second-order coupled recurrence relations given by equations 4.16 and 4.17. p We are not interested in how the separate components Snπ and Sn evolve p π but only in their sum, Sn = Sn + Sn . We have that − ˜+ ˜− Sn+2 = (λp A+ Q+ 1 − λπ A− Q1 ) + (λ p K 1 + λπ K 1 )Sn p π +Sn+1 − ( K˜ 1+ Sn+1 + K˜ 1− Sn+1 )
(5.1)
Multispike Interactions in a Stochastic Model of STDP
1379
and also p π ˜+ − ˜+˜− K˜ 1+ Sn+2 + K˜ 1− Sn+2 = (λp A+ K˜ 1− Q+ 1 − λπ A− K 1 Q1 ) + K 1 K 1 Sn
+(λp K˜ 1+ + λπ K˜ 1− )Sn+1 − K˜ 1+ K˜ 1− Sn+1 .
(5.2)
At the cost of increasing the order of the recurrence relation, we can thus obtain a relation purely in Sm , to give (Sn+3 − Sn+2 ) − K˜ 1+ K˜ 1− (Sn+1 − Sn ) = A˜ 1 ,
(5.3)
where we have defined the inhomogeneous part of this equation, A˜ 1 , by ˜+ − A˜ 1 = λp A+ (1 − K˜ 1− )Q+ 1 − λπ A− (1 − K 1 )Q1 .
(5.4)
The characteristic equation for the homogeneous part of the equation has the three roots +1 and ± K˜ + K˜ − . Since 0 ≤ K ± < 1 for finite β and 0 ≤ 1
1
1
λπ , λp ≤ 1, these last two roots are real and have moduli less than unity. The general solution of this equation is then Sn = n
n/2 A˜ 1 + K˜ 1+ K˜ 1− [D1 + (−1)n D2 ] + D3 , 1 − K˜ + K˜ − 1
(5.5)
1
where the constants D1 , D2 , and D3 can be determined by requiring that S0 = 0, S1 = 0, and S2 = λp A+ K˜ 1+ − λπ A− K˜ 1− . Because K˜ 1+ K˜ 1− < 1, the asymptotic large n form of this solution is then just 1 A˜ 1 Sn ∼ . n 1 − K˜ 1+ K˜ 1−
(5.6)
The solution for large enough n thus essentially scales linearly with n, and this is exactly what we should expect: a typical train of 2n spikes should induce, on average, twice as much change in synaptic strength as a typical train of n spikes, for n sufficiently large. Of course, real neurons would not be expected to scale in this manner because synaptic strengths are presumably constrained and thus saturate at upper or lower limits. 5.2 Asymptotic Solution of the r -Reset Model. As for the 1-reset model, we can write Sn+r +1 = (λp A+ Qr+ − λπ A− Qr− ) + (λp K˜ r+ + λπ K˜ r− )Sn +( K˜ r+−1 + K˜ r−−1 )Sn+1 + (1 − K˜ 1+ − K˜ 1− )Sn+r
1380
P. Appleby and T. Elliott
+
r −2 ˜− ˜+ ˜− ( K˜ + j + K j − K j+1 − K j+1 )Sn+r − j j=1
p π − ( K˜ r+ Sn+1 + K˜ r− Sn+1 )
(5.7)
and p π K˜ r+ Sn+r +1 + K˜ r− Sn+r +1
= (λp A+ K˜ r− Qr+ − λπ A− K˜ r+ Qr− ) + K˜ r+ K˜ r− Sn + ( K˜ r+−1 K˜ r− + K˜ r−−1 K˜ r+ − K˜ r+ K˜ r− )Sn+1 +
r −2
˜+ ˜+ ˜− ˜− K˜ r− ( K˜ + j − K j+1 ) + K r ( K j − K j+1 ) Sn+r − j
j=1
+ (λp K˜ r+ + λπ K˜ r− )Sn+r − ( K˜ 1+ K˜ r− + K˜ 1− K˜ r+ )Sn+r .
(5.8)
Hence, defining A˜r = λp A+ (1 − K˜ r− )Qr+ − λπ A− (1 − K˜ r+ )Qr− ,
(5.9)
after some algebra we have Sn+2r +1 − Sn+2r = A˜r + K˜ r+ K˜ r− (Sn+1 − Sn ) −
r −1 ˜− ( K˜ + j + K j )(Sn+2r +1− j − Sn+2r − j ) j=1
+
r −1 ˜−˜+ ˜− ( K˜ + j K r + K j K r )(Sn+r +1− j − Sn+r− j ),
(5.10)
j=1
where we have written this equation so that it is clear that there is a +1 solution of the characteristic equation for the homogeneous form of the −1 ≡ 0 for r = 1, this equation is valid for all recurrence relation. With rj=1 r ≥ 1. n We define Un = Sn+1 − Sn , with S1 ≡ 0, so that Sn+1 = i=1 Ui . Equation 5.10 then implies that the Ui ’s satisfy the recurrence relation r j=0
j Un+2r − j −
r ˜− ˜−˜+ ˜ ˜+˜− ( K˜ + j K r + K j K r )Un+r − j = Ar − K r K r Un , j=0
(5.11)
Multispike Interactions in a Stochastic Model of STDP
1381
˜− where j = K˜ + j + K j , j ≥ 1, and we define 0 ≡ 1. These Ui ’s will play an important role later in the construction of exact solutions for Sn in the nonresetting model. The characteristic equation for the homogeneous part of equation 5.11 is, after a little rearrangement,
θ 2r +
r −1 j=1
j θ 2r − j −
r −1 ˜− ˜ − ˜ + r − j − K˜ r+ K˜ r− = 0. ( K˜ + j K r + K j K r )θ
(5.12)
j=1
This equation has 2r roots, which we denote by θi , i = 1, . . . , 2r . To obtain the asymptotic behavior of Sn for general r , we need to bound |θi |, all i. We can prove that the roots θi have moduli not exceeding unity by showing that Sn grows no faster than linearly in n. Consider an arbitrary sequence of spikes, n (t0 , . . . , tn−1 ). Any π spike could induce a change of −A− in synaptic efficacy if it triggers a DEP → OFF transition; otherwise, it induces no change. Similarly, any p spike could induce a change of +A+ if it triggers a POT → OFF transition; otherwise, it induces no change. There are at most n π spikes and at most n p spikes in the sequence. Hence, we must have −nA− ≤ Sn (t0 , . . . , tn−1 ) ≤ +nA+ , for any sequence n with any interspike intervals t0 , . . . , tn−1 . The averaged function Sn must therefore also respect these bounds, so that
−A− ≤
1 Sn ≤ +A+ . n
(5.13)
Thus, Sn grows no faster than linearly in n, so there are no roots θi having moduli exceeding unity. Furthermore, there are no at least thrice-repeated unit modulus roots, as these would induce at least quadratic terms in n. We also exclude twice-repeated unit modulus roots not equal to unity because such roots would contribute linearly growing, oscillatory terms to Sn , which are incompatible with our model’s dynamics. Finally, it is easy to see that equation 5.12 has precisely one positive (real) root, and this root is strictly less than one. In summary, all the roots θi satisfy |θi | ≤ 1, and any that satisfy |θi | = 1 are unrepeated. There is no θi ≡ 1 root. Hence, the asymptotic form of Sn is dominated by its particular solution, since the +1 root of the characteristic equation of equation 5.10 in the presence of the constant inhomogeneous term induces a particular solution that grows linearly with n. Thus, we finally obtain the asymptotic large n solution λp A+ (1 − K˜ r− )Qr+ − λπ A− (1 − K˜ r+ )Qr− 1 Sn ∼ , n (1 − K˜ r+ )(1 − K˜ r− ) + (1 − K˜ r− )Qr+ + (1 − K˜ r+ )Qr−
(5.14)
1382
P. Appleby and T. Elliott
valid for any r . We shall see later that the asymptotic solution is good even for n ≈ 10. The formal limit r → ∞ is beset by various technical difficulties associated with the limit and the resulting sum over a potentially infinite number of polynomial roots. However, in the usual spirit of applied mathematics, we shall ignore these possible difficulties and just assume that the limit of the finite r asymptotic solution is the asymptotic solution of the nonresetting model (and confirm this numerically). Now, K˜ r → 0 as r → ∞, and we require Q± = limr →∞ Qr± . We have, for example,
Q+ =
∞
λπ K l+ (β) l
l=1 ∞
∞
tl−1 −βt + e P (t) (l − 1)! 0 l=1 ∞ ∞ (λπ t)l dt = λπ e −βt P + (t) l! 0 l=0 ∞ dt e −λ p t P + (t) = λπ =
λlπ
dt
0
λπ + Q+ = K (λ p ). λp 1
(5.15)
Notice that this derivation makes no use of our particular choice for f + and λ therefore is valid for any pdf f + . Similarly Q− = λπp K 1− (λπ ) (independent of f − ), so for the nonresetting or r → ∞ model, we obtain the particularly simple expression λπ A+ K 1+ (λ p ) − λp A− K 1− (λπ ) 1 . Sn ∼ λ n 1 + λπp K 1− (λπ ) + λλπp K 1+ (λ p )
(5.16)
Notice that the limits λπ → 0 or λ p → 0 are well defined since K 1 (λ) → 0 as λ → 0 at least as fast as λ. 6 1-Transition Processes We now seek to provide a deeper understanding of the asymptotic solutions derived above. This will permit us to construct an exact solution for Sn in the nonresetting model in the next section.
Multispike Interactions in a Stochastic Model of STDP
1383
We rewrite the general asymptotic solution in the slightly more transparent form, 1 T1 , Sn ∼ n N1
(6.1)
where T1 = λp A+ N1 = 1 +
Qr+ Qr− , + − λπ A− ˜ 1 − Kr 1 − K˜ r−
(6.2)
Qr+ Qr− + , + 1 − K˜ r 1 − K˜ r−
(6.3)
and we work henceforth, for convenience, only with the nonresetting, r → ∞ model, so that T1 = λp A+ Q+ − λπ A− Q− , +
(6.4)
−
N1 = 1 + Q + Q .
(6.5)
We specifically do not sum the terms in Q± , obtaining, for example, Q+ = λπ K (λ ), as this obscures the reduction of this model to the r -reset model λp 1 p and also renders opaque a natural interpretation of the terms in T1 and N1 . Notice that for the reduction to the r -reset model, we have, for example, Q+ =
∞
K˜ i+ → Qr+ 1 + K˜ r+ + ( K˜ r+ )2 + · · · =
i=1
Qr+ , 1 − K˜ r+
(6.6)
so we see immediately how terms like Qr+ /(1 − K˜ r+ ) arise from Q+ and therefore no longer need to consider the general r -reset model. We define T1+ = +λp A+ Q+ and T1− = −λπ A− Q− , so that T1 = T1+ + T1− . What processes give rise to the terms that occur, for example, in T1+ ? We have that T1+ = λp A+
∞
λπ K i+ . i
(6.7)
i=1
We know that a K i+ term arises from a sequence of i π spikes in which the synapse never returns to the OFF state once placed in the POT state by the first π spike. That is, K i+ arises from the probability that the synapse never returns to the OFF state during the whole time interval [t0 , ij=0 t j ), which is P + (t1 + · · · + ti ). If the synapse is in the POT state when a final p spike arrives at time ij=0 t j , then a change in synaptic strength by an amount A+ will occur. Thus, the term λp A+ K˜ i+ is obtained from the spike sequence π i p in which no stochastic transitions back to OFF ever occur and
1384
P. Appleby and T. Elliott
in which the only transition back to OFF is induced by the final p spike. An identical argument holds for the term −λπ Ai K˜ i− for the pi π sequence. T1+ is thus obtained by summing over all spike sequences π p, π 2 p, π 3 p, . . ., in each of which there is only one transition back to the OFF state, inducing potentiation by an amount A+ . This is similarly so for T1− . We refer to such processes as 1-transition processes. T1 is therefore the expected change in synaptic strength induced by all 1-transition processes. Another 1-transition process is of the form π i γ , where the γ represents the first stochastic transition back to the OFF state. Of course, such a process induces no change in synaptic strength and so does not contribute to T1 . What is the expected number of spikes in all possible 1-transition processes, π i p, pi π, π i γ and pi γ , any i > 0? The expected number of spikes in all the sequences π i p is just N p+ = λp
∞ (i + 1) K˜ i+ ,
(6.8)
i=1
and similarly for pi π sequences, we have Nπ− = λπ
∞ (i + 1) K˜ i− .
(6.9)
i=1
For the sequence π i γ , the probability that the first stochastic transition back to the OFF state occurs after the ith π spike is P + (t1 + · · · + ti−1 ) − + P + (t1 + · · · + ti ), i > 1, leading to K i−1 − K i+ , with K 0+ ≡ 1, i > 0. Hence, the expected number of spikes in all the sequences π i γ is Nγ+ =
∞
+ i(λπ K˜ i−1 − K˜ i+ ),
(6.10)
i=1
and in all pi γ sequences, we have Nγ− =
∞
− i(λp K˜ i−1 − K˜ i− ).
(6.11)
i=1
Now, N p+ + Nγ+ =
∞ + iλπ K˜ i−1 − i K˜ i+ + λp (i + 1) K˜ i+ i=1
= λπ +
∞ (i + 1) K˜ i+ − i K˜ i+ i=1
= λπ
+ Q+ .
(6.12)
Multispike Interactions in a Stochastic Model of STDP
1385
Similarly, Nπ− + Nγ− = λp + Q− , and so the expected number of spikes in all 1-transitions is just 1 + Q+ + Q− , which is precisely N1 , the denominator in the asymptotic expression for Sn . T1 is therefore the expected change in synaptic strength in all 1-transition processes, and N1 is the expected number of spikes in all 1-transition processes. Consider some general sequence of n spikes. This sequence will decompose, by definition, into a chain of successive 1-transition processes defined by the synapse returning to the OFF state at the end of each 1-transition process. How many 1-transition processes do we expect, on average, to occur in a sequence of n spikes? Because the average length of a 1-transition process is N1 spikes, obviously there are n/N1 1-transitions in a chain of n spikes on average. Each such 1-transition process induces, on average, a change in synaptic strength by an amount T1 . The total change induced by n spikes is thus expected to be T1 × (n/N1 ), but this should just be approximately Sn , the expected change induced by n spikes. When n is sufficiently large, the statistical error should become negligible, so we should have Sn ∼ n
T1 , N1
(6.13)
which is, of course, equation 6.1. It is a remarkable fact that the asymptotic solutions that were derived somewhat laboriously from the recurrence relations nevertheless can be extracted so quickly by carving up an arbitrary sequence of spikes into its natural articulation points, these being the times at which the synapse has returned to the OFF state. This is possible because of the underlying Markovian nature of 1-transition processes: once the synapse has returned to the OFF state, its subsequent evolution is independent of its history at that point. It is therefore natural to think in terms of a chain of 1-transition processes, the end point of each of which is the OFF state. 7 Exact Solutions for Sn in the Nonresetting Model We consider again the nonresetting model, r → ∞. Inspired by the above 1transition analysis, we explicitly calculate Sn for small n by decomposing Sn into terms representing the final potentiating or depressing 1-transition process in the sequence of n spikes. Notice that the final 1-transition process need not terminate on the final spike of the sequence; it could terminate before, with the remaining spikes not forming a potentiating or depressing 1-transition process. Lengthy and tedious calculation then reveals that the A+ -dependent parts of Sn , which we denote by Sn+ , are given by S2+ = λp A+ F0 K˜ 1+ , S3+ = λp A+ F0 K˜ 2+ + (F0 + F1 ) K˜ 1+ ,
1386
P. Appleby and T. Elliott
S4+ = λp A+ F0 K˜ 3+ + (F0 + F1 ) K˜ 2+ + (F0 + F1 + F2 ) K˜ 1+ , where F0 = 1, F1 = F0 − 1 , and F2 = F1 − (2 − 21 ). The expressions for the A− -dependent parts are given by the usual transformations of the A+ dependent parts: λπ ↔ λp and + ↔ −. Notice that F0 , F1 , and F2 are invariant under these transformations, and so the Sn− have identical Fi coefficients to those for the Sn+ . In general, by this final 1-transition process decomposition, we know that we must be able to write Sn+1 =
n
G n−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.1)
i=1
where the λp A+ K˜ i+ term represents the final 1-transition process contribution from a spike sequence π i p, and −λπ A− K˜ i− that from a pi π sequence. They must have the same coefficient, G n−i , because G n−i essentially counts all the possible ways in which we arrive at the OFF state before the final 1-transition process occurs, and this must be independent of whether that final process is π i p or pi π, by the Markovian property of 1-transitions. Moreover, although G i will be a function of the K˜ ± j , it must be a symmetric function of them, so that G i → G i under λπ ↔ λp and + ↔ −, because for any sequence i that arrives at the OFF state after i spikes, there is ¯ i that also arrives at the OFF state after i a complementary sequence ¯ i is obtained from i by π ↔ p, that is, replace every π spikes, where spike by a p spike and every p spike by a π spike. So the expansion in equation 7.1 is generic, exploiting the importance of the 1-transition processes. It remains to determine the coefficients G i . Since Un = Sn+1 − Sn , we have the expansion Un =
n
Fn−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.2)
i=1
where F0 = G 0 = 1 and Fi = G i − G i−1 , i > 0, or G i = ij=0 F j . So, equivalently, we can determine instead the coefficients Fi rather than G i . Although we are now interested only in the nonresetting model, because we know how to extract the r -reset model from it, a determination of the coefficients Fi depends critically on the recurrence relations for the r -reset models. In fact, we can explicitly calculate the sequence of functions Sn , n = 2, 3, 4, . . . , for the nonresetting model in an incremental manner from the 1-, 2-, 3-, . . . , reset models in the following way. Consider S2 . The sequences that could possibly contribute to the function are ππ, π p, pπ, and pp. The two sequences π p and pπ never probe any resetting properties because we need a chain of at least two identical spikes for this. Although
Multispike Interactions in a Stochastic Model of STDP
1387
the sequences ππ and pp do probe 1-resetting properties, they give exactly zero contribution to the function. Hence, the function S2 is independent of r for r ≥ 1, and we can therefore use the r = 1 recurrence relations in equations 4.16 and 4.17, with n = 0 and S0 = 0 and S1 = 0 to calculate it. Now consider S3 . Eight sequences could possibly contribute to it. Only πππ and ppp could probe 2-resetting properties, but again they give zero contribution. All the other sequences probe at most one-resetting properties. Thus, S3 is independent of r for r ≥ 2. We can therefore use the r = 2 form of the recurrence relations in equations 4.53 and 4.54, with n = 0 and the initial values S0 = 0, S1 = 0 and S2 calculated from the r = 1 recurrence relations, to calculate S3 . Similarly, S4 is independent of r for r ≥ 3, and we can use the r = 3 form of equations 4.53 and 4.54 with n = 0 to calculate it, using the forms of Si , i ≤ 3, already established. In general, therefore, we see that Sm+1 is independent of r for r ≥ m because m + 1 spikes cannot probe the m-resetting properties with nonzero contributions, so we can use the r = m form of equations 4.53 and 4.54 with n = 0 and all the previously established forms of Si , i ≤ m, to calculate Sm+1 . The important thing to notice about this procedure is that we always set S0 = 0 and S1 = 0. This means that the corresponding terms in equations 4.53 and 4.54 with n = 0 do not contribute to the calculation of Sm+1 for r ≥ p p π m. But it was precisely the S1π and S1 terms (the Sn+1 and Sn+1 terms for nonzero n) in particular that prevented us before from simply adding equations 4.53 and 4.54 to obtain an (r + 1)th-order recurrence relation entirely in Sm . So, now with S0 = 0 and S1 = 0 (and, in particular, p S1π = 0 and S1 = 0), we are free to add the equations, obtaining
− Sm+1 = (λp A+ Q+ m − λπ A− Qm ) +
m−2
(i − i+1 )Sm−i ,
(7.3)
i=0
m−2 valid for m ≥ 1 with i=0 ≡ 0 for m = 1, and we have used 0 ≡ 1. After a little rearrangement, we have m−1
− i Um−i = λp A+ Q+ m − λπ A− Qm ,
(7.4)
i=0
where we have used the definition Ui = Si+1 − Si . This recurrence relation succinctly encodes the procedure described above and thus determines the form of Sm+1 for an r -reset model in which r ≥ m. Taking the limit r → ∞ to give the nonresetting model, equation 7.4 is valid for any m ≥ 1 and represents a simple recurrence relation to calculate the form of Um , any m, in the nonresetting model.
1388
P. Appleby and T. Elliott
Substituting the expansion in equation 7.2 into the recurrence relation in equation 7.4 and looking at, say, only the A+ -dependent piece, we have m−1
i
m−i
i=0
+ Fm−i− j K˜ + j = Qm ,
(7.5)
j=1
and we can rearrange the double sum to give m i=1
K˜ i+
m−i
j Fm−i− j = Q+ m.
(7.6)
j=0
m ˜+ Since Q+ m = i=1 K i , we see that equation 7.2 is a solution of equation 7.4 provided that the Fi satisfy the recurrence relation n
Fi n−i = 1,
(7.7)
i=0
any n. With F0 = 0 = 1, we can compute the first few Fi ’s, giving agreement with those calculated explicitly above. We now write Fi = i n+1 C j , C0 ≡ 1, so that Fi+1 = Fi + Ci+1 . By subtracting i=0 Fi n+1−i j=0 n from i=0 Fi n−i , we see that the Ci must satisfy the recurrence relation n
Ci n−i = 0,
(7.8)
i=0
for n ≥ 1, with C0 ≡ 1. It is easy to see that the Ci are generated by the function ∞
1 . i i=0 i x
C i x i = ∞
i=0
Summarizing these various results, we have that Un =
n
Fn−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
i=1
Sn+1 =
n i=1
G n−i (λp A+ K˜ i+ − λπ A− K˜ i− ),
(7.9)
Multispike Interactions in a Stochastic Model of STDP
1389
where G i = ij=0 F j and Fi = ij=0 C j , and the Ci ’s are generated by the ∞ function 1/ i=0 i xi . By simple rearrangements of the sums in these two representations of Un and Sn+1 , we also have Un =
n
Cn−i (λp A+ Qi+ − λπ A− Qi− ),
i=1
Sn+1 =
n
Fn−i (λp A+ Qi+ − λπ A− Qi− ).
i=1
These equivalent expressions represent the exact solution of the nonresetting model for any n. Of course, the r -reset model is obtained by the standard replacement K lr+s → K rl K s throughout. 8 Analytical and Simulated Results In this section, we turn to a comparison between the analytical results derived above and numerical results obtained by the computational simulation of the r -reset model. We are particularly concerned with establishing how rapidly n1 Sn converges to the analytical asymptotic results; establishing that the nonresetting model’s asymptotic result matches a large n simulation of the same model, confirming indeed that the r → ∞ limit of the r -reset model’s asymptotic result really is the asymptotic result of the nonresetting model; and establishing how rapidly the r -reset model tends to the nonresetting model as r → ∞. We run simulations of the stochastic model as set out in earlier work (Appleby & Elliott, 2005), with the standard set of parameters used there: A+ A− τ+ τ− ν+ ν−
= = = = = =
1.00 0.95 13.3 ms 20.0 ms 3 3
These parameters were chosen because they reproduced a large variety of the experimental results available in the literature, including the replication of the spike triplet results (Froemke & Dan, 2002). These parameters also guarantee the presence of both a depressing and a potentiating regime in the function S2 , so that depression and potentiation can both occur at the ensemble level. In most of the simulations below, each simulation is run over exactly 107 spikes. Thus, in obtaining the numerical form of n1 Sn , we perform 107 /n separate runs over n spikes and average over all these runs.
P. Appleby and T. Elliott
Change in synaptic strength per spike
1390
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 1: Analytical and numerical values of n1 Sn plotted as a function of λπ for the one-reset model for various values of n. The solid lines show the numerical results for n = 2, 4, 8, 16, 32, and 64 spikes, while the corresponding dotted lines show the exact, analytical result. The n = 2 pair of lines corresponds to the bottom pair, n = 4 the next pair up, and so on, the function n1 Sn increasing as a function of n for fixed λπ . The dashed line shows the exact asymptotic limit, n → ∞.
Of course, as n increases, there are fewer runs over which to average, but there is more averaging within each sequence of n spikes. This means that the statistical error should be roughly equivalent for any value of n, this being the reason for keeping the total number of spikes fixed at 107 . Sn is, of course, a function of both λπ and λ p . Previously, we set λp =
λπ −
for λπ ≥
0
otherwise
,
(8.1)
as a simple model of postsynaptic firing with a threshold set at = 5 Hz. This gives a cross section through the λπ –λ p plane. We will also show contour plots of n1 Sn in the λπ –λ p plane. In Figure 1, we investigate how rapidly n1 Sn approaches the asymptotic limit derived above. We use the r = 1 reset model for convenience, since we also have very simple, exact solutions against which to compare the numerical results. We see very rapid convergence to the asymptotic limit, so that even n = 8 gives the limit to within a few percent. This means that after n ≈ 10 spikes, we expect to see the scaling property observed earlier for the one-reset model, so that doubling the spike train length merely doubles
Change in synaptic strength per spike
Multispike Interactions in a Stochastic Model of STDP
1391
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50
100
150
200
Presynaptic firing rate (Hz)
Figure 2: Numerical values of n1 Sn plotted as a function of λπ for the nonresetting model for various values of n. The solid lines again show the numerical results for n = 2, 4, 8, 16, 32, and 64 spikes, with the bottom line corresponding to n = 2, the next line up n = 4, and so on, n1 Sn exhibiting the same monotonicity as a function of n for fixed λπ as in the 1-reset model. The dashed line shows the exact, analytical result for the asymptotic limit of the nonresetting model.
the expected change in synaptic efficacy, adding no new qualitative features to the dynamics. The results for the nonresetting model are shown in Figure 2. Here we do not plot the corresponding finite n analytical results, but we show the analytical asymptotic limit. This figure demonstrates that the r → ∞ limit of the asymptotic result for the r -reset model is indeed the asymptotic result for the nonresetting model. We address the question of how rapidly the r -reset model converges to the nonresetting model as r → ∞ in Figure 3. We see that even the 4-reset model is extremely close to the nonresetting model, and the 8-reset model is virtually indistinguishable from it. For λπ ≈ λ p , we have that λπ ≈ 0.5 and λp ≈ 0.5. Thus, the probability of a long sequence of identical spikes in this case is maximally suppressed, and the contributions of such sequences to Sn are weighted by these small probabilities. However, to resolve the difference between the r -reset model and the (r + 1)-reset model for large r , we require precisely such long sequences of identical spikes. Hence, we should not be too surprised by this rapid convergence of the r -reset model to the nonresetting model. In Figures 4, 5, and 6 we show contour plots for the analytical form of n1 Sn for the one-reset model for n = 2, n = 3 and the asymptotic limit n → ∞, respectively. Contour plots for other values of n are all very similar
P. Appleby and T. Elliott
Change in synaptic strength per spike
1392
0.015 0.01 0.005 0 -0.005 -0.01 -0.015 0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 3: The analytical asymptotic limit of n1 Sn for the r -reset model plotted as a function of λπ for various values of r . Shown are results for r = 1, 2, 4, 8, 16, and the nonresetting model. The top curve corresponds to r = 1, the next one down r = 2, and so on, n1 Sn monotonically decreasing as a function of r for fixed λπ . The curves for the 8-, 16- and nonresetting models are virtually identical, almost superimposed on top of each other.
to those for n = 3 and n → ∞. These trends are also observed, for example, for the nonresetting model. We see a very significant difference between the n = 2 landscape and that for all other values of n. This shows that although there tends to be an emphasis in the literature on the form of the two-spike rule, the multispike rules for n > 2, at least in our model, exhibit qualitatively very different dynamics. We explore these differences elsewhere, showing that they have important consequences for the presence and nature of competitive dynamics in these models (Appleby & Elliott, 2006). 9 Discussion In this letter, we have extended our earlier ensemble-based, stochastic model of STDP in order to derive exact, analytical results for general nspike interactions for n > 2. By extending the model to include stochastic process resetting, we essentially constructed a mathematical ladder that we could ascend in order to arrive at exact results for the nonresetting model. Once there, knowing how to extract the general r -reset model from the nonresetting model, we could then, to paraphrase Wittgenstein, kick away the ladder of resetting. As suggested earlier, the most biologically plausible forms of these various resetting models are, arguably, the 1-reset model and the nonresetting model. Other finite r , r = 1, forms require us to postulate
Multispike Interactions in a Stochastic Model of STDP
1393
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 4: A contour plot of the function n1 Sn for the 1-reset model, with n = 2, in the λπ –λ p plane. Because the form of S2 is independent of r , this is a contour plot of equation 2.8. Black areas represent minimum values and white areas maximum values, and shades of gray interpolate between these extremes. The minimum value on this partial plane is −0.0122, and the maximum value is +0.0059. The zero contour intersects both axes at values λπ , λ p ≈ 92 Hz.
some kind of spike-counting machinery at individual synapses. Yet one of the original motivations for considering an ensemble-based model of STDP was an attempt to remove some of the computational burdens imposed on single synapses required by the assumption that they can, at some appropriate level, compute and represent complicated, temporally dependent quantities and adjust their strengths accordingly. Although the general results for n1 Sn for all but the 1-reset model are somewhat cumbersome, in fact we find rapid convergence of n1 Sn to its asymptotic, large n form, so that even by approximately 10 spikes, n1 Sn is within just a few percent of the limiting result. Furthermore, in terms of the general form of the function Sn in the λπ –λ p plane, we find no qualitative changes in the structure of the landscape for any n ≥ 3, at least for the parameter set used here. These two facts lead us to suppose that any properties exhibited by the asymptotic form of the n-spike function will also be exhibited by all the n-spike functions for n ≥ 3. Of course, we expect quantitative differences, but the overall dynamics, as determined by the overall structure of the function, should be similar.
1394
P. Appleby and T. Elliott
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 5: A contour plot of the function n1 Sn for the 1-reset model, with n = 3, in the λπ –λ p plane. The minimum value on this partial plane is −0.0284, and the maximum value is +0.0342. The zero contour intersects the λπ axis at λπ ≈ 14 Hz.
In our earlier work (Appleby & Elliott, 2005), we considered purely numerical simulations of n-spike interactions for n > 2. There, we looked specifically at the cross section of the λπ –λ p landscape corresponding to equation 8.1 and found little quantitative difference between the 2-spike and the general n-spike results, all forms exhibiting a BCM-like profile with depression at low presynaptic rates switching over to potentiation for higher presynaptic rates. This observation reflects the highly specific cut through the λπ –λ p plane. Other cuts reveal profound differences between the 2-spike and the general n-spike results, as seen in Figures 4, 5 and 6. The particular model of postsynaptic firing in equation 8.1 might be expected to be fairly reasonable (modulo issues of saturation, etc.) for a target cell innervated by precisely one afferent, so that the postsynaptic response follows closely the presynaptic input. However, for a target cell innervated by multiple afferents, it is entirely possible for there to be large differences between some of the presynaptic rates and the postsynaptic rate, because the postsynaptic rate will reflect all the inputs. We are then led to examine the whole λπ –λ p plane, which reveals these differences. Indeed, it is these differences between the 2-spike and the general n-spike interaction functions, observed initially numerically, that motivated our derivation of the above analytical results for the n-spike function. As we
Multispike Interactions in a Stochastic Model of STDP
1395
Postsynaptic firing rate (Hz)
200
150
100
50
0
0
50 100 150 Presynaptic firing rate (Hz)
200
Figure 6: A contour plot of the function n1 Sn for the 1-reset model, in the asymptotic limit n → ∞, in the λπ –λ p plane. The minimum value on this partial plane is −0.1071, and the maximum value is +0.1148. The zero contour intersects the λπ axis at λπ ≈ 8 Hz.
discuss elsewhere (Appleby & Elliott, 2006), the 2-spike interaction function S2 is, in fact, pathological, and without additional constraints or modifications, it is noncompetitive in nature. However, the general n-spike interaction function in our model for n > 2 leads to stable, competitive dynamics all by itself, without any additional constraints or modifications. This is a remarkable difference, all the more so because much of the experimental and theoretical literature on STDP is devoted to a study of the interaction between only two spikes. We thus wanted to understand this difference mathematically, and so needed to derive general, analytical results for the n-spike interaction function. The focus of this letter has been this rather lengthy derivation. In deriving the results above, we have assumed independent pre- and postsynaptic Poisson firing at fixed rates λπ and λ p . Two major issues arise from this assumption. First, synaptic plasticity will typically induce changes in the rate of postsynaptic firing, for fixed presynaptic rate, contrary to our assumption of fixed postsynaptic rate. Our assumption of fixed preand postsynaptic rates corresponds, however, to the standard experimental protocol in which the pre- and postsynaptic neurons are clamped and induced to fire at fixed rates, regardless of any associated changes in synaptic strength. Our results therefore represent exact predictions about the
1396
P. Appleby and T. Elliott
multispike rules that may be observed under such experimental protocols. Furthermore, provided that the induced changes in synaptic strength in the unclamped situation, are small compared to the overall synaptic strength, or, equivalently, provided that the change in the postsynaptic firing rate is small, then any discrepancy at second order between our fixed-rate rules and the rules allowing for variable rates will be negligible. Hence, our fixed-rate rules may be employed over sufficiently short epochs of firing, even though those epochs may result in small changes in rates, as we show elsewhere (Appleby & Elliott, 2006). Although we derive above asymptotic limits for our rules that entail a consideration of an infinite number of spikes (and thus infinite time), we have shown that the asymptotic limit is achieved very rapidly, within approximately 10 spikes. Presumably, 10 spikes are typically not enough to induce a large change in the postsynaptic firing rate. Second, of course, postsynaptic spiking events are not really independent of presynaptic spiking events, contrary to our assumption. However, for a sufficiently large number of synaptic inputs from multiple afferents, the dependence of a postsynaptic spiking event on any single presynaptic spiking event will be negligible, and, to a first approximation, we can regard them as independent (although, of course, the rates will not be independent). Moreover, in replacing the postsynaptic Poisson neuron in our study with a full numerical implementation of an integrate-and-fire neuron, we find that the synaptic dynamics exhibited by the multispike rules in the independent Poisson case are qualitatively unchanged in the integrate-and-fire case (unpublished observations). Hence, the assumption of independence does not have major consequences for our results. Having derived explicit expressions for the n-spike interaction functions, we may directly compare our switch model to competing models of STDP and the underlying experimental data. A variety of experiments have demonstrated that a biphasic, STDP-like curve apparently governs spike-pair interactions (Bell et al., 1997; Bi & Poo, 1998; Zhang et al., 1998). Reproduction of this 2-spike result has been adopted by the community as the basic test of the validity of a model of STDP. In keeping with this view, models of STDP, our own included, are often initially formulated as descriptions of 2-spike interactions, and we therefore expect to find that all such models are consistent with these results. Accordingly, we find that our switch model produces predictions comparable to other models of STDP when we consider the interaction of precisely two spikes (Song et al., 2000; van Rossum et al., 2000; Castellani et al., 2001; Senn et al., 2001; Karmarker & Buonomano, 2002; Shouval et al., 2002). In addition, we find a requirement under our model that depression dominates potentiation overall, which has empirically been shown to be a requirement in other models of STDP (Song et al., 2000). Although various qualitative differences exist between the various STDP curves produced under different models, the quality of experimental data, in particular the level of noise around the
Multispike Interactions in a Stochastic Model of STDP
1397
transition from potentiation to depression as the temporal ordering of preand postsynaptic spiking is reversed, makes it difficult to distinguish one model from another on the basis of 2-spike interactions alone. We are thus led to examine the predictions of the different models of STDP when we consider more than two spikes. It is in this multispike domain that we find significant differences between the predictions of our model of STDP and those of competing models. In other models of STDP, a multispike train is (often implicitly) considered to be broken down into a series of spike pairs that are then summed in a linear manner according to the underlying STDP rule. This is not the case in our switch model. Instead, we find that a multispike train carries with it higher-order spike interactions that are not present if the train is broken down into a linear sum of spike pairs. These higher-order spike interactions, which are not explicitly built into the model but arise naturally from its intrinsic structure, bestow the 3-spike function with a very different character compared to that of the 2-spike function. We find that these higher-order interactions are sufficient to explain the spike-triplet data of Froemke and Dan (2002) (Appleby & Elliott, 2005), a result that other models of STDP can achieve only by the introduction of additional nonlinearities such as spike suppression (Froemke & Dan, 2002). Even in retrospect, it is astonishing that many of our results can be derived exactly. This is partly due to the use of a stochastic model with intrinsic Markovian properties and partly because the model is actually quite simple in its overall structure. Simplicity and analytical tractability are generally powerful virtues of mathematical models, even when those models are known to be only very crude approximations of the underlying dynamics being studied. It is their capacity for explicit, mathematical understanding that makes such models so useful and permits the development of general intuitions about the manner in which the models behave. In contrast, more complicated models are often ill understood, and resort to purely numerical simulation is frequently necessary even at the very first step. Here we have a trade-off between understanding and the accuracy of any predictions resulting from the two types of model. In a mature subject such as physics, the simple models have been built, studied, and used to develop intuitions, and more complicated models can then be constructed that permit highly detailed (if poorly understood) predictions to be made about fundamental processes. In neurobiology, however, our general belief is that theoretical studies are still in their infancy, and therefore models still need to be simple and, above all, mathematically tractable in order to seek to develop the necessary conceptual framework with which more complicated systems can then be studied. The n-spike interaction functions derived above are unconditional expectation values that represent, in essence, rate-based learning rules in which the underlying spiking processes have been integrated out and summed over. Such rate-based rules are, as we described them before (Appleby &
1398
P. Appleby and T. Elliott
Elliott, 2005), doubly emergent, because they depend on the emergence of STDP at the synaptic or temporal ensemble level in our model, and then this STDP is itself averaged over time and over many spike patterns, giving rise to the rate-based behavior. Of course, the spike-based behavior is more fundamental, and in particular it more accurately describes the behavior of the neuron or synapse over short time courses and exhibits the transition to rate-based-like behavior over longer time courses. Nevertheless, if we are interested in studying, for example, specifically long-term plasticity phenomena (such as the development of cortical structures like ocular dominance columns), it is, we would argue, more appropriate to analyze the behavior in terms of the emergent, rate-based rules. There is a temptation to believe that a model is inadequate if it does not capture every last detail of all known processes. However, finding the appropriate level of approximation for studying a system is often a much more powerful strategy, reducing complexity and increasing understanding. For example, it would be pointless to argue that fluids are not really incompressible and are actually made of molecules, atoms, and, ultimately, quarks and leptons, and seek to replace the Navier-Stokes equations with equations in terms of quantum fields. Such a fine-grained analysis is unnecessary precisely because of emergence. We therefore wonder whether the emergent rate-based rules in our model are not also a more appropriate level of description for understanding certain forms of synaptic plasticity. In conclusion, we have modified our earlier model of STDP in order to derive exact, analytical results for the n-spike interaction function. Elsewhere (Appleby & Elliott, 2006), we focus on the profound differences between the 2-spike interaction function and higher-order interaction functions and the consequences for the presence of stable, competitive dynamics in our model. Acknowledgments P.A.A. thanks the University of Southampton for the support of a studentship. T.E. thanks the Royal Society for the support of a University Research Fellowship during the initial stages of this work. References Appleby, P. A., & Elliott, T. (2005). Synaptic and temporal ensemble interpretation of spike timing dependent plasticity. Neural Comput., 17, 2316–2336. Appleby, P. A., & Elliott, T. (2006). Stable competitive dynamics emerge from multispike interactions in a stochastic model of spike-timing-dependent plasticity. Neural Comput., 18, 2414–2464. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278– 281.
Multispike Interactions in a Stochastic Model of STDP
1399
Bi, G. Q., & Poo, M. M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2, 32–48. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike-timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed point. Neural Comput., 16, 885–940. Castellani, G. C., Quinlan, E. M., Cooper, L. N., & Shouval, H. Z. (2001). A biophysical model of bidirectional synaptic plasticity: Dependence on AMPA and NMDA receptors. Proc. Natl. Acad. Sci. USA, 98, 12772–12777. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Karmarkar, U. R., & Buonomano, D. V. (2002). A model of spike-timing dependent plasticity: One or two coincidence detectors? J. Neurophysiol., 88, 507–513. O’Conner, D. H., Wittenberg, G. M., & Wang, S. S.-H. (2005). Graded bidirectional synaptic plasticity is composed of switch-like unitary events. Proc. Natl. Acad. Sci. USA, 102, 9679–9684. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopfield, J. J. (1998). All-or-none potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Senn, W., Markram, H., & Tsodyks, M. (2001). An algorithm for modifying neurotransmitter release probability based on pre- and postsynaptic spike timing. Neural Comput., 13, 35–67. Shouval, H. Z., Bear, M. F., & Cooper, L. N. (2002). A unified model of NMDA receptor-dependent bidirectional synaptic plasticity. Proc. Natl. Acad. Sci. USA, 99, 10831–10836. Song, S., Miller, K., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3, 919–926. van Rossum, M. C. W., Bi, G. Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20, 8812–8821. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received September 7, 2005; accepted August 2, 2006.
LETTER
Communicated by Yaser Yacoob
3D Periodic Human Motion Reconstruction from 2D Motion Sequences Zonghua Zhang
[email protected] Nikolaus F. Troje
[email protected] BioMotionLab, Department of Psychology, Queen’s University, Kingston, Ontario K7M 3N6 Canada
We present and evaluate a method of reconstructing three-dimensional (3D) periodic human motion from two-dimensional (2D) motion sequences. Using Fourier decomposition, we construct a compact representation for periodic human motion. A low-dimensional linear motion model is learned from a training set of 3D Fourier representations by means of principal components analysis. Two-dimensional test data are projected onto this model with two approaches: least-square minimization and calculation of a maximum a posteriori probability using the Bayes’ rule. We present two different experiments in which both approaches are applied to 2D data obtained from 3D walking sequences projected onto a plane. In the first experiment, we assume the viewpoint is known. In the second experiment, the horizontal viewpoint is unknown and is recovered from the 2D motion data. The results demonstrate that by using the linear model, not only can missing motion data be reconstructed, but unknown view angles for 2D test data can also be retrieved. 1 Introduction Human motion contains a wealth of information about the actions, intentions, emotions, and personality traits of a person, and human motion analysis has widespread applications—in surveillance, computer games, sports, rehabilitation, and biomechanics. Since the human body is a complex articulated geometry overlaid with deformable tissues and skin, motion analysis is a challenging problem for artificial vision systems. General surveys of the many studies on human motion analysis can be found in recent review articles (Gavrila, 1999; Aggarwal & Cai, 1999; Buxton, 2003; Wang, Hu, & Tan, 2003; Moeslund & Granum, 2001; Aggarwal, 2003; Dariush, 2003). The existing approaches to human motion analysis can be roughly divided into two categories: model-based methods and model-free methods. In the modelbased methods, an a priori human model is used to represent the observed Neural Computation 19, 1400–1421 (2007)
C 2007 Massachusetts Institute of Technology
3D Periodic Human Reconstruction
1401
subjects; in model-free methods, the motion information is derived directly from a sequence of images. The main drawback of model-free methods is that they are usually designed to work with images taken from a known viewpoint. Model-based approaches support viewpoint-independent processing and have the potential to generalize across multiple viewpoints (Moeslund & Granum, 2001; Cunado, Nixon, & Carter, 2003; Wang, Tan, Ning, & Hu, 2003; Jepson, Fleet, & El-Maraghi, 2003; Ning, Tan, Wang, & Hu, 2004). Most of the existing research (Gavrila, 1999; Aggarwal & Cai, 1999; Buxton, 2003; Wang, Hu, & Tan, 2003; Moeslund & Granum, 2001; Aggarwal, 2003; Dariush, 2003; Cunado et al., 2003; Wang, Tan et al., 2003; Jepson et al., 2003; Ning et al., 2004) has been focused on the problem of tracking and recognizing human activities through motion sequences. In this context, the problem of reconstructing 3D human motion from 2D motion sequences has received increasing attention (Rosales, Siddiqui, Alon, & Sclaroff, 2001; Sminchisescu & Triggs, 2003; Urtasun & Fua, 2004a, 2004b; Bowden, 2000; Bowden, Mitchell, & Sarhadi, 2000; Ong & Gong, 2002; Yacoob & Black, 1999). The importance of 3D motion reconstruction stems from applications such as surveillance and monitoring, human body animation, and 3D human-computer interaction. Unlike 2D motion, which is highly view dependent, 3D human motion can provide robust recognition and identification. However, existing systems need multiple cameras to get 3D motion information. If 3D motion can be reconstructed from a single camera viewpoint, there are many potential applications. For instance, using 3D motion reconstruction, one could create a virtual actor from archival footage of a movie star, a difficult task for even the most skilled modelers and animators. 3D motion reconstruction could be used to track human body activities in real time (Arikan & Forsyth, 2002; Grochow, Martin, Hertzmann, & Popovi´c, 2004; Chai & Hodgins, 2005; Aggarwal & Triggs, 2004; Yacoob & Black, 1999). Such a system may be used as a new and effective human-computer interface for virtual reality applications. In the model of Kakadiaris and Metaxas (2000), motion estimation of human movement was obtained from multiple cameras. Bowden and colleagues (Bowden, 2000; Bowden et al., 2000) used a statistical model to reconstruct 3D postures from monocular image sequences. Just as a 3D face can be reconstructed from a single image using a morphable model (Blanz & Vetter, 2003), they reconstructed the 3D structure of a subject from a single view of its outline. Ong and Gong (2002) discussed three main issues in the linear combination method: choosing the examples to build a model, learning the spatiotemporal constraints on the coefficients, and estimating the coefficients. They applied their method to track moving 3D skeletons of humans. These models (Bowden, 2000; Bowden et al., 2000; Ong & Gong, 2002) are based on separate poses and do not use the temporal
1402
Z. Zhang and N. Troje
information that connects them. The result is a series of reconstructed 3D human postures. Using principal components analysis (PCA), Yacoob and Black (1999) built a parameterized model for image sequences to model and recognize activity. Urtasun and Fua (2004a) presented a motion model to track the human body and then (Urtasun & Fua, 2004b) extended it to characterize and recognize people by their activities. Leventon and Freeman (1998) and Howe, Leventon, and Freeman (1999) studied reconstruction of human motion from image sequences using Bayes’ rule, which is in many respects similar to our approach. However, they did not present a quantitative evaluation of the 3D motion reconstructions corresponding to the missing dimension. For viewpoint reconstruction from motion data, some researchers (Giese & Poggio, 2000; Agarwal & Triggs, 2004; Ren, Shakhnarovich, Hodgins, Pfister, & Viola, 2005) quantitatively evaluated the performance for their motion model. Incorporating temporal data into model-based methods requires correspondence-based representations, which separate the overall information into range-specific information and domain-specific information (Ramsay & Silverman, 1997; Troje, 2002a). In the case of biological motion data, range-specific information refers to the state of the actor at a given time in terms of the location of a number of feature points. Domainspecific information refers to when a given position occurs. Yacoob and Black (1999) addressed temporal correspondence in their walking data by assuming that all examples had been temporally aligned. Urtasun and Fua (2004a, 2004b) chose one walking cycle with the same number of samples. Giese and Poggio (2000) presented a learning-based approach for the representation of complex motion patterns based on linear combinations of prototypical motion sequences. Troje (2002a) developed a framework that transformed biological motion data into a linear representation using PCA. This representation was used to construct a sex classifier with a reasonable classification performance. These authors (Troje, 2002a; Giese & Poggio, 2000) pointed out that finding spatiotemporal correspondences between motion sequences is the key issue for the development of efficient models that perform well on reconstruction, recognition, and classification tasks. Establishing spatiotemporal correspondences and using them to register the data to a common prototype is a prerequisite for designing a generative linear motion model. Our input data consist of the motion trajectories of discrete marker points, that is, spatial correspondence is basically solved. Since we are working with periodic movements—normal walking—temporal correspondence can be established by a simple linear time warp, which is defined in terms of the frequency and the phase of the walking data. This simple linear warping function can get much more complex when dealing with nonperiodic movement, and suggestions on how to expand our approach to other movements are discussed below.
3D Periodic Human Reconstruction
1403
In this letter, human walking is chosen as an example to study 3D periodic motion reconstruction from 2D motion sequences. On the one hand, locomotion patterns such as walking and running are periodic and highly stereotyped. On the other hand, walking contains information about the individual actor. Keeping the body upright and balanced during locomotion takes a high level of interaction of the central nervous system, sensory system (including proprioceptive, vestibular, and visual systems), and motor control systems. The solutions to the problem of generating a stable gait depend on the masses and dimensions of the particular body and its parts and are therefore highly individualized. Therefore, human walking is characterized not only by abundant similarities but also by stylistic variations. Given a set of walking data represented in terms of a motion model, the similarities are represented by the average motion pattern, while the variations are expressed in terms of the covariance matrix. Principal component analysis can be used to find a low-dimensional, orthonormal basis system that would efficiently span a motion space. Individual walking patterns are approximated in terms of this orthonormal basis system. Here, we use PCA to create a representation that is able to capture the redundancy in gait patterns in an efficient and compact way, and we test the resulting model’s ability to reconstruct missing parts of the full representation. The vision problem is not the primary purpose of this letter, and we make two assumptions to stay focused on the problem of 3D reconstruction. First, we assume that the problem of tracking feature points in the 2D image plane is solved. We represent human motion in terms of trajectories of a series of discrete markers located at the main joints of the body. Figure 1 illustrates the locations of these markers. Second, 2D motion sequences are assumed to be orthographic projections of 3D motion. Both assumptions are not critical to the general idea outlined here. In particular, the tracking problem can be treated completely independently, and there are many researchers working on this problem (Jepson et al., 2003; Ning et al., 2004; Urtasun & Fua, 2004a, 2004b; Karaulova, Hall, & Marshall, 2002). We systematically develop a theory for recovering 3D walking data from 2D motion sequences and evaluate it using a cross-validation procedure. Single 2D test samples are generated by orthographically projecting a walker from our 3D database. We then construct a model based on the remaining walkers and project the test set onto the model. The resulting 3D reconstruction is then evaluated by comparing it to the original 3D data of this walker. This procedure is iterated through the whole data set. Section 2 briefly describes data acquisition with a marker-based motion capture system. The details of the linear motion model and 3D motion reconstruction are given in section 3. We run our algorithm and evaluate the reconstructions in section 4. Finally, conclusions and possible future extensions are discussed in section 5.
1404
Z. Zhang and N. Troje
Figure 1: Human motion representation by joints. The 15 virtual markers are located at the major joints of the body (shoulders, elbows, wrists, hips, knees, ankles), the sternum, the center of the pelvis, and the center of head.
2 Walking Data Acquisition Eighty participants—32 males and 48 females—served as subjects to acquire walking data. Their ages ranged from 13 to 59 years, and the average age was 26 years. Participants wore swimming suits, and a set of 41 retroreflective markers was attached to their bodies. Participants were requested to walk on a treadmill, and they adjusted the speed of the belt to the rate that felt most comfortable. The 3D trajectories of the markers were recorded using an optical motion capture system (Vicon; Oxford Metrics, Oxford, UK) equipped with nine CCD high-speed cameras. The system tracked the markers with a submillimeter spatial resolution and a sampling rate of 120 Hz. From the trajectories of these markers, we computed the location of 15 virtual markers according to a standard biomechanical model (BodyBuilder, Oxford Metrics), as illustrated in Figure 1. The virtual markers were located at the major joints of the body (shoulders, elbows, wrists, hips, knees, ankles), the sternum, the center of the pelvis, and the center of head (for more details, see Troje, 2002a).
3D Periodic Human Reconstruction
1405
3 Motion Reconstruction In our 3D motion reconstruction method, first a linear motion model is constructed from Fourier representations of human examples by PCA, and then the missing motion data are reconstructed from the linear motion model based on two approaches: least-square minimization by pseudoinverse and calculation of a maximum a posteriori probability (MAP) by using Bayes’ rule. 3.1 Linear Motion Model. The collected walking data can be regarded as a set of time series of postures pr (t) : t = 1, 2, . . . , Tr , r = 1, 2, . . . , R, represented by the major joints, where R is the number of walkers and Tr is the number of sampled postures for walker r . Because each joint has three coordinates, the representation of a posture pr (t) consisting of 15 joints is a 45-dimensional vector. Human walking can be efficiently described in terms of low-order Fourier series (Unuma, Anjyo, & Takeuchi, 1995; Troje, 2002b). A compact representation for a particular walker consists of the average posture ( p0 ), the characteristic postures of the fundamental frequency ( p1 , q 1 ) and the second harmonic ( p2 , q 2 ) of a discrete Fourier expansion, and the fundamental frequency (ω) to characterize this walker: p(t) = p0 + p1 sin(ω t) + q 1 cos(ω t) + p2 sin(2ω t) + q 2 cos(2ω t) + err. (3.1) The power carried by the residual term err is less than 3% of the overall power of the input data, and we discard it from further computations. Since the average posture and each of the characteristic postures are 45-dimensional vectors, the dimensionality of a 3D Fourier representation at this stage is 45 ∗ 5 = 225. For each specific walking pattern pr (t), we can get a 3D Fourier representation wr : wr = ( p0,r , p1,r , q 1,r , p2,r , q 2,r ).
(3.2)
The advantage of Fourier representation is that it allows us to apply linear operations and easily determine temporal correspondence. Temporal information is included in the frequency and phase. After computing the frequency and phase of a walking series by Fourier analysis, we can represent the walking series with zero phase and frequency-independent characteristic postures, as shown in equation 3.2. Every 3D Fourier representation of a walker can be treated as a point in a 225-dimensional linear space. PCA is applied to all the Fourier representations in order to learn the principal motion variations and reduce
1406
Z. Zhang and N. Troje
dimensionality further. To do so, all of the Fourier representations are concatenated into a matrix W, with each column containing one walker wr , comprising the parameters p0,r , p1,r , q 1,r , p2,r , and q 2,r stacked vertically. Computing PCA on the matrix W results in a decomposition of each parameter set w of a walker into an average walker w and a series of orthogonal eigenwalkers e 1 , . . . , e N : w=w+
N
kn e n ,
(3.3)
n=1
w = (1/R) wr denotes the average value of all the R columns, and kn is the projection onto eigenwalker e n . A linear motion model is spanned by the first N eigenwalkers e 1 , . . . , e N , which represent the principal variations. The exact dimensionality N of the model depends on the required accuracy of the approximation but will generally be much smaller than R. The Fourier representation of a walker can now be represented as a linear combination of eigenwalkers by using the obtained coefficients k1 , . . . , k N . 3.2 Reconstruction. We denote a 2D motion sequence as pˆ (t) : t = 1, 2, . . . , T, which is represented in terms of its discrete Fourier components: w ˆ = ( pˆ 0 , pˆ 1 , qˆ 1 , pˆ 2 , qˆ 2 ) .
(3.4)
The average posture pˆ 0 and the characteristic postures pˆ 1 , qˆ 1 , pˆ 2 , and qˆ 2 contain only 2D joint position information. These postures are concatenated into a column vector w ˆ with 30 ∗ 5 = 150 entries. We call this a 2D Fourier representation. Supposing a projection matrix C relates the 2D Fourier representation and its 3D Fourier representation, reconstructing the full 3D motion means finding the right solution w to the equation w ˆ = Cw,
(3.5)
where C : 225 → 150 is the projection matrix. At this point we assume C is known. Later we consider C a function of an unknown horizontal viewpoint. Equation 3.5 is an underdetermined equation system and can be solved only if additional constraints can be formulated. We constrain the possible solution to be a linear combination of eigenwalkers in the motion model outlined in the previous section, so the problem is how to calculate a set of coefficients k = [k1 , . . . , k N ]. The solution w is a linear combination of eigenwalkers e 1 , . . . , e N with the obtained coefficients as the corresponding weights. Two calculation approaches are explored to find these coefficients.
3D Periodic Human Reconstruction
1407
3.2.1 Approach I. Since a 3D Fourier representation of the walker w can be represented as a linear combination of eigenwalkers with a set of coefficients kn , substituting equation 3.3 into equation 3.5 gets w ˆ =C w+
N
kn e n .
(3.6)
n=1
ˆ = Cw ¯ and eˆ n = Ce n , equation 3.6 can be rewritten as Denoting w ˆ = w ˆ −w
N
kn eˆ n ,
(3.7)
n=1
or, using matrix notation, as ˆ ˆ¯ = Ek. w ˆ −w
(3.8)
Equation 3.8 contains 150 equations with N unknown coefficients k = [k1 , . . . , k N ]. N is smaller than R, the number of walkers. For our calculations, we used a value of N = 30. Equation 3.8 is therefore an overdetermined linear equation system. We approximate a solution according to a least-square criterion using the pseudo-inverse ˆ ˆ −1 Eˆ T (w k = ( Eˆ T E) ˆ − w).
(3.9)
3.2.2 Approach II. In actual situations, due to measurement noise of 2D motion sequences and incompleteness of training examples, the obtained coefficients from least-squares minimization may cause the 3D reconstruction to be far beyond the range of the training data. In order to avoid this overfitting, Bayes’ rule is used to make a trade-off between quality of match and prior probability. According to Bayes’ rule, a posterior probability is p (k|w) ˆ ∝ p (k) p (w|k) ˆ .
(3.10)
The prior probability p (k) reflects prior knowledge of the possible values of the coefficients. It can be calculated from the above linear motion model. Assuming a normal distribution along each of the eigenvectors, the prior probability is p (k) ∝
N i=1
exp
−ki2 / (2λi )
= exp
−
N i=1
ki2 / (2λi )
,
(3.11)
1408
Z. Zhang and N. Troje
where λi , i = 1, . . . , N is the eigenvalue corresponding to the eigenwalker e i . Because the variations along the eigenvectors are uncorrelated within the set of walkers, products can be used. Our goal is to determine a set of coefficients k = [k1 , . . . , k N ]T for the 2D Fourier representation w ˆ with maximum probability in the 3D linear space. We assume that each dimension of the 2D Fourier representation w ˆ is subject to uncorrelated gaussian noise with a variance σ 2 . Then the likelihood of measuring w ˆ is given by ˆ 2 /(2σ 2 )), ˆ¯ − Ek p(w|k) ˆ ∝ exp(−w ˆ −w
(3.12)
ˆ = Cw is the projection of average motion and Eˆ = C E denotes where w the projection of eigenwalkers. According to equation 3.10, the posterior probability is p (k|w) ˆ ∝ exp
N 2 2 2 ˆ ˆ − w ˆ −w ¯ − Ek / 2σ − ki / (2λi ) .
(3.13)
i=1
The maximization of equation 3.13, corresponding to a maximum a posteriori (MAP) estimate, can be found by minimizing the following equation:
k MAP
N 2 2 2 ˆ ˆ ˆ −w ¯ − Ek / 2σ + = arg min w ki / (2λi ) . k
(3.14)
i=1
The optimal estimate kopt is then calculated (see appendix A for computational details), −1
2 σ ˆ¯ . Eˆ T w ˆ −w kopt = k MAP = diag + Eˆ T Eˆ λ
(3.15)
where diag (x) is an operation to produce a square matrix, whose diagonal elements are the elements of the vector x. We can see when eigenvalue λ(i) is large, the effect of the variance σ 2 on the coefficients k(i) in the optimal estimate is less than those with small eigenvalues. After the coefficient k is estimated using the two proposed approaches, the missing information w ˜ = ( p˜ 0 , p˜ 1 , q˜ 1 , p˜ 2 , q˜ 2 ) of the Fourier representation of a 2D walker is synthesized in terms of the respective linear combination of 3D eigenwalkers e 1 , . . . , e N . The reconstructed Fourier representation w is the combination of the 2D Fourier representation w ˆ and the reconstructed w: ˜ w = ([w, ˆ w]) ˜ = ([ pˆ 0 , p˜ 0 ], [ pˆ 1 , p˜ 1 ], [qˆ 1 , q˜ 1 ], [ pˆ 2 , p˜ 2 ], [qˆ 2 , q˜ 2 ]) .
(3.16)
3D Periodic Human Reconstruction
1409
3D periodic human motion is obtained from the reconstructed Fourier representation w by using equation 3.1. 4 Experiments Using walking data acquired with a marker-based motion capture system, as described in section 2, we conducted two experiments. 2D test sequences were created by orthographic projection of the 3D walkers onto a vertical plane. We used viewpoints that ranged from the left profile view to the frontal view and from there to the right profile view in 1 degree steps. In the first experiment, we tried to reconstruct the missing, third dimension assuming that the view angle of the 2D test walker was known. In the second experiment, we assumed the horizontal viewpoint to be unknown and tested whether we could retrieve it together with the 3D motion. In both experiments, we used a complete leave-one-out cross-validation procedure: one walker is set aside for testing when creating the linear motion model and later projected onto it and evaluated. This is then repeated for every walker in the data set. 4.1 Model Construction. Fourier representations were created separately for each walker. On average, the fundamental frequency accounted for 91.9% of the total variance, and the second harmonic accounted for another 6.0%, which meant that the first two harmonics explained 97.9% of the overall postural variance of a walker. The sets of all the Fourier representations were now submitted to a PCA. The first 30 principal components accounted for 95.5% of the variance of all the Fourier representations, and they were chosen as eigenwalkers in the linear motion model (see Figure 2).
4.2 Motion Reconstruction. For approach II, in order to determine the MAP for a given 2D motion sequence, we calculated the optimal variance σ 2 (see equation 3.15). Since the 2D motion sequence was the orthographic projection of a 3D walker onto a vertical plane, the actual motion in the missing dimension was known and could be compared directly to the reconstructed data. We define an absolute error for each joint, comparing the reconstructed data with the original data, by
E abs ( j) =
T 1 j ( p (t) − p˜ j (t))2 , T
(4.1)
t=1
where p j (t) and p˜ j (t) are the original and reconstructed data for the jth ( j = 1, 2, . . . , 15) joint in the missing dimension. The absolute reconstruction error of one walker is the average value of the absolute errors for the
1410
Z. Zhang and N. Troje
Figure 2: The cumulative variance covered by the principal components computed across 80 walkers.
15 joints: 1 E abs ( j). 15 15
Ea =
(4.2)
j=1
We calculated the average reconstruction errors for the 80 walkers for variances (σ 2 in equation 3.12) ranging from 1 to 625 mm2 . Figure 3 illustrates the results. The optimal variance for all the walkers was about σ 2 = 70 mm2 . This value was used for approach II in all subsequent experiments. To calculate a 3D Fourier representation, we projected the 2D Fourier representation onto the linear motion model and got a set of coefficients by the two proposed approaches. In order to qualitatively evaluate the reconstruction, we displayed all the subjects reconstructed from three viewing angles: 0, 30, and 90 degrees. For each demonstration, the original and reconstructed motion data were superimposed and displayed from a direction orthogonal to the direction along which motion data were missing. Figure 4 illustrates the original and the reconstructed motion sequences for one subject from the three viewing angles. This figure shows five equally spaced postures in one motion cycle. The original and the reconstructed walking sequences appear very similar in the corresponding postures, and there are no obvious differences between the two calculation approaches.
3D Periodic Human Reconstruction
1411
Figure 3: The effect of variance σ 2 on average reconstruction error.
The same results were inspected visually in all 240 demonstrations; no obvious differences could be seen. In order to provide a more quantitative evaluation, equation 4.2 was used to calculate the absolute reconstruction error. From different viewpoints, the 2D projections have different variances, and therefore a relative reconstruction error is also defined for each joint, E rel ( j) =
T 1 j ( p (t) − p˜ j (t))2 , Tσ 2
(4.3)
t=1
where σ 2 is the average overall variance of the 15 joints in the missing dimension. The relative reconstruction error of one walker is the average value of the relative errors for the 15 joints: 1 E rel ( j). 15 15
Er =
(4.4)
j=1
The absolute and relative reconstruction errors were calculated for all walkers and for different viewpoints using the two approaches. 3D walking data were reconstructed at 1 degree intervals from the left profile view to the right profile view. Figure 5 shows absolute average errors, relative average errors, and overall variance of the missing dimension as a function
1412
Z. Zhang and N. Troje
Figure 4: Original and reconstructed walking data from three viewing angles (0, 30, and 90 degrees, corresponding to the top, middle, and bottom rows, respectively). The dashed lines and the solid lines are the original and the reconstructed postures in the motion sequence, respectively. (a) Approach I. (b) Approach II.
of horizontal viewpoint (−90 to 90 degrees). The solid and dashed curves in Figures 5a and 5b correspond to approach I and approach II, respectively. It is obvious that when prior probability is involved in the calculation, the error of 3D motion reconstruction is about 30% smaller than the error by approach I. As can be seen from Figure 5a, there are different absolute reconstruction errors in the negative and positive oblique viewpoints, which implies that human walking is slightly asymmetric. The very small asymmetry of variances in Figure 5c also confirms this point.
3D Periodic Human Reconstruction
1413
Figure 5: Illustration of the average reconstruction errors from different viewpoints of −90 to 90 degrees for 80 walkers. (a) Absolute errors by two approaches. (b) Relative errors by two approaches. (c) Variance.
1414
Z. Zhang and N. Troje
Figure 5: Continued
Figure 5b illustrates that reconstruction based on frontal and oblique viewpoints produces small relative reconstruction errors, while reconstruction from the profile view generates large relative reconstruction errors. The main reason for that is the fact that the overall variance that needs to be recovered from the profile view is relatively small anyway (see Figure 5c), and a reasonable absolute reconstruction error becomes a large relative error. The poor reconstruction in profile views and the good reconstruction in frontal views fit with data on human performance in biological motion recognition tasks (Troje, 2002a; Mather & Murdoch, 1994). 4.3 Viewpoint Reconstruction. While we assumed the projection matrix C (see equation 3.5) to be known in the first experiment, we include the horizontal viewpoint as an unknown variable, which is subject to reconstruction in the second experiment. Assuming that a 2D walking sequence has a constant walking direction and that the viewpoint is rotated only about the vertical axis (that is, it is a horizontal viewpoint), we can estiˆ¯ mate the view angle using the rotated average posture w(α) and the rotated ˆ eigenwalkers E(α) and then retrieve the missing motion data. Equation 3.8 and 3.14 can be represented as , ˆ ˆ¯ (αopt , kopt ) = arg min w ˆ − w(α) − E(α)k α,k
(4.5)
3D Periodic Human Reconstruction
1415
N 2 2 2 / 2σ + ˆ ˆ¯ ˆ − w(α) − E(α)k (αopt , kopt ) = arg min w ki / (2λi ) . α,k
i=1
(4.6) The optimum solution of this minimization problem over the coefficient k and view angle α can be found by solving a nonlinear overdetermined system. Details can be found in appendix B. The missing data are the linear combination of eigenwalkers using the optimal coefficients kopt . As in the first experiment, a leave-one-out cross-validation procedure was applied to all the walkers. The average reconstructed angles from different viewpoints by the two approaches are illustrated in Figure 6. The experimental results show that we can precisely obtain the view angles from motion sequences with unknown viewpoints. Here, Bayesian inference does not provide an advantage. It is possible to identify and recognize human gait from 2D image sequences with different viewpoints by using the obtained view angle and missing data. 5 Conclusions and Future Work We have investigated and evaluated the problem of reconstructing 3D periodic human motion from 2D motion sequences. A linear motion model, based on a linear combination of eigenwalkers, was constructed from Fourier representations of walking examples using PCA. The Fourier representation of a particular walker is a compact description of human walking, so not only does it allow us to find temporal correspondence by adjusting the frequency and phase, but it also reduces computational expenditure and storage requirements. Projecting 2D motion data onto this model could find a set of coefficients and view angles for test data with unknown viewpoints. Two calculation approaches were explored to determine the coefficients. One was based on a least-square minimization using the pseudoinverse; the other used prior probability of training examples to calculate a MAP by using Bayes’ rule. Experiments and evaluations were made on walking data. The results and quantified error analysis showed that the proposed method could reasonably reconstruct 3D human walking data from 2D motion sequences. We also verified that the linear motion model could estimate parameters like the view angle from 2D motion sequences with unknown constant walking direction. By applying Bayes’ rule to 3D motion reconstruction, we got better performance in reconstruction. We developed and tested the proposed framework on a relatively simple data set: discrete marker trajectories obtained from motion capture data from walking human subjects. We also made a series of assumptions that simplified algorithmic and computational demands. In particular, we assumed that the projection matrix was either completely known or that only
1416
Z. Zhang and N. Troje
Figure 6: Illustration of the average reconstruction view angles from different viewpoints of −90 to 90 degrees for 80 walkers. The solid line is the average reconstructed angles, and the dashed lines are the average standard deviations. (a) Approach I. (b) Approach II.
3D Periodic Human Reconstruction
1417
a single parameter, the horizontal viewpoint, had to be recovered. Using the proposed framework to process real-world video data would require a number of additional steps and expansions to the model. The computer vision problem of tracking joint locations in video footage remains challenging. Markerless tracking, however, has become a huge research field, and promising advances have been made along several lines. A discussion of the area is beyond the scope of this letter, but we are positive that solutions to the problem will be available in the near future. Another constraint that could be relaxed in future versions of this work is the knowledge about the projection matrix. The fact that the single parameter that we looked at here, horizontal viewing angle, was recovered effortlessly and with high accuracy shows that the redundancy in the input data is still very high. Great numbers of unknown parameters would require more sophisticated optimization procedures, but we consider it very likely that the system would stay overdetermined, providing enough constraints to find a solution. Recently, Chai and Hodgins (2005) simultaneously extracted the root position and orientation from the vision data captured from two cameras. There is another challenge when we switch from periodic actions like walking and running to nonperiodic movements. Establishing temporal correspondence across our data set was easy: mapping one sequence onto another simply required scaling and translation in time. Establishing correspondence between nonperiodic movements generally requires nonlinear time warping. A number of different methods are available to achieve this task, ranging from simple landmark registration to dynamic programming and the employment of hidden Markov models (see, e.g., Ramsay & Silverman, 1997), time-warping algorithm (Bruderlin & Williams, 1995), spatiotemporal correspondence (Giese & Poggio, 2000; Mezger, Ilg, & Giese, 2005), and registration curve (Kovar & Gleicher, 2003). The main purpose of this work was to emphasize the richness of the redundancy inherent in motion data and suggest ways to employ this redundancy to recover missing data. In our example, the missing data were the third dimension, which gets lost when projecting the 3D motion data onto a 2D image plane. The same approach, however, could be used to recover markers that become occluded by other parts of the body or even to generate new marker trajectories at locations where a real marker cannot be placed (for instance, inside the body). Appendix A: Solving an Overdetermined System for Coefficients In order to calculate the optimal estimates k, we assume the following equation: N ˆ 2 / 2σ 2 + ˆ¯ − Ek ki2 / (2λi ). E= w ˆ −w i=1
(A.1)
1418
Z. Zhang and N. Troje
It can be written as E=
ˆ¯ w ˆ −w
2
N 2 ˆ ˆ¯ + k, Eˆ T Ek / 2σ + ˆ −w − 2 k, Eˆ T w ki2 / (2λi ). i=1
(A.2) According to the optimum, N ˆ / 2σ 2 + ˆ¯ + 2 Eˆ T Ek ˆ −w 2ki / (2λi ), 0 = ∇ E = −2 Eˆ T w
(A.3)
i=1
so ˆ + σ2 Eˆ T Ek
N
ˆ¯ . ˆ −w ki /λi = Eˆ T w
(A.4)
i=1
The solution is −1
2 σ ˆ¯ . Eˆ T w k = diag ˆ −w + Eˆ T Eˆ λ
(A.5)
Appendix B: Solving an Overdetermined System for Coefficients and Angle For solving equation 3.5, we define an objective function, whose return value is a 150-dimensional vector. Every element of this vector is ˆ ˆ¯ diff (i) = w(i) ˆ − w(i)(α) − E(i)(α) · k, for i = 1, . . . , 150
(B.1)
ˆ where E(i)(α) is a row vector with the number of used eigenwalkers. The first 75 elements in this vector can be calculated by the following equation: diff (i) = w(i) ˆ − cos α × (w x + E x · k) − sin α × w y + E y · k ,
(B.2)
where i = 1, . . . , 75, w x , E x , and w y , E y are the corresponding x- and ycoordinates of average walkers and eigenwalkers, respectively. The values of the remaining 75 elements can be calculated by diff (i) = w(i) ˆ − w z − E z · k,
(B.3)
where i = 76, . . . , 150, and w z and E z are the corresponding z-coordinates of average walkers and eigenwalkers, respectively.
3D Periodic Human Reconstruction
1419
Finally, the obtained 150-dimensional vector is sent to the subroutine lsqnonlin in Matlab 6.5. The “large-scale: trust-region reflective Newton” optimization algorithm was used in this subroutine. The returned values are the optimum coefficients and the viewing angle. For equation 4.6, we calculate the coefficients for the viewing angle from 0 to 180 degrees by using equation 3.15 and then substitute them into equation 4.6. We can determine the optimum coefficients and view angle by comparing the 181 obtained values. Acknowledgments We gratefully acknowledge the generous financial support of the German Volkswagen Foundation and the Canada Foundation for Innovation. We thank Tobias Otto for his help in collecting the motion data and Daniel Saunders for many discussions and his tremendous help in the final editing of the manuscript. The helpful comments from the anonymous reviewers are gratefully acknowledged; they greatly helped us improve this letter. References Aggarwal, A., & Triggs, B. (2004). Learning to track 3D human motion from silhouettes. In Proceedings of the 21st International Conference on Machine Learning. Madison, WI: Omni Press. Aggarwal, J. (2003). Problems, ongoing research and future directions in motion research. Machine Vision and Applications, 14, 199–201. Aggarwal, J., & Cai, Q. (1999). Human motion analysis: A review. Computer Vision and Image Understanding, 73, 428–440. Arikan, O., & Forsyth, D. (2002). Interactive motion generation from examples. ACM Transactions on Graphics, 21(3), 483–490. Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1063–1074. Bowden, R. (2000). Learning statistical models of human motion. Paper presented at the IEEE Workshop on Human Modeling Analysis and Synthesis, CVPR 2000, South Carolina. Bowden, R., Mitchell, T., & Sarhadi, M. (2000). Non-linear statistical models for the 3D reconstruction of human pose and motion from monocular image sequences. Image and Vision Computing, 18, 729–737. Bruderlin, A., & Williams, L. (1995). Motion signal processing. Proceedings of SIGGRAPH, 95, 97–104. Buxton, H. (2003). Learning and understanding dynamics scene activity: A review. Image and Vision Computing, 21, 125–136. Chai, J., & Hodgins, J. (2005). Performance animation from low-dimensional control signals. ACM Transactions on Graphics, 24(3), 686–696. Cunado, D., Nixon, M., & Carter, J. (2003). Automatic extraction and description of human gait models for recognition purposes. Computer Vision and Image Understanding, 90, 1–41.
1420
Z. Zhang and N. Troje
Dariush, B. (2003). Human motion analysis for biomechanics and biomedicine. Machine Vision and Applications, 14, 202–205. Gavrila, D. (1999). The visual analysis of human motion: A survey. Computer Vision and Image Understanding, 73, 82–98. Giese, M., & Poggio, T. (2000). Morphable models for the analysis and synthesis of complex motion patterns. International Journal of Computer Vision, 38, 59–73. Grochow, K., Martin, S., Hertzmann, A., & Popovi´c, Z (2004). Style-based inverse kinematics. ACM Transactions on Graphics, 23(3), 522–531. Howe, N., Leventon, M., & Freeman, W. (1999). Bayesian reconstruction of 3D human ¨ motion from single-camera video. In S. Solla, T. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12 (pp. 820–826). Cambridge, MA: MIT Press. Jepson, A., Fleet, D., & El-Maraghi, T. (2003). Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1296–1311. Kakadiaris, I., & Metaxas, D. (2000). Model-based estimation of 3D human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 1453–1459. Karaulova, I., Hall, P., & Marshall, A. (2002). Tracking people in three dimensions using a hierarchical model of dynamics. Image and Vision Computing, 20, 691– 700. Kovar, L., & Gleicher, M. (2003). Flexible automatic motion blending with registration curves. In Proceedings of Symposium on Computer Animation (pp. 214–224). Aire-laSwitzerland: Eurographics Association. Leventon, M., & Freeman, W. (1998). Bayesian estimation of 3D human motion from an image sequence. (Tech. Rep. No. 98-06). Cambridge, MA: Mitsubishi Electric Research Laboratory. Mather, G., & Murdoch, L. (1994). Gender discrimination in biological motion displays based on dynamic cues. Proceedings of the Royal Society of London, Series B: Biological Sciences, 258, 273–279. Mezger, J., Ilg, W., & Giese, M. (2005). Trajectory synthesis by hierarchical spatiotemporal correspondence: Comparison of different methods. In Proceedings of the ACM Symposium on Applied Perception in Graphics and Visualization (pp. 25–32). New York: ACM Press. Moeslund, T., & Granum, E. (2001). A survey of computer vision-based human motion capture. Computer Vision and Image Understanding, 81, 231–268. Ning, H. Z., Tan, T. N., Wang, L., & Hu, W. M. (2004). Kinematics-based tracking of human walking in monocular video sequences. Image and Vision Computing, 22, 429–441. Ong, E., & Gong, S. (2002). The dynamics of linear combinations: Tracking 3D skeletons of human subjects. Image and Vision Computing, 20, 397–414. Ramsay, J., & Silverman, B. (1997). Functional data analysis. New York: Springer. Ren, L., Shakhnarovich, R., Hodgins, J., Pfister, H., & Viola, P. (2005). Learning silhouette features for control of human motion. ACM Transactions of Graphics, 24, 1303–1331. Rosales, R., Siddiqui, M., Alon, J., & Sclaroff, S. (2001). Estimating 3D body pose using uncalibrated cameras. In Proceedings of the Computer Vision and Pattern Recognition Conference (Vol. 1, pp. 821–827). Los Alamitos, CA: IEEE Computer Society Press.
3D Periodic Human Reconstruction
1421
Sminchisescu, C., & Triggs, B. (2003). Kinematic jump processes for monocular 3D human tracking. In Proceedings of the Computer Vision and Pattern Recognition Conference (Vol. 1, pp. 69–76). Los Alamitos, CA: IEEE Computer Society Press. Troje, N. F. (2002a). Decomposing biological motion: A framework for analysis and synthesis of human gait patterns. Journal of Vision, 2, 371–387. Troje, N. F. (2002b). The little difference: Fourier based gender classification from ¨ biological motion. In R. P. Wurtz & M. Lappe (Eds.), Dynamic perception (pp. 115– 120). Berlin: Aka Press. Unuma, M., Anjyo, K., & Takeuchi, R. (1995). Fourier principles for emotion-based human figure animation. Proceedings of SIGGRAPH, 95, 91–96. Urtasun, R., & Fua, P. (2004a). 3D human body tracking using deterministic temporal motion models. In Proceedings of the Eighth European Conference on Computer Vision. Berlin: Springer-Verlag. Urtasun, R., & Fua, P. (2004b). 3D tracking for gait characterization and recognition. In Proceedings of the Sixth International Conference on Automatic Face and Gesture Recognition. (pp. 17–22). Los Alamitos, CA: IEEE Computer Society Press. Wang, L., Hu, W. M., & Tan, T. N. (2003). Recent developments in human motion analysis. Pattern Recognition, 36, 585–601. Wang, L., Tan, T. N., Ning, H. Z., & Hu, W. M. (2003). Silhouette analysis-based gait recognition for human identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1505–1518. Yacoob, Y., & Black, M. (1999). Parameterized modeling and recognition of activities. Computer Vision and Image Understanding, 73, 232–247.
Received April 25, 2005; accepted August 16, 2006.
LETTER
Communicated by Thomas Vogl
A New Backpropagation Learning Algorithm for Layered Neural Networks with Nondifferentiable Units Takahumi Oohori
[email protected] Department of Information Design, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
Hidenori Naganuma
[email protected] Division of Electrical Engineering, Graduate School of Engineering, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
Kazuhisa Watanabe
[email protected] Department of Information Network Engineering, Hokkaido Institute of Technology, Sapporo 006-8585, Japan
We propose a digital version of the backpropagation algorithm (DBP) for three-layered neural networks with nondifferentiable binary units. This approach feeds teacher signals to both the middle and output layers, whereas with a simple perceptron, they are given only to the output layer. The additional teacher signals enable the DBP to update the coupling weights not only between the middle and output layers but also between the input and middle layers. A neural network based on DBP learning is fast and easy to implement in hardware. Simulation results for several linearly nonseparable problems such as XOR demonstrate that the DBP performs favorably when compared to the conventional approaches. Furthermore, in large-scale networks, simulation results indicate that the DBP provides high performance. 1 Introduction The three-layered perceptron proposed by Rosenblatt (referred to as a simple perceptron, SP) has provided research results that have become the foundation for understanding the learning possibilities of neural networks (Rosenblatt, 1961; Minsky & Papert, 1969). It has been theoretically proven that the SP can classify categories in finite learning iterations if the output of the middle layer is linearly separable. However, this means that the network cannot learn if at least one category among the input pattern set is linearly nonseparable. Neural Computation 19, 1422–1435 (2007)
C 2007 Massachusetts Institute of Technology
A New Digital-Type Backpropagation Algorithm
1423
Any linearly nonseparable pattern set becomes separable through random transformations between the input and middle layers if the number of middle-layer units is significantly large (Amari, 1978). However, in general, a large number of units are needed in the middle layer so that the output of the middle layer may become linearly separable. To make learning technologically feasible, it is necessary to restrict the number of the middle-layer units. However, because the SP cannot update coupling weights between the input and middle layers, it is unrealistic to expect a great improvement of the learning performance (Minsky & Papert, 1969). Rumelhart, Hinton, and Williams (1986) proposed an analog backpropagation algorithm (ABP), which replaces the binary units in the middle and output layers with analog and differentiable units. Since the output error propagates backward (from the output layer to the input layer), the ABP can update the coupling weights between the input and middle layers. However, this approach retains a few problems. In the case of a linearly nonseparable pattern set, the ABP converges to the local minimum and stops learning for several initial coupling weights. Even after it converges to the global minimum, the ABP requires a large number of iterations for learning (Jacobs, 1988; Vogl, Mangis, Rigler, Zink, and Alkon, 1988). Moreover, this algorithm cannot directly solve a digital problem where the middlelayer output is intrinsically binary, such as the internal state inference of the automation and vector quantization (Zeng, Goodman, & Smyth, 1993; Watanabe, Furukawa, Mitani, & Oohori, 1997). This letter proposes a digital version of the backpropagation algorithm (called digital BP, DBP) for three-layered neural networks with nondifferentiable binary units. For the DBP, teacher signals are given to both middle and output layers and are given only to the output layer for the SP. The DBP is therefore able to update the coupling weights not only between the middle and output layers, but also between the input and middle layers. A neural network based on the DBP learning model is fast and easy to implement in hardware. We have applied the proposed method to several linearly nonseparable problems and compared the DBP’s learning performances with that of the ABP. Furthermore, we have applied the algorithm to a large-scale network and determined the learning performance. 2 Digital Backpropagation This letter proposes a new method of backpropagation for three-layered neural networks with nondifferentiable binary units in the middle and output layers (Oohori, Kamada, Kobayashi, Konishi, & Watanabe, 2000). In digital neural networks, it is not possible to propagate the error from the output layer, through the middle layer, and to the input layer because of the nondifferentiable units in the middle layer. Thus, the coupling weights
1424
T. Oohori, H. Naganuma, and K. Watanabe Step
Step
Linear
Linear
Output Layer
(p) O k
T
(p) k
(p) j
T
(p) j
Wkj O
Middle Layer 1 Input Layer
Wkj
(p) i
(A) in case of single output
(p)
(p) j
Wji
1 Input Layer
O 1
(p) k
Oj T
Middle Layer
Wji
(p)
Ok T
Output Layer
(p) i
O 1
(B) in case of multiple outputs
Figure 1: Learning of digital backpropagation.
between the input and middle layers cannot be updated. As a result, the SP has difficulty learning linearly nonseparable problems. The proposed DBP is able to update the coupling weights not only between the middle and output layers but also between the input and middle layers by giving the teacher signals to the middle layer. This serves to decrease the error of the output layer. The following sections describe the DBP, focusing on how to set the teacher signals to the middle layer for both single and multiple outputs. 2.1 In Case of Single Output. In digital neural networks (see Figure 1A), for each input pattern p, the output and the error in the output layer are calculated. The coupling weights between the middle and output layers are updated based on the delta rule in the same manner as with the SP: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj .
(2.1)
( p)
Similarly, if the teacher signal Tj can be set to the middle layer, the coupling weights between the input and middle layers can be updated by the following delta rule: ( p) ( p) ( p) Oi . Wji = Wji − α O j − Tj
(2.2)
( p)
Next, we need to calculate the teacher signal Tj for the middle-layer units. If the coupling weightsWk j between the middle and output layers are fixed temporarily after Wk j is updated, it is preferable to set the
A New Digital-Type Backpropagation Algorithm
1425
( p)
middle-layer unit output O j to a value that decreases the output-layer error. Since the activation of the kth output unit is ( p)
Uk
=
( p)
Wk j O j ,
(2.3)
j ( p)
if Wk j is positive, the activation increases when O j is changed from 0 to ( p) 1 and decreases when O j is changed from 1 to 0. If Wk j is negative, the ( p) activation decreases and increases contrary to the change in O j . Therefore, ( p) to activate the output-layer unit when the output O j = 0 and the teacher ( p) ( p) ( p) ( p) signal Tk = 1 (Ok − Tk = −1), the teacher signal Tj of the middle layer is set as follows: ( p)
= 1,
(2.4)
( p)
= 0.
(2.5)
if
Wk j > 0
then
Tj
if
Wk j < 0
then
Tj
( p)
In order to make the output-layer unit inactivate when the outputOk ( p) ( p) ( p) ( p) and teacher Tk = 0 (Ok − Tk = 1), set Tj as follows:
=1
( p)
= 0,
(2.6)
( p)
= 1.
(2.7)
if
Wk j > 0
then
Tj
if
Wk j < 0
then
Tj
Combining equations 2.5 and 2.6 and equations 2.4 and 2.7, the following equations are obtained, respectively: ( p) ( p) ( p) if Ok − Tk Wk j > 0 then Tj = 0, (2.8) ( p) ( p) ( p) Wk j < 0 then Tj = 1. (2.9) if Ok − Tk ( p)
( p)
When Ok = Tk and/or Wk j = 0, the coupling weights between the input and middle layers need not be updated, as shown in equations 2.1 and 2.3. Therefore, the teacher signal is set to the existing output of the middle-layer unit to prevent updating of the coupling weights between the input and middle layers—namely, ( p) ( p) ( p) ( p) if Ok − Tk Wk j = 0 then Tj = O j . (2.10) Based on these principles, the teacher signals given to the middle layer are ( p) ( p) summarized in Table 1 (in case of the single output). Where S j = S jk = ( p) ( p) (Ok − Tk )Wk j is the value that becomes the base for the teacher signal
1426
T. Oohori, H. Naganuma, and K. Watanabe
Table 1: Calculating the Teacher Signal for Middle Layer (Single Output). ( p)
( p)
Ok
Tk
0
1
1
0
0 1
0 1
( p)
Wk j
Sj
( p)
( p)
= S jk = (Ok
( p)
− Tk )Wk j
Positive Negative Positive Negative
= Negative = Positive = Positive = Negative
–
0
( p)
Tj
1 0 0 1 ( p)
Oj
of the middle-layer units, we refer to it as a decision factor for the teacher signal. 2.2 In Case of Multiple Outputs. When dealing with multiple outputs (see Figure 1B), the coupling weights between the middle and output layers are updated by the following delta rule in a manner similar to that used by the SP: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj .
(2.11)
To update the coupling weights between the input and middle layers, the decision factor of the teacher signal is calculated as ( p) ( p) ( p) S jk = Ok − Tk Wk j . ( p)
(2.12) ( p)
The teacher signal Tj for the middle layer may be set by the sign of S jk , and the coupling weights between the input and middle layers may be updated by the following delta rule: ( p) ( p) ( p) Wji = Wji − α O j − Tj Oi .
(2.13)
However, because there are two or more units in the output layer, a ( p) mutual competition occurs when the sign of S jk is different for several ( p) ( p) units. To resolve this, we define the sum S j over k of S jk as the decision factor for teacher signal when there are multiple outputs: ( p)
Sj =
k
( p)
( p)
Ok − Tk
Wk j .
(2.14)
A New Digital-Type Backpropagation Algorithm
1427
Table 2: Calculating Teacher Signals for Middle Layer (Two Outputs). ( p)
( p)
( p)
S j1
S j2
Sj
Positive
Negative
Negative
Positive
Positive Negative
Positive Negative
( p)
( p)
( p)
= S j1 + S j2
Tj
Positive Negative 0 Positive Negative 0 Positive Negative
0 1 ( p) Oj 0 1 ( p) Oj 0 1
Now the teacher signal of the middle layer is set by the following equations: ( p)
( p)
if
Sj > 0
then
Tj
if
S j < 0 then
Tj
( p)
( p)
( p)
= 0,
(2.15)
= 1.
(2.16)
( p)
When S jk s are balanced and S j = 0, the coupling weights between the input and middle layers need not be updated, that is, if
( p)
Sj = 0
then
( p)
Tj
( p)
= Oj .
(2.17)
The teacher signals given to the middle-layer units, based on this discussion, are summarized in Table 2 in case of two outputs. 2.3 Improvement of the Algorithm. Figure 2 shows a program analysis ( p) diagram (PAD) for the DBP algorithm. In this figure, the decision factor S j ( p) and the middle-layer teacher signal Tj are calculated using the current ( p) ( p) Ok and the newly updated Wk j . If the middle-layer teacher signals Tj ( p) are calculated using the newly recalculated Ok instead of the current one, ( p) acceleration in convergence can be expected. The recalculation of Ok using the newly updated Wk j is shown in the dotted box in Figure 2. 3 Comparison with Previous Learning Methods 3.1 Comparison with the ABP. Rumelhart et al. (1986) proposed the backpropagation algorithm (analog BP, or ABP) with analog and differentiable units. Equations 3.1 and 2.11 show updating of the coupling weights between the middle and output layers using both the ABP and the DBP for
1428
T. Oohori, H. Naganuma, and K. Watanabe
Figure 2: PAD for DBP.
comparison,
(ABP)
( p) ( p) ( p) ( p) f (Uk )O j , Wk j = Wk j − α Ok − Tk
(DBP)
( p) ( p) ( p) Oj , Wk j = Wk j − α Ok − Tk
(3.1)
(2.11)
where f is an input output function of the units in the middle and output layers and the sigmoid function is used. In case of the ABP, with reverse error propagation of the output layer ( p) through f , the term f (Uk ) is added as shown in equation 3.1. For the ( p) DBP, the term is not added. f (Uk ) is a small value in the neighborhood of ( p) ∼ ( p) ( p) ( p) the error maximum (Ok = 1 − Tk ) and the error minimum (Ok ∼ = Tk ). ( p) ( p) With the DBP, whenever the error occurs (Ok = Tk ), it updates by a ( p) constant amount α independent of the activation Uk . Next, the equations for updating the coupling weights between the input and middle layers in the ABP and the DBP are simplified by using the
A New Digital-Type Backpropagation Algorithm
1429
( p) ( p) teacher-signal decision factors, S¯ j and S j , respectively:
ABP ( p) S¯ j =
( p)
( p)
Ok − Tk
Wk j f (Uk ), ( p)
(3.2)
k
( p) ( p) ( p) Oi . Wji = Wji − α S¯ j f U j DBP ( p)
Sj =
( p)
( p)
Ok − Tk
Wk j ,
(3.3)
(3.4)
k ( p)
Wji = Wji − α
Sj
( p)
|S j |
( p)
Oi .
(3.5)
( p)
In equation 3.3 of the ABP, f (U j ) for reversely propagating from the ( p) middle layer, and the decision factor S¯ j of the teacher signal, which is calculated in the output layer, are added. As with the output layer, when the ( p) ( p) output O j of the middle layer is close to 0 or 1, f (U j ) becomes almost 0, and Wji is updated slowly. With the DBP, however, as shown in equation 3.5, Wji is updated with a constant amount α, and the learning is fast. 3.2 Comparison with the MADALINE. The MADALINE, proposed by Widrow, Winter, and Baxter (1988) is a three-layered neural network with nondifferentiable binary units in the middle and output layers. It has the same structure as both the SP and the DBP proposed in this letter. During learning iterations with the MADALINE, coupling weights between the middle and output layers are updated by the delta rule until the error does not decrease any further for all input patterns. Afterward, for each input pattern, the output in the middle-layer unit whose activation is the closest to 0 is reversed. If the output error decreases, the coupling weights between the input and middle layers are updated by the delta rule. Otherwise, the output that is reversed is reset, and the output of the unit whose activation is next closest to 0 is reversed. Once one unit is reversed, two units are also reversed, one after another, up to the predetermined number. The MADALINE’s learning algorithm is based on trial and error. When the number of middle-layer units increases, the algorithm must choose the number of units to use in order to decrease the output error from a large number of combinations. Therefore, due to combinatorial explosion, the computational time increases tremendously and is not practical. On the other hand, when using the DBP, the teacher signals for all middlelayer units are set to decrease the output error. Because the coupling weights
1430
T. Oohori, H. Naganuma, and K. Watanabe
between the input and middle layers are updated concurrently under the control of teacher signals, this combinatorial explosion does not occur. Although the DBP’s convergence has not been theoretically proven yet, a large number of simulation experiments have demonstrated good convergence. 3.3 Comparison with the Dual Learning Method. Application of the dual learning method proposed by Yamada, Kuroyanagi, and Iwata (2004) to the digital three-layered perceptron uses the following delta rule to update the coupling weights not only between the middle and output layers but also between the input and middle layers: ( p) ( p) ( p) Wk j = Wk j − α Ok − Tk Oj , ( p) ( p) ( p) Oi , Wji = Wji − α O j − Tj
(3.6) (3.7)
where the middle-layer teacher signal is calculated as an analog value: ( p)
Tj
( p)
= Oj − α
( p)
( p)
Ok − Tk
Wk j .
(3.8)
k ( p)
Yamada et al. (2004) proposed that the teacher signal Tj ( p) when Tj exceeds the range of [0,1], that is,
( p)
Tj
0 ( p) = Tj 1
( p)
Tj
1
(3.9)
Numerical experiments demonstrate that the performance of dual learning using equations 3.8 and 3.9 is comparable to that of both the ABP and the DBP. However, in the dual learning method, the middle-layer teacher signals are not digital but analog, whereas in the DBP, the teacher signals for the middle layer are digital, as shown in equations 2.14 to 2.17. Thus, the proposed DBP is a completely digital type of learning, including the teacher signals. 4 Numerical Experiments 4.1 Preliminary Experiments with the DBP. We tested the DBP with the XOR problem, which consists of two input units, two middle units, and one output unit. The learning coefficient α of the DBP and ABP was
A New Digital-Type Backpropagation Algorithm
1431
Percentage of successful trials
100
DBP1 DBP2 ABP 80
60
40
20
0 0.0001
0.001
0.01
0.1
1
10
100
Learning coefficient
Figure 3: Percentage of trials where the network converges successfully for XOR.
varied from 10−2 to 102 . A hundred different initializations of the coupling weights were generated using a uniform distribution between –1.0 and 1.0. The maximum learning iterations were set to 102 /α in the DBP and to 105 /α in the ABP, inversely proportional to the learning coefficient α. The learning was completed in the DBP when the error between the output and the teacher reached 0, or the learning iterations exceeded the maximum learning iteration, and in the ABP when the mean square error between the output and the teacher signal fell below 10−3 , or the learning iterations exceeded the maximum learning iteration. In the experiment, the learning coefficient α was divided by the number of patterns (Pmax ) in order to suppress the influence of Pmax in each problem. Figures 3 and 4 show the percentage of trials in which the network converged successfully and the average iterations required to converge using both the DBP and the ABP. In these figures, DBP1 and DBP2 represent the DBP without and with output recalculation, respectively. Figure 3 shows a slight improvement in the learning performance of DBP by recalculating the output. The performance decreases greatly when the learning coefficient α exceeds the limits in all of the methods. Figure 4 shows that learning accelerates as α increases. Therefore, it is preferable to set the learning coefficient α to as large a value as possible within the range where the percentage of successful trials does not decrease. A similar tendency is observed in other problems; therefore, in the following simulations, we have used α = 0.002 (for the DBP) and α = 2 (for the ABP) and added the output recalculation in the DBP.
1432
T. Oohori, H. Naganuma, and K. Watanabe
Average number of iterations required to converge
10
10
10
6
DBP1 DBP2 ABP
5
4
10
3
2
10
10 0.0001
0.001
0.01
0.1
1
10
100
Learning coefficient Figure 4: Average iterations to converge for XOR.
4.2 Experiment for Comparison with the Conventional BP (ABP). The next experiment presents some linearly nonseparable problems to both the DBP and the ABP. The learning coefficient α was 0.002 for the DBP and 2 for the ABP. The maximum number of learning iterations was set to 50,000 for both methods. The network consisted of two units in the middle layer for two-input problems and three units for three-input problems. The other parameters were the same as the preliminary experiments. The problems used in this experiment were the XOR (one output), the half adder (two outputs), the parity check (one output), the counting problem (three or four outputs), and the symmetry judgment (one output). Tables 3 and 4 show the percentage of trials where the network converged successfully and the average number of iterations required to converge, respectively. Table 3 shows that the percentage of successful trials for the DBP was higher than for the ABP in all two-input problems and for the three-input counting problem. The DBP performed better than the ABP for multiple-output problems, such as the half adder and the counting problems, though the comparison varied considerably, depending on the nature of each problem. Table 4 shows that the average number of iterations of the DBP required to converge was fewer than those of the ABP, except for the symmetry judgment problem. Moreover, there is a tendency to decrease the average number of iterations required for convergence when the percentage of successful trials is high, except for the parity check problem. 4.3 Experiments of More Realistic Problems. In order to observe the learning characteristics for a large-scale network, we presented a character
A New Digital-Type Backpropagation Algorithm
1433
Table 3: Comparison of Percentages of Successful Trials.
Two inputs
XOR Half adder Counting Parity check Symmetry judgment Counting
Three inputs
ABP
DBP
79 86 97 100 99 68
83 100 100 91 91 98
Table 4: Comparison of Average Iterations to Converge.
Two inputs
XOR Half adder Counting Parity check Symmetry judgment Counting
Three inputs
ABP
DBP
3153 5069 6047 4254 2629 10,883
1881 3259 2966 3762 3169 7208
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Figure 5: Binary input data for character recognition.
recognition problem to the DBP. The network consisted of 25 units in the input layer and 26 units in the output layer, corresponding to each character. The input data consisted of the 26 characters of the alphabet, as shown in Figure 5. The learning coefficient was assumed to be 0.02, and the maximum number of learning iterations was set to 50,000. A hundred different initializations of coupling weights were generated in the same fashion as described in sections 4.1 and 4.2. The number of the units in the middle layer was changed from 4 to 10. Table 5 shows the percentage of trials where the network converged
1434
T. Oohori, H. Naganuma, and K. Watanabe
Table 5: Result of the Character Recognition Problem with 25 Inputs. Number of Units in Middle Layer 4 5 6 7 8 9 10
Percentage of Successful Trials
Average Iteration to Converge
0 24 100 100 100 100 100
NA 2881 2233 2212 2378 2387 2392
successfully and the average number of iterations required for converging. With 4 middle-layer units, the network cannot learn all 26 characters because the maximum number of learnable pattern for 4 units is 16. When the number of the units is increased to 5, the percentage of successful trials remains small. This is because some similar characters such as O and G and E and F are classified into the same pattern in the middle layer. With 6 or more middle-layer units, we see an effective classification in the middle layer. We also see that the average number of iterations does not depend on the number of units in the middle layer. 5 Conclusion This letter proposed a digital version of the backpropagation algorithm for three-layered neural networks with nondifferentiable binary units. In the DBP, teacher signals were given to both the middle and the output layers, while they are given only to the output layer of a simple perceptron. The DBP was able to update the coupling weights not only between the middle and output layers but also between the input and middle layers. Simulations for several problems produced the following results:
r r r
The learning performance of the proposed DBP compares favorably to that of the ABP, though it depends on each linearly nonseparable problem. Hardware implementations are simplified—replacing the learning results with logic units—because the network consists of only digital units. When the proposed DBP is applied to a large-scale network, it can learn fast with high accuracy.
The following problems need to be resolved in the future:
r
A formal proof of convergence of the algorithm and a clarification of convergence conditions
A New Digital-Type Backpropagation Algorithm
r r
1435
Correspondence to more large-scale networks—increasing the layers and number of units in the middle layer Application of the algorithm to intrinsically binary problems such as the internal state inference of the automaton and vector quantization
Acknowledgments We thank the reviewers of this letter for their insightful comments. References Amari, S. (1978). Mathematical Theory of Nerve Nets. Tokyo: Sangyotosho. (In Japanese.) Jacobs, R. (1988). Increased rates of convergence through learning rate adoption. Neural Networks, 1, 295–307. Minsky, M., & Papert, S.(1969). Perceptrons. Cambridge, MA: MIT Press. Oohori, T., Kamada, S., Kobayashi, Y., Konishi, Y., & Watanabe, K. (2000). A new error back propagation learning algorithm for a layered neural network with nondifferentiable units (Tech. rep. NC99-85). Tokyo: Institute of Electronics, Information, and Communications Engineers. Rosenblatt, F. (1961). Principles of neurodynamics. Spartan. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representation by back-propagation errors. Nature, 323, 533–536. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., & Alkon, D. L. (1988). Accelerating the convergence of the backpropagation method. Biological Cybernetics, 59, 257– 263. Watanabe, K., Furukawa, T., Mitani, M., & Oohori, T., (1997). Fluctuation-driven learning for feedforward neural networks. IEICE Trans. on Information and System, J80-D-II, 1249–1256. Widrow, B, Winter, R. G., & Baxter, R. A. (1988). Layered neural nets for pattern recognition. IEEE Trans. Acoustic, Speech and Signal Processing, ASSP-36, 1109– 1118. Yamada, K., Kuroyanagi, S., & Iwata, A. (2004). A supervised learning method using duality in the artificial neuron model. IEICE Trans. on Information and System, J87-D-II, 399–406. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning finite state machines with self-clustering networks. Neural Computation, 5, 976–990.
Received August 31, 2005; accepted August 10, 2006.
LETTER
Communicated by Walter Senn
Spike-Timing-Dependent Plasticity in Balanced Random Networks Abigail Morrison
[email protected] Computational Neuroscience Group, RIKEN Brain Science Institute, Wako City, Saitama 351-0198, Japan
Ad Aertsen
[email protected] Neurobiology and Biophysics, Institute of Biology III, Albert-Ludwigs-University, 79104 Freiburg, Germany
Markus Diesmann
[email protected] Computational Neuroscience Group, RIKEN Brain Science Institute, Wako City, Saitama 351-0198, Japan
The balanced random network model attracts considerable interest because it explains the irregular spiking activity at low rates and large membrane potential fluctuations exhibited by cortical neurons in vivo. In this article, we investigate to what extent this model is also compatible with the experimentally observed phenomenon of spike-timing-dependent plasticity (STDP). Confronted with the plethora of theoretical models for STDP available, we reexamine the experimental data. On this basis, we propose a novel STDP update rule, with a multiplicative dependence on the synaptic weight for depression, and a power law dependence for potentiation. We show that this rule, when implemented in large, balanced networks of realistic connectivity and sparseness, is compatible with the asynchronous irregular activity regime. The resultant equilibrium weight distribution is unimodal with fluctuating individual weight trajectories and does not exhibit development of structure. We investigate the robustness of our results with respect to the relative strength of depression. We introduce synchronous stimulation to a group of neurons and demonstrate that the decoupling of this group from the rest of the network is so severe that it cannot effectively control the spiking of other neurons, even those with the highest convergence from this group.
Neural Computation 19, 1437–1467 (2007)
C 2007 Massachusetts Institute of Technology
1438
A. Morrison, A. Aertsen, and M. Diesmann
1 Introduction In order to understand cognitive processing in the cortex, it is necessary to have a solid understanding of its dynamics. The balanced random network model has been able to provide some insight into this matter, as it can account for the large fluctuations in membrane potential and the irregular firing at low rates observed in vivo for cortical neurons. Many aspects of this model have been investigated and are now well understood (van Vreeswijk & Sompolinsky, 1996, 1998; Brunel & Hakim, 1999; Brunel, 2000). Until now, studies have mainly focused on the activity dynamics of such networks under the assumption that the strength of a synapse remains constant. However, recent research indicates that the strength of a synapse varies with respect to the relative timing of pre- and postynap¨ tic spikes (Markram, Lubke, Frotscher, & Sakmann, 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1998; Debanne, G¨ahwiler, & Thompson, 1998; Sjostrom, Turrigiano, & Nelson, 2001; Froemke & Dan, 2002; Wang, Gerkin, Nauen, & Bi, 2005), a phenomenon known as spike-timing-dependent plasticity (STDP). This has prompted intense theoretical interest (see, e.g., Kempter, Gerstner, & van Hemmen, 1999; Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Kistler & van Hemmen, 2000; Rubin, Lee, & ¨ Sompolinsky, 2001; Kempter, Gerstner, & van Hemmen, 2001; Gutig, Aharonov, Rotter, & Sompolinsky, 2003; Izhikevich & Desai, 2003; Burkitt, Meffin, & Grayden, 2004; Guyonneau, VanRullen, & Thorpe, 2005), but these studies have thus far tended to focus on the development of the synaptic weights of a single neuron in the absence of the activity dynamics of a re¨ current network. Such network studies as there are (Hertz & Prugel-Bennet, 1996; Levy, Horn, Meilijson, & Ruppin, 2001; Izhikevich, Gally, & Edelman, 2004; Iglesias, Eriksson, Grize, Tomassini, & Villa, 2005) have shown divergent results, such as the development of strongly connected neuronal groups in Izhikevich et al. (2004) and systematic synaptic pruning in Iglesias et al. (2005). Additionally, these studies all assume a level of connectivity orders of magnitude below that seen in cortical networks (Braitenberg & ¨ 1998). Schuz, The issue of connectivity is significant, as STDP is sensitive to correlation in neuron activity. Reducing the number of incoming synapses per neuron and scaling up the synaptic strength to compensate for this will inevitably influence this correlation. To ensure that the development of synaptic weights observed is not an artifact of low connectivity, it is necessary to examine the behavior of high-connectivity networks. In this article, we investigate the activity dynamics and the development of the synaptic weight distribution for recurrent networks with biologically realistic levels of connectivity and sparseness. This necessitates the use of distributed computing techniques for simulation, previously developed in Morrison, Mehring, Geisel, Aertsen, & Diesmann (2005). As a starting position, we use the well-understood balanced random network model (van Vreeswijk
STDP in Balanced Random Networks
1439
& Sompolinsky, 1996, 1998; Brunel & Hakim, 1999; Brunel, 2000), and extend it to incorporate STDP in its excitatory-excitatory synapses. The model of STDP implemented is based on a reexamination of the experimental data of Bi and Poo (1998), resulting in a novel power law update rule. The network is investigated for both the case that no structured external stimulus is applied and that a group of neurons is subjected to an irregular synchronous stimulus. In section 2 we reexamine the experimental data for STDP and propose a well-constrained power law model for the synaptic weight update. A consideration of rate perturbation further constrains the model to an all-to-all, rather than nearest-neighbor, spike pairing scheme. The neuron and network models are introduced in section 3, and the activity dynamics for a static balanced random network is reviewed. The network is extended to incorporate STDP in section 4. In section 4.1, we demonstrate the mutual equilibrium of weight and activity dynamics in a plastic network with no structured external stimulus and show that no structure is developed. This finding turns out to be robust across changes to the network connectivity, spike pairing scheme, and formulation of the STDP update rule. We examine the behavior of the network if the STDP model exhibits depression that over- or undercompensates for the correlation in network activity. In section 4.2, we apply a synchronous stimulus to a group of neurons and investigate its potential to induce development of structure. We show that a strong stimulus causes the network to enter a pathological state, whereas the structure-enhancing effects of a weaker stimulus are counteracted by the decoupling of the stimulated group from the rest of the network. These results are discussed in section 5, as are ways in which the investigated models might be adapted to enable stimulus-driven development of structure. An algorithmic template for implementing STDP in network simulations that is compatible with distributed computing is given in the appendix. 2 A Novel Power Law Model of STDP There are many theoretical models of STDP in circulation. The two main sources of disparity are the weight dependence of the weight changes—for example, additive (Song et al., 2000), multiplicative (Rubin et al., 2001), or ¨ somewhere in between (Gutig et al., 2003)—and which pairs of spikes contribute to plasticity—for example, all-to-all (Gerstner, Kempter, van Hemmen, & Wagner, 1996) and nearest neighbor (Izhikevich et al., 2004). Other variants include some degree of activity dependence, for example suppression (Froemke & Dan, 2002) or postsynaptic activity-dependent homeostasis (van Rossum et al., 2000). It has already been demonstrated that variations in the formulation of the STDP update rules can lead to qualitatively differ¨ ent results (Rubin et al., 2001; Gutig et al., 2003; Izhikevich & Desai, 2003; Burkitt et al., 2004), but there is as yet no consensus. Therefore, finding a formulation of STDP describing the experimental data as well as possible
1440
A. Morrison, A. Aertsen, and M. Diesmann
B 100
300 100 30 10 3 0 −3 −10 −30 −100 −300
PSC change [%]
PSC change [pA]
A
30
100
300
1000 3000
Initial PSC [pA]
50
0
−50 −100
−50
0
50
100
Spike timing [ms]
Figure 1: Weight and time dependency of change for the Bi and Poo (1998) protocol. (A) Absolute change in excitatory postsynaptic current (EPSC) amplitude as a function of initial EPSC amplitude in log-log representation for potentiation where the presynaptic spike precedes the postsynaptic spike by 2.3 to 8.3 ms (plus signs) and depression where the postsynaptic spike preceded the presynaptic spike by 3.4 to 23.6 ms (circles). The lower black line is a linear fit to the depression data: slope = −1, offset = 2. The upper black line is a linear fit to the potentiation data assuming a slope = 0.4, resulting in an offset = 1.7. The pale gray line shows an additive STDP prediction for the data, and the dark gray curve shows a multiplicative STDP prediction, with wmax = 3000 pA. (B) Percentage change in EPSC amplitude as a function of spike timing. Curves show power law prediction for the data with µ = 0.4, τ = 20 ms, λ = 0.1, and α = 0.11 for different initial EPSC amplitudes: 17 pA, pale gray curve; 50 pA, medium gray curve; 100 pA, dark gray curve. All data from Bi and Poo (1998).
is still a relevant issue. Here we neglect the activity dependence and investigate how the weight dependence of potentiation and depression can best be characterized and what is an appropriate spike pairing scheme to apply. 2.1 Weight Dependence of STDP. Figure 1A shows the dependence of synaptic weight change on initial synapse strength for the experimental data from Bi and Poo (1998). The difference between this plot and the classical plot (Figure 5 in Bi and Poo) is that here, the absolute, rather than relative, change in the synaptic strength is plotted and a double logarithmic representation is used. The exponent of the dependence can be obtained from the slope of a linear fit to the data. In the case of depression, neglecting the largest outlier results in an exponent of −1. We will therefore assume a purely multiplicative update rule: w− ∝ w. In the case of potentiation, depending on the treatment of the largest outlier, an exponent in the range 0.3 to 0.5 results. This suggests a power law rule of the form w+ ∝ w µ , for which we shall assume µ = 0.4. This novel rule fits the data better than either additive (w+ = c) or multiplicative (w+ ∝ wmax − w) rules, shown
STDP in Balanced Random Networks
1441
in Figure 1A for comparison. Although all the rules can be fitted to give similar results for synapses in the 30 to 100 pA range, the greatest difference in prediction is for the behavior of very weak synapses. A multiplicative rule predicts a greater absolute potentiation for synapses close to 0 pA than that obtained for synapses in the 30 to 100 pA range, and an additive rule predicts the same absolute potentiation regardless of synaptic strength. In contrast to these rules, the power law rule predicts very small absolute increases when very weak synapses are potentiated. 2.2 Constraining the Model Parameters. Having established the weight dependence for the change in synaptic strength over the entire 60-pair protocol, we now constrain the model parameters. The update rule underlying the synaptic modifications over the course of the protocol has a weight and a time component; specifically, we assume it takes the form w+ = λ60 w01−µ w µ e − w− = −λ60 αwe
− |t| τ
|t| τ
if t > 0
(2.1)
if t < 0,
where τ is the time constant, λ60 is the learning rate over the whole protocol, w0 is a reference weight, and α scales the strength of depressing increments with respect to the strength of potentiating increments. We take t to be the difference between the postsynaptic and presynaptic spikes arriving at the synapse, that is, after the delays due to axonal- and backpropagation, as suggested by Debanne et al. (1998). If the spikes arrive exactly synchronously, t = 0, no alteration is made to the weight. We start by fitting the potentiation data from Figure 1A to the theoretical form of equation 2.1. We have: |t| log(w+ ) = µ log(w) + (1 − µ) log(w0 ) + log λ60 e − τ .
(2.2)
Without loss of generality, we set w0 = 1 pA, and substitute for t the mean t time interval t = 6.3 ms of the potentiation data. The value of log(λ60 e − τ ) is thus the offset of the linear fit to the potentiation data. Assuming τ = 20 ms in line with other theoretical work on STDP (e.g., Song et al., 2000; van Rossum et al., 2000), this results in the value λ60 = 7.5. However, to implement STDP in a simulation, we need to know the value of the learning rate for just one pair, λ instead of the learning rate over 60 pairs, λ60 . This can be done numerically: by substituting w+ = w in equation 2.2 and solving for w, the value of the weight that would experience 100% potentiation in this protocol can be calculated as w(0) = w0 e
log(λ60 )−t/τ 1−µ
= 17 pA. We start at
1442
A. Morrison, A. Aertsen, and M. Diesmann
this initial weight, and apply the single pair update rule 60 times: µ
|t|
µ
|t|
1−µ
w(0) e −
1−µ
w(1) e −
w(1) = w(0) + λw0 w(2) = w(1) + λw0
τ
τ
.. . 1−µ
w(60) = w(59) + λw0
µ
w(59) e −
|t| τ
.
A value of λ = 0.1 results in w(60) = 2w(0) = 34 pA, as required. As the range of time intervals used to produce the depression data is too large to have much confidence in the offset of the linear fit, we determined α in an analog fashion to give 40% depression for t = 6.3 ms. For the rest of this article, unless otherwise stated, we will use the following update rule for a pair of spikes: 1−µ
w+ = λw0
wµ e −
w− = −λαwe
− |t| τ
|t| τ
if t > 0
(2.3)
if t < 0,
with µ = 0.4, τ = 20 ms, w0 = 1 pA, λ = 0.1, and α = 0.11. The fit of this update rule to the time-dependence data of Bi and Poo (1998) is depicted in Figure 1B for three different initial EPSC amplitudes. This produces a similar window function to that reported in studies on cortical rather than hippocampal plasticity (e.g., Froemke & Dan, 2002), where a maximum potentiation of 103% and a maximum depression of 51% was reported. Although the above calculation assumes that the time constants for potentiation and depression are equal, τ+ = τ− = τ , it is trivial to adjust the parameters to accommodate τ+ < τ− , as reported in several experimental studies (e.g., Feldman, 2000; Froemke & Dan, 2002). An appropriate efficient algorithmic implementation of STDP suitable for distributed computing is given in the appendix. 2.3 Spike Pairing Scheme. A self-consistent rate is a necessary condition for stability in the balanced random network model. It is known that different spike pairing schemes result in qualitatively different weight dynamics as a function of postsynaptic rate (Kempter et al., 2001; Izhikevich & Desai, 2003; Burkitt et al., 2004). Therefore, to establish which pairs of spikes should be considered for the synaptic weight update, we consider the effects of different spike pairing schemes when a self-consistent rate is perturbed. We implemented a nearest-neighbor scheme, whereby a presynaptic spike is paired with only the last postsynaptic spike to effect depression, and a postsynaptic spike is paired with only the last presynaptic spike to effect
STDP in Balanced Random Networks
1443
potentiation. Conversely, in an all-to-all scheme, a presynaptic spike is paired with all previous postsynaptic spikes to effect depression, and a postsynaptic spike is paired with all previous presynaptic spikes to effect potentiation. For this investigation, 1000 neurons were provided with input designed to match the network model described in section 3, namely, 9000 independent excitatory and 2250 independent inhibitory Poisson processes at 7.7 Hz to model input from the local network, and a further 9000 independent excitatory Poisson processes at 2.32 Hz to model external excitation. (For all synaptic and neuronal parameters, refer to section 3). Of the excitatory inputs modeling local network input, 1000 of them were mediated by synapses implementing the power law STDP update rules described above; the rest were static. The 1 × 106 plastic synapses were initialized from a gaussian distribution with a mean of 45.61 pA and a standard deviation of 4.0 pA. In the case of the all-to-all pairing scheme, we set α = 0.1021 and λ = 0.0973, and in the case of the nearest-neighbor scheme, α = 0.0976 and λ = 0.116. This choice of parameters results in a firing rate of 7.7 Hz and a stable and unimodal synaptic weight distribution measured across all the plastic synapses with a mean of 45.5 pA and a standard deviation of 4.0 pA. A unimodal distribution was expected for the power law formulation of STDP, as bimodal distributions have been reported only for additive ¨ or near-additive rules (see Gutig et al., 2003), with very weak or no weight dependence, whereas the power law formulation has a strong dependence on the initial synaptic strength. To perturb the self-consistent rate, a current was injected into the neurons from 50 s onward. A current of 30 pA increased the firing rate by approximately 2 Hz; a current of −35 pA decreased it by the same amount. For all configurations of spike pairing scheme and injected current, the unimodal nature of the distribution was conserved. As can be seen in Figure 2, the choice of pairing scheme has a significant effect on the consequent weight development. If the postsynaptic rate is increased, an allto-all pairing scheme results in a net decrease of synaptic weights (see Figure 2B), whereas a nearest-neighbor scheme results in a net increase (see Figure 2A). Conversely, if the postsynaptic rate is decreased with respect to the input rate, an all-to-all scheme increases the mean synaptic weight (see Figure 2B), whereas a nearest-neighbor scheme decreases it (see Figure 2A). In other words, the all-to-all scheme acts like a restoring force, counteracting the rate disparity, whereas the nearest-neighbor scheme acts in such a direction as to magnify any rate disparity. These results do not depend on the time constants for depression and potentiation being equal; results obtained for τ− = 34 ms and τ+ = 14 ms as reported by Froemke and Dan (2002), were qualitatively the same (all-to-all scheme: α = 0.042, λ = 0.0973; nearest-neighbor scheme: α = 0.0458, λ = 0.116; data not shown). In the absence of some sliding threshold mechanism (e.g., Bienenstock, Cooper, & Munro, 1982) or postsynaptic homeostasis (see Turrigiano, Leslie,
1444
A. Morrison, A. Aertsen, and M. Diesmann
A
B
weight [pA]
48
45.6
47
45.55
46
45.5 45.45
45
45.4
44 0
100 200 time [s]
300
0
100 200 time [s]
300
Figure 2: Development of mean synaptic weight for (A) nearest-neighbor and (B) all-to-all spike pairing schemes, averaged over 1 × 106 synapses. From 50 s onward, a current is injected into the postsynaptic neurons to increase their rate (upward triangles) or decrease it (downward triangles) by 2 Hz. Dashed curves show mean weight development when no current is injected. See the text for further details of the simulation.
Desai, Rutherford, & Nelson, 1998), an all-to-all spike pairing scheme is clearly better suited to situations where a self-consistent rate is required and shall be used for the remainder of this letter. 3 Network Model We consider networks of current-based integrate-and-fire neurons. For each neuron, the dynamics of the membrane potential V is V˙ = −
1 V + I , τm C
where τm is the membrane time constant, C is the capacitance of the membrane, and I is the input current to the neuron. The current arises as a superposition of the synaptic currents and any external current. The synaptic current Is due to one incoming spike is represented as an α-function: Is (t) = w
e −t/τα te , τα
where w is the peak value of the current and τα is the rise time. When the membrane potential reaches a given threshold value , the membrane potential is clamped to zero for an absolute refractory period τr . The values for these parameters used in this article, unless otherwise stated, are: τm : 10 ms C: 250 pF
STDP in Balanced Random Networks
1445
: 20 mV τr : 0.5 ms w: 45.61 pA τα : 0.33 ms The parameters are chosen such that the generated postsynaptic potentials (PSPs) have a rise time of 1.7 mV and a half-width of 8.5 ms (e.g., Fetz, Toyama, & Smith, 1991). To avoid transients in the dynamics at the beginning of the simulation, the initial membrane potentials are drawn from a gaussian distribution with a mean of 5.7 mV and a standard deviation of 7.2 mV. The network model was derived from Brunel (2000), where the larger number of excitatory neurons is balanced by increasing the strength of the inhibitory connections. In order to realize biologically realistic values for both the number of connections per neuron ( 104 ) and the connection probability ( 0.1) in the same network, it is necessary to consider networks with of the order of 105 neurons, approximately the number of neurons in ¨ 1998). Specifically, we a cubic millimeter of cortex (Braitenberg & Schuz, simulated networks with 90,000 excitatory and 22,500 inhibitory neurons. In the static network, each neuron receives 9000 connections from excitatory and 2250 connections from inhibitory neurons, whereby the peak current of an inhibitory synapse is larger than that of an excitatory synapse by a factor of −5. A neuron is not permitted to have a connection to itself. The total number of synapses in the network is 1.27 × 109 . Additionally, each neuron receives an independent Poissonian stimulus corresponding to 9000 excitatory spike trains, each at 2.32 Hz. The strength of these external excitatory connections is the same as the local excitatory connections. The propagation delay between neurons is 1.5 ms, and the networks are simulated with a computational resolution of 0.1 ms. The activity of the network described above, if all synapses are static, is in the asynchronous irregular (AI) regime (Brunel, 2000), whereby each neuron fires irregularly at a low rate (coefficient of variation for the interspike interval CVISI = 0.91, rate ν = 7.8 Hz), and the network activity is not characterized by synchronous events. However, even in the AI regime, the network is not devoid of synchronous activity. In general, both slow and fast oscillations can be excited (Brunel & Hakim, 1999; Brunel, 2000; Tetzlaff, Morrison, Timme, & Diesmann, 2005). The slow oscillations are eradicated by the use of a noisy external stimulus and drawing the initial membrane potentials from a distribution. The fast oscillations are somewhat more tenacious and are clearly visible in the cross-correlogram in Figure 3. They cause a high variability in the spike count, which can be quantified by the Fano factor, calculated by binning the spike times, and dividing the variance of the spike count by the mean. In this article, all Fano factors are calculated by binning the spike trains from 1000 neurons (unless otherwise stated), using a bin size of 3 ms, which resulted in a mean spike count of at least 10 spikes per bin for all the data sets investigated.
1446
A. Morrison, A. Aertsen, and M. Diesmann −4
−4
x 10
x 10 6
4 5
C(τ)
2 4
0
3
−2 −5
2
0
5
1 0 −1 −2 −75
−50
−25
0
25
50
75
τ [ms]
Figure 3: Cross-correlogram of the spiking activity in the static balanced random network, averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. This is calculated as the cross-covariance normalized by the square root of the product of the variances. As can be seen in the inset, the crosscorrelogram is shifted to the right by the synaptic propagation delay of 1.5 ms, as it is recorded from the perspective of the synapse, and the delay is entirely dendritic (see the appendix).
The above network has a Fano factor for the spike count of FSC = 7.6, significantly higher than the value of 1, which is a characteristic of Poissonian statistics. In the case of an idealized network with pure Poissonian firing statistics, where the excitatory synapses exhibit STDP as described in section 2, the fix point of the resulting synaptic weight distribution can be easily deter1
mined: w ∗ = w0 αpµ−1 . Setting w ∗ = 45.61 pA in line with the static network described above, this determines αp = 0.1. However, as will be demonstrated in section 4, to retain the activity dynamics of the static network, a slightly larger α > αp is required to compensate for the structured raw cross-correlation induced by the fast oscillations.
4 Dynamics of Network Activity and Weight Distribution 4.1 Unstimulated Network. Replacing the excitatory-excitatory connections in the static network with synapses following the power law
STDP in Balanced Random Networks 1
B 0.1
0.8
frac. synapses
area diff. [%]
A
1447
0.6 0.4 0.2 0
0
500 −4
1000
1500
0.05
0
2000
time [s]
−4
x 10
45
55
weight [pA]
x 10
C C(τ)
35
D 4
4
2
2
0
0
−2
−2
−75 −50 −25
0
25 50 75
τ [ms]
−75 −50 −25
0
25 50 75
τ [ms]
Figure 4: Synaptic weight distribution and correlation in the plastic balanced recurrent network with α = 1.057αp . (A) The percentage area difference of the weight distribution from the final distribution (at 2000 s) as a function of time. The data were recorded from the outgoing excitatory-excitatory synapses of 900 neurons: 8.1 × 106 synapses. (B) Histogram of the equilibrium synaptic weight distribution, averaged over 10 samples in the final 500 s. The black curve indicates the gaussian distribution, with the mean and standard deviation obtained from the histogram (µw = 45.65, σw = 3.99). (C) Cross-correlogram of the spiking activity, averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. The recording was made after the weight distribution had stabilized (400 to 450 s). (D) Difference between the cross-correlograms of the plastic network and a static network using synaptic strengths drawn from the equilibrium distribution shown in (B). Cross-correlograms calculated as in Figure 3.
STDP update rules given by equation 2.3, a stable network configuration is achieved for α = 1.057αp . The resultant weight distribution settles within 200 s to approximately gaussian (µw = 45.65, σw = 3.99; see Figures 4A and 4B). The activity of the plastic network is close to the activity of the static network described in section 3: a slightly higher rate of 8.8 Hz is observed. The other spike statistics are also comparable (CVISI = 0.88, FSC = 8.5). Some
1448
A. Morrison, A. Aertsen, and M. Diesmann
difference is to be expected, as in the static network all excitatory-excitatory synapses have the same weight, whereas in the plastic network, the distribution of weights contributes to the fluctuations of the membrane potential and so increases the rate. Indeed, if the weights in the static network are chosen from a gaussian distribution with the same mean and standard deviation as the equilibrium distribution, the activity statistics are very similar (ν = 8.9 Hz, CVISI = 0.9, and FSC = 8.5). For the rest of this letter, the static network using this distribution of synaptic weights will be considered a control case. The similarity of the dynamics is also demonstrated by the crosscorrelogram in Figure 4C. The cross-correlogram has qualitatively the same shape as that of the static network, and the difference of the two (see Figure 4D) shows no structure whatsoever. Therefore, we can conclude that the two networks exhibit the same activity dynamics and that the power law STDP update rule described in section 2 is compatible with balanced recurrent networks in the AI regime. However, we do not claim that this property is unique to this STDP model. It would seem likely that any STDP formulation that results in a unimodal equilibrium distribution of synaptic weights would also be compatible, at least for some parameter range. This ¨ is confirmed for a formulation of the Gutig et al. (2003) update rules in the next section. 4.1.1 Survival of Strong Synapses. One of the main sources of interest in STDP is its apparent ability to create neuronal assemblies (e.g., Izhikevich et al., 2004). However, searching for a stable structure in a network with 8.1 × 108 excitatory-excitatory synapses presents a significant data analysis problem in itself. To discover whether such assemblies are self-organizing in the balanced recurrent network, we consider the persistence of the strongest synapses in the network. If structure were developing, strong synapses should stay strong or become even stronger. During the course of the simulation, the outgoing plastic synapses of 900 neurons were monitored every 5 s, and those that were above a threshold of µ + 1.3σ (50.8 pA) of the equilibrium distribution shown in Figure 4, were recorded (approximately the strongest 10%). The persistence of strong synapses for various different initial times is shown in Figure 5A. For a group of synapses that are strong at a given time, the number that remain strong decays exponentially with a time constant of approximately 60 s, and almost none of an initial group remains strong after 800 s. No synapses remained strong for longer than 1000 s, although the simulation ran for 2000 s. This behavior was not systematically dependent on the time at which the initial population was defined, as is demonstrated by the similar courses of the curves in Figure 5A. These results suggest that structure is not developing in the balanced recurrent network. The survival statistics are qualitatively different for a low-connectivity network, as can be seen in Figure 5B. Here, the number of neurons in the
STDP in Balanced Random Networks 6
5
A 10
B 10
4
4
synapses
10
10
2
3
10
10
0
10
1449
2
0
500 survival time [s]
1000
10
0
1000 2000 3000 survival time [s]
4000
Figure 5: Persistence of strong synapses. (A) High-connectivity network. From a total population of 8.1 × 106 synapses (see Figure 4) and for a given initial time, approximately the 10% strongest synapses are recorded. The number of these synapses that remain strong is plotted as a function of survival time. The shading of the lines indicates the initial times: 100 s (palest gray curve) in steps of 200 s until 1500 s (black curve). (B) Low-connectivity network. As in A, but with a total population of 8.1 × 104 synapses and an earliest initial time of 500 s (palest gray curve).
network was reduced to 900 excitatory and 225 inhibitory neurons while retaining a connection probability of 0.1. To compensate for the lower number of synaptic inputs, the rate of the external input was increased to 34.66 Hz, the peak current of the static excitatory synapses was increased by a factor of 4 to 182.44 pA, and the peak current of the inhibitory synapses was a factor of −18 larger than the excitatory synapses. In the case of the plastic excitatory synapses, the scaling factor of 4 was applied postsynaptically, so for the purposes of the STDP update, the synaptic weights stayed in the range of approximately 30 to 70 pA. This permitted values of α = 1.109αp and λ = 0.1 to be used, resulting in an STDP window function as close as possible to that used for the full-sized network, and much the same activity statistics (ν = 7.9 Hz, CVISI = 0.91, FSC = 8.6 (900 neurons)). The equilibrium weight distribution was unimodal, with a mean of 45.52 pA and a standard deviation of 6.52 pA. In this case, strong synapses are much more stable, with some remaining strong for thousands of seconds. The decay of the number of strong synapses does not vary systematically with the time at which the initial population was defined, as was the case for the full-sized network. However, the decay is no longer well described by an exponential function, but by a power law with an exponent of −1.02. Despite the longer endurance of strong synapses, it still could not be said that structure is spontaneously
1450
A. Morrison, A. Aertsen, and M. Diesmann
developing in the network, as although the number of strong synapses did not quite decay to zero over the 5000 s measuring period, the remainder comprises less than 0.2% of the total number of plastic synapses. Similar results are obtained when using the nearest-neighbor spike pairing scheme described in section 2.3, resulting in a unimodal distribution with a mean of 45.6 pA, a standard deviation of 6.62 pA, and power law survival statistics with an exponent of −0.93 (data not shown). The results are also robust with respect to the formulation of the STDP update rules: power law STDP with ¨ asymmetric time constants (τ− = 34 ms, τ+ = 14 ms, α = 0.048) and Gutig et al. (2003) STDP (wmax = 100 pA, λ = 0.005, µ = 0.4, and α = 1.188) produce unimodal distributions, with means of 45.7 pA and 45.57 pA, standard deviations of 7.13 pA and 5.17 pA, respectively, and power law survival statistics with exponents of −0.62 and −1.3, respectively (data not shown). These results show that the connectivity of a network has a significant effect on the stability of individual weights, as strong synapses in the lowconnectivity networks persist for much longer periods than in the network with biologically realistic connectivity. Further, they demonstrate that the nondevelopment of structure is quite a robust observation. It is not restricted to high-connectivity networks and does not depend critically on either our specific formulation of the STDP update rules or the spike pairing scheme. 4.1.2 Sensitivity to Scaling of Depression. The value of α = 1.057αp was determined through a tuning process to give a mean synaptic weight close to the weight of excitatory synapses in the static network, and thus a similar self-consistent firing rate. This raises the question of what the effect on the network activity dynamics would be if α is chosen too high, leading to a net depression of the synapses, or too low, leading to a net potentiation. Clearly, any net change of the synaptic weights will lead to a change in the correlation structure, which will affect the weights, and so on. A network can be stable only if its weight distribution and correlation structure are compatible, as in the network described above. How easy is this to accomplish? We found that if α is chosen 2% higher—α = 1.078αp —the network does indeed settle to a new equilibrium, with a near-gaussian weight distribution at a slightly lower mean (µw = 45.33 pA) and a slightly increased standard deviation (σw = 4.1 pA). The network firing rate is, of course, lower, ν = 3.4 Hz, and the activity is still in the AI regime (CVISI = 0.92, FSC = 3.9). Interestingly, the network is in a more asynchronous state than either the heterogeneous static network or the plastic network with α = 1.057αp (FSC = 8.5), as reflected in its cross-correlogram (see Figure 6). The central and side peaks are smaller in amplitude and less well defined, and the difference in the cross-correlograms of this network and the static network (see Figure 6, inset) shows clear structure around τ = 0. This lower degree of synchrony is presumably due to the increased dominance of inhibition in the local recurrent network as the excitatory weights decrease.
STDP in Balanced Random Networks
1451 −4 x 10
−4 x 10 6
3
C(τ)
5
0
4
−3
3
−6 −50
2
0
50
1 0 −1 −2 −75
−50
−25
0
25
50
75
τ [ms]
Figure 6: Cross-correlogram of the spiking activity in the plastic balanced random network with α = 1.078 αp , averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. The recording was made after the weight distribution had stabilized (950–1000 s). The inset shows the difference in the cross-correlograms for the plastic and the heterogeneous static networks. Cross-correlograms calculated as in Figure 3.
If α is chosen 2% lower—α = 1.035 αp —a very different picture emerges. Up to 24 s, the mean of the synaptic weight distribution slowly increases to 45.83 pA, and its standard deviation decreases to 3.85 pA, retaining its neargaussian form. Over this period, the network firing rate increases smoothly from 8 Hz to about 27 Hz. Between 24 s and 26 s, there is a sudden transition. The network firing rate increases rapidly to 200 Hz (see Figure 7, inset), and the synaptic weight distribution splinters into clusters (see Figure 7, main panel). Simultaneously, the network activity dynamics undergoes a transition. The global activity changes from asynchronous irregular firing (see Figure 8A) to strong peaks of activity interspersed with periods of silence (see Figure 8B). Within each peak, a highly regular pattern of activity builds up within a few milliseconds and then decays abruptly, to be replaced with a different pattern of activity (see Figure 8, lower panels). These results show that α is a critical parameter for the configuration of network and plasticity model investigated here and that the value of α = 1.057αp is quite close to the bifurcation point. For values of α greater than this, the network is stable, albeit with weaker synaptic weights and consequently lower firing rates. Note that even extremely weak weights will
1452
A. Morrison, A. Aertsen, and M. Diesmann 0.12 200
frac. synapses
rate [Hz]
0.1
150 100
0.08 50 0
0.06
5
15 time [s]
25
0.04
0.02
0
0
50
100
150
200
250
300
weight [pA]
Figure 7: Weight distribution and rate development in the plastic balanced recurrent network with α = 1.035 αp . The gray histogram shows the distribution of weights for the outgoing excitatory-excitatory synapses of 1000 neurons (i.e., 9 × 106 synapses) after 24 s. The black histogram shows the weight distribution for the same group of synapses after 30 s. The inset shows the average rate of these neurons as a function of time.
not extinguish the network activity due to the external input. For values of α lower than the bifurcation point, the network is unstable and leaves the asynchronous irregular regime. 4.2 Stimulated Network. In section 4.1.1 it was demonstrated that if an appropriate scaling of the plasticity window is chosen such that a selfconsistent rate ensues, no structure develops spontaneously. In this section, we investigate whether structure can be induced in the network by generating systematic positive correlation in the network. The most obvious way to accomplish this is to induce a group of neurons to fire synchronously. The greater the number of inputs a neuron has from this group, the more likely it is to fire shortly after the synchronous event. 4.2.1 Stimulus Protocol. A stimulus is applied to a group of neurons (N = 500) at random intervals. At each stimulation event, a rectangular pulse current is injected into each neuron, whereby the injection time for each neuron is drawn from a gaussian distribution (σ = 0.5 ms). In this way, there is no systematic relationship between the firing times of the stimulated group during a stimulus event beyond that given by the gaussian. Due to the random connectivity of the network, the number of connections K synch
STDP in Balanced Random Networks
rate [Hz]
A
200 150 100 50 0 22
rate [Hz]
B
1453
22.2
22.4
29.2
29.4
22.6
22.8
23
29.6
29.8
30
3000 2000 1000 0 29
time [s]
rate [Hz]
C 2000
E
1500 1000 500
neuron id
D
0 100
F
75 50 25
29.1
29.1025
time [s]
29.105
29.7
29.7025
29.705
time [s]
Figure 8: Activity dynamics in the plastic balanced recurrent network with α = 1.035αp . (A) Instantaneous rate as a function of time between 22 s and 23 s for a sample of 1000 neurons with a bin size of 0.1 ms. (B) As in A, but between 29 s and 30 s. Arrows indicate time periods treated in lower panels. (C) Instantaneous rate as a function of time between 29.1 s and 29.106 s. (D) Corresponding raster plot for 100 neurons. (E) As in C, but for the time period 29.7 s to 29.706 s. (F) Corresponding raster plot; same neurons as in D.
each neuron receives from the stimulated group is binomially distributed with a mean of 50. The higher the convergence a neuron has from this group, the more likely it is to be affected by the stimulation. We therefore define a high-convergence group with respect to the stimulated group, such
1454
A. Morrison, A. Aertsen, and M. Diesmann
neuron id
A
100 80 60 40 20
neuron id
B
0 100 80 60 40 20
neuron id
C
0 100 80 60 40 20 0
3
3.2
3.4
3.6
3.8
4
time [s]
Figure 9: Effect of stimulation on the network. Raster plot for 100 arbitrarily indexed neurons from the following populations: (A) stimulated group (N = 500), (B) high-convergence group (N = 568, K synch ≥ 69), (C) random-convergence group (N = 1000). The stimulus consists of a rectangular pulse current of amplitude 6840 pA and duration 0.1 ms at irregular intervals with a frequency of 3 Hz. For each stimulation event, the injection times for each neuron are drawn from a gaussian distribution with a standard deviation of 0.5 ms.
that K synch ≥ 69. This value was chosen as it results in a high-convergence group of approximately the same size as the stimulated group (N = 568). The spiking activity for the stimulated group (see Figure 9A), the highconvergence group (see Figure 9B), and a group of neurons (N = 1000) with random convergences (see Figure 9C) shows that although the stimulus has a strong effect within the stimulated group, its initial effect on the high-convergence group is only moderate and its effect on the randomconvergence group weaker still. 4.2.2 Activity Development of Stimulus-Driven Network. Initially, it seems as if the stimulus has a catastrophic effect on the stability of the network. In Figure 10A, a sudden rate transition can be seen at 224 s. All three recorded neuron populations reach rates of over 200 Hz. A closer inspection of the spiking activity at the transition point (see Figures 10C and 10D) yields some insight. A strong burst in the stimulated group triggers an oscillation
STDP in Balanced Random Networks
rate [Hz]
A
B
20
1455
20
15
15
10
10
5
5
0
0
100
200
0
0
time [s]
100
200
300
400
time [s]
neuron id
C 100 80 60 40 20
neuron id
D 100 80 60 40 20 222.95
223
223.05
223.1
223.15
time [s]
Figure 10: Activity development in the stimulated plastic network: (A) Average rate as a function of time (bin size 1 s) for stimulated group, pale gray curve; high-convergence group, dark gray curve; random-convergence group, solid black curve. The dashed black line indicates the equilibrium rate for the unstimulated plastic network. (B) As in A, but with no connections permitted within the stimulated group. (C) Raster plot for stimulated group during activity state transition, arbitrary neuron indices. (D) Raster plot for random-convergence group during activity state transition, arbitrary neuron indices.
in the random-convergence group (which is representative for the activity in the whole network). This synfire explosion was first reported by Mehring, Hehl, Kubo, Diesmann, & Aertsen (2003). Here, however, the effect is magnified rather than dying away, as the network correlation and the synaptic strengths have a positive feedback relationship through the STDP rule. The emergent activity is characterized by bursts of strongly patterned activity separated by periods of no activity, as was also observed for the unstimulated network with α = 1.035αp (see Figure 8). Quite apart from its consequences for the network activity, the triggering of the synfire explosion is in itself an interesting effect, given that the
1456
A. Morrison, A. Aertsen, and M. Diesmann
stimulated group intially had only a weak effect on the network (compare Figures 9A and 9C). This is partly due to the implementation of synaptic delays. As neurons were connected randomly, the neurons in the stimulated group also receive input from other neurons within that group. Since the synaptic delay was implemented as dendritic rather than axonal, when two neurons fire simultaneously, the presynaptic spike arrives at the synapse before the postsynaptic spike. This causes a systematic increase in the weights of the intragroup synapses when a synchronous stimulus is repeatedly applied. Although in general the response to the stimulus within the group is weakened, as will be discussed below, this increase of weights can occasionally cause an amplification of the response. In this case, more of its neurons fire within a few milliseconds, which naturally has a stronger effect on the rest of the network. If the network is connected such that there are no connections between neurons belonginging to the stimulated group, the amplification does not take place, and no synfire explosion is triggered for a stimulated group of this size (see Figure 10B). However, increasing the group size triggers a synfire explosion without the amplifying effects of the intragroup synapses: for a group size of 600 stimulated neurons, this occurs after 249 s; for a group size of 700 after 127 s; and for a group size of 800 after only 7 s (data not shown). Irrespective of whether a synfire explosion was triggered, a notable feature of the network behavior is the significant reduction in the rate of the stimulated group. This change in rate can be explained with reference to the changes in the synaptic weight distributions (see Figure 11A). The positive correlation between the stimulated group and the high-convergence and random-convergence groups corresponds to a systematic negative correlation from the perspective of the incoming synapses of the stimulated group. As a consequence, the means of the distributions of synaptic weights from the random-convergence and high-convergence groups to the stimulated group decrease from the mean weight determined in section 4.1 by 13.3% and 21.1%, respectively. This represents a significant reduction in the input received by the neurons of the stimulated group and causes them to fire much less frequently. Note that this effect can be generalized to any stimulus that creates a systematic positive correlation between a stimulated group and another group of neurons receiving input from it. This decoupling of the stimulated group from the rest of the network counteracts the development of structure observed in the development of the weights of the outgoing synapses of the simulated group. The means of the distributions from the stimulated group to the random-convergence and high-convergence groups increase from the mean weight value determined in section 4.1 by 13.8% and 25.5%, respectively. Over the same period, the mean of the synaptic weights unconnected with the stimulated group increased by just 0.2%. Moreover, a clear relationship can be seen between the degree of convergence a given neuron has from the stimulated group and the mean weight of those synapses (see Figure 11B). Up to about the
STDP in Balanced Random Networks
B
0.12 0.1
65 60
weight [pA]
frac. synapses
A
1457
0.08 0.06 0.04
55 50
0.02 45 0 20
40
60
weight [pA]
80
20
40
60
80
convergence
Figure 11: Development of synaptic weights in the stimulated plastic network, with no connections within the stimulated group. (A) Distribution of synaptic weights between different neuron populations after 400 s: stimulated group to high-convergence group, solid black curve; stimulated group to randomconvergence group, dashed black curve; high-convergence group to stimulated group, solid dark gray curve; random-convergence group to stimulated group, dashed dark gray curve. The pale gray curve indicates the distribution of synaptic weights for synapses unconnected to the stimulated group. (B) Average synaptic weight after 400 s as a function of convergence from the stimulated group (dots) and the high-convergence group (crosses). The vertical black line indicates the cut-off point for membership in the high-convergence group (K synch ≥ 69), and the gray line is a linear fit to the data below this point.
convergence chosen as the cut-off point for membership in the highconvergence group, the dependence of the average synaptic weight on the degree of convergence is well described by a linear relationship. Interestingly, beyond this point, the average synaptic weight increases much faster than the linear prediction. Although this suggests that a development of functional structure is occurring, such that the volley of nearsynchronous spikes is reliably transferred from the stimulated group to the high-convergence group, the development of the weights of the outgoing synapses of the high-convergence group does not support this. Even after 400 s of repeated stimulation, no increase of the average synaptic weight as a function of the convergence from the high-convergence group can be observed. Clearly, the high-convergence group does not echo the stimulus sufficiently reliably as to have a corresponding effect on its own downstream high-convergence group, even though the synaptic weights between the stimulated group and the high-convergence group have increased so significantly. The failure of the stimulated group to drive the high-convergence group effectively, despite the increased synaptic weights, can also be seen in the cross-correlogram of their spiking activity in Figure 12. Ideally, we would
1458
A. Morrison, A. Aertsen, and M. Diesmann −3
x 10
A C(τ)
3 2 1 0 −40
−20
−3
x 10
B C(τ)
0
20
τ [ms]
−3
x 10
C
3
3
2
2
1
1
0
0
−40
−20
0
τ [ms]
20
40
40
−40
−20
0
20
40
τ [ms]
Figure 12: Expected and actual correlation development. (A) Initial correlation: cross-correlation in the static network between the stimulated group and an external group with the same convergence with respect to the stimulated group and the rest of the network as the high-convergence group; synaptic weights drawn from distribution shown in Figure 4B. Averaged over 500 pairs of neurons and 50 s, using a bin size of 0.1 ms. (B) Expected final correlation: as in A, but synaptic weights for the external group are drawn from the appropiate distributions shown in Figure 11A. (C) Actual final correlation: cross-correlation between the stimulated group and the high-convergence group in the plastic network after stabilization of rates. Averaged over 500 pairs of neurons between 300 to 400 s with a bin size of 0.1 ms. Cross-correlograms calculated as in Figure 3.
like to compare the cross-correlation between the stimulated group and the high-convergence group at the beginning and end of the simulation. However, the rates are changing too quickly at the beginning for a crosscorrelation to be meaningful. In order to gain some insight into the initial cross-correlation between the stimulated and the high-convergence groups, the static network with heterogeneous weights was used to provide input to an external group of 500 neurons. This group was connected such that it had the same convergence with respect to the stimulated group and the
STDP in Balanced Random Networks
1459
rest of the network as the high-convergence group, with synaptic weights drawn from the equilibrium distribution shown in Figure 4B. The static network was then stimulated as described in section 4.2.1, to give a reasonable idea of the cross-correlation structure between the stimulated and high-convergence groups in the plastic network at the beginning of the simulation. This is depicted in Figure 12A. To gain insight into how the cross-correlation should have changed as a result of the increased synaptic weights between the stimulated group and the high-convergence group, the experiment was repeated with synaptic weights for the external group drawn from the appropriate distributions shown in Figure 11A. This gives a reasonable idea of the expected cross-correlation at the end of the 400 s simulation, under the assumption that the rate of the stimulated group retains its initial value. The expected cross-correlation is shown in Figure 12B, and a significant increase from the initial cross-correlation is observable. However, the actual cross-correlation between the stimulated and highconvergence groups at the end of the plastic network simulation, shown in Figure 12C, instead indicates a significant reduction in correlation from the initial condition. This suggests that although the initial strong correlation between the stimulated and high-convergence groups causes a systematic increase in the synaptic weights, which should increase the reliability of transferral of the near-synchronous volley of spikes and thus induce development in the outgoing synapses of the high-convergence group, this effect is entirely counteracted by the decoupling of the stimulated group from the rest of the network, which lowers the mean membrane potential of the stimulated group and thereby its responsiveness to the stimulus. 5 Discussion 5.1 Compatibility of STDP with the Balanced Random Network Model. We have shown that in a network with biologically realistic connectivity and sparseness, the weight dynamics of excitatory STDP synapses and the network dynamics can reach a mutual equilibrium. The equilibrium weight distribution is unimodal, and the network dynamics is that of a balanced random network in the asynchronous irregular regime. Specifically, the dynamics is almost identical to that of a static network with weights drawn from the equilibrium distribution of the plastic network. No spontaneous development of structure can be observed, as although the weight distribution is stable, the weights of individual synapses are not. A strong synapse does not stay strong indefinitely, but decays with a characteristic time constant and is replaced by other, previously weak, synapses. In other words, the number of strong synapses remains essentially constant, but the membership to this group is permanently in flux. Smaller networks with lower connectivity did not exhibit spontaneous development of structure either, although strong synapses persisted much longer, with a power law rather than exponential decay. The qualitatively different survival
1460
A. Morrison, A. Aertsen, and M. Diesmann
statistics demonstrate the importance of investigating STDP in networks with biologically realistic connectivity. These results are in contrast to those of Izhikevich et al. (2004), where the development of neuronal groups was observed, but also in contrast to those of Iglesias et al. (2005), where a principal feature was the systematic pruning of the majority of synapses. It is difficult to determine exactly why the results are so divergent, as there are so many differences in the network, neuron, and plasticity models used. However, we have demonstrated the robustness of our findings. Neither a change in the connectivity in the network, nor a change in the spike pairing scheme, nor even in the specific formulation of STDP altered the essential finding that structure does not spontaneously develop in the balanced random networks studied. In order to achieve a stable weight distribution and network activity with the desired mean weight and activity statistics, it was necessary to adjust the strength of depressing increments, while keeping the strength of potentiating increments constant, to compensate for the nonvanishing crosscorrelation brought about by network oscillations. This tuning amounts to a 6% increase in the strength of depressing increments, compared to that required to maintain the desired mean weight with purely Poissonian firing statistics. Interestingly, this adjustment brings the strength of a depressing increment closer to the value of α = 0.11 determined from the experimental data in section 2.2: the analytical value for Poissonian statistics is 8% lower, the adjusted value only 3% lower than the experimental value. If the strength of depressing increments is increased further in the direction of the experimental value, we show that a new stable state is found at a lower mean weight and with a lower degree of synchrony. In contrast, if the strength is decreased, no new stable state is reached within the asynchronous irregular regime. After an initial smooth rate increase, a sudden transition can be observed in the activity dynamics and the weight distribution. The activity is characterized by bursts of strongly patterned activity at high rates, interspersed with network silence. The weight distribution, which remains unimodal until the transition, splits into several distinct clusters. 5.2 Effects of Induced Synchronous Activity. We investigated the effects of irregular but synchronous stimuli on a group of neurons within the stable plastic network. If the stimulus is too strong due to amplification within the group or simply because the group size is too large, a synfire explosion is triggered. Unlike the behavior observed for a static network (Mehring et al., 2003), this explosion does not die out but drives the network into a pathological state. If the stimulus is not strong enough to trigger a synfire explosion, the weights of the outgoing synapses of the stimulated group develop as expected: a net increase is seen, particularly for synapses where the postsynaptic neuron has a high degree of convergence from the stimulated group. However, the causal relationship between the activity
STDP in Balanced Random Networks
1461
of the stimulated group and the rest of the network is an acausal relationship when viewed from the perspective of the incoming synapses of the stimulated group. These synapses are therefore systematically weakened, which leads to a dramatic fall in the firing rate of the stimulated group. This loss of background input reduces their response to the stimulation, with the end result that the correlation between the stimulated group and the high-convergence group is, in fact, less after 400 s than at the beginning of the simulation, despite the increase in weights. No functional structure is developed under these conditions: the high-convergence group does not echo the stimulus strongly enough to strengthen the synapses of its own downstream high-convergence group. 5.3 Power Law Formulation of STDP. The general framework for nor¨ malized STDP update rules established by Gutig et al. (2003) has four free parameters: the maximum weight wmax , the asymmetry parameter α, the step size λ, and the exponent of the weight dependence µ. The time intensity of the simulation study is such that a systematic investigation of all these parameters was unfeasible; therefore, a reexamination of the experimental data of Bi and Poo (1998) was undertaken in order to constrain this framework as tightly as possible. However, when the absolute rather than relative changes in EPSC are considered as a function of the initial EPSC amplitude, a power law relationship emerges that cannot be expressed within ¨ the Gutig et al. framework. This novel update rule is a better fit to the data than either additive (Song et al., 2000; van Rossum et al., 2000; Izhikevich et al., 2004) or multiplicative rules (Rubin et al., 2001), or anything in be¨ tween (Gutig et al., 2003). A property of our formulation of STDP is that the absolute potentiation of very weak synapses is small rather than maximal (for multiplicative rules) or equal to the absolute potentiation of a stronger synapse (for additive rules). This is a clear prediction that can be tested experimentally to distinguish between the available models more conclusively. We combined this formulation of the STDP update rules with an all-to-all spike pairing scheme, which we showed has a tendency to correct rate disparities, in contrast with a nearest-neighbor scheme, which, in the absence of additional stabilization mechanisms, tends to amplify any such disparities. 5.4 Limitations and Perspectives. Inevitably, this study was limited by the long duration of the simulations. The weight dynamics involve timescales up to three orders of magnitude longer than the activity dynamics in a static network: the transient of the weight distribution investigated in section 4.1 lasted about 200 s, whereas the transient in activity dynamics before a stable asynchronous irregular state is established in a static network is more like 200 ms. This is the main problem with such simulations, rather than the increased computational complexity of implementing plastic synapses; this slows our network simulations by a factor of less than
1462
A. Morrison, A. Aertsen, and M. Diesmann
10. Currently, a 1000 s simulation of the unstimulated network at 8.8 Hz requires about 60 hours of computational time on our PC cluster (20 × 2 processors, AMD Opteron 2.4 GHz, Dolphin/Scali interconnect). Simulating networks with reduced connectivity and scaled synapses ameliorated the problem, but at the cost of changing the survival statistics. It has yet to be established to what extent the network can be reduced from its biologically realistic scale while maintaining qualitatively the same behavior. Regardless of whether this can be achieved, there is a clear need for analytical approaches that simultaneously account for plasticity and activity dynamics. Given these limitations, there are nonetheless clear indications of how this work can be extended to investigate STDP in a balanced recurrent network more systematically. The discovery that the implementation of delays as purely dendritic can trigger a synfire explosion by strengthening the synapses within a stimulated group suggests that the distribution of propagation delay between the axon and the dendrite is potentially a crucial parameter for the network dynamics. Future work will need to determine to what extent the pathological state exhibited by the network as a result of a synfire explosion or insufficiently strong depression is affected by an increased proportion of axonal delay. Moreover, the possibility that the patterns observed in these states are at least partially artifacts of discrete time step simulation cannot be ruled out. We are currently investigating efficient methods of incorporating continuous spike times within a discrete time simulation scheme (Morrison, Straube, Plesser, & Diesmann, 2007), which should allow clarification of this issue. Another major area for development is the relative simplicity of the plasticity model used. It is, of course, necessary to start with simple models; otherwise, when confronted with complex behavior, it is impossible to say which aspect of the model is causing what. However, recent research reveals a rich variety of plasticity expression such as suppression (Froemke & Dan, 2002), activity-dependent postsynaptic homeostasis (Turrigiano et al., 1998), and nonlinear interactions between potentiation and depression (Wang et al., 2005). These effects may turn out to have fundamental consequences for network behavior. The model used also incorporates no upper bound on synaptic strength, which seems unbiological. However, as no upper bound could be extracted from the experimental data of Bi and Poo (1998), further investigation is required to establish a reasonable saturation behavior for the power law update rule proposed here. If structure is to develop in a balanced random network as a response to synchronous stimuli, the stimulated group must continue to receive enough background input so that its response to the stimulus does not diminish over time. To achieve this, one or both of the model components in this study must be adjusted. A natural adjustment to the plasticity model would be to incorporate postsynaptic activity dependent homeostasis as used in van Rossum et al. (2000). This would increase the weights of the incoming synapses of the stimulated group as a response to its falling rate.
STDP in Balanced Random Networks
1463
Alternatively, the connectivity in the network model could be adjusted. First, in the model used here, all propagation delays are identical, and each neuron has the same number of incoming synapses. Introducing heterogeneities in these quantities has been shown to reduce network oscillations (Brunel, 2000; Tetzlaff et al., 2005), which might prevent the network from entering a pathological state under stronger stimulation than was possible to use here. Second, a connection pattern resulting in a long-tailed rather than binomial convergence distribution for the stimulated group would lessen the acausal correlation between the embedding network and the stimulated group, thus reducing the drop in the mean weight of its incoming synapses. Appendix: Algorithmic Implementation of STDP The following implementation assumes that a synapse that connects the axon of neuron i to the dendrite of neuron j is stored in a list belonging to i. In the case of a distributed application, this list is assumed to be physically on the machine of the postsynaptic neuron, j (see Morrison et al., 2005). Each list of postsynaptic targets maintains two variables: told , the time of the last presynaptic spike, and K + , which describes the current time-dependent weighting of the STDP update rule for potentiation. Each synapse maintains its current weight w ji and its synaptic delay di = diA + diD , which is composed of its axonal delay diA and its dendritic delay diD , such that diD ≥ diA. Note that these propagation delays are taken into account when calculating the difference in time between presynaptic and postsynaptic spikes: spike timing is calculated from the perspective of the synapse and not the soma (see Debanne et al., 1998). Additionally, the synapse is equipped with two functions F+ (w) and F− (w), which perform the weight-dependent update for potentiation and depression, respectively. For example, in the case of power law STDP, F− (w) = λαw. The target list performs its weight updates and sends spikes to its postsynaptic neurons when its spike method is invoked for a presynaptic spike at time t: spike(t): for each postsynaptic neuron j: history←j.get history(told + dAj − dDj ,t + dAj − dDj ) for each spike tj in history: dt← (tj + dDj ) − (told + dAj ) if dt = 0: wj ← wj + F+ (wj ) · K+ · exp(−dt/tau) K− ←j.get K value(t + dAi − dDi ) wj ← wj − F− (wj ) · K− send spike (wj , dj ) to neuron j K+ ← K+ · exp(−(t − told )/tau) + 1 told ← t
1464
A. Morrison, A. Aertsen, and M. Diesmann
The postsynaptic neuron j maintains the variables told , containing its last spike, K − , which describes the current time-dependent weighting of the STDP update rule for depression, and Nsyn , the number of STDP incoming synapses. In addition, a dynamic data structure spike register is required, which stores spike information in the form of tuples: (tsp ,K sp ,countersp ), where tsp is the spike time, K sp is the value of K − at time tsp , and countersp counts how many times the spike information has been accessed by synapses. When neuron j spikes at time t, the spike register has to be updated: update register(t): K− ← K− · exp(−(t − told )/tau) + 1 while length of spike register ≥ 1: if countersp ≥ Nsyn : remove first element store (t,K− ,0) as last element told ← t
To enable the synapses to access this information, two other methods have to be available: get history(t1 , t2 ), which returns the spike times recorded in the spike register in the range (t1 , t2 ], and get K value(t), which returns the value of K − for the time t. They may be implemented as follows: get history(t1 , t2 ): starting at beginning of spike register: while tsp ≤ t1 : move to next element history←tsp countersp ← countersp + 1 while tsp ≤ t2 : append tsp to history countersp ← countersp + 1 move to next element return history
get K value(t): starting at end of spike register: while tsp ≥ t: move to previous element return Ksp · exp(−(t − tsp )/tau)
This implementation permits a wide class of STDP models to be simulated in a distributed application. Maintaining a spike register means that a neuron does not require knowledge of the locations of its incoming
STDP in Balanced Random Networks
1465
synapses in order to update their weights when it spikes; synaptic weights are updated only when a presynaptic spike occurs. Using the K ± variables to implement the sums of exponential functions and tracking countersp , the number of times synapses have accessed a particular spike, allows spike information that is no longer relevant to be discarded, thus keeping the number of retained spikes to a minimum. This means that the memory required does not increase for longer simulations, and the list operations in get history remain fast. This template algorithm can be trivially extended to models that require clipping of the weights at upper and lower boundaries or where the presynaptic and postsynaptic time constants differ. Different spike pairing schemes can be investigated by changing the way the K ± variables are updated. With more modifications, it can also be extended to more complex models such as the suppression model put forward by Froemke and Dan (2002). The biggest limitation of the algorithm is the requirement that diD ≥ A di . This is necessary to ensure causality is not violated—the postsynaptic neuron must not be required to give information on its future state. In this article, for simplicity, the assumption was made that the delay is purely dendritic: diD = di , diA = 0. Using the above algorithm, other distributions of the delay between the dendrite and the axon can be investigated up to the limit of diD = diA = di /2. This limitation can be circumvented, but at the cost of more complicated techniques, essentially queueing the event until the postsynaptic neuron has reached a state where the required information is available. Acknowledgments This work was carried out when A.M. and M.D. were based at the Bernstein Center for Computational Neuroscience, Albert-Ludwigs-University, Freiburg. We are particularly grateful to Guo-Qiang Bi and Mu-ming Poo for providing us with their original data. We acknowledge constructive discussions with the members of the NEST initiative (www.nest-initiative.org). This work would not have been possible without a dedicated grant for a high-performance computer cluster, HBFG Chapter 1423 Title 812 59, and the exertions of Martin Walter and Ulrich Gehring of the Freiburg University Computer Center. This work was partially funded by DAAD 313-PPPN4-lk, DIP F1.2, and BMBF Grant 01GQ0420 to the Bernstein Center for Computational Neuroscience Freiburg. References Bi, G.-q., & Poo, M.-m. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472.
1466
A. Morrison, A. Aertsen, and M. Diesmann
Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J. Neurosci., 2(1), 32–48. ¨ A. (1998). Cortex: Statistics and geometry of neuronal connecBraitenberg, V., & Schuz, tivity (2nd ed.). Berlin: Springer-Verlag. Brunel, N. (2000). Dynamics of sparsely connected networks of excitatory and inhibitory spiking neurons. J. Comput. Neurosci., 8(3), 183–208. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrate-andfire neurons with low firing rates. Neural Comput., 11(7), 1621–1671. Burkitt, A. N., Meffin, H., & Grayden, D. B. (2004). Spike-timing-dependent plasticity: The relationship to rate-based learning for models with weight dynamics determined by a stable fixed point. Neural Comput., 16, 885–940. Debanne, D., G¨ahwiler, B. H., & Thompson, S. M. (1998). Long-term synaptic plasticity between pairs of individual CA3 pyramidal cells in rat hippocampal slice cultures. J. Physiol. (Lond.), 507, 237–247. Feldman, D. E. (2000). Timing-based LTP and LTD at vertical inputs to layer II/III pyramidal cells in rat barrel cortex. Neuron, 27, 45–56. Fetz, E., Toyama, K., & Smith, W. (1991). Synaptic interactions between cortical neurons. In A. Peters (Ed.), Cerebral cortex (Vol. 9, pp. 1–47). New York: Plenum. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416(6879), 433–438. Gerstner, W., Kempter, R., van Hemmen, J. L., & Wagner, H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 383, 76–78. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. J. Neurosci., 23(9), 3697–3714. Guyonneau, R., VanRullen, R., & Thorpe, S. J. (2005). Neurons tune to the earliest spikes through STDP. Neural Comput., 17, 859–879. ¨ Hertz, J., & Prugel-Bennet, A. (1996). Learning synfire-chains by self-organization. Network: Comput. Neural Systems, 7(2), 357–363. Iglesias, J., Eriksson, J., Grize, F., Tomassini, M., & Villa, A. (2005). Dynamics of pruning in simulated large-scale spiking neural networks. Biosystems, 79, 11–20. Izhikevich, E. M., & Desai, N. S. (2003). Relating STDP to BCM. Neural Comput., 15, 1511–1523. Izhikevich, E. M., Gally, J. A., & Edelman, G. M. (2004). Spike-timing dynamics of neuronal groups. Cereb. Cortex, 14, 933–944. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Phys. Rev. E, 59, 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Comput., 12, 2709–2742. Kistler, W. M., & van Hemmen, J. L. (2000). Modeling synaptic plasticity in conjunction with the timing of pre- and postsynaptic action potentials. Neural Comput., 12, 385–405. Levy, N., Horn, D., Meilijson, I., & Ruppin, E. (2001). Distributed synchrony in a cell assembly of spiking neurons. Neural Networks, 14, 815–824. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215.
STDP in Balanced Random Networks
1467
Mehring, C., Hehl, U., Kubo, M., Diesmann, M., & Aertsen, A. (2003). Activity dynamics and propagation of synchronous spiking in locally connected random networks. Biol. Cybern., 88(5), 395–408. Morrison, A., Mehring, C., Geisel, T., Aertsen, A., & Diesmann, M. (2005). Advancing the boundaries of high connectivity network simulation with distributed computing. Neural Comput., 17(8), 1776–1801. Morrison, A., Straube, S., Plesser, H. E., & Diesmann, M. (2007). Exact subthreshold integration with continuous spike times in discrete time neural network simulations. Neural Comput., 19(1), 47–79. Rubin, J., Lee, D., & Sompolinsky, H. (2001). Equilibrium properties of temporally asymmetric Hebbian plasticity. Phys. Rev. Lett., 86, 364–367. Sjostrom, P., Turrigiano, G., & Nelson, S. (2001). Rate, timing, and cooperativity jointly determine cortical synaptic plasticity. Neuron, 32, 1149–1164. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nat. Neurosci., 3(9), 919–926. Tetzlaff, T., Morrison, A., Timme, M., & Diesmann, M. (2005). Heterogeneity breaks global synchrony in large networks. In Proceedings of the 30th G¨ottingen Neurobiology Conference. Neuroforum (Supp. 1), 206b. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 892–896. van Rossum, M. C. W., Bi, G.-Q., & Turrigiano, G. G. (2000). Stable Hebbian learning from spike timing-dependent plasticity. J. Neurosci., 20(23), 8812–8821. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wang, H.-X., Gerkin, R. C., Nauen, D. W., & Bi, G.-Q. (2005). Coactivation and timingdependent integration of synaptic potentiation and depression. Nat. Neurosci., 8(2), 187–193. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37–44.
Received December 18, 2005; accepted September 1, 2006.
LETTER
Communicated by Gal Chechik
Reinforcement Learning Through Modulation of Spike-Timing-Dependent Synaptic Plasticity R˘azvan V. Florian
[email protected] Center for Cognitive and Neural Studies (Coneural), 400504 Cluj-Napoca, Romania, and Babes¸-Bolyai University, Institute for Interdisciplinary Experimental Research, 400271 Cluj-Napoca, Romania
The persistent modification of synaptic efficacy as a function of the relative timing of pre- and postsynaptic spikes is a phenomenon known as spike-timing-dependent plasticity (STDP). Here we show that the modulation of STDP by a global reward signal leads to reinforcement learning. We first derive analytically learning rules involving reward-modulated spike-timing-dependent synaptic and intrinsic plasticity, by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons. These rules have several features common to plasticity mechanisms experimentally found in the brain. We then demonstrate in simulations of networks of integrate-and-fire neurons the efficacy of two simple learning rules involving modulated STDP. One rule is a direct extension of the standard STDP model (modulated STDP), and the other one involves an eligibility trace stored at each synapse that keeps a decaying memory of the relationships between the recent pairs of preand postsynaptic spike pairs (modulated STDP with eligibility trace). This latter rule permits learning even if the reward signal is delayed. The proposed rules are able to solve the XOR problem with both ratecoded and temporally coded input and to learn a target output firing-rate pattern. These learning rules are biologically plausible, may be used for training generic artificial spiking neural networks, regardless of the neural model used, and suggest the experimental investigation in animals of the existence of reward-modulated STDP. 1 Introduction The dependence of synaptic changes on the relative timing of pre- and postsynaptic action potentials has been experimentally observed in biolog¨ ical neural systems (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Dan & Poo, 2004). A typical example of spike-timing-dependent plasticity (STDP) is given by the potentiation of a synapse when the postsynaptic spike follows the presynaptic spike within a time window of a few tens of milliseconds and the depression of the synapse when the order Neural Computation 19, 1468–1502 (2007)
C 2007 Massachusetts Institute of Technology
Reinforcement Learning Through Modulation of STDP
1469
of the spikes is reversed. This type of STDP is sometimes called Hebbian because it is consistent with the original postulate of Hebb that predicted the strengthening of a synapse when the presynaptic neuron causes the postsynaptic neuron to fire. It is also antisymmetric, because the sign of synaptic changes varies with the sign of the relative spike timing. Experiments have also found synapses with anti-Hebbian STDP (where the sign of the changes is reversed, in comparison to Hebbian STDP), as well as synapses with symmetric STDP (Dan & Poo, 1992; Bell, Han, Sugawara, & Grant, 1997; Egger, Feldmeyer, & Sakmann, 1999; Roberts & Bell, 2002). Theoretical studies have mostly focused on the computational properties of Hebbian STDP and have shown its function in neural homeostasis and unsupervised and supervised learning. This mechanism can regulate both the rate and the variability of postsynaptic firing and may induce competition between afferent synapses (Kempter, Gerstner, & van Hemmen, 1999, 2001; Song, Miller, & Abbott, 2000). Hebbian STDP can also lead to unsupervised learning and prediction of sequences (Roberts, 1999; Rao & Sejnowski, 2001). Plasticity rules similar to Hebbian STDP were derived theoretically by optimizing the mutual information between the presynaptic input and the activity of the postsynaptic neuron (Toyoizumi, Pfister, Aihara, & Gerstner, 2005; Bell & Parrara, 2005; Chechik, 2003), minimizing the postsynaptic neuron’s variability to a given input (Bohte & Mozer, 2005), optimizing the likelihood of postsynaptic firing at one or several desired firing times (Pfister, Toyoizumi, Barber, & Gerstner, 2006), or self-repairing a classifier network (Hopfield & Brody, 2004). It was also shown that by clamping the postsynaptic neuron to a target signal, Hebbian STDP can lead, under certain conditions, to learning a particular spike pattern (Legenstein, Naeger, & Maass, 2005). Anti-Hebbian STDP is, at first glance, not as interesting as the Hebbian mechanism, as it leads, by itself, to an overall depression of the synapses toward zero efficacy (Abbott & Gerstner, 2005). Hebbian STDP has a particular sensitivity to causality: if a presynaptic neuron contributes to the firing of the postsynaptic neuron, the plasticity mechanism will strengthen the synapse, and thus the presynaptic neuron will become more effective in causing the postsynaptic neuron to fire. This mechanism can determine a network to associate a stable output to a particular input. But let us imagine that the causal relationships are reinforced only when this leads to something good (e.g., if the agent controlled by the neural network receives a positive reward), while the causal relationships that lead to failure are weakened, to avoid erroneous behavior. The synapse should feature Hebbian STDP when the reward is positive and anti-Hebbian STDP when the reward is negative. In this case, the neural network may learn to associate a particular input not to an arbitrary output, determined, for example, by the initial state of the network, but to a desirable output, as determined by the reward. Alternatively, we may consider the theoretical result that under certain conditions, Hebbian STDP minimizes the postsynaptic neuron’s variability
1470
R. Florian
to a given presynaptic input (Bohte & Mozer, 2005). An analysis analogous to the one performed in that study can show that anti-Hebbian STDP maximizes variability. This kind of influence of Hebbian/anti-Hebbian STDP on variability has also been observed at a network level in simulations (Dauc´e, Soula, & Beslon, 2005; Soula, Alwan, & Beslon, 2005). By having Hebbian STDP when the network receives positive reward, the variability of the output is reduced, and the network could exploit the particular configuration that led to positive reward. By having anti-Hebbian STDP when the network receives negative reward, the variability of the network’s behavior is increased, and it could thus explore various strategies until it finds one that leads to positive reward. It is thus tempting to verify whether the modulation of STDP with a reward signal can lead indeed to reinforcement learning. The exploration of this mechanism is the subject of this letter. This hypothesis is supported by a series of studies on reinforcement learning in nonspiking artificial neural networks, where the learning mechanisms are qualitatively similar to reward-modulated STDP, by strengthening synapses when reward correlates with both presynaptic and postsynaptic activity. These studies have focused on networks composed of binary stochastic elements (Barto, 1985; Barto & Anandan, 1985; Barto & Anderson, 1985; Barto & Jordan, 1987; Mazzoni, Andersen, & Jordan, 1991; Pouget, Deffayet, & Sejnowski, 1995; Williams, 1992; Bartlett & Baxter, 1999b, 2000b) or threshold gates (Alstrøm & Stassinopoulos, 1995; Stassinopoulos & Bak, 1995, 1996). These previous results are promising and inspiring, but they do not investigate biologically plausible rewardmodulated STDP, and they use memoryless neurons and work in discrete time. Another study showed that reinforcement learning can be obtained by correlating fluctuations in irregular spiking with a reward signal in networks composed of neurons firing Poisson spike trains (Xie & Seung, 2004). The resulting learning rule qualitatively resembles rewardmodulated STDP as well. However, the results of the study highly depend on the Poisson characteristic of the neurons. Also, this learning model presumes that neurons respond instantaneously, by modulating their firing rate, to their input. This partly ignores the memory of the neural membrane potential, an important characteristic of spiking neural models. Reinforcement learning has also been achieved in spiking neural networks by reinforcement of stochastic synaptic transmission (Seung, 2003). This biologically plausible learning mechanism has several common features with our proposed mechanism, as we will show later, but it is not directly related to STDP. Another existing reinforcement learning algorithm for spiking neural networks requires a particular feedforward network architecture with a fixed number of layers and is not related to STDP as well (Takita, Osana, & Hagiwara, 2001; Takita & Hagiwara, 2002, 2005).
Reinforcement Learning Through Modulation of STDP
1471
Two other previous studies seem to consider STDP as a reinforcement ¨ learning mechanism, but in fact they do not. Strosslin and Gerstner (2003) developed a model for spatial learning and navigation based on reinforcement learning, having as inspiration the hypothesis that eligibility traces are implemented using dopamine-modulated STDP. However, the model does not use spiking neurons and represents just the continuous firing rates of the neurons. The learning mechanism resembles modulated STDP qualitatively by strengthening synapses when reward correlates with both presynaptic and postsynaptic activity, as in some other studies already mentioned. Rao and Sejnowski (2001) have shown that a temporal difference (TD) rule used in conjunction with dendritic backpropagating action potentials reproduces Hebbian STDP. TD learning is often associated with reinforcement learning, where it is used to predict the value function (Sutton & Barto, 1998). But in general, TD methods are used for prediction problems and thus implement a form of supervised learning (Sutton, 1998). In Rao and Sejnowski’s study (2001), the neuron learns to predict its own membrane potential; hence, it can be interpreted as a form of unsupervised learning since the neuron provides its own teaching signal. In any case, that work does not study STDP as a reinforcement learning mechanism, as there is no external reward signal. In a previous preliminary study, we have analytically derived learning rules involving modulated STDP for networks of probabilistic integrateand-fire neurons and tested them and some generalizations of them in simulations, in a biologically inspired context (Florian, 2005). We have also previously studied the effects of Hebbian and anti-Hebbian STDP in oscillatory neural networks (Florian & Mures¸an, 2006). Here we study in more depth reward-modulated STDP and its efficacy for reinforcement learning. We first derive analytically, in section 2, learning rules involving rewardmodulated spike-timing-dependent synaptic and intrinsic plasticity by applying a reinforcement learning algorithm to the stochastic spike response model of spiking neurons and discuss the relationship of these rules with other reinforcement learning algorithms for spiking neural networks. We then introduce, in section 3, two simpler learning rules based on modulated STDP and study them in simulations, in section 4, to test their efficacy for solving some benchmark problems and explore their properties. The results are discussed in section 5. In the course of our study, we found that two other groups have observed independently, through simulations, the reinforcement learning properties of modulated STDP. Soula, Alwan, and Beslon (2004, 2005) have trained a robot controlled by a spiking neural network to avoid obstacles by applying STDP when the robot moves forward and anti-Hebbian STDP when the robot hits an obstacle. Farries and Fairhall (2005a, 2005b) have studied a two-layer feedforward network where the input units are treated as independent inhomogeneous Poisson processes and output units are single-compartment, conductance-based models. The network learned to
1472
R. Florian
have a particular pattern of output activity through modulation of STDP by a scalar evaluation of performance. This latter study has been published in abstract form only, and thus the details of the work are not known. In contrast to these studies, in this letter we also present an analytical justification for introducing reward-modulated STDP as a reinforcement learning mechanism, and we also study a form of modulated STDP that includes an eligibility trace that permits learning even with delayed reward. 2 Analytical Derivation of a Reinforcement Learning Algorithm for Spiking Neural Networks In order to motivate the introduction of reward-modulated STDP as a reinforcement learning mechanism, we derive analytically a reinforcement learning algorithm for spiking neural networks. 2.1 Derivation of the Basic Learning Rule. The algorithm is derived as an application of the OLPOMDP reinforcement learning algorithm (Baxter, Bartlett, & Weaver, 2001; Baxter, Weaver, & Bartlett, 1999), an online variant of the GPOMDP algorithm (Bartlett & Baxter, 1999a; Baxter & Bartlett, 2001). GPOMDP assumes that the interaction of an agent with an environment is a partially observable Markov decision process (POMDP) and that the agent chooses actions according to a probabilistic policy µ that depends on a vector w of several real parameters. GPOMDP was derived analytically by considering an approximation to the gradient of the long-term average of the external reward received by the agent with respect to the parameters w. Results related to the convergence of OLPOMDP to local maxima have been obtained (Bartlett & Baxter, 2000a; Marbach & Tsitsiklis, 1999, 2000). It was shown that applying the algorithm to a system of interacting agents that seek to maximize the same reward signal r (t) is equivalent to applying the algorithm independently to each agent i (Bartlett & Baxter, 1999b, 2000b). OLPOMDP suggests that the parameters wij (components of the vector wi of agent i) should evolve according to wij (t + δt) = wij (t) + γ 0 r (t + δt) zij0 (t + δt) zij0 (t + δt) = β zij0 (t) + ζij (t) ζij (t) =
1
∂µifi (t)
µifi (t) ∂wij
,
(2.1) (2.2) (2.3)
where δt is the duration of a time step (the system evolves in discrete time), the learning rate γ 0 is a small, constant parameter, z0 is an eligibility trace (Sutton & Barto, 1998), ζ is a notation for the change of z0 resulting from
Reinforcement Learning Through Modulation of STDP
1473
the activity in the last time step, µifi (t) is the policy-determined probability that agent i chooses action f i (t), and f i (t) is the action actually chosen at time t. The discount factor β is a parameter that can take values between 0 and 1. To accommodate future developments, we made the following changes to standard OLPOMDP notation: β = exp(−δt/τz )
(2.4)
γ = γ δt/τz
(2.5)
zij0 (t) = zij (t) τz ,
(2.6)
0
with τz being the time constant for the exponential decay of z. We thus have wij (t + δt) = wij (t) + γ δt r (t + δt) zij (t + δt) zij (t + δt) = β zij (t) + ζij (t) /τz .
(2.7) (2.8)
The parameters β and γ 0 should be chosen such that δt/(1 − β) and δt/γ 0 are large compared to the mixing time τm of the system (Baxter et al., 2001; Baxter & Bartlett, 2001; Bartlett & Baxter, 2000a, 2000c). With our notation and for δt τz , these conditions are met if τz is large compared to τm and if 0 < γ < 1. The mixing time can be defined rigorously for a Markov process and can be thought of as the time from the occurrence of an action until the effects of that action have died away. We apply this algorithm to the case of a neural network. We consider that at each time step t, a neuron i either fires ( f i (t) = 1) with probability pi (t) or does not fire ( f i (t) = 0) with probability 1 − pi (t). The neurons are connected through plastic synapses with efficacies wij (t), where i is the index of the postsynaptic neuron. The efficacies wij can be positive or negative (corresponding to excitatory and inhibitory synapses, respectively). The global reward signal r (t) is broadcast to all synapses. By considering each neuron i as an independent agent and the firing and nonfiring probabilities of the neuron as determining the policy µi of the corresponding agent (µi1 = pi , µi0 = 1 − pi ), we get the following form of ζij :
ζij (t) =
1 ∂ pi (t) , pi (t) ∂wij
−
∂ pi (t) 1 , 1 − pi (t) ∂wij
if
f i (t) = 1
if
f i (t) = 0.
(2.9)
1474
R. Florian
Equation 2.9, together with equations 2.7 and 2.8, establishes a plasticity rule that updates the synapses so as to optimize the long-term average of the reward received by the network. Thus far, we have followed a derivation also performed by Bartlett and Baxter (1999b, 2000b) for networks of memoryless binary stochastic units, although they called them spiking neurons. Unlike these studies, we consider here spiking neurons where the membrane potential keeps a decaying memory of past inputs, as in real neurons. In particular, we consider the spike response model (SRM) for neurons, which reproduces with high accuracy the dynamics of the complex Hodgkin-Huxley neural model while being amenable to analytical treatment (Gerstner, 2001; Gerstner & Kistler, 2002). According to the SRM, each neuron is characterized by its membrane potential u that is defined as ui (t) = ηi (t − tˆ i ) +
wij
j
f εij t − tˆ i , t − t j ,
(2.10)
F jt
where tˆ i is the time of the last spike of neuron i, ηi is the refractory response f due to this last spike, t j are the moments of the spikes of presynaptic neuron f j emitted prior to t, and wij εij (t − tˆ i , t − t j ) is the postsynaptic potential f
induced in neuron i due to an input spike from neuron j at time t j . The first sum runs over all presynaptic neurons, and the last sum runs over all spikes of neuron j prior to t (represented by the set F jt ). We consider that the neuron has a noisy threshold and that it fires stochastically, according to the escape noise model (Gerstner & Kistler, 2002). The neuron fires in the interval δt with probability pi (t) = ρi (ui (t) − θi ) δt, where ρi is a probability density, also called firing intensity, and θi is the firing threshold of the neuron. We note ∂ρi = ρi , ∂(ui − θi )
(2.11)
and we have ∂ui (t) ∂ pi (t) = ρi (t) δt. ∂wij ∂wij
(2.12)
From equation 2.10 we get ∂ui (t) f = εij t − tˆ i , t − t j ; ∂wij t Fj
(2.13)
Reinforcement Learning Through Modulation of STDP
1475
hence, equation 2.9 can be rewritten as 1 f εij t − tˆ i , t − t j , ρi (t) ρ (t) i F jt ζij (t) = δt f ρi (t) εij t − tˆ i , t − t j , − 1 − ρi (t) δt
if
f i (t) = 1
if
f i (t) = 0.
(2.14)
F jt
Together with equations 2.7 and 2.8, this defines the plasticity rule for reinforcement learning. We consider that δt τz , and we rewrite equation 2.8 as zij (t + δt) − zij (t) τz = −zij (t) + ξij (t) + O δt
δt τz
zij (t),
(2.15)
where we noted ξij (t) =
ζij (t) . δt
(2.16)
By taking the limit δt → 0 and using equations 2.7, 2.14, 2.15, and 2.16, we finally get the reinforcement learning rule for continuous time: dwij (t) = γ r (t) zij (t) dt dzij (t) τz = −zij (t) + ξij (t) dt i (t) f ξij (t) =
ij t − tˆ i , t − t j , − 1 ρi (t) ρi (t) t
(2.17) (2.18) (2.19)
Fj
f where i (t) = Fi δ(t − ti ) represents the spike train of the postsynaptic neuron as a sum of Dirac functions. f We see that when a postsynaptic spike emitted at ti follows a presynaptic f f spike emitted at t j , z undergoes a sudden growth at ti with f f 1 ρi ti f f z = f εij ti − tˆ i , ti − t j , τz ρi ti
(2.20)
1476
R. Florian
which will decay exponentially with a time constant τz . Over the long term, after the complete decay, this event leads to a total synaptic change w = ti
∞ f
t − ti γ r (t) z exp − τz
f
dt.
(2.21)
If r is constant, this is w = γ r z τz . Hence, if r is positive, the algorithm suggests that the synapse should undergo a spike-timing-dependent potentiation, as in experimentally observed Hebbian STDP. We always have ρ ≥ 0 since it is a probability density and ρ ≥ 0 since the firing probability increases with a higher membrane potential. The spike-timing dependence of the potentiation depends on ε and is approximately exponential, as in experimentally observed STDP, since ε models the exponential decay of the postsynaptic potential after a presynaptic spike (for details, see Gerstner & Kistler, 2002). The amplitude of f f the potentiation is also modulated by ρi (ti )/ρi (ti ), in contrast to standard STDP models, but being consistent with the experimentally observed dependence of synaptic modification not only on relative spike timings, but also on interspike intervals (Froemke & Dan, 2002), on which ρi depends indirectly. Unlike in Hebbian STDP, the depression of z suggested by the algorithm (and the ensuing synaptic depression, if r is positive) is nonassociative, as each presynaptic spike decreases z continuously. In general, the algorithm suggests that synaptic changes are modulated by the reinforcement signal r (t). For example, if the reinforcement is negative, a potentiation of z due to a postsynaptic spike following presynaptic spikes will lead to a depression of the synaptic efficacy w. In the case of constant negative reinforcement, the algorithm implies associative spike-timing-dependent depression and nonassociative potentiation of the synaptic efficacy. This type of plasticity has been experimentally discovered in the neural system of the electric fish (Han, Grant, & Bell, 2000). 2.2 A Neural Model for Bidirectional Associative Plasticity for Reinforcement Learning. Since typical experimentally observed STDP involves only associative plasticity, it is interesting to investigate in which case a reinforcement learning algorithm based on OLPOMDP would lead to associative changes determined by both pre-after-post and post-after-pre spike pairs. It turns out that this happens when presynaptic neurons homeostatically adapt their firing thresholds to keep the average activity of the postsynaptic neurons constant. To see this, let us consider that not only pi , but also the firing probability p j of presynaptic neuron j depends on the efficacy wij of the synapse. We apply the OLPOMDP reinforcement learning algorithm to an agent
Reinforcement Learning Through Modulation of STDP
1477
formed by neuron i and all presynaptic neurons j that connect to i. As in the previous case, the policy is parameterized by the synaptic efficacies wij . However, the policy should now establish the firing probabilities for all presynaptic neurons, as well as for the postsynaptic neuron. Since the firing probabilities depend on only neuronal potentials and thresholds, each neuron decides independently if it fires. Hence, the probability of having j k a given firing pattern f i , { f j } factorizes as µ fi ,{ f j } = µifi j µ f j , where µ f k j
is the probability of neuron k to choose action f k ∈ {0, 1}. Only µifi and µ f j depend on wij , and thus j ∂ µifi j µ fj ∂µ fi ,{ f j } 1 = ζij (t) = j µ fi ,{ f j } ∂wij ∂wij µifi j µ fj j ∂ µifi µ f j 1 = . j ∂wij µifi µ f j 1
(2.22)
The particular form of ζij depends on the various possibilities for the actions taken by neurons i and j:
j
µifi µ f j
(1 − pi )(1 − p j ), = pi (1 − p j ), (1 − pi ) p j ,
if f i = f j = 0 if f i = 1 and f j = 0 .
(2.23)
if f i = 0 and f j = 1
We do not consider the case f i = f j = 1 since there is a negligibly small probability of having simultaneous spiking in a very small interval δt. After performing the calculation of ζij for the various cases and considering that δt → 0, by analogy with equation 2.9 and the calculation from the previous section, we get ξij (t) =
j (t) ∂ρ j (t) i (t) ∂ρi (t) + . −1 −1 ρi (t) ∂wij ρ j (t) ∂wij
(2.24)
Let us consider that the firing threshold of neuron j has the following dependence: f wk j χ j t − tk , θ j (t) = θ j k
(2.25)
Fkt
where θ j is an arbitrary continuous increasing function, the first sum runs over all postsynaptic neurons to which neuron j projects, and the second
1478
R. Florian
sum runs over all spikes prior to t of postsynaptic neuron k. χ j is a decaying f f kernel; for example, χ j (t − tk ) = exp(−(t − tk )/τθ j ), which means that θ j depends on a weighed estimate of the firing rate of postsynaptic neurons, computed using an exponential kernel with time constant τθ j . The proposed dependence of presynaptic firing thresholds means that when the activity of postsynaptic neurons is high, neuron j increases its firing threshold in order to contribute less to their firing and decrease their activity. The activity of each neuron k is weighed with the synaptic efficacy wk j , since an increase in the firing threshold and a subsequent decrease of the activity of neuron j will have a bigger effect on postsynaptic neurons to which synaptic connections are strong. This is biologically plausible, since there are many neural mechanisms that ensure homeostasis (Turrigiano & Nelson, 2004) and plasticity of intrinsic excitability (Daoudal & Debanne, 2003; Zhang & Linden, 2003), including mechanisms that regulate the excitability of presynaptic neurons as a function of postsynaptic activity (Nick & Ribera, 2000; Ganguly, Kiss, & Poo, 2000; Li, Lu, Wu, Duan, & Poo, 2004). The firing intensity of neuron j is ρ j (t) = ρ j (u j (t) − θ j (t)). We have ∂ρ j (t) ∂θ j (t) f = −ρ j (t) = −ρ j (t)θ j (t) χ j t − ti . ∂wij ∂wij t
(2.26)
Fi
We finally have i (t) f ξij (t) = − 1 ρi (t) εij t − tˆ i , t − t j ρi (t) t
+ −
Fj
j (t) f + 1 ρ j (t) θ j (t) χ j t − ti , ρ j (t) t
(2.27)
Fi
which together with equations 2.17 and 2.18 define the plasticity rule in this case. We have the same associative spike-timing-dependent potentiation of z as in the previous case, but now we also have an associative spike-timingdependent depression of z. The depression corresponding to a presynaptic f f spike at t j following a postsynaptic spike at ti is f f f 1 ρ j tj f z = − f θ j t j χ j t j − ti , τz ρ j t j
(2.28)
and the dependence on the relative spike timing is given by χ j . The variation of z is negative because ρ j and θ j are positive since they are derivatives of
Reinforcement Learning Through Modulation of STDP
1479
increasing functions, ρ j is positive since it is a probability density, and χ j is positive by definition. We have thus shown that a bidirectional associative spike-timingdependent plasticity mechanism can result from a reinforcement learning algorithm applied to a spiking neural model with homeostatic control of the average postsynaptic activity. As previously, the synaptic changes are modulated by the global reinforcement signal r (t). As in the previous case, we also have nonassociative changes of z: each presynaptic spike depresses z continuously, while each postsynaptic spike potentiates z continuously. However, depending on the parameters, the nonassociative changes may be much smaller than the spike-timingdependent ones. For example, the magnitude of nonassociative depression induced by presynaptic spikes is modulated with the derivative (with respect to the difference between the membrane potential and the firing threshold) of the postsynaptic firing intensity, ρi . This usually has a nonnegligible value when the membrane potential is high (see the forms of the ρ function from Gerstner & Kistler, 2002; Bohte, 2004), and thus a postsynaptic spike is probable to follow. When this happens, z is potentiated and the effect of the depression may be overcome. 2.3 Reinforcement Learning and Spike-Timing-Dependent Intrinsic Plasticity. In this section, we consider that the firing threshold θi of the postsynaptic neuron is also an adaptable parameter that contributes to learning, like the wij parameters. There is ample evidence that this is the case in the brain (Daoudal & Debanne, 2003; Zhang & Linden, 2003). The firing threshold will change according to
τzθ
dθi (t) = γθ r (t) ziθ (t) dt dziθ (t) = −ziθ (t) + ξiθ (t) dt 1 1 ∂ pi (t) δt pi (t) ∂θi , ξiθ (t) = limδt→0 1 ∂ pi (t) 1 , − δt 1 − pi (t) ∂θi
(2.29) (2.30) if f i (t) = 1 (2.31) if f i (t) = 0,
by analogy with equations 2.17, 2.18, 2.9 and 2.19. Since ∂ pi (t)/∂θi = −ρi (t) δt, we have i (t) ξiθ (t) = − + 1 ρi (t). ρi (t)
(2.32)
1480
R. Florian
Hence, for positive reward, spikes emitted by a neuron lead to a decrease of the firing threshold of the neuron (an increase of excitability). This is consistent with experimental results that showed that neural excitability increases after bursts (Aizenman & Linden, 2000; Cudmore & Turrigiano, 2004; Daoudal & Debanne, 2003). Between spikes, the firing threshold increases continuously (but very slowly when the firing intensity, and thus also its derivative with respect to the difference between the membrane potential and the firing threshold, is low). This reinforcement learning algorithm using spike-timing-dependent intrinsic plasticity is complementary with the two other algorithms previously described and can work simultaneously with any of them. 2.4 Relationship to Other Reinforcement Learning Algorithms for Spiking Neural Networks. It can be shown that the algorithms proposed here share a common analytical background with the other two existing reinforcement learning algorithms for generic spiking neural networks (Seung, 2003; Xie & Seung, 2004). Seung (2003) applies OLPOMDP by considering that the agent is a synapse instead of a neuron, as we did. The action of the agent is the release of a neurotransmitter vesicle, instead of the spiking of the neuron, and the parameter that is optimized is one that controls the release of the vesicle instead of the synaptic connections to the neuron. The result is a learning algorithm that is biologically plausible but for which there exists no experimental evidence yet. Xie and Seung (2004) do not model in detail the integrative characteristics of the neural membrane potential and consider that neurons respond instantaneously to inputs by changing the firing rate of their Poisson spike train. Their study derives an episodic algorithm that is similar to GPOMDP and extends it to an online algorithm similar to OLPOMDP without any justification. The equations determining online learning—equations 16 and 17 in Xie and Seung (2004)—are identical, after adapting notation, to our ones here: equations 2.17, 2.18 and 2.19. By reinterpreting the current-discharge function f in Xie and Seung (2004) as the firing intensity ρ and the synaptic current h ij as the postsynaptic kernel ij , we can see that the algorithm of Xie and Seung is mathematically equivalent to the algorithm derived, more accurately, here. However, our different interpretation and implementation of the mathematical framework permit making better connections to experimentally observed STDP and also a straightforward generalization and application to neural models commonly used in simulations, which the Xie and Seung algorithm does not permit because of the dependence on the memoryless Poisson neural model. The common analytical background of all these algorithms suggests that their learning performance should be similar. Pfister et al. (2006) developed a theory of supervised learning for spiking neurons that leads to STDP and can be reinterpreted in the context
Reinforcement Learning Through Modulation of STDP
1481
of reinforcement learning. They studied in detail only particular episodic learning scenarios, not the general online case approached in our letter. However, the analytical background is again the same one, since in their work, synaptic changes depend on a quantity (equation 7 in their article) that is the integral over the learning episode of our ξij (t), equation 2.19. It is thus not surprising that some of their results are similar to ours: they find a negative offset of the STDP function that corresponds to the nonassociative depression suggested by our algorithm for positive reward, and they derive associative spike-timing-dependent depression as the consequence of a homeostatic mechanism, as we did in a different framework. 3 Modulation of STDP by Reward The derivations in the previous sections show that reinforcement learning algorithms that involve reward-modulated STDP can be justified analytically. We have also previously tested in simulation, in a biologically inspired experiment, one of the derived algorithms, to demonstrate practically its efficacy (Florian, 2005). However, these algorithms involve both associative spike-timing-dependent synaptic changes and nonassociative ones, unlike standard STDP (Bi & Poo 1998; Song et al., 2000), but in agreement with certain forms of STDP observed experimentally (Han et al., 2000). Moreover, the particular dynamics of the synapses depends on the form of the firing intensity ρ, which is a function for which there exists no experimentally justified model. For these reasons, we chose to study in simulations some simplified forms of the algorithms derived analytically, that extend the standard STDP model in a straightforward way. The learning rules that we propose can be applied to any type of spiking neural model; their applicability is not restricted to probabilistic models or to the SRM that we previously used in the analytical derivation. The proposed rules are just inspired by the analytically derived and experimentally observed ones, and do not follow directly from them. In what follows, we drop the dependence of plasticity on the firing intensity (which means that the learning mechanism can also be applied to deterministic neural models commonly used in simulations), we consider that both potentiation and depression of z are associative and spike timing dependent (without considering the homeostatic mechanism introduced in section 2.2), we use an exponential dependence of plasticity on the relative spike timings, and we consider that the effect of different spike pairs is additive, as in previous studies (Song et al., 2000; Abbott & Nelson, 2000). The resulting learning rule is determined by equations 2.17 and 2.18, which we repeat below for convenience, and an equation for the dynamics of ξ that takes into account these simplifications: dwij (t) = γ r (t) zij (t) dt
(3.1)
1482
R. Florian
τz
dzij (t) = −zij (t) + ξij (t) dt ξij (t) = i (t) A+
(3.2)
F jt
+ j (t) A−
f
exp −
Fit
t − tj
τ+
t − ti exp − τ−
f
,
(3.3)
where τ± and A± are constant parameters. The time constants τ± determine the ranges of interspike intervals over which synaptic changes occur. According to the standard antisymmetric STDP model, A+ is positive and A− is negative. We call the learning rule determined by the above equations modulated STDP with eligibility trace (MSTDPET). The learning rule can be simplified further by dropping the eligibility trace. In this case, we have a simple modulation by the reward r (t) of standard STDP: dwij (t) = γ r (t) ξij (t), dt
(3.4)
where ξij (t) is given by equation 3.3. We call this learning rule modulated STDP (MSTDP). We remind that with our notation, the standard STDP model is given by dwij (t) = γ ξij (t). dt
(3.5)
For all these learning rules, it is useful to introduce a variable Pij+ that tracks the influence of presynaptic spikes and a variable Pij− that tracks the influence of postsynaptic spikes, instead of keeping in memory, in computer simulations, all past spikes and summing repeatedly their effects (Song et al., 2000). These variables may have biochemical counterparts in biological neurons. The dynamics of ξ and P ± is then given by ξij (t) = Pij+ i (t) + Pij− j (t)
(3.6)
dPij+ /dt = −Pij+ /τ+ + A+ j (t)
(3.7)
dPij− /dt = −Pij− /τ− + A− i (t).
(3.8)
If τ± and A± are identical for all synapses, as it is the case in our simulations, then Pij+ does not depend on i, and Pij− does not depend on j, but we will keep the two indices for clarity.
Reinforcement Learning Through Modulation of STDP
1483
If we simulate the network in discrete time with a time step δt, the dynamics of the synapses is defined, for MSTDPET, by equations 2.7 and 2.8, or, for MSTDP, by wij (t + δt) = wij (t) + γ r (t + δt) ζij (t),
(3.9)
and by ζij (t) = Pij+ (t) f i (t) + Pij− (t) f j (t)
(3.10)
Pij+ (t) = Pij+ (t − δt) exp(−δt/τ+ ) + A+ f j (t)
(3.11)
Pij− (t) = Pij− (t − δt) exp(−δt/τ− ) + A− f i (t),
(3.12)
where f i (t) is 1 if neuron i has fired at time step t or 0 otherwise. The dynamics of the various variables is illustrated in Figure 1. As for the standard STDP model, we also apply bounds to the synapses, additionally to the dynamics specified by the learning rules. 4 Simulations 4.1 Methods. In the following simulations, we used networks composed of integrate-and-fire neurons with resting potential ur = −70 mV, firing threshold θ = −54 mV, reset potential equal to the resting potential, and ¨ decay time constant τ = 20 ms (parameters from Gutig, Aharonov, Rotter, & Sompolinsky, 2003, and similar to those in Song et al., 2000). The network’s dynamics was simulated in discrete time with a time step δt = 1 ms. The synaptic weights wij represented the total increase of the postsynaptic membrane potential caused by a single presynaptic spike; this increase was considered to take place during the time step following the spike. Thus, the dynamics of the neurons’ membrane potential was given by ui (t) = ur + [ui (t − δt) − ur ] exp(−δt/τ ) +
wij f j (t − δt).
(4.1)
j
If the membrane potential surpassed the firing threshold θ , it was reset to ur . Axons propagated spikes with no delays. More biologically plausible simulations of reward-modulated STDP were presented elsewhere (Florian, 2005); here, we aim to present the effects of the proposed learning rules using minimalist setups. We used τ+ = τ− = 20 ms; the value of γ varied from experiment to experiment. Except where specified, we used τz = 25 ms, A+ = 1, and A− = −1. 4.2 Solving the XOR Problem: Rate-Coded Input. We first show the efficacy of the proposed rules for learning XOR computation. Although this
1484
R. Florian
Figure 1: Illustration of the dynamics of the variables used by MSTDP and MSTDPET and the effects of these rules and of STDP on the synaptic strength for sample spike trains and reward (see equations 2.7, 2.8, and 3.9–3.12). We used γ = 0.2. The values of the other parameters are listed in section 4.1.
is not a biologically relevant task, it is a classic benchmark problem for artificial neural network training. This problem was also approached using other reinforcement learning methods for generic spiking neural networks (Seung, 2003; Xie & Seung, 2004). XOR computation consists of performing the following mapping between two binary inputs and one binary output: {0, 0} → 0 ; {0, 1} → 1 ; {1, 0} → 1 ; {1, 1} → 0. We considered a setup similar to the one in Seung (2003). We simulated a feed-forward neural network with 60 input neurons, 60 hidden neurons, and 1 output neuron. Each layer was fully connected to the next layer. Binary inputs and outputs were coded by the firing rates of the corresponding neurons. The first half of the input neurons represented the first input and the rest the second input. The input 1 was represented by a Poisson spike train at 40 Hz, and the input 0 was represented by no spiking. The training was accomplished by presenting the inputs and then delivering the appropriate reward or punishment to the synapses,
Reinforcement Learning Through Modulation of STDP
1485
according to the activity of the output neuron. In each learning epoch, the four input patterns were presented for 500 ms each, in random order. During the presentation of each input pattern, if the correct output was 1, the network received a reward r = 1 for each output spike emitted and 0 otherwise. This rewarded the network for having a high-output firing rate. If the correct output was 0, the network received a negative reward (punishment) r = −1 for each output spike and 0 otherwise. The corresponding reward was delivered during the time step following the output spike. Fifty percent of the input neurons coding for either inputs were inhibitory; the rest were excitatory. The synaptic weights wij were hardbounded between 0 and 5 mV (for excitatory synapses) or between −5 mV and 0 (for inhibitory synapses). Initial synaptic weights were generated randomly within the specified bounds. We considered that the network learned the XOR function if, at the end of an experiment, the output firing rate for the input pattern {1, 1} was lower than the output rates for the patterns {0, 1} or {1, 0} (the output firing rate for the input {0, 0} was obviously always 0). Each experiment had 200 learning epochs (presentations of the four input patterns), corresponding to 400 s of simulated time. We investigated learning with both MSTDP (using γ = 0.1 mV) and MSTDPET (using γ = 0.625 mV). We found that both rules were efficient for learning the XOR function (see Figure 2). The synapses changed such as to increase the reward received by the network, by minimizing the firing rate of the output neuron for the input pattern {1, 1} and maximizing the same rate for {0, 1} and {1, 0}. With MSTDP, the network learned the task in 99.1% of 1000 experiments, and with MSTDPET, learning was achieved in 98.2% of the experiments. The performance in XOR computation remained constant, after learning, if the reward signal was removed and the synapses were fixed. 4.3 Solving the XOR Problem: Temporally Coded Input. Since in the proposed learning rules synaptic changes depend on the precise timing of spikes, we investigated whether the XOR problem can also be learned if the input is coded temporally. We used a fully connected feedforward network with 2 input neurons, 20 hidden neurons, and 1 output neuron, a setup similar to the one in Xie and Seung (2004). The input signals 0 and 1 were coded by two distinct spike trains of 500 ms in length, randomly generated at the beginning of each experiment by distributing uniformly 50 spikes within this interval. Thus, to underline the distinction with rate coding, all inputs had the same average firing rate of 100 Hz. Since the number of neurons was low, we did not divide them into excitatory and inhibitory ones and allowed the weights of the synapses between input and the hidden layer to take both signs. These weights were bounded between −10 mV and 10 mV. The weights of the synapses between the hidden layer and the output neuron were bounded between 0 and 10 mV. As before, each experiment consisted of 200 learning epochs, corresponding
1486
R. Florian
MSTDP
MSTDPET
250
a)
Reward
200 150 100 50
Average Sample
0 -50 0
50
100
150
b)
Average firing rate (Hz)
Learning epoch
200 0
50
100
150
200
Learning epoch
300 250 200 150 100 50 0 {0,0}
{0,1}
{1,0}
Input pattern
{1,1}
{0,0}
{0,1}
{1,0}
{1,1}
Input pattern
Figure 2: Learning XOR computation with rate-coded input with MSTDP and MSTDPET. (a) Evolution during learning of the total reward received in a learning epoch: average over experiments and a random sample experiment. (b) Average firing rate of the output neuron after learning (total number of spikes emitted during the presentation of each pattern in the last learning epoch divided by the duration of the pattern): average over experiments and standard deviation.
to 400 s of simulated time. We used γ = 0.01 mV for experiments with MSTDP and γ = 0.25 mV for MSTDPET. With MSTDP, the network learned the XOR function in 89.7% of 1000 experiments, while with MSTDPET, learning was achieved in 99.5% of the experiments (see Figure 3). The network learned to solve the task by responding to the synchrony of the two input neurons. We verified that after learning, the network solves the task not only for the input patterns used during learning but for any pair of random input signals similarly generated. 4.4 Learning a Target Firing-Rate Pattern. We have also verified the capacity of the proposed learning rules to allow a network to learn a given output pattern, coded by the individual firing rates of the output neurons. During learning, the network received a reward r = 1 when the distance d between the target pattern and the actual one decreased and a punishment r = −1 when the distance increased, r (t + δt) = sign(d(t) − d(t − δt)).
Reinforcement Learning Through Modulation of STDP
1487
Figure 3: Learning XOR computation with temporally coded input with MSTDP and MSTDPET. (a) Evolution during learning of the total reward received in a learning epoch: average over experiments and a random sample experiment. (b) Average firing rate of the output neuron after learning (total number of spikes emitted during the presentation of each pattern in the last learning epoch divided by the duration of the pattern): average over experiments and standard deviation.
We used a feedforward network with 100 input neurons directly connected to 100 output neurons. Each output neuron received axons from all input neurons. Synaptic weights were bounded between 0 and wmax = 1.25 mV. The input neurons fired Poisson spike trains with constant, random firing rates, generated uniformly, for each neuron, between 0 and 50 Hz. The target firing rates νi0 of individual output neurons were also generated randomly at the beginning of each experiment, uniformly between 20 and 100 Hz. The actual firing rates νi of the output neurons were measured using a leaky integrator with time constant τν = 2 s: δt 1 f i (t). νi (t) = νi (t − δt) exp − + τν τν
(4.2)
This is equivalent to performing a weighed estimate of the firing rate with an exponential kernel. We measured the total distance between the current
R. Florian 250
70
200
65
Rate (Hz)
Distance (Hz)
1488
150 100 50
60 55 50 45 40
0 0
10
20
30 40 Time (s)
a)
50
60
0
10
20
30 40 Time (s)
50
60
b)
Figure 4: Learning a target firing-rate pattern with MSTDP (sample experiment). (a) Evolution during learning of distance between the target pattern and the actual output pattern. (b) Evolution during learning of the firing rates of two sample output neurons (normal line) toward their target output rates (dotted lines).
0 2 output firing pattern and the target one as d(t) = i [νi (t) − νi ] . Learning began after the output rates stabilized after the initial transients. We performed experiments with both MSTDP (γ = 0.001 mV) and MSTDPET (γ = 0.05 mV). Both learning rules allowed the network to learn the given output pattern, with very similar results. During learning, the distance between the target output pattern and the actual one decreased rapidly (in about 30 s) and then remained close to 0. Figure 4 illustrates learning with MSTDP. 4.5 Exploring the Learning Mechanism. In order to study in more detail the properties of the proposed learning rules, we repeated the learning of a target firing-rate pattern experiment, described above, while varying the various parameters defining the rules. To characterize learning speed, we considered that convergence is achieved at time tc if the distance d(t) between the actual output and the target output did not become lower than d(tc ) for t between tc and tc + τc , where τc was either 2 s, 10 s, or 20 s, depending on the experiment. We measured learning efficacy as e(t) = 1 − d(t)/d(t0 ), where d(t0 ) is the distance at the beginning of learning; the learning efficacy at convergence is e c = e(tc ). The best learning efficacy possible is 1, and a value of 0 indicates no learning. A negative value indicates that the distance between the actual firing-rate pattern and the target has increased with respect to the case when the synapses have random values. 4.5.1 Learning Rate Influence. Increasing the learning rate γ results in faster convergence, but as γ increases, its effect on learning speed becomes
Reinforcement Learning Through Modulation of STDP
1489
Figure 5: Influence of learning rate γ on learning a target firing-rate pattern with delayed reward with MSTDP. (a) Convergence time tc as a function of γ . (b) Learning efficacy at convergence e c as a function of γ .
less and less important. The learning efficacy at convergence decreases linearly with higher learning rates. Figure 5 illustrates these effects for MSTDP (with τc = 10 s), but the behavior is similar for MSTDPET. 4.5.2 Nonantisymmetric STDP. We have investigated the effects on learning performance of changing the form of the STDP window (function) used by the learning rules, by varying the values of the A± parameters. This permitted the assessment of the contributions to learning of the two parts of the STDP window (corresponding to positive and, respectively, negative delays between pre- and postsynaptic spikes). This also permitted the exploration of the properties of reward-modulated symmetric STDP. We have repeated the learning of a target firing pattern experiment with various values for the A± parameters. Figure 6 presents the learning efficacy after 25 s of learning. It can be observed that for MSTDP, reinforcement learning is possible in all cases where a postsynaptic spike following a presynaptic spike strengthens the synapse when the reward is positive and depresses the synapse when the reward is negative, thus reinforcing causality. When A+ = 1, MSTDP led to reinforcement learning regardless of the value of A− . For MSTDPET, only modulation of standard antisymmetric Hebbian STDP (A+ = 1, A− = −1) led to learning. Many other values of the A± parameters maximized the distance between the output and the target pattern (see Figure 6), instead of minimizing it, which was the task to be learned. These results suggest that the causal nature of STDP is an important factor for the efficacy of the proposed rules. 4.5.3 Reward Baseline. In previous experiments, the reward was either 1 or −1, thus having a 0 baseline. We have repeated the learning of a target firing pattern experiment while varying the reward baseline r0 , giving the reward r (t + δt) = r0 + sign(d(t) − d(t − δt)). Experiments showed that a
1490
R. Florian
Figure 6: Learning a target firing-rate pattern with delayed reward with modified MSTDP and MSTDPET. Learning efficacy e at 25 s after the beginning of learning as a function of A+ and A− .
zero baseline is most efficient, as expected. The efficiency of MSTDP did not change much for a relatively large interval of reward baselines around 0. The efficiency of MSTDPET proved to be more sensitive to having a 0 baseline, but any positive baseline still led to learning. This suggests that for both MSTDP and MSTDPET, reinforcement learning can benefit from the unsupervised learning properties of standard STDP, achieved when r is positive. A positive baseline that ensures that r is always positive is also biologically relevant, since changes in the sign of STDP have not yet been observed experimentally and may not be possible in the brain. Modified MSTDP (with A− being 0 or 1), which has been shown to allow learning for 0 baseline, is the most sensitive to departures from this baseline, slight deviations leading the learning efficacy to negative values (i.e., distance is maximized instead of being minimized; see Figure 7). The poor performance of modified MSTDP at positive r0 is caused by the tendency of all synapses to take the maximum allowed values. At negative r0 , the distribution of synapses is biased more toward small values for modified MSTDP as compared to normal MSTDP. The relatively poor performance of MSTDPET at negative r0 is due to a bias of the distribution of synapses toward large values. The synaptic distribution after learning, for the cases where learning is efficient, is almost uniform, with MSTDP and modified MSTDP also having additional small peaks at the extremities of the allowed interval. 4.5.4 Delayed Reward. For the tasks presented thus far, MSTDPET had a performance comparable to MSTDP or poorer. The advantage of using an eligibility trace appears when learning a task where the reward is delivered
Reinforcement Learning Through Modulation of STDP
1491
Figure 7: Learning a target firing-rate pattern with MSTDP, MSTDPET, and modified MSTDP. Learning efficacy e at 25 s after the beginning of learning as a function of reward baseline r0 .
to the network with a certain delay σ with regard to the state or the output of the network that caused it. To properly investigate the effects of reward delay on MSTDPET learning performance, for various values of τz , we had to factor out the effects of the variation of τz . A higher τz leads to smaller synaptic changes, since z decays more slowly and, if reward varies from time step to time step and the probabilities of positive and negative reward are approximately equal, their effects on the synapse, having opposite signs, will almost cancel out. Hence, the variation of τz results in the variation of the learning efficacy and the convergence time, similar to the variation of the learning rate, previously described. In order to compensate for this effect, we varied the learning rate γ as a function of τz such that the convergence time of MSTDPET was constant, at reward delay σ = 0, and approximately equal to the convergence time of MSTDP for γ = 0.001, also at σ = 0. We obtained the corresponding values of γ by numerically solving the equation tcMSTDPET (γ ) = tcMSTDP using Ridder’s method (Press, Teukolsky, Vetterling, & Flannery, 1992). For determining convergence, we used τc = 2 s. The dependence on τz of the obtained values of γ is illustrated in Figure 8c. MSTDPET continued to allow the network to learn, while MSTDP and modified MSTDP (A− ∈ {−1, 0, 1}) failed in this case to solve the tasks, even for small delays. MSTDP and modified MSTDP led to learning in our setup only if reward was not delayed. In this case, changes in the strength of the i j synapse were determined by the product r (t + δt) ζij (t) of the reward r (t + δt) that corresponded to the network’s activity (postsynaptic spikes) at time t and of the variable ζij (t) that was nonzero only if there
1492
R. Florian
Figure 8: Learning a target firing-rate pattern with delayed reward with MSTDP and MSTDPET. (a, b) Learning efficacy e at 25 s after the beginning of learning, as a function of reward delay σ and eligibility trace decay time constant τz . (a) Dependence on σ (standard deviation is not shown for all data sets, for clarity). (b) Dependence on τz (MSTDPET only). (c) The values of the learning rate γ , used in the MSTDPET experiments, which compensate at 0 delay for the variation of τz .
were pre- or postsynaptic spikes during the same time step t. Even a small extra delay σ = δt of the reward impeded learning. Figure 8a illustrates the performance of both learning rules as a function of reward delay σ for learning a target firing pattern. Figure 8b illustrates the performance of MSTDPET as a function of τz , for several values of reward delay σ . The
Reinforcement Learning Through Modulation of STDP
1493
value of τz that yields the best performance for a particular delay increases monotonically and very fast with the delay. For zero delay or for values of τz larger than optimal, performance decreases for higher τz . Performance was estimated by measuring the average (obtained from 100 experiments) of the learning efficacy e at 25 s after the beginning of learning. The eligibility trace keeps a decaying memory of the relationships between the most recent pairs of pre- and postsynaptic spikes. It is thus understandable that MSTDPET permits learning even with delayed reward as long as this delay is smaller than τz . It can be seen from Figure 8 that at least for the studied setup, learning efficacy degrades significantly for large delays, even if they are much smaller than τz , and there is no significant learning for delays larger than about 7 ms. The maximum delay for which learning is possible may be larger for other setups. 4.5.5 Continuous Reward. In previous experiments, reward varied from time step to time step, at a timescale of δt = 1 ms. In animals, the internal reward signal (e.g., a neuromodulator) varies more slowly. To investigate learning performance in such a case, we used a continuous reward, varying on a timescale τr , with a dynamics given by
δt r (t + δt) = r (t) exp − τr
+
1 sign(d(t − σ ) − d(t − δt − σ )). τr
(4.3)
Both MSTDP and MSTDPET continue to work when reward is continuous and not delayed. Learning efficacy decreases monotonically with higher τr (see Figures 9a and 9b). As in the case of a discrete reward signal, MSTDPET allows learning even with small reward delays σ (see Figure 9c), while MSTDP does not lead to learning when reward is delayed. 4.5.6 Scaling of Learning with Network Size. To explore how the performance of the proposed learning mechanisms scales with network size, we varied systematically the number of input and output neurons (Ni and No , respectively). The maximum synaptic efficacies were set to wmax = 1.25 · (Ni /100) mV in order to keep the average input per postsynaptic neuron constant between setups with different numbers of input neurons. When No varies with Ni constant, learning efficacy at convergence e c diminishes with higher No and convergence time tc increases with higher No because the difficulty of the task increases. When Ni varies with No constant, convergence time decreases with higher Ni because the difficulty of the task remains constant, but the number of ways in which the task can be solved increases due to the higher number of incoming synapses per output neuron. The learning efficacy diminishes, however, with higher Ni , when No is constant.
1494
R. Florian
Figure 9: Learning driven by a continuous reward signal with a timescale τr . (a, b) Learning performance as a function of τr . (a) MSTDP. (b) MSTDPET. (c) Learning performance of MSTDPET as a function of reward delay σ (τz = τr = 25 ms). All experiments used γ = 0.05 mV.
When both Ni and No vary at the same time (Ni = No = N), the convergence time decreases slightly for higher network sizes (see Figure 10a). Again, learning efficacy diminishes with higher N, which means that one cannot train networks of arbitrary size with this method (see Figure 10b). The behavior is similar for both MSTDP and MSTDPET, but for MSTDPET, the decrease of learning efficacy for larger networks is more accentuated. By considering that the dependence of e c on N is linear for N ≥ 600 and extrapolating from the available data, the result is that learning is not possible (e c 0) for networks with N > Nmax , where Nmax is about 4690 for MSTDP and 1560 for MSTDPET, for the setup and the parameters considered here. For a certain configuration of parameters, learning efficacy can be improved in the detriment of learning time by decreasing the learning rate γ (as discussed above; see Figure 5).
Reinforcement Learning Through Modulation of STDP
1495
Figure 10: Scaling with network size of learning a target firing-rate pattern, with MSTDP and MSTDPET. (a) Convergence time tc as a function of network size Ni = No = N. (b) Learning efficacy at convergence e c as a function of N.
4.5.7 Reward-Modulated Classical Hebbian Learning. To further investigate the importance of causal STDP as a component of the proposed learning rules, we have also studied the properties of classical (rate-dependent) Hebbian plasticity modulated by reward while still using spiking neural networks. Under certain conditions (Xie & Seung, 2000), classical Hebbian plasticity can be approximated by modified STDP when the plasticity window is symmetric, A+ = A− = 1. In this case, according to the notation in Xie and Seung (2000), we have β0 > 0 and β1 = 0, and thus plasticity depends (approximatively) on the product of pre- and postsynaptic firing rates, not on their temporal derivatives. Reward-modulated modified STDP with A+ = A− = 1 was studied above. A more direct approximation of modulated rate-dependent Hebbian learning with spiking neurons is the following learning rule, dwij (t) = γ r (t) Pij+ (t) Pij− (t), dt
(4.4)
where Pij± are computed as for MSTDP and MSTDPET, with A+ = A− = 1. It can be seen that this rule is a reward-modulated version of the classical Hebbian rule, since Pij+ is an estimate of the firing rate of the presynaptic neuron j, and Pij− is an estimate of the firing rate of the postsynaptic neuron i, using an exponential kernel. Reward can take both signs, and thus this rule allows both potentiation and depression of synapses. An extra mechanism for depression is not necessary to avoid synaptic explosion, as for classic Hebbian learning. We have repeated the experiments presented above while using this learning rule. We could not achieve learning with this rule for values of γ between 10−6 and 1 mV.
1496
R. Florian
4.5.8 Sensitivity to Parameters. The results of the simulations presented above are quite robust with respect to the parameters used for the learning rules. The only parameter that we had to tune from experiment to experiment was the learning rate γ , which determined the magnitude of synaptic changes. A learning rate that is too small does not change the synapses rapidly enough to observe the effects of learning during a given experiment duration, while a learning rate that is too large pushes the synapses toward the bounds, thus limiting the network’s behavior. Tuning the learning rate is needed for any kind of learning rule involving synaptic plasticity. The magnitude of the time constant τz of the decay of the eligibility trace was not essential in experiments without delayed reward, although large values degrade learning performance (see Figure 8b). For the relative magnitudes of reward used for positive and, respectively, negative rewards, we used a natural choice (setting them equal) and did not have to tune this as in Xie and Seung (2004) to achieve learning. The absolute magnitudes of the reward and of A± are not relevant, as their effect on the synapses is scaled by γ . 5 Discussion We have derived several versions of a reinforcement learning algorithm for spiking neural networks by applying an abstract reinforcement learning algorithm to the spike response model of spiking neurons. The resulting learning rules are similar to spike-timing-dependent synaptic and intrinsic plasticity mechanisms experimentally observed in animals, but involve an extra modulation by the reward of the synaptic and excitability changes. We have also studied in computer simulations the properties of two learning rules, MSTDP and MSTDPET, that preserve the main features of the rules derived analytically, while being simpler, and also being extensions of the standard model of STDP. We have shown that the modulation of STDP by a global reward signal leads to reinforcement learning. An eligibility trace that keeps a decaying memory of the effects of recent spike pairings allows learning in the case that reward is delayed. The causal nature of the STDP window seems to be an important factor for the learning performance of the proposed learning rules. The continuity between the proposed reinforcement learning mechanism and experimentally observed STDP makes it biologically plausible and also establishes a continuity between reinforcement learning and the unsupervised learning capabilities of STDP. The introduction of the eligibility trace z does not contradict what is currently known about biological STDP, as it simply implies that synaptic changes are not instantaneous but are implemented through the generation of a set of biochemical substances that decay exponentially after generation. A new feature is the modulatory effect of the reward signal r . This may be implemented in the brain by a neuromodulator. For example, dopamine carries a short-latency
Reinforcement Learning Through Modulation of STDP
1497
reward signal indicating the difference between actual and predicted rewards (Schultz, 2002) that fits well with our learning model based on continuous reward-modulated plasticity. It is known that dopamine and acetylcholine modulate classical (firing-rate-dependent) long-term potentiation and depression of synapses (Seamans & Yang, 2004; Huang, Simpson, Kellendonk, & Kandel, 2004; Thiel, Friston, & Dolan, 2002). The modulation of STDP by neuromodulators is supported by the discovery of the amplification of spike-timing-dependent potentiation in hippocampal CA1 pyramidal neurons by the activation of beta-adrenergic receptors (Lin, Min, Chiu, & Yang, 2003, Figure 6F). As speculated before (Xie & Seung, 2004), it may be that other studies failed to detect the influence of neuromodulators on STDP because the reward circuitry may not have worked during the experiments and the reward signal may have been fixed to a given value. In this case, according to the proposed learning rules, for excitatory synapses, if the reward is frozen to a positive value during the experiment, it leads to Hebbian STDP and otherwise to anti-Hebbian STDP. The studied learning rules may be used in applications for training generic artificial spiking neural networks and suggest the experimental investigation in animals of the existence of reward-modulated STDP. Acknowledgments We thank the reviewers, Raul C. Mures¸an, Walter Senn, and Sergiu Pas¸ca, for useful advice. This work was supported by Arxia SRL and by a grant of the Romanian government (MEdC-ANCS). References Abbott, L. F., & Gerstner, W. (2005). Homeostasis and learning through spike-timing dependent plasticity. In B. Gutkin, D. Hansel, C. Meunier, J. Dalibard, & C. Chow (Eds.), Methods and models in neurophysics: Proceedings of the Les Houches Summer School 2003. Amsterdam: Elsevier Science. Abbott, L. F., & Nelson, S. B. (2000). Synaptic plasticity: Taming the beast. Nature Neuroscience, 3, 1178–1183. Aizenman, C. D., & Linden, D. J. (2000). Rapid, synaptically driven increases in the intrinsic excitability of cerebellar deep nuclear neurons. Nature Neuroscience, 3(2), 109–111. Alstrøm, P., & Stassinopoulos, D. (1995). Versatility and adaptive performance. Physical Review, E 51(5), 5027–5032. Bartlett, P. L., & Baxter, J. (1999a). Direct gradient-based reinforcement learning: I. Gradient estimation algorithms (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering. Bartlett, P. L., & Baxter, J. (1999b). Hebbian synaptic modifications in spiking neurons that learn (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering.
1498
R. Florian
Bartlett, P., & Baxter, J. (2000a). Stochastic optimization of controlled partially observable Markov decision processes. In Proceedings of the 39th IEEE Conference on Decision and Control. Piscataway, NJ: IEEE. Bartlett, P. L., & Baxter, J. (2000b). A biologically plausible and locally optimal learning algorithm for spiking neurons. Available online at http://arp.anu.edu.au/ftp/ papers/jon/brains.pdf.gz. Bartlett, P. L., & Baxter, J. (2000c). Estimation and approximation bounds for gradientbased reinforcement learning. In Proceedings of the Thirteenth Annual Conference on Computational Learning Theory (pp. 133–141). San Francisco: Morgan Kaufmann. Barto, A. G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human Neurobiology, 4, 229–256. Barto, A., & Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEE Transactions on Systems, Man, and Cybernetics, 15(3), 360–375. Barto, A. G., & Anderson, C. W. (1985). Structural learning in connectionist systems. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Mahwah, NJ: Erlbaum. Barto, A. G., & Jordan, M. I. (1987). Gradient following without back-propagation in layered networks. In M. Caudill & C. Butler (Eds.), Proceedings of the First IEEE Annual Conference on Neural Networks (Vol. 2, pp. 629–636). Piscataway, NJ: IEEE. Baxter, J., & Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 319–350. Baxter, J., Bartlett, P. L., & Weaver, L. (2001). Experiments with infinite-horizon, policy-gradient estimation. Journal of Artificial Intelligence Research, 15, 351–381. Baxter, J., Weaver, L., & Bartlett, P. L. (1999). Direct gradient-based reinforcement learning: II. Gradient ascent algorithms and experiments (Tech. Rep.). Canberra: Australian National University, Research School of Information Sciences and Engineering. Bell, A. J., & Parrara, L. C. (2005). Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387(6630), 278–281. Bi, G.-Q., & Poo, M.-M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18(24), 10464–10472. Bohte, S. M. (2004). The evidence for neural information processing with precise spike-times: A survey. Natural Computing, 3(2), 195–206. Bohte, S. M., & Mozer, C. (2005). Reducing spike train variability: A computational theory of spike-timing dependent plasticity. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. Chechik, G. (2003). Spike time dependent plasticity and information maximization. Neural Computation, 15, 1481–1510. Cudmore, R. H., & Turrigiano, G. G. (2004). Long-term potentiation of intrinsic excitability in LV visual cortical neurons. Journal of Neurophysiology, 92, 341–348. Dan, Y., & Poo, M.-M. (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science, 256(5063), 1570–1573.
Reinforcement Learning Through Modulation of STDP
1499
Dan, Y., & Poo, M.-M. (2004). Spike timing-dependent plasticity of neural circuits. Neuron, 44, 23–30. Daoudal, G., & Debanne, D. (2003). Long-term plasticity of intrinsic excitability: Learning rules and mechanisms. Learning and Memory, 10, 456– 4665. Dauc´e, E., Soula, H., & Beslon, G. (2005). Learning methods for dynamic neural networks. In Proceedings of the 2005 International Symposium on Nonlinear Theory and Its Applications (NOLTA 2005). Bruges, Belgium. Egger, V., Feldmeyer, D., & Sakmann, B. (1999). Coincidence detection and changes of synaptic efficacy in spiny stellate neurons in rat barrel cortex. Nature Neuroscience, 2(12), 1098–1105. Farries, M. A., & Fairhall, A. L. (2005a). Reinforcement learning with modulated spike timing-dependent plasticity. Poster presented at the Computational and Systems Neuroscience Conference (COSYNE 2005). Available online at http://www.cosyne.org/climages/d/dy/COSYNE05 Abstracts.pdf. Farries, M. A., & Fairhall, A. L. (2005b). Reinforcement learning with modulated spike timing-dependent plasticity. Program No. 384.3. 2005 Abstract Viewer/Itinerary Planner. Washington, DC: Society for Neuroscience. Available online at http://sfn.scholarone.com/itin2005/main.html?new page id=126&abstract id= 5526&p num=384.3&is tec. Florian, R. V. (2005). A reinforcement learning algorithm for spiking neural networks. In D. Zaharie, D. Petcu, V. Negru, T. Jebelean, G. Ciobanu, A. Cicortas¸, A. Abraham, & M. Paprzycki (Eds.), Proceedings of the Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (pp. 299–306). Los Alamitos, CA: IEEE Computer Society. Florian, R. V., & Mures¸an, R. C. (2006). Phase precession and recession with STDP and anti-STDP. In S. Kollias, A. Stafylopatis, W. Duch, & E. Oja (Eds.), Artificial Neural Networks–ICANN 2006. 16th International Conference, Athens, Greece, September 10–14, 2006. Proceedings, Part I. Berlin: Springer. Froemke, R. C., & Dan, Y. (2002). Spike-timing-dependent synaptic modification induced by natural spike trains. Nature, 416, 433–438. Ganguly, K., Kiss, L., & Poo, M. (2000). Enhancement of presynaptic neuronal excitability by correlated presynaptic and postsynaptic spiking. Nature Neuroscience, 3(10), 1018–1026. Gerstner, W. (2001). A framework for spiking neuron models: The spike response model. In F. Moss & S. Gielen (Eds.), The handbook of biological physics (Vol. 4, pp. 469–516). Amsterdam: Elsevier Science. Gerstner, W., & Kistler, W. M. (2002). Spiking neuron models. Cambridge: Cambridge University Press. ¨ Gutig, R., Aharonov, R., Rotter, S., & Sompolinsky, H. (2003). Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity. Journal of Neuroscience, 23(9), 3697–3714. Han, V. Z., Grant, K., & Bell, C. C. (2000). Reversible associative depression and nonassociative potentiation at a parallel fiber synapse. Neuron, 27, 611–622. Hopfield, J. J., & Brody, C. D. (2004). Learning rules and network repair in spiketiming-based computation networks. Proceedings of the National Academy of Sciences, 101(1), 337–342.
1500
R. Florian
Huang, Y.-Y., Simpson, E., Kellendonk, C., & Kandel, E. R. (2004). Genetic evidence for the bidirectional modulation of synaptic plasticity in the prefrontal cortex by D1 receptors. Proceedings of the National Academy of Sciences, 101(9), 3236–3241. Kempter, R., Gerstner, W., & van Hemmen, J. L. (1999). Hebbian learning and spiking neurons. Physical Review E, 59(4), 4498–4514. Kempter, R., Gerstner, W., & van Hemmen, J. L. (2001). Intrinsic stabilization of output rates by spike-based Hebbian learning. Neural Computation, 13, 2709– 2742. Legenstein, R., Naeger, C., & Maass, W. (2005). What can a neuron learn with spiketiming-dependent plasticity? Neural Computation, 17, 2337–2382. Li, C., Lu, J., Wu, C., Duan, S., & Poo, M. (2004). Bidirectional modification of presynaptic neuronal excitability accompanying spike timing-dependent synaptic plasticity. Neuron, 41, 257–268. Lin, Y., Min, M., Chiu, T., & Yang, H. (2003). Enhancement of associative long-term potentiation by activation of beta-adrenergic receptors at CA1 synapses in rat hippocampal slices. Journal of Neuroscience, 23(10), 4173–4181. Marbach, P., & Tsitsiklis, J. N. (1999). Simulation-based optimization of Markov reward processes: Implementation issues. In Proceedings of the 38th Conference on Decision and Control. Piscataway, NJ: IEEE. Marbach, P., & Tsitsiklis, J. N. (2000). Approximate gradient methods in policy-space optimization of Markov reward processes. Discrete Event Dynamic Systems: Theory and Applications, 13(1–2), 111–148. ¨ Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275(5297), 213– 215. Mazzoni, P., Andersen, R. A., & Jordan, M. I. (1991). A more biologically plausible learning rule for neural networks. Proceedings of the National Academy of Science of the USA, 88, 4433–4437. Nick, T. A., & Ribera, A. B. (2000). Synaptic activity modulates presynaptic excitability. Nature Neuroscience, 3(2), 142–149. Pfister, J.-P., Toyoizumi, T., Barber, D., & Gerstner, W. (2006). Optimal spike-timingdependent plasticity for precise action potential firing in supervised learning. Neural Computation, 18(6), 1318–1348. Pouget, A., Deffayet, C., & Sejnowski, T. J. (1995). Reinforcement learning predicts the site of plasticity for auditory remapping in the barn owl. In B. Fritzke, G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7. Cambridge, MA: MIT Press. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C: The art of scientific computing (2nd ed.). Cambridge: Cambridge University Press. Rao, R. P. N., & Sejnowski, T. J. (2001). Spike-timing-dependent Hebbian plasticity as temporal difference learning. Neural Computation, 13(10), 2221–2237. Roberts, P. (1999). Computational consequences of temporally asymmetric learning rules: I. Differential Hebbian learning. Journal of Computational Neuroscience, 7, 235–246. Roberts, P. D., & Bell, C. C. (2002). Spike timing dependent synaptic plasticity in biological systems. Biological Cybernetics, 87(5–6), 392–403.
Reinforcement Learning Through Modulation of STDP
1501
Schultz, W. (2002). Getting formal with dopamine and reward. Neuron, 36, 241–263. Seamans, J. K., & Yang, C. R. (2004). The principal features and mechanisms of dopamine modulation in the prefrontal cortex. Progress in Neurobiology, 74, 1–57. Seung, H. S. (2003). Learning in spiking neural networks by reinforcement of stochastic synaptic transmission. Neuron, 40(6), 1063–1073. Song, S., Miller, K. D., & Abbott, L. F. (2000). Competitive Hebbian learning through spike-timing-dependent synaptic plasticity. Nature Neuroscience, 3, 919–926. Soula, H., Alwan, A., & Beslon, G. (2004). Obstacle avoidance learning in a spiking neural network. Poster presented at Last Minute Results of Simulation of Adaptive Behavior, Los Angeles, CA. Available online at http://www.koredump.org/hed/ abstract sab2004.pdf. Soula, H., Alwan, A., & Beslon, G. (2005). Learning at the edge of chaos: Temporal coupling of spiking neuron controller of autonomous robotic. In Proceedings of AAAI Spring Symposia on Developmental Robotics. Menlo Park, CA: AAAI Press. Stassinopoulos, D., & Bak, P. (1995). Democratic reinforcement: A principle for brain function. Physical Review, E 51, 5033–5039. Stassinopoulos, D., & Bak, P. (1996). Democratic reinforcement: Learning via selforganization. Available online at http://arxiv.org/abs/cond-mat/9601113. ¨ Strosslin, T., & Gerstner, W. (2003). Reinforcement learning in continuous state and action space. Available online at http://lenpe7.epfl.ch/∼stroessl/publications/ StrosslinGe03.pdf. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press. Takita, K., & Hagiwara, M. (2002). A pulse neural network learning algorithm for POMDP environment. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN 2002) (pp. 1643–1648). Piscataway, NJ: IEEE. Takita, K., & Hagiwara, M. (2005). A pulse neural network reinforcement learning algorithm for partially observable Markov decision processes. Systems and Computers in Japan, 36(3), 42–52. Takita, K., Osana, Y., & Hagiwara, M. (2001). Reinforcement learning algorithm with network extension for pulse neural network. Transactions of the Institute of Electrical Engineers of Japan, 121-C(10), 1634–1640. Thiel, C. M., Friston, K. J., & Dolan, R. J. (2002). Cholinergic modulation of experiencedependent plasticity in human auditory cortex. Neuron, 35, 567–574. Toyoizumi, T., Pfister, J.-P., Aihara, K., & Gerstner, W. (2005). Spike-timing dependent plasticity and mutual information maximization for a spiking neuron model. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 1409–1416). Cambridge, MA: MIT Press. Turrigiano, G. G., & Nelson, S. B. (2004). Homeostatic plasticity in the developing nervous system. Nature Reviews Neuroscience, 5, 97–107. Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256. Xie, X., & Seung, H. S. (2000). Spike-based learning rules and stabilization of persis¨ tent neural activity. In S. A. Solla, T. K. Leen, & K.-R. Muller (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press.
1502
R. Florian
Xie, X., & Seung, H. S. (2004). Learning in neural networks by reinforcement of irregular spiking. Physical Review E, 69, 041909. Zhang, W., & Linden, D. J. (2003). The other side of the engram: Experience-driven changes in neuronal intrinsic excitability. Nature Reviews Neuroscience, 4, 885–900.
Received June 14, 2005; accepted September 6, 2006.
LETTER
Communicated by Robert Kass
A Method for Selecting the Bin Size of a Time Histogram Hideaki Shimazaki
[email protected] Shigeru Shinomoto
[email protected] Department of Physics, Kyoto University, Kyoto 606-8502, Japan
The time histogram method is the most basic tool for capturing a timedependent rate of neuronal spikes. Generally in the neurophysiological literature, the bin size that critically determines the goodness of the fit of the time histogram to the underlying spike rate has been subjectively selected by individual researchers. Here, we propose a method for objectively selecting the bin size from the spike count statistics alone, so that the resulting bar or line graph time histogram best represents the unknown underlying spike rate. For a small number of spike sequences generated from a modestly fluctuating rate, the optimal bin size may diverge, indicating that any time histogram is likely to capture a spurious rate. Given a paucity of data, the method presented here can nevertheless suggest how many experimental trials should be added in order to obtain a meaningful time-dependent histogram with the required accuracy. 1 Introduction Neurophysiological studies are based on the idea that information is transmitted between cortical neurons by spikes (Johnson, 1996; Dayan & Abbott, 2001). A number of filtering algorithms have been proposed for estimating the instantaneous activity of an individual neuron or the joint activity of multiple neurons (DiMatteo, Genovese, & Kass, 2001; Wiener & Richmond, 2002; Sanger, 2002; Kass, Ventura, & Cai, 2003; Brockwell, Rojas, & Kass, 2004; Kass, Ventura, & Brown, 2005; Brown, Kass, & Mitra, 2004). The most basic and frequently used tool for spike rate estimation is the time histogram method. For instance, one aligns spike sequences at the onset of stimuli repeatedly applied to an animal and describes the response of a single neuron with a peristimulus time histogram (PSTH) or the responses of multiple neurons with a joint PSTH (Adrian, 1928; Gerstein & Kiang, 1960; Gerstein & Perkel, 1969; Abeles, 1982). The shape of a PSTH is largely dependent on the choice of the bin size. With a bin size that is too large, one cannot represent the time-dependent spike rate. And with a bin size that is too small, the time histogram Neural Computation 19, 1503–1527 (2007)
C 2007 Massachusetts Institute of Technology
1504
H. Shimazaki and S. Shinomoto
fluctuates greatly, and one cannot discern the underlying spike rate. There is an appropriate bin size for each set of spike sequences, which is based on the goodness of the fit of the PSTH to the underlying spike rate. For most previously published PSTHs, however, the bin size has been subjectively selected by the authors. For data points distributed compactly, there are classical theories about how the optimal bin size scales with the total number of data points n. It was proven that the optimal bin size scales as n−1/3 with regard to the bargraph-density estimator (R´ev´esz, 1968; Scott, 1979). It was recently found that for two types of infinitely long spike sequences, whose rates fluctuate either smoothly or jaggedly, the optimal bin sizes exhibit different scaling relations with respect to the number of sequences, timescale, and amplitude of rate modulation (Koyama & Shinomoto, 2004). Though interesting, the scaling relations are valid only for a large amount of data and are of limited use in selecting a bin size. We devised a method of selecting the bin size of a time histogram from the spike data. In the course of our study, we realized that a theory on the empirical choice of the histogram bin size for a probability density function was presented by Rudemo (1982). Although applicable to a Poisson point process, this theory appears to have rarely been used by neurophysiologists in analyses of PSTHs. In the actual procedure of neurophysiological experiments, the number of trials (spike sequences) plays an important role in determining the resolution of a PSTH and thus in designing experiments. Therefore, it is preferable to have a theory that accords with the common protocol of neurophysiological experiments in which a stimulus is repeated to extract a signal from a neuron. Given a set of experimental data, we wish to not only determine the optimal bin size, but also estimate how many more experimental trials should be performed in order to obtain a resolution we deem sufficient. For a small number of spike sequences derived from a modestly fluctuating rate, the estimated optimal bin size may diverge, implying that by constructing a PSTH, it is likely that one obtains spurious results for the spike rate estimation (Koyama & Shinomoto, 2004). Because a shortage of data underlies this divergence, one can carry out more experiments to obtain a reliable rate estimation. Our method can suggest how many sequences should be added in order to obtain a meaningful time histogram with the required accuracy. As an application of this method, we also show that the scaling relations of the optimal bin size that appears for a large number of spike sequences can be examined from a relatively small amount of data. The degree of the smoothness of an underlying rate process can be estimated by this method. In addition to a bar graph (piecewise constant) time histogram, we also designed a method for creating a line graph (piecewise linear) time histogram, which is superior to a bar graph in the goodness of the fit to the underlying spike rate and in comparing multiple responses to different stimulus conditions.
The Bin Size of a Time Histogram
1505
These empirical methods for the bin size selection for a bar and a line graph histogram, estimation of the number of sequences required for the histogram, and estimation of the scaling exponents of the optimal bin size were corroborated by theoretical analysis derived for a generic stochastic rate process. In the next section, we develop the bar graph (peristimulus) time histogram (Bar-PSTH) method, which is the most frequently used PSTH. In the appendix, we develop the line graph (peristimulus) time histogram (Line-PSTH) method. 2 Optimization of the Bar Graph Time Histogram We consider sequences of spikes repeatedly recorded from a single neuron under identical experimental conditions. A recent analysis revealed that in vivo spike trains are not simply random, but possess interspike interval distributions intrinsic and specific to individual neurons (Shinomoto, Shima, & Tanji, 2003; Shinomoto, Miyazaki, Tamura, & Fujita, 2005). However, spikes accumulated from a large number of spike trains are in the majority mutually independent and can be regarded as being derived from a time-dependent Poisson point process (Snyder, 1975; Daley & Vere-Jones, 1988; Kass et al., 2005). It would be natural to assess the goodness of the fit of the estimator λˆ t to the underlying spike rate λt over the total observation period T by the mean integrated squared error (MISE), 1 MISE ≡ T
T
E (λˆ t − λt )2 dt,
(2.1)
0
where E refers to the expectation over different realizations of point events, given λt . We begin with a bar graph time histogram as λˆ t and explore a method to select the bin size that minimizes the MISE. The difficulty of the problem comes from the fact that the underlying spike rate λt is not known. A bar graph time histogram is constructed simply by counting the number of spikes that belong to each bin of width . For an observation period T, we obtain N = T/ intervals. The number of spikes accumulated from all n sequences in the ith interval is counted as ki . The bar height at the ith bin is given as ki /n. Figure 1 shows the schematic diagram for the construction of a bar graph time histogram. Given a bin of width , the expected height of a bar graph for t ∈ [0, ] is the time-averaged rate,
θ=
1
0
λt dt.
(2.2)
1506
H. Shimazaki and S. Shinomoto
A
0
B
^
C 0
Figure 1: Bar-PSTH. (A) An underlying spike rate, λt . The horizontal bars indicate the time-averaged rates θ for individual bins of width . (B) Sequences of spikes derived from the underlying rate. (C) A time histogram for the sample sequences of spikes. The estimated rate θˆ is the total number of spikes k that entered each bin, divided by the number of sequences n and the bin size .
The total number of spikes k from n spike sequences that enter this bin of width obeys the Poisson distribution, (nθ )k −nθ e , k!
p(k | nθ ) =
(2.3)
whose expectation is nθ . The unbiased estimator for θ is denoted as θˆ = k/(n), which is the empirical height of the bar graph for t ∈ [0, ]. 2.1 Selection of the Bin Size. By segmenting the total observation period T into N intervals of size , the MISE defined in equation 2.1 can be rewritten as 1 MISE =
0
N 1 E ( θˆi − λt+(i−1) )2 dt, N
(2.4)
i=1
where θˆi ≡ ki /(n). Hereafter, we denote the average over the segmented rate λt+(i−1) as an average over an ensemble of (segmented) rate functions
The Bin Size of a Time Histogram
1507
{λt } defined on an interval of t ∈ [0, ], as MISE =
1
E ( θˆ − λt )2 dt.
(2.5)
0
The expectation E refers to the average over the spike count, or θˆ = k/(n), given a rate function λt , or its mean value θ . The MISE can be decomposed into two parts: MISE = E(θˆ − θ )2 +
1
(λt − θ )2 dt.
(2.6)
0
The first and second terms are, respectively, the stochastic fluctuation of the estimator θˆ around the expected mean rate θ and the averaged temporal fluctuation of λt around its mean θ over an interval of length . The second term of equation 2.6 can be decomposed further into two parts: 1
(λt − θ + θ − θ )2 dt =
0
1
(λt − θ )2 dt − (θ − θ )2 .
0
(2.7) The first term in equation 2.7 represents a mean squared fluctuation of the underlying rate λt from the mean rate θ and is independent of the bin size , because 1
(λt − θ )2 dt =
0
1 T
T
(λt − θ )2 dt.
(2.8)
0
We define a cost function by subtracting this term from the original MISE, Cn () ≡ MISE −
1 T
T
(λt − θ )2 dt
0
= E(θˆ − θ )2 − (θ − θ )2 .
(2.9)
This cost function corresponds to the risk function in Rudemo (1982, equation 2.3), obtained by direct decomposition of the MISE. The second term in equation 2.9 represents the temporal fluctuation of the expected mean rate θ for individual intervals of period . As the expected mean rate θ is not an observable quantity, we have to replace the fluctuation of the expected mean rate with that of the observable estimator θˆ . Using the decomposition
1508
H. Shimazaki and S. Shinomoto
rule for an unbiased estimator (E θˆ = θ ), E(θˆ − E θˆ )2 = E(θˆ − θ )2 + (θ − θ )2 , the cost function is transformed into Cn () = 2 E(θˆ − θ )2 − E(θˆ − E θˆ )2 .
(2.10)
(2.11)
Due to the assumed Poisson nature of the point process, the number of spikes k counted in each bin obeys a Poisson distribution; the variance of k is equal to the mean. For the estimated rate defined as θˆ = k/(n), this variance-mean relation corresponds to E(θˆ − θ )2 =
1 ˆ E θ. n
(2.12)
By incorporating equation 2.12 into equation 2.11, the cost function is given as a function of the estimator θˆ , Cn () =
2 ˆ 2 . E θˆ − E(θˆ − E θ) n
(2.13)
The optimal bin size is obtained by minimizing the cost function Cn (), as ∗ ≡ arg min Cn ().
(2.14)
By replacing the expectation with the sample spike count, the cost function (see equation 2.13) is converted into this useful recipe: Algorithm 1: A Method for Bin Size Selection for a Bar-PSTH i. Divide the observation period T into N bins of width , and count the number of spikes ki from all n sequences that enter the ith bin. ii. Construct the mean and variance of the number of spikes {ki } as 1 1 ¯ 2. k¯ ≡ ki , and v ≡ (ki − k) N N N
N
i=1
i=1
iii. Compute the cost function: Cn () =
2k¯ − v . (n)2
iv. Repeat i through iii while changing the bin size to search for ∗ that minimizes Cn ().
The Bin Size of a Time Histogram
1509
2.2 Extrapolation of the Cost Function. With the method developed in the preceding section, we can determine the optimal bin size for a given set of experimental data. In this section, we develop a method to estimate how the optimal bin size decreases when more experimental trials are added to the data set: given n sequences, the method provides the cost function for m( = n) sequences. Assume that we are in possession of n spike sequences. The fluctuation of the expected mean rate (θ − θ )2 in equation 2.10 is replaced with the empirical fluctuation of the time histogram θˆ n using the decomposition rule for the unbiased estimator θˆn satisfying E θˆn = θ ,
E(θˆn − E θˆn )2 = E(θˆn − θ )2 + (θ − θ )2 .
(2.15)
The expected cost function for m sequences can thus be obtained by substituting the above equation into equation 2.9, yielding Cm (|n) = E(θˆm − θ )2 + E(θˆn − θ )2 − E(θˆn − E θˆn )2 .
(2.16)
Using the variance-mean relation for a Poisson distribution, equation 2.12, and E(θˆm − θ )2 =
1 1 E θˆm = E θˆn , m m
(2.17)
we obtain Cm (|n) =
1 1 − m n
1 E θˆn + Cn () ,
(2.18)
where Cn () is the original cost function (see equation 2.13) computed using the estimators θˆn . By replacing the expectation with sample spike count averages, the cost function for m sequences can be extrapolated with this formula, using the sample mean k¯ and variance v of the numbers of spikes, given n sequences and the bin size . The extrapolation method is summarized as algorithm 2: Algorithm 2: A Method for Extrapolating the Cost Function for a Bar-PSTH A. Construct the extrapolated cost function, Cm (|n) =
1 1 − m n
k¯ + Cn (), n2
1510
H. Shimazaki and S. Shinomoto
using the sample mean k¯ and variance v of the number of spikes obtained from n sequences of spikes. Cn () is the cost function computed for n sequences of spikes with algorithm 1. B. Search for ∗m that minimizes Cm (|n). C. Repeat A and B while changing m, and plot 1/∗m versus 1/m to search for the critical value 1/m = 1/nˆ c above which 1/∗m practically vanishes. It may come to pass that the original cost function Cn () computed for n spike sequences does not have a minimum or has a minimum at a bin size comparable to the observation period T. Because of the paucity of data, one may consider carrying out more experiments to obtain a reliable rate estimation. The critical number of sequences nc above which the cost function has a finite bin size ∗ may be estimated in the following manner. With a large , the cost function can be expanded as Cn () ∼ µ
1 1 − n nc
1 1 +u 2,
(2.19)
where we have introduced nc and u, which are independent of n. The optimal bin size undergoes a phase transition from the vanishing 1/∗ for n < nc to a finite 1/∗ for n > nc . The inverse optimal bin size is expanded in the vicinity of nc as 1/∗ ∝ (1/n − 1/nc ). We can estimate the critical ˆ ∗m value nˆ c (see Figure 3) by applying this asymptotic relation to the set of estimated from Cm (|n) for various values of m: 1 1 1 − . (2.20) ∝ ∗m m nˆ c The minimum number of sequences required for the construction of a BarPSTH is estimated from nˆ c . It should be noted that there are cases that the optimal bin size exhibits a discontinuous divergence from a finite value. Even in such cases, the plot of {1/m, 1/∗ } could be useful in exploring the discontinuous transition from nonvanishing values of 1/∗ to practically vanishing values. In the opposite extreme, with a sufficiently large number of spike sequences, our method selects a small bin size. It is known that the optimal bin size exhibits a power law scaling with respect to the number of sequences n. The exponent of the scaling relation depends on the smoothness of the underlying rate (Koyama & Shinomoto, 2004). Given a large number of spike sequences, the method for extrapolating the cost function can also be used to estimate the optimal bin size for a larger number of spike sequence m > n, and further estimate the scaling exponent representing the smoothness of the underlying rate.
The Bin Size of a Time Histogram
1511
2.3 Theoretical Analysis on the Optimal Bin Size. To verify the empirical methods, we obtain the theoretical cost function of a Bar-PSTH directly from a process with a known underlying rate. Note that this theoretical cost function is not available in real experimental conditions in which the underlying rate is not known. The estimator θˆ ≡ k/(n) is a uniformly minimum variance unbiased estimator (UMVUE) of θ , which achieves the lower bound of the Cram´erRao inequality (Blahut, 1987; Cover & Thomas, 1991), E(θˆ − θ ) = − 2
∞ k=0
∂ 2 log p (k | nθ ) p (k | nθ ) ∂θ 2
−1 =
θ . n
(2.21)
Inserting this into equation 2.9, the cost function is represented as θ − (θ − θ )2 n µ 1 = − φ (t1 − t2 ) dt1 dt2 , n 2 0 0
Cn () =
(2.22)
where µ = θ is the mean rate, and φ(t1 − t2 ) = (λt1 − µ)(λt2 − µ) is the autocorrelation function of the rate fluctuation, λt − µ. To obtain the last equation, we used (θ − θ ) = 2
=
1 2
1
2 (λt − µ) dt
0 0
0
(λt1 − µ)(λt2 − µ) dt1 dt2 .
(2.23)
The cost function with a large bin size can be rewritten as µ 1 − 2 ( − |t|)φ(t) dt n − ∞ 1 ∞ 1 µ − φ(t) dt + 2 |t|φ(t) dt, ∼ n −∞ −∞
Cn () =
(2.24)
based on the symmetry φ(t) = φ(−t) for a stationary process. Equation 2.24 can be identified with equation 2.19 with parameters µ φ(t) dt −∞
nc = ∞
(2.25)
1512
H. Shimazaki and S. Shinomoto
u=
∞
−∞
|t|φ(t) dt.
(2.26)
For the process giving the correlation of the form φ (t) = σ 2 e −|t|/τ and the 2 2 process giving a gaussian correlation φ(t) = σ 2 e −t /τ used in the simulation in the next √ section, the critical numbers are obtained as nc = µ/2σ 2 τ and 2 nc = µ/σ τ π , respectively. Based on the same theoretical cost function, equation 2.22, we can also derive scaling relations of the optimal bin size, which is achievable with a large number of spike sequences. With a small , the correlation of rate fluctuation φ(t) can be expanded as φ (t) = φ (0) + φ (0+ ) |t| + 12 φ (0) t 2 + O |t|3 , and we obtain an expansion of equation 2.22 with respect to , Cn () =
1 µ 1 − φ (0) − φ (0+ ) − φ (0) 2 + O(3 ). n 3 12
(2.27)
The optimal bin size ∗ is obtained from dCn () /d = 0. For a rate that fluctuates smoothly in time, the correlation function is a smooth function of t, resulting in φ (0) = 0 due to the symmetry φ(t) = φ(−t). In this case, we obtain the scaling relation
6µ ∼ − φ (0) n ∗
1/3 .
(2.28)
For a rate that fluctuates in a zigzag pattern, in which the correlation of rate fluctuation has a cusp at t = 0 (φ (0+ ) < 0), we obtain the scaling relation ∗ ∼ −
3µ φ (0+ )n
1/2 ,
(2.29)
by ignoring the second-order term. Examples that exhibit this type of scaling are the Ornstein-Uhlenbeck process and the random telegraph process (Kubo, Toda, & Hashitsume, 1985; Gardiner, 1985). The scaling relations, equation 2.28 and equation 2.29, are the generalization of the ones found in Koyama and Shinomoto (2004) for two specific time-dependent Poisson processes. 3 Application of the Method to Spike Data In this section, we apply the methods for the bar graph (peristimulus) time histogram (Bar-PSTH) and the line graph (peristimulus) time histogram (Line-PSTH) to (1) spike sequences (point events) generated by the simulations of time-dependent Poisson processes and (2) spike sequences recorded
The Bin Size of a Time Histogram
1513
from a neuron in area MT. The methods for a Line-PSTH are summarized in the appendix. 3.1 Application to Simulation Data. We applied the method for estimating the optimal bin size of a Bar-PSTH to a set of spike sequences derived from a time-dependent Poisson process. Figure 2A shows the empirical cost function Cn () computed from a set of n spike sequences. The empirical cost function is in good agreement with the theoretical cost function computed from the mean spike rate µ and the correlation φ(t) of the rate fluctuation, according to equation 2.22. In Figure 2D, a time histogram with the optimal bin size is compared with those with nonoptimal bin sizes, demonstrating the effectiveness of optimizing the bin size. Algorithm 2 provides an estimate of how many sequences should be added for the construction of a Bar-PSTH with the resolution we deem sufficient. In this method, the cost function of m sequences is obtained by modifying the original cost function, Cn (), computed from the spike count statistics of n sequences of spikes. In subsequent applications of this extrapolation method, the original cost function, Cn (), was obtained by averaging over the initial partitioning positions. We applied this extrapolation method for a Bar-PSTH to a set of spike sequences derived from the smoothly regulated Poisson process. Figure 3A depicts the extrapolated cost function Cm (|n) for several values of m, computed from a given set of n(= 30) spike sequences. Figure 3B represents the dependence of an inverse optimal bin size 1/∗m on 1/m. The inverse optimal bin size, 1/∗m , stays near 0 for 1/m > 1/nˆ c and departs from 0 linearly with m for 1/m ≤ 1/ nˆ c . By fitting the linear function, we estimated the critical number of sequences nˆ c . In Figure 3C, the critical number of sequences nˆ c estimated from a smaller or larger n is compared to the theoretical value of nc computed from equation 2.25. It is noteworthy that the estimated nˆ c approximates the theoretical nc well even from a fairly small number of spike sequences, with which the estimated optimal bin size diverges. We also constructed a method for selecting the bin size of a Line-PSTH, which is summarized as algorithm 3 in the appendix. Figure 4 compares the optimal Bar-PSTH and the optimal Line-PSTH obtained from the same set of spike sequences, demonstrating the superiority of the Line-PSTH to the Bar-PSTH in the sense of the MISE. In addition, the Line-PSTH is suitable for comparing multiple time-dependent rates, as is the case for filtering methods. The extrapolation method for a Line-PSTH is summarized as algorithm 4 in the appendix. With the extrapolation method for a Bar-PSTH (see algorithm 2), one can also estimate how much the optimal bin size decreases (i.e., the resolution increases) with the number of spike sequences. Figure 5A shows log-log plots of the optimal bin sizes ∗m versus m with respect to two rate processes that fluctuate either smoothly according to the stochastic process characterized by the mean spike rate µ and the correlation of the rate
1514
H. Shimazaki and S. Shinomoto
B 60
Underlying rate
A 0
Cost function for the Bar-PSTH
0
1
2
3
0
1
2
3
2
3
C 100
Empirical cost function Theoretical cost function
0
D
120 ^
0 60
-100
^
0
0.1
0.2
0.3
0.4
0.5
0 60 ^
0
0
1 Time, t
Figure 2: Construction of the optimal Bar-PSTH from spike sequences. (A) Dots: An empirical cost function, Cn (), computed from spike data according to algorithm 1. Here, 50 sequences of spikes for an interval of 20 [s] are derived from a time-dependent Poisson process characterized by the mean rate µ and 2 2 the correlation of the rate fluctuation φ(t) = σ 2 e −t /τ , with µ = 30 [spikes/s], σ = 10 [spikes/s], and τ = 0.1 [s]. Solid line: The theoretical cost function computed directly from the underlying fluctuating rate using equation 2.22. (B) The underlying fluctuating rate λt . (C) Spike sequences derived from the rate. (D) Time histograms made using three types of bin sizes: too small, optimal, and too large.
process φ (t) = σ 2 e −t /τ or in a zigzag pattern according to the OrnsteinUhlenbeck process characterized by the mean rate µ and the correlation of the rate process φ (t) = σ 2 e −|t|/τ . These plots exhibit power law scaling relations with distinctly different scaling exponents. The estimated exponents (−0.34 ± 0.04 for the smooth rate process and −0.56 ± 0.04 for the zigzag rate process) are close to the exponents of m−1/3 and m−1/2 that were obtained analytically as in equations 2.28 and 2.29 (see also Koyama & Shinomoto, 2004). In this way, we can estimate the degree of smoothness of the underlying rate from a reasonable amount of spike data. With the extrapolation method for a Line-PSTH (see algorithm 4 in the appendix), the scaling relations for a Line-PSTH can be examined in a 2
2
The Bin Size of a Time Histogram
1515
A m=30
2 1
C Estimated critical number
^
nc
m=40
0 -1
m=60 0
4
60 8
B
40 6 20 4 2
0 0
0 1/200
1/40 1/m
1/20
20 40 60 Number of sequences, n
80
Figure 3: Extrapolation method. (A) Extrapolated cost functions Cm (|n) of a Bar-PSTH for several values of m, computed from 30 sequences of spikes derived from a Poisson process characterized by the mean rate µ and the cor2 2 relation of the rate fluctuation φ(t) = σ 2 e −t /τ , with µ = 30, σ = 2, and τ = 0.1. The modestly fluctuating rate with σ = 2 is used as compared with the rate with σ = 10 used in Figure 1. (B) The inverse optimal bin size 1/∗m plotted against 1/m. A solid line is a linear function fitted to the data. (C) The critical number of sequences nˆ c extrapolated from a smaller or larger number of spike sequences. The horizontal axis represents the number of spike sequences n used to obtain the extrapolated cost function Cm (|n). The vertical axis represents an extrapolated critical number nˆ c . The dashed lines represent a theoretical value nc computed using equation 2.25 and a diagonal.
similar manner. Figure 5B represents the optimal bin sizes computed for two rate processes that fluctuate either smoothly or in a zigzag pattern. The estimated exponents are −0.24 ± 0.04 for the smooth rate process and −0.50 ± 0.05 for the zigzag rate process. The exponents obtained by the extrapolation method are similar to the analytically obtained exponents, m−1/5 and m−1/2 respectively (see equations A.12 and A.13). Note that for the smoothly fluctuating rate process, the scaling relation for the Line-PSTH is m−1/5 , whereas the scaling relation for a Bar-PSTH is m−1/3 . In contrast, for the rate process that fluctuates jaggedly, the exponents of the scaling relations for both a Bar-PSTH and a Line-PSTH are m−1/2 .
1516
A
H. Shimazaki and S. Shinomoto
B
Bar-PSTH
^
Line-PSTH
^
40
40
20
20
0
0
1
2
0
3
0
1
2
3
Time, t
Time, t
Figure 4: Comparison of the optimal Bar-PSTH and the optimal Line-PSTH. The spike data used are the same as the data used in Figure 2. (A) The optimal Bar-PSTH constructed from the spike sequences derived from an underlying fluctuating rate process (dashed line). (B) The optimal Line-PSTH constructed from the same set of spike sequences.
A
B
Scaling relations for the Bar-PSTH m
10
10
-1
10
-2
10
Scaling relations for the Line-PSTH m
0
1
2
10 10 10 Number of sequences, m
3
10
-1
-2
10
0
1
2
10 10 10 Number of sequences, m
3
Figure 5: Scaling relations for the Bar-PSTH and the Line-PSTH. The optimal bin sizes ∗m are estimated from the extrapolated cost function Cm (|n), computed from 100 sequences of spikes. The optimal bin sizes ∗m versus m is shown in log-log scale. Two types of rate processes were examined: one that fluctuates smoothly, which is characterized by the correlation of the rate fluctuation φ (t) = 2 2 σ 2 e −t /τ , and the other that fluctuates jaggedly, which is characterized by the correlation of the rate fluctuation φ (t) = σ 2 e −|t|/τ . The parameter values are the same as the ones used in Figure 2. (A) Log-log plots of the optimal bin size with respect to m for a Bar-PSTH. Lines are fitted to the data in an interval of m ∈ [50, 500], whose regression coefficients, −0.34 ± 0.04 and −0.56 ± 0.04, correspond to the scaling exponents for a Bar-PSTH. (B) Log-log plots of the optimal bin size with respect to m for a Line-PSTH. Lines are fitted to the data in an interval of m ∈ [50, 500], whose regression coefficients, −0.24 ± 0.04 and −0.50 ± 0.05, correspond to the scaling exponents for a Line-PSTH.
The Bin Size of a Time Histogram
1517
3.2 Application to Experimental Data. We also applied our method for optimizing a Bar-PSTH to publicly available neuronal spike data (Britten, Shadlen, Newsome, & Movshon, 2004). The details of the experimental methods are described in Newsome, Britten, and Movshon (1989) and Britten, Shadlen, Newsome, & Movshon (1992). It was reported that the neurons in area MT exhibit highly reproducible temporal responses to identical visual stimuli (Bair & Koch, 1996). The estimation of a time-dependent rate by a PSTH is, however, sensitive to the choice of the bin size. Therefore it would be preferable for such an appraisal of the reproducibility to be tested with our objective method of optimizing the bin size. To make a reliable estimate of the optimal bin size from the spike sequences of a short observation period, we took an average of the cost functions computed under different partitioning positions. We examine here the data recorded from a neuron under the repeated application of a random dot visual stimulus with 3.2% of coherent motion (w052, nsa2004.1, Britten et al., 2004). The method for a Bar-PSTH was applied to the data, with close attention paid to how the optimal bin size changes with the number of spike sequences sampled. Figure 6 depicts the results for the first n = 5, the first n = 20 and the total n = 50 sequences of spikes. The optimal bin size practically diverges (∗ = 1000 [ms]) for the n = 5 sequences, implying that with this small amount of data, a discussion about the time-dependent response does not make sense, as far as we rely on the histogram method. Even in this stage, it is possible to extrapolate the cost function by means of algorithm 2. The critical number of trials, above which the optimal bin size is finite, was estimated as nˆ c ≈ 12. By increasing the number of spikes to n = 20, we obtained the optimal bin size ∗ = 33 [ms], implying that a discussion about the neuronal response is possible based on this amount of data. The critical number of trials estimated from this data set (n = 20) is nˆ c ≈ 10. The optimal bin size for the total 50 sequences is ∗ = 23 [ms]. The critical number of sequences estimated from the total 50 sequences is nˆ c ≈ 12. Algorithm 1 confirms that the temporal rate modulation can be discussed with a sufficiently large number of sequences, such as n = 20 or n = 50. With algorithm 2, three sets of sequences with n = 5, 20, and 50 suggest the critical number of sequence to be nˆ c ≈ 10–12. 4 Discussion We have developed a method for selecting the bin size, so that the BarPSTH (see section 2) or the Line-PSTH (see the appendix) best represents the (unknown) underlying spike rate. The suitability of this method was demonstrated by applying it to not only model spike sequences generated by time-dependent Poisson processes, but also real spike sequences recorded from cortical area MT of a monkey.
1518
H. Shimazaki and S. Shinomoto
A n=5
80
40 ^
40
0 0
0
0.5
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4 0.6 Time, t
0.8
1
1
B n=20 40
0 ^
-40 0
0 0.05
0.1
C n=50 0
40 ^
-40 0
0.05
0.1
0
Figure 6: The Bar-PSTHs for the spike sequences recorded from a MT neuron (w052 in nsa2004.1; Britten et al., 2004). Left: The cost function (averaged over different partitioning positions). Right: Spike sequences sampled and the optimal Bar-PSTH for the first 1 [s] of the recording period 2 [s]. (A) The first 5 spike sequences. (B) The first 20 spike sequences that include 5 sequences of A. (C) The complete 50 spike sequences that include 20 sequences of B.
For a small number of spike sequences derived from a modestly fluctuating rate, the cost function does not have a minimum (Cn (T) < Cn () for any < T), or has a minimum at a bin size that is comparable to the observation period T, implying that the time-dependent rate cannot be captured by means of the time histogram method. Our method can nevertheless extrapolate the cost function for any number of spike sequences and suggest how many sequences should be added in order to obtain a meaningful time histogram with the resolution we require. The model data and real data illustrated that the optimal bin size may diverge (∗ = O(T)), and even under such conditions, our extrapolation method works reasonably well.
The Bin Size of a Time Histogram
1519
In this study, we adopted the mean integrated squared error (MISE) as measuring the goodness of the fit of a PSTH to the underlying spike rate. There were studies of the density estimation based on other standards such as the Kullback-Leibler divergence (Hall, 1990) and the Hellinger distance (Kanazawa, 1993). To our knowledge, however, there are yet no practical algorithms based on these standards that optimize the bin size solely with raw data. It is interesting to explore practical methods based on other standards to the estimation of the time-dependent rate and compare them with the MISE criterion. Use of regular (equal) bin size is a constraint particular to our method. The regular histogram is suitable for stationary time-dependent rates. In practical applications, however, the rate does not necessarily fluctuate in a stationary fashion but can change abruptly during a localized period of time. In such a situation, a histogram with variable bin width can better fit the underlying rate than a regular histogram can. In order to make a better estimate of the underlying rate in the sense of the MISE, it is desirable to develop an adaptive method that adjusts the bin size over time. It is also desirable to develop the optimization method for the Bar-PSTH and the Line-PSTH into a method for higher-order spline fittings that can be compared with filtering methods. Nevertheless, as long as the Bar-PSTHs or the Line-PSTHs are used as conventional rate-estimation tools, our method for selecting the bin size should be used for their construction. Appendix: Optimization of the Line Graph Time Histogram A line graph can be constructed by simply connecting top centers of adjacent bar graphs of the height ki /(n). Figure 7 schematically shows the construction of a line graph time histogram. For the same set of spike sequences, the optimal bin size (window size) of a Line-PSTH is, however, different from that of a Bar-PSTH. We develop here a method for selecting the bin size for a Line-PSTH. The expected heights of adjacent bar graphs for intervals of [−, 0] and [0, ] are θ− ≡
1
θ+ ≡
1
0 −
λt dt, λt dt.
0
The expected line graph L t in an interval of [− 2 , 2 ] is a line connecting top centers of these bar graphs, Lt =
θ+ − θ− θ+ + θ− + t. 2
(A.1)
1520
H. Shimazaki and S. Shinomoto +
A
0
B
^+
C ^
0 Figure 7: Line-PSTH. (A) An underlying spike rate, λt . The horizontal bars indicate the original bar graphs θ − and θ + for adjacent intervals [−, 0] and [0, ]. The expected line graph L t is a line connecting the two top centers of adjacent expected bar graphs (− 2 , θ − ) and ( 2 , θ + ). (B) Sequences of spikes derived from the underlying rate. (C) A time histogram for the sample sequences of spikes. The empirical line graph Lˆ t is a line connecting the two top centers of adjacent empirical bar graphs (− 2 , θˆ − ) and ( 2 , θˆ + ).
The unbiased estimators of the original bar graphs are θˆ − = k − /(n) and θˆ + = k + /(n), where k − and k + are the numbers of spikes from n spike sequences that enter the intervals [−, 0] and [0, ], respectively. The empirical line graph Lˆ t in an interval of [− 2 , 2 ] is a line connecting two top centers of adjacent empirical bar graphs,
Lˆ t =
θˆ + + θˆ − θˆ + − θˆ − + t. 2
(A.2)
A.1 Selection of the Bin Size. The time average of the MISE (see equation 2.1) is rewritten by the average over the N segmented rates,
MISE =
1
/2
−/2
E ( Lˆ t − λt )2 dt.
(A.3)
The Bin Size of a Time Histogram
1521
The MISE can be decomposed into two parts: 1
MISE =
/2
−/2
1 E( Lˆ t − L t )2 dt +
/2
−/2
(λt − L t )2 dt.
(A.4)
The first term of equation A.4 is the stochastic fluctuation of the empirical linear estimator Lˆ t due to the stochastic point events, which can be computed as 1
/2
−/2
2 E(θˆ + − θ + )2 . E( Lˆ t − L t )2 dt = 3
(A.5)
+ ˆ− − ˆ+ Here we have used the relations, E(θ − θ )(θ − θ ) = 0 and E(θˆ + − θ + )2 = E(θˆ − − θ − )2 . The second term of equation A.4 is the temporal fluctuation of λt around the expected linear estimator L t . We expand the second term by inserting µ = θ − = θ + and obtain
1
/2
1 (λt − L t )2 dt = −/2 −
2
/2
−/2
(λt − µ)2 dt
/2
−/2
(λt − µ)(L t − µ)dt
1 /2 (L t − µ)2 dt −/2 /2 1 = (λt − µ)2 dt −/2 − 2 θ+ − µ θ0 − µ + 2 θ+ − µ θ∗
1 2 + (θ − µ)2 + (θ + − µ)(θ − − µ) , + 3 3 +
where 1 θ ≡
/2
0
θ∗ ≡
2 2
−/2
λt dt,
/2
−/2
tλt dt.
(A.6)
1522
H. Shimazaki and S. Shinomoto
To obtain the above equation, we have used relations, θ − = θ + = θ 0 = µ, θ ∗ = 0, (θ + − µ)2 = (θ − − µ)2 , (θ + − µ)(θ 0 − µ) = (θ − − µ)(θ 0 − µ), and (θ + − µ)θ ∗ = −(θ − − µ)θ ∗ . The first term of equation A.6 represents a mean squared fluctuation of the underlying rate and is independent of the choice of the bin size (see equation 2.8). We introduce the cost function by subtracting the variance of the underlying rate from the original MISE: 1 Cn () ≡ MISE − T =
T
(λt − µ)2 dt
0
2 2 E(θˆ + − θ + ) − 2 (θ + − µ)(θ 0 − µ) − 2 (θ + − µ)θ ∗ 3 1 2 2 (A.7) + E(θ + − µ) + (θ + − µ)(θ − − µ) . 3 3
The first term of equation A.7 can be estimated from the data, using the variance-mean relation, equation 2.12. The last four terms of equation A.7 are the covariances of the expected rates θ − , θ + , θ 0 , and θ ∗ , which are not observables. We can estimate them by using the covariance decomposition rule for unbiased estimators: (θ + − θ + )(θ p − θ p ) = E[(θˆ + − E θˆ + )(θˆ p − E θˆ p )] −E[(θˆ + − θ + )(θˆ p − θ p )],
(A.8)
where p denotes −, +, 0, or ∗. The first term of equation A.8 can be computed directly from the data, using the relation E θˆ − = E θˆ + = E θˆ 0 = µ, and E θˆ ∗ = 0. Unlike the Bar-PSTH, however, the mean-variance relation for the Poisson statistics is not directly applicable to the second term. We suggest estimating these covariance terms from multiple sample sequences, as in algorithm 3: Algorithm 3: A Method for Bin Size Selection for a Line-PSTH i. Divide the observation period T into N + 1 bins of width . Count the number of spikes ki ( j) that enter ith bin from jth sequence. (−) (+) Define ki ( j) ≡ ki ( j) and ki ( j) ≡ ki+1 ( j) (i = 1, 2, . . . , N). Divide the period [/2, T − /2] into N bins of width . (0) Count the number of spikes ki ( j) that enter ith bin from jth se (∗) quence. In each bin, compute ki ( j) ≡ 2 ti ( j)/, where ti ( j) is the time of the th spike that enters the ith bin, measured from the center of each bin. ( p) ( p) ( p) ii. Sum up ki ( j) for all sequences: ki ≡ nj=1 ki ( j), where p = {−, +, 0, ∗}.
The Bin Size of a Time Histogram
1523
Average those spike counts with respect to all the bins: k¯ ( p) ≡ ( p) 1 N i=1 ki . N (+)
Compute covariances of ki
( p)
and ki :
1 (+) ¯ (+) ( p) ¯ ( p) ki − k ki − k . N N
c (+, p) ≡
i=1
(+)
( p)
Compute covariances of ki ( j) and ki ( j) with respect to sequences and average over time:
c¯
(+, p)
N n (+) ( p) ki ki 1 1 (+) ( p) ≡ ki ( j) − ki ( j) − . N n−1 n n i=1
j=1
Finally, compute
σ (+, p) ≡
c (+, p) c¯ (+, p) − . 2 (n) n2
iii. Compute the cost function:
Cn () =
2 k¯ (+) 2 1 − 2σ (+, 0) − 2σ (+, ∗) + σ (+, +) + σ (+, −) . 3 (n)2 3 3
iv. Repeat i through iii while changing to search for ∗ that minimizes Cn (). A.2 Extrapolation of the Cost Function. As in the case of the Bar-PSTH, the cost function for any m sequences of spike trains can be extrapolated using the variance-mean relation for the Poisson statistics,
Cm (|n) =
2 3
1 1 + E θˆn + Cn () , − m n
(A.9)
where Cn () is the original cost function (see equation A.7). With the original cost function Cn (), we can easily estimate a cost function for m sequences with algorithm 4:
1524
H. Shimazaki and S. Shinomoto
Algorithm 4: A Method for Extrapolating the Cost Function for a Line-PSTH A. Construct the extrapolated cost function, Cm (|n) =
2 3
1 1 − m n
k¯ (+) + Cn (), n2
where Cn () is the cost function for the line-graph time histogram computed for n sequences of spikes with the algorithm 3. B. Search for ∗m that minimizes Cm (|n). C. Repeat A and B while changing m, and plot 1/∗m versus 1/m to search for the critical value 1/m = 1/nˆ c above which 1/∗m practically vanishes. A.3 Scaling Relations of the Optimal Bin Sizes. We can obtain a theoretical cost function of a Line-PSTH directly from the mean spike rate µ and the correlation φ(t) of the rate fluctuation. According to the mean-variance relation based on the Cram´er-Rao (in)equality (see equation 2.21), the cost function, equation A.7, is given by Cn () =
/2 2µ 2t2 2 φ (t1 − t2 ) dt1 dt2 1+ − 2 3n 0 −/2 0 2 1 + 2 φ (t1 − t2 ) dt1 dt2 + φ (t1 − t2 ) dt1 dt2 . 3 0 0 32 0 − (A.10)
By expanding the correlation of the rate fluctuation, 1 1 1 φ (t) = φ (0) + φ (0+ ) |t| + φ (0) t 2 + φ (0+ ) |t|3 + φ (0) t 4 + O |t|5 , 2 6 24 we obtain Cn () =
181 37 2µ − φ (0) − φ (0+ ) + φ (0+ ) 3 3n 144 5760 49 + φ (0) 4 + O(5 ). 2880
(A.11)
Unlike the Bar-PSTH, the line graph successfully approximates the original rate to first order in , and therefore the O(2 ) term in the cost function vanishes for a Line-PSTH.
The Bin Size of a Time Histogram
1525
The optimal bin size ∗ is obtained from dCn () /d = 0. For a rate process that fluctuates smoothly in time, the correlation function is a smooth function, resulting in φ (0) = 0 and φ (0) = 0 due to the symmetry φ(t) = φ(−t), and we obtain the scaling relation ∗ ∼
1280µ 181φ (0) n
1/5 .
(A.12)
The exponent −1/5 for a Line-PSTH is different from the exponent −1/3 for a Bar-PSTH (see equation 2.28). If the correlation of rate fluctuation has a cusp at t = 0 (φ (0+ ) < 0), we obtain the scaling relation, ∼ − ∗
96µ 37φ (0+ )n
1/2 ,
(A.13)
by ignoring the higher-order terms. The exponent −1/2 for a Line-PSTH is the same as the exponent for a Bar-PSTH (see equation 2.29). Acknowledgments We thank Shinsuke Koyama, Takeaki Shimokawa, and Kensuke Arai for helpful discussions. We also acknowledge K. H. Britten, M. N. Shadlen, W. T. Newsome, and J. A. Movshon, who have made their data available to the public. This study is supported in part by Grants-in-Aid for Scientific Research to S.S. from the Ministry of Education, Culture, Sports, Science and Technology of Japan (16300068, 18020015) and the 21st Century COE Center for Diversity and Universality in Physics. H.S. is supported by the Research Fellowship of the Japan Society for the Promotion of Science for Young Scientists. References Abeles, M. (1982). Quantification, smoothing, and confidence-limits for single-units histograms. Journal of Neuroscience Methods, 5(4), 317–325. Adrian, E. (1928). The basis of sensation: The action of the sense organs. New York: Norton. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8(6), 1185–1202. Blahut, R. (1987). Principles and practice of information theory. Reading, MA: AddisonWesley. Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (1992). The analysis of visual-motion: A comparison of neuronal and psychophysical performance. Journal of Neuroscience, 12(12), 4745–4765.
1526
H. Shimazaki and S. Shinomoto
Britten, K. H., Shadlen, M. N., Newsome, W. T., & Movshon, J. A. (2004). Responses of single neurons in macaque MT/V5 as a function of motion coherence in stochastic dot stimuli. Neural Signal Archive, nsa2004.1. Available online at http://www.neuralsignal.org. Brockwell, A. E., Rojas, A. L., & Kass, R. E. (2004). Recursive Bayesian decoding of motor cortical signals by particle filtering. Journal of Neurophysiology, 91(4), 1899–1907. Brown, E. N., Kass, R. E., & Mitra, P. P. (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nature Neuroscience, 7(5), 456– 461. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Daley, D., & Vere-Jones, D. (1988). An introduction to the theory of point processes. New York: Springer-Verlag. Dayan, P., & Abbott, L. (2001). Theoretical neuroscience: Computational and mathematical modeling of neural systems. Cambridge, MA: MIT Press. DiMatteo, I., Genovese, C. R., & Kass, R. E. (2001). Bayesian curve-fitting with freeknot splines. Biometrika, 88(4), 1055–1071. Gardiner, C. (1985). Handbook of stochastic methods: For physics, chemistry and the naural sciences. Berlin: Springer-Verlag. Gerstein, G. L., & Kiang, N. Y. S. (1960). An approach to the quantitative analysis of electrophysiological data from single neurons. Biophysical Journal, 1(1), 15–28. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials—analysis and functional interpretation. Science, 164(3881), 828–830. Hall, P. (1990). Akaikes information criterion and Kullback-Leibler loss for histogram density-estimation. Probability Theory and Related Fields, 85(4), 449–467. Johnson, D. H. (1996). Point process models of single-neuron discharges. Journal of Computational Neuroscience, 3(4), 275–299. Kanazawa, Y. (1993). Hellinger distance and Akaike information criterion for the histogram. Statistics and Probability Letters, 17(4), 293–298. Kass, R. E., Ventura, V., & Brown, E. N. (2005). Statistical issues in the analysis of neuronal data. Journal of Neurophysiology, 94(1), 8–25. Kass, R. E., Ventura, V., & Cai, C. (2003). Statistical smoothing of neuronal data. Network-Computation in Neural Systems, 14(1), 5–15. Koyama, S., & Shinomoto, S. (2004). Histogram bin width selection for timedependent Poisson processes. Journal of Physics A—Mathematical and General, 37(29), 7255–7265. Kubo, R., Toda, M., & Hashitsume, N. (1985). Nonequilibrium statistical mechanics. Berlin: Springer-Verlag. Newsome, W. T., Britten, K. H., & Movshon, J. A. (1989). Neuronal correlates of a perceptual decision. Nature, 341(6237), 52–54. R´ev´esz, P. (1968). The laws of large numbers. New York: Academic Press. Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators. Scandinavian Journal of Statistics, 9(2), 65–78. Sanger, T. D. (2002). Decoding neural spike trains: Calculating the probability that a spike train and an external signal are related. Journal of Neurophysiology, 87(3), 1659–1663. Scott, D. W. (1979). Optimal and data-based histograms. Biometrika, 66(3), 605–610.
The Bin Size of a Time Histogram
1527
Shinomoto, S., Miyazaki, Y., Tamura, H., & Fujita, I. (2005). Regional and laminar differences in in vivo firing patterns of primate cortical neurons. Journal of Neurophysiology, 94(1), 567–575. Shinomoto, S., Shima, K., & Tanji, J. (2003). Differences in spiking patterns among cortical neurons. Neural Computation, 15(12), 2823–2842. Snyder, D. (1975). Random point processes. New York: Wiley. Wiener, M. C., & Richmond, B. J. (2002). Model based decoding of spike trains. Biosystems, 67(1–3), 295–300.
Received November 17, 2005; accepted September 27, 2006.
LETTER
Communicated by Kiri Wagstaff
Penalized Probabilistic Clustering Zhengdong Lu
[email protected] Todd K. Leen
[email protected] Department of Computer Science and Engineering, OGI School of Science and Engineering, Oregon Health and Science Institute, Beaverton, OR 97006, U.S.A.
While clustering is usually an unsupervised operation, there are circumstances in which we believe (with varying degrees of certainty) that items A and B should be assigned to the same cluster, while items A and C should not. We would like such pairwise relations to influence cluster assignments of out-of-sample data in a manner consistent with the prior knowledge expressed in the training set. Our starting point is probabilistic clustering based on gaussian mixture models (GMM) of the data distribution. We express clustering preferences in a prior distribution over assignments of data points to clusters. This prior penalizes cluster assignments according to the degree with which they violate the preferences. The model parameters are fit with the expectation-maximization (EM) algorithm. Our model provides a flexible framework that encompasses several other semisupervised clustering models as its special cases. Experiments on artificial and real-world problems show that our model can consistently improve clustering results when pairwise relations are incorporated. The experiments also demonstrate the superiority of our model to other semisupervised clustering methods on handling noisy pairwise relations. 1 Introduction While clustering is usually executed completely unsupervised, there are circumstances in which we have prior belief (with varying degrees of certainty) that pairs of samples should (or should not) be assigned to the same cluster. These pairwise relations are less informative than direct labeling of the samples, but are often considerably easier to obtain. Indeed, there are many occasions when the pairwise relations can be directly derived from expert knowledge or common sense. Our interest in such problems was kindled when we tried to manually segment a satellite image by grouping small image clips from the image. It is often hard to assign the image clips to different “groups” since we do not know clearly the characteristic of each group or even how many classes Neural Computation 19, 1528–1567 (2007)
C 2007 Massachusetts Institute of Technology
Penalized Probabilistic Clustering
1529
we should have. In contrast, it is much easier to compare two image clips and decide how much they look alike and thus how likely they should be in one cluster. Another interesting example is word sense disambiguation. Ambiguous words like plant tend to exhibit only one meaning in one discourse (Yarowsky, 1995). In other words, two plant’s in the same discourse probably should be assigned to the same class of sense. This fact is very useful in training unsupervised word sense disambiguation models. The third example is in information retrieval. Cohn, Caruana, and McCallum (2003) suggested that in creating a document taxonomy, the expert critique is often in the form “these two documents shouldn’t be in the same cluster.” The last example is continuity, which suggests that neighboring pairs of samples in a time series or in an image are likely to belong to the same class of object, is also a source of clustering preferences (Theiler & Gisler, 1997; Ambroise, Dang, & Govaert, 1997). We would like these preferences to be incorporated into the cluster structure so that the assignment of out-of-sample data to clusters captures the concepts that give rise to the preferences expressed in the training data. Some work has been done on adopting traditional clustering methods, such as K-means, to incorporate pairwise relations (Wagstaff, Cardie, Rogers, & Schroedl, 2001; Basu, Bannerjee, & Mooney, 2002; Klein, Kamvar, & Manning, 2002). These models are based on hard clustering, and the clustering preferences are expressed as hard pairwise constraints that must be satisfied. Wagstaff (2002) and Basu et al. (2004) extended their models to deal with soft pairwise constraints, where each constraint is assigned a weight. The performance of those constrained K-means algorithms is often not satisfactory, largely due to the incapability of K-means to model nonspherical data distribution in each class. Shental, Bar-Hillel, Hertz, and Weinshall (2003) proposed a gaussian mixture model (GMM) for clustering that incorporates hard pairwise constraints. However, the model cannot be naturally generalized to soft constraints, which are appropriate when our knowledge is only clustering preferences or carries significant uncertainty. Motivated in part to remedy this deficiency, Law, Topchy, and Jain (2004, 2005) proposed another GMMbased model to incorporate soft constraints. In their model, virtual groups are created for samples that are supposed to be in one class. The uncertainty information in pairwise relations is therefore expressed as the soft membership of samples to the virtual group. This modeling strategy is cumbersome to model samples shared by different virtual groups. Moreover, it cannot handle the prior knowledge that two samples are in different clusters. Other efforts to make use of the pairwise relations include changing the metric in feature space in favor of the specified relations (Cohn et al., 2003; Xing, Ng, Jordan, & Russe, 2003) or combining the metric learning with constrained clustering (Bilenko, Basu, & Mooney, 2004). This letter is a detailed exposition and extension of our previously reported work (Lu & Leen, 2005). We propose a soft clustering algorithm based
1530
Z. Lu and T. Leen
on GMM that expresses clustering preferences (in the form of pairwise relations) in the prior probability on assignments of data points to clusters. Our algorithm naturally accommodates both hard constraints and soft preferences. In our framework, the preferences are expressed as a Bayesian prior probability that pairs of points should (or should not) be assigned to the same cluster. After training with the expectation-maximization (EM) algorithm, the information expressed as a prior on the cluster assignment of the training data is successfully encoded in the means, covariances, and cluster priors in the GMM. Hence, the model generalizes in a way consistent with the prior knowledge. We call the algorithm penalized probabilistic clustering (PPC). Experiments on artificial and real-world data sets demonstrate that PPC can consistently improve the clustering result by incorporating reliable prior knowledge. The letter is organized as follows. In section 2, we introduce the model and give an artificial example for illustration. In section 3, we discuss the computational complexity and propose two approximation methods for cases where the computation is intractable. Section 4 gives a detailed analysis of the connection of PPC to several other semisupervised clustering models. In section 5, we present the experiments of PPC on a variety of problems, as well as the comparison of PPC to several other semisupervised clustering methods. Section 6 summarizes the letter and points out future research. 2 Model Penalized probabilistic clustering (PPC) begins with a standard Mcomponent GMM, P(x|) =
M
πk P(x| θk ),
k=1
with the parameter vector = (π1 , . . . , π M , θ1 , . . . , θ M ). Here, πk and θk are the prior probability and parameters of the kth gaussian component, respectively. We augment the data set X = {xi }, i = 1, . . . , N with latent cluster assignments Z = z(xi ), i = 1, . . . , N to form the familiar complete data (X, Z). The complete data likelihood is P(X, Z|) = P(X|Z, )P(Z|),
(2.1)
where P(X|Z, ) is the probability of X conditioned on Z: P(X|Z, ) =
N i=1
P(xi |θzi ).
(2.2)
Penalized Probabilistic Clustering
1531
2.1 Prior Distribution on Cluster Assignments. We incorporate our clustering preferences by manipulating the prior probability P(Z|). In the standard gaussian mixture model, the prior distribution on cluster assignments Z is trivial: P(Z|) =
N
πzi .
(2.3)
i=1
We incorporate our clustering preferences through a weighting function g(Z) that has large values when the assignment of data points to clusters Z conforms to our preferences, and low values when Z conflicts with our preferences. Hence, we write ( i πzi )g(Z) 1 P(Z|, G) ≡ = πzi g(Z), i Z( j πz j )g(Z)
(2.4)
where = Z ( j πz j )g(Z) is the normalization constant. The likelihood of the data, given a specific cluster assignment Z, is independent of the cluster assignment preferences, so the complete data likelihood is P(X, Z|, G) = P(X|Z, )P(Z|, G).
(2.5)
From equations 2.1 through 2.5, the complete data likelihood is P(X, Z|, G) = P(X|Z, )
1 1 πzi g(Z) = P(X, Z|)g(Z), i
(2.6)
where P(X, Z|) is the complete data likelihood for a standard GMM. To distinguish the penalized likelihood from the standard likelihood, we introduce the notation Ps (·) to denote the standard likelihood and Pp (·) for penalized likelihood. The data likelihood is the sum of complete data likelihood over all possible Z, that is, L(X|) = Pp (X|, G) = Z Pp (X, Z|, G), which can be maximized with the EM algorithm. Once the model parameters are fit, we do soft clustering according to the posterior probabilities for new data Ps (k|x, ). (Note that cluster assignment preferences are not expressed for the new data, only for the training data.) 2.2 Pairwise Relations. Pairwise relations provide a special case of the framework discussed above. We specify two types of pairwise relations:
r r
Link: two samples should be assigned to the same cluster. Do-not-link: Two samples should be assigned to different clusters.
1532
Z. Lu and T. Leen
The weighting factor given to the cluster assignment configuration Z is g(Z) =
p exp Wi j δ(zi , z j ) ,
(2.7)
i= j p
where δ is the Kronecker delta function and Wi j is the weight associated with sample pair (xi , x j ). This weight satisfies p
p
p
Wi j ∈ (−∞, ∞), Wi j = Wji . p
The weight Wi j reflects our preference for assigning xi and x j into one cluster. p We use positive Wi j when we prefer to assign xi and x j into one cluster p (link) and negative Wi j when we prefer to assign them into different clusters p (do-not-link). The absolute value |Wi j | reflects the strength of the preference. The prior probability with the pairwise relations is P(Z|, G) =
p 1 πzi exp Wi j δ(zi , z j ) . i i= j
(2.8)
It appears that the g(Z) in equation 2.7 changes asymmetrically with the violation of link and do-not-link: when a link is conformed, g(Z) increases; when a do-not-link is violated, g(Z) decreases. Nevertheless, the prior probability given by equation 2.8 decreases under both types of violations. The PPC model is clearly connected to the standard GMM and the constrained clustering model proposed by Shental et al. (2003). We shall show that both models can be viewed as special cases of PPC with particular W p . The connection between PPC and other semisupervised clustering models p is less straightforward and will be discussed in section 4. If Wi j = 0, we have no prior knowledge on the assignment relevancy of xi and x j . When p Wi j = 0 for all pairs (i, j), we have g(Z) = 1; hence, the complete likelihood reduces to the standard one: Pp (X, Z|, G) =
1 Ps (X, Z|)g(Z) = Ps (X, Z|).
(2.9)
p
In the other extreme with |Wi j | → ∞, assignments Z that violate the pairwise relations between xi and x j have zero prior probability, since for those assignments, k
Pp (Z|, G) = Z
πzk
l
πzl
i= j
p
exp(Wi j δ(zi , z j ))
m=n
p
exp(Wmn δ(zm , zn ))
→ 0.
Penalized Probabilistic Clustering
1533 p
Then the relations become hard constraints, while the relations with |Wi j | < ∞ are called soft preferences. When all the specified pairwise relations are hard constraints, the data likelihood becomes Pp (X, Z|, G) =
N 1 δ(zi , z j ) (1 − δ(zi , z j )) πzi Ps (xi | θzi ), i j∈L i j∈N i=1
(2.10) where L is the set of linked sample pairs and N is the set of do-not-link sample pairs. It is straightforward to verify that equation 2.10 is essentially the same as the complete data likelihood given by Shental et al. (2003). Therefore, the model proposed by Shental et al. (2003) is equivalent to PPC with hard constraints. In appendix A, we give a detailed derivation of equation 2.10 and the equivalence of two models. When only hard constraints are available, we simply implement PPC based on equation 2.10. In the remainder of this letter, we use W p to denote the prior knowledge on pairwise relations, that is, Pp (X, Z|, G) ≡ Pp (X, Z|, W p ) =
p 1 exp Wi j δ(zi , z j ) Ps (X, Z|) i= j (2.11)
2.3 Model Fitting. We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to fit the model parameters : ∗ = arg max L(X|, W p ).
The expectation step (E-step) and maximization step (M-step) are E-step: Q(, (t−1) ) = E Z|X (log Pp (X, Z|, W p )|X, (t−1) , W p ) M-step: (t) = arg max Q(, (t−1) ).
In the M-step, the optimal mean and covariance matrix of each component is N j=1
µk = N
x j Pp (k|x j , (t−1) , W p )
j=1
N k =
j=1
Pp (k|x j , (t−1) , W p )
Pp (k|x j , (t−1) , W p )(x j − µk )(x j − µk )T . N (t−1) , W p ) j=1 Pp (k|x j ,
1534
Z. Lu and T. Leen
The update of the prior probability of each component is more difficult due to the normalizing constant in the data likelihood,
=
N Z
πzk
k=1
i= j
p exp Wi j δ(zi , z j ) .
(2.12)
We need to find π ≡ {π1 , . . . , πm } = arg max π
N M
log πl Pp (l|xi , (t−1) , W p ) − log (π),
l=1 i=1
(2.13) which, unfortunately, does not have a closed-form solution in general.1 In this letter, we use a rather crude approximation of the optimal π instead. First, we estimate the values of log (π) on a grid H = {πˆ n } on the simplex defined by M
πk = 1, πk ≥ 0.
k=1
M N Then in each M-step, we calculate the value of l=1 i=1 n (t−1) p n log πˆ l Pp (l|xi , , W ) for each node πˆ ∈ H and find the node πˆ ∗ that maximizes the function defined in equation 2.13: πˆ ∗ = arg max n πˆ ∈H
N M
log πˆ ln Pp (l|xi , (t−1) , W p ) − log (πˆ n ).
(2.14)
l=1 i=1
We use πˆ ∗ as the approximative solution of equation 2.13. In this letter, the resolution of the grid is set to be 0.01. Although it works very well for all experiments in this letter, we notice that the search over grid will be fairly slow for M > 5. Shental, Bar-Hillel, Hertz, and Weinshall (2004) proposed to find optimal π using gradient descent and approximate (π) by pretending all specified relations are nonoverlapping (see section 3.1). Although this method is originally designed for hard constraints, it can be easily adapted for PPC. This will not be covered in this letter.
1
Shental et al. (2003) pointed out that with a different sampling assumption, a closed-form solution for equation 2.13 exists when only hard links are available. See section 4.
Penalized Probabilistic Clustering
(a)
(c)
1535
(b)
(d)
Figure 1: The influence of constraint weight on model fitting. (a) Artificial data set. (b) Links (solid lines) and do-not-links (dashed line). (c, d) The probability density contour of two possible fitted models.
It is critically important to note that with a nontrivial W p , the assignment independence is broken, Pp (zi , z j |xi , x j , , W p = Pp (zi |xi , , W p )Pp (z j |x j , , W p ), which means that the posterior estimation of each sample cannot be done separately. This fact brings an extra computational problem and will be discussed in section 3. 2.4 . Selecting the Constraint Weights 2.4.1 Example: How the Weight Wi j Affects Clustering. The weight matrix W p is crucial to the performance of the PPC. Here we give an example demonstrating how the weight of pairwise relations affects the clustering process. Figure 1a shows the two-dimensional data from two classes, as indicated by the symbols. Besides the data set, we also have 20 pairs correctly labeled as links and do-not-links, as shown in Figure 1b. We try to fit the data set with a two-component GMM. Figures 1c and 1d give the density contour of the two possible models on the data. Without any pairwise relations specified, we have essentially an equal chance to get each. After incorporating pairwise relations, the EM optimization process is biased to
1536
Z. Lu and T. Leen iteration 1
|w |= 0
iteration 3 3
3
2
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
−2
0
2
−2
0
2
−2
iteration 3
0
2
−3
3
3
3
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2
−2
−2
−2
−3
−3
−3
0
2
−2
0
2
−2
iteration 3
0
2
−3
3
3
3
2
2
2
1
1
1
1
0
0
0
0
−1
−1
−1
−1
−2 −2
0
2
−3
−2 −2
0
2
−3
2
0
2
iteration 20
2
−3
−2
iteration 5
3
−2
0 iteration 20
2
−2
−2
iteration 5
3
iteration 1
|w |= 3
iteration 20
3
iteration 1
|w|=1.3
iteration 5
3
−2 −2
0
2
−3
−2
0
2
Figure 2: The contour of probability density fit on data with different weight given to pairwise relations. Top row: w = 0; middle row: w = 1.3; bottom row: w = 3.
the correct one. The weights of pairwise relations are given as follows, w p Wi j = −w 0
if (xi , x j ) is linked if (xi , x j ) is do-not-linked otherwise,
where w ≥ 0 measures the certainty of all specified pairwise relations. In Figure 2, we give three runs with the same initial model parameters but different weight given to the specified pairwise relations. For each run, we give snapshots of the model after 1, 3, 5, and 20 EM iterations. The first row is the run with w = 0 (standard GMM). The search ends up with a model that violates our prior knowledge of class membership. The middle row is the run with w set to 1.3. With the same poor initial condition, the model fitting process still goes to the wrong one again, although at a slower pace. In the bottom row, we increase w to 3. This time, the model converges to the one we intend. 2.4.2 Choosing Weight W p Based on Prior Knowledge. There are some occasions we can translate our prior belief on the relations into the weight
Penalized Probabilistic Clustering
1537
W p . Here we assume that the pairwise relations are labeled by an oracle but contaminated by flipping noise before they are delivered to us. For each labeled pair (xi , x j ), there is thus a certainty value 0.5 ≤ γi j ≤ 1 equal to the probability that pairwise relation is not flipped, that is, that label is correct.2 Our prior knowledge would include those specified pairwise relations and their certainty values = {γi j }. This prior knowledge can be approximately encoded into the weight W p by letting 1 γi j 2 log( 1−γi j ) p γi j Wi j = − 12 log( 1−γ ) ij 0
(xi , x j ) is specified as linked (xi , x j ) is specified as do-not-linked
(2.15)
otherwise.
The details of the derivation are in appendix B. It is obvious from equation 2.15 that for a specified pairwise relation (xi , x j ), the greater the certainty p value γi j , the greater the absolute value of weight Wi j . Note that the weight designed this way is not necessarily optimal in terms of classification accuracy, as will be demonstrated by experiment in section 5.1. The reason is twofold. First, equation 2.15 is derived based on a (possibly crude) approximation. Second, gaussian mixture models as classifiers are often considerably biased from true class distribution of data. As a result, even if the PPC prior P(Z|, W p ) faithfully reflects the truth, it does not necessarily lead to the best classification accuracy. Nevertheless, equation 2.15 gives good initial guidance for choosing the weight. Our experiments in section 5.1 show that this design often yields superior classification accuracy to simply using the hard constraints or ignoring the pairwise relations (standard GMM). One use for this scheme of weight is when pairwise relations are labeled by domain experts and the certainty values are given at the same time. We might also estimate the flipping noise parameters from historical data or available statistics. For example, we can derive soft pairwise relations based on spatial or temporal continuity among samples. That is, we add soft links to all adjacent pairs of samples, assuming the flipping noise explaining all the adjacent pairs that are actually not in one class. We further assume that the flipping noise of each pair follows the same distribution. Accordingly, we assign uniform weight w > 0 to all adjacent pairs. Let q denote the probability that the label on a adjacent pair is flipped. We might be able to estimate q from labeled instances of a similar problem, for example, segmented images or time series. The maximum likelihood (ML) estimation
2 We consider only the certainty value > 0.5, because a pairwise relation with certainty γi j < 0.5 can be equivalently treated as its opposite relation with certainty 1 − γi j .
1538
Z. Lu and T. Leen
of q is given by simple statistics: q˜ =
Number of adjacent pairs that are not in the same class . Number of all adjacent pairs
We give an application of this idea in section 5.2. 3 Computing the Cluster Posterior The M-step requires the cluster membership posterior. Computing this posterior is simple for the standard GMM since each data point xi can be assigned to a cluster independent of the other data points and we have the familiar cluster origin posterior Ps (zi = k|xi , ). The pairwise constraints bring extra relevancy in assignment among samp ples involved. From equation 2.11, if Wi j = 0, Pp (zi , z j |xi , x j , , W p ) = p p Pp (zi |xi , , W )Pp (z j |x j , , W ). Consequently, the posterior probability of xi and x j cannot be estimated separately. This relevancy in the assignment can be formalized as follows: p
Definition. If Wi j = 0, we say there is direct assignment relevancy between xi and x j , denoted by xi Rd x j . If Pp (zi , z j |xi , x j , , W p ) = Pp (zi |xi , , W p )Pp (z j |x j , , W p ), we say there is assignment relevancy between xi and x j , denoted by xi Ra x j . It is clear that Ra is reflexive, symmetric, and transitive. Hence, Ra is an equivalence relation. It can be shown that Ra is the transitive closure of Rd . In other words, two samples have assignment relevancy relation Ra if they can be connected by a path consisting of Ra relations, as illustrated in Figure 3. We call each equivalence class associated with Ra a clique. It is clear that cliques are the smallest sets of samples whose posterior probabilities can be calculated independently. When calculating posterior probabilities, all samples within a clique need to be considered together. In a clique T with size |T|, the posterior probability of a given sample xi ∈ T is calculated by marginalizing the posterior over the entire clique, Pp (zi = k|X, , W p ) =
Pp (ZT |XT , , W p ),
ZT |zi =k
with the posterior on the clique given by Pp (ZT |XT , , W p ) =
Pp (ZT , XT |, W p ) Pp (ZT , XT |, W p ) = .
p p Pp (XT |, W ) ZT Pp (ZT , XT |, W )
Penalized Probabilistic Clustering
(a)
1539
(b)
(c)
Figure 3: (a) Links (solid line) and do-not-links (dotted line) among six samples. (b) Direct assignment relevancy Rd (solid line) translated from links in a. (c) Equivalence classes defined by assignment relevancy Ra , denoted by shading.
(a)
(c)
(b)
Figure 4: (a) Overlapping pairwise relations, with links (solid line) and donot-links (dotted line). (b) Nonoverlapping pairwise relations. (c) Only hard links.
Exact calculation of the posterior probability of a sample in clique T requires time complexity O(M|T| ), where M is the number of components in the mixture model. This calculation can get prohibitively expensive if |T| is large (e.g., 50) for any model size M ≥ 2. Hence, small-size cliques are required to make the marginalization computationally reasonable. 3.1 Two Special Cases with Easy Inference. Apparently the inference is easy when we limit ourselves to small cliques. Specifically, when |T| ≤ 2, the pairwise relations are nonoverlapping, as illustrated in Figures 4a and 4b. With nonoverlapping constraints, the posterior probability for the whole data set can be given in closed form with O(N) time complexity. Moreover, the evaluation of the normalization factor (π) is simple: (π) = 1 −
M k=1
NL πk2
M
NN πk2
,
k=1
where NL and NN are, respectively, the number of links and do-not-links. The optimization of π in the M-step can thus be achieved with little cost. Sometimes nonoverlapping relations are a natural choice: they can be generated by picking up sample pairs from sample sets and labeling the relations without replacement. More generally, we can avoid the expensive
1540
Z. Lu and T. Leen
computation in posterior inference by breaking large cliques into small ones. To do this, we need to deliberately ignore some links or do-not-links. In section 5.2, experiment 3 is an application of this idea. p The second simplifying situation is that we have only hard links (Wi j = +∞ or 0), as illustrated in Figure 4c. In this case, the posterior probability for each sample must be exactly the same as the others in the same clique, so a clique can be treated as a single sample. That is, assume xi is in clique T. We then have Pp (zi = k|xi , , W p ) = Pp (ZT = k|xT , , W p ) Pp (xT , ZT = k|, W p ) =
p k Pp (xT , ZT = k |, W ) Ps (xT , ZT = k|) =
k Ps (xT , ZT = k |) j∈T πk Ps (x j |θk ) . = k ( j∈T πk Ps (x j |θk )) Similar ideas have been proposed independently by Wagstaff et al. (2001), Shental et al. (2003), and Bilenko et al. (2004). This case is useful when we are sure that a group of samples are from one source (Shental et al., 2003). For more general cases, where the exact inference is computationally prohibitive, we propose to use Gibbs sampling (Neal, 1993) and the mean field approximation (Jaakkola, 2004) to estimate the posterior probability. This will be discussed in sections 3.2 and 3.3. 3.2 Estimation with Gibbs Sampling. For fixed , finding Pp (Z|, W p ) is a typical inference problem for graphical models. Techniques for approximate inference developed for graphical models can also be used here. In this section, we use Gibbs sampling to estimate the posterior probability in each EM iteration. In Gibbs sampling, we estimate Pp (zi |X, , W p ) as a sample mean, 1 (t) δ zi , k , S S
Pp (zi = k|X, , W p ) = E(δ(zi , k)|X, , W p ) ≈
t=1
where the sum is over a sequence of S samples from P(Z|X, , G) generated by the Gibbs MCMC. The tth sample in the sequence is generated by the usual Gibbs sampling technique:
r
(t)
(t−1)
Pick z1 from distribution Pp (z1 |z2
(t−1)
, z3
(t−1)
, . . . , zN
, X, W p , ).
Penalized Probabilistic Clustering
r r
(t)
1541 (t)
(t−1)
Pick z2 from distribution Pp (z2 |z1 , z3 ··· (t)
(t)
(t)
(t−1)
, . . . , zN
, X, W p , ).
(t)
Pick z N from distribution Pp (z N |z1 , z2 , . . . , z N−1 , X, W p , ).
For pairwise relations it is helpful to introduce some notation. Let Z−i denote an assignment of data points to clusters that leaves out the assignment of xi . Let U(i) be the indices of the set of samples that participate in a pairwise p relation with sample xi , U(i) = { j : Wi j = 0}. Then we have Pp (zi |Z−i , X, , W p ) ∝ Ps (xi , zi |)
p exp(2Wi j δ zi , z j ) .
(3.1)
j∈U(i)
The time complexity of each Gibbs sampling pass is O(NnM), where n is the maximum number of pairwise relations a sample can be involved in. When W p is sparse, the size of U(i) is small. Thus, calculating Pp (zi |Z−i , X, , W p ) is fairly cheap, and Gibbs sampling can effectively estimate the posterior probability. 3.3 Estimation with Mean Field Approximation. Another approach to posterior estimation is to use mean field theory (Jaakkola, 2004; Lange, Law, Jain, & Buhmann, 2005). Instead of directly evaluating the intractable Pp (Z|X, , W), we try to find a tractable mean field approximation Q(Z). To find a Q(Z) close to the true posterior probability Pp (Z|X, , W), we minimize the Kullback-Leibler divergence between them, min KL(Q(Z)|Pp (Z|X, , W p )), Q
(3.2)
which can be recast into max[H(Q) + E Q {log Pp (Z|X, , W p )}], Q
(3.3)
where E Q {·} denotes the expectation with respect to Q. The simplest family of variational distribution is one where all the latent variables {zi } are independent of each other:
Q(Z) =
N
Qi (zi ).
(3.4)
i=1
With this Q(Z), the optimization problem in equation 3.3 does not have a closed-form solution and is not a convex problem. Instead, a locally optimal
1542
Z. Lu and T. Leen
Q can be found iteratively with the following update equations,
Qi (zi ) ←
1 exp(E Q {log Pp (Z|X, , W p )|zi }), i
(3.5)
for all i and zi ∈ {1, 2, · · · , M}. Here i = zi exp(E Q {log Pp (Z|X, , p W )|zi }) is the local normalization constant. For the PPC model, we have p exp(E Q {log Pp (Z|X, , W p )|zi }) = Ps (zi |xi , ) exp Wi j Q j (zi ) . j=i
Equation 3.5, collectively for all i, is the mean field equation. Evaluation of mean field equations requires at most O(NnM) time complexity, which is same as the time complexity of one Gibbs sampling pass. Successive updates of equation 3.5 will converge to a local optimum of equation 3.3. In our experiments, the convergence usually occurs after about 20 iterations, which is many fewer than the number of passes required for Gibbs sampling. 4 Related Models Prior to our work, different authors have proposed several constrained clustering models based on K-means, including the seminal work by Wagstaff and colleagues (Wagstaff et al., 2001; Wagstaff, 2002), and its successor (Basu et al., 2002; Basu, Bilenko, & Mooney, 2004; Bilenko et al., 2004). These models generally fall into two classes. The first class of algorithms (Wagstaff et al., 2001; Basu et al., 2002) keeps the original K-means cost function (reconstruction error) while confining the cluster assignments to be consistent with the specified pairwise relations. The problem can be cast into the following constrained optimization problem,
min Z,µ
N
||xi − µzi ||2
i=1
subject to zi = z j , zi = z j ,
if (xi , x j ) ∈ L if (xi , x j ) ∈ N ,
where µ = {µ1 , . . . , µ M } is the cluster center. In the second class of algorithms, cluster assignments that violate the pairwise relations are allowed
Penalized Probabilistic Clustering
1543
but will be penalized. They employ a modified cost function (Basu et al., 2004): J (µ, Z) =
N 1 ||xi − µzi ||2 + a i j (zi = z j ) + b i j (zi = z j ), 2 i=1
(i, j)∈L
(i, j)∈N
(4.1) where a i j is the penalty for violating the link between (xi , x j ) and b i j is the penalty when the violated pairwise relation is a do-not-link. It can be shown that both classes of algorithms are encompassed by PPC as special cases. The particular PPC model we consider has spherical gaussian components with radius shrunk to zero and the weight matrix W p expands properly. The details are in appendix C. One weakness shared by the semisupervised K-means algorithms is the limited capability of K-means to model complex data distribution. If data in one class are far from being spherical, it may take a great number of pairwise relations to achieve reasonable classification accuracy (Wagstaff, 2002). Another serious problem lies in the optimization strategy employed by those algorithms to find the optimal assignment within each EM iteration. Due to the extra dependency brought by the pairwise relations, finding the optimal assignment of samples to clusters is not trivial. Evaluating every potential assignments requires O(M|T| ) time complexity where |T| denotes the size of the biggest clique, which is prohibitively expensive when |T| is big. The greedy search used by these algorithms can return only local optima (Basu et al., 2002, 2004), and the sequential assignment strategy employed by Wagstaff et al. (2001) may lead to the situation where one cannot assign a sample to any cluster because of the conflict with some assigned samples. To remedy the limited capability of constrained K-means, several authors proposed probabilistic models based on gaussian mixture models. The models proposed by Shental et al. (2003, 2004) address the situation where pairwise relations are hard constraints. The authors partition the whole data set into a number of (maximal) “chunklets” consisting of samples that are (hard) linked to each other.3 Shental et al. (2003, 2004) discuss two sampling assumptions:
r
Assumption 1: Chunklet Xi is generated identically and independently from component k with prior πk (Shental et al., 2004), and the complete data likelihood is P(X, Z|, E ) =
L 1 (1 − δ(zi , z j )) · {πzl Ps (xi |θzl )}, i= j∈N x ∈X l=1
i
l
(4.2) 3
If a sample is not linked to any other samples, it comprises a chunklet by itself.
1544
r
Z. Lu and T. Leen
where E denotes the specified constraints. Assumption 2: Chunklet Xi generated from component k with prior |X | ∝ πk i , where |Xi | is the number of samples in Xi (Shental et al., 2004). The complete data likelihood is: P(X, Z|, E ) =
L 1 l| (1 − δ(zi , z j )) · {πz|X Ps (xi |θzl )} l i j∈N x ∈X l=1
i
l
(4.3) =
N 1 δ(zi , z j ) (1 − δ(zi , z j )) πzi Ps (xi | θzi ). i j∈L i j∈N i=1
(4.4) In appendix A we show that when using assumption 2, this model (as expressed in equations 4.3 and 4.4) is equivalent to the PPC with only hard constraints (as expressed in equation 2.10). Shental et al. (2004) suggested that assumption 1 might be appropriate, for example, when chunklets are generated from temporal continuity. When pairwise relations are generated by labeling sample pairs picked from a data set, assumption 2 might be more reasonable. Assumption 1 allows a closed-form solution in the M-step (including solution for π) in each EM iteration (Shental et al., 2004). The empirical comparison of the two sampling assumptions will be discussed in section 5. To incorporate the uncertainty associated with pairwise relations, Law et al. (2004, 2005) proposed to use soft group constraints. To model a link between any sample pair (xi , x j ), they create a group l and express the strength of the link as the membership of xi and x j to group l. This strategy works well for some simple situations, for example, when the pairwise relations are nonoverlapping (as defined in section 3.1). However, it is awkward if samples are shared by multiple groups, which is unavoidable when samples are commonly involved in multiple relations. Another serious drawback of the group constraints model is its inability to model do-not-links. Due to these obvious limitations, we omit the empirical comparison of this model to PPC in the following section. 5 Experiments The experiments section consists of two parts. In section 5.1, we examine the way the number of constraints affects the clustering results. For each clustering task in this section, we generate artificial pairwise relations based on class labels. In section 5.2, we address real-world problems, where the constraints are derived from our prior knowledge. Also in this section,
Penalized Probabilistic Clustering
1545
we demonstrate the approaches to reduce computational complexity, as described in section 3. Following are some abbreviations we will use: soft-PPC is PPC with soft constraints, hard-PPC is PPC with hard constraints (implemented based on equation 2.10), soft-CKmeans is the K-means with soft constraints (Basu et al., 2004) and hard-CKmeans is the K-means with hard constraints (Wagstaff et al., 2001). The gaussian mixture model with hard constraints (Shental et al., 2003, 2004) will be referred to as constrained-EM. 5.1 Artificial Constraints. In this section, we discuss the influence of pairwise relations on PPC’s clustering and compare the result to other semisupervised clustering models. This section presents two experiments. In experiment 1, we consider only correct pairwise relations, as an example of authentic knowledge. Accordingly, we use hard constraints in clustering. In experiment 2, we discuss the situation where pairwise relations contain significant error. We evaluate the performance of soft-PPC and test the weight design strategy described in section 2.4. The result is compared to hard-PPC and other semisupervised clustering models. 5.1.1 Constraint Selection. To avoid the computational burden, we limit our discussion to the nonoverlapping pairwise relations in experiments 1 and 2. As discussed in section 3.1, the nonoverlapping pairwise relations, hard or soft, allow fast solution in the maximization step in each EM iteration. The pairwise relations are generated as follows. We randomly pick two samples from the training set without replacement. If the two have the same class label, we add a link constraint between them; otherwise, we add a do-not-link constraint. Note that the application of PPC is not limited to the nonoverlapping cases. In section 5.2, we discuss more complicated real-world problems where overlapping constraints are necessary, and we also present approaches to solve the computational problems. 5.1.2 Performance Evaluation. We try PPC (with the number of components equal to the number of classes) with various numbers of pairwise relations. For each clustering result, a confusion matrix is built to compare it to true labeling. The classification accuracy is calculated as the ratio of the sum of diagonal elements to the number of all samples. 5.1.3 Experiment 1: Artificial Hard Constraints. This experiment is designed to answer two questions: how the number of pairwise relations affects the clustering result and whether the information in the relations has been successfully encoded into the trained model. To answer the second question, we examine the out-of-sample classification of the gaussian mixture model fit with the aid of the pairwise relations. Toward this end, we divide each data set into a training set (90% of data) and a held-out test set (10% of data). Pairwise relations are generated among samples in
1546
Z. Lu and T. Leen
(a) data set1
(b) data set 2
(c) data set 3
Figure 5: The artificial data sets.
the training set. After the density model is fit on the training set and the pairwise relations, it will be applied to the test set. Since the classification on the test set is merely decided by the fit gaussian mixture model, it will reflect the influence of pairwise relations on the trained model. For comparison, we also give results of two other constrained clustering methods: (1) the hard-CKmeans (Wagstaff et al., 2001), for which the accuracy on the test set is given by the nearest-neighbor classification with the cluster centers fit on training set, and (2) constrained-EM (Shental et al., 2004). Since we show in appendix A that constrained-EM with sampling assumption 2 (see section 5) is equivalent to hard-PPC, we need only to consider the constrained-EM with sampling assumption 1. The reported classification accuracy is averaged over 100 different realizations of pairwise relations. The three two-dimensional artificial data sets shown in Figure 5 are designed to highlight PPC’s superior modeling flexibility over constrained K-means.4 In each example, there are 200 samples in each class. It is clear from Figure 5 that for all three problems, data in each class are nongaussian. So not surprisingly, standard K-means and GMM do not return satisfactory clustering results. Figure 6 compares the clustering result of hard-PPC and hard-CKmeans with various number of pairwise relations. As shown in Figure 6, the accuracy of hard-PPC improves significantly when pairwise relations are incorporated. After enough pairwise relations are added in, we can finally reach close to 100% accuracy on the training data. On the test set, although no pairwise relation is available, we observe significantly improved accuracy as well. For the hard-CKmeans, we do not observe any substantial accuracy improvement on the training set or test set. The classification task of data set 1 is relatively easy for the gaussian mixture and difficult for K-means. The classification accuracy of hard-PPC climbs 4
Some authors (Xing et al., 2003; Cohn et al., 2003; Bilenko et al., 2004) combined standard or constrained K-means with metric learning based on pairwise relations and reported improvement on classification accuracy. This will not be discussed in this letter.
Penalized Probabilistic Clustering
0.8 0.7 0.6 0.5 0.4 0
20
40 60 Number of Relations
80
(a) On data set 1
100
0.95
1
trn ppc tst ppc trn cop tst cop
0.9 Classification Accuracy
0.9
1
trn ppc tst ppc trn cop tst cop
Classification Accuracy
Classification Accuracy
1
1547
0.9 0.85 0.8 0.75 0
trn ppc tst ppc trn cop tst cop
0.8 0.7 0.6 0.5
20
40 60 80 Number of Relations
100
0.4 0
(b) On data set 2
50 100 Number of Relations
(c) On data set 3
Figure 6: The performance of hard-PPC and hard-CKmeans with various numbers of relations: trn ppc: accuracy on the training set with hard-PPC; tst ppc: accuracy on the test set from the GMM trained by hard-PPC; trn cop: accuracy on the training set with hard-CKmeans; tst cop: accuracy of K-means on the test set.
from 60% to close to 100% (on both training and test set) after 70 pairwise relations, whereas the accuracy of hard-CKmeans remains less than 60% even with 100 relations. On data set 2, the hard-PPC accuracy is improved from 75% to close to 95% on the training set and stops at around 90% on the test set. This divergence happens because the two classes in data set 2 are overlapped and thus defy a perfect GMM classifier. Data set 3 is the most difficult since it is highly nongaussian. It takes over 100 pairs for the hard-PPC to reach 95% accuracy, whereas hard-CKmeans never reaches 55%. The comparison of PPC and constrained-EM is presented in a way to highlight the difference between the classification accuracy of the two methods. Basically, we record the classification from PPC and constrained-EM with the same pairwise relations and initial condition and then calculate Accuracy = classification accuracy by PPC − classification accuracy by CEM on both the training and test set. In Figure 7, we report the mean and standard deviation of Accuracy estimated over 100 different realizations of pairwise relations. From Figure 7, the difference between PPC and constrained-EM is indistinguishable when the number of relations is small, while PPC is slightly but consistently better than constrained-EM when the relations are abundant. We perform the same experiments on three UCI data sets: the Iris data set has 150 samples and three classes, with 50 samples in each class; the Waveform data set has 5000 samples and three classes, with around 1700 samples in each class; and the Pendigits data set has four classes (digits 0, 6, 8, 9), each with 750 samples. The results are summarized in
1548
Z. Lu and T. Leen
Training set
Training set
Training set
0 −0.02
20
40
60
80
0.04 PPC−CEM
PPC−CEM
PPC−CEM
0.02 0.02
0
−0.02
100
20
40
Test set
60
80
100
60
80
−0.02
80
0
−0.02
100
100
120
140
160
80 100 120 Number of Relations
140
160
Test set 0.04 PPC−CEM
PPC−CEM
PPC−CEM
0
60 Number of Relations
40
Test set
0.02
40
0 −0.02 −0.04 20
0.02
20
0.02
20
(a) data set 1
40
60 Number of Relations
80
0.02 0 −0.02 −0.04 20
100
40
60
(b) data set 2
(c) data set 3
Figure 7: Comparison of PPC and constrained EM on artificial data. With each number of pairwise relations, we show the mean of Accuracy ± standard deviation estimated over 100 random realization of pairwise relations.
0.9
Classification Accuracy
Classification Accuracy
0.9
trn ppc tst ppc trn cop tst cop
0.85 0.8 0.75
0.85
0.9
trn ppc tst ppc trn cop tst cop
Classification Accuracy
1 0.95
0.8 0.75 0.7
0.85
trn ppc tst ppc trn cop tst cop
0.8
0.75
0.7 0.65 0
10
20 30 40 Number of Relations
0.65 0
50
(a) Iris data
100
200 300 400 Number of Relations
500
0.7 0
600
200
(b) Waveform data
400 600 800 1000 1200 Number of Relations
(c) Pendigits data
Figure 8: Performance of PPC on UCI data sets with various numbers of relations. Training set
Training set
10
20
30
40
0
−0.01 100
50
0.02 PPC−CEM
0
−0.02
200
Test set
400
500
−0.02
600
20
30 Number of Relations
(a) Iris
400
40
50
0
−0.01 100
600
800
1000
1200
1000
1200
0.02 PPC−CEM
0
10
200
Test set
0.01 PPC−CEM
PPC−CEM
300
0
Test set
0.02
−0.02
Training set
0.01 PPC−CEM
PPC−CEM
0.02
200
300 400 Number of Relations
(b) Waveform
500
600
0
−0.02
200
400
600 800 Number of Relations
(c) Pendigits
Figure 9: Comparison of PPC and constrained EM on UCI data sets. With each number of pairwise relations, we show the mean of Accuracy ± standard deviation estimated over 100 random realization of pairwise relations.
Figures 8 and 9. As indicated by Figure 8, hard-PPC can consistently improve its clustering accuracy on the training set when more pairwise constraints are added; also, the effect brought by constraints generalizes to the test set. In contrast, as in the artificial data set case, the increase of
Penalized Probabilistic Clustering
1549
accuracy from hard-CKmeans is much less salient than that of hard-PPC. Figure 9 shows that hard-PPC is slightly better than constrained-EM, especially when the number of constraints is large. 5.1.4 Experiment 2: Artificial Soft Constraints. In this experiment, we evaluate the performance of soft-PPC when the specified pairwise relations contain substantial error. The results are compared to hard-PPC, soft-CKmeans, and hard-CKmeans. The artificial constraints are generated the same way as in the previous experiment. The flipping noise is realized by randomly flipping each pairwise relation with a certain probability q ≤ 0.5. For the soft-PPC model, the p weight Wi j to each specified pairwise relation is given as follows: 1 log 1−q (zi , z j )specified as link 2 q p Wi j = − 1 log 1−q (zi , z j )specified as do-not-link. 2 q
(5.1)
We use w to denote the absolute value of the weight for nontrivial ). For soft-CKmeans, we give pairs. For soft-PPC, we have w = 12 log( 1−q q equal weights to all specified constraints. Because there is no guiding rule in the literature on how to choose the weight for the soft-CKmeans model, we simply use the weight that yields the highest classification accuracy. We present the results on the three artificial data sets and three UCI data sets used in experiment 1. Unlike experiment 1, we use everything available in clustering. On each data set, we randomly generate a number of nonoverlapping pairwise relations to have 50% of the data involved. In this experiment, we try two different noise levels, with q set to 0.15 and 0.3. Figure 10 compares the classification accuracies given by the maximum likelihood (ML) solutions of different models.5 The accuracy for each model is averaged over 20 random realizations of pairwise relations. On all data sets except artificial data set 3, soft-PPC with the designed weight gives higher accuracy than hard-PPC (w = ∞) and standard GMM (w = 0) on both noise levels. On artificial data set 3, when q = 0.3, hard-PPC gives the best classification accuracy.6 Soft-PPC apparently gives superior classification accuracy to the K-means models on all six data sets, even though the weight of soft-CKmeans is optimized. Figure 10 also shows that it can be harmful to use hard constraints when pairwise relations
5 We choose the one with the highest data likelihood among 100 runs with different random initialization. For K-means models, including soft-CKmeans and hard-CKmeans, we use the solutions with the smallest value of cost function. 6 Further experiment shows that on these data, soft-PPC with the optimal w (> the one suggested by equation 5.1) is still slightly better than hard-PPC.
1550
Z. Lu and T. Leen
Figure 10: Classification accuracy with noisy pairwise relations. We use all the data in clustering. (A) Standard GMM. (B) Soft-PPC. (C) Hard-PPC. (D) Standard K-means. (E) Soft-CKmeans with optimal weight. (F) Hard-CKmeans.
are noisy, especially when the noise is significant. Indeed, as shown by Figures 10d and 10f, hard-PPC can yield accuracy even worse than standard GMM. 5.2 Real-World Problems. In this section, we present two examples where pairwise constraints are from domain experts or common sense. Both examples are about image segmentation based on gaussian mixture models. In the first problem, experiment 3, hard pairwise relations are derived from image labeling done by a domain expert. In the second problem, soft pairwise relations are generated based on spatial continuity. 5.2.1 Experiment 3: Hard Do-Not-Links from Partial Class Information. The experiment in this section shows the application of pairwise constraints on partial class information. For example, consider a problem with six classes A, B, . . . , F . The classes are grouped into several class sets C1 = {A, B, C}, C2 = {D, E}, C3 = {F }. The samples are partially labeled in the sense that we are told which class set a sample is from but not which specific class it is from. We can logically derive a do-not-link constraint between any pair of samples known to belong to different class sets, while no link constraint can be derived if each class set has more than one class in it.
Penalized Probabilistic Clustering
1551
Figure 11: (a) Gray-scale image from the first spectral channel 1. (b) Partial label given by experts. Black pixels denote nonsnow area, and white pixels denote snow area. Clustering result of standard GMM (c) and PPC (d). Panels c and d are colored according to image blocks’ assignment.
Figure 11a is a 120 × 400 region from the Greenland ice sheet from NASA Langley DAAC (Srivastava, Oza, & Stroeve, 2005).7 Each pixel has intensities from seven spectrum bands. This region is labeled into snow area and nonsnow area, as indicated in Figure 11b. The snow area may contain samples from several classes of interest: ice, melting snow, and dry snow, while the nonsnow area can be bare land, water, or cloud. The labeling from the expert contains incomplete but useful information for further segmentation of the image. To segment the image, we first divide it into 5 × 5 × 7 blocks (175 dim vectors). We use the first 50 principal components as feature vectors. Our goal is then to segment the image into (typically more than 2) areas by clustering those feature vectors. With PPC, we can encode the partial class information into do-not-links.
7 We use the first seven moderate resolution imaging spectroradiometer (MODIS) channels with bandwidths as follows (in nm): channel 1: 620–670; channel 2: 841–876; channel 3: 459–479; channel 4: 545–565; channel 5: 1230–1250; channel 6: 1628–1652; channel 7: 2105–2155.
1552
Z. Lu and T. Leen
For hard-PPC, we use half of the data samples for training and the rest for test. Hard do-not-link constraints (only on training set) are generated as follows. For each block in the nonsnow area, we randomly choose (without replacement) six blocks from the snow area to build do-not-link constraints. By doing this, we achieve cliques with size seven (1 nonsnow block + 6 snow blocks). As in section 5.1, we apply the model fit with hard-PPC to the test set and combine the clustering results on both data sets into a complete picture. Clearly, the clustering task is nontrivial for any M > 2. Typical clustering results of three-component standard GMM and three-component PPC are shown as Figures 11c and 11d, respectively. Standard GMM gives a clustering that is clearly in disagreement with the human labeling in Figure 11b. The hard-PPC segmentation makes far fewer misassignments of snow areas (tagged white and gray) to nonsnow (black) than does the GMM. The hard-PPC segmentation properly labels almost all of the nonsnow regions as nonsnow. Furthermore, the segmentation of the snow areas into the two classes (not labeled) tagged white and gray in Figure 11d reflects subtle differences in the snow regions captured by the gray-scale image from spectral channel 1, as shown in Figure 11a. 5.2.2 Experiment 4: Soft Links from Continuity. In this section, we present an example where soft constraints come from continuity. As in the previous experiment, we try to do image segmentation based on clustering. The image is divided into blocks and rearranged into feature vectors. We use a GMM to model those feature vectors, with each gaussian component representing one texture. However, standard GMM often fails to give good segmentations because it cannot make use of the spatial continuity of image, which is essential in many image segmentation models, such as random field (Bouman & Shapiro, 1994). In our algorithm, the spatial continuity is incorporated as the soft link preferences with uniform weight between each block and its neighbors. As described in section 2.4, the weight w of the soft link can be given as
w=
1 1−q log , 2 q
(5.2)
where q is the ratio of softly linked adjacent pairs that are not in the same class. Usually q is given by an expert or estimated from segmentation results of similar images. In this experiment, we assume we already know the ratio q , which is calculated from the label of the image. The complete data likelihood is Pp (X, Z|, W p ) =
1 Ps (X, Z|) exp(w δ(zi , z j )), i j∈U(i)
(5.3)
Penalized Probabilistic Clustering
1553
Figure 12: (a) Texture combination. (b) Clustering result of standard GMM. (c) Clustering result of soft-PPC with Gibbs sampling. (d) Clustering result of soft-PPC with mean field approximation. Panels b to d are shaded according to the blocks assignments to clusters.
where U(i) means the neighbors of the ith block. The EM algorithm can be roughly interpreted as iterating on two steps: (1) estimating the texture description (parameters of mixture model) based on segmentation and (2) segmenting the image based on the texture description given by step 1. Since exact calculation of the posterior probability is intractable due to the large clique containing all samples, we have to resort to approximation methods. In this experiment, both the Gibbs sampling (see section 3.2) and the mean field approximation (see section 3.3) are used for posterior estimation. For Gibbs sampling, equation 3.1 is reduced to Pp (zi |Z−i , X, , W p ) ∝ Ps (xi , zi |)
exp(2w δ(zi , z j )).
j∈U(i)
The mean field equation, 3.5, is reduced to Qi (zi ) ←
1 Ps (xi , zi |) exp(2w Q j (zi )). i j∈U(i)
The image shown in Figure 12a is built from four Brodatz textures.8 This image is divided into 7 × 7 blocks and then rearranged to 49-dim vectors. We use those vectors’ first five principal components as the associated feature vectors. A typical clustering result of four-component standard GMM is shown in Figure 12b. For soft-PPC, the soft links with weight w calculated from equation 5.2 are added between each block and its four neighbors. Figures 12c and 12d are the clustering result of four-component soft-PPC with respectively Gibbs sampling and mean field approximation. One run with Gibbs sampling takes around 160 minutes on a PC with Pentium 4,
8 Downloaded from http://sipi.usc.edu/services/database/Database.html, April 2004.
1554
Z. Lu and T. Leen
2.0 GHZ processor, whereas the algorithm using the mean field approximation takes only 3.1 minutes. Although mean field approximation is about 50 times faster than Gibbs sampling, the clustering results are comparable according to Figure 12. Compared to the result given by standard GMM, soft-PPC with both approximation methods achieves significantly better segmentation after incorporating spatial continuity.
6 Conclusion and Future Work We have proposed a probabilistic clustering model that incorporates prior knowledge in the form of pairwise relations between samples. Unlike previous work in semisupervised clustering, our model formulates clustering preferences as a Bayesian prior over the assignment of data points to clusters and so naturally accommodates both hard constraints and soft preferences. Unlike many semisupervised learning methods (Szummer & ¨ Jaakkola, 2001; Zhou & Scholkopf, 2004; Zhu, Lafferty, & Ghahramani, 2003) addressing labeled subsets, PPC returns a fitted parametric density model and thus can deal with unseen data. Experiments on different data sets have shown that pairwise relations can consistently improve the performance of the clustering process. Despite its success, PPC has limitations. First, PPC often needs a substantial proportion of samples involved in pairwise relations to give good results. Indeed, if we have the number of relations fixed and keep adding samples without any new relations, the algorithm will finally degenerate into unsupervised learning (clustering). To overcome this, one can instead build a semisupervised model based on discriminative models such as a neural network or gaussian process classifier and use the pairwise relations in the form of hint (Sill & Abu-Mostafa, 1996) or observation (Zhu et al., 2003). Second, since PPC is based on the gaussian mixture model, it works well when the data in each class can be approximated by a gaussian distribution. When this condition is not satisfied, PPC could lead to poor results. One way to alleviate this situation is to use multiple clusters to model one class, an interesting direction for future exploration. Third, in choosing the weight matrix W p , although our design works well on some data sets, it is not clear how to set the weight for a more general situation. In this letter, we implement hard constraints using equation 2.10. Alternatively, we can approximate hard constraints by using large |Wi j | for every constrained pair (xi , x j ). Indeed, from equations 2.5 to 2.8, when a constraint with large weight is violated in assignments Z, the prior probability P(Z|, W p ) will be close to zero. The value of P(Z|, W p ) with such a Z can be made arbitrarily small by increasing the corresponding weight. This is convenient when we want to model soft and hard relations at the same time. This situation is not covered in this letter, but remains an interesting direction for future exploration.
Penalized Probabilistic Clustering
1555
To address the computational difficulty caused by large cliques, we propose two approximation methods: Gibbs sampling and mean field approximation. We also observe that Gibbs sampling can be fairly slow for large cliques. One way to address this problem is to use fewer sampling passes (and thus a cruder approximate inference ) in the early phase of EM training and gradually increase the number of sampling passes (and a finer approximation) when EM is close to convergence. By doing this, we may be able to achieve a much faster algorithm without sacrificing too much precision. For the mean field approximation, the bias brought by the independence assumption among Qi (·) could be severe for some problems. We can ameliorate this, as suggested by Jaakkola (2004), by retaining more substructure of the original graphical model (for PPC, it is expressed in W p ), while still keeping the computation tractable. Appendix A: Hard Constraints Limit In this appendix, we prove that when |Wi j | → ∞ for each specified pair (xi , x j ), the complete likelihood of PPC can written as in equation 2.10, and thus equivalent to the model proposed by Shental et al. (2003). In the model proposed by Shental et al. (2003), the complete likelihood is written as P(X, Z|, E ) =
N 1 δ yci (1 − δ(ya i1 , ya i2 )) Ps (zi |)Ps (xi |zi , ) c 1 2 i
=
a i =a i
i=1
1 δ yci (1 − δ(ya i1 , ya i2 ))Ps (X, Z|), c 1 2 i
a i =a i
where E stands for the pairwise constraints, δ yci is 1 iff all the points in the chunklet (the clique of samples connected with only hard links) c i have the same label, and (a i1 , a i2 ) is the index of the sample pair with hard do-not-link between them. This is equivalent to P(X, Z|, E ) =
1
0
Ps (X, Z|) Z satisfies all the constraints otherwise
.
(A.1)
In the corresponding PPC model with hard constraints, we have +∞ i and j is linked p Wi j = −∞ i and j is do-not-linked . 0 no relation
(A.2)
1556
Z. Lu and T. Leen
According to equation 2.5 and A.1, to prove P(X, Z|, E ) = Pp (X, Z|, W p ), we need only to prove Pp (Z|, W p ) = 0, for all the Z that violate the constraints, that is, p exp Wi j δ(zi , z j ) Pp (Z|, W ) = p = 0. Z l πzl m=n exp Wmn δ(zm , zn )
πzk
k
p
i= j
p First, let us assume Z violates one link between pair (α, β) Wαβ = +∞ . We have p
zα = zβ ⇒ δ(zα , zβ ) = 0 ⇒ exp(Wαβ δ(zα , zβ )) = 1. We assume the constraints are consistent. In other words, there is at least one Z that satisfies all the constraints. We can denote one such Z by Z∗ . We also assume each component has a positive prior probability. It is straightforward to show that Pp (Z∗ |, W p ) > 0. Then it is easy to show p exp Wi j δ(zi , z j ) Pp (Z|, W ) = p Z l πzl m=n exp Wmn δ(zm , zn ) p k πzk i= j exp Wi j δ(zi , z j ) ≤ p ∗ ∗ ∗ k πzk i= j exp Wmn δ(zi , z j ) p πz exp Wipj δ(zi , z j ) exp 2Wαβ δ(zα , zβ ) k = p p πzk∗ exp Wi j δ zi∗ , z∗j exp 2Wαβ δ zα∗ , zβ∗ k (i, j)=(α,β)
p
k
πzk
πz k = ∗ π z k k
i= j
p exp Wi j δ(zi , z j 1 . p p ∗ ∗ ∗ ∗ exp W δ(z , z ) exp 2W αβ δ zα , zβ ij i j (i, j)=(α,β)
Penalized Probabilistic Clustering
1557
Since Z∗ satisfies all the constraints, we must have p exp Wi j δ(zi , z j ) p ≤ 1. exp Wi j δ zi∗ , z∗j (i, j)=(α,β)
So we have
πz k Pp (Z|, W ) ≤ πzk∗ k
p
exp
1 p 2Wαβ
. δ zα∗ , zβ∗
When p
Wαβ → +∞, we have
exp
1 p 2Wαβ
→ 0 δ zα∗ , zβ∗
and then
πz k Pp (Z|, W ) ≤ ∗ π z k k
p
1 → 0. p exp 2Wαβ δ zα∗ , zβ∗
(A.3)
The do-not-link case can be proved in a similar way. Appendix B: Prior for Noisy Pairwise Relations In this appendix, we show how to derive weight W from the certainty value γi j for each pair (xi , x j ). Let E denote those original (noise-free) labeled pairwise relations and E˜ the noisy version delivered to us. If we know the original pairwise relations E, we only have to consider the cluster assignments that are consistent with E and neglect the others, that is, the prior probability of Z is P(Z|, E) =
1 E
0
Ps (Z|)
Z is consistent with E , otherwise
where E is the normalization constant for E: E = Z: consistent with E Ps (Z|). Since we know E˜ and the associated certainty values = {γi j },
1558
Z. Lu and T. Leen
we know ˜ ) = P(Z|, E,
˜ )P(E| E, ˜ ) P(Z|, E, E,
(B.1)
˜ ). P(Z|, E)P(E| E,
(B.2)
E
=
E
Let E(Z) ≡ the unique E that is consistent with Z. From equation B.2, we know ˜ ) = Pp (Z|, E(Z))P(E(Z)| E, ˜ ) P(Z|, E, 1 1 ˜ ) = ˜ )Ps (Z|). Ps (Z|)P(E(Z)| E, P(E(Z)| E, E E
=
If we ignore the variation of E over E, we can get an approximation of ˜ ), denoted as Pa (Z|, E, ˜ ): P(Z|, E, 1 ˜ ) Ps (Z|)P(E(Z)| E, a Hi j ( E,z ˜ i ,z j ) 1 ˜ = Ps (Z|) γi j (1 − γi j )1−Hi j ( E,zi ,z j ) , a i< j
˜ ) = Pa (Z|, E,
where a is the new normalization constant: a = ˜ ) and P(E(Z)| E, ˜ zi , z j ) = Hi j ( E,
1 0
Z
Ps (Z|)
(zi , z j ) is consistent with E˜ . otherwise
˜ ) is equal to a PPC prior probability Pp (Z|, W p ) We argue that Pa (Z|, E, with γi j 1 (zi , z j ) is specified as must-linked in E˜ log 1−γi j 2 p Wi j = − 1 log γi j (zi , z j ) is specified as cannot-linked in E˜ 2 1−γi j 0 otherwise. (B.3) This can be easily proven by verifying Pp (Z|, W p ) a = ˜ ) w Pa (Z|, E,
i< j,Wi j p =0
p
sign(Wi j )−1
γi j
p
(1 − γi j )−sign(Wi j ) = constant.
Penalized Probabilistic Clustering
1559
˜ ) and Pp (Z|, W p ) are normalized, we know Since both Pa (Z|, E, ˜ ) = Pp (Z|, W p ). Pa (Z|, E, Appendix C: From PPC to Constrained K-Means In this appendix, we show how to derive K-means model with soft and hard constraints from PPC. C.1 From PPC to K-Means with Soft Constraints. The adopted cost function for K-means with soft constraints is J (µ, Z) =
N 1 ||xi − µzi ||2 + a i j (zi = z j ) + b i j (zi = z j ), 2 i=1
(i, j)∈L
(i, j)∈N
(C.1) where µk is the center of the kth cluster. Equation 4.1 can be rewritten as p 1 ||xi − µzi ||2 − Wi j δ(zi , z j ) + C, 2 ij N
J (µ, Z) =
(C.2)
i=1
with C = −
(i, j)∈L
ai j p Wi j = −b i j 0
a i j is a constant and (i, j) ∈ L (i, j) ∈ N
(C.3)
otherwise.
The clustering process includes minimizing the cost function J (µ, Z) over both the model parameters µ = {µ1 , µ2 , . . . , µ M } and cluster assignment Z = {z1 , z2 , . . . , z N }. The optimization is usually done iteratively with a modified Linde-Buzo-Gray (LBG) algorithm. Assume we have the PPC model with the matrix W p the same as in equation C.2. We further constrain each gaussian component to be spherical with radius σ . The complete data likelihood for the PPC model is N N ||xi − µz ||2 1 i P(X, Z|, W ) = πzi exp − 2σ 2 i=1 i=1 p exp Wmn δ(zm , zn ) , p
mn
(C.4)
1560
Z. Lu and T. Leen
where is the normalizing constant and µk is the mean of the kth gaussian component. To build its connection to the cost function in equation C.2, we consider the following scaling: σ → ασ,
p
p
Wi j → Wi j /α 2 .
(C.5)
The complete data likelihood with the scaling parameters α is N N ||xi − µz ||2 1 i Pα (X, Z|, W p ) = πzi exp − (α) 2α 2 σ 2 i=1 i=1 p Wmn exp δ(z , z ) . m n α2 mn
(C.6)
It can be shown that when α → 0, the maximum data likelihood will dominate the data likelihood: max Z Pα (X, Z|, W p ) lim = 1. p α→0 Z Pα (X, Z|, W )
(C.7)
To prove equation C.7, we first show that when α is small enough, we have
N ||xi − µzi∗ ||2 Z 2 i=1 ∗ ∗ p − Wmn δ zm , zn . ∗
arg max Pα (X, Z|, W ) = Z ≡ arg min p
Z
(C.8)
mn
Proof of Equation C.8. Assume Z is any cluster assignment different from Z∗ . We only need to show that when α is small enough, Pα (X, Z∗ |, W p ) > Pα (X, Z |, W p ).
(C.9)
To prove equation C.9, we notice that log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p ) N N ||xi − µzi∗ ||2 1 ||xi − µzi ||2 log πzi∗ − log πzi + 2 = − α 2 2 i=1 i=1 ∗ ∗ p δ zm , zn − δ zm . (C.10) Wmn , zn − mn
Penalized Probabilistic Clustering
Since Z∗ = arg min Z
1561
2 N ||xi −µzi∗ || i=1 2
−
mn
p ∗ Wmn δ zm , zn∗ , we have
N ∗ ∗ ||xi − µzi ||2 ||xi − µzi∗ ||2 p − − Wmn , zn > 0. δ zm , zn − δ zm 2 2 mn i=1
(C.11) Let ε =
N
||xi −µz ||2 i
i=1
2
−
||xi −µz∗ ||2
−
i
2
mn
∗ ∗ p
Wmn δ zm , zn − δ zm , zn . We
can see that when α is small enough, log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p ) =
N i=1
ε log πzi∗ − log πzi + 2 > 0. α
(C.12)
It is obvious from equation C.12 that for any Z different from Z∗ , lim log Pα (X, Z∗ |, W p ) − log Pα (X, Z |, W p )
α→0
= lim
α→0
N i=1
ε log πzi∗ − log πzi + 2 α
= +∞, or equivalently lim
α→0
Pα (X, Z |, W p ) = 0, Pα (X, Z∗ |, W p )
(C.13)
which proves equation C.7. As the result of equation C.7, when optimizing the model parameters, we can equivalently maximize max Z Pα (X, Z|, W p ) over . It is then a joint optimization problem: max Pα (X, Z|, W p ). ,Z
Following the same thought, we find the soft posterior probability of each sample (as in conventional mixture model) becomes hard membership (as in K-means). This fact can be simply proved as follows. The posterior probability of sample xi to component k is Pα (zi = k|X, , W ) = p
Z|z =k
i
Z
Pα (X, Z|, W p )
Pα (X, Z|, W p )
.
1562
Z. Lu and T. Leen
From equation C.7, it is easy to see that lim Pα (zi = k|X, , W p ) =
α→0
1 0
zi∗ = k otherwise.
(C.14)
The negative logarithm of the complete likelihood Pα is then: J α (, Z) = − log Pα (X, Z|, W p ) =−
N
log πzi +
i=1
=−
N i=1
N p ||xi − µzi ||2 Wmn − δ(zm , zn ) + log((α)) 2α 2 α2 mn i=1
1 log πzi + 2 α
N ||xi − µzi ||2 p − Wmn δ(zm , zn ) + C, 2 mn i=1
where C = log (α) is a constant. It is obvious that when α → 0, we can N neglect the term − i=1 log πzi . Hence, the only model parameters left for adjusting are the gaussian means µ. We only have to consider the new cost function 1 J˜α (µ, Z) = 2 α
N ||xi − µzi ||2 p Wmn δ(zm , zn ) , − 2 mn
(C.15)
i=1
the optimization of which is obviously equivalent to equation C.1. So we can conclude that when α → 0 in equation C.5, the PPC model shown in equation 3.4 becomes a K-means model with soft constraints. C.2 From PPC to K-Means with Hard Constraints (COPK-Means). COPK-means is a hard clustering algorithm with hard constraints. The goal is to find a set of cluster centers µ and clustering result Z that minimizes the cost function N
||xi − µzi ||2 ,
(C.16)
i=1
while subject to the constraints zi = z j ,
if (xi , x j ) ∈ L
(C.17)
zi = z j ,
if (xi , x j ) ∈ N .
(C.18)
Penalized Probabilistic Clustering
1563
Assume we have the PPC model with soft relations represented with the matrix W p such that w p Wi j = −w 0
(xi , x j ) ∈ L (xi , x j ) ∈ N
(C.19)
otherwise,
where w > 0. We further constrain each gaussian component to be spherical with radius σ . The complete data likelihood for PPC model is N N ||xi − µz ||2 1 i P(X, Z|, W ) = πzi exp − 2σ 2 i=1 i=1 exp(wδ(zm , zn )) exp(−wδ(zm , zn )), p
(C.20)
(m ,n )∈N
(m,n)∈L
where µk is the mean of the kth gaussian component. There are infinite ways to get equations C.16 to C.18 from equation C.20, but we consider the following scaling with factor β: p
p
σ → βσ, Wi j → Wi j /β 3 .
(C.21)
The complete data likelihood with the scaled parameters is N N ||xi − µz ||2 1 i πzi exp − Pβ (X, Z|, W ) = (β) 2β 2 σ 2 i=1 i=1 w w
, zn ) . exp δ(z , z ) exp − δ(z m n m β3 β3
p
(m,n)∈L
(m ,n )∈N
(C.22) As established in previous section, when β → 0, the maximum data likelihood will dominate the data likelihood max Z Pβ (X, Z|, W p ) lim = 1. p β→0 Z Pβ (X, Z|, W ) As a result, when optimizing the model parameters , we can equivalently maximize max Z Pβ (X, Z|, W p ). Also, the soft posterior probability (as in the conventional mixture model) becomes hard membership (as in K-means).
1564
Z. Lu and T. Leen
The negative logarithm of the complete likelihood Pβ is then: N 1 ||xi − µzi ||2 log πzi + C + 2 J β (, Z) = − β 2 i=1 i=1 1 + wδ(zm , zn ) − wδ(zm , zn ), β
N
(m ,n )∈N
(C.23)
(m,n)∈L
where C = log (β) is a constant. It is obvious that when β → 0, we can N neglect the term − i=1 log πzi . Hence, we only have to consider the new cost function
N ||xi − µzi ||2 2 i=1 1 + wδ(zm , zn ) − wδ(z j , zk ), β
1 J˜β (µ, Z) = 2 β
(m ,n )∈N
(C.24)
(m,n)∈L
the minimization of which is obviously equivalent to the following equation since we can neglect the constant factor β12 : J˜ β (µ, Z) =
N w ||xi − µzi ||2 + J c (Z), 2 β
(C.25)
i=1
where J c (Z) = (m ,n )∈N δ(zm , zn ) − (m,n)∈L δ(zm , zn ) is the cost function term from pairwise constraints. p p Let SZ = {Z|zi = z j if Wi j > 0; zi = z j if Wi j < 0;}. We assume the pairwise relations are consistent, that is, SZ = ∅. Obviously, all Z in SZ achieve the same minimum value of the term J c (Z), that is ∀Z ∈ SZ , Z ∈ SZ
J c (Z) = J c (Z )
∀Z ∈ SZ , Z
/∈ SZ
J c (Z) < J c (Z
).
It is obvious that when β → 0, any Z that minimizes J˜ β (µ, Z) must be in SZ . So the minimization of equation C.22 can be finally casted into the following form, min Z,µ
N
||xi − µzi ||2
i=1
subject to Z ∈ SZ ,
Penalized Probabilistic Clustering
1565
which is apparently equivalent to equations C.16 to C.18. So we can conclude that β → 0 in equation C.21, and the PPC model shown in equation C.20 becomes a K-means model with hard constraints. Acknowledgments We thank Ashok Srivastava of NASA Ames Research Center for providing satellite image data and the reviewers for comments leading to a stronger letter. This work was funded by NASA Collaborative Agreement NCC 21264 and NSF grant OCI-0121475.
References Ambroise, C., Dang, M., & Govaert, G. (1997). Clustering of spatial data by the ´ EM algorithm. In A. Soares, J. Gomez-Hern´ andez, & R. Froidevaux (Eds.), Geostatistics for environmental applications (vol. 3, pp. 493–504). Norwell, MA: Kluwer. Basu, S., Bannerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In C. Sammut & A. Hoffmann (Eds.), Proceedings of the Nineteenth International Conference on Machine Learning (pp. 19–26). San Francisco: Morgan Kaufmann. Basu, S., Bilenko, M., & Mooney, R. J. (2004). A probabilistic framework for semisupervised clustering. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 59–68). New York: ACM. Bilenko, M., Basu, S., & Mooney, R. (2004). Integrating constraints and metric learning in semi-supervised clustering. In C. Brodley (Ed.), Proceedings of the Twenty-First International Conference on Machine Learning (pp. 11–18). New York: ACM. Bouman, C., & Shapiro, M. (1994). A multiscale random field model for Bayesian image segmentation. IEEE Transactions on Image Processing, 3, 162–177. Cohn, D., Caruana, R., & McCallum, A. (2003). Semi-supervised clustering with user feedback (Tech. Rep. TR2003-1892). Ithaca, NY: Cornell University. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38. Jaakkola, T. (2004). Tutorial on variational approximation methods. In D. Saad & M. Opper (Eds.), Advanced mean field methods: Theory and practice (pp. 129–160). Cambridge, MA: MIT Press. Klein, D., Kamvar, S., & Manning, C. (2002). From instance level to space-level constraints: Making the most of prior knowledge in data clustering. In C. Sammut, & A. Hoffmann (Eds.), Proceedings of the Nineteenth International Conference on Machine Learning (pp. 307–313). San Francisco: Morgan Kaufmann. Lange, T., Law, M., Jain, A., & Buhmann, J. (2005). Learning with constrained and unlabelled data. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (pp. 730–737). Los Alamitos, CA: IEEE Computer Society Press.
1566
Z. Lu and T. Leen
Law, M., Topchy, A., & Jain, A. (2004). Clustering with soft and group constraints. In A. Fred, T. Caelli, R. Duin, A. Campilho, & D. Ridder (Eds.), Joint IAPR International Workshop on Syntactical and Structural Pattern Recognition and Statistical Pattern Recognition (pp. 662–670). Berlin: Springer-Verlag. Law, M., Topchy, A., & Jain, A. (2005). Model-based clustering with probabilistic constraints. In Proceedings of SIAM Data Mining (pp. 641–645). Philadelphia: SIAM. Lu, Z., & Leen, T. (2005). Semi-supervised learning with penalized probabilistic clustering. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems, 17 (pp. 849–856). Cambridge, MA: MIT Press. Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. CRG-TR-93-1). Toronto: Computer Science Department, Toronto University. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (2003). Computing gaussian mixture models with EM using side-information (Tech. Rep. 2003-43). Leibniz: Leibniz Center for Research in Computer Science. Shental, N., Bar-Hillel, A., Hertz, T., & Weinshall, D. (2004). Computing gaussian mixture models with EM using equivalence constraints. In L. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems 16 (pp. 505–512). Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. (1996). Monotonicity hints. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 634–640). Cambridge, MA: MIT Press. Srivastava, A., Oza, N. C., & Stroeve, J. (2005). Virtual sensors: Using data mining techniques to efficiently estimate remote sensing spectra. IEEE Transactions on Geoscience and Remote Sensing, 43(3), 590–599. Szummer, M., & Jaakkola, T. (2001). Partially labeled classification with Markov random walks. In T. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processing systems, 14. Cambridge, MA: MIT Press. Theiler, J., & Gisler, G. (1997). A contiguity-enhanced K-means clustering algorithm for unsupervised multispectral image segementation. In B. Javidi & D. Psaltis (Eds.), Proceedings of SPIE (Vol. 3159, pp. 108–118). Bellingham, WA: SPIE. Wagstaff, K. (2002). Intelligent clustering with instance-level constraints. Unpublished doctoral, dissertation, Cornell University. Wagstaff, K., Cardie, C., Rogers, S., & Schroedl, S. (2001). Constrained K-means clustering with background knowledge. In C. Brodley & A. Danyluk (Eds.), Proceedings of the Eighteenth International Conference on Machine Learning (pp. 577–584). San Francisco: Morgan Kaufmann. Xing, E., Ng, A., Jordan, M., & Russe, S. (2003). Distance metric learning with applications to clustering with side information. In S. Becker, S. Thrun, & K. Obermayer (Eds.), Advances in neural information processing systems, 15 (pp. 505–512). Cambridge, MA: MIT Press. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 189–196). Morristown, NJ: Association for Computational Linguistics.
Penalized Probabilistic Clustering
1567
¨ Zhou, D., & Scholkopf, B. (2004). Learning from labeled and unlabeled data using ¨ (Eds.), random walks. In C. Rasmussen, H. Buelthoff, M. Giese, & B. Scholkopf 26th Annual Meeting of the German Association for Pattern Recognition (DAGM 2004). Berlin: Springer. Zhu, X., Lafferty, J., & Ghahramani, Z. (2003). Semi-supervised learning: From Gaussian field to gaussian processes (Tech. Rep. CMU-CS-03-175). Pittsburgh, PA: Computer Science Department, CMU.
Received October 21, 2005; accepted July 27, 2006.
LETTER
Communicated by Jan Karbowski
Robustness of Connectionist Swimming Controllers Against Random Variation in Neural Connections Jimmy Or
[email protected] RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, Saitama, Japan 351-0198
The ability to achieve high swimming speed and efficiency is very important to both the real lamprey and its robotic implementation. In previous studies, we used evolutionary algorithms to evolve biologically plausible connectionist swimming controllers for a simulated lamprey. This letter investigates the robustness and optimality of the best-evolved controllers as well as the biological controller hand-crafted by Ekeberg. Comparing cases of random variation in intrasegmental or intersegmental weights against each controller allows estimates of robustness to be made. We conduct experiments on the controllers’ robustness at the excitation level, which corresponds to either the maximum swimming speed or efficiency by randomly varying the segmental connection weights and on some occasions also the intersegmental couplings, through varying noise ranges. Interestingly, although the swimming performance (in terms of maximum speed and efficiency) of the Ekeberg biological controller is not as good as that of the artificially evolved controllers, it is relatively robust against noise in the neural networks. This suggests that the natural evolutions have evolved a swimming controller that is good enough to survive in the real world. Our findings could inspire neurobiologists to conduct real physiological experiments to gain a better understanding on how neural connectivity affects behavior. The results can also be applied to control an artificial lamprey in simulation and possibly also a robotic one. 1 Introduction Advances have been made in understanding animal motor control in recent years due to better physiological measurement techniques, higher-density microelectrodes, and faster computers for simulating of the neural mechanisms that underlie behavior. However, due to the complexity of the nervous system, we are still far from being able to understand completely the neural control of higher vertebrates such as humans. The lamprey has been chosen for study by several neurobiologists because it is relatively easy to analyze: first, because while it has a brain stem and spinal cord with all the basic vertebrate features, the number of neurons in each category is an order Neural Computation 19, 1568–1588 (2007)
C 2007 Massachusetts Institute of Technology
Robustness of Connectionist Swimming Controllers
1569
of magnitude fewer than in other vertebrates; and second, because it has a simple swimming gait. Findings in relation to this prototype vertebrate can thus provide a better understanding of vertebrate motor control in general. According to Ferrar, Williams, and Bowtell (1993), robustness of neural networks against stochastic variation is an important property for biological systems. Over the past 14 years, several researchers have studied different aspects of robustness of model lamprey central pattern generators (CPGs), some at the segmental level (Hellgren, Grillner, & Lansner, 1992; Ferrar et al. 1993) and others at the intersegmental level (Wadden, Hellgren, Lansner, & Grillner, 1997; Ijspeert, 1998; Or & Hallam, 2000; Or, 2002). Hellgren et al. (1992) used a detailed model of the lamprey segmental network presented in Ekeberg, Wall´en, Brodin, & Lansner (1991) and Wall´en et al. (1992) to study how variation in cell and synaptic parameters affects burst pattern. According to Hellgren et al. (1992), this model consists of a population of interneurons whose cellular properties such as size and membrane conductance (including voltage-dependent ion channels) are taken into consideration. The model is detailed enough to take into consideration all experimentally established important neural properties, yet it is simple enough to accommodate the simulation of several neurons. Hellgren et al. found that the robustness of the burst patterns as well as the working frequency range can be increased by incorporating a random variation in certain cells and in the synaptic parameters. Furthermore, the population model they used is capable of producing stable burst activity over a large range of oscillation frequencies even without the lateral inhibitory interneurons (LIN) neurons. Ferrar et al. (1993) later studied the robustness of the lamprey segmental network proposed by Buchanan and Grillner (1987). Unlike the one investigated by Hellgren et al. (1992), the neurons in this model are less detailed, and each of them represents the entire populations of neurons of the same type. Ferrar and his colleagues observed how the neural characteristics of the network (such as cycle duration, burst proportion, and left-right phase lag) change when it is subjected to two different noise conditions: (1) long timescale noise at the synaptic connections from the driver cells (the brain stem in Ekeberg’s model) to all the other cells and (2) short timescale noise at all synaptic connections. The synaptic strengths of the first test varied from 50% to 100% only at the beginning of the simulations. In the other test, the synaptic strengths varied from 0% to 200% at each time step. Ferrar and his colleagues found that the network was less affected by noise in the short timescale because the effect of noise tended to average out over a cycle. They also found that increasing the number of neurons in the network allowed an increase in robustness despite the influence of noise. Given that a controller that can achieve the highest swimming speed or efficiency but is unable to tolerate random noise in the networks is undesirable in the real world, we investigate how these two variables are affected by controllers with random variations in both segmental and
1570
J. Or
intersegmental neural connections. The controllers under investigations are Ekeberg’s biological model as well as three other biologically plausible connectionist models evolved by genetic algorithms (GAs). In this letter, we conducted four experiments that investigate how varying the segmental or intersegmental connections affects the maximum swimming speed and efficiency. 2 Background 2.1 Neural Model of the Segmental Oscillator. We implemented the lamprey segmental oscillator based on physiological data presented in Ekeberg (1993). In the model, each segmental oscillator consists of eight neurons of four types. Each neuron is modeled using a leaky integrator with a saturating transfer function. The output u (∈ [0, 1]) is the mean firing frequency of the population the unit neuron represents. It is calculated using the following set of formulas: ξ˙+ =
1 ui wi − ξ+ τ D i∈
(2.1)
+
1 ξ˙− = ui wi − ξ− τ D i∈
(2.2)
−
1 (u − ϑ) τA 1 − exp{( − ξ+ )} − ξ− − µϑ (u > 0) u= 0 (u ≤ 0),
ϑ˙ =
(2.3) (2.4)
where wi represents the synaptic weights and + and − represent the groups of presynaptic excitatory and inhibitory neurons, respectively. ξ+ and ξ− can be considered as the dendritic delays (or mean membrane potentials) to excitatory and inhibitory inputs, respectively. ϑ represents the frequency adaptation observed in real neurons. The values of the neural timing (τ D , τ A) and gain/threshold (, , µ) parameters and those for the connection weights are set up in such a way that the simulation results from the model agree with physiological observations. As observed in the real lamprey, 100 copies of interconnected segmental oscillators are connected to form the entire swimming CPG. The brain stem stimulates all neurons within the neural networks by the same amount of excitation (here called global excitation); sufficient stimulation results in oscillations in each individual segment at a frequency that depends on the strength of this global excitation signal. Neurons located at the most rostral segments of the CPG receive additional amount of so-called extra excitation. The effect of this, interacting with intersegmental coupling, is to induce a
Robustness of Connectionist Swimming Controllers
1571
roughly equal relative phase lag between successive segments along the CPG, with the result that caudally traveling waves of neural activity appear.1 The global excitation controls the amplitude (mean firing frequency) of the motoneuron outputs as well as the frequency of oscillation of the CPG. The extra excitation alters the intersegmental phase lag largely independent of the global excitation. (For details of the neuron parameters and the Ekeberg model, refer to Ekeberg, 1993.) Note that in our implementation, the model does not contain mechanosensory feedback. 2.2 A Brief Description of the Controllers Under Investigation. In this letter, we investigate the behavior of four connectionist swimming controllers created in previous studies: 1. The biological controller, hand-crafted by Ekeberg (1993) and based on physiological data 2. Controller 2, evolved by Ijspeert, Hallam, and Willshaw (1999) using Ekeberg’s segmental oscillator and intersegmental couplings evolved by GA 3. The efficient controller, the run9 controller obtained through evolution using the bigmin approach (Or, Hallam, Willshaw, & Ijspeert, 2002) 4. The hybrid robust controller, obtained through the evolution which takes robustness in swimming speed against variation in body scales into consideration (Or, 2002) We implemented our biological controller based on the data presented in Ekeberg (1993). As for the rest of the controllers, we used evolutionary algorithms to evolve the neural couplings. Note that although connections between the neurons within these controllers are different (see Table 1), the neurons that underlie the segmental oscillator are the same. Also, the four controllers are used to drive the same mechanical lamprey model. Given that details on how to evolve the swimming controllers have been presented in previous studies, only a brief description is provided here. All three controllers are created using a real number GA with mutation and crossover operators. Chromosomes with genes of real numbers (between 0 and 1) are used to encode the connection weights. The fitness functions used to evolve these controllers share four basic fitness factors that reward controllers that can produce the following features: 1. Generate stable and regular oscillations across the entire CPG. 2. Increase in frequency of oscillation by increasing the global excitation from the brain stem.
1 In the case of the real lamprey, asymmetrical connections from the CIN neurons to neurons in the caudal direction cause the neural wave to travel from head to tail.
1572
J. Or
Table 1: Connection Weight Matrix for the Biological and the Three Evolved Controllers. From
To
Biological
Controller 2
EINl
MNl EINl LINl CINl MNr EINr LINr CINr MNl EINl LINl CINl LINr MNl EINl LINl CINl MNr EINr LINr CINr
1 [5, 5] 0.4 [2, 2] 13 [5, 5] 3 [2, 2] −1 [5, 5] −2 [5, 5] −2 [1, 10] −1 [1, 10] −2 [1, 10]
1 [4, 1] 0.4 [5, 3] 13 [4, 1] 3 [11, 2] −1 [3, 9] −2 [11, 11] −2 [1, 0] −1 [9, 4] −2 [0, 5]
LINl
CINl
Hybrid Robust
Efficient
14.6 [12, 11] −4.2 [8, 4] −4.8 [7, 0] −0.6 [4, 6] −2.5 [0, 5] −2.5 [8, 10] −1.3 [1, 5]
1 [4, 9] 0.4 [3, 3] 13.0 [4, 2] 3 [11, 1] −1 [4, 7] −2 [4, 4] −2 [5, 5] −1 [11, 5] −2 [5, 7]
Notes: Positive numbers indicate excitatory connections, and negative numbers indicate inhibitory connections. Left and right neurons are indicated by l and r. Due to symmetry between the left and right, the connection weights of the neurons on the right are not shown here. The number of extensions to the neighboring segments is shown in the brackets. The first entry represents extension in the rostral direction, and the second entry represents extension in the caudal direction.
3. Increase in phase lag by increasing the extra excitation to the neurons near the head segments. 4. Modulate a wide range of swimming speed by varying either the oscillation frequency or wavelength of body undulation. Ijspeert et al. (1999) used only these criteria to evolve controller 2. When we evolved the efficient controller, we added an additional fitness factor that encourages high swimming efficiency. During the evolutions, we record the minimum and maximum achievable swimming efficiencies of each controller across a range of global and extra excitation inputs from the brain stem. Controllers that are able to generate big minimum swimming efficiency and exhibit the four features mentioned above are better rewarded. The reason for evolving controllers based on big minimum (rather than maximum) efficiency is that this approach implicitly forces all the measured efficiencies of the controller to be good as the GA is trying to pull up the worst efficiency each controller can achieve. Hence, the evolved
Robustness of Connectionist Swimming Controllers
1573
controllers under this bigmin aproach are able to achieve higher swimming efficiency. In terms of the hybrid robust controller, we use a two-stage evolution. In the first stage, we evolve intrasegmental connections between the neurons within the same segment. We then select an oscillator that is able to generate a wide range of oscillation frequencies and motoneuron outputs as the prototype segment. In the second stage, 100 copies of this segmental oscillator are created. We use GA to evolve the intersegmental couplings for the entire swimming controller. Our fitness function is built on top of the one presented in Ijspeert et al. (1999). It has an additional fitness factor that rewards controllers with the least change in swimming speeds at different body scales (normal, normal ±20%). The best evolved controller is called the hybrid robust controller. 2.3 Mechanical Model. We developed a 2D mechanical simulator to study how the muscular activity induced by the model neural networks affects swimming performance. The mechanical model of the lamprey is made to approximate the size and shape of the real one. The entire length of the model lamprey is 0.3 m. It has an elliptical cross section of constant height (30 mm) and variable width ranging from 1.7 mm to 20 mm. In order to describe the position and the shape of the body over time, a chain of links forming the midline is used. Just like the Ekeberg model, our model lamprey consists of 10 rigid body links with nine joints of one degree of freedom. The link is represented by its center of mass coordinate as well as the angle between it and the horizontal x-axis. On each side of the body, muscles connect each link to its immediate neighbors. The muscles are modeled as a combination of springs and dampers. The outputs from the motoneurons control the spring constants of the corresponding muscles. Since there are 100 neural segments in the entire swimming CPG, each mechanical link corresponds to 10 neural segments. The torque at each joint is controlled by the averaged motoneuron outputs of 10 consecutive neural segments along the CPG from segments 6 to 95. The reason for skipping the 5 segments at each end of the CPG is that neurons at the head and the tail receive fewer inputs. When extra excitations are given to neurons near the head segments, the asymmetrical connections of some of the neurons in the caudal direction result in neural waves traveling along the body from head to tail. The successive contraction of muscles creates a mechanical wave. This in turn generates inertial forces from the surrounding water that propel the lamprey forward. (For more details on our mechanical model, refer to Or et al., 2002.) 2.4 Calculating Swimming Efficiency. To calculate the swimming efficiency, we used the ratio of forward swimming speed (U) to backward mechanical wave speed (V) (Williams, 1986; Carling, Williams, & Bowtell, 1998; Sfakiotakis, Lane, & Davis, 1999). The efficiency and mechanical wave
1574
J. Or
speed are defined as follows: U V λ V= T e=
(2.5) (2.6)
where U, V, λ, and T are the swimming speed (m/s), backward mechanical wave speed (m/s), mechanical wavelength (m), and mechanical period (s), respectively. Since both the neural and mechanical simulations are not stable at the beginning, the lamprey can end up swimming straight at any angle. The first step in computing the mechanical wave parameters is to rotate the original coordinate system so that the fish is swimming parallel to the xaxis. Following the transformation, we calculated the mechanical periods of body links 2 to 5. (The mechanical period of each body link can be calculated by using the difference in time instances at which two consecutive wave crests pass through the same link.) If the periods of three or more of the mechanical links are defined, the mechanical wavelength can be calculated using the method described by Videler (1993). Note that due to strange head and tail movements caused by the reduction in neural connections at the ends of the swimming CPG, the mechanical period of the second link, T2 , is used in equation 2.6 to calculate the mechanical wave speed. Equation 2.5 can then be used to compute the swimming efficiency. (For more details on the algorithms to calculate the swimming efficiency and forward speed, refer to Or, 2002.) 3 Investigation of the Effect of Varying Neural Connectivity on the Maximum Swimming Speed and Efficiency 3.1 Description of Experiments. To test the robustness of the controllers under random variation in connection weights, we conducted the following experiments on each controller and then compared their performance in terms of either maximum achievable speed or efficiency. For a controller to be robust, its maximum achievable speed or efficiency should be similar to the speed or efficiency obtained when noise is not present:2
r
Experiment 1: Robustness of swimming speed against random variation in segmental connection weights. For each controller, the excitation combination that corresponds to the maximum swimming speed is chosen as the brain stem input. The connection weights between neurons at
2 Although the term noise generally refers to random variation over time, here we use it to mean random variation in neural connections only at the beginning of the simulations.
Robustness of Connectionist Swimming Controllers
r
r
r
1575
a prototype segment are subjected to random variation (with uniform distribution) in weight from ±5% to ±25% in steps of 5%. One hundred copies of this segmental oscillator are then combined to form the swimming CPG (with the number of extensions to neighboring segments unchanged). Note that at each amount of variation, 100 runs (each of which has different seeds for the random number generator) of neural and mechanical simulations are performed to obtain the average maximum speed for all 100. Thus, for each controller, five averaged speeds are computed across the amount of variation under investigation. Experiment 2: Robustness of swimming speed against random variation in segmental connection weights and intersegmental couplings. This is the same as experiment 1, but in addition, the numbers of rostral and caudal extensions are subjected to random variation. Each extension can be varied by up to three segments. Experiment 3: Robustness of swimming efficiency against random variation in segmental connection weights. This is the same as experiment 1, but the excitation combination that corresponds to the maximum swimming efficiency is chosen as the brain stem input instead of maximum swimming speed. At each amount of variation, 100 runs of neural and mechanical simulations are performed to obtain an average of all 100 maximum efficiency values. Thus, for each controller, five averaged efficiency values are computed across the noise range under investigation. Experiment 4: Robustness of swimming efficiency against random variation in segmental connection weights and intersegmental couplings. This is the same as experiment 3, but in addition, the number of rostral and caudal extensions is subjected to random variation.
Note that as each calibrated connection strength is based on the total number of connections to neurons of neighboring segments, the same controllers are subjected to more variations in experiments 2 and 4 than in experiments 1 and 3. 3.2 Methods. Each of the experiments was conducted with the following stages:
r
Stage 1: Random variation of connection weights. We use chromosome with genes of real numbers to represent neural connections. After the chromosomes that encode the segmental and intersegmental connections for a controller are loaded, the connection matrices (one for segmental connection weights and the others for rostral and caudal extensions) are formed. Depending on the experiment, either the former matrix alone or all three matrices are subjected to random variation.
1576
r
r
r
r
J. Or
Note that random variation is applied only at the beginning of each simulation. To avoid possible turning when the lamprey swims, leftright symmetry of neural connections is maintained. Stage 2: Neural simulations of the swimming CPG. The excitation combination corresponding to maximum swimming speed or efficiency is chosen as the brain stem input. Neural simulations are performed. The neural parameters (amplitude, frequency, and phase lag) of the outputs from the simulated swimming CPG are recorded. Stage 3: Mechanical simulation of the lamprey swimming in a 2D environment. At this stage, mechanical simulations are performed with the neural waves obtained in stage 2. The swimming speeds and efficiencies are recorded. Stage 4: Calculation of percentage change in speed or efficiency. Once the averaged speed or efficiency at each amount of variation is calculated for each controller, the corresponding percentage change in speed or efficiency, with respect to the original ones (without noise), is calculated. Stage 5: Comparative analysis of swimming controllers within the same experiment.
As mentioned, random noise is added to the neural networks after the connection matrices are formed. Each connection weight is modified using the following equation: New weight = original weight + original weight · rand · noise range (3.1) where rand is a small random number evenly distributed within [−0.5, 0.5] and the noise range can be any value between 0.1 and 0.5 in steps of 0.1. This corresponds to randomly varying the connection weight by ±5 to ±25% (in steps of 5%). Using a similar equation, each intersegmental extension can be varied by up to ±3 segments.3 Table 2 shows how the noise range is related to the maximum amount of variation in connection weight and intersegmental extensions. Note that as 100 copies of the same segmental oscillator are used to form the entire swimming CPG, all segments were subjected to the same amount of noise. In other words, the extensions from each source neuron to the corresponding target neuron in its rostral or caudal neighbors are the same regardless of its location along the CPG.
3
Since the weights are in real numbers, each calculated number of extensions is converted into an integer value by truncating the digits after the decimal point. The result is then assigned to either the rostral or the caudal extension matrix.
Robustness of Connectionist Swimming Controllers
1577
Table 2: Conversion Table from the Noise Range to the Maximum Ranges of Possible Variation in Connection Weights and Intersegmental Extensions (in Brackets).
Noise Range (%)
Maximum Range of Possible Variation in Connection weights (%) and Intersegmental Extensions
10 20 30 40 50
±5 [0] ±10 [±1] ±15 [±1] ±20 [±2] ±25 [±3]
Table 3: Maximum Speed and Efficiency Achieved by the Different Controllers Without Noise in the CPGs.
Controller Biological controller Controller 2 Hybrid robust controller Efficient controller
Speed (m/s) at Excitation Combination
Efficiency at Excitation Combination
0.45 (0.6, 200%) 0.49 (0.9, 40%) 0.49 (0.9, 0%) 0.51 (0.9, 40%)
0.58 (0.3, 150%) 0.59 (0.4, 50%) 0.69 (0.6, 80%) 0.76 (0.4, 100%)
Notes: The excitation combinations at which the maximum speeds and efficiencies occur are given in parentheses. The first number represents global excitation, and the second number represents extra excitation from the brain stem.
3.3 Results and Discussion. Figure 1 shows how the speeds of the four controllers are affected at each given amount of random variation (noise range), while Figure 2 shows how the efficiency of the controllers is affected. For comparison purposes, the maximum speed and efficiency values achieved by the controllers when there is no noise are shown in Table 3. Note that as we are interested in how much damage or gain a particular controller receives when random changes are made relative to its original form, we measure the relative change in swim performance. Hence, percentages rather than absolute values are used. A bigger percentage drop in speed or efficiency means that the controller’s original neural configuration is better optimized (i.e., at the peak of a hill in the fitness landscape commonly referred to in GA literatures) than that of other controllers for high speed or efficiency. Note that if the absolute speed or efficiency of this controller is still better than the rest of the controllers, it means that its peak is relatively higher. 3.3.1 Experiment 1: Robustness of Swimming Speed Against Random Variation in Segmental Connection Weights. The top graph in Figure 1 shows
1578
J. Or Experiment 1: Noise in segmental connections
100
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in speed
60
40
20
0
0
5
10
15
20
25
Magnitude of the maximum allowable variation in connection weight [%] Experiment 2: Noise in both segmental and intersegmental connections 100
B
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in speed
60
40
20
0
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
Robustness of Connectionist Swimming Controllers
1579
how varying the segmental connection weights affects the swimming speed. For all controllers, it appears that increasing the variation in weight tends to decrease the swimming speed. This tendency implies the possibility that the neural configurations of the four controllers are already nearly optimized for high-speed swimming. The speed of controller 2 changes the least: with ±25% variation in segmental connection weights, its speed drops by only 13.3%. As for the biological controller and efficient controller, the maximum drop in their speed is 34.7% and 26.3%, respectively. The hybrid robust controller performs the worst. Even with only ±5% variation in weight, its swimming speed drops by 22.8% (which is worse than the worst performance of controller 2). As the amount of variation increases, the speed of the hybrid robust controller continues to drop. At ±25% variation in segmental connection weights, speed drops by more than half. The percentage change in the speed curve for this controller indicates that noise in the segmental connection weights alone has a big negative impact on swimming speed. To summarize, in terms of robustness in speed against noise in segmental connections, controller 2 is the most robust. It is followed by the efficient controller, the biological controller, and then the hybrid robust controller. Although the biological controller and the hybrid robust controller are the least robust, their neural configurations are more optimized for high-speed swimming than the other two controllers. The percentage speed change of each controller, with respect to the maximum speed obtained when there is no noise in the segmental connection weights (as reported in Table 3), is shown in Table 4. 3.3.2 Experiment 2: Robustness of Swimming Speed Against Random Variation in Segmental Connection Weights and Intersegmental Couplings. The bottom graph in Figure 1 shows how varying the segmental connection weights and intersegmental couplings affects the swimming speed. For all controllers, swimming speed tends to decrease with the presence of noise. Note that at ±5% noise in both segmental and intersegmental connections, there is no change in speed for the biological controller. As the variation continues to increase, the speed of this controller drops monotonically. The swimming speed achieved by the rest of the controllers drops immediately with the presence of noise. Controller 2 has its speed drop by less than one-fifth in the worst case, which is reasonably good. One interesting point to note is that
Figure 1: Percentage change in the averaged speed with respect to maximum speed obtained with the same CPGs but without noise. (Top) Effect of noise in segmental connection weight only. (Bottom) Effect of noise in both segmental connection weights and intersegmental extensions. Each averaged speed is the average of 100 runs with different random seeds. The error bars correspond to standard errors.
1580
J. Or Experiment 3: Noise in segmental connections
100
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in efficiency
60
40
20
0
0
5
10
15
20
25
Magnitude of the maximum allowable variation in connection weight [%] Experiment 4: Noise in both segmental and intersegmental connections 100
B
Biological controller Controller 2 Hybrid robust controller Efficient controller
80
Percentage change in efficiency
60
40
20
0
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
Robustness of Connectionist Swimming Controllers
1581
Table 4: Percentage Change in Speed for All Controllers in Experiment 1. Range of Variation In Segmental Connection Weight Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−4.2 −0.2 −22.8 −2.4
−13.6 −2.7 −39.6 −7.5
−17.8 −6.5 −41.5 −17.4
−21.4 −9.6 −52.6 −15.4
−34.7 −13.3 −51.2 −26.3
Note: Each percentage represents the maximum range of possible variation in segmental connection weights.
Table 5: Percentage Change in Speed for All Controllers in Experiment 2. Range of Variation in Both Segmental and Intersegmental Connections Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.0 −10.4 −39.2 −6.9
−9.1 −10.6 −41.5 −13.5
−14.7 −14.9 −45.7 −15.2
−26.5 −16.9 −45.7 −19.2
−34.7 −19.8 −48.6 −30.3
Note: Each percentage represents the maximum range of possible variation in segmental connection weights and intersegmental couplings.
as in experiment 1, the speed of this controller stays fairly constant across different noise ranges. Hence, controller 2 is the most robust in swimming speed against stochastic variation in any kind of neuronal connection. This may be due to its unique neural organization, which allows more flexibility than the other controllers. As for the hybrid robust controller, it has the greatest reduction in speed. Even with ±5% variation, its swimming speed drops by about 40%, which is worse than the worst case of any other controller. Except for the 30% drop in speed at ±25%, the efficient controller behaves similarly to controller 2. The percentage speed change of each controller, with respect to the maximum speeds reported in Table 3 (obtained when there is no noise in the CPGs) is shown in Table 5.
Figure 2: Percentage change in the averaged efficiency, with respect to maximum efficiency obtained with the same CPGs without noise. (Top) Effect of noise in segmental connection weights alone. (Bottom) Effect of noise in both segmental connection weights and intersegmental extensions. Each averaged efficiency is the average of 100 runs with different random seeds. The error bars correspond to standard errors.
1582
J. Or
Table 6: Percentage Change in Efficiency for All Controllers in Experiment 3. Range of Variation in Segmental Connection Weights Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−0.7 −0.7 −36.9 −22.2
−2.7 −1.5 −36.8 −26.8
−7.6 −2.7 −32.5 −29.9
−10.1 −5.0 −28.7 −33.9
−11.7 −10.2 −35.8 −31.2
Note: Each percentage represents the maximum range of possible variation in segmental connection weights.
Table 7: Percentage Change in Efficiency for all Controllers in Experiment 4. Range of Variation in Both Segmental and Intersegmental Connection Weights Controller
±5%
±10%
±15%
±20%
±25%
Biological controller Controller 2 Hybrid robust controller Efficient controller
−6.7 −4.4 −46.6 −33.9
−8.2 −6.4 −43.6 −30.5
−10.7 −7.9 −41.6 −31.6
−19.1 −10.1 −42.4 −22.1
−20.8 −14.4 −45.3 −31.8
Note: Each percentage represents the maximum range of possible variation in segmental connection weights and intersegmental couplings.
3.3.3 Experiment 3: Robustness of Swimming Efficiency Against Random Variation in Segmental Connection Weights. The top graph in Figure 2 shows that the efficiencies of the biological controller and especially controller 2 are extremely stable against variation in segmental connection weights. Although the percentage drop in efficiency increases with the amount of variation in weight, the absolute efficiency at each noise range for these two controllers is similar to that of the corresponding originals. Specifically, the efficiency of the biological controller drops to only a maximum of about 12% in the worst case, while that of controller 2 drops to about 10% at most. The extremely good performance of these two controllers indicates that their efficiencies are less affected than the others by noise in the segmental connections. This is good for the lamprey because robustness in efficiency is more important than robustness in swimming speed. However, the hybrid robust controller and the efficient controller have a relatively poor tolerance for noise. With only ±5% variation in segmental connection weights, their efficiencies drop by about 37% and 22%, respectively. This is worse than the
Figure 3: Changes in neural characteristics and speed caused by noise in experiments 1 (left) and 2 (right). The error bars correspond to standard errors.
Robustness of Connectionist Swimming Controllers
Experiment 2: Noise in both segmental and intersegmental connections
Experiment 1: Noise in segmental connections 0.85
0.85
A
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
Amplitude at different noise levels
Amplitude at different noise levels
0.65
0.6
0.55
0.7
0.65
0.6
0.55
0.5
0.5
0.45
0.45
0
5
10
15
0.4
25
B
15
G
Biological controller Controller 2 Hybrid robust controller Efficient controller
20
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
Frequency [Hz] at different noise levels
5.5
4.5
4
3.5
5
4.5
4
3.5
3
2.5 0
5
10
15
20
25
0
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 1: Noise in segmental connections
Experiment 2: Noise in both segmental and intersegmental connections 2.5
C
F
Phase lag [%] at different noise levels
Biological controller Controller 2 Hybrid robust controller Efficient controller
2
1.5
1
0
5
Magnitude of the maximum allowable variation in connection weight [%] 2.5
Phase lag [%] at different noise levels
10
Experiment 2: Noise in both segmental and intersegmental connections
3
5
10
15
20
2
1.5
1
0.5
25
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 1: Noise in segmental connections
Experiment 2: Noise in both segmental and intersegmental connections 0.6
D
Biological controller Controller 2 Hybrid robust controller Efficient controller
E
0.5
0.45
0.4
0.35
0.3
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.55
Speed [m/s] at different noise levels
0.55
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
Magnitude of the maximum allowable variation in connection weight [%]
0.6
Speed [m/s] at different noise levels
5
Magnitude of the maximum allowable variation in connection weight [%]
5
0.5
0.45
0.4
0.35
0.3
0.25
0.25
0.2
0
Experiment 1: Noise in segmental connections
5.5
Frequency [Hz] at different noise levels
20
Magnitude of the maximum allowable variation in connection weight [%]
6
6
0.5
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
0.7
2.5
H
0.8
0.75
0.4
1583
0.2 0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
1584
J. Or
worst performance of the other two controllers. However, as the amount of noise continues to increase, the percentage change in efficiency of these two controllers remains fairly robust (with about 10% change at the extremes of the noise range). Note that the percentage drop in efficiency for the hybrid robust controller generally decreases as the amount of variation increases. From the results of experiment 1, we know that the speed for this controller drops as the amount of variation in segmental connections increases. Since efficiency is defined as the ratio of U to V, this implies that the mechanical wave speed (V) drops faster than the swimming speed (U). Given that the mechanical wave speed is defined as the ratio of mechanical wavelength to the mechanical period and that this period is fixed as the neural wave remains the same, the mechanical wavelength must be dropping rapidly with the increase in noise. The percentage efficiency change of each controller with respect to the maximum efficiency obtained when there is no noise in the segmental connection weights is shown in Table 6. 3.3.4 Experiment 4: Robustness of Swimming Efficiency Against Random Variation in Segmental Connection Weights and Intersegmental Couplings. The bottom graph in Figure 2 shows that both the biological controller and controller 2 are fairly robust in efficiency against the presence of noise in both segmental and intersegmental connections. However, their absolute efficiencies are worse than they were in experiment 3. As the amount of variation increases, controller 2 becomes more robust than the biological controller. The percentage drop in efficiency for these controllers at a maximum possible change of ±25% in segmental connection weights and intersegmental couplings is at most 15% and 21%, respectively. Given that these percentage drops are relatively low, they are acceptable. As in experiment 3, the hybrid robust controller and the efficient controller behave similar to each other. When compared with the biological controller and controller 2, their absolute performances are much worse. Even at only ±5% variation, their efficiencies drop to 46.6% and 33.9%, respectively. The performance of the hybrid robust controller is the worst overall. Its drop in efficiency at each noise range is almost half of the original. Note that after the initial drop at ±5% variation in any neural connection, the efficiencies of both the hybrid robust controller and the efficient controller stay fairly constant. The variation in efficiency is less than 5% for the hybrid robust controller and about 3% for the efficient controller (except at ±20% variation in connections).
Figure 4: Changes in neural characteristics and efficiency caused by noise in experiments 3 (left) and 4 (right). The error bars correspond to standard errors.
Robustness of Connectionist Swimming Controllers
Experiment 4: Noise in both segmental and intersegmental connections
Experiment 3: Noise in segmental connections 0.85
0.85
A
H
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
0.75
Amplitude at different noise levels
Amplitude at different noise levels
0.7
0.65
0.6
0.55
0.7
0.65
0.6
0.55
0.5
0.5
0.45
0.45
0
5
10
15
20
0.4
25
5
6 Biological controller Controller 2 Hybrid robust controller Efficient controller
G
Frequency [Hz] at different noise levels
4.5
4
3.5
5
4.5
4
3.5
0
5
10
15
20
2.5
25
0
5
2.5
Phase lag [%] at different noise levels
1.5
1
0.8
F
Biological controller Controller 2 Hybrid robust controller Efficient controller
2
0
5
10
15
15
20
25
Experiment 4: Noise in both segmental and intersegmental connections
Experiment 3: Noise in segmental connections 2.5
C
10
Magnitude of the maximum allowable variation in connection weight [%]
Magnitude of the maximum allowable variation in connection weight [%]
Phase lag [%] at different noise levels
25
3
3
20
2
1.5
1
0.5
25
Biological controller Controller 2 Hybrid robust controller Efficient controller
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 3: Noise in segmental connections
Experiment 4: Noise in both segmental and intersegmental connections
25
0.8
D
E
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.75
0.7
Efficiency at different noise levels
0.7
Efficiency at different noise levels
20
Biological controller Controller 2 Hybrid robust controller Efficient controller
5.5
5
0.65
0.6
0.55
0.5
0.45
0.65
0.6
0.55
0.5
0.45
0.4
0.4
0.35
0.35
0.3
15
Experiment 4: Noise in both segmental and intersegmental connections
5.5
0.5
10
Magnitude of the maximum allowable variation in connection weight [%]
Experiment 3: Noise in segmental connections
B Frequency [Hz] at different noise levels
0
Magnitude of the maximum allowable variation in connection weight [%]
6
2.5
Biological controller Controller 2 Hybrid robust controller Efficient controller
0.8
0.75
0.4
1585
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
0.3
0
5
10
15
20
Magnitude of the maximum allowable variation in connection weight [%]
25
1586
J. Or
The percentage efficiency change of each controller with respect to the maximum efficiency obtained when there is no noise in the CPGs is shown in Table 7. 4 General Discussion Except for the efficient controller, each controller performs similarly across the different experiments. In general, the four controllers under investigation fall into three categories. In the first class, the absolute speed or efficiency of the biological controller and controller 2 at each amount of variation (noise range) is close to that of the originals. This means that these controllers are at the top of the plateau in the fitness landscape. However, the hybrid robust controller forms a class of its own, as its absolute speed or efficiency drops significantly even when there is only a small variation of ±5% in any kind of neural connection. In other words, this controller is more at the peak in the fitness landscape. The third case is the efficient controller, which behaves differently depending on the experiment. In terms of robustness in swimming speed (experiments 1 and 2), it behaves similarly to both the biological controller and controller 2, but its behavior in terms of robustness in swimming efficiency (experiments 3 and 4) is somewhere between the two other classes of controllers. Controller 2 performs best in all the experiments. Its swimming speed and efficiency stay relatively constant across the different amounts of variation under investigation. It is thus the most robust in both speed and efficiency against noise in the networks. As for the hybrid robust controller, its performance is the worst in all the experiments described. In other words, this controller is the most unstable when noise is present in the network system. Having considered the neural characteristics (amplitude, frequency, and phase lag) and swimming performance of each controller across different experiments and different amounts of variation (see Figures 3 and 4), we have observed that there is a tendency for the amplitude of the motoneuron outputs from each controller in each experiment to be least affected by the random seed (i.e., each set of 100 runs gives similar amplitudes). This is followed by the frequency of oscillations, phase lag, and swimming speed or efficiency. It could be that since the measurements of these parameters depend on an increasing number of other factors, the variations add up. Hence, even with the same amount of variation, the amplitude is less affected when compared with the others. Also, as the amount of variation in connection weights increases, the standard error in both neural characteristics and swimming performance tends to increase as well. 5 Conclusion In this letter, we investigated how changes in neural connectivity affect the swimming speed and efficiency. We observed that controller 2 is the
Robustness of Connectionist Swimming Controllers
1587
most robust across different experiments. It is followed by either the biological controller or the efficient controller (depending on experiments). The performance of the evolved hybrid robust controller is the worst. Most important, although the absolute swimming performance of the biological controller in terms of maximum speed and efficiency is not as good as the artificially evolved controllers (it is actually quite good already), it is relatively robust against noise in the system. The robustness of the model CPGs might be explained by the robustness of the real biological circuit due to natural evolution. Through interactions with neurobiologists, we could gain a better understanding of how neural connectivity affects behavior. Acknowledgments I thank my doctoral thesis committee at the University of Edinburgh for their comments on the earlier versions of this letter. Thanks to Shun-ichi Amari at the RIKEN Brain Science Institute for his kind support. References Buchanan, J. T., & Grillner, S. (1987). Newly identified “glutamate interneurons” and their role in locomotion in the lamprey spinal cord. Science, 236, 312–314. Carling, J., Williams, T. L., & Bowtell, G. (1998). Self-propelled anguilliform swimming: Simultaneous solution of the two-dimensional Navier-Stokes equations and Newton’s laws of motion. Journal of Experimental Biology, 201, 3143–3166. ¨ (1993). A combined neuronal and mechanical model of fish swimming. Ekeberg, O. Biological Cybernetics, 69, 363–374. ¨ Wall´en, P., Brodin, L., & Lansner, A. (1991). A computer-based model Ekeberg, O., for realistic simulations of neural networks I: The single neuron and synaptic interaction. Biological Cybernetics, 65, 81–90. Ferrar, C. H., Williams, T. L., & Bowtell, G. (1993). The effect of cell duplication and noise in a pattern generating network. Neural Computation, 5, 587–596. Hellgren, J., Grillner, S., & Lansner, A. (1992). Computer simulation of the segmental neural network generating locomotion in lamprey by using populations of network interneurons. Biological Cybernetics, 68, 1–13. Ijspeert, A. J. (1998). Design of artificial neural oscillatory circuits for the control of lampreyand salamander-like locomotion using evolutionary algorithms. Unpublished doctoral dissertation, University of Edinburgh. Ijspeert, A. J., Hallam, J., & Willshaw, D. (1999). Evolving swimming controllers for a simulated lamprey with inspiration from neurobiology. Adaptive Behavior, 7(2), 151–172. Or, J. (2002). An investigation of artificially-evolved robust and efficient connectionist swimming controllers for a simulated lamprey. Unpublished doctoral dissertation, University of Edinburgh. Or, J., & Hallam, J. (2000). A study on the robustness of the lamprey swimming controllers. In J. A. Meyer, A. Berthoz, D. Floreano, H. L. Roitblat, & S. W. Wilson (Eds.), SAB2000 Proceedings Supplement, Sixth International Conference on Simulation of Adaptive Behaviour (pp. 68–76). Cambridge, MA: MIT Press.
1588
J. Or
Or, J., Hallam, J., Willshaw, D., & Ijspeert, A. (2002). Evolution of efficient swimming controllers for a simulated lamprey. In B. Hallam, D. Floreano, J. Hallam, G. Hayes, & J. A. Meyer (Eds.), From animals to animats: Proceedings of the Seventh International Conference on the Simulation of Adaptive Behavior (SAB 2002) (pp. 323–332). Cambridge, MA: MIT Press. Sfakiotakis, M., Lane, D., & Davis, J. (1999). Review of fish swimming modes for aquatic locomotions. IEEE Journal of Oceanic Engineering, 24(2), 237–252. Videler, J. J. (1993). Fish swimming. London: Chapman and Hall. Wadden, T., Hellgren, J., Lansner, A., & Grillner, S. (1997). Intersegmental coordination in the lamprey: Simulations using a network model without segmental boundaries. Biological Cybernetics, 76, 1–9. ¨ Lansner, A., Brodin, L., Traven, H., & Grillner, S. (1992). Wall´en, P., Ekeberg, O., A computer-based model for realistic simulations of neural networks II: The segmental network generating locomotor rhythmicity in the lamprey. Journal of Neurophysiology, 68, 1939–1950. Williams, T. L. (1986). Mechanical and neural patterns underlying swimming by lateral undulations: Review of studies on fish, amphibia and lamprey. In S. Grillner, P. S. G. Stein, D. Stuart, H. Forssberg, & R. Herman (Eds.), Neurobiology of vertebrate locomotion (pp. 141–155). New York: Macmillan.
Received January 16, 2006; accepted August 3, 2006.
LETTER
Communicated by Hu Sanqing
A Measurement Fusion Method for Nonlinear System Identification Using a Cooperative Learning Algorithm Youshen Xia
[email protected] College of Mathematics and Computer Science, Fuzhou University, 350002, Fuzhou, China
Mohamed S. Kamel
[email protected] Department of Electrical and Computer Engineering, University of Waterloo, Waterloo N2L 3G1, Canada
Identification of a general nonlinear noisy system viewed as an estimation of a predictor function is studied in this article. A measurement fusion method for the predictor function estimate is proposed. In the proposed scheme, observed data are first fused by using an optimal fusion technique, and then the optimal fused data are incorporated in a nonlinear function estimator based on a robust least squares support vector machine (LS-SVM). A cooperative learning algorithm is proposed to implement the proposed measurement fusion method. Compared with related identification methods, the proposed method can minimize both the approximation error and the noise error. The performance analysis shows that the proposed optimal measurement fusion function estimate has a smaller mean square error than the LS-SVM function estimate. Moreover, the proposed cooperative learning algorithm can converge globally to the optimal measurement fusion function estimate. Finally, the proposed measurement fusion method is applied to ARMA signal and spatial temporal signal modeling. Experimental results show that the proposed measurement fusion method can provide a more accurate model. 1 Introduction Identification of linear and nonlinear systems has received considerable attention and has been widely applied in engineering applications, including signal and image processing, communications, and control (Tekalp, Kaufman, & Woods, 1985; Abed-Meraim, Qui, & Hua, 1997; Ljung, 1999; Leung, Hennessey, & Drosopoulos, 2000). The objective is to find a good estimate of the unknown system model using measurements of the system’s input and output. The problem of system identification can be described in different ways: input-output form (Lo, 1994), state-space equations (Chen & Neural Computation 19, 1589–1632 (2007)
C 2007 Massachusetts Institute of Technology
1590
Y. Xia and M. Kamel
Narendra, 2004), and predictor forms (Roll, Nazin, & Ljung, 2005). In this letter, we focus on identifying nonlinear systems in the predictor form. Many estimation and prediction techniques for modeling nonlinear systems have been developed from noise-free input and noise-free output measurement (Ljung, 1999; Sjoberg et al., 1995). The well-known kernel method is a basic function approximation technique (H¨ardle, 1990). Because the kernel method has the asymptotic approximation property, it has became a foundation of many approximation approaches. For example, polynomial approaches, including local approximation techniques, are known to approximate, arbitrarily well, any continuous input-output mapping under relatively mild conditions (Lee & Mathews, 1993; Fan & Gijbels, 1996; Koukoulas & Kalouptsidis, 2000). In addition to universal approximation capability, neural networks, such as multilayer perceptron (MLP) and radial basis function (RBF) networks, can provide parallel approximation techniques (Hornik, Stinchcombe, & White, 1989; Chen & Billings, 1992; Kadirkamanathan & Niranjan, 1993; Mhaskar & Hahm, 1997). In order to satisfy the requirement of fast identification in practical applications (Schoukens, Nemeth, Crama, Rolain, & Pintelon, 2003), nonasymptotic methods, where the function estimate can minimize a cost function, have been studied. The popular support vector machine (SVM) method has been applied to regression approximation (Muller et al., 1997; Mukherjee, ¨ Osuna, & Girosi, 1997; Smola & Scholkopf, 1998; Rojo-Alvarez, MartinezRamon, Prado-Complido, Artes-Rodriguez, & Figueiras-Vidal, 2004). This method makes use of an -insensitive cost function so that the parameter learning problem can be equivalent to a quadratic convex programming problem. As a good alternative, a least-squares support vector machine (LSSVM) for nonlinear system modeling employs a quadratic cost function that can make the parameter learning problem easily solved by performing a set of linear equations (Suykens, Gestel, Brabanter, Moor, & Vandewalle, 2002). Recently, an interesting approach with a unified framework, called a direct weight optimization (DWO), to nonlinear system identification at a given data point was presented (Roll et al., 2005). The parameter learning problem of the DWO method is a nonlinear constrained optimization problem. From the computational complexity viewpoint, the LS-SVM method is the best among the above approaches. However, the SVM function estimator minimizes a regularized risk function, while the LS-SVM function estimate minimizes a least-squares regularized function. The DWO method minimizes an upper bound of the mean square error. Thus, the DWO method may achieve better accuracy since it can minimize an approximate mean square cost function. In addition, the cross-validation method can obtain an optimal regularization parameter to enhance the estimation accuracy of the SVM and LS-SVM method. But its computation is very complex, and the resulting computational time will be greatly increased. In practical circumstances, signals may be contaminated by noise, and the presence of measurement noise is also very common. Thus, the signal
Cooperative Learning Algorithm for Nonlinear System Identification
1591
data that the system receives is not perfect. As a result, the accuracy of the function estimate is affected by not only approximation error but also noise error. The identification of nonlinear systems with measurement noise was considered, but existing identification methods mainly try to reduce the approximation error without considering the effect of the noise error directly on the accuracy of the function estimate since noisy data are usually included in their function estimates. The purpose of this letter is to develop a new estimation method for identifying a general nonlinear noisy system by minimizing both the approximation error and the measurement noise error. Based on an optimal fusion technique, we first construct optimal fused data from measurement data so that the measurement noise error is minimized. We then pass the optimal fused data to a nonlinear function estimator that minimizes the approximation error, based on a robust LS-SVM. Moreover, we show that the proposed measurement fusion method can obtain the optimal measurement fusion function estimator has a smaller mean square error than the LS-SVM function estimator with a smaller mean square error than the LS-SVM function estimator. Another contribution of this article is a novel cooperative learning algorithm to implement the proposed measurement fusion method. In addition to low computational complexity, the proposed algorithm can converge globally to the optimal measurement fusion function estimate. Finally, we apply the proposed measurement fusion method to the ARMA signal and spatial temporal signal modeling. Computed results show that the proposed cooperative learning algorithm indeed gives a more accurate solution than the related identification algorithms. This letter is organized as follows. In section 2, the problem of nonlinear system identification is discussed. In section 3, a measurement fusion method for nonlinear noisy system identification is proposed by using an optimal fusion technique and a robust LS-SVM technique. In section 4, a cooperative learning algorithm is presented. In section 5, the improved performance of the proposed method and the global convergence of the proposed cooperative learning algorithm are analyzed. In section 6, the proposed measurement fusion method is applied to the modeling of both ARMA signals and spatial temporal signals. In section 7, several illustrative examples are provided. Concluding remarks are given in section 8.
2 Problem and Model Estimates We are concerned with the following discrete-time noisy system,
x(t) = g0 (ϕx (t)) + w(t) y(t) = x(t) + n(t),
(2.1)
1592
Y. Xia and M. Kamel
where g0 (·) is the unknown function, ϕx (t) is the regression vector defined by ϕx (t) = [x(t − 1), . . . , x(t − p), u(t − 1), . . . , u(t − q )]T , u(t) and x(t) are, respectively, the signal input and output of the system, p and q are the orders of the system, y(t) is the observed data, w(t) is driving white noise with zero mean, and n(t) is measurement white noise with zero mean. In a special case that p and q denote the corresponding maximum delays, the noisy system, equation 2.1, becomes a nonlinear autoregression moving average (NARMA)(Ljung, 1999). When g0 is linear and q = 0, the noisy system, equation 2.1, is an AR signal system in the presence of noise. The problem of system identification is as follows. Based on received data from the noisy system, equation 2.1, find a smooth function estimate g of g0 such that the mean square error (MSE) criterion, E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ],
(2.2)
is minimized as soon as possible, where ϕ y (t) = [y(t − 1), . . . , y(t − p), u(t − 1), . . . , u(t − q )]T . Now g(ϕ y (t)) − g0 (ϕx (t)) = g(ϕ y (t)) − g(ϕx (t)) + g(ϕx (t)) − g0 (ϕx (t)). Note that noise n(t) is usually assumed to be independent of the signal process x(t) and u(t). Then E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ] = E[(g(ϕ y (t)) − g(ϕx (t)))2 ] + E[(g(ϕx (t)) − g0 (ϕx (t)))2 ].
(2.3)
It can be seen that the second term of the MSE mainly comes from approximation error. Note that g(ϕ y (t)) − g(ϕx (t)) ≈ ∇g(ϕx (t))T (ϕ y (t) − ϕx (t)) = O(ϕn (t)), where ∇g(ϕx (t)) denotes the gradient of g and ϕn (t) = [n(t − 1), . . . , n(t − p), 0, . . . , 0]T . So, the first term of the MSE mainly comes from measurement noise error. Let V1 = E[(g(ϕ y (t)) − g(ϕx (t)))2 ], V2 = E[(g(ϕx (t)) − g0 (ϕx (t)))2 ]. Then minimizing E[(g(ϕ y (t)) − g0 (ϕx (t)))2 ] can be converted into minimizing V1 and V2 separately. Since the true regression vector ϕx (t) and g0 are unknown, V2 is not computable. A useful alternative is to minimize E[(g(ϕ y (t)) − y(t))2 ] in practice. Similarly, since V1 is not computable like
Cooperative Learning Algorithm for Nonlinear System Identification
1593
V2 , we will convert the minimization of V1 into the minimization of the measurement noise error. Most existing identification methods mainly make efforts in minimizing the approximation error without considering the effect of the noise error directly on the accuracy of the function estimate. For example, the neural networks and the local polynomial approaches can make the approximation error reduced to zero as the number of measurements theoretically. For nonasymptotic approaches, the support vector machine (SVM) method can minimize following regularized risk function, 1 L(g(ϕ y (t)) − y(t)) + βα22 , N N
E 1 (α) =
(2.4)
t=1
and its function estimate is given by g(ϕ y (t)) = α0 +
N
αk f k (ϕ y (t)),
(2.5)
k=1
where N is a number of measurements, α = [α1 , . . . , α N ]T , · 2 denotes l2 norm, L(g(ϕ y (t)) − y(t)) is an -insensitive cost function, f k (ϕ y (t)) is a kernel function, and ϕ y (t) is a regression vector including noise. Weight coefficients {αk } and α0 are obtained by solving quadratic convex programming. The LS-SVM method minimizes another regularized quadratic function, 1 (g(ϕ y (t)) − y(t))2 + βα22 , N N
E 2 (α) =
(2.6)
t=1
where the function estimate g(ϕ y (t)) has the same structure as equation 2.5 but the weight coefficients are obtained by solving linear equations. Directly based on the observed data y(t), a direct weight optimization (DWO) method was presented for the following function estimate: gˆ (ϕ ∗ ) = a 0 +
N
a t y(t)
(2.7)
t=1
at a given point ϕ ∗ . The function estimate of the DWO method minimizes an upper bound of the MSE, and its weight parameters {a k } can optimize a nonlinear objective function: N 2 N 1 ∗ 2 |a t |ϕ y (t) − ϕ 2 + σv2 a t2 . E 3 (a) = 4 t=1
t=1
(2.8)
1594
Y. Xia and M. Kamel
The identification problems of nonlinear systems with measurement noise were considered here, but these methods include the observed data in their function estimates. As a result, the measurement noise included in the observed data may cause a large error on the function estimate without minimizing the noise error. Different from the above identification methods, we propose a novel measurement fusion method for a good function estimation of g0 by minimizing E[(g(ϕ y (t)) − y(t))2 ] and minimizing the measurement noise error separately. More exactly, based on an optimal data fusion technique, in next section, we reconstruct optimal fused data {z∗ (t)} from a set of multisensor observed data {yi (t)} such that E[(g(ϕz (t)) − g(ϕx (t)))2 ],
(2.9)
which may be viewed as an alternative to V1 , is minimized as soon as possible, where ϕz (t) = [z(t − 1), . . . , z(t − p), u(t − 1), . . . , u(t − q )]T . We then pass the optimal fused data {z∗ (t)} to a robust LS-SVM function estimator so that the measurement fusion function estimator is given by g(ϕz∗ (t)) = α0 +
N
αk f k (ϕz∗ (t)),
k=1
where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . There are three main advantages of our methods. First, the optimal fused data {z∗ (t)} are shown to have a smaller MSE than observed data {yi (t)}. Second, the proposed measurement fusion method is shown to have an improved performance of the MSE over the LS-SVM method. Third, the proposed measurement fusion method can be effectively implemented by a new cooperative learning algorithm. 3 Measurement Fusion Method In this section, we first study optimal fused data and a robust LS-SVM and then propose a measurement fusion method for the optimal measurement fusion function estimate. 3.1 Optimal Fused Data. With the availability of multiple sensory devices, data fusion has received considerable attention in recent years (Hall & Llinas, 1997). Data fusion is the process of integrating complementary information from multisensor data by minimizing the uncertainty of the fused information. The main potential advantage of data fusion is that by integrating complementary information from multiple sensors, the uncertainty of the fused information is minimized (Varshney, 1997).
Cooperative Learning Algorithm for Nonlinear System Identification
1595
Consider multisensor observed data with τ (τ ≥ 2) sensors, yi (t) = x(t) + ni (t),
(i = 1, . . . , τ, t = 1, . . . , N),
(3.1)
where N is a number of sensor measurements, x(t) = g0 (ϕx (t)) + w(t), yi (t) is the ith sensor measurement, ni (t) represents measurement gaussian noise of the ith sensor, which is assumed to be zero mean and uncorrelated with x(t). Such multisensor observed data could be collected by multiple independent observations or multiple sensory devices with the measurement synchronized. Let the fused data be a linear combination of τ sensor measurements, z(t) =
τ
wi yi (t),
i=1
τ wi = 1. In order to minimize the where {wk } are fusion coefficients with i=1 measurement noise error, we find the optimal fused data {z(t)}1N satisfying minimize subject to
E[(g(ϕz (t)) − g(ϕx (t)))2 ] τ
wi = 1,
(3.2)
i=1
where ϕz (t) = [z(t − 1), . . . , z(t − p), u(t − 1), . . . , u(t − q )]T . Note that E[(g(ϕz (t)) − g(ϕx (t)))2 ] ≈ δ0 E[ϕz (t) − ϕx (t)22 ], where δ0 > 0 is a constant. Then, finding the optimal fused data can be converted into finding an optimal fusion solution of the following: minimize subject to
E[ϕz (t) − ϕx (t)22 ] τ
wi = 1.
(3.3)
i=1
Since E[ϕz (t) − ϕx (t)22 ] = E[ϕz (t)22 ] + E[ϕx (t)22 ] − 2E[ϕz (t)T ϕx (t)] and x(t) is uncorrelated with ni (t), E[ϕz (t) ϕx (t)] = E T
τ i=1
wi ϕ yi (t) ϕx (t) = E[ϕx (t)22 ]. T
1596
Y. Xia and M. Kamel
It follows that E[ϕz (t) − ϕx (t)22 ] = E[ϕz (t)22 ] − E[ϕx (t)22 ]. Thus, equation 3.3 can be rewritten as minimize subject to
E[ϕz (t)22 ] τ
wi = 1.
(3.4)
i=1
Note that E[ϕz (t)22 ] =
p
E[z2 (t − k)] +
k=1
q
E[u2 (t − k)].
k=1
Then, when {yi (t)} is wide-sense stationary (all of its first- and second-order moments are finite and independent of the sample index), solving equation 3.4 can be equivalent to solving the following simple optimization problem: minimize subject to
E[z2 (t)] τ
wi = 1.
(3.5)
i=1
τ So our optimal fused data are now given by z∗ (t) = i=1 wi∗ yi (t), where ∗ {wi } is the optimal solution of equation 3.5. We will show in section 5 that the optimal fused data are unbiased and have a smaller mean square error than the observed data. 3.2 Robust Least Squares Support Vector Machine. Support vector machines (SVMs) motivated by statistical learning theory have received considerable attention in the field of machine learning and engineering applications, including regression approximation (Mukherjee et al., 1997; ¨ Smola & Scholkopf, 1998; Rojo-Alvarez et al., 2004). In order to speed convergence, an interesting alternative, LS-SVM, was recently developed by Suykens, Gestel et al., 2002. The LS-SVM approach has advantages over the SVM as it can be easily implemented by the online algorithm. Consider the problem of approximating the set of data {(yi , xi )}ri=1 , xi ∈ Rn , yi ∈ R
(3.6)
Cooperative Learning Algorithm for Nonlinear System Identification
1597
with a regression function of the form
g(x) =
M
w ˆ i φi (x) + b,
(3.7)
i=1 M are feature functions that are defined in a high-dimensional where {φi (x)}i=1 M space R (possibly infinite-dimensional space), and {w ˆ i } and b are weight coefficients of the regression function g(x). The LS-SVM function estimate is to minimize the following optimization problem in primal weight space:
J (w ˆ k , b) =
minimize
M r 1 2 w ˆk +γ e k2 2 k=1
k=1
yk = g(xk ) + e k (k = 1, . . . , r ).
subject to
(3.8)
Define the Lagrangian function of equation 3.8 as L(αk , b, e k , w ˆ k ) = J (w ˆ k , ek ) −
r
ˆ T φ(xk ) + b + e k − yk }, αk {w
k=1
ˆ = [w ˆ M ]T and φ(xk ) = [φ1 (xk ), . . . , φ M (xk )]T . The condition where w ˆ 1, . . . , w for optimality (Bazaraa, Sherali, & Shetty, 1993) can imply that
ˆ = w
r k=1
αk φ(xk ),
r k=1
αk = 0,
ˆ T φ(xk ) + b + e k = yk (k = 1, . . . , r ). αk = γ e k , w ˆ = Substituting w tion estimate g(x) =
r
r k=1
αk φ(xk ) into equation 3.7, we have the LS-SVM func-
αi φ(x)T φ(xi ) + b,
i=1
where the weight coefficients αk and b satisfy the following linear equations,
Q+
1 I α + ber − γ 1 r k=1 αk = 0,
y=0
(3.9)
where α = [α1 , . . . , αr ]T , er = [1, . . . , 1]T ∈ Rr , y = [y1 , ...., yr ]T , I1 ∈ Rr ×r is a unit matrix, and Q = (φ(xi )T φ(x j )) ∈ Rr ×r . According to Mercer’s
1598
Y. Xia and M. Kamel
theorem, one can further choose a kernel function K (x, z) satisfying K (x, z) = φ(x)T φ(z) so that the LS-SVM function estimate can be represented as g(x) =
r
αi K (xi , x) + b.
(3.10)
i=1
It is well known that γ is a regularization parameter. If γ is made small, then w22 becomes important, and thus the error term e k could be large. Hence, γ should be larger for a small approximation error. In this case, the least square (LS) cost function becomes important and its estimation is not robust. To overcome the drawback concerning robustness, Suykens, Brabanter, Lukas, and Vandewalle (2002) presented a modification on the robust LS-SVM function estimate by weighting error variables such that the robust estimate on αk and b are solutions of the following linear equations,
(Q + γ )α + ber − y = 0 r k=1 αk = 0,
(3.11)
where γ = diag(1/(λ1 γ ), . . . , 1/(λr γ )) and {λk } are weighting factors based on the error variables {e k }. Here we propose an alternative modification on the robust LS-SVM function estimate. From γ e k = αk , we see that the Lagrangian multipliers, {αk }, are directly proportional to the error term e k , where e k indicates the difference between y and g(x). So to reduce the impact of outliers on the estimates, a boundedness of the error e k should be imposed, that is, |e k | ≤ . As a result, |αk | ≤ γ , (k = 1, . . . , r ). Thus, the proposed robust estimates on αk and b are solutions of the following piecewise equations,
f (α − (Q +
1 I )α − γ 1 r k=1 αk
ber + y) − α = 0 = 0,
where f (u) = [ f (u1 ), . . . , f (ur )]T and for i = 1, . . . , r , −γ ui f (ui ) = γ
ui < −γ −γ ≤ ui ≤ γ ui > γ .
(3.12)
Cooperative Learning Algorithm for Nonlinear System Identification
1599
It is easy to see that the solution of the piecewise equation 3.12 satisfies |αi | ≤ γ , and thus |e k | ≤ . Compared with the robust estimate by Suykens, Brabanter et al. (2002), the robust estimate approach needs only to choose two parameters γ and and does not need information about unknown error variables {e k }. Moreover, when γ = +∞, the piecewise equations 3.12 become the linear equations 3.9. So the present modification can be viewed as a significant generalization of the LS-SVM method. 3.3 Measurement Fusion Method. We now propose a measurement fusion method as follows. Step 1: Compute the optimal weight w∗ by solving equation 3.5, and let the optimal fused data be z∗ (t) = x(t)T w∗ . Step 2: Compute optimal weight coefficients of the robust LS-SVM function estimate by solving equation 3.12. Step 3: Incorporate the optimal fused data into the robust LS-SVM function estimate. Remark 1. According to the ergodic theorem, we see that 2 τ N N 1 2 1 E[z (t)] = lim z (t) = lim wk yi (t) . N→+∞ N N→+∞ N 2
t=1
t=1
i=1
So, in practice, finding the optimal solution of equation 3.5 can be replaced by finding the following optimization problem:
minimize
2 τ N 1 wi yi (t) N t=1
subject to
τ
t=1
wi = 1.
(3.13)
i=1
N When sample variance matrix B = N1 t=1 yt ytT is nonsingular, the above optimization problem has a closed-form solution, w∗ = B −1 e/(eT B −1 e), where e = [1, . . . , 1]T ∈ Rτ . However, B may be singular or almost singular due to the dependence of sensor data. The closed-form solution thus gives a poor estimate. To overcome the singularity, we introduce another equivalent formulation. By the well-known projection theorem (Kinderlehrer & Stampcchia, 1980), we see that w∗ = [w1∗ , . . . , wτ∗ ]T is the optimal solution
1600
Y. Xia and M. Kamel
of equation 3.13 if and only if w∗ satisfies w∗ =
I−
1 T ee (I − B)w∗ + e/τ, τ
where ∈ Rτ , I ∈ Rτ ×τ is a unit matrix, and yt = [y1 (t), . . . , yτ (t)]T . Furthermore, the above equation can be written as (τ B + eeT (I − β B))w∗ = (1 − β)eeT Bw∗ + e, ∀β ∈ R. When 0 < β < 1, matrix (τ B + eeT (I − β B)) is nonsingular, we have w∗ = (1 − β)(τ B + eeT (I − β B))−1 eeT Bw∗ + (τ B + eeT (I − β B))−1 e.
(3.14)
Remark 2. In addition to multiple-sensor observed data available, we can create suboptimal fused data directly from single-sensor observed data {y(t)}. Given a sensor number τ ≥ 1, we can combine a set of adjacent data of y(t) such that the suboptimal fused data z∗ (t) = τk=−τ wk∗ y(t + δk) satisfy 1 N N
minimize
t=1
subject to
τ
τ
2 wk∗ y(t
+ δk)
k=−τ
wi = 1,
(3.15)
i=1
where 0 < δ ≤ 1. The suboptimal fused data may still reduce the noise error because when δτ is small enough, we have y(t + δk) ≈ x(t) + nk (t) (k = −τ, . . . , τ ). Remark 3. Xia (1996), developed a continuous-time recurrent neural network for solving linear piecewise equations 3.12 as follows, d dt
α(t) b(t)
=
ˆ 0 + I1 )( f (u(t)) − α(t)) − er erT α(t) (Q erT f (u(t)) − 2erT α(t)
,
(3.16)
ˆ 0 = Q + 1 I1 . According to the where u = α − (Q + γ1 I1 )α − ber + y and Q γ well-known Euler method, a discrete-time iterative algorithm for solving
Cooperative Learning Algorithm for Nonlinear System Identification
1601
equation 3.12 can be the following,
α (k+1)
=
b (k+1)
α (k)
b (k)
− hk
ˆ 0 + I1 ) f u(k) − α (k) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
,
where h k > 0 is a learning step length. 4 Cooperative Learning Algorithm In this section, we present a cooperative learning algorithm to implement the proposed measurement fusion method. Using remarks 1 and 3, we can derive learning algorithms A and B. Learning Algorithm A for Optimal Fused Data Given the sensor number τ , the measurement number N, the initial weight vector w(0) , 0 < β < 1, and δ1 > 0, 0. Compute Aτ = (1 − β)qτ eT B and qτ = Pτ e, where e is defined in equation 3.14 and 1 yt ytT , Pτ = (τ B + eeT (I − β B))−1 , yt = [y1 (t), . . . , yτ (t)]T . N N
B=
t=1
i. Compute w(k) = Aτ w(k−1) + qτ .
(4.1)
ii. If w(k) − w(k−1) 2 ≤ δ1 , then go to iii. Otherwise, let k = k + 1 and go to i. iii. Compute optimal fused data: z(k) (t) =
τ
(k)
wi yi (t), (t = 1, . . . , N)
i=1
Learning Algorithm B for Robust LS-SVM Function Estimate A kernel function K (·, ·), simulation parameters γ and , the initial weight coefficients (α (0) , b (0) ), and δ2 > 0 are given, and data samples {y(t), u(t)}r1 ⊂ N {yi (t), u(t)}t=1 .
1602
Y. Xia and M. Kamel
i. Compute ϕ y (t) = [y(t − 1), y(t − 2), . . . , y(t − p), u(t − 1), . . . , u(t − ˆ = (Q + I1 /γ )/s, yˆ = y/s, and e1 = er /s, where q )]T , and compute Q Q, y, and er are defined in equation 3.12 and Q + I1 /γ er . s= −erT 0 2 ˆ (k) − yˆ + b (k) e1 , and ii. Compute u(k) = ((I − Q)α (k)
f (u(k) ) = [ f (u1 ), . . . , f (ur(k) )]T , where f (ui ) is defined in equation 3.12. iii. Compute
α (k+1) b (k+1)
=
α (k) b (k)
−h
ˆ + I1 ) f u(k) − α (k) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
,
(4.2) where h > 0 is a fixed step length. iv. Compute r (α (k) , b (k) ) = f (u(k) ) − α (k) 22 + (e1T α (k) )2 . If r (α (k) , b (k) ) ≤ δ2 , then output (α (k) , b (k) ). Otherwise, let k = k + 1 and go to step (ii). Cooperative Learning Algorithm Step 1: Find optimal fused data {z∗ (t)} by learning algorithm A. Step 2: Find optimal weights {αi∗ } and b ∗ by learning algorithm B. Step 3: Compute the optimal measurement fusion function estimate given by g(ϕz∗ (t)) =
r
αi∗ K (ϕ y (i), ϕz∗ (t)) + b ∗ , (t = 1, . . . , N)
(4.3)
i=1
where ϕz∗ (t) = [z∗ (t − 1), z∗ (t − 2), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . Although learning algorithms A and B may be mutually independent, the cooperative learning algorithm can perform the two learning algorithms
Cooperative Learning Algorithm for Nonlinear System Identification
Learning Algorithm (A)
1603
z ∗ (t)
Observed
g(ϕz∗ (t))
Data
Learning
α ∗k , b ∗
Algorithm (B)
Figure 1: Cooperative learning algorithm for nonlinear system identification.
in parallel, and can combine their input results with a feedforward network, which implements equation 4.3. Figure 1 displays a block diagram of the proposed cooperative learning algorithm. Compared with conventional individual learning algorithms for system identification, the proposed cooperative learning algorithm can combine adaptively two individual learning algorithms to have an improved performance in the MSE criterion. Compared with the LS-SVM function estimate, the proposed optimal measurement fusion function estimate has a smaller MSE. Remark 4. In general, the RBF kernel function is a reasonable first choice since it can handle the nonlinear case. (For a detailed explanation, see Hsu, Chang, & Lin, 2003.) 5 Performance Analysis In this section, we prove that the proposed optimal measurement fusion function estimate has an improved performance and that the proposed cooperative learning algorithm converges globally to the optimal measurement fusion function estimate. Theorem 1. Assume that {ni (t)} is a stationary gaussian process with zero mean. Then the optimal fused data are unbiased and have a smaller MSE than observed data {yi (t)}.
1604
Y. Xia and M. Kamel
Proof. Let the optimal fused data be z∗ (t) =
τ
wi∗ yi (t),
i=1
where {wi∗ } are the optimal fusion solution of the following 2 τ wi yi (t) minimize E i=1
subject to
τ
wi = 1.
(5.1)
i=1
Then E[z∗ (t)] =
τ
wi∗ E[yi (t)] =
i=1
τ
wi∗ E[x(t)] = E[x(t)].
i=1
Thus the optimal fused data are unbiased. We now show that z(t) has a smaller MSE than yi (t). On one side, E[(z∗ (t) − x(t))2 ] = E[z∗ (t)2 ] + E[x 2 (t)] − 2E[z∗ (t)x(t)]. Since E[yi (t)x(t)] = E[x(t)x(t)] + E[ni (t)x(t)] = E[x 2 (t)], E[z∗ (t)x(t)] =
τ
wi∗ E[yi (t)x(t)] =
i=1
τ
wi∗ E[x 2 (t)] = E[x 2 (t)].
i=1
Then E[(z∗ (t) − x(t))2 ] = E[z∗ (t)2 ] − E[x 2 (t)] and E[z∗ (t)2 ] = E[(ytT w∗ )2 ] = (w∗ )T E[yt ytT ]w∗ , where yt = [y1 (t), . . . , yτ (t)]T . Since w∗ is the optimal fusion solution, (w∗ )T E[yt ytT ]w∗ =
1 eT R−1 e
,
where e = [1, . . . , 1]T ∈ RT , R = E[yt ytT ] and w∗ = [w1∗ , . . . , wτ∗ ]T . Note that eT R−1 e ≥ eT eλmin (R−1 ) =
eT e . λmax (R)
Cooperative Learning Algorithm for Nonlinear System Identification
1605
It follows that 1 eT R−1 e
≤
λmax (R) λmax (R) = . T e e τ
Let {λi }τ1 be the eigenvalues of R. Then when τ ≥ 2, λmax (R) < R is symmetric and positive definite. Thus,
τ i=1
λi since
τ τ 1 λmax (R) 1 1 ≤ < λ = E[yi2 (t)]. i eT R−1 e τ τ τ i=1
i=1
Hence, E[(z∗ (t) − x(t))2 ]
0 such that |z(k) (t) − z∗ (t)| ≤ ηe −µk , ∀k ≥ 1. Proof. The proof is given in the appendix. Theorem 4. Let 0 < h < 1. Learning algorithm B can converge globally to the optimal weights of the robust LS-SVM function estimate. Proof. The proof is given in the appendix. With theorems 2, 3, and 4, we conclude that the proposed cooperative learning algorithm can converge globally to an optimal measurement fusion function estimate with a smaller MSE than the LS-SVM function estimate. Theorem 5. Finally, it should be pointed out that the proposed measurement fusion method can be applied to other function estimates such as the DWO function estimate to reduce the MSE. 6 Two Applications In this section, we apply the proposed measurement fusion method to ARMA signal modeling and spatial temporal signal modeling. 6.1 Application to ARMA Modeling. ARMA models are widely used in many fields such as signal and image processing and digital communications (Tekalp et al., 1985; Tekalp, 1995). Consider the following discrete time linear system, x(t) =
p i=1
a i x(t − i) +
q
b j u(t − j) + w(t),
(6.1)
j=1
where {a i } and {b i } are unknown AR and MA parameters of the system, p and q are known model order, the signals x(t) and u(t) are the output and input of the system, and w(t) is driving white noise with zero mean. Assume that the sensor system is employed and {yi (t)} is a set of noise corrupted measurements of x(t): yi (t) = x(t) + ni (t), (i = 1, . . . , τ ),
Cooperative Learning Algorithm for Nonlinear System Identification
1609
where ni (t) is the additive gaussian white noise with zero mean and τ is a sensor number. Our task is to estimate ARMA modeling including the unknown AR and MA parameter vectors, based on the noisy measurements {yi (t)} and {u(t)}. Note that ARMA model equation 6.1 can be rewritten in the form of equation 2.1 as x(t) = θ T ϕx (t) + w(t), where ϕx (t) = [x(t − 1), . . . , x(t − p), u(t − 1), . . . , u(t − q )]T , θ = [a, b]T , a = [a 1 , . . . , a p ]T , and b = [b 1 , . . . , b q ]T . Thus, we take the linear kernel function in the robust LS-SVM function estimate. We choose a set of data points {y1 (t), ϕ y (t)}r1 as the training set, where ϕ y (t) = [y1 (t − 1), . . . , y1 (t − p), u(t − 1), . . . , u(t − q )]T . By learning algorithm B, we find optimal weight coefficients {αi∗ }, where b is set as zero since there is no constant term in the ARMA model. The robust LS-SVM function estimate is then expressed as g(ϕ y (t)) =
r
αk∗ ϕ y (k)T ϕ y (t),
(6.2)
k=1
and the parameters to be estimated are θ ∗ = [a∗ , b∗ ]T =
r
αk∗ ϕ y (k).
(6.3)
k=1
We next use the fusion technique for improving ARMA modeling. Let N yt = [y1 (t), . . . , yτ (t)]T and B = t=1 yt ytT . By learning algorithm A, we can get the optimal fused data as follows z∗ (t) =
τ
wk∗ yk (t), (t = 1, . . . , N).
k=1
Incorporating the optimal fused data {z∗ (t)} into equation 6.2, we have the optimal measurement fusion function estimate, g(ϕz∗ (t)) =
r k=1
T αk∗ ϕ y (k)
ϕz∗ (t),
(6.4)
1610
Y. Xia and M. Kamel
where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − p), u(t − 1), . . . , u(t − q )]T . As an immediate result of theorems 1 and 2, we have the following corollary. Corollary 1. The optimal measurement fusion function estimate in equation 6.4 has a smaller MSE than equation 6.2. For robust ARMA system identification, a support vector machine method was developed by Rojo-Alvarez et al. (2004). This approach uses an -insensitive cost function so that the parameter learning problem can be equivalent to a quadratic convex programming problem. By comparison, the present parameter learning problem here is the piecewise equation, which generalizes the parameter learning problem of LS-SVM. The function estimate can be shown to have a smaller MSE than existing SVM and LS-SVM function estimates. 6.2 Application to Spatial Temporal Modeling. The problem of spatial temporal modeling has become of special interest. It arises in many subjects such as radio communications, radio astronomy, radar surveillance, wireless communications, and other fields (Tekalp, 1995; Cressie & Majure, 1997). For instance, radar backscatter from a sea surface is basically a spatial temporal phenomenon (Haykin & Puthusserypady, 1997). The scattered signals come from a moving surface, and thus spatial effects in nearby areas cannot be ignored for accurate radar detection. A spatial temporal dynamical model can be described by the following equations:
x(t, i) = h(x(t − 1, i)) c(t, i) = x(t, i) + n(t, i),
(6.5)
where h is an unknown nonlinear function, x(t, i) is the state trajectory of the dynamical system, c(t, i) is the observation, n(t, i) is the measurement noise with zero mean, t is the discrete-time step, and i is the lattice point or the spatial position. x(t − 1, i) is a state vector for a temporal process. Our goal is to obtain an underlying nonlinear spatial temporal predictor from the noisy measurement c(t, i). To apply the proposed measurement fusion method to spatial temporal modeling, we first reconstruct the state vector as the time-delayed vector at
Cooperative Learning Algorithm for Nonlinear System Identification
1611
the ith cell, xˆ (t − 1, i) = [x(t − 1, i), . . . , x(t − p, i)]T , where p is the embedding dimension in time. Then spatial temporal dynamic modeling is reformulated as the identification problem defined in equation 2.1. We next take sensor number τ ≥ 1 such that {c(t, i + j), j = −τ, . . . , τ } are adjacent observed signals of c(t, i) in the spatial domain. According to the result given by Xia, Leung, and Chan (2006), we see that c(t, i + j) ≈ c(t, i) + vi+ j (t),
(6.6)
where vi+ j (t) is a zero mean process. We combine these adjacent observed signals by using learning algorithm A, where yt = [c(t, i − τ ), . . . , c(t, i), . . . , c(t, i + τ )]T . Let optimal fused data be z∗ (t, i) =
τ
w ∗j c(t, i + j),
j=−τ
where {wk∗ }τk=−τ is the optimal weight coefficients. We then choose a set of data points in a temporal process, {c(t, i), c(t, i)}rt=1 , c(t, i) = [c(t − 1, i), . . . , c(t − p, i)]T , as the training set. By learning algorithm B, we can obtain the optimal weight coefficients, and the robust LS-SVM predictor can be represented by g(c(t, i)) =
τ
αk∗ K (c(k, i), c(t, i)) + b ∗ .
(6.7)
k=1
Finally, after passing the optimal fused data to the robust LS-SVM predictor, we have the optimal measurement fusion predictor, g(z∗ (t, i)) =
τ
αk∗ K (c(k, i), z∗ (t, i)) + b ∗ ,
(6.8)
k=1
where z∗ (t, i) = τj=−τ w ∗j c(t, i + j) and c(t, i + j) = [c(t − 1, i + j), . . . , c(t − p, i + j)]T . As an immediate result of theorems 1 and 2, we have the following corollary:
1612
Y. Xia and M. Kamel
Corollary 2. The optimal measurement fusion predictor in equation 6.8 has a smaller MSE than equation 6.7. Recently, a RBF-CML predictor for modeling of sea clutter was developed (Leung et al., 2000). For spatial temporal modeling, two prediction fusion methods were developed by using high-order cumulants (Xia et al., 2006; Xia & Leung, 2006). Compared with the RBF-CML predictor, the proposed measurement fusion predictor extends the RBF-CML predictor for general spatial temporal modeling. Moreover, unlike learning parameters in the RBF-CML predictor, our learning parameters {w ∗j } and {αk∗ } are optimal solutions of the mean square cost function and the quadratic cost function, respectively. Moreover, the proposed measurement fusion predictor can be shown to have an improved performance in the MSE error. Compared with the two prediction fusion methods, the proposed measurement fusion method can be simply implemented by the proposed cooperative learning algorithm with a low computational complexity. Finally, it should be noted that one earlier spatial temporal method, called kriged Kalman filter modeling, was developed by Mardia, Goodall, Redfern, and Alsonso (1998). The optimal partial prediction is built by the well-known method of kriging, and the temporal effects are analyzed by using the models underlying the Kalman filtering method (Sahu and Mardia, 1999). The space-time Kalman filtering approach by Wikle and Cressie (1999) was then proposed to reduce the numbers of dimensions within the kriged Kalman filter model framework. Kriging is a common interpolation technique used for making spatial inferences at locations, and thus the kriged Kalman filter method is different from the presented method, which is based on the function approximation technique. 7 Illustrative Examples In this section, we give several illustrative examples to demonstrate the effectiveness of the proposed algorithms. The simulation is conducted in Matlab. 7.1 Example 1. Consider the following model system given in Lu, Ju, and Chon (2001),
x(t) = 0.35x(t − 1) + 0.32x(t − 2) + 0.1x(t − 3) 0.54u(t) + 0.34u(t − 1) + 0.23u(t − 2) + 0.21u(t − 3),
(7.1)
where u(t) is a white gaussian process with variance 0.5. Assume that {yi (t)} is a set of noise-corrupted measurements of x(t), yi (t) = x(t) + ni (t), (i = 1, . . . , τ ),
Cooperative Learning Algorithm for Nonlinear System Identification
1613
where ni (t) is measurement white gaussian noise with zero mean and variance 0.1. By using u(t), we collect 500 data samples, which were divided into a training set of 200 data samples and a testing set of 300 data samples. To perform learning algorithm B, we let the regression vector be ϕ y (t) = [y(t − 1), . . . , y(t − 3), u(t), . . . , u(t − 3)]T , and take simulation parameters r = 200, γ = 100, and = 0.5. In view of equation 6.3, we have the following parameter estimation, θˆ = [0.3525, 0.3179, 0.0959, −0.5396, 0.3392, 0.2289, 0.2076]T ,
(7.2)
with the l2 norm error 0.0064. We compare the proposed method with the affine geometry method (Lu et al., 2001) and improved least squares method (Zheng, 1999). The affine geometry method gives one computed result of parameter estimation where the l2 norm error is 0.08, and the improved least squares method has another computed result of parameter estimation where the l2 norm error is 0.016. Thus, the parameter estimation here has a smaller norm error. We next consider system modeling. From equation 6.2, we see that the robust LS-SVM function estimate is given by g(ϕ y (t)) = θˆ T ϕ y (t). Let the initial weight be zero and perform learning algorithm A for the optimal fused data, where τ = 8. The optimal measurement fusion function estimate is thus g(ϕz∗ (t)) = θˆ T ϕz∗ (t), where ϕz∗ (t) = [z∗ (t − 1), . . . , z∗ (t − 3), u(t), . . . , u(t − 3)]T . Figure 2 shows the variation of the predictive signal from 200 to 500 using the proposed function estimate. By comparison, Figure 3 shows the variation of the predictive signal from 200 to 500 using the LS-SVM function estimate with same simulation parameters. Figure 4 displays the absolute error between the original signal and the predictive signal given by two function estimates. From Figure 4 we can see that the proposed function estimate has much smaller errors than the LS-SVM function estimate. 7.2 Example 2. Consider the nonlinear ARMA model system,
x(t) = 0.32x(t − 1) + 0.3x(t − 2) − 0.18x(t − 1)u(t − 1) −0.2u(t − 2)2 + 0.8u(t) − 0.23u(t − 2) + 0.53u(t − 3),
(7.3)
1614
Y. Xia and M. Kamel 15
10
x(t)
5
0
−5
−10
−15 200
250
300
350 t
400
450
500
Figure 2: The variation of two signals versus the number of measurements in example 1. The solid line corresponds to the true signal and the dot line to the signal by the measurement fusion predictor. 15
10
x(t)
5
0
−5
−10
−15 200
250
300
350 t
400
450
500
Figure 3: The variation of two signals versus the number of measurements in example 1. The solid line corresponds to the true signal and the dot line to the signal by the LS-SVM predictor.
Cooperative Learning Algorithm for Nonlinear System Identification
1615
10
9
8
predictive signal error
7
6
5
4
3
2
1 0 200
250
300
350 t
400
450
500
Figure 4: The absolute error variation of two-signal prediction in example 1. The solid line corresponds to the measurement fusion predictor and the dotted line to the LS-SVM predictor.
where u(t) is a white gaussian process with variance 0.4. Assume that {yi (t)} is a set of noise-corrupted measurements of x(t): yi (t) = x(t) + ni (t), (i = 1, . . . , τ ), where ni (t) is measurement white gaussian noise with zero mean and variance 0.5. We collect 250 data samples by using u(t), divided into a training set of 100 data samples and a testing set of 150 data samples. To perform the proposed cooperative learning algorithm, we take the regression vector: ϕ y (t) = [y(t − 1), y(t − 2), u(t), u(t − 1), u(t − 2), u(t − 3)]T . We choose the gaussian RBF function as a kernel function and let r = 200, γ = 200, and = 0.5 in learning algorithm B. We take τ = 8 in learning algorithm A with the zero initial weight. Figure 5 shows the variation of the predictive signal from 100 to 250 using the proposed function estimate. By comparison, Figure 6 shows the variation of the predictive signal from 100 to 250 using the LS-SVM function estimate with same simulation parameters. Figure 7 displays the absolute error between the original signal and the predictive signal given by two function estimates. It can be seen from
1616
Y. Xia and M. Kamel 1.5
1
0.5
x(t)
0
−0.5
−1
−1.5
−2 100
150
200
250
t
Figure 5: The variation of two signals versus the number of measurements in example 2. The solid line corresponds to the true signal and the dotted line to the signal by the measurement fusion predictor. 2
1.5
1
0.5
x(t)
0
−0.5
−1
−1.5
−2
−2.5 100
150
200
250
t
Figure 6: The variation of two signals versus the number of measurements in example 2. The solid line corresponds to the true and the dotted line to the signal by the LS-SVM predictor.
Cooperative Learning Algorithm for Nonlinear System Identification
1617
1.4
1.2
Predictive signal error
1
0.8
0.6
0.4
0.2
0 100
150
200
250
t
Figure 7: The absolute error variation of two-signal prediction in example 2. The solid line corresponds to the measurement fusion predictor and the dotted line to the LS-SVM predictor.
Figure 7 that the function estimate has much smaller errors than the LS-SVM function estimate. Furthermore, to evaluate the fit between the simulated output and the measured output, we use the measure percentage given in Roll et al. (2005). The computed result was obtained by averaging 100 independent Monte Carlo simulations. The LS-SVM method has a fit of 51.8% (±1.8%), while the proposed method gives a better fit of 55.4% (±1.6%). 7.3 Example 3. Consider a nonlinear benchmark system given by Narendra and Li (1996). The system is defined as in state-space form by x1 (t) x1 (t + 1) = sin(x2 (t))( 1+x 2 + 1), 1 (t) 2 2 x2 (t + 1) = x2 (t) cos(x2 (t)) + x1 (t)e −(x1 (t) +x2 (t) )/8 + 1 (t+1) 2 (t+1) + 1+0.5xsin(x + n(t), y(t + 1) = 1+0.5xsin(x 2 (t+1)) 1 (t+1))
u(t)3 , (1+u(t)2 +0.5 cos(x1 (t)+x2 (t))
(7.4) where n(t) is a white gaussian noise with variance 0.1 and the states are assumed not to be measurable. We collect 300 data samples by using a uniformly distributed random input u(t) ∈ [−2.5, 2.5]. Also, to validate the
1618
Y. Xia and M. Kamel 3
2
1
x(t)
0
−1
−2
−3
−4
0
20
40
60
80
100 t
120
140
160
180
200
Figure 8: The variation of two signals versus the number of measurements in example 3. The solid line corresponds to the true signal and the dotted line to the signal by the measurement fusion predictor.
learning model, the input signal u(t) = sin
2πt 2πt + sin , t = 1, . . . , 200 10 25
was used. We take the regression vector as follows: ϕ y (t) = [y(t − 1), y(t − 2), y(t − 3), u(t − 1), u(t − 2), u(t − 3)]T .
To perform learning algorithm A for the optimal fused data, we consider 4 z(t) = i=−4 wi y(t + 0.2i), and let the initial weight be zero. To perform learning algorithm B, we take r = 200, γ = 100, and = 1, and choose the gaussian RBF function as a kernel function. Figure 8 shows the variation of the predictive signal using the proposed function estimate. By comparison, Figure 9 shows the variation of the predictive signal using the LS-SVM function estimate with same simulation parameters. Figure 10 displays the absolute error between the original signal and the predictive signal given by two function estimates. It is easy to see that the function estimate here has smaller errors than the LS-SVM function estimate. Moreover, the LS-SVM method gives the computed result of fit 45.3% (±2.3%), while the proposed
Cooperative Learning Algorithm for Nonlinear System Identification
1619
3
2
1
x(t)
0
−1
−2
−3
−4
0
20
40
60
80
100 t
120
140
160
180
200
Figure 9: The variation of two signals versus the number of measurements in example 3. The solid line corresponds to the true signal and the dotted line to the signal by the LS-SVM predictor. 4
3.5
Predictive signal error
3
2.5
2
1.5
1
0.5
0
0
20
40
60
80
100 t
120
140
160
180
200
Figure 10: The absolute error variation of two-signal prediction in example 3. The solid line corresponds to the proposed measurement fusion predictor and the dotted line to the LS-SVM predictor.
1620
Y. Xia and M. Kamel
Table 1: Computed Results of the Measure Percentage by Two Methods at Different Regularization Parameters in Example 3. γ
2−2
2
24
27
210
213
LS-SVM Proposed
0.57 0.61
0.61 0.66
0.55 0.59
0.50 0.55
0.47 0.52
0.46 0.50
method gives a better fit of 51.7% (±2.1%). Furthermore, the computed result by the proposed method can be compared with the ones obtained by the DWO method and the neural network with 20 hidden sigmoidal given in Roll et al. (2005), where the DWO method gave 49.7% fit, and the neural network with 20 hidden sigmoidal then reaches 47.1% fit. Finally, we compare results of the measure percentage of both the LSSVM method and the proposed method, by choosing exponentially growing sequences of γ , which may be viewed as an approximate cross-validation procedure to identify good regularization parameters (Hsu et al., 2003). Computed results of the measure percentage of two methods are listed in Table 1. It can be seen that the proposed method gives better results than the LS-SVM method. 7.4 Example 4. Consider real-life sea clutter data (Leung et al., 2000). The radar used for data collection is a coherent dual-polarized X-band radar. The radar site was located at 44◦ 36.72’N, 63◦ 25.41’W on a cliff at Osborne Head, Nova Scotia, Canada, facing the Atlantic Ocean about 30 m above mean sea level and an open ocean view of about 130 degrees. The database contains a variety of targets, including small boats and beachball targets. Two data sets are used in this study to investigate the spatial temporal dynamics of sea clutter (Drosopoulos, 1994). They are all staring data (ST1 and ST2), where the antenna is aimed at a single azimuth at all times. Their original radar images are plotted in Figures 11a and 12a, respectively. Here, the y-axis denotes the range, and the x-axis represents the time. When visible, the targets appear in the staring images as a dark horizontal line. Since the target was not anchored, it fluctuates with the waves, causing it to be easier to detect at some times than at others. The target, which is a boat, appears as a solid dark line in the image without any breaking. To perform the proposed cooperative learning algorithm, we choose the simulation parameters, defined in equation 6.8, of p = 10, r = 100, γ = 100, and = 1, and we take the gaussian RBF function as a kernel function. For the measurement fusion, we fuse five neighboring cells in the spatial domain. The resulting images are plotted in Figures 11b and 12b, respectively. We now perform a comparison among the proposed measurement fusion predictor, the robust LS-SVM predictor, and the LS-SVM predictor. Figure 13 displays the mean square prediction errors (MSE) of ST1, where the solid line corresponds to the robust LS-SVM predictor and the
Cooperative Learning Algorithm for Nonlinear System Identification
1621
(a)
(b)
Figure 11: (a) Radar image of the staring data set 1(ST1). (b) Prediction radar image of ST1.
dashed line to the proposed measurement fusion predictor, respectively. Figure 15 displays the mean square prediction errors (MSE) of ST2, where the solid line corresponds to the robust LS-SVM predictor, and the dashed line to the proposed measurement fusion predictor, respectively. Figure 14 displays the MSE of ST1, where the solid line corresponds to the robust LS-SVM predictor with = 1, the dot-dashed line corresponds to the robust LS-SVM predictor with = 16, and the dashed line to the LS-SVM predictor ( = +∞), respectively. Figure 16 displays the MSE of ST2, where
1622
Y. Xia and M. Kamel
(a)
(b)
Figure 12: (a) Radar image of the staring data set 2(ST2). (b) Prediction radar image of ST2.
the solid line corresponds to the robust LS-SVM predictor with = 1, the dot-dashed line corresponds to the robust LS-SVM predictor with = 16, and the dashed line to the LS-SVM predictor, respectively. Based on obtained results, we see that the proposed measurement fusion predictor produces a better model than the robust LS-SVM predictor and LS-SVM predictor. It is because the measurement fusion approach can successfully reduce the MSE by incorporating the extra spatial information in the model. This observation matches our theoretical analysis.
Cooperative Learning Algorithm for Nonlinear System Identification
1623
0
MSE
10
−1
10
−2
10 2000
2100
2200
2300
2400 Range
2500
2600
2700
2800
Figure 13: Prediction error of ST1 data. The solid line corresponds to the robust LS-SVM method with parameter = 1 and the dotted line to the measurement fusion method. 0
MSE
10
−1
10
−2
10 2000
2100
2200
2300
2400
2500
2600
2700
2800
Range
Figure 14: Prediction error of ST1 data. The dotted line corresponds to the LSSVM method, the dot-dashed line corresponds to the robust LS-SVM method with parameter = 16, and the solid line to the robust LS-SVM method with parameter = 1.
1624
Y. Xia and M. Kamel 0
10
−1
MSE
10
−2
10
−3
10 2200
2300
2400
2500
2600
2700 Range
2800
2900
3000
3100
3200
Figure 15: Prediction error of ST2 data. The solid line corresponds to the robust LS-SVM method with parameter = 1 and the dotted line to the measurement fusion method. 0
10
−1
MSE
10
−2
10
−3
10 2200
2300
2400
2500
2600
2700 Range
2800
2900
3000
3100
3200
Figure 16: Prediction error of ST2 data. The dotted line corresponds to the LSSVM method, the dot-dashed line corresponds to the robust LS-SVM method with parameter = 16, and the solid line to the robust LS-SVM method with parameter = 1.
Cooperative Learning Algorithm for Nonlinear System Identification
1625
8 Conclusion In this letter, we developed a novel measurement fusion method for identifying a general nonlinear noisy system. Our method has three main features. First, we created optimal fused data with a smaller MSE than observed data. Second, a measurement fusion function estimate combined the optimal fused data and a robust LS-SVM function estimate together. Theoretically, the proposed optimal measurement fusion estimator has an improved performance with smaller mean square error than existing function estimators without optimal measurement fusion. Third, the proposed measurement fusion method is effectively implemented by a new cooperative learning algorithm, which consists of two parallel learning algorithms and a feedforward network. We showed that the proposed cooperative learning algorithm can converge globally to the optimal measurement fusion estimate. Two applications to ARMA signal modeling and spatial temporal signal modeling were illustrated. Experimental results confirmed that the proposed measurement fusion method can provide a more accurate model than several existing identification methods. Appendix: Proofs of Theorems 3 and 4 A.1 Proof of Theorem 3. First, matrix (τ B + eeT (I − β B)) is nonsingular when 0 < β < 1. Next, let G = (1 − β)Pτ eeT B, where Pτ = (τ B + eeT (I − β B))−1 Then ρ(G) ≤ G2 ≤ |1 − β|ρ(eT B Pτ e) ≤ |1 − β||eT B Pτ e|. Now, −1 −1
B Pτ = (τ I + ee (I − β B)(B) ) T
1 = τ
1 = τ
−1 1 T −1 I + ee (B − β I ) . τ
1 I + eeT (I − β B)(B)−1 τ
−1
1626
Y. Xia and M. Kamel
Using the inverse matrix formulation, we have
I+
−1 1 1 T −1 = I − eeT (B −1 − β I )/(1 + θ ), ee (B − β I ) τ τ
where θ=
1 1 T −1 e (B − β I )e = eT B −1 e − β. τ τ
Note that eT e = τ . It follows that eT B Pτ e = 1 −
θ 1 = . 1+θ 1+θ
Then ρ(G) ≤ G2 ≤ |1 − β|
|1 − β| 1 = 0 and 0 < β < 1. So the sequence {wk } converges globally to a point w∗ . We next show that w∗ is an optimal fusion solution of equation 3.13. We easily see that w∗ satisfies w∗ = (1 − β)Pτ eeT Bw∗ + Pτ e. Then Pτ−1 w∗ = (1 − β)eeT Bw∗ + e. That is, (τ B + eeT (I − β B))w∗ = (1 − β)eeT Bw∗ + e. Thus, (τ B + eeT )w∗ = eeT Bw∗ + e. It follows that ∗
w =
1 T I − ee = (I − B)w∗ + e/τ τ
Cooperative Learning Algorithm for Nonlinear System Identification
1627
and
∗
e w =e T
T
1 T ∗ I − ee (I − B)w + e/τ = 1. τ
By the well-known projection theorem (Kinderlehrer & Stampcchia, 1980) we see that w∗ is an optimal fusion solution of equation 3.13. Finally, we have the following: w(k) − w∗ = G(w(k − 1) − w∗ ) = G k (w(0) − w∗ ). Let µ = −ln(η), where η=
|1 − β| |1 − β + τ1 eT B −1 e|
< 1.
Then µ > 0. From the fact that e −µ = η, it follows that w(k) − w∗ 2 ≤ Gk2 w(0) − w∗ 2 ≤ w(0) − w∗ 2 e −µk . It follows that τ τ (k) ∗ wi yi (t) − wi yi (t) ≤ max{Y(t)2 }w(k) − w∗ 2 |z (t) − z (t)| = t (k)
∗
i=1
i=1
≤ max{Y(t)2 }w(0) − w∗ 2 e −µk , t
where Y(t) = [y1 (t), . . . , yτ (t)]T . Thus, proposed learning algorithm A can converge exponentially to the optimal fused data. A.2 Proof of Theorem 4. It is enough to prove that proposed learning algorithm B can converge globally to the optimal weights of the robust LSSVM function estimate. Let z∗ = (α ∗ , b ∗ ) denote the optimal weights of the robust LS-SVM function estimate, and let
F (α, b) =
ˆ e1 Q −e1T 0
α y − , b 0
ˆ and e1 are defined in equation 4.2. According to the projection where Q theorem (Kinderlehrer & Stampcchia, 1980), we see that for any z ∈ Rr +1 , (z − π(z))T (π(z) − z∗ ) ≥ 0, where π(z) = [ f (z1 ), . . . , f (zr ), zr +1 ]T and f (zi ) is defined in equation 3.12.
1628
Y. Xia and M. Kamel
Let z = z(k) − F (α (k) , b (k) ), where z(k) = [α (k) , b (k) ]T is generated by learning algorithm B. Then
T z(k) − F (α (k) , b (k) ) − f (z(k) − F (α (k) , b (k) )) f (z(k) − F (α (k) , b (k) )) − z∗ ≥ 0.
Thus
T z(k) − f (z(k) − F (α (k) , b (k) )) − (F (α (k) , b (k) ) − F (α ∗ , b ∗ )) f (z(k) − F (α (k) , b (k) )) − z∗ ≥ 0.
It follows that
T z(k) − f (z(k) − F (α (k) , b (k) )) F (α (k) , b (k) ) − F (α ∗ , b ∗ ) + z(k) − z∗ T ≥ z(k) − f z(k) − F (α (k) , b (k) ) 22 + z(k) − z∗ (F (α (k) , b (k) ) − F (α ∗ , b ∗ )).
It can be seen that ∗
(k)
ˆ e1 Q
∗
F (α , b ) − F (α , b ) = (k)
α (k) − α ∗
−e1T 0
b (k) − b ∗
.
Then (z(k) − f (z(k) − F (α (k) , b (k) )))T ≥ z
(k)
− f (z
(k)
− F (α , b (k)
(k)
ˆ + I 1 e1 Q −e1T 1 ))22
+ (z
(k)
(z(k) − z∗ ) ∗ T
−z )
ˆ e1 Q −e1T 0
(z(k) − z∗ )
ˆ (k) − α ∗ ) = z(k) − f (z(k) − F (α (k) , b (k) ))22 + (α (k) − α ∗ )T Q(α ≥ z(k) − f (z(k) − F (α (k) , b (k) ))22 + ηα (k) − α ∗ 22 ,
(A.1)
where η > 0. Now, ˆ + I1 )( f (u(k) ) − α (k) ) − e1 eT α (k) (Q 1 e1T f (u(k) ) − 2e1T α (k)
ˆ + I 1 e1 Q = z(k) − h (z(k) − f (z(k) − F (α (k) , b (k) ))). −e1T 1
z(k+1) = z(k) − h
(A.2)
Then, using equation A.1, we have z
(k+1)
−
z∗ 22
= z
(k)
−
z∗ 22
ˆ + I 1 e1 Q (z(k) − f (z(k) − F (α (k) , b (k) )))22 +h −e1T 1
2
Cooperative Learning Algorithm for Nonlinear System Identification
1629
ˆ + I 1 e1 Q −2h(z − z ) (z(k) − f (z(k) − F (α (k) , b (k) ))) −e1T 1
ˆ + I 1 e1 Q (z(k) − f (z(k) − F (α (k) , b (k) )))22 = z(k) − z∗ 22 + h 2 −e1T 1 (k)
∗ T
− 2hz(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 . Note that Q ˆ + I1 e1 2 −eT 1 ≤ 2. 1 2 It follows that z(k+1) − z∗ 22 ≤ z(k) − z∗ 22 + 2h 2 z(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hz(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 ≤ z(k) − z∗ 22 − 2h(1 − h)z(k) − f (z(k) − F (α (k) , b (k) )))22 − 2hηα (k) − α ∗ 22 , where h(1 − h) > 0. Similar to analysis given in Xia, Leung, and Boss´e (2002), ˆ T . From equation we see that the sequence {z(k) } converges a point zˆ = [α, ˆ b] A.2, we see that
zˆ = zˆ − h
ˆ + I 1 e1 Q −e1T 1
ˆ (ˆz − f (ˆz − F (α, ˆ b))).
That is, ˆ zˆ = f (ˆz − F (α, ˆ b)). Thus, zˆ is a solution of the piecewise equations, 3.12, and hence is the optimal weight coefficients of the robust LS-SVM function estimate. Acknowledgments We thank the editor and reviewers for their encouragement and valued comments, which helped in improving the quality of the letter. References Abed-Meraim, K., Qui, W. Z., & Hua, Y. B. (1997). Blind system identification, Proceedings of the IEEE, 85, 1310–1322.
1630
Y. Xia and M. Kamel
Bazaraa, M. S., Sherali, H. D., & Shetty, C. M. (1993). Nonlinear Programming: Theory and algorithms, Hoboken, NJ: Wiley. Chen, L. J., & Narendra, K. S. (2004). Identification and control of a nonlinear discretetime system based on its linearization: A unified framework, IEEE Transactions on Neural Networks, 15, 663–673. Chen, S., & Billings, S. A. (1992). Neural networks for nonlinear dynamic system modelling and identification. International Journal of Control, 56, 319– 346. Cressie, N., & Majure, J. J. (1997). Spatial-temporal statistical modeling of live-stock waste in streams. Journal of Agricultural, Biological, and Environmental Statistics, 2, 24–47. Drosopoulos, A. (1994). Description of the OHGR database (Tech. Note 94-14). Ottawa, Ont.: Defence Research Establishment. Fan, J., & Gijbels, I. (1996). Local polynomial modelling and its application, London: Chapman and Hall. Hall, D. L., & Llinas, J. (1997). An introduction to multisensor data fusion. Proceedings of the IEEE, 85, 6–23. H¨ardle, W. (1990). Applied nonparametric regression. Cambridge: Cambridge University Press. Haykin, S., & Puthusserypady, S. (1997). Chaotic dynamics of sea clutter. Chaos, 7, 777–802. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Hsu, C.-W., Chang, C. C., & Lin, C. J. (2003). A practical guide to support vector classification (Tech. Rep.). Taiwan: Department of Computer Science and Information Engineering, National Taiwan University. Kadirkamanathan, V., & Niranjan, M. (1993). A function estimation. Neural Computation, 5, 954–975. Kinderlehrer, D., & Stampcchia, G. (1980). An introduction to variational inequalities and their applications. New York: Academic Press. Koukoulas, P., & Kalouptsidis, N. (2000). Second-order Volterra system identification. IEEE Transactions on Signal Processing, 48, 3574–3577. Lee, J., & Mathews, V. J. (1993). A fast recursive least squares adaptive second-order Volterra filter and its performance analysis. IEEE Trans. Signal Processing, 41, 1087–1097. Leung, H., Hennessey, G., & Drosopoulos, A. (2000). Signal detection using the radial basis function coupled map lattice. IEEE Transactions on Neural Networks, 11, 1133–1151. Ljung, L. (1999). System identification: Theory for the user (2nd ed.), Upper Saddle River, NJ: Prentice-Hall. Lo, J. T. (1994). Synthetic approach to optimal filtering. IEEE Transactions on Neural Networks, 5, 803–811. Lu, S., Ju, K. H., & Chon, K. H. (2001). A new algorithm for linear and nonlinear ARMA model parameter estimation using affine geometry. IEEE Transactions on Biomedical Engineering, 48, 1116–1122. Mardia, K. V., Goodall, C., Redfern, E. J., & Alsonso, F. J. (1998). The kriged Kalman filter. Test, 7, 217–285.
Cooperative Learning Algorithm for Nonlinear System Identification
1631
Mhaskar, H. N., & Hahm, N. (1997). Neural networks for functional approximation and system identification. Neural Computation, 9, 143–159. Mukherjee, S., Osuna, E., & Girosi, F. (1997). Nonlinear prediction of chaotic time series using support vector machines. Proc. of IEEE NNSP (pp. 24–26). Piscataway, NJ: IEEE Press. ¨ Muller, K. R., Smola, A. J., Ratsch, G., Scholkopf, B. S., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines: Proceedings of ICANN. Berlin: Springer. Narendra, K. S., & Li, S. M. (1996). Neural networks in control systems. In P. Smolensky, M. C. Mozer, & D. E. Rumelhart (Eds.), Mathematical perspectives on neural networks, (pp. 374–394). Mahwah, NJ: Erlbaum. Rojo-Alvarez, J. L., Martinez-Ramon, M., Prado-Cumplido, M. de, Artes-Rodriguez, A., & Figueiras-Vidal, A. R. (2004). Support vector method for robust ARMA system identification. IEEE Transactions on Signal Processing, 52, 155– 164. Roll, J., Nazin A., & Ljung, L. (2005). Nonlinear system identification via direct weight optimization. Automatica, 41, 475–490. Sahu, S. K., & Mardia, K. V. (1999). A Bayesian kriged Kalman model for short-term forecasting of air pollution levels. Applied Statistics, 54, 223–244. Schoukens, J., Nemeth, J. G., Crama, P., Rolain, Y., & Pintelon, R. (2003). Fast approximate identification of nonlinear systems. Automatica, 39, 1267–1274. Sjoberg, J., Zhang, Q. H., Ljung, L., Benveniste, A., Delyon, B., Glorennec, P. Y., Hjalmarsson, H., & Juditsky, A. (1995). Nonlinear black-box modeling in system identification: A unified overview. Automatica, 31, 1691–1724. ¨ Smola, A. J., & Scholkopf, B. S. (1998). On a Kernel-based method for pattern recognition, regression, approximation, and operator inversion. Algorithmica, 22, 211– 231. Suykens, J. A. K., Gestel, T. V., Brabanter, J. D., Moor, B. D., & Vandewalle, J. (2002). Least squares support vector machines. Singapore: World Scientific. Suykens, J. A. K., Brabanter, J., Lukas, L., & Vandewalle, J. (2002). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing, 48, 85–105. Tekalp, A. M. (1995). Digital video processing. Upper Saddle River, NJ: Prentice Hall. Tekalp, A. M., Kaufman, H., Woods, J. W. (1985). Fast recursive estimation of the parameters of a space-varying autoregressive image model. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33, 469–472. Varshney, P. K. (1997). Multisensor data fusion. Journal of Electronics and Communication Engineering, 9, 245–253. Wikle, C. K., & Cressie, N. (1999). A dimension-reduced approach to space-time Kalman filtering. Biometrika, 86, 815–829. Xia, Y. S. (1996). A new neural network for solving linear and quadratic programming problems. IEEE Transactions on Neural Networks, 7, 1544–1547. Xia, Y. S., & Leung, H. (2006). Spatial temporal prediction based on optimal fusion. IEEE Transactions on Neural Networks, 17, 975–988. Xia, Y. S., Leung, H., & Boss´e, E. (2002). Neural data fusion algorithms based on a linearly constrained least square method. IEEE Transactions on Neural Networks, 13, 320–329.
1632
Y. Xia and M. Kamel
Xia, Y. S., Leung, H., & Chan, H. (2006). A prediction fusion method for reconstructing spatial temporal dynamics using support vector machines. IEEE Transactions on Circuits and Systems—Part II Express Briefs, 53, 62–66. Zheng, W. X. (1999). A least-squares based method for autoregressive signals in the presence of noise. IEEE Transactions on Circuits and Systems II-Analog and Digital Signal Processing, 46, 81–85.
Received December 16, 2005; accepted August 23, 2006.
LETTER
Communicated by John Platt
Efficient Computation and Model Selection for the Support Vector Regression Lacey Gunter
[email protected] Ji Zhu
[email protected] Department of Statistics, University of Michigan, Ann Arbor, MI 48109, U.S.A.
In this letter, we derive an algorithm that computes the entire solution path of the support vector regression (SVR). We also propose an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter. 1 Introduction The support vector regression (SVR) is a popular tool for function estimation problems, and it has been widely used in many applications in the past decade, such as signal processing (Vapnik, Golowich, & Smola, 1996), time ¨ series prediction (Muller et al., 1997), and neural decoding (Shpigelman, Crammer, Paz, Vaadia, & Singer, 2004). In this letter, we focus on the regularization parameter of the SVR and propose an efficient algorithm that computes the entire regularized solution path. We also propose an unbiased estimate for the degrees of freedom of the SVR, which allows convenient selection of the regularization parameter. We briefly introduce the SVR and refer readers to Vapnik (1995) and ¨ Smola and Scholkopf (2004) for a detailed tutorial. Suppose we have a set of training data (x 1 , y1 ), . . . , (x n , yn ), where the input x i ∈ R p is a vector with p predictor variables (attributes), and the output yi ∈ R denotes the response (target value). The standard criterion for fitting a linear -SVR is: 1 β2 + C (ξi + δi ), 2 n
min β0 ,β
subject to
(1.1)
i=1
yi − β0 − β T x i ≤ + ξi , β0 + β T x i − yi ≤ + δi , ξi , δi ≥ 0, i = 1, . . . n.
The idea is to disregard errors as long as they are less than , and the ξi , δi are nonnegative slack variables that allow for deviations larger than . C is Neural Computation 19, 1633–1655 (2007)
C 2007 Massachusetts Institute of Technology
L. Gunter and J. Zhu
2.5
1634
Right
1.0
Loss
1.5
2.0
Left
0.5
Center Elbow R
0.0
Elbow L
−3
−2
−1
0
1
2
3
y−f
Figure 1: The -insensitive loss function. Depending on the values of (yi − f i ), the data points can be divided into five sets: left, right, center, left elbow, and Right elbow.
a cost parameter that controls the trade-off between the flatness of the fitted model and the amount up to which deviations larger than are tolerated. Alternatively, we can formulate the problem using a loss + penalty cri¨ terion (Vapnik, 1995; Smola & Scholkopf, 2004): min β0 ,β
n yi − β0 − β T x i + λ β T β, 2
(1.2)
i=1
where |r | is the so-called -insensitive loss function: |r | =
0,
if |r | ≤ ,
|r | − ,
otherwise.
Figure 1 plots the loss function. Notice that it has two nondifferentiable points at ±. The penalty is the L 2 -norm of the coefficient vector, the same as that used in a ridge regression (Hoerl & Kennard, 1970). The regularization parameter λ in equation 1.2 corresponds to 1/C, with C in equation 1.1, and it controls the trade-off between the -insensitive loss and the complexity of the fitted model.
Efficient Computation and Model Selection for SVR
1635
In practice, one often maps x onto a high- (often infinite) dimensional reproducing kernel Hilbert space (RKHS), and fits a nonlinear kernel SVR ¨ model (Vapnik, 1995; Smola & Scholkopf, 2004): n yi − f (x i ) + λ f 2 , HK f ∈H K 2
min
(1.3)
i=1
where H K is a structured RKHS generated by a positive definite kernel K (x, x ). This includes the entire family of smoothing splines and additive and interaction spline models (Wahba, 1990). Some other popular choices of K (·, ·) in practice are dth degree polynomial : K (x, x ) = (1 + x, x )d , Radial basis : K (x, x ) = exp(−x − x 2 /2σ 2 ), where d and σ are prespecified parameters. Using the representer theorem (Kimeldorf & Wahba, 1971), the solution to equation 1.3 has a finite form: 1 θi K (x, x i ). λ n
f (x) = β0 +
(1.4)
i=1
Notice that we write f (x) in a way that involves λ explicitly, and we will see later that θi ∈ [−1, 1]. Given the format of the solution 1.4, we can rewrite equation 1.3 in a finite form: n n n n 1 1 θi K (x i , x i ) + θi θi K (x i , x i ), min yi − β0 − β0 ,θ λ 2λ i=1
i =1
(1.5)
i=1 i =1
which we will focus on for the rest of the letter. Both equations 1.2 and 1.5 can be transformed into a quadratic programming problem; hence, most commercially available packages can be used to solve the SVR. In the past, many specific algorithms for the SVR have been developed, for example, interior point algorithms (Vanderbei, ¨ 1994; Smola & Scholkopf, 2004), subset selection algorithms (Osuna, Freund, & Girosi, 1997; Joachims, 1999), and sequential minimal optimization (Platt, 1999; Keerthi, Shevade, Bhattacharyya, & Murthy, 1999; Smola & ¨ Scholkopf, 2004). All these algorithms solve the SVR for a prefixed regularization parameter λ (or equivalently C). As in any other smoothing problem, to get a good fitted model that performs well on future data (i.e., small generalization error), choice of the regularization parameter λ is critical.
1636
L. Gunter and J. Zhu
In practice, people usually prespecify a grid of values for λ that cover a wide range and then use either cross-validation (Stone, 1974) or an upper bound of the generalization error (Chang & Lin, 2005) to select a value for λ. In this letter, we make two main contributions:
r
r
We show that the solution θ (λ) is piecewise linear as a function of λ, and we derive an efficient algorithm that computes the exact entire solution path {θ (λ), 0 ≤ λ ≤ ∞}, ranging from the least regularized model to the most regularized model. Using the framework of Stein’s unbiased risk estimation (SURE) theory (Stein, 1981), we propose an unbiased estimate for the degrees of freedom of the SVR model. Specifically, we consider the number of data points that are exactly on the two elbows (see Figure 1): yi − f i = ±. This estimate allows convenient selection of the regularization parameter λ.
We acknowledge that some of these results are inspired by one of the author’s earlier work in the support vector machine setting (Hastie, Rosset, Tibshirani, & Zhu, 2004). Before delving into the technical details, we illustrate the concept of piecewise linearity of the solution path with a simple example. We generate 10 training observations using the famous sinc(·) function:
y=
sin(π x) + e, πx
(1.6)
where x is distributed as Uniform(−2, 2), and e is distributed as Normal(0, 0.22 ). We use the SVR with a one-dimensional spline kernel (Wahba, 1990), K (x, x ) = 1 + k1 (x)k1 (x ) + k2 (x)k2 (x ) − k4 (|x − x |),
(1.7)
where k1 (·) = · − 1/2, k2 = (k12 − 1/12)/2, k4 = (k14 − k12 /2 + 7/240)/24. Figure 2 shows a subset of the piecewise linear solution path θ(λ) as a function of λ. The rest of the letter is organized as follows. In section 2, we derive the algorithm that computes the entire solution path of the SVR. In section 3, we propose an unbiased estimate for the degrees of freedom of the SVR, which can be used to select the regularization parameter λ. In section 4, we present numerical results on both simulation data and real-world data. We conclude with a discussion section.
1637
−1.0
−0.5
θ
0.0
0.5
1.0
Efficient Computation and Model Selection for SVR
1
2
3
λ
4
5
Figure 2: A subset of the solution path θ(λ) as a function of λ. Different line types correspond to θi (λ) of different data points, and all paths are piecewise linear.
2 Algorithm 2.1 Problem Setup. Criterion 1.5 can be rewritten in an equivalent way: min β0 ,θ
subject to
n 1 T (ξi + δi ) + θ Kθ , 2λ i=1
−(δi + ) ≤ yi − f (x i ) ≤ (ξi + ), ξi , δi ≥ 0, i = 1, . . . , n,
where 1 θi K (x i , x i ), i = 1, . . . , n λ i =1 K (x 1 , x 1 ) · · · K (x 1 , x n ) .. .. .. . K= . . . K (x n , x 1 ) · · · K (x n , x n ) n×n n
f (x i ) = β0 +
1638
L. Gunter and J. Zhu
For the rest of the letter, we assume the kernel matrix K is positive definite. Then the above setting gives us the Lagrangian primal function: LP :
n n 1 T (ξi + δi ) + αi (yi − f (x i ) − ξi − ) θ Kθ + 2λ i=1
−
i=1
n
γi (yi − f (x i ) + δi + ) −
i=1
n
ρi ξi −
i=1
n
τi δi .
i=1
Setting the derivatives to zero, we arrive at: ∂ : ∂θ ∂ : ∂β0
θi = αi − γi n
αi =
i=1
n
(2.1) γi
(2.2)
i=1
∂ : ∂ξi
αi = 1 − ρi
(2.3)
∂ : ∂δi
γi = 1 − τi ,
(2.4)
where the Karush-Kuhn-Tucker conditions are αi (yi − f (x i ) − ξi − ) = 0
(2.5)
γi (yi − f (x i ) + δi + ) = 0
(2.6)
ρi ξi = 0
(2.7)
τi δi = 0.
(2.8)
Since the Lagrange multipliers must be nonnegative, we can conclude from equations 2.3 and 2.4 that both 0 ≤ αi ≤ 1 and 0 ≤ γi ≤ 1. We also see from equations 2.5 and 2.6 that if αi is positive, then γi must be zero, and vice versa. These lead to the following relationships: yi yi yi yi yi
− − − − −
f (x i ) > f (x i ) < − f (x i ) ∈ (−, ) f (x i ) = f (x i ) = −
⇒ ⇒ ⇒ ⇒ ⇒
αi αi αi αi αi
= 1, = 0, = 0, ∈ [0, 1], = 0,
ξi ξi ξi ξi ξi
> 0, = 0, = 0, = 0, = 0,
γi γi γi γi γi
= 0, = 1, = 0, = 0, ∈ [0, 1],
δi δi δi δi δi
=0 >0 =0 =0 = 0.
Notice from equation 2.1 that for every λ, θi is equal to (αi − γi ). Hence, using these relationships, we can define the following sets that will be used later when we calculate the regularization path of the SVR:
Efficient Computation and Model Selection for SVR
r r r r r
1639
R = {i : yi − f (x i ) > , θi = 1} (right of the elbows) ER = {i : yi − f (x i ) = , 0 ≤ θi ≤ 1} (right elbow) C = {i : − < yi − f (x i ) < , θi = 0} (center) EL = {i : yi − f (x i ) = −, −1 ≤ θi ≤ 0} (left elbow) L = {i : yi − f (x i ) < −, θi = −1} (left of the elbows).
For points in R, L, and C, the values of θi are known; therefore, the algorithm will focus on points resting at the two elbows ER and EL . 2.2 Initialization. Initially, when λ = ∞, we can see from equation 1.4 that f (x) = β0 . We can determine the value of β0 via a simple onedimensional optimization. For simplicity, we focus on the case that all the values of yi are distinct, and furthermore, the initial sets ER and EL have at most one point combined (which is the usual situation). In this case, β0 will not be unique, and each of the θi will be either −1 or 1. This is so because we can easily shift β0 a little to the left or the right such that our θi do not change and we will still get the same error. Since β0 is not unique, we can focus on one particular solution path, for example, by always setting β0 equal to one of its boundary values (thus keeping one point at an elbow). As λ decreases, the range of β0 shrinks, and at the same time, one or more points come toward the elbow. When a second point reaches an elbow, the solution of β0 becomes unique, and the algorithm proceeds from there. We note that this initialization is different from that of classification. In classification, quadratic programming is needed to solve for the initial solution, which can be computationally expensive (Hastie et al., 2004). However, this is not the case for regression. In the case of regression, the output y is continuous; hence, it is almost certain that no two yi ’s will be in ER and EL simultaneously when λ = ∞. 2.3 The Path. The algorithm focuses on the sets of points ER and EL . These points have either f (x i ) = yi − with θi ∈ [0, 1] or f (x i ) = yi + with θi ∈ [−1, 0]. As we follow the path, we will examine these sets until one or both of them change, at which time we will say an event has occurred. Thus, events can be categorized as: 1. The initial event, for which two points must enter the elbow(s). 2. A point from R has just entered ER , with θi initially 1. 3. A point from L has just entered EL , with θi initially −1. 4. A point from C has just entered ER , with θi initially 0. 5. A point from C has just entered EL , with θi initially 0. 6. One or more points in ER and/or EL have just left the elbow(s) to join either R, L, or C, with θi initially 1, −1, or 0, respectively.
1640
L. Gunter and J. Zhu
Until another event has occurred, all sets will remain the same. As a point passes through ER or EL , its respective θi must change from 1 → 0 or −1 → 0 or vice versa. Relying on the fact that f (x i ) = yi − or f (x i ) = yi + for all points in ER or EL , respectively, we can calculate θi for these points. We use the subscript to index the sets above immediately after the th event has occurred, and let θi , β0 and λ be the parameter values immediately after the th event. Also let f be the function at this time. We define for convenience β0,λ = λ · β0 and hence β0,λ = λ · β0 . Then, since
n 1 f (x) = θi K (x, x i ) β0,λ + λ i=1
for λ+1 < λ < λ , we can write
λ λ f (x) = f (x) − f (x) + f (x) λ λ 1 + θi − θi K (x, x i ) + λ f (x) , = β0,λ − β0,λ λ i∈(ER ∪EL )
where the reduction occurs in the second line since the θi ’s are fixed for all points in R , L , and C , and all points remain in their respective sets. Define |ER | = nER and |EL | = nEL , so for the nER + nEL points staying at the elbows, we have that 1 yk − = + β0,λ − β0,λ λ
1 + ym + = β0,λ − β0,λ λ
θi − θi K (x k , x i ) + λ f (x k ), ∀k ∈ ER
i∈(ER ∪EL )
θi −
θi
K (x m , x i ) + λ f (x m ), ∀m ∈ EL .
i∈(ER ∪EL )
∪ EL ) and ν0 = β0,λ − β0,λ . Then To simplify, let νi = θi − θi , i ∈ (ER
ν0 +
νi K (x k , x i ) = (λ − λ )(yk − ), ∀k ∈ ER
i∈(ER ∪EL )
ν0 +
i∈(ER ∪EL )
νi K (x m , x i ) = (λ − λ )(ym + ), ∀m ∈ EL .
Efficient Computation and Model Selection for SVR
1641
Also, by condition 2.2, we have that
νi = 0.
i∈(ER ∪EL )
This gives us nER + nEL + 1 linear equations we can use to solve for each of the nER + nEL + 1 unknown variables νi and ν0 . Now, define K to be a (nER + nEL ) × (nER + nEL ) matrix with first nER rows containing entries K (x k , x i ) where k ∈ ER and i ∈ (ER ∪ EL ), and the remaining nEL rows containing entries K (x m , x i ) where m ∈ ER and i ∈ (ER ∪ EL ), and let ν denote the vector with first nER components νk , k ∈ ER and remaining nEL components νm , m ∈ EL . And finally let y be a (nER + nEL ) vector with first nER entries being (yk − ) and last nEL entries (ym + ). Using these, we have the following two equations: ν0 1 + K ν = (λ − λ ) y
(2.9)
ν 1 = 0. T
(2.10)
Simplifying further, if we let A =
0 1T 1 K
, ν0 =
ν0 ν
, and y0 =
0 y
,
then equations 2.9 and 2.10 can be combined as A ν 0 = (λ − λ ) y0 . Then if A has full rank, we can define b = (A )−1 y0 ,
(2.11)
to give us ∪ EL ); θi = θi + (λ − λ )b i , ∀i ∈ (ER β0,λ = β0,λ + (λ − λ )b 0 .
(2.12) (2.13)
Thus, for λ+1 < λ < λ , the θi and β0,λ proceed linearly in λ. Also f (x) =
λ f (x) − g (x) + g (x), λ
(2.14)
1642
L. Gunter and J. Zhu
where g (x) = b 0 +
b i K (x, x i ).
i∈(ER ∪EL )
Given λ , equations 2.12 and 2.14 allow us to compute λ+1 , the λ at which the next event will occur. This will be the largest λ less than λ , such that either θk for k ∈ ER reaches 0 or 1, or θm for m ∈ EL reaches 0 or −1, or one of the points in R, L, or C reaches an elbow. The latter event will occur for a point j when f (x j ) − g (x j ) λ = λ , ∀ j ∈ (R ∪ C ) y j − g (x j ) − and
λ = λ
f (x j ) − g (x j ) , ∀ j ∈ (L ∪ C ). y j − g (x j ) +
We terminate the algorithm either when the sets R and L become empty or when λ has become sufficiently close to zero. In the latter case, we must have f − g sufficiently small as well. 2.4 Computational Cost. The major computational cost for updating the solutions at any event involves two things: solving the system of (nER + nEL + 1) linear equations and computing g (x). The former takes O((nER + nEL )2 ) calculations by using inverse updating and downdating since the elbow sets usually differ by only one point between consecutive events, and the latter requires O(n(nER + nEL )) computations. According to our experience, the total number of steps taken by the algorithm is on average some small multiple of n. Letting m be the average size of (ER ∪ EL), the approximate computational cost of the algorithm is O cn2 m + nm2 . We make a note that errors can be accumulated by inverse updating and downdating as the number of steps increases. To get around this problem, we may directly compute the inverse using equation 2.11 after every several steps. As we will see in section 4, for a fixed underlying true function, the average size of (ER ∪ EL ) does not seem to increase with n, so the direct computation of equation 2.11 does not seem to add much extra computational cost. 3 Degrees of Freedom It is well known that an appropriate value of λ is crucial for the performance of the fitted model in any smoothing problem. One advantage of computing
Efficient Computation and Model Selection for SVR
1643
the entire solution path is to facilitate the selection of the regularization parameter. Degrees of freedom is an informative measure of the complexity of a fitted model. In this section, we propose an unbiased estimate for the degrees of freedom of the SVR, which allows convenient selection of the regularization parameter λ. Since the usual goal of regression analysis is to minimize the predicted squared-error loss, we study the degrees of freedom using Stein’s unbiased risk estimation (SURE) theory (Stein, 1981). Given x, we assume y is generated according to a homoskedastic model, that is the errors have a common variance: y ∼ (µ(x), σ 2 ), where µ is the true mean and σ 2 is the common variance. Notice that there is no restriction on the distributional form of y. Then the degrees of freedom of a fitted model f (x) can be defined as df( f ) =
n
cov( f (x i ), yi )/σ 2 .
i=1
Stein showed that under mild conditions, the quantity n ∂ f (x i ) i=1
∂ yi
(3.1)
is an unbiased estimate of df( f ). Later Efron (1986) proposed the concept of expected optimism based on equation 3.1, and Ye (1998) developed Monte Carlo methods to estimate equation 3.1 for general modeling procedures. Meyer and Woodroofe (2000) discussed equation 3.1 in shape-restricted regression and also argued that it provided a measure of the effective dimension. (For detailed discussion and complete references, see Efron, 2004.) Notice that equation 3.1 measures the sum of the sensitivity of each fitted value with respect to the corresponding n observed value. It turns out that in the case of SVR, for every fixed λ, i=1 ∂ f i /∂ yi has an extremely simple formula: ≡ df
n ∂ fi = |ER | + |EL |. ∂ yi
(3.2)
i=1
Therefore, |ER | + |EL | is a convenient unbiased estimate for the degrees of freedom of f (x).
1644
L. Gunter and J. Zhu
In applying equation 3.2 to select the regularization parameter λ, we plug it into the generalized cross validation (GCV) criterion (Craven & Wahba, 1979) for model selection: n
− f (x i ))2 . 2 (1 − df/n)
i=1 (yi
(3.3)
The GCV criterion is an approximation to the leave-one-out cross validation n (yi − f −i (x i ))2 , i=1
where f −i (x i ) is the fit at x i with the ith point removed. The GCV can be mathematically justified in that asymptotically, it minimizes mean squared error for estimation of f . We shall not provide details but instead refer readers to O’Sullivan (1985). The advantages of this criterion are that it does not assume a known σ 2 , and it avoids cross validation, which is computationally more intensive. In practice, one can first use the path algorithm in section 2 to compute the entire solution path and then identify the appropriate value of λ that minimizes the GCV criterion. We outline the proof of equation 3.2 in this section, the details are in the appendix. We make a note that the theory in this section considers the scenario when λ and x 1 , . . . , x n are fixed, while y1 , . . . , yn can change; the proof relies closely on our algorithm in section 2, and follows the spirit of Zou, Hastie, and Tibshirani (2005). As we have seen in section 2, for a fixed response vector y = (y1 , . . . , yn )T , there is a sequence of λ’s, ∞ = λ0 > λ1 > λ2 > · · · > λL = 0, such that in the interior of any interval (λ+1 , λ ), the sets R, L, C, ER , and EL are constant with respect to λ. These sets change only at each λ . We thus define these λ ’s as event points. Lemma 1. For any fixed λ > 0, the set of y = (y1 , . . . , yn )T such that λ is an event point is a finite collection of hyperplanes in Rn . Denote this set as Nλ . Then for any y ∈ Rn \Nλ , λ is not an event point. Notice that Nλ is a null set, and Rn \Nλ is of full measure. Lemma 2. For any fixed λ > 0, θ λ is a continuous function of y. Lemma 3. For any fixed λ > 0 and any y ∈ Rn \Nλ , the sets R, L, C, ER , and EL are locally constant with respect to y.
Efficient Computation and Model Selection for SVR
1645
Theorem 1. For any fixed λ > 0 and any y ∈ Rn \Nλ , we have the divergence formula n ∂ fi = |ER | + |EL |. ∂ yi i=1
We note that the theory applies when λ is not an event point. Specifically, in practice, for a given y, the theory applies when λ = λ0 , λ1 , . . . , λ L ; when λ − as an estimate of df λ . λ = λ , we may continue to use |ER | + |EL |, or df 4 Numerical Results In this section we demonstrate our path algorithm and the selection of λ using the GCV criterion on both simulation data and real-world data. 4.1 Computational Cost. We first compare the computational cost of the path algorithm and that of the LIBSVM (Chang & Lin, 2001). We used the sinc(·) function 1.6 from section 1. Both algorithms have been implemented in the R programming language, and the comparison was done on a Linux server that has two AMD Opteron 244 processors running at 1.8 GHz with a 1 MB cache each, and the system has 2 GB of memory. The setup of the function and the error distribution are similar to those in section 1. We used the one-dimensional spline kernel, equation 1.7, and we generated n = 50, 100, 200, 400, 800 training observations from the sinc(·) function. For each simulation data set, we first ran our path algorithm to compute the entire solution path and retrieved the sequence of event points, λ0 > λ1 > λ2 > · · · > λ L ; then for each λ , we ran the LIBSVM to get the corresponding solution. Elapsed computing times (in seconds) were recorded carefully using the system.time() function in R. We repeated this 30 times and computed the average elapsed times and their corresponding standard errors. The results are summarized in Table 1. The number of steps along the path (or the number of event points L) and the average size of the elbow |ER ∪ EL | were also recorded and summarized in Table 1. As we can see, in terms of the elapsed time in computing the entire solution path, our path algorithm dominates the LIBSVM. We can also see that the number of steps increases linearly with the size of the training data, and interestingly, the average elbow size does not seem to change much when the size of the training data increases. We note that these timings should be treated with great caution, as they can be sensitive to details of implementation (e.g., different overheads on I/O). We also note that the LIBSVM was implemented using the cold start scheme: for every value of λ , the training was done from scratch. In recent years, interest has grown in warm start algorithms that can save computing
1646
L. Gunter and J. Zhu
Table 1: Computational Cost. n 50 100 200 400 800
Path
LIBSVM
0.172 (0.030) 0.398 (0.057) 1.018 (0.114) 2.205 (0.180) 6.072 (0.594)
Number of Steps
|ER ∪ EL |
104.3 (16.6) 209.6 (28.1) 430.1 (46.2) 830.8 (44.0) 1681.4 (122.7)
8.8 (0.8) 9.6 (0.6) 9.8 (0.4) 10.0 (0.6) 10.4 (0.4)
4.517 (4.044) 20.13 (29.49) 54.00 (30.62) 219.3 (239.8) 1123.1 (917.9)
Notes: The first column n is the size of the training data; the second column, Path, is the elapsed time (in seconds) for computing the entire solution path using our path algorithm; the third column, LIBSVM, is the total elapsed time (in seconds) for computing all the solutions at the event points along the solution path (using the cold start scheme); the fourth column, Number of Steps, is the number of event points; the fifth column |ER ∪ EL | is the average elbow size at the event points. All results are averages of 30 independent simulations. The numbers in the parentheses are the corresponding standard errors.
Table 2: Average Elapsed Time of the LIBSVM for a Single Value of λ . n 50 100 200 400 800
LIBSVM 0.042 (0.036) 0.090 (0.110) 0.123 (0.063) 0.261 (0.269) 0.656 (0.510)
time by starting from an appropriate initial solution (Gondzio & Grothey, 2001; Ma, Theiler, & Perkins, 2003). To further study the computational cost of our path algorithm, we also computed the average elapsed time of the LIBSVM for a single value of λ , which is defined as follows: Average elapsed time for a single value of λ =
Total elapsed time . # steps
The results are summarized in Table 2. It takes our path algorithm about 4 to 10 times as long to compute the entire solution path as it takes LIBSVM to compute a single solution. We note that these timings should be treated with some caution, as for the LIBSVM, the computing time varies significantly for different values of λ. It is very likely that in the total elapsed time for the LIBSVM, many of the computing times were wasted on useless (too big or too small) λ values. Finally, we have no intention of arguing that our path algorithm is the ultimate SVR computing tool (or it dominates the LIBSVM; in fact, according to Table 2, it is very likely that even in terms of total elapsed time, the LIBSVM with a warm start scheme can be comparable to or even faster than the path algorithm); the advantage of our path algorithm is to give
Efficient Computation and Model Selection for SVR
1647
a full presentation of the solution path without knowing the locations of the event points a priori. We hope our numerical results make our path algorithm an interesting and useful addition to the toolbox for computing the SVR. 4.2 Simulated Data. Next we demonstrate the selection of λ using the GCV criterion. For simulated data, we consider both additive and multiplicative kernels using the one-dimensional spline kernel, equation 1.7, which are, respectively, K (x, x ) =
p
K (x j , x j ) and K (x, x ) =
j=1
p
K (x j , x j ).
j=1
Simulations were based on the following four functions found in Friedman (1991): 1. One-dimensional sinc(·) function: f (x) =
sin(π x) + e1, πx
where x is distributed as uniform (−2, 2). 2. Five-dimensional additive model: f (x) = 0.1e 4x1 +
4 + 3x3 + 2x4 + x5 + e 2 , 1 + e −20(x2 −.5)
where x1 , . . . , x5 are distributed as Uniform (0, 1). 3. Alternating current series circuit: a. Impedance 1/2 1 2 2 f (R, ω, L , C) = R + ωL − + e3; ωC b. Phase angle 1 ωL − ωC −1 f (R, ω, L , C) = tan + e4; R where (R, ω, L , C) are distributed uniformly in (40π, 560π) × (0, 1) × (1, 11).
(0, 100) ×
The errors e i are distributed as N(0, σi2 ), where σ1 = 0.19, σ2 = 1, σ3 = 218.5, and σ4 = 0.18. We generated 300 training observations from each function along with 10,000 validation observations and 10,000 test observations. For the first two simulations, we used the additive one-dimensional spline kernel and for the second two simulations the multiplicative one-dimensional spline kernel. We then found the λ that minimized the GCV criterion, equation 3.3.
1648
L. Gunter and J. Zhu
Table 3: Simulation Results on Four Different Functions. Prediction Error
1 2 3a 3b
Degrees of Freedom
Gold Standard
GCV
Gold Standard
GCV
0.0390 (0.0012) 1.116 (0.030) 50779 (1885) 0.0464 (0.0019)
0.0394 (0.0015) 1.135 (0.037) 51143 (1941) 0.0474 (0.0024)
21.9 (4.0) 24.9 (4.5) 16.6 (3.3) 41.9 (8.0)
17.2 (3.7) 22.2 (5.1) 13.4 (3.2) 40.8 (13.4)
The validation set was used to select the gold standard λ, which minimized the prediction error. Using these λ’s, we calculated the prediction error with the test data for each criterion. Suppose the fitted function is f (x); the prediction error is defined as Prediction error =
10,000 1 (yi − f i )2 . 10,000 i=1
We repeated this 30 times and computed the average prediction errors and their corresponding standard errors. We also compared the degrees of freedom selected by the two methods. The results are summarized in Table 3. As we can see, in terms of the prediction accuracy, the GCV criterion performs close to the gold standard, and the GCV tends to select a slightly simpler model than the gold standard. 4.3 Real Data. For demonstration on real-world data, we consider the eight benchmark data sets used in Chang and Lin (2005) (available online at http://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/). Table 4 contains a summary of these data sets. To be consistent with their analysis, we also used the radial basis kernel, with same values for σ 2 and as specified in Chang and Lin (2005). For each data set, we randomly split the data into training and test sets, with the training set comprising 80% of the data. We calculated the path and used GCV to select the best λ. This λ was used to get a prediction error on the test set. We also computed the prediction error on the test set for each λ and found the minimum prediction error. Notice that in this case, the test set serves more like a separate validation set for selecting a value of λ, and unlike in the simulation study where there is another test set, this minimum prediction error is overly optimistic. We repeated this process 30 times and computed the average prediction errors and their corresponding standard errors. We also computed the degrees of freedom selected by the two different methods. The results are summarized in Table 5. In terms of the prediction accuracy, the GCV performs close to the overly optimistic minimum, and once again, we observe that the GCV tends to select a slightly simpler model than the validation method.
Efficient Computation and Model Selection for SVR
1649
Table 4: Summary of the Eight Benchmark Data Sets. Data Set
n
p
ln(σ 2 )
ln()
pyrim triazines mpg housing add10 cpusmall spacega abalone
74 186 392 566 1000 1000 1000 1000
27 60 7 13 10 12 6 8
3.4 4.0 0.4 1.4 2.9 1.7 1.1 1.2
−6.6 −4.7 −1.7 −1.7 −1.2 −2.1 −4.6 −2.1
Table 5: Real Data Results Prediction Error
pyrim triazines mpg housing add10 cpusmall spacega abalone
Degrees of Freedom
“Minimum”
GCV
“Minimum”
GCV
0.0037 (0.0034) 0.0185 (0.0073) 6.85 (1.92) 9.87 (3.12) 1.77 (0.20) 26.65 (10.22) 0.0119 (0.0015) 4.26 (1.04)
0.0067 (0.0071) 0.0247 (0.0083) 7.37 (2.38) 10.84 (3.69) 1.82 (0.20) 27.60 (10.28) 0.0124 (0.0015) 4.35 (1.05)
31.2 (11.3) 46.9 (37.7) 94.7 (20.7) 226.3 (51.7) 348.1 (35.1) 305.4 (53.9) 138.0 (35.2) 72.1 (20.4)
40.1 (9.0) 32.3 (36.2) 74.8 (10.0) 186.2 (30.1) 317.3 (38.8) 277.4 (29.8) 121.8 (27.5) 57.4 (21.7)
5 Discussion In this letter, we have proposed an efficient algorithm that computes the entire regularization path of the SVR. The general technique employed is known as parametric programming via active sets in the convex optimization literature (Allgower & Georg, 1993). The closest we have seen to our work in the literature is the employment of similar techniques in incremental learning for the SVM (DeCoste & Wagstaff, 2000) and the SVR (Ma et al., 2003). These authors, however, do not construct exact paths as we do, but rather focus on updating and downdating the solutions as more (or fewer) data arise. We have also proposed an unbiased estimate for the degrees of freedom of the SVR model, which allows convenient selection of the regularization parameter λ. Using the solution path algorithm and the proposed degrees of freedom, the GCV criterion seems to work sufficiently well on both simulated data and real-world data. Due to the difficulty of also selecting the best for the SVR, an alternate algorithm exists that automatically adjusts the value of , called the ν-SVR ¨ (Scholkopf, Smola, Williamson, & Bartlett, 2000; Chang & Lin, 2002). The
1650
L. Gunter and J. Zhu
linear ν-SVR can be written as 1 (ξi + δi ) + ν β2 + C 2 n
min
β0 ,β,
i=1
subject to yi − β0 − β x i ≤ + ξi T
β0 + β T x i − yi ≤ + δi ξi , δi ≥ 0, i = 1, . . . , n, where ν > 0 is prespecified. In this scenario, is treated as another free parameter. There are at least two interesting directions where our work can be extended: (1) Using arguments similar to those for β0 in our above algorithm, one can show that is piecewise linear in 1/λ and its path can ¨ be calculated similarly. (2) As Scholkopf et al. (2000) point out, ν is an upper bound on the fraction of errors and a lower bound on the fraction of support vectors. Therefore, it is worthwhile to consider the solution path as a function of ν and develop a corresponding efficient algorithm. Appendix: Proofs A.1 Proof of Lemma 1. For any fixed λ > 0, suppose R, L, C, ER , and EL are given. Then we have 1 β0,λ + λ
θi K (x k , x i ) −
i∈(ER ∪EL )
i∈L
K (x k , x i ) +
K (x k , x i )
i∈R
(A.1) = yk − , ∀k ∈ ER ; 1 θi K (x m , x i ) − K (x m , x i ) + K (x m , x i ) β0,λ + λ i∈L i∈R i∈(ER ∪EL )
= ym + , ∀m ∈ EL ; θi − nL + nR = 0.
(A.2) (A.3)
i∈(ER ∪EL )
These can be reexpressed as
0 1T 1 KE
β0,λ θE
=
b , λ yE − a
where KE is a (nER + nEL ) × (nER + nEL ) matrix, with entries equal to K (x i , x i ), i, i ∈ (ER ∪ EL ). θ E and yE are vectors of length (nER + nEL ),
Efficient Computation and Model Selection for SVR
1651
with elements equal to θi and yi , i ∈ (ER ∪ EL ), respectively. a is also a vector of length (n + n ), with elements equal to − K (x i , x m ) + E E R L m∈L k∈R K (x i , x k ) ± λ, i ∈ EL ∪ ER , and b is a scalar b = nL − nR . Notice that once λ, R, L, C, ER , and EL are fixed, KE , a, and b are also fixed. Then β0,λ and θ E can be expressed as
β0,λ θE
˜ =K
b , λ yE − a
where ˜ = K
0 1T 1 KE
−1
.
Notice that β0,λ and θ E are affine in yE . Now corresponding to the six events listed at the beginning of section 2.3, if λ is an event point, one of the following conditions has to be satisfied: 1. θi = 0, ∃i ∈ (ER ∪ EL ) 2. θk = 1, ∃k ∈ ER 3. θm = −1, ∃m ∈ EL 4. y j = λ1 β0,λ + i∈(ER ∪EL ) θi K (x j , x i ) − i∈L K (x j , x i ) + i∈R K (x j , x i ) + , ∃ j ∈ (R ∪ C) 5. y j = λ1 β0,λ + i∈(ER ∪EL ) θi K (x j , x i ) − i∈L K (x j , x i ) + i∈R K (x j , x i ) − , ∃ j ∈ (L ∪ C). For any fixed λ, R, L, C, ER and EL , each of the above conditions defines a hyperplane of y in Rn . Taking into account all possible combinations of R, L, C, ER , and EL , the set of y such that λ is an event point is a collection of a finite number of hyperplanes. A.2 Proof of Lemma 2. For any fixed λ > 0 and any fixed y0 ∈ Rn , we wish to show that if a sequence ym converges to y0 , then θ ( ym ) converges to θ ( y0 ). Since θ ( ym ) are bounded, it is equivalent to show that for every converging subsequence, say, θ ( ymk ), the subsequence converges to θ ( y0 ). Suppose θ ( ymk ) converges to θ ∞ ; we will show θ ∞ = θ ( y0 ). Denote the objective function in equation 1.5 as h(θ ( y), y), and let h(θ ( y), y, y ) = h(θ ( y), y) − h(θ ( y), y ).
1652
L. Gunter and J. Zhu
Then we have h(θ ( y0 ), y0 ) = h(θ ( y0 ), ymk ) + h(θ ( y0 ), y0 , ymk ) ≥ h(θ ( ymk ), ymk ) + h(θ ( y0 ), y0 , ymk ) = h(θ ( ymk ), y0 ) + h(θ ( ymk ), ymk , y0 ) + h(θ ( y0 ), y0 , ymk ). (A.4) Using the fact that |a | − |b| ≤ |a − b| and ymk → y0 , it is easy to show that for large enough mk , we have h(θ ( ymk ), ymk , y0 ) + h(θ ( y0 ), y0 , ymk ) ≤ c y0 − ymk 1 , where c > 0 is a constant. Furthermore, using ymk → y0 and θ ( ymk ) → θ ∞ , we reduce equation A.4 to h(θ ( y0 ), y0 ) ≥ h(θ ∞ , y0 ). Since θ ( y0 ) is the unique minimizer of h(θ , y0 ), we have θ ∞ = θ ( y0 ). A.3 Proof of Lemma 3. For any fixed λ > 0 and any fixed y0 ∈ Rn \Nλ , since Rn \Nλ is an open set, we can always find a small enough η > 0, such that Ball( y0 , η) ⊂ Rn \Nλ . So λ is not an event point for any y ∈ Ball( y0 , η). We claim that if η is small enough, the sets R, L, C, ER , and EL stay the same for all y ∈ Ball( y0 , η). Consider y and y0 . Let R y, L y, C y, ER y, EL y, R0 , L0 , C0 , ER0 , EL0 denote the corresponding sets, and θ y, f y, θ 0 , f 0 denote the corresponding fits. For any i ∈ ER0 , since λ is not an event point, we have 0 < θi0 < 1. Therey fore, by continuity, we also have 0 < θi < 1, i ∈ ER0 for y close enough to y0 ; or equivalently, ER0 ⊆ ER y, ∀ y ∈ Ball( y0 , η) for small enough η. The same applies to EL0 and EL y as well. Similarly, for any i ∈ R0 , since yi0 − f 0 (x i ) > η, again by continuity, we have yi − f y(x i ) > η for y close enough to y0 ; or equivalently, R0 ⊆ R y, ∀ y ∈ Ball( y0 , η) for small enough η. The same applies to L0 , L y and C0 , C y as well. Overall, we then must have ER0 = ER y, EL0 = EL y, R0 = R y, L0 = L y and C0 = C y for all y ∈ Ball( y0 , η) when η is small enough. A.4 Proof of Theorem 1. Using lemma 3, we know that there exists η > 0, such that for all y ∈ Ball( y, η), the sets R, L, C, ER , and EL stay the same. This implies that for points in (ER ∪ EL ), we have f i = yi + or f i = yi − when y ∈ Ball( y, η); hence, ∂ fi = 1, i ∈ (ER ∪ EL ). ∂ yi
Efficient Computation and Model Selection for SVR
1653
Furthermore, from equations A.1 to A.3, we can see that for points in R, L and C, their θi ’s are fixed at either 1, −1 or 0, and the other θi ’s are determined by yE . Hence, ∂ fi = 0, i ∈ (R ∪ L ∪ C). ∂ yi Overall, we have n ∂ fi = |ER | + |EL |. ∂ yi i=1
Acknowledgments We thank the two referees for their thoughtful and useful comments. Their comments have helped us improve the letter greatly. We also thank Trevor Hastie, Saharon Rosset, Rob Tibshirani, and Hui Zou for their encouragement. L. G. and J. Z. are partially supported by grant DMS-0505432 from the National Science Foundation. References Allgower, E., & Georg, K. (1993). Continuation and path following. Acta Numerica, 2, 1–64. Chang, C. C., & Lin, C. J. (2001). LIBSVM: A library for support vector machines. Taipei: Department of Computer Science and Information Engineering, National Taiwan University. Available online at http://www.csie.ntu.edu.tw/cjlin/libsvm/. Chang, C. C., & Lin, C. J. (2002). Training ν-support vector regression: Theory and algorithms. Neural Computation, 14(8), 1959–1977. Chang, M. W., & Lin, C. J. (2005). Leave-one-out bounds for support vector regression model selection. Neural Computation, 17, 1188–1222. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline function. Numerical Mathematics, 31, 377–403. DeCoste, D., & Wagstaff, K. (2000). Alpha seeding for support vector machines. In Proceedings of Sixth ACM SIGKDD. New York: ACM Press. Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461–470. Efron, B. (2004). The estimation of prediction error: Covariance penalties and crossvalidation. Journal of the American Statistical Association, 99, 619–632. Friedman, J. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19, 1–67. Gondzio, J., & Grothey, A. (2001). Reoptimization with the primal-dual interior point method. SIAM Journal on Optimization, 13, 842–864.
1654
L. Gunter and J. Zhu
Hastie, T., Rosset, S., Tibshirani, R., & Zhu, J. (2004). The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5, 1391– 1415. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(3), 55–67. ¨ Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 169–184). Cambridge, MA: MIT Press. Keerthi, S., Shevade, S., Bhattacharyya, C., & Murthy, K. (1999). Improvements to Platt’s SMO algorithm for SVM classifier design. (Tech. Rep. Mechanical and Production Engineering, CD-99-14). Singapore: National University of Singapore. Kimeldorf, G., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33, 82–95. Ma, J., Theiler, J., & Perkins, S. (2003). Accurate on-line support vector regression. Neural Computation, 15, 2683–2703. Meyer, M., & Woodroofe, M. (2000). On the degrees of freedom in shape-restricted regression. Annals of Statistics, 28, 1083–1104. ¨ ¨ Muller, K., Smola, A., R¨atsch, G., Scholkopf, B., Kohlmorgen, J., & Vapnik, V. (1997). Predicting time series with support vector machines. In Proceedings of the Seventh International Conference on Artificial Neural Networks (pp. 999–1004). Berlin: Springer-Verlag. O’Sullivan, F. (1985). Discussion of “Some aspects of the spline smoothing approach to nonparametric curve fitting” by B. W. Silverman. Journal of the Royal Statistical Society, Series B, 36, 111–147. Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithm for support vector machines. In Proceedings of the IEEE Neural Networks for Signal Processing VII Workshop (pp. 276–285). Piscataway, NJ: IEEE Press. Platt, J. (1999). Fast training of support vector machines using sequential minimal ¨ optimization. In B. Scholkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. ¨ Scholkopf, B., Smola, A., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Shpigelman, L., Crammer, K., Paz, R., Vaadia, E., & Singer, Y. (2004). A temporal kernel-based model for tracking hand movements from neural activities. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.). Advances in neural information processing systems, 17. Cambridge, MA: MIT Press. ¨ Smola, A., & Scholkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14, 199–222. Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics, 9(6), 1135–1151. Stone, M. (1974). Cross-validation choice and assessment of statistical predictors (with discussion). Journal of the Royal Statistical Society Series B, 36, 111–147. Vanderbei, R. (1994). LOQO: An interior point code for quadratic programming (Tech. Rep., Statistics and Operations Research SOR-94-15). Princeton, NJ: Princeton University. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag.
Efficient Computation and Model Selection for SVR
1655
Vapnik, V., Golowich, S., & Smola, A. (1996). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441), 120–131. Zou, H., Hastie, T., & Tibshirani, R. (2005). On the degrees of freedom of the lasso (Tech. Rep.). Palo Alto, CA: Department of Statistics, Stanford University.
Received October 17, 2005; accepted August 1, 2006.
LETTER
Communicated by Sushmita Mitra
A Novel Generic Hebbian Ordering-Based Fuzzy Rule Base Reduction Approach to Mamdani Neuro-Fuzzy System Feng Liu
[email protected] Chai Quek
[email protected] Geok See Ng
[email protected] Center for Computational Intelligence, Nanyang Technological University, School of Computer Engineering, Singapore 639798
There are two important issues in neuro-fuzzy modeling: (1) interpretability—the ability to describe the behavior of the system in an interpretable way—and (2) accuracy—the ability to approximate the outcome of the system accurately. As these two objectives usually exert contradictory requirements on the neuro-fuzzy model, certain compromise has to be undertaken. This letter proposes a novel rule reduction algorithm, namely, Hebb rule reduction, and an iterative tuning process to balance interpretability and accuracy. The Hebb rule reduction algorithm uses Hebbian ordering, which represents the degree of coverage of the samples by the rule, as an importance measure of each rule to merge the membership functions and hence reduces the number of the rules. Similar membership functions (MFs) are merged by a specified similarity measure in an order of Hebbian importance, and the resultant equivalent rules are deleted from the rule base. The rule with a higher Hebbian importance will be retained among a set of rules. The MFs are tuned through the least mean square (LMS) algorithm to reduce the modeling error. The tuning of the MFs and the reduction of the rules proceed iteratively to achieve a balance between interpretability and accuracy. Three published data sets by Nakanishi (Nakanishi, Turksen, & Sugeno, 1993), the Pat synthetic data set (Pal, Mitra, & Mitra, 2003), and the traffic flow density prediction data set are used as benchmarks to demonstrate the effectiveness of the proposed method. Good interpretability, as well as high modeling accuracy, are derivable simultaneously and are suitably benchmarked against other well-established neuro-fuzzy models. 1 Introduction The hybridization of neural networks and fuzzy systems, called neurofuzzy systems, has become a popular research focus (Lin, 1996). It takes Neural Computation 19, 1656–1680 (2007)
C 2007 Massachusetts Institute of Technology
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1657
advantage of the low-level learning ability of neural networks and the highlevel reasoning ability of the fuzzy systems. As a universal approximator (Castro, 1995), this approach resolves the modeling problem using a linguistic model consisting of a set of if-then fuzzy rules instead of a complex mathematical description. The identification of interpretable fuzzy rules is one of the most important issues in the design of a neuro-fuzzy system. Neuro-fuzzy modeling usually comes with two contradictory requirements: interpretability and accuracy. The fuzzy modeling research is divided into two main areas: linguistic fuzzy modeling (LFM) and precise fuzzy modeling (PFM) (Casillas, Cordon, Herrera, & Magdalena, 2003). The former focuses mainly on the interpretability issue and is usually implemented by a Mamdani model (Mamdani & Assilian, 1975), while the latter aims to devise a fuzzy model with a good accuracy and is often realized by a Takagi-Sugeno-Kang (TSK) model (Takagi & Sugeno, 1985; Sugeno & Kang, 1988). This letter focuses on a delicate balance between interpretability and accuracy in linguistic fuzzy modeling. The interpretability is usually improved in the following ways:
r
r
r
Input variable selection and reduction. As the number of input variables increases, the homogeneous partitioning of the input and output spaces will cause the size of the rule base to grow exponentially. This has been much addressed in the literature (Silipo & Berthold, 2000; Gonzalez & Perez, 2001; Casillas, Cordon, Del Jesus, & Herrera, 2001; Chakraborty & Pal, 2004). Fuzzy rule selection, reduction, and merger. To derive a consistent and effective rule base, redundant, erroneous, unimportant, and conflictive rules should be reduced. Various methods have been proposed in the literature (Setnes, Babuska, Kaymak, & Van Nauta Lemke, 1998; Yen & Wang, 1999; Cordon & Herrera, 2000; Krone & Taeger, 2001; Tung & Quek, 2002). Use of other rule expressions. The disjunctive norm form rules (Gonzalez & Perez 1998), exception rules (Krone & Kiendl, 1994), union-rule configuration (Combs & Andrews, 1998), and others have attempted to derive a compact rule base.
Yen and Wang (1999) used the singular value decomposition (SVD) and QR with column pivoting methods to determine the order of importance of the fuzzy rules in TSK model. The rules corresponding to small singular values in the firing matrix are considered less important or irrelevant and may be removed from the rule base. However, Setnes and Babuska (2001) show that the SVD does not produce an importance ordering of the fuzzy rules and only the QR process is supported. The rule reduction in Setnes et al. (1998) is realized by the merger of fuzzy sets. When two fuzzy sets in one input dimension are similar under certain discrete measures of similarity, they are merged into one fuzzy set,
1658
F. Liu, C. Quek, & G. Ng
which may result in equivalent rules in the rule base. If a set of rules is equivalent, with the same conditions and consequences, only one of them is preserved. This method clearly separates the fuzzy sets and reduces the number of rules simultaneously, which seems very promising. However, this method considers all the rules of equal importance. It may result in the deletion of some important rules and may cause much degradation in accuracy. Furthermore, in this method, the merger and reduction process occurs only once, at the end of the formulation of the rule base, and the accuracy requirement may limit its degree of reduction. Ang and Quek (2005) used rough set theory, based on the POP learning algorithm (Ang, Quek, & Pasquier, 2003; Quek & Zhou, 1999), to improve the interpretability of neuro-fuzzy system. It performs feature selection through the reduction of attributes and extends the reduction to rules without redundant attributes. Optimization methods like genetic algorithm (GA) have also been used to deal with the interpretability issue of fuzzy modeling (Casillas, Cordon, Del Jesus, & Herrera, 2005). With multiple objectives, both accuracy and interpretability are compromised through the evolutionary process. Banerjee, Mitra, and Pal (1998), Mitra, Mitra, and Pal (2001), and Pal, Mitra, and Mitra (2003) used the rough set theory for extracting dependency rules from a real-valued attribute table consisting of fuzzy membership values. At the same time, they employ GA to tune the fuzzification parameters and network weight and structure simultaneously. Besides the methods mentioned above, there are many other techniques for neuro-fuzzy rule generation in the soft computing framework (Mitra & Hayashi, 2000). This letter proposes an iterative process to reduce the rule base to derive good interpretability while maintaining a high level of modeling accuracy. It is based on the Hebbian ordering of the rules and is called Hebb rule reduction. The proposed Hebb rule reduction process both reduces the rule base by merging the fuzzy set in each dimension and considers the merging orders realized by Hebbian ordering. Hebbian ordering is used to represent the importance of the rules, where a higher Hebbian ordering indicates a larger coverage of the training points provided by a given rule. The rules with higher importance are more likely to be preserved. The least mean square (LMS) algorithm is used to tune the membership functions of the rule base after each time step of the reduction. Interpretability and accuracy are balanced using this iterative process. This letter is organized as follows. Section 2 describes the details of the proposed Hebb rule reduction process, five experiments are shown in section 3 to demonstrate the use of the proposed method, and section 4 presents the conclusion. 2 Hebbian-Based Rule Reduction Algorithm In the fuzzy system, the crisp inputs are first fuzzified to fuzzy inputs and subsequently transformed into fuzzy outputs through a set of fuzzy rules,
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1659
Figure 1: Structure of five-layer fuzzy neural network.
which has the following form (Mamdani type):
Rl :
if x1 is A and x2 is B then y1 is C.
(2.1)
The fuzzy outputs are finally defuzzified into crisp outputs. When integrating with the neural network to use its low-level learning ability, a five-layer neural network is deployed, as shown in Figure 1. It consists of the input layer, condition layer (which performs the fuzzification), rule node layer (each node in which denotes a fuzzy rule), consequence layer (which performs the defuzzification), and output layer. The crisp input and output vectors are represented as XT = [x1 , x2 , . . . , xi , . . . , xn1 ], YT = [yi , y2 , . . . yi , . . . yn5 ], respectively. The terms n1 , n2 , n3 , n4 , n5 denote the number of the neurons of the input, condition, rule node, n1 conse5 quence, and output layers, respectively. n2 = i=1 L i and n4 = nm=1 Tm , where L i and Tm are the number of input and output linguistic labels for each dimension. In this letter, the gaussian membership function is used in the condition and consequence layers to perform the fuzzification and defuzzification. The centroids and widths of the MFs are deIV IV noted as (c i,jII , δi,jI I ) (the ith input dimension and the jth MF) and (c l,m , δl,m ) (the mth output dimension and the lth MF) for the two layers. By denoting i L k (i) as the input label in the ith input dimension of the kth rule, and o L k (m) as the output label in the mth output dimension of the kth V rule, the final output of the mth output dimension, denoted as o m , can be expressed as n3 V om
=
k=1 c o L k (m),m /δo L k (m),m × n3 IV III k=1 f k /δo L k (m),m IV
IV
f kIII
,
(2.2)
1660
F. Liu, C. Quek, & G. Ng
Figure 2: Flowchart of the proposed rule generation, reduction, and tuning process.
where f kI I I =
n1 i=1
(xi −c I I )2 exp − (δ I I i,i L k )(i)2 is called the firing strength of the input i,i L k (i)
point XT . There are two important issues in neuro-fuzzy modeling: interpretability and accuracy. Interpretability refers to the capability of the fuzzy model to express the behavior of the system in an understandable way, and accuracy refers to the capability of the fuzzy model to represent the system faithfully (Casillas, Cordon, Herrera et al. 2003). Interpretability and accuracy are usually pursued for contradictory purposes and sometimes are dipoles apart as the system complexity increases. When tuning the membership functions of the rules to diminish the modeling error, the interpretability of the rules may be degraded during the tuning process, where the fuzzy sets can drift closer to each other and may end up overlapping each other (Setnes et al., 1998). Thus, a balance between interpretability and accuracy is desirable. The purpose of this article is to identify and devise a way to obtain low modeling error while maintaining high interpretability. The proposed method consists of three phases: the initial rule generation (P1), the iterative rule reduction and refinement (P2), and the membership function tuning (P3), as shown in Figure 2. In P1 phase, the fuzzy rules are formulated to cover all the training samples. The number of input labels, as well as the output labels in the neuro-fuzzy model, is equal to the number of the fuzzy rules. There will be an excessive set of adjustable parameters, and the resultant fuzzy sets will have large areas of overlap. These will be reduced in the P2 phase.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1661
In the P2 phase, an iterative tuning process is employed. It consists of two subphases: rule reduction (P2.1) and membership function (MF) refinement (P2.2). The objective of this phase is to improve the interpretability of the model and simultaneously to maintain high modeling accuracy. Membership functions are merged according to the degree of their overlap, and redundant and conflictive rules are deleted at the same time in the rule reduction subphase. A merging scheme based on the Hebbian ordering of rules is proposed. There are three procedures in P2.1 phase: rank the rules (P2.1.1), merge the MFs (P2.1.2), and reduce the redundant and conflictive rules (P2.1.3). The system first ranks all the rules according to the importance for each rule, which is realized by the Hebbian ordering in this article. Then the MFs of each variable are merged in accordance with a set of criteria, which will be defined later. Finally, if there are any equivalent rules that have the same condition and consequence parts with others, only one of them is preserved, and if there are any conflicting rules that have the same condition but a different consequence part with others, the ones with lower importance are removed. If there is only one MF of an input feature variable, this feature will be reduced. In P2.2 phase, the LMS algorithm is employed to tune the centers and widths of the membership functions to reduce the modeling error. As there may be unsatisfied overlaps between MFs, it will iterate in P2.1 phase to reduce the MFs and the rules. The stopping criterion for iteration is specified as follows: Denote the training error after the ith i time of rule reduction (P2.1) and MF refinement (P2.2), as E train . If i exceeds i+1 i the maximum iteration i max , or E train > ηE train , then restore the rule set just after the ith iteration and go to the next phase. η is a human-defined parameter that controls the reduction of the system complexity, based on the fact that if the number of rules is fewer than necessary, the training error will become much larger than before. The third phase is the finer tuning of the MFs to achieve a high level of accuracy. There will be no further reduction of the MFs and rules, so there will be more updating epochs and a smaller learning rate than that in the MF refinement (P2.2) of the second phase.
2.1 Initial Rule Base Formulation. During the rule initialization phase, whenever a new data sample is presented, if there are no rules in the rule base or if the strength of the rule with the largest firing strength is below a specified threshold θ , a new rule node will be created. This threshold controls the coverage of the data space by the rules and affects the number of initial rules. The larger the threshold, the fewer rules will be generated, and vice versa. When a new rule node is created, a neuron is inserted into the condition and consequence layers for each input and output dimension. The centroid of the newly generated MFs is set to the data sample, and the width of the MFs is set to a predefined value in proportion to the scale of each dimension. The flowchart of the procedure is described in Figure 3.
1662
F. Liu, C. Quek, & G. Ng
Figure 3: Flowchart of the rule initialization algorithm.
2.2 Hebbian-Based MF and Rule Reduction. Excessive MFs and rules will be generated during the rule initialization phase, and they should be reduced so as to provide a clear linguistic semantic meaning of each fuzzy set and decrease the complexity of the model. In fuzzy modeling, each fuzzy rule corresponds to a subregion of the decision space. Some rules lying in a proper region may represent many samples and have a great deal of influence on the final result, while some other rules may occasionally be generated by noise and become redundant in the rule base. Thus, the importance of each rule is not necessarily of the same value. This section introduces the Hebbian importance of fuzzy rules and explains how it is used to merge the fuzzy sets. As the functions of a rule are determined, the training membership sample XTi , YTi is fed into the input and output layers simultaneously. The input XTi is used to produce the firing strength of the kth rule node, while YTi is fed into the consequence layer to derive the membership values at the output label nodes. If the input-output samples repeatedly fire a rule by the product of their firing strength and the membership values, and the accumulated strength surpasses that of other rules, it indicates the existence of such a rule. In another view, the accumulated strength exerted by the sample pairs reflects the degree of coverage of them by the rule. The rule that covers the samples to a higher degree will have greater influence on the modeling accuracy. This is due to the fact that when the MFs of the rule or the rule itself are merged or deleted, respectively, it will result in a larger change in the fuzzy inference results and significantly affect the modeling accuracy. Thus, such a rule is of greater importance. To derive good balance between interpretability and accuracy, when MFs and rules are reduced, it is desirable to preserve all such efficient and effective rules.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1663
Let us define the degree of the coverage of the ith data point XTi , YTi by the kth rule as the product of their firing strength and the membership function values: c k,i = f kIII
n5
µIV o L k (m),m (yi,m )
(2.3)
m=1
The Hebbian importance of the kth rule is defined as the sum of the degree of coverage of all the training samples as in Ik =
n
c k,i .
(2.4)
i=1
The fuzzy rules can then be efficiently sorted by their importance, defined by equation 2.4, in decreasing order. This is called the Hebbian ordering of the rules. At the beginning of rule reduction, the original rule set, denoted by R, contains all the sorted rules, and the reduced rule set, denoted by R , is null. The rules in R are presented successively according to their Hebbian orders. If there is no rule in the reduced rule set, the rule in R is directly added into R ; otherwise, the fuzzy set in each dimension of the rule will be added or merged into the fuzzy sets in R according to a set of criteria defined below. All of the newly added or merged fuzzy sets will be linked together to formulate a new rule in the reduced rule set R . During the merging process, some fuzzy sets may be shared by several rules. To change a fuzzy set shared by many rules is equivalent to modifying these rules simultaneously, which may have great influence on the performance of system, while a less shared fuzzy set affects the performance only locally. Thus, a highly shared fuzzy set is of higher importance than those that are less shared. Denote the importance of a fuzzy set F as IˆF . The IˆF is a changing value during the merging process. At each time the kth rule in the original rule set R is presented, its fuzzy sets Ai and B j (i.e., for the kth rule, “if xi = Ai , then y j = B j ”) have the same importance with their associated rule. It is described as M1 :
Iˆ Ai = Iˆ B j = Ik
(2.5)
for i = 1, . . . , n1 and j = 1, . . . , n5 . For each input dimension i, among the fuzzy sets in the reduced rule set R of this dimension, the fuzzy set Ai with the maximum degree of overlap over Ai (the degree of overlap is defined in equation 2.9) is selected. If the maximum degree of overlap satisfies a specified criterion, the fuzzy sets
1664
F. Liu, C. Quek, & G. Ng
will be merged into another fuzzy set Ai ; otherwise, the fuzzy set Ai is directly added into the reduced rule set of this dimension. The centroid and the variance of Ai and Ai , (c Ai , δ Ai ) and (c Ai , δ Ai ), respectively, are merged into (c Ai , δ Ai ) using M2 :
c Ai =
Iˆ Ai c Ai + Iˆ Ai c Ai Iˆ Ai + Iˆ Ai
(2.6)
δ Ai =
Iˆ Ai δ Ai + Iˆ Ai δ Ai . Iˆ Ai + Iˆ Ai
(2.7)
The importance of the new fuzzy set Ai is given by M3 :
Iˆ Ai = Iˆ Ai + Iˆ Ai .
(2.8)
In other words, the centroid and variance of resultant fuzzy set is the weighted average of the two fuzzy sets in accordance with their importance, and the degree of importance of this final fuzzy set is the sum of the importance of the two. Then the newly generated fuzzy set Ai replaces the previous fuzzy set Ai . For each output dimension, fuzzy set B j is either added directly or merged into the fuzzy sets of this output dimension in the reduced rule set, using the same process for the input dimensions. Finally, the newly added or merged fuzzy sets in all the dimensions are linked together to formulate a fuzzy rule in the reduced rule set R . Given fuzzy sets A and B, the degree of overlap of A by B is defined as A∩ B . (2.9) SA (B) = A As the gaussian membership function is used in the proposed method, the overlap measure can be derived using the centers and variances of the MFs. For the fuzzy sets A and B, with membership functions µ A(x) = exp{−(x − c A)2 /δ 2A} and µ B (x) = exp{−(x − c B )/δ 2B }, respectively, assuming that c A ≥ c B and then |A| and |A ∩ B| can be expressed in (Juang & Lin, 1998) √ |A| = π δ A (2.10)
√ √ h 2 c B − c A + π (δ A + δ B ) h 2 c B − c A + π (δ A − δ B ) + √ √ π (δ A + δ B ) π (δ B − δ A)
√ h 2 c B − c A − π (δ A − δ B ) + (2.11) √ π (δ A − δ B )
1 |A ∩ B| = 2
where h(x) = max {0, x}.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1665
The merging criterion is that if SA (B) > λ or SB (A) > λ, they will be merged, where λ is a specified threshold that determines the maximum degree of overlap between fuzzy sets the system can tolerate. Higher λ may increase the accuracy but would degrade the interpretability, while a lower λ may cause a larger number of rules to be reduced, with the risk that there will not be enough rules to maintain high modeling accuracy. We choose 0.5 in this letter to balance the accuracy and interpretability. After all the rules are presented, the following steps will be executed to remove the feature and delete redundant and conflicting rules in the reduced rule set R : S1: If there is only one membership function within one dimension, this dimension (feature) will be removed. S2: If there is any rule that has the same conditions and consequences as the others, it is removed. S3: If there are any conflicting rules that have equivalent conditions but different consequences, the one with the higher degree of importance is preserved, and the others are deleted. Finally, the original rule set is replaced with the reduced rule set. The flowchart of the algorithm is shown in Figure 4. An example follows. At the start, there are four rules in the original rule set R shown in Figure 5. After rule 1 is presented, the first rule is added into the reduced rule set R (see Figure 6). After the second rule is presented, the two fuzzy (2) (1) sets in the second input dimension, A2 and A2 , are so similar that they (1) are merged to a new A2 . In the first input dimension and the output dimension, the fuzzy sets are well separated and thus are retained. They (3) (1) are shown in Figure 7. After the third rule is presented, A1 and A1 are (1) (3) (1) (1) (3) merged to a new A1 , A2 and A2 are merged to a new A2 , and B1 and (2) (2) B1 are merged into a new B1 , as shown in Figure 8. After the fourth (4) (1) (1) (4) (1) rule is presented, A1 and A1 are merged to a new A1 , A2 and A2 are (1) (4) (1) (1) merged in a new A2 , and B1 and B1 are merged to a new B1 , shown in Figure 9. After all the rules are presented, we can see, first, that there is only one fuzzy set in the second input dimension. Thus, this input feature will be removed (S1). Second, it is easy to see that rules 1 and 4 have the same conditions and consequence. They are equivalent, and one of them will be deleted (S2). In this example, rule 1 is reserved because it has higher importance. Finally, rules 1 and 3 have the same condition but different consequence. So they are conflicting, and one of them has to be removed to improve the interpretability of the system. As I1 > I3 , rule 1 is preserved, and rule 3 is deleted (S3). At the end, four rules are reduced to two rules, and the input feature x2 is removed.
1666
F. Liu, C. Quek, & G. Ng
Figure 4: Flowchart of the rule reduction algorithm.
2.3 Parameter Tuning. The LMS algorithm is used to tune the parameters of the neuro-fuzzy system. The error function used is described by
E=
n5 n5 V 2 1 1 2 em o m − ym = 2 2 m=1
V where e m = o m − ym .
m=1
(2.12)
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1667
Figure 5: Original rule set R.
Figure 6: Reduced rule set R after the presentation of rule 1.
Each parameter in the fuzzy neural network is then adjusted according to the partial derivative of E to that parameter: c oI LVk (m),m = −
∂E
=−
∂c oI LVk (m),m
V ∂o m ∂E I V V ∂o m ∂c o L k (m),m
2 δoI LVk (m),m = −e m 2 n3 III δoI LVk (m),m k =1 o k o kI I I
δoI LV
k (m),m
=−
V ∂E ∂E ∂o m =− V ∂o m ∂δoI LV (m),m ∂δoI LV (m),m k
= −2e m
o kI I I
(2.13)
k
n3
cIV oIII k =1 o L k (m),m k
2 δoI LV (m),m − c oI LV
k (m),m
k
n3
2 2 δoI LV δoI LV (m),m oIII k =1 k
k =1
n3
k
o kI I I
2 δoI LV (m),m k
3
k (m),m
(2.14)
1668
F. Liu, C. Quek, & G. Ng
Figure 7: Reduced rule set R after the presentation of rule 2.
Figure 8: Reduced rule set R after the presentation of rule 3.
Figure 9: Reduced rule set R after the presentation of rule 4.
∂E II c i,i L k (i) = − ∂c I I
i,i L k (i)
o kI I I = −2e m
c oI LV (m),m k
=−
∂E
V ∂o m
II ∂µi,i L k (i)
V ∂µ I I II ∂c i,i ∂o m i,i L k (i) L k (i)
n3
oI I I k =1 k
n3
δoI LV (m),m k
oI I I k =1 k
2
−
n3
δoI LV (m),m k
cIV oI I I k =1 o L k (m),m k
2 2
II δi,i L k (i)
2
2
II δoI LV (m),m xi − c i,i L (i) k k
δoI LV (m),m k
2
(2.15)
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach ∂E II δi,i L k (i) = − ∂δ I I
=−
V ∂o m
∂E
II ∂µi,i L k (i)
V ∂µ I I II ∂δi,i ∂o m i,i L k (i) i,i L k (i) L k (i)
2
n3 n3 o kI I I c oI LV (m),m o I I I δoI LV (m),m −
= −2e m
k
k =1 k
n3
1669
k
oI I I k =1 k
cIV oI I I k =1 o L k (m),m k
δoI LV (m),m k
2 2 3 2 II δoI LV (m),m δi,i δoI LV (m),m L k (i) k k
2
2 II xi − c i,i L k (i)
(2.16) 3 Experimental Results Three sets of experiments were performed to evaluate the proposed Hebb rule reduction algorithm: (1) three sets of Nakanishi data set (Nakanishi et al., 1993) (Sugeno & Yasukawa, 1993), (2) Pat synthetic pattern classification (Pal et al., 2003), and (3) highway traffic flow density prediction. 3.1 Nakanishi Data Set. In this section, the proposed learning method for a neuro-fuzzy system is evaluated using the benchmark experiments on three data sets: the human operation of a chemical plant, a nonlinear system, and the daily price of a stock in a stock market. The complete data sets have been published in Nakanishi et al. (1993). They were used to evaluate the following six reasoning methods: Mamdani’s version of Zadel’s compositional rule inference (CRI), Turksen’s version of interval-valued CRI, Turksen’s versions of point-valued and interval-valued analogical approximate reasoning scheme (AARS), and Sugeno’s version of position and gradient type of reasoning (Nakanishi et al., 1993). These data sets are also used to evaluate the qualitative modeling based on fuzzy logic (Sugeno & Yasukawa, 1993). Each of the three data sets is split into three groups: A, B, and C (Nakanishi et al., 1993). In the following experiments, the data sets of groups A and B are used as the training sets, and group C is used as the testing set. The experimental results are compared against (1) the pseudo outer-product-based fuzzy neural network (POPFNN) (Ang et al., 2003), and (2) its modified version: rough-set-based pseudo outer-product fuzzy rule identification algorithm (RSPOP) (Ang & Quek, 2005); (3) adaptivenetwork-based fuzzy inference system (ANFIS) (Jang, 1993); (4) evolving fuzzy neural networks (EFuNN) (Kasabov, 2001); and (5) dynamic evolving neural fuzzy inference system (DENFIS) (Kasabov & Song, 2002). To evaluate the performance of these systems, both the mean squared error (MSE) and the correlation coefficient (R) are used in the experiments. The comparative analysis is presented in section 3.1.4. 3.1.1 A Nonlinear System. The nonlinear system is described as 2 y = 1 + x1−2 + x21.5
1 ≤ x1 , x5 ≤ 5.
(3.1)
1670
F. Liu, C. Quek, & G. Ng 1
1
0.9
T(3) High
T(2) Medium
T(1) Low
0.7 0.6 0.5 0.4 0.3
0.6 0.5 0.4 0.3 0.2
0.1
0.1 0 1.5
2
2.5
3 Input x 1
3.5
4
T(3) High
0.7
0.2
0
T(2) Medium
T(1) Low
0.8
0.8
membership function u(x2)
membership function u(x1)
0.9
4.5
1.5
2
2.5
3 Input x 2
3.5
4
4.5
(b)
(a)
1 0.9
T(2) Medium
T(1) Low
T(3) High
membership function u(y1)
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1.5
2
2.5
3 3.5 Input y 1
4
4.5
5
(c)
Figure 10: Membership Functions for the nonlinear system modeling problem.
The data set consists of four input variables and one output variable, where only x1 and x2 are useful in the modeling, while x3 and x4 are irrelevant variables. Figure 10 and Table 1 show the shapes, the centroids, and the widths of the derived gaussian membership functions. For this data set, the feature selection method in Nakanishi et al. (1993) discards the input features x3 and x4 , while the RSPOP (Ang & Quek, 2005) discards the feature x3 and partially discards x4 . In our proposed method, the two irrelevant variables x3 and x4 are discarded too. There are three fuzzy sets for each of the input features x1 , x2 and output variable y1 (see Figure 10). In total, seven rules are generated, as shown in Table 2. To show the interpretability, the first two rules are taken as an example. The first rule indicates that if x1 takes a low value and x2 takes a medium value, then y1 will be a medium value, and the second rule indicates that if x1 is high and x2 is medium, then y1 will be low. From equation 3.1, we can see that for x1 , a higher value decreases y and a lower value increases y, while for x2 , the higher the value, the higher the value for y, and vice versa. From these two rules, it can be observed that while the x2 are taking the same fuzzy label, the output y1 is inversely correlated to x1 .
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1671
Table 1: Centroid and Width of the Membership Functions for the Nonlinear System Modeling Problem. Label
Low
Medium
High
x1 = c 1II
1.219
3.020
4.605
δ1II x2 = c 2II δ2II y = c IV
1.900
1.155
0.428
1.061
3.803
4.962
0.747
1.953
0.379
1.337 0.816
3.321 1.875
5.050 1.873
δ IV
Table 2: Fuzzy Rules for the Nonlinear System Modeling Problem. Rule 1 2 3 4 5 6 7
x1
x2
y
Low High Low Medium Medium Medium Low
Medium Medium High Medium High Low Low
Medium Low High Low Medium Low Low
Table 3: Variables in the Data Set of the Human Operation of a Chemical Plant. x1 x2 x3 x4 x5 y
Monomer concentration Change of monomer concentration Monomer flow rate Local temperatures inside the plant Local temperatures inside the plant Set point for monomer flow rate
3.1.2 Human Operation of a Chemical Plant. The problem is to model the human operation of a chemical plant. There are five input variables and one output variable, as shown in Table 3. The feature selection method in Nakanishi et al. (1993) discards the inputs x2 , x4 , and x5 , whereas RSPOP discards the inputs x1 , x2 , and x5 . In our proposed method, among the five input dimensions, x1 to x5 , only the x3 is preserved; the others are discarded. There are three fuzzy sets in both input x3 and the output y. Three rules are generated from the system. 3.1.3 Daily Price of a Stock in the Stock Market. This problem is to predict the stock price using various features of a stock in a stock market.
1672
F. Liu, C. Quek, & G. Ng
Table 4: Variables in the Data Set of Daily Stock Price Predictions. x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 y
Past change of moving average over a middle period Present change of moving average over a middle period Past separation ratio with respect to moving average over a middle period Present separation ratio with respect to moving average over a middle period Present change of moving over a short period Past change of price, for instance, change on one day before Present change of price Past separation ratio with respect to moving average over a short period Present change of moving average over a long period Present separation ratio with respect to moving average over a short period Prediction of stock price
Table 5: Consolidated Experimental Results on Nakanishi Data Sets. Data Set
Hebb rule reduction POPFNN RSPOP Sugeno (P-G) Mandani Turksen (IVCRI) ANFIS EFuNN DENFIS
Chemical Plant
Nonlinear System
Stock Prediction
MSE
R
MSE
R
MSE
R
2.423×104
0.998 0.946 0.983 0.990 0.937 0.993 0.780 0.946 0.995
0.185 0.270 0.383 0.467 0.862 0.706 0.286 0.566 0.411
0.911 0.877 0.856 0.845 0.490 0.609 0.853 0.720 0.805
15.138 76.221 24.859 168.90 40.84 93.02 38.062 72.542 69.824
0.947 0.733 0.922 0.700 0.865 0.661 0.875 0.756 0.810
5.630×105 2.124×105 1.931×106 6.580×105 2.581×105 2.968×106 7.247×105 5.240×104
There are 10 input variables and 1 output variable in the data set. They are summarized in Table 4. The proposed method discards the input variables x2 and x5 ; the feature selection method in Nakanishi et al. (1993) discards the inputs x1 , x2 , x3 , x6 , x7 , x9 , and x10 ; and the RSPOP discards inputs x1 , x2 , x3 , x6 , and x10 . There are 20 fuzzy rules identified by the proposed method. 3.1.4 Discussion. Table 5 shows the consolidated experimental results based on the Nakanishi data set in terms of MSE and Pearson’s correlation coefficient (R). On the problem of modeling the human operation of a chemical plant, although only one variable is reserved and three fuzzy rules are generated by the proposed method, the modeling error (MSE) and accuracy (R) are the best among the nine models. Good interpretability and accuracy are achieved simultaneously (see Tables 5 and 6). On the modeling of a nonlinear system, the proposed method correctly identifies the useful variables. It outperforms others for the two measures. On the prediction of stock price, as described before, the inputs x1 , x2 , x3 , x6 , x7 , x9 , and x10 are discarded by the feature selection method in Nakanishi et al. (1993), and the
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1673
Table 6: Benchmark Against RSPOP.
Data Set Hebb rule reduction
Chemical Nonlinear Stock Plant System Prediction
Number of rules before reduction 30 Number of rules after reduction 3 Rule reduction (%) 90.0 MSE 2.423×104 R 0.998 RSPOP Number of rules before reduction 24 Number of rules after reduction 14 Rule reduction (%) 41.7 MSE 2.124×105 R 0.983 Improvement Rule reduction (%) 48.3 (Hebb rule reduction MSE (%) 88.6 against RSPOP) R (%) 1.5
25 7 64.0 0.185 0.911 22 17 22.7 0.383 0.856 41.3 51.7 6.0
50 20 60.0 15.138 0.947 50 29 14.0 24.859 0.922 46 39.1 2.6
inputs x1 , x2 , x3 , x6 , and x10 are discarded by RSPOP. In the proposed model, x2 and x5 are discarded. Input x2 is the one that all of them, discard but neither of them considered input x5 as a redundant feature. From the results, although the proposed model uses a larger set of attributes, it has achieved superior performance compared to the other models. It indicates that some variables discarded by other models are still useful in the prediction and cannot be discarded. As both the RSPOP model and the proposed Hebb rule reduction model employ a Hebbian-like learning method and reduce the rule set and attribute set simultaneously with the aim of pursuing a balance between accuracy and interpretability, they are contrasted in Table 6. The proposed model begins with a rule set that is larger than RSPOP for the first two data sets and an equal size for the third one. After the tuning and pruning, the Hebb rule reduction ends up with a much smaller rule set for all three data sets than that produced by RSPOP. The ratio of the decrease of the number of the rules by the proposed method is larger than that by RSPOP. For the MSE, the Hebb rule reduction outperforms RSPOP with the ratio of improvement of 88.6%, 51.7%, and 39.1% for the three data sets, respectively. For R, the ratio of improvement by the proposed method is 1.5%, 6.0%, and 2.6% for the three data sets, respectively. It is safe to conclude that the proposed Hebb rule reduction achieves better performance than RSPOP on both interpretability and accuracy. 3.2 Pat Synthetic Pattern Classification. The Pat synthetic data set consists of 880 two-dimensional pattern points, depicted in Figure 11. There are three linearly nonseparable classes. The figure is marked with class 1(c 1 ) and 2(c 2 ), while class 3(c 3 ) corresponds to the background region. In this
1674
F. Liu, C. Quek, & G. Ng 1 0.9
c3
c1
0.8 0.7 0.6
F2
0.5 0.4 0.3 0.2 c2
0.1 0
0
0.1
0.2
0.3
0.4
0.5 F1
0.6
0.7
0.8
0.9
1
Figure 11: Synthetic Data set Pat. Table 7: Comparison of Performance of the Different Models. Model Hebb rule reduction Modular rough-fuzzy MLP EFuNN RSPOP
Accuracy % Kappa % Number of Rule Confusion 84.24 70.31 82.25 55.86
73.12 71.80 70.22 22.44
28 8 88 16
0.67 1.1 1.0 1.3
experiment, 10% of the data are used as the training set, and the remaining data are used as the testing set. The modular rough-fuzzy MLP (Mitra et al., 2001; Pal et al., 2003), EFuNN, and RSPOP are compared to the proposed Hebb rule reduction system. Quantitative measures are used to evaluate the performance of the rules (Pal et al., 2003): (1) accuracy: the correct classification percentage on the testing set; (2) kappa: the coefficient of agreement that measures the relationship of beyond chance agreement to expected disagreement; (3) the number of rules; and (4) confusion: the measure quantifies the goal that the confusion should be restricted within a minimum number of classes. The performance of different models is shown in Table 7. The proposed Hebb rule reduction outperforms the other benchedmarked models. Its accuracy and kappa are the highest among the models, which indicates its good capability for the classification task. Its confusion measure is also the lowest compared with the others, which means that it has the fewest classes between which confusion occurs. However, to achieve such superior performance, the proposed Hebb rule reduction needs more rules than the rough fuzzy MLP and RSPOP.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1675
6 5
05/9 Thu
4
06/9 Fri
07/9 Sat
08/9 Sun
09/9 Mon
10/9 Tue
3 2 1 0.43
0.17
0.9
0.64
0.38
0.11
0.85
0.58
0.32
0.06
0.81
0.55
0.28
0.02
0.75
0.49
0.23
0.94
0 0.68
Normalized Density
Traffic Density of 3 Lanes along PIE (Site29) cv2 cv1 cv3
Normalize d Time L1 Density
L2 Density
L3 Density
Figure 12: Plotting of the traffic flow density data of three lanes and the division of the cross-validation groups.
3.3 Highway Traffic Flow Density Prediction. This experiment seeks to evaluate the performance of the proposed Hebb rule reduction learning method on the highway traffic flow density data set. The raw traffic flow data for the benchmark test was obtained from Tan (1997). The raw data were collected at site 29 located at exit 15 along the eastbound Pan Island Expressway (PIE) in Singapore using loop detectors embedded beneath the road surface. There are five lanes at the site: two exit lanes and three straight lanes for the main traffic. In this experiment, only the traffic flow data for the three straight lanes are employed. There are four input attributes in this data set: the time and the traffic density of the three lanes. The traffic flow trend at the site is modeled by the proposed Hebb rule reduction neuro-fuzzy model, and the trained network is used to predict traffic density for time t + τ , where τ = 5, 10, 30, 45, and 60 minutes. Figure 12 shows the plotting of the traffic flow density data for the three straight lanes spanning a period of six days from September 5 to September 10, 1996. The data set is divided into three cross-validation groups of training and testing sets, designated CV1, CV2, and CV3, the same as in Tung and Quek (2002) and Ang and Quek (2005). The Pearson correlation coefficient (R) and the MSE are used to evaluate the performance. The proposed method is compared to RSPOP, POPFNN, GenSo, EFuNN, and DENFIS. The consolidated experimental results are shown in Figure 13. (The proposed Hebb rule reduction method is abbreviated as Hebb-R-R). Figures 13a, 13c, and 13e show the average R of the three cross-validation groups, whereas Figures 13b, 13d, and 13f show the average MSE for them. The proposed Hebb rule reduction method shows superior performance to the other models: it achieves the highest
1676
F. Liu, C. Quek, & G. Ng
0.94
0.24 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.92 0.9
0.2 0.18 Average MSE
Average R
0.88
0.22
0.86 0.84
0.16 0.14
0.82
0.12
0.8
0.1
0.78 0.76
Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.08
0
10
20
30 Time Interval
40
50
0.06
60
0
10
20
(a) 0.9
60
0.2 0.18 0.16 Average MSE
Average R
0.84 0.82 0.8 0.78
0.14 0.12 0.1 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.08
0.76
0.06
0.74
0.04
0
10
20
30 Time Interval
40
50
0.02
60
0
10
20
(c)
30 Time Interval
40
50
60
(d)
0.94
0.3 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.92 0.9 0.88
0.25
Average MSE
0.86 0.84 0.82
0.2
0.15 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.8 0.78
0.1
0.76 0.74
50
0.22 Hebb-R-R RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.86
Average R
40
(b)
0.88
0.72
30 Time Interval
0
10
20
30 Time Interval
(e)
40
50
60
0.05
0
10
20
30 Time Interval
40
50
60
(f)
Figure 13: Average R and MSE of the Prediction in the Three Cross-Validation Groups.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1677
Table 8: Average MSE, R, and Number of Rules. Neuro-Fuzzy Models
Average MSE
Average R
Average Number of Rules
Hebb rule reduction RSPOP POPFNN GenSoFNN EFuNN DENFIS
0.114 0.146 0.173 0.164 0.189 0.153
0.864 0.834 0.814 0.813 0.798 0.831
8.1 14.4 40.0 50.0 234.5 9.7
average R and the lowest average MSE of lanes 1 to 3 for all intervals: τ = 5, 15, 30, 45 and 60. Table 8 shows the average R, MSE, and number of rules identified for all the time intervals of the three lanes. The proposed Hebb rule reduction method derives on the average only 8.1 fuzzy rules, which is lowest among these models, and its modeling accuracy is superior. It indicates that the proposed method yields a good balance of accuracy and interpretability. 4 Conclusion This letter proposes a novel method for balancing the interpretability and the demand of modeling accuracy in a neuro-fuzzy system. Interpretability is improved through a judicious reduction of the membership functions, fuzzy rules, and attributes, and accuracy is boosted through the least mean square (LMS) learning algorithm that tunes the rules and MFs to the most appropriate locations in the feature space. The Hebbian ordering is proposed to represent the importance of each rule. The rule with a higher Hebbian ordering has a greater degree of coverage of the sample points and more contribution toward the modeling of the data and is more likely to be preserved. Membership functions are merged according to the Hebbian importance of their associated rule. The resultant equivalent rules are deleted, and the attributes with only one MF are removed. The problem of rule conflict is resolved by retaining the rules of higher importance. The proposed membership function merger process not only reduces similar membership functions but also preserves the more informative rules (the rules in high Hebbian ordering) to make the consequent LMS learning process achieve superior accuracy. The proposed method has these key strengths:
r
An intuitive, consistent, and compact rule base can be extracted from the trained neuro-fuzzy model, with a reduced set of attributes, clearly separating the membership functions in each dimension and, most important, a small number of judiciously derived rules.
1678
r
r
F. Liu, C. Quek, & G. Ng
The neuro-fuzzy model does not need clustering method like LVQ (Kohonen, 1982) to specify the number and positions of the membership functions in each dimension prior to the identification of rules, unlike that in Tung and Quek (2002), Ang et al. (2003), and Ang and Quek (2005). Good balance of interpretability and accuracy can be obtained through the proposed iterative processing by Hebb rule reduction and the LMS learning algorithm.
The performance of the proposed Hebb rule reduction method is evaluated using five data sets: (1) human operation of a chemical plant, (2) a nonlinear system, (3) daily price of a stock in a stock market, (4) Pat synthetic pattern classification, and (5) highway traffic flow density prediction. The experimental results show consistently good performance by the proposed method when benchmarked against other established neuro-fuzzy models. References Ang, K. K., & Quek, C. (2005). RSPOP: Rough set-based pseudo outer-product fuzzy rule identification algorithm. Neural Computation, 17(1), 205–243. Ang, K. K., Quek, C., & Pasquier, M. (2003). POPFNN-CRI(S): Pseudo outer productbased fuzzy neural network using the compositional rule of inference and singleton fuzzifier. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 33(6), 838–849. Banerjee, M., Mitra, S., & Pal, S. K. (1998). Rough fuzzy MLP: Knowledge encoding and classification. IEEE Transactions on Neural Networks, 9(6), 1203–1216. Casillas, J., Cordon, O., Del Jesus, M. J., & Herrera, F. (2001). Genetic feature selection in a fuzzy rule-based classification system learning process for high dimensional problems. Information Sciences, 136(1–4), 169–191. Casillas, J., Cordon, O., Del Jesus, M. J., & Herrera, F. (2005). Genetic tuning of fuzzy rule deep structures preserving interpretability and its interaction with fuzzy rule set reduction. IEEE Transactions on Fuzzy Systems, 13, 13–29. Casillas, J., Cordon, O., Herrera, F., & Magdalena, L. (2003). Interpretability improvements to find the balance interpretability-accuracy in fuzzy modeling: An overview. In J. Casillas, O. Cordon, F. Ferrera, & L. Magdalena (Eds.), Interpretability issues in fuzzy modeling (pp. 3–22). New York: Springer. Castro, J. L. (1995). Fuzzy logic controllers are universal approximators. IEEE Transactions on Systems, Man, and Cybernetics, 25(4), 629–635. Chakraborty, D., & Pal, R. N. (2004). A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification. IEEE Transactions on Neural Networks, 15(1), 110–123. Combs, W. E., & Andrews, J. E. (1998). Combinatorial rule explosion eliminated by a fuzzy rule configuration. IEEE Transactions on Fuzzy Systems, 6(1), 1–11. Cordon, O., & Herrera, F. (2000). A proposal for improving the accuracy of linguistic modeling. IEEE Transactions on Fuzzy Systems, 8(3), 335–344.
A Hebbian-Based Mamdani Neuro-Fuzzy Rule Reduction Approach
1679
Gonzalez, A., & Perez, R. (1998). Completeness and consistency condictions for learning fuzzy rules. Fuzzy Sets and Systems, 96(1), 37–51. Gonzalez, A., & Perez, R. (2001). Selection of relevant features in a fuzzy genetic learning algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(3), 417–425. Jang, J.-S. R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on Systems, Man, and Cybernetics, 23(3), 665–685. Juang, C. F., & Lin, C. T. (1998). An online self-constructing neural fuzzy inference network and its applications. IEEE Transactions on Fuzzy Systems, 6(1), 12–32. Kasabov, N. K. (2001). Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 31(6), 902–918. Kasabov, N. K., & Song, Q. (2002). DENFIS: Dynamic evolving neural-fuzzy inference system and its application for time-series prediction. IEEE Transactions on Fuzzy Systems, 10(2), 144–154. Kohonen, T. K. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69. Krone, A., & Kiendl, H. (1994). Automatic generation of positive and negative rules for two-way fuzzy controllers. In Proceedings of the Second European Congress on Intelligent Techniques and Soft Computing (pp. 438–447). Aachen, Germany: Verlag Mainz. Krone, A., & Taeger, H. (2001). Data-based fuzzy rule test for fuzzy modeling. Fuzzy Sets and Systems, 123(3), 343–358. Lin, C. T. (1996). Neural fuzzy systems: A neuro-fuzzy synergism to intelligent systems. Upper Saddle River, NJ: Prentice Hall. Mamdani, E. H., & Assilian, S. (1975). An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7(1), 1–13. Mitra, S., & Hayashi, Y. (2000). Neuro-fuzzy rule generation: Survey in soft computing framework. IEEE Transactions on Neural Networks, 11(3), 748–768. Mitra, S., Mitra, P., & Pal, S. K. (2001). Evolutionary modular design of rough knowledge-based network using fuzzy attributes. Neurocomputing, 36(1), 45–66. Nakanishi, H., Turksen, I. B., & Sugeno, M. (1993). A review and comparison of six reasoning methods. Fuzzy Sets and Systems, 57(3), 257–294. Pal, S. K., Mitra, S., & Mitra, P. (2003). Rough fuzzy MLP: Modular evolution, rule generation and evaluation. IEEE Transactions on Knowledge and Data Engineering, 15(1), 14–25. Quek, C., & Zhou, R. W. (1999). POPFNN-AAR(S). A pseudo outer-product based fuzzy neural network. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29, 859–870. Setnes, M., & Babuska, R. (2001). Rule base reduction: Some comments on the use of orthogonal transforms. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 31(2), 199–206. Setnes, M., Babuska, R., Kaymak, U., & Van Nauta Lemke, H. R. (1998). Similarity measures in fuzzy rule base simplification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 28(3), 376–386. Silipo, R., & Berthold, M. (2000). Input features impact on fuzzy decision processes. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 30(6), 821–834.
1680
F. Liu, C. Quek, & G. Ng
Sugeno, M., & Kang, G. T. (1988). Structure identification of fuzzy model. Fuzzy Sets and Systems, 28(1), 15–33. Sugeno, M., & Yasukawa, T. (1993). A fuzzy-logic-based approach to qualitative modeling. IEEE Transactions on Fuzzy Systems, 1(1), 7–31. Takagi, T., & Sugeno, M. (1985). Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man, and Cybernetics, 15(1), 116–132. Tan, G. K. (1997). Feasibility of predicting congestion states with neural networks (Tech. Rep.). Singapore: Nanyang Technological University. Tung, W. L., & Quek, C. (2002). GenSoFNN: A generic self-organizing fuzzy neural network. IEEE Transactions on Neural Networks, 13(5), 1075–1086. Yen, J., & Wang, L. X. (1999). Simplifying fuzzy rule–based models using orthogonal transformation methods. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(1), 13–24.
Received January 11, 2006; accepted August 2, 2006.
1681
Erratum Kalman Filter Control Embedded into the Reinforcement Learning Framework Istv´an Szita and Andr´as L´´ orincz Neural Computation. 16:491–499 (2004) The theorem on p. 495 should read as follows: If the system (F ; H) is observable, 0 ≥ ∗(or 0 ≥ ∗ in the case of Sarsa), there exists an L such that F + G L ≤ 1/ 1 − p, then there exists a series of learning rates αt such that 0 < αt ≤ 1/xt 4 , t αt = ∞, t αt2 < ∞, and it can be computed online. For all sequences of learning rates satisfying these requirements, the combined TD-KF methods converge to the optimal policy with probability 1 both for state-value and for action-value estimations. Technical report http://arxiv.org/abs/cs/0306120 has been updated accordingly. It contains the details of the proof. Acknowledgment We would like to thank L´aszlo´ Gerencs´er for calling our attention to a mistake in the previous version of the convergence proof.
ARTICLE
Communicated by Jonathan Victor
Estimating Information Rates with Confidence Intervals in Neural Spike Trains Jonathon Shlens
[email protected] Salk Institute, La Jolla, CA 92037, and Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Matthew B. Kennel
[email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A.
Henry D. I. Abarbanel
[email protected] Institute for Nonlinear Science, University of California, San Diego, La Jolla, CA 92093, U.S.A., and Department of Physics and Marine Physical Laboratory, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92093, U.S.A.
E. J. Chichilnisky
[email protected] Salk Institute, La Jolla, CA 92037, U.S.A.
Information theory provides a natural set of statistics to quantify the amount of knowledge a neuron conveys about a stimulus. A related work (Kennel, Shlens, Abarbanel, & Chichilnisky, 2005) demonstrated how to reliably estimate, with a Bayesian confidence interval, the entropy rate from a discrete, observed time series. We extend this method to measure the rate of novel information that a neural spike train encodes about a stimulus—the average and specific mutual information rates. Our estimator makes few assumptions about the underlying neural dynamics, shows excellent performance in experimentally relevant regimes, and uniquely provides confidence intervals bounding the range of information rates compatible with the observed spike train. We validate this estimator with simulations of spike trains and highlight how stimulus parameters affect its convergence in bias and variance. Finally, we apply these ideas to a recording from a guinea pig retinal ganglion cell and compare results to a simple linear decoder.
Neural Computation 19, 1683–1719 (2007)
C 2007 Massachusetts Institute of Technology
1684
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
1 Introduction How neurons encode and process information about the sensory environment is an essential focus of systems neuroscience. While much progress has been made (Rieke, Warland, de Ruyter, van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999), many fundamental questions remain. Which features of the sensory world do neurons extract (de Ruyter van Steveninck, & Bialek, 1988; Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Dimitrov, Miller, Gedeon, Aldworth, & Parker, 2003; Paninski, 2003b; Aguera y Arcas & Fairhall, 2003; Sharpee, Rust, & Bialek, 2004)? How do correlations within a spike train (Mainen & Sejnowski, 1995; Berry, Warland, & Meister, 1997; Reinagel & Reid, 2002; Fellous, Tiesinga, Thomas, & Sejnowski, 2004; Abarbanel & Talathi, 2006) or between spike trains of different cells (Mastronarde, 1989; Meister, Lagnado, & Baylor, 1995; Usrey & Reid, 1999) convey information about the stimulus (Gawne & Richmond, 1993; Dimitrov et al., 2003; Rieke et al., 1997; Brenner, Strong, Koberle, Bialek, & de Ruyter van Steveninck, 2000; Warland, Reinagel, & Meister, 1997; Stanley, Li, & Dan, 1999; Gat & Tishby, 1999; Reich, Mechler, & Victor, 2001; Schnitzer & Meister, 2003; Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003)? Are the tuning properties of sensory neurons optimized for their natural enviornment (Rieke, Bodnar, & Bialek, 1995; Lewen, Bialek, & de Ruyter van Steveninck, 2001; Simoncelli & Olhausen, 2001)? How does adaptation optimize information flow in sensory neurons (Brenner, Bialek, & de Ruyter van Steveninck, 2000; Fairhall, Lewen, Bialek, & de Ruyter van Steveninck, 2001; Meister & Berry, 1999)? Addressing such questions requires a reliable measure of how much information about the sensory world is contained in a spike train. The average mutual information rate is a general measure of nonlinear correlation between dynamic sensory stimuli and neural responses (Shannon, 1948; Cover & Thomas, 1991). It does not assume a specific model for interpreting a spike train, yet it bounds the amount of information available to any method for reading or decoding a spike train. This quantity allows one to measure in a meaningful way how efficiently neural circuits encode the sensory stimulus (Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998) or evaluate on an absolute scale the performance of any decoding mechanism, whether biological or artificial (Buracas, Zador, DeWeese, & Albright, 1998; Warland et al., 1997). A major obstacle to the broad application of mutual information measurements to neural coding problems is that estimating mutual information from experimental data sets is difficult. First, large data requirements of naive estimation techniques restrict the use of information-theoretic analyses to preparations with long-duration recordings. Judging the convergence and bias of conventional estimators of entropy and mutual information is difficult (Treves & Panzeri, 1995; Paninski, 2003a), making the application
Estimating Information Rates in Neural Spike Trains
1685
of these tools particularly problematic in limited data sets. Second, common methods for estimating information rates require a heuristic judgement of the duration over which the stimulus and response histories affect the response (Theunissen & Miller, 1995). This subjective assessment can largely dominate the final estimate (Kennel, Shlens, Abarbanel, & Chichilnisky, 2005). Third, applying this heuristic precludes an objective estimate of a confidence interval for the information rate. In this article, we address all of these shortcomings. We introduce a new estimator of information rates for the analysis of neural spike trains that (1) converges quickly to the true value in finite data sets, (2) requires no heuristic judgements, and (3) alone among all other methods provides accurate confidence intervals bounding the range of information rates from sources that most likely generated the observed data. The last issue is important because it permits one to compare information rates in a statistically meaningful manner. We begin by discussing how to represent a spike train as a time series, and examine what the relevant quantities are for characterizing the information in this representation. We introduce an estimator of information rate, derived from Kennel et al. (2005), and briefly review major issues in estimating these quantities and their associated confidence intervals in neural data. We validate this estimator in simulation and discuss how stimulus parameters in an experiment affect bias and variance in the estimate. Finally, we test this approach on spike data from a guinea pig retinal ganglion cell, calculating the information rate and coding efficiency with confidence intervals, and comparing derived bounds to the performance of a simple linear decoder. 2 Quantifying Information in a Spike Train We begin with several assumputions about how a spike train represents information. These assumptions ought to be as encompassing as possible, balanced by computational tractability, in order to handle two essential features of neural systems: 1. Spike trains are lists of stereotyped action potentials occurring in continuous time. 2. Spike trains are generated by a dynamical system with temporal dependencies. In this setting, only the spike times and number of action potentials in a response can convey information about the stimulus. We represent the spike train as the output of a symbolic dynamical system evolving in time (Lind & Marcus, 1996; Ott, 2002; Gilmore & Lefranc, 2002). By symbolic, we mean that the representation of the spike train is a sequence of integers with a finite alphabet. We select a short time
1686
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
window δt termed the bin size, and count the number of spikes in each time bin (MacKay & McCulloch, 1952; Strong et al., 1998; see also Rapp et al., 1994). This is indicated in Figures 1d and 1g. The resulting time course of the observed response is represented by a symbol stream of integers R = {r1 , r2 , . . . , r N }, where each integer ri is the number of spikes that occur between times (i − 1)δt and iδt. Typically, δt is chosen such that each ri is binary. The bin size is the effective temporal resolution of the spikes train (Theunissen & Miller, 1995; Reinagel & Reid, 2000). This coarse graining of the spike train assumes that the relevant timescale of neural dynamics is larger than the bin size. As a realization of the underlying symbol source, the symbol stream is now amenable to tools from symbolic dynamics and data compression (Willems, Shtarkov, & Tjalkens, 1995; Kennel et al. 2005). Information theory provides a framework for quantifying the transmission of information in any communications channel (Shannon, 1948; Cover & Thomas, 1991; MacKay, 2003; Fano, 1961). In neural coding, the sensory stimulus S and neural response R are considered the respective input to and output of a lossy communications channel, represented probabilistically by
Figure 1: Block diagram of the entire estimation procedure divided into two larger tasks of estimating the total entropy rate (upper-right box) (see section 5.1). Simulate spike trains and bin at particular temporal resolution in order to determine appropriate stimulus parameters (a–b). Perform experiment and bin spike train at same temporal resolution (c–d). (e) Generate NMC samples from the continuous distribution P(h(R) | data) (see section 3.1). Estimate the noise entropy rate by recording multiple repetitions of a stimulus (f) and binning the resulting spike train (g) (see section 3.2). (h) Calculate P(h(R|t) | data) using the adapted algorithm to also condition on phase time t. The result is NS sampled distributions where t and i label the time and sample index in h i (t). (i) Selecting L to be several times the autocorrelation width suffices, although more sophisticated algorithms exist (see appendix A). (j) Generate a stationary bootstrap B to capture the autocorrelation in h(R|t) by acting as a surrogate for the integer time indices t (see appendix A). (k) Draw multiple integers J from a uniform probability distribution ∈ 1, . . . , NMC (see section 3.3). (l) Select a set of samples {h i (t)} using B and J for the time and sample indices and average. Repeat this procedure to generate NMC samples from P(h(R|S) | data), each time generating new sets B and J of time and sample indices (see equation 3.4). (m) Subtract off bias discovered in the bootstrap resampling (see equation 3.5). (n) Subtract individual samples of entropy rates to generate samples of P(I (S; R) | data). The mean and quantiles of P(I (S; R) | data) provide the final estimate and confidence interval bounding the range of likely average mutual information rates (see section 3.5).
Estimating Information Rates in Neural Spike Trains
1687
P(R|S) (for reviews, see Rieke et al., 1997, and Borst & Theunissen, 1999).1 In our framework, the stimulus could be a continuous or discrete signal, but the response will always be a stream of discrete integers. The goal is to quantify the nonlinear correlation or statistical dependence between the stimulus and spike train response.
P(X) denotes the probability distribution of the random variable X, while P(X = x) denotes the probability that X = x. For convenience, we sometimes abbreviate the latter as P(x), where X is implied. 1
1688
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
The classical functional for measuring statistical dependence between P(s,r) (Fano, 1961), where two distributions is the mutual information log2 P(s)P(r ) P(s) and P(r ) are the probabilities of observing stimulus s ∈ S and response r ∈ R. P(s, r ) is the joint probability for these events. One usually deals with the average of this statistic over the joint distribution (Cover & Thomas, 1991), I(S; R) =
P(s, r ) log2
s∈S,r ∈R
P(s, r ) . P(s)P(r )
I(S; R) is the average mutual information and answers the question, How much in bits, on average, do we learn about the observation r when we observe s?2 The average mutual information can also be expressed as a difference in the entropy H[·] of the distributions, I(S; R) = H(S) − H(S|R) = H(R) − H(R|S), where the entropy is defined, H(S) = −
P(s) log2 P(s),
s∈S
and similarly for H(R). The average conditional entropy is H(S|R) = r ∈R P(r )H(S|R = r ), with H(S|R = r ) = −
s∈S
P(s, r ) log2
P(s, r ) . P(r )
The benefit of this formulation is that it is easy to generalize this quantity to time series with temporal dependencies. In a non-Poisson spike train R, successive symbols rt−1 , rt , . . . are not statistically independent. The stationary distribution associated with a symbol stream is the conditional probability distribution P(rt |rt−1 , . . . , rt−D ), in which the symbol rt observed at time t depends on D previous symbols. D is called the conditioning depth or the order of a Markov process. In a dynamical system, the conditioning depth is potentially infinite (Hilborn, 2000;
2 A second interpretation is to recognize that I(S; R) measures the Kullback-Leibler divergence between the joint distribution and the product of the marginals and thus measures how inefficient the product of the marginal distribution is at encoding the joint distribution (Cover & Thomas, 1991).
Estimating Information Rates in Neural Spike Trains
1689
Ott, 2002), though in practice, measurements depend on a finite number of previous symbols. In the limiting case, we define the entropy rate, h(R) =
1 lim H [P(rt |rt−1 , . . . , rt−D )] , δt D→∞
(2.1)
with H [P(rt |rt−1 , . . . , rt−D )] = −
rt ,...,rt−D
P(rt , . . . , rt−D ) log2 P(rt |rt−1 , . . . rt−D ).
h(R) has units of bits per second. The entropy rate quantifies the uncertainty of the next symbol given knowledge of the infinite prior history of the underlying symbol source (Cover & Thomas, 1991; Kennel et al., 2005). In a neuroscience context, the stimulus S and neural response R are two dynamical time series. The generalization of the average mutual information is the average mutual information rate expressed as the difference of entropy rates, I (S; R) = h(R) − h(R|S) = h(S) − h(S|R).
(2.2)
This quantity expresses the average rate of novel information (bits per second) that newly observed responses provide about the stimulus time series. It is symmetric in the response and stimulus. The average information rate can mask the contribution of specific stimuli or responses that are particularly informative because it is an average over all pairs of stimuli and responses. It is of interest then to quantify the information attributed to a single stimulus or neural response (DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003; Bezzi, Samengo, Leutbeg, & Mizmori, 2002), Isp (s) ≡ h(R) − h(R | S = s)
(2.3)
Isp (r ) ≡ h(S) − h(S | R = r ).
These specific information rates quantify the amount of information in a single momentary stimulus or neural response. They are natural decompositions of the average mutual information rate, I (S; R) =
r ∈R
P(r )Isp (r ) =
P(s)Isp (s).
s∈S
The specific information rates need not be positive, although the average mutual information rate is always nonnegative (Fano, 1961; DeWeese & Meister, 1999). We now discuss how to estimate these quantities from an
1690
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
experimental data set by extending estimators of entropy rate from previous work. 3 Estimating Information in a Spike Train The definition of average mutual information rate, equation 2.2, provides two alternatives for calculating the information rate in a neural spike train. One can estimate the information rate by examining the reduction in uncertainty by conditioning in stimulus space (Bialek et al., 1991) or response space (Strong et al., 1998). Stimulus space can be high dimensional and continuous, and thus difficult to sample experimentally. Therefore, we select the alternate formulation, I (S; R) = h(R) − h(R|S), because we can estimate the information rate by measuring solely the neural responses in a suitably designed experiment. In this framework h(R) is termed the total entropy rate, h total , and h(R|S) the noise entropy rate, h noise (Strong et al., 1998). We estimate h total and h noise from separate data sets recorded under different stimulus presentations and combine these results to estimate the information rate I (S; R) = h total − h noise . Each step of this entire procedure is outlined in Figure 1. 3.1 Estimating Total Entropy Rates. The total entropy rate h total is the average rate of uncertainty observed in a neural spike train response over a (stationary) stimulus distribution. Experimentally, this means that we present a stimulus to a neuron to sample all possible spiking patterns over a long duration Nδt. Typically, Nδt ranges from several hundred to several thousand seconds (Borst & Theunissen, 1999). The recorded spike train is discretized at a small bin size δt, producing a list of integer spike counts R = {r1 , r2 , . . . , r N } where the subscript indexes time. These two steps are diagrammed in Figures 1c and 1d. We have discussed in detail how to estimate the entropy rate of a finite-length symbol stream in a related work (Kennel et al., 2005), but we review these ideas briefly here. A paramount goal of entropy rate estimation is to determine the appropriate conditioning depth D of the underlying symbol source.3 We must balance the selection of D as diagrammed in Figure 2 between two opposing biases on the estimate:
r r
Unresolved temporal dependencies due to finite D Undersampled probability distributions due to finite N
3 The word length (Strong et al., 1998; Borst & Theunissen, 1999; Reinagel & Reid, 2000) is roughly equivalent to the conditioning depth D, but see Kennel et al. (2005).
Estimating Information Rates in Neural Spike Trains
1691
Figure 2: Selecting the conditioning depth D can dominate entropy rate estimation. This pedagogical figure reproduced from Kennel et al. (2005) demonstrates how temporal complexity and small data sizes affect entropy rate estimation in a discrete Markov process (D = 10) and a symbolized logistic map (D = ∞). Both symbolic systems use binary alphabets and have a maximum possible entropy rate of 1 bit per symbol. For O(106 ) and O(105 ) simulated symbols (left and right columns, respectively), the entropy rate was estimated as a function ˆ of conditioning depth h(D) using a naive entropy estimator (Miller & Madow, 1954). The dashed line is the true entropy rate calculated analytically. A common heuristic is to identify the final estimate of entropy rate as the plateau on this graph (Strong et al., 1998; Schurmann & Grassberger, 1996). This qualitative assessment is not reliable and can fail in the limit of small data sets (right column) and complex temporal dynamics (top row) (Kennel et al., 2005).
This requires selecting the appropriate probabilistic model complexity for a finite data set, a classic issue in statistical inference and machine learning (Duda, Hart, & Stork, 2001). We resolve this problem by using a model weighting technique termed context tree weighting, originally developed for lossless data compression (Willems et al., 1995; Kennel & Mees, 2002; London, Schreibman, Hausser, Larkum, & Segev, 2002; Kennel et al., 2005), following the spirit of the
1692
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
minimum description length principle (Rissanen, 1989). This weighting, along with local Bayesian entropy estimators (Wolpert & Wolf, 1995; Nemenman, Shalee, & Bialek, 2002), yields a direct estimator of the entropy rate of R, hˆ total . We consider the range of plausible entropy rates compatible with the finite data set, not just a single point estimate. We take the Bayesian perspective and view the estimate hˆ total as the expectation of a posterior distribution of entropy rates most consistent with the observed data, hˆ total =
h total P(h total | data) dh total .
The width of P(h total | data) is a Bayesian confidence interval about its estimate. A wide (narrow) distribution implies large (small) uncertainty about the estimate. The numerical algorithm presented in Kennel et al. (2005) yields NMC numerical samples {h ∗total } of the entropy rate drawn from P(h total | data) using Monte Carlo techniques (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970; Gamerman, 1997). This step is diagrammed by Figure 1e. The mean of these NMC samples is the estimate of hˆ total , and 5% and 95% quantiles of this empirical distribution give a 90% confidence interval. 3.2 Estimating Noise Entropy Rates. The noise entropy rate h noise measures the reliability of a neuron in response to a stimulus. This term quantifies the amount of uncertainty in the response that is not attributable to the stimulus. This entropy rate can also be difficult to estimate in a finite data set. In this section we discuss how to extend the algorithm presented in Kennel et al. (2005) for estimating h(R) to provide an estimator for h(R|S). The quantity h noise is a conditional entropy rate and thus a weighted average over all conditional entropies, 1 P(rhist , shist )H [P(rt |rhist , shist )] δt r ,hist t = h(R|S = shist ) S ,
h(R|S) =
(3.1)
where hist denotes all previous values up to but not including time t. Thus, rhist and shist denote conditioning histories of response and stimulus space, respectively. To calculate this quantity, we must average over all possible stimuli (i.e., stimulus histories shist ). This can be experimentally infeasible because stimulus space might be continuous and high dimensional. A clever trick to circumvent this problem is to design an experiment where one replays the same time-varying stimulus segment over multiple
Estimating Information Rates in Neural Spike Trains
1693
trials (Mainen & Sejnowski, 1995; Bair & Koch, 1996; de Ruyter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997; Strong et al., 1998). Typically, a stimulus of duration of 5 to 30 seconds is replayed dozens to hundreds of trials. The notion of phase time as used here refers to the time elapsed since the beginning of the trial. We index the phase time (measured in units of the time bin size δt) by an integer t ∈ [1, . . . , NS ]. With NR repeated stimulus presentations, this generates a raster plot (see Figure 1f or Figure 8), in which a neuron has received practically the same stimulus history shist at each phase time t. In other words, we equate each moment in phase time t with a particular stimulus history shist . We additionally assume that each repetition of the stimulus is sufficiently long so each trial is reasonably ergodic. This means h(R|S) = h(R|S = shist ) S NS = lim h(R|t)t=1 . NS →∞
(3.2)
We emphasize that this assumption might not hold when NS is small or the stimulus contains temporal correlations as commonly observed in natural stimuli (Simoncelli & Olhausen, 2001). In practice, for each moment in phase time t we have observed NR draws from the unknown conditional distribution P(rt | rhist , shist ). In other words, each presentation has conditioned the response on the same (unspecified) stimulus history shist . The stimulus-conditioned response is the very distribution in equation 3.1 of which we need to calculate the entropy. The entropy of this distribution, averaged over stimulus space (or phase time), is the noise entropy.4 We use the same algorithm as previously used for hˆ total as a component, this time forming a new context-tree-based estimate at each phase time t from the NR histories. The observations or draws of this conditional distribution are read off going “downward” in a raster plot as diagrammed in Figure 3, but the conditioning history is back in phase time as usual. Each of the NS trees is processed in Figure 1h, following Kennel et al. (2005), providing NMC samples from P(h noise (t) | data), the Bayesian posterior for the instantaneous noise entropy rate. We denote the jth numerical sample
4 We note that the definition of the noise entropy rate differs from the method commonly used in practice (Strong et al., 1998). The average noise entropy rate is by definition the average over stimulus space of all individual noise entropy rates h(R|S = s). The direct method calculates the average entropy across stimulus space for varying conditioning depths; subsequently, it derives an entropy rate estimate from the average entropies. This out-of-order procedure in practice avoids many complicated extrapolations (Kennel et al., 2005).
1694
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Figure 3: A diagram demonstrating how to estimate P(rt | rhist , shist ) and, correspondingly, the noise entropy rate hˆ noise (t), conditioned on a particular stimulus. Each row is a discretized spike train response to a repeated stimulus. The set of NR time series ending prior to some moment in phase time t shows the conditioning histories rhist , explicitly, and shist , implicitly. With NR samples of histories, we use the context tree to predict each rt from its past history and estimate hˆ noise (t) = H [P(rt |rhist , shist )], the instantaneous noise entropy rate.
of P(h noise (t) | data) as h ∗j (t). Averaging over those samples yields
hˆ noise (t) =
NMC 1 h ∗j (t). NMC j=1
The estimated average noise entropy rate is then the average over all phase times, NS 1 hˆ noise (t). hˆ noise = NS
(3.3)
t=1
3.3 Estimating the Distribution of Noise Entropy Rates. In addition to the estimate hˆ noise , we also wish to estimate the distribution of likely values for the noise entropy, P(h noise | data). Percentiles of this quantity will provide confidence intervals for hˆ noise . Thus far, we have only computed samples from P(h noise (t) | data) for every point in phase time. How do we appropriately combine these NS sampled distributions into an overall distribution to provide for an error bar on hˆ noise ? At this point, one must make a philosophical decision about what types of variability are intended to be represented by the confidence interval. The proper procedure depends on
Estimating Information Rates in Neural Spike Trains
1695
the intended use and interpretation of the estimated value. We identify two sources of variability, which conceivably ought to be represented in a useful confidence interval: 1. Variability due to estimation error point-wise in phase time, because only a finite number of repeats were recorded 2. Variability across the stimulus process itself, because only a finite duration stimulus was presented The first source of variability quantifies uncertainty in the noise entropy rate pointwise in phase time due to finite samples (NR ). Within typical data sets (NR ranges from 50–1000 repetitions), the width of this distribution, P( hˆ noise (t) | data), tends to be narrow, and thus the variability attributable to point-wise estimation error is small (see, for instance, Figure 8, bottom). The second source of variability is more subtle and often not explicitly recognized. Most often, we want to estimate h(R|S) from the data, where S is the stimulus process, not the individual stimulus time series actually used in the experiment. In principle, to estimate this variability, we would repeat our experiment with many stimulus time series, each with many repeats; the variability across the ensemble of rasters (where each raster corresponds to a unique stimulus time series) would reflect the variability in the stimulus process. In practice, long-duration recordings are difficult, and thus only a single raster plot is measured in response to a single instance from the stimulus process. Thus, the goal set forth is to estimate the variability across an ensemble of raster plots from just a single raster plot. We employ a bootstrap procedure, detailed below, to achieve this goal. Because the underlying stimulus process is (presumed to be) unobserved, the variability in the stimulus process is instead reflected in the range of spike patterns observed across phase time in a raster plot (see Figure 4a, top panel). In practice, we find this source of variability quite large because a neuron can fluctuate from nearly determinstic silence (h noise ≈ 0) to bursting activity (h noise 0) (see Figure 4a, top panel at t = 200, 400 ms, respectively) depending on the stimulus. In order to estimate the effect of variation across the stimulus process, we use a generic time-series bootstrap of the observed effect of stimulus variation on hˆ noise (t) to estimate the expected consequences on hˆ noise . We find ensembles of noise entropy rates based on resampled rasters, as if there were a new stimulus composed of randomized segments of the original stimulus. A bootstrapped sample is
h ∗boot =
NS 1 h ∗J t (Bt ). NS t=1
(3.4)
1696
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
0
400
(a)
800
1200 time (ms)
boot
34 32
corrected h
h
boot
(bits/s)
36
30 28 26
36 34 32 30 28 26
6
(b)
2000
38 (bits/s)
38
1600
8 firing rate (Hz)
10
6
8 firing rate (Hz)
10
Figure 4: Effects of correlation and spike rate on bootstrap procedure on a short repeated stimulus segment from experimental data set (δt = 4 ms; see section 5). (a) Top: original raster plot. Middle: raster with phase time indices B resampled uniformly with replacement. Bottom: raster with phase times indices B resampled using stationary bootstrap procedure. (b) Conditioning on a constant spike rate. Left: noise entropy samples h ∗boot versus firing rate r ∗ over bootstrapped samples, demonstrating typical linear dependence (dashed line). Right: noise entropy samples corrected using linear Ansatz. The width of the distribution in h ∗boot (vertical dispersion) is substantially lowered using this procedure.
where, for each sample h ∗boot , Bt is a draw of time indices in [1 . . . NS ] (details discussed below) and J t is a uniform random draw from the sample indices [1, NMC ] with replacement. Procedurally, we draw a random bootstrap time series B = {B1 , . . . , B NS } of time indices and, at each of those moments, draw
Estimating Information Rates in Neural Spike Trains
1697
one of the NMC samples from the point-wise estimated posterior and average them to give one bootstrap sample, h ∗boot .5 We employ the basic bootstrap interval (Davison et al., 1997) to remove bias discovered during the resampling procedure. The basic bootstrap interval presumes that the distribution of hˆ noise − h noise (where h noise is unknown) may be identified with the distribution of h ∗boot − hˆ noise (where h ∗boot has been sampled by the bootstrap). Samples of our estimate for the distribution of the noise entropy rate h ∗noise are computed using equations 3.3 and 3.4: h ∗noise = hˆ noise − (h ∗boot − hˆ noise ) = 2 hˆ noise − h ∗boot .
(3.5)
These samples are a hybrid Bayesian and bootstrap estimate of the distribution of h noise , and quantiles of h ∗noise provide confidence intervals for the noise entropy. The appropriate procedure for choosing B is subtle because the neural response contains significant temporal correlations across phase time. These correlations may be due to refractory neural dynamics as well as the effect of nearly overlapping stimuli within small shifts of phase time. These correlations in phase time create correlations between hˆ noise (t) and hˆ noise (t + 1) and are reflected in the raster plot by the thickness of each burst of activity (or silence) in Figure 4a, top panel. A naive bootstrapping procedure for B is to uniformly draw random time indices (with replacement) from [1, NS ]. This standard procedure would result in creating the raster plot in Figure 4a, middle panel. Clearly, this procedure fails to capture the autocorrelation structure evident in Figure 4a, top panel. Correspondingly, when significant response autocorrelation exists, a naive bootstrap that presumes point-wise independence in phase time, this procedure will profoundly underestimate the true variance of hˆ noise across the underlying stimulus process. To account for this significant correlation, we propose drawing bootstrapped phase times using a procedure developed by Politis and Romano for correlated time series (Politis & Romano, 1994; Politis & White, 2004). The Politis-Romano (PR) bootstrap procedure creates surrogate time indices B consisting of varying-length blocks of consecutive phase times (each block starting at random phase times), whose purpose is to capture most of the the autocorrelation structure in hˆ noise (t) (Figures 1i and 1j; see appendix A for details). This correlation structure can be quite large at small bin sizes or with stimuli exhibiting long temporal correlations (e.g., natural stimuli). Returning to Figure 4a, the phase time indices B are now resampled using the PR bootstrap. This bootstrap procedure captures the autocorrelation structure evident in the response (Figure 4a, bottom panel). With this bootstrap, the confidence interval calculated as quantiles on equation 3.5 better
5 Note that if one did not want to include the stimulus variation in the confidence interval, one would select Bt = t.
1698
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
capture the variability due to the underlying stimulus process. The consequences of this source of variability are discussed further in the application of this estimator. 3.4 Conditioning Noise Entropy Rate on a Fixed Spike Rate. Entropy rates are significantly correlated with the spike rate. For example, periods of silence in a raster plot correspond to hˆ noise (t) = 0 reflecting practically deterministic activity (Kennel et al., 2005). Thus, sample h ∗boot will contain, by random chance, phase time indices B with fewer or greater periods of time with absolute silence significantly effecting its value. Figure 4b (left panel) demonstrates this empirically linear, systematic effect. This source of variability is artificially introduced by the resampling procedure. Samples drawn with larger (smaller) proportions of phase times B with spiking activity exhibit higher (lower) entropy rates. We may null out this source of variability by adapting the resampling procedure. Conceptually, we are now estimating h(R|S, rate = r0 ), where r0 is the empirically estimated average spike rate. For each moment in phase time t, we estimate the instantaneous spike rate r (t). For each bootstrap sample h ∗boot , there is a corresponding set of phase time indices B and spike rate r ∗ = N1S t r (Bt ). We use least squares to fit a linear Ansatz between the sample spike rate and entropy rate, h ∗boot ≈ α + β(r ∗ − r0 ), where {α, β} are determined with an ordinary least squares fit (see Figure 4b, left panel, dashed line). Each entropy rate sample is corrected as if it had spike rate r0 , ∗ ∗ h ∗∗ boot = h boot + β(r 0 − r ).
(3.6)
These corrected samples can be plugged into equation 3.5, replacing h ∗boot . This procedure corrects for variability introduced by the bootstrap procedure, which is seen primarily through the altered spike rate and effectively shrinks the confidence interval of the noise entropy distribution when this effect is removed. 3.5 Estimating Mutual Information Rates. We now have all of the elements necessary to estimate the specific and average information rates. We remind the reader that the ergodic assumption in section 3.2 equated a specific stimulus history with a particular moment in phase time (i.e., shist ∼ t). Because the total and noise entropy rate distributions are estimated from separately recorded experimental data sets, we can simulate samples of the likely information rates by drawing independent samples from the posterior distributions and taking their difference, I ∗ = h ∗total − h ∗noise ∗ Isp (s) = h ∗total
∗
− h noise (t).
(3.7) (3.8)
Estimating Information Rates in Neural Spike Trains
1699
This step is diagrammed in Figure 1n. The mean of these samples gives the estimated information rate, and the central portion gives a mixed Bayesian and bootstrap confidence interval on the range of information rates consistent with the data. We now test these ideas in simulation and highlight results on an experimental neural data set. 4 Testing the Estimator with Simulation In previous work (Kennel et al., 2005), we validated and compared the performance of our estimator for the total entropy rate using simulations of nonlinear dynamical systems. In summary, we found that our estimate was asymptotically unbiased and converged reliably and rapidly compared to other estimators (Strong et al., 1998; Lempel & Ziv, 1976; Amigo, Szczepanski, Wajnryb, Sanchez-Vives, 2004; Kennel & Mees, 2002; London et al., 2002; Kontoyiannis, Algoet, Suhov, & Wyner, 1998). Furthermore, the size of the confidence interval estimate provided by the entropy rate estimator well matched the variation in entropy rate under sample fluctuations. We now examine the convergence of our estimators and the confidence interval of hˆ total and hˆ noise using a simulation of an inhomogeneous Poisson neuron with an absolute and relative refractory period (Nemenman, Bialek, & de Ruyter van Steveninck, 2004; Victor, 2002). The parameters of this model have been chosen to approximate the statistics of a real neuron and binned at δt = 4 ms; see appendix B for details. The goals of these simulations are to (1) validate our estimates of hˆ total , hˆ noise and their asssociated confidence intervals and (2) judge the necessary amount of data required for a good estimate in real neural data. 4.1 Testing Convergence. We examined the convergence of hˆ total using an inhomogeneous refractory Poisson simulation whose parameters are fit to model the temporal dynamics of a recorded neuron (see appendix B). In previous work we detemined that this estimator is asymptotically unbiased (Kennel et al., 2005). We judge the convergence and consistency of the estimator about an approximate truth (Nδt = 4000 s, NR = 3000 trials, NS δt = 32 s; see also Nemenman et al., 2004) by examining the estimator bias and variance. For each duration Nδt, we generated 100 independent data sets and computed hˆ total for each data set. In Figure 5a, the estimator bias is the difference between approximate truth and the mean of these 100 independent estimates. The estimator variance is the square of the accompanying error bar, and it measures the intrinsic fluctuations in this random process. We emphasize that this error bar is not the estimated confidence interval, but rather is the actual posterior distribution of entropy rates, which we will estimate later. Over increasing Nδt, the estimator variance decreases, and within 125 s of data, the bias and variance (normalized by hˆ total ) are, respectively, less than 0.2% and 1.3%, suggesting a well-converged estimate.
1700
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
total entropy rate (bits/s)
115
110
105
100
5
10
20
50
100
300
600
1000
N δ t (s)
(a) noise entropy rate (bits/s)
65 60 55 50 45 40 0
2
4
6
8
10
12
14
16
400
500
600
700
800
N δ t (s)
(b)
S
noise entropy rate (bits/s)
54 53 52 51 50 49 48 47 0
(c)
100
200
300
N (number of trials) R
Figure 5: Convergence of entropy rate estimates on a simulated refractory Poisson neuron. Shown are ensembles over 100 different data sets; error bars give sample standard deviations over the data set ensembles. Dashed line is approximate truth (Nδt = 4000 s, NR = 3000 trials, NS δt = 32 s). (a) Convergence of hˆ total in N. (b) Convergence of hˆ noise in NS for a fixed number of trials NR = 400. (c) Convergence of hˆ noise in NR for a fixed trial duration NS δt = 8 s.
Examining the convergence of hˆ noise is more subtle because the estimator contains two parameters, NR and NS . Because the convergence is difficult to visualize in multiple dimensions, we instead examine the convergence along each parameter while holding the other fixed at a reasonable value.
Estimating Information Rates in Neural Spike Trains
1701
In Figure 5b, we again generated 100 independent data sets. This time each data set is an entire raster plot with NS varied and NR = 400 trials fixed. Each data set contains a new instantiation of the random process, as well as a new driving firing rate (λ2 ) over NS δt seconds (see appendix B). The bias is the difference between approximate truth and the average of these 100 estimates. The estimator variance is again the width of the posterior distribution of entropy rates, although the posterior distribution of hˆ noise is calculated by applying the stationary bootstrap (see section 3.3) to account for correlations in time. The estimator bias remains small (< 1% of hˆ noise ), yet the estimator variance shrinks as NS increases.6 Finally in Figure 5c, we judge the convergence of hˆ noise over NR while holding NS δt = 8 s fixed. In this panel, each data set is a new instantiation of the random process but retains the same firing rate (λ1 ). Unlike Figure 5c, we do not select a new firing rate time series because we are interested in showing the significant effect of NR on bias, and drawing new firing rate time series partially obscures the effect by submerging it in additional variance. The estimator bias is profound at small NR , dominating the magnitude of the variance (Paninski, 2003a); however, by NR = 400 trials, the bias is well contained within the variance of the estimate. Thus, we find that NR dominates the bias in hˆ noise as compared to NS . In other words, NR dominates the bias, while NS dominates the variance almost independently. Thus, variations across phase time (or stimulus histories shist ), but not the error bar for individual hˆ noise (t), explain most of the variance in hˆ noise . We return to this point in section 6 in discussing how to select experimental parameters for estimating these quantities. 4.2 Testing the Confidence Interval. We now examine how well the confidence interval, which can be estimated from a single data set, can replicate the actual variation of the underlying estimate seen in a large ensemble of new data sets. To test the confidence intervals for hˆ total , we again generate 100 independent data sets with a fixed Nδt. For each data set, we calculate hˆ total and the associated single trial confidence interval. We plot these estimates as circles and error bars sorted by increasing hˆ total in Figure 6a (Nδt = 125 s). First, we note that approximate truth (dashed line) is contained within 90% of the individual confidence intervals. This indicates that the estimated confidence interval is well calibrated. To summarize these results, we calculate the following. We consider the true distribution of this statistic (or posterior distribution of hˆ total ) to be the distribution of hˆ total (circles). The central 90% quantile of this posterior 6 As an aside, notice that we are combining a Bayesian and frequentist bootstrap method for the estimator, and then testing it in a purely frequentist fashion (many draws from the source). The size and location of a Bayesian error bar need not match exactly the result from a frequentist-style experiment, but the general compatibility of the results here gives reassurance there are no peculiar anomalies.
1702
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
total entropy rate (bits/s)
114 112 110 108 106 104 102
0
10
20
30
(a)
40
50
60
70
80
90
100
600
1000
data set number
total entropy rate (bits/s)
120 115 110 105 100 95
(b)
5
10
20
50
N δ t (s)
100
300
Figure 6: Testing the confidence interval of hˆ total . (a) Individual estimates and associated 90% confidence intervals of 100 independent data sets of length Nδt = 125 s (circles). Average of the individually estimated confidence intervals (triangle). 90% quantile on the distribution of hˆ total (circles), which measures actual posterior of h total (square). The horizontal dashed line is approximate truth (Nδt = 4000 s). (b) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across Nδt. Across varying data set lengths, confidence intervals are well calibrated and engulf approximate truth.
distribution is denoted by the left-most square (see Figure 6a). We compare this error bar to the average lower and upper limits of the 100 individual confidence intervals (see Figure 6a, triangle). In Figure 6a, the average upper and lower limits contain approximate truth and match the estimated posterior distribution. We use these two statistics as a summary of the calibration and compare the two across varying Nδt (see Figure 6b). Across varying Nδt, the average confidence interval matches the true posterior distribution and engulfs truth. We repeat this same procedure for examining the confidence interval in hˆ noise across NS and NR (see Figure 7). In Figure 7a, we again draw a new instantiation of the random process and the firing rate (λ2 ) for every sample,
Estimating Information Rates in Neural Spike Trains
1703
noise entropy rate (bits/s)
65 60 55 50 45 40
0
10
20
30
40
(a)
50
60
70
80
90
100
data set number
noise entropy rate (bits/s)
70 65 60 55 50 45 40 35 0
2
4
6
100
200
300
(b)
8
10
12
14
16
400
500
600
700
800
NS δ t (s)
noise entropy rate (bits/s)
58 56 54 52 50 48 0
(c)
N (trials) R
Figure 7: Testing the confidence interval of hˆ noise across variations in NS and NR . (a) This figure follows Figure 6. Each circle is an independent data set (raster) with NS δt = 8 s and NR = 400 trials. (b) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across NS δt with NR = 400 fixed. (c) Comparison between average lower and upper confidence interval limits (triangle) and posterior distribution (square) across NR with NS δt = 8 s fixed. Across varying NS and NR , confidence intervals are well calibrated and engulf approximate truth (dashed line, NR = 3000 trials, NS δt = 32 s).
1704
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
and we make use of the stationary bootstrap to account for correlations within hˆ noise (t). In Figure 7c, we repeat the same procedure, this time varying NR while keeping NS δt = 8 s. This is in contrast to Figure 5, where we did not draw new firing rate time series. The reason for this is that previously, the purpose was to demonstrate only the result that bias depends primarily on NR ; the additional variability induced by draws of new time series from the firing rate process would be irrelevant and would obscure the effect we intended to demonstrate. Now, however, the magnitude of the fluctuations and whether our estimation technology can account for them is the central object of interest, and so it is appropriate to account for this source of variability in the data-generating process. In summary, we find that the hybrid Bayesian-frequentist confidence interval is well calibrated for hˆ noise across a range of NR and NS in our example. 5 Results We now bring these tools to bear on neural data from a guinea pig retinal ganglion cell (RGC). These data were recorded extracellularly with a multielectrode array (for experimental details, see Chichilnisky & Kalmar, 2003) and full-field binary monitor flicker at 120 Hz was presented for a long duration (Nδt = 1000 s) and over repeated trials (NS δt = 9.5 s, NR = 660 trials) to sample the total and noise entropy rates, respectively. We isolated spike times of individual RGCs using a simple voltage threshold procedure and associated distinct clusters of spike times with individual neurons in a spike waveform feature spikes (Frechette et al., 2005). We present this analysis in full on a single RGC in Figures 8 to 11. We selected the spike times of one well-isolated ON-type RGC (contamination rate < 0.02; Uzzell & Chichilnisky, 2004), binned at 4 ms, which appeared relatively stationary over the duration of the experiment (see Figure 8). We qualitatively judged stationarity by measuring fluctuations in the firing rate (7.9 ± 0.9 Hz) and the spike-triggered average (data not shown). In spite of this selective screening, it should be emphasized that nonstationarity exists in all recordings to varying degrees, as with most other experimental recordings. This nonstationarity could result from either changes in experimental conditions (e.g., temperature, cell death) or adaptation over longer timescales (Fairhall et al., 2001; Baccus & Meister, 2002). Regardless, all of these variations could result in either an increased noise entropy rate, a lack of convergence in our estimates, or an underestimation of the confidence interval. Although we attempted to mitigate these effects by careful control of experimental conditions, this issue is of potential concern because nonstationarity violates a central assumption in the definition of entropy rate. 5.1 Entropy Rates. Figure 9 examines the convergence of the entropy rate across increasing durations of data. Only a single data set is available;
Estimating Information Rates in Neural Spike Trains
1705
Figure 8: ON-type guinea pig retinal ganglion cell. A small portion of the raster from repeated trials of a stimulus. The conservation of spike times across trials suggests that the neuron responds reliably and precisely to repeated presentations of the stimulus. The lower plot shows the specific information rate of the retinal ganglion cell. The black dot is the estimate, and the gray shadow is the 95% confidence interval. The average mutual information rate is 15.2 bits per second.
thus, a complete examination of convergence and confidence intervals of the estimator across ensembles of data sets is not possible. As a rough approximation, we instead artificially divide our single long data set into K independent, smaller data sets. In Figure 9, this analysis is performed on the total and noise entropy rate across for K ≤ 8 fractions of the original data set. The final estimate using the complete data set (K = 1) is displayed with the horizontal dashed line. Individual subsampled estimates (circles) converge to the final estimate as the number of data (N, NR respectively) increases. Single-trial 95% confidence intervals engulf the final estimate a vast majority of the time, indicating that the confidence interval provides a
1706
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky 52
htotal (bits/s)
50 48 46 44 42
100
150
200
(a)
300 N t (s)
500
750
1000
h
noise
(bits/s)
40
35
30
25
(b)
100
150
200 300 trials (N )
500
700
R
Figure 9: Convergence of entropy rate estimates (circles; error bar is 95% confidence interval). Each group of data points is an estimate from K ≤ 8 independent data sets subsampled from the complete data set. Horizontal lines are entropy rate estimates from our estimator (dashed) and from the direct method (dotted) (Strong et al., 1998). (a) Total entropy rate estimates across subsamples of N. (b) Noise entropy rate estimates across subsamples of NR . Note that singletrial confidence intervals from subsampled data sets engulf the final estimate using the complete data set.
reasonable bound on the range of potential estimates assuming more data were available. For the noise entropy rate (see Figure 9b), the confidence interval is calculated using the stationary bootstrap procedure with the firing rate correction. The reason for selecting this type of confidence interval is explored in Figure 10 by examining estimates from data sets subsampled across NS . The ergodicity assumption, equation 3.2, provides that each trial is long enough to explore a reasonable range of response space to provide a good estimate of P(R|S) and consequently hˆ noise . This assumption should be reflected not only in the estimate (circle) but also the confidence interval, such that different repeated presentations (i.e., subsampled NS ) do not substantially differ in their values. Indeed, for the standard bootstrap procedure (see
Estimating Information Rates in Neural Spike Trains
1707
40 35 30
h
noise
(bits/s)
45
25 20
3
5 trial duration, NS t (s)
7
10
3
5 trial duration, NS t (s)
7
10
3
5 trial duration, N
7
10
(a) 33 32 31
h
noise
(bits/s)
34
30 29
(b) 40 35 30
h
noise
(bits/s)
45
25 20
(c)
S
t (s)
Figure 10: Confidence intervals of noise entropy rate estimation (symbols follow Figure 9). Estimates use three different methods for calculating 95% confidence intervals. Each panel uses varying fractions K = 1, 2, 3, 4 of the complete data set to generate independent estimates. (a) Stationary bootstrap procedure. (b) Standard bootstrap procedure for phase times B. (c) Stationary bootstrap procedure correcting for spike rate. Note expanded axes in b.
1708
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Figure 10b), the confidence interval would indicate otherwise. This suggests that for short stimulus presentations, the ergodicity assumption is incorrect. In contrast, the correlated bootstrap procedure (see Figure 10b) properly accounts for this uncertainty as the error bars indicate that these independent samples are not statistically different. Furthermore, this result holds even when removing systematic fluctuations in the firing rate introduced by the bootstrap procedure (see Figure 10c). We suggest that several features give encouragement that the entropy rate estimates are fairly well converged. First, diagnostics, based on complexity measures introduced in Kennel et al. (2005) are well converged, suggesting a reasonable first check (data not shown). The convergence of these diagnostics is necessary but not sufficient for a well-converged entropy rate estimate. The simulations in the previous section are the strongest evidence for convergence. A model neuron with similar temporal dynamics, binned at the same temporal resolution, provides well-converged hˆ noise and hˆ total estimates within the total recording time of this neuron. 5.2 Information Rates of a Retinal Ganglion Cell. For this stimulus distribution, we estimate the average information rate I (S; R) to be 15.2 ± [1.1, 1.1] bits per second with 95% confidence.7 The total entropy rate of the response h(R) is 46.5 ± [0.8, 0.8] bits per second, providing an (S;R) average response efficiency I h(R) = 32.8 ± [2.8, 3.0]%. The response efficiency characterizes how much of the response capacity is exploited by the neuron. Similar response efficiencies have been reported in neurons across varying bin sizes (Borst & Theunissen, 1999). The stimulus distribution has an entropy rate h(S) = 120 bits per second, because there is an equal chance of a black or white screen (1 bit) selected randomly at 120 Hz. H(S) > H(R) implies that even if the neuron were noise free (i.e., H(R|S) = 0), at a resolution of 4 ms, the neuron would not contain a large enough repertoire of spike patterns to losslessly transmit the stimulus. The stimulus efficiency I (S;R) = 12.7 ± [0.9, 0.9]% characterizes how h(S) much of the stimulus the spike train transmits. The low stimulus efficiency suggests several possibilities: the neuron failed to transmit all of the information about the stimulus, or the bin size of the spike train is too large to extract all of the information about the stimulus. The absolute refractory period of the neuron sets a rough guide to the relevant timescale for the bin size. Recent work has suggested I (S; R) must be calculated for bin sizes smaller than the refractory period in order to extract all information about a stimulus within a spike train (Strong et al., 1998; Reinagel & Reid, 2000; Liu, Tzonev, Rebrik, & Miller, 2001). However, estimating information rates 7 Because our error bars can be asymmetric, we abbreviate lower and upper portions of the error bar as the first and second items within the brackets.
Estimating Information Rates in Neural Spike Trains
1709
Figure 11: Reconstruction of a binary stimulus. In the left panel, a short segment of the displayed light intensity waveform of the full-field stimulus S (black line) and the corresponding RGC spike train (below). We calculate the reconstructed stimulus Sˆ (black dots) by convolving the discretized spike train (δt = 4 ms) with a filter (right panel) and thresholding the resulting value to a high or low stimulus intensity. The filter is the reverse correlation of the spike train with the stimulus divided by the autocorrelation of the response (Rieke et al., 1997). This simple decoding procedure recovers at least 3.3 bits per second, or 21.9 ± [1.5, 1.7]% of the available information.
at smaller bin sizes require-larger effective conditioning depths D and thus contains larger data requirements (Kennel et al., 2005). Most important, the response kinetics of ON-type guinea pig RGCs is slow compared to the refresh interval of the stimulus (see Figure 11). Thus, we hypothesize that the stimulus is probably flickering too rapidly for the RGC to respond accurately. Average information rates mask the contributions of individual stimuli and responses. Recent work has investigated several extensions of information rates to measure how well particular stimuli are encoded (DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003). A trivial extension of our estimator is to estimate the specific information rate Isp (s), or the contribution of individual stimuli. For a small segment of the raster in Figure 8, we plotted Isp (s) for comparison. The gray shadow is the 95% confidence interval and is largely dominated by the uncertainty in hˆ noise . Stimuli eliciting silence provide the largest amount of information about the response because the neural response is reliably silent (Butts, 2003). During periods of silence, h(R|t) = hˆ noise (t) = 0, implying that the specific information rate plateaus at Isp (s) = h(R) − 0 = h total .
1710
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
During periods of bursting activity, hˆ noise > hˆ total , giving a negative specific information Isp (s) < 0. Negative specific information is not only possible but necessary to maintain additivity, a central assumption of information theory (Shannon, 1948; Brillouin, 1956; Fano, 1961; DeWeese & Meister, 1999; Abarbanel, Masuda, Rabinovich, & Tumer, 2001). The sign and distribution of specific information rates reflect the distribution of stimuli selected in the experiment. 5.3 Comparison to a Decoder. A principal feature of the average mutual information rate I (S; R) is that it does not involve any reconstruction of the stimulus nor require any specific model for interpreting a spike train. However, I (S; R) does bound the performance of any hypothesized decoding algorithm Sˆ = F (R), whether biological or artificial—meaning that it measures the performance an optimal decoder could achieve at a temporal resolution δt. We present a comparison of I (S; R) to a simple decoding procedure based on linear reconstruction to illustrate this point. This decoding procedure reconstructs from the spike train R an estimate of the stimulus waveform Sˆ = {ˆs1 , sˆ2 , . . .}, where the subscript indexes time. The performance of this artificial decoder can be judged on an absolute scale relative to the available information, I (S; R) (Buracas et al., 1998; Warland et al., 1997; Abarbanel & Talathi, 2006). Because the original stimulus waveform is binary, we constructed a simple decoder F by convolving the spike train with a filter and thresholding the resulting value to a high or low stimulus value. Figure 11 shows a segment of the resulting reconstruction. We calculated the filter using the first half of the data (Nδt = 500 s); all subsequent calculations used the second half of the data set. The filter (Figure 11, right panel) is the optimal causal linear estimate of the stimulus given the spike train, calculated by reverse-correlating the spike train with the stimulus and dividing by the autocorrelation of the response (Rieke et al., 1997). Preliminarily we can compare the ability of the decoder F to reconstruct the stimulus to the best possible predictive power dictated by the inforˆ mation rate. We define the error as a binary random variable E ≡ δ(S = S), where 1 and 0 indicate a correct and incorrect prediction, respectively. Fano’s inequality (Cover & Thomas, 1991) guarantees that for a binary stimulus distribution, the probability of an error P(E) implicitly is bounded below by the equation, H(E) ≥ δt h(S|R) = δt [h(S) − I (S; R)].
(5.1)
Estimating Information Rates in Neural Spike Trains
1711
The entropy rate of the stimulus, h(S), is 120 bits per second by experimental design. The difference between the h(S) and I (S; R), the latter estimated from data, gives a lower bound on the entropy rate of the error. Solving equation 5.1 for P(E) with our results gives a lower bound P(E) ≥ 37.8%. Empirically, we find our simple decoder failed to predict the stimulus 40.5% of the time. We also characterize the performance of the decoder in absolute terms by examining the reduction in stimulus space to calculate a lower bound on the information attributed to the decoder, I (S; F (R)) = h(S) − h(S|F (R)). We can overestimate h(S|F (R)) by assuming the stimulus estimate Sˆ is identically and independently distributed (i.i.d.). This assumption ensures that our estimate of I (S; F (R)) is a conservative lower bound: 1 H(st | shist , sˆt , sˆhist ) δt 1 = h(S) − H(st | sˆt , sˆhist ) δt 1 ≥ h(S) − H(st | sˆt ) δt 1 ˆ = h(S) − H(S| S), δt
I (S; F (R)) = h(S) −
(5.2)
where we have recognized that S is i.i.d., and hist denotes all previous values in time up to but not including time t. By the data processing inequality, we know that I (S; F (R)) ≤ I (S; R) with equality if and only if F captures all of the information contained in R (or is a sufficient statistic of R). For this simple decoding procedure, we recover at least 3.3 bits per second, or 21.9 ± [1.5, 1.7]% of the available information. Building a better decoding procedure through dimensional reduction of stimulus space or other biological priors could recover a greater percentage of this available information and decrease the probability of reconstruction error (Bialek et al., 1991; Aguera y Arcas & Fairhall, 2003; Simoncelli, Paninski, Pillow, & Schwartz, 2004; Abarbanel & Talathi, 2006). 6 Discussion We have discussed how to estimate information rates in neural spike trains with a confidence interval by extending an estimator introduced in a related
1712
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
work (Kennel et al., 2005).8 We apply a time series bootstrap procedure to estimate the uncertainty in the noise entropy under the (usually unrecognized) assumption that the presented stimulus is but a single example from some underlying distribution. We combined these ideas to calculate a pointwise specific information rate Isp (s) and average information rate I (S; R) in an experimental data set. We calculated the stimulus and response efficiencies to characterize how well the neuron conveys information about the stimulus and how well the neuron exploits its capacity to send information. Finally, we compared I (S; R) to the performance of a simple decoder based on linear reconstruction. Testing these ideas in simulation, we developed an empirical means of judging convergence in real spike trains. Importantly, we showed in simulation for noise entropy estimation that bias is dominated by the number of trials NR , while variance is independently dominated by the duration of each trial NS . This bias-variance trade-off suggests a systematic strategy for determining experimental parameters for estimating the noise entropy rate. First, determine the minimal number of trials for convergence in simulation for a selected bin size (see Figures 1a and 1b). Second, given a fixed experiment duration, maximize the duration of each stimulus presentation in order to minimize the estimator variance. Ideally, the duration of each presentation should be far larger than the correlation length of the neural respose to ensure ergodicity (see equation 3.2). Finally, on real spike trains with inherent nonstationarity, our estimator appears convergent, and the confidence interval captures the range of reasonable variation assuming the source were stationary. Mitigating the effects of nonstationarity in the estimate is thus the focus of our experiment and potential future analysis. Our estimator is a methodological advance because it (1) empirically converges quickly to the entropy rate (Kennel et al., 2005), (2) removes heuristic judgements of conditioning depth D (Schurmann & Grassberger, 1996; Strong et al., 1998), and (3) provides confidence intervals about its estimate. The first point is imperative for any estimator because it minimizes data requirements, thereby increasing confidence in its accuracy. The second point is a special case of judging the appropriate model complexity for a neural spike train. This point removes heuristics and subjective assessments often relied on in practice in the application of the direct method (Reinagel & Reid, 2000; Lewen et al., 2001). These heuristics either ignore the model selection problem or reject an estimate post hoc based on inappropriate convergence behavior. For example, in the application of the direct method (Strong et al., 1998), physical arguments suggest that a linear extrapolation within a selected scaling region should provide a good estimate of the entropy rate provided enough data
8
Source code as well as a Matlab interface for both Bayesian entropy rate and information rate estimation is available online at http://www.snl.salk.edu/∼shlens/infotheory.html.
Estimating Information Rates in Neural Spike Trains
1713
samples exist. Thus, a common post hoc criterion is to reject this extrapolation if deviations from linearity become too large. One difficulty with this approach (aside from the subjective assessment of a scaling region and a linearity criterion) is that no estimate of the entropy rate is provided if this post hoc criterion is not met. In other words, if a linear extrapolation is poor, then no estimate is provided at all. A potential example of this failure can be observed in the upper-right panel of Figure 2, where deviations from a linear extrapolation would be large. We view this shortcoming as fundamental because it skirts the larger problem of selecting the appropriate model complexity. Selecting the appropriate probabilistic model complexity for a spike train allows one to generalize or sample the range of plausible entropy rates associated with the spike train, thus fulfilling the third point above. The computation of confidence intervals for information rates is unique to our estimator. A confidence interval provides a means for comparing information rates in a statistically significant manner. The lack of a reliable, heuristic-free estimator with an associated error bar has complicated the application of information theory in empirical data analysis. Immediate applications of this estimator include several avenues of research where the lack of a reliable estimator has made analysis difficult. This technique can be applied generally to estimate the information of any situation with a symbolic observable and the opportunity to repeat stimuli. We now discuss immediate applications in neuroscience. One important application is to measure the temporal precision of spike times by comparing information rates across high temporal resolutions (Strong et al., 1998; Reinagel & Reid, 2000; Liu et al., 2001). The selection of the optimal bin size (or any discretization; Rapp et al., 1994) to represent a spike train remains an open question, potentially resolved through model selection criteria (Rissanen, 1989). A second avenue is to explore the correlation structure between neurons by comparing information rates across multiple neurons (Puchalla, Schneidman, Harris, & Berry, 2005; Dan, Alonso, Usrey, & Reid, 1998; Gat & Tishby, 1999; Reich et al., 2001; Schneidman et al., 2003). Furthermore, the extension of this estimator to calculate specific information quantities (Bezzi et al., 2002; DeWeese & Meister, 1999; Brenner, Strong et al., 2000; Butts, 2003) suggests the possibility of actively searching through stimulus space for features that elicit high temporal precision in single neurons or unique correlational structure between multiple neurons. The benefits of this new estimator arise from having a justified theory to select the appropriate model complexity for the estimated probabilistic spiking response P(R|S). Although our model of spiking activity P(R|S) is statistical and not biophysical, this model does highlight how dynamics restrict the production of uncertainty in any neuron and, with further calculation, provides an upper bound to judge the quality of any decoding mechanism—whether experimentally derived or downstream in sensory processing.
1714
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Appendix A: The Politis-Romano (PR) Bootstrap The PR bootstrap is a numerical procedure used to generate correlated bootstrap samples from a time series (Politis & Romano, 1994; Davison et al., 1997). In this appendix Q(t) is a time series discretely sampled at t = [1, . . . , tmax ]. The bootstrap random variable B = (B1 , B2 , . . .) is a list of time indexes that preserve some of the autocorrelation in Q(t). The first value B1 is chosen at random from the integer time indexes. Given an already selected time index Bi , with probability p = 1/L (or with p = 1 if Bi = tmax ), we choose Bi+1 randomly and uniformly; otherwise, Bi+1 = Bi + 1. This procedure generates a bootstrap draw B composed of varying-length blocks of successive time indexes. The bootstrap resample of the time series is thus Q(B1 ), Q(B2 ), . . . The lengths of these blocks are exponentially distributed with mean length L. L is a free parameter chosen to reflect the duration of the correlation structure in the time series Q(t). In the statistical literature, the PR method is often called the stationary bootstrap because the bootstrap process generating B is statistically stationary, as opposed to the fixed-L block bootstrap methods previously used, which are only cyclically stationary. Several algorithms select L automatically given an observed time series (Politis & White, 2004). In practice, a simple but common heuristic is to select L to be a few times the width of the autocorrelation peak of Q(t). In the raster plots, L is calculated as twice the autocorrelation across the phase time of the spike rate r (t) or hˆ noise (t). Empirically, our results do not seem to depend significantly on the specific choice of L within a wide range of reasonable values. We have also used the software in Politis and White (2004) to estimate L, which results in no appreciable change in the results presented here.
Appendix B: Simulation of a Refractory Poisson Neuron In section 4 we use a single model of a neuron to judge the convergence and quality of our estimate of entropy rates. Our model neuron is an inhomogeneous Poisson process with an absolute and relative refractory period. The goal of this model is solely to provide an approximation of the temporal dynamics of a neuron (but see Berry & Meister, 1998; Keat, Reinagel, Reid, & Meister, 2001; Pillow & Simoncelli, 2003; Paninski, Pillow, & Simoncelli, 2004). The parameters that must be specified are the relative (and absolute) refractory periods and the firing rate over time. We matched the absolute refractory period of a real neuron (4 ms) by identifying the smallest interspike interval in the autocorrelation of the spike train. We used a sigmoidal recovery function fit to the firing statistics of the spike train to match the relative refractory period of the neuron (Uzzell & Chichilnisky, 2004).
Estimating Information Rates in Neural Spike Trains
1715
We approximated the firing rate using two separate methods. In the first method, we measured the probability of firing in the PSTH of a cell in response to 385 repetitions of the an identical stimulus 8 seconds in duration. We termed this firing rate time series λ1 . The firing rate λ1 is effectively a probability of a spike within the bin size δt. As can be seen from a raster plot, this quantity varies significantly with the stimulus history. Because λ1 has a limited duration, we generated a second, artificial firing rate λ2 of arbitrary length. We generated λ2 by using the correlated bootstrap method on λ1 (see appendix A) to preserve the correlation structure of λ1 . We used λ2 to test the estimator convergence over long durations of time. Acknowledgments This work was supported by NSF IGERT training grant DGE-0333451 and La Jolla Interfaces in the Sciences (J.S.) and a Sloan Research Fellowship (E.J.C.). We thank J. Victor and P. Reinagel for valuable discussions; our reviewers for substantive comments and feedback; and colleagues at the Institute for Nonlinear Science, Santa Cruz Institute for Particle Physics, and the Systems Neurobiology Laboratory (Salk Institute) for tremendous feedback and technical assistance.
References Abarbanel, H. D. I., Masuda, N., Rabinovich, M. I., & Tumer, E. (2001). Distribution of mutual information. Physics Letters A, 281, 368–373. Abarbanel, H. D., & Talathi, S. S. (2006). Neural circuitry for recognizing interspike interval sequences. Physical Review Letters, 96, 148104. Aguera y Arcas, B., & Fairhall, A. (2003). Computation in a single neuron: Hodgkin and Huxley revisited. Neural Computation, 15, 1715–1749. Amigo, J. M., Szczepanski, J., Wajnryb, E., & Sanchez-Vives, M. (2004). Estimating the entropy rate of spike trains via Lempel-Ziv complexity. Neural Computation, 16, 717–736. Baccus, S. A., & Meister, M. (2002). Fast and slow contrast adaptation in retinal circuitry. Neuron, 36, 909–919. Bair, W., & Koch, C. (1996). Temporal precision of spike trains in extrastriate cortex of the behaving macaque monkey. Neural Computation, 8, 1185–1202. Berry, M. J., & Meister, M. (1998). Refractoriness and neural precision. Journal of Neuroscience, 18, 2200–2211. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proceedings of the National Academies of Science, 94, 5411– 5416. Bezzi, M., Samengo, I., Leutbeg, S., & Mizmori, S. J. Y. (2002). Measuring information spatial densities. Neural Computation, 14, 405–420. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., & Warland, D. (1991). Reading a neural code. Science, 252, 1854–1857.
1716
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Borst, A., & Theunissen, F. (1999). Information theory and neural coding. Nature Neuroscience, 2, 947–957. Brenner, N., Bialek, W., & de Ruyter van Steveninck, R. (2000). Adaptive rescaling maximizes information transmission. Neuron, 26, 695–702. Brenner, N., Strong, S. P., Koberle, R., Bialek, W., & de Ruyter van Steveninck, R. (2000). Synergy in a neural code. Neural Computation, 12, 1531–1552. Brillouin, L. (1956). Science and information theory. Orlando, FL: Academic Press. Buracas, G., Zador, A., DeWeese, M., & Albright, T. (1998). Efficient discrimination of temporal patterns by motion-sensitive neurons in primate visual cortex. Neuron, 20, 959–969. Butts, D. (2003). How much information is associated with a particular stimulus? Network, 14, 177–187. Chichilnisky, E. J., & Kalmar, R. (2003). Temporal resolution of ensemble visual motion signals in primate retina. Journal of Neuroscience, 23, 6681– 6689. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Dan, Y., Alonso, J. M., Usrey, W. M., & Reid, R. E. (1998). Coding of visual information by precisely correlated spikes in the lateral geniculate nucleus. Nature Neuroscience, 1, 501–507. Davison, A., Hinkley, D., Gill, R., Ripley, B., Ross, S., Stein, M., & Williams, D. (1997). Bootstrap methods and their applications. Cambridge: Cambridge University Press. de Ruyter van Steveninck, R., & Bialek, W. (1988). Real-time performance of a movement sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. Proceedings of the Royal Society of London, Series B, 234, 379–414. de Ruyter van Steveninck, R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805– 1808. DeWeese, M., & Meister, M. (1999). How to measure the information gained from one symbol. Network, 10, 325–340. Dimitrov, A. G., Miller, J. P., Gedeon, T., Aldworth, Z., & Parker, A. E. (2003). Analysis of neural coding through quantization with an information-based distortion measure. Network: Computation in Neural Systems, 14, 151–176. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Pattern classification. New York: Wiley. Fairhall, A., Lewen, G., Bialek, W., & de Ruyter van Steveninck, R. (2001). Efficiency and ambiguity in an adaptive neural code. Nature, 412, 787–792. Fano, R. M. (1961). Transmission of information: A statistical theory of communications. Cambridge, MA: MIT Press. Fellous, J. M., Tiesinga, P. H. E., Thomas, P. J., & Sejnowski, T. J. (2004). Discovering spike patterns in neuronal responses. Journal of Neuroscience, 24, 2989– 3001. Frechette, E. S., Sher, A., Grivich, M. I., Petrusca, D., Litke, A. M., & Chichilnisky, E. J. (2005). Fidelity of the ensemble code for visual motion in primate retina. Journal of Neurophysiology, 94, 119–135. Gamerman, D. (1997). Markov chain Monte Carlo: Stochastic simulation of Bayesian inference. New York: CRC Press.
Estimating Information Rates in Neural Spike Trains
1717
Gat, I., & Tishby, N. (1999). Synergy and redundancy among brain cells of behaving monkeys. In H. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 465–471). Cambridge, MA: MIT Press. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13, 2758–2771. Gilmore, R., & Lefranc, M. (2002). The topology of chaos: Alice in stretch and squeeze land. New York: Wiley. Hastings, W. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrica, 57, 97. Hilborn, R. (2000). Chaos and nonlinear dynamics: An introduction for scientists and engineers. New York: Oxford University Press. Keat, J., Reinagel, P., Reid, R. C., & Meister, M. (2001). Predicting every spike: A model for the responses of visual neurons. Neuron, 30, 803–817. Kennel, M. B., & Mees, A. I. (2002). Context-tree modeling of observed symbolic dynamics. Physical Review E, 66, 056209. Kennel, M., Shlens, J., Abarbanel, H. D. I., & Chichilnisky, E. J. (2005). Estimating entropy rates with Bayesian confidence intervals. Neural Computation, 7, 1531– 1576. Kontoyiannis, I., Algoet, P., Suhov, Y., & Wyner, A. (1998). Nonparametric entropy estimation for stationary processes and random fields with applications to English text. IEEE Transactions in Information Theory, 44, 1319–1327. Lempel, A., & Ziv, J. (1976). On the complexity of finite sequences. IEEE Transactions in Information Theory, 22, 75–81. Lewen, G. D., Bialek, W., & de Ruyter van Steveninck, R. R. (2001). Neural coding of naturalistic motion stimuli. Network: Computation in Neural Systems, 12, 317– 329. Lind, D., & Marcus, B. (1996). Symbolic dynamics and coding. Cambridge: Cambridge University Press. Liu, R. C., Tzonev, S., Rebrik, S., & Miller, K. D. (2001). Variability and information in a neural code of the cat lateral geniculate nucleus. Journal of Neurophysiology, 86, 2789–2806. London, M., Schreibman, A., Hausser, M., Larkum, M., & Segev, I. (2002). The information efficacy of a synapse. Nature Neuroscience, 5, 332–340. MacKay, D. (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. MacKay, D., & McCulloch, W. (1952). The limiting information capacity of a neuronal link. Bulletin of Mathematical Biophysics, 14, 127–135. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Mastronarde, D. N. (1989). Correlated firing of retinal ganglion cells. Trends in Neuroscience, 12, 75–80. Meister, M., & Berry, M. J. (1999). The neural code of the retina. Neuron, 22, 435–450. Meister, M., Lagnado, L., & Baylor, D. A. (1995). Concerted signaling by retinal ganglion cells. Science, 270, 1207–1210. Metropolis, N., Rosenbluth, M., Rosenbluth, A., Teller, M., & Teller, A. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21, 1087.
1718
J. Shlens, M. Kennel, H. Abarbanel, and E. Chichilnisky
Miller, G., & Madow, W. (1954). On the maximum likelihood estimate of the ShannonWiener measure of information. Air Force Cambridge Research Center Technical Report, 75, 54. Nemenman, I., Bialek, W., & de Ruyter van Steveninck, R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69. Nemenman, I., Shafee, F., & Bialek, W. (2002). Entropy and inference, revisited. In T. G. Dietterich, S. Becker, & Z. Ghahramani (Eds.), Advances in neural information processsing systems, 14. Cambridge, MA: Cambridge University Press. Nirenberg, S., & Latham, P. (2003). Decoding neuronal spike trains: How important are correlations? Proceedings of the National Academies of Science, 100, 7348–7353. Ott, E. (2002). Chaos in dynamical systems. Cambridge: Cambridge University Press. Paninski, L. (2003a). Estmation of entropy and mutual information. Neural Computation, 15, 1191–1254. Paninski, L. (2003b). Convergence properties of three spike-triggered analysis techinques. Network: Computation in Neural Systems, 14, 437–464. Paninski, L., Pillow, J., & Simoncelli, E. (2004). Maximum likelihood estimation of stochastic integrate-and-fire neural model. Neural Computation, 16, 2533–2561. Pillow, J., & Simoncelli, E. (2003). Biases in white noise analysis due to non-Poisson spike generation. Neurocomputing, 52, 109–115. Politis, D. N., & Romano, J. P. (1994). The stationary bootstrap. J. Amer. Stat. Assoc., 89, 1303–1313. Politis, D. N., & White, H. (2004). Automatic block-length selection for dependent bootstrap. Econometric Review, 23, 53–70. Puchalla, J., Schneidman, E., Harris, R., & Berry R. J. II (2005). Redundancy in the population code of the retina. Neuron, 46, 493–504. Rapp, P. E., Zimmerman, I. D., Vining, E. P., Cohen, N., Albano, A. M., & JimhezMontaiio, M. A. (1994). The algorithmic complexity of neural spike trains increases during focal seizures. Journal of Neuroscience, 14, 4731–4739. Reich, D. S., Mechler, F., & Victor, J. (2001). Information in nearby cortical neurons. Science, 294, 2566–2568. Reinagel, P., & Reid, R. C. (2000). Temporal coding of visual information in the thalamus. Journal of Neuroscience, 20, 5392–5400. Reinagel, P., & Reid, R. C. (2002). Precise firing events are conserved across neurons. Journal of Neuroscience, 22, 6837–6841. Rieke, F., Bodnar, D., & Bialek, W. (1995). Naturalistic stimuli increase the rate and efficiency of information transmission by primary auditory afferents. Proceedings of the Royal Society of London B: Biological Sciences, 262, 259–265. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rissanen, J. (1989). Stochastic complexity in statistical inquiry. Singapore: World Scientific. Schneidman, E., Bialek, W., & Berry, M. J. (2003). Synergy, redundancy, and independence in population codes. Journal of Neuroscience, 23, 11539–11553. Schnitzer, M. J., & Meister, M. (2003). Multineuronal firing patterns in the signal from eye to brain. Neuron, 37, 499–511.
Estimating Information Rates in Neural Spike Trains
1719
Schurmann, T., & Grassberger, P. (1996). Entropy estimation of symbol sequences. Chaos, 6, 414–427. Shannon, C. (1948). A mathematical theory of communication. Bell Systems Technology Journal, 27, 379–423. Sharpee, T., Rust, N., & Bialek, W. (2004). Anaylzing neural responses to natural signals: Maximally informative dimensions. Neural Computation, 16, 223–250. Simoncelli, E. P., & Olhausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1215. Simoncelli, E. P., Paninski, L., Pillow, J., & Schwartz, O. (2004). Characterization of neural responses with stochastic stimuli. In M. Gazzaniga (Ed.), The cognitive neurosciences (3rd ed.). Cambridge, MA: MIT Press. Stanley, G. B., Li, F. F., & Dan, Y. (1999). Reconstruction of natural scenes from ensemble responses in the lateral geniculate nucleus. Journal of Neuroscience, 19, 8036–8042. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R., & Bialek, W. (1998). Entropy and information in neural spike trains. Physical Review Letters, 80, 197–200. Theunissen, F., & Miller, J. (1995). Temporal encoding in the nervous system: A rigorous definition. Journal of Computational Neuroscience, 2, 149–162. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Usrey, W. M., & Reid, R. C. (1999). Synchronous activity in the visual system. Annual Reviews in Physiology, 61, 435–456. Uzzell, V. J., & Chichilnisky, E. J. (2004). Precision of spike trains in primate retinal ganglion cells. Journal of Neurophysiology, 92, 780–789. Victor, J. D. (2002). Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. Warland, D., Reinagel, P., & Meister, M. (1997). Decoding visual information from a population of retinal ganglion cells. Journal of Neurophysiology, 78, 2336–2350. Willems, F. M. J., Shtarkov, Y. M., & Tjalkens, T. (1995). The context tree weighting method: Basic properties. IEEE Transactions in Information Theory, IT-41, 653–664. Wolpert, D., & Wolf, D. (1995). Estimating functions of probability distributions from a finite set of samples. Physical Review E, 52, 6841–6854.
Received April 12, 2006; accepted October 4, 2006.
LETTER
Communicated by Jonathan Victor
Generation of Synthetic Spike Trains with Defined Pairwise Correlations Ernst Niebur
[email protected] Krieger Mind/Brain Institute and Department of Neuroscience, Johns Hopkins University, Baltimore, MD 21218, U.S.A.
Recent technological advances as well as progress in theoretical understanding of neural systems have created a need for synthetic spike trains with controlled mean rate and pairwise cross-correlation. This report introduces and analyzes a novel algorithm for the generation of discretized spike trains with arbitrary mean rates and controlled cross correlation. Pairs of spike trains with any pairwise correlation can be generated, and higher-order correlations are compatible with common synaptic input. Relations between allowable mean rates and correlations within a population are discussed. The algorithm is highly efficient, its complexity increasing linearly with the number of spike trains generated and therefore inversely with the number of cross-correlated pairs.
1 Introduction Understanding any natural or artificial system is considerably furthered by the possibility of not only being able to observe but also to manipulate it, that is, by modifying the system’s state and observing the resulting changes in behavior. In neuroscience, recent technological advances have substantially increased the extent to which the microstate of nervous systems can be controlled. While electrical microstimulation has been one of the standard tools of neurophysiology for some time (for a recent review, see Cohen & Newsome, 2004), new technology allows the stimulation of substantially larger numbers of sites (see, e.g., Normann, Maynard, Rousche, & Warren, 1999; Sekirnjak et al., 2006). In addition, newly developed optical methods based on uncaging of neurotransmitters allow the activation of up to tens of thousands of locations per second with high spatial and temporal resolution (Callaway & Katz, 1993; Shepherd, Pologruto, & Svoboda, 2003; Boucsein, Nawrot, Rotter, Aertsen, & Heck, 2005; Shoham, O’Connor, Sarkisov, & Wang, 2005; Miesenbock & Kevrekidis, 2005; Thompson et al., 2005). Realizing the full potential of these techniques will require the injection of sets of artificial spike trains with controlled rates and correlations between them. There is thus a need for efficient and systematic methods Neural Computation 19, 1720–1738 (2007)
C 2007 Massachusetts Institute of Technology
Generation of Synthetic Spike Trains
1721
that allow generating not just individual spike trains but large numbers of them and the possibility of varying their rates and correlation structure. A similar need arises in theoretical neuroscience. While much of the earlier work in the field was based on the mean-rate coding assumption originating in the peripheral nervous system (Adrian & Zotterman, 1926), that is, the hypothesis that neural processing uses only the information contained in the mean number of spikes averaged over a few hundred milliseconds, this concept is being questioned in increasingly stronger terms (Abeles, 1982; Singer, 1999). A substantial number of current models explores the consequences of the temporal coding hypothesis, that is, that the nervous system does not discard all information in neural spike trains other than the mean activity level. It has been proposed that neural information may be represented by controlled changes in the autocorrelation function, as in the large number of models of oscillation coding (e.g., Gray & Singer, 1989; Niebur, Koch, & Rosin, 1993), or in cross-correlation functions, as in models studying synchrony (e.g., Niebur & Koch, 1994; Tiesinga & Sejnowski, 2004), or a combination of both. Thus, both the development of quantitative neural models that rely on correlated activity and of experimental techniques that allow selectively exciting neural tissue in multiple sites require the availability of test data. The subject of this letter is the generation of sets of synthetic spike trains with controlled rates and cross-correlations. A finite pairwise correlation within a population of N model neurons can be obtained by generating a set of N2 random numbers with suitable distribution (e.g., Poisson) and assigning them randomly to the N spike trains. If N2 < N, some states are necessarily identical, which introduces correlations between spike trains (Destexhe & Pare, 1999). While this method is efficient, the mean pairwise correlation is not parametrically controlled (only indirectly via the ratio N2 /N), and it is the same for all pairs of neurons. Other methods were based on manipulating stochastic processes by either subtracting events (Ogata, 1981) or adding events (Stroeve & Gielen, 2001). Different from these earlier methods, the algorithm developed here generates spike trains whose mean rates as well as cross-correlations between each pair of spike trains are free parameters that can be selected independently, subject to constraints discussed below. The algorithm developed here is a generalization of our earlier work (Mikula & Niebur, 2003a, 2003b, 2004, 2005) in that it allows differing rates for the spike trains rather than requiring that all rates are identical. Furthermore, the cross-correlation between any two of these spike trains can be selected anywhere between the cases of independent spike trains (minimal correlation), on one hand, and having identical spike trains (maximal correlation), on the other, subject to limitations discussed below. Rather than advancing to the general case immediately, we first introduce, in sections 2 and 3, the simple and intuitive case of considering two spike trains only. In sections 4 and 5, we generalize to an arbitrary number
1722
E. Niebur
of spike trains. Different from the earlier case of identical firing rates in all spike trains, not all firing rates and cross-correlations can be combined; it is impossible to have strong correlations in a set of spike trains with vastly different firing rates. The resulting constraints are studied in section 6. 2 Generating Two-Spike Trains Our goal is to provide a constructive method for generating two stochastic processes (we use the terms spike train and stochastic point process interchangeably), x and y, each consisting of a series of binary events (0, or no spike in bin, and 1, or one spike in bin). We assume that spike trains are memoryless, that is, all their time bins are independent, and we can therefore analyze them separately (this assumption can be relaxed; see section 7). These are Bernoulli processes, with 1 occurring in process x with probability1 r x = x, with 0 ≤ r x ≤ 1 (the expectation value of obtaining a 1; see equation 3.2) and 0 occurring with probability 1 − x. Likewise, in process y, 1 occurs with probability r y = y, with 0 ≤ r y ≤ 1, and 0 with probability 1 − y. We note that for most practical purposes, the firing probability in any given bin is low, r x 1, in which case the Bernoulli process can be approximated by a Poisson process. The pairwise correlation of the two processes is C xy , the expectation value of the probability of obtaining coincident spikes with rate correction (see equation 3.4). In order to construct the two processes x and y, we follow the appendix in Mikula and Niebur (2003b). We start out with three independent Bernoulli processes that have either a 1 or a 0 in each bin and are binomially distributed. For one of the spike trains, referred to as reference spike train r in the following (because it is left unchanged), the probability of having a 1 in any given bin is p; thus, a 0 is found with probability (1 − p). The other two spike trains, x and y, are constructed as follows. For spike train x, we start with a memoryless two-state stochastic process in each bin of which a 1 is found with probability px and a 0 with probability (1 − px ). √ We then switch, with a probability q , the state in x to that in r . Note that in general (in all cases other than for p = px or q = 0), the mean rate of the resulting spike train will be neither p nor px but an intermediate value to be computed later, in equation 3.2. The same procedure is applied to spike train y, with px replaced everywhere by p y . Table 1 shows the complete probability table for all combinations of a reference spike train (r , first column) and two spike trains x and y (second and third columns). Determining the probabilities Pr xy (fourth column) is straightforward. For the purpose of illustration, we illustrate computation
1
We use the terms firing probability and mean firing rate interchangeably since they are proportional, the factor of proportionality being the bin width, a constant. See also section 7.
Generation of Synthetic Spike Trains
1723
Table 1: Probability Table for All States of the Reference Spike Train (sr ) and Two Spike Trains sx and s y . sr
sx
sy
0 0 0 0 1 1 1 1
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
Pr xy √ √ (1 − p)[(1 − px ) + px q ][(1 − p y ) + p y q ] √ √ (1 − p) px (1 − q )[(1 − p y ) + p y q ] √ √ (1 − p)[(1 − px ) + px q ] p y (1 − q ) √ (1 − p) px p y (1 − q )2 √ p(1 − px )(1 − p y )(1 − q )2 √ √ p[ px + (1 − px ) q ](1 − p y )(1 − q ) √ √ p[(1 − px )(1 − q ][ p y + (1 − p y ) q ] √ √ p( px + (1 − px ) q )( p y + (1 − p y ) q )
Notes: The reference spike train has firing probability p, and the original spike trains x and y have firing probabilities px and p y , respectively. In columns 1–3, 1 indicates a spike in the respective spike train, and 0 stands for no spike.
of the first probability, P000 . It is obtained by observing that the probability of having no spike in the reference spike train is (1 − p) since the probability of obtaining a spike is p. This gives rise to the first factor, (1 − p). Given the absence of a spike in the reference spike train, there are two possibilities that spike train x also does not contain a spike in this spike train. The first is that it starts out with no spike. This happens with the probability (1 − px ), the first term in the square bracket. This is independent of the state of the reference spike train since whether a switch occurs or not, this spike train will always have no spike (switching would not change the state since, by assumption, the reference spike train does not contain a spike). The second possibility occurs if the state in spike train x is switched to that in the reference spike train (0) even though originally (before switching) this state was 1. Since the latter occurs with the probability px and the √ √ switching with the probability q , this additional probability is px q , which completes the derivation of the first square bracket in P000 (note that these two events, having a spike occur in spike train x and switching the state to that of the reference spike train, are independent; therefore, their probabilities multiply to obtain the probability of the compound event). Finally, the same considerations apply to the second spike train (y), which therefore contributes a factor equal to that obtained for x but with all px replaced by p y . The total probability is the product of those for the states of r, x, and y. The other cases in Table 1 are derived analogously. 3 Properties of Two Spike Trains Having obtained Table 1, it is straightforward (although somewhat tedious) to verify the normalization condition: i=0,1 j=0,1 k=0,1
Pi jk = 1.
(3.1)
1724
E. Niebur
We now proceed to compute the mean firing rate of the spike trains. Direct substitution yields r x := x :=
Pi1k = px +
√ q ( p − px ).
(3.2)
i=0,1 k=0,1
In this equation, the first and second equalities are the definition of the mean rate, and the last one is the result of direct substitution of terms from Table 1 and straightforward simplification. As anticipated, the mean rate x is influenced by p and px . Of course, for q = 0, we obtain x = px , and for q = 1, we have x = p. The analogous result is obtained for y: r y := y :=
Pik1 = p y +
√ q ( p − p y ).
(3.3)
i=0,1 k=0,1
The rate-corrected cross-correlation (or covariance) between spike trains x and y is C xy := xy − x y := Pi11 − Pi1k Pik1 = ( p − p 2 )q , i=0,1
i=0,1 k=0,1
(3.4)
i=0,1 k=0,1
where the first and second equalities are definitions, and the third one is obtained after substitution from Table 1 and simplification. We observe that C xy is independent of px and p y . Furthermore, C xy attains a maximum value of q /4 at p = 1/2 and decreases monotonically for larger and smaller p, to reach zero for both p = 0 and p = 1. Of course, C xy is identically zero for all p for q = 0. For q = 1, all three spike trains are identical and have spike rate p and correlation C xy = p − p 2 , which is the autocorrelation of a binomial process train with probability p. The latter can also be computed directly from Table 1 as the variance of the reference spike train, varr := Crr := (r − r )2 = p − p 2 ,
(3.5)
where the same notation as that introduced in equation 3.4 was used. In practice, rather than specifying the rates px and p y of the uncorrelated √ stochastic processes and the correlation q with the reference spike train, we are interested in generating spike trains with specified mean rates and cross-correlation. That is, x, y, and C xy are given quantities, and we have to determine parameters px , p y , and q given these values. By solving equations 3.2 to 3.4 for px , p y , and q , we find that a pair of spike trains with means x, y and cross-correlation C xy is obtained by choosing px , p y ,
Generation of Synthetic Spike Trains
1725
and q as √ x − p q px = √ 1− q √ y − p q py = √ 1− q q=
C xy , p − p2
(3.6) (3.7) (3.8)
where p, the rate of the reference spike train, is formally a free parameter. There are limitations on the choice of this parameter (e.g., p − p 2 ≥ C xy to ensure q ≤ 1 in equation 3.8), which will be discussed in the more general case of arbitrary numbers of spike trains in section 6. Note that these limitations do not apply if all firing rates are identical (x = y = p) since px and p y do not depend on q in this case (Mikula & Niebur, 2003a). 4 Arbitrary Number of Spike Trains For N spike trains, there are 2 N possible states in each bin. For nonzero q , all these states are entangled because each randomly chosen pair has to have the same mean covariance. Despite the apparent complexity of the problem, we can compute the probability of each of the states as the sum of two terms, as follows. The trick is that we specify only the correlation with the reference spike train r and that the pairwise correlation is then obtained from expressions like equation 3.4. For instance, Table 1 shows that the state without a spike in either x and y has a total probability of √ √ {(1 − p)[(1 − px ) + px q ][(1 − p y ) + p y q ]} √ +{ p(1 − px )(1 − p y )(1 − q )2 }, where the terms in the first and second parentheses correspond to the cases without and with a spike in r , respectively. The algorithm to generate N correlated spike trains is thus as follows. We start by generating N uncorrelated, memoryless stochastic point processes (i.e., they have independent time bins) and binomial firing statistic in each bin. As before, each is a Bernoulli process, and process i, with i ∈ {1, . . . , N}, has a mean firing rate (probability of having state 1) of pi ; 0 ≤ pi ≤ 1. One additional process with the same statistics and with mean rate p is generated and referred to, as before, as the reference spike train r . To generate the set √ of N correlated spike trains, we switch, with a probability q , the state of uncorrelated spike train i to that of spike train r . Using the notation from the previous paragraph, let (s1 , s2 , . . . , s N ) be the state of the N spike trains
1726
E. Niebur
Table 2: Probability of Obtaining State si in Process i Depends on the State of the Reference Process. sr 0 1 0 1
si
Psr si
(i)
P (i) (sr , si )
0 0 1 1
√ (1 − pi ) + pi q √ (1 − pi )(1 − q ) √ pi (1 − q ) √ pi + (1 − pi ) q
√ (1 − p)[(1 − pi ) + pi q ] √ p(1 − pi )(1 − q ) √ (1 − p) pi (1 − q ) √ p[ pi + (1 − pi ) q ]
Notes: Column 1: state of reference process. Column 2: state of process i. The last column shows the probability of obtaining the combination of state sr in the reference process and state si in process i, and the third column shows only that component of this probability that is independent of p. The first factor in P (i) (sr , si ) stems from the reference spike train r , the second factor comes from spike train i and depends on both its own state and that of r . Note that in order to avoid proliferation of symbols, we distinguish the symbols for the probabilities in column 4 from the expressions in column 3 (which are not probabilities but terms used in equation 4.1 and beyond) by using a subscript notation in column 3 and parentheses in column 4 (top row).
in a given time bin. The probability of finding the system in this state is then
P(s1 , s2 , . . . , s N ) = (1 − p)
N i=1
(i)
(i)
P0si + p
N
(i)
P1si ,
(4.1)
i=1
(i)
where P0si and P1si are the terms from the third column of Table 2. We will show now that equations 3.6 to 3.8 generalize immediately to the case of arbitrary N. The mean rate of process i is defined as the expectation value of this process having state 1, that is, (1) (N) (1) (2) (2)
(i) (N)
ri : = P(si = 1) = (1 − p) P00 + P01 P00 + P01 . . . P01 . . . P00 + P01 (N) (1) (1) (2) (2)
(i) (N)
+ p P10 + P11 P10 + P11 . . . P11 . . . P10 + P11 . The sums within each parenthesis in this equation are again unity, and we obtain (i)
(i)
ri = (1 − p)P01 + p P11 = pi +
√ q ( p − pi ),
(4.2)
where the second equation is obtained using column 3 in Table 2. Equation 4.2 is the generalization of equation 3.2 for spike train i.
Generation of Synthetic Spike Trains
1727
The correlation function is computed analogously. By definition, the correlation function of processes i and k is the expectation value of both processes having state 1 corrected by the product of their rates: Cik : = P(si = 1, sk = 1) − ri rk (1) (N) (1)
(i) (k) (N)
= (1 − p) P00 + P01 . . . P01 . . . P01 . . . P00 + P01 (N) (1) (1)
(i) (k) (N)
+ p P10 + P11 . . . P11 . . . P11 . . . P10 + P11 − ri rk .
(4.3)
Sums in the parentheses being unity, we are left with (i)
(k)
(i)
(k)
Cik = (1 − p)P01 P01 + p P11 P11 − ri rk . (i)
(4.4)
(i)
Using again P01 and P11 from Table 2 and equation 4.2, we obtain, after simplification, Cik = ( p − p 2 )q ,
(4.5)
which is the generalization of equation 3.4. In practice, the mean rates ri for i ∈ {1, . . . , N} of the correlated processes and their pairwise correlation C are given, and we need to determine the parameters pi and q of the uncorrelated stochastic processes. We therefore solve equations 4.2 and 4.5 for these parameters to obtain pi =
√ ri − p q √ 1− q
(4.6)
q=
C , p − p2
(4.7)
which are the generalizations of equations 3.2 to 3.4. Section 6 discusses constraints on the choice of these parameters. 5 Arbitrary Cross-Correlations So far we have assumed that all cross-correlations are identical. To generalize to arbitrary covariances, rather than changing the state in all stochastic √ processess with the same probability q , the state in process i is changed √ to that of the reference process with probability q i , which can be different for each i. The table of probabilities is then just as in Table 2 but with √ √ q replaced by q i everywhere. Straightforward algebra along the lines of equations 4.1 to 4.5 shows that the mean firing rate of process i is unchanged and given as in equation 4.2, and that the mean cross-correlation function
1728
E. Niebur
between processes i and k, as defined in equation 4.3, is √ Cik = ( p − p 2 ) q i q k .
(5.1)
Of course, in the special case of q i = q k , this reduces to equation 4.5. √ The mean rates are obtained as in equation 4.6, with q replaced by √ q i in the numerator and denominator. Likewise, given a set of crosscorrelations Ci j = C ji ; i, j = 1, . . . , N), the corresponding parameters q i j are computed from equation 5.1. In the most general case, when all Ci j are different, half of the N parameters q i can be chosen freely (subject to constraints similar to those discussed in section 6), and the other N/2 are determined by equation 5.1. We note in passing that another way to add structure to the correlation at the population level is to add additional reference spike trains. 6 Constraints on Rates and Correlations While arbitrary values for q ∈ [0, 1] and p ∈ [0, 1] can be chosen if the mean rates of all spike trains are identical (Mikula & Niebur, 2003a), a set of conditions between p and q needs to be satisfied when mean rates differ (for simplicity, we discuss the case of equal pairwise cross-correlation only; the generalization discussed in section 5 does not pose any particular difficulties but complicates the notation). This is intuitively understood by the fact that too large a spread of firing rates does not allow enforcing pairwise strong correlation on all pairs. Formally, the constraint manifests itself by the requirement that q in equation 4.7 and pi in equation 4.6 need to be well-defined probabilities with 0 ≤ q , pi ≤ 1. Equation 4.7 requires2 C ≤ p − p2 .
(6.1)
We solve for p to obtain 1 − 2
1 1 −C ≤ p ≤ + 4 2
1 − C. 4
(6.2)
Equation 4.6 yields √ ri − p q ≥ 0 √ √ ri − p q ≤ 1 − q .
2
(6.3) (6.4)
As noted at the end of section 3, these constraints do not apply if all mean rates are identical, pi ≡ p ∀i = 1, . . . , N, since in that case pi = ri = p in equation 4.6, and the equation becomes an identity. The same is true for the trivial cases of q = 0 or q = 1.
Generation of Synthetic Spike Trains
1729
Inequalities 6.3 and 6.4 can be combined as 1 − ri ri 1− √ ≤ p ≤ √ . q q
(6.5)
For a given p and q , the right inequality 6.5 determines the frequency of the spike train with the smallest possible ri as √ rmin = p q ,
(6.6)
and the left inequality determines the frequency of the spike train with the largest ri , √ rmax = 1 − (1 − p) q .
(6.7)
In the following, we consider the practically more important reverse question: For a given set of ri , can we find (at least) one p such that the algorithm can be applied, and if so, what are the allowable values for p? For any given pi and q < 1, a solution of inequality 6.5 exists because the left inequality can be transformed into √riq − ( √1q − 1), which is smaller than √ the right-hand side since by assumption, q < 1 (the case q = 1 is trivial, leading to N identical spike trains with rate p). The condition does, however, establish a constraint on a set of mean rates ri that can be generated with this procedure because if their spread becomes too large, it is not possible to find a p that satisfies both inequalities 6.5 for all i. r2
From the second of the inequalities 6.5 we have p ≤ √riq and thus p 2 ≤ qi , bearing in mind that the positive solution of this equation will be the correct one since p is a probability. Replacing q in this inequality by using equation 3.8, we obtain p 2 ≤ p≤
ri2 ( p− p2 ) C
and after division by p (= 0),
ri2 . C + ri2
(6.8)
Thus, p must be smaller than the expression on the right for all ri . For C = 0, the condition is fulfilled for all probabilities p, but for increasing C, it becomes more and more stringent. Using again equation 3.8, we rewrite the first inequality in equation 6.5 as 1 − ri 1− √ p − p2 ≤ p C
(6.9)
1730
E. Niebur
or p + A p − p 2 − 1 ≥ 0,
(6.10)
√ i is a positive constant. where A := 1−r C At equality, this is a nonlinear equation of the type
p = 1 − A p − p2 ,
(6.11)
whose solution depends, through A, on ri and C. In the relevant interval p ∈ ]0, 1[, the right-hand side of this equation is a U-shaped function with exactly one crossing with the identity. These fixed points are given by p = C[(1 − ri )2 + C]−1 . Although a set of p can always be found for any given pair of values (ri , C), it is possible that the intervals given by conditions equations 6.8 and 6.9 are nonoverlapping for a given C and a range of values [ri,min , ri,max ]. No solution with the specified set of mean rates and pairwise correlation C between all pairs exists in this case. Only if at least one p can be found that simultaneously fulfills equations 6.8 and 6.9 can the algorithm be applied. Figure 1 shows the solutions of inequalities 6.5 as a function of ri . Only those values of p that are between the two lines corresponding to the left inequality (solid line) and the right inequality (dashed line) fulfill both inequalities and are therefore acceptable solutions. If the pairwise correlation is relatively small, as in Figure 1a with C = 0.02, a suitable p value can be chosen for all but the most extreme values of ri , and in this case there is usually a large range of suitable p values. For large correlation, for example, for C = 0.2 as shown in Figure 1b (note that the maximum of completely correlated, i.e., identical, spike trains is attained for C = 0.25), the constraints become more stringent and the spread of allowable values for p is zero for a substantial subset of ri (those below and above the crossing points). It is impossible to generate sets of correlated spike trains with unequal mean rates; only sets with ri = p∀i are allowed. The shaded part of the abscissa (in Figures 1a and 1b) is an example for a range of firing rates [ri,min , ri,max ]. The projection onto the ordinate shows that for small correlation (see Figure 1a, C = 0.02), a large range of p values can be chosen (shaded part of the ordinate), while for strong correlation (C = 0.2, in Figure 1b), the range of possible p values is much smaller. For an even larger range, for example, if [ri,min , ri,max ] = [0.4, 0.6], it is impossible to generate spike trains with cross-correlation C = 0.2 between each pair of the set. Figure 2 shows that spike trains computed with the described algorithm have the desired properties. The upper panels of Figure 2 show examples of 300 bins each of 100 spike trains, with, respectively, low correlation
Generation of Synthetic Spike Trains
1731
Figure 1: Solutions of the inequalities 6.5 as a function of ri for C = 0.02 (a) and C = 0.2 (b). The solid lines show the limits of the solution (i.e., at equality) for the left inequality; the dashed lines are the equivalent for the right inequality. Acceptable p values are all those between the two curves, that is, above the full lines and below the dashed lines. For a chosen range of mean rates (shaded on the abscissa), the range of acceptable values (shaded on the ordinate) is significantly smaller for C = 0.2 than for C = 0.02. The spread of mean rates is zero for rates ri that are outside the range where the curves cross.
(C = 0.02) in the upper left panel and high correlation (C = 0.2) in the upper right. Even at C = 0.02, spike trains are clearly correlated and, as expected, correlation is very strong at C = 0.2. The lower left panel of Figure 2 demonstrates that the actual firing rates, computed as the expectation value of the number of events in equation 3.2 and plotted on the ordinate, scatter closely around the selected values (on the abscissa). For illustration, we generated 20 spike trains with 1000 bins each with firing rates varying in equidistant steps, with px ∈ [0.3, 0.4] for C = 0.02 and px ∈ [0.5, 0.6] for C = 0.2 (these intervals were chosen simply so we could show both data sets in one plot). In both cases, p = 0.58 was chosen, which is from the allowable range in Figure 1. The bottom right panel of Figure 2 is a plot of the cross-correlation values between all these 20 spike trains (that is, the expectation value computed in equation 3.4) plotted as a function of the parameter C. Again, it is seen that the cross-correlation of the simulated spike trains scatters around the chosen value. 7 Discussion A thorough understanding of biological neural systems requires the development of quantitative models, both experimental and theoretical. This is of particular importance for understanding neural microcircuity, one of the most complex and at the same time practically and philosophically important structures known. Both theoretical (computational) and experimental models of neural tissue need a principled way of stimulating the model
E. Niebur
100
100
80
80 Spike train
Spike train
1732
60
40
20
0 0
60
40
20
50
100
150 Bin
200
250
0
300
1
0
50
100
150 Bin
200
250
300
0.25 0.2
0.8
0.15
0.6
0.1
0.4 0.05 0.2 0 0.25
0
0.3
0.35
0.4
ri
0.45
0.5
0.55
0.6
-0.05 0
0.05
0.1
C
0.15
0.2
0.25
Figure 2: Properties of simulated spike trains. (Upper left) Raster plot of 100 spike trains with p = 0.2 and C = 0.02. (Upper right) Same, but C = 0.2. (Bottom left) Measured rates of 40 simulated spike trains plotted against nominal rates. For illustration, rates are shown for 20 spike trains with equidistant rates in the range [0.3, 0.4] for C = 0.02 (+ signs) and from the range [0.5, 0.6] for C = 0.2 (x signs). (Lower right) Average cross-correlation between spike trains plotted as a function of C. It can be seen that the observed cross-correlation, computed from equation 3.4, scatters around the nominal value.
substrate, in vitro or in silico. Recent reports demonstrate substantial progress in overcoming the technical difficulties of delivering stimuli with high spatial and temporal resolution into the tissue. While the simplest stimulation patterns might be independently distributed random processes, the next stages of investigation will surely demand more structured patterns. This letter introduces an efficient and straightforward algorithm for constructing excitation patterns consisting of stochastic processes with controlled mean activity and cross-correlations between all pairs of processes. The algorithm is highly efficient and requires on the order of N × B operations where N is the number of spike trains and B the number of time bins per spike train. Remarkably, its complexity is thus linear with N, even though the number of cross-correlations generated increases as o(N2 ). Furthermore, the algorithm lends itself to parallelization up to the finest grain (invidual spike trains and individual bins).
Generation of Synthetic Spike Trains
1733
An attractive alternative approach for generating correlated sets of model spike trains is based on maximum entropy principles. The basic idea (perhaps most clearly described by Jaynes, 1957, but going back to Gibbs and Laplace) is that the most rational course of action is to avoid making assumptions in the absence of information. Applied to the question of spike train statistics, it has been argued that spike trains with controlled secondorder correlations should be generated with maximum entropy in all higher orders. This means that all but second-order correlations must be minimal since higher-order correlations would introduce structure and therefore lower-than-maximum entropy. Algorithms for spike train generation and analysis based on maximum entropy principles (e.g., Bohte, Spekreijse, & Roelfsema, 2000; Nakahara & Amari, 2002) are powerful methods for characterizing ensembles of spike trains if higher-order correlations are either known to vanish or if the structure of the higher-order correlations is known (and can thus be taken into account), or if absolutely no information about them is available. However, none of these three cases is the situation considered here, namely, when we want to generate spike trains with correlational structure similar to that found in biological neural systems. Although we assume that only second-order correlations are explicitly known, as is usually the case (some exceptions in which higher-order correlations are observed and analyzed can be found are Gerstein & Clark, 1964; Abeles & Goldstein, 1977; Abeles ¨ Aertsen, & & Gerstein, 1988; Abeles, 1991; Martignon, von Hasseln, Grun, ¨ Diesmann, & Aertsen, 1997), and that we have no Palm, 1995; Riehle, Grun, detailed information about higher-order correlations, we do know that the spike trains are generated in a highly connected neural network. For this reason, it is unlikely that no correlations exist between them at any higher than second order, which is the assumption made by maximum entropy methods. Applying maximum entropy methods with vanishing higherorder terms in this situation would then violate the basic rationale for using these methods. Paraphrasing from Jaynes (1957), the goal is to generate a probability distribution that correctly represents our state of knowledge about the system, or finding a probability assignment that avoids bias, while agreeing with whatever information is given. It is always possible to use a maximum entropy approach, enforcing vanishing higher-order correlations, in a case in which insufficient information about higher-order correlations is available. In this study, we adopt a different strategy, with the explicit goal of taking into account all available information, including the likely presence of higher-order correlations. Although we have no detailed knowledge about those correlations in the absence of sufficient experimental data, we at least know that such correlations are likely to exist in a hierarchically connected complex network. The solution adopted in this study is to generate spike trains whose higher-order correlations correspond to those found in neural systems
1734
E. Niebur
whose correlations are due to common input. Although this is by no means the only possible source of correlations (examples for others being direct and indirect interactions between neurons, as was observed more than a generation ago; Moore, Segundo, Perkel, & Gerstein, 1970), common input is widely recognized to be an important source of correlation in spike trains. The spike trains generated by the methods developed here are thus particularly suited as models of neural activity where neurons are influenced by spiking input from a common source. While we have focused on pairwise cross-correlations in this work, it is possible to make the mean rates of all processes time variant (by adjusting the parameters pi ) and thus also to vary their autocorrelations. This allows, for instance, modeling absolute and refractory periods by temporarily lowering the spike rate of a process after a spike has occurred in this process. Figure 3 shows an example of correlation functions for a process with an absolute refractory period of two bins. Relative refractory periods are implemented in the same way but with a finite lowered probability of firing after each spike. The formulas for the computation of rates and correlations, equations 4.6 and 4.7, remain unchanged, with N reduced by the number of neurons that are in the absolute refractory period, and with ri in equation 4.6 replaced by a lowered rate (subject to the constraints in section 6) for neurons in the relative refractory period. Figure 3 shows, as expected, vanishing entries in the autocorrelation function (squares) during the absolute refractory period, which is reflected in smaller cross-correlation (crosses) because of the finite correlation between the spike trains. The figure also shows that refractoriness decreases the mean firing rate (from 0.2 to about 0.14 in this case) and leads to an increase of the autocorrelation function immediately following the refractory period, with a corresponding peak in the power spectrum (Bair, Koch, Newsome, & Britten, 1994; Franklin & Bair, 1995). This peak, more noticeable for higher firing rates where higher-order peaks also can be observed (data not shown), is again reflected in attenuated form in the cross-correlation function. We note that the occurrence of these structures is a matter of principle (Bair et al., 1994; Franklin & Bair, 1995) and not due to the specifics of our implementation. The possibility of varying the firing rates independently (rather than for all spike trains in unison) and the resulting requirement that bins are independent alleviates limitations of our older algorithm (Mikula & Niebur, 2003a). Experimentally observed cross-correlograms frequently have broad maxima rather than a sharp peak only in the center bin (as in, for example, Mikula & Niebur, 2003a; Destexhe & Pare, 1999). Spike jitter can be introduced by allowing transitions into a given bin of a spike train not only from the corresponding bin in the reference spike train but also (with smaller probability) from the neighboring bins. The probabilities for the states are computed as in equation 4.1 and, for a stationary situation (i.e., pi and jitter distribution independent of time), the rates (see equation 4.2) are unchanged. The width of the jitter distribution will determine the width of the
Generation of Synthetic Spike Trains
1735
0.2 0.18 0.16
Correlation functions
0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
1
2
3
4
5
6
7
8
9
10
t [bins]
Figure 3: Correlation functions with refractory period. Shown are the average autocorrelation functions (squares) of 20 spike trains and the average crosscorrelation functions (crosses) between them, with an absolute refractory period of 2 bins. All processes have mean rates of p = 0.2 and correlation between spike trains is C = 0.1. The refractory period introduces a depression not only in the autocorrelation but, because of the correlation between spike trains, also in the cross-correlation. Also note the increase in correlation immediately following the refractory period in both functions.
resulting correlation function. Some kind of integral over the peak width is usually employed as a measure of synchrony, and the one-bin definition of synchrony (see equation 4.3) is replaced by a more complex expression that involves this measure. As a general observation, we established in section 6 that there are combinations of rates and pairwise cross-correlations that cannot be achieved, namely, high cross-correlations and, simultaneously, a large spread in the rates of the invididual spike trains. Within the limits described in sections 5 and 6, pairwise cross-correlations and mean firing rates can be chosen freely. A free parameter in the procedure is the bin width, which determines the relation between the probability of the 1 event in the stochastic process and the mean firing rate of the simulated spike train generated: mean firing rate = bin width × probability (see also note 1). The only formal requirement in the choice of the bin width is that there cannot be more than one event in any given bin, and the actual choice will depend on constraints provided by the system under study. For computational models, a suitable bin size could be the inverse maximal rate at which the neurons are known to fire;
1736
E. Niebur
for the generation of artificial spike trains to be used in microstimulation, it could be the inverse maximal stimulation frequency. The spike trains introduced here can be used in a variety of situations that we have explored previously with spike trains with identical rates, for example, to study the effect of spike-timing-dependent plasticity (Mikula & Niebur, 2003b), the interplay of excitation and inhibition (Mikula & Niebur, 2004), and in feedforward networks of coincidence detectors (Mikula & Niebur, 2005). The ease of generating large numbers of synthetic spike trains in a highly efficient algorithm should prove to be of use in a large number of applications. On request, the author will make available sets of spike trains with the described statistics. Acknowledgments I thank Emery Brown for discussions. The work was supported by NIH grants NS43188-01A1, 5R01EY016281-02, and R01-NS40596.
References Abeles, M. (1982). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., & Gerstein, G. (1988). Detecting spatiotemporal firing patterns among simultaneously recorded single neurons. J. Neurophysiol., 60, 909–924. Abeles, M., & Goldstein, M. (1977). Multispike train analysis. Proc. IEEE, 65, 762–773. Adrian, E. D., & Zotterman, Y. (1926). The impulses produced by sensory nerve endings. Part 2: The response of a single end organ. J. Physiol., 61, 151–171. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994). Power spectrum analysis of bursting cells in area MT in the behaving monkey. J. Neuroscience, 14(5), 2870–2892. Bohte, S. M., Spekreijse, H., & Roelfsema, P. R. (2000). The effects of pairwise and higher-order correlations on the firing rate of a postsynaptic neuron. Neural Computation, 12, 153–179. Boucsein, C., Nawrot, M., Rotter, S., Aertsen, A., & Heck, D. (2005). Controlling synaptic input patterns in vitro by dynamic photo stimulation. J. Neurophysiology, 94(4), 2948–2958. Callaway, E. M., & Katz, L. C. (1993). Photostimulation using caged glutamate reveals functional circuitry in living brain slice. Proc. Nat. Acad. Sci., USA, 90, 7661–7665. Cohen, M. R., & Newsome, W. T. (2004). What electrical microstimulation has revealed about the neural basis of cognition. Curr. Opin. Neurobiol., 14(2), 169–177. Destexhe, A., & Pare, D. (1999). Impact of network activity on the integrative properties of neocortical pyramidal neurons in vivo. J. Neurophysiology, 81, 1531–1547. Franklin, J., & Bair, W. (1995). The effect of a refractory period on the power spectrum of neuronal discharge. SIAM J. Appl. Math., 55(4), 1074–1093. Gerstein, G., & Clark, W. (1964). Simultaneous studies of firing patterns in several neurons. Science, 143, 1325–1327.
Generation of Synthetic Spike Trains
1737
Gray, C., & Singer, W. (1989). Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Nat. Acad. Sci., USA, 86, 1698–1702. Jaynes, E. T. (1957). Information theory and statistical mechanics. Physical Review, 106(4), 620–630. ¨ S., Aertsen, A., & Palm, G. (1995). DetectMartignon, L., von Hasseln, H., Grun, ing higher-order interactions among the spiking events in a group of neurons. Biological Cybernetics, 73(1), 69–81. Miesenbock, G., & Kevrekidis, I. G. (2005). Optical imaging and control of genetically designated neurons in functioning circuit. Annu. Rev. Neurosci., 28, 533– 563. Mikula, S., & Niebur, E. (2003a). The effects of input rate and synchrony on a coincidence detector: Analytical solution. Neural Computation, 15, 539–547. Mikula, S., & Niebur, E. (2003b). Synaptic depression leads to nonmonotonic frequency dependence in the coincidence detector. Neural Computation, 15(10), 2339– 2358. Mikula, S., & Niebur, E. (2004). Correlated inhibitory and excitatory inputs to the coincidence detector: Analytical solution. IEEE Transactions on Neural Networks, 15(5), 957–962. Mikula, S., & Niebur, E. (2005). Rate and synchrony in feedforward networks of coincidence detectors: Analytical solution. Neural Computation, 14(4), 881–902. Moore, G., Segundo, J., Perkel, D., & Gerstein, G. (1970). Statistical signs of synaptic interactions in neurons. Biophysical Journal, 10, 876–900. Nakahara, H., & Amari, S. (2002). Information-geometric measure for neural spikes. Neural Comput., 14(10), 2269–2316. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Niebur, E., Koch, C., & Rosin, C. (1993). An oscillation-based model for the neural basis of attention. Vision Research, 33, 2789–2802. Normann, R. A., Maynard, E. M., Rousche, P. J., & Warren, D. J. (1999). A neural interface for a cortical vision prosthesis. Vision Research, 39(15), 2577–2587. Ogata, Y. (1981). On Lewis simulation method for point processes. IEEE Transactions on Information Theory, 27, 23–31. ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and Riehle, A., Grun, rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Sekirnjak, C., Hottowy, P., Sher, A., Dabrowski, W., Litke, A. M., & Chichilnisky, E. J. (2006). Electrical stimulation of mammalian retinal cells with multielectrode arrays. J. Neurophysiology, 95(6), 3311–3327. Shepherd, G. M., Pologruto, T. A., & Svoboda, K. (2003). Circuit analysis of experience-dependent plasticity in the developing rat barrel cortex. Neuron, 38(2), 277–290. Shoham, A., O’Connor, D. H., Sarkisov, D. V., & Wang, S. S. (2005). Rapid neurotransmitter uncaging in spatially defined patterns. Nature Methods, 2(11), 837– 843. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24, 49–65.
1738
E. Niebur
Stroeve, S., & Gielen, S. (2001). Correlation between uncoupled conductance-based integrate-and-fire neurons due to common and synchronous presynaptic firing. Neural Computation, 13, 2005–2029. Thompson, S. M., Kao, J. P., Kramer, R. H., Poskanzer, K. E., Silver, R. A., Digregorio, D., & Wang, S. S. (2005). Flashy science: Controlling neural function with light. J. Neurosci., 25(45), 10358–10365. Tiesinga, P. H. E., & Sejnowski, T. J. (2004). Rapid temporal modulation of synchrony by competition in cortical interneuron networks. Neural Computation, 16, 251–275.
Received February 2, 2006; accepted November 29, 2006.
LETTER
Communicated by Michelle Rudolph
Input-Driven Oscillations in Networks with Excitatory and Inhibitory Neurons with Dynamic Synapses Daniele Marinazzo
[email protected] Department of Biophysics, Radboud University of Nijmegen, 6525 EZ Ni¨ megen, The Netherlands; TIRES—Center of Innovative Technologies for Signal Detection and Processing and Dipartimento Interateneo di Fisica, Universit`a di Bari, 70125, Bari, Italy; and Istituto Nazionale di Fisica Nucleare, Sezione di Bari, 70125 Bari, Italy
Hilbert J. Kappen
[email protected] Stan C. A. M. Gielen
[email protected] Department of Physics, Radboud University of Nijmegen, 6525 EZ Nijmegen, The Netherlands
Previous work has shown that networks of neurons with two coupled layers of excitatory and inhibitory neurons can reveal oscillatory activ¨ ity. For example, Borgers and Kopell (2003) have shown that oscillations occur when the excitatory neurons receive a sufficiently large input. A constant drive to the excitatory neurons is sufficient for oscillatory activity. Other studies (Doiron, Chacron, Maler, Longtin, & Bastian, 2003; Doiron, Lindner, Longtin, Maler, & Bastian, 2004) have shown that networks of neurons with two coupled layers of excitatory and inhibitory neurons reveal oscillatory activity only if the excitatory neurons receive correlated input, regardless of the amount of excitatory input. In this study, we show that these apparently contradictory results can be explained by the behavior of a single model operating in different regimes of parameter space. Moreover, we show that adding dynamic synapses in the inhibitory feedback loop provides a robust network behavior over a broad range of stimulus intensities, contrary to that of previous models. A remarkable property of the introduction of dynamic synapses is that the activity of the network reveals synchronized oscillatory components in the case of correlated input, but also reflects the temporal behavior of the input signal to the excitatory neurons. This allows the network to encode both the temporal characteristics of the input and the presence of spatial correlations in the input simultaneously.
Neural Computation 19, 1739–1765 (2007)
C 2007 Massachusetts Institute of Technology
1740
D. Marinazzo, H. Kappen, and S. Gielen
1 Introduction The central nervous system can display a wide spectrum of spatially synchronized, rhythmic oscillatory patterns of activity with frequencies in the range from 0.5 Hz (δ rhythm), 20 Hz (β rhythm), to 40 to 80 Hz (γ rhythm) and even higher up to 200 Hz (for a review, see Gray, 1994). In the past two decades, evidence has been presented that synchronized activity and temporal correlation are fundamental tools for encoding and exchanging information for neuronal information processing in the brain (Singer & Gray, 1995; Singer, 1999; Reynolds & Desimone, 1999). In particular, it has been suggested that clusters of cells organize spontaneously into flexible groups of neurons with similar firing rates, but with a different temporal correlation structure. However, despite the fact that synchronization of groups of neurons has been the subject of intense research efforts in many studies, the functional role of synchronized activity is still a topic of debate (Gray, 1994; Fries, 2005). A longstanding, fundamental question is when synchrony can emerge and how cell properties, synaptic interactions, and network architecture interact to determine the nature of synchronous states in a large neural network. Most theoretical studies have investigated synchronization of neuronal activity in fully connected or sparsely connected networks under assumptions of weak or strong coupling (see, e.g., Ariaratnam & Strogatz, 2001; Mirollo & Strogatz, 1990; Ernst, Pawelzik, & Geisel, 1995; Hansel & Mato, 1993; Hansel, Mato, & Meunier, 1995; van Vreeswijk, 2000). Several experimental studies have shown that the grouping of neurons with synchronized firing depends on stimulus properties and on instruction to the subject or on attention (Reynolds, Chelazzi, & Desimone, 1999; Schoffelen, Oostenveld, & Fries, 2005). Therefore, both network properties, stimulus-driven feedforward (bottom-up) and feedback (top-down) mechanisms, are involved and should be incorporated in an explanatory model. These concepts are compatible with neurophysiological studies, which have suggested that synchronous oscillatory activity may be entrained by synchronous rhythmic inhibition originating from fast-spiking inhibitory interneurons (Buzs´aki, Leung, & Vanderwolf, 1983; Lytton & Sejnowski, 1991), which has been an important reason to study networks with excitatory and inhibitory neurons. Based on this insight, two different models have been proposed in the literature to explain the emergence of synchronized neuronal activity. Doiron, Lindner, Longtin, Maler, & Bastian (2004) presented a network with a layer of excitatory neurons with stochastic input and delayed inhibitory feedback of the summed activity of the excitatory neurons to the excitatory neurons. This model revealed the remarkable property that oscillations were observed in the excitatory neurons in response to spatially correlated stimuli but not to uncorrelated input. It was the spatial correlation in stimuli, not the total power of the stimulus, that was essential for the network to display oscillatory activity. The frequency of oscillation was determined by
Oscillations in Networks of Spiking Neurons
1741
the effective delay in the feedback loop and was not related to the temporal properties of spatially correlated input. Although the architecture with interacting populations of excitatory and inhibitory neurons in the model by (Doiron, Chacron, Maler, Longtin, & Bastian, 2003; Doiron et al., 2004) was similar to the architecture proposed by ¨ Borgers and Kopell (2003), the behavior of the latter model, which was proposed to explain oscillations in the gamma range (25–80 Hz), was different ¨ from that proposed by Doiron et al. The Borgers and Kopell model consists of two interacting layers of excitatory and inhibitory neurons without a pure time delay in the feedback loop. This model assumes a considerably longer time constant for the decay of inhibition than for the decay of excitation. It reveals synchronous rhythmic spiking due to a constant excitatory drive to the excitatory neurons. The excitatory neurons provide diverging input to the inhibitory cells, which inhibit (by divergent feedback to the excitatory neurons, where each inhibitory cell projects to multiple excitatory cells) and thereby synchronize the excitatory cells. With the proper parameters, this network starts to oscillate as soon as the power of the external input to the excitatory neurons is sufficiently large, regardless of spatial correlations in the input. This result may seem at odds with the results by Doiron et al. (2003), since Doiron et al. describe that oscillations occur only for spatially correlated input, regardless of the intensity of the external input. In our study, we present a more general model and demonstrate that the models ¨ by Doiron et al. (2003, 2004) and by Borgers and Kopell (2003) are a special case of a more general model, which reveals a qualitatively different behavior in different regimes of parameter space. In order to understand the reasons for the apparent discrepancy, we ¨ used the same architecture as Doiron et al. and Borgers and Kopell, with a layer of excitatory neurons and a layer of inhibitory neurons. External input is provided to the excitatory neurons, which feed their output to the inhibitory neurons. In addition to the external input, the excitatory neurons receive feedback from the inhibitory neurons. Regarding the type of ¨ neuron, Borgers and Koppell used the theta model (Ermentrout & Kopell, 1986) whereas Doiron et al. (2003, 2004) used the leaky integrate-and-fire (LIF) model. The LIF model is an approximation of the more physiological Hodgkin-Huxley model, but the standard Hodgkin-Huxley neuron also can well be approximated by the theta model (see, e.g., Ermentrout & Kopell, 1986; Hoppensteadt & Izhikevich, 1997; Gutkin & Ermentrout, 1998). In this study we choose the LIF neuron, because Lindner et al. (Lindner & Schimansky-Geier, 2001; Lindner, Schimansky-Geier, & Longtin, 2002; Lindner, Doiron, & Longtin, 2005) provided a method for an analytical analysis as an approximation for the dynamics of the LIF neuron. This allowed us to derive analytical expressions, which could be compared with the results of numerical analyses to explain the oscillatory behavior of the network. We also did some simulations using the standard Hodgkin-Huxley neuron, which gave the same results.
1742
D. Marinazzo, H. Kappen, and S. Gielen
In this study we add dynamic synapses (Tsodyks & Markram, 1997) to the model. Experimental studies have shown that the efficacy with which a synapse can transmit an action potential depends on the recent history of the synapse itself. The model by Tsodyks, Pawelzik, and Markram (1998) for dynamic synapses incorporates the property that presynaptic activity gives rise to depletion of neurotransmitter. This makes the synapse less effective when the firing rate of presynaptic activity arriving at the synapse increases. This dynamical behavior, called short-term depression, has been explained and modeled in detail (Tsodyks et al., 1998). Tsodyks, Uziel, and Markram (2000) and Pantic, Torres, and Kappen (2003) have explained how temporal coding, synchronization, and coincidence detection are greatly affected by a time-varying synaptic strength. Based on the results in the literature, we hypothesized that incorporating dynamic synapses in the network should made the network behavior robust for a relatively large range of input characteristics. Incorporating dynamic synapses also gave another surprising result. The model by Doiron et al. (2004) produced a resonance when the input was spatially correlated only when the feedback gain was relatively small. With dynamic synapses, this behavior became robust for a large range of feedback gains. Moreover, the model with dynamic synapses both shows a resonance peak to indicate spatial correlation at the input and preserves the temporal characteristics of the spatially correlated input. This is a novel finding, which is important for modeling sensory (e.g., visual) information processing in combination with rhythmic activity (such as θ -, β-, and γ -rhythms) in cortex.
2 Model Description The basic architecture of our model is very similar to that of Doiron et al. (2004) and is described in Figure 1: a population of leaky integrate-andfire (LIF) neurons labeled 1, 2, . . . , NE , with external inputs s1 , . . . , s NE . In this study, NE was equal to 100. The output of these neurons provides excitatory input to another LIF neuron, which provides inhibitory feedback with gain g to the excitatory neurons. In some of our simulations, the single inhibitory LIF neuron was replaced by a set of 20 LIF neurons, which project by inhibitory synapses with strength g/20 to each of the excitatory neurons. One of differences between our model and that by Doiron et al. (2003, 2004) is that the inhibitory feedback in the model by Doiron et al. (2004) was obtained by convolution of a linear summation of action potentials of the excitatory neurons, whereas we used the spike output of an LIF neuron (or of 20 LIF neurons) as inhibitory feedback to the excitatory neurons. As we will show later (see Figures 2 and 3), this affects only to a minor extent the responses of the model, as long as the membrane time constant of the leaky integrate-and-fire neuron is relatively long.
Oscillations in Networks of Spiking Neurons
1743
Figure 1: Model architecture describing the external input si to the excitatory neurons. The output of the excitatory neurons projects to the inhibitory neuron with an effective input current I E I . The output of the inhibitory neuron is fed back to the excitatory neurons by the inhibitory current IIE .
The dynamics of the membrane potential V of the LIF neurons satisfies d V(t) = −(V(t) − Vr ) + I (t), dt
(2.1)
where time is measured in units of the membrane time constant τm and the membrane resistance is normalized to one. Every time the potential of the jth neuron reaches the threshold value Vth , a spike is fired. This resets the potential to the rest potential Vr and remains bound to this value for an absolute refractory period τ Re f . As in Doiron et al. (2003, 2004), we have set Vr = 0, Vth = 1, and τ Re f = 3 ms. Each excitatory neuron j( j ∈ {1, . . . , NE }) receives an input I j (t), I j (t) = µ + η j (t) + σ
√
1 − cξ j (t) +
√ cξG (t) ,
(2.2)
which consists of a constant base current µ, internal√gaussian white √ noise η j (t) with intensity D, and a stimulus s j (t) = σ ( 1 − cξ j (t) + cξG (t)) where ξ j (t) and ξG (t) are both gaussian white noise with zero mean and unit power. Varying c increases or decreases the degree of spatial correlation of the external stimuli, while the total input power to each neuron remains constant. In addition to this input, the excitatory neurons receive
1744
D. Marinazzo, H. Kappen, and S. Gielen
inhibitory feedback (see Figure 1), which will be defined in equation 2.4 and the text following that equation. The series of spikes s Ej (t) of neuron j in this layer of input neurons provides excitatory input to the inhibitory LIF neuron, which has a constant base current µ. The input current I E I to the inhibitory neurons due to input from the excitatory neurons is given by the convolution of the sum of the spike trains of the excitatory neurons and a standard α-function: I E I (t) =
K τE
∗ STE (t) =
∞
STE (t − τ )α 2 τ e −ατ dτ,
(2.3)
0
E E where STE (t) = Nj=1 s j (t) is the sum of the spike trains of all excitatory neurons. The base current to the neurons is too small to reach threshold. Therefore, neurons fire only in response to excitatory spike input. Just like the excitatory neurons, the inhibitory LIF neuron has a constant base current µ and internal noise η(t) with the same variance as that for the excitatory neurons. In addition, it receives the input I E I (t), as defined by equation 2.3. The action potentials s I (t) of the inhibitory neuron provide an inhibitory synaptic current to the excitatory neurons after a time delay τ D . The shape of the postsynaptic potential in the excitatory neurons due to input from the inhibitory neuron will be represented in further equations by K τI , which will be defined after the introduction of equation 2.4. For the case of dynamic synapses, their effective strength is governed by three parameters obeying the following equations (Tsodyks & Markram, 1997), dx = z/τrec − Uxs I (t − τ D ) dt dy = −y/τin + Uxs I (t − τ D ) dt dz = y/τin − z/τrec dt
(2.4)
where x, y, and z are the fraction of synaptic resources in the recovered, active, and inactive state, respectively. Without any spike input, all neurotransmitter is recovered, and the fraction of available neurotransmitter is one: x(t) = 1. After each spike arriving at the synapse, a fraction U of the available (recovered) neurotransmitter is released. Note that the spike input from the inhibitory neuron s I (t) is delayed by τ D . The fraction y of active neurotransmitter is then inactivated into the inactive state z. τin is the time constant of the inactivation process, and τrec is the recovery time constant for conversion of the inactive to the active state. With dynamic synapses,
Oscillations in Networks of Spiking Neurons
1745
the total postsynaptic current IIE (t) is proportional to the fraction of neurotransmitter in the active state, that is, IIE (t) = gy(t) (g < 0 because of the inhibitory feedback). The total input to the excitatory neuron j is defined by I j (t) + IIE (t), where I j (t) is defined in equation 2.2. With this model, we can easily change to a situation with static synapses by setting x(t) = 1 for all t (or, equivalently, τrec → 0). Note that equation 2.4 includes an exponential decay for the postsynaptic potential with a time constant τin in response to each presynaptic action potential. This is true for the case of both dynamic and static synapses. We used the following parameter values in the model simulations: NE = 100; τm = 6 ms for all neurons except when explicitly mentioned otherwise; base current µ = 0.5; intensity of the internal gaussian white noise D = 0.08; σ = 0.4; α = 18 ms−1 ; τin = 3 ms; τrec = 800 ms; and time delay in the feedback loop τ D = 6 ms. All simulations were integrated using a Euler integration scheme with a time step of 0.05 ms. Along similar lines as outlined in Doiron et al. (2004; see the appendix), we obtain the following expression for the spike train power spectral density of the excitatory neurons when feedback is provided by an LIF neuron E ∗E 2 E 2 2 E 2 s Ej (ω)s ∗j (ω) = s0,E j (ω)s0, j (ω) + σ |A (ω)| + . . . + cσ |A (ω)| ×
2(AE (ω)ge −iωτ D K τI (ω)AI (ω)K τE (ω)) − |AE (ω)gK τI (ω)AI (ω)K τE (ω)|2 |1 − AE (ω)ge −iωτ D K τI (ω)AI (ω)K τE (ω)|2 (2.5)
Here, s Ej (ω) and s0,E j (ω) represent the spike train in the presence and absence, respectively, of the external stimulus and feedback. K τE (ω) and K τI (ω) represent the postsynaptic dynamics as defined in equations 2.3 and 2.4, respectively, in the frequency domain, and AE (ω) and AI (ω) represent the susceptibility in the frequency domain with respect to the input of the excitatory neurons and the inhibitory neurons, respectively (for details see Doiron et al., 2004; Lindner & Schimansky-Geier, 2001; Lindner et al., 2002). (C) represents the real-valued part of the complex number C. This result E was obtained with the assumptions that s0,E j (ω)sk∗ (ω) = s0,E j (ω)I j∗ (ω) = s Ej (ω)ξk∗ (ω) = 0 for j = k and for N → ∞ so as to neglect terms of order 1/N and higher. A derivation of equation 2.5 is given in the appendix. Resonance is obtained when the denominator of the last term in equation 2.5 is minimal, which depends mainly on the effective time delay (i.e., the sum of the pure time delay τ D , the frequency-dependent phase shifts due to the dynamics AE (ω) and AI (ω) of the LIF neurons, and the synaptic functions K τE (ω) and K τI (ω)) in the feedback loop.
1746
D. Marinazzo, H. Kappen, and S. Gielen
Figure 2: Spike train power spectra (defined by equation 2.5) of an excitatory neuron for spatially correlated (c = 1) and uncorrelated (c = 0) input when inhibition is provided by an LIF neuron (filled symbols) and for correlated input when inhibition is provided by a linear unit (LIN, open symbols), as in Doiron et al. (2003) for g = −1.2. All other parameters as described in the text.
3 Results We start by analyzing the model by Doiron et al. (2003, 2004) and demonstrate that replacing the linear summation in the model by an LIF neuron affects the behavior of this model only to a minor extent. We then explore the behavior of the model for various gain values of the inhibitory feedback ¨ loop and show that we obtain the behavior of the model by Borgers and Kopell (2003) for large feedback gains. We discuss the effect of feedback gain for both static and dynamics synapses. Figure 2 shows the results of computer simulations for the spike train power spectral density of an excitatory neuron from the network with delayed feedback described in Figure 1 in case of fully spatially correlated (c = 1) and fully uncorrelated (c = 0) input for a feedback gain g = −1.2. Since the statistics of action potential firing are the same for all excitatory neurons, it suffices to show the spectra for one neuron. For uncorrelated input (c = 0), the last term in the expression of equation 2.5 is zero, and we obtain a spectrum that increases slightly in the range from zero to 120 Hz. The input signal is gaussian noise with a flat spectrum, and the slight increase in spike train power spectral density reflects the dynamics of the
Oscillations in Networks of Spiking Neurons
1747
Figure 3: Spike train power spectra of an excitatory neuron for spatially correlated input when inhibition is provided by a single LIF neuron (dashed line) or a population of 20 inhibitory LIF neurons (solid line). All other parameters as in Figure 2.
excitatory LIF neurons and the effect of feedback. When the gaussian noise to the input units is fully correlated (c = 1), the results show a clear resonance near 40 Hz for the spatially correlated case. The resonance is absent in the absence of spatial correlation. For reference, the power spectrum of an excitatory neuron from the model with linear summation, as in Doiron et al. (2004), is shown for the case of fully correlated input. The peak in the power spectrum falls at a slightly lower frequency for the model with the inhibitory LIF neuron (filled circles, our model), which is due to the additional delay stemming from the dynamics of the inhibitory LIF neuron. Otherwise extensive simulations show that replacing the linear summation in the model by Doiron et al. (2004) by an LIF neuron gives very similar results, as long as the time constant of the inhibitory LIF neuron is sufficiently large (further explained in Figure 8). The results in Figure 2 were obtained with a single inhibitory neuron, as in the study by Doiron et al. (2004). Adding more inhibitory neurons did not affect the results, as shown in Figure 3. The reason is that when the model ¨ starts to oscillate (as pointed out by Borgers and Kopell, 2003, who did their simulations with a population of inhibitory neurons), the activity of the
1748
D. Marinazzo, H. Kappen, and S. Gielen
inhibitory neurons synchronizes. In that case, a mean-field approximation is allowed, which explains why the population of inhibitory neurons can be replaced by a single inhibitory neuron. In further simulations, we use one inhibitory neuron. The behavior of the network depends on the membrane time constant of the inhibitory neuron and the feedback gain. Varying these two quantities results in similar effects: a smaller membrane time constant gives rise to a higher firing rate of the inhibitory neuron, leading to a higher effective gain of the feedback. There is another effect related to the value of the membrane time constant τm,I of the inhibitory LIF neuron: when it is small, the inhibitory neuron behaves as a coincidence detector, whereas for large values of τm,I , the neuron becomes an integrator (in agreement with Rudolph & Destexhe, 2003). In the former case, the neurons spike only when there is a sufficiently large number of input spikes within a time window, small with respect to the membrane time constant. If the interval between input spikes is large relative to the membrane time constant, any effect of the single-spike inputs decays, and the neuron does not spike. The effect of the feedback gain and the membrane time constant will be discussed separately in the following. The fact that the network oscillations in Figure 2 do not depend on power ¨ of the stimulus, differs from the results by Borgers and Kopell (2003), who report that a sufficiently large input power is required to cause oscillations. This apparent discrepancy can be understood if the feedback gain g is increased (more negative values). Large values of g lead to network oscillations in case of uncorrelated input (see Figure 4, left panels). Each spike of the inhibitory neuron (lower left panel) inhibits the excitatory neurons after a delay of about 15 ms in the feedback loop. The explanation for the apparent discrepancies with the results by Doiron et al. (2003, 2004) in the case of uncorrelated input is that equation 2.5 was obtained using linear response theory for the LIF neuron, as outlined by Doiron et al. (2003) and Lindner, Doiron et al. (2005). For large, synchronized inputs and when the inhibitory neuron acts like a coincidence detector, the linear response approximation for the LIF is no longer valid, and the behavior changes from that illustrated in Figure 2 into the behavior reported by Borgers and Kopell (2003). Therefore, the model in Figure 1 can reproduce the results by both ¨ Doiron et al. (2004) and Borgers and Kopell (2003) for different values of the feedback gain g. It is worth stressing that the feedback gain values are not excessive or unrealistic, since the value of |g| is always within the limits of the selfconsistent equation for the effective base current, µeff = µ − gr0 (µeff ), with r0 (µeff ) −1 (µeff −v R )/√ D √ x2 e erfc(x)d x , = τr + π √ (µeff −vT )/ D
(3.1)
Oscillations in Networks of Spiking Neurons
1749
Figure 4: Superposition of spikes of all excitatory neurons (top panels) and of the inhibitory neuron (bottom panels) in case of uncorrelated gaussian white noise input with static (left) and dynamic synapses (right) for g = −3.6. Note that with static synapses, each action potential of the inhibitory neuron is followed by complete inhibition of the excitatory neurons after a time delay of about 15 ms.
where erfc is the complementary error function. (For more detailed information about the relevance of the effective base current and the derivation of r0 (µeff ), see Lindner, Doiron et al. 2005.) At this point it is useful to consider the effect of dynamic synapses. As is evident from Figure 4 (right panels), the presence of dynamic inhibitory synapses prevents complete silencing of the excitatory neurons as observed for static synapses in the case of a high feedback gain. We thus want to evaluate the effect of the feedback gain on the one hand and of the presence of dynamic synapses on the other, and explore how these two parameters change the capacity of the system to detect spatially correlated input. In order to do so, we measure the power of the spike train around the resonance frequency. The behavior of the model is explained in more detail in Figure 5, which shows the power spectra of spiking of the excitatory neurons for the model with static synapses (left panels) and for dynamic synapses (right panels) for the case of uncorrelated gaussian white noise input (c = 0, upper panels) and for correlated input (lower panels, c = 1) for two values of feedback
1750
D. Marinazzo, H. Kappen, and S. Gielen
Figure 5: Spectra of the excitatory neurons for uncorrelated input (c = 0; upper panels) and for correlated input (c = 1; lower panels) for the model with static synapses (left panels) and dynamic synapses (right panels) for two different gains in the inhibitory feedback loop: g = −2 (solid line), g = −4 (dashed line). The input signal was gaussian white noise with a flat frequency spectrum up to 120 Hz. The units along the vertical axis are in spikes2 /s. However, in order to compare the shape of the spectra for static and dynamic synapses and for correlated and uncorrelated input, the spectra were normalized.
gain. Variations in feedback gain cause changes in the firing rate of the excitatory and inhibitory neurons. Therefore, the power spectral density of the responses of the excitatory neurons (in spikes2 per s as in Figure 2) differs for different feedback gains. In Figures 5 and 8, we normalized the power spectral density to allow a better comparison of the shape of the spectra. The variations in firing rate are discussed later (see Figure 11). When the input is correlated (lower panels) a resonance peak appears near 40 Hz for both the static and dynamic synapses for g = −2 and g = −4. For the uncorrelated noise with g = −2, the model does not show a significant resonance peak for the static or dynamic synapses. Therefore, the model is able to detect spatially correlated input for g = −2 with both static and dynamic synapses. However, when the feedback gain is increased to g = −4, the network with static synapses resonates also in the absence of
Oscillations in Networks of Spiking Neurons
1751
Figure 6: Power spectra of the excitatory neurons for the model with dynamic synapses for different values of correlated input. Gains in the inhibitory feedback loop: g = −2. The input signal was gaussian white noise with a flat frequency spectrum up to 120 Hz. In order to compare the shape of the spectra, the units along the vertical axis were normalized.
spatially correlated input (upper left panel). These results demonstrate that we can reproduce the results of Doiron et al. (2003, 2004) for static synapses as long as the feedback gain is relatively small and obtain the results of ¨ Borgers and Kopell (2003) for relatively high feedback gains. When dynamic synapses are included in the model, the model behavior does not depend on feedback gain, since there is no resonance for physiological feedback gains when the input is uncorrelated (upper-right panel). If the input is correlated, there is a peak near 40 Hz as soon as there is feedback in the model. For simplicity, we have considered only the extreme cases of fully correlated (c = 1) or fully uncorrelated (c = 0) input in Figure 5. Values of c in the range between 0 and 1 yield results that are intermediate between the two extremes described above. This is illustrated in Figure 6, which clearly illustrates that the resonance peak decreases with decreasing values of c. Notice that for c = 1 we find a peak near 40 Hz, but a decrease in the power
1752
D. Marinazzo, H. Kappen, and S. Gielen
spectrum in the range between 50 and 80 Hz. This can be understood by the fact that the term ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω) in the denominator in the last term in equation 2.5 contains a factor e −iωτ D related to the time delay τ D in the feedback loop. When the term ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω) is maximal, it causes a peak in the spectrum near 40 Hz. For higher frequencies, the term e −iωτ D changes sign, which, together with the frequency dependence of the susceptibility A and synaptic transfer function K, changes the sign of ge −iωτ D K τE (ω)K τI (ω)AE (ω)AI (ω), causing a decrease in the spectrum. For our choice of time constants (membrane time constant 6 ms for excitatory and inhibitory neurons), there will or will not occur a resonance as indicated by the peak in Figure 2 depending on the strength of the feedback coupling, the characteristics of the input signal, and the dynamics of the synapse. The amplitude of the resonance can be quantified introducing the f statistics f1 , f2 = f12 S( f )d f , where the power spectrum of the spike train of the excitatory neurons is represented by S( f ). As in Doiron et al. (2003, 2004), we quantify the resonance by FB = 30,50 − 2,22 , comparing the value of
near the resonance frequency to the value of in a low-frequency region. Figure 7 (upper panel) shows that for relatively small values of the feedback gain, the value FB increases for small values of g (−2 < g < 0) for fully correlated input to the excitatory neurons for both the static and dynamic synapses, indicating increasing peak values of oscillations in the presence of fully correlated input. However, FB remains small for uncorrelated input as long as −2 < g < 0. For g < −2, the network also starts to oscillate for uncorrelated input (c = 0) for the case of static synapses. The spike train power near the resonance frequency rapidly reaches the same value as that for the correlated input case. This is not the case for the model with dynamic synapses, which shows a very robust behavior for both uncorrelated and correlated input over the whole range of feedback gains studied. To quantify this situation even better, we introduce a discrimination index, ρ0−1 =
FB (c = 1) − FB (c = 0) .
FB (c = 1) + FB (c = 0)
(3.2)
This index indicates to what extent the network is able to distinguish between a fully spatially correlated and a fully uncorrelated input. The lower panel of Figure 7 shows that dynamic synapses (filled symbols) guarantee a clear and constant discrimination for all feedback gain values, whereas the model with static synapses (open symbols) detects the correlated input only at intermediate feedback gains (range from −1 to −2) but cannot distinguish correlated and uncorrelated inputs for feedback gains g < −2. It has been shown (Whittington, Traub, & Jeffreys, 1995; Traub, Whittington, Colling, Buzsaki, & Jeffreys, 1996) that gamma oscillations in hippocampal interneurons are sensitive to the decay time constant of the
Oscillations in Networks of Spiking Neurons
1753
Figure 7: (Top) Power FB = 30,50 − 2,22 near the resonance peak (30–50 Hz) relative to that in the frequency range between 2 and 22 Hz in the presence of correlated (c = 1; circles) and uncorrelated (c = 0; triangles) input, with static synapses (SS; open symbols) and dynamic synapses (DS; filled symbols). (Bottom) Ability to discriminate correlated and uncorrelated input ρ0−1 (see equation 3.2) for static (SS, open symbols) and dynamic (DS, closed symbols) synapses as a function of the inhibitory gain.
GABAA synapse in the sense that the oscillation frequency decreases for increasing values of that time constant. This is compatible with the fact that ¨ the oscillation frequency in the models by Borgers and Kopell (2003) and by Doiron et al. (2003, 2004) depends on the effective time delay in the feed¨ back loop, which includes the time delay (absent in the Borgers and Kopell model), the dynamics of the synapses, and the dynamics of the inhibitory LIF neurons. Since changing the time constant of the inhibitory LIF neuron changes both the effective delay in the feedback loop and the behavioral characteristics of the LIF neuron (integrator versus coincidence detector), we have analyzed the behavior of the model for various time constants τm,I for the inhibitory LIF neuron. As mentioned before, the neurons in the network start to oscillate with synchronized firing of the excitatory neurons when the membrane time constant of the inhibitory neuron is small. This synchronization is independent of the nature of the external stimulus as long as the intensity is
1754
D. Marinazzo, H. Kappen, and S. Gielen
Figure 8: Ability to discriminate correlated and uncorrelated input ρ0−1 (see equation 3.2) for static (SS, open symbols) and dynamic (DS, filled symbols) synapses versus the membrane time constant of the inhibitory neuron, with g = −1.2.
sufficiently large and when the time constant τm,I is small (see Figure 8). For time constants larger than 15 ms, the linear approximation of the inhibitory LIF neuron is valid, and the network behaves as in Doiron et al. (2004, 2004). When dynamic synapses are included in the model, the separation between correlated and uncorrelated input is also preserved for small values of τm,I ; the index ρ0−1 , which represents the ability to discriminate between correlated and uncorrelated input, is constant for all values of the membrane time constantτm,I . Another novel aspect of the model with dynamic synapses relates to the ability to preserve the frequency content of the input signal. In the ¨ study by Borgers and Kopell (2003), the oscillation frequency depends on only the effective time delay in the feedback loop between excitatory and inhibitory neurons and is present as long as the power of the input signal is sufficiently large. In the model by Doiron et al. (2003), the presence of oscillations depends on the presence of spatially correlated input, and the oscillation frequency depends on the effective delay in the feedback loop, not the temporal properties of the input. With dynamic synapses, the system also preserves the temporal information contained in the input. This is illustrated in Figure 9, which shows the spectra of the spike responses of the excitatory neurons in response to bandpass filtered gaussian noise with a center frequency at 60 Hz—a frequency well above the resonance peak near 40 Hz. The bandpass filtered noise is obtained by filtering gaussian white noise with a flat spectrum up to 150 Hz with a second-order Butterworth bandpass filter with passband between 53 and 67 Hz. After filtering in the
Oscillations in Networks of Spiking Neurons
1755
Figure 9: Spectra of the excitatory neurons for uncorrelated input (C = 0; upper panels) and for correlated input (C = 1; lower panels) for the model with static synapses (left panels) and dynamic synapses (right panels) for three different gains in the inhibitory feedback loop: g = 0 (solid line), g = −2 (dashed line) and g = −4 (dotted line). The input signal was bandpass filtered gaussian white noise, centered at 60 Hz. The units along the vertical axis are in spikes2 /s. In order to compare the shape of the spectra, the spectra were normalized.
forward direction, the filtered sequence is reversed to ensure a zero-phase shift. The power of the bandpass filtered signal is normalized such that the power of the input signal is the same in all conditions. Without feedback (g = 0) the solid lines show the spectra of the responses to the bandpass filtered noise with a clear peak near 60 Hz. For the static synapses (left-hand panels), the amplitude of the peak near 60 Hz decreases with increasing feedback gain. If the input is spatially correlated, a resonance peak appears near 40 Hz, consistent with the results of Doiron et al. (2003, 2004) and with the results in Figure 5 and Figure 7, which show that a resonance peak is found for intermediate feedback gains only if the input is correlated. If the bandpass filtered noise at the input is spatially uncorrelated (c = 0; upper left panel), a peak near 40 Hz also appears for
1756
D. Marinazzo, H. Kappen, and S. Gielen
the high feedback gain of −4, in agreement with our previous results (see Figure 5). If the bandpass filtered noise for the excitatory input units is spatially correlated (c = 1; lower left panel), the peak near 60 Hz disappears and a peak near 40 Hz appears, for a feedback gain of −4. The spectra of the spike output of the excitatory neurons for the model with dynamic synapses differ from those for the static synapses in two aspects. The first difference is that the spectra for the model with dynamic synapses are almost identical for the uncorrelated bandpass filtered noise for all feedback gain values (c = 0, upper right panel). The almost identical shape of the spectra reflects the robustness of the model behavior with dynamic synapses. If the bandpass filtered noise is correlated (c = 1, lower right panel), a peak appears near 40 Hz, but in addition, the peak at the input near 60 Hz is preserved. This finding reflects that the model with the dynamic synapses has two important advantages over the model with static synapses: the responses of the excitatory neurons are robust for changes in feedback gain and the spectra preserve the spectra of the input, in addition to the peak in the spectrum indicating correlated input. In order to explore the effect of the feedback gain on the spike responses near the input frequency, we have explored the ratio
52.5,67.5 , which relates 72.5,87.5 the power of the excitatory neurons near the bandpass filtered input at 60 Hz to the value of power near 80 Hz, which is in a region where the power spectrum is not affected by feedback or resonance. A high value for this ratio reflects large response amplitudes near the input frequency of 60 Hz, whereas small values indicate that the input signal is not present in the spike responses. With feedback gain equal to zero, there is no difference between static and dynamic synapses (see Figure 10). As illustrated in Figures 5, 7, and 9, the ability to discriminate between spatially correlated and uncorrelated input is about the same for small feedback gains for dynamic and static synapses. For the range of feedback values between −1 and −2, we reproduce the behavior reported by Doiron et al. (2004) for static synapses; there is a resonance peak for spatially correlated input (see Figure 9, lower left panel), but the characteristics of the input signal are almost lost. This explains why for the static synapses, the ratio
52.5,67.5 72.5,87.5 is smaller in the range −2 < g < −1 than for g = 0. For values of g < −2, the system starts to oscillate at 40 Hz for the static synapses regardless of spatial correlation of the input (see Figures 5, 7, and 9), which explains the decrease of
52.5,67.5 to values near 1 for static synapses. Quite remarkably, 72.5,87.5
remains approximately constant for the dynamic synapses the ratio
52.5,67.5 72.5,87.5 for all values of g, indicating a preservation of the input characteristics in the firing rate of the excitatory neurons. The robustness of the model responses with the dynamic synapses is reflected not only in the spectra (see Figures 8, 9, and 10) but also in the firing rates of the excitatory and inhibitory neurons in the model. This is illustrated in Figure 11, which shows the firing rate of the excitatory (E) and inhibitory
Oscillations in Networks of Spiking Neurons
1757
Figure 10: Discrimination index ρ I F to discriminate the input frequency near 60 Hz in the responses of the excitatory neurons, defined as the power around
52.5,67.5 ) for static (open symbols, SS) 60 Hz divided by the power around 80 Hz ( 72.5,87.5 and dynamic (filled symbols, DS) synapses.
Figure 11: Firing rate of excitatory (E) and inhibitory (I) neurons for the models with static and dynamic synapses. Data at left (right) hand correspond to model with inhibitory LIF neuron with leak time constant of 6 (18) ms. Except for differences in time constant of inhibitory neuron, static versus dynamic synapses and feedback gain, all other parameters were the same for all conditions.
(I) neurons for the models with static and dynamic synapses for the various feedback gains and for two different time constants of the inhibitory neurons (6 and 18 ms). For the static synapses, the firing rate of the excitatory neurons decreases with increasing feedback gain. The difference in firing rate for g = 0 and g = −4 is typically about 30%. A similar decrease in firing
1758
D. Marinazzo, H. Kappen, and S. Gielen
rate is observed for the inhibitory neurons. However, for the model with dynamic synapses, the firing rates of the excitatory and inhibitory neurons depend only weakly on the feedback gain. Note that this phenomenon is independent of the time constant of the inhibitory neurons. 4 Discussion The main results of this study can be summarized in two conclusions. The ¨ first is that the different results, reported by Borgers and Kopell (2003) and by Doiron et al. (2003, 2004), who used models with the same architecture but with a qualitatively different behavior, can be explained by the responses of a single model with different feedback gains. The simulations presented here show that the same architecture with pools of excitatory and inhibitory neurons can display a different behavior when the parameter values are in a different range. For small and intermediate feedback gains, the model starts to oscillate if the input to the excitatory input units is spatially correlated. Without correlation, the resonance peak in the spike ¨ responses is absent. The resonance in the model, as reported by Borgers and Kopell (2003), which appears as soon as the input to the excitatory neurons is sufficiently large (regardless of correlation between input to the neurons), is obtained if the feedback gain is sufficiently large, in agreement with the conditions for resonance in their study. Another difference between the ¨ models by Borgers and Kopell (2003) and Doiron et al. (2003, 2004) is that ¨ the latter has only one inhibitory unit, whereas Borgers and Kopell have a population of inhibitory neurons. In this study we have shown that the behavior of the model by Doiron et al. (2003, 2004) does not change if the linear unit, which sums the spikes of the excitatory neurons and which feeds the feedback loop, is replaced by an LIF neuron. Figure 3 shows that the results are not affected either, when several inhibitory neurons are added in ¨ parallel. Borgers and Kopell (2005) also report that the qualitative behavior of the model is the same if several inhibitory neurons are added. In their study, suppression of excitatory cells can occur for asynchronously firing inhibitory cells. In our study, the inhibitory cells fire in synchrony for spatially correlated input (c = 1) and for high feedback gains. For c = 0 and for small feedback gains, the inhibitory neurons in our model do not fire in synchrony, and neither do the excitatory cells. However, if there are several inhibitory neurons, the network models starts oscillations at slightly higher feedback gains than for a single inhibitory neuron. In summary, the behavior of the model is qualitatively the same for a single inhibitory neuron and for the case of several inhibitory neurons. The second conclusion of our study is that replacing synapses with a constant efficacy by dynamic synapses has two major implications: the model behavior becomes robust for changes in feedback gain and in time constant of the inhibitory LIF neuron. The resonance peak appears only for correlated input stimuli. Moreover, the model with dynamic synapses preserves
Oscillations in Networks of Spiking Neurons
1759
the characteristics of the input in the spike responses. From a biological perspective, this is highly relevant since it allows transmission of the temporal properties of the input signal while the resonance provides a label to indicate whether the input is spatially correlated. In visual cortex cells, responses reflect the temporal properties of the stimulus in the receptive field, but in addition can reveal gamma-frequency components (see, e.g., Reynolds & Desimone, 1999; Roelfsema, Lamme, & Spekreijse, 2004). In addition, several studies (e.g., Baker, Pinches, & Lemon, 2003; Schoffelen et al., 2005) have shown that multiple EEG rhythms can occur simultaneously, which is in contradiction with the behavior of some models in the ¨ literature (see, e.g., Borgers & Kopell, 2003; Doiron et al., 2004), which reveal oscillations only at one particular frequency. The finding that the activity of the network can reflect both the temporal characteristics of the input signal (frequency of input modulations), which drives the network, and synchronous oscillations at one particular frequency is new. Previous studies modeling neurons as coupled oscillators have shown that when the width of the distribution of the intrinsic frequencies is small compared to the coupling strength, all oscillators become phase-locked (Ariaratnam & Strogatz, 2001). When the distribution of intrinsic frequencies becomes larger, the network can reveal a broad spectrum of dynamics, such as frequency locking at a single frequency, incoherent firing with a broad range of frequencies, oscillator death, and hybrid solutions that combine two or more of these states, depending on the coupling strength. Tsodyks, Mitkov, and Sompolinsky (1993) studied a system of globally coupled oscillators with pulse interactions (mimicking coupling by presynaptic action potentials). In such a system, the completely phase-locked state is unstable to weak disorder. For a small inhomogeneity, the oscillators are divided in two populations, one of which exhibits phaselocked periodic behavior, whereas the other consists of unlocked oscillators, which exhibit aperiodic patterns that are only partially synchronized. In this context it is important to stress, that the spectra shown in Figures 5, 6, and 9 are representative for all excitatory neurons. That is, the oscillations and the input characteristics can be found in each neuron and do not reflect a partitioning within the pool of excitatory neurons. The novel finding that a group of neurons can code both the correlation between parallel neuronal input channels and the temporal characteristics of input signals is relevant for several reasons. As explained in section 1, the phenomenon of oscillations in neuronal activity in groups of neurons has long been known. However the functional significance is yet the subject of speculation (see, e.g., Fries, 2005). Moreover, the neuronal mechanisms, which are responsible for the initiation of oscillatory activity, are barely known. It is well known that cells in visual cortex can fire in synchrony at frequencies near 40 to 60 Hz and at the same time can reproduce the temporal dynamics of changes in the light in the receptive field. For example, Fries, Reynolds, Rorie, and Desimone (2001) report typical changes in firing
1760
D. Marinazzo, H. Kappen, and S. Gielen
rate at a latency of about 25 ms in cat primary visual cortex, related to onset and offset of a visual stimulus in the receptive field of these neurons. At the same time, they find a correlated oscillatory activity at frequencies in the range between 40 and 70 Hz. This clearly illustrates that the activity of cells in cat primary visual cortex reflects simultaneously temporal changes in light falling in the receptive field as well as correlated oscillatory activity at 40 to 70 Hz. Similarly, cells in visual areas V2 and V4 reflect the dynamics of stimuli in the receptive field, but at the same time reveal correlated activity in the γ -range (Reynolds & Desimone, 1999). As far as we know, our study is the first to show that the temporal dynamics of activity in a population of neurons can reflect both changes in the stimulus that drives the cells and oscillatory activity. van Vreeswijk and Hansel (2001) studied the emergence of synchronized burst activity in networks of neurons with spike adaptation. They showed that networks of tonically firing adapting excitatory neurons can evolve to a state where the neurons burst in a synchronized manner. These authors showed that synchronized bursting is robust against inhomogeneities, sparseness of the connectivity, and noise. In networks of two populations, one excitatory and one inhibitory, they found that decreasing the inhibitory feedback can cause the network to switch from a tonically active, asynchronous state to the synchronized bursting state. In our study, we had dynamic synapses instead of spike adaptation, as in the study by van Vreeswijk and Hansel (2001). Another difference is that the latter study did not consider external input but studied tonically active neurons instead. Our study therefore is a further extension of the work of van Vreeswijk and Hansel by showing the effect of external input on the stable states of a network with excitatory and inhibitory neurons coupled by dynamic synapses. In the model shown in Figure 1, the dynamic synapses were implemented in the feedback (i.e., the I → E) path. If the firing rate of the inhibitory neuron increases, the efficacy of the dynamic synapses decreases, which acts like an automatic gain control. This explains partly why the responses of the model with dynamic synapses hardly depend on the feedback loop gain. This finding is in agreement with Pantic et al. (2003), who demonstrated that dynamic synapses can perform good coincidence detection over a large range of frequencies for one suitably chosen threshold value, whereas static synapses require an adaptive frequency-dependent threshold value for good detection. In fact, the adaptation is done by the synapses themselves. Dynamic synapses in the feedforward loop have also been discussed in several studies. The common message of these studies (see, e.g., Natschlager, Maas, & Zador, 2001; Senn, Segev, & Tsodyks, 1998; Pantic et al., 2003) is that dynamic synapses in the E → I path are very effective in allowing a correct coincidence detection for a wide range of firing rates of the excitatory neurons without changing the base current to the inhibitory neurons.
Oscillations in Networks of Spiking Neurons
1761
Initially it came as a surprise that the model shown in Figure 1 revealed oscillations also for uncorrelated noise (c = 0) for large feedback gains. The last term in equation 2.5 is zero for c = 0. The explanation for the unexpected resonance is that equation 2.5 was derived using linear-response theory (see Doiron et al., 2004, and Lindner, Chacron, & Longtin, 2005, where the assumptions and mathematical analyses are explained in detail). However, the linear approximations of an LIF neuron are valid only for relatively long time constants of the neuron, where it operates mainly as an integrator. This approximation is no longer valid when the time constant of the LIF neuron is short, transforming the neuron into a coincidence detector. Similarly, a high-feedback gain gives rise to full inhibition of the excitatory neurons for each action potential of the inhibitory neuron (see Figure 4, left panel), followed by a release of excitatory output. The large inhibitory input to the excitatory neurons synchronizes the excitatory neurons (precisely as ¨ in Borgers & Kopell, 2003), leading to a burst of activity to the inhibitory neuron. For such volleys of synchronized input, the linear approximation for the LIF neuron also fails. Appendix This appendix gives a derivation of equation 2.5. It mainly follows the lines set out in Doiron et al. (2004) and in Lindner, Chacron et al. (2005) and Lindner, Doiron et al. (2005). The input to the excitatory neurons is split into two parts. The first part consists of the base current µ, the internal neuron-specific noise ηi (t), and the time-independent mean of the feedback (gSTI (t) with g < 0) from the inhibitory neuron√to the excitatory √ neuron. The second part consists of the external input σ 1 − cξ j (t) + cξG (t) and the time-dependent part of the feedback from the inhibitory neurons. Using a linear response approximation for a leaky integrate-and-fire neuron with susceptibility AE (ω) and AI (ω) (see Lindner & SchimanskyGeier, 2001; Lindner, Schimansky-Geier, & Longtin, 2002) for the excitatory neurons and inhibitory neurons, respectively, the spectrum of the spike train of the excitatory neuron is given by
g s Ej (ω) = s0,E j (ω) + AE (ω) I j (ω) + e −iωτ D K τI (ω)s I (ω) , N
(A.1)
where s Ej (ω) and s0,E j (ω) represent the spike train of the jth excitatory neuron in the presence and absence, respectively, of the external stimulus and feedback. K τI (ω) represents the postsynaptic dynamics for the synapses from the inhibitory to the excitatory neurons as defined in equation 2.4 in the frequency domain, and AE (ω) and AI (ω) represent the susceptibility in the frequency domain with respect to the input of the excitatory neurons and the inhibitory neurons, respectively (for details, see Doiron et al., 2004;
1762
D. Marinazzo, H. Kappen, and S. Gielen
Lindner & Schimansky-Geier, 2001; Lindner et al., 2002). The term e −iωτ D is the Fourier representation of the time delay in the inhibitory feedback loop. For the activity of the inhibitory neuron in the frequency domain, we obtain
s I (ω) = AI (ω)KτE (ω)
NI
s Ej (ω).
(A.2)
j=1
Inserting equation A.2 in equation √ A.1 and with √ I j (ω) as the Fourier transform of I j (t) = µ + η j (t) + σ 1 − cξ j (t) + cξG (t) gives for the power spectrum of the spike train of the jth excitatory neuron
s˜ Ej s˜ ∗j
E
∗E E ˜ ∗E ˜ ˜ ∗E ˜ ∗ E I = s˜0,E j s˜0, + A s ˜ j j 0, j + A I j s˜0, j +
k=1
∗ s˜0,E j
+
I ∗ ˜ ∗ E g e iωτ D K˜ I τ∗ A ˜ I ∗ K˜ τE ∗ s˜kE A N
N
s˜0,E j
˜ E g e −iωτ D K˜ τI A ˜ I K˜ τE A N
NI
s˜kE
2 ˜ E I˜ j 2 + A
k=1
NI E 2 g iωτ I ∗ I ∗ E ∗ ∗ ˜ I˜ j e D K˜ τ A ˜ K˜ τ + A s˜kE N k=1
E 2 ∗ g −iωτ I I E D ˜ ˜ I˜ j e ˜ K˜ τ Kτ A + A N
NI
s˜kE
k=1
NE E 2 g 2 I 2 I 2 E 2 E E∗ ˜ ˜ ˜ ˜ Kτ A Kτ s˜k s˜l + A , N k,l
(A.3)
where x˜ represents the Fourier transform of x. E Using that s0,E j (ω)sk∗ (ω) = s0,E j (ω)I j∗ (ω) = s Ej (ω)ξk∗ (ω) = ∗ ξG (ω)ξk (ω) = 0 for j = k explains why the second and third terms on the right-hand side of the equal sign are equal to zero. I E Substitution of equation A.2 in A.1 and solving for Nj=1 s j (ω) gives NE NE k=1
s˜kE
=
j=1
˜E s˜0,E j + A
NE
I˜ j
j=1
˜ E ge −iωτ D K˜ τI A ˜ I K˜ τE 1− A
.
(A.4)
Oscillations in Networks of Spiking Neurons
1763
Substitution of equation A.4 in A.3 and using that s0,E j (ω)s ∗k (ω) = s0,E j (ω)I ∗ j (ω) = s Ej (ω)ξ ∗k (ω) + ξG (ω)ξ ∗k (ω) = 0 for j = k and for N → ∞ so as to neglect terms of order 1/N and higher, we obtain for the power spectrum of the spike train of the excitatory E
E ∗E 2 E 2 s Ej (ω)s ∗ j (ω) = s0,E j (ω)s0, j (ω) + σ |A (ω)| + . . . + cσ |A (ω)| 2
E
2 2
E ˜ g K˜ τI A ˜ I K˜ τE − A ˜ I K˜ τE 2 ˜ E ge −iωτ D K˜ τI A A . ˜ E ge −iωτ D K˜ τI A ˜ I K˜ τE |2 |1 − A
Acknowledgments We acknowledge support from the Lorentz Center (http://www. lc.leidenuniv.nl/) for organizing a workshop, which allowed us to discuss the results of this study with Andr´e Longtin and Nancy Kopell. We thank Magteld Zeitler for helpful discussions. References Ariaratnam, J. T., & Strogatz, S. H. (2001). Phase diagram for the Winfree model of coupled nonlinear oscillators. Phys. Rev. Lett., 86, 4278–4281. Baker, S. N., Pinches, E. M., & Lemon, R. N. (2003). Synchronization in monkey motor cortex during a precision grip task. II: Effect of oscillatory activity on corticospinal output. J. Neurophysiol., 89, 1941–1953. ¨ Borgers, C., & Kopell, N. (2003). Synchronization in networks of excitatory and inhibitory neurons with sparse, random connectivity. Neural Comp., 15, 509–538. Buzs´aki, G., Leung, L. W. S., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res. Rev., 6, 139–171. Doiron, B., Chacron, M. J., Maler, L., Longtin, A., & Bastian J. (2003). Inhibitory feedback required for network oscillatory responses to communication but not prey stimuli. Nature, 421, 539–543. Doiron, B., Lindner, B., Longtin, A., Maler, L., & Bastian, J. (2004). Oscillatory activity in electrosensory neurons increases with the spatial correlation of the stochastic input stimulus. Phys. Rev. Lett., 93, 048101–048104. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM J. Appl. Math., 46, 233–253. Ernst, U., Pawelzik, K., & Geisel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74, 1570–1573. Fries, P. (2005). A mechanism for cognitive dynamics: Neuronal communication through neuronal coherence. Trends in Cognitive Sciences, 9, 474–480. Fries, P., Reynolds, J. H., Rorie, A. E., & Desimone, R. (2001). Modulation of oscillatory neuronal synchronization by selective visual attention. Science, 291, 1560–1563. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comp. Neurosci., 1, 11–38.
1764
D. Marinazzo, H. Kappen, and S. Gielen
Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Comp., 10, 1047–1065. Hansel, D., & Mato, G. (1993). Patterns of synchrony in a heterogeneous HodgkinHuxley neural-network with weak-coupling. Physica A, 200, 662–669. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Lindner, B., Chacron, M. J., & Longtin, A. (2005). Integrate-and-fire neurons with threshold noise: A tractable model of how interspike interval correlations affect neuronal signal transmission. Phys. Rev., E, 72, 021911. Lindner, B., Doiron, B., & Longtin, A. (2005). Theory of oscillatory firing induced by spatially correlated noise and delayed inhibitory feedback. Phys. Rev. E, 72, 061919. Lindner, B., & Schimansky-Geier, L. (2001). Transmission of noise coded versus additive signals through a neuronal ensemble. Phys. Rev. Lett., 86, 2934–2937. Lindner, B., Schimansky-Geier, L., & Longtin, A. (2002). Maximizing spike train coherence or incoherence in the leaky integrate-and-fire model. Phys. Rev. E, 66, 031916. Lytton, W. W., & Sejnowski, T. J. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol., 66, 1059–1079. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Natschlager, T., Maass, W., & Zador, A. (2001). Efficient temporal processing with biologically realistic dynamic synapses. Network Computation in Neural Systems, 12, 75–87. Pantic, L., Torres, J. J., & Kappen, H. J. (2003). Coincidence detection with dynamic synapses. Network Computation in Neural Systems, 14(1), 17–33. Reynolds, J. H., Chelazzi, L., & Desimone, R. (1999). Competitive mechanisms subserve attention in macaque areas V2 and V4. J. Neurosci., 19, 1736–1753. Reynolds, J. H., & Desimone, R. (1999). The role of neural mechanisms of attention in solving the binding problem. Neuron, 24, 19–29. Roelfsema, P., Lamme, V. A. F., & Spekreijse, H. (2004). Synchrony and covariation of firing rates in the primary visual cortex during contour grouping. Nature Neurosci, 7, 982–991. Rudolph, M., & Destexhe, A. (2003). Tuning neocortical pyramidal neurons between integrators and coincidence detectors. J. Comp. Neurosci., 14, 239–251. Schoffelen, J. M., Oostenveld, R., & Fries, P. (2005). Neuronal coherence as a mechanism of effective corticospinal interaction. Science, 308, 111–113. Senn, W., Segev, I., & Tsodyks, M. (1998). Reading neuronal synchrony with depressing synapses. Neural Comp., 10, 815–819. Singer, W. (1999). Neuronal synchrony: A versatile code for the definition of relations? Neuron, 24(1), 49–65. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.
Oscillations in Networks of Spiking Neurons
1765
Traub, R. D., Whittington, M. A., Colling, S. B., Buzsaki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol. (London), 493, 471–484. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Ac. Sci. USA, 94, 719–723. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Tsodyks, M., Uziel, A., & Markram, H. (2000). Synchrony generation in recurrent networks with frequency-dependent synapses. J. Neurosci., 20(RC50), 1–5. van Vreeswijk, C. (2000). Analysis of the asynchronous state in networks of strongly coupled oscillators. Phys. Rev. Lett., 84, 5110–5113. van Vreeswijk, C., & Hansel, D. (2001). Patterns of synchrony in neural networks with spike adaptation. Neural Comp., 13, 959–992. Whittington, M. A., Traub, R. D., & Jeffreys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615.
Received January 30, 2006; accepted November 8, 2006.
LETTER
Communicated by Andreas Nieder
Extracting Number-Selective Responses from Coherent Oscillations in a Computer Model Jeremy A. Miller
[email protected] Garrett T. Kenyon
[email protected] Los Alamos National Laboratory, Los Alamos, New Mexico 87544, U.S.A.
Cortical neurons selective for numerosity may underlie an innate number sense in both animals and humans. We hypothesize that the numberselective responses of cortical neurons may in part be extracted from coherent, object-specific oscillations . Here, indirect evidence for this hypothesis is obtained by analyzing the numerosity information encoded by coherent oscillations in artificially generated spikes trains. Several experiments report that gamma-band oscillations evoked by the same object remain coherent, whereas oscillations evoked by separate objects are uncorrelated. Because the oscillations arising from separate objects would add in random phase to the total power summed across all stimulated neurons, we postulated that the total gamma activity, normalized by the number of spikes, should fall roughly as the square root of the number of objects in the scene, thereby implicitly encoding numerosity. To test the hypothesis, we examined the normalized gamma activity in multiunit spike trains, 50 to 1000 msec in duration, produced by a model feedback circuit previously shown to generate realistic coherent oscillations. In response to images containing different numbers of objects, regardless of their shape, size, or shading, the normalized gamma activity followed a square-root-of-n rule as long as the separation between objects was sufficiently large and their relative size and contrast differences were not too great. Arrays of winner-take-all numerosity detectors, each responding to normalized gamma activity within a particular band, exhibited tuning curves consistent with behavioral data. We conclude that coherent oscillations in principle could contribute to the number-selective responses of cortical neurons, although many critical issues await experimental resolution. 1 Introduction Both animals and humans possess an innate number sense that has been characterized in a variety of behavioral experiments (Dehaene, 1997, 2001; Nieder, 2005). Monkeys, for example, can distinguish between images Neural Computation 19, 1766–1797 (2007)
C 2007 Massachusetts Institute of Technology
Number-Selective Responses
1767
containing different numbers of objects regardless of their shape, size, or distribution, demonstrating an ability to utilize an abstract sense of number independent of nonnumerical factors, such as the total area of illumination or total integrated contour length (Brannon & Terrace, 1998; Nieder & Miller, 2004a). Likewise, preverbal human infants respond specifically to changes in numerosity (Feigenson, Dehaene, & Spelke, 2004), and the ability to perform numerosity discriminations under controlled conditions has been documented across a wide range of animal taxa, including birds, rats, dolphins, cats, and primates (Davis & Perusse, 1988; Nieder, 2005). Adult humans also possess an innate number sense, distinct from arithmetical computation, that can be revealed by measuring performance on numerosity discrimination tasks in which explicit counting is prevented, either by appropriate experimental protocols (Cordes, Gelman, Gallistel, & Whalen, 2001; Whalen, Gallistel, & Gelman, 1999) or by studying cultures that lack numerically precise words for all but the first few integers (Pica, Lemer, Izard, & Dehaene, 2004). The evolution of an innate number sense is clearly adaptive, as it provides, for example, a basis for making optimal foraging, fight-or-flight, and mating decisions (e.g., “Which branch has more fruit?” or “Which troupe has more individuals?”). In both animals and humans, performance on numerosity discrimination tasks typically depends on the ratio of the quantities being compared, so that 2 versus 4, 4 versus 8, 8 versus 16, and so on are all distinguished with approximately equal levels of accuracy. Similarly, behavioral experiments exhibit both a prominent size effect (e.g., 3 versus 4 can be distinguished more reliably than 4 versus 5) as well as a prominent distance effect (e.g., 3 versus 5 can be distinguished more reliably than 3 versus 4). Combined with data showing that the latency of such judgments depends only weakly on numerosity, the above results indicate that animals and humans, especially preverbal human infants, utilize an innate, qualitative sense of number in which individual items are processed in parallel without resort to explicit counting or reliance on the formal mathematical concept of an integer. Recordings from single neurons in the monkey prefrontal and posterior parietal cortex have identified cells that respond selectively to a preferred number of visual items, usually between one and five, independent of nonnumerical factors such as shape, total area, or boundary length, or the distribution of the individual objects in the scene (Nieder, Freedman, & Miller, 2002; Nieder & Miller, 2003, 2004b). Similar to behavioral measures, number-selective neurons exhibit both a size effect—tuning curves become broader as their preferred numerosity increases (e.g., the cells respond to a wider range of numerosities)—and a distance effect—the overlap between tuning curves falls off with increasing numerical separation. The response properties of number-selective cortical neurons make them attractive candidates for mediating the innate number sense documented in related behavioral studies.
1768
J. Miller and G. Kenyon
Here, we explore the hypothesis that coherent, object-specific oscillations implicitly encode numerosity in a manner that might contribute to the number-selective responses of cortical neurons. Neurons in the early visual system often exhibit coherent oscillations at frequencies within the gammaor upper gamma band, here defined broadly as between 40 and 120 Hz ¨ (Castelo-Branco, Neuenschwander, & Singer, 1998; Gray, Konig, Engel, & Singer, 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Neuenschwander, Castelo-Branco, & Singer, 1999; Neuenschwander & Singer, 1996; Singer & Gray, 1995). Typically, gamma-band oscillations recorded from visual neurons responding to the same object are strongly correlated, even though their common phase drifts randomly over time, as revealed by the absence of periodic structure in stimulus-locked measures, such as the peristimulus time histogram (PSTH), as well as by the declining amplitude of successive periodic side bands in autocorrelograms that otherwise exhibit strong oscillations, an indication that phase coherence is lost after several cycles. Moreover, the oscillatory responses evoked by different objects do not remain phase-locked, so that the cross-correlogram between two recordings sites may be flat even though the autocorrelograms recorded at each site show strong periodic modulations at roughly equal frequencies. It follows from the object-specific nature of coherent oscillations that if only a single item is present in a scene, all of the stimulated cells will be modulated in phase, and the average spectral amplitude in the gamma band, computed from the summed activity across all stimulated neurons, will be maximal when normalized by the total number of spikes. If the scene instead contains two objects of similar size and luminance, there will be two sources of oscillatory activity whose relative phase will vary randomly over time. Thus, in the case of two objects, the total normalized power in the gamma band, assuming the sum includes all stimulated neurons, will be reduced by approximately the square root of 2. Likewise, for n distinct objects of similar size, the normalized gamma power will be reduced by approximately the square root of n (i.e., the average of n imaginary numbers possessing equal magnitudes and random phases). If it is assumed that all distinct visual items, regardless of their shape, orientation, shading, or distribution, evoke coherent, object-specific oscillations, the resulting estimate of numerosity will be automatically independent of such nonnumerical factors. Although the status of gamma activity in normal visual processing remains hotly debated (Sejnowski & Paulsen, 2006; Shadlen & Movshon, 1999; Singer & Gray, 1995), we may nonetheless ask to what extent numerosity can be encoded by coherent, object-specific oscillations. To investigate the numerosity encoded by coherent oscillations, we analyzed the normalized gamma power from multiunit spike trains obtained from a circuit model previously shown to generate realistic spatiotemporal correlations (Kenyon, Harvey, Stephens, & Theiler, 2004; Kenyon et al., 2003; Kenyon, Theiler, George, Travis, & Marshak, 2004; Kenyon, Travis et al., 2004). To obtain similar measurements from an experimental preparation
Number-Selective Responses
1769
would have required recording simultaneously from large numbers of oscillatory neurons all of the same type and distributed across the visual field. Our theoretical study used computer-generated spike trains to investigate the feasibility of the hypothesis that gamma-band oscillations can, in principle, encode numerosity as a prelude to a more rigorous, yet technically difficult experimental study. Our results suggest that the normalized gamma activity, summed over all stimulated neurons and divided by the total spike count, permits single-trial discriminations to be made between images containing different numbers of objects. Because the circuit model was uniform and isotropic and produced object-specific, coherent oscillations in response to all sufficiently large, contiguous features, the resulting numerosity information was encoded in a manner that was inherently independent of the shape, orientation, shading, and distribution of the individual items in the scene. Previous attempts to compute numerosity using artificial neural networks implicitly utilized separate nodes for every enumerable object at each image location, either by starting with an explicit location map or by limiting the implementation to a 1D retina in which all possible lengths could be accounted for by appropriate winner-take-all filters (Dehaene & Changeux, 1993; Verguts & Fias, 2004). Our approach is fundamentally different. By utilizing the numerosity information encoded by stimulus-specific, coherent oscillations, it was possible to achieve invariance with respect to nonnumerical factors, such as size, shape, shading, and distribution, in a fully 2D model without introducing a combinatorially large number of additional elements. Finally, to compare our results to experimental measurements, we constructed an array of winner-take-all numerosity detectors that used only the normalized gamma activity present on single trials. The resulting theoretical tuning curves were consistent with behavioral data, supporting the hypothesis that gamma-band oscillations can, in principle, provide an implicit representation of numerosity that might contribute to the innate number sense shared by animals and humans. 2 Methods 2.1 Retinal Model. A negative feedback circuit, originally developed to account for high-frequency oscillations between retinal ganglion cells (Kenyon et al., 2003), was used to generate artificial spike trains with realistic spatiotemporal correlations in a manner consistent with known anatomy and physiology. Coherent oscillations generated by the model were of similar amplitude, frequency, and duration to those measured experimentally (Kenyon et al., 2003). Moreover, the gamma-band oscillations between model neurons were strongly size dependent and remained coherent only within contiguously stimulated regions (Kenyon, Harvey et al., 2004; Kenyon, Travis et al., 2004), thereby capturing important aspects of the object-specific, spatiotemporal correlations between neurons in the cat
1770
J. Miller and G. Kenyon
retina (Neuenschwander et al., 1999). The circuit model thus provided a reasonable basis for estimating the numerosity information encoded by coherent oscillations in the central nervous system. Because details of the circuit model have been presented previously, only an abbreviated description is provided below (see also Figure 1). Input to the model was conveyed by an array of external currents proportional to the pixel-by-pixel gray-scale value of a two-dimensional image. These external currents directly stimulated the model bipolar cells and approximated their light-modulated synaptic input from cone photoreceptors. The bipolar cells produced excitatory postsynaptic potentials in both ganglion cells and amacrine cells according to a random process (Freed, 2000). The axon-bearing amacrine cells were electrically coupled to neighboring ganglion cells and to each other, and made strong inhibitory connections onto the surrounding ganglion cells and axon-bearing amacrine cells. This feedback circuit produced robust, physiologically realistic oscillations in response to large stimuli. When several ganglion cells were activated by a stimulus, they in turn activated neighboring axon-bearing amacrine cells via gap junctions (Dacey & Brace, 1992; Jacoby, Stafford, Kouyama, & Marshak, 1996; Vaney, 1994). The stimulated cells were then hyperpolarized by the ensuing wave of axon-mediated inhibition, thus setting up the next cycle of the oscillation. Spike generation in ganglion cells was modeled as a leaky integrate-and-fire process with a membrane time constant of 5 msec, consistent with published physiological data (O’Brien, Isayama, Richardson, & Berson, 2002). The model also contained local nonspiking amacrine cells that generated randomly distributed inhibitory postsynaptic potentials that helped make spontaneous firing asynchronous and acted to increase
Figure 1: Schematic. (a,b) Circuit diagram. The model contained five distinct cell types—bipolar (BP), small amacrine (SA), large amacrine (LA), polyaxonal amacrine (PA), and ganglion cells (GC)—whose implementation was based on the anatomy and physiology of the motion processing pathway in the cat inner retina. (a) Excitation, feedforward, and feedback inhibition. BPs (relay neurons) were activated directly by the stimulus and in turn excited the other cell types (open triangles). Both the SAs and LAs provided feedforward inhibition to the GCs (closed circles), while all three amacrine cell types (retinal interneurons) made inhibitory feedback synapses onto the BPs. (b) Serial inhibition. Interactions between retinal interneurons were arranged as a negative feedback loop. (c) Oscillatory circuit. Local excitation, via gap junctions, elicited long-range, axon-mediated inhibition, producing realistic coherent oscillations. (d) Winnertake-all numerosity detectors. Normalized gamma activity, a measure of the coherent oscillations across all stimulated cells, is postulated to decline as the inverse square root of the number of objects. Bands representing two and four objects are illustrated. The normalized gamma activity present on any given trial was guaranteed to fall within the range of one and only one numerosity detector.
Number-Selective Responses
1771
a) Excitation, Feedforward & Feedback Inhibition
b) Serial Inhibition
BP
SA
LA
PA
GC c) Oscillatory Circuit
PA GC
2
n
4
normalized γ activity
d) winner-take-all numerosity detectors
1
2
3
4
preferred numerosity
5
1772
J. Miller and G. Kenyon
the overall dynamic range, but were not otherwise critical for generating oscillatory spike trains. The behavior of the model in response to a variety of stimuli, the robustness of the model with respect to both numerical and physiological parameters, and the connection of the model to anatomical and electrophysiological data, are described in detail elsewhere (Kenyon, Harvey et al., 2004; Kenyon & Marshak, 1998; Kenyon et al., 2003; Kenyon, Theiler et al., 2004; Kenyon, Travis et al., 2004). Some model parameters were modified from previously published values. Specifically, all axon-mediated feedback interactions were assigned a fixed synaptic delay of 2 msec and a finite transmission delay, corresponding to an action potential conduction velocity of 4 ganglion cells per time step (∼1 mm/msec), was incorporated. The time constant of the axon-bearing amacrine cells was also increased slightly, from 5 to 7.5 msec. By increasing the overall delay of the axon-mediated negative feedback loop, the above changes tended to strengthen the coherent oscillations in the retinal model, but this effect was compensated for by reducing the strength of the feedback inhibition onto the model ganglion cells by a factor of 0.44. Overall, the above changes made no qualitative difference in the behavior of the coherent oscillations in the dynamic feedback circuit but were implemented to make the model more realistic and provided a better fit to the high frequency resonance in the responses of cat Y ganglion cells to rapidly modulated stimuli (Frishman, Freeman, Troy, Schweitzer-Tong, & Enroth-Cugell, 1987; Miller, Denning, George, Marshak, & Kenyon, 2006). 2.2 Common Input Model. To control for the possibility that the dynamics of our retinal circuit model, which depended on long-range inhibitory feedback from spiking amacrine cells, could have introduced spurious higher-order correlations that might have unintentionally augmented the numerosity information encoded by coherent oscillations, we generated a second set of artificial spike trains in which the model ganglion cells were simply passive recipients of a common oscillatory input that uniformly modulated the firing rates of all neurons responding to the same contiguous stimulus. Starting with a frequency spectrum approximated by a single Lorentzian function, a best fit was obtained to the normalized Fourier amplitudes produced by the circuit model in response to an 8 × 8 stimulus. The specific form of the Lorentzian function used was (c/π) (0.5a )/(( f − b)2 + 0.25a 2 ), where f is the frequency of the three parameters a , b, and c, roughly corresponding to the width, location, and amplitude, which were determined empirically to be 12.9844 Hz, 73.4968 Hz, and 6.4584, respectively. To obtain a time-dependent oscillatory signal with realistic temporal correlations, we performed an inverse Fourier transform by assigning random phases, chosen uniformly between 0 and 2π, to the individual frequency components. Parallel sets of artificial spike trains with realistic spatiotemporal correlations were then constructed by applying the same time-dependent oscillatory signal to an 8 × 8 array of Poisson spike generators that simultaneously modulated their instantaneous firing rates.
Number-Selective Responses
1773
A constant offset was added to the oscillatory signal to fix the baseline firing rate of the Poisson spike generators at the same value exhibited by ganglion cells in the retinal model, here equal to 31 Hz, well within the range of observed resting firing rates for Y ganglion cells in the cat retina (Passaglia, Enroth-Cugell, & Troy, 2001). Each distinct object was represented by a separate 8 × 8 array of Poisson spike generators, obtained by repeating the above process using an independent distribution of random phase assignments, ensuring that the gamma-band oscillations arising from separate stimuli were uncorrelated. Additional mathematical details of a related procedure for generating artificial spike train via common input modulation are described elsewhere (Kenyon, Theiler et al., 2004). 2.3 Stimulation and Data Analysis. Both natural and computergenerated images containing various numbers of distinct objects, encompassing a range of sizes, shapes, shadings, and separations, were presented to the circuit model. Multiunit spike trains were constructed by combining the output of all stimulated output neurons into a single time series. For natural images, a binary mask was used to determine which neurons to include in the multiunit spike train. Nearly identical results to those presented here were obtained when an activity-dependent mask, which selected only those cells that fired a suprathreshold number of spikes over a brief time interval, was used instead of a fixed mask, but the latter method allowed more controlled comparisons across stimulus conditions. In most experiments, stimuli were presented for 200 msec prior to collecting data, so that only steady-state activity during the plateau portion of the response was considered. Multiunit spike trains obtained during the plateau portion of the response were typically 1000 msec, but shorter durations, corresponding to 125, 250, and 500 msec, were examined as well. In one set of experiments, we also considered the numerosity encoded by gamma activity during the transient portion of the response, using multiunit spike trains that included only the first 50 msec following stimulus onset. In all cases, single-trial frequency spectra were obtained as follows: the autocorrelation function was first computed directly from the multiunit spike train and then Fourier transformed, yielding a conventional power spectrum. The result was then divided by the total number of spikes (power in the DC band) to control for variations in the combined area of stimulation. Finally, the square root was taken of each frequency component, so that the final values reflected normalized spectral amplitudes. For each stimulus condition, corresponding to different numbers of objects in the input image, means and standard deviations of the normalized frequency spectra were computed by analyzing the results from 25 independent trials. The peak amplitude in the upper gamma band typically fell between the discrete steps along the frequency axis. To compensate, the maximum lag value of the autocorrelation function was adjusted for each stimulus condition so that the spectral peak was maximal. This latter procedure had no qualitative impact on our results or conclusions but was conducted to ameliorate aliasing effects due to the
1774
J. Miller and G. Kenyon
use of discrete frequency components. Finally, to obtain a scalar estimate of the normalized gamma activity, the normalized spectral amplitudes were averaged from 65 to 85 Hz, a range that approximately enclosed the peak oscillatory response. The predicted gamma activity for multiple objects was determined by dividing the normalized gamma activity produced in response to a single object by the square root of the number of objects in the corresponding input image. Comparisons to behavioral data were facilitated by constructing artificial winner-take-all numerosity detectors that responded to single-trial estimates of normalized gamma activity falling in a given span (see Figure 1d). Input ranges were nonoverlapping, such that one and only one detector responded to each stimulus presentation, corresponding to the node whose target normalized gamma activity was closest to the actual normalized gamma activity present on that trial. The probability of discharge for each detector in response to images containing various numbers of objects was computed from the mean and standard deviation of the normalized gamma activity under each stimulus condition, under the assumption that all measured quantities were gaussian distributed. The target normalized gamma activity for each numerosity detector was determined separately in each experiment by the mean value produced by images containing the preferred number of items. 3 Results To determine whether coherent oscillations could, in principle, encode numerosity, we stimulated the model circuit with computer-generated images containing different numbers of identical square spots, with each spot covering an array of 8 × 8 output cells (see Figure 2a). Normalized frequency
Figure 2: Numerical discriminations mediated by coherent oscillations. (a) A circuit model for producing realistic, object-specific oscillations was stimulated by square spots of equal intensity, each covering an array of 8 × 8 output neurons. (b) Frequency spectra computed from multiunit spike trains, 1000 msec in duration and which included the plateau responses of all stimulated neurons, exhibited maximal gamma activity (after normalizing by the total number of spikes) when only one object was present and declined steadily as the number of objects increased. (c) Black bars: Normalized gamma activity computed from the average spectral amplitude between 65 and 85 Hz on single stimulus trials. Error bars denote one standard deviation of the corresponding distribution. Gray bars: Prediction of the square-root-of-n rule. (d) Solid lines: Percentage activation of winner-take-all numerosity detectors. Only one detector, whose target level was closest to the normalized gamma activity, was active on any given trial. Dashed lines: Tuning curves measured behaviorally from monkeys performing a numerical discrimination task (Nieder & Miller, 2004a). Model and behavioral tuning curves exhibited similar features.
Number-Selective Responses
1775
a
normalized spectra
b
one
.3
two three
.2
four .1
50
75
normalized activity
c
model n .1
0
one
two
three
four
# of objects
100
model % activation
d
100
frequency (Hz)
monkey
0
1
2
3
4
# of objects
5
6
7
1776
J. Miller and G. Kenyon
spectra, computed from multiunit spike trains 1000 msec in duration obtained during the plateau portion of the response and consisting of all stimulated cells, exhibited different amounts of gamma activity depending on the number of objects in the image, being maximal when only one object was present and declining as the number of objects increased (see Figure 2b). To obtain a scalar estimate of gamma activity, the spectral amplitude was averaged over the response peak, 65 to 85 Hz. The mean normalized gamma activity declined as the number of objects increased in a manner well predicted by the square-root-of-n rule (see Figure 2c). Error bars denote one standard deviation of the corresponding distribution, not the standard deviation of the mean. There was good discrimination between images containing either one or two objects, as the corresponding means were well outside the associated error bars, but the overlap between neighboring distributions grew steadily as more objects were added, consistent with the size effect observed in a variety of animal and human studies (Nieder, 2005). To quantify the numerosity information encoded by coherent oscillations, we constructed an array of winner-take-all numerosity detectors that responded to the normalized gamma activity present on single stimulus trials. One and only one detector was activated by each stimulus presentation, corresponding to the element whose target activity, defined as the mean response to images containing the preferred number of objects, was closest to the normalized gamma activity present on that trial. Using the means and standard deviations of the normalized gamma activity for each stimulus condition, percentage activation rates were determined for each detector in response to images containing from one to four objects (see Figure 2d, solid lines). When only one object was present in the input image, the 1-object detector was always activated, corresponding to 0% false negatives. In response to images containing two, three, or four objects, the activation rates for the 2-, 3-, and 4-object detectors were 84%, 67%, and 45%, respectively, indicating an increasing percentage of false negatives for detectors preferring larger numerosities. Likewise, the percentage of false positives, given by the integrated response over all detectors other than the correct node, increased with preferred numerosity as well. The theoretical tuning curves computed from the single-trial normalized gamma activity (see Figure 2d, dashed lines) agreed quite well with previously published behavioral data (Nieder & Miller, 2004a). Monkeys were trained to report whether the number of items in a test stimulus, given by the value along the abscissa, equaled the number of items in the target image, with each curve representing a different preferred numerosity. The behavioral and model data both exhibited strong size and distance effects. Tuning curves became broader as the preferred numerosity increased, and the overlap between pairs of tuning curves tended to zero as the absolute difference in their preferred numerosities became larger. Because our theoretical analysis involved no fine-tuning to fit the experimental results—the
Number-Selective Responses
1777
parameters of the circuit model having been established previously on the basis of electrophysiological recordings—the close correspondence between the model and behavioral data suggests that the encoding of numerosity may be a robust property of coherent, stimulus-specific oscillations. As an additional test of robustness, we obtain qualitatively similar results after repeating the above experiment using the same basic circuit but with slightly different, previously published parameter values (see section 2), confirming that the encoding of numerosity by coherent oscillations did not require precise tuning of the model. The innate number sense, as characterized both behaviorally and electrophysiologically, depends only on numerosity and is independent of the specific properties of the individual objects in the scene, such as their shape or location. We asked whether normalized gamma activity obtained from the model circuit encoded numerosity in a manner that was likewise robust against nonnumerical factors (see Figure 3). The first set of input images contained various numbers of objects encompassing different shapes, sizes, and orientations, such that the total stimulated area remained invariant (see Figure 3a). This control was necessary to ensure that the normalized gamma activity was in fact specific for numerosity, and not simply a decreasing function of the total stimulated area. As a second, complementary control, we used images in which object size increased in proportion to the number of items present, so that the total stimulated area grew supralinearly (see Figure 3b). Third, to ensure that high-contrast edges were not required for normalized gamma activity to encode numerosity, we repeated the basic experimental protocol using objects that stimulated the same number of cells but possessed continuously shaded boundaries, falling off over onequarter cycle of a cosine function so that stimulus intensity was described by a series of square, isoluminant contours with the same total integrated strength as the standard 8 × 8 spot (see Figure 3c). For both the binary and gray-scale images, the normalized gamma activity encoded numerosity in a manner independent of object-specific properties (see Figure 3d), consistent with the innate number sense characterized experimentally. In addition, the decline in normalized gamma activity as a function of the number of objects was well predicted by a square-root-of-n rule. In the above experiments, only stimulated cells contributed to the normalized gamma activity. The restriction to stimulated cells was necessary because when all cells were included in the calculation, coherent oscillations no longer provided a useful measure of numerosity, probably because the oscillatory signals arising from the subset of stimulated neurons were overwhelmed by the baseline of ongoing activity. Background firing rates in the model were rather high, approximately 31 Hz, consistent with experimentally measured values (Passaglia et al., 2001). However, oscillatory activity in the retina can propagate to the visual cortex (Castelo-Branco et al., 1998), where the background firing rates of projection neurons may be much lower. Thus, the extraction of numerosity from coherent oscillations
1778
a
J. Miller and G. Kenyon
equal area
b
c
increasing size
d
equal size equal area increasing size shaded n
activity normalized
shaded
.1
.0
one
two
three
four
# of objects Figure 3: Invariance of numerosity encoding. (a) Constant area. Individual object sizes were decreased so as to keep the total stimulated area invariant. (b) Supralinear growth. Object sizes increased in proportion to the number of objects in the image. (c) Shaded boundaries. Sharp, high-contrast edges were replaced by smooth gradations. (d) Normalized gamma activity encoded numerosity in a manner consistent with a square-root-of-n rule regardless of object size, shape, or shading.
at points further along the visual processing pathway may not require the elimination of unstimulated cells from the analysis. Alternatively, it may be possible to identify stimulated cells from the same data used to estimate numerosity. As a preliminary test of this possibility, we reanalyzed the responses to the gray-scale objects (see Figure 3c) using an activity-dependent mask, whose threshold was chosen to best discriminate between stimulated and unstimulated cells based on 100 msec spike trains taken from the middle of each trial, to determine which individual units to include in the analysis. In this simple test, we obtained virtually identical results to those shown above, since the activity-dependent mask rarely differed by more than one or two cells from the fixed binary mask specified a priori. Although
Number-Selective Responses
1779
a detailed account of how numerosity might be extracted physiologically is beyond the scope of this study, an activity-dependent mask might be implemented physiologically by facilitating synapses (Beierlein and Connors, 2002) or dendritic nonlinearities (Polsky, Mel, & Schiller, 2004). Behavioral experiments indicate that numerical discriminations do not depend on the absolute difference in the number of items but rather on their ratio (Nieder, 2005; Nieder & Miller, 2003). To determine whether coherent oscillations encode numerosity in a manner consistent with the ratio law observed experimentally, we examined the discrimination mediated by normalized gamma activity in response to computer-generated images containing 1, 2, 4, 8, and 16 identical square spots, each covering an array of 8 × 8 model ganglion cells (see Figure 4). As in the above experiments, frequency spectra were computed from mass spike trains that included all stimulated cells. Spectral amplitudes in the upper gamma band, normalized by the total number of spikes, continued to decrease as the number of objects was raised, up to and including the maximum numerosity tested (see Figure 4a). As a function of the number of objects, the mean normalized spectral amplitude between 65 Hz to 85 Hz (gamma activity) adhered quite closely to a square-root-of-n rule, even for images containing up to 16 items (see Figure 4b). Any discrimination procedure based on the percentage change in normalized gamma activity will therefore depend only on the ratio of the number of items being compared. Plotted on a logarithmic scale, model numerosity detectors exhibited symmetrical tuning curves whose half-width at half-maximum (HWHM) was approximately constant (see Figure 4c). The total area under each tuning curve was always unity; the apparent differences were linear distortions due to the logarithmic scale. To the extent that the HWHM remains linearly proportional to the number of items, numerosity discrimination would be expected to obey a ratio law. We also asked whether the retinal model could encode the numerosity of more realistic gray-scale images, using a photograph that contained five airplanes on a tarmac (see Figure 5a). A somewhat artificial image was employed for these experiments to achieve good object specificity. To simulate both single- and multiple-plane images, we summed the spikes from either one or five distinct groups of selected ganglion cells, indicated by the corresponding binary masks (see Figures 5b and 5c). As with computergenerated images, the normalized frequency spectra computed from the multiunit spike trains showed a clear effect of numerosity, decreasing substantially when the data consisted of responses to five as opposed to a single airplane (see Figure 5d). In additional, the normalized gamma activity computed from the multiunit spike train containing responses to all five planes fell well within the error bars of the value predicted from a simple square-root-of-n rule (see Figure 5e). To ensure that the numerosity encoded in the multiunit spike train corresponding to a single plane was not confounded by the presence of additional items, we repeated the experiment using a new image in which the other four planes were digitally removed,
1780
J. Miller and G. Kenyon
normalized spectra
a
1 2 4 8 16
.3
.2
.1
50
75
100
frequency (Hz) normalized activity
b
model n .1
0
1
2
4
8
16
# of objects 100
1 2 4 8 16
% activation
c
0 1
2
4
8
16
# of objects Figure 4: Ratio law. The circuit model was stimulated with images containing 1, 2, 4, 8, or 16 identical square spots each covering an 8 × 8 array of output neurons. (a) Normalized frequency spectra. Gamma-band activity declined monotonically for numerosities up to 16. (b) Normalized gamma activity (black bars) closely adhered to a square-root-of-n rule (gray bars). (c) Percentage activation of winner-take-all numerosity detectors. On a logarithmic scale, tuning curves separated by powers of 2 were all approximately of the same width and had similar overlap with their neighbors, consistent with the ratio law observed experimentally.
Number-Selective Responses
a
1781
d normalized spectra
1 plane
b
.2
5 planes
.1
50
75
normalized activity
c
100
frequency (Hz)
e
model n .1
1
# of planes
5
Figure 5: Naturalistic stimuli. (a) Input image containing five airplanes on a tarmac. It was necessary to use a somewhat artificial image for this experiment to achieve adequate segmentation. (b) Binary mask used to select a multiunit spike train corresponding to a single plane. (c) Mask used to select a multiunit spike train that included responses to five separate planes. (d) Normalized frequency spectra showed clear effects of numerosity. (e) Normalized gamma activity provided good discrimination between multiunit spike trains containing one plane versus five planes, consistent with the prediction of a square-root-of-n rule.
obtaining equivalent results (not shown). These findings suggest that coherent oscillations may encode numerosity even in response to nonidealized gray-scale images. The above experiments all used an analysis window of 1000 msec during the plateau portion of the response. To be of maximum behavioral significance, however, the numerosity information encoded by coherent oscillations should be available on shorter timescales as well. We therefore used the circuit model to investigate how numerosity information encoded by gamma activity during the plateau portion of the response might degrade
1782
J. Miller and G. Kenyon
as the size of the analysis window or, equivalently, the length of the multiunit spike train record was reduced (see Figure 6). The mean normalized gamma activity was roughly constant for analysis window sizes down to approximately 200 msecs and then began rising steadily as the spike train duration was decreased still further, reflecting an overall increase in broadband noise (see Figure 6a). The standard deviation in the distribution of single-trial normalized gamma activities also increased steadily as the spike train duration was reduced (see Figure 6a, error bars). The degradation of numerosity encoding with smaller window sizes was reflected in the corresponding tuning curves, which became broader as the analysis time was decreased (see Figure 6b). However, considerable numerosity information was still present even at the shortest analysis times tested. In particular, reasonable numerical discrimination during the plateau portion of the response was still possible even for analysis times of only 125 msec. So far we have considered only the sustained portion of the multiunit spike train recorded after the initial response transients have settled to a stable plateau level. One advantage of focusing on the oscillations present during the asymptotic portion of the response is that correlations due to stimulus coordination become negligible after approximately 100 msec, as revealed by a nearly flat shift predictor once the plateau period has been reached (Kenyon et al., 2003). When instead we analyzed multiunit spike trains recorded during the transient portion of the response, the predicted performance on numerical discrimination tasks was greatly impaired (data not shown). Although the model predicted large-amplitude oscillations during the response transient, there was very little specificity between objects due to strong stimulus coordination. However, the model consisted of an unrealistically homogeneous population whose onset latencies were all identical, potentially exacerbating the problem of stimulus coordination during the response transient. When a gaussian spread of onset latencies, with width equal to ±5 msec, was imposed on the model input cells, reasonable performance on the numerical discrimination task was obtained (see Figure 7). Normalized spectra showed a clear trend toward lower amplitudes as the number of objects was increased (see Figure 7a). The peaks became broader in part due to the relatively low resolution (20 Hz) at which individual frequency components could be resolved using 50 msec analysis windows. Normalized gamma activity also showed a clear decline with increasing numerosity (see Figure 7b), although the widths of the corresponding distributions, indicated by the attached errors bars, were rather large due to the reduced signal-to-noise in numerical estimates based on 50 msec multiunit spike records. Similarly, the theoretical tuning curves predicted by a bank of winner-take-all numerosity detectors were much broader than in the standard case (see Figure 7c), although some level of numerosity discrimination was still clearly evident. Our results suggest that as long as there is a sufficient spread of onset latencies to support adequate object specificity during the transient portion of the response, coherent
Number-Selective Responses
1783
a normalized activity
one two three four
.2
.1
0.0
0.25
0.5
0.75
1.0
spike train duration (seconds)
b 100
1s % activation
0.25 s
0 1
2
3
4
5
6
7
# of objects Figure 6: Effect of analysis window size. (a) Normalized gamma activity versus spike train duration as a function of numerosity. Regardless of the analysis window size, normalized gamma activity declined with increasing numerosity. Distribution widths (errors bars) increased for shorter spike trains. Base-line gamma activity also increased as the analysis window was reduced below approximately 200 msec. (b) Percentage activation of artificial numerosity detectors. Tuning curves became broader as the analysis time was decreased, although considerable numerosity information was still present even for analysis windows of 250 msec.
1784
J. Miller and G. Kenyon
normalized spectra
a .3
one
three
two
four
.1
50
75
100
frequency (Hz) normalized activity
b
model n
.2
.1
4
3
2
1
# of objects
c 100
% activation
one two three four
0
1
2
3
4
5
6
7
# of objects Figure 7: Rapid encoding of numerosity during the response transient. Multiunit spike trains in response to images containing one to four identical square spots but limited in duration to the first 50 msec following stimulus onset. In order to reduce stimulus coordination, onsets were jittered by an average of ±5 msec independently for each input cell. (a) Normalized frequency spectra. Amplitudes declined with increasing numerosity. (b) Normalized gamma activity declined with numerosity in a manner roughly consistent with a squareroot-of-n rule. Distribution widths (error bars) were very large, reflecting reduced signal-to-noise. (c) Percentage activation of winner-take-all numerosity detectors. Tuning curves, though broad, nonetheless indicated numerosity discriminations could be supported by coherent oscillations during the response transient.
Number-Selective Responses
1785
oscillations could begin to encode numerosity very shortly after stimulus onset, consistent with the measured onset latencies of number-selective cortical neurons, which are on the order of 100 msec (Nieder & Miller, 2004b). Previous analysis of the circuit model suggests that coherent oscillations between separate objects will become phase locked as their spacing is reduced (Kenyon, Travis et al., 2004). The resulting breakdown in object specificity will necessarily confound the numerosity information encoded by coherent oscillations. To quantify this effect, we examined how the normalized gamma activity was affected as a group of four objects, each covering an 8 × 8 array of model ganglion cells, was moved closer together (see Figure 8a). As expected, at high packing densities, the normalized gamma activity became strongly dependent on the interobject spacing, with the transition between object-specific and nonspecific activity occurring at separations approximately three to five times the distance between neighboring output neurons (see Figure 8b). However, even at the highest densities examined, the four objects still could be distinguished from a single object on the basis of normalized gamma activity alone, although the exact numerosity was no longer represented unambiguously. Thus, even though strict specificity broke down as the objects were packed closer than approximately five interneuron spacings, significant numerosity information was still retained even for very small separations. Moreover, as long as the objects in the image were sufficiently well separated, the numerosity encoded by the normalized gamma activity in the circuit model was independent of their spatial distribution. The above experiments all employed objects possessing the same maximum stimulus intensity. Would coherent oscillations continue to encode numerosity at stimulus intensities corresponding to different levels of luminance contrast? Computer-generated images containing from one to four identical objects, with each again covering an array of 8 × 8 output neurons, were presented to the circuit model with the maximum stimulus intensity set to 1/2 or 3/2 times the standard value (see Figure 9, columns 1 and 2, respectively). For a single object, the averaged plateau firing rates were 50, 62, and 87 Hz, corresponding to stimulus intensities of 1/2, 1, and 3/2, respectively, relative to a baseline firing rate of 31 Hz. Despite dividing by the total number of events in the multiunit spike train, the normalized spectral amplitudes were not contrast invariant (see Figure 9a; compare columns 1 and 2). Coherent oscillations in the circuit model are more sensitive to contrast than are the firing rates (Kenyon, Theiler et al., 2004), possibly contributing to the lack of contrast invariance. Nonetheless, for any given contrast, the normalized spectral amplitudes still declined steadily with increasing object number. Likewise, the normalized gamma activity followed a square-root-of-n rule over a threefold range of stimulus intensities (see Figure 9b). In addition, the numerosity discrimination exhibited by the winner-take-all detectors was also roughly invariant with respect to stimulus intensity, as revealed by the similarity of the corresponding tuning
1786
J. Miller and G. Kenyon
a
8
7
6
5
4
3
2
1
0
normalized activity
b
.1
.0 0
1
2
3
4
5
6
7
8
separation between objects Figure 8: Effects of object spacing. (a) The spacing between four identical objects, each covering an 8 × 8 array of output neurons, was reduced from 8 to 0, with 1 corresponding to the distance between two neighboring output neurons. (b) Normalized gamma activity increased as the spacing between objects was reduced. The oscillations evoked by separate objects became partially phaselocked as they were moved closer together, obscuring numerosity information, although discrimination between one and four objects was still possible even at the smallest nonzero separations tested.
Number-Selective Responses
1787
curves (see Figure 9c). These results imply that coherent, object-specific oscillations can encode the numerosity of distinct visual items at various luminance contrasts, as long as the differences in the relative luminance contrast between objects in the same image are not too large. Otherwise, some mechanism for contrast normalization, or for making the image representation more binary, would be required to apply the model as currently formulated. We are unaware of any behavioral experiments that explicitly address the issue of how relative luminance contrast affects innate numerical comparisons. It was also useful to consider how numerosity estimates based on normalized gamma activity became degraded as the disparity in the relative sizes of the different objects in the image was made progressively larger. The model circuit was presented with sets of images containing various numbers of objects with equal intensity but whose relative areas differed by a factor of up to 16 (see Figure 10). We expected that the strong size dependence of the coherent oscillations, coupled with surround inhibition, would act to partially preserve numerosity information in the normalized gamma activity even when the image contained objects of different sizes. For input images consisting of one relatively large spot, covering a 16 × 16 array of model ganglion cells, in combination with either one, two, or three much smaller spots, each covering only a 4 × 4 array of model ganglion cells, the numerosity information conveyed by the normalized gamma activity was seriously confounded (see Figure 10a). Some numerosity information was still preserved, as it was possible to partially distinguish between one versus four objects (see Figure 10a, right column), but overall such large discrepancies in relative size had a major impact on the accuracy of numerical comparisons. The predicted performance on numerosity discrimination tasks improved dramatically as the difference in the relative sizes of the individual objects was reduced. For example, the numerosity encoded by gamma activity in response to images containing objects whose relative size differed by a factor of 4 supported discriminations comparable to that obtained when relative size differed by only a factor of 2 (see Figure 10b versus Figure 10c). Although performance on numerosity discrimination tasks, when based solely on the normalized gamma activity, clearly degraded as the size disparity between the individual objects became very large, coherent oscillations continued to encode potentially useful numerosity information over an approximately fourfold range of relative object sizes. We are unaware of any behavioral experiments that explicitly examine the issue of how numerical discriminations are affected by large differences in relative object size, but our results clearly predict a significant deterioration in task performance as such disparities become exceedingly large. Finally, to increase the generality of our conclusions, we considered the possibility that the numerosity encoded by normalized gamma activity in the retinal model was influenced by simulation artifacts. Although the
1788
J. Miller and G. Kenyon
1/2
1
3/2
a
normalized spectra
.4
one two three four
.2
.0 50
b
75
75
frequency (Hz)
frequency (Hz)
normalized activity
.2
model n .1
0
1
2
3
4
1
2
c
3
4
# of objects
# of objects
100
% activation
one two three four
0
1
2
3
4
# of objects
5
6
1
2
3
4
5
# of objects
6
7
Number-Selective Responses
1789
pairwise correlations between model neurons have been shown to be consistent with those measured experimentally, our simulations, based as they were on complex, nonlinear feedback dynamics, might have introduced spurious higher-order correlations that unintentionally favored the extraction of numerosity from multiunit spike trains over single stimulus trials. As a control, we constructed a new set of multiunit spike trains based on a much simpler dynamical process: a common oscillatory modulation applied to an array of identical Poisson spike generators. To model stimuli containing n objects, we constructed n separate 8 × 8 arrays of Poisson spike generators, with each 8 × 8 array driven by an independent source of common modulatory input. This procedure yielded a set of n multiunit spike trains with realistic gamma-band oscillations that were mutually uncorrelated with each other. As with the data generated by the more complex circuit model, the multiunit spike trains produced by common modulatory input supported reasonable numerosity discrimination (see Figure 11). Indeed, the diminished normalized spectral amplitudes with increasing numerosity (see Figure 11a), the corresponding decline in normalized gamma activity (see Figure 11b, black bars), the general adherence to a square-rootof-n rule (see Figure 11b, shaded bars), and the average responses of the numerosity detectors to single-stimulus presentations (see Figure 11c), were all nearly identical to corresponding measures obtained from the dynamic feedback model. However, because the correlations between the separate Poisson spike generators were due entirely to a common modulatory input, the possibility of higher-order artifacts being present in these experiments was greatly reduced. 4 Discussion To test the hypothesis that coherent oscillations can implicitly encode numerosity, images containing different numbers of objects, encompassing various shapes, sizes, and shadings, as well as different packing densities, intensities, and areas of illumination, were used to stimulate a model feedback circuit that generated gamma-band activity consistent with experimental data. As a function of the number of objects in the input image,
Figure 9: Numerosity discrimination at different stimulus intensities. Left column: Stimulus intensity = 1/2 (fraction of standard value). Right column: Stimulus intensity = 3/2. (a) Normalized frequency spectra. Although Fourier amplitudes were not contrast invariant, they still declined monotonically with increasing numerosity regardless of the stimulus intensity employed. (b) Normalized gamma activity followed a square-root-of-n rule over a threefold range of stimulus intensities. (c) Percentage activation of winner-take-all numerosity detectors. Tuning curves were roughly independent of stimulus intensity.
1790
J. Miller and G. Kenyon
normalized activity
% activation
a
100
16:1
.1
0
0
b
100
4:1 .1
0
0
c
100
1:2
model n .1
0
0 1
2
3
# of objects
4
1
2
3
4
5
6
7
# of objects
Figure 10: Numerosity discriminations involving stimuli of different relative size. All images contained one large object, covering a 16 × 16 array of output neurons, plus zero to three smaller objects. Relative sizes illustrated by insets at the end of each row. (a) Top row: Ratio of large to small object size: 16:1. (b) Middle row: Ratio reduced to 4:1. (c) Bottom row: Ratio reduced to 2:1. Left column: Normalized gamma activity. The square-root-of-n rule was strongly violated at very large ratios of relative object size but was in reasonable accord with the data for relative size ratios less than approximately 4:1. Right column: Percentage activation of winner-take-all (Polsky et al., 2004) numerosity detectors. Tuning curves became reasonably distinct for relative size ratios less than approximately 4:1.
Number-Selective Responses
1791
the normalized gamma activity measured during the plateau portion of the response followed a square-root-of-n rule rather closely, consistent with the expectation that the gamma-band oscillations arising from within a single object add in phase, whereas the oscillations arising from separate objects add with random phase. The square-root-of-n rule was found to be generally robust to the shape, size, shading, contrast, and distribution of the individual objects with several notable exceptions. Decreasing the separation between objects confounded numerosity estimates at high packing densities, as the oscillations evoked by separate objects became partially coherent (Kenyon, Travis et al., 2004). The inability of the feedback circuit to completely segment cluttered natural scenes, especially those containing large, overlapping objects, makes it unlikely that the model could have been successfully applied to more realistic images. Likewise, numerosity comparisons involving sets of objects encompassing very different relative sizes or very different relative contrasts (here modeled as stimulus intensity) were also confounded, as the brightest and largest objects could, if the relative disparity was sufficiently great, dominate the normalized gamma activity. Finally, using shorter analysis windows reduced the signal-to-noise ratio and thus led to degraded numerosity discrimination. Nonetheless, except in the most limiting cases, it was usually possible to discriminate between one and several objects based on the normalized gamma activity alone, virtually regardless of nonnumerical factors. To quantitatively assess the numerosity information encoded by coherent, object-specific oscillations, we constructed winner-take-all numerosity detectors based on the normalized gamma activity integrated across all stimulated neurons. An individual numerosity detector was activated whenever the normalized gamma activity on that trial fell nearest to its target value, given by the mean gamma activity evoked by images containing its preferred number of objects. The accuracy of these detectors followed the same general trends for numerosity discrimination observed behaviorally (Nieder & Miller, 2003). As the number of objects increased, the tuning curves broadened, so that it was progressively more difficult to distinguish sets of objects differing by a fixed number of items. Likewise, sets of objects became more distinguishable as their numerical separation increased. These two effects, known as the size and distance effects, respectively, were both clearly evident in our results and suggest that coherent, object-specific oscillations can encode numerosity in a manner consistent with behavioral data. Previous attempts to account for an innate number sense have used one-dimensional neural networks based on arrays of position and size filters (Dehaene & Changeux, 1993). However, to implement an analogous two-dimensional model, large numbers of additional shape and orientation filters, representing all possible 2D objects, would presumably be required. Coherent, object-specific oscillations, on the other hand, can in principle represent numerosity regardless of the shape, size, or locations of the
1792
J. Miller and G. Kenyon
normalized spectra
a one
.3
two three .2
four
.1
50
75
100
frequency (Hz) normalized activity
b .2
n .1
0
c
model
one
two
three
four
# of objects 100
% activation
one two three four
0
1
2
3
4
5
6
7
# of objects
individual visual items. Our proposed mechanism for encoding numerosity, based on normalized gamma activity, thus eliminates the need to allocate large numbers of additional neurons in order to represent every possible object to be enumerated. Other modeling efforts have shown how numberselective responses might be learned de novo by employing an appropriate
Number-Selective Responses
1793
plasticity rule (Verguts & Fias, 2004), but these experiments assumed as input a “location map” in which objects were represented a priori as single points, leaving open the question of how numerosity detectors might be constructed from low-level, retinotopic inputs. A completely different approach to an innate number sense is represented by accumulator models, which posit that objects are enumerated sequentially, with the tally being summed into a noisy short-term buffer (Gallistel & Gelman, 2000). Both behavioral and electrophysiological experiments, however, have cast doubt on accumulator models (Nieder, 2005). The response latencies of numberselective cortical neurons, as well as those measured behaviorally, suggest that numerosity is processed in parallel rather than sequentially. Moreover, the response curves recorded from number-selective cortical neurons suggest that numerosity is represented by a population code distributed across multiple broadly tuned elements. Our findings suggest that coherent oscillations encode numerosity in a form that can be extracted from multiunit spike trains over physiologically meaningful timescales. One advantage of using normalized gamma activity for encoding numerosity is that the complexity of the model does not increase as a function of either the number of objects to be detected or the number of possible combinations of object size, shape, or distribution. By employing numerosity detectors based purely on the normalized gamma activity, we could naturally take advantage of the translational and rotational invariance of low-level visual circuitry, thus automatically generalizing over a wide range of nonnumerical properties. The issue of how numerosity information encoded by coherent oscillations might be extracted by downstream elements was not addressed in this study, although it has been suggested that increased gamma activity or synchrony could differentially activate target structures (Alonso, Usrey, & Reid, 1996; Kenyon, Fetz, & Puff, 1990; Usrey, Alonso, & Reid, 2000). In addition, known physiological mechanisms such as facilitating synapses (Beierlein &
Figure 11: Common input model. To control for the possibility that our simulations introduced spurious higher-order correlations that inadvertently enhanced numerosity encoding, control sets of artificial multiunit spike trains were constructed by applying a common oscillatory modulation to separate 8 × 8 arrays of identical Poisson spike generators. Separate, uncorrelated modulatory signals, produced by randomizing the phases of the individual Fourier components, were applied to each 8 × 8 array, corresponding to distinct visual objects. (a) Normalized frequency spectra. (b) Normalized gamma activity. (c) Percentage activation of winner-take-all numerosity detectors. Coherent, object-specific, gamma-band oscillations produced by the common input model encoded numerosity in a manner that was very similar to the dynamic feedback circuit model.
1794
J. Miller and G. Kenyon
Connors, 2002), nonlinear dendritic processing (Polsky et al., 2004), and dynamical resonances (Miller et al., 2006; Frishman et al., 1987; Rager & Singer, 1998; Vigh, Solessio, Morgans, & Lasater, 2003) might confer a differential sensitivity to oscillatory input. Innate numerical competency extends well beyond the few examples considered here. Rather than being limited to simultaneously presented visual items, the number sense characterized extensively in animals and humans encompasses multiple sensory and motor modalities (Dehaene, 1997), such as the ability to respond to a particular number of tones (Meck & Church, 1983) or to generate a specified number of bar presses (Mechner & Guevrekian, 1962). Coherent oscillations arising early in the visual processing pathway can at best account for only a narrow aspect of this much wider phenomenon. Nonetheless, we may ask to what extent the mechanisms illustrated by our restricted model might be expanded to include a broader range of behaviors. Although our results were obtained using a model of the inner retina, gamma-band oscillations of cortical origin (Castelo-Branco et al., 1998; Gray et al., 1989; Gray & Singer, 1989; Kreiter & Singer, 1996; Singer & Gray, 1995) could likely encode similar, and indeed much higher-quality, numerosity information. The superior processing resources available to the cerebral cortex could presumably generate coherent oscillations that were much more object specific. In particular, by utilizing additional segmentation cues, such as shape, texture, color, depth, and motion, cortically generated coherent oscillations might encode numerosity much more reliably than was true of our comparatively simple feedback circuit. From this point of view, the much higher frequencies of the coherent oscillations arising in the retina versus those arising in the cerebral cortex (Castelo-Branco et al., 1998) suggest a possible strategy for distinguishing between numerosity information encoded at these two very distinct processing stages. Although their status remains speculative, theoretical models have also